Monthly Archives: October 2014

Moving to AWS…

October 9, 2014Cloud Computing, DataScienceAWS, becoming a data scientist, cloud computing, creating your AMI, EC2, pandas, python packages installpriyamvadadesai

As the project grew, we started downloading tweets from various journal websites and tried to set up an algorithm to parse tweets that were related to particular papers and link them to the paper; thereby producing another “metric” to compare papers by. In addition; we started keeping a detailed records all the authors of the papers and attempted to create a citation database. As complexity grew; we found that our local server was too slow and after some research; decided to take the plunge and move our stuff to AWS-the Amazon Web Server. We created a VPC (a virtual Private cloud), moved our database to Amazon’s RDS (Relational Database Service) and created buckets or storage on Amazon’s S3 (Simple Storage Service). Its relatively easy to do and the AWS documentation is pretty good. What I found really helpful were the masterclass webinar series.

I launched a Linux based instance and then installed all the software versions I needed like python2.7, pandas, numpy, ipython-pylab, matplotlib, scipy etc . It was interesting to note that on many of the amazon machine, the default python version loaded was 2.6, not 2.7. I scouted the web a fair bit to help me configure my instance the am sharing soem of the commands below

General commands to install python 2.7 on AWS- should work on most instances running Ubuntu/RedHat Linux:

Start python to check the installation unicode type. If you have to deal with a fair amount of unicode data like I do then make sure you have the “wide build” . I learned this the hard way.

>>import sys
>>print sys.maxunicode

It should NOT be 65564

>>wget https://s3.amazonaws.com/aws-cli/awscli-bundle.zip
>> unzip awscli-bundle.zip
>> sudo ./awscli-bundle/install -i /usr/local/aws -b /usr/local/bin/aws
# install build tools
>>sudo yum install make automake gcc gcc-c++ kernel-devel git-core -y
# install python 2.7 and change default python symlink
>>sudo yum install python27-devel -y
>>sudo rm /usr/bin/python
>>sudo ln -s /usr/bin/python2.7 /usr/bin/python
# yum still needs 2.6, so write it in and backup script
>>sudo cp /usr/bin/yum /usr/bin/_yum_before_27
>>sudo sed -i s/python/python2.6/g /usr/bin/yum
#This should display now 2.7.5 or later: >>python
>>sudo yum install httpd
# now install pip for 2.7
>>sudo curl -o /tmp/ez_setup.py https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py
>>sudo /usr/bin/python27 /tmp/ez_setup.py
>>sudo /usr/bin/easy_install-2.7 pip
>>sudo pip install virtualenv
>>sudo apt-get update
>>sudo apt-get install git
# should display current versions: pip -V && virtualenv –version
Installing all the python library modules:
sudo pip install ipython
sudo yum install numpy scipy python-matplotlib ipython python-pandas sympy python-nose
sudo yum install xorg-x11-xauth.x86_64 xorg-x11-server-utils.x86_64
sudo pip install pyzmq tornado jinja2
sudo yum groupinstall “Development Tools”
sudo yum install python-devel
sudo pip install matplotlib
sudo pip install networkx
sudo pip install cython
sudo pip install boto
sudo pip install pandas
Some modules could Not be loaded using pip, so use the following instead: >>sudo apt-get install python-mpi4py python-h5py python-tables python-pandas python-sklearn python-scikits.statsmodels
Note that to install h5py or pytables you must install the following dependencies first:
-numpy
-numexpr
-Cython
-dateutil
HDF5
HDF5 can be installed using wget:
>> wget http://www.hdfgroup.org/ftp/HDF5/current/src/hdf5-1.8.9.tar.gz
>>tar xvfz hdf5-1.8.9.tar.gz; cd hdf5-1.8.9
>> ./configure –prefix=/usr/local
>> make; make install
Pytables
pip install git+https://github.com/PyTables/PyTables.git@v.3.1.1#egg=tables

**to install h5py make sure hdf5 is in the path.

I really liked the concept of AMI’s : you create a machine that has the configuration that you want and then create an “image” and can give it a name. You have then created a virtual machine that you can launch anytime you want and as many copies of as you want! What is also great is that when you create the image, all the files that may be in your directory at that time are also part of that virtual machine …so its almost like you have a snapshot of your environment at the time. So create images often (you can delete old images ofcourse!) and you can launch your environment at any time. Of course you can and should always upload new files to S3 as that it you storage repository.

Another neat trick was to learn that if you can install EC (Elastic Cloud) CLI(Command Line Interface) on your local machine, and set up your IAM you don’t need the you don’t need the .pem file!! Even if you are accessing an AMI in a private cloud; when you set up the instance make sure you click on “get public IP” when you launch your instance and log in as ssh –Y ubuntu@ ec2-xx.yyy.us-west1.compute.amazonaws.com. You can them enjoy xterm and the graphical display from matplotlib, as if you are running everything on your local terminal. Very cool indeed!

Hierarchical clustering – what does that even mean, in terms of my topic models?

October 6, 2014DataSciencebecoming a data scientist, hierarchical clustering, latent dirichlet allocation, LDA, topic modellingpriyamvadadesai

(continued from Topic Modeling …)

So great, I ran LDA, got 150 topics, and now I wanted to see if one could group these topics together using clustering. How can one go about doing that? Well as part of the process, LDA basically creates a “vocabulary” consisting of all the words from the corpus. As this number may get unmanageably large, as part of the LDA preprocessing, one removes words like a, an, the, if where (also called stopwords) as may not really help decide whether a document belongs to a certain topic or not. There are other text learning tricks like stemming and lemmatization that I thought were not necessarily useful this context, but can often be useful and help control the size of the vocabulary. Well the vocabulary from my run contained ~650,000 words and mallet allows you to output, for every topic, the word counts for all these words! So now you have a representation of all the topics in terms of their “word vectors”! And one can use this word vector to calculate “distances” between topics.

So after some data wrangling and manipulation, I had the topics represented in a numerical matrix, and ran the clustering algorithm on them. There are many variations of hierarchical clustering algorithms, and I tried most them to see which one seemed the best. I finally went with average linkage and shown below are some of the branches that clustered together. Instead of showing, topic numbers as leaves, we are displaying the word cloud represented by the topic at that leaf:

Figure on the left could be thought to represent a neuroimaging cluster and the figure to the right could be thought to represent disease and trauma. These images are courtesy Natalie Tellis- Thanks Natalie!!

Unfortunately, we had to do a significant amount of manual curation, as some of the clusters didn’t make sense the way we humanly think of these topics …though algorithmically speaking they probably were “sisters”. We wound up having twenty supertopics or umbrella topics which contained the topics that LDA had produced. The naming of topics was done manually and was strongly influenced by the top most words for that topic.

For example the supertopic called “Genetics and Genomics”, and “cellBiology” have the following subtopics:

Topic Modeling- continued…

October 5, 2014DataSciencelatent dirichlet allocation, LDA, mallet, topic modellingpriyamvadadesai

So continuing with Topic Modeling…(see earlier post)

Well -the time had come to confront pubmed- the real data I was going to work with. To start with I decided to only use 2013 pubmed data to see if I could run LDA on it and get out meaningful topics. Well, what do I mean by pubmed data: As I explained earlier, pubmed is a repository containing almost *all* research literature pertaining to the biomedical field. Since it is maintained and funded by NIH, we as the tax payers can access or scrape data from it! The only caveat is that only a small subset of papers have the entire text, and they are housed in what is called Pubmed Central. For the rest of the data we can access things like: title, abstract, keywords (if any), journal name, journal ISSN number, date of publication, date created, date last modified etc.

In preparing the text corpus for LDA, we decided to use only the title and abstract. So, for all the records published in 2013, I parsed out the title and abstract, and created a text corpus containing one record per line, with the pmid as the record identifier. It looked something like this:

The real challenge for LDA is figuring out how many topics or categories should you try to divide the corpus into. I played around with K=10,12,14…500. It seemed to me that the larger the K, more “fine” grained my topics: but was there an “intrinsic” number of topics that pubmed was naturally divided into… but remember each paper or record has a non zero probability of belonging to every topic- its a mixture of topics. So we could think about each “topic” as a dimension, and each paper belonging to this K dimensional space. And intuitively I felt that we went with a really large K, we would get a really high resolution among topics (which we could think of as subtopics)- and then if we ran hierarchical clustering on this large number of topics, we could “cluster” similar topics thus naturally forming the “super topics”. I was excited. The challenge was that as K grew large, so did my job of trying to make sure that indeed all the topics made sense.

The road was quite bumpy. For example, I initially included keywords along with the abstracts in the text corpus, but found that keywords were present only in about 30% of the papers!. Earlier I had thought that perhaps the keywords could make up the main “vocabulary” of the corpus and be used to describe it- but that did not seem to be the case. Furthermore, in many cases the keyword tended to be names of particular chemical compounds which could not really “describe” the paper. I also had to check and see if the topics made sense. One way to do this was, for a given value of K, to look at the top 20 words of the topic and see if the words seem to point to a coherent topic. If so, then pull out all the papers that had a high probability of being assigned to that topic (say >0.7) and look at them.

To see how good my topic modeling really was, I decided to ask the inverse question: if I picked specific journals ( that I know were represented in the corpus) , pulled out all the papers from journals, and summed the topic probabilities of those papers, I would get a “topic distribution” for that set of journals. What did that look like? I decided to pick specialized journals like “Cancer”, “Oncology Letters”, “Oncoimmunology”, OncoTargets (which could represent a specific topic “cancer”) and The Science of the total environment, Environmental pollution, Environmental toxicology and pharmacology, and Environmental toxicology and chemistry which could be thought to represent “environmental science”.

Figure A below is the topic distribution for the journal “Cancer“. As you can see topic 13 has avery high representation. Figure B shows the topic distribution if papers published in Oncology Letters”, “Oncoimmunology” and “OncoTargets” are included. You can see, topic 13 continues to have a high representation, but representation for topic 3 has also gone up.

Figure A .

Figure B.

So what are topic 13 and topic 3? Here are word cloud representations of their top most words:

topic 13 topic 3

These word clouds make sense given that the journals were Cancer, Oncology letters etc; the high representation of topic 13 is very heartening.

Figure C. below is the topic distributions for the journal The Science of the total environment, and Figure D is the topic distribution for papers from journals The Science of the total environment, Environmental pollution, Environmental toxicology and pharmacology, Environmental toxicology and chemistry. Topics 4 and 2 are the most dominant.

Figure C.

Figure D.

And here is the word clouds for topics 4:

As you can see it doesn’t conjure up the topic “environmental science”. But then the question is, is the value of K too small to “resolve” the topic “environmental science”? we can look at what happens when K is larger. Figure E shows the distribution of the same papers as included in Figure D, but assuming that we have run lda with K=50 on them.

As you can see, it topic 15 that dominates. Here is the word cloud for topic 15 and this looks much more like environmental science!!

So what value of K( ie number of topics) should we go with? After conducting many experiments like these, we decided to go with K=150. Yes, K=150 is a large number but the thinking was that we could run the hierarchical clustering on the topics themselves and see if how they clustered together and then each cluster could be considered a “supertopic” or category and the topics that were contained in it, would be the finer classification. On the other hand 150 was manageable , in case we needed to perform some manual curation.

MyJourneyAsaDataScientist

About eighteen months ago I decided to leave astronomy, change my career trajectory and follow the Data Science Bandwagon- this is a blog about that ongoing journey…

Monthly Archives: October 2014

Hierarchical clustering – what does that even mean, in terms of my topic models?

Topic Modeling- continued…