Tag Archives: becoming a data scientist

Data Science Skills checklist:

What to learn, in what order?

As I try to re-inspire myself to be regular about blogging – and documenting what I learn …

More really good resources:

Moving to AWS…

As the project grew, we started downloading tweets from various journal websites and tried to set up an algorithm to parse tweets that were related to particular papers and link them to the paper; thereby producing another “metric” to compare papers by. In addition; we started keeping a detailed records all the authors of the papers and attempted to create a citation database. As complexity grew; we found that our local server was too slow and after some research; decided to take the plunge and move our stuff to AWS-the Amazon Web Server. We created a VPC (a virtual Private cloud), moved our database to Amazon’s RDS (Relational Database Service) and created buckets or  storage on Amazon’s S3 (Simple Storage Service). Its relatively easy to do and the AWS documentation is pretty good. What I found really helpful were the masterclass webinar series.

I launched a Linux based instance and then installed all the software versions I needed like python2.7, pandas, numpy, ipython-pylab, matplotlib, scipy etc . It was interesting to note that on many of the amazon machine, the default python version loaded was 2.6, not 2.7. I scouted  the web a fair bit to help me configure my instance the am sharing soem of the commands below

General commands to install python 2.7  on AWS- should work on most instances running Ubuntu/RedHat Linux:                                   

Start python to check the installation unicode type. If you have to deal with a fair amount of unicode data like I do then make sure you have the “wide build” . I learned this the hard way.                                                     

  • >>import sys  
  • >>print sys.maxunicode

It should NOT be 65564

  • >>wget https://s3.amazonaws.com/aws-cli/awscli-bundle.zip  
  •   >> unzip awscli-bundle.zip          
  • >> sudo ./awscli-bundle/install -i /usr/local/aws -b /usr/local/bin/aws
  • # install build tools
  • >>sudo yum install make automake gcc gcc-c++ kernel-devel git-core -y
  • # install python 2.7 and change default python symlink
  • >>sudo yum install python27-devel -y          
  • >>sudo rm /usr/bin/python  
  • >>sudo ln -s /usr/bin/python2.7 /usr/bin/python
  • # yum still needs 2.6, so write it in and backup script
  •  >>sudo cp /usr/bin/yum /usr/bin/_yum_before_27  
  •  >>sudo sed -i s/python/python2.6/g /usr/bin/yum                                                                                                                                                                       
  • #This  should display now 2.7.5 or later:                                                                                                       >>python  
  •   >>sudo yum install httpd
  • # now install pip for 2.7
  • >>sudo curl -o /tmp/ez_setup.py https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py
  • >>sudo /usr/bin/python27 /tmp/ez_setup.py
  • >>sudo /usr/bin/easy_install-2.7 pip  
  • >>sudo pip install virtualenv
  • >>sudo apt-get update    
  •  >>sudo apt-get install git
  • # should display current versions:                                                                                                                   pip -V && virtualenv –version
  • Installing all the python library modules:
  • sudo pip install ipython
  • sudo yum install numpy scipy python-matplotlib ipython python-pandas sympy python-nose
  • sudo yum install xorg-x11-xauth.x86_64 xorg-x11-server-utils.x86_64
  • sudo pip install pyzmq tornado jinja2
  • sudo yum groupinstall “Development Tools”
  • sudo yum install python-devel
  • sudo pip install matplotlib
  • sudo pip install networkx
  • sudo pip install cython
  • sudo pip install boto
  • sudo pip install pandas                    
  • Some modules could Not be loaded using pip, so use the following instead:                       >>sudo apt-get install python-mpi4py python-h5py python-tables python-pandas python-sklearn python-scikits.statsmodels 
  • Note that to install h5py or pytables you must install the following dependencies first:
  • -numpy
  • -numexpr
  • -Cython
  • -dateutil
  • HDF5
  • HDF5 can be installed using wget:
  • >> wget http://www.hdfgroup.org/ftp/HDF5/current/src/hdf5-1.8.9.tar.gz        
  •  >>tar xvfz hdf5-1.8.9.tar.gz; cd hdf5-1.8.9   
  •   >> ./configure –prefix=/usr/local
  • >> make; make install
  • Pytables
  • pip install git+https://github.com/PyTables/PyTables.git@v.3.1.1#egg=tables

**to install h5py make sure hdf5 is in the path.

I really liked the concept of AMI’s : you create a machine that has the configuration that you want and then create an “image” and can give it a name. You have then created a virtual machine that you can launch anytime you want and as many copies of as you want! What is also great is that when you create the image, all the files that may be in your directory at that time are also part of that virtual machine …so its almost like you have a  snapshot of your environment at the time. So create images often (you can delete old images ofcourse!) and you can launch your environment at any time.  Of course you can and should always upload new files to S3 as that it you storage repository.

Another neat trick was to learn that if you can install EC (Elastic Cloud) CLI(Command Line Interface) on your local machine, and set up your IAM you don’t need the you don’t need the .pem file!! Even if you are accessing an AMI in a private cloud; when you set up the instance make sure you click on “get public IP” when you launch your instance and log in as ssh –Y ubuntu@ ec2-xx.yyy.us-west1.compute.amazonaws.com. You can them enjoy xterm and the graphical display from matplotlib, as if you are running everything  on your local terminal. Very cool indeed!

Hierarchical clustering – what does that even mean, in terms of my topic models?

(continued from Topic Modeling …)

So great, I ran LDA, got 150 topics, and now I wanted to see if one could group these topics together using clustering. How can one go about doing that? Well as part of the process, LDA basically creates a “vocabulary” consisting of all the words from the corpus. As this number may get unmanageably large, as part of the LDA preprocessing, one  removes words like a, an, the, if where (also called stopwords) as may not really help decide whether a document belongs to a certain topic or not. There are other text learning tricks like stemming and lemmatization that I thought were not necessarily useful this context, but can often be useful and help control the size of the vocabulary. Well the vocabulary from my run contained ~650,000 words and mallet  allows you to output, for every topic, the word counts for all these words! So now you have a representation of all the topics in terms of their “word vectors”! And one can use this word vector to calculate “distances” between topics.

So after some data wrangling and manipulation, I had the topics represented in a numerical matrix, and ran the clustering algorithm on them. There are many variations of hierarchical clustering algorithms, and I tried most them to see which one seemed the best. I finally went with average linkage and shown below are some of the branches that clustered together. Instead of showing, topic numbers as leaves, we are displaying the word cloud represented by the topic at that leaf:

Figure on the left could be thought to represent a neuroimaging cluster and the figure to the right could be thought to represent disease and trauma. These images are courtesy  Natalie Tellis- Thanks Natalie!!


disease_trauma research

Unfortunately, we had to do a significant amount of manual curation, as some of the clusters didn’t make sense the way we humanly think of these topics …though algorithmically speaking they probably were “sisters”.  We wound up having twenty supertopics or umbrella topics  which contained the topics that LDA had produced. The naming of topics was done manually and was strongly influenced by the top most words for that topic.  

For example the supertopic called “Genetics and Genomics”, and “cellBiology”  have the following subtopics:


Making of a recommendation system -continued..

While its been a few months since we started working on this project, I am hoping to document its algorithm development. The main vision for this project was to have a website for biomedical literature, which would be able to recommend new and recent articles to you based on your past browsing history- sort of like a Netflix or Amazon for biomedical literature! We have been inspired by sites such as Goodreads and reddit. As our ideas developed, we decided that what we wanted was place where like researchers could go, upload their past/present  publications or research interests and come back daily to see an updated “recommendation list” of new articles in their area. As ideas went, we though it would be great to have a “virtual coffee shop”, where you could post comments on articles, up-vote or down-vote articles -basically be a place to hang out with like minded people!

Pubmed, hosted and maintained by NIH is the go to repository for biomed literature and our first task was to be able to access all of that and make sense of their corpus of twenty million articles!! As the algorithms person, my first job was to figure out how to classify all that literature. I had no experience with anything like that- Zero- Zilch-Nada. I wasn’t quite sure how people did that and I started wondering about and reading up on how people made catalogs, classified objects and organized libraries. Was there away to automate that? Turns out its a really hot area of research and people have been doing some really interesting and cool stuff. There was a whole area of computer science dedicated to that: Text Mining and Natural Language Processing !! 

The making of a recommendation system-

About eighteen months ago I decided to leave astronomy and follow the Data Science Bandwagon-  this is a blog about that journey. I spent a few months studying DataScience courses on Coursera and Udacity and was fortunate enough to become part of a project to build a “recommender system for Biomedical Literature”.

Some background: Turns out that the biomedical field is growing so rapidly that it is getting really difficult to keep up with the literature. For newcomers to the field, its hard to figure out what research papers to read, where to start as few thousand articles are published daily and new /open source journals are popping up regularly. For veterans its hard to keep up and not enough hours in the day to scan through the articles to figure what is relevant, new and exciting in their area of research. This is true not just for the academic researchers but also those in the related fields of medicine and bioinformatics. Here is a recent plot I made of number of papers/month uploaded to pubmed (a popular biomedical research literature repository). As you can see, there are about ~92,000 new publications a month…


I have been working mainly on the algorithm design and development for this project and my intention with this blog is to focus on that and my growth as a data scientist.