Category Archives: DataScience

Learning Python, DataScience – some great resources.

There are some AMAZING resources online to learn python and Data Science in general. Here are an incomplete list of sites/pages I like to go to. I have started using Pandas/ SciPy a lot and love it.  I plan to keep adding to this page.


Data Science     

Using R



Topic Modeling: LDA… LSI.. Oh My!

As I explored the topics in text mining, and machine learning, I learned that there is an area of research called Topic Modelling. A  topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. A topic modeling tool takes a collection of unstructured texts, and looks for patterns in the use of words – and a  “topic” consists of a cluster of words that frequently occur together. Using contextual clues, topic models can connect words with similar meanings and distinguish between uses of words with multiple meanings. Note that the algorithm has no knowledge of the semantic meaning of the words. Its an exercise in statistical probability. As a result every piece of text has some non zero probability of belonging to every topic. Naively, you say text A belongs to topic XX if it has a high probability of belonging to XX. Conversely, a peice of text could belong to multiple topics. Pretty cool!

I couldn’t wait to get my hand dirty. In their first incarnation(~1998), topic models were called Latent Semantic Indexing (LSI). Today, the most popularly used topic model is    Latent Dirichlet allocation (LDA). You can find links to lots more resources on David Blei’s  (one of the best known experts in topic modeling) site. I also learned a lot from some digital humanities sites like the programming historian and topic modelling for humanists. (At the time of writing this blog (which is ~ten months after I was learning about this), LDA has become the new rage and there are a lot more implementations of it with easy interfaces.)

I played with a bunch of freely available LDA implementations including Gensim, mallet, and lda-c. I found mallet to be the easiest to use and give me the most sensible results.

For my experiment, I went into pubmed, and searched for documents with the following terms in the abstract or title. I tried to pick terms that (I thought) were distinct. My hypothesis was that if I pulled out the abstracts and titles of the documenst pulled out in thsi way, combined them into a random ordered list, and fed it to LDA, and specified the number of “topics” I wanted; I could get back my classified list.

          terms in abstract or title                                              # of documents   

  1.  male breast cancer[Title/Abstract]                                            811
  2.  gluten free diet [Title/Abstract]                                                2790
  3. childhood schizophrenia[Title/Abstract]                                   1683
  4.  microfluidics[Title/Abstract]                                                     2279
  5.  high protein diet [Title/Abstract]                                              1340
  6.  malaria[Title/Abstract]) AND india[Title/Abstract]                    1178
  7. chromatin associated[Title/Abstract]                                        1138
  8. juvenile diabetes[Title/Abstract]                                                  621
  9. typhoid fever [Title/Abstract]                                                     2414
  10. bioethics                                                                                    2818

The results were quite interesting.  Here are the top 20 words of the topics that were returned:

0        microfluidics microfluidic cell chemistry methods instrumentation analysis techniques based cells flow detection analytical high chip surface systems system devices

1        typhoid fever salmonella typhi bacterial patients drug immunology humans blood vaccines infections diagnosis infection vaccine therapy test treatment strains

2        schizophrenia disorders childhood child diagnosis disorder children humans adult studies risk onset adolescent patients age psychology male factors female

3        malaria india epidemiology health falciparum control diseases population plasmodium disease humans cases countries vivax study drug species incidence water

4        proteins dna chromatin genetics protein metabolism cells cell gene expression binding sequence nuclear transcription genes histone rna molecular genetic

5        bioethics ethics health research medical care humans ethical human issues patient social public professional moral approach life medicine clinical

6        protein diet high rats metabolism dietary blood effects fed animals low administration weight proteins dosage increased body intake activity

7       breast cancer male neoplasms aged brca genetics patients female risk humans genetic carcinoma disease mutations cases factors mutation analysis

8        diabetes complications patients humans disease blood mellitus type etiology female diseases diagnosis male insulin adult juvenile patient child therapy

9        disease celiac gluten patients diet free coeliac cd immunology humans intestinal diagnosis antibodies adult aged blood female child children

Pretty cool eh? The top words seem to indicate that the topics were pretty well categorized. But how well were the documents really classified?  To explore this further, I took all the documents that had an original topic id of say “cancer”, and plotted  the maximum probability predicted for that document. The color indicates the topic predicted by that probability. For example: the plots  below shows all the documents that were originally identified as cancer and chromatin.

docs identified as cancer topics       train_chromatin_tpids

All the dots in the plot represent papers that were originally selected to be that topic- so the true positives. The color of the dot, represents which topic the algorithm thinks the paper belongs to.  Comparing witht the legend,  the fact that majority of the dots in the chromatin plot are peach and in the cancer plot are olive green suggest a pretty high accuracy rate.  Infact :

# of  incorrectly classified cancer docs=16          out of  614       ~2.6%      # of  incorrectly classified chromatin docs=27        out of 876        ~3.0%

Unfortunately, things don’t look so good for all the topics . Here are the plots for Malaria and Typhoid:


As you can see, the accuracy here is significantly less:

# of  incorrectly classified malaria docs=166      out of 853         ~19.46%  # of  incorrectly classified typhoid docs=734     out of 1841     ~39.87%

Here are the numbers for the rest of the topics:                                                 # of  incorrectly classified gluten docs=279         out of 2094     ~ 13.3% # of  incorrectly classified schizophrenia docs=61    out of 1327 ~4.6% # of  incorrectly classified southbeach docs=94    out of 1027      ~9.1   #  of  incorrectly classified microfluidics docs=67  out of 1342   ~4.9%  # of  incorrectly classified diabetes docs=127        out of 489     ~ 25.9%  # of  incorrectly classified bioethics docs=69        out of 2166    ~3.18%

So, error in prediction 1640 /12629 =~12.98%. While this is a good start; it would be worth exploring what one could do to bring the error rate lower.

Making of a recommendation system -continued..

While its been a few months since we started working on this project, I am hoping to document its algorithm development. The main vision for this project was to have a website for biomedical literature, which would be able to recommend new and recent articles to you based on your past browsing history- sort of like a Netflix or Amazon for biomedical literature! We have been inspired by sites such as Goodreads and reddit. As our ideas developed, we decided that what we wanted was place where like researchers could go, upload their past/present  publications or research interests and come back daily to see an updated “recommendation list” of new articles in their area. As ideas went, we though it would be great to have a “virtual coffee shop”, where you could post comments on articles, up-vote or down-vote articles -basically be a place to hang out with like minded people!

Pubmed, hosted and maintained by NIH is the go to repository for biomed literature and our first task was to be able to access all of that and make sense of their corpus of twenty million articles!! As the algorithms person, my first job was to figure out how to classify all that literature. I had no experience with anything like that- Zero- Zilch-Nada. I wasn’t quite sure how people did that and I started wondering about and reading up on how people made catalogs, classified objects and organized libraries. Was there away to automate that? Turns out its a really hot area of research and people have been doing some really interesting and cool stuff. There was a whole area of computer science dedicated to that: Text Mining and Natural Language Processing !! 

The making of a recommendation system-

About eighteen months ago I decided to leave astronomy and follow the Data Science Bandwagon-  this is a blog about that journey. I spent a few months studying DataScience courses on Coursera and Udacity and was fortunate enough to become part of a project to build a “recommender system for Biomedical Literature”.

Some background: Turns out that the biomedical field is growing so rapidly that it is getting really difficult to keep up with the literature. For newcomers to the field, its hard to figure out what research papers to read, where to start as few thousand articles are published daily and new /open source journals are popping up regularly. For veterans its hard to keep up and not enough hours in the day to scan through the articles to figure what is relevant, new and exciting in their area of research. This is true not just for the academic researchers but also those in the related fields of medicine and bioinformatics. Here is a recent plot I made of number of papers/month uploaded to pubmed (a popular biomedical research literature repository). As you can see, there are about ~92,000 new publications a month…


I have been working mainly on the algorithm design and development for this project and my intention with this blog is to focus on that and my growth as a data scientist.