Tag Archives: nltk

Topic Modeling: LDA… LSI.. Oh My!

As I explored the topics in text mining, and machine learning, I learned that there is an area of research called Topic Modelling. A  topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. A topic modeling tool takes a collection of unstructured texts, and looks for patterns in the use of words – and a  “topic” consists of a cluster of words that frequently occur together. Using contextual clues, topic models can connect words with similar meanings and distinguish between uses of words with multiple meanings. Note that the algorithm has no knowledge of the semantic meaning of the words. Its an exercise in statistical probability. As a result every piece of text has some non zero probability of belonging to every topic. Naively, you say text A belongs to topic XX if it has a high probability of belonging to XX. Conversely, a peice of text could belong to multiple topics. Pretty cool!

I couldn’t wait to get my hand dirty. In their first incarnation(~1998), topic models were called Latent Semantic Indexing (LSI). Today, the most popularly used topic model is    Latent Dirichlet allocation (LDA). You can find links to lots more resources on David Blei’s  (one of the best known experts in topic modeling) site. I also learned a lot from some digital humanities sites like the programming historian and topic modelling for humanists. (At the time of writing this blog (which is ~ten months after I was learning about this), LDA has become the new rage and there are a lot more implementations of it with easy interfaces.)

I played with a bunch of freely available LDA implementations including Gensim, mallet, and lda-c. I found mallet to be the easiest to use and give me the most sensible results.

For my experiment, I went into pubmed, and searched for documents with the following terms in the abstract or title. I tried to pick terms that (I thought) were distinct. My hypothesis was that if I pulled out the abstracts and titles of the documenst pulled out in thsi way, combined them into a random ordered list, and fed it to LDA, and specified the number of “topics” I wanted; I could get back my classified list.

          terms in abstract or title                                              # of documents   

  1.  male breast cancer[Title/Abstract]                                            811
  2.  gluten free diet [Title/Abstract]                                                2790
  3. childhood schizophrenia[Title/Abstract]                                   1683
  4.  microfluidics[Title/Abstract]                                                     2279
  5.  high protein diet [Title/Abstract]                                              1340
  6.  malaria[Title/Abstract]) AND india[Title/Abstract]                    1178
  7. chromatin associated[Title/Abstract]                                        1138
  8. juvenile diabetes[Title/Abstract]                                                  621
  9. typhoid fever [Title/Abstract]                                                     2414
  10. bioethics                                                                                    2818

The results were quite interesting.  Here are the top 20 words of the topics that were returned:

0        microfluidics microfluidic cell chemistry methods instrumentation analysis techniques based cells flow detection analytical high chip surface systems system devices

1        typhoid fever salmonella typhi bacterial patients drug immunology humans blood vaccines infections diagnosis infection vaccine therapy test treatment strains

2        schizophrenia disorders childhood child diagnosis disorder children humans adult studies risk onset adolescent patients age psychology male factors female

3        malaria india epidemiology health falciparum control diseases population plasmodium disease humans cases countries vivax study drug species incidence water

4        proteins dna chromatin genetics protein metabolism cells cell gene expression binding sequence nuclear transcription genes histone rna molecular genetic

5        bioethics ethics health research medical care humans ethical human issues patient social public professional moral approach life medicine clinical

6        protein diet high rats metabolism dietary blood effects fed animals low administration weight proteins dosage increased body intake activity

7       breast cancer male neoplasms aged brca genetics patients female risk humans genetic carcinoma disease mutations cases factors mutation analysis

8        diabetes complications patients humans disease blood mellitus type etiology female diseases diagnosis male insulin adult juvenile patient child therapy

9        disease celiac gluten patients diet free coeliac cd immunology humans intestinal diagnosis antibodies adult aged blood female child children

Pretty cool eh? The top words seem to indicate that the topics were pretty well categorized. But how well were the documents really classified?  To explore this further, I took all the documents that had an original topic id of say “cancer”, and plotted  the maximum probability predicted for that document. The color indicates the topic predicted by that probability. For example: the plots  below shows all the documents that were originally identified as cancer and chromatin.

docs identified as cancer topics       train_chromatin_tpids

All the dots in the plot represent papers that were originally selected to be that topic- so the true positives. The color of the dot, represents which topic the algorithm thinks the paper belongs to.  Comparing witht the legend,  the fact that majority of the dots in the chromatin plot are peach and in the cancer plot are olive green suggest a pretty high accuracy rate.  Infact :

# of  incorrectly classified cancer docs=16          out of  614       ~2.6%      # of  incorrectly classified chromatin docs=27        out of 876        ~3.0%

Unfortunately, things don’t look so good for all the topics . Here are the plots for Malaria and Typhoid:


As you can see, the accuracy here is significantly less:

# of  incorrectly classified malaria docs=166      out of 853         ~19.46%  # of  incorrectly classified typhoid docs=734     out of 1841     ~39.87%

Here are the numbers for the rest of the topics:                                                 # of  incorrectly classified gluten docs=279         out of 2094     ~ 13.3% # of  incorrectly classified schizophrenia docs=61    out of 1327 ~4.6% # of  incorrectly classified southbeach docs=94    out of 1027      ~9.1   #  of  incorrectly classified microfluidics docs=67  out of 1342   ~4.9%  # of  incorrectly classified diabetes docs=127        out of 489     ~ 25.9%  # of  incorrectly classified bioethics docs=69        out of 2166    ~3.18%

So, error in prediction 1640 /12629 =~12.98%. While this is a good start; it would be worth exploring what one could do to bring the error rate lower.