Monthly Archives: September 2014

Data Science- learning from MOOCS’s…

Coursera has changed my life. My husband calls me a MOOC junkie. For the uninitiated, MOOC’s are “Massive Open Online Courses” and for me when I decided I wanted to switch fields, they were a godsend. There a lot of them around now, but in my opinion, the best ones by far are:

For statistics, machine learning, artificial intelligence and computer science; these places can give you a great education for almost free. When I first started; about 18 months ago, they are all new and free. Coursera came out of Stanford, Udacity also from an ex-Stanford professor and edX by Harvard/MIT on the east coast. Interestingly enough, there have been free courses available online for a long time: Stanford Online,  Carnegie Mellon University’s Open Learning Initiative , UC Berkeley’s lectures on YouTube, and MIT’s OpenCourseWare (OCW). But they never really caught on like Coursera and Udacity did.  So some of the classes I have taken on Coursera, which I think have helped me in this new field:

Machine Learning
Data Analysis
R Programming
Natural Language Processing
Mathematical Biostatistics Boot Camp 1
Introduction to Data Science
Algorithms: Design and Analysis, Part 1
Introduction to Recommender Systems
Core Concepts in Data Analysis
Social Network Analysis

About six months ago they offering “Certification Programs” in certain areas and Johns Hopkins University  has a nine-course certification at $49 a course. Rice University offers a great certification called “Fundamentals in Computing” , which is all done using python. Their courses are quite challenging. EdX has some great courses too:

Learning From Data
Introduction to Statistics: Probability
Introduction to Statistics: Descriptive Statistics
Introduction to Statistics: Inference

All the above courses, have a fixed schedule, like a regular class. You have assignments due every week, lectures to listen to, reading to do -so its not “self paced”. Some of the courses have been quite demanding and time consuming – but very rewarding. There are course projects that are then”peer assessed” based on a rubric that you are provided -its not perfect ; but it works rather well I think.

Udacity has a different model. They are “self paced”- they used to be free but a few months ago added a paid option where you can “check in” with a coach; and have your work reviewed.  I thought they were a bit expensive ~$150/month.  I prefer to be given deadlines that set them my self! They have a lot of Data Science courses as well.

Learning Python, DataScience – some great resources.

There are some AMAZING resources online to learn python and Data Science in general. Here are an incomplete list of sites/pages I like to go to. I have started using Pandas/ SciPy a lot and love it.  I plan to keep adding to this page.

Python:                                                                                                                    

Data Science     

Using R

 

Topic Modeling: LDA… LSI.. Oh My!

As I explored the topics in text mining, and machine learning, I learned that there is an area of research called Topic Modelling. A  topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. A topic modeling tool takes a collection of unstructured texts, and looks for patterns in the use of words – and a  “topic” consists of a cluster of words that frequently occur together. Using contextual clues, topic models can connect words with similar meanings and distinguish between uses of words with multiple meanings. Note that the algorithm has no knowledge of the semantic meaning of the words. Its an exercise in statistical probability. As a result every piece of text has some non zero probability of belonging to every topic. Naively, you say text A belongs to topic XX if it has a high probability of belonging to XX. Conversely, a peice of text could belong to multiple topics. Pretty cool!

I couldn’t wait to get my hand dirty. In their first incarnation(~1998), topic models were called Latent Semantic Indexing (LSI). Today, the most popularly used topic model is    Latent Dirichlet allocation (LDA). You can find links to lots more resources on David Blei’s  (one of the best known experts in topic modeling) site. I also learned a lot from some digital humanities sites like the programming historian and topic modelling for humanists. (At the time of writing this blog (which is ~ten months after I was learning about this), LDA has become the new rage and there are a lot more implementations of it with easy interfaces.)

I played with a bunch of freely available LDA implementations including Gensim, mallet, and lda-c. I found mallet to be the easiest to use and give me the most sensible results.

For my experiment, I went into pubmed, and searched for documents with the following terms in the abstract or title. I tried to pick terms that (I thought) were distinct. My hypothesis was that if I pulled out the abstracts and titles of the documenst pulled out in thsi way, combined them into a random ordered list, and fed it to LDA, and specified the number of “topics” I wanted; I could get back my classified list.

          terms in abstract or title                                              # of documents   

  1.  male breast cancer[Title/Abstract]                                            811
  2.  gluten free diet [Title/Abstract]                                                2790
  3. childhood schizophrenia[Title/Abstract]                                   1683
  4.  microfluidics[Title/Abstract]                                                     2279
  5.  high protein diet [Title/Abstract]                                              1340
  6.  malaria[Title/Abstract]) AND india[Title/Abstract]                    1178
  7. chromatin associated[Title/Abstract]                                        1138
  8. juvenile diabetes[Title/Abstract]                                                  621
  9. typhoid fever [Title/Abstract]                                                     2414
  10. bioethics                                                                                    2818

The results were quite interesting.  Here are the top 20 words of the topics that were returned:

0        microfluidics microfluidic cell chemistry methods instrumentation analysis techniques based cells flow detection analytical high chip surface systems system devices

1        typhoid fever salmonella typhi bacterial patients drug immunology humans blood vaccines infections diagnosis infection vaccine therapy test treatment strains

2        schizophrenia disorders childhood child diagnosis disorder children humans adult studies risk onset adolescent patients age psychology male factors female

3        malaria india epidemiology health falciparum control diseases population plasmodium disease humans cases countries vivax study drug species incidence water

4        proteins dna chromatin genetics protein metabolism cells cell gene expression binding sequence nuclear transcription genes histone rna molecular genetic

5        bioethics ethics health research medical care humans ethical human issues patient social public professional moral approach life medicine clinical

6        protein diet high rats metabolism dietary blood effects fed animals low administration weight proteins dosage increased body intake activity

7       breast cancer male neoplasms aged brca genetics patients female risk humans genetic carcinoma disease mutations cases factors mutation analysis

8        diabetes complications patients humans disease blood mellitus type etiology female diseases diagnosis male insulin adult juvenile patient child therapy

9        disease celiac gluten patients diet free coeliac cd immunology humans intestinal diagnosis antibodies adult aged blood female child children

Pretty cool eh? The top words seem to indicate that the topics were pretty well categorized. But how well were the documents really classified?  To explore this further, I took all the documents that had an original topic id of say “cancer”, and plotted  the maximum probability predicted for that document. The color indicates the topic predicted by that probability. For example: the plots  below shows all the documents that were originally identified as cancer and chromatin.

docs identified as cancer topics       train_chromatin_tpids

All the dots in the plot represent papers that were originally selected to be that topic- so the true positives. The color of the dot, represents which topic the algorithm thinks the paper belongs to.  Comparing witht the legend,  the fact that majority of the dots in the chromatin plot are peach and in the cancer plot are olive green suggest a pretty high accuracy rate.  Infact :

# of  incorrectly classified cancer docs=16          out of  614       ~2.6%      # of  incorrectly classified chromatin docs=27        out of 876        ~3.0%

Unfortunately, things don’t look so good for all the topics . Here are the plots for Malaria and Typhoid:

train_typhoid_tpidstrain_malaria_tpids

As you can see, the accuracy here is significantly less:

# of  incorrectly classified malaria docs=166      out of 853         ~19.46%  # of  incorrectly classified typhoid docs=734     out of 1841     ~39.87%

Here are the numbers for the rest of the topics:                                                 # of  incorrectly classified gluten docs=279         out of 2094     ~ 13.3% # of  incorrectly classified schizophrenia docs=61    out of 1327 ~4.6% # of  incorrectly classified southbeach docs=94    out of 1027      ~9.1   #  of  incorrectly classified microfluidics docs=67  out of 1342   ~4.9%  # of  incorrectly classified diabetes docs=127        out of 489     ~ 25.9%  # of  incorrectly classified bioethics docs=69        out of 2166    ~3.18%

So, error in prediction 1640 /12629 =~12.98%. While this is a good start; it would be worth exploring what one could do to bring the error rate lower.