Monthly Archives: September 2014

Data Science- learning from MOOCS’s…

September 21, 2014DataScienceartificial intelligence, Coursera and Udacity, data science, machine learningpriyamvadadesai

Coursera has changed my life. My husband calls me a MOOC junkie. For the uninitiated, MOOC’s are “Massive Open Online Courses” and for me when I decided I wanted to switch fields, they were a godsend. There a lot of them around now, but in my opinion, the best ones by far are:

For statistics, machine learning, artificial intelligence and computer science; these places can give you a great education for almost free. When I first started; about 18 months ago, they are all new and free. Coursera came out of Stanford, Udacity also from an ex-Stanford professor and edX by Harvard/MIT on the east coast. Interestingly enough, there have been free courses available online for a long time: Stanford Online, Carnegie Mellon University’s Open Learning Initiative , UC Berkeley’s lectures on YouTube, and MIT’s OpenCourseWare (OCW). But they never really caught on like Coursera and Udacity did. So some of the classes I have taken on Coursera, which I think have helped me in this new field:

Machine Learning

Data Analysis

R Programming

Natural Language Processing

Mathematical Biostatistics Boot Camp 1

Introduction to Data Science

Algorithms: Design and Analysis, Part 1

Introduction to Recommender Systems

Core Concepts in Data Analysis

Social Network Analysis

About six months ago they offering “Certification Programs” in certain areas and Johns Hopkins University has a nine-course certification at $49 a course. Rice University offers a great certification called “Fundamentals in Computing” , which is all done using python. Their courses are quite challenging. EdX has some great courses too:

Learning From Data

Introduction to Statistics: Probability

Introduction to Statistics: Descriptive Statistics

Introduction to Statistics: Inference

All the above courses, have a fixed schedule, like a regular class. You have assignments due every week, lectures to listen to, reading to do -so its not “self paced”. Some of the courses have been quite demanding and time consuming – but very rewarding. There are course projects that are then”peer assessed” based on a rubric that you are provided -its not perfect ; but it works rather well I think.

Udacity has a different model. They are “self paced”- they used to be free but a few months ago added a paid option where you can “check in” with a coach; and have your work reviewed. I thought they were a bit expensive ~$150/month. I prefer to be given deadlines that set them my self! They have a lot of Data Science courses as well.

Learning Python, DataScience – some great resources.

September 19, 2014DataScienceartificial intelligence, data mining, pandas, Python, Rpriyamvadadesai

There are some AMAZING resources online to learn python and Data Science in general. Here are an incomplete list of sites/pages I like to go to. I have started using Pandas/ SciPy a lot and love it. I plan to keep adding to this page.

Python:

Data Science

Using R

Topic Modeling: LDA… LSI.. Oh My!

September 19, 2014DataScienceartificial intelligence, latent dirichlet allocation, LDA, mallet, nltk, text mining, topic modellingpriyamvadadesai

As I explored the topics in text mining, and machine learning, I learned that there is an area of research called Topic Modelling. A topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. A topic modeling tool takes a collection of unstructured texts, and looks for patterns in the use of words – and a “topic” consists of a cluster of words that frequently occur together. Using contextual clues, topic models can connect words with similar meanings and distinguish between uses of words with multiple meanings. Note that the algorithm has no knowledge of the semantic meaning of the words. Its an exercise in statistical probability. As a result every piece of text has some non zero probability of belonging to every topic. Naively, you say text A belongs to topic XX if it has a high probability of belonging to XX. Conversely, a peice of text could belong to multiple topics. Pretty cool!

I couldn’t wait to get my hand dirty. In their first incarnation(~1998), topic models were called Latent Semantic Indexing (LSI). Today, the most popularly used topic model is Latent Dirichlet allocation (LDA). You can find links to lots more resources on David Blei’s (one of the best known experts in topic modeling) site. I also learned a lot from some digital humanities sites like the programming historian and topic modelling for humanists. (At the time of writing this blog (which is ~ten months after I was learning about this), LDA has become the new rage and there are a lot more implementations of it with easy interfaces.)

I played with a bunch of freely available LDA implementations including Gensim, mallet, and lda-c. I found mallet to be the easiest to use and give me the most sensible results.

For my experiment, I went into pubmed, and searched for documents with the following terms in the abstract or title. I tried to pick terms that (I thought) were distinct. My hypothesis was that if I pulled out the abstracts and titles of the documenst pulled out in thsi way, combined them into a random ordered list, and fed it to LDA, and specified the number of “topics” I wanted; I could get back my classified list.

terms in abstract or title # of documents

male breast cancer[Title/Abstract] 811
gluten free diet [Title/Abstract] 2790
childhood schizophrenia[Title/Abstract] 1683
microfluidics[Title/Abstract] 2279
high protein diet [Title/Abstract] 1340
malaria[Title/Abstract]) AND india[Title/Abstract] 1178
chromatin associated[Title/Abstract] 1138
juvenile diabetes[Title/Abstract] 621
typhoid fever [Title/Abstract] 2414
bioethics 2818

The results were quite interesting. Here are the top 20 words of the topics that were returned:

0 microfluidics microfluidic cell chemistry methods instrumentation analysis techniques based cells flow detection analytical high chip surface systems system devices

1 typhoid fever salmonella typhi bacterial patients drug immunology humans blood vaccines infections diagnosis infection vaccine therapy test treatment strains

2 schizophrenia disorders childhood child diagnosis disorder children humans adult studies risk onset adolescent patients age psychology male factors female

3 malaria india epidemiology health falciparum control diseases population plasmodium disease humans cases countries vivax study drug species incidence water

4 proteins dna chromatin genetics protein metabolism cells cell gene expression binding sequence nuclear transcription genes histone rna molecular genetic

5 bioethics ethics health research medical care humans ethical human issues patient social public professional moral approach life medicine clinical

6 protein diet high rats metabolism dietary blood effects fed animals low administration weight proteins dosage increased body intake activity

7 breast cancer male neoplasms aged brca genetics patients female risk humans genetic carcinoma disease mutations cases factors mutation analysis

8 diabetes complications patients humans disease blood mellitus type etiology female diseases diagnosis male insulin adult juvenile patient child therapy

9 disease celiac gluten patients diet free coeliac cd immunology humans intestinal diagnosis antibodies adult aged blood female child children

Pretty cool eh? The top words seem to indicate that the topics were pretty well categorized. But how well were the documents really classified? To explore this further, I took all the documents that had an original topic id of say “cancer”, and plotted the maximum probability predicted for that document. The color indicates the topic predicted by that probability. For example: the plots below shows all the documents that were originally identified as cancer and chromatin.

All the dots in the plot represent papers that were originally selected to be that topic- so the true positives. The color of the dot, represents which topic the algorithm thinks the paper belongs to. Comparing witht the legend, the fact that majority of the dots in the chromatin plot are peach and in the cancer plot are olive green suggest a pretty high accuracy rate. Infact :

# of incorrectly classified cancer docs=16 out of 614 ~2.6% # of incorrectly classified chromatin docs=27 out of 876 ~3.0%

Unfortunately, things don’t look so good for all the topics . Here are the plots for Malaria and Typhoid:

As you can see, the accuracy here is significantly less:

# of incorrectly classified malaria docs=166 out of 853 ~19.46% # of incorrectly classified typhoid docs=734 out of 1841 ~39.87%

Here are the numbers for the rest of the topics: # of incorrectly classified gluten docs=279 out of 2094 ~ 13.3% # of incorrectly classified schizophrenia docs=61 out of 1327 ~4.6% # of incorrectly classified southbeach docs=94 out of 1027 ~9.1 # of incorrectly classified microfluidics docs=67 out of 1342 ~4.9% # of incorrectly classified diabetes docs=127 out of 489 ~ 25.9% # of incorrectly classified bioethics docs=69 out of 2166 ~3.18%

So, error in prediction 1640 /12629 =~12.98%. While this is a good start; it would be worth exploring what one could do to bring the error rate lower.

MyJourneyAsaDataScientist

About eighteen months ago I decided to leave astronomy, change my career trajectory and follow the Data Science Bandwagon- this is a blog about that ongoing journey…