Monthly Archives: August 2014

Making of a recommendation system -continued..

While its been a few months since we started working on this project, I am hoping to document its algorithm development. The main vision for this project was to have a website for biomedical literature, which would be able to recommend new and recent articles to you based on your past browsing history- sort of like a Netflix or Amazon for biomedical literature! We have been inspired by sites such as Goodreads and reddit. As our ideas developed, we decided that what we wanted was place where like researchers could go, upload their past/present  publications or research interests and come back daily to see an updated “recommendation list” of new articles in their area. As ideas went, we though it would be great to have a “virtual coffee shop”, where you could post comments on articles, up-vote or down-vote articles -basically be a place to hang out with like minded people!

Pubmed, hosted and maintained by NIH is the go to repository for biomed literature and our first task was to be able to access all of that and make sense of their corpus of twenty million articles!! As the algorithms person, my first job was to figure out how to classify all that literature. I had no experience with anything like that- Zero- Zilch-Nada. I wasn’t quite sure how people did that and I started wondering about and reading up on how people made catalogs, classified objects and organized libraries. Was there away to automate that? Turns out its a really hot area of research and people have been doing some really interesting and cool stuff. There was a whole area of computer science dedicated to that: Text Mining and Natural Language Processing !! 


The making of a recommendation system-

About eighteen months ago I decided to leave astronomy and follow the Data Science Bandwagon-  this is a blog about that journey. I spent a few months studying DataScience courses on Coursera and Udacity and was fortunate enough to become part of a project to build a “recommender system for Biomedical Literature”.

Some background: Turns out that the biomedical field is growing so rapidly that it is getting really difficult to keep up with the literature. For newcomers to the field, its hard to figure out what research papers to read, where to start as few thousand articles are published daily and new /open source journals are popping up regularly. For veterans its hard to keep up and not enough hours in the day to scan through the articles to figure what is relevant, new and exciting in their area of research. This is true not just for the academic researchers but also those in the related fields of medicine and bioinformatics. Here is a recent plot I made of number of papers/month uploaded to pubmed (a popular biomedical research literature repository). As you can see, there are about ~92,000 new publications a month…


I have been working mainly on the algorithm design and development for this project and my intention with this blog is to focus on that and my growth as a data scientist.