While its been a few months since we started working on this project, I am hoping to document its algorithm development. The main vision for this project was to have a website for biomedical literature, which would be able to recommend new and recent articles to you based on your past browsing history- sort of like a Netflix or Amazon for biomedical literature! We have been inspired by sites such as Goodreads and reddit. As our ideas developed, we decided that what we wanted was place where like researchers could go, upload their past/present publications or research interests and come back daily to see an updated “recommendation list” of new articles in their area. As ideas went, we though it would be great to have a “virtual coffee shop”, where you could post comments on articles, up-vote or down-vote articles -basically be a place to hang out with like minded people!
Pubmed, hosted and maintained by NIH is the go to repository for biomed literature and our first task was to be able to access all of that and make sense of their corpus of twenty million articles!! As the algorithms person, my first job was to figure out how to classify all that literature. I had no experience with anything like that- Zero- Zilch-Nada. I wasn’t quite sure how people did that and I started wondering about and reading up on how people made catalogs, classified objects and organized libraries. Was there away to automate that? Turns out its a really hot area of research and people have been doing some really interesting and cool stuff. There was a whole area of computer science dedicated to that: Text Mining and Natural Language Processing !!