sckit-learn | MyJourneyAsaDataScientist

Migrating Storage to HDF5/Pytables

Very often, part of trying to provide on demand, instantaneous recommendations involves optimizing calculations between large matrices. In our case, one of the most challenging problems was reading in, and efficiently calculating cosine similarity between tfidf features for every pair of documents in the two corpuses – the papers that could “potentially” be recommended and the users library of papers that they were interested in. Most recommendation engines try to ease this computational load by doing as much of the prep work before as possible. Its almost like running a good restaurant – you try to pre-cook (read pre-calculate) all the potential sauces, vegetables (read possible features ) you may need, so you can whip up a great meal (read recommendations) with minimum delay. For us, this is how we handled it:

First create a dictionary of possible words from a corpus containing all abstracts from pubmed articles from a couple of years (after removing the common english stopwords).
Take the corpus of potential recent papers (that could be recommended) and vectorize it ( i.e. basically represent the collection of text documents as numerical feature vectors.) There are many ways of computing the term-weights or values of the vector for a document, and after experimenting with a few, we went with one of the most commonly used weighting schemes –tf-idf. We used python’s scikit-learn library.
The actual word-similarity calculation often involved multiplication between matrices of the order of ~650000x 15000 and these reading these in and multiplying them can become both computationally very intensive and impossible to store in memory.
Enter HDF5, sparse matrices and pickle to the rescue !
Hierarchical Data Format version 5, or “HDF5,” is a great mechanism for storing large quantities of numerical data which can be read in rapidly and allow for sub-setting and partial I/O of datasets. H5py package/PyTables is a Pythonic interface to the HDF5 binary data format which allows you to easily manipulate that data from NumPy. Thousands of datasets can be stored in a single file, and you can interact with them in a very pythonic way: e.g you can iterate over datasets in a file, or check out the .shape or .dtype attributes of datasets. We chose to store our tf-idf matrices in the hdf5 format.
Given that we were creating our “vectors” for each paper abstract using all the words in the dictionary; it was very likely that the matrix was very sparse and contained a lot of 0’s. Scipy’s sparse matrices are regular matrices that only store non-zero elements, thus greatly reducing the memory requirements. We stored our sparse tf-idf matrices in using scipy’s sparse matrices in HDF5 format. Further more, Scipy’s sparse calculations are coded in C/C++ thus significantly speeding performance.
pickle and cpickle (written in C and about ~1000 times faster) is a python module that implements a fundamental, but powerful algorithm for serializing and de-serializing a Python object structure i.e it is the process of converting a python object into a byte stream in order to store it in a file/database, maintain program state across sessions, or to transport data over the network. Pickle uses a simple stack-based virtual machine that records the instructions used to reconstruct the object. It is incredibly useful to store python objects and we used it to store the Word dictionary, and easily recreate it everytime we had new papers we needed to convert to word vectors.

MyJourneyAsaDataScientist

About eighteen months ago I decided to leave astronomy, change my career trajectory and follow the Data Science Bandwagon- this is a blog about that ongoing journey…

Tag Archives: sckit-learn

HDF5, Pytables and Sparse Matrices: Optimizing calculations