**Migrating Storage to HDF5/Pytables**

Very often, part of trying to provide on demand, instantaneous recommendations involves optimizing calculations between large matrices. In our case, one of the most challenging problems was reading in, and efficiently calculating cosine similarity between tfidf features for every pair of documents in the two corpuses – the papers that could “potentially” be recommended and the users library of papers that they were interested in. Most recommendation engines try to ease this computational load by doing as much of the prep work before as possible. Its almost like running a good restaurant – you try to pre-cook (read *pre-calculate*) all the potential sauces, vegetables (read* possible features* ) you may need, so you can whip up a great meal (read *recommendations*) with minimum delay. For us, this is how we handled it:

- First create a
**dictionary**of possible words from a corpus containing all abstracts from pubmed articles from a couple of years (after removing the common english stopwords). - Take the corpus of potential recent papers (that could be recommended) and
**vectorize**it ( i.e. basically represent the collection of text documents as numerical feature vectors.) There are many ways of computing the**term-weights**or values of the vector for a document, and after experimenting with a few, we went with one of the most commonly used weighting schemes –We used python’s**tf-idf.****scikit-learn**library. - The actual word-similarity calculation often involved multiplication between matrices of the order of ~650000x 15000 and these reading these in and multiplying them can become both computationally
**very**intensive and impossible to store in memory. - Enter
**HDF5,****sparse matrices**and**pickle**to the rescue ! - Hierarchical Data Format version 5, or “HDF5,” is a great mechanism for storing large quantities of numerical data which can be
**read in rapidly**and allow for sub-setting and partial I/O of datasets.**H5py package/PyTables**is a Pythonic interface to the**HDF5**binary data format which allows you to easily manipulate that data from NumPy. Thousands of datasets can be stored in a single file, and you can interact with them in a very pythonic way: e.g you can iterate over datasets in a file, or check out the .shape or .dtype attributes of datasets. We chose to store our tf-idf matrices in the hdf5 format. - Given that we were creating our “vectors” for each paper abstract using
**all**the words in the dictionary; it was very likely that the matrix was very sparse and contained a lot of 0’s.**Scipy’s**sparse matrices are regular matrices that**only**store non-zero elements, thus greatly reducing the memory requirements. We stored our sparse tf-idf matrices in using scipy’s sparse matrices in HDF5 format. Further more, Scipy’s sparse calculations are coded in C/C++ thus significantly speeding performance. **pickle**and**cpickle (**written in C and about ~1000 times faster)**Pickle**uses a simple stack-based virtual machine that records the instructions used to reconstruct the object. It is incredibly useful to store python objects and we used it to store the Word dictionary, and easily recreate it everytime we had new papers we needed to convert to word vectors.