Launching an instance in Google cloud and using user scripts

Many of you may remember, a while ago I had written about my experiences with AWS.  Recently, I was asked to launch a Google Cloud Instance,perform some tasks, output the files to a Google cloud bucket, and then shut down the instance. It was a fun experience as I had never used Google cloud and was curious to use it, and compare and contrast it with AWS.  The specs were:

  •   Load an Ubuntu OS.
  •   Create the instance in the Western US region.
  •   Specify a 10 GB boot disk.
  •   Explicitly specify option to delete boot disk when the VM instance is deleted.
  •   Provide read-write access to Google Cloud Storage.
  •   Run a startup script on the instance as soon as it is launched that does the following:
  •        –  download some publicly available tools and install them
  •         – run a compute job and copy the output to  google cloud bucket and share it publicly
  •          – Get generic memory/availability info about the instance, output it to a file and upload it to the cloud bucket.
  •         – shutdown the instance

Google cloud services are very similar to the AWS and if you are familiar with AWS,  Google Cloud Platform for AWS Professionals  can be a really useful site. While the syntax and semantics of the SDK, APIs, or command-line tools provided by AWS and Google Cloud Platform, the underlying infrastructure and logic is very similar.

One of the first things you need to do is download, install and initialize the cloud SDK on your local machine. It contains gcloud, gsutil, and bq, which you can use to access Google Compute Engine, Google Cloud Storage, Google BigQuery, and other products and services from the command-line. You can run these tools interactively or in your automated scripts (as you will see later in this example). The SDK has a set of properties that govern the behavior of the gcloud command-line tool and other SDK tools. You can use these properties to control the behavior of gcloud commands across invocations.

You also need to create a  gcloud account if you dont have one – and its really helpful if you have a gmail account that you can connect it to. This way when you initialize gcloud on your local machine, you can authorize access to Google Cloud Platform using a service account. The gcloud console is very similar to the AWS dashboard and you can view/control all your instances/resources from the console. The console is  an easy to use interface and quite intuitive. Most of what you can do from the console can be done from the command line or using a Rest API. However, one of the things you cannot do from the command line is create a ‘Project‘ – and this is something that is different in gcloud compared to AWS. You must first create a project, get a project id which you have to reference when you spin up instances  directly from the command line.

Initializing/configuring the SDK is fairly straightforward- just follow the instructions. I set up my default region to be us-west1a and set up my default project as “stanford”. I was given a project id “stanford-147200″ and I needed to reference that every time I wanted to access an instance associated with that project, start a new instance associated with that project etc. Note: you can set up multiple configurations for multiple projects and move between them quite easily because of this partitioning into “projects”.  Its like creating multiple virtual environments!

If you want to store the output from the jobs run on the instance, you need to set up a cloud bucket.  This is analogous to to s3 on AWS. This can be done via the console or the command line using the gsutil. You can create a bucket directly  from the instance as well (as part of a script) -which is what I will do. Note: when you launch an instance, you have to specify the project id, and the bucket is created for that project.

So lets get started.  I have SDK and gsutil installed on my machine. My project id is “stanford-147200″.  I can launch an instance in the Western zone with an Ubuntu OS and specific properties  with the following command:

gcloud compute --project "stanford-147200" instances create "stanford2" \
--zone "us-west1-a" \
--boot-disk-auto-delete \
--machine-type "f1-micro" \
--metadata-from-file startup-script="" \
--subnet "default" --maintenance-policy "MIGRATE" \
--scopes default="" \
--tags "http-server","https-server" \
--image "/ubuntu-os-cloud/ubuntu-1404-trusty-v20161020" \
--boot-disk-size "10" --boot-disk-type "pd-standard" \
--boot-disk-device-name "stanford2"

So what do the options mean? – breaking it down:
gcloud compute : using the command line interface to interact with the Google compute                                            facility
 –project “stanford-147200” instances create “stanford2” : create an instance called                                                                                       ‘stanford2’ under project id “stanford-147200”
–zone “us-west1-a” :specifically create the instance in the Western US Region
 –boot-disk-auto-delete : Delete the boot disk when VM instance is deleted -note this is                                                                                 the default
–machine-type “f1-micro” : specify machine type from the choice of publicly available                                                                                    images
 –metadata-from-file startup-script=”” : metadata to be made available to                                                                              the guest operating system running on the instance                                                                                 is uploaded from a local file
 –subnet “default” : Specifies the network that the instances will be part of. Default is                                                                                      “default”
–maintenance-policy “MIGRATE” :Specifies the behavior of the instances when their                                                                                     host machines undergo maintenance. The default                                                                             is MIGRATE.
 –scopes default=”” : specifies the                                                                                     access scopes of the instance- I will allow it to                                                                                       access all cloud API’s- this will allow it to                                                                                                 read/write from the Google storage/BigQuery and                                                                                 all other Cloud services. This is very wide access –                                                                                One can specify it to only access specific services
–tags “http-server”,”https-server” : allow http and https traffic ie open port 80 and                                                                                       443.
 –image “/ubuntu-os-cloud/ubuntu-1404-trusty-v20161020” : Use this publicly                                                                                              available image to create your instance -Using                                                                                       image with Ubuntu OS
–boot-disk-size “10” –boot-disk-type “pd-standard” : specify boot disk size to 10GB
–boot-disk-device-name “stanford2” : The name the guest operating system will see                                                                                  for the boot disk as.


You can go to the cloud console and see your instance being booted up, and its CPU usage. Furthermore, the instance will upload the run the startup script we have specified, change its permissions and run it. So what is in the Startup-script? Its a bash script :

#! /bin/bash
# Description: SCGPM Cloud Technical challenge (2/2). 
# This is the startup script that is called by the Google
# Compute Instance launcher script.
# Author: Priya Desai 
# Date: 10/25/2016

## behave like superuser
sudo apt-get update

## usually need to make sure the the java version installed is 1.8 or higher. Download the JDK from Oracle website 
# thanks to this StackOverflow post How to automate download and installation of Java JDK on Linux?.
sudo wget --no-check-certificate --no-cookies --header "Cookie: oraclelicense=accept-securebackup-cookie" -qO- | tar xvz
# create symbolic link to allow update of Java/access the directory without changing the rest
sudo ln -s jdk1.8.0_05 jdk
#cd ~

## Install picard-tools
sudo wget -O picard.jar

## Copy bam & bai files from public bucket
sudo gsutil cp gs://gbsc-gcp-project-cba-public/Challenge/noBarcode.bai .
sudo gsutil cp gs://gbsc-gcp-project-cba-public/Challenge/noBarcode.bam .
## Run picard-tools BamIndexStats and write results to "bamIndexStats.txt"
sudo jdk/bin/java -jar picard.jar BamIndexStats INPUT=noBarcode.bam >> bamIndexStat.txt


echo "Memory information": >> instance_info.txt
## Append memory info for current instance to "instance_info.txt"
free -m >>instance_info.txt
echo " " >> instance_info.txt
echo "cat /proc/meminfo:">> instance_info.txt
cat /proc/meminfo >>instance_info.txt
echo " " >> instance_info.txt
echo " " >> instance_info.txt
echo "Filesystem information": >> instance_info.txt
## Append filesystem info for current instance to "instance_info.txt"
df -T>>instance_info.txt

echo " ">>instance_info.txt
mount >>instance_info.txt
echo "mount output: "

## Upload the two output files to Google Cloud Storage bucket
##This bucket was created earlier- Can create a bucket in the script using gsutil mb 
sudo gsutil mb gs://stanford-cloud-challenge-priya/
sudo gsutil cp *.txt gs://stanford-cloud-challenge-priya

## Shutdown instance. Note: this just shuts down the instance-not delete it.
sudo shutdown -h now

Most of the script was reasonably straight forward, but its important to note the following interesting points:

  1. how to install Java 1.8 (needed to run Picard tools) on the command line: Oracle has put a prevention cookie on the download link to force you to agree to the terms even though the license agreement to use Java clearly states that merely by using Java you ‘agree’ to the license….but there is a workaround -thanks to the above StackOverflow site!  I could get Java 1.8 to install be using:                                                        “sudo wget –no-check-certificate –no-cookies –header “Cookie: oraclelicense=accept-securebackup-cookie” -qO- | tar xvz”
  2. The cloud storage bucket can be created in the script, but in order to make it publicly accessible, you need to set the permissions correctly. You can do this on the console or on the command line using gsutil. I do this in the script itself. My bucket is called ‘stanford-challenge-priya’ ,  and to make  the bucket is accessible as: gs://stanford-challenge-priya/ , I edit the permissions with:
     gsutil acl ch -u AllUsers:R gs://stanford-cloud-challenge-priya/

3 . Its important to realize that the user_script runs on the instance as “root”- You can             see this by getting the output of whoami. However, when you log into your instance,              you  log in as a user (with your username). Hence you will not see the files that your                  script may  have downloaded- but if you change to superuser status, you can see the              files that have  been downloaded.

4. You can shutdown the instance, after your jobs have completed using:

                           sudo shutdown -h now

Note- this only shuts down the instance , it does not delete it. So you could restart it  and you will find all your files still there.

5. user_script is a great way for you to remotely, launch an instance, configure it, run jobs and shut it down so you don’t waste money keeping an instance running. If you have very complex environments that take a while to configure, you could create an instance with the configuration you like and save it as an image, give it a name and then launch that image as you need, saving you trouble of having to configure it every time!






Lessons in Linear regression- Analytics Edge in python: Week 2

As I have said before, this course is in R- and trying to perform all the tasks in python is proving to be an interesting challenge. I want to document the new and interesting  things, I am learning along the way. It turns out that the way most generic way to run linear regression in R, is to use the lm function. As part of the summary results, you get a bunch of extra information like confidence intervals and p values for each of the coefficients,  adjusted R2, F statistic etc that you don’t get as part of the output-in sklearn -the most popular machine learning package in python. Digging a bit deeper, you can see why:

  • scikit-learn is doing machine learning with emphasis on predictive modeling with often large and sparse data.
  • statsmodels is doing “traditional” statistics and econometrics, with much stronger emphasis on parameter estimation and (statistical) testing. Its linear models, generalized linear models and discrete models have been around for several years and are verified against Stata and R – and the output parameters are almost identical to what you would get in R.

For this project; where I am trying to translate R to python; statsmodels is a better choice. Statsmodels can be used by importing statsmodels.api, or the statsmodels.formula.api and I have played around with both. Statsmodels.formula.api uses R like syntax as well, but Statsmodels.api has a very sklearn -like syntax. I have explored using linear regression in a few different kinds of datasets: (github repo)

  • Climate data
  • Sales data for a company
  • Detecting flu epidemics via search engine query
  • Analyzing Test scores
  • State data including things like population, per capita income,murder rates etc

There are a lot of wonderful  blogs/tutorials on Linear Regression. Some of my favorite are:

What is different about Analytics edge is that that is a lot of emphasis on the intuitive understanding of the output of the regression algorithm.

So what is a linear regression model?     

This question has confused me at times because :

  • y=Bo+B1X1+B2X2
  • y=B0+B1(lnX1)
  • y=Bo+B1X1+B2X**2   are all linear models, but they are not all linear in X !!

From  what I understand now, a regression model is linear, when it is linear in the parameters. What that means is there is only one parameter in each term of the model and each parameter is a multiplicative constant on the independent variable. So in the above example, (lnX1) is the parameter or variable, and the function is linear in that parameter. Similarly, (X**2) is the variable and the function is linear in that  variable.

Examples of non-linear models:

  • y=B1.exp(-B2.X2)+K
  • y=B1+(B2-B1)exp(-B3X) +K
  • y= (B1X)/(B2+X) +K

A really important analysis to perform before you run any multivariate  linear regression model is to calculate the correlations between different variables -this is if you include because highly correlated variables as independent variables in your model, you would get some unintuitive an d inaccurate answers!! For example, if I run a multiple regression model to predict average global temperature using atmospheric concentrations of most of the greenhouse gases available from the Climatic Research Unit at the University of East Anglia, this is my output:


If we examine the output, and look at the coefficients and their corresponding  p_values, ie the P>|t| column, we see that  according to this model, N2O and CH4 which are both considered to be green house gases, are not significant variables!  ie the P>|t| values for those coefficients is higher that the rest.

So what is  P>|t|?                                                                                                                                           P >|t| is  probability that the coefficient for that variable is Zero, given the data used to build the model. The smaller this probability, the less likely the coefficient is actually Zero. We want independent variables with very small values for this. Note that if the coefficient of a variable is almost Zero, that variable is not helping to predict the dependent variable.

Furthermore, the coefficient of N2O is negative, implying that the global temperature is inversely proportional to the atmospheric concentration of N2O. This is contrary to what we know!  So what is going on?

It becomes clear when  we examine the correlations between the different variables:

screen-shot-2016-09-08-at-1-40-13-pmWe can see that there is  a really high correlation between N2O and CH4:                 corr(N2O,CH4)=0.89  

Ok, so what does that mean?  Multicollinearity refers to the situation when two independent variables are highly correlated – and this can cause the coefficients to have non-intuitive signs. What that means is that if, N2O and CH4 are highly correlated, it may be sufficient to include only one of them in the model. We should look at other highly correlated variables.

Our aim is to find a linear relationship between the fewest independent variables that explain most of the variance.

How do you measure the quality of the regression model?

  • Typically when people say linear regression, they are using Ordinary Least Squares (OLS) to fit their model. This is what is also called the L2 norm. The goal is to find the line which minimizes the sum of the squared errors.
  • If you are trying to minimize the sum of the absolute errors, you are using the L1 norm.
  • The errors or residuals are difference between actual value and  predicted value. Root mean squared error (RMSE ) is preferred over Sum of Squared Errors (SSE) because it scales with n. So if you have double the number of data points, and thus maybe double the  SSE, it does not imply that the model is twice as bad!
  • Furthermore, the units of SSE are hard to understand-RMSE is normalized by n and has the same units as the dependent variable.
  • R2: the regression coefficient is often used as a measure of how good the model is. R2 basically compares how the model to a  baseline model ( which does not depend on any variable) and is just the mean of the dependent variable (ie a constant). So R2 captures the value added from using a regression model.                                                         R2= 1 – (SSE/SST)      where SST=total sum of squared errors using the baseline model.
  • For a single variable linear model, R2= Corr( independent Var, dependent Var) **2.
  • When you use multiple variables in your model, you can calculate something called Adjusted R2: the adjusted R2 accounts for the number of independent variables used relative to number of data points in the observation. A model with just one variable, also gives an R2 -but the R2 for a combined linear model is not equal to the sum of R2’s of independent models (of individual independent variables). ie R2 is not additive. 
  • R2 will always increase as you add more independent variables – but adjusted R2 will decrease if you add an independent variable that does not help the model (for example if the newly added  variable introduces collinearity issues!) This is good way to determine if an additional variable should even be included ! However, adjusted R2 which penalizes model complexity to control for overfitting, generally under penalizes complexity. The best approach to feature selection is actually Cross validation.
  • Cross-validation provides a more reliable estimate of out-of-sample error, and thus is a better way to choose which of your models will best generalize to out-of-sample data. There is extensive functionality for cross-validation in scikit-learn, including automated methods for searching different sets of parameters and different models. Importantly, cross-validation can be applied to any model, whereas the methods described above (ie using R2) only apply to linear models.

For skewed distributions it is often useful to predict the logarithm of the dependent variable instead of the dependent variable itself. This prevents  the small number of unusually large or small observations from having an undue effect on the sum of squared errors of predictive models. For example, in the Detecting Flu Epidemics via Search engine query data  python notebook, a histogram of  the percentage of Influenza like Illnesses (ILI) related physician visits is:

screen-shot-2016-09-08-at-5-32-56-pm     suggesting the distribution is right skewed- ie most of the ILI values are small with a relatively large number of values being large. However a plot of the natural log of ILI vs Queries shows that their probably is a positive linear relation between ln(ILI) and Queries.

:  screen-shot-2016-09-08-at-5-37-16-pm

Time Series Model: 

In Google Flu trends problem set, initially we attempt to model log(ILI) as a function of Queries:


but since the observations in this dataset are consecutive weekly measurements of the dependent and independent variable, it can be thought of as a time-series.  Often Statistical models can be improved by predicting current value of the dependent variable using past values. Since most the ILI data has a 1-2 week lag- we need will use data from 2 weeks ago and earlier. Well how many such data points are there -ie  many values are missing from this variable ?  Calculating lag:


Plotting the log of ILI_lag2 vs log IL1, we notice a strong linear relationship.


We can thus use both Queries and log(ILI_lag2), to predict log(ILI) and this significantly improves the model- increasing the R2 from 0.70 to 0.90. and the RMSE on the test set also reduces significantly.

Running regression models using all numerical variables is interesting, but one can use Categorical variables as well in the model!

Encoding Categorical Variables : To include categorical features in a Linear regression model, we would need to convert the categories into integers. In python, this functionality is available in DictVectorizer from scikit-learn, or “get_dummies()” function. One Hot encoder is another option. The difference is as follows:

  1. OneHotEncoder takes as input categorical values encoded as integers – you can get them from LabelEncoder.
  2. DictVectorizer expects data as a list of dictionaries, where each dictionary is a data row with column names as keys

One can also use Patsy (another python library) or get_dummies() see- Reading test scores. If you are using  statsmodels, you could also use the  C() function (see Forecasting Elantra sales)

Pandas ‘get_dummies’ is I think the easiest option. For example, in the Reading Test Scores Assignment, we  wish to use the categorical variable ‘Race’ which has the following values:


Using get_dummies:


But we need only k-1 = 5 dummy variables for  the Race categorical variable. The reason is if we were to make k dummies for any of your variables, we would have a collinearity. One  can think of the k-1 dummies as being contrasts between the effects of their corresponding levels, and the level whose dummy is left out. Usually the most frequent category is is used as the reference level.

Other great Resources:

edX’s Analytics Edge in Python: exploring the power of pandas ‘groupby’ and value_counts!

There is an amazing course for beginners in Data Science on edX by MIT: Analytics Edge. The material is great – the assignments are plentiful and I think its great practice – my only problem is that its in R-and I have decided to focus more on python. I decided it would be an interesting exercise to try and complete all the assignments in python  -and boy it it has been so worthwhile! I’ve had to hunt around for R-equivalent code/ syntax and realized that there are some things that are so simple in R but convoluted in python!

I will be posting all my python notebooks on github (see github repo: Analytiq Edge in Python), along with the associated data files as well as the assignment questions. This blog post deals with data analysis of assignments posted in Week 1.

I had not understood the power of ‘value_counts’  and ‘groupby’ command in python. They are really useful and powerful. For example, in the Analytical detective notebook, where we are analyzing Chicago street crime data from 2001-2012, and we need to figure out which month had the  most arrests, one can create a ‘month’ column using the lambda function and then plot the value_counts.

Screen Shot 2016-08-19 at 9.41.07 PMScreen Shot 2016-08-19 at 9.43.04 PM

To find what the trends were over the 12 years -its useful to create a boxplot of the variable “Date”, sorted by the variable “Arrest” showing that the number of arrests made in the first half of the time period are significantly more, though total number of crimes is more similar over the first and second half of the time period.

Screen Shot 2016-08-19 at 9.48.15 PM

Another way to check if that makes sense, is to plot the number of arrests by year:

Screen Shot 2016-08-19 at 9.50.30 PM

groupby- with MultiIndex 

With hierarchically indexed data, one can group by one of the levels of the hierarchy. This can be very useful. For example, to answer the question:  “On which day of the week do the most motor vehicle thefts at gas stations happen? “,  we can first define a new dataframe as:Screen Shot 2016-08-24 at 11.39.33 AM

and then groupby level 0 and then sum-note we are not asking when most arrests happen, but most thefts happen-so we need the sum of arrests and no arrests!Screen Shot 2016-08-24 at 11.42.22 AM

Pretty cool huh?

Assignment using ‘Demographics and Employments in the US’ dataset also uses some neat usage of groupby. For example to find ‘How many states had all interviewees living in a non-metropolitan area (aka they have a missing MetroAreaCode value)?’, one can doScreen Shot 2016-08-24 at 11.46.49 AM

and if one wants just the list of states:

Screen Shot 2016-08-24 at 11.48.49 AM.png

To get how many states had all interviewees living in a metropolitan area, ie urban and all rural:

Screen Shot 2016-08-24 at 11.53.00 AM.png

which region of the US had largest proportion of interviewees living in a non metropolitan area? One can even find proportions:

Screen Shot 2016-08-24 at 11.57.09 AMThe dataset with stock prices ( see Stock_dynamics.ipynb)  is really useful the play around with different plotting routines.  Visualizing the stock prices of the five companies over  a 10 year span -and seeing what happened after the Oct 1997 crash:

Screen Shot 2016-08-24 at 12.06.25 PMUsing groupby to plot monthly trends:

Screen Shot 2016-08-24 at 12.08.59 PM

Pretty cool I think!

Take  a look at the datasets- and have fun!


AWS and Django intricacies – I wish I had known earlier….

I have recently taken up managing the AWS deployment of our website. Our current site is powered by the python’s Django web framework and while that is a great framework to use, it has a somewhat steep learning curve. Atleast it did for me!

The recurring memory leak bug:

We had one AWS instance that hosted the site and also ran two cron jobs in the background. We had another AWS machine (pretty powerful one) which ran a bunch of cron jobs that mostly consisted of periodically downloading data from certain repositories, and storing them into an RDS database  as well as downloading twitter data every hour. This job performed  multiple reads and writes into the database. Those mostly seemed to work great – except we noticed something rather peculiar: that almost on cue, the jobs would crash every 72 hours and had to be manually restarted!  We put in diagnostics to monitor the memory usage after every job was completed, and noticed that the memory usage would rise continously (even when a job was finished) till it reached ~99% – and then we would get a segmentation fault. I went over the code multiple times,  learnt about python gc ( a very useful utility by the way), went over the garbage collection ..but just could not figure out what was going on.

I was not the author of the original django code, and could not find anything wrong with the logic and was at my wits end. Finally I discovered what the problem was – (yes of course, on StackOverflow). It turns out that the usual way of coding up your web application (in django) while it is still in development is done keeping DEBUG=TRUE in the django settings file. This allows the developer to run tests- but it also means that host validation is disabled. Note: this can be very dangerous in production, as any host is now accepted and it makes your site/database vulnerable to hacks. Well our site was live; with DEBUG=TRUE!!  Ok- thats bad ..but how did that relate to do with the problem we were having?? or did it? Turns out it did. Here is the other less talked about side effects of that:  apparently when DEBUG=True,  django will store all the SQL queries you make which over time adds up, and looks suspiciously like a memory leak!! So if you have a lot of reads and writes to the database – it stores all those connections as using up the machine’s CPU- till it just crashes! Yes making DEBUG=False did fix the majority of that problem. Here are some really good links:

Ah- but my woes did not end there. We had a domain name and DNS provider, and had pointed the domain names to the AWS server. This worked fine when debug=True; but after I made debug =False, suddenly, I could not access my site on the browser. I could see the django code was running on the sever, but trying to access my domain name gave me a 404 Error.  Remember, I had said that debug=True accepted all hosts. Well, if you make debug=False, the ALLOWED_HOSTS field in which by default is empty, needs to be populated with the list of strings representing the host/domain names that this Django site can serve (see allowed_hosts). The reason this variable exists is to prevent an attacker from poisoning caches and password reset emails with links to malicious hosts by submitting requests with a fake HTTP Host header, which is possible even under many seemingly-safe webserver configurations.

Ok , great that was figured out but I still could not access our site using the domain name. If I put ALLOWED_HOSTS=[*] i.e wildcard, I could see the site , but not if I put in the actual domain name or server IP address. But doing that is a bad idea, as it basically renders the feature useless. Very strange.

Remember I said we were hosting this on AWS, and my AWS instance had an elastic IP address. Moreover, I was putting it behind an elastic load balancer. Well AWS and Django have  a peculiar dance that you need to know about to make this work. Since the internal IP address the EC2 instance uses could change over time and because we want our settings to work no matter how many instances we spin up, use  ec2metadata to dynamically get and add the internal IP to ALLOWED_HOSTS. This still gives us the same security/traffic benefits because the IP space is reserved for internal networks only; meaning that external web traffic cannot easily fake your internal IP address when requesting URIs.You can use the  python-requests library.

Add the following to you file

import requests
 EC2_PRIVATE_IP = requests.get('', timeout = 0.01).text
except requests.exceptions.RequestException:

Note that ‘’ is the IP address you would use , regardless of the elastic IP of you web server.

Many many thanks to this blog –  and AWS customer service for helping me figure out how to make this work!

HDF5, Pytables and Sparse Matrices: Optimizing calculations

Migrating Storage to HDF5/Pytables

Very often, part of trying to provide on demand, instantaneous recommendations involves optimizing calculations between large matrices. In our case, one of the most challenging problems was reading in, and efficiently calculating cosine similarity between tfidf features for every pair of documents in the two corpuses – the papers that could “potentially” be recommended and the users library of papers that they were interested in.   Most recommendation engines try to ease this computational load by doing as much of the prep work before as possible.  Its almost like running a good restaurant – you try to pre-cook (read pre-calculate) all the potential sauces, vegetables (read possible features ) you may need, so you can whip up a great meal (read recommendations) with minimum delay. For us, this is how we handled it:

  • First create a dictionary of possible words from a corpus containing all abstracts from pubmed articles from a couple of years (after removing the  common english  stopwords).
  • Take the corpus of potential recent papers (that could be recommended)  and vectorize it ( i.e. basically represent the collection of text documents as  numerical feature vectors.) There are many ways of computing the term-weights or values of the vector for a document, and after experimenting with a few, we went with one of the most commonly used weighting schemes tf-idf.  We used python’s  scikit-learn library.
  • The actual word-similarity calculation often involved multiplication between matrices of the order of ~650000x 15000 and these reading these in and multiplying them can become both computationally very intensive and impossible to store in memory.
  • Enter HDF5, sparse matrices and pickle to the rescue !
  • Hierarchical Data Format version 5, or “HDF5,” is a great mechanism for storing large quantities of numerical data which can be read in rapidly and allow for sub-setting and partial I/O of datasets. H5py package/PyTables is a Pythonic interface to the HDF5 binary data format which allows you to easily manipulate that data from NumPy.  Thousands of datasets can be stored in a single file,  and you can interact with them in a very pythonic way: e.g you can iterate over datasets in a file, or check out the .shape or .dtype attributes of datasets.  We chose to store our tf-idf matrices in the hdf5 format.
  • Given that we were creating our “vectors” for each paper abstract using all the words in the dictionary; it was very likely that the matrix was very sparse and contained a lot of 0’s. Scipy’s sparse matrices are regular matrices that  only store  non-zero elements, thus greatly reducing the memory requirements. We stored our sparse tf-idf matrices in  using scipy’s sparse matrices in HDF5 format. Further more, Scipy’s sparse  calculations are coded in C/C++ thus  significantly  speeding performance.
  • pickle and cpickle (written in C and about ~1000 times faster)  is a python module that implements a fundamental, but powerful algorithm for serializing and de-serializing a Python object structure i.e it is the process of converting a python object into a byte stream in order to store it in a file/database, maintain program state across sessions, or to transport data over the network. Pickle uses a simple stack-based virtual machine that records the instructions used to reconstruct the object. It is incredibly useful to store python objects and we used it to store the Word dictionary, and easily recreate it everytime we had new papers we needed to convert to word vectors.

Data Science Skills checklist:

What to learn, in what order?

As I try to re-inspire myself to be regular about blogging – and documenting what I learn …

More really good resources:

running a ipython Notebook on AWS

I recently started using iPython notebooks …they are so cool! (… but it took some finagling to figure out how to get an ipython Notebook running on an AWS. I struggled with this some as my data was on an AWS server  and while I could “run” the ipython notebook on the server, I could not “view” it on my local browser ! My AWS server was on a VPC and setting up a profile_nbserver did not work.

Its a really neat hack:  the basic idea is that you use the AWS default settings (ie configuration from/home/ubuntu/.ipython/profile_default), not any particular  profile_nbserver setting (which is what the tutorials ask you to do.)  Essentially you  you want to  run the  ipython notebook loaclly on the AWS server and mirror it on your local machine. From what I understand, we are remote port forwarding, ie transferring a remote (AWS)  connection to your machine.

So after installing ipython on your AWS server;

  • On the AWS machine terminal run:  
    • ipython notebook  – this runs the the default configuration (not ipython notebook inline –profile=nbserver )
    • on your local machine type:                                                                                                                                                     ssh -L 9000:localhost:8888 (or whatever                                      your server public DNS name is) – oh and of course make sure you have a public IP address!
    • open your favorite browser and go to                                                                                                                                        http://localhost:9000/

and  Voila!! enjoy!

becoming a Data Scientist — more resources online

The online world is exploding with tutorials, how-to manuals and free resources to help you get into the world of Data Science. In the last 5 months I have seen the addition of so many new MOOCs – looks like every university wants to make sure they don’t get left behind. But this is great news for the consumers- most of these programs are pretty darn good!

Newer offerings on Coursera, Udacity and EdX:                                                                                         In addition to JHU’s Data Science Specialization, here is a partial list of more “data related” specializations (which typically consist of 4-10 courses followed by a project) ie “mini degrees ” you can rack up. Coursera offers “Specializations“, Udacity offers “Nanodegrees” and EdX offers “XSeries“:

Also a  host of new short “paid” programs have sprung up. theses are typically 6-12 week long bootcamps. They typically range between $10,000-$15,000 and can be competitive to get in.

Here is a link to a bootcamp finder in the city of your choice for a price that works for you! And here is another link great link List of data sciencebootcamps

The Data Science Apprenticeship is free, and another way to go. They have  a really nice comprehensive list of Data Science Resources and a cool cheatsheet.

Generating Topic and Personal recommendations

Here is an overview of the features we wanted to use to determine the “score” of a paper that we would then rank and output to the user as recommendations.

recommendation system overview


We also decided that since we had created  these “topics”,  and were running the LDA inferencer on all the new papers everyday classifying  them into topics, we would provide topic based recommendations as well- so if  a new user came in, and was browsing the topics- they could see the top papers in that topic. Ofcourse,  in addition to having high topic probability, these papers were ranked by recency of publication, impact factor of their host journal and tweet counts (if any)!

For personalized recommendations, we decided  we would first use topic similarity between the users papers (or library)  and the corpus of all recent papers to filter or shortlist possible candidate papers to recommend, and then use word similarity to further refine the selection. The final ranking would use our special ‘sauce’ based on tweet counts, date of publication, author quality etc to order these papers and present to the user!

This involved connecting various pieces of the pipeline and by September 2014,  we had a working pipeline that generated and  displayed topic recommendations and library recommendations (if a user had uploaded a personal library) on the website!!


Here is a list of books/talks I found useful:                                                                                                           Introduction to Recommender systems  (coursera)                                                                                    Intro to recommender Systems – a four hour lecture by Xavier Amatriain                                         Coursera: Machine Learning class:  Section on Recommender systems

Moving to AWS…

As the project grew, we started downloading tweets from various journal websites and tried to set up an algorithm to parse tweets that were related to particular papers and link them to the paper; thereby producing another “metric” to compare papers by. In addition; we started keeping a detailed records all the authors of the papers and attempted to create a citation database. As complexity grew; we found that our local server was too slow and after some research; decided to take the plunge and move our stuff to AWS-the Amazon Web Server. We created a VPC (a virtual Private cloud), moved our database to Amazon’s RDS (Relational Database Service) and created buckets or  storage on Amazon’s S3 (Simple Storage Service). Its relatively easy to do and the AWS documentation is pretty good. What I found really helpful were the masterclass webinar series.

I launched a Linux based instance and then installed all the software versions I needed like python2.7, pandas, numpy, ipython-pylab, matplotlib, scipy etc . It was interesting to note that on many of the amazon machine, the default python version loaded was 2.6, not 2.7. I scouted  the web a fair bit to help me configure my instance the am sharing soem of the commands below

General commands to install python 2.7  on AWS- should work on most instances running Ubuntu/RedHat Linux:                                   

Start python to check the installation unicode type. If you have to deal with a fair amount of unicode data like I do then make sure you have the “wide build” . I learned this the hard way.                                                     

  • >>import sys  
  • >>print sys.maxunicode

It should NOT be 65564

  • >>wget  
  •   >> unzip          
  • >> sudo ./awscli-bundle/install -i /usr/local/aws -b /usr/local/bin/aws
  • # install build tools
  • >>sudo yum install make automake gcc gcc-c++ kernel-devel git-core -y
  • # install python 2.7 and change default python symlink
  • >>sudo yum install python27-devel -y          
  • >>sudo rm /usr/bin/python  
  • >>sudo ln -s /usr/bin/python2.7 /usr/bin/python
  • # yum still needs 2.6, so write it in and backup script
  •  >>sudo cp /usr/bin/yum /usr/bin/_yum_before_27  
  •  >>sudo sed -i s/python/python2.6/g /usr/bin/yum                                                                                                                                                                       
  • #This  should display now 2.7.5 or later:                                                                                                       >>python  
  •   >>sudo yum install httpd
  • # now install pip for 2.7
  • >>sudo curl -o /tmp/
  • >>sudo /usr/bin/python27 /tmp/
  • >>sudo /usr/bin/easy_install-2.7 pip  
  • >>sudo pip install virtualenv
  • >>sudo apt-get update    
  •  >>sudo apt-get install git
  • # should display current versions:                                                                                                                   pip -V && virtualenv –version
  • Installing all the python library modules:
  • sudo pip install ipython
  • sudo yum install numpy scipy python-matplotlib ipython python-pandas sympy python-nose
  • sudo yum install xorg-x11-xauth.x86_64 xorg-x11-server-utils.x86_64
  • sudo pip install pyzmq tornado jinja2
  • sudo yum groupinstall “Development Tools”
  • sudo yum install python-devel
  • sudo pip install matplotlib
  • sudo pip install networkx
  • sudo pip install cython
  • sudo pip install boto
  • sudo pip install pandas                    
  • Some modules could Not be loaded using pip, so use the following instead:                       >>sudo apt-get install python-mpi4py python-h5py python-tables python-pandas python-sklearn python-scikits.statsmodels 
  • Note that to install h5py or pytables you must install the following dependencies first:
  • -numpy
  • -numexpr
  • -Cython
  • -dateutil
  • HDF5
  • HDF5 can be installed using wget:
  • >> wget        
  •  >>tar xvfz hdf5-1.8.9.tar.gz; cd hdf5-1.8.9   
  •   >> ./configure –prefix=/usr/local
  • >> make; make install
  • Pytables
  • pip install git+

**to install h5py make sure hdf5 is in the path.

I really liked the concept of AMI’s : you create a machine that has the configuration that you want and then create an “image” and can give it a name. You have then created a virtual machine that you can launch anytime you want and as many copies of as you want! What is also great is that when you create the image, all the files that may be in your directory at that time are also part of that virtual machine …so its almost like you have a  snapshot of your environment at the time. So create images often (you can delete old images ofcourse!) and you can launch your environment at any time.  Of course you can and should always upload new files to S3 as that it you storage repository.

Another neat trick was to learn that if you can install EC (Elastic Cloud) CLI(Command Line Interface) on your local machine, and set up your IAM you don’t need the you don’t need the .pem file!! Even if you are accessing an AMI in a private cloud; when you set up the instance make sure you click on “get public IP” when you launch your instance and log in as ssh –Y ubuntu@ You can them enjoy xterm and the graphical display from matplotlib, as if you are running everything  on your local terminal. Very cool indeed!