edX’s Analytics Edge in Python: exploring the power of pandas ‘groupby’ and value_counts!

There is an amazing course for beginners in Data Science on edX by MIT: Analytics Edge. The material is great – the assignments are plentiful and I think its great practice – my only problem is that its in R-and I have decided to focus more on python. I decided it would be an interesting exercise to try and complete all the assignments in python  -and boy it it has been so worthwhile! I’ve had to hunt around for R-equivalent code/ syntax and realized that there are some things that are so simple in R but convoluted in python!

I will be posting all my python notebooks on github (see github repo: Analytiq Edge in Python), along with the associated data files as well as the assignment questions. This blog post deals with data analysis of assignments posted in Week 1.

I had not understood the power of ‘value_counts’  and ‘groupby’ command in python. They are really useful and powerful. For example, in the Analytical detective notebook, where we are analyzing Chicago street crime data from 2001-2012, and we need to figure out which month had the  most arrests, one can create a ‘month’ column using the lambda function and then plot the value_counts.

Screen Shot 2016-08-19 at 9.41.07 PMScreen Shot 2016-08-19 at 9.43.04 PM

To find what the trends were over the 12 years -its useful to create a boxplot of the variable “Date”, sorted by the variable “Arrest” showing that the number of arrests made in the first half of the time period are significantly more, though total number of crimes is more similar over the first and second half of the time period.

Screen Shot 2016-08-19 at 9.48.15 PM

Another way to check if that makes sense, is to plot the number of arrests by year:

Screen Shot 2016-08-19 at 9.50.30 PM

groupby- with MultiIndex 

With hierarchically indexed data, one can group by one of the levels of the hierarchy. This can be very useful. For example, to answer the question:  “On which day of the week do the most motor vehicle thefts at gas stations happen? “,  we can first define a new dataframe as:Screen Shot 2016-08-24 at 11.39.33 AM

and then groupby level 0 and then sum-note we are not asking when most arrests happen, but most thefts happen-so we need the sum of arrests and no arrests!Screen Shot 2016-08-24 at 11.42.22 AM

Pretty cool huh?

Assignment using ‘Demographics and Employments in the US’ dataset also uses some neat usage of groupby. For example to find ‘How many states had all interviewees living in a non-metropolitan area (aka they have a missing MetroAreaCode value)?’, one can doScreen Shot 2016-08-24 at 11.46.49 AM

and if one wants just the list of states:

Screen Shot 2016-08-24 at 11.48.49 AM.png

To get how many states had all interviewees living in a metropolitan area, ie urban and all rural:

Screen Shot 2016-08-24 at 11.53.00 AM.png

which region of the US had largest proportion of interviewees living in a non metropolitan area? One can even find proportions:

Screen Shot 2016-08-24 at 11.57.09 AMThe dataset with stock prices ( see Stock_dynamics.ipynb)  is really useful the play around with different plotting routines.  Visualizing the stock prices of the five companies over  a 10 year span -and seeing what happened after the Oct 1997 crash:

Screen Shot 2016-08-24 at 12.06.25 PMUsing groupby to plot monthly trends:

Screen Shot 2016-08-24 at 12.08.59 PM

Pretty cool I think!

Take  a look at the datasets- and have fun!



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s