groupby | MyJourneyAsaDataScientist

There is an amazing course for beginners in Data Science on edX by MIT: Analytics Edge. The material is great – the assignments are plentiful and I think its great practice – my only problem is that its in R-and I have decided to focus more on python. I decided it would be an interesting exercise to try and complete all the assignments in python -and boy it it has been so worthwhile! I’ve had to hunt around for R-equivalent code/ syntax and realized that there are some things that are so simple in R but convoluted in python!

I will be posting all my python notebooks on github (see github repo: Analytiq Edge in Python), along with the associated data files as well as the assignment questions. This blog post deals with data analysis of assignments posted in Week 1.

I had not understood the power of ‘value_counts’ and ‘groupby’ command in python. They are really useful and powerful. For example, in the Analytical detective notebook, where we are analyzing Chicago street crime data from 2001-2012, and we need to figure out which month had the most arrests, one can create a ‘month’ column using the lambda function and then plot the value_counts.

Screen Shot 2016-08-19 at 9.41.07 PM Screen Shot 2016-08-19 at 9.43.04 PM

To find what the trends were over the 12 years -its useful to create a boxplot of the variable “Date”, sorted by the variable “Arrest” showing that the number of arrests made in the first half of the time period are significantly more, though total number of crimes is more similar over the first and second half of the time period.

Screen Shot 2016-08-19 at 9.48.15 PM

Another way to check if that makes sense, is to plot the number of arrests by year:

Screen Shot 2016-08-19 at 9.50.30 PM

groupby- with MultiIndex

With hierarchically indexed data, one can group by one of the levels of the hierarchy. This can be very useful. For example, to answer the question: “On which day of the week do the most motor vehicle thefts at gas stations happen? “, we can first define a new dataframe as: Screen Shot 2016-08-24 at 11.39.33 AM

and then groupby level 0 and then sum-note we are not asking when most arrests happen, but most thefts happen-so we need the sum of arrests and no arrests! Screen Shot 2016-08-24 at 11.42.22 AM

Pretty cool huh?

Assignment using ‘Demographics and Employments in the US’ dataset also uses some neat usage of groupby. For example to find ‘How many states had all interviewees living in a non-metropolitan area (aka they have a missing MetroAreaCode value)?’, one can do Screen Shot 2016-08-24 at 11.46.49 AM

and if one wants just the list of states:

Screen Shot 2016-08-24 at 11.48.49 AM.png

To get how many states had all interviewees living in a metropolitan area, ie urban and all rural:

Screen Shot 2016-08-24 at 11.53.00 AM.png

which region of the US had largest proportion of interviewees living in a non metropolitan area? One can even find proportions:

Screen Shot 2016-08-24 at 11.57.09 AM The dataset with stock prices ( see Stock_dynamics.ipynb) is really useful the play around with different plotting routines. Visualizing the stock prices of the five companies over a 10 year span -and seeing what happened after the Oct 1997 crash:

Screen Shot 2016-08-24 at 12.06.25 PM Using groupby to plot monthly trends:

Screen Shot 2016-08-24 at 12.08.59 PM

Pretty cool I think!

Take a look at the datasets- and have fun!

MyJourneyAsaDataScientist

About eighteen months ago I decided to leave astronomy, change my career trajectory and follow the Data Science Bandwagon- this is a blog about that ongoing journey…

Tag Archives: groupby

edX’s Analytics Edge in Python: exploring the power of pandas ‘groupby’ and value_counts!