October | 2015 | MyJourneyAsaDataScientist

I have recently taken up managing the AWS deployment of our website. Our current site is powered by the python’s Django web framework and while that is a great framework to use, it has a somewhat steep learning curve. Atleast it did for me!

The recurring memory leak bug:

We had one AWS instance that hosted the site and also ran two cron jobs in the background. We had another AWS machine (pretty powerful one) which ran a bunch of cron jobs that mostly consisted of periodically downloading data from certain repositories, and storing them into an RDS database as well as downloading twitter data every hour. This job performed multiple reads and writes into the database. Those mostly seemed to work great – except we noticed something rather peculiar: that almost on cue, the jobs would crash every 72 hours and had to be manually restarted! We put in diagnostics to monitor the memory usage after every job was completed, and noticed that the memory usage would rise continously (even when a job was finished) till it reached ~99% – and then we would get a segmentation fault. I went over the code multiple times, learnt about python gc ( a very useful utility by the way), went over the garbage collection ..but just could not figure out what was going on.

I was not the author of the original django code, and could not find anything wrong with the logic and was at my wits end. Finally I discovered what the problem was – (yes of course, on StackOverflow). It turns out that the usual way of coding up your web application (in django) while it is still in development is done keeping DEBUG=TRUE in the django settings file. This allows the developer to run tests- but it also means that host validation is disabled. Note: this can be very dangerous in production, as any host is now accepted and it makes your site/database vulnerable to hacks. Well our site was live; with DEBUG=TRUE!! Ok- thats bad ..but how did that relate to do with the problem we were having?? or did it? Turns out it did. Here is the other less talked about side effects of that: apparently when DEBUG=True, django will store all the SQL queries you make which over time adds up, and looks suspiciously like a memory leak!! So if you have a lot of reads and writes to the database – it stores all those connections as using up the machine’s CPU- till it just crashes! Yes making DEBUG=False did fix the majority of that problem. Here are some really good links:

Ah- but my woes did not end there. We had a domain name and DNS provider, and had pointed the domain names to the AWS server. This worked fine when debug=True; but after I made debug =False, suddenly, I could not access my site on the browser. I could see the django code was running on the sever, but trying to access my domain name gave me a 404 Error. Remember, I had said that debug=True accepted all hosts. Well, if you make debug=False, the ALLOWED_HOSTS field in settings.py which by default is empty, needs to be populated with the list of strings representing the host/domain names that this Django site can serve (see allowed_hosts). The reason this variable exists is to prevent an attacker from poisoning caches and password reset emails with links to malicious hosts by submitting requests with a fake HTTP Host header, which is possible even under many seemingly-safe webserver configurations.

Ok , great that was figured out but I still could not access our site using the domain name. If I put ALLOWED_HOSTS=[*] i.e wildcard, I could see the site , but not if I put in the actual domain name or server IP address. But doing that is a bad idea, as it basically renders the feature useless. Very strange.

Remember I said we were hosting this on AWS, and my AWS instance had an elastic IP address. Moreover, I was putting it behind an elastic load balancer. Well AWS and Django have a peculiar dance that you need to know about to make this work. Since the internal IP address the EC2 instance uses could change over time and because we want our settings to work no matter how many instances we spin up, use ec2metadata to dynamically get and add the internal IP to ALLOWED_HOSTS. This still gives us the same security/traffic benefits because the 10.0.0.0 IP space is reserved for internal networks only; meaning that external web traffic cannot easily fake your internal IP address when requesting URIs.You can use the python-requests library.

Add the following to you settings.py file

import requests
EC2_PRIVATE_IP = None
try:
 EC2_PRIVATE_IP = requests.get('http://169.254.169.254/latest/meta-data/local-ipv4', timeout = 0.01).text
except requests.exceptions.RequestException:
 pass
if EC2_PRIVATE_IP:
 ALLOWED_HOSTS.append(EC2_PRIVATE_IP)

Note that ‘169.254.169.254’ is the IP address you would use , regardless of the elastic IP of you web server.

Many many thanks to this blog – and AWS customer service for helping me figure out how to make this work!

MyJourneyAsaDataScientist

About eighteen months ago I decided to leave astronomy, change my career trajectory and follow the Data Science Bandwagon- this is a blog about that ongoing journey…

Monthly Archives: October 2015

AWS and Django intricacies – I wish I had known earlier….