AWS | MyJourneyAsaDataScientist

I have recently taken up managing the AWS deployment of our website. Our current site is powered by the python’s Django web framework and while that is a great framework to use, it has a somewhat steep learning curve. Atleast it did for me!

The recurring memory leak bug:

We had one AWS instance that hosted the site and also ran two cron jobs in the background. We had another AWS machine (pretty powerful one) which ran a bunch of cron jobs that mostly consisted of periodically downloading data from certain repositories, and storing them into an RDS database as well as downloading twitter data every hour. This job performed multiple reads and writes into the database. Those mostly seemed to work great – except we noticed something rather peculiar: that almost on cue, the jobs would crash every 72 hours and had to be manually restarted! We put in diagnostics to monitor the memory usage after every job was completed, and noticed that the memory usage would rise continously (even when a job was finished) till it reached ~99% – and then we would get a segmentation fault. I went over the code multiple times, learnt about python gc ( a very useful utility by the way), went over the garbage collection ..but just could not figure out what was going on.

I was not the author of the original django code, and could not find anything wrong with the logic and was at my wits end. Finally I discovered what the problem was – (yes of course, on StackOverflow). It turns out that the usual way of coding up your web application (in django) while it is still in development is done keeping DEBUG=TRUE in the django settings file. This allows the developer to run tests- but it also means that host validation is disabled. Note: this can be very dangerous in production, as any host is now accepted and it makes your site/database vulnerable to hacks. Well our site was live; with DEBUG=TRUE!! Ok- thats bad ..but how did that relate to do with the problem we were having?? or did it? Turns out it did. Here is the other less talked about side effects of that: apparently when DEBUG=True, django will store all the SQL queries you make which over time adds up, and looks suspiciously like a memory leak!! So if you have a lot of reads and writes to the database – it stores all those connections as using up the machine’s CPU- till it just crashes! Yes making DEBUG=False did fix the majority of that problem. Here are some really good links:

Ah- but my woes did not end there. We had a domain name and DNS provider, and had pointed the domain names to the AWS server. This worked fine when debug=True; but after I made debug =False, suddenly, I could not access my site on the browser. I could see the django code was running on the sever, but trying to access my domain name gave me a 404 Error. Remember, I had said that debug=True accepted all hosts. Well, if you make debug=False, the ALLOWED_HOSTS field in settings.py which by default is empty, needs to be populated with the list of strings representing the host/domain names that this Django site can serve (see allowed_hosts). The reason this variable exists is to prevent an attacker from poisoning caches and password reset emails with links to malicious hosts by submitting requests with a fake HTTP Host header, which is possible even under many seemingly-safe webserver configurations.

Ok , great that was figured out but I still could not access our site using the domain name. If I put ALLOWED_HOSTS=[*] i.e wildcard, I could see the site , but not if I put in the actual domain name or server IP address. But doing that is a bad idea, as it basically renders the feature useless. Very strange.

Remember I said we were hosting this on AWS, and my AWS instance had an elastic IP address. Moreover, I was putting it behind an elastic load balancer. Well AWS and Django have a peculiar dance that you need to know about to make this work. Since the internal IP address the EC2 instance uses could change over time and because we want our settings to work no matter how many instances we spin up, use ec2metadata to dynamically get and add the internal IP to ALLOWED_HOSTS. This still gives us the same security/traffic benefits because the 10.0.0.0 IP space is reserved for internal networks only; meaning that external web traffic cannot easily fake your internal IP address when requesting URIs.You can use the python-requests library.

Add the following to you settings.py file

import requests
EC2_PRIVATE_IP = None
try:
 EC2_PRIVATE_IP = requests.get('http://169.254.169.254/latest/meta-data/local-ipv4', timeout = 0.01).text
except requests.exceptions.RequestException:
 pass
if EC2_PRIVATE_IP:
 ALLOWED_HOSTS.append(EC2_PRIVATE_IP)

Note that ‘169.254.169.254’ is the IP address you would use , regardless of the elastic IP of you web server.

Many many thanks to this blog – and AWS customer service for helping me figure out how to make this work!

As the project grew, we started downloading tweets from various journal websites and tried to set up an algorithm to parse tweets that were related to particular papers and link them to the paper; thereby producing another “metric” to compare papers by. In addition; we started keeping a detailed records all the authors of the papers and attempted to create a citation database. As complexity grew; we found that our local server was too slow and after some research; decided to take the plunge and move our stuff to AWS-the Amazon Web Server. We created a VPC (a virtual Private cloud), moved our database to Amazon’s RDS (Relational Database Service) and created buckets or storage on Amazon’s S3 (Simple Storage Service). Its relatively easy to do and the AWS documentation is pretty good. What I found really helpful were the masterclass webinar series.

I launched a Linux based instance and then installed all the software versions I needed like python2.7, pandas, numpy, ipython-pylab, matplotlib, scipy etc . It was interesting to note that on many of the amazon machine, the default python version loaded was 2.6, not 2.7. I scouted the web a fair bit to help me configure my instance the am sharing soem of the commands below

General commands to install python 2.7 on AWS- should work on most instances running Ubuntu/RedHat Linux:

Start python to check the installation unicode type. If you have to deal with a fair amount of unicode data like I do then make sure you have the “wide build” . I learned this the hard way.

>>import sys
>>print sys.maxunicode

It should NOT be 65564

>>wget https://s3.amazonaws.com/aws-cli/awscli-bundle.zip
>> unzip awscli-bundle.zip
>> sudo ./awscli-bundle/install -i /usr/local/aws -b /usr/local/bin/aws
# install build tools
>>sudo yum install make automake gcc gcc-c++ kernel-devel git-core -y
# install python 2.7 and change default python symlink
>>sudo yum install python27-devel -y
>>sudo rm /usr/bin/python
>>sudo ln -s /usr/bin/python2.7 /usr/bin/python
# yum still needs 2.6, so write it in and backup script
>>sudo cp /usr/bin/yum /usr/bin/_yum_before_27
>>sudo sed -i s/python/python2.6/g /usr/bin/yum
#This should display now 2.7.5 or later: >>python
>>sudo yum install httpd
# now install pip for 2.7
>>sudo curl -o /tmp/ez_setup.py https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py
>>sudo /usr/bin/python27 /tmp/ez_setup.py
>>sudo /usr/bin/easy_install-2.7 pip
>>sudo pip install virtualenv
>>sudo apt-get update
>>sudo apt-get install git
# should display current versions: pip -V && virtualenv –version
Installing all the python library modules:
sudo pip install ipython
sudo yum install numpy scipy python-matplotlib ipython python-pandas sympy python-nose
sudo yum install xorg-x11-xauth.x86_64 xorg-x11-server-utils.x86_64
sudo pip install pyzmq tornado jinja2
sudo yum groupinstall “Development Tools”
sudo yum install python-devel
sudo pip install matplotlib
sudo pip install networkx
sudo pip install cython
sudo pip install boto
sudo pip install pandas
Some modules could Not be loaded using pip, so use the following instead: >>sudo apt-get install python-mpi4py python-h5py python-tables python-pandas python-sklearn python-scikits.statsmodels
Note that to install h5py or pytables you must install the following dependencies first:
-numpy
-numexpr
-Cython
-dateutil
HDF5
HDF5 can be installed using wget:
>> wget http://www.hdfgroup.org/ftp/HDF5/current/src/hdf5-1.8.9.tar.gz
>>tar xvfz hdf5-1.8.9.tar.gz; cd hdf5-1.8.9
>> ./configure –prefix=/usr/local
>> make; make install
Pytables
pip install git+https://github.com/PyTables/PyTables.git@v.3.1.1#egg=tables

**to install h5py make sure hdf5 is in the path.

I really liked the concept of AMI’s : you create a machine that has the configuration that you want and then create an “image” and can give it a name. You have then created a virtual machine that you can launch anytime you want and as many copies of as you want! What is also great is that when you create the image, all the files that may be in your directory at that time are also part of that virtual machine …so its almost like you have a snapshot of your environment at the time. So create images often (you can delete old images ofcourse!) and you can launch your environment at any time. Of course you can and should always upload new files to S3 as that it you storage repository.

Another neat trick was to learn that if you can install EC (Elastic Cloud) CLI(Command Line Interface) on your local machine, and set up your IAM you don’t need the you don’t need the .pem file!! Even if you are accessing an AMI in a private cloud; when you set up the instance make sure you click on “get public IP” when you launch your instance and log in as ssh –Y ubuntu@ ec2-xx.yyy.us-west1.compute.amazonaws.com. You can them enjoy xterm and the graphical display from matplotlib, as if you are running everything on your local terminal. Very cool indeed!

MyJourneyAsaDataScientist

About eighteen months ago I decided to leave astronomy, change my career trajectory and follow the Data Science Bandwagon- this is a blog about that ongoing journey…

Tag Archives: AWS

AWS and Django intricacies – I wish I had known earlier….