Practical introduction to web mining: data wrangling

Most of programming work in data analysis project is spend in data preparation stage, and these’s due to the fact that the collected data is not already represented in the required and expected structure for your data processing application. Hopefully, Twitter data is structured, so we’ll not spend a lot of time in this stage.

First thing that we had to do is loading the collected data. There’s nothing special here, we need only the  json python module. Below the code:

import json

def load_tweets(path):
    tweets = []
    with open(path, 'r') as file_stream:
        for line in file_stream:
                tweet = json.loads(line)
    return tweets

tweets_list = load_tweets("PL_tweets.txt")

1. Pandas

Next we will create a pandas DataFrame. Pandas is an open source python library providing high-level data structures and tools for data analysis. Pandas has mainly two data structures types:

  • Series: one-dimensional array containing an array of data and an associated array of index.
  • DataFrame: tabular data structure containing a collection of columns. DataFrame has both a row and column index. In other words, a DataFrame is a collection of Series.

Let’s first explore the tweet structure. If you don’t have an idea about the Twitter API, it’s a good idea to look first to the official documentation before completing this tutorial. Personally, I think that the key attributes of a tweet are:

  • id: the tweet identifier
  • text: the text of the tweet itself
  • lang: acronym for the language (e.g. “en” for english, “fr” for french)
  • created_at: the date of creation
  • favorite_count, retweet_count: the number of favorites and retweets
  • place, coordinates, geo: geo-location information if available
  • user: the author’s full profile
  • entities: list of entities like URLs, @-mentions, hashtags and symbols
  • in_reply_to_user_id: user identifier if the tweet is a reply to a specific user
  • in_reply_to_status_id: status identifier id the tweet is a reply to a specific status

The below code will create a Pandas DataFrame object containing the most usefule tweet’s metadata that we will use in the next post of this series:

import pandas as pd

# create Pandas DataFrame
tweets = pd.DataFrame()

# create some columns
tweets['tweetID'] = [ tweet['id'] for tweet in tweets_list ]
tweets['tweetText'] = [ tweet['text'] for tweet in tweets_list ]
tweets['tweetLang'] = [ tweet['lang'] for tweet in tweets_list ]
tweets['tweetCreatedAt'] = [ tweet['created_at'] for tweet in tweets_list ]
tweets['tweetRetweetCount'] = [ tweet['retweet_count'] for tweet in tweets_list ]
tweets['tweetFavoriteCount'] = [ tweet['favorite_count'] for tweet in tweets_list ]
tweets['tweetGeo'] = [ tweet['geo'] for tweet in tweets_list ]
tweets['tweetCoordinates'] = [ tweet['coordinates'] for tweet in tweets_list ]
tweets['tweetPlace'] = [ tweet['place'] for tweet in tweets_list ] 

# tweeple information 
tweets['userScreenName'] = [ tweet['user']['screen_name'] for tweet in tweets_list ]
tweets['userName'] = [ tweet['user']['name'] for tweet in tweets_list ]
tweets['userLocation'] = [ tweet['user']['location'] for tweet in tweets_list ]

# tweet interaction 
tweets['tweetIsReplyToUserId'] = [ tweet['in_reply_to_user_id'] for tweet in tweets_list ]
tweets['tweetIsReplyToStatusId'] = [ tweet['in_reply_to_status_id'] for tweet in tweets_list ]

Super ! we created our first data frame. Pandas data frame provide a beautiful and rich API, from visualizing to interacting with the dataframe:

  • head(N): returns first N rows
  • tail(N): returns last N rows
  • iteritems(): iterator over (column name, series) pair
  • etc.

The code below will display the first 5 rows in our data frame:

>>> tweets.head(5)

2. Cleaning Data

Unfortunately, the acquired data is usually dirty and have a lot of inconsistencies, which could be duplicated entries, bad values, not normalized values, etc. So, the cleanup process should include mainly:

  • removing duplicate entries
  • strip whitespaces
  • normalize numbers, dates, etc.

The output of this process is a clean dataset: a dataset consisted only of valid and normalized values, and this will ensure that our analysis code WILL NOT CRASH !

2.1 Missing data

If you followed the previous steps in this tutorial, you noticed probably, as shown in the below figure, the NaN values in some columns. NaN is a special value to denote missing data.

Pandas NaN value

fig 1. Missing data

Now, we had to handle this missing values. In fact, we had mainly two options:

  • replacing all NaN values with None
  • treat each column separately. For example, replacing NaN by None for tweetIsReplyToUserId and tweetIsReplyToStatusId columns, and replacing both None and NaN by “Unknown” for userLocation column, etc.

Personally, I will opt to the second option, and I’ll use the fillna method which will fill NaN values by the given value:

# let's handle userLocation column
tweets.userLocation.fillna("Unknown", inplace=True)
# Now let's replace the other NaN values by None
tweets.fillna(lambda: None)

Note that I set inplace argument of fillna method to True explicitly. Otherwise, the userLocation series will not be modified.

2.2 Bad data

If  you took previously a look to the Twitter documentation, you knew probably that the values of the  tweetCreatedAt column are a string representation of a date and time object. We had to convert these values to a datetime object.

You can use strptime function of datetime package which parse a string representation of date and/or time object. But, I prefer to use Pandas’ to_datetime method which will parse and convert the entire series.

tweets.tweetCreatedAt = pd.to_datetime(tweets.tweetCreatedAt)

2.3 Duplicated data

Really, I didn’t expect to have duplicated entries in my dataset. But as the script crashed several time, I wasn’t surprised. Pandas provide some methods to deal with duplicated data. The duplicated method will annotate rows by a boolean specifying if that row is duplicated or not. By default, the row identity is defined by checking all columns, but you can restrict it on specific columns. For our example, we can specify only the tweetID  column as it’s a unique ID for the tweet.

>>> tweets.duplicated(['tweetID'],
0        False
1        False
2        False
3        False
4        False
5        False
6         True
7        False
8        False
9        False
10       False
11       False

You can drop duplicated rows using drop_duplicates method, as below:

>>> tweets.drop_duplicates(['tweetID'],


I think that I spoke about the most important tips/steps on data wrangling stage. But you had to not that twitter data is structured and clean but this not the regular case. In fact, real-world data is dirty: you had to do more work on it to be able to use it.

Waiting for your comments and suggestions.


Scheduled jobs with Celery, Django and Redis

Setting up a deferred task queue for your Django application can be a pain and it shouldn’t to be. Some “persons” use cron which is not only a bad solution, but this is a disaster. Personally, I use Celery. In this post, I’ll show you how to set-up a deferred task queue for your Django application using Celery.

What’s Celery ?

Celery is an asynchronous task queue/job queue based on distributed message passing. It is focused on real-time operation, but supports scheduling as well.

The promise of Celery is to allow you to run code later, or regularly according to a schedule. Unfortunately, running deferred tasks through Celery is not trivial. But it’s useful and beneficial, as it has a distributed architecture that scales as you need. Any Celery installation is composed of three core components:

  1. Celery client: which used to issue background jobs.
  2. Celery workers: these are the processes responsible to run jobs. Worker can be local or remote, so you can start with a single worker in the same web application server, and later add workers as your traffic and overload grow.
  3. Message broker: The client communicates with the the workers through a message queue, and Celery supports several ways to implement these queues. The most commonly used brokers are RabbitMQ and Redis.

Installing requirements

Fistable, let’s install Redis:

$ sudo apt-get install redis-server

Now, let’s install some python packages:

pip install celery
pip install django-celery

Configuring Django for Celery

Once the installation is completed, you’re ready to set up our scheduler. Let’s configure Celery:


BROKER_URL = 'redis://'
CELERYBEAT_SCHEDULER = 'djcelery.schedulers.DatabaseScheduler'

The above lines is used to configure Celery: which broker you’ll use? Which scheduler for heart beat event ?

As you added djcelery package to your INSTALLED_APPS, you need to create the celery database tables – instructions for that differ depending on your environment, If using South or Migrations (Django >= 1.7) for schema migrations:

$ python migrate


$ python syncdb

Below, the file that is used for setting up the scheduler for your django project:

# file
from future import absolute_import

import os
import django

from celery import Celery
from django.conf import settings

os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'demo.settings')

app = Celery('Scheduler')

app.autodiscover_tasks(lambda: settings.INSTALLED_APPS)

Write some tasks

Let’s assume that you have a task that should be executed periodically, a good example might be a twitter bot or a scraper.

import tweepy

api = tweepy.API()

def get_recent_tweets(query):
    for tweet in tweepy.Cursor(, q=query,
                               rpp=100, result_type="recent",
        print tweet.created_at, tweet.text
        # Save tweet into database

Now, we need to create a Celery task for get_recent_tweets

    ## /project_name/app_name/

    from celery.decorators import task

    from utils import twitter

    def get_recent_tweets(*args):
        # Just an example

N.B: Things can get a lot more complicated than this.

Scheduling it

Now, we have to schedule our tasks. For get_bigdata_tweets task, we will run it every hour, this is an interesting subject that I want to follow, For this purpose, I’ll use celery.beat scheduler. In file add this code:

from celery.schedules import crontab

CELERYBEAT_SCHEDULER = "djcelery.schedulers.DatabaseScheduler"
    "get_bigdata_tweets": {
        'task': "bots.twitter.tasks.get_recent_tweets",
        # Every 1 hour
        'schedule': timedelta(seconds=6),
        'args': ("bigdata"),
For further details, about scheduler configuration, see documentation.

Merge querysets from different django models

If you were in a situation where you need to merge two querysets from different models into one, you’ve surely see this error:

Cannot combine queries on two different base models.

The solution is to use itertools.chain which make an iterator that is the junction of the given iterators.

from itertools import chain

result_lst = list(chain(queryset1, queryset2))

Now, you can sort the resulting list by any common field, e.g. creation date

from itertools import chain
from operator import attrgetter

result_lst = sorted(
    chain(queryset1, queryset2),

Installing different python versions in ubuntu

Since I write python code that should be running on different python versions, I have to install multiple python versions on my workstation. As usually, I believe that we should do everything well as we can :).

This post is a description of my procedure to get different python versions installed in my Ubuntu workstation.

Installing Multiple Versions

Ubuntu typically only supports one python 2.x version and one 3.x version at a time.  There’s a popular PPA called deadsnakes that contains older versions of python. To install it you should run the below commands:

$ sudo add-apt-repository ppa:fkrull/deadsnakes
$ sudo apt-get update

I’ve a Ubuntu 14.04 already installed in my workstation (So I’ve both python2.7 and python3.4). So, I’ll install versions 2.6 and 3.3.

$ sudo apt-get install python2.6 python3.3

Let’s check the default python version by running `python – V`

$ python -V
Python 2.7.6

Now, to manage the different python versions I will use an amazing Linux command: update-alternatives. According to Linux man page, ” update-alternatives maintain symbolic links determining default commands ”

Firstable, let’s install the different alternatives:

$ sudo update-alternatives --install /usr/bin/python python /usr/bin/python2.6 10
$ sudo update-alternatives --install /usr/bin/python python /usr/bin/python2.7 20
$ sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.3 30
$ sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.4 40

To choose the default python version you should run the below command:

$ sudo update-alternatives --config python

Secondly, I can switch between the different Python versions easily with the previous command. However, Ubuntu runs multiple maintenance scripts and those script may break if I choose Python 2.6 as a default version.

Using virtualenv

I assume that we have different python version installed on your machine and you didn’t change the default python version (which is 2.7 in my case).

1. Installing virtualenv

$ sudo apt-get install python-virtualenv

2. Managing different python version
Suppose that I will start a new project which will run on Python 2.6. Using this solution, I will be able to manage different version of python and different version of any package I use. Great!

$ virtualenv -p /usr/bin/python2.6 ~/.envs/project_x_py2.6
Running virtualenv with interpreter /usr/bin/python2.6
New python executable in ~/.envs/project_x_py2.6/bin/python2.6
Also creating executable in ~/.envs/project_x_py2.6/bin/python
Installing distribute....................................done.
Installing pip.....................done.

3. Activating virtualenv
Before that you can install any package for this project, you should activate it:

$ source ~/.envs/project_x_py2.6/bin/activate

Now, If we check the default python version used for this project:

$ python -V
Python 2.6.9
$ which python

When you’re gone with the project, just deactivate its virtualenv and you can back to it when you need by activating it

$ deactivate

Memento design pattern: Part 2

As promised in the last post (part 1), I will try to improve the official implementation of the memento pattern inspired from the Java code in Wikipedia. I will try to improve this points:

  • the CareTaker create a Memento object for every change in the Originator behind the scene
  • create a CareTake object implicitly for each Originator class

Coding time

Firstable, Let’s improve the Memento and CareTaker classes.

class Memento(object):
    def __init__(self, state):
        self.__state = state

    def state(self):
        return self.__state

    def __repr__(self):
        return "<Memento: {} >".format(str(self.__state))

class CareTaker(object):
    def __init__(self):
        self.__mementos = []

    def mementos(self):
        return self.__mementos

    def save(self, memento):

    def restore(self):
        return self.mementos.pop()

For the first enhancement we will use a magic method which is __setattr__ that give us the possibility to control attribute assignment. Consider this example:

class A(object):
    def __setattr__(self, attr, val):
        print "Permission denied."

a = A()
a.x = 4  # "Permission denied"

As you see, modifying the attribute value is through a Python call to the special method __setattr__. In our example, we removed the default behaviour.

In our case, we will use this method to create a Memento object implicitly for every change made on the Originator class.

class Originator(object):
    Any originator class should inherits from this class.
    def __init__(self, *args, **kw):
        # Let's create a caretaker for this originator
        self.__caretaker = kw.pop('caretaker', None) or CareTaker()
        super(Originator, self).__init__(*args, **kw)

    def caretaker(self):
        return self.__caretaker

    def __setattr__(self, attr, val):
        # Avoid keeping trace of private attributes changes,
        # especially the `caretaker` attribute
        if not attr.startswith('_'):
            # Let's save both attribute and its value
              'attr': attr,
              'value': getattr(self, attr, None)
        super(Originator, self).__setattr__(attr, val)

class Settings(Originator):

settings = Settings()
settings.font = 'Arial'
settings.font = 'Calibri'
caretaker = settings.caretaker
print 'We have {} states'.format(len(caretaker.states))

The downside of this implementation is that we should call explicitly, in the first place, the Originator’s __init__ method when we override it in the subclass. Consider this example:

class User1(Originator):
    def __init__(self, login, password):
        self.login = login
        self.password = password
        super(User, self).__init__()

user = User1('john', 'password') # AttributeError

class User2(Originator):
    def __init__(self, login, password):
        # Initialise Originator class in the first place
        super(User, self).__init__()
        self.login = login
        self.password = password

user = User2('john', 'password') # works

The problem is appeared when python initialize the User1 object: It call implicitly the __setattr__ method which try to save a memento (for login attribute) but the caretaker object is not yet created. To fix this, we will only create memento object after instance initialisation:

class Originator(object):
    Any originator class should inherits from this class.
    def __init__(self, *args, **kw):
        # Let's create a caretaker for this originator
        self.__caretaker = kw.pop('caretaker', None) or CareTaker()
        super(Originator, self).__init__(*args, **kw)

    def caretaker(self):
        return self.__caretaker

    def __setattr__(self, attr, val):
        if hasattr(self, '_Originator__caretaker'):
            # Let's save both attribute and its value
              'attr': attr,
              'value': getattr(self, attr, None)
        super(Originator, self).__setattr__(attr, val)

It’s mostly done, we should now add an undo method to the Originator class

class Originator(object):

    def undo(self):
        memento = self.caretaker.restore()
        setattr(self, memento.state['attr'], memento.state['value'])

Great ! However there are a bug in this code: If we try to restore the Originator object, another memento object will be created which is an issue, but if we restore another time we will back to the last state which is terrible. Consider this example:

settings = Settings()
caretaker = settings.caretaker

for color in ('red', 'blue', 'green', 'yellow'):
    settings.color = color
    print 'We have {} mementos'.format(len(caretaker.mementos))

for i in range(7):
    print 'We have {} mementos ## color: {}'.format(len(caretaker.mementos), settings.color)

and bellow the output:

We have 1 mementos
We have 2 mementos
We have 3 mementos
We have 4 mementos
We have 4 mementos ## color: green
We have 4 mementos ## color: yellow
We have 4 mementos ## color: green
We have 4 mementos ## color: yellow
We have 4 mementos ## color: green
We have 4 mementos ## color: yellow
We have 4 mementos ## color: green

To fix this we will add a flag indicating if the __setattr__ will be executed in a restore mode or not

class Originator(object):
    def __setattr__(self, attr, val):
        restore = getattr(self, 'restore_mode', False)
        if (not restore and hasattr(self, '_Originator__caretaker')
              and attr != 'restore_mode'):
              'attr': attr,
              'value': getattr(self, attr, None)
        super(Originator, self).__setattr__(attr, val)

    def undo(self):
        memento = self.caretaker.restore()
        self.restore_mode = True
        setattr(self, memento.state['attr'], memento.state['value'])
        self.restore_mode = False

Now, we have only two issues:

  • Handle IndexError exception raised by restore method
  • For the moment, new created attribute will be considered set to None before creation which is confusing
class Empty:

class Originator(object):
    def __setattr__(self, attr, val):
        restore = getattr(self, 'restore_mode', False)
        if (not restore and hasattr(self, '_Originator__caretaker')
              and attr != 'restore_mode'):
              'attr': attr,
              'value': getattr(self, attr, Empty())
        super(Originator, self).__setattr__(attr, val)

    def undo(self):
            memento = self.caretaker.restore()
        except IndexError:
        if isinstance(memento.state['value'], Empty):
            delattr(self, memento.state['value'])
            self.restore_mode = True
            setattr(self, memento.state['attr'], memento.state['value'])
            self.restore_mode = False

I hope you liked today’s post and as I don’t think that’s perfect, you are welcome to give me your opinion and feedback. Comments here or @benzid_wael.

Memento design pattern: Part 1

Today, I will show you how to implements the Memento design pattern in Python. Assuming, that you are in a position where you want to implement an undo system. So you have an object where you should keep all the changes that user made on it.

How it works?

If you take a look to Memento pattern in Wikipedia, you’ll find this:

The memento pattern is implemented with three objects: the originator, a caretaker and a memento. The originator is some object that has an internal state. The caretaker is going to do something to the originator, but wants to be able to undo the change. The caretaker first asks the originator for a memento object. Then it does whatever operation (or sequence of operations) it was going to do. To roll back to the state before the operations, it returns the memento object to the originator. The memento object itself is an opaque object (one which the caretaker cannot, or should not, change). When using this pattern, care should be taken if the originator may change other objects or resources – the memento pattern operates on a single object.

Coding time

Let’s now, rewrite the Java example from the wiki in Python:

My opinion

I don’t like this implementation, there are many lacks on it, especially if you are Pythonista:

  1. We have not a general purpose for the Originator class
  2. We should manage what attributes to save on the Memento
  3. We should create a CakeTaker explicitly for every object

As an enhancement to this implementation, I want :

  • to create a CakeTake object implicitly for each Originator class
  • that the CakeTaker create a Memento object for every change in the Originator behind the scene
  • possibility to have a Memento for a group of changes

Ok, next time I will implement this enhancement and will discuss the solution, nice day.

Python notes: Part II

Last time I listed some python’s features, but I like to reserve a post for the most lovely on, which is: Python philosophy. The below code will enumerate the main idioms around it Python has been designed:

>>> import this
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

Through this post, I will try to mention an example for each idiom:

* Beautiful is better than ugly.

The below snippets, will highlight how Python code is more beautiful than code in C or any other programming language:

void permute(int *a, int *b)
 int swap;
 swap = *a;
 *a = *b;
 *b = swap;
>>> a, b = b, a

* Explicit is better than implicit.

So, you should avoid to write code like that:

>>> from os import *
>>> print getcwd()

Instead, you should write:

>>> import os
>>> print os.getcwd()

As a general rule, try to be always explicit and clear when coding

* Simple is better than complex.

Simplicity should be a key goal in design. For example, Python offer the in operator to iterate over some structures. Besides, you should manage it by yourself in C/C++

int array[6] = {4, 5, 45, 3, 9, 7};
int i = 0;
for (i, i++, i<6) {
  printf("i=%i", i);

Hopefully, Python make my life much easier

>>> l = [4, 5, 45, 3, 9, 7]
>>> for i in l:
        print "i=", i

* Complex is better than complicated.

Complicated is a something hard to understand and analyse. If we can not have simple solution, having a complex solution is better than a complicated one.

* Flat is better than nested.

Flat code is easier to read and maintain:

if (year % 4 == 0):
    if (year % 100 == 0):
        if (year % 400 == 0):
            leap = true
            leap = false
        leap = true
    leap = false

Instead, flat code is more readable:

if year % 4 == 0 and year % 100 != 0:
    leap = true
elif year % 400 == 0:
    leap = true
    leap = false

* Sparse is better than dense.

It’s always about readability, don’t write dense code because it is difficult to understand:

  • Put empty lines between blocks of unrelated code within functions.
  • Put spaces around operators much of the time in the most cases.
  • Put two lines between method or function definitions.

* Readability counts.

You can notice this from:

  • Use of white spaces
  • Python is bracket-less. Instead it use indentation which is more elegant and brings more clarity to Python code
  • Documentation

* Special cases aren’t special enough to break the rules.

Everything is an object in Python, and basic types like integer, float, boolean, etc. are not special enough to break the rule.

* Although practicality beats purity.

Sometimes, the rules have to be broken.

* Errors should never pass silently.

    import json
except ImportError:
    print "Can not load json module"

* Unless explicitly silenced.

    price = prices[k]
except KeyError:
    price = default_price

* In the face of ambiguity, refuse the temptation to guess.

Consider this code:

>>> 1 + 1
>>> '1' + '1'
>>> 1 + '1'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for +: 'int' and 'str'

As you see, the interpreter is in ambiguous situation, it does not know if it’s a string concatenation, or simple add operation. So, it don’t try to guess, instead it raise a TypeError.

* There should be one– and preferably only one –obvious way to do it.

This idiom is against another famous idioms which is:

There Is More Than One Way To Do It

The precedent idom is generally associated with Perl programming language. LarryWall (the inventor of Perl language), who thinks that phrasing a sentence differently might make it clearer. Besides, Pythonic don’t agree with him, We thinks that even there are many ways to do it, there are always a good way to do it.

    val = plants["carrot"]
except KeyError:
    val = "No carrot found"

Instead, It’s better to use the get method:

val = plants.get("carrot", "No carrot found")

* Although that way may not be obvious at first unless you’re Dutch.

Maybe, you have seen or written code like that:

hits = {}
for log in logs:
    if hits.has_key(log.url):
        hits[log.url] += 1
        hits[log.url] = 1

But, if you know collections module you will write:

hits = collections.defaultdict(int)
for log in logs:
    hits[log.url] += 1

This is the preferably way to do it.

* Now is better than never.

Always, fix the problem as soon as possible, and do not worry about perfection Python 3 need years to see the light.

* Although never is often better than *right* now.

Just take your time to do it well.

* If the implementation is hard to explain, it’s a bad idea.

* If the implementation is easy to explain, it may be a good idea.

Each one of two idioms above is a reformulation of the KISS principle


* Namespaces are one honking great idea — let’s do more of those!

Using namespaces enable us to follow SOC (Separation Of Concerns) principle