Practical introduction to web mining: collect data

Web mining is the application of natural language processing techniques to web content in order to retreive relevant information. It  became more important these days due to an exponential increase in digital content especially with the apperance of social media platforms, especially Twitter which constitue a rich and fiable information source.

In this series, I’ll explain how to collect twitter data, manipulate it and extract knowledge from it. As I am fan of Python, I’ll try to compare Python to other programming languages such as Java, Ruby and PHP based on information that we will collect from twitter.

In this tutorial, we will start by collecting data from twitter, introduce tweepy and the structure of twitter data.

1. Create a Twitter application

First of all, you should have some Twitter keys to be able to connect to twitter API and gather data from it. We need especially API key, API secret, Access token and Access token secret. To get this informations, follow steps bellow:

  1. go to https://apps.twitter.com and login with your twitter account.
  2. Create a new Twitter application
  3. In the next page, precisely in the “API keys” tab you can find both API key, API secret
  4. Scroll down and generate you access token and token secret

create_twitter_app

Once you created a new Twitter app and generated your keys, you can move to the next step and start collecting data.

 2. Getting Data From Twitter

We will use the Twitter Stream API to collect tweets related to 4 keywords: python, java, php and ruby. Happily, the Twitter Stream API is restful and give us the possibility to filter tweets by keywords. The code below, will fetch popular tweets that contains one of the keywords mentioned earlier:

#!/bin/python
# -*- coding: utf-8 -*-

from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream

# User credentials for Twitter API 
access_token = "ENTER YOUR ACCESS TOKEN"
access_token_secret = "ENTER YOUR ACCESS TOKEN SECRET"
consumer_key = "ENTER YOUR API KEY"
consumer_secret = "ENTER YOUR API SECRET"


class StdoutListener(StreamListener):

    def on_data(self, data):
        print data
        return True

    def on_error(self, status):
        print status


if __name__ == '__main__':
    # Twitter authetification
    listner = StdoutListener()
    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    stream = Stream(auth, listner)

    # Filter Twitter Streams to capture data by the keywords
    stream.filter(track=['python', 'java', 'php', 'ruby'])

Now if you run this command:


python get_tweets.py >>PL_tweets.txt

you’ll have information about most popular tweets containing one of the keywords python, java, php and ruby in the specified txt file.

3. Understand Twitter response

The data collected previously is in JSON format, so it’s easy to read and understand. But, I’ll take the time here to highlight some useful informations inside the twitter response .

tweet_sample

As you propably noticed, the tweet contains information about the tweeple, list of tags and URIs appeared in the tweet, the main text of the tweet, retweet count, favourite count, etc.

Awesome, now you should start collect data. Next posts of this series will be hot and exciting, and you need a lot of data for it: more data, better experience.

Stay tuned ….

Advertisements

Installing different python versions in ubuntu

Since I write python code that should be running on different python versions, I have to install multiple python versions on my workstation. As usually, I believe that we should do everything well as we can :).

This post is a description of my procedure to get different python versions installed in my Ubuntu workstation.

Installing Multiple Versions

Ubuntu typically only supports one python 2.x version and one 3.x version at a time.  There’s a popular PPA called deadsnakes that contains older versions of python. To install it you should run the below commands:

$ sudo add-apt-repository ppa:fkrull/deadsnakes
$ sudo apt-get update

I’ve a Ubuntu 14.04 already installed in my workstation (So I’ve both python2.7 and python3.4). So, I’ll install versions 2.6 and 3.3.

$ sudo apt-get install python2.6 python3.3

Let’s check the default python version by running `python – V`

$ python -V
Python 2.7.6

Now, to manage the different python versions I will use an amazing Linux command: update-alternatives. According to Linux man page, ” update-alternatives maintain symbolic links determining default commands ”

Firstable, let’s install the different alternatives:

$ sudo update-alternatives --install /usr/bin/python python /usr/bin/python2.6 10
$ sudo update-alternatives --install /usr/bin/python python /usr/bin/python2.7 20
$ sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.3 30
$ sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.4 40

To choose the default python version you should run the below command:

$ sudo update-alternatives --config python

Secondly, I can switch between the different Python versions easily with the previous command. However, Ubuntu runs multiple maintenance scripts and those script may break if I choose Python 2.6 as a default version.

Using virtualenv

I assume that we have different python version installed on your machine and you didn’t change the default python version (which is 2.7 in my case).

1. Installing virtualenv

$ sudo apt-get install python-virtualenv

2. Managing different python version
Suppose that I will start a new project which will run on Python 2.6. Using this solution, I will be able to manage different version of python and different version of any package I use. Great!

$ virtualenv -p /usr/bin/python2.6 ~/.envs/project_x_py2.6
Running virtualenv with interpreter /usr/bin/python2.6
New python executable in ~/.envs/project_x_py2.6/bin/python2.6
Also creating executable in ~/.envs/project_x_py2.6/bin/python
Installing distribute....................................done.
Installing pip.....................done.

3. Activating virtualenv
Before that you can install any package for this project, you should activate it:

$ source ~/.envs/project_x_py2.6/bin/activate

Now, If we check the default python version used for this project:

$ python -V
Python 2.6.9
$ which python
~/.envs/project_x_py2.6/bin/python

When you’re gone with the project, just deactivate its virtualenv and you can back to it when you need by activating it

$ deactivate

Memento design pattern: Part 2

As promised in the last post (part 1), I will try to improve the official implementation of the memento pattern inspired from the Java code in Wikipedia. I will try to improve this points:

  • the CareTaker create a Memento object for every change in the Originator behind the scene
  • create a CareTake object implicitly for each Originator class

Coding time

Firstable, Let’s improve the Memento and CareTaker classes.


class Memento(object):
    def __init__(self, state):
        self.__state = state

    @property
    def state(self):
        return self.__state

    def __repr__(self):
        return "<Memento: {} >".format(str(self.__state))

class CareTaker(object):
    def __init__(self):
        self.__mementos = []

    @property
    def mementos(self):
        return self.__mementos

    def save(self, memento):
        self.__mementos.append(memento)

    def restore(self):
        return self.mementos.pop()

For the first enhancement we will use a magic method which is __setattr__ that give us the possibility to control attribute assignment. Consider this example:

class A(object):
    def __setattr__(self, attr, val):
        print "Permission denied."

a = A()
a.x = 4  # "Permission denied"

As you see, modifying the attribute value is through a Python call to the special method __setattr__. In our example, we removed the default behaviour.

In our case, we will use this method to create a Memento object implicitly for every change made on the Originator class.

class Originator(object):
    """
    Any originator class should inherits from this class.
    """
    def __init__(self, *args, **kw):
        # Let's create a caretaker for this originator
        self.__caretaker = kw.pop('caretaker', None) or CareTaker()
        super(Originator, self).__init__(*args, **kw)

    @property
    def caretaker(self):
        return self.__caretaker

    def __setattr__(self, attr, val):
        # Avoid keeping trace of private attributes changes,
        # especially the `caretaker` attribute
        if not attr.startswith('_'):
            # Let's save both attribute and its value
            self.__caretaker.save(Memento({
              'attr': attr,
              'value': getattr(self, attr, None)
            }))
        super(Originator, self).__setattr__(attr, val)

class Settings(Originator):
    pass

settings = Settings()
settings.font = 'Arial'
settings.font = 'Calibri'
caretaker = settings.caretaker
print 'We have {} states'.format(len(caretaker.states))

The downside of this implementation is that we should call explicitly, in the first place, the Originator’s __init__ method when we override it in the subclass. Consider this example:

class User1(Originator):
    def __init__(self, login, password):
        self.login = login
        self.password = password
        super(User, self).__init__()

user = User1('john', 'password') # AttributeError

class User2(Originator):
    def __init__(self, login, password):
        # Initialise Originator class in the first place
        super(User, self).__init__()
        self.login = login
        self.password = password

user = User2('john', 'password') # works

The problem is appeared when python initialize the User1 object: It call implicitly the __setattr__ method which try to save a memento (for login attribute) but the caretaker object is not yet created. To fix this, we will only create memento object after instance initialisation:

class Originator(object):
    """
    Any originator class should inherits from this class.
    """
    def __init__(self, *args, **kw):
        # Let's create a caretaker for this originator
        self.__caretaker = kw.pop('caretaker', None) or CareTaker()
        super(Originator, self).__init__(*args, **kw)

    @property
    def caretaker(self):
        return self.__caretaker

    def __setattr__(self, attr, val):
        if hasattr(self, '_Originator__caretaker'):
            # Let's save both attribute and its value
            self.__caretaker.save(Memento({
              'attr': attr,
              'value': getattr(self, attr, None)
            }))
        super(Originator, self).__setattr__(attr, val)

It’s mostly done, we should now add an undo method to the Originator class

class Originator(object):
    ...

    def undo(self):
        memento = self.caretaker.restore()
        setattr(self, memento.state['attr'], memento.state['value'])

Great ! However there are a bug in this code: If we try to restore the Originator object, another memento object will be created which is an issue, but if we restore another time we will back to the last state which is terrible. Consider this example:

settings = Settings()
caretaker = settings.caretaker

for color in ('red', 'blue', 'green', 'yellow'):
    settings.color = color
    print 'We have {} mementos'.format(len(caretaker.mementos))

for i in range(7):
    settings.undo()
    print 'We have {} mementos ## color: {}'.format(len(caretaker.mementos), settings.color)

and bellow the output:

We have 1 mementos
We have 2 mementos
We have 3 mementos
We have 4 mementos
We have 4 mementos ## color: green
We have 4 mementos ## color: yellow
We have 4 mementos ## color: green
We have 4 mementos ## color: yellow
We have 4 mementos ## color: green
We have 4 mementos ## color: yellow
We have 4 mementos ## color: green

To fix this we will add a flag indicating if the __setattr__ will be executed in a restore mode or not

class Originator(object):
    ...
    def __setattr__(self, attr, val):
        restore = getattr(self, 'restore_mode', False)
        if (not restore and hasattr(self, '_Originator__caretaker')
              and attr != 'restore_mode'):
            self.__caretaker.save(Memento({
              'attr': attr,
              'value': getattr(self, attr, None)
            }))
        super(Originator, self).__setattr__(attr, val)

    def undo(self):
        memento = self.caretaker.restore()
        self.restore_mode = True
        setattr(self, memento.state['attr'], memento.state['value'])
        self.restore_mode = False

Now, we have only two issues:

  • Handle IndexError exception raised by restore method
  • For the moment, new created attribute will be considered set to None before creation which is confusing
class Empty:
    pass

class Originator(object):
    ...
    def __setattr__(self, attr, val):
        restore = getattr(self, 'restore_mode', False)
        if (not restore and hasattr(self, '_Originator__caretaker')
              and attr != 'restore_mode'):
            self.__caretaker.save(Memento({
              'attr': attr,
              'value': getattr(self, attr, Empty())
            }))
        super(Originator, self).__setattr__(attr, val)

    def undo(self):
        try:
            memento = self.caretaker.restore()
        except IndexError:
            return
        if isinstance(memento.state['value'], Empty):
            delattr(self, memento.state['value'])
        else:
            self.restore_mode = True
            setattr(self, memento.state['attr'], memento.state['value'])
            self.restore_mode = False

I hope you liked today’s post and as I don’t think that’s perfect, you are welcome to give me your opinion and feedback. Comments here or @benzid_wael.

Memento design pattern: Part 1

Today, I will show you how to implements the Memento design pattern in Python. Assuming, that you are in a position where you want to implement an undo system. So you have an object where you should keep all the changes that user made on it.

How it works?

If you take a look to Memento pattern in Wikipedia, you’ll find this:

The memento pattern is implemented with three objects: the originator, a caretaker and a memento. The originator is some object that has an internal state. The caretaker is going to do something to the originator, but wants to be able to undo the change. The caretaker first asks the originator for a memento object. Then it does whatever operation (or sequence of operations) it was going to do. To roll back to the state before the operations, it returns the memento object to the originator. The memento object itself is an opaque object (one which the caretaker cannot, or should not, change). When using this pattern, care should be taken if the originator may change other objects or resources – the memento pattern operates on a single object.

Coding time

Let’s now, rewrite the Java example from the wiki in Python:

My opinion

I don’t like this implementation, there are many lacks on it, especially if you are Pythonista:

  1. We have not a general purpose for the Originator class
  2. We should manage what attributes to save on the Memento
  3. We should create a CakeTaker explicitly for every object

As an enhancement to this implementation, I want :

  • to create a CakeTake object implicitly for each Originator class
  • that the CakeTaker create a Memento object for every change in the Originator behind the scene
  • possibility to have a Memento for a group of changes

Ok, next time I will implement this enhancement and will discuss the solution, nice day.

Python notes: Part II

Last time I listed some python’s features, but I like to reserve a post for the most lovely on, which is: Python philosophy. The below code will enumerate the main idioms around it Python has been designed:

>>> import this
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

Through this post, I will try to mention an example for each idiom:

* Beautiful is better than ugly.

The below snippets, will highlight how Python code is more beautiful than code in C or any other programming language:

void permute(int *a, int *b)
{
 int swap;
 swap = *a;
 *a = *b;
 *b = swap;
}
>>> a, b = b, a

* Explicit is better than implicit.

So, you should avoid to write code like that:

>>> from os import *
>>> print getcwd()

Instead, you should write:

>>> import os
>>> print os.getcwd()

As a general rule, try to be always explicit and clear when coding

* Simple is better than complex.

Simplicity should be a key goal in design. For example, Python offer the in operator to iterate over some structures. Besides, you should manage it by yourself in C/C++

int array[6] = {4, 5, 45, 3, 9, 7};
int i = 0;
for (i, i++, i<6) {
  printf("i=%i", i);
}

Hopefully, Python make my life much easier

>>> l = [4, 5, 45, 3, 9, 7]
>>> for i in l:
        print "i=", i

* Complex is better than complicated.

Complicated is a something hard to understand and analyse. If we can not have simple solution, having a complex solution is better than a complicated one.

* Flat is better than nested.

Flat code is easier to read and maintain:

if (year % 4 == 0):
    if (year % 100 == 0):
        if (year % 400 == 0):
            leap = true
        else:
            leap = false
    else:
        leap = true
else:
    leap = false

Instead, flat code is more readable:

if year % 4 == 0 and year % 100 != 0:
    leap = true
elif year % 400 == 0:
    leap = true
else:
    leap = false

* Sparse is better than dense.

It’s always about readability, don’t write dense code because it is difficult to understand:

  • Put empty lines between blocks of unrelated code within functions.
  • Put spaces around operators much of the time in the most cases.
  • Put two lines between method or function definitions.

* Readability counts.

You can notice this from:

  • Use of white spaces
  • Python is bracket-less. Instead it use indentation which is more elegant and brings more clarity to Python code
  • Documentation

* Special cases aren’t special enough to break the rules.

Everything is an object in Python, and basic types like integer, float, boolean, etc. are not special enough to break the rule.

* Although practicality beats purity.

Sometimes, the rules have to be broken.

* Errors should never pass silently.

try:
    import json
except ImportError:
    print "Can not load json module"

* Unless explicitly silenced.

try:
    price = prices[k]
except KeyError:
    price = default_price

* In the face of ambiguity, refuse the temptation to guess.

Consider this code:

>>> 1 + 1
2
>>> '1' + '1'
'11'
>>> 1 + '1'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for +: 'int' and 'str'

As you see, the interpreter is in ambiguous situation, it does not know if it’s a string concatenation, or simple add operation. So, it don’t try to guess, instead it raise a TypeError.

* There should be one– and preferably only one –obvious way to do it.

This idiom is against another famous idioms which is:

There Is More Than One Way To Do It

The precedent idom is generally associated with Perl programming language. LarryWall (the inventor of Perl language), who thinks that phrasing a sentence differently might make it clearer. Besides, Pythonic don’t agree with him, We thinks that even there are many ways to do it, there are always a good way to do it.

try:
    val = plants["carrot"]
except KeyError:
    val = "No carrot found"

Instead, It’s better to use the get method:

val = plants.get("carrot", "No carrot found")

* Although that way may not be obvious at first unless you’re Dutch.

Maybe, you have seen or written code like that:

hits = {}
for log in logs:
    if hits.has_key(log.url):
        hits[log.url] += 1
    else:
        hits[log.url] = 1

But, if you know collections module you will write:

hits = collections.defaultdict(int)
for log in logs:
    hits[log.url] += 1

This is the preferably way to do it.

* Now is better than never.

Always, fix the problem as soon as possible, and do not worry about perfection Python 3 need years to see the light.

* Although never is often better than *right* now.

Just take your time to do it well.

* If the implementation is hard to explain, it’s a bad idea.

* If the implementation is easy to explain, it may be a good idea.

Each one of two idioms above is a reformulation of the KISS principle

KEEP IT SIMPLE, STUPID

* Namespaces are one honking great idea — let’s do more of those!

Using namespaces enable us to follow SOC (Separation Of Concerns) principle

What’s new in Python 3.x

In this post, you will take a tour at most of Python 3.x new features compared to Python 2.7.

1. print function

The print statement has been removed from Python 3.x, and replaced by the print function.

>>> # This is a python 3.x interpreter
>>> print("Hello World!")
Hello World!
>>> print("Hello", "World!")
Hello World!
>>> print "Hello World!"
File "", line 1
 print "Hello World!"

SyntaxError: invalid syntax

2. Integer division

In Python 2.x, the / operator is used for integer division, and real division should made explicitly. So only float(5)/3 or 5.0/3 returns the real division of 5 by 3. Python 3.x removes this confusing by limiting / operator to real division and // to integer division.

>>> # This is a python 3.x interpreter
>>> print (3 / 2)
1.5
>>> print (3 // 2)
1
>>> print ( 3/ 2.0)
1.5
>>> print ( 3 // 2.0)
1.0

3. Unicode

Finally, Python 3.x removes ASCII string types. So there is no such thing as a Python string encoded in ASCII (with str built-in) or in UTF-8. Now, in Python 3.x we have 2 kind of types str for text and bytes for binary data.  As a consequence of this change in philosophy:

  • We can not use u”…” litterals for unicode text. However, we should use b”…” for binary data
  • Encoded unicode is represented in binary data

4. range

Hopefuly, range returns now an iterable object instead of list, just like xrange in Python 2.x

5. Comparaison

In Python 2.x, we can compare unorderable types:

>>> # This is a python 2.x interpreter
>>> [1, 2] > "foo"
False
>>> (1, 2) > "bar"
True

However Python 3 throws a TypeError to avoid subtle bugs

>>> # This is a python 3.x interpreter
>>> 1 > "two"
TypeError: unorderable types: int() > str()

6. yield from

Python 3 arrives with the yield from expression, which permits generator delegation to subgenerators or arbitrary subiterators. So, instead of writing

>>> # This is a python 2.x interpreter
>>> for i in gen():
...     yield i

we can just write:

>>> # This is a python 3.x interpreter
>>> yield from gen():

This is much pretty and shortest: just in one line

7. Annotation

PEP 3107 introduced function annotations in Python. Here is a small example:

>>> # This is a python 3.x interpreter
>>> def greet(name: str):
...     print ("Hello {0}".format(name))

Prior to Python 3, Developers use 3rd-party libraries to annotate function, below an example

def greet(name):
    """
    Print a greet message
    :type name: str
    :param name: The name of person to be greeted.
    """
    print "Hello %s" % name

You can see that this method has some drawbacks: First of all, it does not respect DRY principle, second It’s not normalized because there are several docstring formats (sphinx, epydoc, etc.).

As you see, there are many great new features in Python 3.x, and may be you share with me that it’s time to change and upgrade to Python 3.

Its-Time-to-Change

Grid Walk problem

Description:

There is a monkey which can walk around on a planar grid. The monkey can move one space at a time left, right, up or down. That is, from (x, y) the monkey can go to (x+1, y), (x-1, y), (x, y+1), and (x, y-1). Points where the sum of the digits of the absolute value of the x coordinate plus the sum of the digits of the absolute value of the y coordinate are lesser than or equal to 19 are accessible to the monkey. For example, the point (59, 79) is inaccessible because 5 + 9 + 7 + 9 = 30, which is greater than 19. Another example: the point (-5, -7) is accessible because abs(-5) + abs(-7) = 5 + 7 = 12, which is less than 19. How many points can the monkey access if it starts at (0, 0), including (0, 0) itself?

Input sample:

There is no input for this program.

Output sample:

Print out the how many points can the monkey access. (The number should be printed as an integer whole number e.g. if the answer is 10 (its not !!), print out 10, not 10.0 or 10.00 etc.)

Solution:

def sum_of_digits(number):
    """
    Calculate the sum of the digits of the specified parameter.
    """
    return sum(map(int, str(number)))


def is_accessible_point(x, y):
     """
     Verify if the given point is accessible or not.
     """
     return (sum_of_digits(abs(x)) + sum_of_digits(abs(y))) <= 19


def append_neighbors(points, x, y):
    """
    Append accessible neighbors to the accessible points list `points`
    if it does not exists.
    """
        for (i, j) in ((x+1, y), (x, y+1), (x-1, y), (x, y-1)):
            if is_accessible_point(i, j) and ((i, j) not in points):
                points.append((i, j))


if __name__ == '__main__':
    points = [(0, 0)]
    i = 0
    while i < len(points):
        append_neighbors(points, points[i][0], points[i][])
        i += 1


   print "Number of accessible points is: " % len(points)

Gist: https://gist.github.com/benzid-wael/e8d4b16ddc566e60b75c
Pastebin: http://pastebin.com/AC25hvGF