Strange Python Import Error

Recently, when I was installing a big Django project in my macOS Sierra, I was sucking with a strange ImportError.

ImportError: No module named certs

Where certs.py is a submodule of the requests package. At the first look, I thought that I didn’t install the right version of the requests package. But, I was wrong: I found the file under the right directory in my filesystem.

Now it remains only two options, either the virtualenv is not properly configured or there is a problem with Python in macOS. The first option  was eliminated as the shell was working fine and I was able to import the same module on the shell session. Then, I did another attempt, I checked file permission, even it doesn’t make sense on that time, and the user has all privileges to manipulate the whole directory.

Honestly, I didn’t find any explication and I didn’t know from where to start. But, after a discussion with a colleague, he told me that another colleague was sucking with the same issue for one week and he fixed it by increasing the max files limits. Now everything start to make sense in some way or another. As you may guess it, Python was not handling properly this error.

Solution

Now, you know the problem, so the solution is to increase max files limits:
* For older macOS (Lion or before)

You may add the following line to /etc/launchd.conf (owner: root:wheel, mode: 0644):

limit maxfiles 262144 524288

* For Mountain lion:
You may add the following lines to /etc/sysctl.conf (owner: root:wheel, mode: 0644):

kern.maxfiles=524288
kern.maxfilesperproc=262144

* For Mavericks, Yosemite, El Capitan, and Sierra:

You have to create a file at /Library/LaunchDaemons/limit.maxfiles.plist (owner: root:wheel, mode: 0644):

<?xml version="1.0" encoding="UTF-8">
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"            "http://www.apple.com/DTDs/PropertyList-1.0.dtd>
<plist version="1.0">
  <dict>
    <key>Label</key>
    <string>limit.maxfiles</string>
    <key>ProgramArguments</key>
    <array>
      <string>launchctl</string>
      <string>limit</string>
      <string>maxfiles</string>
      <string>262144</string>
      <string>524288</string>
    </array>
    <key>RunAtLoad</key>
    <true/>
    <key>ServiceIPC</key>
    <false/>
  </dict>
</plist>

 

 

Advertisements

Practical introduction to web mining: data wrangling

Most of programming work in data analysis project is spend in data preparation stage, and these’s due to the fact that the collected data is not already represented in the required and expected structure for your data processing application. Hopefully, Twitter data is structured, so we’ll not spend a lot of time in this stage.

First thing that we had to do is loading the collected data. There’s nothing special here, we need only the  json python module. Below the code:

import json

def load_tweets(path):
    tweets = []
    with open(path, 'r') as file_stream:
        for line in file_stream:
            try:
                tweet = json.loads(line)
                tweets.append(tweet)
            except:
                pass
    return tweets

tweets_list = load_tweets("PL_tweets.txt")

1. Pandas

Next we will create a pandas DataFrame. Pandas is an open source python library providing high-level data structures and tools for data analysis. Pandas has mainly two data structures types:

  • Series: one-dimensional array containing an array of data and an associated array of index.
  • DataFrame: tabular data structure containing a collection of columns. DataFrame has both a row and column index. In other words, a DataFrame is a collection of Series.

Let’s first explore the tweet structure. If you don’t have an idea about the Twitter API, it’s a good idea to look first to the official documentation before completing this tutorial. Personally, I think that the key attributes of a tweet are:

  • id: the tweet identifier
  • text: the text of the tweet itself
  • lang: acronym for the language (e.g. “en” for english, “fr” for french)
  • created_at: the date of creation
  • favorite_count, retweet_count: the number of favorites and retweets
  • place, coordinates, geo: geo-location information if available
  • user: the author’s full profile
  • entities: list of entities like URLs, @-mentions, hashtags and symbols
  • in_reply_to_user_id: user identifier if the tweet is a reply to a specific user
  • in_reply_to_status_id: status identifier id the tweet is a reply to a specific status

The below code will create a Pandas DataFrame object containing the most usefule tweet’s metadata that we will use in the next post of this series:

import pandas as pd

# create Pandas DataFrame
tweets = pd.DataFrame()

# create some columns
tweets['tweetID'] = [ tweet['id'] for tweet in tweets_list ]
tweets['tweetText'] = [ tweet['text'] for tweet in tweets_list ]
tweets['tweetLang'] = [ tweet['lang'] for tweet in tweets_list ]
tweets['tweetCreatedAt'] = [ tweet['created_at'] for tweet in tweets_list ]
tweets['tweetRetweetCount'] = [ tweet['retweet_count'] for tweet in tweets_list ]
tweets['tweetFavoriteCount'] = [ tweet['favorite_count'] for tweet in tweets_list ]
tweets['tweetGeo'] = [ tweet['geo'] for tweet in tweets_list ]
tweets['tweetCoordinates'] = [ tweet['coordinates'] for tweet in tweets_list ]
tweets['tweetPlace'] = [ tweet['place'] for tweet in tweets_list ] 

# tweeple information 
tweets['userScreenName'] = [ tweet['user']['screen_name'] for tweet in tweets_list ]
tweets['userName'] = [ tweet['user']['name'] for tweet in tweets_list ]
tweets['userLocation'] = [ tweet['user']['location'] for tweet in tweets_list ]

# tweet interaction 
tweets['tweetIsReplyToUserId'] = [ tweet['in_reply_to_user_id'] for tweet in tweets_list ]
tweets['tweetIsReplyToStatusId'] = [ tweet['in_reply_to_status_id'] for tweet in tweets_list ]

Super ! we created our first data frame. Pandas data frame provide a beautiful and rich API, from visualizing to interacting with the dataframe:

  • head(N): returns first N rows
  • tail(N): returns last N rows
  • iteritems(): iterator over (column name, series) pair
  • etc.

The code below will display the first 5 rows in our data frame:

>>> tweets.head(5)

2. Cleaning Data

Unfortunately, the acquired data is usually dirty and have a lot of inconsistencies, which could be duplicated entries, bad values, not normalized values, etc. So, the cleanup process should include mainly:

  • removing duplicate entries
  • strip whitespaces
  • normalize numbers, dates, etc.

The output of this process is a clean dataset: a dataset consisted only of valid and normalized values, and this will ensure that our analysis code WILL NOT CRASH !

2.1 Missing data

If you followed the previous steps in this tutorial, you noticed probably, as shown in the below figure, the NaN values in some columns. NaN is a special value to denote missing data.

Pandas NaN value

fig 1. Missing data

Now, we had to handle this missing values. In fact, we had mainly two options:

  • replacing all NaN values with None
  • treat each column separately. For example, replacing NaN by None for tweetIsReplyToUserId and tweetIsReplyToStatusId columns, and replacing both None and NaN by “Unknown” for userLocation column, etc.

Personally, I will opt to the second option, and I’ll use the fillna method which will fill NaN values by the given value:

# let's handle userLocation column
tweets.userLocation.fillna("Unknown", inplace=True)
# Now let's replace the other NaN values by None
tweets.fillna(lambda: None)

Note that I set inplace argument of fillna method to True explicitly. Otherwise, the userLocation series will not be modified.

2.2 Bad data

If  you took previously a look to the Twitter documentation, you knew probably that the values of the  tweetCreatedAt column are a string representation of a date and time object. We had to convert these values to a datetime object.

You can use strptime function of datetime package which parse a string representation of date and/or time object. But, I prefer to use Pandas’ to_datetime method which will parse and convert the entire series.

tweets.tweetCreatedAt = pd.to_datetime(tweets.tweetCreatedAt)

2.3 Duplicated data

Really, I didn’t expect to have duplicated entries in my dataset. But as the script crashed several time, I wasn’t surprised. Pandas provide some methods to deal with duplicated data. The duplicated method will annotate rows by a boolean specifying if that row is duplicated or not. By default, the row identity is defined by checking all columns, but you can restrict it on specific columns. For our example, we can specify only the tweetID  column as it’s a unique ID for the tweet.


>>> tweets.duplicated(['tweetID'],
                      keep="last")
0        False
1        False
2        False
3        False
4        False
5        False
6         True
7        False
8        False
9        False
10       False
11       False
...

You can drop duplicated rows using drop_duplicates method, as below:


>>> tweets.drop_duplicates(['tweetID'],
                           take_last=True)

Conclusion

I think that I spoke about the most important tips/steps on data wrangling stage. But you had to not that twitter data is structured and clean but this not the regular case. In fact, real-world data is dirty: you had to do more work on it to be able to use it.

Waiting for your comments and suggestions.

Scheduled jobs with Celery, Django and Redis

Setting up a deferred task queue for your Django application can be a pain and it shouldn’t to be. Some “persons” use cron which is not only a bad solution, but this is a disaster. Personally, I use Celery. In this post, I’ll show you how to set-up a deferred task queue for your Django application using Celery.

What’s Celery ?

Celery is an asynchronous task queue/job queue based on distributed message passing. It is focused on real-time operation, but supports scheduling as well.

The promise of Celery is to allow you to run code later, or regularly according to a schedule. Unfortunately, running deferred tasks through Celery is not trivial. But it’s useful and beneficial, as it has a distributed architecture that scales as you need. Any Celery installation is composed of three core components:

  1. Celery client: which used to issue background jobs.
  2. Celery workers: these are the processes responsible to run jobs. Worker can be local or remote, so you can start with a single worker in the same web application server, and later add workers as your traffic and overload grow.
  3. Message broker: The client communicates with the the workers through a message queue, and Celery supports several ways to implement these queues. The most commonly used brokers are RabbitMQ and Redis.

Installing requirements

Fistable, let’s install Redis:

$ sudo apt-get install redis-server

Now, let’s install some python packages:

pip install celery
pip install django-celery

Configuring Django for Celery

Once the installation is completed, you’re ready to set up our scheduler. Let’s configure Celery:

INSTALLED_APPS = (
    'djcelery',
)

BROKER_URL = 'redis://127.0.0.1:6379/0'
BROKER_TRANSPORT = 'redis'
CELERYBEAT_SCHEDULER = 'djcelery.schedulers.DatabaseScheduler'

The above lines is used to configure Celery: which broker you’ll use? Which scheduler for heart beat event ?

As you added djcelery package to your INSTALLED_APPS, you need to create the celery database tables – instructions for that differ depending on your environment, If using South or Migrations (Django >= 1.7) for schema migrations:

$ python manage.py migrate

Otherwise:

$ python manage.py syncdb

Below, the celery.py file that is used for setting up the scheduler for your django project:

# celery.py file
from future import absolute_import

import os
import django

from celery import Celery
from django.conf import settings

os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'demo.settings')
django.setup()

app = Celery('Scheduler')

app.config_from_object('django.conf:settings')
app.autodiscover_tasks(lambda: settings.INSTALLED_APPS)

Write some tasks

Let’s assume that you have a task that should be executed periodically, a good example might be a twitter bot or a scraper.

import tweepy

api = tweepy.API()

def get_recent_tweets(query):
    for tweet in tweepy.Cursor(api.search, q=query,
                               rpp=100, result_type="recent",
                               include_entities=True,
                               lang="en").items():
        print tweet.created_at, tweet.text
        # Save tweet into database
        ...

Now, we need to create a Celery task for get_recent_tweets

    ## /project_name/app_name/tasks.py

    from celery.decorators import task

    from utils import twitter

    @task
    def get_recent_tweets(*args):
        # Just an example
        twitter.get_recent_tweets(*args)

N.B: Things can get a lot more complicated than this.

Scheduling it

Now, we have to schedule our tasks. For get_bigdata_tweets task, we will run it every hour, this is an interesting subject that I want to follow, For this purpose, I’ll use celery.beat scheduler. In settings.py file add this code:

from celery.schedules import crontab

CELERYBEAT_SCHEDULER = "djcelery.schedulers.DatabaseScheduler"
CELERYBEAT_SCHEDULE = {
    "get_bigdata_tweets": {
        'task': "bots.twitter.tasks.get_recent_tweets",
        # Every 1 hour
        'schedule': timedelta(seconds=6),
        'args': ("bigdata"),
    },
}
For further details, about scheduler configuration, see documentation.

Merge querysets from different django models

If you were in a situation where you need to merge two querysets from different models into one, you’ve surely see this error:

Cannot combine queries on two different base models.

The solution is to use itertools.chain which make an iterator that is the junction of the given iterators.

from itertools import chain

result_lst = list(chain(queryset1, queryset2))

Now, you can sort the resulting list by any common field, e.g. creation date

from itertools import chain
from operator import attrgetter

result_lst = sorted(
    chain(queryset1, queryset2),
    key=attrgetter('created_at'))

Installing different python versions in ubuntu

Since I write python code that should be running on different python versions, I have to install multiple python versions on my workstation. As usually, I believe that we should do everything well as we can :).

This post is a description of my procedure to get different python versions installed in my Ubuntu workstation.

Installing Multiple Versions

Ubuntu typically only supports one python 2.x version and one 3.x version at a time.  There’s a popular PPA called deadsnakes that contains older versions of python. To install it you should run the below commands:

$ sudo add-apt-repository ppa:fkrull/deadsnakes
$ sudo apt-get update

I’ve a Ubuntu 14.04 already installed in my workstation (So I’ve both python2.7 and python3.4). So, I’ll install versions 2.6 and 3.3.

$ sudo apt-get install python2.6 python3.3

Let’s check the default python version by running `python – V`

$ python -V
Python 2.7.6

Now, to manage the different python versions I will use an amazing Linux command: update-alternatives. According to Linux man page, ” update-alternatives maintain symbolic links determining default commands ”

Firstable, let’s install the different alternatives:

$ sudo update-alternatives --install /usr/bin/python python /usr/bin/python2.6 10
$ sudo update-alternatives --install /usr/bin/python python /usr/bin/python2.7 20
$ sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.3 30
$ sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.4 40

To choose the default python version you should run the below command:

$ sudo update-alternatives --config python

Secondly, I can switch between the different Python versions easily with the previous command. However, Ubuntu runs multiple maintenance scripts and those script may break if I choose Python 2.6 as a default version.

Using virtualenv

I assume that we have different python version installed on your machine and you didn’t change the default python version (which is 2.7 in my case).

1. Installing virtualenv

$ sudo apt-get install python-virtualenv

2. Managing different python version
Suppose that I will start a new project which will run on Python 2.6. Using this solution, I will be able to manage different version of python and different version of any package I use. Great!

$ virtualenv -p /usr/bin/python2.6 ~/.envs/project_x_py2.6
Running virtualenv with interpreter /usr/bin/python2.6
New python executable in ~/.envs/project_x_py2.6/bin/python2.6
Also creating executable in ~/.envs/project_x_py2.6/bin/python
Installing distribute....................................done.
Installing pip.....................done.

3. Activating virtualenv
Before that you can install any package for this project, you should activate it:

$ source ~/.envs/project_x_py2.6/bin/activate

Now, If we check the default python version used for this project:

$ python -V
Python 2.6.9
$ which python
~/.envs/project_x_py2.6/bin/python

When you’re gone with the project, just deactivate its virtualenv and you can back to it when you need by activating it

$ deactivate

Memento design pattern: Part 2

As promised in the last post (part 1), I will try to improve the official implementation of the memento pattern inspired from the Java code in Wikipedia. I will try to improve this points:

  • the CareTaker create a Memento object for every change in the Originator behind the scene
  • create a CareTake object implicitly for each Originator class

Coding time

Firstable, Let’s improve the Memento and CareTaker classes.


class Memento(object):
    def __init__(self, state):
        self.__state = state

    @property
    def state(self):
        return self.__state

    def __repr__(self):
        return "<Memento: {} >".format(str(self.__state))

class CareTaker(object):
    def __init__(self):
        self.__mementos = []

    @property
    def mementos(self):
        return self.__mementos

    def save(self, memento):
        self.__mementos.append(memento)

    def restore(self):
        return self.mementos.pop()

For the first enhancement we will use a magic method which is __setattr__ that give us the possibility to control attribute assignment. Consider this example:

class A(object):
    def __setattr__(self, attr, val):
        print "Permission denied."

a = A()
a.x = 4  # "Permission denied"

As you see, modifying the attribute value is through a Python call to the special method __setattr__. In our example, we removed the default behaviour.

In our case, we will use this method to create a Memento object implicitly for every change made on the Originator class.

class Originator(object):
    """
    Any originator class should inherits from this class.
    """
    def __init__(self, *args, **kw):
        # Let's create a caretaker for this originator
        self.__caretaker = kw.pop('caretaker', None) or CareTaker()
        super(Originator, self).__init__(*args, **kw)

    @property
    def caretaker(self):
        return self.__caretaker

    def __setattr__(self, attr, val):
        # Avoid keeping trace of private attributes changes,
        # especially the `caretaker` attribute
        if not attr.startswith('_'):
            # Let's save both attribute and its value
            self.__caretaker.save(Memento({
              'attr': attr,
              'value': getattr(self, attr, None)
            }))
        super(Originator, self).__setattr__(attr, val)

class Settings(Originator):
    pass

settings = Settings()
settings.font = 'Arial'
settings.font = 'Calibri'
caretaker = settings.caretaker
print 'We have {} states'.format(len(caretaker.states))

The downside of this implementation is that we should call explicitly, in the first place, the Originator’s __init__ method when we override it in the subclass. Consider this example:

class User1(Originator):
    def __init__(self, login, password):
        self.login = login
        self.password = password
        super(User, self).__init__()

user = User1('john', 'password') # AttributeError

class User2(Originator):
    def __init__(self, login, password):
        # Initialise Originator class in the first place
        super(User, self).__init__()
        self.login = login
        self.password = password

user = User2('john', 'password') # works

The problem is appeared when python initialize the User1 object: It call implicitly the __setattr__ method which try to save a memento (for login attribute) but the caretaker object is not yet created. To fix this, we will only create memento object after instance initialisation:

class Originator(object):
    """
    Any originator class should inherits from this class.
    """
    def __init__(self, *args, **kw):
        # Let's create a caretaker for this originator
        self.__caretaker = kw.pop('caretaker', None) or CareTaker()
        super(Originator, self).__init__(*args, **kw)

    @property
    def caretaker(self):
        return self.__caretaker

    def __setattr__(self, attr, val):
        if hasattr(self, '_Originator__caretaker'):
            # Let's save both attribute and its value
            self.__caretaker.save(Memento({
              'attr': attr,
              'value': getattr(self, attr, None)
            }))
        super(Originator, self).__setattr__(attr, val)

It’s mostly done, we should now add an undo method to the Originator class

class Originator(object):
    ...

    def undo(self):
        memento = self.caretaker.restore()
        setattr(self, memento.state['attr'], memento.state['value'])

Great ! However there are a bug in this code: If we try to restore the Originator object, another memento object will be created which is an issue, but if we restore another time we will back to the last state which is terrible. Consider this example:

settings = Settings()
caretaker = settings.caretaker

for color in ('red', 'blue', 'green', 'yellow'):
    settings.color = color
    print 'We have {} mementos'.format(len(caretaker.mementos))

for i in range(7):
    settings.undo()
    print 'We have {} mementos ## color: {}'.format(len(caretaker.mementos), settings.color)

and bellow the output:

We have 1 mementos
We have 2 mementos
We have 3 mementos
We have 4 mementos
We have 4 mementos ## color: green
We have 4 mementos ## color: yellow
We have 4 mementos ## color: green
We have 4 mementos ## color: yellow
We have 4 mementos ## color: green
We have 4 mementos ## color: yellow
We have 4 mementos ## color: green

To fix this we will add a flag indicating if the __setattr__ will be executed in a restore mode or not

class Originator(object):
    ...
    def __setattr__(self, attr, val):
        restore = getattr(self, 'restore_mode', False)
        if (not restore and hasattr(self, '_Originator__caretaker')
              and attr != 'restore_mode'):
            self.__caretaker.save(Memento({
              'attr': attr,
              'value': getattr(self, attr, None)
            }))
        super(Originator, self).__setattr__(attr, val)

    def undo(self):
        memento = self.caretaker.restore()
        self.restore_mode = True
        setattr(self, memento.state['attr'], memento.state['value'])
        self.restore_mode = False

Now, we have only two issues:

  • Handle IndexError exception raised by restore method
  • For the moment, new created attribute will be considered set to None before creation which is confusing
class Empty:
    pass

class Originator(object):
    ...
    def __setattr__(self, attr, val):
        restore = getattr(self, 'restore_mode', False)
        if (not restore and hasattr(self, '_Originator__caretaker')
              and attr != 'restore_mode'):
            self.__caretaker.save(Memento({
              'attr': attr,
              'value': getattr(self, attr, Empty())
            }))
        super(Originator, self).__setattr__(attr, val)

    def undo(self):
        try:
            memento = self.caretaker.restore()
        except IndexError:
            return
        if isinstance(memento.state['value'], Empty):
            delattr(self, memento.state['value'])
        else:
            self.restore_mode = True
            setattr(self, memento.state['attr'], memento.state['value'])
            self.restore_mode = False

I hope you liked today’s post and as I don’t think that’s perfect, you are welcome to give me your opinion and feedback. Comments here or @benzid_wael.

Memento design pattern: Part 1

Today, I will show you how to implements the Memento design pattern in Python. Assuming, that you are in a position where you want to implement an undo system. So you have an object where you should keep all the changes that user made on it.

How it works?

If you take a look to Memento pattern in Wikipedia, you’ll find this:

The memento pattern is implemented with three objects: the originator, a caretaker and a memento. The originator is some object that has an internal state. The caretaker is going to do something to the originator, but wants to be able to undo the change. The caretaker first asks the originator for a memento object. Then it does whatever operation (or sequence of operations) it was going to do. To roll back to the state before the operations, it returns the memento object to the originator. The memento object itself is an opaque object (one which the caretaker cannot, or should not, change). When using this pattern, care should be taken if the originator may change other objects or resources – the memento pattern operates on a single object.

Coding time

Let’s now, rewrite the Java example from the wiki in Python:

My opinion

I don’t like this implementation, there are many lacks on it, especially if you are Pythonista:

  1. We have not a general purpose for the Originator class
  2. We should manage what attributes to save on the Memento
  3. We should create a CakeTaker explicitly for every object

As an enhancement to this implementation, I want :

  • to create a CakeTake object implicitly for each Originator class
  • that the CakeTaker create a Memento object for every change in the Originator behind the scene
  • possibility to have a Memento for a group of changes

Ok, next time I will implement this enhancement and will discuss the solution, nice day.