Practical introduction to web mining: collect data

Web mining is the application of natural language processing techniques to web content in order to retreive relevant information. It  became more important these days due to an exponential increase in digital content especially with the apperance of social media platforms, especially Twitter which constitue a rich and fiable information source.

In this series, I’ll explain how to collect twitter data, manipulate it and extract knowledge from it. As I am fan of Python, I’ll try to compare Python to other programming languages such as Java, Ruby and PHP based on information that we will collect from twitter.

In this tutorial, we will start by collecting data from twitter, introduce tweepy and the structure of twitter data.

1. Create a Twitter application

First of all, you should have some Twitter keys to be able to connect to twitter API and gather data from it. We need especially API key, API secret, Access token and Access token secret. To get this informations, follow steps bellow:

  1. go to https://apps.twitter.com and login with your twitter account.
  2. Create a new Twitter application
  3. In the next page, precisely in the “API keys” tab you can find both API key, API secret
  4. Scroll down and generate you access token and token secret

create_twitter_app

Once you created a new Twitter app and generated your keys, you can move to the next step and start collecting data.

 2. Getting Data From Twitter

We will use the Twitter Stream API to collect tweets related to 4 keywords: python, java, php and ruby. Happily, the Twitter Stream API is restful and give us the possibility to filter tweets by keywords. The code below, will fetch popular tweets that contains one of the keywords mentioned earlier:

#!/bin/python
# -*- coding: utf-8 -*-

from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream

# User credentials for Twitter API 
access_token = "ENTER YOUR ACCESS TOKEN"
access_token_secret = "ENTER YOUR ACCESS TOKEN SECRET"
consumer_key = "ENTER YOUR API KEY"
consumer_secret = "ENTER YOUR API SECRET"


class StdoutListener(StreamListener):

    def on_data(self, data):
        print data
        return True

    def on_error(self, status):
        print status


if __name__ == '__main__':
    # Twitter authetification
    listner = StdoutListener()
    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    stream = Stream(auth, listner)

    # Filter Twitter Streams to capture data by the keywords
    stream.filter(track=['python', 'java', 'php', 'ruby'])

Now if you run this command:


python get_tweets.py >>PL_tweets.txt

you’ll have information about most popular tweets containing one of the keywords python, java, php and ruby in the specified txt file.

3. Understand Twitter response

The data collected previously is in JSON format, so it’s easy to read and understand. But, I’ll take the time here to highlight some useful informations inside the twitter response .

tweet_sample

As you propably noticed, the tweet contains information about the tweeple, list of tags and URIs appeared in the tweet, the main text of the tweet, retweet count, favourite count, etc.

Awesome, now you should start collect data. Next posts of this series will be hot and exciting, and you need a lot of data for it: more data, better experience.

Stay tuned ….

Advertisements

Update your git repositories at once

If you has/use multiple git repositories, here is a CLI tool that will allows you to update multiple repositories at once, it’s intuited GitupGitup is a cross-platform tool designed to update a large number of git repositories at once, and it’ can manage different remotes, branches, etc

To install Gitup :

git clone git://github.com/earwig/git-repo-updater.git
cd git-repo-updater
sudo python setup.py install

If all your repositories are in the same directory (here repos), just launch gitup like this:

gitup ~/repos

and you get the latest files version at once. You can also bookmark your repositories with the parameter –add

gitup --add ~/repos
gitup --add ~/repos2

and so do your git pull just with running the bellow command:

gitup

Practical, isn’t ? For further details, see the github repository.

Scheduled jobs with Celery, Django and Redis

Setting up a deferred task queue for your Django application can be a pain and it shouldn’t to be. Some “persons” use cron which is not only a bad solution, but this is a disaster. Personally, I use Celery. In this post, I’ll show you how to set-up a deferred task queue for your Django application using Celery.

What’s Celery ?

Celery is an asynchronous task queue/job queue based on distributed message passing. It is focused on real-time operation, but supports scheduling as well.

The promise of Celery is to allow you to run code later, or regularly according to a schedule. Unfortunately, running deferred tasks through Celery is not trivial. But it’s useful and beneficial, as it has a distributed architecture that scales as you need. Any Celery installation is composed of three core components:

  1. Celery client: which used to issue background jobs.
  2. Celery workers: these are the processes responsible to run jobs. Worker can be local or remote, so you can start with a single worker in the same web application server, and later add workers as your traffic and overload grow.
  3. Message broker: The client communicates with the the workers through a message queue, and Celery supports several ways to implement these queues. The most commonly used brokers are RabbitMQ and Redis.

Installing requirements

Fistable, let’s install Redis:

$ sudo apt-get install redis-server

Now, let’s install some python packages:

pip install celery
pip install django-celery

Configuring Django for Celery

Once the installation is completed, you’re ready to set up our scheduler. Let’s configure Celery:

INSTALLED_APPS = (
    'djcelery',
)

BROKER_URL = 'redis://127.0.0.1:6379/0'
BROKER_TRANSPORT = 'redis'
CELERYBEAT_SCHEDULER = 'djcelery.schedulers.DatabaseScheduler'

The above lines is used to configure Celery: which broker you’ll use? Which scheduler for heart beat event ?

As you added djcelery package to your INSTALLED_APPS, you need to create the celery database tables – instructions for that differ depending on your environment, If using South or Migrations (Django >= 1.7) for schema migrations:

$ python manage.py migrate

Otherwise:

$ python manage.py syncdb

Below, the celery.py file that is used for setting up the scheduler for your django project:

# celery.py file
from future import absolute_import

import os
import django

from celery import Celery
from django.conf import settings

os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'demo.settings')
django.setup()

app = Celery('Scheduler')

app.config_from_object('django.conf:settings')
app.autodiscover_tasks(lambda: settings.INSTALLED_APPS)

Write some tasks

Let’s assume that you have a task that should be executed periodically, a good example might be a twitter bot or a scraper.

import tweepy

api = tweepy.API()

def get_recent_tweets(query):
    for tweet in tweepy.Cursor(api.search, q=query,
                               rpp=100, result_type="recent",
                               include_entities=True,
                               lang="en").items():
        print tweet.created_at, tweet.text
        # Save tweet into database
        ...

Now, we need to create a Celery task for get_recent_tweets

    ## /project_name/app_name/tasks.py

    from celery.decorators import task

    from utils import twitter

    @task
    def get_recent_tweets(*args):
        # Just an example
        twitter.get_recent_tweets(*args)

N.B: Things can get a lot more complicated than this.

Scheduling it

Now, we have to schedule our tasks. For get_bigdata_tweets task, we will run it every hour, this is an interesting subject that I want to follow, For this purpose, I’ll use celery.beat scheduler. In settings.py file add this code:

from celery.schedules import crontab

CELERYBEAT_SCHEDULER = "djcelery.schedulers.DatabaseScheduler"
CELERYBEAT_SCHEDULE = {
    "get_bigdata_tweets": {
        'task': "bots.twitter.tasks.get_recent_tweets",
        # Every 1 hour
        'schedule': timedelta(seconds=6),
        'args': ("bigdata"),
    },
}
For further details, about scheduler configuration, see documentation.

Getting Started with Vagrant

Last post, I show you how to install VirtualBox on Ubuntu 12.04. Today, I’ll show you how to improve your capabilities to manage VMs by using an awesome tool. Yes, I’m speaking about Vagrant.

What’s Vagrant ?

Vagrant is an open source tool for managing virtual machines (VMs) developed by Mitchell Hashimoto and John Bender. A virtual machine is a full implementation of a computer with a virtual disk, memory and CPU. A Box, or base image, is the pre-packaged virtual machine that Vagrant will manage.

Vagrant is a CLI tool. Calling vagrant without additional arguments will provide the list of available commands:

  • init — create the base configuration file.
  • up — start a new instance of the virtual machine.
  • suspend — suspend the running guest.
  • halt — stop the running guest, similar to hitting the power button on a real machine.
  • resume — restart the suspended guest.
  • reload — reboot the guest.
  • status — determine the status of vagrant for the current Vagrantfile.
  • provision — run the provisioning commands.
  • destroy — remove the current instance of the guest, delete the virtual disk and associated files.
  • box — the set of commands used to add, list, remove or repackage box files.
  • package — used for the creation of new box files.
  • sshssh to a running guest.

Install it

This is so easy. Go to the Vagrant downloads page and download the latest release version which is v1.7.4 when writing this post. You can download it with the below command:

wget https://dl.bintray.com/mitchellh/vagrant/vagrant_1.7.4_x86_64.deb

Next install it with the following command:

dpkg -i vagrant_1.7.4_x86_64.deb

Great ! Now you have vagrant installed on your machine. You can test it by adding boxes, creating a new box, etc.

Your first Vagrant project

Download the precise Ubuntu 12.04 vagrant box

vagrant box add precise64 http://files.vagrantup.com/precise64.box # 323MB, faster download

Once it’s downloaded, let’s initialize the new Vagrant project basing on the downloaded box:

mkdir -p ~/vagrant-tutorial/ && cd ~/vagrant-tutorial/
vagrant init precise64 # creates a default Vagrantfile in the current directory

For more details about Vagrantfile, see the official documentation.

Creating VM

Now let’s start up a VM defined by the created Vagrantfile:

vagrant up

The VM is now running. You can ssh into it with the following command:

vagrant ssh

Note that on the new VM, /vagrant is a shared directory mounted to ~/vagrant-tutorial

ls /vagrant

Soon, you’ll start making modifications to the Vagrantfile. There are three ways to rebuild the VM.

# Fastest method: re-runs the provisioner without stopping the VM.
vagrant provision 

# Restarts VM, provisions. Use this if you changed virtualbox settings (e.g shared folders)
vagrant reload 

# Destroys the active VM, and rebuilds.
# Slow, but guarantees stability.
vagrant destroy --force && vagrant up

Thanks for reading !

Install Virtualbox on Ubuntu using PPA

This is a brief post in which I’m going to show you how to install VirtualBox in Ubuntu. I assume that you already know what is VirtualBox. So I’ll enter directly into the subject.

Dependency

To avoid any error, we should install the dkms package, just type:

sudo apt-get install dkms

Installation

Firstable, Press Ctrl – Alt – T, to open a terminal.when it’s opened, type:

wget -q http://download.virtualbox.org/virtualbox/debian/oracle_vbox.asc -O- | sudo apt-key add -

which will add the VirtualBox repository’s key. Next, add the VirtualBox repository to your /etc/apt/sources.list

sudo sh -c 'echo "deb http://download.virtualbox.org/virtualbox/debian precise contrib" >> /etc/apt/sources.list'

Finally, let’s update your system and install the latest VirtualBox (which is v4.3 on Ubuntu 12.04)

sudo apt-get update && sudo apt-get install virtualbox-4.3

Extra

Now, after understanding the different steps to install VirtualBox on Ubuntu 12.04 using PPA, I’ll give a command which will install VirtualBox for you independently of your Ubuntu version:

sudo sh -c "echo 'deb http://download.virtualbox.org/virtualbox/debian '$(lsb_release -cs)' contrib non-free' > /etc/apt/sources.list.d/virtualbox.list" && wget -q http://download.virtualbox.org/virtualbox/debian/oracle_vbox.asc -O- | sudo apt-key add - && sudo apt-get update && sudo apt-get install virtualbox-4.3 dkms

Enjoy 🙂 !

 

Merge querysets from different django models

If you were in a situation where you need to merge two querysets from different models into one, you’ve surely see this error:

Cannot combine queries on two different base models.

The solution is to use itertools.chain which make an iterator that is the junction of the given iterators.

from itertools import chain

result_lst = list(chain(queryset1, queryset2))

Now, you can sort the resulting list by any common field, e.g. creation date

from itertools import chain
from operator import attrgetter

result_lst = sorted(
    chain(queryset1, queryset2),
    key=attrgetter('created_at'))