Practical introduction to web mining: data wrangling

Most of programming work in data analysis project is spend in data preparation stage, and these’s due to the fact that the collected data is not already represented in the required and expected structure for your data processing application. Hopefully, Twitter data is structured, so we’ll not spend a lot of time in this stage.

First thing that we had to do is loading the collected data. There’s nothing special here, we need only the  json python module. Below the code:

import json

def load_tweets(path):
    tweets = []
    with open(path, 'r') as file_stream:
        for line in file_stream:
            try:
                tweet = json.loads(line)
                tweets.append(tweet)
            except:
                pass
    return tweets

tweets_list = load_tweets("PL_tweets.txt")

1. Pandas

Next we will create a pandas DataFrame. Pandas is an open source python library providing high-level data structures and tools for data analysis. Pandas has mainly two data structures types:

  • Series: one-dimensional array containing an array of data and an associated array of index.
  • DataFrame: tabular data structure containing a collection of columns. DataFrame has both a row and column index. In other words, a DataFrame is a collection of Series.

Let’s first explore the tweet structure. If you don’t have an idea about the Twitter API, it’s a good idea to look first to the official documentation before completing this tutorial. Personally, I think that the key attributes of a tweet are:

  • id: the tweet identifier
  • text: the text of the tweet itself
  • lang: acronym for the language (e.g. “en” for english, “fr” for french)
  • created_at: the date of creation
  • favorite_count, retweet_count: the number of favorites and retweets
  • place, coordinates, geo: geo-location information if available
  • user: the author’s full profile
  • entities: list of entities like URLs, @-mentions, hashtags and symbols
  • in_reply_to_user_id: user identifier if the tweet is a reply to a specific user
  • in_reply_to_status_id: status identifier id the tweet is a reply to a specific status

The below code will create a Pandas DataFrame object containing the most usefule tweet’s metadata that we will use in the next post of this series:

import pandas as pd

# create Pandas DataFrame
tweets = pd.DataFrame()

# create some columns
tweets['tweetID'] = [ tweet['id'] for tweet in tweets_list ]
tweets['tweetText'] = [ tweet['text'] for tweet in tweets_list ]
tweets['tweetLang'] = [ tweet['lang'] for tweet in tweets_list ]
tweets['tweetCreatedAt'] = [ tweet['created_at'] for tweet in tweets_list ]
tweets['tweetRetweetCount'] = [ tweet['retweet_count'] for tweet in tweets_list ]
tweets['tweetFavoriteCount'] = [ tweet['favorite_count'] for tweet in tweets_list ]
tweets['tweetGeo'] = [ tweet['geo'] for tweet in tweets_list ]
tweets['tweetCoordinates'] = [ tweet['coordinates'] for tweet in tweets_list ]
tweets['tweetPlace'] = [ tweet['place'] for tweet in tweets_list ] 

# tweeple information 
tweets['userScreenName'] = [ tweet['user']['screen_name'] for tweet in tweets_list ]
tweets['userName'] = [ tweet['user']['name'] for tweet in tweets_list ]
tweets['userLocation'] = [ tweet['user']['location'] for tweet in tweets_list ]

# tweet interaction 
tweets['tweetIsReplyToUserId'] = [ tweet['in_reply_to_user_id'] for tweet in tweets_list ]
tweets['tweetIsReplyToStatusId'] = [ tweet['in_reply_to_status_id'] for tweet in tweets_list ]

Super ! we created our first data frame. Pandas data frame provide a beautiful and rich API, from visualizing to interacting with the dataframe:

  • head(N): returns first N rows
  • tail(N): returns last N rows
  • iteritems(): iterator over (column name, series) pair
  • etc.

The code below will display the first 5 rows in our data frame:

>>> tweets.head(5)

2. Cleaning Data

Unfortunately, the acquired data is usually dirty and have a lot of inconsistencies, which could be duplicated entries, bad values, not normalized values, etc. So, the cleanup process should include mainly:

  • removing duplicate entries
  • strip whitespaces
  • normalize numbers, dates, etc.

The output of this process is a clean dataset: a dataset consisted only of valid and normalized values, and this will ensure that our analysis code WILL NOT CRASH !

2.1 Missing data

If you followed the previous steps in this tutorial, you noticed probably, as shown in the below figure, the NaN values in some columns. NaN is a special value to denote missing data.

Pandas NaN value

fig 1. Missing data

Now, we had to handle this missing values. In fact, we had mainly two options:

  • replacing all NaN values with None
  • treat each column separately. For example, replacing NaN by None for tweetIsReplyToUserId and tweetIsReplyToStatusId columns, and replacing both None and NaN by “Unknown” for userLocation column, etc.

Personally, I will opt to the second option, and I’ll use the fillna method which will fill NaN values by the given value:

# let's handle userLocation column
tweets.userLocation.fillna("Unknown", inplace=True)
# Now let's replace the other NaN values by None
tweets.fillna(lambda: None)

Note that I set inplace argument of fillna method to True explicitly. Otherwise, the userLocation series will not be modified.

2.2 Bad data

If  you took previously a look to the Twitter documentation, you knew probably that the values of the  tweetCreatedAt column are a string representation of a date and time object. We had to convert these values to a datetime object.

You can use strptime function of datetime package which parse a string representation of date and/or time object. But, I prefer to use Pandas’ to_datetime method which will parse and convert the entire series.

tweets.tweetCreatedAt = pd.to_datetime(tweets.tweetCreatedAt)

2.3 Duplicated data

Really, I didn’t expect to have duplicated entries in my dataset. But as the script crashed several time, I wasn’t surprised. Pandas provide some methods to deal with duplicated data. The duplicated method will annotate rows by a boolean specifying if that row is duplicated or not. By default, the row identity is defined by checking all columns, but you can restrict it on specific columns. For our example, we can specify only the tweetID  column as it’s a unique ID for the tweet.


>>> tweets.duplicated(['tweetID'],
                      keep="last")
0        False
1        False
2        False
3        False
4        False
5        False
6         True
7        False
8        False
9        False
10       False
11       False
...

You can drop duplicated rows using drop_duplicates method, as below:


>>> tweets.drop_duplicates(['tweetID'],
                           take_last=True)

Conclusion

I think that I spoke about the most important tips/steps on data wrangling stage. But you had to not that twitter data is structured and clean but this not the regular case. In fact, real-world data is dirty: you had to do more work on it to be able to use it.

Waiting for your comments and suggestions.

Practical introduction to web mining: collect data

Web mining is the application of natural language processing techniques to web content in order to retreive relevant information. It  became more important these days due to an exponential increase in digital content especially with the apperance of social media platforms, especially Twitter which constitue a rich and fiable information source.

In this series, I’ll explain how to collect twitter data, manipulate it and extract knowledge from it. As I am fan of Python, I’ll try to compare Python to other programming languages such as Java, Ruby and PHP based on information that we will collect from twitter.

In this tutorial, we will start by collecting data from twitter, introduce tweepy and the structure of twitter data.

1. Create a Twitter application

First of all, you should have some Twitter keys to be able to connect to twitter API and gather data from it. We need especially API key, API secret, Access token and Access token secret. To get this informations, follow steps bellow:

  1. go to https://apps.twitter.com and login with your twitter account.
  2. Create a new Twitter application
  3. In the next page, precisely in the “API keys” tab you can find both API key, API secret
  4. Scroll down and generate you access token and token secret

create_twitter_app

Once you created a new Twitter app and generated your keys, you can move to the next step and start collecting data.

 2. Getting Data From Twitter

We will use the Twitter Stream API to collect tweets related to 4 keywords: python, java, php and ruby. Happily, the Twitter Stream API is restful and give us the possibility to filter tweets by keywords. The code below, will fetch popular tweets that contains one of the keywords mentioned earlier:

#!/bin/python
# -*- coding: utf-8 -*-

from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream

# User credentials for Twitter API 
access_token = "ENTER YOUR ACCESS TOKEN"
access_token_secret = "ENTER YOUR ACCESS TOKEN SECRET"
consumer_key = "ENTER YOUR API KEY"
consumer_secret = "ENTER YOUR API SECRET"


class StdoutListener(StreamListener):

    def on_data(self, data):
        print data
        return True

    def on_error(self, status):
        print status


if __name__ == '__main__':
    # Twitter authetification
    listner = StdoutListener()
    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    stream = Stream(auth, listner)

    # Filter Twitter Streams to capture data by the keywords
    stream.filter(track=['python', 'java', 'php', 'ruby'])

Now if you run this command:


python get_tweets.py >>PL_tweets.txt

you’ll have information about most popular tweets containing one of the keywords python, java, php and ruby in the specified txt file.

3. Understand Twitter response

The data collected previously is in JSON format, so it’s easy to read and understand. But, I’ll take the time here to highlight some useful informations inside the twitter response .

tweet_sample

As you propably noticed, the tweet contains information about the tweeple, list of tags and URIs appeared in the tweet, the main text of the tweet, retweet count, favourite count, etc.

Awesome, now you should start collect data. Next posts of this series will be hot and exciting, and you need a lot of data for it: more data, better experience.

Stay tuned ….