Twitter Text Classification¶
Posted on June 10, 2018
In this post, we consider the problem of classifying tweets to a set of categories. In particular, our goal is to build a model that classifies tweets to different categories such as technology, politics, weather, health, etc. Given the large number of tweets tweeted each day or even each hour, tweet classification can be very useful when our goal is to filter noise and extract useful and relevant information from tweets. For example, in election season, we can filter for tweets that fall in politics category before doing a sentiment analysis on tweets to find which candidates are favored by the population. As another example, we can extract technology related tweets to find out what is trending in the technology domain. Later in this post, we will obtain tweets that fall in weather category to find out in which parts of the country people are talking about the weather.
The two broad approaches that can be used for this problem are supervised and unsupervised learning. In this post we focus on supervised learning. The unsupervised approach focuses on text clustering and topic modeling and we will explore that later in a separate post.
Successful application of supervised learning would require a large enough dataset of labeled tweets. One approach is to label the tweets manually, but that will turn out to be a time-consuming and painstaking task. Instead, the approach we use here is to collect tweets from well-established Twitter accounts such as those of news agencies that focus on a particular topic, e.g., politics, technology, or weather. The drawback of this method is that inevitably, there will be tweets that are mislabeled. For example, a twitter account that tweets about technology may occasionally tweets material about health or politics as well and this could interfere with our classification and decrease the classification accuracy. We pursue this approach here and later on perform a performance evaluation to find out if this approach is practical.
Collecting Tweets¶
To collect tweets, in addition to a Twitter account, we need to create an Twitter app associated with the account. Once we register the app, we obtain the credentials necessary to authorize our application to access Twitter data as if it was the Twitter account itself. Once we have the credentials, we can use python tweepy library to access our Twitter account data. The following code sets up tweepy API assuming the app credentials are in 'credentials.csv'.
import pandas as pd
credentials = pd.read_csv('credentials.csv')
consumer_key = credentials['consumer_key'][0]
consumer_secret = credentials['consumer_secret'][0]
access_token = credentials['access_token'][0]
access_secret = credentials['access_secret'][0]
import tweepy
from tweepy import OAuthHandler
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth, wait_on_rate_limit=True)
import warnings
warnings.simplefilter('ignore')
As mentioned before, we are going to follow some well-known Twitter accounts to prepare our labeled data. Here are the account we have chosen to follow:
users = api.friends('<twitter account>', count=100)
[u.screen_name for u in users]
As we can see, we are following 45 account that tweet about technology, politics, health, weather, traffic, and sport. Next we set up a timer to periodically collect new tweets from these accounts and save them to a MongoDB database. MongoDB provides data persistence and provides an easy way to manage Twitter data. In addition, it accepts json data which is very convenient, since we get the tweets from Twitter API as json.
import threading
import time
from pymongo import MongoClient
def collect_tweets(users):
curr_count = news_tweets.count()
now = time.time()
print("Started collecting tweets")
for user in users:
try:
tweets = tweepy.Cursor(api.user_timeline,screen_name=user.screen_name).items()
except:
continue
for tw in tweets:
# See if the tweet already exists in the database
if news_tweets.find_one({'id':tw.id}):
break
try:
news_tweets.insert_one(tw._json)
except DuplicateKeyError:
break
new_count = news_tweets.count()
new_time = time.time() - now
print("Collected %d tweets in %.2f seconds" % (new_count - curr_count, new_time))
class TimerThread(threading.Thread):
def __init__(self, interval, task, args=[], kwargs={}):
super(TimerThread, self).__init__()
self.stop_ = False
self.interval = interval
self.args = args
self.kwargs = kwargs
self.task = task
self.start()
def run(self):
while not self.stop_:
self.task(*self.args, **self.kwargs)
time.sleep(self.interval)
def stop(self):
self.stop_ = True
client = MongoClient()
news_tweets = client.tweets_db.new_tweets
tthread = TimerThread(15 * 60, collect_tweets, [users])
tthread.stop()
In the above code:
We set up a timer thread to wake up every 15 minutes. Once the thread wakes up it goes though the followed accounts and downloads the tweets that are posted since the last thread invocation.
For each new tweet, we check the database to see we have already collected this tweet. To do this we use the 'id' field which is unique for each tweet.
Once we see a duplicate tweets for an account, it means we have collected all new tweets generated by this account in the past 15 minutes. So we move on to the next account.
After running the code for a few hours, we have collected around 140000 tweets:
from pymongo import MongoClient
client = MongoClient()
news_tweets = client.tweets_db.new_tweets
news_tweets.count()
We store the relevant information from tweets in a data-frame and save it as a pickle file as well.
import pandas as pd
tweets_df = pd.DataFrame([[tw['id'], tw['text'], tw['user']['screen_name']]
for tw in news_tweets.find()], columns=['id', 'text', 'user'])
tweets_df.to_pickle('data/tweets_df.pkl')
tweets_df.head()
The following peace of code assigns Twitter accounts that generate the tweets to Health, Politics, Sports, Tech, Traffic, and Weather categories based on the account username.
import numpy as np
users_df = pd.DataFrame(np.unique(tweets_df.user).tolist(), columns=['user'])
cats = [('Traffic', None),
('Tech', ['tech', 'breakingbytes', 'CNET']),
('Health', ['health', 'KHNews']),
('Sports', 'sport'),
('Weather', None),
('Politics', ['politic', 'rtetwip', 'realclear', 'PnPCBC']),
]
for cat in cats:
keywords = cat[1]
cat_name = cat[0]
if keywords is None:
keywords = cat_name
if not isinstance(keywords, list):
keywords = [keywords]
print cat
for index, u in zip(users_df.index, users_df.user):
for kw in keywords:
if u.lower().find(kw.lower()) != -1:
print("\t%s %s"% (u, kw))
users_df.loc[index, 'cat'] = cat_name
break
We can now add a new columns to the tweet data-frame that indicates each tweet's category:
users_df.set_index('user', inplace=True)
tweets_df['cat'] = tweets_df['user'].map(users_df['cat'])
To prepare the category data for classification we need to translate categories to labels using a label-encoder:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
tweets_df['cat_encoded'] = label_encoder.fit_transform(tweets_df.cat)
Here is the resulting data-frame after shuffling the rows:
tweets_df = tweets_df.sample(frac=1)
tweets_df.reset_index(inplace=True, drop=True)
tweets_df.head()
Let's have a look at the number of tweets we have collected from each category.
tweets_df['cat'].value_counts()
We observe that some categories like Weather and Traffic contain fewer tweets since the corresponding tweeter account generated the tweets at a slower pace. However, this is not problematic since all categories contain a reasonable number of tweets for the classification task to be meaningful.
Text Preprocessing and Feature Extraction¶
Next, we develop a transformer that performs some cleaning and preprocessing of tweet texts. For more details about this transformer please see the article Insults Detection in Social Media.
from scipy import sparse
from sklearn.base import BaseEstimator, TransformerMixin
import re
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
class Preprocessor(BaseEstimator, TransformerMixin):
def __init__(self):
pass
def fit(self, X, y=None):
return self
def transform(self, X):
comments_clean = []
for c in X:
c = c.replace('\\\\', '\\')
c = c.replace('\\n', ' ')
c = re.sub(r'[-_"]', '', c)
c = re.sub(r'[*%&,?!;]', ' ', c)
c = re.sub(r"(.)\1{2,}", '\g<1>', c)
c = re.sub(r'\.(\s+|$)', ' ', c)
c = re.sub(r'[^\x00-\x7F]+',' ', c)
c = re.sub(r'https?://[\w./]+', ' ', c)
c = [wordnet_lemmatizer.lemmatize(wordnet_lemmatizer.lemmatize(w, pos='v')).lower()
for w in c.split()]
c = " ".join(w for w in c if len(w) > 2)
comments_clean.append(c)
return comments_clean
prep = Preprocessor()
tweets_df['text_processed'] = prep.fit_transform(tweets_df['text'])
tweets_df.head()
Using the processed text we can extract useful text features by performing text vectorization. Again, for a more detailed explanation please refer to Insults Detection in Social Media.
from sklearn.feature_extraction.text import TfidfVectorizer
import re
from nltk.corpus import stopwords
from scipy.sparse import hstack
tfidf_word_vectorizer = TfidfVectorizer(ngram_range=(1, 3), max_df=0.5, max_features=None,
min_df=2, stop_words='english', token_pattern=u'((?u)\\b\\w\\w+\\b|[#@]\w+)',
use_idf=True)
tfidf_char_vectorizer = TfidfVectorizer(ngram_range=(1, 7), analyzer='char', max_df=0.5, max_features=None,
min_df=2, stop_words='english', token_pattern=u'((?u)\\b\\w\\w+\\b|[#@]\w+)',
use_idf=True)
tfidf_word = tfidf_word_vectorizer.fit_transform(tweets_df['text_processed'])
tfidf_char = tfidf_char_vectorizer.fit_transform(tweets_df['text_processed'])
tfidf_word_char = hstack([tfidf_word, tfidf_char]).tocsr()
Now our tweet data is ready for the classification task. In particular, the we have the input data 'X' in 'tfidf_word_char' while are target variables 'y' are given by tweets_df['cat_encoded'].
Text Classification¶
We first divide our data into cv and test sets. We use the cv set to train the classifier using k-fold cross-validation. We then examine the performance of the resulting classifier on the test set.
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.33, random_state=42)
cv_index, test_index = sss.split(tfidf_word_char, tweets_df['cat_encoded']).next()
X_cv, y_cv = tfidf_word_char[cv_index], tweets_df['cat_encoded'][cv_index]
X_test, y_test = tfidf_word_char[test_index], tweets_df['cat_encoded'][test_index]
For classifier, we use Logisitic-Regression classifier since it is an efficient classifier that achieves great results on text classification tasks.
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
Using this classifier, we perform a grid-search on the cv set to obtain the best regularization parameter C:
from sklearn.model_selection import GridSearchCV
gs = GridSearchCV(lr, param_grid={'C': np.logspace(-1, 3, 5)}, return_train_score=True, n_jobs=3)
gs.fit(X_cv, y_cv)
Here are the cross-validation results:
gs.cv_results_
gs.best_params_
Next, we use the obtained classifier to perform prediction of the test data.
clf = gs.best_estimator_
clf.fit(X_cv, y_cv)
y_pred = clf.predict(X_test)
Performance Evaluation¶
sklearn library provides several useful tools for performance evaluation of multi-class classifiers. In the following, we generate a classification report, calculate the overall accuracy, and plot the confusion matrix. We will obtain the learning curves in the next section.
from sklearn import metrics
target_names = tweets_df[['cat_encoded', 'cat']].drop_duplicates().sort_values(['cat_encoded'])['cat'].tolist()
print(metrics.classification_report(y_test, y_pred, target_names=target_names))
All score are around 94% which is a great classification score. Also we can obtain the overall accuracy as:
overall_accuracy = float(np.sum(y_test == y_pred))/y_test.shape[0]
overall_accuracy
which is also around 94%.
To explore the classification performance visually, we can plot the confusion matrix.
import matplotlib.pyplot as plt
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title, fontsize=18)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45, fontsize=14)
plt.yticks(tick_marks, classes, fontsize=14)
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black", fontsize=14)
plt.tight_layout()
plt.ylabel('True label', fontsize=16)
plt.xlabel('Predicted label', fontsize=16)
We first plot the raw data and then plot the data normalized per true-label classes.
import itertools
# Compute confusion matrix
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
np.set_printoptions(precision=2)
class_names = target_names
# Plot non-normalized confusion matrix
plt.figure(figsize=(7.5, 5.5))
plot_confusion_matrix(cnf_matrix, classes=class_names,
title='Confusion matrix, without normalization')
# Plot normalized confusion matrix
plt.figure(figsize=(7.5, 5.5))
plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,
title='Normalized confusion matrix')
plt.show()
Learning Curves¶
To get more insight into the performance of the classifier and explore ways to improve its performance it is helpful to plot the learning curves. We can use tools from sklearn library to plot these curves:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None, scoring=None,
n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5), fontsize=20):
plt.figure(figsize=(9, 7))
plt.title(title, fontsize=19)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples", fontsize=16)
plt.xticks(fontsize=16)
plt.ylabel("Score", fontsize=16)
plt.yticks(fontsize=16)
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, scoring=scoring, cv=cv, n_jobs=n_jobs,
train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="lower right", fontsize=16)
plt.grid(True)
return train_sizes, train_scores, test_scores
estimator = LogisticRegression()
train_sizes, train_scores, test_scores = plot_learning_curve(
estimator, title, tfidf_word_char,
tweets_df['cat_encoded'], n_jobs=4, cv=3)
plt.show();
The above figure shows the training and cross-validation score as a function of the number of training samples. The score here is the accuracy score, i.e., the fraction of samples that are classified correctly. The learning curves show that there is a large enough gap between the train and cv scores. In other words, the classifier suffer from some degree of over-fitting.
There are several ways to overcome over-fitting. Common ways to combat over-fitting are using regularization, reducing the number of features, and collecting more data. We have already tried regularization. Judging from the shape of the curve we may benefit substantially by collecting more data, since even for higher values of the sample size, the cross-validation score is increasing with substantial rate. Next, we try collecting more tweets to see how that affects our classifier's performance.
Collecting More Tweets¶
We run the tweet collector for some more time to collect a total of 400000 tweets.
tweets_df.shape
The number of tweets for each category:
tweets_df.cat.value_counts()
We follow the steps in the previous sections to build a classifier for our new tweet dataset. We can also obtain the performance measures for the new classifier using a similar approach:
from sklearn import metrics
print(metrics.classification_report(y_test, y_pred, target_names=target_names))
overall_accuracy = float(np.sum(y_test == y_pred))/y_test.shape[0]
overall_accuracy
The classifier performance has greatly improved is all performance measures are around 99%!
Let's also plot the confusion matrices as before. They also show that performance has improved in all categories.
import itertools
# Compute confusion matrix
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
np.set_printoptions(precision=2)
class_names = target_names
# Plot non-normalized confusion matrix
plt.figure(figsize=(7.5, 5.5))
plot_confusion_matrix(cnf_matrix, classes=class_names,
title='Confusion matrix, without normalization')
# Plot normalized confusion matrix
plt.figure(figsize=(7.5, 5.5))
plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,
title='Normalized confusion matrix')
plt.show();
Weather Talks¶
To see our classifier in action, next we use it to detect when people are talking about the weather. In particular, we use the Twitter streaming API to access tweets that are sent by users in real time. We then use our classifier to filter out tweets that are classified to be Weather category.
from pymongo import MongoClient
import json
from tweepy import Stream
from tweepy.streaming import StreamListener
import numpy as np
text_global = ""
class TweetListener(StreamListener):
def on_data(self, data):
global cnt, text_global
try:
tweet = json.loads(data)
tweet_list.append(tweet)
if 'extended_tweet' in tweet:
text = tweet['extended_tweet']['full_text']
else:
text = tweet['text']
text_processed = prep.fit_transform([text])
if len(text_processed[0].split()) < 2:
return True
tfidf_word = tfidf_word_vectorizer.transform(text_processed)
tfidf_char = tfidf_char_vectorizer.transform(text_processed)
tfidf_word_char = hstack([tfidf_word, tfidf_char])
y_pred = lr.predict(tfidf_word_char)
y_pred_proba = lr.predict_proba(tfidf_word_char)
if y_pred_proba[0][1] < 0.9:
return True
coordinates = None
try:
coordinates = tweet['place']['bounding_box'][u'coordinates']
except KeyError:
return True
text_global = text
tweet_output = ("Text: %s\nUser: %s\nProbability: %s\nCoordinates: %s\n"
% (text.encode('utf-8'), tweet['user']['screen_name'].encode('utf-8'),
y_pred_proba[0][1], str(coordinates[0]))
)
print("--------------------------------------")
print(tweet_output)
coordinates_mean = np.array(coordinates[0]).mean(axis=0).tolist()
f.write("%s, %s\n" % tuple(coordinates_mean))
f.flush()
cnt += 1
if cnt % 1000 == 0:
print(cnt)
if cnt == max_count:
twitter_stream.disconnect()
print("done!")
except BaseException as e:
print("Error on_data: %s" % str(e))
twitter_stream.disconnect()
pass
return True
def on_error(self, status):
print(status)
return True
To achieve an even better classification performance, we can train a one-vs-all classifier. In other words, we transform the target variable by mapping the Weather category to value one, and all other categories to value zero. The result is a binary classifier that can detect tweets that specifically belong to Weather category.
weather_df = (tweets_df['cat'] == 'Weather').astype(int)
lr = LogisticRegression(C=100)
lr.fit(tfidf_word_char, weather_df)
Next we run the Twitter streaming API to collect 20 tweets for which the probability of belonging to the Weather category is higher than 90%. The streaming API accepts a location parameter which can be use to filter out tweets that originate from a certain geographical area. Using this parameter we instruct the API to only collect tweets from US and Canada.
f = open("data/weather_tweets_coordinates", "a")
cnt = 0
tweet_list = []
max_count = 30
twitter_stream = Stream(auth=auth, listener=TweetListener())
GEOBOX_US_CANADA = [-128.755117, 26.415893, -52.437305, 54.093165]
twitter_stream.filter(locations=GEOBOX_US_CANADA)
f.close()
As we can see, the classifier has done a great job in detecting the Weather related tweets. Occasionally we see a tweet that is mis-classified, for example the tweet "Jon Snow a real one." is not about the Weather. We can reduce the possibility of such tweets by filtering out short tweets, e.g., tweets with fewer than 10 words. In addition, some of the detected tweets seem to be from weather agencies and therefore not posted by read people. We can also remove these tweets by preparing a list of business Twitter account that are not owned by individuals and removing the related tweets.
In the above, in addition to tweet text, we collected the coordinates information of the tweets. We can use this information to create a heat-map visualization of the weather tweets. We use google-maps and gmaps library for this purpose. Similar to Twitter, we need to register an app with google-maps to be able to access the map data. Once we register, we receive an API key which we can use together with gmaps to create heatmap visualization.
import gmaps
import gmaps.datasets
f = open("data/weather_tweets_coordinates")
locations = []
for l in f.readlines():
longitude, latitude = map(float, l.split(','))
locations.append((latitude, longitude))
f.close()
gmaps.configure(api_key=<google map api-key>)
fig = gmaps.figure()
heatmap = gmaps.heatmap_layer(locations)
heatmap.point_radius = 20
fig.add_layer(heatmap)
Today, there is a heat-wave going on in the afternoon, as was the case on many days during this summer (summer 2018). This could explain why we see a higher number of tweets posted in Florida and California. Also right now there is an thunder storm happening in Ottawa which explains high concentration of tweets around this city.