Insult Detection in Social Media¶
In this article, we consider the problem of detecting insults in social media comments. This is an important problem from a practical point of views, since there are many social media sites such as YouTube, Yelp, etc, that are interested in detecting and filtering out comments that involve insults especially to other users. We use the data provided in this competition hosted by Kaggle. We follow the same setup and use the same performance criterion as in the competition to be able to compare our results to the winners of the competition.
Data and Problem Setup¶
At this competition, first "train.csv" was made available and contestants were ask to submit their preliminary models. Here are five rows sampled randomly from the train data:
import pandas as pd
pd.set_option('max_colwidth', 1000)
train = pd.read_csv("data/train.csv")
train.sample(5)
The public and private leaderboad evaluation at this first stage was done on a test dataset that later was made available as "test_with_solutions.csv". The contestants then submitted their final models which were evaluated on a verification set. This set was provided after the competition was ended and was disclosed as "impermium_verification_labels.csv".
test_with_solutions = pd.read_csv("data/test_with_solutions.csv")
verfication_set = pd.read_csv("data/impermium_verification_labels.csv")
Since we are interested in building the final model, we need to combine the "train" and "test_with_solutions" datasets and use the resulting dataset as training data:
train_total = pd.concat([train, test_with_solutions.iloc[:, :-1]]).reset_index()
After building our model, we will evaluate the model performance on the "verification" dataset above which serves as test data. Here is a glimpse of our training data:
train_total.sample(10)
As we can see, we are provided with the text of the comments as well as the date the comments were posted. Also the Insult column shows whether the comments is an insult. This is what we are required to predict for the test data.
We also see that the comments are not cleaned. For example the non-breaking spaces appear as "\xa0" and should be removed. Also there are characters like "\n" and "\n" which should be replaced by a white space.
Text Cleaning¶
We construct a transformer which takes the comments as input and transforms them to cleaned comments:
import re
from scipy import sparse
from sklearn.base import BaseEstimator, TransformerMixin
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
class Preprocessor(BaseEstimator, TransformerMixin):
def __init__(self):
pass
def fit(self, X, y=None):
return self
def transform(self, X):
comments_clean = []
for c in X:
c = c.replace('\\\\', '\\')
c = c.replace('\\n', ' ')
c = re.sub(r'http[s]?://[^\s]*', ' ', c)
c = re.sub(r'\\x[0-9a-f]{2}', ' ', c)
c = re.sub(r'\\u[0-9a-f]{4}', ' ', c)
c = re.sub(r'[-"]', '', c)
c = re.sub(r'[:*#%&,.?!\']', ' ', c)
c = re.sub(r"(.)\1{2,}", '\g<1>', c)
c = " ".join([wordnet_lemmatizer.lemmatize(wordnet_lemmatizer.lemmatize(w, pos='v')).lower()
for w in c.split()])
comments_clean.append(c)
return comments_clean
preprocessor = Preprocessor()
The transformer above performs the following cleaning operations on the data:
- remove double-scape and carriage-return using
c = c.replace('\\\\', '\\')
andc = c.replace('\\n', ' ')
- remove urls using
c = re.sub(r'http[s]?://[^\s]*', ' ', c)
- remove punctuation using
c = re.sub(r'[:*#%&,.?!\']', ' ', c)
- remove non-ascii and unicode characters using
c = re.sub(r'\\x[0-9a-f]{2}', ' ', c)
andc = re.sub(r'\\u[0-9a-f]{4}', ' ', c)
- replace repeated characters with a single character so that e.g. word "heeelllll" will be converted to "hell"
- replace each word with its lemmatized version and convert all words to lower case
Here we see the original and cleaned comments side-by-side:
train_sample = train_total.sample(10)
train_sample['Comment_clean'] = preprocessor.fit_transform(train_sample['Comment'])
train_sample[['Comment', 'Comment_clean']]
Feature Extraction¶
The most common way to extract feature from text is text vectorization. In this approach, first a bag-of-word representation of the documents is build which contains all the words that appear in any of the documents. Then the number of times each word appears in a document is calculated and is used as a feature vector to represent the document. Optionally, we can divide these word counts by the frequency of each word appearing in all the documents combined. This allows us to reduce the effect of very common words which do not convey much information. A detailed explanation of text vectorization techniques can be found here.
from sklearn.feature_extraction.text import TfidfVectorizer
tfidv_char = TfidfVectorizer(ngram_range=(1, 2), analyzer='char', stop_words='english')
tfidv_word = TfidfVectorizer(ngram_range=(1, 2), analyzer='word', stop_words='english')
In its plain form, text vectorization only considers word frequency and does not take into account the position of words relative to each other. In other words, it does not take into account the text "context". To alleviate this problem, as seen above, we can use the ngram_range parameter of the vectorizer. We also note that to extract richer set of features, we have used two vectorizers, one that works of words and one that works on characters.
In addition to standard features, we develop the following feature extractor transformer which extract features that are directly related to the problem of insult detection. This transformer extracts the following features:
- The count of bad words used the in comment.
- The number of positional tags such as verb, noun, adjective, etc, obtained using ntlk library.
- The number of @ mentions in the comment since we are asked to detect insults directed to other members as apposed to public figures like politicians or celebrities.
from nltk import word_tokenize
import nltk
from scipy import sparse
from sklearn.base import BaseEstimator
class FeatureExtractor(BaseEstimator):
def __init__(self):
with open('bad_words.txt') as f:
self.badwords = set(w.strip() for w in f.readlines())
def fit(self, X, y=None):
self.fit_transform(X, y)
return self
def transform(self, X):
return self.fit_transform(X)
def fit_transform(self, X, y=None):
pos_tags = ["VB", "JJ", "RB", "NN"]
def pos_tag_count(comment):
pos = nltk.pos_tag(word_tokenize(comment))
d = {}
for tag in pos_tags:
d[tag] = 0
for p in pos:
for tag in pos_tags:
if p[1].startswith(tag):
d[tag] += 1
return d.values()
def bad_word_count(comment):
cnt = 0
for w in comment.split():
if w.strip() in self.badwords:
cnt += 1
return cnt
def at_mention_count(comment):
return len(re.findall(r'@[^\s]+', comment))
comments_features = []
for c in X:
features = pos_tag_count(c)
features.append(bad_word_count(c))
features.append(at_mention_count(c))
comments_features.append(features)
return sparse.csr_matrix(comments_features)
The next step is to combine the extracted features using FeatureUnion and create a pipeline which consists of preprocessing and cleaning, feature extraction, and classification.
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import FeatureUnion
tfidv_char = TfidfVectorizer(ngram_range=(1, 10), analyzer='char', stop_words='english', min_df=6)
tfidv_word = TfidfVectorizer(ngram_range=(1, 3), analyzer='word', stop_words='english', min_df=2)
fe = FeatureExtractor()
fu = FeatureUnion([('fe', fe), ('tfidv_char', tfidv_char), ('tfidv_word', tfidv_word)])
#clf_rf = RandomForestClassifier()
clf_rf = LogisticRegression(C=3)
estimators = [('prep', preprocessor), ('fu', fu), ('clf', clf_rf)]
pl = Pipeline(estimators)
The metric used for the competition is ROC AUC score which can be calculated as follows:
from sklearn.metrics import roc_auc_score
def scoring(estimator, X, y):
y_pred = estimator.predict_proba(X)
return roc_auc_score(y, y_pred[:, 1])
Using this scoring metric we can calculate the performance of our pipeline using cross-validation:
from sklearn.model_selection import cross_val_score
cross_val_score(pl, train_total.Comment, train_total.Insult, scoring=scoring)
The validation score looks great and has small variations across the cross-validation folds. But we should keep in mind that this is our local validation score and does not necessarily translate directly to the leaderboard score. Nevertheless, we can use this local score as a proxy for the leaderboard score and employ it to improve the classification performance.
In the following we use a grid-search to perform an exhaustive search over a range of parameters for feature extraction and classification steps in our data pipeline.
from sklearn.model_selection import GridSearchCV
import numpy as np
param_grid = {'clf__C':np.logspace(-1, 2, 5),
'fu__tfidv_word__ngram_range': [(1, 1), (1, 2)],
'fu__tfidv_char__ngram_range': [(1, 1), (1, 3)],
'fu__tfidv_char__min_df': [1, 2, 4],
'fu__tfidv_char__stop_words': ['english', None],
}
gs = GridSearchCV(pl, param_grid, scoring=scoring, n_jobs=2)
gs.fit(train_total.Comment, train_total.Insult)
gs.best_params_
gs.best_score_
The fact that we obtain the highest score for the largest values of n_gram parameters suggests that we may be able to improve the pipeline performance by increasing this parameter. To explore this, we perform the following optimization:
from sklearn.model_selection import GridSearchCV
import numpy as np
param_grid = {'clf__C':np.logspace(0, 1.5, 10),
'fu__tfidv_word__ngram_range': [(1, 3)],
'fu__tfidv_char__ngram_range': [(1, 3), (1, 5), (1, 7), (1, 9)],
'fu__tfidv_char__min_df': [4],
'fu__tfidv_char__stop_words': ['english'],
}
gs = GridSearchCV(pl, param_grid, scoring=scoring, n_jobs=4, verbose=10)
gs.fit(train_total.Comment, train_total.Insult)
gs.best_score_
import matplotlib.pyplot as plt
x_data = [3, 5, 7, 9]
y_data = gs.cv_results_['mean_test_score'].reshape(10, 4).max(axis=0)
plt. plot(x_data, y_data, 'r-o');
plt.xlabel('n_grams', fontsize=14)
plt.xticks(fontsize=12)
plt.ylabel('AUC ROC score', fontsize=14)
plt.yticks(fontsize=12)
plt.grid(True)
plt.show()
Test Set Validation¶
We take the model obtained using the above optimization as our final model. The next step is to evaluate our model on the test set provided for the competition (called verification set "impermium_verification_labels.csv").
from sklearn.metrics import roc_auc_score
clf = gs.best_estimator_
clf.fit(train_total.Comment, train_total.Insult)
y_ver_pred = clf.predict_proba(verfication_set.Comment)[:, 1]
y_ver = verfication_set.Insult
roc_auc_score(y_ver, y_ver_pred)
Our final score is considerably lower than the local validation score of 0.9081. However, if we turn to the private leaderboard of the competition, we find that the best score was 0.84248. This means that we have a achieved a score around 97% of the winning solution score which is great specially considering that we used logistic regression which is not considered a complex classifier. On the plus side however, logistic regression is very fast and efficient which is advantageous when using the model on real world data to classify comments as they are received.