Twitter Sentiment Modeling on Detecting Racist or Sexist tweets

1. Introduction

Twitter is one of the most popular social media nowadays. It provides online news and social networking service on which users post and interact with messages known as "tweets". Registered users can post, like, and retweet tweets, but unregistered users can only read them.

Sentiment Analysis refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. (Source: Wikipedia)

Sentiment analysis on tweets can be extremely useful. By analyzing the tweets, we can find the sentiment of people on certain affair, understand people's opinion. Also it can help us make right strategies/reactions.

In this post, our main goal is to build a model to identify tweets with racist or sexist sentiment.

2. The data we use

We have a training data of 31,962 tweets with labeled sentiments, the dataset is provided in the form of a csv file with each line storing a tweet id, its label and the tweet. Label '1' denotes the tweet is racist/sexist and label '0' denotes the tweet is not racist/sexist.

We also have a test data of 17,197 tweets. The test data file contains only tweet ids and the tweet text with each tweet in a new line.

The dataset we use can be download from here.

3. Tweets cleaning and preprocessing

Load tweets

In [1]:
# importing libraries
import pandas as pd
import numpy as np
import re
import string
import nltk
import warnings
warnings.filterwarnings("ignore")

#import plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# loading training and test tweets
train = pd.read_csv('data\\train_E6oV3lV.csv')
test = pd.read_csv('data\\test_tweets_anuFYb8.csv')
print("The size of the training tweets is {}".format(train.shape))
print("The size of the test set is {}".format(test.shape))
The size of the training tweets is (31962, 3)
The size of the test set is (17197, 2)

Let's take a look at the first 5 rows of the training data.

In [2]:
train.head()
Out[2]:
id label tweet
0 1 0 @user when a father is dysfunctional and is s...
1 2 0 @user @user thanks for #lyft credit i can't us...
2 3 0 bihday your majesty
3 4 0 #model i love u take with u all the time in ...
4 5 0 factsguide: society now #motivation
In [3]:
train['label'].value_counts()
Out[3]:
0    29720
1     2242
Name: label, dtype: int64

We caclulated the number of tweets of each label in our training data. As can be seen, it is quite uneven. We have to take this into account when building our prediction model.

The data has 3 columns id, label, and tweet. id is a unique id of each tweet. label is the binary target variable, where label '1' denotes the tweet is racist/sexist and label '0' denotes the tweet is not racist/sexist. tweet contains the tweets that we will clean and preprocess.

Tweets cleaning

After taking a look at the first 5 records, we can have some initial thoughts regards to cleaning our training data:

  1. The Twitter handles are already masked as @user due to privacy concerns. So, these Twitter handles are hardly giving any information about the nature of the tweet and they need to be removed.
  2. The punctuations, numbers and special characters also need to be removed.
  3. Short words, like "is", "are", "for", "and" and so on, which do not have much value to the tweets also need to be cleaned.
  4. Tokenization. Tokenization is an essential step in any NLP task. Tokenization splits every tweets into individual tokens/words.
  5. Stemming/lemmatizing. Reduce similar words (different variations of words of the same root). For example, 'love', 'loving', 'loves' etc. are from the same root 'love'. We can reduce the total number of unique words in our data without losing a significant amount of information.

1. Removing handles (@user)

For our convience, let's combine train and test set together to perform cleaning.

In [4]:
all_tweets = train.append(test, ignore_index = True)

We define a function _removepattern to remove unwanted patterns in our tweets data. In our case, the unwanted pattern is '@user'.

In [5]:
# function to remove unwanted patterns
def remove_pattern(input_txt, pattern):
    matches = re.findall(pattern, input_txt)
    for match in matches:
        input_txt = re.sub(match, '', input_txt)
    
    return input_txt

Now let’s create a new column tidy_tweet, it will contain the cleaned and processed tweets. Note that we have passed "@[\w]*" as the pattern to the remove_pattern function. It is actually a regular expression which will pick any word starting with '@'.

In [6]:
# remove twitter handles (@username)
all_tweets['tidy_tweet'] = np.vectorize(remove_pattern)(
                                    all_tweets['tweet'], "@[\w]*")

2. Romving punctuations, numbers, and special characters

We also use regular expressions to remove punctuations, numbers, and special characters in this step.

In [7]:
# remove punctuations, numbers, and special characters
all_tweets['tidy_tweet'] = all_tweets['tidy_tweet'].str.replace(
                            "[^a-zA-Z#]", " ")

3. Removing short words

Some short words are of little useful information. So we need to remove them from our tweets data. Let's have a look at the data after the three cleaning steps.

In [8]:
all_tweets['tidy_tweet'] = all_tweets['tidy_tweet'].apply(
                            lambda x: ' '.join([w for w in x.split() 
                                               if len(w) > 3]))
all_tweets.head()
Out[8]:
id label tweet tidy_tweet
0 1 0.0 @user when a father is dysfunctional and is s... when father dysfunctional selfish drags kids i...
1 2 0.0 @user @user thanks for #lyft credit i can't us... thanks #lyft credit cause they offer wheelchai...
2 3 0.0 bihday your majesty bihday your majesty
3 4 0.0 #model i love u take with u all the time in ... #model love take with time
4 5 0.0 factsguide: society now #motivation factsguide society #motivation

As can be clearly seen, the tweets in _tidytweet column are much shorter than the original ones. They only contain important words and the noise(numbers, punctuations, and special characters) has been removed effectively.

4. Tokenization

In [9]:
# tokenization
tokenized_tweet = all_tweets['tidy_tweet'].apply(
                                    lambda x: x.split())
tokenized_tweet.head()
Out[9]:
0    [when, father, dysfunctional, selfish, drags, ...
1    [thanks, #lyft, credit, cause, they, offer, wh...
2                              [bihday, your, majesty]
3                     [#model, love, take, with, time]
4                   [factsguide, society, #motivation]
Name: tidy_tweet, dtype: object

5. Stemming/Lemmatizing

Stemming is a rule-based process of stripping the suffixes ("ing", "ly", "es", "s" etc) from a word.

In [10]:
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()
tokenized_tweet = tokenized_tweet.apply(lambda x:
                                       [stemmer.stem(i) for i in x])
tokenized_tweet.head()
Out[10]:
0    [when, father, dysfunct, selfish, drag, kid, i...
1    [thank, #lyft, credit, caus, they, offer, whee...
2                              [bihday, your, majesti]
3                     [#model, love, take, with, time]
4                         [factsguid, societi, #motiv]
Name: tidy_tweet, dtype: object

Now let's stitch these tokens back together.

In [11]:
for i in range(len(tokenized_tweet)):
    tokenized_tweet[i] = ' '.join(tokenized_tweet[i])
    
all_tweets['tidy_tweet'] = tokenized_tweet
all_tweets.head()
Out[11]:
id label tweet tidy_tweet
0 1 0.0 @user when a father is dysfunctional and is s... when father dysfunct selfish drag kid into dys...
1 2 0.0 @user @user thanks for #lyft credit i can't us... thank #lyft credit caus they offer wheelchair ...
2 3 0.0 bihday your majesty bihday your majesti
3 4 0.0 #model i love u take with u all the time in ... #model love take with time
4 5 0.0 factsguide: society now #motivation factsguid societi #motiv

4. Insights and visualization from cleaned tweets

A few questions we are gonna ask are:

  1. What are the most common words in the entire dataset?
  2. What are the most common words in the dataset for negative and positive tweets, respectively?
  3. How many hashtags are there in a tweet?
  4. Which trends are associated with my dataset?
  5. Which trends are associated with either of the sentiments? Are they compatible with the sentiments?

1. Most common words: WordCloud

Wordcloud is an useful tool in finding the most common words appreared in the dataset. The most frequent words apprear in larger sizes and the less common words appear in smaller sizes.

In [12]:
all_words = ' '.join([text for text in all_tweets['tidy_tweet']])

from wordcloud import WordCloud
wordcloud = WordCloud(width = 800, height = 500, random_state = 21,
                     max_font_size = 110).generate(all_words)
plt.figure(figsize = (10,7))
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off')
plt.show()

We can see most of the words are positive or neutral. With happy, thank, today and love being the most frequent ones. t doesn’t give us any idea about the words associated with the racist/sexist tweets. Hence, we will plot separate wordclouds for both the classes(racist/sexist or not) in our train data.

2. Most common words in normal/positive tweets

In [13]:
normal_words = ' '.join([text for text in 
                         all_tweets['tidy_tweet'][all_tweets['label'] == 0]])
wordcloud = WordCloud(width = 800, height = 500, random_state = 21,
                     max_font_size = 110).generate(normal_words)
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

As expected, the most common words in normal/positive tweets are either positive or neutral.

3. Racist/sexist tweets

In [14]:
negative_words = ' '.join([text for text in 
                all_tweets['tidy_tweet'][all_tweets['label'] == 1]])

wordcloud = WordCloud(width = 800, height = 500, random_state = 21,
                     max_font_size = 110).generate(negative_words)
plt.figure(figsize = (10, 7))
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off')
plt.show()

As we can clearly see, most of the words have negative connotations. So, it seems we have a pretty good text data to work on. Next we will the hashtags/trends in our twitter data.

4. Exploring hashtags on tweets sentiment

Hashtags in twitter are synonymous with the ongoing trends on twitter at any particular point in time. The hashtags may have values in helping us finding negative tweets.

We will first store all the hashtags into two separate lists - one for normal/positive tweets and the other for tweets contain racist/sexist.

In [15]:
# function to collect hashtags
def hashtag_extract(x):
    hashtags = []
    # loop over the words in the tweet
    for i in x:
        ht = re.findall(r"#(\w+)", i)
        hashtags.append(ht)
    
    return hashtags
In [16]:
#extracting hashtags from non racist/sexist tweets

HT_regular = hashtag_extract(all_tweets['tidy_tweet']
                            [all_tweets['label'] == 0])

# extracting hashtags from racist/sexist tweets
HT_negative = hashtag_extract(all_tweets['tidy_tweet']
                            [all_tweets['label']== 1])

# unnesting list
HT_regular = sum(HT_regular, [])
HT_negative = sum(HT_negative, [])

Top hashtags in non-racist/sexist tweets

In [17]:
a = nltk.FreqDist(HT_regular)
d = pd.DataFrame({'Hashtag': list(a.keys()), 
                  'Count': list(a.values())})
# selecting top 10 most frequent hashtags
d = d.nlargest(columns = 'Count', n = 10)
plt.figure(figsize = (16, 5))
ax = sns.barplot(data = d, x = "Hashtag", y = "Count")
ax.set(ylabel = 'Count')
plt.show()

Top hastags in racist/sexist tweets

In [18]:
b = nltk.FreqDist(HT_negative)
e = pd.DataFrame({'Hashtag': list(b.keys()), 'Count': list(b.values())})

# selecting top 10 most frequent hashtags
e = e.nlargest(columns = "Count", n = 10)
plt.figure(figsize = (16, 5))
ax = sns.barplot(data = e, x = "Hashtag", y = "Count")
ax.set(ylabel = 'Count')
plt.show()

As can be seen, the positive/regular hashtags are positive or neutral, and the negative hashtags are really negative.

5. Feature extraction from processed tweets

Depending upon the usage, text features can be constructed using assorted techniques – Bag-of-Words, term frequency–inverse document frequency (TF-IDF), and Word Embeddings. In this post, we will be covering only Bag-of-Words and TF-IDF.

Bag-of-words features

Bag-of-Words is a method to convert text into numerical features. Bag-of-Words features can be easily created using sklearn’s CountVectorizer function. We will set the parameter max_features = 1000 to select only top 1000 terms ordered by term frequency across the corpus.

In [19]:
from sklearn.feature_extraction.text import CountVectorizer
bow_vectorizer = CountVectorizer(max_df = 0.9, min_df = 2,
                                max_features = 1000, stop_words = 'english')
# bag-of-words feature matrix
bow = bow_vectorizer.fit_transform(all_tweets['tidy_tweet'])

TF-IDF features

Term frequency–inverse document frequency (TF-IDF), takes into account not just the occurrence of a word in a single document (or tweet) but in the entire corpus. TF-IDF works by penalizing the common words by assigning them lower weights while giving importance to words which are rare in the entire corpus but appear in good numbers in few documents.

  • TF = (number of appearance of a term t)/(number of total terms in the document)
  • IDF = log(N/n), where N is the number of documents and n is the number of documents a term t has appeared in.
  • TF-IDF = TF * IDF
In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(max_df = 0.9, min_df = 2,
                                  max_features = 1000, stop_words = 'english')
# TF-IDF feature matrix
tfidf = tfidf_vectorizer.fit_transform(all_tweets['tidy_tweet'])

6. Sentiment modeling

Since this is an binary prediction, we will use Logistic Regression to build our prediction model. It predicts the probability of occurrence of fitting data into logit function.

As mentioned before, the distribution in our training data is quite uneven. So simply using prediction accuracy/precision is not able to correctly evaluate our model's performance. Here, we use F1_score which can be more useful in this situation. This metric can be understood as:

Precision = TP/TP+FP

Recall = TP/TP+FN

F1 Score = 2(Recall Precision) / (Recall + Precision)

Where TP - true positive, FP - false positive, TN - true negative, FN - false negative.

Building a model using bag-of-words features

In [21]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

train_bow = bow[:31962, :]
test_bow = bow[31962:, :]

# splitting the train_bow into training and validation set
xtrain_bow, xvalid_bow, ytrain, yvalid = train_test_split(
                 train_bow, train['label'], random_state = 42, test_size = 0.3)
lreg = LogisticRegression()
lreg.fit(xtrain_bow, ytrain)

# make prediction on validation set
prediction = lreg.predict_proba(xvalid_bow)
prediction_int = prediction[:,1] >= 0.3 #if prediction is greater than or equal to 0.3 than 1 else 0
prediction_int = prediction_int.astype(np.int)

f1 = f1_score(yvalid, prediction_int)#calculate f1_score

print("The f1-score for using only the bag-of-words features is : {}".format(f1))
The f1-score for using only the bag-of-words features is : 0.5307820299500832
In [22]:
test_pred = lreg.predict_proba(test_bow)
test_pred_int = test_pred >= 0.3
test_pred_int = test_pred_int.astype(np.int)
# test['label'] = test_pred_int
# submission = test[['id','label']]
# submission.to_csv('sub_lreg_bow.csv', index=False) # writing data to a CSV file

Building a model using TF-IDF features

In [23]:
train_tfidf = tfidf[:31962, :]
test_tfidf = tfidf[31962:, :]

xtrain_tfidf = train_tfidf[ytrain.index]
xvalid_tfidf=train_tfidf[yvalid.index]

lreg.fit(xtrain_tfidf, ytrain)

prediction = lreg.predict_proba(xvalid_tfidf)
prediction_int = prediction[:,1] >=0.3
prediction_int = prediction_int.astype(np.int)

f1 = f1_score(yvalid, prediction_int)

print("The f1-score for using only the TF-IDF features is : {}".format(f1))
The f1-score for using only the TF-IDF features is : 0.5446507515473032

The output is about 0.545 by using the TF-IDF features which is pretty good. The public leader board F1 score is 0.564.



Comments

comments powered by Disqus