Twitter Sentiment Modeling on Detecting Racist or Sexist tweets¶
1. Introduction¶
Twitter is one of the most popular social media nowadays. It provides online news and social networking service on which users post and interact with messages known as "tweets". Registered users can post, like, and retweet tweets, but unregistered users can only read them.
Sentiment Analysis refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. (Source: Wikipedia)
Sentiment analysis on tweets can be extremely useful. By analyzing the tweets, we can find the sentiment of people on certain affair, understand people's opinion. Also it can help us make right strategies/reactions.
In this post, our main goal is to build a model to identify tweets with racist or sexist sentiment.
2. The data we use¶
We have a training data of 31,962 tweets with labeled sentiments, the dataset is provided in the form of a csv file with each line storing a tweet id, its label and the tweet. Label '1' denotes the tweet is racist/sexist and label '0' denotes the tweet is not racist/sexist.
We also have a test data of 17,197 tweets. The test data file contains only tweet ids and the tweet text with each tweet in a new line.
The dataset we use can be download from here.
3. Tweets cleaning and preprocessing¶
Load tweets¶
# importing libraries
import pandas as pd
import numpy as np
import re
import string
import nltk
import warnings
warnings.filterwarnings("ignore")
#import plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# loading training and test tweets
train = pd.read_csv('data\\train_E6oV3lV.csv')
test = pd.read_csv('data\\test_tweets_anuFYb8.csv')
print("The size of the training tweets is {}".format(train.shape))
print("The size of the test set is {}".format(test.shape))
Let's take a look at the first 5 rows of the training data.
train.head()
train['label'].value_counts()
We caclulated the number of tweets of each label in our training data. As can be seen, it is quite uneven. We have to take this into account when building our prediction model.
The data has 3 columns id, label, and tweet. id is a unique id of each tweet. label is the binary target variable, where label '1' denotes the tweet is racist/sexist and label '0' denotes the tweet is not racist/sexist. tweet contains the tweets that we will clean and preprocess.
Tweets cleaning¶
After taking a look at the first 5 records, we can have some initial thoughts regards to cleaning our training data:
- The Twitter handles are already masked as @user due to privacy concerns. So, these Twitter handles are hardly giving any information about the nature of the tweet and they need to be removed.
- The punctuations, numbers and special characters also need to be removed.
- Short words, like "is", "are", "for", "and" and so on, which do not have much value to the tweets also need to be cleaned.
- Tokenization. Tokenization is an essential step in any NLP task. Tokenization splits every tweets into individual tokens/words.
- Stemming/lemmatizing. Reduce similar words (different variations of words of the same root). For example, 'love', 'loving', 'loves' etc. are from the same root 'love'. We can reduce the total number of unique words in our data without losing a significant amount of information.
1. Removing handles (@user)
For our convience, let's combine train and test set together to perform cleaning.
all_tweets = train.append(test, ignore_index = True)
We define a function _removepattern to remove unwanted patterns in our tweets data. In our case, the unwanted pattern is '@user'.
# function to remove unwanted patterns
def remove_pattern(input_txt, pattern):
matches = re.findall(pattern, input_txt)
for match in matches:
input_txt = re.sub(match, '', input_txt)
return input_txt
Now let’s create a new column tidy_tweet, it will contain the cleaned and processed tweets. Note that we have passed "@[\w]*" as the pattern to the remove_pattern function. It is actually a regular expression which will pick any word starting with '@'.
# remove twitter handles (@username)
all_tweets['tidy_tweet'] = np.vectorize(remove_pattern)(
all_tweets['tweet'], "@[\w]*")
2. Romving punctuations, numbers, and special characters
We also use regular expressions to remove punctuations, numbers, and special characters in this step.
# remove punctuations, numbers, and special characters
all_tweets['tidy_tweet'] = all_tweets['tidy_tweet'].str.replace(
"[^a-zA-Z#]", " ")
3. Removing short words
Some short words are of little useful information. So we need to remove them from our tweets data. Let's have a look at the data after the three cleaning steps.
all_tweets['tidy_tweet'] = all_tweets['tidy_tweet'].apply(
lambda x: ' '.join([w for w in x.split()
if len(w) > 3]))
all_tweets.head()
As can be clearly seen, the tweets in _tidytweet column are much shorter than the original ones. They only contain important words and the noise(numbers, punctuations, and special characters) has been removed effectively.
4. Tokenization
# tokenization
tokenized_tweet = all_tweets['tidy_tweet'].apply(
lambda x: x.split())
tokenized_tweet.head()
5. Stemming/Lemmatizing
Stemming is a rule-based process of stripping the suffixes ("ing", "ly", "es", "s" etc) from a word.
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
tokenized_tweet = tokenized_tweet.apply(lambda x:
[stemmer.stem(i) for i in x])
tokenized_tweet.head()
Now let's stitch these tokens back together.
for i in range(len(tokenized_tweet)):
tokenized_tweet[i] = ' '.join(tokenized_tweet[i])
all_tweets['tidy_tweet'] = tokenized_tweet
all_tweets.head()
4. Insights and visualization from cleaned tweets¶
A few questions we are gonna ask are:
- What are the most common words in the entire dataset?
- What are the most common words in the dataset for negative and positive tweets, respectively?
- How many hashtags are there in a tweet?
- Which trends are associated with my dataset?
- Which trends are associated with either of the sentiments? Are they compatible with the sentiments?
1. Most common words: WordCloud
Wordcloud is an useful tool in finding the most common words appreared in the dataset. The most frequent words apprear in larger sizes and the less common words appear in smaller sizes.
all_words = ' '.join([text for text in all_tweets['tidy_tweet']])
from wordcloud import WordCloud
wordcloud = WordCloud(width = 800, height = 500, random_state = 21,
max_font_size = 110).generate(all_words)
plt.figure(figsize = (10,7))
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off')
plt.show()
We can see most of the words are positive or neutral. With happy, thank, today and love being the most frequent ones. t doesn’t give us any idea about the words associated with the racist/sexist tweets. Hence, we will plot separate wordclouds for both the classes(racist/sexist or not) in our train data.
2. Most common words in normal/positive tweets
normal_words = ' '.join([text for text in
all_tweets['tidy_tweet'][all_tweets['label'] == 0]])
wordcloud = WordCloud(width = 800, height = 500, random_state = 21,
max_font_size = 110).generate(normal_words)
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()
As expected, the most common words in normal/positive tweets are either positive or neutral.
3. Racist/sexist tweets
negative_words = ' '.join([text for text in
all_tweets['tidy_tweet'][all_tweets['label'] == 1]])
wordcloud = WordCloud(width = 800, height = 500, random_state = 21,
max_font_size = 110).generate(negative_words)
plt.figure(figsize = (10, 7))
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off')
plt.show()
As we can clearly see, most of the words have negative connotations. So, it seems we have a pretty good text data to work on. Next we will the hashtags/trends in our twitter data.
4. Exploring hashtags on tweets sentiment
Hashtags in twitter are synonymous with the ongoing trends on twitter at any particular point in time. The hashtags may have values in helping us finding negative tweets.
We will first store all the hashtags into two separate lists - one for normal/positive tweets and the other for tweets contain racist/sexist.
# function to collect hashtags
def hashtag_extract(x):
hashtags = []
# loop over the words in the tweet
for i in x:
ht = re.findall(r"#(\w+)", i)
hashtags.append(ht)
return hashtags
#extracting hashtags from non racist/sexist tweets
HT_regular = hashtag_extract(all_tweets['tidy_tweet']
[all_tweets['label'] == 0])
# extracting hashtags from racist/sexist tweets
HT_negative = hashtag_extract(all_tweets['tidy_tweet']
[all_tweets['label']== 1])
# unnesting list
HT_regular = sum(HT_regular, [])
HT_negative = sum(HT_negative, [])
Top hashtags in non-racist/sexist tweets
a = nltk.FreqDist(HT_regular)
d = pd.DataFrame({'Hashtag': list(a.keys()),
'Count': list(a.values())})
# selecting top 10 most frequent hashtags
d = d.nlargest(columns = 'Count', n = 10)
plt.figure(figsize = (16, 5))
ax = sns.barplot(data = d, x = "Hashtag", y = "Count")
ax.set(ylabel = 'Count')
plt.show()
Top hastags in racist/sexist tweets
b = nltk.FreqDist(HT_negative)
e = pd.DataFrame({'Hashtag': list(b.keys()), 'Count': list(b.values())})
# selecting top 10 most frequent hashtags
e = e.nlargest(columns = "Count", n = 10)
plt.figure(figsize = (16, 5))
ax = sns.barplot(data = e, x = "Hashtag", y = "Count")
ax.set(ylabel = 'Count')
plt.show()
As can be seen, the positive/regular hashtags are positive or neutral, and the negative hashtags are really negative.
5. Feature extraction from processed tweets¶
Depending upon the usage, text features can be constructed using assorted techniques – Bag-of-Words, term frequency–inverse document frequency (TF-IDF), and Word Embeddings. In this post, we will be covering only Bag-of-Words and TF-IDF.
Bag-of-words features
Bag-of-Words is a method to convert text into numerical features. Bag-of-Words features can be easily created using sklearn’s CountVectorizer function. We will set the parameter max_features = 1000 to select only top 1000 terms ordered by term frequency across the corpus.
from sklearn.feature_extraction.text import CountVectorizer
bow_vectorizer = CountVectorizer(max_df = 0.9, min_df = 2,
max_features = 1000, stop_words = 'english')
# bag-of-words feature matrix
bow = bow_vectorizer.fit_transform(all_tweets['tidy_tweet'])
TF-IDF features
Term frequency–inverse document frequency (TF-IDF), takes into account not just the occurrence of a word in a single document (or tweet) but in the entire corpus. TF-IDF works by penalizing the common words by assigning them lower weights while giving importance to words which are rare in the entire corpus but appear in good numbers in few documents.
- TF = (number of appearance of a term t)/(number of total terms in the document)
- IDF = log(N/n), where N is the number of documents and n is the number of documents a term t has appeared in.
- TF-IDF = TF * IDF
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_df = 0.9, min_df = 2,
max_features = 1000, stop_words = 'english')
# TF-IDF feature matrix
tfidf = tfidf_vectorizer.fit_transform(all_tweets['tidy_tweet'])
6. Sentiment modeling¶
Since this is an binary prediction, we will use Logistic Regression to build our prediction model. It predicts the probability of occurrence of fitting data into logit function.
As mentioned before, the distribution in our training data is quite uneven. So simply using prediction accuracy/precision is not able to correctly evaluate our model's performance. Here, we use F1_score which can be more useful in this situation. This metric can be understood as:
Precision = TP/TP+FP
Recall = TP/TP+FN
F1 Score = 2(Recall Precision) / (Recall + Precision)
Where TP - true positive, FP - false positive, TN - true negative, FN - false negative.
Building a model using bag-of-words features
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
train_bow = bow[:31962, :]
test_bow = bow[31962:, :]
# splitting the train_bow into training and validation set
xtrain_bow, xvalid_bow, ytrain, yvalid = train_test_split(
train_bow, train['label'], random_state = 42, test_size = 0.3)
lreg = LogisticRegression()
lreg.fit(xtrain_bow, ytrain)
# make prediction on validation set
prediction = lreg.predict_proba(xvalid_bow)
prediction_int = prediction[:,1] >= 0.3 #if prediction is greater than or equal to 0.3 than 1 else 0
prediction_int = prediction_int.astype(np.int)
f1 = f1_score(yvalid, prediction_int)#calculate f1_score
print("The f1-score for using only the bag-of-words features is : {}".format(f1))
test_pred = lreg.predict_proba(test_bow)
test_pred_int = test_pred >= 0.3
test_pred_int = test_pred_int.astype(np.int)
# test['label'] = test_pred_int
# submission = test[['id','label']]
# submission.to_csv('sub_lreg_bow.csv', index=False) # writing data to a CSV file
Building a model using TF-IDF features
train_tfidf = tfidf[:31962, :]
test_tfidf = tfidf[31962:, :]
xtrain_tfidf = train_tfidf[ytrain.index]
xvalid_tfidf=train_tfidf[yvalid.index]
lreg.fit(xtrain_tfidf, ytrain)
prediction = lreg.predict_proba(xvalid_tfidf)
prediction_int = prediction[:,1] >=0.3
prediction_int = prediction_int.astype(np.int)
f1 = f1_score(yvalid, prediction_int)
print("The f1-score for using only the TF-IDF features is : {}".format(f1))
The output is about 0.545 by using the TF-IDF features which is pretty good. The public leader board F1 score is 0.564.
Comments
comments powered by Disqus