Sentiment Analysis using SimpleRNN, LSTM and GRU

Intro

Recurrent Neural Networks (RNN) are good at processing sequence data for predictions. Therefore, they are extremely useful for deep learning applications like speech recognition, speech synthesis, natural language understanding, etc.

Three are three main types of RNNs: SimpleRNN, Long-Short Term Memories (LSTM), and Gated Recurrent Units (GRU). SimpleRNNs are good for processing sequence data for predictions but suffers from short-term memory. LSTM’s and GRU’s were created as a method to mitigate short-term memory using mechanisms called gates.

In this post, I am not going to discuss the details of the theory behinds these RNNs. Instead, I am going to show you how you can actually apply this RNNs to your application. We'll take an example of twitter sentiment analysis. I'm also going to show how to use pre-trained word embeddings in these RNNs.

The code I've uploaded to my repo, feel free to use it.

Download Dependencies

You can use the following code to download the twitter pre-trained embeddings from kaggle (you need install Kaggle API to use the code, otherwise you can use the link), and download the tweets data.

# download and unzip the glove model
!kaggle datasets download fullmetal26/glovetwitter27b100dtxt 
!unzip glovetwitter27b100dtxt.zip
# download the tweets data
!wget https://raw.githubusercontent.com/haochen23/nlp-rnn-lstm-sentiment/master/training.1600000.processed.noemoticon.csv
In [4]:
# import necessary libraries
import tensorflow.keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import InputLayer, Dense, SimpleRNN, Activation, Dropout, Conv1D
from tensorflow.keras.layers import Embedding, Flatten, LSTM, GRU
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import EarlyStopping
import pandas as pd
import numpy as np
import spacy
from sklearn.metrics import classification_report

Exploratory Data Analysis

Fisrt, let's take a look at the whole dataset. There are 20000 tweets in this dataset, and 6 attributes for each tweet. The dataset is well prepared and there is no missing value in it.

In [5]:
data = pd.read_csv("training.1600000.processed.noemoticon.csv", header=None, encoding='latin-1')
print("The shape of the original dataset is {}".format(data.shape))
data.head()
The shape of the original dataset is (20000, 6)
Out[5]:
0 1 2 3 4 5
0 0 1467810369 Mon Apr 06 22:19:45 PDT 2009 NO_QUERY _TheSpecialOne_ @switchfoot http://twitpic.com/2y1zl - Awww, t...
1 0 1467810672 Mon Apr 06 22:19:49 PDT 2009 NO_QUERY scotthamilton is upset that he can't update his Facebook by ...
2 0 1467810917 Mon Apr 06 22:19:53 PDT 2009 NO_QUERY mattycus @Kenichan I dived many times for the ball. Man...
3 0 1467811184 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY ElleCTF my whole body feels itchy and like its on fire
4 0 1467811193 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY Karoli @nationwideclass no, it's not behaving at all....
In [8]:
# Check missing values
data.isnull().any()
Out[8]:
0    False
1    False
2    False
3    False
4    False
5    False
dtype: bool

Data Preparation

Define Util Functions

Before we start data preparation, we first define some util functions:

  • load_glove_model load the twitter embeddings model we downloaded. This model is trained on 2 billion tweets, which contains 27 billion tokens, 1.2 million vocabs.
  • remove_stopwords remove the stop words in a sentence
  • lemmatize perform lemmatization on a sentence
  • sent_vectorizer convert a sentence into a vector using the glove_model. This function may be used if we want a different type of input to the RNNs.
In [9]:
def load_glove_model(glove_file):
    print("[INFO]Loading GloVe Model...")
    model = {}
    with open(glove_file, 'r') as f:
        for line in f:
            split_line = line.split()
            word = split_line[0]
            embeddings = [float(val) for val in split_line[1:]]
            model[word] = embeddings
    print("[INFO] Done...{} words loaded!".format(len(model)))
    return model
# adopted from utils.py
nlp = spacy.load("en")

def remove_stopwords(sentence):
    '''
    function to remove stopwords
        input: sentence - string of sentence
    '''
    new = []
    # tokenize sentence
    sentence = nlp(sentence)
    for tk in sentence:
        if (tk.is_stop == False) & (tk.pos_ !="PUNCT"):
            new.append(tk.string.strip())
    # convert back to sentence string
    c = " ".join(str(x) for x in new)
    return c


def lemmatize(sentence):
    '''
    function to do lemmatization
        input: sentence - string of sentence
    '''
    sentence = nlp(sentence)
    s = ""
    for w in sentence:
        s +=" "+w.lemma_
    return nlp(s)

def sent_vectorizer(sent, model):
    '''
    sentence vectorizer using the pretrained glove model
    '''
    sent_vector = np.zeros(200)
    num_w = 0
    for w in sent.split():
        try:
            # add up all token vectors to a sent_vector
            sent_vector = np.add(sent_vector, model[str(w)])
            num_w += 1
        except:
            pass
    return sent_vector

Preparing Data

We only care about the tweet text and tweet sentiment information, which stored in the 5th column and 0th column in the dataset. In the sentiment column, 0 represents negative, and 1 represents positive.

We organize the data as data_Xcontains all the tweet text, data_y contains the labels.

In [10]:
data_X = data[data.columns[5]].to_numpy()
data_y = data[data.columns[0]]
data_y = pd.get_dummies(data_y).to_numpy()

The following code cell will convert the tweet text data_X to sequence format that will be feed into RNNs.

In [11]:
# load the glove model
glove_model = load_glove_model("glove.twitter.27B.200d.txt")
# number of vocab to keep
max_vocab = 18000
# length of sequence that will generate
max_len = 15

tokenizer = Tokenizer(num_words=max_vocab)
tokenizer.fit_on_texts(data_X)
sequences = tokenizer.texts_to_sequences(data_X)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
data_keras = pad_sequences(sequences, maxlen=max_len, padding="post")
[INFO]Loading GloVe Model...
[INFO] Done...1193514 words loaded!
Found 30256 unique tokens.

Split Data into Training and Evaluation Sets

In [12]:
from sklearn.model_selection import train_test_split
train_X, valid_X, train_y, valid_y = train_test_split(data_keras, data_y, test_size = 0.3, random_state=42)

Preparing Word Embeddings using the GloVe Model

In [13]:
# calcultaete number of words
nb_words = len(tokenizer.word_index) + 1

# obtain the word embedding matrix
embedding_matrix = np.zeros((nb_words, 200))
for word, i in word_index.items():
    embedding_vector = glove_model.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
print('Null word embeddings: %d' % np.sum(np.sum(embedding_matrix, axis=1) == 0))
Null word embeddings: 12567

Build RNN Models

In [14]:
# adopted from sent_tran_eval.py
def build_model(nb_words, rnn_model="SimpleRNN", embedding_matrix=None):
    '''
    build_model function:
    inputs: 
        rnn_model - which type of RNN layer to use, choose in (SimpleRNN, LSTM, GRU)
        embedding_matrix - whether to use pretrained embeddings or not
    '''
    model = Sequential()
    # add an embedding layer
    if embedding_matrix is not None:
        model.add(Embedding(nb_words, 
                        200, 
                        weights=[embedding_matrix], 
                        input_length= max_len,
                        trainable = False))
    else:
        model.add(Embedding(nb_words, 
                        200, 
                        input_length= max_len,
                        trainable = False))
        
    # add an RNN layer according to rnn_model
    if rnn_model == "SimpleRNN":
        model.add(SimpleRNN(200))
    elif rnn_model == "LSTM":
        model.add(LSTM(200))
    else:
        model.add(GRU(200))
    # model.add(Dense(500,activation='relu'))
    # model.add(Dense(500, activation='relu'))
    model.add(Dense(2, activation='softmax'))
    
    model.compile(loss='categorical_crossentropy', 
                optimizer='adam',
                metrics=['accuracy'])
    return model

Training and Evaluation

Now we'll train and evaluate the SimpleRNN, LSTM, and GRU networks on our prepared dataset.

We are using the pre-trained word embeddings from the glove.twitter.27B.200d.txt data. Using the pre-trained word embeddings as weights for the Embedding layer leads to better results and faster convergence.

We set each models to run 20 epochs, but we also set EarlyStopping rules to prevent overfitting. The results of the SimpleRNN, LSTM, GRU models can be seen below.

In [15]:
model_rnn = build_model(nb_words, "SimpleRNN", embedding_matrix)
model_rnn.fit(train_X, train_y, epochs=20, batch_size=120,
          validation_data=(valid_X, valid_y), callbacks=EarlyStopping(monitor='val_accuracy', mode='max',patience=3))
predictions = model_rnn.predict(valid_X)
predictions = predictions.argmax(axis=1)
print(classification_report(valid_y.argmax(axis=1), predictions))
Epoch 1/20
117/117 [==============================] - 2s 17ms/step - loss: 0.5889 - accuracy: 0.6806 - val_loss: 0.5029 - val_accuracy: 0.7485
Epoch 2/20
117/117 [==============================] - 2s 13ms/step - loss: 0.4855 - accuracy: 0.7658 - val_loss: 0.4932 - val_accuracy: 0.7533
Epoch 3/20
117/117 [==============================] - 2s 15ms/step - loss: 0.4454 - accuracy: 0.7939 - val_loss: 0.4954 - val_accuracy: 0.7522
Epoch 4/20
117/117 [==============================] - 2s 14ms/step - loss: 0.4117 - accuracy: 0.8104 - val_loss: 0.5041 - val_accuracy: 0.7513
Epoch 5/20
117/117 [==============================] - 2s 14ms/step - loss: 0.3785 - accuracy: 0.8316 - val_loss: 0.5172 - val_accuracy: 0.7483
              precision    recall  f1-score   support

           0       0.76      0.73      0.74      3016
           1       0.74      0.77      0.75      2984

    accuracy                           0.75      6000
   macro avg       0.75      0.75      0.75      6000
weighted avg       0.75      0.75      0.75      6000

In [16]:
model_lstm = build_model(nb_words, "LSTM", embedding_matrix)
model_lstm.fit(train_X, train_y, epochs=20, batch_size=120,
          validation_data=(valid_X, valid_y), callbacks=EarlyStopping(monitor='val_accuracy', mode='max',patience=3))
predictions = model_lstm.predict(valid_X)
predictions = predictions.argmax(axis=1)
print(classification_report(valid_y.argmax(axis=1), predictions))
Epoch 1/20
117/117 [==============================] - 1s 10ms/step - loss: 0.5441 - accuracy: 0.7188 - val_loss: 0.4777 - val_accuracy: 0.7783
Epoch 2/20
117/117 [==============================] - 1s 7ms/step - loss: 0.4764 - accuracy: 0.7664 - val_loss: 0.4651 - val_accuracy: 0.7828
Epoch 3/20
117/117 [==============================] - 1s 6ms/step - loss: 0.4514 - accuracy: 0.7817 - val_loss: 0.4604 - val_accuracy: 0.7802
Epoch 4/20
117/117 [==============================] - 1s 6ms/step - loss: 0.4241 - accuracy: 0.8017 - val_loss: 0.4538 - val_accuracy: 0.7837
Epoch 5/20
117/117 [==============================] - 1s 6ms/step - loss: 0.3923 - accuracy: 0.8206 - val_loss: 0.4442 - val_accuracy: 0.7925
Epoch 6/20
117/117 [==============================] - 1s 6ms/step - loss: 0.3589 - accuracy: 0.8394 - val_loss: 0.4700 - val_accuracy: 0.7943
Epoch 7/20
117/117 [==============================] - 1s 6ms/step - loss: 0.3275 - accuracy: 0.8539 - val_loss: 0.4635 - val_accuracy: 0.7855
Epoch 8/20
117/117 [==============================] - 1s 6ms/step - loss: 0.2867 - accuracy: 0.8772 - val_loss: 0.4842 - val_accuracy: 0.7903
Epoch 9/20
117/117 [==============================] - 1s 6ms/step - loss: 0.2450 - accuracy: 0.8966 - val_loss: 0.5190 - val_accuracy: 0.7822
              precision    recall  f1-score   support

           0       0.79      0.78      0.78      3016
           1       0.78      0.79      0.78      2984

    accuracy                           0.78      6000
   macro avg       0.78      0.78      0.78      6000
weighted avg       0.78      0.78      0.78      6000

In [17]:
model_gru = build_model(nb_words, "GRU", embedding_matrix)
model_gru.fit(train_X, train_y, epochs=20, batch_size=120,
          validation_data=(valid_X, valid_y), callbacks=EarlyStopping(monitor='val_accuracy', mode='max',patience=3))
predictions = model_gru.predict(valid_X)
predictions = predictions.argmax(axis=1)
print(classification_report(valid_y.argmax(axis=1), predictions))
Epoch 1/20
117/117 [==============================] - 1s 10ms/step - loss: 0.5824 - accuracy: 0.6819 - val_loss: 0.4902 - val_accuracy: 0.7640
Epoch 2/20
117/117 [==============================] - 1s 6ms/step - loss: 0.4761 - accuracy: 0.7703 - val_loss: 0.4661 - val_accuracy: 0.7800
Epoch 3/20
117/117 [==============================] - 1s 6ms/step - loss: 0.4483 - accuracy: 0.7876 - val_loss: 0.4689 - val_accuracy: 0.7748
Epoch 4/20
117/117 [==============================] - 1s 6ms/step - loss: 0.4231 - accuracy: 0.8029 - val_loss: 0.4518 - val_accuracy: 0.7877
Epoch 5/20
117/117 [==============================] - 1s 6ms/step - loss: 0.3916 - accuracy: 0.8202 - val_loss: 0.4637 - val_accuracy: 0.7810
Epoch 6/20
117/117 [==============================] - 1s 6ms/step - loss: 0.3650 - accuracy: 0.8358 - val_loss: 0.4784 - val_accuracy: 0.7748
Epoch 7/20
117/117 [==============================] - 1s 6ms/step - loss: 0.3193 - accuracy: 0.8609 - val_loss: 0.4802 - val_accuracy: 0.7863
              precision    recall  f1-score   support

           0       0.80      0.76      0.78      3016
           1       0.77      0.81      0.79      2984

    accuracy                           0.79      6000
   macro avg       0.79      0.79      0.79      6000
weighted avg       0.79      0.79      0.79      6000

As we can see from the above results, even using this small RNN architecture, LSTM and GRU have much better performances than SimpleRNN. It is consistent to general practice.

You can also try training the RNNs without the pre-trained word embeddings and see what results you'll get.

Summary

In this post, we've seen the use of RNNs for sentiment analysis task in NLP.

SimpleRNNs are good for processing sequence data for predictions but suffers from short-term memory. LSTMs and GRUs were created as a method to mitigate short-term memory using mechanisms called gates. And they usually perform better than SimpleRNNs. Engineers or practitioners using experiment on both LSTM and GRU to see which one have better results.



Comments

comments powered by Disqus