NLP Sincereness Detector using Pytorch¶

In this post, we will use Pytorch to train a NLP Sincereness Detector which will detect whether a question is asked sincerely or not. I will walk you through data preparation, model architecture, model training and evaluation.

I've put all the code in my github repo including the dataset. Feel free to clone or download it.

Understanding the Problem¶

The dataset we are using is a Quora questions dataset. Quora wants to keep track of insincere questions on their platform so as to make users feel safe while sharing their knowledge. An insincere question in this context is defined as a question intended to make a statement rather than looking for helpful answers. To break this down further, here are some characteristics that can signify that a particular question is insincere:

Has a non-neutral tone
Is disparaging or inflammatory
Isn’t grounded in reality
Uses sexual content (incest, bestiality, pedophilia) for shock value, and not to seek genuine answers

The dataset contains the question that was asked, and a flag denoting whether it was identified as insincere (target = 1).

Load and Prepare the Data¶

First, clone the my repo from Github. And change to the cloned directory. I will only show you the important part of the code, detailed implementation please find in the source code in the repo.

!git clone https://github.com/haochen23/pytorch-nlp-classifier.git
%cd pytorch-nlp-classifier/

Use the following function to load and prepare the dataset.

# Adopted from utils.py
def load_data(file_path):
    '''
    load and prepare dataset to training and validation iterator
    '''
    TEXT = data.Field(tokenize="spacy", batch_first=True, include_lengths=True)
    LABEL = data.LabelField(dtype=torch.float, batch_first=True)

    fields = [(None,None), ('text',TEXT), ('label',LABEL)]

    total_data = data.TabularDataset(path=file_path,
                                    format="csv", 
                                    fields=fields,
                                    skip_header=True)
    # split data
    train_data, valid_data = total_data.split(split_ratio=0.7, random_state = random.seed(seed))
    # initialize glove embeddings
    TEXT.build_vocab(train_data, min_freq=3, vectors="glove.6B.100d")
    LABEL.build_vocab(train_data)
    batch_size = config.BATCH_SIZE
    device = config.device
    train_iterator, valid_iterator = data.BucketIterator.splits(
        (train_data, valid_data),
        batch_size = batch_size,
        sort_key = lambda x: len(x.text),
        sort_within_batch = True,
        device=device
    )

    return train_iterator, valid_iterator, TEXT, LABEL

This function prepares the data and return 4 objects:

train_iterator iterator for training data used for model training
valid_iterator iterator for validation data used for model validation during training
TEXT torch.data.Field object, contains all textual information of train_data
LABEL torch.data.LabelField object, contains all label information of train_data

Model Architecture¶

We define a classifier class which inherited from torch.nn.Module. In this class we define two functions:

__init__: we will define all the layers that will be used in our model.
forward: forward pass of our model.

Then, let's take a look the layers we used in the model.

Embedding layer: Embeddings are extremely important for any NLP related task since it represents a word in a numerical format. Embedding layer creates a look up table where each row represents an embedding of a word. The embedding layer converts the integer sequence into a dense vector representation.
LSTM: LSTM is a improvement of RNN that is capable of capturing long term dependencies.
Linear: Linear layer refers to dense layer.
Pack Padding: ack padding is used to define the dynamic recurrent neural network. Without pack padding, the padding inputs are also processed by the rnn and returns the hidden state of the padded element. This an awesome wrapper that does not show the inputs that are padded. It simply ignores the values and returns the hidden state of the non padded element.

Now let's see the code that defines model architecture.

# Adopted from classifer.py
class classifer(nn.Module):
    #define all the layers used in the model
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, 
                 bidirectional, dropout):
        super().__init__()

        # embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim)

        # lstm layer
        self.lstm = nn.LSTM(embedding_dim,
                            hidden_dim,
                            num_layers = n_layers,
                            bidirectional=bidirectional,
                            dropout=dropout,
                            batch_first=True)
        # dense layer
        self.fc = nn.Linear(hidden_dim*2, output_dim)

        # activation
        self.act = nn.Sigmoid()

    def forward(self, text, text_lengths):
        # text=[batch size, sent_length]
        embedded = self.embedding(text)
        # embedded = [batch size, sent_len, emb dim]

        #packed sequence
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, 
                                                            text_lengths,
                                                            batch_first=True)
        packed_output, (hidden, cell) = self.lstm(packed_embedded)


        # concat the final forward and backward hidden state
        hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1)


        dense_outputs=self.fc(hidden)

        #final activation
        outputs = self.act(dense_outputs)
        return outputs

Define Training Process¶

The following two functions are used for traning and validating during the model training process. The comments in the code pretty much explains the use of the two functions.

# Adopted from train.py
def train(model, iterator, optimizer, criterion):

    # initialize every epoch
    epoch_loss = 0
    epoch_acc = 0

    # set the model in training mode
    model.train()

    for batch in iterator:
        #reset the gradients after every batch
        optimizer.zero_grad()

        # retrieve text and num of words
        text, text_lengths = batch.text

        # convert to 1D tensor
        predictions = model(text, text_lengths).squeeze()

        # compute the loss
        loss = criterion(predictions, batch.label)
        # print(loss.item())
        # compute binary accuracy
        acc = binary_accuracy(predictions, batch.label)
        # print(acc.item())

        # backpropogate the loss and compute the gradients
        loss.backward()

        # update weights
        optimizer.step()

        # loss and accuracy
        epoch_loss += loss.item()
        epoch_acc += acc.item()
    return epoch_loss/len(iterator), epoch_acc/len(iterator)

def evaluate(model, iterator, criterion):

    # initialize every epoch
    epoch_loss = 0
    epoch_acc = 0

    #set model to eval mode
    model.eval()

    # deactivates auto_grad
    with torch.no_grad():
        for batch in iterator:
            # retrieve text and num of words
            text, text_lengths = batch.text 

            # convert to 1D tensor
            predictions = model(text, text_lengths).squeeze()

            loss = criterion(predictions, batch.label)
            acc = binary_accuracy(predictions, batch.label)

            # keep track of loss and accuracy
            epoch_loss += loss.item()
            epoch_acc += acc.item()

    return epoch_loss/len(iterator), epoch_acc/len(iterator)

We use a for loop train the model for N_EPOCHS.

# Adopted from train.py
optimizer = optim.Adam(model.parameters())
criterion = nn.BCELoss()
N_EPOCHS = config.N_EPOCH
best_valid_loss = float('inf')

# start training
for epoch in range(N_EPOCHS):

    # train the model
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)

    # evaluate the model
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)

    # save the best model
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        # torch.save(model.state_dict(), './output/best_model.pt')
        torch.save(model, './output/best_model.pt')
    print("Epoch {}/{}:".format(epoch+1, N_EPOCHS))
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Demonstration¶

After we reviewed the important code blocks above, it's time to see how to actually use the code.

Train the model

The first time run the code need to download the glove.6B.100d in order to build the TEXT vocabulary. This procedure may take a while according to your internet speed, since the pre-trained embedding is about 870 Mb.

The best_model will be saved in the output folder together with TEXT which contains vocabulary information for our training data, and it will be also used for prediction.

In [1]:

%cd pytorch-nlp-classifier/

/content/pytorch-nlp-classifier

In [2]:

!python3 train.py

Size of TEXT vocabulary: 17119
classifer(
  (embedding): Embedding(17119, 100)
  (lstm): LSTM(100, 32, num_layers=2, batch_first=True, dropout=0.2, bidirectional=True)
  (fc): Linear(in_features=64, out_features=1, bias=True)
  (act): Sigmoid()
)
The model has 1,771,357 trainable parameters
Epoch 1/5:
	Train Loss: 0.331 | Train Acc: 84.24%
	 Val. Loss: 0.285 |  Val. Acc: 88.86%
Epoch 2/5:
	Train Loss: -0.329 | Train Acc: 89.81%
	 Val. Loss: 0.337 |  Val. Acc: 89.21%
Epoch 3/5:
	Train Loss: -1.319 | Train Acc: 91.68%
	 Val. Loss: 0.336 |  Val. Acc: 88.88%
Epoch 4/5:
	Train Loss: -1.484 | Train Acc: 92.73%
	 Val. Loss: 0.385 |  Val. Acc: 88.58%
Epoch 5/5:
	Train Loss: -1.578 | Train Acc: 93.98%
	 Val. Loss: 0.391 |  Val. Acc: 88.23%
/usr/local/lib/python3.6/dist-packages/torch/storage.py:34: FutureWarning: pickle support for Storage will be removed in 1.5. Use `torch.save` instead
  warnings.warn("pickle support for Storage will be removed in 1.5. Use `torch.save` instead", FutureWarning)

We obtained about 89% accuracy on the validation data in 5 epochs.

Inference¶

We'll see a few examples to check how our model performs on identifying sincere and insinsere questions.

Example 1: "Why people vote for Donald Trump?"

In [3]:

!python3 predict.py --question "Why people vote for Donald Trump?"

Results close to 1 represent insincere questions.
Results close to 0 represent sincere questions.
------
The result for 'Why people vote for Donald Trump?' is 0.9592183232307434

The prediction result is very close to 1, and it indicates this is an insincere question. Wow, I will take this, as it is always not neutral when you talk something about Trump. LOL.

Example 2: "What is Quora and why should you care?"

In [4]:

!python3 predict.py --question "What is Quora and why should you care?"

Results close to 1 represent insincere questions.
Results close to 0 represent sincere questions.
------
The result for 'What is Quora and why should you care?' is 0.23973408341407776

The prediction result is ~ 0.24 and it's close to zero, which means it is a sincere question. And yes, there is not much emotion involved in this question, and it's neutral.

Example 3: " What is your biggest weakness?"

In [5]:

!python3 predict.py --question " What is your biggest weakness?"

Results close to 1 represent insincere questions.
Results close to 0 represent sincere questions.
------
The result for ' What is your biggest weakness?' is 0.02312156930565834

This is pretty common interview question. And the result shows that it is a sincere question. These examples indicate that our trained sincere detetor works pretty well.

Summary¶

In this post, we've used Pytorch to train a NLP Sincereness Detector which will detect whether a question is asked sincerely or not. We see the complete PyTorch workflow for training the model and performing inference.