NLP Sincereness Detector using Pytorch¶
In this post, we will use Pytorch to train a NLP Sincereness Detector which will detect whether a question is asked sincerely or not. I will walk you through data preparation, model architecture, model training and evaluation.
I've put all the code in my github repo including the dataset. Feel free to clone or download it.
Understanding the Problem¶
The dataset we are using is a Quora questions dataset. Quora wants to keep track of insincere questions on their platform so as to make users feel safe while sharing their knowledge. An insincere question in this context is defined as a question intended to make a statement rather than looking for helpful answers. To break this down further, here are some characteristics that can signify that a particular question is insincere:
- Has a non-neutral tone
- Is disparaging or inflammatory
- Isn’t grounded in reality
- Uses sexual content (incest, bestiality, pedophilia) for shock value, and not to seek genuine answers
The dataset contains the question that was asked, and a flag denoting whether it was identified as insincere (target = 1).
Load and Prepare the Data¶
First, clone the my repo from Github. And change to the cloned directory. I will only show you the important part of the code, detailed implementation please find in the source code in the repo.
!git clone https://github.com/haochen23/pytorch-nlp-classifier.git
%cd pytorch-nlp-classifier/
Use the following function to load and prepare the dataset.
# Adopted from utils.py
def load_data(file_path):
'''
load and prepare dataset to training and validation iterator
'''
TEXT = data.Field(tokenize="spacy", batch_first=True, include_lengths=True)
LABEL = data.LabelField(dtype=torch.float, batch_first=True)
fields = [(None,None), ('text',TEXT), ('label',LABEL)]
total_data = data.TabularDataset(path=file_path,
format="csv",
fields=fields,
skip_header=True)
# split data
train_data, valid_data = total_data.split(split_ratio=0.7, random_state = random.seed(seed))
# initialize glove embeddings
TEXT.build_vocab(train_data, min_freq=3, vectors="glove.6B.100d")
LABEL.build_vocab(train_data)
batch_size = config.BATCH_SIZE
device = config.device
train_iterator, valid_iterator = data.BucketIterator.splits(
(train_data, valid_data),
batch_size = batch_size,
sort_key = lambda x: len(x.text),
sort_within_batch = True,
device=device
)
return train_iterator, valid_iterator, TEXT, LABEL
This function prepares the data and return 4 objects:
train_iterator
iterator for training data used for model trainingvalid_iterator
iterator for validation data used for model validation during trainingTEXT
torch.data.Field object, contains all textual information of train_dataLABEL
torch.data.LabelField object, contains all label information of train_data
Model Architecture¶
We define a classifier
class which inherited from torch.nn.Module
. In this class we define two functions:
- __init__: we will define all the layers that will be used in our model.
- forward: forward pass of our model.
Then, let's take a look the layers we used in the model.
Embedding layer: Embeddings are extremely important for any NLP related task since it represents a word in a numerical format. Embedding layer creates a look up table where each row represents an embedding of a word. The embedding layer converts the integer sequence into a dense vector representation.
LSTM: LSTM is a improvement of RNN that is capable of capturing long term dependencies.
Linear: Linear layer refers to dense layer.
Pack Padding: ack padding is used to define the dynamic recurrent neural network. Without pack padding, the padding inputs are also processed by the rnn and returns the hidden state of the padded element. This an awesome wrapper that does not show the inputs that are padded. It simply ignores the values and returns the hidden state of the non padded element.
Now let's see the code that defines model architecture.
# Adopted from classifer.py
class classifer(nn.Module):
#define all the layers used in the model
def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers,
bidirectional, dropout):
super().__init__()
# embedding layer
self.embedding = nn.Embedding(vocab_size, embedding_dim)
# lstm layer
self.lstm = nn.LSTM(embedding_dim,
hidden_dim,
num_layers = n_layers,
bidirectional=bidirectional,
dropout=dropout,
batch_first=True)
# dense layer
self.fc = nn.Linear(hidden_dim*2, output_dim)
# activation
self.act = nn.Sigmoid()
def forward(self, text, text_lengths):
# text=[batch size, sent_length]
embedded = self.embedding(text)
# embedded = [batch size, sent_len, emb dim]
#packed sequence
packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded,
text_lengths,
batch_first=True)
packed_output, (hidden, cell) = self.lstm(packed_embedded)
# concat the final forward and backward hidden state
hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1)
dense_outputs=self.fc(hidden)
#final activation
outputs = self.act(dense_outputs)
return outputs
Define Training Process¶
The following two functions are used for traning and validating during the model training process. The comments in the code pretty much explains the use of the two functions.
# Adopted from train.py
def train(model, iterator, optimizer, criterion):
# initialize every epoch
epoch_loss = 0
epoch_acc = 0
# set the model in training mode
model.train()
for batch in iterator:
#reset the gradients after every batch
optimizer.zero_grad()
# retrieve text and num of words
text, text_lengths = batch.text
# convert to 1D tensor
predictions = model(text, text_lengths).squeeze()
# compute the loss
loss = criterion(predictions, batch.label)
# print(loss.item())
# compute binary accuracy
acc = binary_accuracy(predictions, batch.label)
# print(acc.item())
# backpropogate the loss and compute the gradients
loss.backward()
# update weights
optimizer.step()
# loss and accuracy
epoch_loss += loss.item()
epoch_acc += acc.item()
return epoch_loss/len(iterator), epoch_acc/len(iterator)
def evaluate(model, iterator, criterion):
# initialize every epoch
epoch_loss = 0
epoch_acc = 0
#set model to eval mode
model.eval()
# deactivates auto_grad
with torch.no_grad():
for batch in iterator:
# retrieve text and num of words
text, text_lengths = batch.text
# convert to 1D tensor
predictions = model(text, text_lengths).squeeze()
loss = criterion(predictions, batch.label)
acc = binary_accuracy(predictions, batch.label)
# keep track of loss and accuracy
epoch_loss += loss.item()
epoch_acc += acc.item()
return epoch_loss/len(iterator), epoch_acc/len(iterator)
We use a for
loop train the model for N_EPOCHS
.
# Adopted from train.py
optimizer = optim.Adam(model.parameters())
criterion = nn.BCELoss()
N_EPOCHS = config.N_EPOCH
best_valid_loss = float('inf')
# start training
for epoch in range(N_EPOCHS):
# train the model
train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
# evaluate the model
valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
# save the best model
if valid_loss < best_valid_loss:
best_valid_loss = valid_loss
# torch.save(model.state_dict(), './output/best_model.pt')
torch.save(model, './output/best_model.pt')
print("Epoch {}/{}:".format(epoch+1, N_EPOCHS))
print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
print(f'\t Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}%')
Demonstration¶
After we reviewed the important code blocks above, it's time to see how to actually use the code.
Train the model
The first time run the code need to download the glove.6B.100d
in order to build the TEXT vocabulary. This procedure may take a while according to your internet speed, since the pre-trained embedding is about 870 Mb.
The best_model
will be saved in the output folder together with TEXT
which contains vocabulary information for our training data, and it will be also used for prediction.
%cd pytorch-nlp-classifier/
!python3 train.py
We obtained about 89% accuracy on the validation data in 5 epochs.
Inference¶
We'll see a few examples to check how our model performs on identifying sincere and insinsere questions.
Example 1: "Why people vote for Donald Trump?"
!python3 predict.py --question "Why people vote for Donald Trump?"
The prediction result is very close to 1, and it indicates this is an insincere question. Wow, I will take this, as it is always not neutral when you talk something about Trump. LOL.
Example 2: "What is Quora and why should you care?"
!python3 predict.py --question "What is Quora and why should you care?"
The prediction result is ~ 0.24 and it's close to zero, which means it is a sincere question. And yes, there is not much emotion involved in this question, and it's neutral.
Example 3: " What is your biggest weakness?"
!python3 predict.py --question " What is your biggest weakness?"
This is pretty common interview question. And the result shows that it is a sincere question. These examples indicate that our trained sincere detetor works pretty well.
Summary¶
In this post, we've used Pytorch to train a NLP Sincereness Detector which will detect whether a question is asked sincerely or not. We see the complete PyTorch workflow for training the model and performing inference.
Comments
comments powered by Disqus