Transfer Learning in NLP - BERT as Service for Text Classification

BERT stands for Bidirectional Encoder Representations from Transformers. It is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of NLP tasks. BERT can be viewed as a pre-trained model in Computer Vision.

(source: BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding https://arxiv.org/pdf/1810.04805v2.pdf)

In this post, I will show you how to use BERT as a feature extractor (embedding extractor) and perform text classification on the outputs of BERT.

Setting Up the Environment

Use the following code to install BERT server and client, downgrade tensorflow (in order to run BERT server), download and unzip the pretrained BERT model. We are using BERT-Base Uncased model.

!pip install bert-serving-server  # server
!pip install bert-serving-client  # client, independent of `bert-serving-server`
!pip install tensorflow-gpu==1.15.0 # downgrade tensorflow to 1.15.0

# download and unzip the BERT Model
!wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip  
!unzip uncased_L-12_H-768_A-12.zip

Starting BERT Serve

Run the following code to start the BERT server. If you are using Google Colab, run the second code line

bert-serving-start -model_dir=./uncased_L-12_H-768_A-12 -num_worker=4 -max_seq_len 50
#run this if use Colab
!nohup bert-serving-start -model_dir=./uncased_L-12_H-768_A-12 -num_worker=4 -max_seq_len 50 > out.file 2>&1 &

Now the BERT server should be runing, now we simply test whether it is working properly.

In [1]:
from bert_serving.client import BertClient
bc = BertClient()
print(bc.encode(['We are using BERT model', 'BERT is amazing']).shape)
(2, 768)

Perfect! Both the BERT server and client is working properly. As we can see from the above output, BERT model converts a sentence into a 768-dimentional vector, we also call it embeddings. We can feed these embeddings to a classifer and then get the classification results.

Data Preparation

For this text classificaiton task, we are going to use the twitter sentiment dataset as in our previous post. This dataset contains 20000 labeled tweets data. And use the following code to download the dataset.

!wget https://raw.githubusercontent.com/haochen23/nlp-rnn-lstm-sentiment/master/training.1600000.processed.noemoticon.csv

Load and preview the dataset

In this data table, 0th column represents the sentiment label, and 5th column contains the actual tweet text content.

In [2]:
# import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import re
from keras.models import Sequential
from keras.layers import Dense, Dropout

data = pd.read_csv("training.1600000.processed.noemoticon.csv", header=None, encoding='latin-1')
print("The shape of the original dataset is {}".format(data.shape))
data.head()
Using TensorFlow backend.
The shape of the original dataset is (20000, 6)
Out[2]:
0 1 2 3 4 5
0 0 1467810369 Mon Apr 06 22:19:45 PDT 2009 NO_QUERY _TheSpecialOne_ @switchfoot http://twitpic.com/2y1zl - Awww, t...
1 0 1467810672 Mon Apr 06 22:19:49 PDT 2009 NO_QUERY scotthamilton is upset that he can't update his Facebook by ...
2 0 1467810917 Mon Apr 06 22:19:53 PDT 2009 NO_QUERY mattycus @Kenichan I dived many times for the ball. Man...
3 0 1467811184 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY ElleCTF my whole body feels itchy and like its on fire
4 0 1467811193 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY Karoli @nationwideclass no, it's not behaving at all....

Clean the texts

For simplicity, we use the clean_text function to briefly clean the text content by removing non-alphabets and unicode characters.

In [3]:
# clean text from noise
def clean_text(text):
    # filter to allow only alphabets
    text = re.sub(r'[^a-zA-Z\']', ' ', text)
    
    # remove Unicode characters
    text = re.sub(r'[^\x00-\x7F]+', '', text)
    
    # convert to lowercase to maintain consistency
    text = text.lower()
       
    return text

data['clean_text'] = data[data.columns[5]].apply(clean_text)

Split the dataset for training and testing

In [4]:
data_y = data[data.columns[0]].to_numpy()
data_y = pd.get_dummies(data_y).to_numpy()
X_tr, X_test, y_tr, y_test = train_test_split(data['clean_text'].to_numpy(), data_y, test_size = 0.3, random_state=42)
print('X_tr shape:',X_tr.shape)
print('X_test shape:',X_test.shape)
X_tr shape: (14000,)
X_test shape: (6000,)

Transform the text data into BERT embeddings using BERT server and client

As we mentioned before, each tweet will be convert into a 768-length embeddings by BERT. After convert the tweets data to embeddings, we can then feed these embeddings to any type of classifier to do text classification.

In [5]:
from bert_serving.client import BertClient

# make a connection with the BERT server using it's ip address
bc = BertClient()
# get the embedding for train and val sets
X_tr_bert = bc.encode(X_tr.tolist())
X_test_bert = bc.encode(X_test.tolist())

Model Training (Fine-tuning BERT)

Now, you will see how easy it is to fine-tune a BERT model. This time, we simply use a feed-forward network with only one hidden layer as the the classification head. And use Dropout to combat overfitting. You can play with the model architecture.

Build and train the model

In [6]:
model = Sequential()
model.add(Dense(100, activation='relu', input_dim=768))
# model.add(Dense(100, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))

model.compile(loss="categorical_crossentropy", optimizer='adam', metrics=['accuracy'])
H = model.fit(X_tr_bert, y_tr, epochs=15, validation_split=0.2)
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:422: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

Train on 11200 samples, validate on 2800 samples
Epoch 1/15
11200/11200 [==============================] - 1s 129us/step - loss: 0.5363 - accuracy: 0.7265 - val_loss: 0.4728 - val_accuracy: 0.7825
Epoch 2/15
11200/11200 [==============================] - 1s 99us/step - loss: 0.4786 - accuracy: 0.7733 - val_loss: 0.4629 - val_accuracy: 0.7821
Epoch 3/15
11200/11200 [==============================] - 1s 99us/step - loss: 0.4605 - accuracy: 0.7807 - val_loss: 0.4553 - val_accuracy: 0.7871
Epoch 4/15
11200/11200 [==============================] - 1s 99us/step - loss: 0.4493 - accuracy: 0.7871 - val_loss: 0.4580 - val_accuracy: 0.7893
Epoch 5/15
11200/11200 [==============================] - 1s 99us/step - loss: 0.4395 - accuracy: 0.7921 - val_loss: 0.4732 - val_accuracy: 0.7764
Epoch 6/15
11200/11200 [==============================] - 1s 98us/step - loss: 0.4282 - accuracy: 0.8012 - val_loss: 0.4554 - val_accuracy: 0.7954
Epoch 7/15
11200/11200 [==============================] - 1s 102us/step - loss: 0.4193 - accuracy: 0.8066 - val_loss: 0.4570 - val_accuracy: 0.7864
Epoch 8/15
11200/11200 [==============================] - 1s 100us/step - loss: 0.4111 - accuracy: 0.8135 - val_loss: 0.4736 - val_accuracy: 0.7732
Epoch 9/15
11200/11200 [==============================] - 1s 97us/step - loss: 0.3959 - accuracy: 0.8173 - val_loss: 0.4577 - val_accuracy: 0.7950
Epoch 10/15
11200/11200 [==============================] - 1s 100us/step - loss: 0.3922 - accuracy: 0.8210 - val_loss: 0.4593 - val_accuracy: 0.8004
Epoch 11/15
11200/11200 [==============================] - 1s 99us/step - loss: 0.3827 - accuracy: 0.8316 - val_loss: 0.4633 - val_accuracy: 0.7971
Epoch 12/15
11200/11200 [==============================] - 1s 96us/step - loss: 0.3679 - accuracy: 0.8387 - val_loss: 0.4573 - val_accuracy: 0.8021
Epoch 13/15
11200/11200 [==============================] - 1s 100us/step - loss: 0.3579 - accuracy: 0.8393 - val_loss: 0.4613 - val_accuracy: 0.7889
Epoch 14/15
11200/11200 [==============================] - 1s 97us/step - loss: 0.3574 - accuracy: 0.8442 - val_loss: 0.4714 - val_accuracy: 0.7989
Epoch 15/15
11200/11200 [==============================] - 1s 97us/step - loss: 0.3423 - accuracy: 0.8535 - val_loss: 0.4833 - val_accuracy: 0.7943

Model Evaluation

Now, let's plot the training and validation loss, and traing and validationaccuracy during the training process.

In [7]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use("ggplot")
N = 15
plt.plot(np.arange(0, N), H.history["loss"], label="train_loss")
plt.plot(np.arange(0, N), H.history["val_loss"], label="val_loss")
plt.plot(np.arange(0, N), H.history["accuracy"], label="train_acc")
plt.plot(np.arange(0, N), H.history["val_accuracy"], label="val_acc")
plt.title("Training Loss and Accuracy")
plt.xlabel("Epoch #")
plt.ylabel("Loss/Accuracy")
plt.legend(loc="upper left")
plt.show()

Using the BERT extract embeddings, we obtained a pretty similar results in our previous post where we tried using different types of Recurrent Neural Networks, namely SimpleRNN, LSTM, and GRU. You can refer to that post for comparison.

Evaluate on test set

In [8]:
predictions = model.predict(X_test_bert)
predictions = predictions.argmax(axis=1)
print(classification_report(y_test.argmax(axis=1), predictions))
              precision    recall  f1-score   support

           0       0.78      0.79      0.79      3016
           1       0.79      0.78      0.78      2984

    accuracy                           0.79      6000
   macro avg       0.79      0.78      0.78      6000
weighted avg       0.79      0.79      0.78      6000

We obtained an accuracy of 0.79 on the test set, which is consistent of the accuracy of training and validation. It seems our model can generalize pretty well. Although the accuracy is not very high, you can try different architecture to improve the model accuracy. Besides neural networks, you can also try machine learning algorithms on the extracted embeddings. For example, you can use a Logistic Regression Classifer, or an SVM Classifier, etc.

Summary

In this post, we've seen how to perform Transfer Learning in NLP tasks. We use the BERT-as-Service, and used the BERT_Base Model as a embeddings extractor. These embeddings can then be fed into a classification head to do text classification. The overall workflow is quite easy to follow.



Comments

comments powered by Disqus