Transfer Learning in NLP - BERT as Service for Text Classification¶
BERT stands for Bidirectional Encoder Representations from Transformers. It is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of NLP tasks. BERT can be viewed as a pre-trained model in Computer Vision.
(source: BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding https://arxiv.org/pdf/1810.04805v2.pdf)
In this post, I will show you how to use BERT as a feature extractor (embedding extractor) and perform text classification on the outputs of BERT.
Setting Up the Environment¶
Use the following code to install BERT server and client, downgrade tensorflow (in order to run BERT server), download and unzip the pretrained BERT model. We are using BERT-Base Uncased model.
!pip install bert-serving-server # server
!pip install bert-serving-client # client, independent of `bert-serving-server`
!pip install tensorflow-gpu==1.15.0 # downgrade tensorflow to 1.15.0
# download and unzip the BERT Model
!wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
!unzip uncased_L-12_H-768_A-12.zip
Starting BERT Serve¶
Run the following code to start the BERT server. If you are using Google Colab, run the second code line
bert-serving-start -model_dir=./uncased_L-12_H-768_A-12 -num_worker=4 -max_seq_len 50
#run this if use Colab
!nohup bert-serving-start -model_dir=./uncased_L-12_H-768_A-12 -num_worker=4 -max_seq_len 50 > out.file 2>&1 &
Now the BERT server should be runing, now we simply test whether it is working properly.
from bert_serving.client import BertClient
bc = BertClient()
print(bc.encode(['We are using BERT model', 'BERT is amazing']).shape)
Perfect! Both the BERT server and client is working properly. As we can see from the above output, BERT model converts a sentence into a 768-dimentional vector, we also call it embeddings. We can feed these embeddings to a classifer and then get the classification results.
Data Preparation¶
For this text classificaiton task, we are going to use the twitter sentiment dataset as in our previous post. This dataset contains 20000 labeled tweets data. And use the following code to download the dataset.
!wget https://raw.githubusercontent.com/haochen23/nlp-rnn-lstm-sentiment/master/training.1600000.processed.noemoticon.csv
Load and preview the dataset
In this data table, 0th column represents the sentiment label, and 5th column contains the actual tweet text content.
# import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import re
from keras.models import Sequential
from keras.layers import Dense, Dropout
data = pd.read_csv("training.1600000.processed.noemoticon.csv", header=None, encoding='latin-1')
print("The shape of the original dataset is {}".format(data.shape))
data.head()
Clean the texts
For simplicity, we use the clean_text function to briefly clean the text content by removing non-alphabets and unicode characters.
# clean text from noise
def clean_text(text):
# filter to allow only alphabets
text = re.sub(r'[^a-zA-Z\']', ' ', text)
# remove Unicode characters
text = re.sub(r'[^\x00-\x7F]+', '', text)
# convert to lowercase to maintain consistency
text = text.lower()
return text
data['clean_text'] = data[data.columns[5]].apply(clean_text)
Split the dataset for training and testing
data_y = data[data.columns[0]].to_numpy()
data_y = pd.get_dummies(data_y).to_numpy()
X_tr, X_test, y_tr, y_test = train_test_split(data['clean_text'].to_numpy(), data_y, test_size = 0.3, random_state=42)
print('X_tr shape:',X_tr.shape)
print('X_test shape:',X_test.shape)
Transform the text data into BERT embeddings using BERT server and client
As we mentioned before, each tweet will be convert into a 768-length embeddings by BERT. After convert the tweets data to embeddings, we can then feed these embeddings to any type of classifier to do text classification.
from bert_serving.client import BertClient
# make a connection with the BERT server using it's ip address
bc = BertClient()
# get the embedding for train and val sets
X_tr_bert = bc.encode(X_tr.tolist())
X_test_bert = bc.encode(X_test.tolist())
Model Training (Fine-tuning BERT)¶
Now, you will see how easy it is to fine-tune a BERT model. This time, we simply use a feed-forward network with only one hidden layer as the the classification head. And use Dropout to combat overfitting. You can play with the model architecture.
Build and train the model
model = Sequential()
model.add(Dense(100, activation='relu', input_dim=768))
# model.add(Dense(100, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))
model.compile(loss="categorical_crossentropy", optimizer='adam', metrics=['accuracy'])
H = model.fit(X_tr_bert, y_tr, epochs=15, validation_split=0.2)
Model Evaluation¶
Now, let's plot the training and validation loss, and traing and validationaccuracy during the training process.
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use("ggplot")
N = 15
plt.plot(np.arange(0, N), H.history["loss"], label="train_loss")
plt.plot(np.arange(0, N), H.history["val_loss"], label="val_loss")
plt.plot(np.arange(0, N), H.history["accuracy"], label="train_acc")
plt.plot(np.arange(0, N), H.history["val_accuracy"], label="val_acc")
plt.title("Training Loss and Accuracy")
plt.xlabel("Epoch #")
plt.ylabel("Loss/Accuracy")
plt.legend(loc="upper left")
plt.show()
Using the BERT extract embeddings, we obtained a pretty similar results in our previous post where we tried using different types of Recurrent Neural Networks, namely SimpleRNN, LSTM, and GRU. You can refer to that post for comparison.
Evaluate on test set
predictions = model.predict(X_test_bert)
predictions = predictions.argmax(axis=1)
print(classification_report(y_test.argmax(axis=1), predictions))
We obtained an accuracy of 0.79 on the test set, which is consistent of the accuracy of training and validation. It seems our model can generalize pretty well. Although the accuracy is not very high, you can try different architecture to improve the model accuracy. Besides neural networks, you can also try machine learning algorithms on the extracted embeddings. For example, you can use a Logistic Regression Classifer, or an SVM Classifier, etc.
Summary¶
In this post, we've seen how to perform Transfer Learning in NLP tasks. We use the BERT-as-Service, and used the BERT_Base Model as a embeddings extractor. These embeddings can then be fed into a classification head to do text classification. The overall workflow is quite easy to follow.
Comments
comments powered by Disqus