NLP with State-of-the-Art Language Models¶
In this post, we'll see how to use state-of-the-art language models to perform downstream NLP tasks with Transformers.
Transformers (previously known as pytorch-transformers) provides state-of-the-art general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet, T5, CTRL...) for Natural Laguage Processing. Transformers currently support 19 primary architectures with variations in model depth and size. These models are all based on the Transformer structure. Details about transformer can be found in this paper - Attention Is All You Need. There are thousands of pretained models including community models available in transformers.
Leveraging State-of-the-Art Language Models on NLP tasks¶
Install Transformers
!pip install transformers
Transformers Pipeline API
Transformers' pipeline() method provides a high-level, easy to use, API for doing inference over a variety of downstream-tasks, including:
- Sentence Classification (Sentiment Analysis): Indicate if the overall sentence is either positive or negative, i.e. binary classification task or logitic regression task.
- Token Classification (Named Entity Recognition, Part-of-Speech tagging): For each sub-entities (tokens) in the input, assign them a label, i.e. classification task.
- Question-Answering: Provided a tuple (
question
,context
) the model should find the span of text incontent
answering thequestion
. - Mask-Filling: Suggests possible word(s) to fill the masked input with respect to the provided
context
. - Summarization: Summarizes the
input
article to a shorter article. - Translation: Translates the input from a language to another language.
- Feature Extraction: Maps the input to a higher, multi-dimensional space learned from the data.
Pipelines encapsulate the overall process of every NLP process:
- Tokenization: Split the initial input into multiple sub-entities with ... properties (i.e. tokens).
- Inference: Maps every tokens into a more meaningful representation.
- Decoding: Use the above representation to generate and/or extract the final output for the underlying task.
The pipeline()
method can be used in three ways:
- Using the default model and tokenizer by only specifying the task name.
from transformers import pipeline pipeline("<task-name>")
- Using user defined model by specifying task -name and model-name
pipeline("<task-name>", model="<model_name>")
- Using user-defined model and tokenizer
pipeline('<task-name>', model='<model name>', tokenizer='<tokenizer_name>')
Usually the defualt models work pretty well on specific tasks, you can also use your own model by providing your model path to the model
parameter.
Note: Not all tasks are supported by every pretrained model in transformers.For example, Summarization task is only supported by bart
and t5
models. You can go to this page to check which models support a specific task.
from transformers import pipeline
Usage Examples¶
I will only present you with the default models for each task, so you have an idea whether to use the default model and tokenizer or choose another one based on your requirements. Don't forget to check models supportability to the NLP tasks in this page
1. Sentence Classification - Sentiment Analysis¶
The default model for Sentiment Analysis is DistilBERT uncased version - a smaller, faster version of BERT.
nlp_sentiment = pipeline('sentiment-analysis')
nlp_sentiment('What a game for Kobe Bryant!')
2. Named Entity Recognition¶
The default model for Name Entity Recognition is Bert (bert-large-cased).
nlp_ner = pipeline('ner')
nlp_ner('What a game for Kobe Bryant !')
3. Question Answering¶
The default model for Question Answering is DistilBERT and it's using bert-base-cased tokenizer.
nlp_qa = pipeline('question-answering')
nlp_qa(context='Kobe Bryant was an American professional basketball player.', question='Who is Kobe Bryant ?')
4. Text Generation - Mask Prediction¶
The default model for Text Generation is DistilRoBERTa.
nlp_fill_mask = pipeline('fill-mask')
nlp_fill_mask('Kobe Bryant was an American professional basketball' + nlp_fill_mask.tokenizer.mask_token)
5. Text Summarization¶
As mentioned earlies, Summarization is currently supported by Bart
and T5
. And the default model is bart-large-cnn
.
TEXT_TO_SUMMARIZE = """
Kobe Bean Bryant was an American professional basketball player.
As a shooting guard, Bryant entered the National Basketball Association (NBA) directly from high school, and played his entire 20-season professional career in the league with the Los Angeles Lakers.
Bryant won many accolades: five NBA championships, 18-time All-Star, 15-time member of the All-NBA Team, 12-time member of the All-Defensive Team, 2008 NBA Most Valuable Player (MVP), two-time NBA Finals MVP winner.
Widely regarded as one of the greatest players of all time, he led the NBA in scoring during two seasons, ranks fourth on the league's all-time regular season scoring and all-time postseason scoring lists.
Bryant was the son of former NBA player Joe Bryant.
He attended Lower Merion High School in Pennsylvania, where he was recognized as the top high-school basketball player in the country.
Upon graduation, he declared for the 1996 NBA draft and was selected by the Charlotte Hornets with the 13th overall pick; the Hornets then traded him to the Lakers.
As a rookie, Bryant earned himself a reputation as a high-flyer and a fan favorite by winning the 1997 Slam Dunk Contest, and he was named an All-Star by his second season.
Despite a feud with teammate Shaquille O'Neal, the pair led the Lakers to three consecutive NBA championships from 2000 to 2002.
In 2003, Bryant was accused of sexual assault by a 19-year-old hotel clerk.
Criminal charges were brought and then dropped after the accuser refused to testify, with a civil suit later settled out of court.
Bryant denied the assault charge, but admitted to a sexual encounter and issued a public apology.
"""
nlp_summarizer = pipeline('summarization')
nlp_summarizer(TEXT_TO_SUMMARIZE, max_length=30)
6. Translation¶
Translation is currently supported by T5
for the language mappings English-to-French (translation_en_to_fr
), English-to-German (translation_en_to_de
) and English-to-Romanian (translation_en_to_ro
).
# English to French
translator = pipeline('translation_en_to_fr')
translator("Kobe Bean Bryant was an American professional basketball player. As a shooting guard, Bryant entered the National Basketball Association (NBA) directly from high school, and played his entire 20-season professional career in the league with the Los Angeles Lakers. ")
# English to German
translator = pipeline('translation_en_to_de')
translator("Kobe Bean Bryant was an American professional basketball player. As a shooting guard, Bryant entered the National Basketball Association (NBA) directly from high school, and played his entire 20-season professional career in the league with the Los Angeles Lakers. ")
7. Text Generation¶
Text generation is currently supported by GPT-2, OpenAi-GPT, TransfoXL, XLNet, CTRL and Reformer. And the default model is GPT-2.
text_generator = pipeline("text-generation")
text_generator("It's a sunny day. Let's ")
8. Features Extraction¶
Feature Extraction outputs' a 3D tensor (Samples, Tokens, Embeddings for each token). These embeddings can then be used as input features to other models, e.g. to a classifer for sentiment analysis.
import numpy as np
nlp_features = pipeline('feature-extraction')
output = nlp_features('Kobe Bryant was an American professional basketball player.')
np.array(output).shape
Summary¶
In this post, we've seen how to use Transfomers pipeline API to perform various NLP downstream tasks. It's really a cool thing that we can leverage State-of-the-Art language models with only one or two lines of code. Transformers makes language models easy to use for everyone.
Comments
comments powered by Disqus