NLP with State-of-the-Art Language Models

In this post, we'll see how to use state-of-the-art language models to perform downstream NLP tasks with Transformers.

Transformers (previously known as pytorch-transformers) provides state-of-the-art general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet, T5, CTRL...) for Natural Laguage Processing. Transformers currently support 19 primary architectures with variations in model depth and size. These models are all based on the Transformer structure. Details about transformer can be found in this paper - Attention Is All You Need. There are thousands of pretained models including community models available in transformers.

Leveraging State-of-the-Art Language Models on NLP tasks

Install Transformers

!pip install transformers

Transformers Pipeline API

Transformers' pipeline() method provides a high-level, easy to use, API for doing inference over a variety of downstream-tasks, including:

  • Sentence Classification (Sentiment Analysis): Indicate if the overall sentence is either positive or negative, i.e. binary classification task or logitic regression task.
  • Token Classification (Named Entity Recognition, Part-of-Speech tagging): For each sub-entities (tokens) in the input, assign them a label, i.e. classification task.
  • Question-Answering: Provided a tuple (question, context) the model should find the span of text in content answering the question.
  • Mask-Filling: Suggests possible word(s) to fill the masked input with respect to the provided context.
  • Summarization: Summarizes the input article to a shorter article.
  • Translation: Translates the input from a language to another language.
  • Feature Extraction: Maps the input to a higher, multi-dimensional space learned from the data.

Pipelines encapsulate the overall process of every NLP process:

  1. Tokenization: Split the initial input into multiple sub-entities with ... properties (i.e. tokens).
  2. Inference: Maps every tokens into a more meaningful representation.
  3. Decoding: Use the above representation to generate and/or extract the final output for the underlying task.

The pipeline() method can be used in three ways:

  1. Using the default model and tokenizer by only specifying the task name.
    from transformers import pipeline
    pipeline("<task-name>")
    
  2. Using user defined model by specifying task -name and model-name
    pipeline("<task-name>", model="<model_name>")
    
  3. Using user-defined model and tokenizer
    pipeline('<task-name>', model='<model name>', tokenizer='<tokenizer_name>')
    

Usually the defualt models work pretty well on specific tasks, you can also use your own model by providing your model path to the model parameter.

Note: Not all tasks are supported by every pretrained model in transformers.For example, Summarization task is only supported by bart and t5 models. You can go to this page to check which models support a specific task.

In [1]:
from transformers import pipeline

Usage Examples

I will only present you with the default models for each task, so you have an idea whether to use the default model and tokenizer or choose another one based on your requirements. Don't forget to check models supportability to the NLP tasks in this page

1. Sentence Classification - Sentiment Analysis

The default model for Sentiment Analysis is DistilBERT uncased version - a smaller, faster version of BERT.

In [2]:
nlp_sentiment = pipeline('sentiment-analysis')
nlp_sentiment('What a game for Kobe Bryant!')

Out[2]:
[{'label': 'POSITIVE', 'score': 0.9981902837753296}]

2. Named Entity Recognition

The default model for Name Entity Recognition is Bert (bert-large-cased).

In [3]:
nlp_ner = pipeline('ner')
nlp_ner('What a game for Kobe Bryant !')

Out[3]:
[{'entity': 'I-PER', 'index': 5, 'score': 0.9990301132202148, 'word': 'Kobe'},
 {'entity': 'I-PER',
  'index': 6,
  'score': 0.9992470741271973,
  'word': 'Bryant'}]

3. Question Answering

The default model for Question Answering is DistilBERT and it's using bert-base-cased tokenizer.

In [4]:
nlp_qa = pipeline('question-answering')
nlp_qa(context='Kobe Bryant was an American professional basketball player.', question='Who is Kobe Bryant ?')

Out[4]:
{'answer': 'an American professional basketball player.',
 'end': 58,
 'score': 0.6634031543230492,
 'start': 16}

4. Text Generation - Mask Prediction

The default model for Text Generation is DistilRoBERTa.

In [5]:
nlp_fill_mask = pipeline('fill-mask')
nlp_fill_mask('Kobe Bryant was an American professional basketball' + nlp_fill_mask.tokenizer.mask_token)

Out[5]:
[{'score': 0.5135236978530884,
  'sequence': '<s> Kobe Bryant was an American professional basketball player</s>',
  'token': 869},
 {'score': 0.13336943089962006,
  'sequence': '<s> Kobe Bryant was an American professional basketball legend</s>',
  'token': 7875},
 {'score': 0.10051079839468002,
  'sequence': '<s> Kobe Bryant was an American professional basketball coach</s>',
  'token': 704},
 {'score': 0.07933259010314941,
  'sequence': '<s> Kobe Bryant was an American professional basketball star</s>',
  'token': 999},
 {'score': 0.05176172032952309,
  'sequence': '<s> Kobe Bryant was an American professional basketball superstar</s>',
  'token': 10896}]

5. Text Summarization

As mentioned earlies, Summarization is currently supported by Bart and T5. And the default model is bart-large-cnn.

In [6]:
TEXT_TO_SUMMARIZE = """ 
Kobe Bean Bryant was an American professional basketball player. 
As a shooting guard, Bryant entered the National Basketball Association (NBA) directly from high school, and played his entire 20-season professional career in the league with the Los Angeles Lakers. 
Bryant won many accolades: five NBA championships, 18-time All-Star, 15-time member of the All-NBA Team, 12-time member of the All-Defensive Team, 2008 NBA Most Valuable Player (MVP), two-time NBA Finals MVP winner. 
Widely regarded as one of the greatest players of all time, he led the NBA in scoring during two seasons, ranks fourth on the league's all-time regular season scoring and all-time postseason scoring lists.

Bryant was the son of former NBA player Joe Bryant. 
He attended Lower Merion High School in Pennsylvania, where he was recognized as the top high-school basketball player in the country. 
Upon graduation, he declared for the 1996 NBA draft and was selected by the Charlotte Hornets with the 13th overall pick; the Hornets then traded him to the Lakers. 
As a rookie, Bryant earned himself a reputation as a high-flyer and a fan favorite by winning the 1997 Slam Dunk Contest, and he was named an All-Star by his second season. 
Despite a feud with teammate Shaquille O'Neal, the pair led the Lakers to three consecutive NBA championships from 2000 to 2002. 
In 2003, Bryant was accused of sexual assault by a 19-year-old hotel clerk. 
Criminal charges were brought and then dropped after the accuser refused to testify, with a civil suit later settled out of court. 
Bryant denied the assault charge, but admitted to a sexual encounter and issued a public apology.
"""

nlp_summarizer = pipeline('summarization')
nlp_summarizer(TEXT_TO_SUMMARIZE, max_length=30)
Out[6]:
[{'summary_text': 'Kobe Bean Bryant was an American professional basketball player. He played for the Los Angeles Lakers for his entire 20-season professional career.'}]

6. Translation

Translation is currently supported by T5 for the language mappings English-to-French (translation_en_to_fr), English-to-German (translation_en_to_de) and English-to-Romanian (translation_en_to_ro).

In [7]:
# English to French
translator = pipeline('translation_en_to_fr')
translator("Kobe Bean Bryant was an American professional basketball player. As a shooting guard, Bryant entered the National Basketball Association (NBA) directly from high school, and played his entire 20-season professional career in the league with the Los Angeles Lakers. ")

Out[7]:
[{'translation_text': 'Kobe Bean Bryant est un joueur de basketball professionnel américain qui, en tant que gardien de tir, est entré dans la National Basketball Association (NBA) dès son école secondaire et a joué toute sa carrière professionnelle pendant 20 saisons avec les Los Angeles Lakers.'}]
In [8]:
# English to German
translator = pipeline('translation_en_to_de')
translator("Kobe Bean Bryant was an American professional basketball player. As a shooting guard, Bryant entered the National Basketball Association (NBA) directly from high school, and played his entire 20-season professional career in the league with the Los Angeles Lakers. ")

Out[8]:
[{'translation_text': 'Kobe Bean Bryant ist ein US-amerikanischer Basketballspieler.'}]

7. Text Generation

Text generation is currently supported by GPT-2, OpenAi-GPT, TransfoXL, XLNet, CTRL and Reformer. And the default model is GPT-2.

In [9]:
text_generator = pipeline("text-generation")
text_generator("It's a sunny day. Let's ")

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence
Out[9]:
[{'generated_text': "It's a sunny day. Let's  go get some tea. I want to make these as quickly as possible because of your advice. You know how it is sometimes difficult to create a new relationship. Don't be afraid to talk and to talk."}]

8. Features Extraction

Feature Extraction outputs' a 3D tensor (Samples, Tokens, Embeddings for each token). These embeddings can then be used as input features to other models, e.g. to a classifer for sentiment analysis.

In [10]:
import numpy as np
nlp_features = pipeline('feature-extraction')
output = nlp_features('Kobe Bryant was an American professional basketball player.')
np.array(output).shape   

Out[10]:
(1, 11, 768)

Summary

In this post, we've seen how to use Transfomers pipeline API to perform various NLP downstream tasks. It's really a cool thing that we can leverage State-of-the-Art language models with only one or two lines of code. Transformers makes language models easy to use for everyone.



Comments

comments powered by Disqus