当前位置：网站首页>Basic use of transformers Library

Basic use of transformers Library

2022-06-25 01:31:00 【Empty cup realm】

This content mainly introduces Transformers library Basic use of .

1.1 Transformers Library profile

Transformers Library is an open source library , All the pre training models provided are based on transformer Model structure .

1.1.1 Transformers library

We can use Transformers Library provides the API Easily download and train state-of-the-art pre training models . Using the pre training model can reduce the computational cost , And save the time of training the model from scratch . These models can be used for different modal tasks , for example ：

Text ： Text classification 、 Information extraction 、 Question answering system 、 Text in this paper, 、 Machinetranslation and text generation .
Images ： Image classification 、 Target detection and image segmentation .
Audio ： Speech recognition and audio classification .
Multimodal ： Tabular question answering system 、OCR、 Scan document information extraction 、 Video classification and visual Q & A .

Transformers The library supports three of the most popular deep learning libraries （PyTorch、TensorFlow and JAX）.

The corresponding websites of relevant resources are as follows ：

	website
Library GitHub Address	https://github.com/huggingface/transformers
Official development documents	https://huggingface.co/docs/transformers/index
Pre training model download address	https://huggingface.co/models

1.1.2 Transformers Models and frameworks supported by the library

The following table shows the current Transformers Library support for each model ：

Model	Tokenizer slow	Tokenizer fast	PyTorch support	TensorFlow support	Flax support
ALBERT
BART
BEiT
BERT
Bert Generation
BigBird
BigBirdPegasus
Blenderbot
BlenderbotSmall
CamemBERT
Canine
CLIP
ConvBERT
ConvNext
CTRL
Data2VecAudio
Data2VecText
Data2VecVision
DeBERTa
DeBERTa-v2
Decision Transformer
DeiT
DETR
DistilBERT
DPR
DPT
ELECTRA
Encoder decoder
FairSeq Machine-Translation
FlauBERT
Flava
FNet
Funnel Transformer
GLPN
GPT Neo
GPT-J
Hubert
I-BERT
ImageGPT
LayoutLM
LayoutLMv2
LED
Longformer
LUKE
LXMERT
M2M100
Marian
MaskFormer
mBART
MegatronBert
MobileBERT
MPNet
mT5
Nystromformer
OpenAI GPT
OpenAI GPT-2
OPT
Pegasus
Perceiver
PLBart
PoolFormer
ProphetNet
QDQBert
RAG
Realm
Reformer
RegNet
RemBERT
ResNet
RetriBERT
RoBERTa
RoFormer
SegFormer
SEW
SEW-D
Speech Encoder decoder
Speech2Text
Speech2Text2
Splinter
SqueezeBERT
Swin
T5
TAPAS
TAPEX
Transformer-XL
TrOCR
UniSpeech
UniSpeechSat
VAN
ViLT
Vision Encoder decoder
VisionTextDualEncoder
VisualBert
ViT
ViTMAE
Wav2Vec2
WavLM
XGLM
XLM
XLM-RoBERTa
XLM-RoBERTa-XL
XLMProphetNet
XLNet
YOLOS

Be careful ：Tokenizer slow： Use Python Realization tokenization The process .Tokenizer fast： be based on Rust library Tokenizers To implement .

1.2 Pipeline

pipeline() The function of is to use the pre training model for inference , It supports from here All models downloaded .

1.2.1 Pipeline Supported task types

pipeline() Support for many common tasks ：

Text
- Sentiment analysis （Sentiment analysis）
- The text generated （Text generation）
- Named entity recognition （Name entity recognition,NER）：
- Question answering system （Question answering）
- Mask recovery （Fill-mask）
- Text in this paper, （Summarization）
- Machine translation （Translation）
- feature extraction （Feature extraction）
Images
- Image classification （Image classification）
- Image segmentation （Image segmentation）
- object detection （Object detection）
Audio
- Audio classification （Audio classification）
- Automatic speech recognition （Automatic speech recognition,ASR）

Be careful ： Can be in Transformers Source code of the library （ see Transformers/pipelines/__init__.py Medium SUPPORTED_TASKS Definition ） View its supported tasks in , Different versions support different types .

1.2.2 Pipeline Use

（1） Easy to use

for example , At present, we need to carry out an inference task of emotion analysis . We can use the following code directly ：

from transformers import pipeline


classifier = pipeline("sentiment-analysis")
result = classifier("We are very happy to show you the  Transformers library.")
print(result)

The following results will be output ：

[{
    'label': 'POSITIVE', 'score': 0.9997795224189758}]

In the above code pipeline("sentiment-analysis") A default pre training model for emotion analysis will be downloaded and cached, and the corresponding tokenizer. For different types of tasks , The corresponding parameter name can be viewed pipeline Parameters of task Explanation （ here ）; The default pre training model for different types of tasks can be downloaded in Transformers Source code of the library （ see Transformers/pipelines/__init__.py Medium SUPPORTED_TASKS Definition ） View in .

When we need to reason more than one sentence at a time , have access to list The form is passed in as a parameter ：

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
results = classifier(["We are very happy to show you the  Transformers library.",
                      "We hope you don't hate it."])
print(results)

The following results will be output ：

[{
    'label': 'POSITIVE', 'score': 0.9997795224189758},
 {
    'label': 'NEGATIVE', 'score': 0.5308570265769958}]

（2） Choose a model

The upper part , In reasoning , The default model of the corresponding task is used . But sometimes we want to use a specified model , You can specify pipeline() Parameters of model To achieve .

The first method ：

from transformers import pipeline


classifier = pipeline("sentiment-analysis",
                      model="IDEA-CCNL/Erlangshen-Roberta-110M-Sentiment")
result = classifier(" I am in a good mood today ")
print(result)

The following results will be output ：

[{
    'label': 'Positive', 'score': 0.9374911785125732}]

The second method ：（ And the above method , The same model is loaded . However, this method can use local models for reasoning .）

from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer
from transformers import pipeline


model_path = r"../pretrained_model/IDEA-CCNL(Erlangshen-Roberta-110M-Sentiment)"
model = AutoModelForSequenceClassification.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
result = classifier(" I am in a good mood today ")
print(result)

The following results will be output ：

[{
    'label': 'Positive', 'score': 0.9374911785125732}]

summary ： The above section describes the use of pipeline() The inference method of text classification task . For other types of text tasks 、 Image and audio tasks , The use method is basically the same , For details, please refer to here .

1.3 Load model

Next, we will introduce some methods of loading models .

1.3.1 Random initialization of model weights

occasionally , Model weights need to be initialized randomly （ For example, use your own data for pre training ）. First, we need to initialize a config object , And then put this config Object is passed to the model as a parameter ：

from transformers import BertConfig
from transformers import BertModel


config = BertConfig()
model = BertModel(config)

above config The default value is used , But as needed , We can modify the corresponding parameters . Of course , We can also use AutoConfig.from_pretrained() Load other models config：

from transformers import AutoConfig
from transformers import AutoModel


model_path = r"../pretrained_model/IDEA-CCNL(Erlangshen-Roberta-110M-Sentiment)"
config = AutoConfig.from_pretrained(model_path)
model = AutoModel.from_config(config)

1.3.2 Use the pre training weights to initialize the model weights

occasionally , Need to load weights from the pre training model . In general use AutoModelForXXX.from_pretrained() Load the pre training model of the corresponding task , The reason why we use XXX, Because different types of tasks use different classes . for example , We need to load a text sequence classification model , Need to use AutoModelForSequenceClassification.

from transformers import AutoModelForSequenceClassification


model = AutoModelForSequenceClassification.from_pretrained(
    "IDEA-CCNL/Erlangshen-Roberta-110M-Sentiment")

AutoModelForSequenceClassification.from_pretrained() The first parameter of pretrained_model_name_or_path It can be a string , It can also be a folder path .

from transformers import AutoModelForSequenceClassification


model_path = r"../pretrained_model/IDEA-CCNL(Erlangshen-Roberta-110M-Sentiment)"
model = AutoModelForSequenceClassification.from_pretrained(model_path)

We can also use concrete model classes , Like the following BertForSequenceClassification：

from transformers import BertForSequenceClassification


model_path = r"../pretrained_model/IDEA-CCNL(Erlangshen-Roberta-110M-Sentiment)"
model = BertForSequenceClassification.from_pretrained(model_path)

Be careful ： The above model types are all for PyTorch Model . If we use TensorFlow Model , Its class name needs to be in PyTorch Add... Before the model class name TF. such as BertForSequenceClassification Corresponding TF The model class name is TFBertForSequenceClassification

summary ： Official recommendation AutoModelForXXX and TFAutoModelXXX Load pre training model . Officials believe that this will ensure that the correct framework is loaded every time .

1.4 Preprocessing

Because the model itself cannot understand the original text 、 Image or audio . So you need to convert the data into a form that the model can accept , Then it is transferred into the model .

1.4.1 NLP：AutoTokenizer

The main tools for processing text data are tokenizer. First ,tokenizer Text is split into... According to a set of rules token. then , Will these token Convert to numeric （ According to Thesaurus , namely vocab）, These values are constructed as tensors and used as inputs to the model . Other inputs required for the model are also provided by tokenizer add to .

When we use the pre training model , Be sure to use the corresponding pre training tokenizer. That's the only way , To ensure that the text is segmented in the same way as the pre training corpus , And use the same correspondence token Indexes （ namely vocab）.

（1）Tokenize

Use AutoTokenizer.from_pretrained() Load a Pre Workout tokenizer, And pass the text into tokenizer：

from transformers import AutoTokenizer


model_path = r"../pretrained_model/IDEA-CCNL(Erlangshen-Roberta-110M-Sentiment)"
tokenizer = AutoTokenizer.from_pretrained(model_path)

encoded_input = tokenizer(" I am in a good mood today ")
print(encoded_input)

The following results will be output ：

{
    'input_ids': [101, 791, 1921, 1921, 3698, 4696, 1962, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

You can see that the output above consists of three parts ：

input_ids： Corresponding to each in the sentence token The index of .
token_type_ids： When there are multiple sequences , identification token Belong to that sequence .
attention_mask： Show the corresponding token Need to be noticed （1 Need to be noticed ,0 Do not need to be noticed . It involves attention mechanisms ）.

We can also use tokenizer take input_ids Decode to original input ：

decoded_input = tokenizer.decode(encoded_input["input_ids"])
print(decoded_input)

The following results will be output ：

[CLS]  today   God   God   gas   really   good  [SEP]

We can see the output above , More than the original text [CLS] and [SEP], They are in BERT And other models token.

If you need to process multiple sentences at the same time , Multiple texts can be typed as list Type in the form of tokenizer in .

（2） fill （Pad）

When we deal with a batch of sentences , Their lengths are not always the same . But the input of the model needs to have a uniform shape （shape）. Population is a strategy to achieve this requirement , That is to say token Add special padding to fewer sentences token.

to tokenizer() Pass in the parameter padding=True：

batch_sentences = [" It's a beautiful day ",
                   " It's a beautiful day , Suitable for travel "]
encoded_inputs = tokenizer(batch_sentences, padding=True)
print(encoded_inputs)

The following results will be output ：

{
    'input_ids':
 [[101, 791, 1921, 1921, 3698, 4696, 1962, 102, 0, 0, 0, 0, 0],
  [101, 791, 1921, 1921, 3698, 4696, 1962, 8024, 6844, 1394, 1139, 3952, 102]], 'token_type_ids':
 [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
'attention_mask':
 [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

You can see tokenizer Use 0 Fill in the first sentence .

（3） truncation （Truncation）

When the sentence is too short , The strategy of filling can be adopted . But sometimes , The sentence may be too long , The model can't handle . under these circumstances , Sentences can be truncated .

to tokenizer() Pass in the parameter truncation=True That is to say .

If you want to know tokenizer() More about parameters in padding and truncation Information about , You can refer to here

（4） Construction tensor （Build tensors）

Final , If we want to tokenizer Returns the actual tensor in the incoming model . You need to set the parameters return_tensors. If it's incoming PyTorch Model , Set it to pt; If it's incoming TensorFlow Model , Set it to tf.

batch_sentences = [" It's a beautiful day ",
                   " It's a beautiful day , Suitable for travel "]
encoded_inputs = tokenizer(batch_sentences,
                           padding=True, truncation=True,
                           return_tensors="pt")
print(encoded_inputs)

The following results will be output ：

{
    'input_ids':
 tensor([[ 101,  791, 1921, 1921, 3698, 4696, 1962,  102,    0,    0,    0,    0,
            0],
        [ 101,  791, 1921, 1921, 3698, 4696, 1962, 8024, 6844, 1394, 1139, 3952,
          102]]),
'token_type_ids':
 tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),
 'attention_mask':
 tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

1.4.2 other

For audio data , Pretreatment mainly includes resampling （Resample）、 feature extraction （Feature Extractor）、 fill （pad） And truncation （Truncate）, For details, please refer to here . For image data , Preprocessing mainly includes feature extraction （Feature Extractor） And data enhancement , For details, please refer to here . For multimodal data , Different types of data use the corresponding preprocessing methods described above. For details, please refer to here . Although each kind of data preprocessing method is not exactly the same , But the ultimate goal is the same ： Transform the raw data into a form acceptable to the model .

1.5 Fine tune the pre training model

The following is an example of text multi classification , Briefly introduce how to use our own data to train a classification model .

1.5.1 Prepare the data

Before fine tuning the pre training model , We need to prepare the data first . We can use Datasets Library load_dataset Load data set ：

from datasets import load_dataset


#  The first  1  Step ： Prepare the data 
#  Get raw data from files 
datasets = load_dataset(f'./my_dataset.py')
#  Output the first data in the training set 
print(datasets["train"][0])

I need to pay attention to , Because we use our own data for model training , So the above load_dataset The parameter passed in is one py Path to file . This py The document follows Datasets Library rules read files and return training data , If you want more information , You can refer to here .

If we just want to learn simply Transformers Library usage , have access to Datasets Some data sets preset in this library , This is the time load_dataset The parameters passed in are strings （ such as ,load_dataset("imdb")）, Then the corresponding data set will be automatically downloaded .

1.5.2 Preprocessing

Before feeding the data to the model , The data needs to be preprocessed （Tokenize、 fill 、 Truncation, etc ）.

from transformers import AutoTokenizer


#  The first  2  Step ： Preprocessing data 
# 2.1  load  tokenizer
tokenizer = AutoTokenizer.from_pretrained(configure["model_path"])

def tokenize_function(examples):
	return tokenizer(examples["text"], padding="max_length", truncation=True)

# 2.2  Get through  tokenization  Later data 
tokenized_datasets = datasets.map(tokenize_function, batched=True)
print(tokenized_datasets["train"][0])

First , load tokenizer; then , Use datasets.map() Generate preprocessed data . Because the data goes through tokenizer() After processing, it is no longer dataset Format , So we need to use datasets.map() To deal with .

1.5.3 Load model

In the front section , The method of model loading has been introduced , have access to AutoModelXXX.from_pretrained Load model ：

from transformers import AutoModelForSequenceClassification

#  The first  3  Step ： Load model 
classification_model = AutoModelForSequenceClassification.from_pretrained(
    configure["model_path"], num_labels=get_num_labels())

The difference from the previous section is that ： There's one in the code above num_labels Parameters , You need to pass this parameter to the number of categories in our dataset .

1.5.4 Set metrics

During model training , We want to be able to output the performance indicators of the model （ For example, accuracy 、 Accuracy 、 Recall rate 、F1 It's worth waiting for ） In order to understand the training of the model . We can go through Datasets Library provides the load_metric() To achieve . The following code implements the accuracy calculation ：

import numpy as np
from datasets import load_metric


#  The first  4  Step ： Set metrics 
metric = load_metric("./accuracy.py")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

If you want more information , You can refer to here .

1.5.5 Set training super parameters

In model training , You also need to set some super parameters ,Transformers The library provides TrainingArguments class .

from transformers import TrainingArguments


#  The first  5  Step ： Set training super parameters 
training_args = TrainingArguments(output_dir=configure["output_dir"],
                                  evaluation_strategy="epoch")

In the code above , We set two parameters ：output_dir Specify the output path to save the model ;evaluation_strategy Decide when to evaluate the model , Set parameters epoch Indicates that after each training epoch And then conduct an assessment , The evaluation content is the measurement index set in the previous step .

If you want to know more about parameter settings and specific meanings , You can refer to here .

1.5.6 Train and save models

After the previous series of steps , We can finally start model training .Transformers The library provides Trainer class , Model training can be carried out simply and conveniently . First , Create a Trainer, And then call train() function , Start model training . When the model training is finished , call save_model() Save the model .

#  The first  6  Step ： Start training model 
trainer = Trainer(model=classification_model,
                  args=training_args,
                  train_dataset=tokenized_datasets["train"],
                  eval_dataset=tokenized_datasets["validation"],
                  tokenizer=tokenizer,
                  compute_metrics=compute_metrics)
trainer.train()

#  Save the model 
trainer.save_model()

occasionally , We need to debug the model , You need to write your own model training cycle , Detailed method , You can refer to here .

1.5.7 summary

After the introduction , Now we can start to train our own text multi classification model .

however , In the previous section, I introduced how to use... With an example of text multi classification Transformers Library fine tuning pre training model . For other types of tasks , There are some differences compared with text classification tasks , Specific guidance , You can refer to the following links ：

Task type	Reference link
Text classification （Text classification）	https://huggingface.co/docs/transformers/tasks/sequence_classification
Token classification（ for example NER）	https://huggingface.co/docs/transformers/tasks/token_classification
Question answering system （Question answering）	https://huggingface.co/docs/transformers/tasks/question_answering
Language model （Language modeling）	https://huggingface.co/docs/transformers/tasks/language_modeling
Machine translation （Translation）	https://huggingface.co/docs/transformers/tasks/translation
Text in this paper, （Sumarization）	https://huggingface.co/docs/transformers/tasks/summarization
Multiple choice （Multiple choice）	https://huggingface.co/docs/transformers/tasks/multiple_choice
Audio classification （Audio classification）	https://huggingface.co/docs/transformers/tasks/audio_classification
Automatic speech recognition （ASR）	https://huggingface.co/docs/transformers/tasks/asr
Image classification （Image classification）	https://huggingface.co/docs/transformers/tasks/image_classification