当前位置:网站首页>Basic use of transformers Library

Basic use of transformers Library

2022-06-25 01:31:00 Empty cup realm

   This content mainly introduces Transformers library Basic use of .

1.1 Transformers Library profile

  Transformers Library is an open source library , All the pre training models provided are based on transformer Model structure .

1.1.1 Transformers library

   We can use Transformers Library provides the API Easily download and train state-of-the-art pre training models . Using the pre training model can reduce the computational cost , And save the time of training the model from scratch . These models can be used for different modal tasks , for example :

  • Text : Text classification 、 Information extraction 、 Question answering system 、 Text in this paper, 、 Machinetranslation and text generation .
  • Images : Image classification 、 Target detection and image segmentation .
  • Audio : Speech recognition and audio classification .
  • Multimodal : Tabular question answering system 、OCR、 Scan document information extraction 、 Video classification and visual Q & A .

  Transformers The library supports three of the most popular deep learning libraries (PyTorch、TensorFlow and JAX).

   The corresponding websites of relevant resources are as follows :

website
Library GitHub Address https://github.com/huggingface/transformers
Official development documents https://huggingface.co/docs/transformers/index
Pre training model download address https://huggingface.co/models

1.1.2 Transformers Models and frameworks supported by the library

   The following table shows the current Transformers Library support for each model :

ModelTokenizer slowTokenizer fastPyTorch supportTensorFlow supportFlax support
ALBERT
BART
BEiT
BERT
Bert Generation
BigBird
BigBirdPegasus
Blenderbot
BlenderbotSmall
CamemBERT
Canine
CLIP
ConvBERT
ConvNext
CTRL
Data2VecAudio
Data2VecText
Data2VecVision
DeBERTa
DeBERTa-v2
Decision Transformer
DeiT
DETR
DistilBERT
DPR
DPT
ELECTRA
Encoder decoder
FairSeq Machine-Translation
FlauBERT
Flava
FNet
Funnel Transformer
GLPN
GPT Neo
GPT-J
Hubert
I-BERT
ImageGPT
LayoutLM
LayoutLMv2
LED
Longformer
LUKE
LXMERT
M2M100
Marian
MaskFormer
mBART
MegatronBert
MobileBERT
MPNet
mT5
Nystromformer
OpenAI GPT
OpenAI GPT-2
OPT
Pegasus
Perceiver
PLBart
PoolFormer
ProphetNet
QDQBert
RAG
Realm
Reformer
RegNet
RemBERT
ResNet
RetriBERT
RoBERTa
RoFormer
SegFormer
SEW
SEW-D
Speech Encoder decoder
Speech2Text
Speech2Text2
Splinter
SqueezeBERT
Swin
T5
TAPAS
TAPEX
Transformer-XL
TrOCR
UniSpeech
UniSpeechSat
VAN
ViLT
Vision Encoder decoder
VisionTextDualEncoder
VisualBert
ViT
ViTMAE
Wav2Vec2
WavLM
XGLM
XLM
XLM-RoBERTa
XLM-RoBERTa-XL
XLMProphetNet
XLNet
YOLOS

Be careful :Tokenizer slow: Use Python Realization tokenization The process .Tokenizer fast: be based on Rust library Tokenizers To implement .


1.2 Pipeline

  pipeline() The function of is to use the pre training model for inference , It supports from here All models downloaded .

1.2.1 Pipeline Supported task types

  pipeline() Support for many common tasks :

  • Text
    • Sentiment analysis (Sentiment analysis)
    • The text generated (Text generation)
    • Named entity recognition (Name entity recognition,NER):
    • Question answering system (Question answering)
    • Mask recovery (Fill-mask)
    • Text in this paper, (Summarization)
    • Machine translation (Translation)
    • feature extraction (Feature extraction)
  • Images
    • Image classification (Image classification)
    • Image segmentation (Image segmentation)
    • object detection (Object detection)
  • Audio
    • Audio classification (Audio classification)
    • Automatic speech recognition (Automatic speech recognition,ASR)

Be careful : Can be in Transformers Source code of the library ( see Transformers/pipelines/__init__.py Medium SUPPORTED_TASKS Definition ) View its supported tasks in , Different versions support different types .


1.2.2 Pipeline Use

(1) Easy to use

   for example , At present, we need to carry out an inference task of emotion analysis . We can use the following code directly :

from transformers import pipeline


classifier = pipeline("sentiment-analysis")
result = classifier("We are very happy to show you the  Transformers library.")
print(result)

The following results will be output :

[{
    'label': 'POSITIVE', 'score': 0.9997795224189758}]

   In the above code pipeline("sentiment-analysis") A default pre training model for emotion analysis will be downloaded and cached, and the corresponding tokenizer. For different types of tasks , The corresponding parameter name can be viewed pipeline Parameters of task Explanation ( here ); The default pre training model for different types of tasks can be downloaded in Transformers Source code of the library ( see Transformers/pipelines/__init__.py Medium SUPPORTED_TASKS Definition ) View in .

   When we need to reason more than one sentence at a time , have access to list The form is passed in as a parameter :

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
results = classifier(["We are very happy to show you the  Transformers library.",
                      "We hope you don't hate it."])
print(results)

The following results will be output :

[{
    'label': 'POSITIVE', 'score': 0.9997795224189758},
 {
    'label': 'NEGATIVE', 'score': 0.5308570265769958}]

(2) Choose a model

   The upper part , In reasoning , The default model of the corresponding task is used . But sometimes we want to use a specified model , You can specify pipeline() Parameters of model To achieve .

   The first method :

from transformers import pipeline


classifier = pipeline("sentiment-analysis",
                      model="IDEA-CCNL/Erlangshen-Roberta-110M-Sentiment")
result = classifier(" I am in a good mood today ")
print(result)

The following results will be output :

[{
    'label': 'Positive', 'score': 0.9374911785125732}]

   The second method :( And the above method , The same model is loaded . However, this method can use local models for reasoning .)

from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer
from transformers import pipeline


model_path = r"../pretrained_model/IDEA-CCNL(Erlangshen-Roberta-110M-Sentiment)"
model = AutoModelForSequenceClassification.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
result = classifier(" I am in a good mood today ")
print(result)

The following results will be output :

[{
    'label': 'Positive', 'score': 0.9374911785125732}]

summary : The above section describes the use of pipeline() The inference method of text classification task . For other types of text tasks 、 Image and audio tasks , The use method is basically the same , For details, please refer to here .

1.3 Load model

   Next, we will introduce some methods of loading models .

1.3.1 Random initialization of model weights

   occasionally , Model weights need to be initialized randomly ( For example, use your own data for pre training ). First, we need to initialize a config object , And then put this config Object is passed to the model as a parameter :

from transformers import BertConfig
from transformers import BertModel


config = BertConfig()
model = BertModel(config)

   above config The default value is used , But as needed , We can modify the corresponding parameters . Of course , We can also use AutoConfig.from_pretrained() Load other models config:

from transformers import AutoConfig
from transformers import AutoModel


model_path = r"../pretrained_model/IDEA-CCNL(Erlangshen-Roberta-110M-Sentiment)"
config = AutoConfig.from_pretrained(model_path)
model = AutoModel.from_config(config)

1.3.2 Use the pre training weights to initialize the model weights

   occasionally , Need to load weights from the pre training model . In general use AutoModelForXXX.from_pretrained() Load the pre training model of the corresponding task , The reason why we use XXX, Because different types of tasks use different classes . for example , We need to load a text sequence classification model , Need to use AutoModelForSequenceClassification.

from transformers import AutoModelForSequenceClassification


model = AutoModelForSequenceClassification.from_pretrained(
    "IDEA-CCNL/Erlangshen-Roberta-110M-Sentiment")

  AutoModelForSequenceClassification.from_pretrained() The first parameter of pretrained_model_name_or_path It can be a string , It can also be a folder path .

from transformers import AutoModelForSequenceClassification


model_path = r"../pretrained_model/IDEA-CCNL(Erlangshen-Roberta-110M-Sentiment)"
model = AutoModelForSequenceClassification.from_pretrained(model_path)

   We can also use concrete model classes , Like the following BertForSequenceClassification

from transformers import BertForSequenceClassification


model_path = r"../pretrained_model/IDEA-CCNL(Erlangshen-Roberta-110M-Sentiment)"
model = BertForSequenceClassification.from_pretrained(model_path)

Be careful : The above model types are all for PyTorch Model . If we use TensorFlow Model , Its class name needs to be in PyTorch Add... Before the model class name TF. such as BertForSequenceClassification Corresponding TF The model class name is TFBertForSequenceClassification

summary : Official recommendation AutoModelForXXX and TFAutoModelXXX Load pre training model . Officials believe that this will ensure that the correct framework is loaded every time .

1.4 Preprocessing

   Because the model itself cannot understand the original text 、 Image or audio . So you need to convert the data into a form that the model can accept , Then it is transferred into the model .

1.4.1 NLP:AutoTokenizer

   The main tools for processing text data are tokenizer. First ,tokenizer Text is split into... According to a set of rules token. then , Will these token Convert to numeric ( According to Thesaurus , namely vocab), These values are constructed as tensors and used as inputs to the model . Other inputs required for the model are also provided by tokenizer add to .

When we use the pre training model , Be sure to use the corresponding pre training tokenizer. That's the only way , To ensure that the text is segmented in the same way as the pre training corpus , And use the same correspondence token Indexes ( namely vocab).

(1)Tokenize

   Use AutoTokenizer.from_pretrained() Load a Pre Workout tokenizer, And pass the text into tokenizer:

from transformers import AutoTokenizer


model_path = r"../pretrained_model/IDEA-CCNL(Erlangshen-Roberta-110M-Sentiment)"
tokenizer = AutoTokenizer.from_pretrained(model_path)

encoded_input = tokenizer(" I am in a good mood today ")
print(encoded_input)

The following results will be output :

{
    'input_ids': [101, 791, 1921, 1921, 3698, 4696, 1962, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

   You can see that the output above consists of three parts :

  • input_ids: Corresponding to each in the sentence token The index of .
  • token_type_ids: When there are multiple sequences , identification token Belong to that sequence .
  • attention_mask: Show the corresponding token Need to be noticed (1 Need to be noticed ,0 Do not need to be noticed . It involves attention mechanisms ).

   We can also use tokenizer take input_ids Decode to original input :

decoded_input = tokenizer.decode(encoded_input["input_ids"])
print(decoded_input)

The following results will be output :

[CLS]  today   God   God   gas   really   good  [SEP]

   We can see the output above , More than the original text [CLS] and [SEP], They are in BERT And other models token.

   If you need to process multiple sentences at the same time , Multiple texts can be typed as list Type in the form of tokenizer in .

(2) fill (Pad)

   When we deal with a batch of sentences , Their lengths are not always the same . But the input of the model needs to have a uniform shape (shape). Population is a strategy to achieve this requirement , That is to say token Add special padding to fewer sentences token.

   to tokenizer() Pass in the parameter padding=True

batch_sentences = [" It's a beautiful day ",
                   " It's a beautiful day , Suitable for travel "]
encoded_inputs = tokenizer(batch_sentences, padding=True)
print(encoded_inputs)

The following results will be output :

{
    'input_ids':
 [[101, 791, 1921, 1921, 3698, 4696, 1962, 102, 0, 0, 0, 0, 0],
  [101, 791, 1921, 1921, 3698, 4696, 1962, 8024, 6844, 1394, 1139, 3952, 102]], 'token_type_ids':
 [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
'attention_mask':
 [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

   You can see tokenizer Use 0 Fill in the first sentence .

(3) truncation (Truncation)

   When the sentence is too short , The strategy of filling can be adopted . But sometimes , The sentence may be too long , The model can't handle . under these circumstances , Sentences can be truncated .

   to tokenizer() Pass in the parameter truncation=True That is to say .

   If you want to know tokenizer() More about parameters in padding and truncation Information about , You can refer to here

(4) Construction tensor (Build tensors)

   Final , If we want to tokenizer Returns the actual tensor in the incoming model . You need to set the parameters return_tensors. If it's incoming PyTorch Model , Set it to pt; If it's incoming TensorFlow Model , Set it to tf.

batch_sentences = [" It's a beautiful day ",
                   " It's a beautiful day , Suitable for travel "]
encoded_inputs = tokenizer(batch_sentences,
                           padding=True, truncation=True,
                           return_tensors="pt")
print(encoded_inputs)

The following results will be output :

{
    'input_ids':
 tensor([[ 101,  791, 1921, 1921, 3698, 4696, 1962,  102,    0,    0,    0,    0,
            0],
        [ 101,  791, 1921, 1921, 3698, 4696, 1962, 8024, 6844, 1394, 1139, 3952,
          102]]),
'token_type_ids':
 tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),
 'attention_mask':
 tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

1.4.2 other

   For audio data , Pretreatment mainly includes resampling (Resample)、 feature extraction (Feature Extractor)、 fill (pad) And truncation (Truncate), For details, please refer to here . For image data , Preprocessing mainly includes feature extraction (Feature Extractor) And data enhancement , For details, please refer to here . For multimodal data , Different types of data use the corresponding preprocessing methods described above. For details, please refer to here . Although each kind of data preprocessing method is not exactly the same , But the ultimate goal is the same : Transform the raw data into a form acceptable to the model .

1.5 Fine tune the pre training model

   The following is an example of text multi classification , Briefly introduce how to use our own data to train a classification model .

1.5.1 Prepare the data

   Before fine tuning the pre training model , We need to prepare the data first . We can use Datasets Library load_dataset Load data set :

from datasets import load_dataset


#  The first  1  Step : Prepare the data 
#  Get raw data from files 
datasets = load_dataset(f'./my_dataset.py')
#  Output the first data in the training set 
print(datasets["train"][0])

   I need to pay attention to , Because we use our own data for model training , So the above load_dataset The parameter passed in is one py Path to file . This py The document follows Datasets Library rules read files and return training data , If you want more information , You can refer to here .

   If we just want to learn simply Transformers Library usage , have access to Datasets Some data sets preset in this library , This is the time load_dataset The parameters passed in are strings ( such as ,load_dataset("imdb")), Then the corresponding data set will be automatically downloaded .

1.5.2 Preprocessing

   Before feeding the data to the model , The data needs to be preprocessed (Tokenize、 fill 、 Truncation, etc ).

from transformers import AutoTokenizer


#  The first  2  Step : Preprocessing data 
# 2.1  load  tokenizer
tokenizer = AutoTokenizer.from_pretrained(configure["model_path"])

def tokenize_function(examples):
	return tokenizer(examples["text"], padding="max_length", truncation=True)

# 2.2  Get through  tokenization  Later data 
tokenized_datasets = datasets.map(tokenize_function, batched=True)
print(tokenized_datasets["train"][0])

   First , load tokenizer; then , Use datasets.map() Generate preprocessed data . Because the data goes through tokenizer() After processing, it is no longer dataset Format , So we need to use datasets.map() To deal with .

1.5.3 Load model

   In the front section , The method of model loading has been introduced , have access to AutoModelXXX.from_pretrained Load model :

from transformers import AutoModelForSequenceClassification

#  The first  3  Step : Load model 
classification_model = AutoModelForSequenceClassification.from_pretrained(
    configure["model_path"], num_labels=get_num_labels())

   The difference from the previous section is that : There's one in the code above num_labels Parameters , You need to pass this parameter to the number of categories in our dataset .

1.5.4 Set metrics

   During model training , We want to be able to output the performance indicators of the model ( For example, accuracy 、 Accuracy 、 Recall rate 、F1 It's worth waiting for ) In order to understand the training of the model . We can go through Datasets Library provides the load_metric() To achieve . The following code implements the accuracy calculation :

import numpy as np
from datasets import load_metric


#  The first  4  Step : Set metrics 
metric = load_metric("./accuracy.py")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

   If you want more information , You can refer to here .

1.5.5 Set training super parameters

   In model training , You also need to set some super parameters ,Transformers The library provides TrainingArguments class .

from transformers import TrainingArguments


#  The first  5  Step : Set training super parameters 
training_args = TrainingArguments(output_dir=configure["output_dir"],
                                  evaluation_strategy="epoch")

   In the code above , We set two parameters :output_dir Specify the output path to save the model ;evaluation_strategy Decide when to evaluate the model , Set parameters epoch Indicates that after each training epoch And then conduct an assessment , The evaluation content is the measurement index set in the previous step .

   If you want to know more about parameter settings and specific meanings , You can refer to here .

1.5.6 Train and save models

   After the previous series of steps , We can finally start model training .Transformers The library provides Trainer class , Model training can be carried out simply and conveniently . First , Create a Trainer, And then call train() function , Start model training . When the model training is finished , call save_model() Save the model .

#  The first  6  Step : Start training model 
trainer = Trainer(model=classification_model,
                  args=training_args,
                  train_dataset=tokenized_datasets["train"],
                  eval_dataset=tokenized_datasets["validation"],
                  tokenizer=tokenizer,
                  compute_metrics=compute_metrics)
trainer.train()

#  Save the model 
trainer.save_model()

   occasionally , We need to debug the model , You need to write your own model training cycle , Detailed method , You can refer to here .

1.5.7 summary

   After the introduction , Now we can start to train our own text multi classification model .

   however , In the previous section, I introduced how to use... With an example of text multi classification Transformers Library fine tuning pre training model . For other types of tasks , There are some differences compared with text classification tasks , Specific guidance , You can refer to the following links :

Task type Reference link
Text classification (Text classification)https://huggingface.co/docs/transformers/tasks/sequence_classification
Token classification( for example NER)https://huggingface.co/docs/transformers/tasks/token_classification
Question answering system (Question answering)https://huggingface.co/docs/transformers/tasks/question_answering
Language model (Language modeling)https://huggingface.co/docs/transformers/tasks/language_modeling
Machine translation (Translation)https://huggingface.co/docs/transformers/tasks/translation
Text in this paper, (Sumarization)https://huggingface.co/docs/transformers/tasks/summarization
Multiple choice (Multiple choice)https://huggingface.co/docs/transformers/tasks/multiple_choice
Audio classification (Audio classification)https://huggingface.co/docs/transformers/tasks/audio_classification
Automatic speech recognition (ASR)https://huggingface.co/docs/transformers/tasks/asr
Image classification (Image classification)https://huggingface.co/docs/transformers/tasks/image_classification

Reference resources :

[1] Github Address

[2] Official development documents

[3] transformers course

[4] https://github.com/nlp-with-transformers/notebooks

[5] https://github.com/datawhalechina/learn-nlp-with-transformers

原网站

版权声明
本文为[Empty cup realm]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/176/202206242124199569.html