当前位置:网站首页>Basic use of transformers Library
Basic use of transformers Library
2022-06-25 01:31:00 【Empty cup realm】
This content mainly introduces Transformers library Basic use of .
1.1 Transformers Library profile
Transformers Library is an open source library , All the pre training models provided are based on transformer Model structure .
1.1.1 Transformers library
We can use Transformers Library provides the API Easily download and train state-of-the-art pre training models . Using the pre training model can reduce the computational cost , And save the time of training the model from scratch . These models can be used for different modal tasks , for example :
- Text : Text classification 、 Information extraction 、 Question answering system 、 Text in this paper, 、 Machinetranslation and text generation .
- Images : Image classification 、 Target detection and image segmentation .
- Audio : Speech recognition and audio classification .
- Multimodal : Tabular question answering system 、OCR、 Scan document information extraction 、 Video classification and visual Q & A .
Transformers The library supports three of the most popular deep learning libraries (PyTorch、TensorFlow and JAX).
The corresponding websites of relevant resources are as follows :
website | |
---|---|
Library GitHub Address | https://github.com/huggingface/transformers |
Official development documents | https://huggingface.co/docs/transformers/index |
Pre training model download address | https://huggingface.co/models |
1.1.2 Transformers Models and frameworks supported by the library
The following table shows the current Transformers Library support for each model :
Model | Tokenizer slow | Tokenizer fast | PyTorch support | TensorFlow support | Flax support |
---|---|---|---|---|---|
ALBERT | |||||
BART | |||||
BEiT | |||||
BERT | |||||
Bert Generation | |||||
BigBird | |||||
BigBirdPegasus | |||||
Blenderbot | |||||
BlenderbotSmall | |||||
CamemBERT | |||||
Canine | |||||
CLIP | |||||
ConvBERT | |||||
ConvNext | |||||
CTRL | |||||
Data2VecAudio | |||||
Data2VecText | |||||
Data2VecVision | |||||
DeBERTa | |||||
DeBERTa-v2 | |||||
Decision Transformer | |||||
DeiT | |||||
DETR | |||||
DistilBERT | |||||
DPR | |||||
DPT | |||||
ELECTRA | |||||
Encoder decoder | |||||
FairSeq Machine-Translation | |||||
FlauBERT | |||||
Flava | |||||
FNet | |||||
Funnel Transformer | |||||
GLPN | |||||
GPT Neo | |||||
GPT-J | |||||
Hubert | |||||
I-BERT | |||||
ImageGPT | |||||
LayoutLM | |||||
LayoutLMv2 | |||||
LED | |||||
Longformer | |||||
LUKE | |||||
LXMERT | |||||
M2M100 | |||||
Marian | |||||
MaskFormer | |||||
mBART | |||||
MegatronBert | |||||
MobileBERT | |||||
MPNet | |||||
mT5 | |||||
Nystromformer | |||||
OpenAI GPT | |||||
OpenAI GPT-2 | |||||
OPT | |||||
Pegasus | |||||
Perceiver | |||||
PLBart | |||||
PoolFormer | |||||
ProphetNet | |||||
QDQBert | |||||
RAG | |||||
Realm | |||||
Reformer | |||||
RegNet | |||||
RemBERT | |||||
ResNet | |||||
RetriBERT | |||||
RoBERTa | |||||
RoFormer | |||||
SegFormer | |||||
SEW | |||||
SEW-D | |||||
Speech Encoder decoder | |||||
Speech2Text | |||||
Speech2Text2 | |||||
Splinter | |||||
SqueezeBERT | |||||
Swin | |||||
T5 | |||||
TAPAS | |||||
TAPEX | |||||
Transformer-XL | |||||
TrOCR | |||||
UniSpeech | |||||
UniSpeechSat | |||||
VAN | |||||
ViLT | |||||
Vision Encoder decoder | |||||
VisionTextDualEncoder | |||||
VisualBert | |||||
ViT | |||||
ViTMAE | |||||
Wav2Vec2 | |||||
WavLM | |||||
XGLM | |||||
XLM | |||||
XLM-RoBERTa | |||||
XLM-RoBERTa-XL | |||||
XLMProphetNet | |||||
XLNet | |||||
YOLOS |
Be careful :Tokenizer slow: Use Python Realization tokenization The process .Tokenizer fast: be based on Rust library Tokenizers To implement .
1.2 Pipeline
pipeline()
The function of is to use the pre training model for inference , It supports from here All models downloaded .
1.2.1 Pipeline Supported task types
pipeline()
Support for many common tasks :
- Text
- Sentiment analysis (Sentiment analysis)
- The text generated (Text generation)
- Named entity recognition (Name entity recognition,NER):
- Question answering system (Question answering)
- Mask recovery (Fill-mask)
- Text in this paper, (Summarization)
- Machine translation (Translation)
- feature extraction (Feature extraction)
- Images
- Image classification (Image classification)
- Image segmentation (Image segmentation)
- object detection (Object detection)
- Audio
- Audio classification (Audio classification)
- Automatic speech recognition (Automatic speech recognition,ASR)
Be careful : Can be in Transformers Source code of the library ( see
Transformers/pipelines/__init__.py
MediumSUPPORTED_TASKS
Definition ) View its supported tasks in , Different versions support different types .
1.2.2 Pipeline Use
(1) Easy to use
for example , At present, we need to carry out an inference task of emotion analysis . We can use the following code directly :
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
result = classifier("We are very happy to show you the Transformers library.")
print(result)
The following results will be output :
[{
'label': 'POSITIVE', 'score': 0.9997795224189758}]
In the above code pipeline("sentiment-analysis")
A default pre training model for emotion analysis will be downloaded and cached, and the corresponding tokenizer. For different types of tasks , The corresponding parameter name can be viewed pipeline
Parameters of task
Explanation ( here ); The default pre training model for different types of tasks can be downloaded in Transformers Source code of the library ( see Transformers/pipelines/__init__.py
Medium SUPPORTED_TASKS
Definition ) View in .
When we need to reason more than one sentence at a time , have access to list The form is passed in as a parameter :
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
results = classifier(["We are very happy to show you the Transformers library.",
"We hope you don't hate it."])
print(results)
The following results will be output :
[{
'label': 'POSITIVE', 'score': 0.9997795224189758},
{
'label': 'NEGATIVE', 'score': 0.5308570265769958}]
(2) Choose a model
The upper part , In reasoning , The default model of the corresponding task is used . But sometimes we want to use a specified model , You can specify pipeline()
Parameters of model
To achieve .
The first method :
from transformers import pipeline
classifier = pipeline("sentiment-analysis",
model="IDEA-CCNL/Erlangshen-Roberta-110M-Sentiment")
result = classifier(" I am in a good mood today ")
print(result)
The following results will be output :
[{
'label': 'Positive', 'score': 0.9374911785125732}]
The second method :( And the above method , The same model is loaded . However, this method can use local models for reasoning .)
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer
from transformers import pipeline
model_path = r"../pretrained_model/IDEA-CCNL(Erlangshen-Roberta-110M-Sentiment)"
model = AutoModelForSequenceClassification.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
result = classifier(" I am in a good mood today ")
print(result)
The following results will be output :
[{
'label': 'Positive', 'score': 0.9374911785125732}]
summary : The above section describes the use of pipeline()
The inference method of text classification task . For other types of text tasks 、 Image and audio tasks , The use method is basically the same , For details, please refer to here .
1.3 Load model
Next, we will introduce some methods of loading models .
1.3.1 Random initialization of model weights
occasionally , Model weights need to be initialized randomly ( For example, use your own data for pre training ). First, we need to initialize a config object , And then put this config Object is passed to the model as a parameter :
from transformers import BertConfig
from transformers import BertModel
config = BertConfig()
model = BertModel(config)
above config The default value is used , But as needed , We can modify the corresponding parameters . Of course , We can also use AutoConfig.from_pretrained()
Load other models config:
from transformers import AutoConfig
from transformers import AutoModel
model_path = r"../pretrained_model/IDEA-CCNL(Erlangshen-Roberta-110M-Sentiment)"
config = AutoConfig.from_pretrained(model_path)
model = AutoModel.from_config(config)
1.3.2 Use the pre training weights to initialize the model weights
occasionally , Need to load weights from the pre training model . In general use AutoModelForXXX.from_pretrained()
Load the pre training model of the corresponding task , The reason why we use XXX
, Because different types of tasks use different classes . for example , We need to load a text sequence classification model , Need to use AutoModelForSequenceClassification
.
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
"IDEA-CCNL/Erlangshen-Roberta-110M-Sentiment")
AutoModelForSequenceClassification.from_pretrained()
The first parameter of pretrained_model_name_or_path
It can be a string , It can also be a folder path .
from transformers import AutoModelForSequenceClassification
model_path = r"../pretrained_model/IDEA-CCNL(Erlangshen-Roberta-110M-Sentiment)"
model = AutoModelForSequenceClassification.from_pretrained(model_path)
We can also use concrete model classes , Like the following BertForSequenceClassification
:
from transformers import BertForSequenceClassification
model_path = r"../pretrained_model/IDEA-CCNL(Erlangshen-Roberta-110M-Sentiment)"
model = BertForSequenceClassification.from_pretrained(model_path)
Be careful : The above model types are all for PyTorch Model . If we use TensorFlow Model , Its class name needs to be in PyTorch Add... Before the model class name TF
. such as BertForSequenceClassification
Corresponding TF The model class name is TFBertForSequenceClassification
summary : Official recommendation AutoModelForXXX
and TFAutoModelXXX
Load pre training model . Officials believe that this will ensure that the correct framework is loaded every time .
1.4 Preprocessing
Because the model itself cannot understand the original text 、 Image or audio . So you need to convert the data into a form that the model can accept , Then it is transferred into the model .
1.4.1 NLP:AutoTokenizer
The main tools for processing text data are tokenizer. First ,tokenizer Text is split into... According to a set of rules token. then , Will these token Convert to numeric ( According to Thesaurus , namely vocab), These values are constructed as tensors and used as inputs to the model . Other inputs required for the model are also provided by tokenizer add to .
When we use the pre training model , Be sure to use the corresponding pre training tokenizer. That's the only way , To ensure that the text is segmented in the same way as the pre training corpus , And use the same correspondence token Indexes ( namely vocab).
(1)Tokenize
Use AutoTokenizer.from_pretrained()
Load a Pre Workout tokenizer, And pass the text into tokenizer:
from transformers import AutoTokenizer
model_path = r"../pretrained_model/IDEA-CCNL(Erlangshen-Roberta-110M-Sentiment)"
tokenizer = AutoTokenizer.from_pretrained(model_path)
encoded_input = tokenizer(" I am in a good mood today ")
print(encoded_input)
The following results will be output :
{
'input_ids': [101, 791, 1921, 1921, 3698, 4696, 1962, 102],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
You can see that the output above consists of three parts :
- input_ids: Corresponding to each in the sentence token The index of .
- token_type_ids: When there are multiple sequences , identification token Belong to that sequence .
- attention_mask: Show the corresponding token Need to be noticed (1 Need to be noticed ,0 Do not need to be noticed . It involves attention mechanisms ).
We can also use tokenizer take input_ids Decode to original input :
decoded_input = tokenizer.decode(encoded_input["input_ids"])
print(decoded_input)
The following results will be output :
[CLS] today God God gas really good [SEP]
We can see the output above , More than the original text [CLS]
and [SEP]
, They are in BERT And other models token.
If you need to process multiple sentences at the same time , Multiple texts can be typed as list Type in the form of tokenizer in .
(2) fill (Pad)
When we deal with a batch of sentences , Their lengths are not always the same . But the input of the model needs to have a uniform shape (shape). Population is a strategy to achieve this requirement , That is to say token Add special padding to fewer sentences token.
to tokenizer()
Pass in the parameter padding=True
:
batch_sentences = [" It's a beautiful day ",
" It's a beautiful day , Suitable for travel "]
encoded_inputs = tokenizer(batch_sentences, padding=True)
print(encoded_inputs)
The following results will be output :
{
'input_ids':
[[101, 791, 1921, 1921, 3698, 4696, 1962, 102, 0, 0, 0, 0, 0],
[101, 791, 1921, 1921, 3698, 4696, 1962, 8024, 6844, 1394, 1139, 3952, 102]], 'token_type_ids':
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
'attention_mask':
[[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
You can see tokenizer Use 0
Fill in the first sentence .
(3) truncation (Truncation)
When the sentence is too short , The strategy of filling can be adopted . But sometimes , The sentence may be too long , The model can't handle . under these circumstances , Sentences can be truncated .
to tokenizer()
Pass in the parameter truncation=True
That is to say .
If you want to know tokenizer()
More about parameters in padding
and truncation
Information about , You can refer to here
(4) Construction tensor (Build tensors)
Final , If we want to tokenizer Returns the actual tensor in the incoming model . You need to set the parameters return_tensors
. If it's incoming PyTorch Model , Set it to pt
; If it's incoming TensorFlow Model , Set it to tf
.
batch_sentences = [" It's a beautiful day ",
" It's a beautiful day , Suitable for travel "]
encoded_inputs = tokenizer(batch_sentences,
padding=True, truncation=True,
return_tensors="pt")
print(encoded_inputs)
The following results will be output :
{
'input_ids':
tensor([[ 101, 791, 1921, 1921, 3698, 4696, 1962, 102, 0, 0, 0, 0,
0],
[ 101, 791, 1921, 1921, 3698, 4696, 1962, 8024, 6844, 1394, 1139, 3952,
102]]),
'token_type_ids':
tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),
'attention_mask':
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
1.4.2 other
For audio data , Pretreatment mainly includes resampling (Resample)、 feature extraction (Feature Extractor)、 fill (pad) And truncation (Truncate), For details, please refer to here . For image data , Preprocessing mainly includes feature extraction (Feature Extractor) And data enhancement , For details, please refer to here . For multimodal data , Different types of data use the corresponding preprocessing methods described above. For details, please refer to here . Although each kind of data preprocessing method is not exactly the same , But the ultimate goal is the same : Transform the raw data into a form acceptable to the model .
1.5 Fine tune the pre training model
The following is an example of text multi classification , Briefly introduce how to use our own data to train a classification model .
1.5.1 Prepare the data
Before fine tuning the pre training model , We need to prepare the data first . We can use Datasets Library load_dataset
Load data set :
from datasets import load_dataset
# The first 1 Step : Prepare the data
# Get raw data from files
datasets = load_dataset(f'./my_dataset.py')
# Output the first data in the training set
print(datasets["train"][0])
I need to pay attention to , Because we use our own data for model training , So the above load_dataset
The parameter passed in is one py Path to file . This py The document follows Datasets Library rules read files and return training data , If you want more information , You can refer to here .
If we just want to learn simply Transformers Library usage , have access to Datasets Some data sets preset in this library , This is the time load_dataset
The parameters passed in are strings ( such as ,load_dataset("imdb")
), Then the corresponding data set will be automatically downloaded .
1.5.2 Preprocessing
Before feeding the data to the model , The data needs to be preprocessed (Tokenize、 fill 、 Truncation, etc ).
from transformers import AutoTokenizer
# The first 2 Step : Preprocessing data
# 2.1 load tokenizer
tokenizer = AutoTokenizer.from_pretrained(configure["model_path"])
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
# 2.2 Get through tokenization Later data
tokenized_datasets = datasets.map(tokenize_function, batched=True)
print(tokenized_datasets["train"][0])
First , load tokenizer; then , Use datasets.map()
Generate preprocessed data . Because the data goes through tokenizer()
After processing, it is no longer dataset Format , So we need to use datasets.map()
To deal with .
1.5.3 Load model
In the front section , The method of model loading has been introduced , have access to AutoModelXXX.from_pretrained
Load model :
from transformers import AutoModelForSequenceClassification
# The first 3 Step : Load model
classification_model = AutoModelForSequenceClassification.from_pretrained(
configure["model_path"], num_labels=get_num_labels())
The difference from the previous section is that : There's one in the code above num_labels
Parameters , You need to pass this parameter to the number of categories in our dataset .
1.5.4 Set metrics
During model training , We want to be able to output the performance indicators of the model ( For example, accuracy 、 Accuracy 、 Recall rate 、F1 It's worth waiting for ) In order to understand the training of the model . We can go through Datasets Library provides the load_metric()
To achieve . The following code implements the accuracy calculation :
import numpy as np
from datasets import load_metric
# The first 4 Step : Set metrics
metric = load_metric("./accuracy.py")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
If you want more information , You can refer to here .
1.5.5 Set training super parameters
In model training , You also need to set some super parameters ,Transformers The library provides TrainingArguments
class .
from transformers import TrainingArguments
# The first 5 Step : Set training super parameters
training_args = TrainingArguments(output_dir=configure["output_dir"],
evaluation_strategy="epoch")
In the code above , We set two parameters :output_dir Specify the output path to save the model ;evaluation_strategy Decide when to evaluate the model , Set parameters epoch
Indicates that after each training epoch And then conduct an assessment , The evaluation content is the measurement index set in the previous step .
If you want to know more about parameter settings and specific meanings , You can refer to here .
1.5.6 Train and save models
After the previous series of steps , We can finally start model training .Transformers The library provides Trainer
class , Model training can be carried out simply and conveniently . First , Create a Trainer
, And then call train()
function , Start model training . When the model training is finished , call save_model()
Save the model .
# The first 6 Step : Start training model
trainer = Trainer(model=classification_model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
tokenizer=tokenizer,
compute_metrics=compute_metrics)
trainer.train()
# Save the model
trainer.save_model()
occasionally , We need to debug the model , You need to write your own model training cycle , Detailed method , You can refer to here .
1.5.7 summary
After the introduction , Now we can start to train our own text multi classification model .
however , In the previous section, I introduced how to use... With an example of text multi classification Transformers Library fine tuning pre training model . For other types of tasks , There are some differences compared with text classification tasks , Specific guidance , You can refer to the following links :
Task type | Reference link |
---|---|
Text classification (Text classification) | https://huggingface.co/docs/transformers/tasks/sequence_classification |
Token classification( for example NER) | https://huggingface.co/docs/transformers/tasks/token_classification |
Question answering system (Question answering) | https://huggingface.co/docs/transformers/tasks/question_answering |
Language model (Language modeling) | https://huggingface.co/docs/transformers/tasks/language_modeling |
Machine translation (Translation) | https://huggingface.co/docs/transformers/tasks/translation |
Text in this paper, (Sumarization) | https://huggingface.co/docs/transformers/tasks/summarization |
Multiple choice (Multiple choice) | https://huggingface.co/docs/transformers/tasks/multiple_choice |
Audio classification (Audio classification) | https://huggingface.co/docs/transformers/tasks/audio_classification |
Automatic speech recognition (ASR) | https://huggingface.co/docs/transformers/tasks/asr |
Image classification (Image classification) | https://huggingface.co/docs/transformers/tasks/image_classification |
Reference resources :
[1] Github Address
[2] Official development documents
[4] https://github.com/nlp-with-transformers/notebooks
[5] https://github.com/datawhalechina/learn-nlp-with-transformers
边栏推荐
猜你喜欢
void* 指针
How to store dataframe data in pandas into MySQL
Powerbi - for you who are learning
Bi-sql create
Introduction to bi-sql wildcards
Abnova丨CSV 磁珠中英文说明
天书夜读笔记——深入虚函数virtual
论文翻译 | RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds
Huawei laptop, which grew against the trend in Q1, is leading PC into the era of "smart office"
最长连续序列[扩散法+空间换时间]
随机推荐
Audio PCM data calculates sound decibel value to realize simple VAD function
Abnova丨A4GNT多克隆抗体中英文说明
WinXP内核驱动调试
Deoxyribonuclease I instructions in Chinese and English
Some Modest Advice for Graduate Students - by Stephen C. Stearns, Ph.D.
Abnova BSG monoclonal antibody description in Chinese and English
Merge sort template & understanding
After the college entrance examination, the following four situations will inevitably occur:
Why does Dell always refuse to push the ultra-thin commercial notebook to the extreme?
Elastase instructions in Chinese and English
Bi-sql create
PS5连接OPPO K9电视不支持2160P/4K
天书夜读笔记——反汇编引擎xde32
Transformers 库的基本使用
动手学数据分析 数据建模和模型评估
“一个优秀程序员可抵五个普通程序员!”
实验5 8254定时/计数器应用实验【微机原理】【实验】
lnmp环境安装ffmpeg,并在Yii2中使用
Abnova丨BSG 单克隆抗体中英文说明
leetcode:2104. 子数组范围和