当前位置:网站首页>Basic use of transformers Library
Basic use of transformers Library
2022-06-25 01:31:00 【Empty cup realm】
This content mainly introduces Transformers library Basic use of .
1.1 Transformers Library profile
Transformers Library is an open source library , All the pre training models provided are based on transformer Model structure .
1.1.1 Transformers library
We can use Transformers Library provides the API Easily download and train state-of-the-art pre training models . Using the pre training model can reduce the computational cost , And save the time of training the model from scratch . These models can be used for different modal tasks , for example :
- Text : Text classification 、 Information extraction 、 Question answering system 、 Text in this paper, 、 Machinetranslation and text generation .
- Images : Image classification 、 Target detection and image segmentation .
- Audio : Speech recognition and audio classification .
- Multimodal : Tabular question answering system 、OCR、 Scan document information extraction 、 Video classification and visual Q & A .
Transformers The library supports three of the most popular deep learning libraries (PyTorch、TensorFlow and JAX).
The corresponding websites of relevant resources are as follows :
| website | |
|---|---|
| Library GitHub Address | https://github.com/huggingface/transformers |
| Official development documents | https://huggingface.co/docs/transformers/index |
| Pre training model download address | https://huggingface.co/models |
1.1.2 Transformers Models and frameworks supported by the library
The following table shows the current Transformers Library support for each model :
| Model | Tokenizer slow | Tokenizer fast | PyTorch support | TensorFlow support | Flax support |
|---|---|---|---|---|---|
| ALBERT | |||||
| BART | |||||
| BEiT | |||||
| BERT | |||||
| Bert Generation | |||||
| BigBird | |||||
| BigBirdPegasus | |||||
| Blenderbot | |||||
| BlenderbotSmall | |||||
| CamemBERT | |||||
| Canine | |||||
| CLIP | |||||
| ConvBERT | |||||
| ConvNext | |||||
| CTRL | |||||
| Data2VecAudio | |||||
| Data2VecText | |||||
| Data2VecVision | |||||
| DeBERTa | |||||
| DeBERTa-v2 | |||||
| Decision Transformer | |||||
| DeiT | |||||
| DETR | |||||
| DistilBERT | |||||
| DPR | |||||
| DPT | |||||
| ELECTRA | |||||
| Encoder decoder | |||||
| FairSeq Machine-Translation | |||||
| FlauBERT | |||||
| Flava | |||||
| FNet | |||||
| Funnel Transformer | |||||
| GLPN | |||||
| GPT Neo | |||||
| GPT-J | |||||
| Hubert | |||||
| I-BERT | |||||
| ImageGPT | |||||
| LayoutLM | |||||
| LayoutLMv2 | |||||
| LED | |||||
| Longformer | |||||
| LUKE | |||||
| LXMERT | |||||
| M2M100 | |||||
| Marian | |||||
| MaskFormer | |||||
| mBART | |||||
| MegatronBert | |||||
| MobileBERT | |||||
| MPNet | |||||
| mT5 | |||||
| Nystromformer | |||||
| OpenAI GPT | |||||
| OpenAI GPT-2 | |||||
| OPT | |||||
| Pegasus | |||||
| Perceiver | |||||
| PLBart | |||||
| PoolFormer | |||||
| ProphetNet | |||||
| QDQBert | |||||
| RAG | |||||
| Realm | |||||
| Reformer | |||||
| RegNet | |||||
| RemBERT | |||||
| ResNet | |||||
| RetriBERT | |||||
| RoBERTa | |||||
| RoFormer | |||||
| SegFormer | |||||
| SEW | |||||
| SEW-D | |||||
| Speech Encoder decoder | |||||
| Speech2Text | |||||
| Speech2Text2 | |||||
| Splinter | |||||
| SqueezeBERT | |||||
| Swin | |||||
| T5 | |||||
| TAPAS | |||||
| TAPEX | |||||
| Transformer-XL | |||||
| TrOCR | |||||
| UniSpeech | |||||
| UniSpeechSat | |||||
| VAN | |||||
| ViLT | |||||
| Vision Encoder decoder | |||||
| VisionTextDualEncoder | |||||
| VisualBert | |||||
| ViT | |||||
| ViTMAE | |||||
| Wav2Vec2 | |||||
| WavLM | |||||
| XGLM | |||||
| XLM | |||||
| XLM-RoBERTa | |||||
| XLM-RoBERTa-XL | |||||
| XLMProphetNet | |||||
| XLNet | |||||
| YOLOS |
Be careful :Tokenizer slow: Use Python Realization tokenization The process .Tokenizer fast: be based on Rust library Tokenizers To implement .
1.2 Pipeline
pipeline() The function of is to use the pre training model for inference , It supports from here All models downloaded .
1.2.1 Pipeline Supported task types
pipeline() Support for many common tasks :
- Text
- Sentiment analysis (Sentiment analysis)
- The text generated (Text generation)
- Named entity recognition (Name entity recognition,NER):
- Question answering system (Question answering)
- Mask recovery (Fill-mask)
- Text in this paper, (Summarization)
- Machine translation (Translation)
- feature extraction (Feature extraction)
- Images
- Image classification (Image classification)
- Image segmentation (Image segmentation)
- object detection (Object detection)
- Audio
- Audio classification (Audio classification)
- Automatic speech recognition (Automatic speech recognition,ASR)
Be careful : Can be in Transformers Source code of the library ( see
Transformers/pipelines/__init__.pyMediumSUPPORTED_TASKSDefinition ) View its supported tasks in , Different versions support different types .
1.2.2 Pipeline Use
(1) Easy to use
for example , At present, we need to carry out an inference task of emotion analysis . We can use the following code directly :
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
result = classifier("We are very happy to show you the Transformers library.")
print(result)
The following results will be output :
[{
'label': 'POSITIVE', 'score': 0.9997795224189758}]
In the above code pipeline("sentiment-analysis") A default pre training model for emotion analysis will be downloaded and cached, and the corresponding tokenizer. For different types of tasks , The corresponding parameter name can be viewed pipeline Parameters of task Explanation ( here ); The default pre training model for different types of tasks can be downloaded in Transformers Source code of the library ( see Transformers/pipelines/__init__.py Medium SUPPORTED_TASKS Definition ) View in .
When we need to reason more than one sentence at a time , have access to list The form is passed in as a parameter :
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
results = classifier(["We are very happy to show you the Transformers library.",
"We hope you don't hate it."])
print(results)
The following results will be output :
[{
'label': 'POSITIVE', 'score': 0.9997795224189758},
{
'label': 'NEGATIVE', 'score': 0.5308570265769958}]
(2) Choose a model
The upper part , In reasoning , The default model of the corresponding task is used . But sometimes we want to use a specified model , You can specify pipeline() Parameters of model To achieve .
The first method :
from transformers import pipeline
classifier = pipeline("sentiment-analysis",
model="IDEA-CCNL/Erlangshen-Roberta-110M-Sentiment")
result = classifier(" I am in a good mood today ")
print(result)
The following results will be output :
[{
'label': 'Positive', 'score': 0.9374911785125732}]
The second method :( And the above method , The same model is loaded . However, this method can use local models for reasoning .)
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer
from transformers import pipeline
model_path = r"../pretrained_model/IDEA-CCNL(Erlangshen-Roberta-110M-Sentiment)"
model = AutoModelForSequenceClassification.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
result = classifier(" I am in a good mood today ")
print(result)
The following results will be output :
[{
'label': 'Positive', 'score': 0.9374911785125732}]
summary : The above section describes the use of pipeline() The inference method of text classification task . For other types of text tasks 、 Image and audio tasks , The use method is basically the same , For details, please refer to here .
1.3 Load model
Next, we will introduce some methods of loading models .
1.3.1 Random initialization of model weights
occasionally , Model weights need to be initialized randomly ( For example, use your own data for pre training ). First, we need to initialize a config object , And then put this config Object is passed to the model as a parameter :
from transformers import BertConfig
from transformers import BertModel
config = BertConfig()
model = BertModel(config)
above config The default value is used , But as needed , We can modify the corresponding parameters . Of course , We can also use AutoConfig.from_pretrained() Load other models config:
from transformers import AutoConfig
from transformers import AutoModel
model_path = r"../pretrained_model/IDEA-CCNL(Erlangshen-Roberta-110M-Sentiment)"
config = AutoConfig.from_pretrained(model_path)
model = AutoModel.from_config(config)
1.3.2 Use the pre training weights to initialize the model weights
occasionally , Need to load weights from the pre training model . In general use AutoModelForXXX.from_pretrained() Load the pre training model of the corresponding task , The reason why we use XXX, Because different types of tasks use different classes . for example , We need to load a text sequence classification model , Need to use AutoModelForSequenceClassification.
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
"IDEA-CCNL/Erlangshen-Roberta-110M-Sentiment")
AutoModelForSequenceClassification.from_pretrained() The first parameter of pretrained_model_name_or_path It can be a string , It can also be a folder path .
from transformers import AutoModelForSequenceClassification
model_path = r"../pretrained_model/IDEA-CCNL(Erlangshen-Roberta-110M-Sentiment)"
model = AutoModelForSequenceClassification.from_pretrained(model_path)
We can also use concrete model classes , Like the following BertForSequenceClassification:
from transformers import BertForSequenceClassification
model_path = r"../pretrained_model/IDEA-CCNL(Erlangshen-Roberta-110M-Sentiment)"
model = BertForSequenceClassification.from_pretrained(model_path)
Be careful : The above model types are all for PyTorch Model . If we use TensorFlow Model , Its class name needs to be in PyTorch Add... Before the model class name TF. such as BertForSequenceClassification Corresponding TF The model class name is TFBertForSequenceClassification
summary : Official recommendation AutoModelForXXX and TFAutoModelXXX Load pre training model . Officials believe that this will ensure that the correct framework is loaded every time .
1.4 Preprocessing
Because the model itself cannot understand the original text 、 Image or audio . So you need to convert the data into a form that the model can accept , Then it is transferred into the model .
1.4.1 NLP:AutoTokenizer
The main tools for processing text data are tokenizer. First ,tokenizer Text is split into... According to a set of rules token. then , Will these token Convert to numeric ( According to Thesaurus , namely vocab), These values are constructed as tensors and used as inputs to the model . Other inputs required for the model are also provided by tokenizer add to .
When we use the pre training model , Be sure to use the corresponding pre training tokenizer. That's the only way , To ensure that the text is segmented in the same way as the pre training corpus , And use the same correspondence token Indexes ( namely vocab).
(1)Tokenize
Use AutoTokenizer.from_pretrained() Load a Pre Workout tokenizer, And pass the text into tokenizer:
from transformers import AutoTokenizer
model_path = r"../pretrained_model/IDEA-CCNL(Erlangshen-Roberta-110M-Sentiment)"
tokenizer = AutoTokenizer.from_pretrained(model_path)
encoded_input = tokenizer(" I am in a good mood today ")
print(encoded_input)
The following results will be output :
{
'input_ids': [101, 791, 1921, 1921, 3698, 4696, 1962, 102],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
You can see that the output above consists of three parts :
- input_ids: Corresponding to each in the sentence token The index of .
- token_type_ids: When there are multiple sequences , identification token Belong to that sequence .
- attention_mask: Show the corresponding token Need to be noticed (1 Need to be noticed ,0 Do not need to be noticed . It involves attention mechanisms ).
We can also use tokenizer take input_ids Decode to original input :
decoded_input = tokenizer.decode(encoded_input["input_ids"])
print(decoded_input)
The following results will be output :
[CLS] today God God gas really good [SEP]
We can see the output above , More than the original text [CLS] and [SEP], They are in BERT And other models token.
If you need to process multiple sentences at the same time , Multiple texts can be typed as list Type in the form of tokenizer in .
(2) fill (Pad)
When we deal with a batch of sentences , Their lengths are not always the same . But the input of the model needs to have a uniform shape (shape). Population is a strategy to achieve this requirement , That is to say token Add special padding to fewer sentences token.
to tokenizer() Pass in the parameter padding=True:
batch_sentences = [" It's a beautiful day ",
" It's a beautiful day , Suitable for travel "]
encoded_inputs = tokenizer(batch_sentences, padding=True)
print(encoded_inputs)
The following results will be output :
{
'input_ids':
[[101, 791, 1921, 1921, 3698, 4696, 1962, 102, 0, 0, 0, 0, 0],
[101, 791, 1921, 1921, 3698, 4696, 1962, 8024, 6844, 1394, 1139, 3952, 102]], 'token_type_ids':
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
'attention_mask':
[[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
You can see tokenizer Use 0 Fill in the first sentence .
(3) truncation (Truncation)
When the sentence is too short , The strategy of filling can be adopted . But sometimes , The sentence may be too long , The model can't handle . under these circumstances , Sentences can be truncated .
to tokenizer() Pass in the parameter truncation=True That is to say .
If you want to know tokenizer() More about parameters in padding and truncation Information about , You can refer to here
(4) Construction tensor (Build tensors)
Final , If we want to tokenizer Returns the actual tensor in the incoming model . You need to set the parameters return_tensors. If it's incoming PyTorch Model , Set it to pt; If it's incoming TensorFlow Model , Set it to tf.
batch_sentences = [" It's a beautiful day ",
" It's a beautiful day , Suitable for travel "]
encoded_inputs = tokenizer(batch_sentences,
padding=True, truncation=True,
return_tensors="pt")
print(encoded_inputs)
The following results will be output :
{
'input_ids':
tensor([[ 101, 791, 1921, 1921, 3698, 4696, 1962, 102, 0, 0, 0, 0,
0],
[ 101, 791, 1921, 1921, 3698, 4696, 1962, 8024, 6844, 1394, 1139, 3952,
102]]),
'token_type_ids':
tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),
'attention_mask':
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
1.4.2 other
For audio data , Pretreatment mainly includes resampling (Resample)、 feature extraction (Feature Extractor)、 fill (pad) And truncation (Truncate), For details, please refer to here . For image data , Preprocessing mainly includes feature extraction (Feature Extractor) And data enhancement , For details, please refer to here . For multimodal data , Different types of data use the corresponding preprocessing methods described above. For details, please refer to here . Although each kind of data preprocessing method is not exactly the same , But the ultimate goal is the same : Transform the raw data into a form acceptable to the model .
1.5 Fine tune the pre training model
The following is an example of text multi classification , Briefly introduce how to use our own data to train a classification model .
1.5.1 Prepare the data
Before fine tuning the pre training model , We need to prepare the data first . We can use Datasets Library load_dataset Load data set :
from datasets import load_dataset
# The first 1 Step : Prepare the data
# Get raw data from files
datasets = load_dataset(f'./my_dataset.py')
# Output the first data in the training set
print(datasets["train"][0])
I need to pay attention to , Because we use our own data for model training , So the above load_dataset The parameter passed in is one py Path to file . This py The document follows Datasets Library rules read files and return training data , If you want more information , You can refer to here .
If we just want to learn simply Transformers Library usage , have access to Datasets Some data sets preset in this library , This is the time load_dataset The parameters passed in are strings ( such as ,load_dataset("imdb")), Then the corresponding data set will be automatically downloaded .
1.5.2 Preprocessing
Before feeding the data to the model , The data needs to be preprocessed (Tokenize、 fill 、 Truncation, etc ).
from transformers import AutoTokenizer
# The first 2 Step : Preprocessing data
# 2.1 load tokenizer
tokenizer = AutoTokenizer.from_pretrained(configure["model_path"])
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
# 2.2 Get through tokenization Later data
tokenized_datasets = datasets.map(tokenize_function, batched=True)
print(tokenized_datasets["train"][0])
First , load tokenizer; then , Use datasets.map() Generate preprocessed data . Because the data goes through tokenizer() After processing, it is no longer dataset Format , So we need to use datasets.map() To deal with .
1.5.3 Load model
In the front section , The method of model loading has been introduced , have access to AutoModelXXX.from_pretrained Load model :
from transformers import AutoModelForSequenceClassification
# The first 3 Step : Load model
classification_model = AutoModelForSequenceClassification.from_pretrained(
configure["model_path"], num_labels=get_num_labels())
The difference from the previous section is that : There's one in the code above num_labels Parameters , You need to pass this parameter to the number of categories in our dataset .
1.5.4 Set metrics
During model training , We want to be able to output the performance indicators of the model ( For example, accuracy 、 Accuracy 、 Recall rate 、F1 It's worth waiting for ) In order to understand the training of the model . We can go through Datasets Library provides the load_metric() To achieve . The following code implements the accuracy calculation :
import numpy as np
from datasets import load_metric
# The first 4 Step : Set metrics
metric = load_metric("./accuracy.py")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
If you want more information , You can refer to here .
1.5.5 Set training super parameters
In model training , You also need to set some super parameters ,Transformers The library provides TrainingArguments class .
from transformers import TrainingArguments
# The first 5 Step : Set training super parameters
training_args = TrainingArguments(output_dir=configure["output_dir"],
evaluation_strategy="epoch")
In the code above , We set two parameters :output_dir Specify the output path to save the model ;evaluation_strategy Decide when to evaluate the model , Set parameters epoch Indicates that after each training epoch And then conduct an assessment , The evaluation content is the measurement index set in the previous step .
If you want to know more about parameter settings and specific meanings , You can refer to here .
1.5.6 Train and save models
After the previous series of steps , We can finally start model training .Transformers The library provides Trainer class , Model training can be carried out simply and conveniently . First , Create a Trainer, And then call train() function , Start model training . When the model training is finished , call save_model() Save the model .
# The first 6 Step : Start training model
trainer = Trainer(model=classification_model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
tokenizer=tokenizer,
compute_metrics=compute_metrics)
trainer.train()
# Save the model
trainer.save_model()
occasionally , We need to debug the model , You need to write your own model training cycle , Detailed method , You can refer to here .
1.5.7 summary
After the introduction , Now we can start to train our own text multi classification model .
however , In the previous section, I introduced how to use... With an example of text multi classification Transformers Library fine tuning pre training model . For other types of tasks , There are some differences compared with text classification tasks , Specific guidance , You can refer to the following links :
| Task type | Reference link |
|---|---|
| Text classification (Text classification) | https://huggingface.co/docs/transformers/tasks/sequence_classification |
| Token classification( for example NER) | https://huggingface.co/docs/transformers/tasks/token_classification |
| Question answering system (Question answering) | https://huggingface.co/docs/transformers/tasks/question_answering |
| Language model (Language modeling) | https://huggingface.co/docs/transformers/tasks/language_modeling |
| Machine translation (Translation) | https://huggingface.co/docs/transformers/tasks/translation |
| Text in this paper, (Sumarization) | https://huggingface.co/docs/transformers/tasks/summarization |
| Multiple choice (Multiple choice) | https://huggingface.co/docs/transformers/tasks/multiple_choice |
| Audio classification (Audio classification) | https://huggingface.co/docs/transformers/tasks/audio_classification |
| Automatic speech recognition (ASR) | https://huggingface.co/docs/transformers/tasks/asr |
| Image classification (Image classification) | https://huggingface.co/docs/transformers/tasks/image_classification |
Reference resources :
[1] Github Address
[2] Official development documents
[4] https://github.com/nlp-with-transformers/notebooks
[5] https://github.com/datawhalechina/learn-nlp-with-transformers
边栏推荐
- PHP easywechat and applet realize long-term subscription message push
- Bi SQL constraints
- Poj3669 meteor shower (BFS pretreatment)
- Use redis' sorted set to make weekly hot Reviews
- JVM directive
- Install mysql5.6 under linux64bit - the root password cannot be modified
- matlab 取整
- Bi-sql Union
- Audio PCM data calculates sound decibel value to realize simple VAD function
- How to prepare for the last day of tomorrow's exam? Complete compilation of the introduction to the second building test site
猜你喜欢

Basic knowledge of assembly language (2) -debug

数组中关于sizeof()和strlen
![最长连续序列[扩散法+空间换时间]](/img/db/7b0d1b0db7015e887340723505153a.png)
最长连续序列[扩散法+空间换时间]

Boutique enterprise class powerbi application pipeline deployment

明日考试 最后一天如何备考?二造考点攻略全整理

Bi-sql - join

Bi-sql - different join

Ideas and examples of divide and conquer
![搜索二维矩阵[二分巧用 + 记录不同于插入二分的解法]](/img/c9/afc03afd477bbfdd3c0dc54bacfd2d.png)
搜索二维矩阵[二分巧用 + 记录不同于插入二分的解法]

论文翻译 | RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds
随机推荐
Bi-sql - join
Welcome to the new world of Lenovo smart screen
Tencent cloud wecity Industry joint collaborative innovation to celebrate the New Year!
Bi SQL drop & alter
屡获大奖的界面控件开发包DevExpress v22.1官宣发布
带马尔科夫切换的正向随机微分方程数值格式模拟
Abnova丨CSV 磁珠中英文说明
Install mysql5.6 under linux64bit - the root password cannot be modified
Excel Chinese character to pinyin "suggestions collection"
【LeetCode】11、盛最多水的容器
Using macro code to generate handwriting automatically in word or WPS
This national day! Tencent cloud wecity will accompany you to travel and light up the city landmark
谷歌浏览器控制台 f12怎么设置成中文/英文 切换方法,一定要看到最后!!!
js数组对象转对象
多模态数据也能进行MAE?伯克利&谷歌提出M3AE,在图像和文本数据上进行MAE!最优掩蔽率可达75%,显著高于BERT的15%
Poj3669 meteor shower (BFS pretreatment)
Deoxyribonuclease I instructions in Chinese and English
Linux64Bit下安装MySQL5.6-不能修改root密码
腾讯搬家了!
木瓜蛋白酶中英文说明书