当前位置：网站首页>Text classification and fine tuning using transformer Bert pre training model

Text classification and fine tuning using transformer Bert pre training model

2022-06-24 05:56:00 【Goose】

0. BERT brief introduction

Bert Its full name is Bidirectional Encoder Representations from Transformers（Bert）. and ELMo Different ,BERT The deep bidirectional representation is pre trained by jointly adjusting the left and right contexts at all levels , In addition, the understanding of long-range semantics is enhanced by assembling long sentences as input .Bert Can be fine tuned for a wide range of tasks , Just add an additional output layer , There is no need to adjust the model structure for the task , In text classification , Semantic understanding and other tasks state-of-the-art The achievement of .

Bert In the paper on pre training Bert The model designs two usages for specific domain tasks , One is fine-tune（ fine-tuning ） Method , One is feature extract（ feature extraction ） Method .

fine tune（ fine-tuning ） Method refers to loading pre trained Bert Model , In fact, it is a pile of network weight values , Feed the data set of specific domain tasks to the model , Continue back propagation training on the network , Constantly adjust the weight of the original model , Get a model for a new specific task . That makes sense , It's equivalent to using Bert The model helps us initialize the initial weight of a network , Is a common means of transfer learning .
feature extract（ feature extraction ） Method refers to calling a pre - trained Bert Model , Code the sentences of the new task , Encode sentences of any length into vectors of fixed length . After the coding , As some kind of model designed by yourself （ for example LSTM、SVM It's up to you to decide ） The input of , It means to say that Bert As a sentence feature coder , This method has no back propagation process , As for how to input the fixed length sentence vector into LSTM Continue back propagation training , Then don't close it Bert Thing . This is also a common language model usage , Similar to the same kind ELMo.

Let's first look at how to use feature extraction methods for text classification .

1. background

This blog will record the use of transformer BERT Model for text classification process , The model takes sentences as input （ Film review ）, Output is 1（ Sentences with positive emotions ） perhaps 0（ Sentences with negative emotions ）; The approximate structure of the model is shown in the figure below , Here we use the above-mentioned feature extract Feature extraction method , Use BERT The generated sentence vector of .

2. Loading datasets and pre training models

First, introduce the lib And data sets , What we use here is SST Review data set

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import torch
import transformers as ppb
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', delimiter='\t', header=None)

Next use transformer Load pre training model

# For DistilBERT:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

## Want BERT instead of distilBERT? Uncomment the following line:
#model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

3. Model input

Before delving into the code to understand how to train the model , Let's take a look at how a trained model calculates the prediction results .

First try to correct the sentence a visually stunning rumination on love To classify .

As shown in the figure above , Before the sentence is input into the model tokenize

First step , Use BERT The word splitter converts English words into standard words （token）, If it is Chinese, it will be divided into words ;

The second step , Add special standard words for sentence classification （special token, As in the first place CLS And at the end of the sentence SEP）;

The third step , The word breaker uses... Embedded in the table id Replace every standard word （ The embedded table is obtained from the trained model ）

tokenize When it's done , It will put tokenize Convert array to two-dimensional array , Input a batch of data into each time BERT Model , Can handle faster .

tokenized = batch_1[0].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])
attention_mask = np.where(padded != 0, 1, 0)

Because the above generated padded The model doesn't recognize those words , What are no words ( empty ). So here we will generate a attention_mask ,1 It means there are words ,0 Means no words .

4. Use BERT Pre training model

Now? , We need to get a tensor from the filled marker matrix , As DistilBERT The input of .

input_ids = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

After running this step ,last_hidden_states preservation DistilBERT Output . It is a multi-dimensional tuple ：

For sentence classification , We are only concerned with [CLS] Of the tag BERT Output of interest , Therefore, we only select one slice of the 3D dataset as the feature input of the subsequent classification model . The code and explanation are shown in the figure below

features = last_hidden_states[0][:,0,:].numpy()
labels = batch_1[1]

5. Classification model training

Later, the training set and test set will be divided , And use LR Model classification

train_features, test_features, train_labels, test_labels = train_test_split(features, labels)
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)
lr_clf.score(test_features, test_labels)

As a reference , The highest accuracy score for this dataset is currently 96.8. It can be done to DistilBERT Train to improve their scores on this task , This process is called fine tuning , Will update BERT The weight of , To improve their sentence classification （ We call it a downstream mission ） Performance in . Fine tuned DistilBERT The accuracy score can reach 90.7, The standard version of the BERT The model can achieve 94.9.

6. appendix No freeze BERT Parameters

May refer to pytorch bert Chinese classification

https://zhuanlan.zhihu.com/p/72448986

https://zhuanlan.zhihu.com/p/145192287

If Freeze Parameters ？

https://github.com/huggingface/transformers/issues/400

Model deployment torch server

https://zhuanlan.zhihu.com/p/344364948

7. appendix Try fine tune

fine tune There are certain restrictions on the use of . The model structure of the pre training model is designed for the pre training task , So obviously , If we want to carry out back propagation again on the basis of the pre training model , Then the design requirements of the specific domain tasks we do for the network must be consistent with the pre training tasks . that Bert What exactly is the pre training process doing ？Bert Two tasks are designed .

Task a ： Shielding language model （Masked LM）

This task is similar to the cloze test done by high school students , Cover some words of sentences in the corpus , Use [MASK] As a shielding symbol , Then predict what the covered word is . For example, for the original sentence my dog is hairy , After shielding my dog is [MASK]. In this mission , The last layer of the hidden layer [MASK] The vector corresponding to the tag is fed to a corresponding vocabulary softmax layer , Make word classification prediction . Of course, there are still many problems in the concrete implementation , such as [MASK] Will appear in the context of the training set , There will never be any in the test set , See paper , This is not a detailed introduction .

Task 2 ： Adjacent sentence judgment （Next Sentence Prediction）

The sentences in the corpus are ordered adjacency , We use [SEP] As a separator for sentences ,[CLS] As a classification symbol of sentences , Now, some sentences in the corpus are scrambled and spliced , Form the following training samples ：

Input = [CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP]
Label = IsNext

Input = [CLS] the man [MASK] to the store [SEP] penguin [MASK] are flight ##less birds [SEP]
Label = NotNext

Enter the network , For the last layer of hidden layer [CLS] The words of symbols are embedded in softmax Two classification , Make a prediction about whether two sentences are adjacent dichotomous tasks .

It can be seen that , Both tasks learn how to input marker symbols during training embedding, And then based on the last layer embedding Just add an output layer to complete the task . This process may also introduce some special lexical symbols , By learning special symbols such as [CLS] Of embedding To accomplish specific tasks .

therefore , If we're coming fine-tune Do the task , It's the same idea ： First, the original model is properly constructed , Generally, you only need to add an output layer to complete the task .

chart a and b It's a sequence level task ,c and d It's a word level task .a Do the sentence pair classification task ,b Do the single sentence classification task , The structure is very simple , Put the red arrow in the figure to [CLS] The corresponding hidden layer output is followed by softmax Output layer .c What we do is reading comprehension ,d What we do is named entity recognition （NER）, The model construction is similar , Take the hidden layer output corresponding to some words pointed out by the arrow in the figure and connect it to a classification output layer to complete the task .

Design of tasks similar to the above , The pre training model can be fine-tuning To various tasks , But it doesn't always apply , There are some NLP The task is not suitable to be Transformer encoder The schema represents , Instead, you need a model architecture that fits a particular task . Therefore, the feature-based method is useful .

If you use HuggingFace Conduct FineTune Very convenient also , The code is as follows

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)

from transformers import TrainingArguments

training_args = TrainingArguments("test_trainer")

from transformers import Trainer

trainer = Trainer(
    model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset
)
trainer.train()

8. appendix The follow-up to optimize

You can try ：

Try different pre training models , such as RoBERT、WWM、ALBERT
except [CLS] It can also be used avg、max Pool is used to express , You can even combine different layers
Incremental pre training on domain data
Integrated distillation , Train multiple large models to integrate and distill them into one
First use multi task training , Then move to your own task

Ref

https://colab.research.google.com/github/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb
https://nlp.stanford.edu/sentiment/index.html SST Data sets
https://cloud.tencent.com/developer/article/1555590
https://work.padeoe.com/notes/bert.html
https://huggingface.co/transformers/training.html huggingface BERT fine tune
BERT Text classification and optimization https://zhuanlan.zhihu.com/p/349086747

原网站

版权声明
本文为[Goose]所创，转载请带上原文链接，感谢
https://yzsam.com/2021/07/20210730215943810M.html