当前位置：网站首页>Summary of NLP data enhancement methods

Summary of NLP data enhancement methods

2022-06-25 08:29:00 【Happy little yard farmer】

List of articles

- NLP Data to enhance
- - 1. UDA (Unsupervised Data Augmentation)【 recommend 】
  - 2. EDA (Easy Data Augmentation)

NLP Data to enhance

1. UDA (Unsupervised Data Augmentation)【 recommend 】

Reference resources ：
[1]: https://github.com/google-research/uda “Unsupervised Data Augmentation”
[2]: https://arxiv.org/abs/1904.12848 “Unsupervised Data Augmentation for Consistency Training”

A semi supervised learning method , Reduce the need for annotation data , Increase the utilization of unmarked data .

Insert picture description here

UDA Language enhancement techniques used ——Back-translation： Back translation can save the semantic unchanged , Generate a variety of sentence patterns .

UDA The key solution is how to increase the use of unmarked data according to a small amount of labeled data ？

For a given dimension data , A model can be learned according to the supervised learning method $M=p_{\theta}(y|x)$ . For unmarked data , Conduct semi supervised learning ： Reference label data distribution , The model learned after adding noise to unmarked data $p_{\theta}(y|\hat{x})$ . Training to ensure consistency （consistency training), It is necessary to minimize the distribution difference between labeled data and unlabeled data , That is to minimize the two distributions of KL The divergence ： $\quad D_{KL} (p_{\theta}(y|x)||p_{\theta}(y|\hat{x}))$ . and $\hat{x}=q(x,\epsilon)$ It is the enhanced data obtained by adding noise to the unmarked data . So how to add noise $\epsilon$ , To get an enhanced dataset $\hat{x}$ ？

valid noise: It can ensure the consistency of the prediction of the original unmarked data and the extended unmarked data .
diverse noise: Make a lot of changes to the input without changing the label , Increase sample diversity , Instead of just making local changes with Gaussian noise .
targeted inductive biases: Different tasks require different inductive biases .

UDA Image classification in this paper 、 Experiments were done on the task of text classification , Different data enhancement strategies are used ：

Image Classification: RandAugment
Text Classification: Back-translation Back translation , Maintain semantics , Use machine translation system for multi language translation , Increase sentence diversity .
Text Classification: Word replacing with TF-IDF , Back translation can ensure that the global semantics remain unchanged , But you can't control the retention of a word . For task classification , Some keywords have more important information when determining the subject . New enhancements ： Use a lower TF-IDF Replace words without information with scores , While retaining high TF-IDF Value words .

2. EDA (Easy Data Augmentation)

Reference resources ：
[1]: https://github.com/zhanlaoban/EDA_NLP_for_Chinese “EDA_NLP_for_Chinese”
[2]: https://arxiv.org/abs/1901.11196 “EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks”
[3]: https://github.com/jasonwei20/eda_nlp “EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks”

EDA Of 4 Data enhancement operations ：

Synonym substitution (Synonym Replacement, SR)： Choose at random from the sentences n A word that does not belong to the stop word set , And randomly choose their synonyms to replace them ;
Insert randomly (Random Insertion, RI)： Randomly find a word in the sentence that does not belong to the stop word set , And find its random synonyms , Insert the synonym into a random position in the sentence . repeat n Time ;
Random exchange (Random Swap, RS)： Randomly select two words in a sentence and exchange their positions . repeat n Time ;
Random delete (Random Deletion, RD)： With p Probability , Randomly remove each word from the sentence ;

Use EDA We need to pay attention to ： Control the number of samples , A little study , Can't expand too much , because EDA Operating too frequently may change semantics , This reduces the performance of the model .

原网站

版权声明
本文为[Happy little yard farmer]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202200556380896.html