当前位置:网站首页>Summary of NLP data enhancement methods
Summary of NLP data enhancement methods
2022-06-25 08:29:00 【Happy little yard farmer】
List of articles
NLP Data to enhance
1. UDA (Unsupervised Data Augmentation)【 recommend 】
Reference resources :
[1]: https://github.com/google-research/uda “Unsupervised Data Augmentation”
[2]: https://arxiv.org/abs/1904.12848 “Unsupervised Data Augmentation for Consistency Training”
A semi supervised learning method , Reduce the need for annotation data , Increase the utilization of unmarked data .


UDA Language enhancement techniques used ——Back-translation: Back translation can save the semantic unchanged , Generate a variety of sentence patterns .
UDA The key solution is how to increase the use of unmarked data according to a small amount of labeled data ?
For a given dimension data , A model can be learned according to the supervised learning method M = p θ ( y ∣ x ) M=p_{\theta}(y|x) M=pθ(y∣x). For unmarked data , Conduct semi supervised learning : Reference label data distribution , The model learned after adding noise to unmarked data p θ ( y ∣ x ^ ) p_{\theta}(y|\hat{x}) pθ(y∣x^). Training to ensure consistency (consistency training), It is necessary to minimize the distribution difference between labeled data and unlabeled data , That is to minimize the two distributions of KL The divergence : m i n D K L ( p θ ( y ∣ x ) ∣ ∣ p θ ( y ∣ x ^ ) ) min \quad D_{KL} (p_{\theta}(y|x)||p_{\theta}(y|\hat{x})) minDKL(pθ(y∣x)∣∣pθ(y∣x^)). and x ^ = q ( x , ϵ ) \hat{x}=q(x,\epsilon) x^=q(x,ϵ) It is the enhanced data obtained by adding noise to the unmarked data . So how to add noise ϵ \epsilon ϵ, To get an enhanced dataset x ^ \hat{x} x^?
- valid noise: It can ensure the consistency of the prediction of the original unmarked data and the extended unmarked data .
- diverse noise: Make a lot of changes to the input without changing the label , Increase sample diversity , Instead of just making local changes with Gaussian noise .
- targeted inductive biases: Different tasks require different inductive biases .
UDA Image classification in this paper 、 Experiments were done on the task of text classification , Different data enhancement strategies are used :
- Image Classification: RandAugment
- Text Classification: Back-translation Back translation , Maintain semantics , Use machine translation system for multi language translation , Increase sentence diversity .
- Text Classification: Word replacing with TF-IDF , Back translation can ensure that the global semantics remain unchanged , But you can't control the retention of a word . For task classification , Some keywords have more important information when determining the subject . New enhancements : Use a lower TF-IDF Replace words without information with scores , While retaining high TF-IDF Value words .
2. EDA (Easy Data Augmentation)
Reference resources :
[1]: https://github.com/zhanlaoban/EDA_NLP_for_Chinese “EDA_NLP_for_Chinese”
[2]: https://arxiv.org/abs/1901.11196 “EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks”
[3]: https://github.com/jasonwei20/eda_nlp “EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks”
EDA Of 4 Data enhancement operations :
- Synonym substitution (Synonym Replacement, SR): Choose at random from the sentences n A word that does not belong to the stop word set , And randomly choose their synonyms to replace them ;
- Insert randomly (Random Insertion, RI): Randomly find a word in the sentence that does not belong to the stop word set , And find its random synonyms , Insert the synonym into a random position in the sentence . repeat n Time ;
- Random exchange (Random Swap, RS): Randomly select two words in a sentence and exchange their positions . repeat n Time ;
- Random delete (Random Deletion, RD): With p Probability , Randomly remove each word from the sentence ;
Use EDA We need to pay attention to : Control the number of samples , A little study , Can't expand too much , because EDA Operating too frequently may change semantics , This reduces the performance of the model .
边栏推荐
- Can I grant database tables permission to delete column objects? Why?
- Nips 2014 | two stream revolutionary networks for action recognition in videos reading notes
- Is the securities account given by Qiantang education business school safe? Can I open an account?
- Go语言学习教程(十三)
- How to analyze the coupling coordination index?
- Is it safe to open an account for stocks on the Internet? Can the securities account be used by others?
- 如何成为一名软件测试高手? 月薪3K到17K,我做了什么?
- Jdbc-dao layer implementation
- How to calculate the fuzzy comprehensive evaluation index? How to calculate the four fuzzy operators?
- TCP MIN_ A dialectical study of RTO
猜你喜欢

使用apt-get命令如何安装软件?

What are the indicators of entropy weight TOPSIS method?

Software engineering review questions

417 sequence traversal of binary tree 1 (102. sequence traversal of binary tree, 107. level traversal of binary tree II, 199. right view of binary tree, 637. layer average of binary tree)

《树莓派项目实战》第五节 使用Nokia 5110液晶屏显示Hello World

Rank sum ratio (RSR) index calculation

A solution to slow startup of Anaconda navigator
How to calculate the characteristic vector, weight value, CI value and other indicators in AHP?
![[QT] QT 5 procedure: print documents](/img/76/2fce505c43f75360a8ff477aa2d31d.png)
[QT] QT 5 procedure: print documents

堆栈认知——栈溢出实例(ret2libc)
随机推荐
What are the indicators of entropy weight TOPSIS method?
第五天 脚本与UI系统
TCP stuff
钱堂教育商学院给的证券账户安全吗?能开户吗?
Static web server
4 reasons for adopting "safe left shift"
Retrieval model rough hnsw
Is the securities account given by Qiantang education business school safe? Can I open an account?
Use pytorch to build mobilenetv2 and learn and train based on migration
GPU calculation
Go语言学习教程(十三)
linux中的mysql有10061错误怎么解决
Can I grant database tables permission to delete column objects? Why?
打新债安不安全 有风险吗
How to calculate the D value and W value of statistics in normality test?
Beam search and five optimization methods
CVPR 2022 oral 2D images become realistic 3D objects in seconds
Find out the possible memory leaks caused by the handler and the solutions
Jdbc-dao layer implementation
DNS protocol and its complete DNS query process