当前位置:网站首页>Summary of NLP data enhancement methods
Summary of NLP data enhancement methods
2022-06-25 08:29:00 【Happy little yard farmer】
List of articles
NLP Data to enhance
1. UDA (Unsupervised Data Augmentation)【 recommend 】
Reference resources :
[1]: https://github.com/google-research/uda “Unsupervised Data Augmentation”
[2]: https://arxiv.org/abs/1904.12848 “Unsupervised Data Augmentation for Consistency Training”
A semi supervised learning method , Reduce the need for annotation data , Increase the utilization of unmarked data .


UDA Language enhancement techniques used ——Back-translation: Back translation can save the semantic unchanged , Generate a variety of sentence patterns .
UDA The key solution is how to increase the use of unmarked data according to a small amount of labeled data ?
For a given dimension data , A model can be learned according to the supervised learning method M = p θ ( y ∣ x ) M=p_{\theta}(y|x) M=pθ(y∣x). For unmarked data , Conduct semi supervised learning : Reference label data distribution , The model learned after adding noise to unmarked data p θ ( y ∣ x ^ ) p_{\theta}(y|\hat{x}) pθ(y∣x^). Training to ensure consistency (consistency training), It is necessary to minimize the distribution difference between labeled data and unlabeled data , That is to minimize the two distributions of KL The divergence : m i n D K L ( p θ ( y ∣ x ) ∣ ∣ p θ ( y ∣ x ^ ) ) min \quad D_{KL} (p_{\theta}(y|x)||p_{\theta}(y|\hat{x})) minDKL(pθ(y∣x)∣∣pθ(y∣x^)). and x ^ = q ( x , ϵ ) \hat{x}=q(x,\epsilon) x^=q(x,ϵ) It is the enhanced data obtained by adding noise to the unmarked data . So how to add noise ϵ \epsilon ϵ, To get an enhanced dataset x ^ \hat{x} x^?
- valid noise: It can ensure the consistency of the prediction of the original unmarked data and the extended unmarked data .
- diverse noise: Make a lot of changes to the input without changing the label , Increase sample diversity , Instead of just making local changes with Gaussian noise .
- targeted inductive biases: Different tasks require different inductive biases .
UDA Image classification in this paper 、 Experiments were done on the task of text classification , Different data enhancement strategies are used :
- Image Classification: RandAugment
- Text Classification: Back-translation Back translation , Maintain semantics , Use machine translation system for multi language translation , Increase sentence diversity .
- Text Classification: Word replacing with TF-IDF , Back translation can ensure that the global semantics remain unchanged , But you can't control the retention of a word . For task classification , Some keywords have more important information when determining the subject . New enhancements : Use a lower TF-IDF Replace words without information with scores , While retaining high TF-IDF Value words .
2. EDA (Easy Data Augmentation)
Reference resources :
[1]: https://github.com/zhanlaoban/EDA_NLP_for_Chinese “EDA_NLP_for_Chinese”
[2]: https://arxiv.org/abs/1901.11196 “EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks”
[3]: https://github.com/jasonwei20/eda_nlp “EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks”
EDA Of 4 Data enhancement operations :
- Synonym substitution (Synonym Replacement, SR): Choose at random from the sentences n A word that does not belong to the stop word set , And randomly choose their synonyms to replace them ;
- Insert randomly (Random Insertion, RI): Randomly find a word in the sentence that does not belong to the stop word set , And find its random synonyms , Insert the synonym into a random position in the sentence . repeat n Time ;
- Random exchange (Random Swap, RS): Randomly select two words in a sentence and exchange their positions . repeat n Time ;
- Random delete (Random Deletion, RD): With p Probability , Randomly remove each word from the sentence ;
Use EDA We need to pay attention to : Control the number of samples , A little study , Can't expand too much , because EDA Operating too frequently may change semantics , This reduces the performance of the model .
边栏推荐
- 软件工程复习题
- First experience Amazon Neptune, a fully managed map database
- How to calculate the information entropy and utility value of entropy method?
- 在哪个平台买股票开户安全?求分享
- Prepare these before the interview. The offer is soft. The general will not fight unprepared battles
- Deep learning series 45: overview of image restoration
- 想要软件测试效果好,搭建好测试环境是前提
- STM32CubeMX 學習(5)輸入捕獲實驗
- Go语言学习教程(十三)
- LeetCode_ Hash table_ Medium_ 454. adding four numbers II
猜你喜欢

Daily question brushing record (III)

NIPS 2014 | Two-Stream Convolutional Networks for Action Recognition in Videos 阅读笔记

2022年毕业生求职找工作青睐哪个行业?

软件测试月薪10K如何涨到30K,只有自动化测试能做到

TCP stuff

Almost taken away by this wave of handler interview cannons~

What are the indicators of DEA?

Similarity calculation method

A solution to slow startup of Anaconda navigator

每日刷题记录 (三)
随机推荐
How to calculate the distance between texts: WMD
在二叉树(搜索树)中找到两个节点的最近公共祖先(剑指offer)
What are the indicators of VIKOR compromise?
使用pytorch搭建MobileNetV2并基于迁移学习训练
Establish open data set standards and enable AI engineering implementation
Home server portal easy gate
Rosparam statement
Ffmpeg+sdl2 for audio playback
Wechat applet opening customer service message function development
How to do factor analysis? Why should data be standardized?
Getting to know the generation confrontation network (12) -- using pytoch to build wgan-gp to generate handwritten digits
Beam search and five optimization methods
rosbag
Meaning of Jieba participle part of speech tagging
在网上股票开户安全吗?证券账户可以给别人用吗?
STM32CubeMX 學習(5)輸入捕獲實驗
Incluxdb time series database
配置、软件配置项、软件配置管理项辨析
Is there any risk in the security of new bonds
第五天 脚本与UI系统