当前位置:网站首页>NLP enhanced technology
NLP enhanced technology
2022-06-26 01:37:00 【Green Lantern swordsman】
I met an old brother yesterday , Ask me what I know NLP Enhancement technology . I was stunned , Enhanced technology originated from The image processing , Later on 《 Baimian machine learning 》 I saw its detailed interpretation in this book .NLP Enhanced technology for ? Actually , I used it before .
In the voice assistant , As input, expect , First, I enhanced the data of the definition . As the main model Fasttext, In fact, we also have data enhancement technology , So , I have also written several exploration summaries .
After you come back , I saw something about NLP Data enhancement technology , So I prepared to do some research .
This article mainly refers to Zhihu famous post 1
- Word substitution
1.1 Dictionary based substitution
(1) Verb , Made into a small set; interjection 、 People and words have dictionaries , It can be replaced at will
(2) Resource replacement is a difficult problem . such as , Children education This skill . Some resources are exclusive to children's education , Other resources do not belong to children's education , Therefore, attention should be paid to , Also add Counter sentences . And in this case , We also have words embedded Way to grasp .
1.2. Word replacement based on word vector
Never used , The pre training model is too big , This should not be easy to operate
1.3 Based on mask language model (MLM augmentation)
Difficult to operate , It's not easy to control
1.4 be based on TF-IDF Replace with the word
tf It is a word of high frequency ,IDF It is a word that distinguishes great power among articles . The original idea of this technique is Get rid of those Unimportant words . - Back translation
This method is very useful in enhancing text similar data sets - Text form transformation
This is specifically for To expand or abbreviate words - Random noise injection
Based on a hypothesis : A small amount of interference with the sample , The results predicted by the model are consistent
4.1 Introduce spelling errors
This is . We save the error results after speech recognition , As an alias for a formal resource name .
4.2 Unigram The noise 、Blank The noise
One is to join Useless high-frequency words ; One is to join Fixed symbol
4.3 Other simple methods
(1) Sentence order shuffle
(2) Randomly exchange the order of two words
(3) Insert words or sentences randomly
(4) Random delete - Instance cross enhancement
边栏推荐
- Using redis database as cache in Django
- Quickly generate 1~20 natural numbers and easily copy
- I2C protocol
- shell正则表达式
- leetcode 300. Longest Increasing Subsequence 最长递增子序列 (中等)
- Containerd client comparison
- Simple making of master seal
- 在FreeBSD中安装MySQL数据库
- Reading notes on how to connect the network - hubs, routers and routers (III)
- Cross validation -- a story that cannot be explained clearly
猜你喜欢

“热帖”统计

RT thread project engineering construction and configuration - (Env kconfig)
![[Excel知识技能] Excel数据类型](/img/f6/e1ebe033d1a2a266ebda00b10098ed.png)
[Excel知识技能] Excel数据类型

QT cmake pure C code calls the system console to input scanf and Chinese output garbled code

Design and process analysis of anti backflow circuit for MOS transistor

物联网?快来看 Arduino 上云啦

Technical foreword - metauniverse

Dgus new upgrade: fully support digital video playback function

MySQL图书借阅系统项目数据库建库表语句(组合主键、外键设置)

STM32GPIO
随机推荐
Handling of @charset UTF-8 warning problems during vite packaging and construction;
Oracle database startup backup preparation
Data analysis slicer, PivotTable and PivotChart (necessary in the workplace)
JSON instance (I)
Summary of informer's paper
How to search papers in a certain field
Zhihuijia - full furniture function
23. histogram equalization
Laravel basic course routing and MVC - controller
开窍之问答
远程增量同步神器rsync
Oracle數據庫完全卸載步驟(暫無截圖)
Reading notes on how to connect the network - hubs, routers and routers (III)
Idempotence of interfaces -- talk about idempotence of interfaces in detail, that is, solutions
同花顺上登录股票账户是安全的吗?同花顺上是如何开股票账户的
When you run the demo using the gin framework, there is an error "listen TCP: 8080: bind: an attempt was made to access a socket in a way forbidden"
2021-1-15 fishing notes ctrl+c /v
Install tensorflow GPU miscellaneous
Embedded C first learning notes
黑盒测试 — 测试用例 之 判定表法看这一篇就够了