当前位置:网站首页>NLP - monocleaner
NLP - monocleaner
2022-06-27 13:54:00 【Yizhi code】
List of articles
About monocleaner
monocleaner It is a tool for testing the fluency of monolingual sentences .
It is suggested that linux Upper use monocleaner, because monocleaner The dependency package of FastSpell stay Mac Installation failed on ( If you succeed , Welcome to tell me how to install ), So it's not recommended to Mac Upper use .
- Training tools available
monocleaner-train, You can also use language packs directly . - You can use
monocleaner-downloadTools to download the latest data , You can also visit https://github.com/bitextor/monocleaner-data/releases/latest download .
install
python3.7 -m pip install monocleaner
Dependencies
- Most of the dependencies will be in monocleaner During installation , Will automatically download ;
- KenLM, It needs to be installed in advance . May refer to : https://blog.csdn.net/lovechris00/article/details/125424808
- monocleaner Also depends on FastSpell, This library is in macOS Cannot install on , therefore monoclear Only in linux Upper use .
FastSpell : https://github.com/mbanon/fastspell
FastSpell Depend onpython-devandlibhunspell-dev( install :sudo apt install python-dev libhunspell-dev) - If you need to support similar languages such as similar Listed , Need to install
hunspell-es(sudo apt-get install hunspell-es), Or download external resources , such as :https://github.com/wooorm/dictionaries/tree/main/dictionaries
You can also give Hunspell Dictionary folder configuration path .- If you use pip install , Set in the
venv/lib/python3.7/site-packages/fastspell/config/hunspell.yaml - If you use
setup.pyinstall , Configure in/config/hunspell.yaml - If you run directly with code , The default address is :
/usr/share/hunspell.
- If you use pip install , Set in the
After successful installation , An executable will be generated monocleaner, monocleaner-train, monocleaner-download There are two files in python/installation/prefix/bin Next
such as :
stay Mac On , My papers are in /Library/Frameworks/Python.framework/Versions/3.7/bin/ Next
stay linux On , I use ananconda Medium python, So the executable file is in /home/newtranx/anaconda3/bin below
View version information and help
$ monocleaner -v
monocleaner Version 1.1.0 # 2021-03-07 # Add lang ident column # Jaume Zaragoza
$ monocleaner -h
usage: monocleaner [-h] [--scol SCOL] [--disable_lang_ident] [--disable_hardrules] [--disable_minimal_length] [--score_only]
[--add_lang_ident] [--annotated_output] [--debug] [-q] [-v]
model_dir [input] [output]
positional arguments:
model_dir Model directory to store LM file and metadata.
input Input file. If omitted, read from 'stdin'.
output Output tab-separated text file adding monocleaner score. When omitted output will be written to stdout.
optional arguments:
-h, --help show this help message and exit
--scol SCOL Sentence column (starting in 1)
--disable_lang_ident Disables language identification in hardrules
--disable_hardrules Disables the hardrules filtering (only monocleaner fluency scoring is applied)
--disable_minimal_length
Don't apply minimal length (3 words) rule --score_only Only print the score for each sentence, omit all fields --add_lang_ident Add another column with the identified language if it's not disabled.
--annotated_output Add hardrules annotation for each sentence
--debug
-q, --quiet
-v, --version show version of this script and exit
Scoring Scoring
- monocleaner It is mainly used to test the fluency of monolingual sentences .
- The fluency of each sentence is rated at 0–1 Within the interval . The higher the score, the more fluent .
- Beyond the continuous rating , Some dead writing rules also rate sentences that are obviously problematic as 0.
- The input file must have one sentence per line .
- The number of lines in the output file is consistent with that in the input file , There will be an extra column of points .
The syntax format of the tool is as follows :
monocleaner [-h]
[--disable_minimal_length]
[--disable_hardrules]
[--score_only]
[--annotated_output]
[--add_lang_ident]
[--debug]
[-q]
model_dir [input] [output]
Parameter description
- Positional arguments:
model_dir: Folder for model storageinput: Enter the address of the file . If this item is omitted , Will read from the terminal interaction .output: The output file , Use tab As a separator .
- Optional parameters :
--score_only: Output scores only .( The default is False)--add_lang_ident: If effective , According to the given language , Add other columns .--disable_hardrules: ( Just in the fluency score ) Cancel hardrules.( The default is False)--disable_minimal_length: The minimum length rule does not apply .( The default is False)
- journal :
-q, --quiet: Silent log mode ( The default is False)--debug: Debug log mode ( The default is False)-v, --version: Display version information
Examples of use :
Input command +
$ monocleaner xx/monocleaner/models/en
2022-06-25 13:17:35,372 - WARNING - Downloading FastText model...
Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.
2022-06-25 13:18:01,280 - INFO - Start scoring text
hello, this my name is
hello, this my name is 0.676
hello, this is my name
hello, this is my name 0.706
Only show ratings
$ monocleaner --score_only xx/monocleaner/models/en
Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.
2022-06-25 13:23:13,298 - INFO - Start scoring text
hi, I wanna fly to the sky!
0.603
you're beautiful in white
0.800
Use monocleaner-download Download data
monocleaner-download It seems that I didn't check the version , Enter the command to see the instructions
$ monocleaner-download --version
Wrong number of arguments: --version
Script to download Bicleaner language packs.
Usage: monocleaner-download <lang> <download_path>
<lang> Language code.
<download_path> Path where downloaded language pack should be placed.
Then we can download the data as much as we like
$ monocleaner-download es xx/monocleaner/models/
PS: I don't see it at the moment zh data . have access to monocleaner-train Train one .
You can also go to https://github.com/bitextor/monocleaner-data/releases/latest download , Or check the existing language support .
monocleaner-train Training data
$ monocleaner-train -h
usage: monocleaner-train [-h] -l LANGUAGE [--dev_size DEV_SIZE]
[--lm_type {
PLACEHOLDER,CHARACTER}]
[--tokenizer_command TOKENIZER_COMMAND] [--debug]
[-q]
train model_dir
- positional arguments:
train: Training dataset file , Monolingual data in one line .model_dir: Model directory to store LM file and metadata. Model folder , Used to store LM Files and metadata .
- optional arguments:
-h,--help: show this help message and exit-l LANGUAGE, --language LANGUAGE: Language code of the model.--dev_size DEV_SIZE: Number of sentences used to estimate mean and stddev perplexity on noisy and clean text. Extracted from training the training corpus.--lm_type {PLACEHOLDER,CHARACTER}--tokenizer_command TOKENIZER_COMMAND: Tokenizer command to replace Moses tokenizer when using PLACEHOLDER LMType.--debug-q,--quiet
I didn't do any training here , So I will not explain the training results and problems encountered here . Have a chance to make it up .
Yizhi 2022-06-25( 6、 ... and )
边栏推荐
- 【OS命令注入】常见OS命令执行函数以及OS命令注入利用实例以及靶场实验—基于DVWA靶场
- SFINAE
- Crane: a new way of dealing with dictionary items and associated data
- Is there any discount for opening an account now? Is it safe to open an account online?
- Domestic database disorder
- Awk concise tutorial
- Gaode map IP positioning 2.0 backup
- Summary of redis master-slave replication principle
- Debug tool
- Step by step expansion of variable parameters in class templates
猜你喜欢

请求一下子太多了,数据库危

Naacl 2022 | TAMT: search the transportable Bert subnet through downstream task independent mask training

【PHP代码注入】PHP语言常见可注入函数以及PHP代码注入漏洞的利用实例

Pytoch learning 2 (CNN)

enable_if
![[business security-02] business data security test and example of commodity order quantity tampering](/img/0f/c4d4dd72bed206bbe3e15e32456e2c.png)
[business security-02] business data security test and example of commodity order quantity tampering

类模板中可变参的逐步展开

SFINAE

Implementing springboard agent through SSH port forwarding configuration
![[WUSTCTF2020]girlfriend](/img/a8/33fe5feb7bcbb73ba26a94d226cc4d.png)
[WUSTCTF2020]girlfriend
随机推荐
如何使用200行代码实现Scala的对象转换器
AcWing 第57 场周赛
为什么 Oracle 云客户必须在Oracle Cloud 季度更新发布后自行测试?
Array related knowledge
实现WordPress上传图片自动重命名的方法
招标公告:上海市研发公共服务平台管理中心Oracle一体机软硬件维保项目
POSIX AIO -- glibc 版本异步 IO 简介
Completely solve the problem of Chinese garbled code in Web Engineering at one time
Kotlin函数使用示例教程
How to split microservices
重读经典:《The Craft of Research(1)》
Number of printouts (solved by recursive method)
Daily 3 questions (1): find the nearest point with the same X or Y coordinate
Pytorch learning 3 (test training model)
Redis 主从复制、哨兵模式、Cluster集群
基于 Nebula Graph 构建百亿关系知识图谱实践
SFINAE
深入理解位运算
Implementing springboard agent through SSH port forwarding configuration
Bidding announcement: Oracle all-in-one machine software and hardware maintenance project of Shanghai R & D Public Service Platform Management Center