当前位置:网站首页>Flashtext, a data cleaning tool, has directly increased the efficiency by dozens of times
Flashtext, a data cleaning tool, has directly increased the efficiency by dozens of times
2022-06-28 02:39:00 【Python concentration camp】
In some ordinary small-scale data filtering 、 Regular expressions are the most commonly used in the cleaning process , But as the data scale increases , Regular expressions seem to have some spare energy .
Regular expressions in a 10k In the thesaurus of 15k The time of a keyword is almost 0.165 second . But for Flashtext It just needs 0.002 second . therefore , On this issue Flashtext Is about faster than regular expressions 82 times .
From the performance comparison of the above example diagram , You can see that as we need to process more and more characters , The processing speed of regular expressions almost increases linearly . However ,Flashtext Almost a constant .
1、 Get ready flashtext Environmental Science
adopt pip To install flashtext, Or other ways are also possible , The mirror station of Tsinghua University is used by default .
pip install flashtext -i https://pypi.tuna.tsinghua.edu.cn/simple
Getting ready for flashtext After environment , Take a look at flashtext Important use process , Help us to better complete the data cleaning operation .
2、 Add keywords
When adding a keyword here, it is added to the keyword thesaurus through a single keyword , Use add_keyword Function to add . The first parameter indicates the keyword to be added , The second parameter is the alias of the first keyword , If the keyword is found, it is displayed as an alias , If the second parameter is not used as an alias, the original name will still be displayed .
from flashtext import KeywordProcessor
# Initialize the key vocabulary processor
processor = KeywordProcessor()
# Add keywords in the normal way
processor.add_keyword('Python')
# Add keywords by alias
processor.add_keyword('Scala', 'Java')
In this way, the required keywords have been added to the thesaurus processor in two ways .
3、 Extract key words
Add keywords through the previous step , Now the keyword information already exists in the thesaurus processor , Reuse extract_keywords Just extract the keywords .
# Extract keyword information from a string
found = processor.extract_keywords('I like Python and Scala.')
# result
print(found)
# ['Python', 'Java']
And here it is , As we expected , and Scala Also shown as Java.
4、 Replace keywords
Replace the keywords with replace_keywords function , The premise is that the words with aliases in the thesaurus can be replaced , Just like up here Scala Displayed as Java equally .
Replace... In a string Scala key word , because Scala The corresponding alias is Java, So... In a string Scala It should be replaced by Java.
replaced = processor.replace_keywords('I like Scala.')
# result
print(replaced)
# I like Java.
# Scala If so, it will be replaced by Java.
5、 Get all keywords
Sometimes , stay KeywordProcessor You may not remember what keywords have been added to the thesaurus processor , It can be used at this time get_all_keywords Function to get all the current keywords .
all_keywords = processor.get_all_keywords()
# result
print(all_keywords)
# {'python': 'Python', 'scala': 'Java'}
6、 Add keywords in batch
When the key vocabulary needs more keywords , You can add them in batches through lists or dictionaries . The corresponding functions are add_keywords_from_list、add_keywords_from_dict function .
# Initialize a dictionary for batch addition
dict_ = {
'java': ['java_ee', 'java_se', 'java_me'],
'python': ['pandas', 'all']
}
# Add keywords in batches through dictionaries
processor.add_keywords_from_dict(dict_)
# Match keywords from batch added keywords
result = processor.extract_keywords('looking for java_ee and pandas.')
# result
print(result)
# ['java', 'python']
# Batch add keywords by listing
processor.add_keywords_from_list(['scala', 'python', 'scala', 'go'])
# adopt get_all_keywords Take a look at all the keywords
all_keywords = processor.get_all_keywords()
# result
print(all_keywords)
# {'python': 'python', 'pandas': 'python', 'scala': 'scala', 'java_ee': 'java', 'java_se': 'java', 'java_me': 'java', 'all': 'python', 'go': 'go'}
Find that all the keywords have been added to the thesaurus processor , And repeated will not be added again .
7、 Delete keywords in batch
There are also two ways to batch delete keywords in the thesaurus processor , One is the list 、 The other is a dictionary . The corresponding functions are remove_keywords_from_list、remove_keywords_from_dict function .
# Remove keywords from the list in batch
processor.remove_keywords_from_list(['python','java_ee','java_me'])
# Remove keywords from the dictionary in batch
processor.remove_keywords_from_dict({'python': ['pandas','all']})
# adopt get_all_keywords Take a look at all the keywords
all_keywords = processor.get_all_keywords()
# result
print(all_keywords)
# {'scala': 'scala', 'java_se': 'java', 'go': 'go'}
It is found that all the keywords that need to be removed have been removed .
8、 Comparison of execution efficiency
For a more impressive display effect , I found two flashtext The efficiency comparison chart in the process of searching and replacing keywords can be seen at a glance .
flashtext、 Regular expression search efficiency comparison
flashtext、 Regular expression search Replacement comparison
【 Past highlights 】
One help Function solves python View all document information for ...
python Custom exception /raise Keyword throws an exception
python Local music player production process ( Complete source code attached )
Automation tools :PyAutoGUI Mouse and keyboard control , A sharp weapon for freeing hands !
Have you ever seen a birthday cake from a programming ape ?
边栏推荐
- yarn下载报错There appears to be trouble with your network connection. Retrying.
- 原理图合并中的技巧
- 【 amélioration de la correction d'image de Code bidimensionnel】 simulation du traitement d'amélioration de la correction d'image de Code bidimensionnel basée sur MATLAB
- 【历史上的今天】6 月 3 日:微软推出必应搜索引擎;Larry Roberts 启动阿帕网;Visual Basic 之父出生
- 【云原生】-Docker安装部署分布式数据库 OceanBase
- 图灵机启动顺序
- Digital intelligence learning Lake Warehouse Integration Practice and exploration
- 把腾讯搬上云:云服务器 CVM 的半部进化史
- MySQL interview question set
- Stm32f1 interrupt introduction
猜你喜欢
SQL 注入繞過(二)
Digital intelligence learning Lake Warehouse Integration Practice and exploration
【历史上的今天】6 月 12 日:美国进入数字化电视时代;Mozilla 的最初开发者出生;3Com 和美国机器人公司合并
Anonymous Mount & named mount
【方块编码】基于matlab的图像方块编码仿真
【历史上的今天】6 月 11 日:蒙特卡罗方法的共同发明者出生;谷歌推出 Google 地球;谷歌收购 Waze
MySQL优化小技巧
关于st-link usb communication error的解决方法
Solutions to st link USB communication error
Cvpr22 collected papers | hierarchical residual multi granularity classification network based on label relation tree
随机推荐
SQL 注入绕过(二)
【倒立摆控制】基于UKF无迹卡尔曼滤波的倒立摆控制simulink仿真
Based on am335x development board arm cortex-a8 -- acontis EtherCAT master station development case
数据治理与数据标准
Jenkins - groovy postbuild plug-in enriches build history information
js实现时钟
把腾讯搬上云:云服务器 CVM 的半部进化史
架构高可靠性应用知识图谱 ----- 架构演进之路
一种低成本增长私域流量,且维护简单的方法
数据清洗工具flashtext,效率直接提升了几十倍数
High reliability application knowledge map of Architecture -- the path of architecture evolution
SQL报了一个不常见的错误,让新来的实习生懵了
【历史上的今天】6 月 1 日:Napster 成立;MS-DOS 原作者出生;谷歌出售 Google SketchUp
贪吃蛇 C语言
云原生(三十) | Kubernetes篇之应用商店-Helm
KVM相关
Cloud platform KVM migration local virtual machine records
NER中BiLSTM-CRF解读Forward_algorithm
【二維碼圖像矯正增强】基於MATLAB的二維碼圖像矯正增强處理仿真
Skills in schematic merging