当前位置:网站首页>Stutter participle_ Principle of word breaker
Stutter participle_ Principle of word breaker
2022-06-28 09:24:00 【Java architects must see】
install jieba library :pip3 install jieba
# Stuttering participle
# -*- coding:utf-8 -*-
import sys
import os
import jiebasent = ' Tianshan intelligence is a business intelligence enterprise BI、 Data analysis 、 Technical community in the field of data mining and big data technology www.hellobi.com . Content from the initial business intelligence BI The field has also been extended to data analysis 、 Data mining is related to big data In the field of technology , Include R、Python、SPSS、Hadoop、Spark、Hive、Kylin etc. , Become a vertical community focused on the data field . Tianshan intelligence is committed to building an ecosystem based on the data field , Link everything through the community Data related resources : For example, the data itself 、 people 、 Data solution providers and enterprises , Work together with everyone to promote big data 、 business intelligence BI Popularization and development in China .'
print (sent)Stuttering word segmentation module has three word segmentation modes :
1. All model : Scan all the words that can be made into words in a sentence , Very fast , But it doesn't solve the ambiguity . This full mode , According to the dictionary , Match and divide all the words that appear , So there will be repetition , obviously , This is not what we need .
2. Accurate model : Try to cut the sentence as precisely as possible , Suitable for text analysis ( similar LTP Word segmentation ), And this precise model is closer to what we want .
3. Search engine model : Segmentation of long words based on precise patterns , Increase recall rate , Suitable for search engine segmentation . This search engine model is also good , More detailed .
# All model
wordlist = jieba.cut(sent,cut_all = True)
print('|'.join(wordlist))# Exact segmentation
wordlist = jieba.cut(sent)
print('|'.join(wordlist)) # Search engine model
wordlist = jieba.cut_for_search(sent)
print('|'.join(wordlist))Find new problems -- Add user-defined dictionary : Looking back at the results of the exact model , Find some new words or professional words , for example : Tianshan intelligence 、 big data , These should no longer be cut apart , So based on the default dictionary , We can load custom dictionaries . Enter my jieba Module directory -> See a dict The dictionary of , open -> Found to have 1. word 2. Numbers ( For word frequency , The higher the height, the easier it is to match ) 3. The part of speech . For convenience , We define and add a dictionary named userdict.txt
# Add user-defined dictionary
# Use the user dictionary
jieba.load_userdict('D:\\Anaconda3\\Lib\\site-packages\\jieba\\userdict.txt')
wordlist = jieba.cut(sent)
print('|'.join(wordlist)) Reference material :
https://zhuanlan.zhihu.com/p/29747350?utm_source=qq&utm_medium=social&utm_oi=780081763178258432
That's the end of today's article , Thank you for reading ,Java Architects must see I wish you a promotion and a raise , Good luck every year .
边栏推荐
- redis5.0的槽点迁移,随意玩(单机迁移集群)
- PMP考试重点总结五——执行过程组
- Basic knowledge of hard disk (head, track, sector, cylinder)
- 自动转换之-面试题
- 全局异常处理器与统一返回结果
- 手机炒股开户安不安全?
- I want to register my stock account online. How do I do it? Is online account opening safe?
- The constructor is never executed immediately after new()!!!!!
- PMP考试重点总结六——图表整理
- 微信小程序开发日志
猜你喜欢

Automatic conversion - interview questions

Music website design based on harmonyos (portal page)

Apache Doris 成为 Apache 顶级项目

Common test method used by testers --- orthogonal method
Understanding the IO model

redis5.0的槽点迁移,随意玩(单机迁移集群)

Dbeaver connects to kingbasees V8 (ultra detailed graphic tutorial)

详解final、finally和finalize

Postman interface test

Apiccloud, together with 360 Tianyu, helps enterprises keep the "first pass" of APP security
随机推荐
Resource scheduling and task scheduling of spark
Screen settings in the source code of OBS Live Room
Music website design based on harmonyos (portal page)
Rman Backup Report Ora - 19809 Ora - 19804
P2394 yyy loves Chemistry I
1182: group photo effect
PMP needs to master its own learning methods
Divide and rule classic Hanoi
多线程-并发并行-线程进程
股票 停牌
PMP考试重点总结五——执行过程组
104. maximum depth of binary tree
什么是在线开户?现在网上开户安全么?
剑指Offer | 链表转置
Basic knowledge of hard disk (head, track, sector, cylinder)
Calculation of stock purchase and sale expenses
spark的资源调度和任务调度
Learn how Alibaba manages the data indicator system
The digital human industry is about to break out. What is the market pattern?
小米旗下支付公司被罚 12 万,涉违规开立支付账户等:雷军为法定代表人,产品包括 MIUI 钱包 App