当前位置:网站首页>spacy教程(持续更新ing...)
spacy教程(持续更新ing...)
2022-06-28 14:42:00 【诸神缄默不语】
最近更新时间:2022.6.27
最早更新时间:2022.6.27
本文介绍spacy模型的使用方式,即spacy的API使用教程。spacy包的API基本都要靠特定模型(trained pipeline)来使用,本文主要用英文(en_core_web_sm)和中文(zh_core_web_sm)来做示例,毕竟我就只会这两种语言。
spacy模型官网:Trained Models & Pipelines · spaCy Models Documentation
1. 分词
官网示例(可以在网上直接用docker运行):
import spacy
from spacy.lang.en.examples import sentences
nlp = spacy.load("en_core_web_sm")
doc = nlp(sentences[0])
print(doc.text)
for token in doc:
print(token.text, token.pos_, token.dep_)
输出:
Apple is looking at buying U.K. startup for $1 billion
Apple PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.K. PROPN dobj
startup NOUN advcl
for ADP prep
$ SYM quantmod
1 NUM compound
billion NUM pobj
可以看到模型将句子进行了tokenize,并给出了每个token的词性(pos_)和dependency relation(dep_)(我也不知道这是啥。介绍见:DependencyParser · spaCy API Documentation)
2. 停用词表
Defaults文档见Language · spaCy API Documentation
import spacy
sp=spacy.load('en_core_web_sm')
StopWord=sp.Defaults.stop_words
StopWord是一个由停用词(字符串格式)组成的集合。
3. 分句
Sentencizer文档见Sentencizer · spaCy API Documentation
这里的句子是那种完整的句子,以句号之类的标准作为划分标准的那种。
import spacy
from spacy.lang.zh.examples import sentences
nlp = spacy.load("zh_core_web_sm")
total_doc=''.join(sentences)
nlp.add_pipe('sentencizer', name='sentence_segmenter', before='parser')
doc = nlp(total_doc)
print(doc.text)
for token in doc:
print(token)
print(token.is_sent_start)
for sent in doc.sents:
print(sent)
输出略。总之is_sent_start属性为True的token就是句子开头的token,doc.sents是句子列表的迭代器。
另外v2.0版本spacy有这种分句的写法,在v3.0(我是3.2.4)版本的spacy中无法使用,我没有试过:
from seg.newline.segmenter import NewLineSegmenter # note that pip package is called spacyss
import spacy
nlseg = NewLineSegmenter()
nlp = spacy.load('en')
nlp.add_pipe(nlseg.set_sent_starts, name='sentence_segmenter', before='parser')
doc = nlp(my_doc_text)
所需的包是:spacyss · PyPI
边栏推荐
- PMP认证证书的续证费用是多少?
- 老板囑咐了三遍:低調、低調、低調
- 【数字IC精品文章收录】近500篇文章|学习路线|基础知识|接口|总线|脚本语言|芯片求职|安全|EDA|工具|低功耗设计|Verilog|低功耗|STA|设计|验证|FPGA|架构|AMBA|书籍|
- @ControllerAdvice + @ExceptionHandler 全局处理 Controller 层异常
- 配置文件加密(Jasypt的简单使用)
- Recommendation letter brain correspondent: if love is just a chemical reaction, can you still believe in love?
- 2022中式烹調師(高級)試題及在線模擬考試
- Conversion between pointcloud and numpy arrays in open3d
- 验证回文串
- Open source invites you to participate in openinfra days China 2022. Topic collection is in progress ~
猜你喜欢
Ding! Techo day Tencent technology open day arrived as scheduled!
Work study management system based on ASP
Vscode writes markdown file and generates pdf
安杰思医学冲刺科创板:年营收3亿 拟募资7.7亿
PC Museum - familiar and strange ignorant age
CVPR disputes again: IBM's Chinese draft papers were accused of copying idea, who won the second place in the competition
Nature | mapping the interaction map of plant foliar flora to establish genotype phenotype relationship
Dry goods | how to calculate the KPI of scientific researchers, and what are the h index and G index
物联网低代码平台常用《组件介绍》
腾讯再遭大股东Prosus减持:后者还从京东套现37亿美元
随机推荐
Gas station (greedy)
PMP认证证书的续证费用是多少?
Angers medical sprint scientific innovation board: annual revenue of RMB 300million and proposed fund raising of RMB 770million
2022年最新PyCharm激活破解码永久_详细安装教程(适用多版本)
CVPR disputes again: IBM's Chinese draft papers were accused of copying idea, who won the second place in the competition
Unable to create process using 'd:\program file
open3d里pointcloud和numpy数组之间的转化
Leetcode (88) -- merge two ordered arrays
Leetcode(167)——两数之和 II - 输入有序数组
Ionq and Ge research confirmed that quantum computing has great potential in risk aggregation
Jingyuan's safe sprint to the Growth Enterprise Market: it plans to raise 400million yuan for investment and Yunyou software is the shareholder
2022下半年软考考试时间安排已确定!
Validate palindrome string
证券公司和银行哪个更安全 怎么办理开户最安全
Research and Simulation of chaotic digital image encryption technology based on MATLAB
Practice of constructing ten billion relationship knowledge map based on Nebula graph
运行近20年,基于Win 98的火星探测器软件迎来首次升级
Youju new material rushes to Shenzhen Stock Exchange: it plans to raise 650million yuan, with an annual revenue of 333million yuan
RAM ROM FLASH的区别
js 判断字符串为空或者不为空