当前位置:网站首页>语料库数据处理个案实例(词性赋码、词性还原)
语料库数据处理个案实例(词性赋码、词性还原)
2022-06-23 03:49:00 【Triumph19】
7.2 词性赋码
7.2.1 词性赋码的基本操作
- 对文本进行词性赋码(part-of-speech tagging 或 POS tagging)是语料库语言学中最常见的文本处理任务之一。NLTK库页提供了词性赋码模块。请看下面的例子。
import nltk
string = "My father's name being Pririp,and my Christian name Philip,my infant tongue could make of both names nothing longer or more explicit than Pip. So,I called myself Pip,and came to be called Pip."
string_tokenized = nltk.word_tokenize(string)
string_postagged = nltk.pos_tag(string_tokenized)
string_postagged
- 在词性赋码前需对句子进行分词处理,所以首先需要利用nltk.word_tokenize()函数对string进行分词处理。然后,通过nltk.pos_tage()函数对分词后的列表进行词性赋码。词性赋码的打印结果如下:
[('My', 'PRP$'),
('father', 'NN'),
("'s", 'POS'),
('name', 'NN'),
('being', 'VBG'),
('Pririp', 'NNP'),
(',', ','),
('and', 'CC'),
('my', 'PRP$'),
('Christian', 'JJ'),
('name', 'NN'),
('Philip', 'NNP'),
(',', ','),
('my', 'PRP$'),
('infant', 'JJ'),
('tongue', 'NN'),
('could', 'MD'),
('make', 'VB'),
('of', 'IN'),
('both', 'DT'),
('names', 'NNS'),
('nothing', 'NN'),
('longer', 'RB'),
('or', 'CC'),
('more', 'JJR'),
('explicit', 'NNS'),
('than', 'IN'),
('Pip', 'NNP'),
('.', '.'),
('So', 'NNP'),
(',', ','),
('I', 'PRP'),
('called', 'VBD'),
('myself', 'PRP'),
('Pip', 'NNP'),
(',', ','),
('and', 'CC'),
('came', 'VBD'),
('to', 'TO'),
('be', 'VB'),
('called', 'VBN'),
('Pip', 'NNP'),
('.', '.')]
- 从结果可见,nltk.word_tokenize()函数词性赋码后,返回一个列表,该列表的每一个元素是一个元组,每个元组又有两个元素,分别是单词和它的词性码。
- 如果直接打印或输出上述结果,可读性不好。为了提高结果的可读性,我们可以将之处理成如"单词_词性"的形式。因此,可以使用如下代码实现此功能。
for i in string_postagged:
print(i[0] + '_' + i[1])
My_PRP$
father_NN
's_POS
name_NN
being_VBG
Pririp_NNP
,_,
and_CC
my_PRP$
Christian_JJ
name_NN
Philip_NNP
,_,
my_PRP$
infant_JJ
tongue_NN
could_MD
make_VB
of_IN
both_DT
names_NNS
nothing_NN
longer_RB
or_CC
more_JJR
explicit_NNS
than_IN
Pip_NNP
._.
So_NNP
,_,
I_PRP
called_VBD
myself_PRP
Pip_NNP
,_,
and_CC
came_VBD
to_TO
be_VB
called_VBN
Pip_NNP
7.2.2 将文本分句并词性赋码
- 在本小节的实例中,我们希望处理某文本,使之按分句形式写出到某文本文件,并且对句子的每个单词进行词性赋码。代码如下:
import nltk
string = "My father's name being Pririp,and my Christian name Philip,my infant tongue could make of both names nothing longer or more explicit than Pip. So,I called myself Pip,and came to be called Pip."
# 对字符串进行分句处理
sent_splitter = nltk.data.load('tokenizers/punkt/english.pickle')
sents_splitted = sent_splitter.tokenize(string)
file_out = open('D:\works\文本分析\sent_postagged.txt','a')
# 对分句后的文本进行词性赋码
for sent in sents_splitted:
# posttag the sentence
sent_tokenized = nltk.word_tokenize(sent)
sent_postag = nltk.pos_tag(sent_tokenized)
# save the postagged sentence in sent_postagged
for i in sent_postag:
output = i[0] + '_' + i[1] + ' '
file_out.write(output)
file_out.write('\n')
file_out.close()
- 结果如下:
My_PRP$ father_NN 's_POS name_NN being_VBG Pririp_NNP ,_, and_CC my_PRP$ Christian_JJ name_NN Philip_NNP ,_, my_PRP$ infant_JJ tongue_NN could_MD make_VB of_IN both_DT names_NNS nothing_NN longer_RB or_CC more_JJR explicit_NNS than_IN Pip_NNP ._.
So_RB ,_, I_PRP called_VBD myself_PRP Pip_NNP ,_, and_CC came_VBD to_TO be_VB called_VBN Pip_NNP ._.
- 实际上就是按"."将其分为两个句子,而每个句子中的每个词都被词性赋码。
7.3 词性还原
- 词性还原(lemmatization)指的是将有曲折变化的单词还原成其原型(base form)。比如desks可以还原成desk,动词went或going还原成go等。NLTK库内置wordnet模块,wordnet模块中有词形还原工具WordNetLemmatizer。因此我们可以利用WordNetLemmatizer来进行词性还原处理。请看下面的代码示例。
import nltk
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('books','n')) #book
print(lemmatizer.lemmatize('went','v')) #go
print(lemmatizer.lemmatize('better','a')) #good
print(lemmatizer.lemmatize('geese')) #goose
- 上面演示的是最基本呢的利用wordnet的词性还原工具来进行单个单词的词性还原。如果我们有一个较长的文本,而不是单个的单词,利用wordnet词性还原工具就比较繁琐。比如,我们需要先对文本进行分词和词性赋码处理,然后逐个提取单词及词性赋码,还需要将词性码转换成wordnet词性还原工具可接受的磁性码,最后才能通过wordnet工具进行词性还原。
- 当然,我们也可以运用其他工具进行词性还原处理。比如,Stanford CoreNLP软件包的词性还原工具来进行文本的词性还原处理。具体方法我们将在本章7.12小节讨论。
- 出现下面这个报错,但是没能下载成功。

7.5 抽取词块
- 语料库语言学研究的一个热点问题是对词块(Ngrams 或chunks)的研究。根据抽取词块的长度,可以将词块分为一词词块(单词)、二词词块、三词词块、四词词块等。比如从字符串"To be or not to be"中可以抽取出五个二词词块"To be"、"be or " 、"or not "、"not to "、“to be”。
- NLTK库中有ngrams模块,该模块的ngrams()函数可以从字符串中提取出词块。其基本用法为ngrams(string,n),即ngrams()有两个参数,第一个参数为字符串,第二个参数为词块的长度。请看下面的代码。
#%%
import nltk
from nltk.util import ngrams
string = "My father's name being Pririp,and my Christian name Philip,my infant tongue could make of both names nothing longer or more explicit than Pip. So,I called myself Pip,and came to be called Pip."
string_tokenized = nltk.word_tokenize(string.lower())
n = 4
n_grams = ngrams(string_tokenized,n)
for grams in n_grams:
print(grams)
- 我们首先通过import nltk和from nltk.util import ngrams两个语句引入nltk和ngrams模块。然后,通过nltk.word_tokenize(string.lower())语句对string小写和分词处理,并定义抽取词块的长度(n = 4)。接下来,通过ngrams(string_tokenized,n)来提取string中长度为4的词块。最后,通过for…in循环对提取的词块进行打印。结果如下:
('my', 'father', "'s", 'name')
('father', "'s", 'name', 'being')
("'s", 'name', 'being', 'pririp')
('name', 'being', 'pririp', ',')
('being', 'pririp', ',', 'and')
('pririp', ',', 'and', 'my')
(',', 'and', 'my', 'christian')
('and', 'my', 'christian', 'name')
('my', 'christian', 'name', 'philip')
('christian', 'name', 'philip', ',')
('name', 'philip', ',', 'my')
('philip', ',', 'my', 'infant')
(',', 'my', 'infant', 'tongue')
('my', 'infant', 'tongue', 'could')
('infant', 'tongue', 'could', 'make')
('tongue', 'could', 'make', 'of')
('could', 'make', 'of', 'both')
('make', 'of', 'both', 'names')
('of', 'both', 'names', 'nothing')
('both', 'names', 'nothing', 'longer')
('names', 'nothing', 'longer', 'or')
('nothing', 'longer', 'or', 'more')
('longer', 'or', 'more', 'explicit')
('or', 'more', 'explicit', 'than')
('more', 'explicit', 'than', 'pip')
('explicit', 'than', 'pip', '.')
('than', 'pip', '.', 'so')
('pip', '.', 'so', ',')
('.', 'so', ',', 'i')
('so', ',', 'i', 'called')
(',', 'i', 'called', 'myself')
('i', 'called', 'myself', 'pip')
('called', 'myself', 'pip', ',')
('myself', 'pip', ',', 'and')
('pip', ',', 'and', 'came')
(',', 'and', 'came', 'to')
('and', 'came', 'to', 'be')
('came', 'to', 'be', 'called')
('to', 'be', 'called', 'pip')
('be', 'called', 'pip', '.')
- 当然,我们可以根据研究需要对上面的结果做进一步的处理,比如删除类似最后一个词块含有标点符号元素的元组(词块)。下面的代码查找并删除上面代码生成的n_grams列表中含有非字母非数字字符元素的元组(词块),并只打印其他词块。
import re
import nltk
from nltk.util import ngrams
string = "My father's name being Pririp,and my Christian name Philip,my infant tongue could make of both names nothing longer or more explicit than Pip. So,I called myself Pip,and came to be called Pip."
string_tokenized = nltk.word_tokenize(string.lower())
n = 4
n_grams = ngrams(string_tokenized,n)
n_grams_AlphaNum = []
for gram in n_grams:
# to test if there is any non-alphanumeric character in the ngrams
# 过滤掉存在非英文字符的gram
for i in range(4):
if re.search(r'^\W+$',gram[i]): # \W匹配任何非单词字符。等价于“[^A-Za-z0-9_]”
break
else:
n_grams_AlphaNum.append(gram)
for j in n_grams_AlphaNum:
print(j)

边栏推荐
- Ideal car × Oceanbase: when new forces of car building meet new forces of database
- 华为联机对战服务玩家快速匹配后,不同玩家收到的同一房间内玩家列表不同
- There is a problem with redis startup
- Background ribbon animation plug-in ribbon js
- Getting started with tensorflow
- 【二叉树进阶】AVLTree - 平衡二叉搜索树
- P1347 sorting (TOPO)
- Prince language on insect date class
- Analysis on the current situation of the Internet of things in 2022
- PTA:7-61 师生信息管理
猜你喜欢

The spring recruitment in 2022 begins, and a collection of interview questions will help you

Differences between MyISAM and InnoDB of MySQL storage engine
![3D数学基础[十六] 匀加速直线运动的公式](/img/51/5b05694bbd0f4fd01dd26cf55b22c7.png)
3D数学基础[十六] 匀加速直线运动的公式

x24Cxx系列EEPROM芯片C语言通用读写程序

Review the SQL row column conversion, and the performance has been improved

Deploying Apache pulsar on kubesphere

Zhongang Mining: the demand for fluorite in the new energy and new material industry chain has increased greatly

无线网络安全的12个优秀实践

Pytoch --- use pytoch's pre training model to realize four weather classification problems

在word里,如何让页码从指定页开始编号
随机推荐
Svg+js smart home monitoring grid layout
Idea import module
Tables de recherche statiques et tables de recherche statiques
What is the open source database under Linux
Software development in 2022: five realities CIOs should know
虫子 STM32 高级定时器 (哈哈我说实话硬件定时器不能体现实力,实际上想把内核定时器发上来的,一想算了,慢慢来吧)
How to use shell script to monitor file changes
Halcon胶线检测—模板匹配、位姿变换、胶宽,胶连续性检测
Sessions and Daemons
SVG+JS智能家居监控网格布局
Web page dynamic and static separation based on haproxy
QMainWindow
给你的AppImage创建桌面快捷方式
Pytoch --- use pytoch's pre training model to realize four weather classification problems
How e-commerce makes use of small programs
Questions about SQL statements
How to ensure application security
mysql如何删除表的一行数据
Flutter series: wrap in flutter
PTA:7-67 友元很简单2016final