当前位置:网站首页>Case examples of corpus data processing (cases related to sentence retrieval)
Case examples of corpus data processing (cases related to sentence retrieval)
2022-06-24 07:53:00 【Triumph19】
7.8 Sentence retrieval related cases
- In this section, we discuss two cases related to sentence retrieval . The first case , We retrieve sentences that contain a word or a class of words in the text . The second case , We retrieve passive sentences in text .
7.8.1 Retrieve all sentences containing a word
- In this case , We first search ge.txt The text contains "children" One word sentence . We allow c Letters in upper or lower case . Look at the code below :
import re
file_in = open(r'D:\works\ Text analysis \leopythonbookdata-master\texts\ge.txt','r')
file_out = open(r'D:\works\ Text analysis \ge_children.txt','a')
for line in file_in.readlines():
if re.search(r'[Cc]hildren',line):
file_out.write(line)
file_in.close()
file_out.close()
7.8.2 Retrieve all that contain with "-tional" The ending sentence
- Let's search ge.txt The text contains the following words "-tional" The ending sentence . Look at the code below . Compared with the above code , We made two changes . One is to change the written text name to "ge_tional.txt", The second is to change the search regular expression to r’\w+tional\b’, among ’\b’ The boundary of a word .
import re
file_in = open(r'D:\works\ Text analysis \leopythonbookdata-master\texts\ge.txt','r')
file_out = open(r'D:\works\ Text analysis \ge_tional.txt','a')
for line in file_in.readlines():
if re.search(r'\w+tional\b',line):
file_out.write(line+'\n')
file_in.close()
file_out.close()
7.9 Realization Rrange software function
- The study of academic English vocabulary is an important research issue in the field of Applied Linguistics and English for academic purposes .Nation(2001) Divide words into common words 、 Academic vocabulary 、 Technical vocabulary 、 Rare words .Nation(2001) Think , Learners often learn common vocabulary first , Then learn the technical vocabulary , Finally, learn technical terms and rare words . Common words generally refer to West(1953) The development of a general vocabulary (General Service List,GSL) Contains the most commonly used in English 2000 A vocabulary ( Or word family ). Scholars have also developed many academic English word lists , To help researchers learn and use academic English vocabulary , among ,Coxhead(2000) Developed academic English vocabulary (Academic Word List,AWL) Is the most influential academic English vocabulary in recent years , The glossary includes 570 High frequency academic vocabulary ( Or word family ). In order to better help researchers will AWL Thesaurus is used in research and teaching material development ,Paul Natio It has also been developed. Range Software . The software has built-in GSL The most common 1000 A vocabulary ( For short GSL1000)、GSL The most common 2000 A vocabulary ( For short GSL2000) and AWL Three word lists .Range The software can analyze which words in the corpus text belong to GSL1000、GSL2000 and AWL Thesaurus , And report the number of three vocabulary words and their percentage in the total number of words in the text ( coverage ).
- The study of academic English vocabulary is an important research topic in the field of Applied Linguistics and English for academic purposes . In order to achieve Range The function of the software , It takes roughly four steps . First , Need to read in text , Set up word frequency table of text ; secondly , Read in GSL1000.txt、GSL2000.txt and AWL.txt Three word lists ; Again , Analyze which words read into the text belong to the above three lists , And count their number and percentage ; Last , Write out the results to a new text file . below , We will introduce the content of the presentation code in four parts according to the above steps .
- Here is the first part of the code , establish ge.txt Word frequency table of text . We first read ge.txt Text , Then through regular expressions ’\W’ Replace all non alphanumeric parts with spaces , Re pass split() Function for word segmentation , And store the word after word segmentation in all_words In the list . Next , We define an empty dictionary wordlist_freq_dict, To make ge.txt Frequency table of text words . Let's loop through all_words Words in the list , If the word is in the dictionary key (key) in , Then the frequency is increased 1; otherwise , The frequency is 1. such , We store words as dictionary keys , The word frequency is the value of the dictionary (value).
(1) establish ge.txt Word frequency table of text
# Recognizing_gsl_awl_words.py,part 1
# the following is to make the wordlist with freq
# and store the info in a dictionary (wordlist_freq_dict)
import re
file_in = open(r'D:\works\ Text analysis \leopythonbookdata-master\texts\ge.txt','r')
all_words = []
for line in file_in.readlines():
line2 = line.lower()
line3 = re.sub(r'\W',r' ',line2) # Replace non alphanumeric parts with spaces
wordlist = line3.split()
for word in wordlist:
all_words.append(word)
wordlist_freq_dict = {
}
for word in all_words:
if word in wordlist_freq_dict.keys():
wordlist_freq_dict[word] += 1
else:
wordlist_freq_dict[word] = 1
file_in.close()
(2) Read in the vocabulary
- Here is the second part of the code , Read in GSL1000.txt、GSL2000.txt and AWL.txt Three word lists . We first create an empty dictionary gsl_awl_dict, Then read the above three thesaurus files respectively , Store the words of the thesaurus as the keys of the dictionary , Define the dictionary keys corresponding to the three vocabulary words as 1,2,3.
# Recognizing_gsl_awl_words.py,Part 2
# the following is to read the GSL and the AWL words
# and save them in a dictionary
gsl1000_in = open(r'D:\works\ Text analysis \leopythonbookdata-master\texts\GSL1000.txt','r')
gsl2000_in = open(r'D:\works\ Text analysis \leopythonbookdata-master\texts\GSL2000.txt','r')
awl_in = open(r'D:\works\ Text analysis \leopythonbookdata-master\texts\AWL.txt','r')
gsl_awl_dict = {
}
for word in gsl1000_in.readlines():
gsl_awl_dict[word.strip()] = 1
for word in gsl2000_in.readlines():
gsl_awl_dict[word.strip()] = 2
for word in awl_in.readlines():
gsl_awl_dict[word.strip()] = 3
gsl1000_in.close()
gsl2000_in.close()
awl_in.close()
(3) analysis ge.txt The words in belong to that vocabulary
- Here is the third part of the code , Analyze which words read into the text belong to the above three lists , And count their number and percentage . We first define four empty dictionary variables , To store the words of the above three word lists respectively 、 The frequency of (gsl1000_words,gsl2000_words,awl_words) And words other than the above three lists (other_words). First , Loop traversal wordlist_freq_dict Key , That is to say ge.txt The words of the text (wordlist_freq_dict.keys()), If the word is not GSL and AWL In this table (if word not in gsl_awl_dict.keys()), be other_words The key of the dictionary is the word , The value is 4(other_words[word] = 4); If the word is in GSL_1000 in , be gsl1000_words The key of the dictionary is the value , The value is the word in ge.txt Frequency in text ; By analogy .
# Recognizing_gsl_awl_words.py,part 3-1
# the following is to categorize the words in wordlist_freq_dict
# into dictionaries of GSL1000 words,GSL2000 words,AWL words or others
gsl1000_words = {
}
gsl2000_words = {
}
awl_words = {
}
other_words = {
}
for word in wordlist_freq_dict.keys():
if word not in gsl_awl_dict.keys():
other_words[word] = 4
elif gsl_awl_dict[word] == 1:
gsl1000_words[word] = wordlist_freq_dict[word] # Statistics ge.txt The emergence of gsl1000 Number of words in
elif gsl_awl_dict[word] == 2:
gsl2000_words[word] = wordlist_freq_dict[word]
elif gsl_awl_dict[word] == 3:
awl_words[word] = wordlist_freq_dict[word]
(4) Statistics ge.txt All kinds of words in ( Common words 、 Academic vocabulary 、 Rare words ) frequency
- then , We calculate the total frequency of each word in the vocabulary , And store them in gsl1000_freq_total And so on . Its calculation method is , First define a value as 0 The variable of ( Such as gsl1000_freq_total = 0), If a word GSL1000 in , Then the frequency will be increased ge.txt Frequency in .
# Recognizing_gsl_awl_words.py,part 3-2
# compute freq total
gsl1000_freq_total = 0
gsl2000_freq_total = 0
awl_freq_total = 0
other_freq_total = 0
for word in gsl1000_words:
gsl1000_freq_total += wordlist_freq_dict[word]
for word in gsl2000_words:
gsl2000_freq_total += wordlist_freq_dict[word]
for word in awl_words:
awl_freq_total += wordlist_freq_dict[word]
for word in other_words:
other_freq_total += wordlist_freq_dict[word]
(4) Statistics ge.txt The form and number of words in ( Common words 、 Academic vocabulary 、 Rare words ) The number of
- Next , Calculate the number of word forms in each vocabulary , The calculation method is relatively simple , The number of word forms is the length of each dictionary . then , Add the total frequency of each part of the word , To calculate ge.txt Total frequency of text words ; Add the number of word forms of each part of the word , To calculate ge.txt The total number of word forms in the text .
# Recognizing_gsl_awl_words.py,part 3-3
# to compute the number of words in gsl1000,gsl2000,awl and other words
gsl1000_num_of_words = len(gsl1000_words)
gsl2000_num_of_words = len(gsl2000_words)
awl_num_of_words = len(awl_words)
other_num_of_words = len(other_words)
# Calculation ge.txt Total number of words in
freq_total = gsl1000_freq_total + gsl2000_freq_total + awl_freq_total + other_freq_total
# Calculation ge.txt The total number of forms in
num_of_words_total = gsl1000_num_of_words + gsl2000_num_of_words + awl_num_of_words + other_num_of_words
- Here is the fourth part of the code , Write the result to range_wordlist_results.txt In the text . First open a to write range_wordlist_results.txt File handle for text . then , Write the title of the text in the text ’RESULTS OF WORD ANALYSIS’, Two blank lines after the title ’\n\n’. then , Write the total number of word forms of the text and the number of word forms of each part . Please note that , What is written out to text must be a string , therefore , When the writing content is a numeric variable , Must use first str() Function to convert it to a string , Such as str(num_of_words_total).
# Recognizing_gsl_awl_words.py,part 4-1
# the following is to write out the results
# first,define the file to save the results
file_out = open(r'D:\works\ Text analysis \range_wordlist_results.txt','a')
# then,write out the results
file_out.write('RESULTS OF WORD ANALYSIS\n\n')
file_out.write('Total No. of word types in Great Expectations: ' + str(num_of_words_total) + '\n\n')
file_out.write('Total No. of GSL1000 word types : ' + str(gsl1000_num_of_words) + '\n\n')
file_out.write('Total No. of GSL2000 word types : ' + str(gsl2000_num_of_words) + '\n')
file_out.write('Total No. of AWL word types : ' + str(awl_num_of_words) + '\n')
file_out.write('Total No. of other word types : ' + str(other_num_of_words) + '\n')

- Next , Write the total frequency of words in the text 、 The total frequency of each part of words and the percentage of each part of words in the total frequency . There is a problem to be aware of : Since the word frequency and total frequency of each part are integer values , stay Python2.7 in , The result of dividing an integer value is still an integer value , therefore , When calculating the percentage , You need to convert an integer value to a floating-point value first , Such as float(freq_total).
# Recognizing_gsl_words.py,part 4-2
file_out.write('\n\n')
file_out.write('Total word frequency of Great Expectations: ' + str(freq_total) + '\n\n')
file_out.write('Total frequency of GSL1000 words: ' + str(gsl1000_freq_total) + '\n')
file_out.write('Frequency percentage of GSL1000 words: ' + str(gsl1000_freq_total / float(freq_total)) + '\n\n')
file_out.write('Total frequency of GSL2000 words: ' + str(gsl2000_freq_total) + '\n')
file_out.write('Frequency percentage of GSL2000 words: ' + str(gsl2000_freq_total / float(freq_total)) + '\n\n')
file_out.write('Total frequency of AWL words: ' + str(awl_freq_total) + '\n')
file_out.write('Frequency percentage of AWL words: ' + str(awl_freq_total / float(freq_total)) + '\n\n')
file_out.write('Total frequency of other words: ' + str(other_freq_total) + '\n')
file_out.write('Frequency percentage of other words: ' + str(other_freq_total / float(freq_total)) + '\n')

(5) Statistics ge.txt The total frequency of words in the text and the frequency of words in each part
- Last , We write the words and frequencies of the three word lists and the words and frequencies outside the three word lists into the result file .
# Recognizing_gsl_awl_words.py, Part 4-3
# write out the GSL1000 words
file_out.write('\n\n')
file_out.write('##########\n')
file_out.write('Words in GSL1000\n\n')
for word in sorted(gsl1000_words.keys()):
file_out.write(word + '\t' + str(gsl1000_words[word]) + '\n')
# write out the GSL2000 words
file_out.write('\n\n')
file_out.write('##########\n')
file_out.write('Words in GSL2000\n\n')
for word in sorted(gsl2000_words.keys()):
file_out.write(word + '\t' + str(gsl2000_words[word]) + '\n')
# write out the AWL words
file_out.write('\n\n')
file_out.write('##########\n')
file_out.write('Words in AWL\n\n')
for word in sorted(awl_words.keys()):
file_out.write(word + '\t' + str(awl_words[word]) + '\n')
# write out other words
file_out.write('\n\n')
file_out.write('##########\n')
file_out.write('Other words\n\n')
for word in sorted(other_words.keys()):
file_out.write(word + '\t' + str(wordlist_freq_dict[word]) + '\n')
file_out.close()
- The results are visible ,Great Expectations The number of common words in the text 10764 individual , The total number of single words ( Total frequency 188955) individual , among GSL1000 Lexical proportion 84.78%,GSL2000 Lexical proportion 5.26%,AWL Lexical proportion 1.17%, Other words account for 8.79%. According to the relevant results of previous Vocabulary Studies , In academic texts AWL Vocabulary accounts for roughly the total frequency of academic texts 8%~10%. and Great Expectations The text is a novel , Therefore, its AWL Vocabulary only accounts for 1.17%.
边栏推荐
猜你喜欢

RDD的执行原理

云开发谁是卧底小程序源码

『C语言』系统日期&时间

Gossip: what happened to 3aC?

Chapitre 2: dessiner une fenêtre

Part 2: drawing a window

Open cooperation and win-win future | Fuxin Kunpeng joins Jinlan organization

OpenGauss数据库在 CentOS 上的实践,配置篇

用Ngrok 配置属于自己的免费外网域名

Mousse shares listed on Shenzhen Stock Exchange: gross profit margin continued to decline, and marketing failed in the first quarter of 2022
随机推荐
免费ICP域名备案查接口
关于h5页面苹果手机使用fixed定位tabbar最底部时遮挡内容问题
Exploration on Optimization of elastic expansion engineering
Tuple remarks
『C语言』系统日期&时间
语料库数据处理个案实例(读取多个文本文件、读取一个文件夹下面指定的多个文件、解码错误、读取多个子文件夹文本、多个文件批量改名)
解决 These dependencies were not found: * core-js/modules/es6.array.fill in xxx 之类的问题
Wechat cloud hosting hot issues Q & A
Hongmeng OS development III
Q & A on cloud development cloudbase hot issues of "Huage youyue phase I"
Unity 的序列化
AWTK 最新动态:Grid 控件新用法
opencvsharp二值图像反色
常见的数组封装
单片机STM32F103RB,BLDC直流电机控制器设计,原理图、源码和电路方案
Chapter 3: drawing triangles
How to realize high stability and high concurrency of live video streaming transmission and viewing?
希尔伯特-黄变换
These dependencies were not found: * core JS / modules / es6 array. Fill in XXX
Quickly set up PgSQL for serverless