当前位置：网站首页>Case examples of corpus data processing (cases related to sentence retrieval)

Case examples of corpus data processing (cases related to sentence retrieval)

2022-06-24 07:53:00 【Triumph19】

7.8 Sentence retrieval related cases

In this section, we discuss two cases related to sentence retrieval . The first case , We retrieve sentences that contain a word or a class of words in the text . The second case , We retrieve passive sentences in text .

7.8.1 Retrieve all sentences containing a word

In this case , We first search ge.txt The text contains "children" One word sentence . We allow c Letters in upper or lower case . Look at the code below ：

import re
file_in = open(r'D:\works\ Text analysis \leopythonbookdata-master\texts\ge.txt','r')
file_out = open(r'D:\works\ Text analysis \ge_children.txt','a')
for line in file_in.readlines():
    if re.search(r'[Cc]hildren',line):
        file_out.write(line)
file_in.close()
file_out.close()

7.8.2 Retrieve all that contain with "-tional" The ending sentence

Let's search ge.txt The text contains the following words "-tional" The ending sentence . Look at the code below . Compared with the above code , We made two changes . One is to change the written text name to "ge_tional.txt", The second is to change the search regular expression to r’\w+tional\b’, among ’\b’ The boundary of a word .

import re
file_in = open(r'D:\works\ Text analysis \leopythonbookdata-master\texts\ge.txt','r')
file_out = open(r'D:\works\ Text analysis \ge_tional.txt','a')
for line in file_in.readlines():
    if re.search(r'\w+tional\b',line):
        file_out.write(line+'\n')
file_in.close()
file_out.close()

7.9 Realization Rrange software function

The study of academic English vocabulary is an important research issue in the field of Applied Linguistics and English for academic purposes .Nation(2001) Divide words into common words 、 Academic vocabulary 、 Technical vocabulary 、 Rare words .Nation(2001) Think , Learners often learn common vocabulary first , Then learn the technical vocabulary , Finally, learn technical terms and rare words . Common words generally refer to West(1953) The development of a general vocabulary （General Service List,GSL) Contains the most commonly used in English 2000 A vocabulary （ Or word family ). Scholars have also developed many academic English word lists , To help researchers learn and use academic English vocabulary , among ,Coxhead(2000) Developed academic English vocabulary （Academic Word List,AWL) Is the most influential academic English vocabulary in recent years , The glossary includes 570 High frequency academic vocabulary （ Or word family ）. In order to better help researchers will AWL Thesaurus is used in research and teaching material development ,Paul Natio It has also been developed. Range Software . The software has built-in GSL The most common 1000 A vocabulary （ For short GSL1000)、GSL The most common 2000 A vocabulary （ For short GSL2000) and AWL Three word lists .Range The software can analyze which words in the corpus text belong to GSL1000、GSL2000 and AWL Thesaurus , And report the number of three vocabulary words and their percentage in the total number of words in the text ( coverage ）.
The study of academic English vocabulary is an important research topic in the field of Applied Linguistics and English for academic purposes . In order to achieve Range The function of the software , It takes roughly four steps . First , Need to read in text , Set up word frequency table of text ; secondly , Read in GSL1000.txt、GSL2000.txt and AWL.txt Three word lists ; Again , Analyze which words read into the text belong to the above three lists , And count their number and percentage ; Last , Write out the results to a new text file . below , We will introduce the content of the presentation code in four parts according to the above steps .
Here is the first part of the code , establish ge.txt Word frequency table of text . We first read ge.txt Text , Then through regular expressions ’\W’ Replace all non alphanumeric parts with spaces , Re pass split() Function for word segmentation , And store the word after word segmentation in all_words In the list . Next , We define an empty dictionary wordlist_freq_dict, To make ge.txt Frequency table of text words . Let's loop through all_words Words in the list , If the word is in the dictionary key （key) in , Then the frequency is increased 1; otherwise , The frequency is 1. such , We store words as dictionary keys , The word frequency is the value of the dictionary （value).

（1） establish ge.txt Word frequency table of text

# Recognizing_gsl_awl_words.py,part 1

# the following is to make the wordlist with freq
# and store the info in a dictionary (wordlist_freq_dict)

import re
file_in = open(r'D:\works\ Text analysis \leopythonbookdata-master\texts\ge.txt','r')
all_words = []
for line in file_in.readlines():
    line2 = line.lower()
    line3 = re.sub(r'\W',r' ',line2) # Replace non alphanumeric parts with spaces 
    wordlist = line3.split()
    for word in wordlist:
        all_words.append(word)

wordlist_freq_dict = {
    }
for word in all_words:
    if word in wordlist_freq_dict.keys():
        wordlist_freq_dict[word] += 1
    else:
        wordlist_freq_dict[word] = 1
file_in.close()

（2） Read in the vocabulary

Here is the second part of the code , Read in GSL1000.txt、GSL2000.txt and AWL.txt Three word lists . We first create an empty dictionary gsl_awl_dict, Then read the above three thesaurus files respectively , Store the words of the thesaurus as the keys of the dictionary , Define the dictionary keys corresponding to the three vocabulary words as 1,2,3.

# Recognizing_gsl_awl_words.py,Part 2
# the following is to read the GSL and the AWL words
# and save them in a dictionary
gsl1000_in = open(r'D:\works\ Text analysis \leopythonbookdata-master\texts\GSL1000.txt','r')
gsl2000_in = open(r'D:\works\ Text analysis \leopythonbookdata-master\texts\GSL2000.txt','r')
awl_in = open(r'D:\works\ Text analysis \leopythonbookdata-master\texts\AWL.txt','r')

gsl_awl_dict = {
    }
for word in gsl1000_in.readlines():
    gsl_awl_dict[word.strip()] = 1

for word in gsl2000_in.readlines():
    gsl_awl_dict[word.strip()] = 2

for word in awl_in.readlines():
    gsl_awl_dict[word.strip()] = 3

gsl1000_in.close()
gsl2000_in.close()
awl_in.close()

（3） analysis ge.txt The words in belong to that vocabulary

Here is the third part of the code , Analyze which words read into the text belong to the above three lists , And count their number and percentage . We first define four empty dictionary variables , To store the words of the above three word lists respectively 、 The frequency of （gsl1000_words,gsl2000_words,awl_words) And words other than the above three lists （other_words). First , Loop traversal wordlist_freq_dict Key , That is to say ge.txt The words of the text （wordlist_freq_dict.keys()）, If the word is not GSL and AWL In this table （if word not in gsl_awl_dict.keys()), be other_words The key of the dictionary is the word , The value is 4（other_words[word] = 4); If the word is in GSL_1000 in , be gsl1000_words The key of the dictionary is the value , The value is the word in ge.txt Frequency in text ; By analogy .

# Recognizing_gsl_awl_words.py,part 3-1
# the following is to categorize the words in wordlist_freq_dict
# into dictionaries of GSL1000 words,GSL2000 words,AWL words or others
gsl1000_words = {
    }
gsl2000_words = {
    }
awl_words = {
    }
other_words = {
    }

for word in wordlist_freq_dict.keys():
    if word not in gsl_awl_dict.keys():
        other_words[word] = 4
    elif gsl_awl_dict[word] == 1:
        gsl1000_words[word] = wordlist_freq_dict[word] # Statistics ge.txt The emergence of gsl1000 Number of words in 
    elif gsl_awl_dict[word] == 2:
        gsl2000_words[word] = wordlist_freq_dict[word]
    elif gsl_awl_dict[word] == 3:
        awl_words[word] = wordlist_freq_dict[word]

（4） Statistics ge.txt All kinds of words in （ Common words 、 Academic vocabulary 、 Rare words ） frequency

then , We calculate the total frequency of each word in the vocabulary , And store them in gsl1000_freq_total And so on . Its calculation method is , First define a value as 0 The variable of （ Such as gsl1000_freq_total = 0）, If a word GSL1000 in , Then the frequency will be increased ge.txt Frequency in .

# Recognizing_gsl_awl_words.py,part 3-2
# compute freq total
gsl1000_freq_total = 0
gsl2000_freq_total = 0
awl_freq_total = 0
other_freq_total = 0
for word in gsl1000_words:
    gsl1000_freq_total += wordlist_freq_dict[word]
for word in gsl2000_words:
    gsl2000_freq_total += wordlist_freq_dict[word]
for word in awl_words:
    awl_freq_total += wordlist_freq_dict[word]
for word in other_words:
    other_freq_total += wordlist_freq_dict[word]

（4） Statistics ge.txt The form and number of words in （ Common words 、 Academic vocabulary 、 Rare words ） The number of

Next , Calculate the number of word forms in each vocabulary , The calculation method is relatively simple , The number of word forms is the length of each dictionary . then , Add the total frequency of each part of the word , To calculate ge.txt Total frequency of text words ; Add the number of word forms of each part of the word , To calculate ge.txt The total number of word forms in the text .

# Recognizing_gsl_awl_words.py,part 3-3
# to compute the number of words in gsl1000,gsl2000,awl and other words
gsl1000_num_of_words = len(gsl1000_words)
gsl2000_num_of_words = len(gsl2000_words)
awl_num_of_words = len(awl_words)
other_num_of_words = len(other_words)
#  Calculation ge.txt Total number of words in 
freq_total = gsl1000_freq_total + gsl2000_freq_total + awl_freq_total + other_freq_total

#  Calculation ge.txt The total number of forms in 
num_of_words_total = gsl1000_num_of_words + gsl2000_num_of_words + awl_num_of_words + other_num_of_words

Here is the fourth part of the code , Write the result to range_wordlist_results.txt In the text . First open a to write range_wordlist_results.txt File handle for text . then , Write the title of the text in the text ’RESULTS OF WORD ANALYSIS’, Two blank lines after the title ’\n\n’. then , Write the total number of word forms of the text and the number of word forms of each part . Please note that , What is written out to text must be a string , therefore , When the writing content is a numeric variable , Must use first str() Function to convert it to a string , Such as str(num_of_words_total).

# Recognizing_gsl_awl_words.py,part 4-1
# the following is to write out the results
# first,define the file to save the results
file_out = open(r'D:\works\ Text analysis \range_wordlist_results.txt','a')

# then,write out the results
file_out.write('RESULTS OF WORD ANALYSIS\n\n')
file_out.write('Total No. of word types in Great Expectations: ' + str(num_of_words_total) + '\n\n')
file_out.write('Total No. of GSL1000 word types : ' + str(gsl1000_num_of_words) + '\n\n')
file_out.write('Total No. of GSL2000 word types : ' + str(gsl2000_num_of_words) + '\n')
file_out.write('Total No. of AWL word types : ' + str(awl_num_of_words) + '\n')
file_out.write('Total No. of other word types : ' + str(other_num_of_words) + '\n')

Insert picture description here

Next , Write the total frequency of words in the text 、 The total frequency of each part of words and the percentage of each part of words in the total frequency . There is a problem to be aware of ： Since the word frequency and total frequency of each part are integer values , stay Python2.7 in , The result of dividing an integer value is still an integer value , therefore , When calculating the percentage , You need to convert an integer value to a floating-point value first , Such as float(freq_total).

# Recognizing_gsl_words.py,part 4-2
file_out.write('\n\n')
file_out.write('Total word frequency of Great Expectations: ' + str(freq_total) + '\n\n')

file_out.write('Total frequency of GSL1000 words: ' + str(gsl1000_freq_total) + '\n')
file_out.write('Frequency percentage of GSL1000 words: ' + str(gsl1000_freq_total / float(freq_total)) + '\n\n')

file_out.write('Total frequency of GSL2000 words: ' + str(gsl2000_freq_total) + '\n')
file_out.write('Frequency percentage of GSL2000 words: ' + str(gsl2000_freq_total / float(freq_total)) + '\n\n')

file_out.write('Total frequency of AWL words: ' + str(awl_freq_total) + '\n')
file_out.write('Frequency percentage of AWL words: ' + str(awl_freq_total / float(freq_total)) + '\n\n')

file_out.write('Total frequency of other words: ' + str(other_freq_total) + '\n')
file_out.write('Frequency percentage of other words: ' + str(other_freq_total / float(freq_total)) + '\n')

Insert picture description here

（5） Statistics ge.txt The total frequency of words in the text and the frequency of words in each part

Last , We write the words and frequencies of the three word lists and the words and frequencies outside the three word lists into the result file .

# Recognizing_gsl_awl_words.py, Part 4-3

# write out the GSL1000 words
file_out.write('\n\n')
file_out.write('##########\n')
file_out.write('Words in GSL1000\n\n')
for word in sorted(gsl1000_words.keys()):
    file_out.write(word + '\t' + str(gsl1000_words[word]) + '\n')

# write out the GSL2000 words
file_out.write('\n\n')
file_out.write('##########\n')
file_out.write('Words in GSL2000\n\n')
for word in sorted(gsl2000_words.keys()):
    file_out.write(word + '\t' + str(gsl2000_words[word]) + '\n')

# write out the AWL words
file_out.write('\n\n')
file_out.write('##########\n')
file_out.write('Words in AWL\n\n')
for word in sorted(awl_words.keys()):
    file_out.write(word + '\t' + str(awl_words[word]) + '\n')

# write out other words
file_out.write('\n\n')
file_out.write('##########\n')
file_out.write('Other words\n\n')
for word in sorted(other_words.keys()):
    file_out.write(word + '\t' + str(wordlist_freq_dict[word]) + '\n')

file_out.close()

The results are visible ,Great Expectations The number of common words in the text 10764 individual , The total number of single words （ Total frequency 188955） individual , among GSL1000 Lexical proportion 84.78%,GSL2000 Lexical proportion 5.26%,AWL Lexical proportion 1.17%, Other words account for 8.79%. According to the relevant results of previous Vocabulary Studies , In academic texts AWL Vocabulary accounts for roughly the total frequency of academic texts 8%~10%. and Great Expectations The text is a novel , Therefore, its AWL Vocabulary only accounts for 1.17%.

原网站

版权声明
本文为[Triumph19]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/175/202206240314465264.html

当前位置：网站首页>Case examples of corpus data processing (cases related to sentence retrieval)

Case examples of corpus data processing (cases related to sentence retrieval)

7.8 Sentence retrieval related cases

7.8.1 Retrieve all sentences containing a word

7.8.2 Retrieve all that contain with "-tional" The ending sentence

7.9 Realization Rrange software function

（1） establish ge.txt Word frequency table of text

（2） Read in the vocabulary

（3） analysis ge.txt The words in belong to that vocabulary

（4） Statistics ge.txt All kinds of words in （ Common words 、 Academic vocabulary 、 Rare words ） frequency

（4） Statistics ge.txt The form and number of words in （ Common words 、 Academic vocabulary 、 Rare words ） The number of

（5） Statistics ge.txt The total frequency of words in the text and the frequency of words in each part

边栏推荐

猜你喜欢

随机推荐