当前位置:网站首页>Case examples of corpus data processing (cases related to sentence retrieval)

Case examples of corpus data processing (cases related to sentence retrieval)

2022-06-24 07:53:00 Triumph19

7.8 Sentence retrieval related cases

  • In this section, we discuss two cases related to sentence retrieval . The first case , We retrieve sentences that contain a word or a class of words in the text . The second case , We retrieve passive sentences in text .

7.8.1 Retrieve all sentences containing a word

  • In this case , We first search ge.txt The text contains "children" One word sentence . We allow c Letters in upper or lower case . Look at the code below :
import re
file_in = open(r'D:\works\ Text analysis \leopythonbookdata-master\texts\ge.txt','r')
file_out = open(r'D:\works\ Text analysis \ge_children.txt','a')
for line in file_in.readlines():
    if re.search(r'[Cc]hildren',line):
        file_out.write(line)
file_in.close()
file_out.close()

7.8.2 Retrieve all that contain with "-tional" The ending sentence

  • Let's search ge.txt The text contains the following words "-tional" The ending sentence . Look at the code below . Compared with the above code , We made two changes . One is to change the written text name to "ge_tional.txt", The second is to change the search regular expression to r’\w+tional\b’, among ’\b’ The boundary of a word .
import re
file_in = open(r'D:\works\ Text analysis \leopythonbookdata-master\texts\ge.txt','r')
file_out = open(r'D:\works\ Text analysis \ge_tional.txt','a')
for line in file_in.readlines():
    if re.search(r'\w+tional\b',line):
        file_out.write(line+'\n')
file_in.close()
file_out.close()

7.9 Realization Rrange software function

  • The study of academic English vocabulary is an important research issue in the field of Applied Linguistics and English for academic purposes .Nation(2001) Divide words into common words 、 Academic vocabulary 、 Technical vocabulary 、 Rare words .Nation(2001) Think , Learners often learn common vocabulary first , Then learn the technical vocabulary , Finally, learn technical terms and rare words . Common words generally refer to West(1953) The development of a general vocabulary (General Service List,GSL) Contains the most commonly used in English 2000 A vocabulary ( Or word family ). Scholars have also developed many academic English word lists , To help researchers learn and use academic English vocabulary , among ,Coxhead(2000) Developed academic English vocabulary (Academic Word List,AWL) Is the most influential academic English vocabulary in recent years , The glossary includes 570 High frequency academic vocabulary ( Or word family ). In order to better help researchers will AWL Thesaurus is used in research and teaching material development ,Paul Natio It has also been developed. Range Software . The software has built-in GSL The most common 1000 A vocabulary ( For short GSL1000)、GSL The most common 2000 A vocabulary ( For short GSL2000) and AWL Three word lists .Range The software can analyze which words in the corpus text belong to GSL1000、GSL2000 and AWL Thesaurus , And report the number of three vocabulary words and their percentage in the total number of words in the text ( coverage ).
  • The study of academic English vocabulary is an important research topic in the field of Applied Linguistics and English for academic purposes . In order to achieve Range The function of the software , It takes roughly four steps . First , Need to read in text , Set up word frequency table of text ; secondly , Read in GSL1000.txt、GSL2000.txt and AWL.txt Three word lists ; Again , Analyze which words read into the text belong to the above three lists , And count their number and percentage ; Last , Write out the results to a new text file . below , We will introduce the content of the presentation code in four parts according to the above steps .
  • Here is the first part of the code , establish ge.txt Word frequency table of text . We first read ge.txt Text , Then through regular expressions ’\W’ Replace all non alphanumeric parts with spaces , Re pass split() Function for word segmentation , And store the word after word segmentation in all_words In the list . Next , We define an empty dictionary wordlist_freq_dict, To make ge.txt Frequency table of text words . Let's loop through all_words Words in the list , If the word is in the dictionary key (key) in , Then the frequency is increased 1; otherwise , The frequency is 1. such , We store words as dictionary keys , The word frequency is the value of the dictionary (value).

(1) establish ge.txt Word frequency table of text

# Recognizing_gsl_awl_words.py,part 1

# the following is to make the wordlist with freq
# and store the info in a dictionary (wordlist_freq_dict)

import re
file_in = open(r'D:\works\ Text analysis \leopythonbookdata-master\texts\ge.txt','r')
all_words = []
for line in file_in.readlines():
    line2 = line.lower()
    line3 = re.sub(r'\W',r' ',line2) # Replace non alphanumeric parts with spaces 
    wordlist = line3.split()
    for word in wordlist:
        all_words.append(word)

wordlist_freq_dict = {
    }
for word in all_words:
    if word in wordlist_freq_dict.keys():
        wordlist_freq_dict[word] += 1
    else:
        wordlist_freq_dict[word] = 1
file_in.close()

(2) Read in the vocabulary

  • Here is the second part of the code , Read in GSL1000.txt、GSL2000.txt and AWL.txt Three word lists . We first create an empty dictionary gsl_awl_dict, Then read the above three thesaurus files respectively , Store the words of the thesaurus as the keys of the dictionary , Define the dictionary keys corresponding to the three vocabulary words as 1,2,3.
# Recognizing_gsl_awl_words.py,Part 2
# the following is to read the GSL and the AWL words
# and save them in a dictionary
gsl1000_in = open(r'D:\works\ Text analysis \leopythonbookdata-master\texts\GSL1000.txt','r')
gsl2000_in = open(r'D:\works\ Text analysis \leopythonbookdata-master\texts\GSL2000.txt','r')
awl_in = open(r'D:\works\ Text analysis \leopythonbookdata-master\texts\AWL.txt','r')

gsl_awl_dict = {
    }
for word in gsl1000_in.readlines():
    gsl_awl_dict[word.strip()] = 1

for word in gsl2000_in.readlines():
    gsl_awl_dict[word.strip()] = 2

for word in awl_in.readlines():
    gsl_awl_dict[word.strip()] = 3

gsl1000_in.close()
gsl2000_in.close()
awl_in.close()

(3) analysis ge.txt The words in belong to that vocabulary

  • Here is the third part of the code , Analyze which words read into the text belong to the above three lists , And count their number and percentage . We first define four empty dictionary variables , To store the words of the above three word lists respectively 、 The frequency of (gsl1000_words,gsl2000_words,awl_words) And words other than the above three lists (other_words). First , Loop traversal wordlist_freq_dict Key , That is to say ge.txt The words of the text (wordlist_freq_dict.keys()), If the word is not GSL and AWL In this table (if word not in gsl_awl_dict.keys()), be other_words The key of the dictionary is the word , The value is 4(other_words[word] = 4); If the word is in GSL_1000 in , be gsl1000_words The key of the dictionary is the value , The value is the word in ge.txt Frequency in text ; By analogy .
# Recognizing_gsl_awl_words.py,part 3-1
# the following is to categorize the words in wordlist_freq_dict
# into dictionaries of GSL1000 words,GSL2000 words,AWL words or others
gsl1000_words = {
    }
gsl2000_words = {
    }
awl_words = {
    }
other_words = {
    }

for word in wordlist_freq_dict.keys():
    if word not in gsl_awl_dict.keys():
        other_words[word] = 4
    elif gsl_awl_dict[word] == 1:
        gsl1000_words[word] = wordlist_freq_dict[word] # Statistics ge.txt The emergence of gsl1000 Number of words in 
    elif gsl_awl_dict[word] == 2:
        gsl2000_words[word] = wordlist_freq_dict[word]
    elif gsl_awl_dict[word] == 3:
        awl_words[word] = wordlist_freq_dict[word]
    

(4) Statistics ge.txt All kinds of words in ( Common words 、 Academic vocabulary 、 Rare words ) frequency

  • then , We calculate the total frequency of each word in the vocabulary , And store them in gsl1000_freq_total And so on . Its calculation method is , First define a value as 0 The variable of ( Such as gsl1000_freq_total = 0), If a word GSL1000 in , Then the frequency will be increased ge.txt Frequency in .
# Recognizing_gsl_awl_words.py,part 3-2
# compute freq total
gsl1000_freq_total = 0
gsl2000_freq_total = 0
awl_freq_total = 0
other_freq_total = 0
for word in gsl1000_words:
    gsl1000_freq_total += wordlist_freq_dict[word]
for word in gsl2000_words:
    gsl2000_freq_total += wordlist_freq_dict[word]
for word in awl_words:
    awl_freq_total += wordlist_freq_dict[word]
for word in other_words:
    other_freq_total += wordlist_freq_dict[word]

(4) Statistics ge.txt The form and number of words in ( Common words 、 Academic vocabulary 、 Rare words ) The number of

  • Next , Calculate the number of word forms in each vocabulary , The calculation method is relatively simple , The number of word forms is the length of each dictionary . then , Add the total frequency of each part of the word , To calculate ge.txt Total frequency of text words ; Add the number of word forms of each part of the word , To calculate ge.txt The total number of word forms in the text .
# Recognizing_gsl_awl_words.py,part 3-3
# to compute the number of words in gsl1000,gsl2000,awl and other words
gsl1000_num_of_words = len(gsl1000_words)
gsl2000_num_of_words = len(gsl2000_words)
awl_num_of_words = len(awl_words)
other_num_of_words = len(other_words)
#  Calculation ge.txt Total number of words in 
freq_total = gsl1000_freq_total + gsl2000_freq_total + awl_freq_total + other_freq_total

#  Calculation ge.txt The total number of forms in 
num_of_words_total = gsl1000_num_of_words + gsl2000_num_of_words + awl_num_of_words + other_num_of_words
  • Here is the fourth part of the code , Write the result to range_wordlist_results.txt In the text . First open a to write range_wordlist_results.txt File handle for text . then , Write the title of the text in the text ’RESULTS OF WORD ANALYSIS’, Two blank lines after the title ’\n\n’. then , Write the total number of word forms of the text and the number of word forms of each part . Please note that , What is written out to text must be a string , therefore , When the writing content is a numeric variable , Must use first str() Function to convert it to a string , Such as str(num_of_words_total).
# Recognizing_gsl_awl_words.py,part 4-1
# the following is to write out the results
# first,define the file to save the results
file_out = open(r'D:\works\ Text analysis \range_wordlist_results.txt','a')

# then,write out the results
file_out.write('RESULTS OF WORD ANALYSIS\n\n')
file_out.write('Total No. of word types in Great Expectations: ' + str(num_of_words_total) + '\n\n')
file_out.write('Total No. of GSL1000 word types : ' + str(gsl1000_num_of_words) + '\n\n')
file_out.write('Total No. of GSL2000 word types : ' + str(gsl2000_num_of_words) + '\n')
file_out.write('Total No. of AWL word types : ' + str(awl_num_of_words) + '\n')
file_out.write('Total No. of other word types : ' + str(other_num_of_words) + '\n')

 Insert picture description here

  • Next , Write the total frequency of words in the text 、 The total frequency of each part of words and the percentage of each part of words in the total frequency . There is a problem to be aware of : Since the word frequency and total frequency of each part are integer values , stay Python2.7 in , The result of dividing an integer value is still an integer value , therefore , When calculating the percentage , You need to convert an integer value to a floating-point value first , Such as float(freq_total).
# Recognizing_gsl_words.py,part 4-2
file_out.write('\n\n')
file_out.write('Total word frequency of Great Expectations: ' + str(freq_total) + '\n\n')

file_out.write('Total frequency of GSL1000 words: ' + str(gsl1000_freq_total) + '\n')
file_out.write('Frequency percentage of GSL1000 words: ' + str(gsl1000_freq_total / float(freq_total)) + '\n\n')

file_out.write('Total frequency of GSL2000 words: ' + str(gsl2000_freq_total) + '\n')
file_out.write('Frequency percentage of GSL2000 words: ' + str(gsl2000_freq_total / float(freq_total)) + '\n\n')

file_out.write('Total frequency of AWL words: ' + str(awl_freq_total) + '\n')
file_out.write('Frequency percentage of AWL words: ' + str(awl_freq_total / float(freq_total)) + '\n\n')

file_out.write('Total frequency of other words: ' + str(other_freq_total) + '\n')
file_out.write('Frequency percentage of other words: ' + str(other_freq_total / float(freq_total)) + '\n')

 Insert picture description here

(5) Statistics ge.txt The total frequency of words in the text and the frequency of words in each part

  • Last , We write the words and frequencies of the three word lists and the words and frequencies outside the three word lists into the result file .
# Recognizing_gsl_awl_words.py, Part 4-3

# write out the GSL1000 words
file_out.write('\n\n')
file_out.write('##########\n')
file_out.write('Words in GSL1000\n\n')
for word in sorted(gsl1000_words.keys()):
    file_out.write(word + '\t' + str(gsl1000_words[word]) + '\n')

# write out the GSL2000 words
file_out.write('\n\n')
file_out.write('##########\n')
file_out.write('Words in GSL2000\n\n')
for word in sorted(gsl2000_words.keys()):
    file_out.write(word + '\t' + str(gsl2000_words[word]) + '\n')

# write out the AWL words
file_out.write('\n\n')
file_out.write('##########\n')
file_out.write('Words in AWL\n\n')
for word in sorted(awl_words.keys()):
    file_out.write(word + '\t' + str(awl_words[word]) + '\n')

# write out other words
file_out.write('\n\n')
file_out.write('##########\n')
file_out.write('Other words\n\n')
for word in sorted(other_words.keys()):
    file_out.write(word + '\t' + str(wordlist_freq_dict[word]) + '\n')

file_out.close()
  • The results are visible ,Great Expectations The number of common words in the text 10764 individual , The total number of single words ( Total frequency 188955) individual , among GSL1000 Lexical proportion 84.78%,GSL2000 Lexical proportion 5.26%,AWL Lexical proportion 1.17%, Other words account for 8.79%. According to the relevant results of previous Vocabulary Studies , In academic texts AWL Vocabulary accounts for roughly the total frequency of academic texts 8%~10%. and Great Expectations The text is a novel , Therefore, its AWL Vocabulary only accounts for 1.17%.
原网站

版权声明
本文为[Triumph19]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/175/202206240314465264.html