当前位置:网站首页>Theory and application of naturallanguageprocessing
Theory and application of naturallanguageprocessing
2022-06-22 06:54:00 【C don't laugh】
Theory and application of naturallanguageprocessing
Introduction to natural language processing
What is natural language


- Take speech as the material shell , A symbolic system consisting of two parts, vocabulary and grammar .
What is natural language processing (NLP)

- Naturallanguageprocessing is , The use of computers as a tool for human specific written and oral forms of natural language information , The technique of carrying out various types of treatment and processing .(NLP
It is the bridge between human beings and machines )
What is natural language understanding (NLU)
natural language understanding NLU) It is the general name of all method models or tasks that support machine understanding of text content .NLU It plays a very important role in text information processing system , It's a recommendation 、 Question and answer 、 Search and other necessary modules of the system .
Naturallanguageprocessing tasks and methods
Preliminary knowledge
Language model
What is a language model
The language model can estimate the probability of a text , Information retrieval , Machine translation , Speech recognition and other tasks play an important role . Language model is divided into statistical language model and neural network language model .
Common language models
N-Gram Language model (n Metamodel )
n-gram Is a statistical language model , Used according to the previous (n-1) individual item To predict the n individual item. At the application level , these item It can be phoneme ( Voice recognition applications )、 character ( Input method application )、 word ( Participle application ) Or base pairs ( Genetic information ). In general , It can be generated from large-scale text or audio corpora n-gram Model .
n-gram A common application of the model
Search engines such as Google or Baidu , Or the prompt of input method . When we search for every word or words , The search box usually provides several alternatives in the form of a pull-down menu :
sparse : The greater the window , The more likely it is count by 0 The problem of
Storage ∶ these count It needs to be stored in advance , It's too much
Neural network language model (NNLM)

advantage : No, ngram The sparsity and storage of language model
shortcoming ︰ Want better performance , It is necessary to enlarge the window , The greater the window , The larger the number of parameters
RNN Language model

RNN Language model advantages :
Can handle any length of text sequence , The number of parameters remains the same .
Than n-gram Can handle longer contexts , And there is no sparse problem .
Text vectorization
The function of text representation is to transform these unstructured information into structured information , In this way, we can calculate the text information , To complete the text classification we can see everyday , Emotional judgment and other tasks .
Hot coding alone | one-hot representation
If there is a total of 4 Word : cat 、 Dog 、 cattle 、 sheep 
shortcoming
Unable to express the relationship between words
This vector is too sparse , Resulting in inefficient computing and storage
Integer encoding

shortcoming
Unable to express the relationship between words
For model interpretation , Integer coding can be challenging .
Word embedding | word embedding
Common algorithms

hidden Markov model (HMM)
HMM It's about the probability model of time series , It generates an unobservable random state sequence by a hidden Markov chain , The process of generating an observation from each state to generate an observation random sequence . Each position in the sequence is called a moment .
Two hypotheses
Implicit state independence assumption ( Homogeneous Markov hypothesis )∶ The current status is only related to the previous status .
Observation independence Hypothesis : The current observations are generated only by the current hidden state , Independent of other observations .
One HMM Model examples
import numpy as np
from hmmlearn import hmm
# Set the state of the hidden collection
states = ["box 1", "box 2", "box3"]
n_states = len(states)
# Set the set of observation States
observations = ["red", "white"]
n_observations = len(observations)
# Set the initial state distribution
start_probability = np.array([0.2, 0.4, 0.4])
# Set the state transition probability distribution matrix
transition_probability = np.array([
[0.5, 0.2, 0.3],
[0.3, 0.5, 0.2],
[0.2, 0.3, 0.5]
])
# Set the observation state probability matrix
emission_probability = np.array([
[0.5, 0.5],
[0.4, 0.6],
[0.7, 0.3]
])

Conditional random field model (CRF)
- Conditional random field ( Conditional Random
Field,CRF), To cancel the HMM Two independent hypotheses , Consider label transfer and context input as one of the global features , Perform probability normalization globally , It's solved HMM Label bias and missing context features . It is widely used in word segmentation ,
Such scenarios as entity recognition and part of speech tagging . With the popularity of deep learning , BILSTM+CRF, BERT+CRF, TRANSFORMER+CRF Wait for the model ,
Step by step , And in these annotation scenes , The effect has been significantly improved . - Conditional random fields are used for sequence labeling , Chinese word segmentation 、 Chinese name recognition, ambiguity resolution and other natural language processing , It's showing good results . The principle is : For a given observation sequence and annotation sequence , Establish conditional probability model . Conditional random fields can be used for different prediction problems , Its learning method is usually maximum likelihood estimation .
- The conditional random field model also needs to solve three basic problems : The choice of features 、 Parameter training and decoding .


Production model and discriminant model
- Production model : Model the joint distribution directly , Such as : Gaussian mixture model 、 The hidden Markov model 、 Markov random field, etc
- Discriminant model : Model the conditional distribution , Such as : Conditional random field 、 Support vector machine 、 Logical regression, etc .
Two way recurrent neural network + Conditional random field model (BiLSTM+CRF)
LSTM It's a recurrent neural network (RNN) A variant of ,BiLSTM It means two-way LSTM The Internet ,BiLSTM Compared with the traditional CRF Algorithm , It can learn context features more effectively , No need to manually design features , It can also handle longer context dependencies .
key technology
participle
What is participle ?
A participle is a sentence 、 The paragraph 、 The long text of the article , Decompose into data structures in terms of words , Facilitate subsequent processing and analysis .
Participle case
import jieba
testSentence = " utilize python Data analysis "
print("1. Precise pattern segmentation results :"+"/".join(jieba.cut(testSentence,cut_all=False)))
print("2. Full pattern word segmentation results :"+"/".join(jieba.cut(testSentence,cut_all=True)))
print("3. Search engine pattern segmentation results :"+"/".join(jieba.cut_for_search(testSentence)))
print("4. Default ( Accurate model ) Segmentation result :"+"/".join(jieba.cut(testSentence)))

remarks
- Accurate model : Try to cut the sentence as precisely as possible , Suitable for text analysis ;
- All model : Scan the sentences for all the words that can be made into words , Very fast , But it doesn't solve the ambiguity ;
- Search engine model : On the basis of exact patterns , Again shred long words , Increase recall rate , Suitable for search engine segmentation .
# Load Dictionary
print(" Load Dictionary ")
def load_dictionary():
dic = set()
# Read dictionary file by line , The string before the first space in each line is extracted .
for line in open("CoreNatureDictionary.mini.txt", "r",encoding='utf-8'):
dic.add(line[0:line.find(' ')])
return dic
dic = load_dictionary()
print(dic)
print(" Find all the words in a paragraph of text ")
# Find all the words in a paragraph of text
def fully_segment(text, dic):
word_list = []
for i in range(len(text)): # i from 0 To text The subscript of the last word of
for j in range(i + 1, len(text) + 1): # j Traverse [i + 1, len(text)] Section
word = text[i:j] # Take out the continuous interval [i, j] The corresponding string
if word in dic: # If in a dictionary , It is considered to be a word
word_list.append(word)
return word_list
dic = load_dictionary()
print(fully_segment(' Studying at Peking University ', dic))
# Forward longest match
def forward_segment(text, dic):
word_list = []
i = 0
while i < len(text):
longest_word = text[i] # The word of the current scanning position
for j in range(i + 1, len(text) + 1): # All possible endings
word = text[i:j] # A continuous string from the current position to the end
if word in dic: # In the dictionary
if len(word) > len(longest_word): # And longer
longest_word = word # Then the priority output is
word_list.append(longest_word) # Output the longest word
i += len(longest_word) # Forward scanning
return word_list
print(" Forward longest match ")
dic = load_dictionary()
print(forward_segment(' Studying at Peking University ', dic))
print(forward_segment(' Study the origin of life ', dic))
# Reverse longest matching
def backward_segment(text, dic):
word_list = []
i = len(text) - 1
while i >= 0: # Scanning position as the end point
longest_word = text[i] # The word of the scanning position
for j in range(0, i): # Traverse [0, i] The interval is used as the starting point of the words to be queried
word = text[j: i + 1] # Take out [j, i] The interval is used as the word to be queried
if word in dic:
if len(word) > len(longest_word): # The longer the priority, the higher
longest_word = word
break
word_list.insert(0, longest_word) # Reverse scanning , So the more the words found out first, the later they are in position
i -= len(longest_word)
return word_list
print(" Reverse longest matching ")
dic = load_dictionary()
print(backward_segment(' Study the origin of life ', dic))
print(backward_segment(' Project research ', dic))
# Bidirectional longest matching
print(" Bidirectional longest matching ")
def count_single_char(word_list: list): # Count the number of words in a word
return sum(1 for word in word_list if len(word) == 1)
def bidirectional_segment(text, dic):
f = forward_segment(text, dic)
b = backward_segment(text, dic)
if len(f) < len(b): # Fewer words, higher priority
return f
elif len(f) > len(b):
return b
else:
if count_single_char(f) < count_single_char(b): # Fewer words, higher priority
return f
else:
return b # When they are equal, reverse matching has higher priority
print(bidirectional_segment(' Study the origin of life ', dic))
print(bidirectional_segment(' Project research ', dic))

Why participle ?
- Turn complex problems into mathematical problems

- Word is a more appropriate granularity

Chinese word segmentation 3 Major difficulties

There is no uniform standard
How to segment ambiguous words
Table Tennis \ The auction \ Finished Table tennis \ Racket \ sell \ FinishedRecognition of new words
I feel awful. I want to cry. YYDS Little pan dish
Part of speech tagging
What is part of speech tagging ?
Part of speech tagging ( Part-Of-Speech tagging,POS tagging )∶ It refers to the procedure of marking a correct part of speech for each word in the word segmentation result of a sentence , That is, make sure that each word is a noun 、 Verb 、 The process of adjectives or other parts of speech . for example : March towards /v Full /v hope /n Of /uj new /a century /n. There are many part of speech tagging NLP Task preprocessing steps , Such as parsing 、 Information extraction , The text marked with part of speech will bring great convenience , But it is not an indispensable step .
Given a sequence of words with their respective annotations , We can determine the most likely part of speech of the next word .

import jieba.posseg
testSentence = " utilize python Data analysis "
words = jieba.posseg.cut(testSentence)
for item in words:
print(item.word+"----"+item.flag)

Named entity recognition
Named entity recognition (Named Entity Recognition,NER) For natural language processing (NLP) One of the basic tasks of , Its goal is to extract named entities from text and classify them , For example, person names. 、 Place names 、 Institutions 、 Time 、 Currency, percentage, etc .

import jieba.analyse
print(jieba.analyse.extract_tags(" I like Guangzhou small man waist ", 3))
print(jieba.analyse.extract_tags(" I like Guangzhou small Manyao ", 3))
print(jieba.analyse.extract_tags(" I like Guangzhou Guangzhou small Manyao ", 3))

import jieba.analyse
print("1. Adopt accurate mode results :")
print([item for item in jieba.tokenize(" Application of data analysis and data mining ")])
print("-------------------")
print("2. Take search mode results :")
print([item for item in jieba.tokenize(" Application of data analysis and data mining ",mode="search")])

Syntactic parsing
It is to analyze the structure of sentences and phrases , The purpose is to find the word 、 The relationship between phrases and their respective functions in sentences .
Semantic analysis
Is to find out the meaning of the word 、 The meaning of structure and its combination , So as to determine the real meaning or concept expressed by language .


Example analysis
In the life , If you want to book a flight , People have a lot of natural expressions
“ make a flight reservation ”;
“ Is there a flight to Shanghai ?”;
“ Look at the flight , Leave for New York next Tuesday ”;
“ To travel , Check the ticket for me ”;
Make intention judgment based on rules 
be based on NLU To identify user intent 
NLP The general steps of the task

Application system
Text classification
Text classification ( text classification), also called Document classification ( document classification), It refers to the natural language processing task of classifying a document into one or more categories . Text classification has a wide range of application scenarios , Cover spam filtering 、 Spam filtering 、 Auto Label 、 Emotional analysis and any other occasions that need to automatically archive text .
The category of text is sometimes called label , All categories make up the dimension set , The output result of text classification must belong to the annotation set .
Text categorization is a typical supervised learning task , The process cannot be separated from manual guidance : Manually mark the categories of documents , Training model with corpus , Use the model to predict the categories of documents .
Text clustering
In many app There is a recommendation function in , For example, Netease cloud music has daily song recommendations 、 Some reading software has books to read and so on , The general recommendation modes are user based and content-based , Among them, the content-based recommendation may calculate the text similarity , Of course, it must be combined with other dimensions , Such as the style of music . Similarly, the search engine will sort the web pages according to the similarity with the search keywords . Next, we will implement an implementation based on TF-IDF Weighted text similarity calculation .
TF-IDF Algorithm
TF-IDF: It is a commonly used weighting technology for information retrieval and information exploration .
(1)TF(term frequency)
Word frequency , It refers to the frequency of a given word in the document . The calculation formula is the number of times a word appears in a document divided by the number of times all words appear in the document .
The denominator is the coefficient of all words appearing in the document. The purpose is to normalize the number of words in order to prevent bias towards long documents ( Whether the word is important or not , The same word may have a higher number of words in a long document than in a short document ).
(2)IDF(inverse document frequency)
Reverse file frequency , Is a measure of the universal importance of words . The calculation formula is the total number of documents divided by the number of files containing the word , And then take the logarithm of the quotient .
(3) Calculation example
words “ Cow ” The total number of words in a certain passage is 100 Files of appear 3 Time , The total number of files in the corpus where this file is located is 10,000,000 Share , also “ Cow ” Among them 1,000 Documents have appeared , that “ Cow ” The frequency of a word in the document is 3/100=0.03, Its reverse file frequency is log(10,000,000 / 1,000)=4. final TF-IDF The score of is 0.03 * 4=0.12.
'''
utilize gensim do TF-IDF Theme model
'''
from gensim import corpora, models, similarities
import jieba
from collections import defaultdict
# 1. Import sentences
sentence1 = " I like to eat sweet potatoes "
sentence2 = " Sweet potato is a good thing "
sentence3 = " utilize python Text mining "
# 2. participle
data1 = " ".join(jieba.cut(sentence1))
data2 = " ".join(jieba.cut(sentence2))
data3 = " ".join(jieba.cut(sentence3))
# 3. Transformation format :" words 1 words 2 words 3 … words n"
texts = [list(data1), list(data2), list(data3)]
# 4. Build a dictionary based on text
dictionary = corpora.Dictionary(texts)
featureNum=len(dictionary.token2id.keys())# Extract the number of dictionary features
dictionary.save("./dictionary.txt")# Save the corpus
# 5. Building a new corpus based on dictionaries
corpus = [dictionary.doc2bow(text) for text in texts]
# 6.TF-IDF Handle
tfidf = models.TfidfModel(corpus)
'''
# Output each word of each sentence tfidf value
corpus_tfidf = tfidf[corpus]
for doc in corpus_tfidf:
print(doc)
'''
# 7. Load comparative sentences and organize their formats
query = " Eat something "
data4 = jieba.cut(query)
data41 = ""
for item in data4:
data41 += item+" "
new_doc = data41
# 8. Convert comparative sentences into sparse vectors
new_vec = dictionary.doc2bow(new_doc.split())
# 9. Calculate similarity
index = similarities.SparseMatrixSimilarity(tfidf[corpus],num_features=featureNum)
sim = index[tfidf[new_vec]]
for i in range(len(sim)):
print(" Inquiry and "+str(i+1)+" The similarity of sentences is :"+str(sim[i]))

Machine translation
Input the source language text through automatic translation to get the text of another language . Machine translation from the earliest rule-based method to the statistical method 20 years ago , And today's neural network ( code - decode ) Methods , Gradually formed a more rigorous method system .
Question answering system
A question of natural language expression , The Q & a system gives an accurate answer . We need some semantic analysis of natural language query statements , Include entity Links 、 Relationship recognition , Form logical expressions , Then search the knowledge base for possible candidate answers and find the best answer through a sorting mechanism .
Information filtering
Knowledge of sensitive word filtering and anti spam text
Auto digest
from textrank4zh import TextRank4Keyword, TextRank4Sentence
import jieba
import logging
# Cancel jieba Log output of word segmentation
jieba.setLogLevel(logging.INFO)
text = """
Now with the development of the Internet , The speed of uploading and downloading files on the Internet has been greatly improved . So now the infrastructure , More and more high demand applications are feasible . Artificial intelligence , Deep learning , Network hot words such as naturallanguageprocessing frequently appear in our field of vision , that , What is naturallanguageprocessing ? Next, let's talk about what naturallanguageprocessing is , What do you do , And what it can do for us .
First , Natural language processing is a branch of artificial intelligence , The ultimate goal, like artificial intelligence, is to imitate human behavior , The ultimate goal of naturallanguageprocessing is to understand language , Make computers read and understand languages like people , And give the corresponding answer in line with human thinking . There are many ways to implement it , Can be based on statistics , It can also be based on methods such as deep learning .
Simply speaking , Naturallanguageprocessing is , Analyze and apply various language text files in the computer . For example, analyze whether the semantics of a text is negative , Or it may be a noun that distinguishes a paragraph of words , Adjectives, etc .
say concretely , natural language processing , English is Natural Language Processing, Abbreviation NLP. It can be divided into “ Natural language ” and “ Handle ” Two parts . Let's start with natural language . Now all the languages in the world , All belong to natural language , Including Chinese 、 English 、 French, etc . Then I'll see “ Handle ”. This “ Handle ” It refers to computer processing . But computers are not people after all , Can't process text like a human , You need to have your own way of handling it . So naturallanguageprocessing , That is, the computer accepts the user's natural language input , And it is processed internally by the algorithm defined by human beings 、 Calculation and other series of operations , To simulate human understanding of natural language , And return the result expected by the user .
Just as machinery liberates human hands , The purpose of naturallanguageprocessing (NLP) is to process large-scale naturallanguageinformation by computer instead of man . It's artificial intelligence 、 Computer science 、 The cross domain of Information Engineering , It's about statistics 、 Knowledge of linguistics, etc . Because language is the proof of human thinking , So naturallanguageprocessing is the highest realm of artificial intelligence , Known as the “ Pearl on the crown of artificial intelligence ”.
that , What can naturallanguageprocessing bring us :
The following is the specific position of naturallanguageprocessing in the whole field and some of its main contents :
The application of naturallanguageprocessing is very extensive and practical , The following is a breakdown of naturallanguageprocessing , Let's go deep into naturallanguageprocessing !
Semantic understanding :
Semantic understanding technology is simply to make computers understand text like human beings , And the process of answering questions related to the text . Semantic understanding pays more attention to understanding the context and controlling the accuracy of the answer . for instance , Give a passage , Then ask another question , The computer passes the algorithm model , Output the answer to this question according to the semantics of the text . The following figure is an example , The computer understood the article through reading Passage And questions Question after , Answer the corresponding answer .
demo
2.2 Text in this paper,
Is to give the computer a paragraph of text or a whole paper , Then the computer will output a summary of the text according to your text . Its core technology is to focus on the core part of the text file , Then automatically generate a summary . This technology imitates a characteristic unique to human beings , That's attention . Before we face many things , There will always be priorities . It's like you're in a bar , A lot of people are talking , But if someone calls your name , Or you are interested in someone , Then your brain will filter out other sounds , And focus on what you care about . You see , Computers can do such things , Are there more and more human beings ?
2.3 Linguistic reasoning and abductive natural language reasoning (aNLI)
Verbal reasoning : Enter two sentences into the computer , Then the computer judges the relationship between the two sentences , For example, strengthen the relationship , Or cause and effect .
demo
Abductive natural language reasoning (aNLI): This is an implementation of computer imitating human imagination , Because when people face a problem , For example, someone asks you what is blue , Then you will have a lot of blue related things in your mind , Like the sky , Blue whale , Blue car, etc , Even think of something unrelated to blue , Like a basket , Orchid, etc .Anli It is the technology that makes the computer imitate the daily conversation of human beings , It is compared to natural language reasoning , More imagination , And more in line with human daily communication .
2.4 Sentiment analysis
Emotion analysis is a specific application of text classification in naturallanguageprocessing .
Text classification refers to the use of computers to classify text ( Or other entities ) Automatically classify and mark according to a certain classification system or standard . With the explosive growth of information , Manually marking which category each data belongs to has become very time-consuming , And the quality is low , Because of the subjective consciousness of the tagger . therefore , It is of great significance to use computer to automatically classify text , The above problems can be effectively overcome by handing over the repetitive and boring text classification tasks to the computer for processing , At the same time, the classified data is consistent 、 High quality, etc .
Emotion analysis is the emotional classification of text by computer after learning the characteristics of human emotion , Recognize the emotion of a given text ( such as , Very negative 、 negative 、 Neutral 、 positive 、 Very positive ). If the sentence is clearly phrased , such as “ I don't like winter weather ”, Emotion analysis can be very simple . However , When the artificial intelligence system encounters a sentence with inverted structure or negation , Emotion analysis may become more challenging , for example “ It's not my real business to say that I hate winter weather , This is completely inaccurate . The core difficulty of emotion analysis lies in how to understand text emotion and how to measure the distance between texts .
The following is an example, as shown in the figure :
demo
2.5 Machine translation
Simply speaking , It is to use computer technology to realize the translation process from one natural language to another natural language . Now based on Statistics , The deep learning machinetranslation method breaks through the limitations of previous rule-based and case-based translation methods , Translation performance has been greatly improved . The successful application of deep neural network based machinetranslation in some scenes such as daily oral English has shown great potential . With the development of context representation of context and knowledge logic reasoning ability , The natural language knowledge map continues to expand , Machinetranslation will make greater progress in the fields of multi turn dialogue translation and text translation .
2.6 Question and answer system and dialogue system
Strictly speaking , Question answering system and dialogue system are technologies that are realized by combining multiple branches of artificial intelligence , Naturallanguageprocessing is an unattainable part of the system . Dialogue system is to accept the questions raised by users , And return the corresponding answer like a person . The common form is retrieval 、 There are three types of extraction and generation . In recent years, interactive has gradually attracted attention . Typical applications include intelligent customer service . There are many similarities with the question and answer system , The difference is that the question and answer system is designed to give accurate answers directly , Whether the answer is colloquial is not the main consideration ; The dialogue system aims to solve user problems by means of spoken natural language dialogue . At present, the dialogue system is divided into chat type and task oriented type . The former is mainly used in siri、 Xiaobing et al ; The latter is mainly used in car chat robots .( Dialogue system and question and answer system should be the closest NLP The field of the ultimate goal )
3 General steps of application implementation
Okay , We learned that natural language imitates all aspects of human beings , imagination , attention , Ability to understand , Emotions and conversations, etc , that , How on earth do we make computers realize these technologies ? Let's take a look at the basic technology of natural language !
In fact, now we need to implement these natural language technologies , It must be supported by big data . Like a man , How do people react to a thing , A very large proportion comes from things that human beings have experienced before , That is the so-called experience . Experience gained , You will be more handy in doing something . For example, you go to the exam , If you have done many similar problems before , Then you will learn a lot of experience , Based on these experiences , The choices you make in this exam are generally right . The same is true of computers , Behind the big data , It's what computers call “ Experience ”, Use the data , The computer can make a better and more correct imitation of human beings .
in addition , In many subdivision application scenarios of naturallanguageprocessing , Generally, several necessary steps are indispensable , Before introducing the specific implementation details , Let's warm up with a simple practical example , are you ready? ?
For example, you want to make a model , Let the computer help you analyze whether a person can be your man / Girl friend , First of all, you have a basic measure of a large number of people in reality , You will know which of these people are in line with your expectations , Suitable for being boyfriend and girlfriend , Those who are determined not to . Then extract the characteristics of this person , For example, the height of the person you like should be 1.6 Meters above , good , There are cars, houses, etc , The computer can judge according to these specific quantitative characteristics , So as to output two answers of "suitable" and "unsuitable" . If the computer output does not meet your expectations , Then adjust the input characteristic parameters ( Because at first the computer didn't know that your height requirement was 1.6 m ), Adjust the height feature to 1.55 rice , Or adjust the proportion of this feature in the whole feature , Then recalculate the output , It's a constant cycle , Constantly adjust the parameters of the feature , Until the output probability meets your expectations . such , A boyfriend and girlfriend judgment model has been made .
Okay , After we understand how a simple model is implemented , Let's talk about what to do in each step .
One is to obtain data sets , Let the computer get the so-called exercises and answers .
The second is to preprocess the data set , During the preprocessing process, unused and repeated words in the data set shall be processed , The data set obtained by the computer is of high quality , Then we need to segment these data sets , Because the computer can not directly recognize and understand the characters one by one , So in order for the computer to quantify every word , We need to break the data set into words one by one , And then Feature Engineering , That is to turn each word into a vector . Word vectors are based on certain rules ( Such as one-hot、word2vec), Every word vector is not generated randomly , It will be generated after calculation with other surrounding words , In this way, each word vector is related to the surrounding word vectors , It is in this way that the computer can understand the relationship between each word . Then it is to select specific features for different applications , Common algorithms are (DF、MI、IG、CHI、WLLR、WFO etc. ), thus , It turns a set of words that a person can understand into a set of word vectors that a computer can understand .
The third is to construct a specific model for an application , After we input these word vectors into the computer , According to a certain model ( That is, an algorithm ), Dataset based “ Experience ” Calculate , To produce the results we want , If the result of the calculation does not meet our expectations , Then adjust the weight of each input feature , In constant cycle optimization , Gradually achieve the expectations we want , It's like the process of constantly making questions , Once you have gained experience, you can do it right . This process is called model learning of a computer , The process of transforming knowledge into experience .
The last is to evaluate the model , Generally, we divide the data set into training set and test set , Training set is the process of training the model , In the process, the parameter weight of each input feature will be continuously optimized . The test set is to test the accuracy of our model , The difference is that the process of using the test set does not change the parameter weights , Instead, we just observe whether the output of the model meets our expectations .
above , These are the steps that naturallanguageprocessing must take in general .
Common models are divided into machine learning model and deep learning model :
Common machine learning models are KNN,SVM,Naive Bayes, Decision tree ,GBDT,K-means etc. .
Common deep learning models are CNN,RNN,LSTM,Seq2Seq,Fast Text,Text CNN etc. .
4 The future development
Now almost everyone can't live without the Internet , And a large amount of text data will be stored on the network , This produces a large number of natural language text files , This is a huge resource , Naturallanguageprocessing is developing very fast now , More and more people pay attention to naturallanguageprocessing , Many applications have made great progress , The accuracy is getting closer to human beings , Even some aspects have surpassed human beings . however , Because all application aspects are based on a very simple thing to do , It is still far from what specific people can do . Most of these tasks can only correspond to human perception , For example, identify something in a picture or something in a video , They are all things that human beings can do in a few seconds , But for humans it takes hours , Even things that can only be done for many days , Is not involved . So naturallanguageprocessing has a lot of room for development .
"""
def get_key_words(text, num=3):
""" Extract key words """
tr4w = TextRank4Keyword()
tr4w.analyze(text, lower=True)
key_words = tr4w.get_keywords(num)
return [item.word for item in key_words]
def get_summary(text, num=3):
""" Extract abstract """
tr4s = TextRank4Sentence()
tr4s.analyze(text=text, lower=True, source='all_filters')
return [item.sentence for item in tr4s.get_key_sentences(num)]
words = get_key_words(text)
print(words)
#[' Computer ', ' Natural language ', ' people ']
summary = get_summary(text)
print(summary)
#[' Now almost everyone can't live without the Internet , And a large amount of text data will be stored on the network , This produces a large number of natural language text files , This is a huge resource , Naturallanguageprocessing is developing very fast now , More and more people pay attention to naturallanguageprocessing , Many applications have made great progress , The accuracy is getting closer to human beings , Even some aspects have surpassed human beings ',
# ' First , Natural language processing is a branch of artificial intelligence , The ultimate goal, like artificial intelligence, is to imitate human behavior , The ultimate goal of naturallanguageprocessing is to understand language , Make computers read and understand languages like people , And give the corresponding answer in line with human thinking ',
# ' Simply speaking , Naturallanguageprocessing is , Analyze and apply various language text files in the computer ']

Information extraction
Extract important information from a given text , Such as time 、 place 、 figure 、 event 、 reason 、 result 、 Numbers 、 date 、 currency 、 Proper nouns and so on . Popularly speaking , It's about knowing who and when 、 What's the cause of the 、 To whom 、 What has been done 、 What's the result .
LDA Topic model keyword extraction
The topic model is a statistical model for discovering abstractions that appear in a collection of documents “ The theme ”. Topic modeling is a common text mining tool , Used to find hidden semantic structure in text body .
from gensim import corpora, models
import jieba.posseg as jp
import jieba
# Simple text processing
def get_text(text):
flags = ('n', 'nr', 'ns', 'nt', 'eng', 'v', 'd') # The part of speech
stopwords = (' Of ', ' Just ', ' yes ', ' use ', ' also ', ' stay ', ' On ', ' As ') # Stop words
words_list = []
for text in texts:
words = [w.word for w in jp.cut(text) if w.flag in flags and w.word not in stopwords]
words_list.append(words)
return words_list
# Generate LDA Model
def LDA_model(words_list):
# Construction dictionary
# Dictionary() Method to traverse all text , Assign a single integer to each non repeating word ID, At the same time, collect the occurrence times of the word and relevant statistical information
dictionary = corpora.Dictionary(words_list)
print(dictionary)
print(' Print to view each word id:')
print(dictionary.token2id) # Print to view each word id
# take dictionary Into a word bag
# doc2bow() Methods will dictionary Into a word bag . The result corpus Is a list of vectors , The number of vectors is the number of documents .
# Each document vector contains a series of tuples , The form of tuple is ( word ID, Word frequency )
corpus = [dictionary.doc2bow(words) for words in words_list]
print(' Output vectors for each document :')
print(corpus) # Output vectors for each document
# LDA Theme model
# num_topics -- must , Number of topics to generate .
# id2word -- must ,LdaModel Class requires our previous dictionary hold id Are mapped to strings .
# passes -- Optional , The number of times the model traverses the corpus . The more times you traverse , The more accurate the model . But for a very large corpus , Traversing too many times can take a long time .
lda_model = models.ldamodel.LdaModel(corpus=corpus, num_topics=2, id2word=dictionary, passes=10)
return lda_model
if __name__ == "__main__":
texts = [' As one of the few thousand yuan mobile phones with a true full screen ,OPPO K3 Once launched , Just surrounded by many fans ', \ ' Many people are buying for this screen OPPO K3 after , Discover the original K3 It's not just on the screen ', \ 'OPPO K3 Of consumers are generally very satisfied with this mobile phone ', \ ' Geely, the more PRO stay 7 month 3 The new geek intelligent ecosystem GKUI19 At the press conference ', \ ' Shanghai auto show this year , changan CS75 PLUS Debut ', \ ' The common version of the vehicle adopts the double side common double outlet exhaust layout ; The sports version adopts the exhaust layout with two sides and four outlets ']
# Get the text list after word segmentation
words_list = get_text(texts)
print(' Text after word segmentation :')
print(words_list)
# Get after training LDA Model
lda_model = LDA_model(words_list)
# It can be used print_topic and print_topics Method to view the topic
# Print all topics , Each topic displays 5 Word
topic_words = lda_model.print_topics(num_topics=2, num_words=5)
print(' Print all topics , Each topic displays 5 Word :')
print(topic_words)
# Output the words of the topic and their weights
words_list = lda_model.show_topic(0, 5)
print(' Output the words of the topic and their weights :')
print(words_list)

Public opinion analysis
It refers to the collection and processing of massive information , Analyze online public opinion automatically , In order to achieve the purpose of timely response to Internet public opinion .
# -*- coding:utf-8 -*-
import pandas as pd
import jieba
# Calculate emotion value based on Poisson emotion Dictionary
def getscore(text):
df = pd.read_table(r"BosonNLP_dict\BosonNLP_sentiment_score.txt", sep=" ", names=['key', 'score'])
key = df['key'].values.tolist()
score = df['score'].values.tolist()
# jieba participle
segs = jieba.lcut(text,cut_all = False) # return list
# Calculate the score
score_list = [score[key.index(x)] for x in segs if(x in key)]
return sum(score_list)
# Read the file
def read_txt(filename):
with open(filename,'r',encoding='utf-8')as f:
txt = f.read()
return txt
# write file
def write_data(filename,data):
with open(filename,'a',encoding='utf-8')as f:
f.write(data)
if __name__=='__main__':
text = read_txt('test_data\ Microblogging .txt')
lists = text.split('\n')
# al_senti = [' nothing ',' positive ',' negative ',' negative ',' Neutral ',' negative ',' positive ',' negative ',' positive ',' positive ',' positive ',
# ' nothing ',' positive ',' positive ',' Neutral ',' positive ',' negative ',' positive ',' negative ',' positive ',' negative ',' positive ',
# ' nothing ',' Neutral ',' negative ',' Neutral ',' negative ',' positive ',' negative ',' negative ',' negative ',' negative ',' positive '
# ]
al_senti = read_txt(r'test_data\ Artificial emotion tagging .txt').split('\n')
i = 0
for list in lists:
if list != '':
# print(list)
sentiments = round(getscore(list),2)
# Emotional value is positive , Show positive ; A negative number means negative
print(list)
print(" Emotional value :",sentiments)
print(' Manually mark emotional tendencies :'+al_senti[i])
if sentiments > 0:
print(" Machines label emotional tendencies : positive \n")
s = " Machines judge emotional tendencies : positive \n"
else:
print(' Machines label emotional tendencies : negative \n')
s = " Machines judge emotional tendencies : negative "+'\n'
sentiment = ' Emotional value :'+str(sentiments)+'\n'
al_sentiment= ' Manually mark emotional tendencies :'+al_senti[i]+'\n'
# File is written to
filename = 'result_data\BosonNLP Emotional analysis results .txt'
write_data(filename,' Emotional analysis text :')
write_data(filename,list+'\n') # Write pending text
write_data(filename,sentiment) # Write emotion value
write_data(filename,al_sentiment) # Write the machine to judge the emotional tendency
write_data(filename,s+'\n') # Write manually marked emotion
i = i+1
边栏推荐
- [openairinterface5g] RRC NR resolution (I)
- Leetcode: interview question 08.12 Eight queens [DFS + backtrack]
- 安装boost
- Cactus Song - online operation (5)
- 5g-guti detailed explanation
- Leetcode--- search insertion location
- Advanced usage of setting breakpoints during keil debugging
- Cactus Song - online operation (4)
- Which is the best agency mode or decoration mode
- Map of STL knowledge summary
猜你喜欢

cookie的介绍和使用

C语言——深入理解数组

Introduction to 51 single chip microcomputer - 8x8 dot matrix LED

vue连接mysql数据库失败

Py's scorecardpy: a detailed introduction to the introduction, installation and use of scorecardpy

Don't throw away the electric kettle. It's easy to fix!

Reprint the Alibaba open source project egg JS technical documents cause "copyright disputes". How to use the loose MIT license?

Introduction notes to quantum computing (continuously updated)

C技能树评测——用户至上做精品

Cesium加载3D Tiles模型
随机推荐
Introduction to 51 Single Chip Microcomputer -- the use of Keil uvision4
C技能树评测——用户至上做精品
Cesium loading 3D tiles model
iframe框架,,原生js路由
Databricks from open source to commercialization
Dongjiao home development technical service
Introduction to 51 Single Chip Microcomputer -- timer and external interrupt
Event preview edgex developer summit @ Nanjing station is coming!
【5G NR】RRC连接重建解析
[php]tp6 cli mode to create tp6 and multi application configurations and common problems
DL and alignment of spatially resolved single cell transcriptomes with Tangram
Blog add mailbox private message shortcut
vue连接mysql数据库失败
[5g NR] NAS connection management - cm status
golang調用sdl2,播放pcm音頻,報錯signal arrived during external code execution。
[5g NR] UE registration management status
Xh_CMS渗透测试文档
Flink core features and principles
Introduction to 51 Single Chip Microcomputer -- the use of Proteus 8 professional
cookie的介绍和使用