当前位置：网站首页>Theory and application of naturallanguageprocessing

Theory and application of naturallanguageprocessing

2022-06-22 06:54:00 【C don't laugh】

Theory and application of naturallanguageprocessing

Introduction to natural language processing
Naturallanguageprocessing tasks and methods

Introduction to natural language processing

What is natural language

Insert picture description here

Take speech as the material shell , A symbolic system consisting of two parts, vocabulary and grammar .

What is natural language processing （NLP）

Insert picture description here

Naturallanguageprocessing is , The use of computers as a tool for human specific written and oral forms of natural language information , The technique of carrying out various types of treatment and processing .（NLP
It is the bridge between human beings and machines ）

What is natural language understanding （NLU）

natural language understanding NLU) It is the general name of all method models or tasks that support machine understanding of text content .NLU It plays a very important role in text information processing system , It's a recommendation 、 Question and answer 、 Search and other necessary modules of the system .
Insert picture description here

Naturallanguageprocessing tasks and methods

Preliminary knowledge

Language model

What is a language model

The language model can estimate the probability of a text , Information retrieval , Machine translation , Speech recognition and other tasks play an important role . Language model is divided into statistical language model and neural network language model .

Common language models

N-Gram Language model （n Metamodel ）

n-gram Is a statistical language model , Used according to the previous (n-1) individual item To predict the n individual item. At the application level , these item It can be phoneme （ Voice recognition applications ）、 character （ Input method application ）、 word （ Participle application ） Or base pairs （ Genetic information ）. In general , It can be generated from large-scale text or audio corpora n-gram Model .
Insert picture description here

n-gram A common application of the model
Search engines such as Google or Baidu , Or the prompt of input method . When we search for every word or words , The search box usually provides several alternatives in the form of a pull-down menu ：
Insert picture description here
sparse : The greater the window , The more likely it is count by 0 The problem of
Storage ∶ these count It needs to be stored in advance , It's too much

Neural network language model (NNLM)

Insert picture description here
advantage : No, ngram The sparsity and storage of language model
shortcoming ︰ Want better performance , It is necessary to enlarge the window , The greater the window , The larger the number of parameters

RNN Language model

Insert picture description here
RNN Language model advantages :
Can handle any length of text sequence , The number of parameters remains the same .
Than n-gram Can handle longer contexts , And there is no sparse problem .

Text vectorization

The function of text representation is to transform these unstructured information into structured information , In this way, we can calculate the text information , To complete the text classification we can see everyday , Emotional judgment and other tasks .
Insert picture description here

Hot coding alone | one-hot representation

If there is a total of 4 Word ： cat 、 Dog 、 cattle 、 sheep
Insert picture description here
shortcoming
Unable to express the relationship between words
This vector is too sparse , Resulting in inefficient computing and storage

Integer encoding

Insert picture description here

shortcoming
Unable to express the relationship between words
For model interpretation , Integer coding can be challenging .

Word embedding | word embedding

Common algorithms

Insert picture description here

hidden Markov model （HMM）

HMM It's about the probability model of time series , It generates an unobservable random state sequence by a hidden Markov chain , The process of generating an observation from each state to generate an observation random sequence . Each position in the sequence is called a moment .
Two hypotheses
Implicit state independence assumption （ Homogeneous Markov hypothesis )∶ The current status is only related to the previous status .
Observation independence Hypothesis : The current observations are generated only by the current hidden state , Independent of other observations .
Insert picture description here
One HMM Model examples

import numpy as np
from hmmlearn import hmm

#  Set the state of the hidden collection 
states = ["box 1", "box 2", "box3"]
n_states = len(states)

#  Set the set of observation States 
observations = ["red", "white"]
n_observations = len(observations)

#  Set the initial state distribution 
start_probability = np.array([0.2, 0.4, 0.4])

#  Set the state transition probability distribution matrix 
transition_probability = np.array([
  [0.5, 0.2, 0.3],
  [0.3, 0.5, 0.2],
  [0.2, 0.3, 0.5]
])

#  Set the observation state probability matrix 
emission_probability = np.array([
  [0.5, 0.5],
  [0.4, 0.6],
  [0.7, 0.3]
])

Insert picture description here

Conditional random field model （CRF）

Conditional random field ( Conditional Random
Field,CRF), To cancel the HMM Two independent hypotheses , Consider label transfer and context input as one of the global features , Perform probability normalization globally , It's solved HMM Label bias and missing context features . It is widely used in word segmentation ,
Such scenarios as entity recognition and part of speech tagging . With the popularity of deep learning , BILSTM+CRF, BERT+CRF, TRANSFORMER+CRF Wait for the model ,
Step by step , And in these annotation scenes , The effect has been significantly improved .
Conditional random fields are used for sequence labeling , Chinese word segmentation 、 Chinese name recognition, ambiguity resolution and other natural language processing , It's showing good results . The principle is ： For a given observation sequence and annotation sequence , Establish conditional probability model . Conditional random fields can be used for different prediction problems , Its learning method is usually maximum likelihood estimation .
The conditional random field model also needs to solve three basic problems ： The choice of features 、 Parameter training and decoding .

Insert picture description here

Production model and discriminant model

Production model ： Model the joint distribution directly , Such as ： Gaussian mixture model 、 The hidden Markov model 、 Markov random field, etc
Discriminant model ： Model the conditional distribution , Such as ： Conditional random field 、 Support vector machine 、 Logical regression, etc .

Two way recurrent neural network + Conditional random field model （BiLSTM+CRF）

LSTM It's a recurrent neural network （RNN） A variant of ,BiLSTM It means two-way LSTM The Internet ,BiLSTM Compared with the traditional CRF Algorithm , It can learn context features more effectively , No need to manually design features , It can also handle longer context dependencies .
Insert picture description here

key technology

participle

What is participle ？

A participle is a sentence 、 The paragraph 、 The long text of the article , Decompose into data structures in terms of words , Facilitate subsequent processing and analysis .

Participle case

import jieba
testSentence = " utilize python Data analysis "
print("1. Precise pattern segmentation results ："+"/".join(jieba.cut(testSentence,cut_all=False)))
print("2. Full pattern word segmentation results ："+"/".join(jieba.cut(testSentence,cut_all=True)))
print("3. Search engine pattern segmentation results ："+"/".join(jieba.cut_for_search(testSentence)))
print("4. Default （ Accurate model ） Segmentation result ："+"/".join(jieba.cut(testSentence)))

Insert picture description here
remarks

Accurate model ： Try to cut the sentence as precisely as possible , Suitable for text analysis ;
All model ： Scan the sentences for all the words that can be made into words , Very fast , But it doesn't solve the ambiguity ;
Search engine model ： On the basis of exact patterns , Again shred long words , Increase recall rate , Suitable for search engine segmentation .

# Load Dictionary 
print(" Load Dictionary ")
def load_dictionary():
    dic = set()

    #  Read dictionary file by line , The string before the first space in each line is extracted .
    for line in open("CoreNatureDictionary.mini.txt", "r",encoding='utf-8'):
        dic.add(line[0:line.find(' ')])

    return dic
dic = load_dictionary()
print(dic)

print(" Find all the words in a paragraph of text ")
# Find all the words in a paragraph of text 
def fully_segment(text, dic):
    word_list = []
    for i in range(len(text)):  # i  from  0  To text The subscript of the last word of 
        for j in range(i + 1, len(text) + 1):  # j  Traverse [i + 1, len(text)] Section 
            word = text[i:j]  #  Take out the continuous interval [i, j] The corresponding string 
            if word in dic:  #  If in a dictionary , It is considered to be a word 
                word_list.append(word)
    return word_list


dic = load_dictionary()
print(fully_segment(' Studying at Peking University ', dic))

# Forward longest match 

def forward_segment(text, dic):
    word_list = []
    i = 0
    while i < len(text):
        longest_word = text[i]  #  The word of the current scanning position 
        for j in range(i + 1, len(text) + 1):  #  All possible endings 
            word = text[i:j]  #  A continuous string from the current position to the end 
            if word in dic:  #  In the dictionary 
                if len(word) > len(longest_word):  #  And longer 
                    longest_word = word  #  Then the priority output is 
        word_list.append(longest_word)  #  Output the longest word 
        i += len(longest_word)  #  Forward scanning 
    return word_list

print(" Forward longest match ")
dic = load_dictionary()
print(forward_segment(' Studying at Peking University ', dic))
print(forward_segment(' Study the origin of life ', dic))

# Reverse longest matching 
def backward_segment(text, dic):
    word_list = []
    i = len(text) - 1
    while i >= 0:  #  Scanning position as the end point 
        longest_word = text[i]  #  The word of the scanning position 
        for j in range(0, i):  #  Traverse [0, i] The interval is used as the starting point of the words to be queried 
            word = text[j: i + 1]  #  Take out [j, i] The interval is used as the word to be queried 
            if word in dic:
                if len(word) > len(longest_word):  #  The longer the priority, the higher 
                    longest_word = word
                    break
        word_list.insert(0, longest_word)  #  Reverse scanning , So the more the words found out first, the later they are in position 
        i -= len(longest_word)
    return word_list

print(" Reverse longest matching ")
dic = load_dictionary()
print(backward_segment(' Study the origin of life ', dic))
print(backward_segment(' Project research ', dic))

# Bidirectional longest matching 
print(" Bidirectional longest matching ")
def count_single_char(word_list: list):  #  Count the number of words in a word 
    return sum(1 for word in word_list if len(word) == 1)


def bidirectional_segment(text, dic):
    f = forward_segment(text, dic)
    b = backward_segment(text, dic)
    if len(f) < len(b):  #  Fewer words, higher priority 
        return f
    elif len(f) > len(b):
        return b
    else:
        if count_single_char(f) < count_single_char(b):  #  Fewer words, higher priority 
            return f
        else:
            return b  #  When they are equal, reverse matching has higher priority 

print(bidirectional_segment(' Study the origin of life ', dic))
print(bidirectional_segment(' Project research ', dic))

Insert picture description here

Why participle ？

Turn complex problems into mathematical problems
Word is a more appropriate granularity

Chinese word segmentation 3 Major difficulties

Insert picture description here

There is no uniform standard

How to segment ambiguous words

  Table Tennis  \  The auction  \  Finished 
  Table tennis  \  Racket  \  sell  \  Finished

Recognition of new words

  I feel awful. I want to cry. 
 YYDS
  Little pan dish

Part of speech tagging

What is part of speech tagging ？

Part of speech tagging ( Part-Of-Speech tagging,POS tagging )∶ It refers to the procedure of marking a correct part of speech for each word in the word segmentation result of a sentence , That is, make sure that each word is a noun 、 Verb 、 The process of adjectives or other parts of speech . for example : March towards /v Full /v hope /n Of /uj new /a century /n. There are many part of speech tagging NLP Task preprocessing steps , Such as parsing 、 Information extraction , The text marked with part of speech will bring great convenience , But it is not an indispensable step . Insert picture description here
Given a sequence of words with their respective annotations , We can determine the most likely part of speech of the next word .

import jieba.posseg
testSentence = " utilize python Data analysis "
words = jieba.posseg.cut(testSentence)
for item in words:
    print(item.word+"----"+item.flag)

Insert picture description here

Named entity recognition

Named entity recognition (Named Entity Recognition,NER） For natural language processing (NLP) One of the basic tasks of , Its goal is to extract named entities from text and classify them , For example, person names. 、 Place names 、 Institutions 、 Time 、 Currency, percentage, etc .
Insert picture description here

import jieba.analyse

print(jieba.analyse.extract_tags(" I like Guangzhou small man waist ", 3))
print(jieba.analyse.extract_tags(" I like Guangzhou small Manyao ", 3))
print(jieba.analyse.extract_tags(" I like Guangzhou Guangzhou small Manyao ", 3))

Insert picture description here

import jieba.analyse
print("1. Adopt accurate mode results ：")
print([item for item in jieba.tokenize(" Application of data analysis and data mining ")])
print("-------------------")
print("2. Take search mode results ：")
print([item for item in jieba.tokenize(" Application of data analysis and data mining ",mode="search")])

Insert picture description here

Syntactic parsing

It is to analyze the structure of sentences and phrases , The purpose is to find the word 、 The relationship between phrases and their respective functions in sentences .
Insert picture description here

Semantic analysis

Is to find out the meaning of the word 、 The meaning of structure and its combination , So as to determine the real meaning or concept expressed by language .
Insert picture description here

Example analysis

In the life , If you want to book a flight , People have a lot of natural expressions

“ make a flight reservation ”;
“ Is there a flight to Shanghai ？”;
“ Look at the flight , Leave for New York next Tuesday ”;
“ To travel , Check the ticket for me ”;

Make intention judgment based on rules
Insert picture description here
be based on NLU To identify user intent

NLP The general steps of the task

Insert picture description here

Application system

Text classification

Text classification ( text classification), also called Document classification ( document classification), It refers to the natural language processing task of classifying a document into one or more categories . Text classification has a wide range of application scenarios , Cover spam filtering 、 Spam filtering 、 Auto Label 、 Emotional analysis and any other occasions that need to automatically archive text .

The category of text is sometimes called label , All categories make up the dimension set , The output result of text classification must belong to the annotation set .

Text categorization is a typical supervised learning task , The process cannot be separated from manual guidance : Manually mark the categories of documents , Training model with corpus , Use the model to predict the categories of documents .

Text clustering

In many app There is a recommendation function in , For example, Netease cloud music has daily song recommendations 、 Some reading software has books to read and so on , The general recommendation modes are user based and content-based , Among them, the content-based recommendation may calculate the text similarity , Of course, it must be combined with other dimensions , Such as the style of music . Similarly, the search engine will sort the web pages according to the similarity with the search keywords . Next, we will implement an implementation based on TF-IDF Weighted text similarity calculation .

TF-IDF Algorithm

TF-IDF： It is a commonly used weighting technology for information retrieval and information exploration .
（1）TF(term frequency)
Word frequency , It refers to the frequency of a given word in the document . The calculation formula is the number of times a word appears in a document divided by the number of times all words appear in the document .
The denominator is the coefficient of all words appearing in the document. The purpose is to normalize the number of words in order to prevent bias towards long documents ( Whether the word is important or not , The same word may have a higher number of words in a long document than in a short document ）.
（2）IDF（inverse document frequency）
Reverse file frequency , Is a measure of the universal importance of words . The calculation formula is the total number of documents divided by the number of files containing the word , And then take the logarithm of the quotient .
（3） Calculation example

words “ Cow ” The total number of words in a certain passage is 100 Files of appear 3 Time , The total number of files in the corpus where this file is located is 10,000,000 Share , also “ Cow ” Among them 1,000 Documents have appeared , that “ Cow ” The frequency of a word in the document is 3/100=0.03, Its reverse file frequency is log(10,000,000 / 1,000)=4. final TF-IDF The score of is 0.03 * 4=0.12.

'''
 utilize gensim do TF-IDF Theme model  
'''
from gensim import corpora, models, similarities
import jieba
from collections import defaultdict
# 1. Import sentences 
sentence1 = " I like to eat sweet potatoes "
sentence2 = " Sweet potato is a good thing "
sentence3 = " utilize python Text mining "
# 2. participle 
data1 = " ".join(jieba.cut(sentence1))
data2 = " ".join(jieba.cut(sentence2))
data3 = " ".join(jieba.cut(sentence3))
# 3. Transformation format ：" words 1  words 2  words 3 …  words n"
texts = [list(data1), list(data2), list(data3)]
# 4. Build a dictionary based on text 
dictionary = corpora.Dictionary(texts)
featureNum=len(dictionary.token2id.keys())# Extract the number of dictionary features 
dictionary.save("./dictionary.txt")# Save the corpus 
# 5. Building a new corpus based on dictionaries 
corpus = [dictionary.doc2bow(text) for text in texts]
# 6.TF-IDF Handle 
tfidf = models.TfidfModel(corpus)
'''
#  Output each word of each sentence tfidf value 
corpus_tfidf = tfidf[corpus]
for doc in corpus_tfidf:
    print(doc)
'''
# 7. Load comparative sentences and organize their formats 
query = " Eat something "
data4 = jieba.cut(query)
data41 = ""
for item in data4:
    data41 += item+" "
new_doc = data41
# 8. Convert comparative sentences into sparse vectors 
new_vec = dictionary.doc2bow(new_doc.split())
# 9. Calculate similarity 
index = similarities.SparseMatrixSimilarity(tfidf[corpus],num_features=featureNum)
sim = index[tfidf[new_vec]]
for i in range(len(sim)):
    print(" Inquiry and "+str(i+1)+" The similarity of sentences is :"+str(sim[i]))

Insert picture description here

Machine translation

Input the source language text through automatic translation to get the text of another language . Machine translation from the earliest rule-based method to the statistical method 20 years ago , And today's neural network （ code - decode ） Methods , Gradually formed a more rigorous method system .

Question answering system

A question of natural language expression , The Q & a system gives an accurate answer . We need some semantic analysis of natural language query statements , Include entity Links 、 Relationship recognition , Form logical expressions , Then search the knowledge base for possible candidate answers and find the best answer through a sorting mechanism .
Insert picture description here

Information filtering

Knowledge of sensitive word filtering and anti spam text

Auto digest

from textrank4zh import TextRank4Keyword, TextRank4Sentence
import jieba
import logging

#  Cancel jieba Log output of word segmentation 
jieba.setLogLevel(logging.INFO)
text = """
 Now with the development of the Internet , The speed of uploading and downloading files on the Internet has been greatly improved . So now the infrastructure , More and more high demand applications are feasible . Artificial intelligence , Deep learning , Network hot words such as naturallanguageprocessing frequently appear in our field of vision , that , What is naturallanguageprocessing ？ Next, let's talk about what naturallanguageprocessing is , What do you do , And what it can do for us .
 First , Natural language processing is a branch of artificial intelligence , The ultimate goal, like artificial intelligence, is to imitate human behavior , The ultimate goal of naturallanguageprocessing is to understand language , Make computers read and understand languages like people , And give the corresponding answer in line with human thinking . There are many ways to implement it , Can be based on statistics , It can also be based on methods such as deep learning .
 Simply speaking , Naturallanguageprocessing is , Analyze and apply various language text files in the computer . For example, analyze whether the semantics of a text is negative , Or it may be a noun that distinguishes a paragraph of words , Adjectives, etc .
 say concretely , natural language processing , English is Natural Language Processing, Abbreviation NLP. It can be divided into “ Natural language ” and “ Handle ” Two parts . Let's start with natural language . Now all the languages in the world , All belong to natural language , Including Chinese 、 English 、 French, etc . Then I'll see “ Handle ”. This “ Handle ” It refers to computer processing . But computers are not people after all , Can't process text like a human , You need to have your own way of handling it . So naturallanguageprocessing , That is, the computer accepts the user's natural language input , And it is processed internally by the algorithm defined by human beings 、 Calculation and other series of operations , To simulate human understanding of natural language , And return the result expected by the user .
 Just as machinery liberates human hands , The purpose of naturallanguageprocessing (NLP) is to process large-scale naturallanguageinformation by computer instead of man . It's artificial intelligence 、 Computer science 、 The cross domain of Information Engineering , It's about statistics 、 Knowledge of linguistics, etc . Because language is the proof of human thinking , So naturallanguageprocessing is the highest realm of artificial intelligence , Known as the “ Pearl on the crown of artificial intelligence ”.
 that , What can naturallanguageprocessing bring us ：
 The following is the specific position of naturallanguageprocessing in the whole field and some of its main contents ：

 The application of naturallanguageprocessing is very extensive and practical , The following is a breakdown of naturallanguageprocessing , Let's go deep into naturallanguageprocessing ！
 Semantic understanding ：
 Semantic understanding technology is simply to make computers understand text like human beings , And the process of answering questions related to the text . Semantic understanding pays more attention to understanding the context and controlling the accuracy of the answer . for instance , Give a passage , Then ask another question , The computer passes the algorithm model , Output the answer to this question according to the semantics of the text . The following figure is an example , The computer understood the article through reading Passage And questions Question after , Answer the corresponding answer .
demo
2.2  Text in this paper, 
 Is to give the computer a paragraph of text or a whole paper , Then the computer will output a summary of the text according to your text . Its core technology is to focus on the core part of the text file , Then automatically generate a summary . This technology imitates a characteristic unique to human beings , That's attention . Before we face many things , There will always be priorities . It's like you're in a bar , A lot of people are talking , But if someone calls your name , Or you are interested in someone , Then your brain will filter out other sounds , And focus on what you care about . You see , Computers can do such things , Are there more and more human beings ？
2.3  Linguistic reasoning and abductive natural language reasoning （aNLI）
 Verbal reasoning ： Enter two sentences into the computer , Then the computer judges the relationship between the two sentences , For example, strengthen the relationship , Or cause and effect .
demo
 Abductive natural language reasoning （aNLI）： This is an implementation of computer imitating human imagination , Because when people face a problem , For example, someone asks you what is blue , Then you will have a lot of blue related things in your mind , Like the sky , Blue whale , Blue car, etc , Even think of something unrelated to blue , Like a basket , Orchid, etc .Anli It is the technology that makes the computer imitate the daily conversation of human beings , It is compared to natural language reasoning , More imagination , And more in line with human daily communication .
2.4  Sentiment analysis 
 Emotion analysis is a specific application of text classification in naturallanguageprocessing .
 Text classification refers to the use of computers to classify text ( Or other entities ) Automatically classify and mark according to a certain classification system or standard . With the explosive growth of information , Manually marking which category each data belongs to has become very time-consuming , And the quality is low , Because of the subjective consciousness of the tagger . therefore , It is of great significance to use computer to automatically classify text , The above problems can be effectively overcome by handing over the repetitive and boring text classification tasks to the computer for processing , At the same time, the classified data is consistent 、 High quality, etc .
 Emotion analysis is the emotional classification of text by computer after learning the characteristics of human emotion , Recognize the emotion of a given text （ such as , Very negative 、 negative 、 Neutral 、 positive 、 Very positive ）. If the sentence is clearly phrased , such as “ I don't like winter weather ”, Emotion analysis can be very simple . However , When the artificial intelligence system encounters a sentence with inverted structure or negation , Emotion analysis may become more challenging , for example “ It's not my real business to say that I hate winter weather , This is completely inaccurate . The core difficulty of emotion analysis lies in how to understand text emotion and how to measure the distance between texts .
 The following is an example, as shown in the figure ：
demo
2.5  Machine translation 
 Simply speaking , It is to use computer technology to realize the translation process from one natural language to another natural language . Now based on Statistics , The deep learning machinetranslation method breaks through the limitations of previous rule-based and case-based translation methods , Translation performance has been greatly improved . The successful application of deep neural network based machinetranslation in some scenes such as daily oral English has shown great potential . With the development of context representation of context and knowledge logic reasoning ability , The natural language knowledge map continues to expand , Machinetranslation will make greater progress in the fields of multi turn dialogue translation and text translation .
2.6  Question and answer system and dialogue system 
 Strictly speaking , Question answering system and dialogue system are technologies that are realized by combining multiple branches of artificial intelligence , Naturallanguageprocessing is an unattainable part of the system . Dialogue system is to accept the questions raised by users , And return the corresponding answer like a person . The common form is retrieval 、 There are three types of extraction and generation . In recent years, interactive has gradually attracted attention . Typical applications include intelligent customer service . There are many similarities with the question and answer system , The difference is that the question and answer system is designed to give accurate answers directly , Whether the answer is colloquial is not the main consideration ; The dialogue system aims to solve user problems by means of spoken natural language dialogue . At present, the dialogue system is divided into chat type and task oriented type . The former is mainly used in siri、 Xiaobing et al ; The latter is mainly used in car chat robots .（ Dialogue system and question and answer system should be the closest NLP The field of the ultimate goal ）
3  General steps of application implementation 
 Okay , We learned that natural language imitates all aspects of human beings , imagination , attention , Ability to understand , Emotions and conversations, etc , that , How on earth do we make computers realize these technologies ？ Let's take a look at the basic technology of natural language ！
 In fact, now we need to implement these natural language technologies , It must be supported by big data . Like a man , How do people react to a thing , A very large proportion comes from things that human beings have experienced before , That is the so-called experience . Experience gained , You will be more handy in doing something . For example, you go to the exam , If you have done many similar problems before , Then you will learn a lot of experience , Based on these experiences , The choices you make in this exam are generally right . The same is true of computers , Behind the big data , It's what computers call “ Experience ”, Use the data , The computer can make a better and more correct imitation of human beings .
 in addition , In many subdivision application scenarios of naturallanguageprocessing , Generally, several necessary steps are indispensable , Before introducing the specific implementation details , Let's warm up with a simple practical example , are you ready? ？
 For example, you want to make a model , Let the computer help you analyze whether a person can be your man / Girl friend , First of all, you have a basic measure of a large number of people in reality , You will know which of these people are in line with your expectations , Suitable for being boyfriend and girlfriend , Those who are determined not to . Then extract the characteristics of this person , For example, the height of the person you like should be 1.6 Meters above , good , There are cars, houses, etc , The computer can judge according to these specific quantitative characteristics , So as to output two answers of "suitable" and "unsuitable" . If the computer output does not meet your expectations , Then adjust the input characteristic parameters （ Because at first the computer didn't know that your height requirement was 1.6 m ）, Adjust the height feature to 1.55 rice , Or adjust the proportion of this feature in the whole feature , Then recalculate the output , It's a constant cycle , Constantly adjust the parameters of the feature , Until the output probability meets your expectations . such , A boyfriend and girlfriend judgment model has been made .
 Okay , After we understand how a simple model is implemented , Let's talk about what to do in each step .
 One is to obtain data sets , Let the computer get the so-called exercises and answers .
 The second is to preprocess the data set , During the preprocessing process, unused and repeated words in the data set shall be processed , The data set obtained by the computer is of high quality , Then we need to segment these data sets , Because the computer can not directly recognize and understand the characters one by one , So in order for the computer to quantify every word , We need to break the data set into words one by one , And then Feature Engineering , That is to turn each word into a vector . Word vectors are based on certain rules （ Such as one-hot、word2vec）, Every word vector is not generated randomly , It will be generated after calculation with other surrounding words , In this way, each word vector is related to the surrounding word vectors , It is in this way that the computer can understand the relationship between each word . Then it is to select specific features for different applications , Common algorithms are （DF、MI、IG、CHI、WLLR、WFO etc. ）, thus , It turns a set of words that a person can understand into a set of word vectors that a computer can understand .
 The third is to construct a specific model for an application , After we input these word vectors into the computer , According to a certain model （ That is, an algorithm ）, Dataset based “ Experience ” Calculate , To produce the results we want , If the result of the calculation does not meet our expectations , Then adjust the weight of each input feature , In constant cycle optimization , Gradually achieve the expectations we want , It's like the process of constantly making questions , Once you have gained experience, you can do it right . This process is called model learning of a computer , The process of transforming knowledge into experience .
 The last is to evaluate the model , Generally, we divide the data set into training set and test set , Training set is the process of training the model , In the process, the parameter weight of each input feature will be continuously optimized . The test set is to test the accuracy of our model , The difference is that the process of using the test set does not change the parameter weights , Instead, we just observe whether the output of the model meets our expectations .
 above , These are the steps that naturallanguageprocessing must take in general .
 Common models are divided into machine learning model and deep learning model ：
 Common machine learning models are KNN,SVM,Naive Bayes, Decision tree ,GBDT,K-means etc. .
 Common deep learning models are CNN,RNN,LSTM,Seq2Seq,Fast Text,Text CNN etc. .
4  The future development 
 Now almost everyone can't live without the Internet , And a large amount of text data will be stored on the network , This produces a large number of natural language text files , This is a huge resource , Naturallanguageprocessing is developing very fast now , More and more people pay attention to naturallanguageprocessing , Many applications have made great progress , The accuracy is getting closer to human beings , Even some aspects have surpassed human beings . however , Because all application aspects are based on a very simple thing to do , It is still far from what specific people can do . Most of these tasks can only correspond to human perception , For example, identify something in a picture or something in a video , They are all things that human beings can do in a few seconds , But for humans it takes hours , Even things that can only be done for many days , Is not involved . So naturallanguageprocessing has a lot of room for development .
"""
def get_key_words(text, num=3):
    """ Extract key words """
    tr4w = TextRank4Keyword()
    tr4w.analyze(text, lower=True)
    key_words = tr4w.get_keywords(num)
    return [item.word for item in key_words]

def get_summary(text, num=3):
    """ Extract abstract """
    tr4s = TextRank4Sentence()
    tr4s.analyze(text=text, lower=True, source='all_filters')
    return [item.sentence for item in tr4s.get_key_sentences(num)]

words = get_key_words(text)
print(words)
#[' Computer ', ' Natural language ', ' people ']
summary = get_summary(text)
print(summary)
#[' Now almost everyone can't live without the Internet , And a large amount of text data will be stored on the network , This produces a large number of natural language text files , This is a huge resource , Naturallanguageprocessing is developing very fast now , More and more people pay attention to naturallanguageprocessing , Many applications have made great progress , The accuracy is getting closer to human beings , Even some aspects have surpassed human beings ',
# ' First , Natural language processing is a branch of artificial intelligence , The ultimate goal, like artificial intelligence, is to imitate human behavior , The ultimate goal of naturallanguageprocessing is to understand language , Make computers read and understand languages like people , And give the corresponding answer in line with human thinking ',
# ' Simply speaking , Naturallanguageprocessing is , Analyze and apply various language text files in the computer ']

Insert picture description here

Information extraction

Extract important information from a given text , Such as time 、 place 、 figure 、 event 、 reason 、 result 、 Numbers 、 date 、 currency 、 Proper nouns and so on . Popularly speaking , It's about knowing who and when 、 What's the cause of the 、 To whom 、 What has been done 、 What's the result .

LDA Topic model keyword extraction

The topic model is a statistical model for discovering abstractions that appear in a collection of documents “ The theme ”. Topic modeling is a common text mining tool , Used to find hidden semantic structure in text body .

from gensim import corpora, models
import jieba.posseg as jp
import jieba


#  Simple text processing 
def get_text(text):
    flags = ('n', 'nr', 'ns', 'nt', 'eng', 'v', 'd')  #  The part of speech 
    stopwords = (' Of ', ' Just ', ' yes ', ' use ', ' also ', ' stay ', ' On ', ' As ')  #  Stop words 
    words_list = []
    for text in texts:
        words = [w.word for w in jp.cut(text) if w.flag in flags and w.word not in stopwords]
        words_list.append(words)
    return words_list


#  Generate LDA Model 
def LDA_model(words_list):
    #  Construction dictionary 
    # Dictionary() Method to traverse all text , Assign a single integer to each non repeating word ID, At the same time, collect the occurrence times of the word and relevant statistical information 
    dictionary = corpora.Dictionary(words_list)
    print(dictionary)
    print(' Print to view each word id:')
    print(dictionary.token2id)  #  Print to view each word id

    #  take dictionary Into a word bag 
    # doc2bow() Methods will dictionary Into a word bag . The result corpus Is a list of vectors , The number of vectors is the number of documents .
    #  Each document vector contains a series of tuples , The form of tuple is （ word  ID, Word frequency ）
    corpus = [dictionary.doc2bow(words) for words in words_list]
    print(' Output vectors for each document :')
    print(corpus)  #  Output vectors for each document 

    # LDA Theme model 
    # num_topics --  must , Number of topics to generate .
    # id2word --  must ,LdaModel Class requires our previous dictionary hold id Are mapped to strings .
    # passes --  Optional , The number of times the model traverses the corpus . The more times you traverse , The more accurate the model . But for a very large corpus , Traversing too many times can take a long time .
    lda_model = models.ldamodel.LdaModel(corpus=corpus, num_topics=2, id2word=dictionary, passes=10)

    return lda_model


if __name__ == "__main__":
    texts = [' As one of the few thousand yuan mobile phones with a true full screen ,OPPO K3 Once launched , Just surrounded by many fans ', \ ' Many people are buying for this screen OPPO K3 after , Discover the original K3 It's not just on the screen ', \ 'OPPO K3 Of consumers are generally very satisfied with this mobile phone ', \ ' Geely, the more PRO stay 7 month 3 The new geek intelligent ecosystem GKUI19 At the press conference ', \ ' Shanghai auto show this year , changan CS75 PLUS Debut ', \ ' The common version of the vehicle adopts the double side common double outlet exhaust layout ; The sports version adopts the exhaust layout with two sides and four outlets ']
    #  Get the text list after word segmentation 
    words_list = get_text(texts)
    print(' Text after word segmentation ：')
    print(words_list)

    #  Get after training LDA Model 
    lda_model = LDA_model(words_list)

    #  It can be used  print_topic  and  print_topics  Method to view the topic 
    #  Print all topics , Each topic displays 5 Word 
    topic_words = lda_model.print_topics(num_topics=2, num_words=5)
    print(' Print all topics , Each topic displays 5 Word :')
    print(topic_words)

    #  Output the words of the topic and their weights 
    words_list = lda_model.show_topic(0, 5)
    print(' Output the words of the topic and their weights :')
    print(words_list)

Insert picture description here

Public opinion analysis

It refers to the collection and processing of massive information , Analyze online public opinion automatically , In order to achieve the purpose of timely response to Internet public opinion .

# -*- coding:utf-8 -*-
import pandas as pd
import jieba
 
# Calculate emotion value based on Poisson emotion Dictionary 
def getscore(text):
    df = pd.read_table(r"BosonNLP_dict\BosonNLP_sentiment_score.txt", sep=" ", names=['key', 'score'])
    key = df['key'].values.tolist()
    score = df['score'].values.tolist()
    # jieba participle 
    segs = jieba.lcut(text,cut_all = False) # return list
    #  Calculate the score 
    score_list = [score[key.index(x)] for x in segs if(x in key)]
    return sum(score_list)
 
# Read the file 
def read_txt(filename):
    with open(filename,'r',encoding='utf-8')as f:
        txt = f.read()
    return txt
# write file 
def write_data(filename,data):
    with open(filename,'a',encoding='utf-8')as f:
        f.write(data)
 
 
if __name__=='__main__':
    text = read_txt('test_data\ Microblogging .txt')
    lists  = text.split('\n')
 
    # al_senti = [' nothing ',' positive ',' negative ',' negative ',' Neutral ',' negative ',' positive ',' negative ',' positive ',' positive ',' positive ',
    #             ' nothing ',' positive ',' positive ',' Neutral ',' positive ',' negative ',' positive ',' negative ',' positive ',' negative ',' positive ',
    #             ' nothing ',' Neutral ',' negative ',' Neutral ',' negative ',' positive ',' negative ',' negative ',' negative ',' negative ',' positive '
    #             ]
    al_senti = read_txt(r'test_data\ Artificial emotion tagging .txt').split('\n')
    i = 0
    for list in lists:
        if list  != '':
            # print(list)
            sentiments = round(getscore(list),2)
            # Emotional value is positive , Show positive ; A negative number means negative 
            print(list)
            print(" Emotional value ：",sentiments)
            print(' Manually mark emotional tendencies ：'+al_senti[i])
            if sentiments > 0:
                print(" Machines label emotional tendencies ： positive \n")
                s = " Machines judge emotional tendencies ： positive \n"
            else:
                print(' Machines label emotional tendencies ： negative \n')
                s = " Machines judge emotional tendencies ： negative "+'\n'
            sentiment = ' Emotional value ：'+str(sentiments)+'\n'
            al_sentiment= ' Manually mark emotional tendencies :'+al_senti[i]+'\n'
            # File is written to 
            filename = 'result_data\BosonNLP Emotional analysis results .txt'
            write_data(filename,' Emotional analysis text ：')
            write_data(filename,list+'\n') # Write pending text 
            write_data(filename,sentiment) # Write emotion value 
            write_data(filename,al_sentiment) # Write the machine to judge the emotional tendency 
            write_data(filename,s+'\n') # Write manually marked emotion 
            i = i+1