当前位置：网站首页>Chapter 1 overview of naturallanguageprocessing and deep learning

Chapter 1 overview of naturallanguageprocessing and deep learning

2022-06-22 13:36:00 【A hundred years of literature have been written on the left sid】

The first 1 Chapter An overview of naturallanguageprocessing and deep learning

Remove stop words , And participle
Count Vectorization , Generating word vectors one-hot code
TF-IDF fraction
Keras structure MIP A small example of a model

As the chapter title , This chapter is more a basic overview , Some functions of some libraries introduced are quite interesting .

Remove stop words , And participle

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

#  It needs to be done first nltk.download('punkt') and nltk.download('stopwords')
sent = 'deep learning for natural language processing is very interesting'
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(sent)
filtered_sent = [w for w in word_tokens if w not in stop_words] 
print(filtered_sent)
# ['deep', 'learning', 'natural', 'language', 'processing', 'interesting']

Count Vectorization , Generating word vectors one-hot code

from sklearn.feature_extraction.text import CountVectorizer

texts = ['Ramiess sing classic songs', 'he listens to old pop', 'and rock music']
cv = CountVectorizer()
cv_fit = cv.fit_transform(texts)
print(cv.get_feature_names())
# ['and', 'classic', 'he', 'listens', 'music', 'old', 'pop', 'ramiess', 'rock', 'sing', 'songs', 'to']
print(cv_fit.toarray())
# [[0 1 0 0 0 0 0 1 0 1 1 0]
# [0 0 1 1 0 1 1 0 0 0 0 1]
# [1 0 0 0 1 0 0 0 1 0 0 0]]

TF-IDF fraction

TF-IDF( Word frequency and reverse document frequency ,term frequency–inverse document frequency) It is a common index in the field of information retrieval , The importance of a word increases in proportion to the number of times it appears in the document , But at the same time, it will decrease inversely with the frequency of its occurrence in the corpus , That is, the frequency of a word or phrase in an article TF high , And it's rarely seen in other articles , It is believed that this word or phrase has a good ability of classification , Suitable for classification .
$tf-idf_i=tf_{i,j} \cdot idf_i$
$tf_{i,j}=\frac{n_{i,j}}{\sum_k n_{k,j}}$
$n_{i,j}$ Means the word $i$ In the document $j$ Is the number of times
$n_{k,j}$ Means the word $k$ In the document $j$ Is the number of times
so , $tf_{i,j}$ What counts is words $i$ In the document $j$ The proportion of , That is word frequency
$idf_i=lg\frac{\vert D \vert}{\vert \{j:t_i\in d_j\} \vert}$
$\vert D \vert$ Represents the total number of documents in the corpus
$\vert \{j:t_i\in d_j\} \vert$ Means contains the word i Documents j The number of
so , Divide the two and take 10 Log base , I.e. inverse document frequency
for instance ：
Suppose a document contains 100 Word , word happy There is 5 Time , Then the word frequency is 5/100=0.05, Suppose the corpus has 1000 Ten thousand documents , And words happy Appear in it 1000 Of documents , Then the inverse document frequency is lg（10000000/1000）=4, so tf-idf=0.05×4=0.2

from sklearn.feature_extraction.text import TfidfVectorizer
texts = ['aaa bbb aaa', 'aaa ccc']
vert = TfidfVectorizer()
X = vert.fit_transform(texts)
print(X.todense())
#[[0.81818021 0.57496187 0. ]
# [0.57973867 0. 0.81480247]]

If you test the above code , It will be found that the calculation result is different from that of the above formula , This is because in the sklearn The formula used in is different from the above , The formula used is as follows ：
$v_{i,j}=tf_{i,j} \cdot idf_i$
$tf-idf_i=\frac{v_{i,j}}{\sqrt {\sum_k v_{k,j}^2}}$
The Euclidean distance of all words in the document is found on the denominator
$tf_{i,j}=n_{i,j}$
namely $t f$ The value of is the number of occurrences
$idf_i=ln\frac{\vert D \vert +1} {\vert \{ j:t_i \in d_j\} \vert + 1}+1$
The use is based on e Bottom , There are also changes in the formula
But if you use $vert=TfidVectorizer(smooth_idf=False)$
that
$idf_i=ln\frac{\vert D \vert+1}{\vert \{ j:t_i \in d_j\}\vert + 1}$
You won't add 1 了 , The default is True, It's the one above that adds 1 Formula
The corpus in the sample code is [‘aaa bbb aaa’, ‘aaa ccc’]
In the first document bbb For example , The output of the program is 0.57496187
Because it only appears once in the first document , therefore $tf_{i,j}=1$
The total number of documents in the corpus is 2, There is bbb The number of documents is 1, therefore $idf_i=ln\frac{3}{2}+1=1.4054651081081644$
therefore bbb Of $v_{i,j}=1 \times 1.4054651081081644=1.4054651081081644$
Empathy , Calculate the... In the first document aaa Of $v_{i,j}=2\times (ln\frac{3}{3} + 1)=2$
therefore bbb Of $tf-idf_i=\frac{1.4054651081081644}{\sqrt{1.4054651081081644^2+2^2}}=0.5749618667993135$

Keras structure MIP A small example of a model

See the download link of this article for the blood transfusion data set used in this example
The dataset has 748 Data and 4 Attributes

Recency, months since last donation
Frequency, total number of donation
Monetary, total blood donated in c.c.
Time, months since first donation
And a binary attribute 1 or 0 Indicates whether the user is 2007 year 3 Monthly blood donation
First, read the file , And extract independent variables and dependent variables

import keras
from keras.layers import Dense
from keras.models import Sequential
import numpy as np

#  Modify according to the specific file path 
trans = np.genfromtxt('D:\\py3\\book_dl_for_nlp\\transfusion.csv', delimiter=',', skip_header=1)
X = trans[:, 0:4]
Y = trans[:, 4]

below , Create a network structure , The first hidden layer consists of 8 Neurons make up , The second hidden layer consists of 6 Neurons make up , Both use ReLU Activation function , Output layer use Sigmoid Activate the function for binary classification .

mlp_keras = Sequential()
mlp_keras.add(Dense(8, input_dim=4, kernel_initializer='uniform', activation='relu'))
mlp_keras.add(Dense(6, kernel_initializer='uniform', activation='relu'))
mlp_keras.add(Dense(1, kernel_initializer='uniform', activation='sigmoid'))

Set learning parameters , And output the result

mlp_keras.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
mlp_keras.fit(X, Y, epochs=200, batch_size=8, verbose=0)
accuracy = mlp_keras.evaluate(X, Y)
print('Accuracy:%0.2f%%' % (accuracy[1]*100))

The overall code is as follows

import keras
from keras.layers import Dense
from keras.models import Sequential
import numpy as np

#%%
trans = np.genfromtxt('D:\\py3\\book_dl_for_nlp\\transfusion.csv', delimiter=',', skip_header=1)
X = trans[:, 0:4]
Y = trans[:, 4]

#%%
mlp_keras = Sequential()
mlp_keras.add(Dense(8, input_dim=4, kernel_initializer='uniform', activation='relu'))
mlp_keras.add(Dense(6, kernel_initializer='uniform', activation='relu'))
mlp_keras.add(Dense(1, kernel_initializer='uniform', activation='sigmoid'))

#%%
mlp_keras.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
mlp_keras.fit(X, Y, epochs=200, batch_size=8, verbose=0)
accuracy = mlp_keras.evaluate(X, Y)
print('Accuracy:%0.2f%%' % (accuracy[1]*100))