当前位置:网站首页>Chapter 1 overview of naturallanguageprocessing and deep learning

Chapter 1 overview of naturallanguageprocessing and deep learning

2022-06-22 13:36:00 A hundred years of literature have been written on the left sid

As the chapter title , This chapter is more a basic overview , Some functions of some libraries introduced are quite interesting .

Remove stop words , And participle

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

#  It needs to be done first nltk.download('punkt') and nltk.download('stopwords')
sent = 'deep learning for natural language processing is very interesting'
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(sent)
filtered_sent = [w for w in word_tokens if w not in stop_words] 
print(filtered_sent)
# ['deep', 'learning', 'natural', 'language', 'processing', 'interesting']

Count Vectorization , Generating word vectors one-hot code

from sklearn.feature_extraction.text import CountVectorizer

texts = ['Ramiess sing classic songs', 'he listens to old pop', 'and rock music']
cv = CountVectorizer()
cv_fit = cv.fit_transform(texts)
print(cv.get_feature_names())
# ['and', 'classic', 'he', 'listens', 'music', 'old', 'pop', 'ramiess', 'rock', 'sing', 'songs', 'to']
print(cv_fit.toarray())
# [[0 1 0 0 0 0 0 1 0 1 1 0]
# [0 0 1 1 0 1 1 0 0 0 0 1]
# [1 0 0 0 1 0 0 0 1 0 0 0]]

TF-IDF fraction

TF-IDF( Word frequency and reverse document frequency ,term frequency–inverse document frequency) It is a common index in the field of information retrieval , The importance of a word increases in proportion to the number of times it appears in the document , But at the same time, it will decrease inversely with the frequency of its occurrence in the corpus , That is, the frequency of a word or phrase in an article TF high , And it's rarely seen in other articles , It is believed that this word or phrase has a good ability of classification , Suitable for classification .
t f − i d f i = t f i , j ⋅ i d f i tf-idf_i=tf_{i,j} \cdot idf_i tfidfi=tfi,jidfi
t f i , j = n i , j ∑ k n k , j tf_{i,j}=\frac{n_{i,j}}{\sum_k n_{k,j}} tfi,j=knk,jni,j
n i , j n_{i,j} ni,j Means the word i i i In the document j j j Is the number of times
n k , j n_{k,j} nk,j Means the word k k k In the document j j j Is the number of times
so , t f i , j tf_{i,j} tfi,j What counts is words i i i In the document j j j The proportion of , That is word frequency
i d f i = l g ∣ D ∣ ∣ { j : t i ∈ d j } ∣ idf_i=lg\frac{\vert D \vert}{\vert \{j:t_i\in d_j\} \vert} idfi=lg{ j:tidj}D
∣ D ∣ \vert D \vert D Represents the total number of documents in the corpus
∣ { j : t i ∈ d j } ∣ \vert \{j:t_i\in d_j\} \vert { j:tidj} Means contains the word i Documents j The number of
so , Divide the two and take 10 Log base , I.e. inverse document frequency
for instance :
Suppose a document contains 100 Word , word happy There is 5 Time , Then the word frequency is 5/100=0.05, Suppose the corpus has 1000 Ten thousand documents , And words happy Appear in it 1000 Of documents , Then the inverse document frequency is lg(10000000/1000)=4, so tf-idf=0.05×4=0.2

from sklearn.feature_extraction.text import TfidfVectorizer
texts = ['aaa bbb aaa', 'aaa ccc']
vert = TfidfVectorizer()
X = vert.fit_transform(texts)
print(X.todense())
#[[0.81818021 0.57496187 0. ]
# [0.57973867 0. 0.81480247]]

If you test the above code , It will be found that the calculation result is different from that of the above formula , This is because in the sklearn The formula used in is different from the above , The formula used is as follows :
v i , j = t f i , j ⋅ i d f i v_{i,j}=tf_{i,j} \cdot idf_i vi,j=tfi,jidfi
t f − i d f i = v i , j ∑ k v k , j 2 tf-idf_i=\frac{v_{i,j}}{\sqrt {\sum_k v_{k,j}^2}} tfidfi=kvk,j2vi,j
The Euclidean distance of all words in the document is found on the denominator
t f i , j = n i , j tf_{i,j}=n_{i,j} tfi,j=ni,j
namely t f tf tf The value of is the number of occurrences
i d f i = l n ∣ D ∣ + 1 ∣ { j : t i ∈ d j } ∣ + 1 + 1 idf_i=ln\frac{\vert D \vert +1} {\vert \{ j:t_i \in d_j\} \vert + 1}+1 idfi=ln{ j:tidj}+1D+1+1
The use is based on e Bottom , There are also changes in the formula
But if you use v e r t = T f i d V e c t o r i z e r ( s m o o t h i d f = F a l s e ) vert=TfidVectorizer(smooth_idf=False) vert=TfidVectorizer(smoothidf=False)
that
i d f i = l n ∣ D ∣ + 1 ∣ { j : t i ∈ d j } ∣ + 1 idf_i=ln\frac{\vert D \vert+1}{\vert \{ j:t_i \in d_j\}\vert + 1} idfi=ln{ j:tidj}+1D+1
You won't add 1 了 , The default is True, It's the one above that adds 1 Formula
The corpus in the sample code is [‘aaa bbb aaa’, ‘aaa ccc’]
In the first document bbb For example , The output of the program is 0.57496187
Because it only appears once in the first document , therefore t f i , j = 1 tf_{i,j}=1 tfi,j=1
The total number of documents in the corpus is 2, There is bbb The number of documents is 1, therefore i d f i = l n 3 2 + 1 = 1.4054651081081644 idf_i=ln\frac{3}{2}+1=1.4054651081081644 idfi=ln23+1=1.4054651081081644
therefore bbb Of v i , j = 1 × 1.4054651081081644 = 1.4054651081081644 v_{i,j}=1 \times 1.4054651081081644=1.4054651081081644 vi,j=1×1.4054651081081644=1.4054651081081644
Empathy , Calculate the... In the first document aaa Of v i , j = 2 × ( l n 3 3 + 1 ) = 2 v_{i,j}=2\times (ln\frac{3}{3} + 1)=2 vi,j=2×(ln33+1)=2
therefore bbb Of t f − i d f i = 1.4054651081081644 1.405465108108164 4 2 + 2 2 = 0.5749618667993135 tf-idf_i=\frac{1.4054651081081644}{\sqrt{1.4054651081081644^2+2^2}}=0.5749618667993135 tfidfi=1.40546510810816442+221.4054651081081644=0.5749618667993135

Keras structure MIP A small example of a model

See the download link of this article for the blood transfusion data set used in this example
The dataset has 748 Data and 4 Attributes

  • Recency, months since last donation
  • Frequency, total number of donation
  • Monetary, total blood donated in c.c.
  • Time, months since first donation
    And a binary attribute 1 or 0 Indicates whether the user is 2007 year 3 Monthly blood donation
    First, read the file , And extract independent variables and dependent variables
import keras
from keras.layers import Dense
from keras.models import Sequential
import numpy as np

#  Modify according to the specific file path 
trans = np.genfromtxt('D:\\py3\\book_dl_for_nlp\\transfusion.csv', delimiter=',', skip_header=1)
X = trans[:, 0:4]
Y = trans[:, 4]

below , Create a network structure , The first hidden layer consists of 8 Neurons make up , The second hidden layer consists of 6 Neurons make up , Both use ReLU Activation function , Output layer use Sigmoid Activate the function for binary classification .

mlp_keras = Sequential()
mlp_keras.add(Dense(8, input_dim=4, kernel_initializer='uniform', activation='relu'))
mlp_keras.add(Dense(6, kernel_initializer='uniform', activation='relu'))
mlp_keras.add(Dense(1, kernel_initializer='uniform', activation='sigmoid'))

Set learning parameters , And output the result

mlp_keras.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
mlp_keras.fit(X, Y, epochs=200, batch_size=8, verbose=0)
accuracy = mlp_keras.evaluate(X, Y)
print('Accuracy:%0.2f%%' % (accuracy[1]*100))

The overall code is as follows

import keras
from keras.layers import Dense
from keras.models import Sequential
import numpy as np

#%%
trans = np.genfromtxt('D:\\py3\\book_dl_for_nlp\\transfusion.csv', delimiter=',', skip_header=1)
X = trans[:, 0:4]
Y = trans[:, 4]

#%%
mlp_keras = Sequential()
mlp_keras.add(Dense(8, input_dim=4, kernel_initializer='uniform', activation='relu'))
mlp_keras.add(Dense(6, kernel_initializer='uniform', activation='relu'))
mlp_keras.add(Dense(1, kernel_initializer='uniform', activation='sigmoid'))

#%%
mlp_keras.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
mlp_keras.fit(X, Y, epochs=200, batch_size=8, verbose=0)
accuracy = mlp_keras.evaluate(X, Y)
print('Accuracy:%0.2f%%' % (accuracy[1]*100))
原网站

版权声明
本文为[A hundred years of literature have been written on the left sid]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/173/202206221137548077.html