当前位置:网站首页>Chapter 1 overview of naturallanguageprocessing and deep learning
Chapter 1 overview of naturallanguageprocessing and deep learning
2022-06-22 13:36:00 【A hundred years of literature have been written on the left sid】
The first 1 Chapter An overview of naturallanguageprocessing and deep learning
As the chapter title , This chapter is more a basic overview , Some functions of some libraries introduced are quite interesting .
Remove stop words , And participle
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# It needs to be done first nltk.download('punkt') and nltk.download('stopwords')
sent = 'deep learning for natural language processing is very interesting'
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(sent)
filtered_sent = [w for w in word_tokens if w not in stop_words]
print(filtered_sent)
# ['deep', 'learning', 'natural', 'language', 'processing', 'interesting']
Count Vectorization , Generating word vectors one-hot code
from sklearn.feature_extraction.text import CountVectorizer
texts = ['Ramiess sing classic songs', 'he listens to old pop', 'and rock music']
cv = CountVectorizer()
cv_fit = cv.fit_transform(texts)
print(cv.get_feature_names())
# ['and', 'classic', 'he', 'listens', 'music', 'old', 'pop', 'ramiess', 'rock', 'sing', 'songs', 'to']
print(cv_fit.toarray())
# [[0 1 0 0 0 0 0 1 0 1 1 0]
# [0 0 1 1 0 1 1 0 0 0 0 1]
# [1 0 0 0 1 0 0 0 1 0 0 0]]
TF-IDF fraction
TF-IDF( Word frequency and reverse document frequency ,term frequency–inverse document frequency) It is a common index in the field of information retrieval , The importance of a word increases in proportion to the number of times it appears in the document , But at the same time, it will decrease inversely with the frequency of its occurrence in the corpus , That is, the frequency of a word or phrase in an article TF high , And it's rarely seen in other articles , It is believed that this word or phrase has a good ability of classification , Suitable for classification .
t f − i d f i = t f i , j ⋅ i d f i tf-idf_i=tf_{i,j} \cdot idf_i tf−idfi=tfi,j⋅idfi
t f i , j = n i , j ∑ k n k , j tf_{i,j}=\frac{n_{i,j}}{\sum_k n_{k,j}} tfi,j=∑knk,jni,j
n i , j n_{i,j} ni,j Means the word i i i In the document j j j Is the number of times
n k , j n_{k,j} nk,j Means the word k k k In the document j j j Is the number of times
so , t f i , j tf_{i,j} tfi,j What counts is words i i i In the document j j j The proportion of , That is word frequency
i d f i = l g ∣ D ∣ ∣ { j : t i ∈ d j } ∣ idf_i=lg\frac{\vert D \vert}{\vert \{j:t_i\in d_j\} \vert} idfi=lg∣{ j:ti∈dj}∣∣D∣
∣ D ∣ \vert D \vert ∣D∣ Represents the total number of documents in the corpus
∣ { j : t i ∈ d j } ∣ \vert \{j:t_i\in d_j\} \vert ∣{ j:ti∈dj}∣ Means contains the word i Documents j The number of
so , Divide the two and take 10 Log base , I.e. inverse document frequency
for instance :
Suppose a document contains 100 Word , word happy There is 5 Time , Then the word frequency is 5/100=0.05, Suppose the corpus has 1000 Ten thousand documents , And words happy Appear in it 1000 Of documents , Then the inverse document frequency is lg(10000000/1000)=4, so tf-idf=0.05×4=0.2
from sklearn.feature_extraction.text import TfidfVectorizer
texts = ['aaa bbb aaa', 'aaa ccc']
vert = TfidfVectorizer()
X = vert.fit_transform(texts)
print(X.todense())
#[[0.81818021 0.57496187 0. ]
# [0.57973867 0. 0.81480247]]
If you test the above code , It will be found that the calculation result is different from that of the above formula , This is because in the sklearn The formula used in is different from the above , The formula used is as follows :
v i , j = t f i , j ⋅ i d f i v_{i,j}=tf_{i,j} \cdot idf_i vi,j=tfi,j⋅idfi
t f − i d f i = v i , j ∑ k v k , j 2 tf-idf_i=\frac{v_{i,j}}{\sqrt {\sum_k v_{k,j}^2}} tf−idfi=∑kvk,j2vi,j
The Euclidean distance of all words in the document is found on the denominator
t f i , j = n i , j tf_{i,j}=n_{i,j} tfi,j=ni,j
namely t f tf tf The value of is the number of occurrences
i d f i = l n ∣ D ∣ + 1 ∣ { j : t i ∈ d j } ∣ + 1 + 1 idf_i=ln\frac{\vert D \vert +1} {\vert \{ j:t_i \in d_j\} \vert + 1}+1 idfi=ln∣{ j:ti∈dj}∣+1∣D∣+1+1
The use is based on e Bottom , There are also changes in the formula
But if you use v e r t = T f i d V e c t o r i z e r ( s m o o t h i d f = F a l s e ) vert=TfidVectorizer(smooth_idf=False) vert=TfidVectorizer(smoothidf=False)
that
i d f i = l n ∣ D ∣ + 1 ∣ { j : t i ∈ d j } ∣ + 1 idf_i=ln\frac{\vert D \vert+1}{\vert \{ j:t_i \in d_j\}\vert + 1} idfi=ln∣{ j:ti∈dj}∣+1∣D∣+1
You won't add 1 了 , The default is True, It's the one above that adds 1 Formula
The corpus in the sample code is [‘aaa bbb aaa’, ‘aaa ccc’]
In the first document bbb For example , The output of the program is 0.57496187
Because it only appears once in the first document , therefore t f i , j = 1 tf_{i,j}=1 tfi,j=1
The total number of documents in the corpus is 2, There is bbb The number of documents is 1, therefore i d f i = l n 3 2 + 1 = 1.4054651081081644 idf_i=ln\frac{3}{2}+1=1.4054651081081644 idfi=ln23+1=1.4054651081081644
therefore bbb Of v i , j = 1 × 1.4054651081081644 = 1.4054651081081644 v_{i,j}=1 \times 1.4054651081081644=1.4054651081081644 vi,j=1×1.4054651081081644=1.4054651081081644
Empathy , Calculate the... In the first document aaa Of v i , j = 2 × ( l n 3 3 + 1 ) = 2 v_{i,j}=2\times (ln\frac{3}{3} + 1)=2 vi,j=2×(ln33+1)=2
therefore bbb Of t f − i d f i = 1.4054651081081644 1.405465108108164 4 2 + 2 2 = 0.5749618667993135 tf-idf_i=\frac{1.4054651081081644}{\sqrt{1.4054651081081644^2+2^2}}=0.5749618667993135 tf−idfi=1.40546510810816442+221.4054651081081644=0.5749618667993135
Keras structure MIP A small example of a model
See the download link of this article for the blood transfusion data set used in this example
The dataset has 748 Data and 4 Attributes
- Recency, months since last donation
- Frequency, total number of donation
- Monetary, total blood donated in c.c.
- Time, months since first donation
And a binary attribute 1 or 0 Indicates whether the user is 2007 year 3 Monthly blood donation
First, read the file , And extract independent variables and dependent variables
import keras
from keras.layers import Dense
from keras.models import Sequential
import numpy as np
# Modify according to the specific file path
trans = np.genfromtxt('D:\\py3\\book_dl_for_nlp\\transfusion.csv', delimiter=',', skip_header=1)
X = trans[:, 0:4]
Y = trans[:, 4]
below , Create a network structure , The first hidden layer consists of 8 Neurons make up , The second hidden layer consists of 6 Neurons make up , Both use ReLU Activation function , Output layer use Sigmoid Activate the function for binary classification .
mlp_keras = Sequential()
mlp_keras.add(Dense(8, input_dim=4, kernel_initializer='uniform', activation='relu'))
mlp_keras.add(Dense(6, kernel_initializer='uniform', activation='relu'))
mlp_keras.add(Dense(1, kernel_initializer='uniform', activation='sigmoid'))
Set learning parameters , And output the result
mlp_keras.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
mlp_keras.fit(X, Y, epochs=200, batch_size=8, verbose=0)
accuracy = mlp_keras.evaluate(X, Y)
print('Accuracy:%0.2f%%' % (accuracy[1]*100))
The overall code is as follows
import keras
from keras.layers import Dense
from keras.models import Sequential
import numpy as np
#%%
trans = np.genfromtxt('D:\\py3\\book_dl_for_nlp\\transfusion.csv', delimiter=',', skip_header=1)
X = trans[:, 0:4]
Y = trans[:, 4]
#%%
mlp_keras = Sequential()
mlp_keras.add(Dense(8, input_dim=4, kernel_initializer='uniform', activation='relu'))
mlp_keras.add(Dense(6, kernel_initializer='uniform', activation='relu'))
mlp_keras.add(Dense(1, kernel_initializer='uniform', activation='sigmoid'))
#%%
mlp_keras.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
mlp_keras.fit(X, Y, epochs=200, batch_size=8, verbose=0)
accuracy = mlp_keras.evaluate(X, Y)
print('Accuracy:%0.2f%%' % (accuracy[1]*100))
边栏推荐
- Rf5.0 new content quick view
- RobotFramework中setUp的小技巧
- Acwing week 52
- 2017年度总结
- Opengauss database source code analysis series articles -- detailed explanation of dense equivalent query technology
- Problème de sous - séquence / substrat leetcode
- Rigid demand of robot direction → personal thinking ←
- Customer member value analysis
- leetcode-并查集
- PHP反序列化&魔术方法
猜你喜欢

Leetcode interval DP

PHP反序列化&魔术方法

257. Binary Tree Paths

HMS Core新闻行业解决方案:让技术加上人文的温度

46. Permutations

File download vulnerability & file read vulnerability & file delete vulnerability

Problème de sous - séquence / substrat leetcode

SSM based library management system, high-quality graduation thesis example (can be used directly), project import video, attached source code and database script, Thesis Writing Tutorial

769. Max Chunks To Make Sorted

310. Minimum Height Trees
随机推荐
Redis password modification, startup, view and other operations
934. Shortest Bridge
Système de classification des déchets et de gestion des transports basé sur SSM, exemple de thèse de diplôme de haute qualité (peut être utilisé directement), code source, script de base de données, t
769. Max Chunks To Make Sorted
文件下载漏洞&文件读取漏洞&文件删除漏洞
If the programmer tells the truth during the interview
别再用 System.currentTimeMillis() 统计耗时了,太 Low,StopWatch 好用到爆!
If Tiankeng majors learn IC design by themselves, will any company want it
Rf5.0 new content quick view
“不敢去怀疑代码,又不得不怀疑代码”记一次网络请求超时分析
Getting started with shell Basics
241. Different Ways to Add Parentheses
268. Missing Number
338. Counting Bits
leetcode 32. Longest valid bracket
Leetcode union search set
Alicloud disk performance analysis
从零开始写一个契约测试工具——数据库设计
130. Surrounded Regions
leetcode 968.监控二叉树