当前位置：网站首页>Xgboost implements text classification and sklearn NLP library tfidfvectorizer

Xgboost implements text classification and sklearn NLP library tfidfvectorizer

2022-06-23 21:33:00 【Goose】

1. background

It is often used in text classification tasks XGBoost Quickly build baseline, When processing text data, you need to introduce TFIDF The text can only be input into a vector based on word frequency XGBoost To classify . This blog will briefly explain XGB The realization of text classification and some principles .

2. Realization

import pandas as pd
import xgboost as xgb
import jieba
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn import metrics
from sklearn.model_selection import train_test_split
from matplotlib import pyplot

def word_seg(x):
    content = str(x['a']) + ' ' + str(x['b'])
    for i in string.punctuation + ''.join([r'\N', '\t', '\n', '（', '）', ',', '.', '：', '】', '【']):
        content = content.replace(i, '')
    return ' '.join(jieba.lcut(content))
train_data = pd.read_csv('./data/train.csv')
train_data['content'] = train_data.apply(word_seg, axis=1)

x_train, x_test, y_train, y_test = train_test_split(train_data['content'], train_data['label'], test_size=0.2)
#  Turn the corpus into word bag vector , According to word bag vector statistics TF-IDF
vectorizer = CountVectorizer(max_features=5000)
tf_idf_transformer = TfidfTransformer()
tf_idf = tf_idf_transformer.fit_transform(vectorizer.fit_transform(x_train))
x_train_weight = tf_idf.toarray()  #  Training set TF-IDF Weight matrices 
tf_idf = tf_idf_transformer.transform(vectorizer.transform(x_test))
x_test_weight = tf_idf.toarray()  #  Test set TF-IDF Weight matrices 

# be based on Scikit-learn Classification of interfaces 
#  Training models 
eval_set = [(x_train_weight, y_train), (x_test_weight, y_test)]
model = xgb.XGBClassifier(max_depth=6, learning_rate=0.1, n_estimators=60, objective='binary:logistic')
model.fit(x_train_weight, y_train, eval_set=eval_set, verbose=True)
y_predict = model.predict(x_test_weight)

results = model.evals_result()
results
epochs = len(results['validation_0']['logloss'])
x_axis = range(0, epochs)
# plot log loss
fig, ax = pyplot.subplots()
ax.plot(x_axis, results['validation_0']['logloss'], label='Train')
ax.plot(x_axis, results['validation_1']['logloss'], label='Test')
ax.legend()
pyplot.ylabel('Log Loss')
pyplot.title('XGBoost Log Loss')
pyplot.show()

# Use the trained model to test directly 
# xgb_model = xgb.Booster(model_file='data/xgb_model')  # init model # Load model 
# dtest = xgb.DMatrix('data/test.buffer')  # Load data 
# xgb_test(dtest,xgb_model)


# y_predict = model.predict(dtest)  #  Model to predict 
label_all = [0, 1]
confusion_mat = metrics.confusion_matrix(y_test, y_predict)
df = pd.DataFrame(confusion_mat, columns=label_all)
df.index = label_all
print(' Accuracy rate ：\n', metrics.accuracy_score(y_test, y_predict))
print('confusion_matrix: \n', df)
print(' Classified reports : \n', metrics.classification_report(y_test, y_predict))
print('AUC: %.4f' % metrics.roc_auc_score(y_test, y_predict))

 Accuracy rate ：
 0.9730405840669959
confusion_matrix: 
        0      1
0  48544      0
1   2511  42085
 Classified reports : 
               precision    recall  f1-score   support

           0       0.95      1.00      0.97     48544
           1       1.00      0.94      0.97     44596

    accuracy                           0.97     93140
   macro avg       0.98      0.97      0.97     93140
weighted avg       0.97      0.97      0.97     93140

AUC: 0.9718

3. TfidfVectorizer principle

Here is a brief introduction scikit-learn An open source method for natural language text processing ——TfidfVectorizer, This method consists of two methods CountVectorizer And TfidfTransformer The combination of , So let's say that , Three document links are given before the description （ This article is basically translated from official documents ）：

（ I have the document in hand , If there is a problem, see the document ）

Method 1 ：TfidfVectorizer Method 2 ：CountVectorizer、TfidfTransformer

OK, let's go to the text

TfidfVectorizer The main central idea of dealing with text language is TF-IDF ( Word frequency - Reverse document frequency ), Since the focus of this article is to introduce the module , So there are not many pairs TF-IDF explain , If necessary, here are the more detailed articles written before for reference ——TF-IDF And related knowledge

TfidfVectorizer The use of is equivalent to calling CountVectorizer Method , Then call TfidfTransformer Method , So I want to know TfidfVectorizer We have to start with the latter two methods .

CountVectorizer：

function ：

Converts a collection of text documents into a sparse matrix of counts . The internal implementation method is to call scipy.sparse.csr_matrix modular . also , If you're calling CountVectorizer() It does not provide a priori dictionary and does not use an analyzer that performs some feature selection , Then the number of characteristic words will be equal to the vocabulary found through the direct analysis of data by this method .

Code instructions ：

from sklearn.feature_extraction.text import CountVectorizer

corpus = ['This is the first document.',
	'This document is the second document.',
	'And this is the third one.',
	'Is this the first document?']

vectorizer = CountVectorizer()		# () A priori dictionary is not provided here 
# vectorizer.fit(corpus)			#  First fit Train incoming text data 
# X = vectorizer.transform(corpus)		#  Then the text data is marked and transformed into a sparse counting matrix 
X = vectorizer.fit_transform(corpus)		#  Sure fit、transform Use together to replace the two lines above 

print(vectorizer.get_feature_names())	#  Get the vocabulary found by the model and directly analyze the data （ A collection of words above ）
print(X.toarray())	#  Print directly X The output is the position of each word

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

[[0 1 1 1 0 0 1 0 1]        # 4 That's ok   Data samples 
 [0 2 0 1 0 1 1 0 1]        # 9 Column   Characteristic words 
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]

Brief description of parameters ：

The above is the simplest CountVectorizer Use of modules , We hardly use any parameters or methods , But still can achieve a better 【 Text —> Word vector sparse matrix 】 The effect of , Some parameters are as follows .

input : string {‘filename’, ‘file’, ‘content’}

encoding : string, ‘utf-8’ by default.

stop_words : string {‘english’}, list, or None (default)

max_df : float in range 0.0, 1.0 or int, default=1.0

min_df : float in range 0.0, 1.0 or int, default=1

max_features : int or None, default=None

TfidfTransformer：

function ：

Count matrix （ As shown in the figure above ） Convert to standardized tf or tf-idf Express .Tf Indicates the term frequency , and tf-idf Indicates that the term frequency is multiplied by the inverse document frequency . This is a term weighting scheme commonly used in information retrieval , It also has a good use in document classification . Used to calculate tf-idf The formula is tf-idf（d,t）= tf（t）* idf（d,t）.

Code instructions ：

from sklearn.feature_extraction.text import TfidfTransformer

transform = TfidfTransformer()    #  Use TF-IDF（ Word frequency 、 Reverse document frequency ） Applied to sparse matrices 
Y = transform.fit_transform(X)    #  Using the above CountVectorizer After processing the  X  data 
print(Y.toarray())                #  The output is converted to tf-idf After  Y  matrix , Also print directly  Y  Output the location of each data 
print(vectorizer.get_feature_names())    #  Print feature name

[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

Brief description of parameters ：

The above is for direct use TfidfTransformer transformation CountVectorizer The processed count matrix is standardized tf-idf matrix 【 Word vector sparse matrix —> Standardization tf-idf】 The effect of , Some of its parameters are given below .

norm : ‘l1’, ‘l2’ or None, optional

Norm used to normalize term vectors. None for no normalization.

use_idf : boolean, default=True

Enable inverse-document-frequency reweighting.

smooth_idf : boolean, default=True

Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.

sublinear_tf : boolean, default=False

Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).

Finally, I can briefly describe TfidfVectorizer 了

TfidfVectorizer

function ：

remember TfidfVectorizer It is equivalent to the combination of the two , Call it successively CountVectorizer and TfidfTransformer The two methods （ Simplified code , But the idea of operation remains the same ）, And the use of parameters is basically the same .

Code instructions ：

from sklearn.feature_extraction.text import TfidfVectorizer

VT = TfidfVectorizer()		#  Call it successively CountVectorizer and TfidfTransformer The two methods （ Simplified code , But the idea of operation remains the same ）
result = VT.fit_transform(corpus)
print(result.toarray())
print(VT.get_feature_names())

[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

Can be seen with the above CountVectorizer and TfidfTransformer The results after treatment are consistent , It is really a combination of the two .

Parameters and methods of use CountVectorizer and TfidfTransformer Agreement , I won't describe it here .

原网站

版权声明
本文为[Goose]所创，转载请带上原文链接，感谢
https://yzsam.com/2021/12/202112230049054958.html

当前位置：网站首页>Xgboost implements text classification and sklearn NLP library tfidfvectorizer

Xgboost implements text classification and sklearn NLP library tfidfvectorizer

1. background

2. Realization

3. TfidfVectorizer principle

OK, let's go to the text

CountVectorizer：

TfidfTransformer：

TfidfVectorizer

边栏推荐

猜你喜欢

随机推荐