当前位置:网站首页>Xgboost implements text classification and sklearn NLP library tfidfvectorizer
Xgboost implements text classification and sklearn NLP library tfidfvectorizer
2022-06-23 21:33:00 【Goose】
1. background
It is often used in text classification tasks XGBoost Quickly build baseline, When processing text data, you need to introduce TFIDF The text can only be input into a vector based on word frequency XGBoost To classify . This blog will briefly explain XGB The realization of text classification and some principles .
2. Realization
import pandas as pd
import xgboost as xgb
import jieba
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn import metrics
from sklearn.model_selection import train_test_split
from matplotlib import pyplot
def word_seg(x):
content = str(x['a']) + ' ' + str(x['b'])
for i in string.punctuation + ''.join([r'\N', '\t', '\n', '(', ')', ',', '.', ':', '】', '【']):
content = content.replace(i, '')
return ' '.join(jieba.lcut(content))
train_data = pd.read_csv('./data/train.csv')
train_data['content'] = train_data.apply(word_seg, axis=1)
x_train, x_test, y_train, y_test = train_test_split(train_data['content'], train_data['label'], test_size=0.2)
# Turn the corpus into word bag vector , According to word bag vector statistics TF-IDF
vectorizer = CountVectorizer(max_features=5000)
tf_idf_transformer = TfidfTransformer()
tf_idf = tf_idf_transformer.fit_transform(vectorizer.fit_transform(x_train))
x_train_weight = tf_idf.toarray() # Training set TF-IDF Weight matrices
tf_idf = tf_idf_transformer.transform(vectorizer.transform(x_test))
x_test_weight = tf_idf.toarray() # Test set TF-IDF Weight matrices
# be based on Scikit-learn Classification of interfaces
# Training models
eval_set = [(x_train_weight, y_train), (x_test_weight, y_test)]
model = xgb.XGBClassifier(max_depth=6, learning_rate=0.1, n_estimators=60, objective='binary:logistic')
model.fit(x_train_weight, y_train, eval_set=eval_set, verbose=True)
y_predict = model.predict(x_test_weight)
results = model.evals_result()
results
epochs = len(results['validation_0']['logloss'])
x_axis = range(0, epochs)
# plot log loss
fig, ax = pyplot.subplots()
ax.plot(x_axis, results['validation_0']['logloss'], label='Train')
ax.plot(x_axis, results['validation_1']['logloss'], label='Test')
ax.legend()
pyplot.ylabel('Log Loss')
pyplot.title('XGBoost Log Loss')
pyplot.show()# Use the trained model to test directly
# xgb_model = xgb.Booster(model_file='data/xgb_model') # init model # Load model
# dtest = xgb.DMatrix('data/test.buffer') # Load data
# xgb_test(dtest,xgb_model)
# y_predict = model.predict(dtest) # Model to predict
label_all = [0, 1]
confusion_mat = metrics.confusion_matrix(y_test, y_predict)
df = pd.DataFrame(confusion_mat, columns=label_all)
df.index = label_all
print(' Accuracy rate :\n', metrics.accuracy_score(y_test, y_predict))
print('confusion_matrix: \n', df)
print(' Classified reports : \n', metrics.classification_report(y_test, y_predict))
print('AUC: %.4f' % metrics.roc_auc_score(y_test, y_predict)) Accuracy rate :
0.9730405840669959
confusion_matrix:
0 1
0 48544 0
1 2511 42085
Classified reports :
precision recall f1-score support
0 0.95 1.00 0.97 48544
1 1.00 0.94 0.97 44596
accuracy 0.97 93140
macro avg 0.98 0.97 0.97 93140
weighted avg 0.97 0.97 0.97 93140
AUC: 0.97183. TfidfVectorizer principle
Here is a brief introduction scikit-learn An open source method for natural language text processing ——TfidfVectorizer, This method consists of two methods CountVectorizer And TfidfTransformer The combination of , So let's say that , Three document links are given before the description ( This article is basically translated from official documents ):
( I have the document in hand , If there is a problem, see the document )
Method 1 :TfidfVectorizer Method 2 :CountVectorizer、TfidfTransformer
OK, let's go to the text
TfidfVectorizer The main central idea of dealing with text language is TF-IDF ( Word frequency - Reverse document frequency ), Since the focus of this article is to introduce the module , So there are not many pairs TF-IDF explain , If necessary, here are the more detailed articles written before for reference ——TF-IDF And related knowledge
TfidfVectorizer The use of is equivalent to calling CountVectorizer Method , Then call TfidfTransformer Method , So I want to know TfidfVectorizer We have to start with the latter two methods .
CountVectorizer:
function :
Converts a collection of text documents into a sparse matrix of counts . The internal implementation method is to call scipy.sparse.csr_matrix modular . also , If you're calling CountVectorizer() It does not provide a priori dictionary and does not use an analyzer that performs some feature selection , Then the number of characteristic words will be equal to the vocabulary found through the direct analysis of data by this method .
Code instructions :
from sklearn.feature_extraction.text import CountVectorizer corpus = ['This is the first document.', 'This document is the second document.', 'And this is the third one.', 'Is this the first document?'] vectorizer = CountVectorizer() # () A priori dictionary is not provided here # vectorizer.fit(corpus) # First fit Train incoming text data # X = vectorizer.transform(corpus) # Then the text data is marked and transformed into a sparse counting matrix X = vectorizer.fit_transform(corpus) # Sure fit、transform Use together to replace the two lines above print(vectorizer.get_feature_names()) # Get the vocabulary found by the model and directly analyze the data ( A collection of words above ) print(X.toarray()) # Print directly X The output is the position of each word
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this'] [[0 1 1 1 0 0 1 0 1] # 4 That's ok Data samples [0 2 0 1 0 1 1 0 1] # 9 Column Characteristic words [1 0 0 1 1 0 1 1 1] [0 1 1 1 0 0 1 0 1]]
Brief description of parameters :
The above is the simplest CountVectorizer Use of modules , We hardly use any parameters or methods , But still can achieve a better 【 Text —> Word vector sparse matrix 】 The effect of , Some parameters are as follows .
input : string {‘filename’, ‘file’, ‘content’}
encoding : string, ‘utf-8’ by default.
stop_words : string {‘english’}, list, or None (default)
max_df : float in range 0.0, 1.0 or int, default=1.0
min_df : float in range 0.0, 1.0 or int, default=1
max_features : int or None, default=None
TfidfTransformer:
function :
Count matrix ( As shown in the figure above ) Convert to standardized tf or tf-idf Express .Tf Indicates the term frequency , and tf-idf Indicates that the term frequency is multiplied by the inverse document frequency . This is a term weighting scheme commonly used in information retrieval , It also has a good use in document classification . Used to calculate tf-idf The formula is tf-idf(d,t)= tf(t)* idf(d,t).
Code instructions :
from sklearn.feature_extraction.text import TfidfTransformer transform = TfidfTransformer() # Use TF-IDF( Word frequency 、 Reverse document frequency ) Applied to sparse matrices Y = transform.fit_transform(X) # Using the above CountVectorizer After processing the X data print(Y.toarray()) # The output is converted to tf-idf After Y matrix , Also print directly Y Output the location of each data print(vectorizer.get_feature_names()) # Print feature name
[[0. 0.46979139 0.58028582 0.38408524 0. 0. 0.38408524 0. 0.38408524] [0. 0.6876236 0. 0.28108867 0. 0.53864762 0.28108867 0. 0.28108867] [0.51184851 0. 0. 0.26710379 0.51184851 0. 0.26710379 0.51184851 0.26710379] [0. 0.46979139 0.58028582 0.38408524 0. 0. 0.38408524 0. 0.38408524]] ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
Brief description of parameters :
The above is for direct use TfidfTransformer transformation CountVectorizer The processed count matrix is standardized tf-idf matrix 【 Word vector sparse matrix —> Standardization tf-idf】 The effect of , Some of its parameters are given below .
norm : ‘l1’, ‘l2’ or None, optional
Norm used to normalize term vectors. None for no normalization.
use_idf : boolean, default=True
Enable inverse-document-frequency reweighting.
smooth_idf : boolean, default=True
Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.
sublinear_tf : boolean, default=False
Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
Finally, I can briefly describe TfidfVectorizer 了
TfidfVectorizer
function :
remember TfidfVectorizer It is equivalent to the combination of the two , Call it successively CountVectorizer and TfidfTransformer The two methods ( Simplified code , But the idea of operation remains the same ), And the use of parameters is basically the same .
Code instructions :
from sklearn.feature_extraction.text import TfidfVectorizer VT = TfidfVectorizer() # Call it successively CountVectorizer and TfidfTransformer The two methods ( Simplified code , But the idea of operation remains the same ) result = VT.fit_transform(corpus) print(result.toarray()) print(VT.get_feature_names())
[[0. 0.46979139 0.58028582 0.38408524 0. 0. 0.38408524 0. 0.38408524] [0. 0.6876236 0. 0.28108867 0. 0.53864762 0.28108867 0. 0.28108867] [0.51184851 0. 0. 0.26710379 0.51184851 0. 0.26710379 0.51184851 0.26710379] [0. 0.46979139 0.58028582 0.38408524 0. 0. 0.38408524 0. 0.38408524]] ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
Can be seen with the above CountVectorizer and TfidfTransformer The results after treatment are consistent , It is really a combination of the two .
Parameters and methods of use CountVectorizer and TfidfTransformer Agreement , I won't describe it here .
边栏推荐
- A detailed discussion on the use guide of network Swiss Army knife nmap
- Does FTP meet managed file transfer (MFT) requirements?
- Connect edgex gateway to thingsboard IOT platform
- What software is safe to use to fight new debts? What are the new bond platforms
- Bypass memory integrity check
- Where should DNS start? I -- from the failure of Facebook
- Initial experience of nodejs express framework
- How to define an "enumeration" type in JS
- Processing of purchase return in SAP mm preliminary transaction code Migo
- Supplement to fusionui form component
猜你喜欢

大一女生废话编程爆火!懂不懂编程的看完都拴Q了
![Harmonyos application development -- mynotepad[memo][api v6] based on textfield and image pseudo rich text](/img/b1/71cc36c45102bdb9c06e099eb42267.jpg)
Harmonyos application development -- mynotepad[memo][api v6] based on textfield and image pseudo rich text

How PMO uses two dimensions for performance appraisal

Steps for formulating the project PMO strategic plan

New SQL syntax quick manual!

How to gradually improve PMO's own ability and management level

Uncover the secrets of Huawei cloud enterprise redis issue 16: acid'true' transactions beyond open source redis

What are the main dimensions of PMO performance appraisal?
Application of JDBC in performance test

发现一个大佬云集的宝藏硕博社群!
随机推荐
What can RFID fixed assets management system bring to enterprises?
Customize view to imitate today's headlines and like animation!
MySQL advanced development
RI Gai series: push of STD container_ Why is back slower than []
Memory patch amsi bypass
From AIPL to grow, talking about the marketing analysis model of Internet manufacturers
What about the cloud disk service status error? How to format the cloud disk service?
[tutorial] Tencent lightweight cloud builds an online customer service chat system
【Redis】有序集合的交集与并集
What are the processing methods for PPT pictures
How to Net project migration to NET Core
Development and code analysis of easycvr national standard user defined streaming address function
Gradle asked seven times. You should know that~
Nodejs operation state keeping technology cookies and sessions
Infrastructure splitting of service splitting
JS remove tabs and line breaks
Do you really understand the cache penetration, cache breakdown and cache avalanche in rotten street?
Polling and connection
. NET Core . NET Framework
JS chain call