当前位置：网站首页>TFIDF Sklearn 代码调用

TFIDF Sklearn 代码调用

2022-07-13 18:09:00 【清风2022】

sklearn

官方文档ttps://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

class sklearn.feature_extraction.text.TfidfVectorizer(input=’content’, encoding=’utf-8’, decode_error=’strict’, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer=’word’, stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class ‘numpy.float64’>, norm=’l2’, use_idf=True, smooth_idf=True, sublinear_tf=False)

参数说明：

input : 字符串string {‘filename’, ‘file’, ‘content’}

如果是文件名，

the sequence passed as an argument to fit is expected to be a list of filenames that need reading to fetch the raw content to analyze.

如果是文件, the sequence items must have a ‘read’ method (file-like object) that is called to fetch the bytes in memory.

否则， input是 sequence strings or bytes items 可以被直接处理

max_df : 浮点数范围在 [0.0, 1.0]之间，默认值为 (default=1.0)

在建立词汇集时会忽略文档词频大于max_df 的词汇。max_df 表示文档频率，如果某个单词在所有文档中都出现，则该词的文档频率为1。

min_df : 浮点数范围在[0.0,1.0] 之间，或整型默认值为 (default=1)

在建立词汇集时会忽略文档词频小于min_df 的词汇。

官方文档示例

from sklearn.feature_extraction.text import TfidfVectorizer
 corpus = [
   'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?']
vectorizer = TfidfVectorizer()
 X = vectorizer.fit_transform(corpus)
 print(vectorizer.get_feature_names())
#['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
 print(X.shape)
#(4, 9)

使用示例：

# -*- coding: utf-8 -*-
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
# 读取数据

data=pd.read_csv('demo.csv',encoding='utf-8',header=0)
data.columns=['word']
data1=data['word']

tfidf_vectorizer=TfidfVectorizer()
#tfidf_vectorizer = TfidfVectorizer(min_df=3, max_df=0.9) 该函数是可以设置参数的

TFIDFvec=tfidf_vectorizer.fit_transform(data1)#将文档数据转化维矩阵
#稀疏矩阵

result=pd.DataFrame(TFIDFvec.toarray(),columns=tfidf_vectorizer.get_feature_names())
print result

输入 demo.csv 数据示例