当前位置:网站首页>TFIDF Sklearn 代码调用
TFIDF Sklearn 代码调用
2022-07-13 18:09:00 【清风2022】
sklearn
class sklearn.feature_extraction.text.TfidfVectorizer(input=’content’, encoding=’utf-8’, decode_error=’strict’, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer=’word’, stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class ‘numpy.float64’>, norm=’l2’, use_idf=True, smooth_idf=True, sublinear_tf=False)
参数说明:
input : 字符串string {‘filename’, ‘file’, ‘content’}
如果是文件名,
the sequence passed as an argument to fit is expected to be a list of filenames that need reading to fetch the raw content to analyze.
如果是文件, the sequence items must have a ‘read’ method (file-like object) that is called to fetch the bytes in memory.
否则, input是 sequence strings or bytes items 可以被直接处理
max_df : 浮点数范围在 [0.0, 1.0]之间,默认值为 (default=1.0)
在建立词汇集时会忽略文档词频大于max_df 的词汇。max_df 表示文档频率,如果某个单词在所有文档中都出现,则该词的文档频率为1。
min_df : 浮点数范围在[0.0,1.0] 之间,或 整型默认值为 (default=1)
在建立词汇集时会忽略文档词频小于min_df 的词汇。
官方文档示例
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
#['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
print(X.shape)
#(4, 9)
使用示例:
# -*- coding: utf-8 -*-
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
# 读取数据
data=pd.read_csv('demo.csv',encoding='utf-8',header=0)
data.columns=['word']
data1=data['word']
tfidf_vectorizer=TfidfVectorizer()
#tfidf_vectorizer = TfidfVectorizer(min_df=3, max_df=0.9) 该函数是可以设置参数的
TFIDFvec=tfidf_vectorizer.fit_transform(data1)#将文档数据转化维矩阵
#稀疏矩阵
result=pd.DataFrame(TFIDFvec.toarray(),columns=tfidf_vectorizer.get_feature_names())
print result输入 demo.csv 数据示例

代码运行输出示例

边栏推荐
- 还在使用 MySQL 中使用枚举?这些陷阱一定要注意!
- Basic use of gzip and PM2 management tools for project launch
- "MySQL database principle, design and application" after class exercises and answers compiled by dark horse programmer
- POI framework learning - Import and export cases
- C verification code
- Wechat native payment
- 头文件ctype.h(详细)
- Set、Map、WeakSet 和 WeakMap 的区别
- (一)输入输出
- Day 6 of DL
猜你喜欢

MySQL learning records

Solve the problem of missing precision of bigint type data in MySQL query data in nodejs

When vscode is updated, the error failed to install visual studio code update Updates may fail due to anti-virus softwa

C#笔记-基础知识,问答,WPF

After 3 months of job hunting, most resumes are dead in the sea. When it comes to manual testing, they shake their heads again and again ~ it's too difficult

C#验证码

C#使用Autoupdater.NET

爬虫——有道翻译

C#驗證碼

在MVVM中加载界面后执行方法或者事件
随机推荐
Network layer protocol
Azkaban overview
A few lines of code can realize complex excel import and export. This tool class is really powerful!
jsonp原理
Win11 is not compatible with VM -- VMware Workstation solution. On March 31, 2022, the pro test was successfully solved
Day 16 of leetcode
CCF 202012-2 期末预测之最佳阈值
STC定时器初值计算
Re 正则表达式
nodejs+express设置和获取cookie,session
Day 17 of leetcode
2021-07-02
File management - Alibaba cloud OSS learning (I)
3、 Experimental report on the implementation of SMB sharing and FTP construction by freenas
C excel net core reading xlsm
XPath超详细总结
Introduction to C language compiler
利用 Redis 的 sorted set 做每周热评的功能
几行代码就能实现复杂的 Excel 导入导出,这个工具类真心强大!
重发布实验