当前位置:网站首页>Feature preprocessing
Feature preprocessing
2022-08-05 04:11:00 【Mika grains】
2.4.1 What is feature preprocessing
Why normalize and standardize
Dimensionless
2.4.2 Normalization: For normalization, if there are abnormal points that affect the maximum and minimum values, the results will obviously change
Map the data to ([0,1] by default) by transforming the original data

Outliers: Maximum Minimum
2.4.3 Standardization: For standardization, if there are abnormal points, due to a certain amount of data, a small number of abnormal points have little effect on the average, so the variance changes are small

(x - mean)/std
Standard Deviation: Degree of Concentration
def minmax_demo():"""Normalized:return:"""# 1 Get datadata = pd.read_csv("lizi")data = data.iloc[:, :3]print("data:\n", data)# 2 Instantiate a converter class# transfer = MinMaxScaler()transfer = MinMaxScaler(feature_range=[2,3])# 3 Call fit_transformdata_new = transfer.fit_transform(data)print("data_new:\n", data_new)return Nonedef stand_demo():"""standardization:return:"""# 1 Get datadata = pd.read_csv("lizi")data = data.iloc[:, :3]print("data:\n", data)# 2 Instantiate a converter classtransfer = StandardScaler# 3 Use fit_transferdata_new = transfer.fit_transform(data)print("data_new:\n", data_new)return NoneBecause factor_returns.csv was not found, I don't know if it can be run or not
def variance_demo():"""Filter low variance features:return:"""# 1. Get datadata = pd.read_csv("factor_returns.csv")data = data.iloc[:, 1:-2]print("data:\n", data)# 2. Instantiate a convertertransfer = VarianceThreshold(threshold=10)# 3. Call fit_transformdata_new = transfer.fit_transform(data)print("data_new:\n", data_new, data_new.shape)return None2.5.1 Dimensionality Reduction - Dimensionality Reduction
ndarray
Dimensions: the number of levels of nesting
2D array
Dimension reduction here: reduce the number of features
Effect: No correlation between features
2.5.1 Dimensionality reduction
Feature selection
Filter filter
Variance selection method: low variance feature filtering
Correlation coefficient - the degree of correlation between features
Value range: -1 ~1
The correlation between features is high:
1) Choose one of them
2) Weighted Summation
3) Principal Component Analysis
Embeded
def variance_demo():"""Filter low variance features:return:"""# 1. Get datadata = pd.read_csv("factor_returns.csv")data = data.iloc[:, 1:-2]print("data:\n", data)# 2. Instantiate a convertertransfer = VarianceThreshold(threshold=10)# 3. Call fit_transformdata_new = transfer.fit_transform(data)print("data_new:\n", data_new, data_new.shape)# Calculate the correlation coefficient between two variablesr1 = pearsonr(data["pe_ratio"], data["pb_ratio"])print("Correlation coefficient:\n", r1)r2 = pearsonr(data['revenue'], data['total_expense'])print("Correlation between revenue and total_expense:\n", r2)return NoneDecision Tree Regularization Deep Learning
Principal component analysis:
2.6.1 What is Principal Component Analysis (PCA)
sklearn.decomposition.PCA(n_compinents=None)
n_components
def pca_demo():"""PCA:return:"""data = [[2,8,4,5], [6,3,0,8], [5,4,9,1]]# 1 Instantiate a converter classtransfer = PCA(n_components=2)# call fit_transform(data)data_new = transfer.fit_transform(data)print("data_new:\n", data_new)return None
Decimal means how much information is retained
Integer reduces to how many features
2.6.2 Case study of user preferences for item categories
边栏推荐
- Hard power or soft power, which is more important to testers?
- [BSidesCF 2019] Kookie
- [GYCTF2020]EasyThinking
- 【8.2】代码源 - 【货币系统】【硬币】【新年的问题(数据加强版)】【三段式】
- The most effective seven performance testing techniques of software testing techniques
- 使用IDEA连接TDengine服务器
- 【测量学】速成汇总——摘录高数帮
- creo怎么测量点到面的距离
- Redis1: Introduction to Redis, basic features of Redis, relational database, non-relational database, database development stage
- Bytebuffer put flip compact clear method demonstration
猜你喜欢

The most effective seven performance testing techniques of software testing techniques

新人如何入门和学习软件测试?

Industry Status?Why do Internet companies prefer to spend 20k to recruit people rather than raise their salary to retain old employees~

A 35-year-old software testing engineer with a monthly salary of less than 2W, resigns and is afraid of not finding a job, what should he do?

flink reads mongodb data source

Acid (ACID) Base (BASE) Principles for Database Design

How to solve the three major problems of bank data collection, data supplementary recording and index management?

Increasing leetcode - a daily topic 1403. The order of the boy sequence (greed)

Based on holding YOLOv5 custom implementation of FacePose YOLO structure interpretation, YOLO data format conversion, YOLO process modification"

bytebuffer use demo
随机推荐
How do newcomers get started and learn software testing?
[TA-Frost Wolf_may-"Hundred Talents Project"] Graphics 4.3 Real-time Shadow Introduction
Redis1:Redis介绍、Redis基本特性、关系型数据库、非关系型数据库、数据库发展阶段
1007 Climb Stairs (greedy | C thinking)
不看后悔,appium自动化环境完美搭建
【8.1】代码源 - 【第二大数字和】【石子游戏 III】【平衡二叉树】
In the WebView page of the UI automation test App, the processing method when the search bar has no search button
Increasing leetcode - a daily topic 1403. The order of the boy sequence (greed)
【Mysql进阶优化篇02】索引失效的10种情况及原理
2022软件测试工程师最全面试题
动力小帆船制作方法简单,电动小帆船制作方法
pyqt5 + socket 实现客户端A经socket服务器中转后主动向客户端B发送文件
GC Gaode coordinate and Baidu coordinate conversion
七夕节代码表白
What is the difference between SAP ERP and ORACLE ERP?
[BJDCTF2020]EasySearch
【背包九讲——01背包问题】
为什么刚考完PMP,就开始准备软考了?
将故事写成我们
开发属于自己的node包