当前位置:网站首页>Data preprocessing - normalization and standardization
Data preprocessing - normalization and standardization
2022-06-25 15:10:00 【A window full of stars and milky way】
Standardization of data (normalization) It's scaling the data , To fall into a small, specific area .
Remove the unit limit of data , Convert it to dimensionless pure values , It is convenient to compare and weight indexes of different units or scales
The most typical is the normalization of data , Unified mapping of data to [0,1] On interval
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
% matplotlib inline
0-1 Standardization
Also called Max-Min Standardization , The formula :
# Create data
df = pd.DataFrame({
'value1':np.random.rand(100)*10,
'value2':np.random.rand(100)*100})
print(df.head())
print('---------------')
def maxmin (df,*cols):
df_m = df.copy()
for col in cols:
ma = df[col].max()
mi = df[col].min()
df_m[col + '_m'] = (df[col] - mi) / (ma - mi)
return df_m
df1 = maxmin(df,'value1','value2')
print(df1.head())
value1 value2
0 7.363287 15.749935
1 5.713568 33.233757
2 6.108123 21.522650
3 0.804442 85.003204
4 6.387467 21.264910
---------------
value1 value2 value1_m value2_m
0 7.363287 15.749935 0.740566 0.151900
1 5.713568 33.233757 0.574296 0.329396
2 6.108123 21.522650 0.614062 0.210505
3 0.804442 85.003204 0.079521 0.854962
4 6.387467 21.264910 0.642216 0.207888
# Use sklearn Medium scale function
minmax_scaler = preprocessing.MinMaxScaler() # establish MinMaxScaler object
df_m1 = minmax_scaler.fit_transform(df) # Standardized treatment
df_m1 = pd.DataFrame(df_m1,columns=['value1_m','value2_m'])
df_m1.head()
value1_m | value2_m | |
---|---|---|
0 | 0.740566 | 0.151900 |
1 | 0.574296 | 0.329396 |
2 | 0.614062 | 0.210505 |
3 | 0.079521 | 0.854962 |
4 | 0.642216 | 0.207888 |
Z-Score
Also called z fraction , Is a quantity with equal units . It is the quotient obtained by dividing the difference between the original score and the group average by the standard deviation , It measures how many standard deviations the original score is above the score of its average in standard deviation , Or how many standard deviations are below the average .
- It is an abstract value , Not affected by the original unit of measurement , Further statistical processing is acceptable
- The processed value obeys the mean value of 0, The variance of 1 The standard normal distribution of .
- A centralized approach , Will change the data distribution of the original data , It is not suitable for processing sparse dataz=x−μσ z = x − μ σ
def data_Znorm(df, *cols):
df_n = df.copy()
for col in cols:
u = df_n[col].mean()
std = df_n[col].std()
df_n[col + '_Zn'] = (df_n[col] - u) / std
return(df_n)
# Create a function , Standardized data
df_z = data_Znorm(df,'value1','value2')
u_z = df_z['value1_Zn'].mean()
std_z = df_z['value1_Zn'].std()
print(df_z.head())
print(' After standardization value1 The mean of :%.2f, The standard deviation is :%.2f' % (u_z, std_z))
# Standardized data
# The processed data conform to the standard normal distribution , That is, the mean value is 0, The standard deviation is 1
# What's the use of Z-score Standardization :
# In the classification 、 Clustering algorithm , When distance is needed to measure similarity ,Z-score Perform better
value1 value2 value1_Zn value2_Zn
0 7.363287 15.749935 0.744641 -1.164887
1 5.713568 33.233757 0.196308 -0.550429
2 6.108123 21.522650 0.327450 -0.962008
3 0.804442 85.003204 -1.435387 1.268973
4 6.387467 21.264910 0.420298 -0.971066
After standardization value1 The mean of :-0.00, The standard deviation is :1.00
# Z-Score Standardization
zscore_scale = preprocessing.StandardScaler()
df_z1 = zscore_scale.fit_transform(df)
df_z1 = pd.DataFrame(df_z1,columns=['value1_z','value2_z'])
df_z1.head()
value1_z | value2_z | |
---|---|---|
0 | 0.748393 | -1.170755 |
1 | 0.197297 | -0.553202 |
2 | 0.329100 | -0.966855 |
3 | -1.442619 | 1.275366 |
4 | 0.422416 | -0.975959 |
MaxAbs
The maximum absolute value is standardized , and MaxMin The method is similar to , Drop the data into a certain range [-1,1], however MaxAbs have Do not destroy the data structure Characteristics , It can be used for Sparse data , perhaps
It's coefficient CSR( Line compression ) and CSC( Column compression ) matrix ( Two storage formats for matrices )
# MaxAbs Standardization
maxbas_scaler = preprocessing.MaxAbsScaler()
df_ma = maxbas_scaler.fit_transform(df)
df_ma = pd.DataFrame(df_ma,columns=['value1_ma','value2_ma'])
df_ma.head()
value1_ma | value2_ma | |
---|---|---|
0 | 0.740969 | 0.158626 |
1 | 0.574957 | 0.334715 |
2 | 0.614661 | 0.216766 |
3 | 0.080951 | 0.856112 |
4 | 0.642772 | 0.214170 |
RobustScaler
In some cases , If there are outliers in the data , We can use Z-Score Standardize , But the standardized data is not ideal , Because the features of outliers tend to lose their outliers after standardization , You can use RobustScaler Standardize outliers .
This method has stronger parameter control ability for the robustness of data center calls and data scaling
————《Python Data analysis and data operation 》
# RobustScaler Standardization
robustscaler = preprocessing.RobustScaler()
df_r = robustscaler.fit_transform(df)
df_r = pd.DataFrame(df_r,columns=['value1_r','value2_r'])
df_r.head()
value1_r | value2_r | |
---|---|---|
0 | 0.360012 | -0.644051 |
1 | 0.055296 | -0.303967 |
2 | 0.128174 | -0.531764 |
3 | -0.851457 | 0.703016 |
4 | 0.179770 | -0.536777 |
Draw standardized scatter diagram
data_list = [df, df_m1, df_ma, df_z1, df_r]
title_list = ['soure_data', 'maxmin_scaler',
'maxabs_scaler', 'zscore_scaler',
'robustscaler']
fig = plt.figure(figsize=(12,6))
for i,j in enumerate(data_list):
# For an iterative (iterable)/ Traversable objects ( As listing 、 character string ),enumerate Make it into an index sequence ,
# Use it to get both index and value ,enumerate It's mostly used in for Count in the loop '''
plt.subplot(2,3,i+1)
plt.scatter(j.iloc[:,:-1],j.iloc[:,-1])
plt.title(title_list[i])
边栏推荐
- Customization and encapsulation of go language zap library logger
- Introduction to flexible array
- System Verilog - thread
- iconv_ Open returns error code 22
- Sequential programming 1
- Gif动图如何裁剪?收下这个图片在线裁剪工具
- Is it safe to open a stock account online?
- 2. operator and expression multiple choice questions
- Iterator failure condition
- One question per day, a classic simulation question
猜你喜欢
Yolov4 coco pre train Darknet weight file
Shared memory synchronous encapsulation
System Verilog - thread
Source code analysis of synergetics and ntyco
Power automatic test system nsat-8000, accurate, high-speed and reliable power test equipment
【Try to Hack】vulnhub DC1
How to make GIF animation online? Try this GIF online production tool
Character encoding minutes
Qcodeeditor - QT based code editor
Judging the number of leap years from 1 to N years
随机推荐
Customization and encapsulation of go language zap library logger
Vs2019 scanf error
Js- get the mouse coordinates and follow them
Solution of push code failure in idea
Is it safe to open an online stock account? Who knows
[untitled] PTA check password
Cross compilation correlation of curl Library
Learning C language today is the first time to learn C language. In college, C linguistics is not good, but I want to make progress, so I found a beep video on the Internet to learn C language
The difference between sizeof and strlen
分饼干问题
Qcodeeditor - QT based code editor
Compile Caffe's project using cmake
Usage of qlist
AB string interchange
@Font face fonts only work on their own domain - @font-face fonts only work on their own domain
QT pattern prompt box implementation
Learning notes on February 5, 2022 (C language)
[C language] 32 keyword memory skills
Yolov3 spp Darknet version to caffemodel and then to OM model
How to crop GIF dynamic graph? Take this picture online clipping tool