当前位置：网站首页>Data preprocessing

Data preprocessing

2022-07-25 09:28:00 【Bubble Yi】

One 、 introduction

(1) Why data preprocessing ？

The real world data is “ Dirty ”—— With more data, everything will appear . For example, there will be ( Incomplete 、 A noise 、 Data inconsistency ).

(2) Why is data preprocessing important ？

Because there is no high-quality data , There will be no high-quality mining results .

Two 、 The following is an example to illustrate the detailed process of data preprocessing .( The steps to solve the problem do not have to follow this order , Different results in different order , There will be differences , Specific analysis of specific problems , Random strain )

First step ：

Deep copy data file , View basic data types , Statistical description of numerical data . The code is as follows ：

import numpy as np
import pandas as pd
data=pd.read_excel('D:/a/ data mining /Students.xls')# Pay attention to the slash 
my_data=data.copy()
print(my_data.head(4))
print(my_data.info())# View data type functions 
print(my_data.describe())# View the statistical description of the data

The second step ：

Remove duplicate lines .

(1) First, check whether there are duplicate lines . The code is as follows ：

print(my_data.duplicated())

（2） Delete duplicate lines . The code is as follows ：

print(my_data=my_data.drop_duplicates())

The third step ：

Delete those useless features in the feature data set （ Generally, it is character data ： Handle by hand ）

Step four ： Handling missing values （ You can use the mean value , Median , Mode substitution , Specific case specific analysis ）

（1) Display the specific location of the missing value （bool）

print(my_data.isnull())

（2） Delete the entire line with missing values （ Additional explanation ： After running this code, we can use info() Function to test ）

my_data=my_data.dropna()
print(my_data.head(4))

Step five ：

Remove spaces in text fields in character type feature columns

my_data[" Gender "]=my_data[" Gender "].str.strip()
my_data[" Whether to donate blood for free "]=my_data[" Whether to donate blood for free "].str.strip()
my_data[" Whether parents encourage "]=my_data[" Whether parents encourage "].str.strip()

Step six ：

Character characteristic value column case conversion

（1） First use nique() Function to find , The code is as follows ：

print(my_data[" Gender "].unique())
print(my_data[" Whether to donate blood for free "].unique())
print(my_data[" Whether parents encourage "].unique())
print(my_data[" Whether to participate in "].unique())

（2） be used serise Of str Property to debug , The code is as follows ：

my_data[" Gender "]=my_data[" Gender "].str.title()
my_data[" Whether to donate blood for free "]=my_data[" Whether to donate blood for free "].str.title()

Step seven ：

Detection and processing of outliers in numerical feature train

（1） Outlier detection ： clustering 、 boxplot , $\bar{x}$ $\pm$ 3 $\sigma$ .

from pylab import *
mpl.rcParams['font.sans-serif']=['SimHei']
figure(figsize=[5,4])
plt.boxplot(my_data[' Age '])
plt.xlabel(' Age ')
plt.ylabel(' A variable's value ')
plt.show()

def up_low_value(d):       ## Find the upper and lower limits of the box diagram 
    Q1=d.quantile(0.25)
    Q3=d.quantile(0.75)
    IQR=Q3-Q1           # Four minute spacing 
    up_value=Q3+1.5*IQR
    low_value=Q1-1.5*IQR
    return up_value,low_value
up1,low1=up_low_value(my_data[' Age '])

(2) Exception handling : Leave blank and fill with the mean

my_data.loc[(my_data[' Age ']>up1)|(my_data[' Age ']<low1),[' Age ']]=None
my_data=my_data.fillna(my_data.mean())

Step eight ：

Quantification of qualitative characteristics , The code is as follows ：

print(my_data[' Gender '].unique())
size_mapping={'M':0,' male ':0,'F':1,' Woman ':1}
my_data[' Gender ']=my_data[' Gender '].map(size_mapping)
print(my_data[' Gender '].unique())

原网站

版权声明
本文为[Bubble Yi]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/206/202207250918549804.html

当前位置：网站首页>Data preprocessing

Data preprocessing

边栏推荐

猜你喜欢

随机推荐