当前位置:网站首页>Data preprocessing
Data preprocessing
2022-07-25 09:28:00 【Bubble Yi】
One 、 introduction
(1) Why data preprocessing ?
The real world data is “ Dirty ”—— With more data, everything will appear . For example, there will be ( Incomplete 、 A noise 、 Data inconsistency ).
(2) Why is data preprocessing important ?
Because there is no high-quality data , There will be no high-quality mining results .
Two 、 The following is an example to illustrate the detailed process of data preprocessing .( The steps to solve the problem do not have to follow this order , Different results in different order , There will be differences , Specific analysis of specific problems , Random strain )
First step :
Deep copy data file , View basic data types , Statistical description of numerical data . The code is as follows :
import numpy as np
import pandas as pd
data=pd.read_excel('D:/a/ data mining /Students.xls')# Pay attention to the slash
my_data=data.copy()
print(my_data.head(4))
print(my_data.info())# View data type functions
print(my_data.describe())# View the statistical description of the data The second step :
Remove duplicate lines .
(1) First, check whether there are duplicate lines . The code is as follows :
print(my_data.duplicated())(2) Delete duplicate lines . The code is as follows :
print(my_data=my_data.drop_duplicates())The third step :
Delete those useless features in the feature data set ( Generally, it is character data : Handle by hand )
Step four : Handling missing values ( You can use the mean value , Median , Mode substitution , Specific case specific analysis )
(1) Display the specific location of the missing value (bool)
print(my_data.isnull())(2) Delete the entire line with missing values ( Additional explanation : After running this code, we can use info() Function to test )
my_data=my_data.dropna()
print(my_data.head(4))Step five :
Remove spaces in text fields in character type feature columns
my_data[" Gender "]=my_data[" Gender "].str.strip()
my_data[" Whether to donate blood for free "]=my_data[" Whether to donate blood for free "].str.strip()
my_data[" Whether parents encourage "]=my_data[" Whether parents encourage "].str.strip()Step six :
Character characteristic value column case conversion
(1) First use nique() Function to find , The code is as follows :
print(my_data[" Gender "].unique())
print(my_data[" Whether to donate blood for free "].unique())
print(my_data[" Whether parents encourage "].unique())
print(my_data[" Whether to participate in "].unique())(2) be used serise Of str Property to debug , The code is as follows :
my_data[" Gender "]=my_data[" Gender "].str.title()
my_data[" Whether to donate blood for free "]=my_data[" Whether to donate blood for free "].str.title()Step seven :
Detection and processing of outliers in numerical feature train
(1) Outlier detection : clustering 、 boxplot ,
3
.
from pylab import *
mpl.rcParams['font.sans-serif']=['SimHei']
figure(figsize=[5,4])
plt.boxplot(my_data[' Age '])
plt.xlabel(' Age ')
plt.ylabel(' A variable's value ')
plt.show()
def up_low_value(d): ## Find the upper and lower limits of the box diagram
Q1=d.quantile(0.25)
Q3=d.quantile(0.75)
IQR=Q3-Q1 # Four minute spacing
up_value=Q3+1.5*IQR
low_value=Q1-1.5*IQR
return up_value,low_value
up1,low1=up_low_value(my_data[' Age '])(2) Exception handling : Leave blank and fill with the mean
my_data.loc[(my_data[' Age ']>up1)|(my_data[' Age ']<low1),[' Age ']]=None
my_data=my_data.fillna(my_data.mean())Step eight :
Quantification of qualitative characteristics , The code is as follows :
print(my_data[' Gender '].unique())
size_mapping={'M':0,' male ':0,'F':1,' Woman ':1}
my_data[' Gender ']=my_data[' Gender '].map(size_mapping)
print(my_data[' Gender '].unique())
边栏推荐
- ActiveMQ -- kahadb of persistent mechanism
- 初始Flask以及简单地上手应用
- 【Nacos】NacosClient在服务注册时做了什么
- PHP网站设计思路
- 梦想启航(第一篇博客)
- Dynamically add multiple tabs and initialize each tab page
- Unable to start debugging on the web server, the web server failed to find the requested resource
- *6-3 节约小能手
- Redis数据库基础
- ¥1-2 例2.2 将两个集合的并集放到线性表中
猜你喜欢
随机推荐
C#语言和SQL Server数据库技术
@5-1 CCF 2019-12-1 报数
C#语言和SQL Server数据库技术
ActiveMQ -- message retry mechanism
ActiveMQ -- JDBC Journal of persistent mechanism
多态和接口
无法再web服务器上启动调试,web服务器未能找到请求资源
C#语言和SQL Server数据库技术
C#语言和SQL Server数据库技术
yarn : 无法加载文件 yarn.ps1,因为在此系统上禁止运行脚本。
『怎么用』观察者模式
Nacos启动报错Unable to start web server
*6-3 节约小能手
registration status: 204
¥1-3 SWUST oj 942: 逆置顺序表
Idea hot deployment
Go foundation 3
excl批量导入数据,后台公共解析方法
【代码源】每日一题 分数拆分
Go foundation 2

![[GYCTF2020]Ez_Express](/img/ce/02b90708f215715bb53cacfd4c21f0.png)







