当前位置：网站首页>How to play a data mining game entry Edition

How to play a data mining game entry Edition

2022-07-25 05:57:00 【Datawhale】

Datawhale dried food

contributor ： Herding bear , Luoxiutao , Si Yuxin , Pan Shuyu etc.

This is a simple competition tutorial , Our goal is to help students step out AI The first step in training Masters . There will be a lot to learn in data mining , It is suggested that students who are getting started can temporarily understand the principles of various codes without worrying , Get through the code first , Then look at the knowledge points involved in the code to query relevant materials for learning , This will make your study more targeted , It is also easy to find the fun of learning . A journey , Begins with a single step , From here , Open your AI A journey of study ！

—— contributor ： Herding bear 、 Luoxiutao

One 、 Preparation steps

1.1 Platform registration and Competition Registration

Links to events ：
https://challenge.xfyun.cn/topic/info?type=diabetes&ch=ds22-dw-gzh02
register （ Remember to fill in your personal information ）

Click on the top right corner of the page ： register

Fill in personal information , Registered successfully

Click to register , Show successful enrollment

Click on ： entrants

Successful registration

1.2 Data download

Data acquisition

Download data on the official website ： Download data and real name authentication .
Detailed operations can be viewed ：https://xj15uxcopw.feishu.cn/docx/doxcn11gwo7cEuAXWhCrDld4Inb
Please put the data file and code file in the same folder , Ensure normal operation

1.3 Reference material

python Please refer to ：

Mac equipment ：Mac Installation on Anaconda Most comprehensive tutorial https://zhuanlan.zhihu.com/p/350828057
Windows equipment ：Anaconda Super detailed installation tutorial
https://blog.csdn.net/fan18317517352/article/details/123035625

Two 、 Practical ideas

This competition is a data mining competition , Players need to build models through training set data , Then predict the validation set data , Submit the prediction results .

The task of this topic is to build a model , The model can predict whether the patient has diabetes according to the patient's test data . This type of task is a typical binary classification problem （ Have diabetes / No diabetes ）, The prediction output of the model is 0 or 1 （ Have diabetes ：1, No diabetes ：0）

Machine learning , About the classification task, we usually think of logical regression 、 Decision tree and other algorithms , In this Baseline in , We try to use decision tree to build our model . When we solve machine learning problems , Generally, the following process will be followed ：

2.1 Code implementation

The following code , Please be there. jupyter notbook or python In the compiler environment

# Install dependent Libraries   If it is windows System ,cmd Input in the command box pip install , Refer to the above environment configuration 
#!pip install sklearn
#!pip install pandas
#---------------------------------------------------
# Import library 
#---------------- Data exploration ----------------
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
# Data preprocessing 
data1=pd.read_csv(' Game training set .csv',encoding='gbk')
data2=pd.read_csv(' Competition test set .csv',encoding='gbk')
#label Marked as -1
data2[' Signs of diabetes ']=-1
# The training set and the testing machine are merged 
data=pd.concat([data1,data2],axis=0,ignore_index=True)
# Fill the missing values in the diastolic blood pressure characteristics with -1
data[' diastolic pressure ']=data[' diastolic pressure '].fillna(-1)

#---------------- Feature Engineering ----------------
"""
 Convert the year of birth into age 
"""
data[' Age ']=2022-data[' Year of birth ']  # Change to age 

"""
 The normal value of the body mass index for adults is 18.5-24 Between 
 lower than 18.5 It's a low BMI 
 stay 24-27 Between them is overweight 
27 The above consideration is obesity 
 higher than 32 You are very fat .
"""
def BMI(a):
    if a<18.5:
        return 0
    elif 18.5<=a<=24:
        return 1
    elif 24<a<=27:
        return 2
    elif 27<a<=32:
        return 3
    else:
        return 4

data['BMI']=data[' Body mass index '].apply(BMI)

# Family history of diabetes 
"""
 No record 
 One uncle or aunt has diabetes / One uncle or aunt has diabetes 
 One parent has diabetes 
"""
def FHOD(a):
    if a==' No record ':
        return 0
    elif a==' One uncle or aunt has diabetes ' or a==' One uncle or aunt has diabetes ':
        return 1
    else:
        return 2

data[' Family history of diabetes ']=data[' Family history of diabetes '].apply(FHOD)
"""
 The diastolic pressure range is 60-90
"""
def DBP(a):
    if 0<=a<60:
        return 0
    elif 60<=a<=90:
        return 1
    elif a>90:
        return 2
    else:
        return a
data['DBP']=data[' diastolic pressure '].apply(DBP)

#------------------------------------
# The processed feature engineering is divided into training set and test set , The training set is used to train the model , The test set is used to evaluate the accuracy of the model 
# There is no relationship between the number and whether the patient has diabetes , Irrelevant features shall be deleted 
train=data[data[' Signs of diabetes '] !=-1]
test=data[data[' Signs of diabetes '] ==-1]
train_label=train[' Signs of diabetes ']
train=train.drop([' Number ',' Signs of diabetes ',' Year of birth '],axis=1)
test=test.drop([' Number ',' Signs of diabetes ',' Year of birth '],axis=1)

#---------------- model training ----------------
model = DecisionTreeClassifier()
model.fit(train, train_label) 
y_pre=model.predict(test)
y_pre

#---------------- Results output ----------------
result=pd.read_csv(' Submit sample .csv')
result['label']=y_pre
result.to_csv('result-de.csv',index=False)