当前位置:网站首页>How to play a data mining game entry Edition
How to play a data mining game entry Edition
2022-07-25 05:57:00 【Datawhale】
Datawhale dried food
contributor : Herding bear , Luoxiutao , Si Yuxin , Pan Shuyu etc.
This is a simple competition tutorial , Our goal is to help students step out AI The first step in training Masters . There will be a lot to learn in data mining , It is suggested that students who are getting started can temporarily understand the principles of various codes without worrying , Get through the code first , Then look at the knowledge points involved in the code to query relevant materials for learning , This will make your study more targeted , It is also easy to find the fun of learning . A journey , Begins with a single step , From here , Open your AI A journey of study !
—— contributor : Herding bear 、 Luoxiutao

One 、 Preparation steps
1.1 Platform registration and Competition Registration
Links to events :
https://challenge.xfyun.cn/topic/info?type=diabetes&ch=ds22-dw-gzh02register ( Remember to fill in your personal information )


Click to register , Show successful enrollment


1.2 Data download
Data acquisition
Download data on the official website : Download data and real name authentication .
Detailed operations can be viewed :https://xj15uxcopw.feishu.cn/docx/doxcn11gwo7cEuAXWhCrDld4InbPlease put the data file and code file in the same folder , Ensure normal operation
1.3 Reference material
python Please refer to :
Mac equipment :Mac Installation on Anaconda Most comprehensive tutorial https://zhuanlan.zhihu.com/p/350828057
Windows equipment :Anaconda Super detailed installation tutorial
https://blog.csdn.net/fan18317517352/article/details/123035625
Two 、 Practical ideas
This competition is a data mining competition , Players need to build models through training set data , Then predict the validation set data , Submit the prediction results .
The task of this topic is to build a model , The model can predict whether the patient has diabetes according to the patient's test data . This type of task is a typical binary classification problem ( Have diabetes / No diabetes ), The prediction output of the model is 0 or 1 ( Have diabetes :1, No diabetes :0)
Machine learning , About the classification task, we usually think of logical regression 、 Decision tree and other algorithms , In this Baseline in , We try to use decision tree to build our model . When we solve machine learning problems , Generally, the following process will be followed :

2.1 Code implementation
The following code , Please be there. jupyter notbook or python In the compiler environment
# Install dependent Libraries If it is windows System ,cmd Input in the command box pip install , Refer to the above environment configuration
#!pip install sklearn
#!pip install pandas
#---------------------------------------------------
# Import library
#---------------- Data exploration ----------------
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
# Data preprocessing
data1=pd.read_csv(' Game training set .csv',encoding='gbk')
data2=pd.read_csv(' Competition test set .csv',encoding='gbk')
#label Marked as -1
data2[' Signs of diabetes ']=-1
# The training set and the testing machine are merged
data=pd.concat([data1,data2],axis=0,ignore_index=True)
# Fill the missing values in the diastolic blood pressure characteristics with -1
data[' diastolic pressure ']=data[' diastolic pressure '].fillna(-1)
#---------------- Feature Engineering ----------------
"""
Convert the year of birth into age
"""
data[' Age ']=2022-data[' Year of birth '] # Change to age
"""
The normal value of the body mass index for adults is 18.5-24 Between
lower than 18.5 It's a low BMI
stay 24-27 Between them is overweight
27 The above consideration is obesity
higher than 32 You are very fat .
"""
def BMI(a):
if a<18.5:
return 0
elif 18.5<=a<=24:
return 1
elif 24<a<=27:
return 2
elif 27<a<=32:
return 3
else:
return 4
data['BMI']=data[' Body mass index '].apply(BMI)
# Family history of diabetes
"""
No record
One uncle or aunt has diabetes / One uncle or aunt has diabetes
One parent has diabetes
"""
def FHOD(a):
if a==' No record ':
return 0
elif a==' One uncle or aunt has diabetes ' or a==' One uncle or aunt has diabetes ':
return 1
else:
return 2
data[' Family history of diabetes ']=data[' Family history of diabetes '].apply(FHOD)
"""
The diastolic pressure range is 60-90
"""
def DBP(a):
if 0<=a<60:
return 0
elif 60<=a<=90:
return 1
elif a>90:
return 2
else:
return a
data['DBP']=data[' diastolic pressure '].apply(DBP)
#------------------------------------
# The processed feature engineering is divided into training set and test set , The training set is used to train the model , The test set is used to evaluate the accuracy of the model
# There is no relationship between the number and whether the patient has diabetes , Irrelevant features shall be deleted
train=data[data[' Signs of diabetes '] !=-1]
test=data[data[' Signs of diabetes '] ==-1]
train_label=train[' Signs of diabetes ']
train=train.drop([' Number ',' Signs of diabetes ',' Year of birth '],axis=1)
test=test.drop([' Number ',' Signs of diabetes ',' Year of birth '],axis=1)
#---------------- model training ----------------
model = DecisionTreeClassifier()
model.fit(train, train_label)
y_pre=model.predict(test)
y_pre
#---------------- Results output ----------------
result=pd.read_csv(' Submit sample .csv')
result['label']=y_pre
result.to_csv('result-de.csv',index=False)2.2 Results submitted
Submit at the submission result , Submit Predicted results .csv( Program generated CSV file ), Check your score ranking




Sorting is not easy to , spot Fabulous Three even ↓
边栏推荐
- This is how the permission system is designed, yyds
- Prometheus operator configures promethesrule alarm rules
- Equal proportion of R language test group: use the prop.test function to test whether the success proportion of the two groups is equal
- What are the ways to realize web digital visualization?
- 出于数据安全考虑,荷兰教育部要求学校暂停使用 Chrome 浏览器
- Brief introduction of acoustic filter Market
- Dynamic planning learning notes
- R language uses rowmedians function to calculate the row data median value of all data rows in dataframe
- 2021年ICPC陕西省赛热身赛 B.CODE(位运算)
- Adaptation dynamics | in June, sequoiadb completed mutual certification with five products
猜你喜欢

Adaptation dynamics | in June, sequoiadb completed mutual certification with five products

HTB-Optimum

新时代生产力工具——FlowUs 息流全方位评测

(Niuke multi school I in 2022) i-chiitoitsu (expected DP)

Programming hodgepodge (I)

HTB-Devel

An SQL execution process

(牛客多校二)G-Link with Monotonic Subsequence(构造题)

Microservice - hystrix fuse

VO, dto, do, Po distinction and use
随机推荐
Calculate BDP value and wnd value
Introduction summary of using unirx in unity
Android interview question: why do activities rebuild ViewModel and still exist—— Jetpack series (3)
50:第五章:开发admin管理服务:3:开发【查询admin用户名是否已存在,接口】;(这个接口需要登录时才能调用;所以我们编写了拦截器,让其拦截请求,判断用户是否是登录状态;)
对于von Mises distribution(冯·米塞斯分布)的一点心得
计算BDP值和wnd值
leetcode/整数除法
Sword finger offer 36. binary search tree and bidirectional linked list
Sword finger offer 05. replace spaces
sqlilabs less-28~less-8a
Ceres solver version 1.14 and eigen3.2.9
Era5 dataset description
Leetcode 204. 计数质数(太妙了)
HTB-Granpa
Talk about how redis handles requests
Leetcode 0121. the best time to buy and sell stocks - simulation from back to front
(Niuke multi school I in 2022) i-chiitoitsu (expected DP)
(牛客多校二)G-Link with Monotonic Subsequence(构造题)
暑期总结2
Please stop using system The currenttimemillis() statistical code is time-consuming, which is really too low!