当前位置:网站首页>Hands on data analysis unit 3 model building and evaluation
Hands on data analysis unit 3 model building and evaluation
2022-06-24 13:29:00 【51CTO】
hands-on-data-analysis Unit three Model building and evaluation
Table of Contents
- hands-on-data-analysis Unit three Model building and evaluation
- 1. Model structures,
- 1.1. Import related libraries
- 1.2. Loading of data sets
- 1.3. Dataset analysis
- 1.4. Model structures,
- 1.5. Import model
- 1.5.1. Logistic regression model with default parameters
- 1.5.2. A logistic regression model for adjusting parameters
- 1.5.3. Random forest classification model with default parameters
- 1.5.4. A stochastic forest classification model with adjusted parameters
- 1.6. prediction model
- 2. Model to evaluate
- 3. Reference material
1. Model structures,
1.1. Import related libraries
1.2. Loading of data sets
Output is :
1.3. Dataset analysis
Output is :
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th… | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
You can see that these data still need to be cleaned , The cleaned data sets are as follows :
| PassengerId | Pclass | Age | SibSp | Parch | Fare | Sex_female | Sex_male | Embarked_C | Embarked_Q | Embarked_S | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | 22.0 | 1 | 0 | 7.2500 | 0 | 1 | 0 | 0 | 1 |
| 1 | 1 | 1 | 38.0 | 1 | 0 | 71.2833 | 1 | 0 | 1 | 0 | 0 |
| 2 | 2 | 3 | 26.0 | 0 | 0 | 7.9250 | 1 | 0 | 0 | 0 | 1 |
| 3 | 3 | 1 | 35.0 | 1 | 0 | 53.1000 | 1 | 0 | 0 | 0 | 1 |
| 4 | 4 | 3 | 35.0 | 0 | 0 | 8.0500 | 0 | 1 | 0 | 0 | 1 |
Output is :
1.4. Model structures,
sklearn The algorithm chooses the path

Split the dataset
Output is :
Output is :
Output is :
1.5. Import model
1.5.1. Logistic regression model with default parameters
Output is :
1.5.2. A logistic regression model for adjusting parameters
Output is :
Output is :
1.5.3. Random forest classification model with default parameters
Output is :
Output is :
1.5.4. A stochastic forest classification model with adjusted parameters
Output is :
Output is :
1.6. prediction model
General supervisory model in sklearn There's a predict Can output prediction labels ,predict_proba Label probability can be output
Output is :
Output is :
2. Model to evaluate
2.1. Cross validation
There are many kinds of cross validation , The first is the simplest , It's easy to think of : Divide the data set into two parts , Is a training set (training set), One is the test set (test set).
however , There are two drawbacks to this simple approach .
1. The final model and parameter selection will largely depend on how you divide the training set and test set .
2. This method only uses part of the data to train the model , Failure to take full advantage of the data in the dataset .
To solve this problem , The following technicians have carried out a variety of optimizations , The next step is K Crossover verification :
We will no longer have only one data per test set , It's more than one. , The specific number will be based on K The choice of . such as , If K=5, So the steps we take to cross verify with a 30% discount are :
1. Divide all data sets into 7 Share
2. Do not repeatedly take one of them at a time as a test set , Use other 6 Make a training set training model , And then calculate the MSE
3. take 7 Take the average of times to get the final MSE

Output :
Output :
2.2. Confusion matrix
Confusion matrix is used to summarize the results of a classifier . about k Metaclassification , In fact, it is a k x k Table for , Used to record the prediction results of the classifier .

The method of confusion matrix is sklearn Medium sklearn.metrics modular
The confusion matrix needs to input the real label and prediction label
Accuracy 、 Recall rate and f- Scores can be used classification_report modular
In fact, the quality of the model , Just look at the main diagonal of the confusion matrix .
2.3.ROC curve
ROC The curve originated from the judgment of radar signal by radar soldiers during World War II . At that time, the task of every radar soldier was to analyze the radar signal , But the radar technology was not so advanced at that time , There is a lot of noise , So whenever a signal appears on the radar screen , Radar soldiers need to decipher it . Some radar soldiers are more cautious , Whenever there is a signal , He tends to interpret it as an enemy bomber , Some radar soldiers are more nervous , It tends to be interpreted as a bird . In this case, a set of evaluation indicators is urgently needed to help him summarize the prediction information of each radar soldier and evaluate the reliability of this radar . therefore , One of the earliest ROC The curve analysis method was born . After that ,ROC Curve is widely used in medicine and machine learning .
ROC The full name is Receiver Operating Characteristic Curve, The Chinese name is 【 The working characteristic curve of subjects 】
ROC The curve is in sklearn The module in is sklearn.metrics
ROC The larger the area surrounded by the curve, the better

3. Reference material
【 machine learning 】Cross-Validation( Cross validation ) Detailed explanation - You know (zhihu.com)
边栏推荐
- 几种常见的DoS攻击
- Attack Science: DDoS (Part 2)
- What if the WordPress website forgets its password
- Kubernetes cluster deployment
- Leetcode 1218. 最长定差子序列
- 8 lines of code to teach you how to build an intelligent robot platform
- Explain the difference between iaas/paas/saas by cooking rice
- Creation and use of unified links in Huawei applinking
- Nifi from introduction to practice (nanny level tutorial) - environment
- Pycharm中使用Terminal激活conda服务(终极方法,铁定可以)
猜你喜欢

Pycharm中使用Terminal激活conda服务(终极方法,铁定可以)

Getting started with the lvgl Library - colors and images

CVPR 2022 - Interpretation of selected papers of meituan technical team

几种常见的DoS攻击

使用 Abp.Zero 搭建第三方登录模块(一):原理篇

手把手教你用AirtestIDE无线连接手机!

黄金年代入场券之《Web3.0安全手册》

openGauss内核:简单查询的执行

Interviewer: the MySQL database is slow to query. What are the possible reasons besides the index problem?

我真傻,招了一堆只会“谷歌”的程序员!
随机推荐
我開導一個朋友的一些話以及我個人對《六祖壇經》的一點感悟
What should I do if I fail to apply for the mime database? The experience from failure to success is shared with you ~
黄金年代入场券之《Web3.0安全手册》
Best practices of swagger in egg project
16 safety suggestions from metamask project to solid programmers
Beauty of script │ VBS introduction interactive practice
手机开户后多久才能通过?在线开户安全么?
Kubernetes集群部署
Summary of the process of restoring damaged data in MySQL database
Experience of IOS interview strategy - App testing and launching
MySQL master-slave replication
Leetcode 1218. 最长定差子序列
Several common DoS attacks
CVPR 2022 - Interpretation of selected papers of meituan technical team
Why is open source technology so popular in the development of audio and video streaming media platform?
How stupid of me to hire a bunch of programmers who can only "Google"!
Prometheus PushGateway 碎碎念
Opengauss kernel: simple query execution
Yolov6: the fast and accurate target detection framework is open source
Main steps of system test