当前位置:网站首页>Hands on data analysis unit 3 model building and evaluation

Hands on data analysis unit 3 model building and evaluation

2022-06-24 13:29:00 51CTO

hands-on-data-analysis Unit three Model building and evaluation

Table of Contents

1. Model structures,

1.1. Import related libraries

import pandas as pd
import numpy as np
# matplotlib.pyplot  and  seaborn  It's a drawing library 
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Image

     
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
#  Embedded display picture 
%matplotlib inline

     
  • 1.
  • 2.
plt.rcParams['font.sans-serif'] = ['SimHei']  #  Used to display Chinese labels normally 
plt.rcParams['axes.unicode_minus'] = False  #  Used to display negative sign normally 
plt.rcParams['figure.figsize'] = (10, 6)  #  Set output picture size 

     
  • 1.
  • 2.
  • 3.

1.2. Loading of data sets

#  Read the original data set 
train = pd.read_csv('train.csv')
train.shape

     
  • 1.
  • 2.
  • 3.

Output is :

(891, 12)

     
  • 1.

1.3. Dataset analysis

train.head()

     
  • 1.

Output is :

PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th…female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
train.info()

     
  • 1.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 # Column Non-Null Count Dtype 
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

     
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.

You can see that these data still need to be cleaned , The cleaned data sets are as follows :

# Read the cleaned data set 
data = pd.read_csv('clear_data.csv')

     
  • 1.
  • 2.
data.head()

     
  • 1.
PassengerIdPclassAgeSibSpParchFareSex_femaleSex_maleEmbarked_CEmbarked_QEmbarked_S
00322.0107.250001001
11138.01071.283310100
22326.0007.925010001
33135.01053.100010001
44335.0008.050001001
data.info()

     
  • 1.

Output is :

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 # Column Non-Null Count Dtype 
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Pclass       891 non-null    int64  
 2   Age          891 non-null    float64
 3   SibSp        891 non-null    int64  
 4   Parch        891 non-null    int64  
 5   Fare         891 non-null    float64
 6   Sex_female   891 non-null    int64  
 7   Sex_male     891 non-null    int64  
 8   Embarked_C   891 non-null    int64  
 9   Embarked_Q   891 non-null    int64  
 10  Embarked_S   891 non-null    int64  
dtypes: float64(2), int64(9)
memory usage: 76.7 KB

     
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.

1.4. Model structures,

sklearn The algorithm chooses the path

hands-on-data-analysis Unit three Model building and evaluation _ Data sets

Split the dataset

# train_test_split  Is a function used to cut data sets 
from sklearn.model_selection import train_test_split

     
  • 1.
  • 2.
#  Usually take it out first X and y Then cut , In some cases, uncut... Will be used , Now X and y You can use it ,x It's cleaned data ,y Is the survival data we want to predict 'Survived'
X = data
y = train['Survived']

     
  • 1.
  • 2.
  • 3.
#  Cut the dataset 
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)

     
  • 1.
  • 2.
#  View data shapes 
X_train.shape, X_test.shape

     
  • 1.
  • 2.

Output is :

((668, 11), (223, 11))

     
  • 1.
X_train.info()

     
  • 1.

Output is :

<class 'pandas.core.frame.DataFrame'>
Int64Index: 668 entries, 671 to 80
Data columns (total 11 columns):
 # Column Non-Null Count Dtype 
---  ------       --------------  -----  
 0   PassengerId  668 non-null    int64  
 1   Pclass       668 non-null    int64  
 2   Age          668 non-null    float64
 3   SibSp        668 non-null    int64  
 4   Parch        668 non-null    int64  
 5   Fare         668 non-null    float64
 6   Sex_female   668 non-null    int64  
 7   Sex_male     668 non-null    int64  
 8   Embarked_C   668 non-null    int64  
 9   Embarked_Q   668 non-null    int64  
 10  Embarked_S   668 non-null    int64  
dtypes: float64(2), int64(9)
memory usage: 82.6 KB

     
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
X_test.info()

     
  • 1.

Output is :

<class 'pandas.core.frame.DataFrame'>
Int64Index: 223 entries, 288 to 633
Data columns (total 11 columns):
 # Column Non-Null Count Dtype 
---  ------       --------------  -----  
 0   PassengerId  223 non-null    int64  
 1   Pclass       223 non-null    int64  
 2   Age          223 non-null    float64
 3   SibSp        223 non-null    int64  
 4   Parch        223 non-null    int64  
 5   Fare         223 non-null    float64
 6   Sex_female   223 non-null    int64  
 7   Sex_male     223 non-null    int64  
 8   Embarked_C   223 non-null    int64  
 9   Embarked_Q   223 non-null    int64  
 10  Embarked_S   223 non-null    int64  
dtypes: float64(2), int64(9)
memory usage: 30.9 KB

     
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.

1.5. Import model

1.5.1. Logistic regression model with default parameters

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

     
  • 1.
  • 2.
lr = LogisticRegression()
lr.fit(X_train, y_train)

     
  • 1.
  • 2.

Output is :

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

     
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
#  View training sets and test sets score value 
print("Training set score: {:.2f}".format(lr.score(X_train, y_train)))
print("Testing set score: {:.2f}".format(lr.score(X_test, y_test)))

     
  • 1.
  • 2.
  • 3.
Training set score: 0.80
Testing set score: 0.79

     
  • 1.
  • 2.

1.5.2. A logistic regression model for adjusting parameters

lr2 = LogisticRegression(C=100)
lr2.fit(X_train, y_train)

     
  • 1.
  • 2.

Output is :

LogisticRegression(C=100, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

     
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
print("Training set score: {:.2f}".format(lr2.score(X_train, y_train)))
print("Testing set score: {:.2f}".format(lr2.score(X_test, y_test)))

     
  • 1.
  • 2.

Output is :

Training set score: 0.79
Testing set score: 0.78

     
  • 1.
  • 2.

1.5.3. Random forest classification model with default parameters

rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

     
  • 1.
  • 2.

Output is :

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

     
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
print("Training set score: {:.2f}".format(rfc.score(X_train, y_train)))
print("Testing set score: {:.2f}".format(rfc.score(X_test, y_test)))

     
  • 1.
  • 2.

Output is :

Training set score: 1.00
Testing set score: 0.82

     
  • 1.
  • 2.

1.5.4. A stochastic forest classification model with adjusted parameters

rfc2 = RandomForestClassifier(n_estimators=100, max_depth=5)
rfc2.fit(X_train, y_train)

     
  • 1.
  • 2.

Output is :

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=5, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

     
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
print("Training set score: {:.2f}".format(rfc2.score(X_train, y_train)))
print("Testing set score: {:.2f}".format(rfc2.score(X_test, y_test)))

     
  • 1.
  • 2.

Output is :

Training set score: 0.87
Testing set score: 0.81

     
  • 1.
  • 2.

1.6. prediction model

General supervisory model in sklearn There's a predict Can output prediction labels ,predict_proba Label probability can be output

#  Forecast tags 
pred = lr.predict(X_train)

     
  • 1.
  • 2.
#  Now we can see 0 and 1 Array of 
pred[:10]

     
  • 1.
  • 2.

Output is :

array([0, 1, 1, 1, 0, 0, 1, 0, 1, 1])

     
  • 1.
#  Predicted tag probability 
pred_proba = lr.predict_proba(X_train)

     
  • 1.
  • 2.
pred_proba[:10]

     
  • 1.

Output is :

array([[0.60884602, 0.39115398],
       [0.17563455, 0.82436545],
       [0.40454114, 0.59545886],
       [0.1884778 , 0.8115222 ],
       [0.88013064, 0.11986936],
       [0.91411123, 0.08588877],
       [0.13260197, 0.86739803],
       [0.90571178, 0.09428822],
       [0.05273217, 0.94726783],
       [0.10924951, 0.89075049]])

     
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.

2. Model to evaluate

2.1. Cross validation

There are many kinds of cross validation , The first is the simplest , It's easy to think of : Divide the data set into two parts , Is a training set (training set), One is the test set (test set).

however , There are two drawbacks to this simple approach .

1. The final model and parameter selection will largely depend on how you divide the training set and test set .

2. This method only uses part of the data to train the model , Failure to take full advantage of the data in the dataset .

To solve this problem , The following technicians have carried out a variety of optimizations , The next step is K Crossover verification :

We will no longer have only one data per test set , It's more than one. , The specific number will be based on K The choice of . such as , If K=5, So the steps we take to cross verify with a 30% discount are :

1. Divide all data sets into 7 Share

2. Do not repeatedly take one of them at a time as a test set , Use other 6 Make a training set training model , And then calculate the MSE

3. take 7 Take the average of times to get the final MSE

hands-on-data-analysis Unit three Model building and evaluation _ Cross validation _02

from sklearn.model_selection import cross_val_score

     
  • 1.
lr = LogisticRegression(C=100)
scores = cross_val_score(lr, X_train, y_train, cv=10)

     
  • 1.
  • 2.
# k Fold cross validation score 
scores

     
  • 1.
  • 2.

Output :

array([0.82089552, 0.74626866, 0.74626866, 0.7761194 , 0.88059701,
       0.8358209 , 0.76119403, 0.8358209 , 0.74242424, 0.75757576])

     
  • 1.
  • 2.
#  Average cross validation score 
print("Average cross-validation score: {:.2f}".format(scores.mean()))

     
  • 1.
  • 2.

Output :

Average cross-validation score: 0.79

     
  • 1.

2.2. Confusion matrix

Confusion matrix is used to summarize the results of a classifier . about k Metaclassification , In fact, it is a k x k Table for , Used to record the prediction results of the classifier .

hands-on-data-analysis Unit three Model building and evaluation _ Cross validation _03

The method of confusion matrix is sklearn Medium sklearn.metrics modular

The confusion matrix needs to input the real label and prediction label

Accuracy 、 Recall rate and f- Scores can be used classification_report modular

In fact, the quality of the model , Just look at the main diagonal of the confusion matrix .

from sklearn.metrics import confusion_matrix

     
  • 1.
#  Training models 
lr = LogisticRegression(C=100)
lr.fit(X_train, y_train)

     
  • 1.
  • 2.
  • 3.
LogisticRegression(C=100, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

     
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
#  Model predictions 
pred = lr.predict(X_train)

     
  • 1.
  • 2.
#  Confusion matrix 
confusion_matrix(y_train, pred)

     
  • 1.
  • 2.
array([[354,  58],
       [ 83, 173]])

     
  • 1.
  • 2.
#  Classified reports 
from sklearn.metrics import classification_report

     
  • 1.
  • 2.
#  Accuracy 、 Recall rate and f1-score
print(classification_report(y_train, pred))

     
  • 1.
  • 2.
    			precision    recall  f1-score   support

           0       0.81      0.86      0.83       412
           1       0.75      0.68      0.71       256

    accuracy                           0.79       668
   macro avg       0.78      0.77      0.77       668
weighted avg       0.79      0.79      0.79       668


     
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.

2.3.ROC curve

ROC The curve originated from the judgment of radar signal by radar soldiers during World War II . At that time, the task of every radar soldier was to analyze the radar signal , But the radar technology was not so advanced at that time , There is a lot of noise , So whenever a signal appears on the radar screen , Radar soldiers need to decipher it . Some radar soldiers are more cautious , Whenever there is a signal , He tends to interpret it as an enemy bomber , Some radar soldiers are more nervous , It tends to be interpreted as a bird . In this case, a set of evaluation indicators is urgently needed to help him summarize the prediction information of each radar soldier and evaluate the reliability of this radar . therefore , One of the earliest ROC The curve analysis method was born . After that ,ROC Curve is widely used in medicine and machine learning .

ROC The full name is Receiver Operating Characteristic Curve, The Chinese name is 【 The working characteristic curve of subjects 】

ROC The curve is in sklearn The module in is sklearn.metrics

ROC The larger the area surrounded by the curve, the better

from sklearn.metrics import roc_curve

     
  • 1.
fpr, tpr, thresholds = roc_curve(y_test, lr.decision_function(X_test))
plt.plot(fpr, tpr, label="ROC Curve")
plt.xlabel("FPR")
plt.ylabel("TPR (recall)")
#  Find the closest to 0 The threshold of 
close_zero = np.argmin(np.abs(thresholds))
plt.plot(fpr[close_zero], tpr[close_zero], 'o', markersize=10, label="threshold zero", fillstyle="none", c='k', mew=2)
plt.legend(loc=4)

     
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.

hands-on-data-analysis Unit three Model building and evaluation _ Data sets _04

3. Reference material

 【 machine learning 】Cross-Validation( Cross validation ) Detailed explanation - You know (zhihu.com)

 https://www.jianshu.com/p/2ca96fce7e81

原网站

版权声明
本文为[51CTO]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/175/202206241233192759.html