当前位置：网站首页>Six supervised learning methods: classification of poisonous mushrooms

Six supervised learning methods: classification of poisonous mushrooms

2022-06-23 22:18:00 【PIDA】

This article is about kaggle The second part of case sharing 3 piece , The title of the competition is ：Mushroom Classification,Safe to eat or deadly poison?

The data come from UCI：https://archive.ics.uci.edu/ml/datasets/mushroom

kaggle Source code address ：https://www.kaggle.com/nirajvermafcb/comparing-various-ml-models-roc-curve-comparison

ranking

Here is kaggle The ranking for this question on . The first priority is feature selection , The data of this question are not used , Personally, I feel that I have deviated ; The second place focuses on Classification Based on Bayesian theory , Limited ability , After learning Bayes, we'll talk about it .

therefore , Chose the third place notebook Source code to learn . The author will 6 The modeling of a supervised learning method on this data set 、 The process of model evaluation is compared .

Data sets

This data set is UCI Donate to kaggle Of . The total number of samples is 8124, among 6513 Two samples for training ,1611 One sample to test ; also , Among them, edible are 4208 sample , Occupy 51.8%; The toxic sample is 3916, Occupy 48.2%. Each sample describes the 22 Attributes , Like shape 、 Smell, etc .

Poisoning by eating wild mushrooms occurs from time to time , And mushrooms vary in shape , For non professionals , Cannot from the appearance 、 form 、 Distinguish poisonous mushrooms from edible mushrooms in terms of color , There is no simple standard to distinguish poisonous mushrooms from edible mushrooms . To find out if mushrooms are edible , Mushrooms with different characteristic attributes must be collected to analyze whether they are poisonous .

For mushrooms 22 Two characteristic attributes are analyzed , Thus, the mushroom usability model , Better predict whether mushrooms are edible .

Here is UCI Specific data information displayed ：

Interpretation of attribute characteristics ：

data EDA

Import data

import pandas as pd
import numpy as np

import plotly_express as px
from matplotlib import pyplot as plt
import seaborn as sns

#  Ignore the warning 
import warnings
warnings.filterwarnings('ignore')

The raw data is 8124 Bar record ,23 Attributes ; also There is no missing value

There is no toxic contrast

Count the number of toxic and non-toxic ：

Visual analysis

Cap color

First, let's discuss the color of the cap ： Number of times per cap color

fig = px.bar(cap,x="color",
             y="number",
             color="number",
             text="number",
             color_continuous_scale="rainbow")

# fig.update_layout(text_position="outside")
fig.show()

Which kinds of poisonous mushrooms have more colors ？ Statistics of color distribution under toxic and non-toxic conditions ：

 fig = px.bar(cap_class,x="color",
              y="number",
              color="class",
              text="number",
              barmode="group",
             )

fig.show()

Summary ： Color n、g、e It's poisonous p There are many situations .

The smell of bacteria

Count the number of each smell ：

fig = px.bar(odor,
             x="odor",
             y="number",
             color="number",
             text="number",
             color_continuous_scale="rainbow")

fig.show()

The above is for the overall data , The following points are toxic and non-toxic to continue the discussion ：

 fig = px.bar(odor_class,
              x="odor",
              y="number",
              color="class",
              text="number",
              barmode="group",
             )

fig.show()

Summary ： From the two pictures above , We can see that ：f This smell is most likely to cause toxicity

Feature relevance

Draw the correlation coefficient between features into a thermal diagram , Look at the distribution ：

corr = data.corr()
sns.heatmap(corr)

plt.show()

Feature Engineering

Feature conversion

The features in the original data are all text types , We turn it into numerical , Convenient for subsequent analysis ：

1、 Before conversion

2、 Implement conversion

from sklearn.preprocessing import LabelEncoder  #  Type code 
labelencoder = LabelEncoder()

for col in data.columns:
    data[col] = labelencoder.fit_transform(data[col])

#  After the transformation 
data.head()

3、 View the conversion results of some attributes

The data distribution

View the data distribution after data conversion coding ：

ax = sns.boxplot(x='class', 
                 y='stalk-color-above-ring',
                 data=data)

ax = sns.stripplot(x="class", 
                   y='stalk-color-above-ring',
                   data=data, 
                   jitter=True,
                   edgecolor="gray")

plt.title("Class w.r.t stalkcolor above ring",fontsize=12)

plt.show()

Separate features and labels

X = data.iloc[:,1:23]  #  features 
y = data.iloc[:, 0]  #  label

Data standardization

#  normalization （Normalization）、 Standardization （Standardization）

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)
X

Principal component analysis PCA

PCA The process

In the raw data 22 Not all attributes may be valid data , In other words, some attributes have a certain relationship , The attributes of... Cause overlap . We used principal component analysis , First identify the key features ：

# 1、 The implementation of pca
from sklearn.decomposition import PCA
pca = PCA()
pca.fit_transform(X)

# 2、 Get the correlation coefficient 
covariance = pca.get_covariance()

# 3、 Get the variance value corresponding to each variable 
explained_variance=pca.explained_variance_
explained_variance

Plot to show the score relationship of each principal component ：

with plt.style.context("dark_background"):  #  background 
    plt.figure(figsize=(6,4))  #  size 
    
    plt.bar(range(22),  #  Number of principal components 
           explained_variance,  #  Variance value 
            alpha=0.5,  #  transparency 
            align="center",
            label="individual explained variance"  #  label 
           )
    plt.ylabel('Explained variance ratio')  #  Shaft name and legend 
    plt.xlabel('Principal components')
    plt.legend(loc="best")
    plt.tight_layout()  #  Automatically adjust subgraph parameters

Conclusion ： From the figure above, we can see the final 4 The sum of the variances of the principal components is very small ; Ahead 17 One occupied 90% The above variance , Can be used as a principal component .

We can see that the last 4 components has less amount of variance of the data.The 1st 17 components retains more than 90% of the data.

2 Data distribution under principal components

Then we use based on 2 Data of multiple attributes to implement K-means clustering ：

1、2 Distribution of raw data under principal components

N = data.values
pca = PCA(n_components=2)
x = pca.fit_transform(N)

plt.figure(figsize=(5,5))
plt.scatter(x[:,0],x[:,1])
plt.show()

2、 Distribution after cluster modeling ：

from sklearn.cluster import KMeans
km = KMeans(n_clusters=2,random_state=5)

N = data.values  # numpy Array form 
X_clustered = km.fit_predict(N)  #  Modeling results 0-1

label_color_map = {0:"g",  #  The classification result is only 0 and 1, To mark 
                  1:"y"}
label_color = [label_color_map[l] for l in X_clustered]

plt.figure(figsize=(5,5))
# x = pca.fit_transform(N)
plt.scatter(x[:,0],x[:,1], c=label_color)
plt.show()

be based on 17 Modeling under principal component

This place doesn't understand itself ： The total is 22 Attributes , The selection above is 4 Features , Why is this based on 17 Principal component analysis ？？

First, based on 17 The transformation of principal components ：

The partition of data sets ： The proportion of training set and test set is 8-2

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)

Here's the beginning 6 The specific process of a supervised learning method ：

Model 1： Logical regression

from sklearn.linear_model import LogisticRegression  #  Logical regression （ classification ）
from sklearn.model_selection import cross_val_score  #  Cross validation score 
from sklearn import metrics  #  Model evaluation 

#  Build a model 
model_LR = LogisticRegression()
model_LR.fit(X_train, y_train)

See the specific prediction effect ：

model_LR.score(X_test,y_pred)

#  result 
1.0  #  The effect is very good

Confusion matrix under logistic regression ：

confusion_matrix = metrics.confusion_matrix(y_test, y_pred)
confusion_matrix

#  result  
array([[815,  30],
       [ 36, 744]])

Concrete auc value ：

auc_roc = metrics.roc_auc_score(y_test, y_pred)  #  Test paper and predicted values 
auc_roc

#  result 
0.9591715976331362

True false positive

from sklearn.metrics import roc_curve, auc
false_positive_rate, true_positive_rate,thresholds = roc_curve(y_test, y_prob)

roc_auc = auc(false_positive_rate,true_positive_rate)
roc_auc

#  result 
0.9903474434835382

ROC curve

import matplotlib.pyplot as plt
plt.figure(figsize=(10,10))
plt.title("ROC")  # Receiver Operating Characteristic
plt.plot(false_positive_rate,
         true_positive_rate,
         color="red",
         label="AUC = %0.2f"%roc_auc
        )

plt.legend(loc="lower right")
plt.plot([0,1],[0,1],linestyle="--")
plt.axis("tight")
#  True positive ： The forecast category is 1 Of positive; The prediction is correct True
plt.ylabel("True Positive Rate") 
#  False positive ： The forecast category is 1 Of positive; Wrong prediction False
plt.xlabel("False Positive Rate")

The following is the correction of the logistic regression model . The correction here is mainly to take The grid search To select the best parameters , Then the next step of modeling . The process of grid search ：

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn import metrics

#  An unoptimized model 
LR_model= LogisticRegression()
#  Parameters to be determined 
tuned_parameters = {"C":[0.001,0.01,0.1,1,10,100,1000],
                    "penalty":['l1','l2']  #  Choose different regularization methods , Prevent over fitting 
                   }
#  Grid search module 
from sklearn.model_selection import GridSearchCV
#  Add grid search function 
LR = GridSearchCV(LR_model, tuned_parameters,cv=10)
#  Search and then model 
LR.fit(X_train, y_train)

#  Determining parameters 
print(LR.best_params_)

{'C': 100, 'penalty': 'l2'}

View the optimized forecast ：

Confuse matrices with AUC situation ：

ROC Curve situation ：

from sklearn.metrics import roc_curve, auc
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_prob)

#roc_auc = auc(false_positive_rate, true_positive_rate)

import matplotlib.pyplot as plt
plt.figure(figsize=(10,10))
plt.title("ROC")  # Receiver Operating Characteristic
plt.plot(false_positive_rate,
         true_positive_rate,
         color="red",
         label="AUC = %0.2f"%roc_auc
        )

plt.legend(loc="lower right")
plt.plot([0,1],[0,1],linestyle="--")
plt.axis("tight")
#  True positive ： The forecast category is 1 Of positive; The prediction is correct True
plt.ylabel("True Positive Rate") 
#  False positive ： The forecast category is 1 Of positive; Wrong prediction False
plt.xlabel("False Positive Rate")

Model 2： Gaussian naive Bayes

modeling

from sklearn.naive_bayes import GaussianNB
model_naive = GaussianNB()

#  modeling 
model_naive.fit(X_train, y_train)

#  Prediction probability 
y_prob = model_naive.predict_proba(X_test)[:,1]  
y_pred = np.where(y_prob > 0.5,1,0)
model_naive.score(X_test,y_pred)

#  result 
1

The number of differences between the predicted value and the real value ：111 individual

Cross validation

scores = cross_val_score(model_naive,
                        X,
                        y,
                        cv=10,
                        scoring="accuracy"
                       )
scores

Confuse matrices with AUC

True false positive

#  Import the evaluation module 
from sklearn.metrics import roc_curve, auc

#  The evaluation index 
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_prob)

# roc Curve area 
roc_auc = auc(false_positive_rate, true_positive_rate)
roc_auc

#  result 
0.9592201486876043

ROC curve

AUC The value of 0.96

#  mapping 
import matplotlib.pyplot as plt
plt.figure(figsize=(10,10))

plt.title("ROC")
plt.plot(false_positive_rate,true_positive_rate,color="red",label="AUC=%0.2f"%roc_auc)

plt.legend(loc="lower right")
plt.plot([0,1],[0,1],linestyle='--')

plt.axis("tight")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()

Model 3： Support vector machine SVM

Support vector machine process with default parameters

Modeling process

from sklearn.svm import SVC
svm_model = SVC()

tuned_parameters = {
    'C': [1, 10, 100,500, 1000],
    'kernel': ['linear','rbf'],
    'C': [1, 10, 100,500, 1000], 
    'gamma': [1,0.1,0.01,0.001, 0.0001], 
    'kernel': ['rbf']
}

Random grid search -RandomizedSearchCV

from sklearn.model_selection import RandomizedSearchCV

#  Establish a random search model 
model_svm = RandomizedSearchCV(
    svm_model,  #  Model to be searched 
    tuned_parameters,  #  Parameters 
    cv=10,  # 10 Crossover verification 
    scoring="accuracy",  #  Standard for evaluation 
    n_iter=20  #  The number of iterations 
    )

#  Training models 
model_svm.fit(X_train,y_train)

RandomizedSearchCV(cv=10, 
                   estimator=SVC(), 
                   n_iter=20,
                   param_distributions={'C': [1, 10, 100, 500, 1000],
                                        'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
                                        'kernel': ['rbf']},
                   scoring='accuracy')

#  Best scoring effect 
print(model_svm.best_score_)
1.0

Score best match parameter ：

#  forecast 
y_pred = model_svm.predict(X_test)

#  The predicted value and the original tag value are calculated ： Classification accuracy 
metrics.accuracy_score(y_pred, y_test)
#  result 
1

Confusion matrix

See the specific confusion matrix and prediction ：

ROC curve

from sklearn.metrics import roc_curve, auc
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)

import matplotlib.pyplot as plt

plt.figure(figsize=(10,10))
plt.title('ROC')

plt.plot(false_positive_rate,true_positive_rate, color='red',label = 'AUC = %0.2f' % roc_auc)

plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],linestyle='--')

plt.axis('tight')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')

Model 5： Random forests

Modeling and fitting

from sklearn.ensemble import RandomForestClassifier

#  modeling 
model_RR = RandomForestClassifier()
#  fitting 
model_RR.fit(X_train, y_train)

Forecast score

Confusion matrix

ROC curve

from sklearn.metrics import roc_curve, auc

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_prob)

roc_auc = auc(false_positive_rate, true_positive_rate)
roc_auc  # 1

import matplotlib.pyplot as plt
plt.figure(figsize=(10,10))
plt.title('ROC')

plt.plot(false_positive_rate,true_positive_rate, color='red',label = 'AUC = %0.2f' % roc_auc)

plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],linestyle='--')

plt.axis('tight')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

Model 6： Decision tree （CART）

modeling

from sklearn.tree import DecisionTreeClassifier

#  modeling 
model_tree = DecisionTreeClassifier()
model_tree.fit(X_train, y_train)

#  forecast 
y_prob = model_tree.predict_proba(X_test)[:,1]

#  The probability of prediction is transformed into 0-1 classification 
y_pred = np.where(y_prob > 0.5, 1, 0)
model_tree.score(X_test, y_pred)
#  result 
1

Confusion matrix

The embodiment of various evaluation indicators ：

ROC curve

from sklearn.metrics import roc_curve, auc
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(false_positive_rate, true_positive_rate)
roc_auc  # 1

import matplotlib.pyplot as plt
plt.figure(figsize=(10,10))  #  canvas 
plt.title('ROC')  #  title 

plt.plot(false_positive_rate,  #  mapping 
         true_positive_rate, 
         color='red',
         label = 'AUC = %0.2f' % roc_auc)  

plt.legend(loc = 'lower right') #   Legend location 
plt.plot([0, 1], [0, 1],linestyle='--')  #  Positive proportional line 

plt.axis('tight')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()

Model 6： neural network ANN

modeling

Confusion matrix

ROC curve

 #  True false positive 
from sklearn.metrics import roc_curve, auc
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(false_positive_rate, true_positive_rate)
roc_auc  # 1

#  draw ROC curve 

import matplotlib.pyplot as plt
plt.figure(figsize=(10,10))
plt.title('ROC')
plt.plot(false_positive_rate,true_positive_rate, color='red',label = 'AUC = %0.2f' % roc_auc)

plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],linestyle='--')

plt.axis('tight')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

Next, the parameters of the neural network are optimized ：

hidden_layer_sizes： Number of hidden layers
activation： Activation function
alpha： Learning rate
max_iter： Maximum number of iterations

The grid search

from sklearn.neural_network import MLPClassifier

#  Instantiation 
mlp_model = MLPClassifier()
#  Parameters to be adjusted 
tuned_parameters={'hidden_layer_sizes': range(1,200,10),
                  'activation': ['tanh','logistic','relu'],
                  'alpha':[0.0001,0.001,0.01,0.1,1,10],
                  'max_iter': range(50,200,50)
}

model_mlp= RandomizedSearchCV(mlp_model,
                              tuned_parameters,
                              cv=10,
                              scoring='accuracy',
                              n_iter=5,
                              n_jobs= -1,
                              random_state=5)
model_mlp.fit(X_train,y_train)

Model properties

The model properties and appropriate parameters after tuning ：

ROC curve

from sklearn.metrics import roc_curve, auc
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(false_positive_rate, true_positive_rate)
roc_auc  # 1

import matplotlib.pyplot as plt
plt.figure(figsize=(10,10))
plt.title('ROC')

plt.plot(false_positive_rate,true_positive_rate, color='red',label = 'AUC = %0.2f' % roc_auc)

plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],linestyle='--')

plt.axis('tight')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')

Confuse matrices with ROC

This is a good article to explain confusion matrices and ROC：https://www.cnblogs.com/wuliytTaotao/p/9285227.html

1、 What is confusion matrix ？

2、4 Big target

TP、FP、TN、FN, The second letter indicates the category in which the sample is predicted , The first letter indicates whether the predicted category of the sample is consistent with the real category .

3、 Accuracy rate

4、 Accuracy and recall

5、F_1 and F_B

6、ROC curve

AUC Its full name is Area Under Curve, Represents the area under a curve ,ROC Curved AUC Values can be used to evaluate the model .ROC The curve is as shown in the figure 1 Shown ：

summary

After reading this notebook Source code , What you need to know ：

The whole idea of machine learning modeling ： Choose a model 、 modeling 、 Grid search tuning 、 Model to evaluate 、ROC curve （ classification ）
Technology of feature engineering ： Encoding conversion 、 Data standardization 、 Data set partitioning
The evaluation index ： Confusion matrix 、ROC curve As a point , There are articles to explain

Notice ： Back Peter I will write a special article to model and analyze this data , Pure original ideas , Looking forward ~

原网站

版权声明
本文为[PIDA]所创，转载请带上原文链接，感谢
https://yzsam.com/2021/12/202112152142547426.html