当前位置:网站首页>Six supervised learning methods: classification of poisonous mushrooms
Six supervised learning methods: classification of poisonous mushrooms
2022-06-23 22:18:00 【PIDA】
This article is about kaggle The second part of case sharing 3 piece , The title of the competition is :Mushroom Classification,Safe to eat or deadly poison?
The data come from UCI:https://archive.ics.uci.edu/ml/datasets/mushroom
kaggle Source code address :https://www.kaggle.com/nirajvermafcb/comparing-various-ml-models-roc-curve-comparison
<!--MORE-->
ranking
Here is kaggle The ranking for this question on . The first priority is feature selection , The data of this question are not used , Personally, I feel that I have deviated ; The second place focuses on Classification Based on Bayesian theory , Limited ability , After learning Bayes, we'll talk about it .
therefore , Chose the third place notebook Source code to learn . The author will 6 The modeling of a supervised learning method on this data set 、 The process of model evaluation is compared .
Data sets
This data set is UCI Donate to kaggle Of . The total number of samples is 8124, among 6513 Two samples for training ,1611 One sample to test ; also , Among them, edible are 4208 sample , Occupy 51.8%; The toxic sample is 3916, Occupy 48.2%. Each sample describes the 22 Attributes , Like shape 、 Smell, etc .
Poisoning by eating wild mushrooms occurs from time to time , And mushrooms vary in shape , For non professionals , Cannot from the appearance 、 form 、 Distinguish poisonous mushrooms from edible mushrooms in terms of color , There is no simple standard to distinguish poisonous mushrooms from edible mushrooms . To find out if mushrooms are edible , Mushrooms with different characteristic attributes must be collected to analyze whether they are poisonous .
For mushrooms 22 Two characteristic attributes are analyzed , Thus, the mushroom usability model , Better predict whether mushrooms are edible .
Here is UCI Specific data information displayed :
Interpretation of attribute characteristics :
data EDA
Import data
import pandas as pd
import numpy as np
import plotly_express as px
from matplotlib import pyplot as plt
import seaborn as sns
# Ignore the warning
import warnings
warnings.filterwarnings('ignore')The raw data is 8124 Bar record ,23 Attributes ; also There is no missing value
There is no toxic contrast
Count the number of toxic and non-toxic :
Visual analysis
Cap color
First, let's discuss the color of the cap : Number of times per cap color
fig = px.bar(cap,x="color",
y="number",
color="number",
text="number",
color_continuous_scale="rainbow")
# fig.update_layout(text_position="outside")
fig.show()Which kinds of poisonous mushrooms have more colors ? Statistics of color distribution under toxic and non-toxic conditions :
fig = px.bar(cap_class,x="color",
y="number",
color="class",
text="number",
barmode="group",
)
fig.show()Summary : Color n、g、e It's poisonous p There are many situations .
The smell of bacteria
Count the number of each smell :
fig = px.bar(odor,
x="odor",
y="number",
color="number",
text="number",
color_continuous_scale="rainbow")
fig.show()The above is for the overall data , The following points are toxic and non-toxic to continue the discussion :
fig = px.bar(odor_class,
x="odor",
y="number",
color="class",
text="number",
barmode="group",
)
fig.show()Summary : From the two pictures above , We can see that :f This smell is most likely to cause toxicity
Feature relevance
Draw the correlation coefficient between features into a thermal diagram , Look at the distribution :
corr = data.corr() sns.heatmap(corr) plt.show()
Feature Engineering
Feature conversion
The features in the original data are all text types , We turn it into numerical , Convenient for subsequent analysis :
1、 Before conversion
2、 Implement conversion
from sklearn.preprocessing import LabelEncoder # Type code
labelencoder = LabelEncoder()
for col in data.columns:
data[col] = labelencoder.fit_transform(data[col])
# After the transformation
data.head()3、 View the conversion results of some attributes
The data distribution
View the data distribution after data conversion coding :
ax = sns.boxplot(x='class',
y='stalk-color-above-ring',
data=data)
ax = sns.stripplot(x="class",
y='stalk-color-above-ring',
data=data,
jitter=True,
edgecolor="gray")
plt.title("Class w.r.t stalkcolor above ring",fontsize=12)
plt.show()Separate features and labels
X = data.iloc[:,1:23] # features y = data.iloc[:, 0] # label
Data standardization
# normalization (Normalization)、 Standardization (Standardization) from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X = scaler.fit_transform(X) X
Principal component analysis PCA
PCA The process
In the raw data 22 Not all attributes may be valid data , In other words, some attributes have a certain relationship , The attributes of... Cause overlap . We used principal component analysis , First identify the key features :
# 1、 The implementation of pca from sklearn.decomposition import PCA pca = PCA() pca.fit_transform(X) # 2、 Get the correlation coefficient covariance = pca.get_covariance() # 3、 Get the variance value corresponding to each variable explained_variance=pca.explained_variance_ explained_variance
Plot to show the score relationship of each principal component :
with plt.style.context("dark_background"): # background
plt.figure(figsize=(6,4)) # size
plt.bar(range(22), # Number of principal components
explained_variance, # Variance value
alpha=0.5, # transparency
align="center",
label="individual explained variance" # label
)
plt.ylabel('Explained variance ratio') # Shaft name and legend
plt.xlabel('Principal components')
plt.legend(loc="best")
plt.tight_layout() # Automatically adjust subgraph parameters Conclusion : From the figure above, we can see the final 4 The sum of the variances of the principal components is very small ; Ahead 17 One occupied 90% The above variance , Can be used as a principal component .
We can see that the last 4 components has less amount of variance of the data.The 1st 17 components retains more than 90% of the data.
2 Data distribution under principal components
Then we use based on 2 Data of multiple attributes to implement K-means clustering :
1、2 Distribution of raw data under principal components
N = data.values pca = PCA(n_components=2) x = pca.fit_transform(N) plt.figure(figsize=(5,5)) plt.scatter(x[:,0],x[:,1]) plt.show()
2、 Distribution after cluster modeling :
from sklearn.cluster import KMeans
km = KMeans(n_clusters=2,random_state=5)
N = data.values # numpy Array form
X_clustered = km.fit_predict(N) # Modeling results 0-1
label_color_map = {0:"g", # The classification result is only 0 and 1, To mark
1:"y"}
label_color = [label_color_map[l] for l in X_clustered]
plt.figure(figsize=(5,5))
# x = pca.fit_transform(N)
plt.scatter(x[:,0],x[:,1], c=label_color)
plt.show()be based on 17 Modeling under principal component
This place doesn't understand itself : The total is 22 Attributes , The selection above is 4 Features , Why is this based on 17 Principal component analysis ??
First, based on 17 The transformation of principal components :
The partition of data sets : The proportion of training set and test set is 8-2
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)
Here's the beginning 6 The specific process of a supervised learning method :
Model 1: Logical regression
from sklearn.linear_model import LogisticRegression # Logical regression ( classification ) from sklearn.model_selection import cross_val_score # Cross validation score from sklearn import metrics # Model evaluation # Build a model model_LR = LogisticRegression() model_LR.fit(X_train, y_train)
See the specific prediction effect :
model_LR.score(X_test,y_pred) # result 1.0 # The effect is very good
Confusion matrix under logistic regression :
confusion_matrix = metrics.confusion_matrix(y_test, y_pred)
confusion_matrix
# result
array([[815, 30],
[ 36, 744]])Concrete auc value :
auc_roc = metrics.roc_auc_score(y_test, y_pred) # Test paper and predicted values auc_roc # result 0.9591715976331362
True false positive
from sklearn.metrics import roc_curve, auc false_positive_rate, true_positive_rate,thresholds = roc_curve(y_test, y_prob) roc_auc = auc(false_positive_rate,true_positive_rate) roc_auc # result 0.9903474434835382
ROC curve
import matplotlib.pyplot as plt
plt.figure(figsize=(10,10))
plt.title("ROC") # Receiver Operating Characteristic
plt.plot(false_positive_rate,
true_positive_rate,
color="red",
label="AUC = %0.2f"%roc_auc
)
plt.legend(loc="lower right")
plt.plot([0,1],[0,1],linestyle="--")
plt.axis("tight")
# True positive : The forecast category is 1 Of positive; The prediction is correct True
plt.ylabel("True Positive Rate")
# False positive : The forecast category is 1 Of positive; Wrong prediction False
plt.xlabel("False Positive Rate") The following is the correction of the logistic regression model . The correction here is mainly to take The grid search To select the best parameters , Then the next step of modeling . The process of grid search :
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn import metrics
# An unoptimized model
LR_model= LogisticRegression()
# Parameters to be determined
tuned_parameters = {"C":[0.001,0.01,0.1,1,10,100,1000],
"penalty":['l1','l2'] # Choose different regularization methods , Prevent over fitting
}
# Grid search module
from sklearn.model_selection import GridSearchCV
# Add grid search function
LR = GridSearchCV(LR_model, tuned_parameters,cv=10)
# Search and then model
LR.fit(X_train, y_train)
# Determining parameters
print(LR.best_params_)
{'C': 100, 'penalty': 'l2'}View the optimized forecast :
Confuse matrices with AUC situation :
ROC Curve situation :
from sklearn.metrics import roc_curve, auc
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_prob)
#roc_auc = auc(false_positive_rate, true_positive_rate)
import matplotlib.pyplot as plt
plt.figure(figsize=(10,10))
plt.title("ROC") # Receiver Operating Characteristic
plt.plot(false_positive_rate,
true_positive_rate,
color="red",
label="AUC = %0.2f"%roc_auc
)
plt.legend(loc="lower right")
plt.plot([0,1],[0,1],linestyle="--")
plt.axis("tight")
# True positive : The forecast category is 1 Of positive; The prediction is correct True
plt.ylabel("True Positive Rate")
# False positive : The forecast category is 1 Of positive; Wrong prediction False
plt.xlabel("False Positive Rate") Model 2: Gaussian naive Bayes
modeling
from sklearn.naive_bayes import GaussianNB model_naive = GaussianNB() # modeling model_naive.fit(X_train, y_train) # Prediction probability y_prob = model_naive.predict_proba(X_test)[:,1] y_pred = np.where(y_prob > 0.5,1,0) model_naive.score(X_test,y_pred) # result 1
The number of differences between the predicted value and the real value :111 individual
Cross validation
scores = cross_val_score(model_naive,
X,
y,
cv=10,
scoring="accuracy"
)
scoresConfuse matrices with AUC
True false positive
# Import the evaluation module from sklearn.metrics import roc_curve, auc # The evaluation index false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_prob) # roc Curve area roc_auc = auc(false_positive_rate, true_positive_rate) roc_auc # result 0.9592201486876043
ROC curve
AUC The value of 0.96
# mapping
import matplotlib.pyplot as plt
plt.figure(figsize=(10,10))
plt.title("ROC")
plt.plot(false_positive_rate,true_positive_rate,color="red",label="AUC=%0.2f"%roc_auc)
plt.legend(loc="lower right")
plt.plot([0,1],[0,1],linestyle='--')
plt.axis("tight")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()Model 3: Support vector machine SVM
Support vector machine process with default parameters
Modeling process
from sklearn.svm import SVC
svm_model = SVC()
tuned_parameters = {
'C': [1, 10, 100,500, 1000],
'kernel': ['linear','rbf'],
'C': [1, 10, 100,500, 1000],
'gamma': [1,0.1,0.01,0.001, 0.0001],
'kernel': ['rbf']
}Random grid search -RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV
# Establish a random search model
model_svm = RandomizedSearchCV(
svm_model, # Model to be searched
tuned_parameters, # Parameters
cv=10, # 10 Crossover verification
scoring="accuracy", # Standard for evaluation
n_iter=20 # The number of iterations
)
# Training models
model_svm.fit(X_train,y_train)RandomizedSearchCV(cv=10,
estimator=SVC(),
n_iter=20,
param_distributions={'C': [1, 10, 100, 500, 1000],
'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
'kernel': ['rbf']},
scoring='accuracy')# Best scoring effect print(model_svm.best_score_) 1.0
Score best match parameter :
# forecast y_pred = model_svm.predict(X_test) # The predicted value and the original tag value are calculated : Classification accuracy metrics.accuracy_score(y_pred, y_test) # result 1
Confusion matrix
See the specific confusion matrix and prediction :
ROC curve
from sklearn.metrics import roc_curve, auc
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
import matplotlib.pyplot as plt
plt.figure(figsize=(10,10))
plt.title('ROC')
plt.plot(false_positive_rate,true_positive_rate, color='red',label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],linestyle='--')
plt.axis('tight')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')Model 5: Random forests
Modeling and fitting
from sklearn.ensemble import RandomForestClassifier # modeling model_RR = RandomForestClassifier() # fitting model_RR.fit(X_train, y_train)
Forecast score
Confusion matrix
ROC curve
from sklearn.metrics import roc_curve, auc
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(false_positive_rate, true_positive_rate)
roc_auc # 1
import matplotlib.pyplot as plt
plt.figure(figsize=(10,10))
plt.title('ROC')
plt.plot(false_positive_rate,true_positive_rate, color='red',label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],linestyle='--')
plt.axis('tight')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()Model 6: Decision tree (CART)
modeling
from sklearn.tree import DecisionTreeClassifier # modeling model_tree = DecisionTreeClassifier() model_tree.fit(X_train, y_train) # forecast y_prob = model_tree.predict_proba(X_test)[:,1] # The probability of prediction is transformed into 0-1 classification y_pred = np.where(y_prob > 0.5, 1, 0) model_tree.score(X_test, y_pred) # result 1
Confusion matrix
The embodiment of various evaluation indicators :
ROC curve
from sklearn.metrics import roc_curve, auc
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(false_positive_rate, true_positive_rate)
roc_auc # 1
import matplotlib.pyplot as plt
plt.figure(figsize=(10,10)) # canvas
plt.title('ROC') # title
plt.plot(false_positive_rate, # mapping
true_positive_rate,
color='red',
label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right') # Legend location
plt.plot([0, 1], [0, 1],linestyle='--') # Positive proportional line
plt.axis('tight')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()Model 6: neural network ANN
modeling
Confusion matrix
ROC curve
# True false positive
from sklearn.metrics import roc_curve, auc
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(false_positive_rate, true_positive_rate)
roc_auc # 1
# draw ROC curve
import matplotlib.pyplot as plt
plt.figure(figsize=(10,10))
plt.title('ROC')
plt.plot(false_positive_rate,true_positive_rate, color='red',label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],linestyle='--')
plt.axis('tight')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()Next, the parameters of the neural network are optimized :
- hidden_layer_sizes: Number of hidden layers
- activation: Activation function
- alpha: Learning rate
- max_iter: Maximum number of iterations
The grid search
from sklearn.neural_network import MLPClassifier
# Instantiation
mlp_model = MLPClassifier()
# Parameters to be adjusted
tuned_parameters={'hidden_layer_sizes': range(1,200,10),
'activation': ['tanh','logistic','relu'],
'alpha':[0.0001,0.001,0.01,0.1,1,10],
'max_iter': range(50,200,50)
}
model_mlp= RandomizedSearchCV(mlp_model,
tuned_parameters,
cv=10,
scoring='accuracy',
n_iter=5,
n_jobs= -1,
random_state=5)
model_mlp.fit(X_train,y_train)Model properties
The model properties and appropriate parameters after tuning :
ROC curve
from sklearn.metrics import roc_curve, auc
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(false_positive_rate, true_positive_rate)
roc_auc # 1
import matplotlib.pyplot as plt
plt.figure(figsize=(10,10))
plt.title('ROC')
plt.plot(false_positive_rate,true_positive_rate, color='red',label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],linestyle='--')
plt.axis('tight')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')Confuse matrices with ROC
This is a good article to explain confusion matrices and ROC:https://www.cnblogs.com/wuliytTaotao/p/9285227.html
1、 What is confusion matrix ?
2、4 Big target
TP、FP、TN、FN, The second letter indicates the category in which the sample is predicted , The first letter indicates whether the predicted category of the sample is consistent with the real category .
3、 Accuracy rate
4、 Accuracy and recall
5、F_1 and F_B
6、ROC curve
AUC Its full name is Area Under Curve, Represents the area under a curve ,ROC Curved AUC Values can be used to evaluate the model .ROC The curve is as shown in the figure 1 Shown :
summary
After reading this notebook Source code , What you need to know :
- The whole idea of machine learning modeling : Choose a model 、 modeling 、 Grid search tuning 、 Model to evaluate 、ROC curve ( classification )
- Technology of feature engineering : Encoding conversion 、 Data standardization 、 Data set partitioning
- The evaluation index : Confusion matrix 、ROC curve As a point , There are articles to explain
Notice : Back Peter I will write a special article to model and analyze this data , Pure original ideas , Looking forward ~
边栏推荐
- MySQL de duplication query only keeps one latest record
- How do fortress computers log in to the server? What is the role of the fortress machine?
- What happened when the fortress remote login server was blocked? What can be done to solve it?
- WordPress plugin wpschoolpress 2.1.16 -'multiple'cross site scripting (XSS)
- How does the fortress machine view the account assigned by the server? What are the specific steps?
- Trident tutorial
- TMUX support, file transfer tool Trz / Tsz (trzsz) similar to RZ / SZ
- You must like these free subtitle online tools: Video subtitle extraction, subtitle online translation, double subtitle merging
- How to defend the security importance of API gateway
- How does the API gateway intercept requests? How does the security of the API gateway reflect?
猜你喜欢

MySQL de duplication query only keeps one latest record

Hackinglab penetration test question 8:key can't find it again

Introduction to scikit learn machine learning practice

Installation and use of Minio

Error running PyUIC: Cannot start process, the working directory ‘-m PyQt5. uic. pyuic register. ui -o

Icml2022 | robust task representation for off-line meta reinforcement learning based on contrastive learning

Leetcode algorithm interview sprint sorting algorithm theory (32)
Performance optimization of database 5- database, table and data migration

Acl2022 | MVR: multi view document representation for open domain retrieval

Teacher lihongyi from National Taiwan University - grade Descent 2
随机推荐
Second kill design of 100 million level traffic architecture
北大、加州伯克利大学等联合| Domain-Adaptive Text Classification with Structured Knowledge from Unlabeled Data(基于未标记数据的结构化知识的领域自适应文本分类)
Detailed explanation of redisson distribution lock
Bi SQL constraints
Freiburg University, Hildesheim University and other universities in Germany jointly | zero shot automl with pre trained models (zero sample automl based on pre training model)
万字长文!一文搞懂InheritedWidget 局部刷新机制
What do you know about the 15 entry-level applets
New high-speed random graph API interface, the first sci-fi graph API interface
What if the fortress remote access server fails? What are the reasons why the fortress computer fails to connect to the server?
Text editor GNU nano 6.0 release!
How to deploy API gateways and split services under multi services?
Hugegraph: hugegraph Hubble web based visual graph management
[log service CLS] one click to start the efficient operation and maintenance journey of Tencent E-Sign
Tencent cloud database tdsql elite challenge Q & A (real-time update)
JWT implementation
CAD图在线Web测量工具代码实现(测量距离、面积、角度等)
v-chart
[tutorial] build librephotos using Tencent cloud lightweight application server to support photo management of face recognition!
How to solve the loss of video source during easynvr split screen switching?
Don't let your server run naked -- security configuration after purchasing a new server (Basics)