Quickly review the soft voting and hard voting in the integration method
The integration method is to combine the results of two or more separate machine learning algorithms , And try to produce more accurate results than any single algorithm .
In soft voting , The probability of each category is averaged to produce results . for example , If the algorithm 1 With 40% The object of probability prediction is a rock , And algorithm 2 With 80% The probability predicts that it is a rock , Then the integration will predict that the object is an object with (80 + 40) / 2 = 60% The rock of possibility .
In a hard vote , The prediction of each algorithm is considered to select the set of classes with the highest number of votes . for example , If three algorithms predict the color of a particular wine as “ white ”、“ white ” and “ Red ”, Then the integration will predict “ white ”.
The simplest explanation is : Soft voting is the integration of probability , Hard voting is the integration of result labels .
Generate test data
Now let's start writing the code , First, import some libraries and some simple configurations
importpandasaspd
importnumpyasnp
importcopyascp
fromsklearn.datasetsimportmake_classification
fromsklearn.model_selectionimportKFold, cross_val_score
fromtypingimportTuple
fromstatisticsimportmode
fromsklearn.ensembleimportVotingClassifier
fromsklearn.metricsimportaccuracy_score
fromsklearn.linear_modelimportLogisticRegression
fromsklearn.ensembleimportRandomForestClassifier
fromsklearn.ensembleimportExtraTreesClassifier
fromxgboostimportXGBClassifier
fromsklearn.neural_networkimportMLPClassifier
fromsklearn.svmimportSVC
fromlightgbmimportLGBMClassifier
RANDOM_STATE : int = 42
N_SAMPLES : int = 10000
N_FEATURES : int = 25
N_CLASSES : int = 3
N_CLUSTERS_PER_CLASS : int = 2
FEATURE_NAME_PREFIX : str = "Feature"
TARGET_NAME : str = "Target"
N_SPLITS : int = 5
np.set_printoptions(suppress=True)
You also need some data as input for classification .make_classification_dataframe The function creates test data containing features and targets .
Here we set the number of categories to 3. In this way, the multi classification algorithm can be realized ( exceed 2 Classes are OK ) Soft voting and hard voting algorithms . And our code can also be applied to binary classification .
defmake_classification_dataframe(n_samples : int = 10000, n_features : int = 25, n_classes : int = 2, n_clusters_per_class : int = 2, feature_name_prefix : str = "Feature", target_name : str = "Target", random_state : int = 42) ->pd.DataFrame:
X, y = make_classification(n_samples=n_samples, n_features=n_features, n_classes=n_classes, n_informative = n_classes*n_clusters_per_class, random_state=random_state)
feature_names = [feature_name_prefix+" "+str(v) forvinnp.arange(1, n_features+1)]
returnpd.concat([pd.DataFrame(X, columns=feature_names), pd.DataFrame(y, columns=[target_name])], axis=1)
df_data = make_classification_dataframe(n_samples=N_SAMPLES, n_features=N_FEATURES, n_classes=N_CLASSES, n_clusters_per_class=N_CLUSTERS_PER_CLASS, feature_name_prefix=FEATURE_NAME_PREFIX, target_name=TARGET_NAME, random_state=RANDOM_STATE)
X = df_data.drop([TARGET_NAME], axis=1).to_numpy()
y = df_data[TARGET_NAME].to_numpy()
df_data.head()
The data generated is as follows :
Cross validation
Use cross validation instead of train_test_split, Because it can provide more robust algorithm performance evaluation .
cross_val_predict The helper function provides the code to do this :
defcross_val_predict(model, kfold : KFold, X : np.array, y : np.array) ->Tuple[np.array, np.array, np.array]:
model_ = cp.deepcopy(model)
no_classes = len(np.unique(y))
actual_classes = np.empty([0], dtype=int)
predicted_classes = np.empty([0], dtype=int)
predicted_proba = np.empty([0, no_classes])
fortrain_ndx, test_ndxinkfold.split(X):
train_X, train_y, test_X, test_y = X[train_ndx], y[train_ndx], X[test_ndx], y[test_ndx]
actual_classes = np.append(actual_classes, test_y)
model_.fit(train_X, train_y)
predicted_classes = np.append(predicted_classes, model_.predict(test_X))
try:
predicted_proba = np.append(predicted_proba, model_.predict_proba(test_X), axis=0)
except:
predicted_proba = np.append(predicted_proba, np.zeros((len(test_X), no_classes), dtype=float), axis=0)
returnactual_classes, predicted_classes, predicted_proba
stay predict_proba Added in try Because not all algorithms support probability , And there are no consistent warnings or errors that can be caught explicitly .
Before we start , Take a quick look at the of a single algorithm cross_val_predict ..
lr = LogisticRegression(random_state=RANDOM_STATE)
kfold = KFold(n_splits=N_SPLITS, random_state=RANDOM_STATE, shuffle=True)
%timeactual, lr_predicted, lr_predicted_proba = cross_val_predict(lr, kfold, X, y)
print(f"Accuracy of Logistic Regression: {accuracy_score(actual, lr_predicted)}")
lr_predicted
Walltime: 309ms
AccuracyofLogisticRegression: 0.6821
array([0, 0, 1, ..., 0, 2, 1])
function cross_val_predict Probability and prediction categories have been returned , The forecast category is already displayed in the cell output .
The first set of data is predicted to belong to 0 , 0,1 wait .
Multiple classifiers for prediction
The next thing is to generate a set of predictions and probabilities for several classifiers , The algorithm chosen here is random forest 、XGboost etc.
defcross_val_predict_all_classifiers(classifiers : dict) ->Tuple[np.array, np.array]:
predictions = [None] *len(classifiers)
predicted_probas = [None] *len(classifiers)
fori, (name, classifier) inenumerate(classifiers.items()):
%timeactual, predictions[i], predicted_probas[i] = cross_val_predict(classifier, kfold, X, y)
print(f"Accuracy of {name}: {accuracy_score(actual, predictions[i])}")
returnactual, predictions, predicted_probas
classifiers = dict()
classifiers["Random Forest"] = RandomForestClassifier(random_state=RANDOM_STATE)
classifiers["XG Boost"] = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=RANDOM_STATE)
classifiers["Extra Random Trees"] = ExtraTreesClassifier(random_state=RANDOM_STATE)
actual, predictions, predicted_probas = cross_val_predict_all_classifiers(classifiers)
Walltime: 17.1s
AccuracyofRandomForest: 0.8742
Walltime: 24.6s
AccuracyofXGBoost: 0.8838
Walltime: 6.2s
AccuracyofExtraRandomTrees: 0.8754
predictions A variable is a list of a set of prediction classes for each algorithm :
[array([2, 0, 0, ..., 0, 2, 1]),
array([2, 0, 2, ..., 0, 2, 1], dtype=int64),
array([2, 0, 0, ..., 0, 2, 1])]
predict_probas It's also a list , But it includes the probability of each predicted target . Every array is (10000, 3) Of , among :
- 10,000 Is the number of data points in the sample data set . Each array has a row for each group of data
- 3 Is the number of classes in the non binary classifier ( Because our goal is 3 Classes )
[array([[0.17, 0.02, 0.81],
[0.58, 0.07, 0.35],
[0.54, 0.1 , 0.36],
...,
[0.46, 0.08, 0.46],
[0.15, 0. , 0.85],
[0.01, 0.97, 0.02]]),
array([[0.05611309, 0.00085733, 0.94302952],
[0.95303732, 0.00187497, 0.04508775],
[0.4653917 , 0.01353438, 0.52107394],
...,
[0.75208634, 0.0398241 , 0.20808953],
[0.02066649, 0.00156501, 0.97776848],
[0.00079027, 0.99868006, 0.00052966]]),
array([[0.33, 0.02, 0.65],
[0.54, 0.14, 0.32],
[0.51, 0.17, 0.32],
...,
[0.52, 0.06, 0.42],
[0.1 , 0.03, 0.87],
[0.05, 0.93, 0.02]])]
The first line in the above output is explained in detail below
For the prediction of the first set of data of the first algorithm ( namely DataFrame The first line in the has 17% The probability of belongs to 0 class ,2% The probability of belongs to 1 class ,81% The probability of belongs to 2 class ( The three kinds of addition are 100%).
Soft voting and hard voting
Now enter the topic of this article . Just a few lines Python The code can realize soft voting and hard voting .
defsoft_voting(predicted_probas : list) ->np.array:
sv_predicted_proba = np.mean(predicted_probas, axis=0)
sv_predicted_proba[:,-1] = 1-np.sum(sv_predicted_proba[:,:-1], axis=1)
returnsv_predicted_proba, sv_predicted_proba.argmax(axis=1)
defhard_voting(predictions : list) ->np.array:
return [mode(v) forvinnp.array(predictions).T]
sv_predicted_proba, sv_predictions = soft_voting(predicted_probas)
hv_predictions = hard_voting(predictions)
fori, (name, classifier) inenumerate(classifiers.items()):
print(f"Accuracy of {name}: {accuracy_score(actual, predictions[i])}")
print(f"Accuracy of Soft Voting: {accuracy_score(actual, sv_predictions)}")
print(f"Accuracy of Hard Voting: {accuracy_score(actual, hv_predictions)}")
AccuracyofRandomForrest: 0.8742
AccuracyofXGBoost: 0.8838
AccuracyofExtraRandomTrees: 0.8754
AccuracyofSoftVoting: 0.8868
AccuracyofHardVoting: 0.881
As can be seen from the above code, soft voting is higher than the single algorithm with the best performance 0.3%(88.68% Yes 88.38%), And hard voting has decreased (88.10% Yes 88.38%), The following is a detailed explanation of these two mechanisms
Voting algorithm code implementation
Soft voting
sv_predicted_proba = np.mean(predicted_probas, axis=0)
sv_predicted_proba
array([[0.18537103, 0.01361911, 0.80100984],
[0.69101244, 0.07062499, 0.23836258],
[0.50513057, 0.09451146, 0.40035798],
...,
[0.57736211, 0.05994137, 0.36269651],
[0.09022216, 0.01052167, 0.89925616],
[0.02026342, 0.96622669, 0.01350989]])
numpy mean Function along the axis 0 ( Column ) Average. . In theory , This should be the whole content of soft voting , Because this has created 3 The average value of each output in the group output ( mean value ) And it seems right .
print(np.sum(sv_predicted_proba[0]))
sv_predicted_proba[0]
0.9999999826153119
array([0.18537103, 0.01361911, 0.80100984])
But because of the rounding error , The values of rows do not always add up to 1, Because every data point belongs to probability and is 1 One of the three classes of
If we use topk Method to get the classification label , This error will not have any effect . But sometimes other treatments are needed , It must be ensured that the probability is 1, Then you need to do some simple processing : Set the value in the last column to 1- The sum of the values in other columns
sv_predicted_proba[:,-1] = 1-np.sum(sv_predicted_proba[:,:-1], axis=1)
1.0
array([0.18537103, 0.01361911, 0.80100986])
Now there is no problem with the data , The probability of each line adds up to 1, Just as they should .
Here's how to use numpy Of argmax Function to obtain the category with the highest probability as the result of prediction ( That is, for each line , Whether soft voting predicts categories 0、1 or 2).
sv_predicted_proba.argmax(axis=1)
array([2, 0, 0, ..., 0, 2, 1], dtype=int64)
argmax The function is along axis The axis specified in the parameter selects the index of the maximum value in the array , So it selects... For the first line 2, Select... For the second line 0, Select... For the third line 0 etc. .
Hard voting
hv_predicted = [mode(v) for v in np.array(predictions).T]
among np.array(predictions).T The syntax is just transposing arrays , take (10000, 3) Turn into (3,10000 )
print(np.array(predictions).shape)
np.array(predictions).T
(3, 10000)
array([[2, 2, 2],
[0, 0, 0],
[0, 2, 0],
...,
[0, 0, 0],
[2, 2, 2],
[1, 1, 1]], dtype=int64)
Then the list derivation gets each element ( That's ok ) And will statistics.mode Apply to it , So as to select the classification that obtains the most votes from the algorithm ......
np.array(hv_predicted)
array([2, 0, 0, ..., 0, 2, 1], dtype=int64)
Use Scikit-Learn
Writing code from scratch can make us better the implementation mechanism of link algorithm , But you don't need to make wheels repeatedly , scikit-learn Of VotingClassifier Can do all the work
estimators = list(classifiers.items())
vc_sv = VotingClassifier(estimators=estimators, voting="soft")
vc_hv = VotingClassifier(estimators=estimators, voting="hard")
%timeactual, vc_sv_predicted, vc_sv_predicted_proba = cross_val_predict(vc_sv, kfold, X, y)
%timeactual, vc_hv_predicted, _ = cross_val_predict(vc_hv, kfold, X, y)
print(f"Accuracy of SciKit-Learn Soft Voting: {accuracy_score(actual, vc_sv_predicted)}")
print(f"Accuracy of SciKit-Learn Hard Voting: {accuracy_score(actual, vc_hv_predicted)}")
Walltime: 1min4s
Walltime: 55.3s
AccuracyofSciKit-LearnSoftVoting: 0.8868
AccuracyofSciKit-LearnHardVoting: 0.881
cikit-learn The result of the implementation is exactly the same as our handwritten algorithm —— The accuracy of soft voting is 88.68%, The accuracy of hard voting is 88.1%.
Why is the effect of hard voting not good ?
Think about this situation , Take binary classification as an example :
B The probabilities are 0.01,0.51,0.6,0.1,0.7, be
A The probability of is 0.99,0.49,0.4,0.9,0.3
If calculated by hard voting :
B 3 ticket ,A 2 ticket , The result is B
Calculation of soft voting
B:(0.01+0.51+0.6+0.1+0.7)/5=0.38
A:(0.99+0.49+0.4+0.9+0.3)/5=0.61
The result is A
The result of hard voting is finally determined by the model with relatively low probability value (0.51,0.6,0.7) decision , The soft voting is made up of (0.99,0.9) Model decision , Soft voting will give those models with high probability more weight , So it's better than a hard vote .
How much can integrated learning improve ?
Let's see how much improvement ensemble learning can achieve in accuracy measurement ? Use common 6 An algorithm to see how much performance we can squeeze out of the integration ......
lassifiers = dict()
classifiers["Random Forrest"] = RandomForestClassifier(random_state=RANDOM_STATE)
classifiers["XG Boost"] = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=RANDOM_STATE)
classifiers["Extra Random Trees"] = ExtraTreesClassifier(random_state=RANDOM_STATE)
classifiers['Neural Network'] = MLPClassifier(max_iter = 1000, random_state=RANDOM_STATE)
classifiers['Support Vector Machine'] = SVC(probability=True, random_state=RANDOM_STATE)
classifiers['Light GMB'] = LGBMClassifier(random_state=RANDOM_STATE)
estimators = list(classifiers.items())
Method 1: Use handwritten code
%time
actual, predictions, predicted_probas = cross_val_predict_all_classifiers(classifiers) # Get a collection of predictions and probabilities from the selected algorithms
sv_predicted_proba, sv_predictions = soft_voting(predicted_probas) # Combine those collections into a single set of predictions
hv_predictions = hard_voting(predictions)
print(f"Accuracy of Soft Voting: {accuracy_score(actual, sv_predictions)}")
print(f"Accuracy of Hard Voting: {accuracy_score(actual, hv_predictions)}")
Walltime: 14.9s
AccuracyofRandomForest: 0.8742
Walltime: 32.8s
AccuracyofXGBoost: 0.8838
Walltime: 5.78s
AccuracyofExtraRandomTrees: 0.8754
Walltime: 3min2s
AccuracyofNeuralNetwork: 0.8612
Walltime: 36.2s
AccuracyofSupportVectorMachine: 0.8674
Walltime: 1.65s
AccuracyofLightGMB: 0.8828
AccuracyofSoftVoting: 0.8914
AccuracyofHardVoting: 0.8851
Walltime: 4min34s
Method 2: Use SciKit-Learn and cross_val_predict
%%time
vc_sv = VotingClassifier(estimators=estimators, voting="soft")
vc_hv = VotingClassifier(estimators=estimators, voting="hard")
%timeactual, vc_sv_predicted, vc_sv_predicted_proba = cross_val_predict(vc_sv, kfold, X, y)
%timeactual, vc_hv_predicted, _ = cross_val_predict(vc_hv, kfold, X, y)
print(f"Accuracy of SciKit-Learn Soft Voting: {accuracy_score(actual, vc_sv_predicted)}")
print(f"Accuracy of SciKit-Learn Hard Voting: {accuracy_score(actual, vc_hv_predicted)}")
Walltime: 4min11s
Walltime: 4min41s
AccuracyofSciKit-LearnSoftVoting: 0.8914
AccuracyofSciKit-LearnHardVoting: 0.8859
Walltime: 8min52s
Method 3: Use SciKit-Learn and cross_val_score
%timeprint(f"Accuracy of SciKit-Learn Soft Voting using cross_val_score: {np.mean(cross_val_score(vc_sv, X, y, cv=kfold))}")
AccuracyofSciKit-LearnSoftVotingusingcross_val_score: 0.8914
Walltime: 4min46s
3 There are two different ways to score the accuracy of soft voting , This again shows that our handwriting implementation is correct .
Add :Scikit-Learn Restrictions on hard voting
fromcatboostimportCatBoostClassifier
fromsklearn.model_selectionimportcross_val_score
classifiers["Cat Boost"] = CatBoostClassifier(silent=True, random_state=RANDOM_STATE)
estimators = list(classifiers.items())
vc_hv = VotingClassifier(estimators=estimators, voting="hard")
print(cross_val_score(vc_hv, X, y, cv=kfold))
You get a mistake :
ValueError: couldnotbroadcastinputarrayfromshape (2000,1) intoshape (2000)
This problem is easy to solve , Because the output is (2000,1) The input requirement is (2000), So we can use it np.squeeze Delete the second dimension .
summary
Through neural networks 、 Support vector machines and lightGMB Add to the combination , The accuracy of soft voting ranges from 88.68% Improved 0.46% to 89.14%, The accuracy of the new soft voting algorithm is better than that of the best individual algorithm (XG Boost by 88.38 Improved 0.76% ). Add accuracy scores below XGBoost Of Light GMB Improved integration performance 0.1%! In other words, integration is not the best model, and it can also improve the performance of soft voting . If it is Kaggle match , that 0.76% It could be huge , It may make you leap on the leaderboard .
https://www.overfit.cn/post/a4a0f83dafad46f3b4135b5faf1ca85a
author :Graham Harrison