Detailed explanation and code implementation of soft voting and hard voting mechanism in integrated learning

Quickly review the soft voting and hard voting in the integration method

The integration method is to combine the results of two or more separate machine learning algorithms , And try to produce more accurate results than any single algorithm .

In soft voting , The probability of each category is averaged to produce results . for example , If the algorithm 1 With 40% The object of probability prediction is a rock , And algorithm 2 With 80% The probability predicts that it is a rock , Then the integration will predict that the object is an object with (80 + 40) / 2 = 60% The rock of possibility .

In a hard vote , The prediction of each algorithm is considered to select the set of classes with the highest number of votes . for example , If three algorithms predict the color of a particular wine as “ white ”、“ white ” and “ Red ”, Then the integration will predict “ white ”.

The simplest explanation is ： Soft voting is the integration of probability , Hard voting is the integration of result labels .

Generate test data

Now let's start writing the code , First, import some libraries and some simple configurations

importpandasaspd
importnumpyasnp
importcopyascp

fromsklearn.datasetsimportmake_classification
fromsklearn.model_selectionimportKFold, cross_val_score
fromtypingimportTuple
fromstatisticsimportmode
fromsklearn.ensembleimportVotingClassifier
fromsklearn.metricsimportaccuracy_score

fromsklearn.linear_modelimportLogisticRegression
fromsklearn.ensembleimportRandomForestClassifier
fromsklearn.ensembleimportExtraTreesClassifier
fromxgboostimportXGBClassifier
fromsklearn.neural_networkimportMLPClassifier
fromsklearn.svmimportSVC
fromlightgbmimportLGBMClassifier

RANDOM_STATE : int = 42
N_SAMPLES : int = 10000
N_FEATURES : int = 25
N_CLASSES : int = 3
N_CLUSTERS_PER_CLASS : int = 2
    
FEATURE_NAME_PREFIX : str = "Feature"
TARGET_NAME : str = "Target"
    
N_SPLITS : int = 5
    
np.set_printoptions(suppress=True)

You also need some data as input for classification .make_classification_dataframe The function creates test data containing features and targets .

Here we set the number of categories to 3. In this way, the multi classification algorithm can be realized （ exceed 2 Classes are OK ） Soft voting and hard voting algorithms . And our code can also be applied to binary classification .

defmake_classification_dataframe(n_samples : int = 10000, n_features : int = 25, n_classes : int = 2, n_clusters_per_class : int = 2, feature_name_prefix : str = "Feature", target_name : str = "Target", random_state : int = 42) ->pd.DataFrame:
    X, y = make_classification(n_samples=n_samples, n_features=n_features, n_classes=n_classes, n_informative = n_classes*n_clusters_per_class, random_state=random_state)

    feature_names = [feature_name_prefix+" "+str(v) forvinnp.arange(1, n_features+1)]
    returnpd.concat([pd.DataFrame(X, columns=feature_names), pd.DataFrame(y, columns=[target_name])], axis=1)

df_data = make_classification_dataframe(n_samples=N_SAMPLES, n_features=N_FEATURES, n_classes=N_CLASSES, n_clusters_per_class=N_CLUSTERS_PER_CLASS, feature_name_prefix=FEATURE_NAME_PREFIX, target_name=TARGET_NAME, random_state=RANDOM_STATE)

X = df_data.drop([TARGET_NAME], axis=1).to_numpy()
y = df_data[TARGET_NAME].to_numpy()

df_data.head()

The data generated is as follows ：

Cross validation

Use cross validation instead of train_test_split, Because it can provide more robust algorithm performance evaluation .

cross_val_predict The helper function provides the code to do this ：

defcross_val_predict(model, kfold : KFold, X : np.array, y : np.array) ->Tuple[np.array, np.array, np.array]:

    model_ = cp.deepcopy(model)
    
    no_classes = len(np.unique(y))
    
    actual_classes = np.empty([0], dtype=int)
    predicted_classes = np.empty([0], dtype=int)
    predicted_proba = np.empty([0, no_classes]) 

    fortrain_ndx, test_ndxinkfold.split(X):

        train_X, train_y, test_X, test_y = X[train_ndx], y[train_ndx], X[test_ndx], y[test_ndx]

        actual_classes = np.append(actual_classes, test_y)

        model_.fit(train_X, train_y)
        predicted_classes = np.append(predicted_classes, model_.predict(test_X))

        try:
            predicted_proba = np.append(predicted_proba, model_.predict_proba(test_X), axis=0)
        except:
            predicted_proba = np.append(predicted_proba, np.zeros((len(test_X), no_classes), dtype=float), axis=0)

    returnactual_classes, predicted_classes, predicted_proba

stay predict_proba Added in try Because not all algorithms support probability , And there are no consistent warnings or errors that can be caught explicitly .

Before we start , Take a quick look at the of a single algorithm cross_val_predict ..

lr = LogisticRegression(random_state=RANDOM_STATE)
kfold = KFold(n_splits=N_SPLITS, random_state=RANDOM_STATE, shuffle=True)

%timeactual, lr_predicted, lr_predicted_proba = cross_val_predict(lr, kfold, X, y)
print(f"Accuracy of Logistic Regression: {accuracy_score(actual, lr_predicted)}")
lr_predicted

Walltime: 309ms
AccuracyofLogisticRegression: 0.6821
array([0, 0, 1, ..., 0, 2, 1])

function cross_val_predict Probability and prediction categories have been returned , The forecast category is already displayed in the cell output .

The first set of data is predicted to belong to 0 , 0,1 wait .

Multiple classifiers for prediction

The next thing is to generate a set of predictions and probabilities for several classifiers , The algorithm chosen here is random forest 、XGboost etc.

defcross_val_predict_all_classifiers(classifiers : dict) ->Tuple[np.array, np.array]:
    predictions = [None] *len(classifiers)
    predicted_probas = [None] *len(classifiers)

    fori, (name, classifier) inenumerate(classifiers.items()):
        %timeactual, predictions[i], predicted_probas[i] = cross_val_predict(classifier, kfold, X, y)
        print(f"Accuracy of {name}: {accuracy_score(actual, predictions[i])}")
    
    returnactual, predictions, predicted_probas

classifiers = dict()
classifiers["Random Forest"] = RandomForestClassifier(random_state=RANDOM_STATE)
classifiers["XG Boost"] = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=RANDOM_STATE)
classifiers["Extra Random Trees"] = ExtraTreesClassifier(random_state=RANDOM_STATE)

actual, predictions, predicted_probas = cross_val_predict_all_classifiers(classifiers)

Walltime: 17.1s
AccuracyofRandomForest: 0.8742
Walltime: 24.6s
AccuracyofXGBoost: 0.8838
Walltime: 6.2s
AccuracyofExtraRandomTrees: 0.8754

predictions A variable is a list of a set of prediction classes for each algorithm ：

[array([2, 0, 0, ..., 0, 2, 1]),
array([2, 0, 2, ..., 0, 2, 1], dtype=int64),
array([2, 0, 0, ..., 0, 2, 1])]

predict_probas It's also a list , But it includes the probability of each predicted target . Every array is (10000, 3) Of , among ：

10,000 Is the number of data points in the sample data set . Each array has a row for each group of data
3 Is the number of classes in the non binary classifier （ Because our goal is 3 Classes ）

[array([[0.17, 0.02, 0.81],
        [0.58, 0.07, 0.35],
        [0.54, 0.1 , 0.36],
        ...,
        [0.46, 0.08, 0.46],
        [0.15, 0.  , 0.85],
        [0.01, 0.97, 0.02]]),
array([[0.05611309, 0.00085733, 0.94302952],
        [0.95303732, 0.00187497, 0.04508775],
        [0.4653917 , 0.01353438, 0.52107394],
        ...,
        [0.75208634, 0.0398241 , 0.20808953],
        [0.02066649, 0.00156501, 0.97776848],
        [0.00079027, 0.99868006, 0.00052966]]),
array([[0.33, 0.02, 0.65],
        [0.54, 0.14, 0.32],
        [0.51, 0.17, 0.32],
        ...,
        [0.52, 0.06, 0.42],
        [0.1 , 0.03, 0.87],
        [0.05, 0.93, 0.02]])]

The first line in the above output is explained in detail below

For the prediction of the first set of data of the first algorithm （ namely DataFrame The first line in the has 17% The probability of belongs to 0 class ,2% The probability of belongs to 1 class ,81% The probability of belongs to 2 class （ The three kinds of addition are 100%）.

Soft voting and hard voting

Now enter the topic of this article . Just a few lines Python The code can realize soft voting and hard voting .

defsoft_voting(predicted_probas : list) ->np.array:

    sv_predicted_proba = np.mean(predicted_probas, axis=0)
    sv_predicted_proba[:,-1] = 1-np.sum(sv_predicted_proba[:,:-1], axis=1)    

    returnsv_predicted_proba, sv_predicted_proba.argmax(axis=1)

defhard_voting(predictions : list) ->np.array:
    return [mode(v) forvinnp.array(predictions).T]

sv_predicted_proba, sv_predictions = soft_voting(predicted_probas)
hv_predictions = hard_voting(predictions)

fori, (name, classifier) inenumerate(classifiers.items()):
    print(f"Accuracy of {name}: {accuracy_score(actual, predictions[i])}")
    
print(f"Accuracy of Soft Voting: {accuracy_score(actual, sv_predictions)}")
print(f"Accuracy of Hard Voting: {accuracy_score(actual, hv_predictions)}")

AccuracyofRandomForrest: 0.8742
AccuracyofXGBoost: 0.8838
AccuracyofExtraRandomTrees: 0.8754
AccuracyofSoftVoting: 0.8868
AccuracyofHardVoting: 0.881

As can be seen from the above code, soft voting is higher than the single algorithm with the best performance 0.3%（88.68% Yes 88.38%）, And hard voting has decreased （88.10% Yes 88.38%）, The following is a detailed explanation of these two mechanisms

Voting algorithm code implementation

Soft voting

sv_predicted_proba = np.mean(predicted_probas, axis=0)
sv_predicted_proba

array([[0.18537103, 0.01361911, 0.80100984],
       [0.69101244, 0.07062499, 0.23836258],
       [0.50513057, 0.09451146, 0.40035798],
       ...,
       [0.57736211, 0.05994137, 0.36269651],
       [0.09022216, 0.01052167, 0.89925616],
       [0.02026342, 0.96622669, 0.01350989]])

numpy mean Function along the axis 0 ( Column ) Average. . In theory , This should be the whole content of soft voting , Because this has created 3 The average value of each output in the group output （ mean value ） And it seems right .

print(np.sum(sv_predicted_proba[0]))
sv_predicted_proba[0]

0.9999999826153119
array([0.18537103, 0.01361911, 0.80100984])

But because of the rounding error , The values of rows do not always add up to 1, Because every data point belongs to probability and is 1 One of the three classes of

If we use topk Method to get the classification label , This error will not have any effect . But sometimes other treatments are needed , It must be ensured that the probability is 1, Then you need to do some simple processing ： Set the value in the last column to 1- The sum of the values in other columns

sv_predicted_proba[:,-1] = 1-np.sum(sv_predicted_proba[:,:-1], axis=1)

1.0
array([0.18537103, 0.01361911, 0.80100986])

Now there is no problem with the data , The probability of each line adds up to 1, Just as they should .

Here's how to use numpy Of argmax Function to obtain the category with the highest probability as the result of prediction （ That is, for each line , Whether soft voting predicts categories 0、1 or 2）.

sv_predicted_proba.argmax(axis=1)

array([2, 0, 0, ..., 0, 2, 1], dtype=int64)

argmax The function is along axis The axis specified in the parameter selects the index of the maximum value in the array , So it selects... For the first line 2, Select... For the second line 0, Select... For the third line 0 etc. .

Hard voting

hv_predicted = [mode(v) for v in np.array(predictions).T]

among np.array(predictions).T The syntax is just transposing arrays , take (10000, 3) Turn into (3,10000 )

print(np.array(predictions).shape)
np.array(predictions).T

(3, 10000)
array([[2, 2, 2],
       [0, 0, 0],
       [0, 2, 0],
       ...,
       [0, 0, 0],
       [2, 2, 2],
       [1, 1, 1]], dtype=int64)

Then the list derivation gets each element （ That's ok ） And will statistics.mode Apply to it , So as to select the classification that obtains the most votes from the algorithm ......

np.array(hv_predicted)

array([2, 0, 0, ..., 0, 2, 1], dtype=int64)

Use Scikit-Learn

Writing code from scratch can make us better the implementation mechanism of link algorithm , But you don't need to make wheels repeatedly , scikit-learn Of VotingClassifier Can do all the work

estimators = list(classifiers.items())
    
vc_sv = VotingClassifier(estimators=estimators, voting="soft")
vc_hv = VotingClassifier(estimators=estimators, voting="hard")

%timeactual, vc_sv_predicted, vc_sv_predicted_proba = cross_val_predict(vc_sv, kfold, X, y)
%timeactual, vc_hv_predicted, _ = cross_val_predict(vc_hv, kfold, X, y)

print(f"Accuracy of SciKit-Learn Soft Voting: {accuracy_score(actual, vc_sv_predicted)}")
print(f"Accuracy of SciKit-Learn Hard Voting: {accuracy_score(actual, vc_hv_predicted)}")

Walltime: 1min4s
Walltime: 55.3s
AccuracyofSciKit-LearnSoftVoting: 0.8868
AccuracyofSciKit-LearnHardVoting: 0.881

cikit-learn The result of the implementation is exactly the same as our handwritten algorithm —— The accuracy of soft voting is 88.68%, The accuracy of hard voting is 88.1%.

Why is the effect of hard voting not good ？

Think about this situation , Take binary classification as an example ：

B The probabilities are 0.01,0.51,0.6,0.1,0.7, be

A The probability of is 0.99,0.49,0.4,0.9,0.3

If calculated by hard voting ：

B 3 ticket ,A 2 ticket , The result is B

Calculation of soft voting

B：（0.01+0.51+0.6+0.1+0.7）/5=0.38

A:（0.99+0.49+0.4+0.9+0.3）/5=0.61

The result is A

The result of hard voting is finally determined by the model with relatively low probability value （0.51,0.6,0.7） decision , The soft voting is made up of （0.99,0.9） Model decision , Soft voting will give those models with high probability more weight , So it's better than a hard vote .

How much can integrated learning improve ？

Let's see how much improvement ensemble learning can achieve in accuracy measurement ？ Use common 6 An algorithm to see how much performance we can squeeze out of the integration ......

lassifiers = dict()
classifiers["Random Forrest"] = RandomForestClassifier(random_state=RANDOM_STATE)
classifiers["XG Boost"] = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=RANDOM_STATE)
classifiers["Extra Random Trees"] = ExtraTreesClassifier(random_state=RANDOM_STATE)
classifiers['Neural Network'] = MLPClassifier(max_iter = 1000, random_state=RANDOM_STATE)
classifiers['Support Vector Machine'] = SVC(probability=True, random_state=RANDOM_STATE)
classifiers['Light GMB'] = LGBMClassifier(random_state=RANDOM_STATE)

estimators = list(classifiers.items())

Method 1： Use handwritten code

%time
actual, predictions, predicted_probas = cross_val_predict_all_classifiers(classifiers) # Get a collection of predictions and probabilities from the selected algorithms

sv_predicted_proba, sv_predictions = soft_voting(predicted_probas) # Combine those collections into a single set of predictions
hv_predictions = hard_voting(predictions)

print(f"Accuracy of Soft Voting: {accuracy_score(actual, sv_predictions)}")
print(f"Accuracy of Hard Voting: {accuracy_score(actual, hv_predictions)}")

Walltime: 14.9s
AccuracyofRandomForest: 0.8742
Walltime: 32.8s
AccuracyofXGBoost: 0.8838
Walltime: 5.78s
AccuracyofExtraRandomTrees: 0.8754
Walltime: 3min2s
AccuracyofNeuralNetwork: 0.8612
Walltime: 36.2s
AccuracyofSupportVectorMachine: 0.8674
Walltime: 1.65s
AccuracyofLightGMB: 0.8828
AccuracyofSoftVoting: 0.8914
AccuracyofHardVoting: 0.8851
Walltime: 4min34s

Method 2： Use SciKit-Learn and cross_val_predict

%%time
vc_sv = VotingClassifier(estimators=estimators, voting="soft")
vc_hv = VotingClassifier(estimators=estimators, voting="hard")

%timeactual, vc_sv_predicted, vc_sv_predicted_proba = cross_val_predict(vc_sv, kfold, X, y)
%timeactual, vc_hv_predicted, _ = cross_val_predict(vc_hv, kfold, X, y)

print(f"Accuracy of SciKit-Learn Soft Voting: {accuracy_score(actual, vc_sv_predicted)}")
print(f"Accuracy of SciKit-Learn Hard Voting: {accuracy_score(actual, vc_hv_predicted)}")

Walltime: 4min11s
Walltime: 4min41s
AccuracyofSciKit-LearnSoftVoting: 0.8914
AccuracyofSciKit-LearnHardVoting: 0.8859
Walltime: 8min52s

Method 3： Use SciKit-Learn and cross_val_score

%timeprint(f"Accuracy of SciKit-Learn Soft Voting using cross_val_score: {np.mean(cross_val_score(vc_sv, X, y, cv=kfold))}")

AccuracyofSciKit-LearnSoftVotingusingcross_val_score: 0.8914
Walltime: 4min46s

3 There are two different ways to score the accuracy of soft voting , This again shows that our handwriting implementation is correct .

Add ：Scikit-Learn Restrictions on hard voting

fromcatboostimportCatBoostClassifier
fromsklearn.model_selectionimportcross_val_score

classifiers["Cat Boost"] = CatBoostClassifier(silent=True, random_state=RANDOM_STATE)
estimators = list(classifiers.items())

vc_hv = VotingClassifier(estimators=estimators, voting="hard")

print(cross_val_score(vc_hv, X, y, cv=kfold))

You get a mistake ：

ValueError: couldnotbroadcastinputarrayfromshape (2000,1) intoshape (2000)

This problem is easy to solve , Because the output is （2000,1） The input requirement is （2000）, So we can use it np.squeeze Delete the second dimension .

summary

Through neural networks 、 Support vector machines and lightGMB Add to the combination , The accuracy of soft voting ranges from 88.68% Improved 0.46% to 89.14%, The accuracy of the new soft voting algorithm is better than that of the best individual algorithm （XG Boost by 88.38 Improved 0.76% ）. Add accuracy scores below XGBoost Of Light GMB Improved integration performance 0.1%！ In other words, integration is not the best model, and it can also improve the performance of soft voting . If it is Kaggle match , that 0.76% It could be huge , It may make you leap on the leaderboard .

https://www.overfit.cn/post/a4a0f83dafad46f3b4135b5faf1ca85a

author ：Graham Harrison