Mixed density network (MDN) for multiple regression explanation and code example

In this paper , First, briefly explain Hybrid density networks MDN （Mixture Density Network） What is it? , Then we will use Python The code to build MDN Model , Finally, use the constructed model for multiple regression and test the effect .

Return to

“ Regression prediction modeling is to approximate from input variables (X) To continuous output variables (y) The mapping function of (f) [...] Regression problems need to predict specific values . The problem with multiple input variables is usually called multiple regression problem for example , Predicted house value , May be in 100,000 Dollar to 200,000 Between US dollars

This is another visual explanation to distinguish between classification problem and regression problem ：

Another example

density

DENSITY “ density ” What does that mean? ？ This is a quick, popular example ：

Suppose pizza is being delivered for pizza hut . Now record the time of each delivery just made （ In minutes ）. deliver 1000 Next time , Visualize the data to see how well you're doing . This is the result ：

This is the distribution of pizza delivery time data “ density ”. On average, , Each delivery requires 30 minute （ The peak in the figure ）. It also says , stay 95% Under the circumstances （2 A standard deviation 2sd ）, Delivery requires 20 To 40 Minutes to complete . The kind of density represents the result of time “ frequency ”. “ frequency ” and “ density ” The difference is that ：

· frequency ： If you draw a histogram under this curve and compare all bin Count , It will sum to any integer （ Depends on the total number of observations captured in the dataset ）.

· density ： If you draw a histogram under this curve and calculate all bin, It adds up to 1. We can also call this curve probability density function (pdf).

In statistical terms , This is a beautiful normal / Gaussian distribution . This normal distribution has two parameters ：

mean value

· Standard deviation ：“ The standard deviation is a number , Used to describe how a set of measurements can be averaged （ Average ） Or expected value . A low standard deviation means that most numbers are close to the average . High standard deviation means that the numbers are more dispersed .“

The change of mean and standard deviation will affect the shape of distribution . for example ：

There are many different distribution types with different types of parameters . for example ：

Mixed density

Now let's look at this 3 Distribution ：

If we use this bimodal distribution （ Also known as general distribution ）：

Hybrid density networks use such assumptions , That is, any general distribution like this bimodal distribution can be decomposed into a mixture of normal distribution （ This mix can also be customized with other types of distributions Like Laplace ）：

Network architecture

Hybrid density network is also an artificial neural network . This is a classic example of neural networks ：

Input layer （ yellow ）、 Hidden layer （ green ） And output layer （ Red ）.

If we define the goal of neural network as learning to output continuous values given some input characteristics . In the example above , Given age 、 Gender 、 Education and other characteristics , Then the neural network can carry out the operation of regression .

Density network

Density networks are also neural networks , The goal is not to simply learn to output a single continuous value , Instead, learn to output distribution parameters given some input characteristics （ Here is the mean and standard deviation ）. In the example above , Given age 、 Gender 、 Education level and other characteristics , Neural network learning predicts the mean and standard deviation of expected wage distribution . Predicting distribution has many advantages over predicting a single value , For example, it can give the uncertainty boundary of prediction . This is the solution to the regression problem “ Bayes ” Method . The following is a good example of predicting the distribution of each expected continuous value ：

The following picture shows us the expected value distribution of each prediction instance ：

Hybrid density networks

Finally, back to the point , The goal of hybrid density network is to , Learn to output the parameters of all distributions mixed in the general distribution （ Here is the mean 、 Standard deviation and Pi）. New parameters “Pi” Is a mixed parameter , It gives the weight of a given distribution in the final mix / probability .

The final results are as follows ：

Example 1： Of univariate data MDN class

The above definition and theoretical basis have been introduced , Now let's start the demonstration of the code ：

import numpy as np
import pandas as pd

from mdn_model import MDN

from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.kernel_ridge import KernelRidge

plt.style.use('ggplot')

Generate famous “ half a month ” Type data set ：

X, y = make_moons(n_samples=2500, noise=0.03)
y = X[:, 1].reshape(-1,1)
X = X[:, 0].reshape(-1,1)

x_scaler = StandardScaler()
y_scaler = StandardScaler()

X = x_scaler.fit_transform(X)
y = y_scaler.fit_transform(y)

plt.scatter(X, y, alpha = 0.3)

Draw the target value (y) Density distribution of ：

sns.kdeplot(y.ravel(), shade=True)

By looking at the data , We can see that there are two overlapping clusters ：

At this time, a good multimodal distribution （ General distribution ）. If we try a standard linear regression on this data set X forecast y：

model = LinearRegression()
model.fit(X.reshape(-1,1), y.reshape(-1,1))
y_pred = model.predict(X.reshape(-1,1))

plt.scatter(X, y, alpha = 0.3)
plt.scatter(X,y_pred)
plt.title('Linear Regression')

sns.kdeplot(y_pred.ravel(), shade=True, alpha = 0.15, label = 'Linear Pred dist')      
sns.kdeplot(y.ravel(), shade=True, label = 'True dist')

The effect must not be good ！ Now let's try a nonlinear model （ Radial basis function kernel ridge regression ）：

model = KernelRidge(kernel = 'rbf')
model.fit(X, y)
y_pred = model.predict(X)


plt.scatter(X, y, alpha = 0.3)
plt.scatter(X,y_pred)
plt.title('Non Linear Regression')

sns.kdeplot(y_pred.ravel(), shade=True, alpha = 0.15, label = 'NonLinear Pred dist')      
sns.kdeplot(y.ravel(), shade=True, label = 'True dist')

Although the result is not satisfactory , But it's much better than the linear regression above .

The main reason for the failure of both models is ： For the same X There are several different values y value …… More specifically , For the same X There seems to be more than one possible y Distribution . The regression model is just trying to find the optimal function to minimize the error , No consideration is given to the mixing of density , therefore Those in the middle X There's no one Y Explain , They have two possible solutions , So it leads to the above problems .

Now let's try one MDN Model , A fast and easy-to-use “fit-predict”、“sklearn alike” Customize python MDN class . If you want to use it yourself , This is a python Code link （ Please note that ： This MDN Classes are experimental , Not yet widely tested ）：https://github.com/CoteDave/b...

To be able to use this class , Yes sklearn、tensorflow probability、Tensorflow < 2、umap and hdbscan（ Used to customize the visualization class function ）.

EPOCHS = 10000
BATCH_SIZE=len(X)

model = MDN(n_mixtures = -1, 
            dist = 'laplace',
            input_neurons = 1000, 
            hidden_neurons = [25], 
            gmm_boost = False,
            optimizer = 'adam',
            learning_rate = 0.001, 
            early_stopping = 250,
            tf_mixture_family = True,
            input_activation = 'relu',
            hidden_activation = 'leaky_relu')

model.fit(X, y, epochs = EPOCHS, batch_size = BATCH_SIZE)

The parameters of the class are summarized as follows ：

· n_mixtures：MDN The number of distribution blends used . If set to -1, It will use Gaussian mixture model (GMM) and X and y Upper HDBSCAN Model “ Automatically ” Find the best mixing number .

· dist： The type of distribution used in mixing . at present , There are two options ; “ normal ” or “ Laplace ”. （ Based on some experiments , The Laplace distribution gives better results than the normal distribution ）.

· input_neurons： stay MDN The number of neurons used in the input layer

· hidden_neurons：MDN Of Hidden layer architecture . List of neurons per hidden layer . This parameter enables you to select the number of hidden layers and the number of neurons per hidden layer .

· gmm_boost： Boolean value . If set to True, Cluster features will be added to the dataset .

· optimizer： The optimization algorithm to use .

· learning_rate： Learning rate of optimization algorithm

· early_stopping： Avoid over fitting during training . When the index does not change in a given number of periods , This trigger determines when to stop training .

· tf_mixture_family： Boolean value . If set to True, Will use tf_mixture series （ recommend ）：Mixture Object to realize batch mixed distribution .

· input_activation： The activation function of the input layer

· hidden_activation： Activation function of hidden layer

Now? MDN The model has fitted the data , Sample from the mixed density distribution and plot the probability density function ：

model.plot_distribution_fit(n_samples_batch = 1)

our MDN The model is very suitable for the real general distribution ！ The final mixed distribution is decomposed into each distribution , Look at it ：

model.plot_all_distribution_fit(n_samples_batch = 1)

Use the learned mixed distribution to sample some more Y data , The generated sample is compared with the real sample ：

model.plot_samples_vs_true(X, y, alpha = 0.2)

Very close to the actual data , If , Given X You can also generate multiple batches of samples to generate quantiles 、 Statistical information such as mean value ：

generated_samples = model.sample_from_mixture(X, n_samples_batch = 10)
generated_samples

Plot the average of each learning distribution , And their respective mixing weights (pi)：

plt.scatter(X, y, alpha = 0.2)
model.plot_predict_dist(X, with_weights = True, size = 250)

There is the mean and standard deviation of each distribution , You can also plot with complete uncertainty ; Suppose we use 95% The confidence interval is plotted as the mean ：

plt.scatter(X, y, alpha = 0.2)
model.plot_predict_dist(X, q = 0.95, with_weights = False)

Mix the distributions together , When it comes to the same X There are many. y When distributed , We use the highest Pi Select the most possible parameter value ：

Y_preds = For each X, Choose the one with the greatest probability / The weight （Pi Parameters ） The distribution of Y mean value

plt.scatter(X, y, alpha = 0.3)
model.plot_predict_best(X)

This way is not ideal , Because there are obviously two different clusters overlapping in the data , The density is almost equal . So that the error will be higher than the standard regression model . This also means that there may be a lack of important features in the dataset that can help avoid clusters overlapping in higher dimensions .

We can also choose to use Pi Parameter and the mean of all distributions ：

· Y_preds = (mean_1 Pi1) + (mean_2 Pi2)

plt.scatter(X, y, alpha = 0.3)
model.plot_predict_mixed(X)

If we add 95 confidence interval ：

This option provides almost the same results as the nonlinear regression model , Mix everything to minimize the distance between points and functions . In this very special case , My favorite choice is to assume that in some areas of the data ,X There are many. Y, And in other areas ; Use only one of these blends .：

for example , When X = 0 when , Each mix may have two different Y Explain . When X = -1.5 when , blend 1 There is only one in the world Y Solution . Depending on the use case or business context , When the same X When there are multiple solutions , Can trigger actions or decisions .

The meaning of this option is that when there is overlapping distribution （ If both mixing probabilities are >= Given the probability threshold ）, The line will be copied ：

plt.scatter(X, y, alpha = 0.3)
model.plot_predict_with_overlaps(X)

Use 95% confidence interval ：

Dataset row from 2500 Increased to 4063, The final forecast data set is as follows ：

In this data table , When X = -0.276839 when ,Y It can be 1.43926（ blend \_0 The probability of is 0.351525）, But it can also be -0.840593（ blend \_1 The probability of is 0.648475）.

Instances with multiple distributions also provide important information , That is, something is happening in the data , And more analysis may be needed . It may be some data quality problems , Or it may indicate the lack of an important feature in the dataset ！

“ Traffic scenario prediction is a good example of using mixed density networks . In traffic scenario prediction , We need a distribution of behaviors that we can show —— for example , An agent can turn left 、 Turn right or go straight . therefore , The mixed density network can be used to represent the “ Behavior ”, The behavior consists of probability and trajectory （（x,y） The coordinates are within a certain time range in the future ）.

Example 2： have MDN Multivariate regression

Last MDN Did you do well on multiple regression ？

We will use the following data sets ：

· Age ： The age of the primary beneficiary

· Gender ： Insurance contractor gender , Woman , male

· bmi： Body mass index , Provide an understanding of the body , A relatively high or low weight relative to height , Objective body mass index using the ratio of height to weight （kg / m ^ 2）, Ideally 18.5 To 24.9

· children ： Number of children covered by health insurance / Number of dependants

· smoker ： smoking

· region ： The beneficiary is in the United States 、 The northeast 、 Southeast 、 southwest 、 Residential areas in the Northwest .

· cost ： Personal medical expenses charged by health insurance . This is the goal we want to predict

The problem statement is ： Whether the insurance cost can be accurately predicted （ charge ）？

Now? , Let's import the dataset ：

"""
#################
# 2-IMPORT DATA #
#################
"""
dataset = pd.read_csv('insurance_clean.csv', sep = ';')

##### BASIC FEATURE ENGINEERING
dataset['age2'] = dataset['age'] * dataset['age']
dataset['BMI30'] = np.where(dataset['bmi'] > 30, 1, 0)
dataset['BMI30_SMOKER'] = np.where((dataset['bmi'] > 30) & (dataset['smoker_yes'] == 1), 1, 0)
"""
######################
# 3-DATA PREPARATION #
######################
"""
###### SPLIT TRAIN TEST
from sklearn.model_selection import train_test_split
X = dataset[dataset.columns.difference(['charges'])]
y = dataset[['charges']]

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.25, 
                                                    stratify = X['smoker_yes'], 
                                                    random_state=0)


test_index = y_test.index.values
train_index = y_train.index.values
features = X.columns.tolist()

##### FEATURE SCALING 
from sklearn.preprocessing import StandardScaler
x_scaler = StandardScaler()
y_scaler = StandardScaler()

X_train = x_scaler.fit_transform(X_train)
#X_calib = x_scaler.transform(X_calib)
X_test = x_scaler.transform(X_test)

y_train = y_scaler.fit_transform(y_train)
#y_calib = y_scaler.transform(y_calib)
y_test = y_scaler.transform(y_test)

y_test_scaled = y_test.copy()

The data is ready for training

EPOCHS = 10000
BATCH_SIZE=len(X_train)

model = MDN(n_mixtures = -1, #-1
            dist = 'laplace',
            input_neurons = 1000, #1000
            hidden_neurons = [], #25
            gmm_boost = False,
            optimizer = 'adam',
            learning_rate = 0.0001, #0.00001
            early_stopping = 200,
            tf_mixture_family = True,
            input_activation = 'relu',
            hidden_activation = 'leaky_relu')

model.fit(X_train, y_train, epochs = EPOCHS, batch_size = BATCH_SIZE)

Use... After training “ Best mixing probability （Pi Parameters ） Strategy ” Predict the test data set and plot the results （y_pred vs y_test）：

y_pred = model.predict_best(X_test, q = 0.95, y_scaler = y_scaler)
model.plot_pred_fit(y_pred, y_test, y_scaler = y_scaler)

model.plot_pred_vs_true(y_pred, y_test, y_scaler = y_scaler)

R2 by 89.09,MAE by 882.54,MDN fantastic , Let's draw a graph to compare the fitted distribution with the real distribution ：

model.plot_distribution_fit(n_samples_batch = 1)

It's almost exactly the same ！ Decomposition hybrid model , Let's see what's going on ：

A total of six different distributions are mixed .

Generate multivariable samples from the fitted mixed model （ application PCA In the 2D Visualization results in ）：

model.plot_samples_vs_true(X_test, y_test, alpha = 0.35, y_scaler = y_scaler)

The generated sample is very close to the real sample ！ If we want to , It can also be predicted from each distribution ：

y_pred_dist = model.predict_dist(X_test, q = 0.95, y_scaler = y_scaler)
y_pred_dist

summary

· With linear or nonlinear classical ML The model compares ,MDN Performed well in univariate regression data sets , Two of these clusters overlap each other , also X There may be more than one Y Output .

· MDN We have also done a good job on the issue of multiple regression , It can be done with XGBoost Wait for popular models to compete

· MDN yes ML An excellent and unique tool in , It can solve specific problems that cannot be solved by other models （ Be able to learn from data obtained from mixed distribution ）

· With MDN Learning distribution , It is also possible to calculate uncertainty by prediction or generate new samples from the distribution of learning

This article has a lot of code , Here is the whole notebook, You can download and run it directly ：

https://www.overfit.cn/post/20245a8446ae43e3982b48e4320991ab

author ：Dave Cote, M.Sc.