In this paper , First, briefly explain Hybrid density networks MDN (Mixture Density Network) What is it? , Then we will use Python The code to build MDN Model , Finally, use the constructed model for multiple regression and test the effect .
Return to
“ Regression prediction modeling is to approximate from input variables (X) To continuous output variables (y) The mapping function of (f) [...] Regression problems need to predict specific values . The problem with multiple input variables is usually called multiple regression problem for example , Predicted house value , May be in 100,000 Dollar to 200,000 Between US dollars
This is another visual explanation to distinguish between classification problem and regression problem :
Another example
density
DENSITY “ density ” What does that mean? ? This is a quick, popular example :
Suppose pizza is being delivered for pizza hut . Now record the time of each delivery just made ( In minutes ). deliver 1000 Next time , Visualize the data to see how well you're doing . This is the result :
This is the distribution of pizza delivery time data “ density ”. On average, , Each delivery requires 30 minute ( The peak in the figure ). It also says , stay 95% Under the circumstances (2 A standard deviation 2sd ), Delivery requires 20 To 40 Minutes to complete . The kind of density represents the result of time “ frequency ”. “ frequency ” and “ density ” The difference is that :
· frequency : If you draw a histogram under this curve and compare all bin Count , It will sum to any integer ( Depends on the total number of observations captured in the dataset ).
· density : If you draw a histogram under this curve and calculate all bin, It adds up to 1. We can also call this curve probability density function (pdf).
In statistical terms , This is a beautiful normal / Gaussian distribution . This normal distribution has two parameters :
mean value
· Standard deviation :“ The standard deviation is a number , Used to describe how a set of measurements can be averaged ( Average ) Or expected value . A low standard deviation means that most numbers are close to the average . High standard deviation means that the numbers are more dispersed .“
The change of mean and standard deviation will affect the shape of distribution . for example :
There are many different distribution types with different types of parameters . for example :
Mixed density
Now let's look at this 3 Distribution :
If we use this bimodal distribution ( Also known as general distribution ):
Hybrid density networks use such assumptions , That is, any general distribution like this bimodal distribution can be decomposed into a mixture of normal distribution ( This mix can also be customized with other types of distributions Like Laplace ):
Network architecture
Hybrid density network is also an artificial neural network . This is a classic example of neural networks :
Input layer ( yellow )、 Hidden layer ( green ) And output layer ( Red ).
If we define the goal of neural network as learning to output continuous values given some input characteristics . In the example above , Given age 、 Gender 、 Education and other characteristics , Then the neural network can carry out the operation of regression .
Density network
Density networks are also neural networks , The goal is not to simply learn to output a single continuous value , Instead, learn to output distribution parameters given some input characteristics ( Here is the mean and standard deviation ). In the example above , Given age 、 Gender 、 Education level and other characteristics , Neural network learning predicts the mean and standard deviation of expected wage distribution . Predicting distribution has many advantages over predicting a single value , For example, it can give the uncertainty boundary of prediction . This is the solution to the regression problem “ Bayes ” Method . The following is a good example of predicting the distribution of each expected continuous value :
The following picture shows us the expected value distribution of each prediction instance :
Hybrid density networks
Finally, back to the point , The goal of hybrid density network is to , Learn to output the parameters of all distributions mixed in the general distribution ( Here is the mean 、 Standard deviation and Pi). New parameters “Pi” Is a mixed parameter , It gives the weight of a given distribution in the final mix / probability .
The final results are as follows :
Example 1: Of univariate data MDN class
The above definition and theoretical basis have been introduced , Now let's start the demonstration of the code :
import numpy as np
import pandas as pd
from mdn_model import MDN
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.kernel_ridge import KernelRidge
plt.style.use('ggplot')
Generate famous “ half a month ” Type data set :
X, y = make_moons(n_samples=2500, noise=0.03)
y = X[:, 1].reshape(-1,1)
X = X[:, 0].reshape(-1,1)
x_scaler = StandardScaler()
y_scaler = StandardScaler()
X = x_scaler.fit_transform(X)
y = y_scaler.fit_transform(y)
plt.scatter(X, y, alpha = 0.3)
Draw the target value (y) Density distribution of :
sns.kdeplot(y.ravel(), shade=True)
By looking at the data , We can see that there are two overlapping clusters :
At this time, a good multimodal distribution ( General distribution ). If we try a standard linear regression on this data set X forecast y:
model = LinearRegression()
model.fit(X.reshape(-1,1), y.reshape(-1,1))
y_pred = model.predict(X.reshape(-1,1))
plt.scatter(X, y, alpha = 0.3)
plt.scatter(X,y_pred)
plt.title('Linear Regression')
sns.kdeplot(y_pred.ravel(), shade=True, alpha = 0.15, label = 'Linear Pred dist')
sns.kdeplot(y.ravel(), shade=True, label = 'True dist')
The effect must not be good ! Now let's try a nonlinear model ( Radial basis function kernel ridge regression ):
model = KernelRidge(kernel = 'rbf')
model.fit(X, y)
y_pred = model.predict(X)
plt.scatter(X, y, alpha = 0.3)
plt.scatter(X,y_pred)
plt.title('Non Linear Regression')
sns.kdeplot(y_pred.ravel(), shade=True, alpha = 0.15, label = 'NonLinear Pred dist')
sns.kdeplot(y.ravel(), shade=True, label = 'True dist')
Although the result is not satisfactory , But it's much better than the linear regression above .
The main reason for the failure of both models is : For the same X There are several different values y value …… More specifically , For the same X There seems to be more than one possible y Distribution . The regression model is just trying to find the optimal function to minimize the error , No consideration is given to the mixing of density , therefore Those in the middle X There's no one Y Explain , They have two possible solutions , So it leads to the above problems .
Now let's try one MDN Model , A fast and easy-to-use “fit-predict”、“sklearn alike” Customize python MDN class . If you want to use it yourself , This is a python Code link ( Please note that : This MDN Classes are experimental , Not yet widely tested ):https://github.com/CoteDave/b...
To be able to use this class , Yes sklearn、tensorflow probability、Tensorflow < 2、umap and hdbscan( Used to customize the visualization class function ).
EPOCHS = 10000
BATCH_SIZE=len(X)
model = MDN(n_mixtures = -1,
dist = 'laplace',
input_neurons = 1000,
hidden_neurons = [25],
gmm_boost = False,
optimizer = 'adam',
learning_rate = 0.001,
early_stopping = 250,
tf_mixture_family = True,
input_activation = 'relu',
hidden_activation = 'leaky_relu')
model.fit(X, y, epochs = EPOCHS, batch_size = BATCH_SIZE)
The parameters of the class are summarized as follows :
· n_mixtures:MDN The number of distribution blends used . If set to -1, It will use Gaussian mixture model (GMM) and X and y Upper HDBSCAN Model “ Automatically ” Find the best mixing number .
· dist: The type of distribution used in mixing . at present , There are two options ; “ normal ” or “ Laplace ”. ( Based on some experiments , The Laplace distribution gives better results than the normal distribution ).
· input_neurons: stay MDN The number of neurons used in the input layer
· hidden_neurons:MDN Of Hidden layer architecture . List of neurons per hidden layer . This parameter enables you to select the number of hidden layers and the number of neurons per hidden layer .
· gmm_boost: Boolean value . If set to True, Cluster features will be added to the dataset .
· optimizer: The optimization algorithm to use .
· learning_rate: Learning rate of optimization algorithm
· early_stopping: Avoid over fitting during training . When the index does not change in a given number of periods , This trigger determines when to stop training .
· tf_mixture_family: Boolean value . If set to True, Will use tf_mixture series ( recommend ):Mixture Object to realize batch mixed distribution .
· input_activation: The activation function of the input layer
· hidden_activation: Activation function of hidden layer
Now? MDN The model has fitted the data , Sample from the mixed density distribution and plot the probability density function :
model.plot_distribution_fit(n_samples_batch = 1)
our MDN The model is very suitable for the real general distribution ! The final mixed distribution is decomposed into each distribution , Look at it :
model.plot_all_distribution_fit(n_samples_batch = 1)
Use the learned mixed distribution to sample some more Y data , The generated sample is compared with the real sample :
model.plot_samples_vs_true(X, y, alpha = 0.2)
Very close to the actual data , If , Given X You can also generate multiple batches of samples to generate quantiles 、 Statistical information such as mean value :
generated_samples = model.sample_from_mixture(X, n_samples_batch = 10)
generated_samples
Plot the average of each learning distribution , And their respective mixing weights (pi):
plt.scatter(X, y, alpha = 0.2)
model.plot_predict_dist(X, with_weights = True, size = 250)
There is the mean and standard deviation of each distribution , You can also plot with complete uncertainty ; Suppose we use 95% The confidence interval is plotted as the mean :
plt.scatter(X, y, alpha = 0.2)
model.plot_predict_dist(X, q = 0.95, with_weights = False)
Mix the distributions together , When it comes to the same X There are many. y When distributed , We use the highest Pi Select the most possible parameter value :
Y_preds = For each X, Choose the one with the greatest probability / The weight (Pi Parameters ) The distribution of Y mean value
plt.scatter(X, y, alpha = 0.3)
model.plot_predict_best(X)
This way is not ideal , Because there are obviously two different clusters overlapping in the data , The density is almost equal . So that the error will be higher than the standard regression model . This also means that there may be a lack of important features in the dataset that can help avoid clusters overlapping in higher dimensions .
We can also choose to use Pi Parameter and the mean of all distributions :
· Y_preds = (mean_1 Pi1) + (mean_2 Pi2)
plt.scatter(X, y, alpha = 0.3)
model.plot_predict_mixed(X)
If we add 95 confidence interval :
This option provides almost the same results as the nonlinear regression model , Mix everything to minimize the distance between points and functions . In this very special case , My favorite choice is to assume that in some areas of the data ,X There are many. Y, And in other areas ; Use only one of these blends .:
for example , When X = 0 when , Each mix may have two different Y Explain . When X = -1.5 when , blend 1 There is only one in the world Y Solution . Depending on the use case or business context , When the same X When there are multiple solutions , Can trigger actions or decisions .
The meaning of this option is that when there is overlapping distribution ( If both mixing probabilities are >= Given the probability threshold ), The line will be copied :
plt.scatter(X, y, alpha = 0.3)
model.plot_predict_with_overlaps(X)
Use 95% confidence interval :
Dataset row from 2500 Increased to 4063, The final forecast data set is as follows :
In this data table , When X = -0.276839 when ,Y It can be 1.43926( blend \_0 The probability of is 0.351525), But it can also be -0.840593( blend \_1 The probability of is 0.648475).
Instances with multiple distributions also provide important information , That is, something is happening in the data , And more analysis may be needed . It may be some data quality problems , Or it may indicate the lack of an important feature in the dataset !
“ Traffic scenario prediction is a good example of using mixed density networks . In traffic scenario prediction , We need a distribution of behaviors that we can show —— for example , An agent can turn left 、 Turn right or go straight . therefore , The mixed density network can be used to represent the “ Behavior ”, The behavior consists of probability and trajectory ((x,y) The coordinates are within a certain time range in the future ).
Example 2: have MDN Multivariate regression
Last MDN Did you do well on multiple regression ?
We will use the following data sets :
· Age : The age of the primary beneficiary
· Gender : Insurance contractor gender , Woman , male
· bmi: Body mass index , Provide an understanding of the body , A relatively high or low weight relative to height , Objective body mass index using the ratio of height to weight (kg / m ^ 2), Ideally 18.5 To 24.9
· children : Number of children covered by health insurance / Number of dependants
· smoker : smoking
· region : The beneficiary is in the United States 、 The northeast 、 Southeast 、 southwest 、 Residential areas in the Northwest .
· cost : Personal medical expenses charged by health insurance . This is the goal we want to predict
The problem statement is : Whether the insurance cost can be accurately predicted ( charge )?
Now? , Let's import the dataset :
"""
#################
# 2-IMPORT DATA #
#################
"""
dataset = pd.read_csv('insurance_clean.csv', sep = ';')
##### BASIC FEATURE ENGINEERING
dataset['age2'] = dataset['age'] * dataset['age']
dataset['BMI30'] = np.where(dataset['bmi'] > 30, 1, 0)
dataset['BMI30_SMOKER'] = np.where((dataset['bmi'] > 30) & (dataset['smoker_yes'] == 1), 1, 0)
"""
######################
# 3-DATA PREPARATION #
######################
"""
###### SPLIT TRAIN TEST
from sklearn.model_selection import train_test_split
X = dataset[dataset.columns.difference(['charges'])]
y = dataset[['charges']]
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.25,
stratify = X['smoker_yes'],
random_state=0)
test_index = y_test.index.values
train_index = y_train.index.values
features = X.columns.tolist()
##### FEATURE SCALING
from sklearn.preprocessing import StandardScaler
x_scaler = StandardScaler()
y_scaler = StandardScaler()
X_train = x_scaler.fit_transform(X_train)
#X_calib = x_scaler.transform(X_calib)
X_test = x_scaler.transform(X_test)
y_train = y_scaler.fit_transform(y_train)
#y_calib = y_scaler.transform(y_calib)
y_test = y_scaler.transform(y_test)
y_test_scaled = y_test.copy()
The data is ready for training
EPOCHS = 10000
BATCH_SIZE=len(X_train)
model = MDN(n_mixtures = -1, #-1
dist = 'laplace',
input_neurons = 1000, #1000
hidden_neurons = [], #25
gmm_boost = False,
optimizer = 'adam',
learning_rate = 0.0001, #0.00001
early_stopping = 200,
tf_mixture_family = True,
input_activation = 'relu',
hidden_activation = 'leaky_relu')
model.fit(X_train, y_train, epochs = EPOCHS, batch_size = BATCH_SIZE)
Use... After training “ Best mixing probability (Pi Parameters ) Strategy ” Predict the test data set and plot the results (y_pred vs y_test):
y_pred = model.predict_best(X_test, q = 0.95, y_scaler = y_scaler)
model.plot_pred_fit(y_pred, y_test, y_scaler = y_scaler)
model.plot_pred_vs_true(y_pred, y_test, y_scaler = y_scaler)
R2 by 89.09,MAE by 882.54,MDN fantastic , Let's draw a graph to compare the fitted distribution with the real distribution :
model.plot_distribution_fit(n_samples_batch = 1)
It's almost exactly the same ! Decomposition hybrid model , Let's see what's going on :
A total of six different distributions are mixed .
Generate multivariable samples from the fitted mixed model ( application PCA In the 2D Visualization results in ):
model.plot_samples_vs_true(X_test, y_test, alpha = 0.35, y_scaler = y_scaler)
The generated sample is very close to the real sample ! If we want to , It can also be predicted from each distribution :
y_pred_dist = model.predict_dist(X_test, q = 0.95, y_scaler = y_scaler)
y_pred_dist
summary
· With linear or nonlinear classical ML The model compares ,MDN Performed well in univariate regression data sets , Two of these clusters overlap each other , also X There may be more than one Y Output .
· MDN We have also done a good job on the issue of multiple regression , It can be done with XGBoost Wait for popular models to compete
· MDN yes ML An excellent and unique tool in , It can solve specific problems that cannot be solved by other models ( Be able to learn from data obtained from mixed distribution )
· With MDN Learning distribution , It is also possible to calculate uncertainty by prediction or generate new samples from the distribution of learning
This article has a lot of code , Here is the whole notebook, You can download and run it directly :
https://www.overfit.cn/post/20245a8446ae43e3982b48e4320991ab
author :Dave Cote, M.Sc.