当前位置：网站首页>Flow based depth generation model

Flow based depth generation model

2022-06-28 03:10:00 【Ghost road 2022】

1 introduction

up to now , Two generation models $\mathrm{GAN}$ and $\mathrm{VAE}$ It can not be accurately obtained from real data ${\bf{x}}\in \mathcal{D}$ Learning probability distribution in middle school $p({\bf{x}})$ . Take the generation model of implicit variables as an example , Calculating integral $p({\bf{x}})=\int p({\bf{x}}|{\bf{z}})d{\bf{z}}$ when , You need to traverse all the implicit variables ${\bf{z}}$ This is very difficult , And impractical . be based on $\mathrm{Flow}$ The generation model in regularized flow （ Regularized flow is a very powerful tool for estimating probability distribution ） This problem can be better solved with help . A probability distribution $p({\bf{x}})$ A good estimate can accomplish many tasks , For example, data generation , Estimate the probability of predicting future events , Data sample enhancement, etc .

2 Types of generated models

Currently, there are three types of generation models , The difference is based on $\mathrm{GAN}$ The generation model of , be based on $\mathrm{VAE}$ A generation model based on $\mathrm{Flow}$ The generation model of ：

Generative antagonistic network （GAN）： GAN It is composed of two neural networks , They are generator and discriminator . The purpose of the generator is to get rid of noise ${\bf{z}}$ Learning to generate real data samples ${\bf{x}}^{\prime}$ , The purpose of the discriminator is to distinguish the real samples ${\bf{x}}$ And generated samples ${\bf{x}}^{\prime}$ . During the training , Two networks are playing one $\min\text{-}\max$ To promote and improve each other in the game .
Variational automatic encoder （VAE）： GAN It is also composed of two neural networks , Encoder and decoder respectively . The encoder is a sample of data ${\bf{x}}$ Encode as hidden vector ${\bf{z}}$ , The decoder will hide the vector ${\bf{z}}$ Map back to sample data ${\bf{x}}^{\prime}}$ .VAE Is in the lower bound of the maximum variation , Roughly optimize the log likelihood estimation of data .
be based on ${\mathrm{Flow}}$ The generation model of ： One is based on ${\mathrm{Flow}}$ The generation model is composed of a series of reversible converters . It can make the model learn the data distribution more accurately $p({\bf{x}})$ , Its loss function is a negative log likelihood function .

3 Preliminary knowledge

In understanding based on $\mathrm{Flow}$ Before generating the model for , You need to know three key mathematical concepts , They are Jacobian matrix , Determinant and variable substitution theorem .

3.1 Jacobian matrix and determinant

Given a mapping function $f:\mathbb{R}^n \rightarrow \mathbb{R}^m$ , take $n$ Dimensional input vector ${\bf{x}}$ It maps to $m$ The output vector of dimension . Jacobian matrix is a function $f$ About input vectors ${\bf{x}}$ The first partial derivative of all components
${\bf{J}}=\left[\begin{array}{ccc}\frac{\partial f_1}{\partial x_1}& \cdots & \frac{\partial f_1}{\partial x_n}\\\vdots & \ddots & \vdots\\ \frac{\partial f_m}{\partial x_1} & \cdots &\frac{\partial f_m}{\partial x_n}\end{array}\right]$ The determinant is used to calculate a square matrix , The result is a real valued scalar . The absolute value of the determinant can be considered as “ How much space does the multiplication of a matrix expand or contract ” The measurement . One $n\times n$ Matrix $M$ The determinant of is as follows $\mathrm{det}(M)=\mathrm{det}\left(\left[\begin{array}{cccc}a_{11}&a_{12}&\cdots&a_{1n}\\a_{21}&a_{22}&\cdots&a_{2n}\\\vdots& \vdots & \ddots & \vdots \\ a_{n1} & a_{n2} & \cdots & a_{nn}\end{array}\right]\right)=\sum\limits_{j_1j_2\cdots j_n}(-1)^{\tau(j_1j_2\cdots j_n)}a_{1j_1}a_{2j_2}\cdots a_{nj_n}$ Where the subscript under the sum $j_1j_2\cdots j_n$ Is a collection $\{1,\cdots,n\}$ All permutations of , share $n!$ term . $\tau$ It represents the symbol of replacement . Matrix $M$ The value of determinant is $0$ when , Is irreversible , vice versa . The determinant product formula is $\mathrm{det}(AB)=\mathrm{det}(A)\cdot \mathrm{det}(B)$

3.2 Variable substitution theorem

Given a single variable random variable $z$ , Its probability distribution is known to be $z\sim \pi(z)$ , If you want to use a mapping function $f$ Construct a new random variable $x$ , namely $x = f (z)$ , among $f$ It's reversible , namely $z=f^{-1}(x)$ , Then the probability distribution of the new random variable is derived as follows $\int p(x)dx =\int \pi (z)dz=1$ $p(x)=\pi(z)\frac{dz}{dx}=\pi(f^{-1}(x))\frac{d f^{-1}}{dx}=\pi(f^{-1}(x))|(f^{-1})^{\prime}(x)|$ By definition , integral $\int \pi(z)dz$ Is an infinite number of widths $\Delta z$ The sum of infinitesimal rectangles . This location $z$ The height of the rectangle at is a function of density $\pi(z)$ Value . When a variable is replaced , from $z=f^{-1}(x)$ obtain $\frac{\Delta z}{\Delta x}=(f^{-1}(x))^{\prime}$ and $\Delta z =(f^{-1}(x))^{\prime}$ . $|f^{-1}(x)|^{\prime}$ Represents the ratio between rectangular areas defined in two different variable coordinates . The version of the multivariable is as follows ：
${\bf{z}}\sim \pi({\bf{z}}), \text{ }{\bf{x}}=f({\bf{z}}),\text{ }{\bf{z}}=f^{-1}({\bf{x}})$ $p({\bf{x}})=\pi({\bf{z}})\cdot \mathrm{det} \left(\frac{d{\bf{z}}}{d{\bf{x}}}\right)=\pi(f^{-1}({\bf{x}}))\cdot\mathrm{det}\left(\frac{d f^{-1}}{d{\bf{x}}}\right)$ among $\mathrm{det}\left(\frac{\partial f}{\partial {\bf{z}}}\right)$ Is the determinant of Jacobian matrix .

4 Standardized flow

Good density estimation of probability distribution has direct application in many machine learning problems , But this is very difficult . for example , Because we need to run back propagation in the deep learning model , So the probability distribution of the embedded variable （ A posteriori $p(\mathbf{z}\vert\mathbf{x})$ ） The prediction is simple enough , Derivatives can be calculated easily and efficiently . This is why Gaussian distribution is often used in implicit variable generation models , Although most real-world distributions are much more complex than Gaussian distributions . Standardized flow models can be used for better 、 More powerful distribution approximation . The normalized flow transforms a simple distribution into a complex distribution by applying a series of reversible transformation functions . Through a series of transformations , Replace new variables repeatedly according to the variable replacement theorem , Finally, the probability distribution of the final target variable is obtained .

As shown in the figure above , The corresponding formula is
$\begin{aligned}{\bf{z}}_{i-1}&\sim p_{i-1}({\bf{z}}_{i-1})\\{\bf{z}}_i&=f_i({\bf{z}}_{i-1}), \text{ }{\bf{z}}_{i-1}=f_i^{-1}({\bf{z}}_i)\\ p_i({\bf{z}}_i)&=p_{i}(f^{-1}_i({\bf{z}}_i))\cdot \mathrm{det}\frac{d f_i^{-1}}{d {\bf{z}}_i}\end{aligned}$ According to the above formula, we can deduce the probability distribution $p_i({\bf{z}}_i)$ Expressed as $\begin{aligned}p_i({\bf{z}}_i)&=p_{i-1}(f_i^{-1}({\bf{z}}_i))\cdot \mathrm{det}\left(\frac{d f^{-1}_i}{d {\bf{z}}_i}\right)\\&=p_{i-1}({\bf{z}}_{i-1})\cdot \mathrm{det}\left(\frac{d f_i}{d {\bf{z}}_{i-1}}\right)^{-1}\\&=p_{i-1}({\bf{z}}_{i-1})\cdot \left[\mathrm{det}\left(\frac{d f_i}{d {\bf{z}}_{i-1}}\right)\right]^{-1}\\\log p_i({\bf{z}}_i)&=\log p_{i-1}({\bf{z}}_{i-1})-\log \mathrm{det}\left(\frac{d f_i}{d {\bf{z}}_{i-1}}\right)\end{aligned}$ The inverse function theorem is used in the derivation of the above formula , That is, if $y = f (x)$ And $x=f^{-1}(y)$ , Then there are
$\frac{d f^{-1}(y)}{dy}=\frac{dx}{dy}=\left(\frac{dy}{dx}\right)^{-1}=\left(\frac{d f(x)}{dx}\right)^{-1}$ The inverse function theorem of Jacobian matrix is ： The determinant of the inverse of a reversible matrix is the reciprocal of the determinant of the matrix , namely $\mathrm{det}({M}^{-1})=(\mathrm{det}(M))^{-1}$ , because $\mathrm{det}({M})\cdot\mathrm{det}(M^{-1})=\mathrm{det}(M\cdot M^{-1})=\mathrm{det}(I)=1$ . Given such a series of probability density functions , We know the relationship between each pair of continuous variables . So you can output ${\bf{x}}$ Until we trace back to the initial distribution ${\bf{z}}_0$ . $\begin{aligned}{\bf{x}}={\bf{z}}_k&=f_K \circ f_{K-1}\circ \cdots f_1({\bf{z}}_0)\\\log p({\bf{x}})=\log\pi_K ({\bf{z}}_K)&= \log \pi_{K-1}({\bf{z}}_{K-1})-\log \mathrm{det}\left(\frac{d f_K}{d {\bf{z}}_{K-1}}\right)\\&=\log \pi_{K-2}({\bf{z}}_{K-2})-\log \mathrm{det}\left(\frac{d f_{K-1}}{d {\bf{z}}_{K-2}}\right)-\log \mathrm{det}\left(\frac{d f_K}{d {\bf{z}}_{K-1}}\right)\\&=\cdots\\&=\log \pi_0({\bf{z}}_0)-\sum\limits_{i=1}^K \log \det\left(\frac{d f_i}{d {\bf{z}}_{i-1}}\right)\end{aligned}$ A random variable ${\bf{z}}_i=f_i({\bf{z}}_{i-1})$ The path through is a stream , Continuous distribution $\pi_i$ The whole chain formed is called a standardized flow . According to the calculation requirements of the equation , The transformation function should satisfy two properties , It is easy to calculate the determinant of function reversibility and Jacobian matrix respectively .

5 Standardized flow

It becomes easier to calculate the exact log likelihood of input data through standardized flow , The training loss function of the flow based generation model is the negative log likelihood on the training data set $\mathcal{L}(\mathcal{D})=-\frac{1}{|\mathcal{D}|}\sum\limits_{ {\bf{x}}\in \mathcal{D}}\log p({\bf{x}})$

5.1 RealNVP

RealNVP The model realizes the standardization flow by superimposing the reversible bijective transformation function sequence . In every double shot $f:{\bf{x}}\rightarrow {y}$ in , The input dimension is divided into two parts ：

$d$ Dimensions remain the same ;
The first $d + 1$ Dimension to $D$ dimension , Affine transformation （“ Zoom and pan ”）, The zoom and translation parameters are functions of the first dimension . $\begin{aligned}{\bf{y}}_{1:d}&={\bf{x}}_{1:d}\\{\bf{y}}_{d+1:D}&={\bf{x}}_{d+1:D} \odot \exp (s({\bf{x}}_{1:d}))+t({\bf{x}}_{1:d})\end{aligned}$ among $s(\cdot)$ and $t(\cdot)$ Is the zoom and translation function , The mappings are $\mathbb{R}^d \rightarrow \mathbb{R}^{D-d}$ , $\odot$ The product by element represented by the operator .

For the condition of normalized flow 1 Is a function invertible in RealNVP It is very easy to implement in the model , The specific functions are as follows $\left\{\begin{aligned}{\bf{y}}_{1:d}&={\bf{x}}_{1:d}\\{\bf{y}}_{d+1:D}&={\bf{x}}_{d+1:D}\odot \exp(s({\bf{x}}_{1:d}))+t({\bf{x}}_{1:d})\end{aligned}\right. \iff \left\{\begin{aligned}{\bf{x}}_{1:d}&={\bf{y}}_{1:d}\\{\bf{x}}_{d+1:D}&=({\bf{y}}_{d+1:D}-t({\bf{y}}_{1:d}))\odot \exp(-s({\bf{y}}_{1:d}))\end{aligned}\right.$
For normalized flow conditions 2 The middle Jacobian determinant is easy to calculate in RealNVP The model can also be implemented , Its Jacobian matrix is a lower triangular matrix , The specific matrix is as follows ${\bf{J}}=\left[\begin{array}{cc}\mathbb{I}_{d}&{\bf{0}}_{d\times(D-d)}\\\frac{\partial {\bf{y}}_{d+1:D}}{\partial {\bf{x}}_{1:d}}& \mathrm{diag}(\exp(s({\bf{x}}_{1:d})))\end{array}\right]$ therefore , Determinant is simply product of the terms on diagonal . $\mathrm{det}({\bf{J}})=\prod_{j=1}^{D-d}\exp(s({ {\bf{x}}}_{1:d}))_j=\exp\left(\sum\limits_{j=1}^{D-d}s({\bf{x}}_{1:d})_j\right)$ up to now , The affine coupling layer looks very suitable for building standardized flows . What's better is , Because of the calculation $f^{-1}$ There is no need to calculate $s$ or $t$ The inverse of , And the calculation of Jacobian determinant does not involve calculation $s$ or $t$ Jacobian matrix of , So these functions can be arbitrarily complex , Both can be modeled using deep neural networks . In an affine coupling layer , Some dimensions （ passageway ） remain unchanged . To ensure that all inputs have a chance to be changed , The model reverses the order in each layer , So that different module components remain unchanged . In this alternating pattern , Keeping the same set of cells in one transformation layer is always modified in the next transformation layer . Batch standardization helps train models with very deep coupling layer stacks . Besides ,RealNVP Can work in a multi-scale Architecture , Build more efficient models for large inputs . Multi-scale architecture will be a number of “ sampling ” Operation applied to generic affine layer , Including spatial checkerboard pattern masking 、 Compression operation and channel masking .
NICE The model is RealNVP The previous work of ,NICE The transformation in is an affine coupling layer without scale terms , It is called additive coupling layer $\left\{\begin{aligned}{\bf{y}}_{1:d}&={\bf{x}}_{1:d}\\{\bf{y}}_{d+1:D}&={\bf{x}}_{d+1:D}+m({\bf{x}}_{1:d})\end{aligned}\right.\iff \left\{\begin{aligned}{\bf{x}}_{1:d}&={\bf{y}}_{1:d}\\{\bf{x}}_{d+1:D}&={\bf{y}}_{d+1:D}-m({\bf{y}}_{1:d})\end{aligned}\right.$

5.2 Glow

$\mathrm{Glow}$ The model extends the previous reversible generation model $\mathrm{NICE}$ and $\mathrm{RealNVP}$ , And by reversible $1\times 1$ Convolution replaces the reverse permutation operation on channel sorting to simplify the architecture . $\mathrm{Glow}$ A step in a process in contains three sub steps ：

Activate normalization ： It performs affine transformations using the scale and bias parameters of each channel , Similar to batch normalization , But it is applicable to the batch size of $1$ . The parameters are trainable , But initialized , Therefore, the small batch data has a mean value of... After activation and normalization $0$ And the standard deviation is $1$ .
reversible $1\times 1$ Convolution ： stay $\mathrm{RealNVP}$ Between the layers of the flow , The order of channels is the opposite , Therefore, all data dimensions have the opportunity to be changed . Having the same number of input and output channels $1\times1$ Convolution is a generalization of any channel permutation . Suppose there is an input tensor dimension called tensor ${\bf{h}}\in \mathbb{R}^{h\times w \times c}$ Reversible $1\times1$ Convolution , Its weight matrix is ${\bf{W}}\in\mathbb{R}^{c\times c}$ . The output is a $h\times w \times c$ Tensor , Write it down as $f=\mathrm{conv2d}({\bf{h}};{\bf{W}})$ . In order to apply the variable substitution theorem , You need to compute the Jacobian determinant $\left|\mathrm{det}\left(\frac{\partial f }{\partial {\bf{h}}}\right)\right|$ . ${\bf{h}}$ Every element in ${\bf{x}}_{ij}(i=1,\cdots,h,j=1,\cdots,2)$ It's a $c$ Vector of the number of channels , Each element is multiplied by the weight matrix to obtain the corresponding element in the output matrix ${\bf{y}}_{ij}$ . The derivative of each element is $\frac{\partial {\bf{x}}_{ij}{\bf{W}}}{\partial {\bf{x}}_{ij}}={\bf{W}}$ , And there are altogether $h\times w$ Elements ：
$\log \det\left(\frac{\partial \mathrm{conv2d({\bf{h}};{\bf{W}})}}{\partial {\bf{h}}}\right)=\log(|\mathrm{det}({\bf{W}})|^{h\cdot w})=h\cdot w \cdot \log |\det ({\bf{W}})|$ reversible $1\times1$ Convolution depends on inverse matrix ${\bf{W}}$ . Because the weight matrix is relatively small , Therefore, the calculation of matrix determinant and inverse is still in a controllable range .
Affine coupling layer ： $\mathrm{Glow}$ The structure design of affine coupling layer and $\mathrm{RealNVP}$ The affine coupling layer is the same .

6 The model of autoregressive flow

An autoregressive constraint is a constraint on a sequence of data ${\bf{x}}=[x_1,\cdots,x_D]$ Modeling method ： Each output depends only on data observed in the past , Without relying on future data . let me put it another way , Probability of observation $x_i$ Is dependent on sequence data $x_1,\cdots,x_{i-1}$ , The product of these conditional probabilities provides the probability of observing the complete sequence ： $p({\bf{x}})=\prod_{i=1}^D p(x_i|x_1,\cdots,x_{i-1})=\prod_{i=1}^D p(x_i|x_{1:i-1})$

6.1 MADE

$\mathrm{MADE}$ Is a specially designed architecture , Autoregressive attributes can be effectively performed in the auto encoder . When using an automatic encoder to predict conditional probabilities , $\mathrm{MADE}$ It is not an input that provides different viewing window times to the automatic encoder , Instead, the contribution of some hidden units is eliminated by multiplying the binary mask matrix , So that each input dimension is only reconstructed from a given previous dimension for one-time propagation . Given a $L$ A hidden layer fully connected neural network , Its weight matrix is ${\bf{W}}^1,\cdots{\bf{W}}^L$ , And an output layer weight matrix ${\bf{V}}$ , Each dimension of the output has $\hat{x}_i=p(x_i|x_{1:i-1})$ , When there is no mask matrix , The process of neural network forward propagation is as follows ： $\begin{aligned}{\bf{h}}^0&={\bf{x}}\\{\bf{h}}^l&=\mathrm{activation}^l({\bf{W}}^l {\bf{h}}^{l-1}+{\bf{b}}^l)\\\hat{ {\bf{x}}}&=\sigma({\bf{V}}{\bf{h}}^L+{\bf{c}})\end{aligned}$ To zero some connections between layers , Each weight matrix can be simply multiplied by a binary mask matrix by its elements . Each hidden node is assigned a random “ Concatenate integers ”, Be situated between $1$ and $D - 1$ Between ; The first $k$ In the middle of the layer $l$ The assigned value of cells is expressed as $m^l_k$ . The binary mask matrix is determined by comparing the values of two nodes in the two layers element by element , Then there are $\begin{aligned}{\bf{h}}^l&=\mathrm{activation}^l(({\bf{W}}^l \odot {\bf{M}}^{ {\bf{W}}^l}){\bf{h}}^{l-1}+{\bf{b}}^l)\\\hat{ {\bf{x}}}&=\sigma(({\bf{V}}\odot{\bf{M}}^{\bf{V}}){\bf{h}}^L+{\bf{c}})\\ M_{k^{\prime},k}^{ {\bf{W}}^l}&={\bf{1}}_{m^l_{k^{\prime}}\ge m^{l-1}_k}=\left\{\begin{array}{ll}1,& \mathrm{if}\text{ }m^l_{k^{\prime}}\ge m_k^{l-1}\\0,&\mathrm{otherwise}\end{array}\right.\\M_{d,k}^{ {\bf{V}}}&={\bf{1}}_{d\ge m^L_k}=\left\{\begin{array}{ll}1,&\mathrm{if}\text{ }d>m_k^L\\0,&\mathrm{otherwise}\end{array}\right.\end{aligned}$ An example is shown in the figure below , A cell in the current layer can only be connected to other cells with the same or smaller random number in the previous layer , And this type of dependency can be easily propagated to the output layer through the network . Once random numbers are assigned to all cells and layers , The order in which dimensions are entered is fixed , And relative to it, it produces a conditional probability . To ensure that all hidden cells are connected to the input and output layers through some paths , sampling $m^l_k$ Equal to or greater than the smallest connected integer in the previous layer $\min_{k^{\prime}} m^{l-1}_{k^\prime}$

6.2 WaveNet

$\mathrm{WaveNet}$ It consists of a stack of causal convolutions , This is a convolution operation designed to respect sorting ： Prediction at a time stamp can only consume data observed in the past , Not dependent on the future . $\mathrm{WaveNet}$ The causal convolution in just moves the output multiple timestamps into the future , So that the output is aligned with the last input element . One of the disadvantages of convolution layer is that the size of perception field is very limited . The output can hardly depend on the input hundreds or thousands of time steps ago , This may be a key requirement for long sequence modeling . therefore , $\mathrm{WaveNet}$ Using extended convolution , The kernel is applied to the uniformly distributed sample subset in the larger perceptual field of input . $\mathrm{WaveNet}$ The gated activation unit is used as the nonlinear layer , Because it is found to be better than in modeling one-dimensional audio data $\mathrm{ReLU}$ Work better , Apply residual connection after gating activation , The formula is as follows ${\bf{z}}=\mathrm{tanh}({\bf{W}}_{f,k}\otimes {\bf{x}})\odot \sigma({\bf{W}}_{g,k}\otimes {\bf{x}})$ among ${\bf{W}}_{f,k}$ and ${\bf{W}}_{g,k}$ They are the first $k$ Layer convolution filter and gate weight matrix , Both are learnable .

6.3 MAF

$\mathrm{MAF}$ Is a standardized flow , The transformation layer is constructed as an autoregressive neural network . $\mathrm{MAF}$ It's the same as the following $\mathrm{IAF}$ Very similar . Given two random variables ${\bf{z}}\sim \pi({\bf{z}})$ and ${\bf{x}}\sim p({\bf{x}})$ , And the probability density function $\pi({\bf{z}})$ It is known that , $\mathrm{MAF}$ Aims at learning $p({\bf{x}})$ . $\mathrm{MAF}$ Generate each $x_i$ In the dimension of the past ${\bf{x}}_{1:i-1}$ On condition that . To be exact , The conditional probability is ${\bf{z}}$ The affine transformation of , Where the scale and shift terms are ${\bf{x}}$ Function of the observation part of . Data generation , Will produce a new ${\bf{x}}$ , The formula is as follows $x_i\sim p(x_i|{ {\bf{x}}_{1:i-1}})=z_i\odot \sigma_i({\bf{x}}_{1:i-1})+\mu_i({\bf{x}}_{1:i-1})$ Given ${\bf{x}}$ when , The density is estimated to be $p({\bf{x}})=\prod_{i=1}^D p(x_i|{\bf{x}}_{1:i-1})$ The method advantage of this framework is that the generation process is sequential , So the design speed is very slow . Density estimation only needs to use $\mathrm{MADE}$ And so on . The inverse of the transformation function is very simple , Jacobian determinant is also easy to calculate .

6.4 IAF

And $\mathrm{MAF}$ similar , Inverse autoregressive flow $\mathrm{IAF}$ The conditional probability of the target variable is also modeled as an autoregressive model , But with reverse flow , Thus, a very effective sampling process is realized . $\mathrm{MAF}$ The affine transformation in is ： $z_i=\frac{x_i -\mu_i({\bf{x}}_{1:i-1})}{\sigma_i({\bf{x}}_{1:i-1})}=-\frac{\mu_i({\bf{x}}_{1:i-1})}{\sigma_i({\bf{x}}_{1:i-1})}+x_i \odot \frac{1}{\sigma_i({\bf{x}}_{1:i-1})}$ If you make
$\begin{aligned}&\tilde{ {\bf{x}}}={\bf{z}},\text{ }\tilde{p}(\cdot)=\pi(\cdot),\text{ }\tilde{ {\bf{x}}}\sim\tilde{p}(\tilde{\bf{x}})\\&{\tilde{\bf{x}}}={\bf{x}}\text{ },\tilde{\pi}(\cdot)=p(\cdot),\text{ }{\bf{\tilde{z}}}\sim\tilde{\pi}({\bf{\tilde{z}}})\\&\tilde{\mu}_i(\tilde{\bf{z}}_{1:i-1})=\tilde{\mu}_i({\bf{x}}_{1:i-1})=-\frac{\mu_i({\bf{x}}_{1:i-1})}{\sigma_i({\bf{x}}_{1:i-1})}\\&\tilde{\sigma}(\tilde{ {\bf{z}}}_{1:i-1})=\tilde{\sigma}({\bf{x}}_{1:i-1})=\frac{1}{\sigma_i({\bf{x}}_{1:i-1})}\end{aligned}$ Then there are ${\tilde{x}}_i\sim p(\tilde{x}_i|{\bf{\tilde{z}}}_{1:i})=\tilde{z}_i\odot \tilde{\sigma}_i({\bf{\tilde{z}}}_{1:i-1})+\tilde{\mu}_i({\bf{\tilde{z}}}_{1:i-1}),\quad \mathrm{where}\text{ }{\tilde{\bf{z}}}\sim \tilde{\pi}({\bf{\tilde{z}}})$ As shown in the figure below , $\mathrm{IAF}$ Intend to estimate known $\tilde{\pi}(\tilde{\bf{z}})$ Given by $\tilde{\bf{x}}$ The probability density function of . Countercurrent is also an autoregressive affine transformation , And $\mathrm{MAF}$ identical , But the scale and shift terms are known distributions $\tilde{\pi}(\tilde{\bf{z}})$ The autoregressive function of the observed variable in .

Single element $\tilde{x}_i$ The calculations of are not interdependent , So they are easy to parallelize . It is known that $\tilde{\bf{x}}$ The efficiency of density estimation is not high , Because it must be restored in sequence $\tilde{z}_i$ Value , That is to say $\tilde{z}_i=(\tilde{x}_i-\tilde{\mu}_i(\tilde{\bf{z}}_{1:i-1}))/\tilde{\sigma}_i({\bf{z}}_{1:i-1})$ , So the total needs $D$ Secondary estimate .