Normalizing flows for MC integration¶

N. Deutschmann

We look at the work of the ETH graphics/Disney lab presented in "Neural importance sampling" by Müller et al. [1808.03856], which was noted by a Fermilab group (Gao, Hoeche, Krause, Isaacson) as being relevant for our field. They have been developing a code for the last six months (code repository) and seem to have had some success applying it in unweighting in Sherpa C. Krause's slides.

This is an attempt at implementing this in tensorflow.

Neural Network-powered importance sampling¶

Let us first remind ourselves the basics of MC integration: we want to compute an integral $$I=\int_0^1 dx f(x)$$ which is the expectation value of the $f$ applied to a uniform random variable $$I=\int_0^1 dx f(x) = \underset{X\sim U(0,1)}{\mathbb{E}}\hspace{-0.5em}\left(f \left(X\right) \right).$$ We can therefore create an estimator for this integral: $$\hat{I}_X^{(n)} = \frac{1}{n}\sum_i f(X_i) \text{ for which } \mathbb{E}\left(\hat{I}_X^{(n)} \right)= I \text{ and } \sigma \left(\hat{I}_X^{(n)}\right) = \frac{\sigma \left(f \left(X\right)\right)}{\sqrt{n}}$$ The variance of the function controls the prefactor of the variance of the estimator: a flat function integral estimator converges faster as $n$ grows than a function that is wildly varying.

Importance sampling is a technique that allows us to evaluate this integral faster by reducing the variance of the integrand. There are two equivalent ways of looking at it:

1. Importance sampling as a integration space reparametrization¶

The one I prefer is the one that is most straightforward to implement: if we perform a change of variable $x = h(y)$ with a monotonous function $h$ that maps the unit interval to itself we obtain the following integral $$I = \int_0^1 dy \frac{dh}{dy}(y)f \left(h \left(y\right)\right) = \underset{Y\sim U(0,1)}{\mathbb{E}}\hspace{-0.3em}\left(\frac{dh}{dy}(Y)f \left(h \left(Y\right)\right)\right).$$ for which we can again formulate an estimator averaging $n$ copies of $Y$, $$\hat{I}_Y^{(n)} = \frac{1}{n}\sum_i f(Y_i) \text{ for which } \mathbb{E}\left(\hat{I}_Y^{(n)} \right)= I \text{ and } \sigma \left(\hat{I}_Y^{(n)}\right) = \frac{\sigma \left(\frac{dh}{dy}(Y)f \left(h \left(Y\right)\right)\right)}{\sqrt{n}}.$$ In particular if we have $\frac{dh}{dy}(y)f \left(h \left(y\right)\right) =\text{cst} $, a single sample point gives infinite precision since the variance of the integrand is zero.

The problem of importance sampling can therefore be formulated in the genralized multidimensional case as finding a change of variable $\vec{x} = \vec{h}\left(\vec{y}\right)$ such that the integrand $\left|J\left(\vec{h}\right)\right|f\circ \vec{h}$ has a minimal variance, where $J(h)$ is the Jacobian of the coordinate transformation.

This is a type of problem that can be solved by a neural network approach: NNs are universal, nearly everywhere differentiable approximators for smooth functions $f:\mathbb{R}^n\to\mathbb{R}^m$, we can therefore try to learn a function $h$ that satisifies our requirements. The NN itself will provide a map $\vec{x} = \vec{h} \left(\vec{y},\theta\right)$ with some free parameters $\theta$ that we will attempt to optimize.

To this end we will define a loss function $$L=\sigma \left(\left.\frac{dh}{dy}(Y,\theta)f \left(h \left(Y,\theta\right)\right)\right|Y\sim U(0,1)\right)$$ which we can estimate by sampling random points $y_i$ from $Y$ $$\hat{L}\left(\left\{y_i\right\}\right) = \hat{\sigma} \left(\left.\frac{dh}{dy}(Y,\theta)f \left(h \left(Y,\theta\right)\right)\right|Y\sim U\left(\left\{y_i\right\}\right)\right) $$ and we can update the parameters $\theta$ by descending the gradient of this function in $\theta$ directions.

There is one possible issue when computing the gradient of this estimator: by minimizing the variance over a set with fixed $y_i$, we encourage the mapping $h \left(y_i,\theta\right)$ to generate points $x_i$ that are closer to each other, which would reduce the variance in an undesired fashion. Indeed, if the PDF for $x = h(y)$ is close to a delta function (all points $y$ are mapped to the same $x$), there is little variance but we have a biased estimator.

An alternative is to define a loss as $$L'=\sigma \left(\left.\frac{dh}{dy}(Y,\theta)f \left(h \left(Y,\hat{\theta}\right)\right)\right|Y\sim U(0,1)\right)$$ where the numerical values $\hat{\theta} = \theta$ but we take the gradient over $\theta$ only.

2. Importance sampling as a change in sampling density¶

While closely connected, this picture is slightly different for the purpose of ML-based approaches as it suggests a slightly different objective function to be optimized. In this approach, we wish take an everywhere-position function $q$ normalized to 1 and rewrite our integral as $$I = \int_0^1 dz q(z) \frac{f(z)}{q(z)}$$ where we relabelled $x$ as $z$ (without a change of variables) to associate it to a different random variable $Z$. Indeed, interpreting this integral as a Rieman sum with measure $dz q(z)$ we can say $$I = \underset{Z\sim q}{\mathbb{E}}\left(\frac{f(Z)}{q(Z)}\right)$$ allowing us to build the estimator: $$\begin{align} \hat{I}_q^{(n)} &= \frac{1}{n} \sum_i \frac{f(Z_i)}{q(Z_i)}\\ \mathbb{E} \left(\hat{I}_q^{(n)}\right)&= I\\ \sigma \left(\hat{I}_q^{(n)}\right) &= \frac{\sigma \left(f\left(Z\right)/q\left(Z\right)\right) }{\sqrt{n}} \end{align}$$

In this case, we can formulate our importance sampling problem as finding a random variable $Z$ with PDF $q$ such that the variance of $f\left(Z\right)/q\left(Z\right)$ is minimized. The link to the previous approach should be obvious to the experienced eye: if we have $z = h(y)$, then $1/q(z) = J(h)(y) $. Furthermore, the most straightforward way to have a random variable $Z\sim q$ is to find $h$ such that $J(h) = 1/q$ and define $Z = h(Y)$ for $Y\sim U(0,1)$. What our NN should now be is the PDF $q(z,\theta)$ with trainable parameters $\theta$

Here, multiple choices for the loss function are possible. As before, we can simply minimize the variance of our integrand:

$$L = \sigma \left(f\left(Z\right)/q\left(Z,\theta\right) | Z\sim q \right) , $$

for which we can build some idealized estimator from a fixed sample of $z_i$ as

$$ \hat{L}\left(\left\{z_i\right\}\right) = \sigma \left(f\left(Z\right)/q\left(Z,\theta\right) | Z \sim U \left(\left\{z_i\right\}\right)\right).$$

As we already said, the only straightforward way to actually generate a sample drawn from $q$ is to have our NN learn the (invertible) transform $h$ and have $Z=h(Y)$ with a uniform $Y$. Then we would have $q(z) = \frac{dh^{-1}}{dz}(z,\theta)$. A realization of the loss estimator above would therefore be exactly the same as in the other approach

$$\hat{L}\left(\left\{y_i\right\}\right) = \hat{\sigma} \left(\left.\frac{dh}{dy}(Y,\theta)f \left(h \left(Y,\theta\right)\right)\right|Y\sim U\left(\left\{y_i\right\}\right)\right).$$

However, while the numerical value of this loss would exactly match our idealized distribution from above, its gradient does not correspond to what we would expect as some of its terms correspond to the effect of moving the $z_i$ around with fixed $y_i$, while the idealized definition implies a fixed sample $z_i$.

A better aligned choice would therefore be to first sample a set of $y_i$, then obtain $z_i = h(y_i,\hat{\theta})$, where again we use the numerical value $\theta = \hat{\theta}$ but $d \hat{\theta}/ d \theta = 0$. We then evaluate the loss the idealized version on our fix $z_i$ sample

$$\hat{L}''(z_i) = \sigma \left(f\left(Z\right) /q(Z) | Z \sim U \left(\left\{z_i\right\}\right)\right),$$

where we evaluate $q(z_i)$ either by knowing the jacobian of the inverse transform $h^{-1}$ as a function of $z_i$ or by using $$q(z_i,\theta) = \left(\frac{dh}{dy} \left(h^{-1}(z_i,\theta\right),\theta\right)^{-1} $$

Note that this is the same numerical value, but a different gradient from

$$ \left(\frac{dh}{dy} \left(y_i,\theta\right)\right)^{-1}.$$

We will try to explore how these different choices affect performance and training

Normalizing flow: an invertible, easily differentiable NN¶

A normalizing flow is a combination of invertible mappings $[0,1]^n\to [0,1]^n$ with an efficiently calculable Jacobian. The main type of mapping we consider is a so called coupling cell, which defines a change of variable that has a block-wise triangular Jacobian matrix, with diagonal blocks consisting of diagonal matrices, making the determinant calculation very efficient.

A coupling cell relies on separating our variable vector $\vec{y}_i \in [0,1]^n$ into two disjoint subvector $y_i^A$ and $y_i^B$ where $A$ and $B$ define collection of indices and we will use $a$ and $b$ as specific indices in each set. The change of variables acts on $y_i^A$ by simply passing it along unchanged ($x_i^A = y_i^A$) while the other set is changed. To this end we define a positive set of functions $c_b(y_i^b,\alpha_b)$ (for each direction $b$ in $B$, there is a function $c_b$), depending on some free parameters $\alpha_b$, and which has an easily computable primitive $C_b$. We define a neural network $T$ which maps the $y_i^A$ into the whole set of parameters $\alpha_B$. Our change of variable is then $x_i^b = C_b(y_i^b,\alpha_b = T_b(y_i^A))$. Put with words, we're trying to learn the probability density for $y_b$ given $y^A$, and we assume it is independent of the other components of $y^B$: ideally we therefore have $$c_b(y^b) = \int\hspace{-.7em} \prod_{b'\in B\not{\phantom{a}}\left\{b\right\}}\hspace{-.7em}dy^{b'} P\left(y^b \left| y^A,y^{B\not{\phantom{a}}\left\{b\right\}} \right.\right)$$

We can summarize the whole operation using the following graph

Let us consider how the Jacobian matrix for this process looks like:

$$J_m=\left(\begin{array}{ccc|ccc} 1_{\phantom{b}} & & & &\phantom{\huge{0}} \\ & \ddots & & \phantom{\huge{0}}&\huge{0} &\phantom{\huge{0}} \\ & & 1_{\phantom{b}} & & \phantom{\huge{0}}\\\hline &\phantom{\huge{0}} & & c_{b_1} & & \\ &\Large{\frac{dC_B}{dy^A}} & & & \ddots & \\ &\phantom{{0}} & & & & c_{b_n} \\ \end{array}\right)$$

where the first block of indices refer to $A$ and the second to $B$. The triangularity of the matrix saves us from having to consider non trivial derivatives with respect to the $y^A$, which avoids having to consider derivatives of the output of the neural network $T$, a costly task. Indeed, the Jacobian is just

$$ J = \det J_m =\prod_{b\in B} c_b\left(y_i^b,\alpha_b\left(x^A\right)\right) $$

which depends on the numerical values $\alpha_b$ of the neural network output but requires to know the derivatives of $C$ only, which we of course have easy analytic access to by design. This means that it is very efficient to calculate the Jacobian of the transformation as we move through the coupling cell. In practice therefore, the input is a phase space point $\vec{y}_i$ and an "upstream Jacobian", i.e. the Jacobian of the transformations done upstream from the coupling cell at hand. The upstream Jacobian is multiplied by the Jacobian of the coupling cell and passed downstream as the next cell's upstream Jacobian as shown below

Screen%20Shot%202019-10-23%20at%2015.09.45.png

In general, we would like to affect all variables, to this end we chain several such coupling cells together and choose for each a different set $A$ and $B$. While the changes of variables defined by a single coupling cell are of course far from covering all changes, enough chained cells acting on different variable sets can be expected to work nicely.

The choice of coupling function¶

The coupling functions $C_b$ or their derivatives $c_b$ are one of the main ways we can make variations on the idea of normalizing flows. A typical choice is to use the primitive of a piece-wise simple function (the simple function then being the PDF in direction $b$ for fixed $y^A$) and the most simple such choice is to have $c_b$ be a piece-wise constant function, i.e. we approximate the PDF with a step function and the CDF is a piecewise linear function. The output of the neural network $T$ is then for each direction $b$ the set of step-heights, which are also the set of slopes for the piecewise linear CDF.

Screen%20Shot%202019-10-23%20at%2017.24.09.png

Other choices are of course possible as long as we can integrate/differientate and invert the function easily. Among the choices discussed in the litterature, polynomial and rational choices are suggested as being a good alternative to having variable sizes for bins which allow a much better analytic control over the coupling cell operations. For the time being however, we will stick to the step/pw-linear case as it is easily tractable.

Using the implementation¶

At the moment, the code is hosted in my repo tf-toolbox which implements tf_toolbox.normalizing_flows. Let us see how it works on the user side of things

import tensorflow as tf
import tensorflow.keras as keras
import numpy as np
from tf_toolbox.normalizing_flows import *
import matplotlib.pyplot as plt

Learning a Gaussian¶

We will try to integrate a gaussian in two dimensions centered at the middle of our unit square

# We define our Gaussian to allow for tensors of shape (N,d)
# where the first dimension labels the point and the second the dimension.
# We also wrap it with a @tf.function to make it possible for tensorflow to include these
# in a graph. This is not necessary per se but allows function generation to be done on GPU when available

@tf.function
def gaussian(x):
    return tf.exp( -tf.reduce_sum((x-0.5)**2,axis=-1)/0.3**2 )

# Let us define the most basic normalizing flow: a single coupling layer acting on the input (x_1,x_2).
# There is only one possibility up to permutation: A = {1} and B={2}

# We define convenient parameters

n_flow = 2      # number of dimensions
n_pass = 1      # number of dimensions passed through without change
n_bins = 10     # number of bins for the PW-PDF/CDF
layer_size = 20 # number of neurons per layers in the neural network T
n_layers = 5    # number of layers

# We define our NormalizingFlow object by giving it a list of cells, here only one
NF = NormalizingFlow(n_flow,
    [
        # PW-linear coupling on variable 1
        # The architecture of T is a dense neural net given by a sequence of layers.
        # Here: n_layers identical layers of size layer_size
        # reg is the regularization rate (L2 on the weigths of T)
        PieceWiseLinear(n_flow,n_pass,n_bins,[layer_size]*n_layers,reg=0)
    ])

# We then define an optimizer object, which is a TF object that applies a gradient descent algorithm (here Adam)
optim = keras.optimizers.Adamax(lr=1e-3)

# We then call our Normalizing Flow's training function (in one of its versions)
# specifying how many iterations (epochs) we would like and how many points are
# used at each iteration to estimate the variance of the integrand, and of course
# the function we would like to optimize on.
# This yields a the history of the losses (logs of variances) as well as the history 
losses = NF.train_variance_forward(gaussian,n_epochs=250,n_batch=100000,optimizer=optim)

# Finaly we plot the losses, the standard deviation of the integrand, and a histogram of the distribution of points
fig = plt.figure(figsize=(12, 4))
a1=fig.add_subplot(131)
plt.plot(losses)
a1.title.set_text('Loss')
a2=fig.add_subplot(132)
plt.plot(np.sqrt(np.exp(losses)))
a2.title.set_text('Standard Deviation')
X,fX=NF.generate_data_batches(gaussian,n_batch=10000)
a3=fig.add_subplot(133)
plt.hist2d(X[0][:,0],X[0][:,1],bins=10)
a3.set_aspect(aspect=1.)
a3.title.set_text('Point histogram (PDF)')
a3.tick_params(
    axis='both',          # changes apply to the x-axis
    which='both',      # both major and minor ticks are affected
    bottom=False,      # ticks along the bottom edge are off
    left=False,         # ticks along the top edge are off
    labelbottom=False,
    labelleft=False)
plt.plot()

[]

Debriefing: What happened here? The transformation learned to distribute one of the two variables in a Gaussian fashion, but not the other one. This is of course to be expected here because our normalizing flow learns the conditional PDF for one variable, given the other one. In our gaussian example, the variables $y^1$ and $y^2$ are completely independent from one another so that the normalized PDF for $y_2$ given a value of $y_1$ is always the same Gaussian distribution.

Of course, we can go around this by iterating the process on the new distribution, distorting it in the other direction. Let us define a new normalizing flow with two coupling cells

n_flow = 2      # number of dimensions
n_pass = 1      # number of dimensions passed through without change
n_bins = 10     # number of bins for the PW-PDF/CDF
layer_size = 20 # number of neurons per layers in the neural network T
n_layers = 5    # number of layers

# We define our NormalizingFlow object by giving it a list of cells
NF = NormalizingFlow(2,
    [
        #PW-linear coupling on variable 1
        PieceWiseLinear(n_flow,n_pass,n_bins,[layer_size]*n_layers,reg=1e-3), 
        # Apply a cyclic permutation on the variables with stride 1 (i.e. echange 1 and 2)
        RollLayer(1),
        #PW-linear coupling on the new variable 1, i.e. variable 2
        PieceWiseLinear(n_flow,n_pass,n_bins,[layer_size]*n_layers,reg=1e-3), 
        # Exchange 1 and 2 again to restore the order
        RollLayer(1)
    ])

# Let us train and plot immediately
optim = keras.optimizers.Adamax(lr=1e-3)
losses = NF.train_variance_forward(gaussian,n_epochs=250,n_steps=10,n_batch=100000,optimizer=optim)


fig = plt.figure(figsize=(12, 4))
a1=fig.add_subplot(131)
plt.plot(losses)
a1.title.set_text('Loss')
a2=fig.add_subplot(132)
plt.plot(np.sqrt(np.exp(losses)))
a2.title.set_text('Standard Deviation')
X,fX=NF.generate_data_batches(gaussian,n_batch=10000)
a3=fig.add_subplot(133)
plt.hist2d(X[0][:,0],X[0][:,1],bins=10)
a3.set_aspect(aspect=1.)
a3.title.set_text('Point histogram (PDF)')
a3.tick_params(
    axis='both',          # changes apply to the x-axis
    which='both',      # both major and minor ticks are affected
    bottom=False,      # ticks along the bottom edge are off
    left=False,         # ticks along the top edge are off
    labelbottom=False,
    labelleft=False)
plt.plot()

[]

Debriefing:

the learned distribution looks quite Gaussian
the standard deviation obtained is significantly better than before, we gained nearly a factor 3 in integration time

A fancier example: the slashed circle function¶

Let us still work in two dimensions to allow easy visualization of the results. Here we would like to learn a distribution that is the quintessential example of what the VEGAS algorithm cannot handle. Let us first define the function and plot it.

@tf.function
def slashed_circle(x):
    return tf.math.minimum(1.,tf.exp( - (tf.norm(x-0.5,axis=-1)-0.3)**2/0.1**2 )+tf.exp(- (x[:,0]-x[:,1])**2/0.1**2))

# Plotting magic
xs = np.linspace(0,1,100)
ys = np.linspace(1,0,100) # in images the y axis is inverted
Xs,Ys = np.meshgrid(xs,ys)
zs=slashed_circle(np.array(list(zip(Xs.reshape(100*100),Ys.reshape(100*100)))).astype(np.float32)).numpy().reshape(100,100)
plt.imshow(zs)
plt.show()

This is a hard function even for a smart user using VEGAS because VEGAS performs changes of variables one dimension at a time. A smart user can perform changes of variables, such as polar coordinates to adapt the radial distance to the circle shape, but optimizing both the circle and the line at the same time is practically impossible. This is typically handled using multi-channelling (doing multiple integrations optimized for different features), but such an approach require knowing much about the function you would like to integrate, which is not always possible.

Let us therefore see how our learning approach can handle this situation

n_flow = 2
n_pass = 1
n_bins = 25
layer_size = 60
n_layers=5

NF = NormalizingFlow(2,
    [

        PieceWiseLinear(n_flow,n_pass,n_bins,[layer_size]*n_layers,reg=1e-5), 
        RollLayer(1), 
        PieceWiseLinear(n_flow,n_pass,n_bins,[layer_size]*n_layers,reg=1e-5), 
        RollLayer(1), 

    ])

optim = keras.optimizers.Adamax(lr=1e-3)
losses = NF.train_variance_forward(slashed_circle,n_epochs=700,n_batch=500000,optimizer=optim)

fig = plt.figure(figsize=(12, 4))
a1=fig.add_subplot(131)
plt.plot(losses)
a1.title.set_text('Loss')
a2=fig.add_subplot(132)
plt.plot(np.sqrt(np.exp(losses)))
a2.title.set_text('Standard Deviation')
X,fX=NF.generate_data_batches(slashed_circle,n_batch=100000)
a3=fig.add_subplot(133)
plt.hist2d(X[0][:,0],X[0][:,1],bins=25)
a3.set_aspect(aspect=1.)
a3.title.set_text('Point histogram (PDF)')
a3.tick_params(
    axis='both',          # changes apply to the x-axis
    which='both',      # both major and minor ticks are affected
    bottom=False,      # ticks along the bottom edge are off
    left=False,         # ticks along the top edge are off
    labelbottom=False,
    labelleft=False)
plt.plot()

[]