diff --git a/code/vae_notebook_pytorch_solutions.ipynb b/code/vae_notebook_pytorch_solutions.ipynb
new file mode 100644
index 0000000..2d5fabd
--- /dev/null
+++ b/code/vae_notebook_pytorch_solutions.ipynb
@@ -0,0 +1,837 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# How to implement a Variational Autoencoder (VAE)\n",
+    "\n",
+    "A variational autoencoder observes data, infers a latent code for it and tries to reconstruct the data from that latent code. In contrast to regular autoencoders, the code of the VAE is **random**. That means that when presented with the same input, the VAE will produce a slightly different code each time. This makes its decoding process more robust, since it has to deal with noisy code.\n",
+    "\n",
+    "Another way of looking at a VAE is as a training procedure for a probablistic model. The model is \n",
+    "$$p(x) = \\int p(z)p(x|z) dz$$\n",
+    "where $z$ is the latent code and $x$ is the data. During training we need to infer a posterior over $z$. In the case of a VAE this is done by neural network.\n",
+    "\n",
+    "Assuming that the theory of VAEs has already been presented, we now dive straight into implementing them. If you need more background on VAEs, have a look at our [tutorial slides](https://github.com/philschulz/VITutorial/tree/master/modules) and the references therein."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# The Framework\n",
+    "\n",
+    "For the purpose of this tutorial we are going to use [pytorch](https://pytorch.org) which is a scalable deep learning library for python. We are going to import and abbreviate it as \"mx\". We will use pytorch to define a computation graph. This is done using the [torch.nn library](https://pytorch.org/docs/stable/nn.html). When building the VAE, all the methods that you use should be prefixed with `torch.nn`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import logging\n",
+    "import pathlib\n",
+    "\n",
+    "from typing import List, Tuple, Optional, Iterable\n",
+    "from abc import ABC\n",
+    "\n",
+    "from collections import OrderedDict\n",
+    "import itertools\n",
+    "import math\n",
+    "import numpy as np\n",
+    "\n",
+    "import torch\n",
+    "import torch.utils.data\n",
+    "from torch.autograd import Variable\n",
+    "\n",
+    "from matplotlib import cm, pyplot as plt"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next, we specify a couple of constants that will help us to load the data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "DEFAULT_LEARNING_RATE = 0.0003\n",
+    "\n",
+    "TRAIN_SET = 'train'\n",
+    "VALID_SET = 'valid'\n",
+    "TEST_SET = 'test'\n",
+    "data_names = [TRAIN_SET, VALID_SET, TEST_SET]\n",
+    "test_set = [TEST_SET]\n",
+    "data_dir = pathlib.Path.cwd() / \"binary_mnist\"\n",
+    "USE_CUDA = False # Set to True to run on your gpu\n",
+    "\n",
+    "if USE_CUDA:\n",
+    "    device = torch.device(\"cuda:0\")\n",
+    "else:\n",
+    "    device = torch.device(\"cpu\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Finally, we set up basic logging facilities to print intermediate output."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def plot_images(a, columns):\n",
+    "    width = height = int(math.sqrt(a.shape[1]))\n",
+    "    n = a.shape[0]\n",
+    "    rows = math.ceil(n / columns)\n",
+    "    cellsize = min(15 / columns, 5)\n",
+    "    if rows == 1 and columns > n:\n",
+    "        if n == 1:\n",
+    "            plot, axes = plt.subplots(rows, sharex=\"col\", sharey=\"row\", figsize=(cellsize * columns, cellsize * rows))\n",
+    "        else:\n",
+    "            plot, axes = plt.subplots(rows, n, sharex=\"col\", sharey=\"row\", figsize=(cellsize * columns, cellsize * rows))\n",
+    "    else:\n",
+    "        plot, axes = plt.subplots(rows, columns, sharex=\"col\", sharey=\"row\", figsize=(cellsize * columns, cellsize * rows))\n",
+    "    for r in range(rows):\n",
+    "        for c in range(columns):\n",
+    "            if (r * columns + c) >= n:\n",
+    "                break\n",
+    "            if rows == 1:\n",
+    "                if n == 1:\n",
+    "                    axes.imshow(np.reshape(a[(r * columns) + c,:], (width, height)), cmap=cm.Greys)\n",
+    "                else:\n",
+    "                    axes[c].imshow(np.reshape(a[(r * columns) + c,:], (width, height)), cmap=cm.Greys)\n",
+    "            else:\n",
+    "                axes[r][c].imshow(np.reshape(a[(r * columns) + c,:], (width, height)), cmap=cm.Greys)\n",
+    "    plt.show()\n",
+    "\n",
+    "logging.basicConfig(level=logging.DEBUG, format=\"%(asctime)s [%(levelname)s -- %(name)s]: %(message)s\", datefmt=\"%H:%M:%S\")\n",
+    "logger = logging.getLogger(\"notebook\")\n",
+    "mpl_logger = logging.getLogger(\"matplotlib\")\n",
+    "mpl_logger.setLevel(logging.ERROR)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# The Data\n",
+    "\n",
+    "Throughout the tutorial we will use the binarised MNIST data set consisting of images of handwritten digits (0-9). \n",
+    "The binarisation was done by sampling each pixel from a Bernoulli distribution with the pixel's original intensity being the Bernoulli parameter (see [this paper](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.141.1680&rep=rep1&type=pdf) for details).Each pixel has been mapped to either 0 or 1, meaning that pixels are either fully on or off. We use this data set because it allows us to use a rather simple product of Bernoullis as a likelihood. We download the data into a folder called \"binary_mnist\". This may take some time."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_dir.mkdir(parents=True, exist_ok=True)\n",
+    "for data_set in data_names:\n",
+    "    file_name = \"binary_mnist.{}\".format(data_set)\n",
+    "    goal = data_dir / file_name\n",
+    "    if goal.exists():\n",
+    "        logger.info(\"Data file {} exists\".format(file_name))\n",
+    "    else:\n",
+    "        logger.info(\"Downloading {}\".format(file_name))\n",
+    "        link = \"http://www.cs.toronto.edu/~larocheh/public/datasets/binarized_mnist/binarized_mnist_{}.amat\".format(\n",
+    "            data_set)\n",
+    "        urllib.request.urlretrieve(link, goal)\n",
+    "        logger.info(\"Finished\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Alright, now we have the data on disk. We will load the training and test data later. Right now we just load the validation set to see what the data looks like. We'll grab 5 random numbers and visualise them (you can run this box several times to get 5 different digits each time). That the digits look a bit rough results from their pixels being binarised."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "file_name = data_dir / f\"binary_mnist.{VALID_SET}\"\n",
+    "logger.info(f\"Reading {file_name} into memory\")\n",
+    "valid_set = np.genfromtxt(file_name)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "n = 5\n",
+    "random_indices = np.random.randint(valid_set.shape[0], size=n)\n",
+    "random_pictures = valid_set[random_indices, :]\n",
+    "plot_images(random_pictures, 5)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "After having acquainted ourselves with the data we can now proceed to build our VAE."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Diagonal Gaussian VAE\n",
+    "\n",
+    "The most basic VAE model is one where we assume that the latent variable is multiviariate Gaussian. We fix the prior to be standard normal. During inference, we use a multivariate Gaussian variational distribution with diagonal covariance matrix. This means that we are only modelling variance but not covariance (in fact, a k-dimensional Guassian with diagonal covariance has the same density as a product of k independent univariate Gaussians). Geometrically, this variational distribution can only account for axis-aligned elliptical densities. It is thus rather limited in its modelling capabilities. Still, because it uses a neural network under the hood, it is very expressive. \n",
+    "\n",
+    "In this tutorial, we will model the mist binarised digit data set. Each image is encoded as a 784-dimensional vector. We will model each of these vectors as a product of 784 Bernoullis (of course, there are better models but we want to keep it simple). Our likelihood is thus a product of independent Bernoullis. The resulting model is formally specified as \n",
+    "\n",
+    "\\begin{align}z \\sim \\mathcal{N}(0,I) && x_i|z \\sim Bernoulli(\\text{NN}_{\\theta}(z))~~~ i \\in \\{1,2,\\ldots, 784\\} \\ .\\end{align}\n",
+    "\n",
+    "The variational approximation is given by $$q(z|x) = \\mathcal{N}(\\text{NN}_{\\lambda}(x), \\text{NN}_{\\lambda}(x)).$$\n",
+    "\n",
+    "Notice that both the Bernoulli likelihood and the Gaussian variational distribution use NNs to compute their parameters. The parameters of the NNs, however, are different ($\\theta$ and $\\lambda$, respectively)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Implementation\n",
+    "\n",
+    "We will spread our implementation across 3 classes. This design choice is motivated by the desire to make our models as modular as possible. This will later allow us to mix and match different likelihoods and variational distributions.\n",
+    "\n",
+    "* **Generator**: This class defines our likelihood. Given a latent value, it produces a data sample or assigns a density to an existing data point.\n",
+    "* **InferenceNetwork**: This neural network computes the parameters of the variational approximation from a data point.\n",
+    "* **VAE**: This is the variational autoencoder. It combines a Generator and an InferenceNetwork and trains them jointly. Once trained, it can generate random data points or try to reproduce data presented to it.\n",
+    "\n",
+    "Below we have specified these classes abstractly. Make sure you understand what each method is supposed to be doing."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class Generator(torch.nn.Sequential):\n",
+    "    \"\"\"\n",
+    "    Generator network.\n",
+    "\n",
+    "    :param data_dims: Dimensionality of the generated data.\n",
+    "    :param layer_sizes: Size of each hidden layer in the network.\n",
+    "    :param act_type: The activation after each hidden layer.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def __init__(self, input_dims: int, data_dims: int, layer_sizes: List[int], act_type: str) -> None:\n",
+    "        super(Generator, self).__init__(OrderedDict([]))\n",
+    "        self.input_dims = input_dims\n",
+    "        self.data_dims = data_dims\n",
+    "        self.layer_sizes = layer_sizes\n",
+    "        self.act_type = act_type\n",
+    "\n",
+    "        \n",
+    "class InferenceNetwork(torch.nn.Module):\n",
+    "    \"\"\"\n",
+    "    A network to infer distributions over latent states.\n",
+    "\n",
+    "    :param latent_variable_size: The dimensionality of the latent variable.\n",
+    "    :param layer_sizes: Size of each hidden layer in the network.\n",
+    "    :param act_type: The activation after each hidden layer.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def __init__(self, latent_variable_size: int, layer_sizes: List[int], act_type: torch.nn.Module) -> None:\n",
+    "        super(InferenceNetwork, self).__init__()\n",
+    "        self.latent_variable_size = latent_variable_size\n",
+    "        self.layer_sizes = layer_sizes\n",
+    "        self.act_type = act_type\n",
+    "\n",
+    "    \n",
+    "class VAE(torch.nn.Module):\n",
+    "    \"\"\"\n",
+    "    A variational autoencoding model (Kingma and Welling, 2014).\n",
+    "\n",
+    "    :param generator: A generator network that specifies the likelihood of the model.\n",
+    "    :param inference_net: An inference network that specifies the distribution over latent values.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def __init__(self, generator: Generator, inference_net: InferenceNetwork) -> None:\n",
+    "        super(VAE, self).__init__()\n",
+    "        self.generator = generator\n",
+    "        self.inference_net = inference_net\n",
+    "\n",
+    "    def forward(self, data: Variable) -> Variable:\n",
+    "        raise NotImplementedError()\n",
+    "        \n",
+    "    def generate_reconstructions(self, data: Variable, n: int) -> Variable:\n",
+    "        \"\"\"\n",
+    "        Generate a number of reconstructions of input data points.\n",
+    "\n",
+    "        :param data: The input data.\n",
+    "        :param n: Number of reconstructions per data point.\n",
+    "        :return: The reconstructed data.\n",
+    "        \"\"\"\n",
+    "        raise NotImplementedError()\n",
+    "\n",
+    "    def phantasize(self, n: int) -> Variable:\n",
+    "        \"\"\"\n",
+    "        Generate data by randomly sampling from the prior.\n",
+    "\n",
+    "        :param n: Number of sampled data points.\n",
+    "        :return: Randomly generated data points.\n",
+    "        \"\"\"\n",
+    "        raise NotImplementedError()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Exercise 1\n",
+    "Let us start by implementing the generator. This is pretty much a standard neural network. The main point of this exercise is to get you comfortable with pytorch. Complete all the TODOs below. Before starting, check pytorch's [fully connected layer](https://pytorch.org/docs/stable/nn.html#torch.nn.Linear) and [activation functions](https://pytorch.org/docs/stable/nn.html#non-linear-activations-weighted-sum-nonlinearity) (and [more activation functions](https://pytorch.org/docs/stable/nn.html#non-linear-activations-other))."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class ProductOfBernoullisGenerator(Generator):\n",
+    "    \"\"\"\n",
+    "    A generator that produces binary vectors whose entries are independent Bernoulli draws.\n",
+    "\n",
+    "    :param data_dims: Dimensionality of the generated data.\n",
+    "    :param layer_sizes: Size of each hidden layer in the network.\n",
+    "    :param act_type: The activation after each hidden layer.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def __init__(self, *args, **kwargs) -> None:\n",
+    "        super(ProductOfBernoullisGenerator, self).__init__(*args, **kwargs)\n",
+    "        # TODO choose the correct output activation for a Bernoulli variable. This should be a class in\n",
+    "        # torch.nn\n",
+    "        output_activation = torch.nn.Sigmoid\n",
+    "\n",
+    "        # TODO for each layer size in layer_sizes, introduce a layer that performs an affine transformation \n",
+    "        # (implemented by torch.nn.Linear) followed by a non-linearity that is specified by output_activation.\n",
+    "\n",
+    "        for i, size in enumerate([self.input_dims] + self.layer_sizes[:-1]):\n",
+    "            print(i, len(self.layer_sizes))\n",
+    "            self.add_module(f\"gen_fc_{i}\", torch.nn.Linear(size, self.layer_sizes[i]))\n",
+    "            self.add_module(f\"gen_act_{i}\", self.act_type())\n",
+    "        self.add_module(f\"gen_fc_out\", torch.nn.Linear(self.layer_sizes[-1], self.data_dims))\n",
+    "        self.add_module(f\"gen_act_out\", output_activation())\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We now move on to the inference network. Recall that this network will return the parameters of a diagonal Gaussian. Thus, we need to return two vectors of the same size: a mean and a standard deviation vector. (Formally, the parameters of the Gaussian are the variances. However, from the derivation of the Gaussian reparametrisation we know that we\n",
+    "need the standard deviations to generate a Gaussian random variable $z$ as transformation of a standard Gaussian variable $\\epsilon$.)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Exercise 2.a\n",
+    "To train our VAE we need to compute the KL of the prior distribution from the variational approximation. Assuming that\n",
+    "the prior is standard normal the KL for the a diagonal Gaussian variational approximation is\n",
+    "$$ 0.5 * \\left(\\sum_i - \\log(\\sigma_i^2) - 1 + \\sigma_i^2 + \\mu_i^2 \\right) \\ . $$\n",
+    "We use this KL divergence in the get_kl_loss method below of the GaussianInferenceNetwork. (Losses in mxnet are defined using the [MakeLoss](https://mxnet.incubator.apache.org/api/python/symbol/symbol.html#mxnet.symbol.MakeLoss) symbol.) Please implement the function diagonal_gaussian_kl. Use the [sum](https://mxnet.incubator.apache.org/api/python/symbol/symbol.html#mxnet.symbol.sum) and [log](https://mxnet.incubator.apache.org/api/python/symbol/symbol.html#mxnet.symbol.log) symbols in the process."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def diagonal_gaussian_kl(location: torch.Tensor, scale: torch.Tensor) -> torch.Tensor:\n",
+    "    \"\"\"\n",
+    "    Computes the KL divergence of a standard normal distribution from a diagonal Gaussian.\n",
+    "    \n",
+    "    :param location: The mean of the diagonal Gaussian.\n",
+    "    :param scale: The standard deviations of the diagonal Gaussian.\n",
+    "    :return: The value of the Kl divergence.\n",
+    "    \"\"\"\n",
+    "    #q = torch.distributions.normal.Normal(loc, scale)\n",
+    "    #p = torch.distributions.normal.Normal(torch.zeros_like(loc), torch.ones_like(scale))\n",
+    "    #kl = torch.distributions.kl.kl_divergence(q, p)\n",
+    "    kl = torch.sum(0.5 * ((-1 * scale.pow(2).log()) -1 + scale.pow(2) + location.pow(2)), dim=1)\n",
+    "    return kl\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Exercise 2.b\n",
+    "We can finally implement the Gaussian reparametrisation. Fill in the correct code to perform the sampling in the sample_latent_state method.\n",
+    "\n",
+    "**Hint:** In this exercise you will need to draw a random Gaussian sample (see [here](https://mxnet.incubator.apache.org/api/python/symbol/symbol.html#mxnet.symbol.random_normal)). The operator requires a\n",
+    "shape whose first entry is the batch size. The batch size is not known to you during implementation, however.\n",
+    "You can leave it underspecified by choosing $0$ as a value. When you combine the sampling operator with another\n",
+    "operator mxnet will infer the correct the batch size for you."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class GaussianInferenceNetwork(InferenceNetwork):\n",
+    "    \"\"\"\n",
+    "    An inference network that predicts the parameters of a diagonal Gaussian and samples from that distribution.\n",
+    "\n",
+    "    :param latent_variable_size: The dimensionality of the latent variable.\n",
+    "    :param layer_sizes: Size of each hidden layer in the network.\n",
+    "    :param act_type: The activation after each hidden layer.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def __init__(self, data_size, *args) -> None:\n",
+    "        super(GaussianInferenceNetwork, self).__init__(*args)\n",
+    "        self.data_size = data_size\n",
+    "        \n",
+    "        # We choose to share the first layer between the networks that compute the standard deviations\n",
+    "        # and means. This is a fairly standard design choice.\n",
+    "        self.shared = torch.nn.Sequential(OrderedDict({\n",
+    "                \"inf_joint_fc\": torch.nn.Linear(self.data_size, self.layer_sizes[0]),\n",
+    "                \"inf_joint_act\": self.act_type(),\n",
+    "            }))\n",
+    "        # location (live in R^d), for Gaussians this is also the mean of the random variable\n",
+    "        self.location = torch.nn.Sequential(\n",
+    "                OrderedDict(itertools.chain(*[\n",
+    "                    (\n",
+    "                        (f\"inf_mean_fc_{i}\", torch.nn.Linear(self.layer_sizes[i], size)),\n",
+    "                        (f\"inf_mean_act_{i}\", self.act_type()),\n",
+    "                    ) for i, size in enumerate(self.layer_sizes[1:])\n",
+    "                ]))\n",
+    "            )\n",
+    "        self.location.add_module(\"inf_mean_compute\", torch.nn.Linear(self.layer_sizes[-1], self.latent_variable_size))\n",
+    "\n",
+    "        # scale (live in strictly positive space), for Gaussians this is also the std \n",
+    "        self.scale = torch.nn.Sequential(\n",
+    "                OrderedDict(itertools.chain(*[\n",
+    "                    (\n",
+    "                        (f\"inf_variance_fc_{i}\", torch.nn.Linear(self.layer_sizes[i], size)),\n",
+    "                        (f\"inf_variance_act_{i}\", self.act_type()),\n",
+    "                    ) for i, size in enumerate(self.layer_sizes[1:])\n",
+    "                ]))\n",
+    "            )\n",
+    "        self.scale.add_module(\"inf_variance_compute\", torch.nn.Linear(self.layer_sizes[-1], self.latent_variable_size))\n",
+    "\n",
+    "        # TODO: The standard deviation is constrained to be positive for non-degenerate Gaussians. Take the\n",
+    "        # prev_out symbol and transform it into a vector of positive reals.\n",
+    "        self.scale.add_module(\"inf_variance_softrelu\", torch.nn.Softplus())\n",
+    "\n",
+    "    def sample_latent_state(self, loc: Variable, scale: Variable, n: Optional[int] = 1) -> Variable:\n",
+    "        \"\"\"\n",
+    "        Sample a value for the latent Gaussian variable.\n",
+    "\n",
+    "        :param mean: The mean of the Gaussian.\n",
+    "        :param std: The standard deviation of the Gaussian.\n",
+    "        :param n: Number of samples to be produced.\n",
+    "        :return: A Gaussian sample.\n",
+    "        \"\"\"\n",
+    "        eps = Variable(scale.data.new(n, scale.size()[1]).normal_())\n",
+    "        escale = scale.expand(n, -1)\n",
+    "        eloc = loc.expand(n, -1)\n",
+    "\n",
+    "        return (eps * escale) + loc\n",
+    "\n",
+    "    def forward(self, data):\n",
+    "        shared = self.shared(data)\n",
+    "        loc = self.location(shared)\n",
+    "        scale = self.scale(shared)\n",
+    "        \n",
+    "        z = self.sample_latent_state(loc, scale, scale.size()[0])\n",
+    "        \n",
+    "        return z, loc, scale\n",
+    "    "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Exercise 3\n",
+    "\n",
+    "The only thing that is left to do is to implement VAE training. To train our VAE we will maximise the ELBO:\n",
+    "$$ \\mathbb{E}\\left[ \\log p(x|z) \\right] - \\text{KL}(q(z|x)||p(z)) \\ . $$\n",
+    "By default, MXNet's optimisers minimise losses instead of maximising objectives. We thus turn the ELBO into a loss by taking its negative. We already get the (sampled) cross entropy loss and (analytically computed) KL divergence as return values form calls to the Generator and inference network respectively. Your job is to combine them into one loss. In mxnet this can be done using the [Group symbol](https://mxnet.incubator.apache.org/api/python/symbol/symbol.html#mxnet.symbol.Group). If a list of losses is provided as an argument to this symbol, the optimiser will add up these losses.\n",
+    "\n",
+    "**Note**: We could just as well have added up the values of the cross-entropy and the KL divergence and put the resulting loss into a MakeLoss symbol. The reason why we want to separate these two terms of the loss is that it allows us to output KL divergence. We will use this below when defining the KLMetric. Besides tracking the ELBO, which is our optimisation objective, we can also track to KL divergence to see how far the variational approximation moves away from the prior."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class GaussianVAE(VAE):\n",
+    "    \"\"\"\n",
+    "    A VAE with Gaussian latent variables. It assumes a standard normal prior on the latent values.\n",
+    "\n",
+    "    :param generator: A generator network that specifies the likelihood of the model.\n",
+    "    :param inference_net: An inference network that specifies the Gaussian over latent values.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def forward(self, data):\n",
+    "        z, loc, scale = self.inference_net(data)\n",
+    "        y_pred = self.generator(z)\n",
+    "        \n",
+    "        return y_pred, loc, scale\n",
+    "\n",
+    "\n",
+    "    def generate_reconstructions(self, data: Variable, n: int) -> Variable:\n",
+    "        \"\"\"\n",
+    "        Generate a number of reconstructions of input data points.\n",
+    "\n",
+    "        :param data: The input data.\n",
+    "        :param n: Number of reconstructions per data point.\n",
+    "        :return: The reconstructed data.\n",
+    "        \"\"\"\n",
+    "        _, loc, scale = self.inference_net(data)\n",
+    "        latent_state = self.inference_net.sample_latent_state(loc, scale, n)\n",
+    "        return self.generator(latent_state)\n",
+    "\n",
+    "\n",
+    "    def phantasize(self, n):\n",
+    "        \"\"\"\n",
+    "        Generate data by randomly sampling from the prior.\n",
+    "\n",
+    "        :param n: Number of sampled data points.\n",
+    "        :return: Randomly generated data points.\n",
+    "        \"\"\"\n",
+    "        latent_state = Variable(torch.empty(n, self.inference_net.latent_variable_size).normal_())\n",
+    "        if USE_CUDA:\n",
+    "            latent_state = latent_state.cuda()\n",
+    "        return self.generator(latent_state)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Constructing a VAE\n",
+    "\n",
+    "We have now all the code for VAEs in place. Below, we have defined a factory method that makes it easier for you to play with different architectures."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def construct_vae(latent_type: str,\n",
+    "                  likelihood: str,\n",
+    "                  generator_layer_sizes: List[int],\n",
+    "                  infer_layer_sizes: List[int],\n",
+    "                  latent_variable_size: int,\n",
+    "                  data_dims: int,\n",
+    "                  generator_act_type: torch.nn.Tanh,\n",
+    "                  infer_act_type: torch.nn.Tanh) -> VAE:\n",
+    "    \"\"\"\n",
+    "    Construct a variational autoencoder\n",
+    "\n",
+    "    :param latent_type: Distribution of latent variable.\n",
+    "    :param likelihood: Type of likelihood.\n",
+    "    :param generator_layer_sizes: Sizes of generator hidden layers.\n",
+    "    :param infer_layer_size: Sizes of inference network hidden layers.\n",
+    "    :param latent_variable_size: Size of the latent variable.\n",
+    "    :param data_dims: Dimensionality of the data.\n",
+    "    :param generator_act_type: Activation function for generator hidden layers.\n",
+    "    :param infer_act_type: Activation function for inference network hidden layers.\n",
+    "    :return: A variational autoencoder.\n",
+    "    \"\"\"\n",
+    "    if issubclass(likelihood, Generator):\n",
+    "        generator = likelihood(input_dims=latent_variable_size, data_dims=data_dims, layer_sizes=generator_layer_sizes,\n",
+    "                                                 act_type=generator_act_type)\n",
+    "    else:\n",
+    "        raise Exception(\"{} is an invalid likelihood type. It should be a subclass of Generator\".format(likelihood))\n",
+    "\n",
+    "    if issubclass(latent_type, InferenceNetwork):\n",
+    "        inference_net = latent_type(data_dims, latent_variable_size,\n",
+    "                                                 infer_layer_sizes,\n",
+    "                                                 infer_act_type)\n",
+    "    else:\n",
+    "        raise Exception(\"{} is an invalid latent variable type. It should be a subclass of InferenceNetwork\".format(latent_type))\n",
+    "        \n",
+    "    return GaussianVAE(generator=generator, inference_net=inference_net)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Exercise 4\n",
+    "\n",
+    "Your turn! Construct your own VAE below."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "vae = construct_vae(\n",
+    "    latent_type=GaussianInferenceNetwork,\n",
+    "    likelihood=ProductOfBernoullisGenerator,\n",
+    "    generator_layer_sizes=[200,500],\n",
+    "    infer_layer_sizes=[500,200],\n",
+    "    latent_variable_size=200,\n",
+    "    data_dims=784,\n",
+    "    generator_act_type=torch.nn.Tanh,\n",
+    "    infer_act_type=torch.nn.Tanh\n",
+    ")\n",
+    "if USE_CUDA:\n",
+    "    vae = vae.cuda()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To check what your VAE looks like we will visualise it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(vae)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Training a VAE\n",
+    "\n",
+    "We are now set to train the VAE. In order to do so we have to define a data iterator and a module in pytorch. The data iterator takes care of batching the data while the module executes the computation graph and updates the model parameters. For the purpose of training the model, we only need the training data which we load from disk."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\n",
+    "mnist = {}\n",
+    "file_name = data_dir / f\"binary_mnist.{TRAIN_SET}\"\n",
+    "logging.info(f\"Reading {file_name} into memory\")\n",
+    "mnist[TRAIN_SET] = np.genfromtxt(file_name)\n",
+    "\n",
+    "logging.info(\"%s contains %d data points\", file_name, mnist[TRAIN_SET].shape[0])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Using that data we define the training set loader. At this point we also need to choose a batch_size that will then be used in training. Notice that we randomise the order in which the data points are presented in the training set. Notice that the data and labels are identical. This is because the VAE tries to reconstruct its input data and thus labels and data are the same."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "batch_size = 200\n",
+    "\n",
+    "train_data = torch.Tensor(mnist[TRAIN_SET])\n",
+    "train_label = torch.Tensor(mnist[TRAIN_SET])\n",
+    "if USE_CUDA:\n",
+    "    train_data = train_data.cuda()\n",
+    "    train_label = train_label.cuda()\n",
+    "\n",
+    "dataset = torch.utils.data.TensorDataset(train_data, train_label)\n",
+    "\n",
+    "trainloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Training the vae (or any other network defined in mxnet) is most easily done using the [fit method](https://mxnet.incubator.apache.org/api/python/module/module.html#mxnet.module.BaseModule.fit). Here we choose to train our model for 20 epochs. Our optimiser will be adam. Training will take some time (approximately 2-5 minutes, depending on your machine). We are keeping track of the loss (the negative ELBO) to see how the model develops. Notice that this loss\n",
+    "is not the actual negative ELBO but a _doubly stochastic approximation_ to it (see [here](https://projecteuclid.org/download/pdf_1/euclid.aoms/1177729586) and [here](http://proceedings.mlr.press/v32/titsias14.pdf)). The two sources of stochasticity are the mini-batches (the ELBO is defined with respect to the entire data set) and the Gaussian reparametrisation (we approximate the integral over $ z $ through sampling). Both sources of stochasticity leave the approximation unbiased, however. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "epochs = 30\n",
+    "\n",
+    "bce_func = torch.nn.BCELoss(reduction=\"none\")\n",
+    "optimizer = torch.optim.Adam(vae.parameters(), lr=DEFAULT_LEARNING_RATE)\n",
+    "\n",
+    "for t in range(epochs):\n",
+    "    running_loss = div = 0.0\n",
+    "    for i, data in enumerate(trainloader, 0):\n",
+    "        x,y = data\n",
+    "        y_pred, loc, scale = vae(x)\n",
+    "        bce = bce_func(y_pred, y).sum(dim=1).mean()\n",
+    "\n",
+    "#        # [batch_size]\n",
+    "        running_loss += bce.item()\n",
+    "        kl = diagonal_gaussian_kl(loc, scale).mean()\n",
+    "        div += kl.item()\n",
+    "        loss = bce + kl\n",
+    "    \n",
+    "        elbo = -running_loss - div\n",
+    "        vae.zero_grad()\n",
+    "    \n",
+    "        loss.backward()\n",
+    "        optimizer.step()\n",
+    "    \n",
+    "        # print statistics\n",
+    "        if i % 20 == 19:    # print every 20 mini-batches\n",
+    "            logger.info(\"[%5d, %5d] elbo: %.3f; divergence: %.3f\", t + 1, i + 1, elbo / 20, div / 20)\n",
+    "            running_loss = 0.0\n",
+    "            div = 0.0\n",
+    "    torch.save(vae.state_dict(), f\"vae-{t:04d}.torch\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Simulating Data from the VAE\n",
+    "\n",
+    "Now that we have trained the VAE, we can use it to produce data for us. We produce 10 random digits. The pixel intensities are the predicted Bernoulli parameters."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "num_samples = 10\n",
+    "\n",
+    "samples = vae.phantasize(num_samples)\n",
+    "samples = samples.cpu().detach().numpy()\n",
+    "print(samples.shape)\n",
+    "plot_images(samples, columns=5)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Notice that the above samples were obtained by sampling from the standard normal prior. We can also check the reconstruction abilities of the VAE. To do this we sample a random number from the test set and generate 10 reconstructions of it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "file_name = data_dir / f\"binary_mnist.{TEST_SET}\"\n",
+    "logging.info(f\"Reading {file_name} into memory\")\n",
+    "mnist[TEST_SET] = np.genfromtxt(file_name)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here is what the sampled number looks like. (Rerun if you want a different number.)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "random_indices = np.random.randint(mnist[TEST_SET].shape[0], size=1)\n",
+    "random_pictures = mnist[TEST_SET][random_indices, :]\n",
+    "plot_images(random_pictures, 1)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "And now we are going to see how the VAE reconstructs it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "num_samples = 10\n",
+    "\n",
+    "random_x = torch.Tensor(random_pictures)\n",
+    "if USE_CUDA:\n",
+    "    random_x = random_x.cuda()\n",
+    "y_pred = (vae.generate_reconstructions(random_x, num_samples)\n",
+    "            .cpu()\n",
+    "            .detach()\n",
+    "            .numpy()\n",
+    "         )\n",
+    "plot_images(y_pred, columns=5)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Congrats!\n",
+    "\n",
+    "You have completed your first VAE tutorial. From here on you can modify this code to play with your own VAE ideas. Want to you use different data? Or vary the neural net structure for the generator or inference networks? Or use a different likelihood model? The world of probablistic modeling is yours to explore."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}