Simple script to run a binary diffusion model for binarised MIDI generations.
Denoising Diffusion Probabilistic Models (DDPMs) describe a class of directed graphical models that leverage Markov chains analogously to diffusion processes to generate samples from complex data distributions.
They were first introduced in the 2015 paper ``Deep Unsupervised Learning using Nonequilibrium Thermodynamics" by Sohl-Dickstein et al., where they were used to generate images. However, in the years after, they went largely ignored in favour of the two prominent image generation architectures at the time: Generative Adversarial Networks (GAN) and Variational Autoencoders (VAE), that were introduced the year prior. As of the writing of this essay (July 2021), these papers have 32343 and 15003 citations, respectively, as opposed to the DDPM model, which has only 153. Nevertheless, the topic has recently found a resurgence when Ho et al. showed DDPMs could match GANs in terms of Inception score and FID. The flurry of research that followed this paper has seen DDPMs applied to different domains, such as speech, music and audio up-sampling. In addition, they have been generalised to different distributions, like Poisson and Gaussian mixture, and continuous diffusion processes.
In this essay, we will explore DDPMs with a binomial distribution and their application to MIDI generation. In the next chapter, we will introduce DDPMs in general. Then in chapter 3, we will derive them for Binomial Markov chains and explore how they work. Chapters 4 and 5 describe our experiments with these and the results, respectively. Finally, in chapter 6 a conclusion is provided.
DDPMs can be used to generate new data by sampling from a tractable data distribution such as a standard Gaussian distribution and then repeatedly apply Markov diffusion kernels to let this initial sample drift towards a sample from the data distribution. The reverse process does exactly the opposite, it starts with a sample from the data distribution, i.e. an image or audio sample, and uses inverse Markov kernels to slowly add noise to it until it becomes indistinguishable from random noise. These processes are demonstrated in Figure \ref{fig1} below. The left-to-right direction describes the data generating process and the right-to-left direction depicts the process of turning an image into random noise.
Figure 1: DDPM forward and backward process. Taken from Ho et al., 2020
In more mathematical terms, during the forward process the data distribution, denoted by
where the integral can be replaced by a sum for discrete distributions, as is the case for binomial DDPMs. This corresponds to the right-to-left direction in Figure 1 above and ends in
The Markov chain in the data generating direction is correspondingly called the reverse process and results in samples from the data distribution. The reverse Markov kernels are parameterised by
Result: Sample from data distribution
- Sample
$x^T \sim \pi(x^T)$ -
for
$t = T, T-1, \dots, 1$ do- Compute
$\theta_t$ - Sample
$x^{t-1} \sim p_{\theta_t}(x^{t-1}|x^t)$
- Compute
-
Return
$x^0$
The log-likelihood of DDPMs is given by
This in integral is naively intractable. Therefore, we must apply some mathematical tricks to it to get an expression with which we can work. The first step is to write
Secondly, we can write these joint distributions in terms of a product of Markov kernels:
where we used the Markov property in the first row. This yields the expression
Note that each of the term in the product is the ratio between a forward step at
We then plug this into Equation [1] and use Jensen's inequality to get an expression with the logarithm inside the integrals. This results in a tractable expression at the cost of introducing an inequality. However, optimising the resulting quantity, which is called the Evidence Lower BOund (ELBO), is an effective way to yield results.
Finally, to make the denominators and numerators in the fractions depend on the same state, we use Bayes' rule to write
This allows us to express the ELBO in terms of Kullback–Leibler divergences:
Sampling from
Result: Converged DDPM
While not converged:
- Sample
$x^0$ from$q(x^0)$ - Sample
$t$ from$\text{Uniform}(1, \dots, T)$ - Compute
$\text{Loss} = T \cdot D_\text{KL}(q(x^{t-1}|x^t, x^0)||p(x^{t-1}|x^t))$ - Take gradient step with
$\text{Loss}$
We define the binomial distribution
$P_t = \begin{Vmatrix} 1-\beta_t/2 & \beta_t/2\ \beta_t/2 & 1-\beta_t/2 \end{Vmatrix}.$
In accordance with the conditions from Equations 1.3 and 1.4 of chapter III of ``An Introduction To Stochastic Modeling" by Taylor and Karlin, all the elements are greater than or equal to zero and all the rows add up to one.
To compute the
$M_2 = \begin{Vmatrix} 1-\beta_2/2 & \beta_2/2\ \beta_2/2 & 1-\beta_2/2 \end{Vmatrix}\begin{Vmatrix} 1-\beta_1/2 & \beta_1/2\ \beta_1/2 & 1-\beta_1/2 \end{Vmatrix}= \begin{Vmatrix} 1-\alpha_2/2 & \alpha_2/2\ \alpha_2/2 & 1-\alpha_2/2 \end{Vmatrix},$
where
$M_t = \begin{Vmatrix} 1-\alpha_t/2 & \alpha_t/2\ \alpha_t/2 & 1-\alpha_t/2 \end{Vmatrix},$
with
and hence
We now have all the terms to compute
$q(x^{t-1}=1|x^0, x^t) = \begin{pmatrix} 1-x^0\ x^0 \end{pmatrix}^\top L_t \begin{pmatrix} 1-x^t\ x^t \end{pmatrix},$
where
The vector
note that the binomial vectors can be decomposed as
$\begin{pmatrix} 1-x^t\ x^t \end{pmatrix} = \begin{pmatrix} 1 & -1\ 0 & 1 \end{pmatrix}\begin{pmatrix} 1\ x^t \end{pmatrix}$
and therefore
$\tilde{L}^t = \begin{pmatrix} 1 & -1\ 0 & 1 \end{pmatrix}^\top L^t \begin{pmatrix} 1 & -1\ 0 & 1 \end{pmatrix}$
encodes the probabilities in the form of Equation [3]. Using the fact that
$\tilde{L}^t = \begin{pmatrix} 1 - L^t_{11} & L^t_{11}+L^t_{01}-1\ L^t_{11} - L^t_{01} & 0 \end{pmatrix}.$
Hence
and
where
It is important to choose the
- Constant:
$\beta^\text{C}_t = \beta \in [0, 1]$ - Reciprocal:
$\beta^\text{R}_t = 1/(T - t + 1)$ , as used in Sohl-Dickstein et al.
In the case of constant
$\alpha^\text{R}t = 1 - \prod{i = 1}^t\left(1-\frac{1}{T - i + 1}\right) = 1 - \prod_{i = 1}^t\frac{T-i}{T - i + 1} = 1 - \frac{T-t}{T} = \frac{t}{T}.$
To illustrate this, see Figure 2 below, which shows the value of
Figure 2: Value of $\alpha_t$ at different time steps
It can be seen that constant
To gain insight into the conditional forward process as determined by the Markov kernels
Figure 3: Expectation value of $x^t$ at different time steps for the conditional forward process
The blue and orange lines represent the average value of
MIDI stands for Musical Instrument Digital Surface and is a standarised communications protocol for recorded music. Sounds are encoded by different instrument, pitch and velocity, which corresponds to how loud a tone is to be played. A typical MIDI file consists of several tracks, up to 16 different ones, one for each instrument. Typically one of these is a percussion instrument, for which the 10th channel is reserved, whereas the rest are other instruments, such as piano, guitar, strings, ensemble, brass and sound effects. There are 47 different percussion instruments and 128 others. The tempo indicates the number of quarter notes that are played per minute and the length of each quarter note is determined by the resolution parameter.
For our experiments, we removed the percussion track and overlaid the others on top of each other and let them be played by a piano. This resulted in a single track per song. Furthermore, we binarised the individual notes, to make them only take on the values 0 and 1. This was to make them suitable for our binomial DDPM. These modifications reduces the sound quality somewhat, but for the purposes of our experiment it retained enough of the intrinsic characteristics to yields valuable results. Below in Figure 4 is an example.
Figure 4: Sample of MIDI sequence of the song ``Livin' on a Prayer"
For our experiments, we used the composing AI dataset, consisting of 77153 songs, ranging from classical music to pop music to folk tunes. These songs were preprocessed using the steps described above, resulting in binary data of 128 note channels and time lengths of several hundred steps, depending on the length of the song. They were then converted and stored as binary tensors, which resulted in a dataset of about 10 GB.
The model that we used was based on a pre-trained GPT-2 model from HuggingFace. Recent research has shown that pre-trained language models possess the ability to transfer well to different tasks and modalities from other domains such as image recognition and algorithmic computations. We added a linear projection layer for the input and output, as well as positional embeddings at each layer. The positional embeddings were calculated as
and projected down to the hidden size of the model. During training we froze the parameters of the pre-trained model during the initial five epochs, only training the input and output layers, and the projection of the positional embeddings. After this, we unfroze all parameters and continued training.
As far as we know, there are no algorithmic methods that can quantify the quality of the results in a way to compare them with other MIDI sequences, since most methods assume an underlying Gaussian distribution, such as the Overlapping Area metric. Instead, we will describe them in a more subjective manner, leaving the results to be judged by the reader (see attached files for generated music samples). Below two samples of generated MIDI sequences are shown.
Figure 5: Two generated MIDI sequences
The density of notes is similar those of the original MIDI sequences from the training dataset. Furthermore, there appear to be longer and shorter notes, as indicated by different length in bars. Most notes lie within the middle of the range, as is common for real music too. In the attached files, these corresponding wav files of these MIDI sequence can be heard. People that we've shown the recordings to have described them as 'futuristic'.

