Core Diffusion is a collection of minimal implementations of Diffusion Models that I built while studying core diffusion ideas, all explained in my video.
The plan was to write the simplest one-file diffusion training script, and then build on it step by step. This allows you to study the code and build your mental model with increasing complexity.
There are versions with the following numbers:
-
Basic CNN: A minimalistic implementation of the core ideas behind DDPM, but with a simple Convolutional Neural Network instead of a UNet and with no timestep conditioning.
-
Naive Time Conditioning: Extends the previous version with naive timestep injection as a simple float from -1 to 1. It's better than nothing, but that's not how it's done in modern models.
-
Sinusoidal Embeddings: Upgraded timestep conditioning with proper timestep Sinusoidal Encodings first proposed in the Transformer paper.
-
Residual UNet: Replaces the simple Convolutional Neural Network with a UNet containing residual connections. A significant architectural step up.
-
Self-Attention: Upgrades the UNet with Self-Attention layers. This version is conceptually closest to the one used in the original 'DDPM' paper.
-
Cosine Schedule: Implements the improved noise schedule proposed in the 'Improved DDPM' paper. Note: the loss is higher here, and this is expected. Cosine schedule makes the task harder for the network, but leads to better sample quality.
-
Zero Initialization: Controls the initial variance of the output, since residual connections skew it. One way to do it is to scale down the residual connections as mentioned in the 'Diffusion Models Beat GANs on Image Synthesis' paper, but in their codebase they do it by initializing the last layer to zeros, so that's what I did here as well.
-
Fixed Upsampling: Fixes checkerboard pattern artifacts, explained here, through interpolating the image instead of upsampling through convolutions.
-
Smooth Weights: Stabilizes training with exponential moving average of the model weights. This decreases the model output quality early on, but only on the surface, since the training weights remain the same. But it improves the output quality in the long term, and we liiiiiike long term ;)
There is potential for more versions described below.
Ran with python 9-smoothWeights.py -o -s -n finalMNIST, so it's all just default settings. It took around 2h to train on my laptop RTX3070.
The run is slightly outdated, but it's still fully solved MNIST, so you can't get much better than that.
Ran with python 9-smoothWeights.py -o -s -e 400 -n finalCIFAR10 -d cifar10, so it's a slightly longer run, but with default settings. This run took 10h to train on RTX3070 laptop, but you get sensible images even after 2.5h (100 epochs), then it keeps slowly improving, and these 400 epochs aren't even it's limit. I'd run it for 1000 epochs if I had more time.
To be serious, the params should be tuned for this much more complex dataset.
IMPORTANT: In this run, before late in the training, some of the generated images are fully black or fully white, so it's nothing strange. Not sure if it's a quirk of these models, or there are improvements to be made that fix that (except hard clipping the outputs at every step). This behavior is shown below on output after 50 epochs. Just so you guys know and don't message me asking if something isn't broken, hehe. It eventually fixes itself, but I have no idea where it comes from.
Each script is self-contained and currently supports:
- Random Generation: Scripts generate random images from the dataset (there are no prompts here).
- Datasets: Support for automatically-downloading MNIST (default, grayscale 28x28) or CIFAR10 (RGB 32x32).
- Clear Logging: Prints the run name, dataset, and image shape at startup to help differentiate each run.
- Progress Tracking: Real-time training loss updates and a
tqdmprogress bar in the console. - Visual Sampling: Optional sampling of outputs after every epoch to visualize the model's progress.
- Checkpoint Management: Optional saving and loading of checkpoints to resume training.
- Flexible Configuration: Modify parameters like learning rate, batch size, and timestep embedding size directly via the console.
For full details on configuration, please check the argument parser at the bottom of every Python file.
Note on Checkpoints: By default we only keep the newest checkpoint, not all of them, since having 100 checkpoints of 200MB each doesn't sound like a fantastic default behavior. Modify the script if needed - it only requires deleting the checkpoint-deleting lines. The world is yours to shape.
Note on Performance: This is an educational repo; it doesn't aim for speed or multi-GPU training, even though I did my best not to waste resources. I myself work on a mid-range gaming laptop, so that is also my target compute.
1. Install requirements.txt. If you're using pip, it's typically done with:
pip install -r requirements.txt2. Run any Python file from the console. By default the code will download MNIST dataset to train on. You don't need any parameters, but you can provide them to save generated images and checkpoints. For example:
python 9-smoothWeights.py --runName runName --epochs 100 --saveOutput --saveCheckpointsor shorter:
python 9-smoothWeights.py -n runName -e 100 -o -sNote on complex datasets: CIFAR10 and more complex datasets require longer training, deeper nets, even more smoothed weights and other things, so don't rely on default settings for everything, hehe. Default settings are for default dataset.
Warning on odd image shapes: The code should work on many datasets, but it will break if your images have nonstandard shapes that don't divide cleanly by 4 (we need two divisions by 2). To make it work you'd have to ensure that the downsampling and upsampling architectures are perfectly aligned.
Checkpoints will be automatically saved in the checkpoints folder, so all you need to do is provide the --checkpoint flag, followed by the checkpoint name to load and resume from. For example:
python 9-smoothWeights.py -n runName -e 100 -o -s -ch runName_level_50.pthI'm not going all the way on implementing tiny approachable versions of the most modern advancements, since I'm not working with diffusion on a daily basis (I want to specialize in RL), I'm just here to study and have fun. However, if I get super bored, or if you'd like to write your own improved versions, here are some suggested directions:
- Implement faster sampling according to the DDIM paper, which enables sampling in only a fraction of the steps used while training.
- Follow the paper about Guided Diffusion and follow their architectural tricks (like zeroing last layers, that is already done. I just haven't read the whole paper).
- Add a Variational Autoencoder to compress and decompress images, allowing our diffusion model to operate on compressed latents rather than full-sized images.
- Add labels and prompt encodings to direct our diffusion instead of it sampling randomly from the dataset.
- Reach Stable Diffusion? With the previous suggestions done, the model should be very close to being a minimalistic tiny version of Stable Diffusion, which started the image generation revolution back in 2022.
I'm very grateful for:
minDiffusion repo - This served as my starting point as a single file diffusion script that I could use as reference. I love simple and clean implementations that are approachable to study ^^


