This project implements a Real-ESRGAN (Realistic Enhanced Super-Resolution Generative Adversarial Network) model for Blind SISR (Single Image Super-Resolution) task. The primary goal is to upscale low-resolution (LR) images by a given factor (2x, 4x, 8x) to produce super-resolution (SR) images with high fidelity and perceptual quality.
This implementation is based on the paper Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data.
The following images compare the standard bicubic interpolation with the output of the Real-ESRGAN model.
This project is based on my ESRGAN implementation. The following key features represent the main upgrades implemented to transition to the Real-ESRGAN project and improve performance on real-world data:
- Implements a High-Order Degradation Model that applies a complex sequence of degradations (blur, resize, noise, JPEG) twice to synthesize realistic training data on the fly, replacing simple bicubic downsampling.
- Incorporates sinc filters into the data generation process to simulate and remove common ringing and overshoot artifacts found in real-world images.
- Replaces the standard VGG-style discriminator with a U-Net Discriminator with Spectral Normalization, which stabilizes training and provides pixel-level feedback for better local detail refinement.
As a Generator, this project uses pretrained ESRGAN model with the same architecture.
Input (LR Image)
|
v
+-Input-Conv-Block-----------------------+
| Conv2D (9x9 kernel) (3 -> 64 channels) |
+----------------------------------------+
|
+---------------------------+
| |
v |
+-----+-23x-Residual-in-Residual-Blocks---------+ |
| |-3x-Residual-Dense-Blocks----------------+ |
+-----+ Conv2D (3x3 kernel) (64 -> 32 channels) | |
(Skip connections)| | LeakyReLU | | (Skip connection)
+-----+ Conv2D (3x3 kernel) (96 -> 32 channels) | |
| | LeakyReLU | |
+-----+ Conv2D (3x3 kernel) (128 -> 32 channels)| |
| | LeakyReLU | |
+-----+ Conv2D (3x3 kernel) (160 -> 32 channels)| |
| | LeakyReLU | |
+-----+ Conv2D (3x3 kernel) (192 -> 64 channels)| |
| | * RESIDUAL_SCALING_VALUE + X | |
| +-----------------------------------------+ |
| | * RESIDUAL_SCALING_VALUE + X | |
+-----+-----------------------------------------+ |
| |
v |
+-Middle-Conv-Block-----------------------+ |
| Conv2D (3x3 kernel) (64 -> 64 channels) | |
+-----------------------------------------+ |
| |
+---------------------------+
|
v
+-2x-Sub-pixel-Conv-Blocks-----------------+
| Conv2D (3x3 kernel) (64 -> 256 channels) |
| PixelShuffle (h, w, 256 -> 2h, 2w, 64) |
| PReLU |
+------------------------------------------+
|
v
+-Final-Conv-Block-----------------------+
| Conv2D (9x9 kernel) (64 -> 3 channels) |
| Tanh |
+----------------------------------------+
|
v
Output (SR Image)
As a Discriminator, this project uses a U-Net architecture with Spectral Normalization that performs pixel-wise assessment of images to provide local feedback for better texture refinement and training stability.
Note: The result of the model is logit, which is then passed to BCEWithLogitsLoss (with built-in Sigmoid) loss function, and therefore does not need a separate Sigmoid layer.
The model is trained on the DF2K_OST (DIV2K + Flickr2K + OST) dataset. The data_processing.py script dynamically creates LR images from HR images using bicubic downsampling and applies random crops and augmentations (flips, rotations).
The DIV2K_valid dataset is used for validation.
The test.py script is configured to evaluate the trained model on standard benchmark datasets: Set5, Set14, BSDS100, and Urban100.
.
├── checkpoints/ # Stores model weights (.safetensors) and training states
├── images/ # Directory for inference inputs, outputs, and training plots
├── config.py # Configures the application logger, hyperparameters and file paths
├── data_processing.py # Defines the SRDataset class and image transformations
├── inference.py # Script to run the model on a single image
├── models.py # Generator, Discriminator and TruncatedVGG19 model architectures definition
├── test.py # Script for evaluating the model on benchmark datasets
├── train.py # Script for training the model
└── utils.py # Utility functions (metrics, checkpoints, plotting)
All hyperparameters, paths, and training settings can be configured in the config.py file.
Explanation of some settings:
INITIALIZE_WITH_ESRGAN_CHECKPOINT: Set toTrueto use pre-trained ESRGAN weights (forpretrain.py).LOAD_REAL_ESRNET_CHECKPOINT: Set toTrueto resume training from the last Real-ESRNET checkpoint (forpretrain.py).LOAD_BEST_REAL_ESRNET_CHECKPOINT: Set toTrueto resume training from the best Real-ESRNET checkpoint (forpretrain.py).INITIALIZE_WITH_REAL_ESRNET_CHECKPOINT: Set toTrueto use pre-trained Real-ESRNET weights (fortrain.py).LOAD_REAL_ESRGAN_CHECKPOINT: Set toTrueto resume training from the last Real-ESRGAN checkpoint (fortrain.py).LOAD_BEST_REAL_ESRGAN_CHECKPOINT: Set toTrueto resume training from the best Real-ESRGAN checkpoint (fortrain.py).TRAIN_DATASET_PATH: Path to the train data. Can be a directory of images or a.txtfile listing image paths.VAL_DATASET_PATH: Path to the validation data. Can be a directory of images or a.txtfile listing image paths.TEST_DATASETS_PATHS: List of paths to the test data. Each path can be a directory of images or a.txtfile listing image paths.DEV_MOVE: Set toTrueto use a 10% subset of the train data for quick testing.
Note: INITIALIZE_WITH_REAL_ESRNET_CHECKPOINT and LOAD_REAL_ESRGAN_CHECKPOINT or LOAD_BEST_REAL_ESRGAN_CHECKPOINT are mutually exclusive. If the first one is True, then the other two should be False and vice versa. If the first parameter is set to True and one of the second parameters is set to True, then the model weights will be overwritten by the second parameter.
- Clone the repository:
git clone https://github.com/ash1ra/Real-ESRGAN.git
cd Real-ESRGAN- Create
.venvand install dependencies:
uv sync- Activate a virtual environment:
# On Windows
.venv\Scripts\activate
# On Unix or MacOS
source .venv/bin/activate-
Download the DIV2K datasets (
Train Data (HR images)andValidation Data (HR images)). -
Download the Flickr2K dataset.
-
Download the OST datasets (
OutdoorSceneTest300/OST300_img.zipandOutdoorSceneTrain_v2). -
Download the standard benchmark datasets (Set5, Set14, BSDS100, Urban100).
-
Create training dataset from DIV2K, Flickr2K and OST (both, test and train).
-
Organize your data directory as expected by the scripts:
data/ ├── DF2K_OST/ │ ├── 1.jpg │ └── ... ├── DIV2K_valid/ │ ├── 1.jpg │ └── ... ├── Set5/ │ ├── baboon.png │ └── ... ├── Set14/ │ └── ... ...or
data/ ├── DF2K_OST.txt ├── DIV2K_valid.txt ├── Set5.txt ├── Set14.txt ... -
Update the paths (
TRAIN_DATASET_PATH,VAL_DATASET_PATH,TEST_DATASETS_PATHS) inconfig.pyto match your data structure.
- Adjust parameters in
config.pyas needed. - Run the training script:
python pretrain.py
- Training progress will be logged to the console and to a file in the
logs/directory. - Checkpoints will be saved in
checkpoints/. A plot of the training metrics will be saved inimages/upon completion.
- Adjust parameters in
config.pyas needed. - Run the training script:
python train.py
- Training progress will be logged to the console and to a file in the
logs/directory. - Checkpoints will be saved in
checkpoints/. A plot of the training metrics will be saved inimages/upon completion.
To evaluate the model's performance on the test datasets:
- Ensure the
BEST_ESRGAN_CHECKPOINT_DIR_PATHinconfig.pypoints to your trained model (e.g.,checkpoints/esrgan_best). - Run the test script:
python test.py
- The script will print the average PSNR and SSIM for each dataset.
To upscale a single image:
- Place your image in the
images/folder (or update the path). - In
config.py, setINFERENCE_INPUT_IMG_PATHto your image,INFERENCE_OUTPUT_IMG_PATHto desired location of output image,INFERENCE_COMPARISON_IMG_PATHto deisred location of comparison image (optional) andBEST_REAL_ESRGAN_CHECKPOINT_DIR_PATHto your trained model. - Run the script:
python inference.py
- The upscaled image (
sr_img_*.png) and a comparison image (comparison_img_*.png) will be saved in theimages/directory.
The training process is divided into two distinct stages, as recommended by the Real-ESRGAN paper. Both stages were trained on an NVIDIA RTX 4060 Ti (8 GB) with a batch size of 48 (with GRADIENT_ACCUMULATION_STEPS = 12, so the real batch is 48 / 12 = 4).
The first stage involved training the Real-ESRNET generator (using L1 Loss) for 250 epochs. This stage took nearly 37 hours. The final model was selected based on the epoch with the highest validation PSNR.
The pre-trained weights from Stage 1 were used to initialize the generator for Real-ESRGAN fine-tuning. This model was then trained for 107 epochs using the full Real-ESRGAN loss (Perceptual, RaGAN, and L1) with learning rates 1e-4 and 1e-5 for Generator and Discriminator respectively. This stage took nearly 41 hours. The final model was selected based on the epoch with the lowest validation loss.
Note: For the inference process was taken checkpoint from the epoch 81.
Note 2: It is important to consider that I was not able to train the model longer, because with different learning rate parameters the gradients still started to explode, and therefore it was decided to stop the training and move on to other projects and architectures.
The final model (real_esrgan_best) was evaluated on standard benchmark datasets. Metrics are calculated on the Y-channel after shaving 4px (the scaling factor) from the border.
PSNR (dB) / SSIM Comparison
| Dataset | Real-ESRGAN (this project) |
|---|---|
| Set5 | 26.91/0.8283 |
| Set14 | 24.16/0.7139 |
| BSDS100 | 23.42/0.6585 |
| Urban100 | 21.70/0.7189 |
Note: It is crucial to remember that for perceptual models like Real-ESRGAN, traditional metrics (PSNR and SSIM) are not the primary measure of success. As highlighted in the original research, distortion (PSNR) and perceptual quality (human-perceived realism) are fundamentally at odds with each other. A model trained only for PSNR will score higher on these metrics but will produce overly smooth images. The final Real-ESRGAN model intentionally achieves lower PSNR/SSIM scores to produce sharp, realistic textures that look far more convincing to the human eye.
The following images compare the standard bicubic interpolation with the output of the Real-ESRGAN model. I tried to use different images that would be visible difference in results with anime images, photos etc.
This implementation is based on the paper Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data.
@misc{wang2021realesrgantrainingrealworldblind,
title={Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data},
author={Xintao Wang and Liangbin Xie and Chao Dong and Ying Shan},
year={2021},
eprint={2107.10833},
archivePrefix={arXiv},
primaryClass={eess.IV},
url={https://arxiv.org/abs/2107.10833},
}DIV2K dataset citation:
@InProceedings{Timofte_2018_CVPR_Workshops,
author = {Timofte, Radu and Gu, Shuhang and Wu, Jiqing and Van Gool, Luc and Zhang, Lei and Yang, Ming-Hsuan and Haris, Muhammad and others},
title = {NTIRE 2018 Challenge on Single Image Super-Resolution: Methods and Results},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
month = {June},
year = {2018}
}This project is licensed under the Apache License 2.0 - see the LICENSE file for details.











