Diffusion-based video super-resolution that runs on your Mac.
Stream-DiffVSR upscales low-resolution video frames using a diffusion model optimized for Apple Silicon. It runs on MPS (Metal Performance Shaders) with attention slicing and VAE chunking to fit within unified memory. Feed it 360p frames, get back 720p or higher. No NVIDIA GPU required.
Why This Exists | Before vs After | Install | Quick Start | How It Works | Features | Performance | Contributing | License
The original Stream-DiffVSR paper introduced a strong approach to video super-resolution using auto-regressive diffusion. But the code only ran on NVIDIA GPUs with CUDA. Every hardcoded device reference, every xFormers dependency, every CUDA-specific memory call made it impossible to run on a Mac.
This fork strips all of that out. It replaces CUDA-only code with device-agnostic PyTorch, adds MPS support through attention slicing and VAE slicing, and cuts 50+ NVIDIA-specific dependencies. The result: a video super-resolution pipeline that runs natively on any Apple Silicon Mac.
| Input (Low Resolution) | Output (Super-Resolved) |
|---|---|
| 360p / 540p source frame | 720p / 1080p upscaled frame |
| Blurry, compression artifacts visible | Sharp edges, recovered detail |
| Blocky textures in motion | Temporally coherent across frames |
The model upscales video frames by 4x while maintaining temporal consistency between frames. Unlike single-image upscalers, Stream-DiffVSR uses information from previous frames to produce stable, flicker-free results.
git clone https://github.com/199-biotechnologies/stream-diffvsr.git
cd stream-diffvsr
python3 -m venv venv
source venv/bin/activate
pip install -r requirements-mac.txtRequirements: macOS 12.3+, Python 3.9+, Apple Silicon (M1/M2/M3/M4), 16GB+ RAM recommended.
git clone https://github.com/199-biotechnologies/stream-diffvsr.git
cd stream-diffvsr
conda env create -f requirements.yml
conda activate stream-diffvsrpython inference.py \
--model_id 'Jamichsu/Stream-DiffVSR' \
--in_path './input/' \
--out_path './output/' \
--num_inference_steps 4The script auto-detects MPS on Mac, CUDA on NVIDIA, or falls back to CPU. Pretrained weights download automatically from Hugging Face.
python inference.py --device mps --in_path ./input/ --out_path ./output/
python inference.py --device cpu --in_path ./input/ --out_path ./output/
python inference.py --device cuda --in_path ./input/ --out_path ./output/Organize frames as sequential PNGs inside subdirectories:
input/
video1/
frame_0001.png
frame_0002.png
...
video2/
frame_0001.png
...
Stream-DiffVSR operates as a causal diffusion model. It only looks at past frames, never future ones. This makes it suitable for streaming and real-time scenarios where you cannot buffer ahead.
The pipeline has three core components:
- 4-step distilled denoiser -- A diffusion model distilled down to 4 denoising steps (from the typical 20-50), cutting inference time dramatically.
- Auto-regressive Temporal Guidance (ARTG) -- Injects motion-aligned cues from previous frames during latent denoising. This is what keeps the output temporally stable.
- Temporal Processor Module (TPM) -- A lightweight decoder add-on that enhances fine detail while preserving coherence across frames.
On Mac, the pipeline uses attention slicing (reduces memory ~40%) and VAE slicing (processes the decoder in chunks) to fit within unified memory constraints.
| What | Detail |
|---|---|
| Device auto-detection | MPS on Mac, CUDA on NVIDIA, CPU fallback |
| Attention slicing | 40% memory reduction on unified memory |
| VAE slicing | Processes decoder in chunks for large frames |
| 4-step inference | Distilled from 50 steps with minimal quality loss |
| Temporal coherence | No flicker between consecutive frames |
| Causal architecture | No future-frame dependency, works for streaming |
| TensorRT support | Optional acceleration on NVIDIA GPUs |
| Auto model download | Weights fetched from Hugging Face on first run |
| Device | Chip | RAM | 720p Frame Time | Notes |
|---|---|---|---|---|
| Mac | M1 | 8GB | ~4-6s | May need lower resolution |
| Mac | M1 Pro/Max | 16-32GB | ~2-4s | Good for 720p |
| Mac | M2/M3 Pro/Max | 32-64GB | ~1-3s | Comfortable |
| Mac | M3 Ultra | 64-192GB | ~0.8-1.5s | Best Mac performance |
| NVIDIA | RTX 4090 | 24GB | 0.328s | Original benchmark |
| NVIDIA | RTX 3080 | 10GB | ~0.5-0.8s | With xFormers |
Out of memory: Set PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 before running inference, close other apps, or process at 540p instead of 720p.
Unsupported operations: Set PYTORCH_ENABLE_MPS_FALLBACK=1 to let PyTorch fall back to CPU for any MPS-unsupported ops.
If you use this work, cite the original paper:
@article{shiu2025stream,
title={Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion},
author={Shiu, Hau-Shiang and Lin, Chin-Yang and Wang, Zhixiang and Hsiao, Chi-Wei and Yu, Po-Fan and Chen, Yu-Chih and Liu, Yu-Lun},
journal={arXiv preprint arXiv:2512.23709},
year={2025}
}Contributions are welcome. See CONTRIBUTING.md for guidelines.
Apache 2.0. See LICENSE for the full text.
Original implementation by jamichss. Built on HuggingFace Diffusers, StreamDiffusion, StableVSR, and TAESD.
Built by Boris Djordjevic at 199 Biotechnologies | Paperfoot AI