Skip to content

199-biotechnologies/stream-diffvsr

Repository files navigation

Stream-DiffVSR

Diffusion-based video super-resolution that runs on your Mac.

Star this repo Follow @longevityboris

Python PyTorch Apple Silicon Hugging Face License

Stream-DiffVSR upscales low-resolution video frames using a diffusion model optimized for Apple Silicon. It runs on MPS (Metal Performance Shaders) with attention slicing and VAE chunking to fit within unified memory. Feed it 360p frames, get back 720p or higher. No NVIDIA GPU required.

Why This Exists | Before vs After | Install | Quick Start | How It Works | Features | Performance | Contributing | License


Why This Exists

The original Stream-DiffVSR paper introduced a strong approach to video super-resolution using auto-regressive diffusion. But the code only ran on NVIDIA GPUs with CUDA. Every hardcoded device reference, every xFormers dependency, every CUDA-specific memory call made it impossible to run on a Mac.

This fork strips all of that out. It replaces CUDA-only code with device-agnostic PyTorch, adds MPS support through attention slicing and VAE slicing, and cuts 50+ NVIDIA-specific dependencies. The result: a video super-resolution pipeline that runs natively on any Apple Silicon Mac.

Before vs After

Input (Low Resolution) Output (Super-Resolved)
360p / 540p source frame 720p / 1080p upscaled frame
Blurry, compression artifacts visible Sharp edges, recovered detail
Blocky textures in motion Temporally coherent across frames

The model upscales video frames by 4x while maintaining temporal consistency between frames. Unlike single-image upscalers, Stream-DiffVSR uses information from previous frames to produce stable, flicker-free results.

Install

Apple Silicon (Recommended)

git clone https://github.com/199-biotechnologies/stream-diffvsr.git
cd stream-diffvsr

python3 -m venv venv
source venv/bin/activate

pip install -r requirements-mac.txt

Requirements: macOS 12.3+, Python 3.9+, Apple Silicon (M1/M2/M3/M4), 16GB+ RAM recommended.

Linux / NVIDIA GPU

git clone https://github.com/199-biotechnologies/stream-diffvsr.git
cd stream-diffvsr

conda env create -f requirements.yml
conda activate stream-diffvsr

Quick Start

python inference.py \
    --model_id 'Jamichsu/Stream-DiffVSR' \
    --in_path './input/' \
    --out_path './output/' \
    --num_inference_steps 4

The script auto-detects MPS on Mac, CUDA on NVIDIA, or falls back to CPU. Pretrained weights download automatically from Hugging Face.

Force a specific device

python inference.py --device mps --in_path ./input/ --out_path ./output/
python inference.py --device cpu --in_path ./input/ --out_path ./output/
python inference.py --device cuda --in_path ./input/ --out_path ./output/

Input format

Organize frames as sequential PNGs inside subdirectories:

input/
  video1/
    frame_0001.png
    frame_0002.png
    ...
  video2/
    frame_0001.png
    ...

How It Works

Stream-DiffVSR operates as a causal diffusion model. It only looks at past frames, never future ones. This makes it suitable for streaming and real-time scenarios where you cannot buffer ahead.

The pipeline has three core components:

  1. 4-step distilled denoiser -- A diffusion model distilled down to 4 denoising steps (from the typical 20-50), cutting inference time dramatically.
  2. Auto-regressive Temporal Guidance (ARTG) -- Injects motion-aligned cues from previous frames during latent denoising. This is what keeps the output temporally stable.
  3. Temporal Processor Module (TPM) -- A lightweight decoder add-on that enhances fine detail while preserving coherence across frames.

On Mac, the pipeline uses attention slicing (reduces memory ~40%) and VAE slicing (processes the decoder in chunks) to fit within unified memory constraints.

Features

What Detail
Device auto-detection MPS on Mac, CUDA on NVIDIA, CPU fallback
Attention slicing 40% memory reduction on unified memory
VAE slicing Processes decoder in chunks for large frames
4-step inference Distilled from 50 steps with minimal quality loss
Temporal coherence No flicker between consecutive frames
Causal architecture No future-frame dependency, works for streaming
TensorRT support Optional acceleration on NVIDIA GPUs
Auto model download Weights fetched from Hugging Face on first run

Performance

Device Chip RAM 720p Frame Time Notes
Mac M1 8GB ~4-6s May need lower resolution
Mac M1 Pro/Max 16-32GB ~2-4s Good for 720p
Mac M2/M3 Pro/Max 32-64GB ~1-3s Comfortable
Mac M3 Ultra 64-192GB ~0.8-1.5s Best Mac performance
NVIDIA RTX 4090 24GB 0.328s Original benchmark
NVIDIA RTX 3080 10GB ~0.5-0.8s With xFormers

Troubleshooting MPS

Out of memory: Set PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 before running inference, close other apps, or process at 540p instead of 720p.

Unsupported operations: Set PYTORCH_ENABLE_MPS_FALLBACK=1 to let PyTorch fall back to CPU for any MPS-unsupported ops.

Citation

If you use this work, cite the original paper:

@article{shiu2025stream,
  title={Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion},
  author={Shiu, Hau-Shiang and Lin, Chin-Yang and Wang, Zhixiang and Hsiao, Chi-Wei and Yu, Po-Fan and Chen, Yu-Chih and Liu, Yu-Lun},
  journal={arXiv preprint arXiv:2512.23709},
  year={2025}
}

Contributing

Contributions are welcome. See CONTRIBUTING.md for guidelines.

License

Apache 2.0. See LICENSE for the full text.

Original implementation by jamichss. Built on HuggingFace Diffusers, StreamDiffusion, StableVSR, and TAESD.