Skip to content

Replace MoviePy with ffmpeg for 10-100x performance improvement#15

Open
cyberb wants to merge 60 commits intomotattack:masterfrom
cyberb:feature/ffmpeg-performance
Open

Replace MoviePy with ffmpeg for 10-100x performance improvement#15
cyberb wants to merge 60 commits intomotattack:masterfrom
cyberb:feature/ffmpeg-performance

Conversation

@cyberb
Copy link
Copy Markdown

@cyberb cyberb commented Mar 24, 2026

Summary

  • Parallel downloads: 4 concurrent downloads instead of sequential, with 1MB chunk size (was 8KB)
  • ffmpeg instead of MoviePy: Eliminates the full video re-encoding that caused 3-hour videos to take 14+ hours to process. Uses ffmpeg concat demuxer with -c copy for near-instant concatenation
  • Removed moviepy/numpy dependencies: Only requires httpx and tqdm as Python deps. ffmpeg is the only new system requirement
  • Proper gap handling: Black segments and silence generated with ffmpeg (ultrafast, stillimage tune) instead of in-memory numpy arrays
  • Audio-only track merging: Uses ffmpeg amix filter with proper delay offsets

What changed

File Change
downloader.py Added download_chunks_parallel() using ThreadPoolExecutor, chunk size 8KB → 1MB
processor.py Full rewrite: MoviePy → ffmpeg subprocess calls (ffprobe, concat demuxer, filter_complex)
webinar.py Updated to use new parallel download + ffmpeg pipeline
requirements.txt Removed moviepy
setup.py Removed moviepy dep, bumped version to 2.0.0
Dockerfile Added ffmpeg package installation

Performance

Tested on a real 3-hour webinar (139 chunks, 106 video + 33 audio-only segments):

Metric Old (MoviePy) New (ffmpeg)
Total time 14+ hours (reported in #8) ~70 minutes
Video concat Full re-encode -c copy (stream copy)
Peak RAM ~1GB ~300MB

For recordings without audio-only tracks, the improvement is even larger since the audio mixing step (the slowest remaining part) is skipped entirely.

Requirements

  • ffmpeg and ffprobe must be installed (apt install ffmpeg / brew install ffmpeg)
  • The tool checks for ffmpeg at startup and gives a clear error if missing

Fixes #8

Test plan

  • Download a webinar with multiple video segments — verified output plays correctly
  • Test with recordings that have 33 audio-only tracks — audio merged successfully
  • Test with recordings that have gaps between segments — black segments generated correctly
  • Verify Docker build works with new Dockerfile

cyberb added 23 commits March 24, 2026 10:40
- Parallel downloads (4 concurrent) instead of sequential
- Increase download chunk size from 8KB to 1MB
- Replace MoviePy video processing with direct ffmpeg calls
- Use ffmpeg concat demuxer with -c copy (no re-encoding)
- Normalize segments to common resolution for reliable concat
- Handle gaps with lightweight ffmpeg-generated black segments
- Merge audio-only tracks using ffmpeg filter_complex
- Remove moviepy and numpy dependencies
- Add ffmpeg to Dockerfile
- Bump version to 2.0.0

Fixes motattack#8
Some MTS Link recordings store only audio in the direct mp4 files,
while the HLS delivery endpoint has both video and audio streams.
Check the HLS playlist for a video track and download via ffmpeg
when detected.
The old _merge_audio_tracks passed all audio files (up to 63+) as
simultaneous inputs to a single ffmpeg amix command, which required
ffmpeg to hold all delayed audio streams in memory for the full
recording duration — causing OOM kills on long recordings.

Now audio tracks are pre-delayed individually, then mixed via tree
reduction in batches of 8. Also routes all subprocess calls through
_run_ffmpeg which logs stderr on failure instead of silently
swallowing it.
Recordings with multiple simultaneous feeds (webcam + screen share)
have segments with overlapping timestamps. The old code laid them out
sequentially, turning a 3hr recording into 10+ hours of concat video.
This also caused the audio merge WAVs to be padded to 10hrs each,
requiring ~400GB of disk.

Added _deduplicate_overlapping() which keeps only the longest segment
per time window (186 -> 7 segments in a real test case). Also pass
total_duration from the API to _merge_audio_tracks so WAVs are padded
to the correct recording length, not the (potentially inflated) concat
file duration.
The previous approach materialized each audio track as a full-duration
WAV (~1.8GB each for a 3hr recording). With 63 tracks that's ~113GB,
filling the disk and crashing with "No space left on device".

Now audio tracks are mixed in batches directly with adelay inside the
ffmpeg filter graph, outputting compressed m4a (~15MB each). No
intermediate WAVs are created. Batch results are tree-reduced and
intermediates are deleted immediately after each round.
- Add _validate_downloaded_file() to check files with ffprobe after download
- Re-download corrupt files (missing moov atom) up to 2 retries
- Validate existing cached files on disk, re-download if corrupt
- Add _is_valid_media() in processor to skip corrupt files during classification
- Audio batch mixing catches errors and skips failed batches instead of crashing
- If all audio batches fail, output video without audio overlay
Extract presentation.update events from the MTS API to get slide
images and their timestamps. Download pre-rendered slide JPGs and
composite them with the webcam video in a 1280x720 layout:
- Left 960px: presentation slide
- Top-right 320x180: webcam
- Slides are pre-encoded as 1fps video segments and concatenated
  into a single track, then overlaid with the webcam in one pass.

Recordings without presentations are unaffected (existing behavior).
Some recordings have tiny thumbnail-sized video segments (192x108)
as the first file. The old code used the first segment's resolution
for all normalization, resulting in a blurry output. Now scans all
segments and picks the largest, with a 640x360 floor.
When multiple webcams overlap at the same timestamp, the old code
kept the longest segment (often a random participant). Now tracks
conference ID from the API and prefers the user with the most total
segments across the recording — typically the presenter/instructor.

Falls back to longest segment when conf_id is unavailable.
The -loop 1 -framerate 1 -t approach could produce millions of frames
for long-duration slides (e.g., last slide staying up for 3 hours),
causing ffmpeg to spin for hours and write gigabytes. Now uses
-frames:v to strictly cap frame count to match duration at 1fps.
Detects h264_nvenc at startup and uses it for all encoding steps
if available. Falls back to libx264 CPU encoding if no GPU.
Massively reduces CPU load and encoding time on systems with
NVIDIA GPUs, while keeping the CPU cool.
The overlay step was CPU-bound (97°C). Now uses hwupload_cuda,
scale_cuda, and overlay_cuda to do the compositing entirely on GPU.
Falls back to CPU filters if CUDA overlay is not available.
Two changes:

1. Swap inputs in slide compositing so webcam (25fps) drives the
   output frame clock instead of the slide track (1fps). Fixes
   choppy webcam playback in presentation videos.

2. For recordings without presentation slides that have multiple
   concurrent webcams (ПЗ sessions), composite all active webcams
   into a grid layout using xstack instead of discarding all but
   one. Grid size adapts to the number of concurrent webcams
   (2x1, 2x2, 3x3 etc). Audio from all participants is mixed.
…ipeline

Webcam inputs may lack audio tracks, causing ffmpeg to fail with
'Stream specifier :a matches no streams'. Since _merge_audio_tracks
handles all audio separately, the grid step should output video only.
The old scoring picked the conference with the most segments, which
favored participants toggling their cameras (many short segments) over
the presenter (few long segments). Also had a window-shrinking bug
where replacing a long segment with a short higher-ranked one let
subsequent segments leak through.

New approach: identify the main conference by total recorded duration,
keep its segments, and fill gaps from other conferences.
Extracts ADMIN role from userlist events, maps to conference IDs via
conference.add events, and passes is_admin flag through the download
pipeline. Dedup now prefers ADMIN conferences (the presenter), falling
back to total duration when no admin is found.

Also fixes download_chunks_parallel to preserve the is_admin flag.
…ebcam layout

Dedup gap-fill: clamp "other" conference segments to actual gap boundaries
instead of using raw file duration, preventing timeline overflow.

Compile: skip segments starting before current_time (safety net for overlaps),
and truncate segments via -t so they can't overflow into the next segment.

Slide composite: scale webcam proportionally to 320px wide (was fixed 320x180),
so portrait webcams render at a usable size instead of being squished.
Grid fix: cell dimensions from integer division could be odd, causing
ffmpeg's scale filter to round up and produce dimensions larger than
the pad target ("Padded dimensions cannot be smaller than input").
Now forces even dimensions and uses min() to cap scale output.

Audio fix: amix divides volume by number of inputs at each stage.
After 3 levels of mixing (batch->reduce->overlay), audio was
attenuated to near-silence (-91 dB). Added volume=N compensation
after each amix to restore original loudness.
normalize=0 already prevents amix from dividing by N, so the
volume=N multiplier was over-amplifying (~x112 across 3 pipeline
stages), turning noise from silent tracks into interference.

Also filter out silent audio-only segments (<-80 dB) before mixing
so they don't waste processing time or add noise floor.
-80 dB was filtering out participant microphone audio that
sits around -80 to -60 dB. Only -91 dB is true digital silence.
With normalize=0 and no volume=N, mixing silent segments with
real audio just gives real audio. The filter was incorrectly
dropping participant microphone tracks. Removing it simplifies
the pipeline and ensures all audio-only segments are included.
When slides + multiple webcams are present, analyzes audio levels
per participant to detect who is talking. Switches the right-side
webcam to show the active speaker, defaulting to presenter when
nobody else talks. Uses 2s analysis windows with 4s minimum hold
to prevent flickering.
cyberb added 6 commits April 5, 2026 15:53
NVENC + complex overlay filter on 3+ hour videos consumes ~7GB,
triggering OOM killer. libx264 uses ~300MB for the same operation.
All other encoding steps still use NVENC.
The 720p cap + fast preset still OOM-kills on 3.5h recordings with
many segments (e.g. 1197678196: 125 chunks, 23 participants, 34 segments).
Split compositing into 30-min chunks so ffmpeg never holds the full
video in memory. This allows using NVENC again (faster) and restores
720p resolution cap. Each chunk is composited independently then
concatenated with stream copy.
_get_video_encoder_fast() set _NVENC_AVAILABLE directly, bypassing
_detect_gpu(). This left _CUDA_OVERLAY_AVAILABLE as None, so
compositing always used CPU overlay even with CUDA support available.
Each participant's audio segments are analyzed by independent ffmpeg
calls. Running 4 in parallel instead of sequentially speeds up
speaker detection ~4x on multi-core systems.
Speaker switching can produce segments whose combined duration exceeds
the original recording. Cap the concat at total_duration from the API
to ensure the output matches the expected length.
@yokidjo
Copy link
Copy Markdown
Contributor

yokidjo commented Apr 9, 2026

@cyberb Thank you for the excellent work done

@motattack you can accept the pull request after finishing testing the work in docker

Grid segments could start late (e.g. 1762s) but concat placed them
at 0s, causing video/audio desync. Now inserts leading and internal
gaps in the manifest so the video timeline matches total_duration
exactly. Validated: planned duration equals API duration with zero
timeline gaps.
@cyberb
Copy link
Copy Markdown
Author

cyberb commented Apr 9, 2026

Sorry it started as a simple improvement , then I tried on various videos, then some videos would not have sound, some no video. Then fixing fixing fixing. So currently it downloaded videos I needed, but testing it was a pain as I needed to wait for hours, then check videos, then fix then repeat.
So now I am not sure if you want to take this PR or not, Claude code was helping me :(

@yokidjo
Copy link
Copy Markdown
Contributor

yokidjo commented Apr 9, 2026

@cyberb

Security note: command injection via filenames

Since we're passing arguments to ffmpeg via subprocess, we should ensure that filenames containing special characters (;, |, $(), ` ) are not interpreted as shell commands.

✅ The current implementation looks safe — all calls use subprocess.run(cmd: List[str]) without shell=True, so arguments remain as literal strings passed to ffmpeg, not parsed by the shell.

Just double-checking: no os.system() or shell=True sneaked into any of the 46 commits, right? 😄

@yokidjo
Copy link
Copy Markdown
Contributor

yokidjo commented Apr 9, 2026

@cyberb

Architecture & I/O review:

· Responsibility split: processor.py is quite heavy (~800 loc). Consider splitting audio / grid / manifest logic into separate modules in future refactoring.
· I/O: efficient — no large in-memory arrays, intermediate files cleaned up, parallel downloads.
· Pipeline design: normalize→concat→audio is correct for HLS streams with variable resolutions. Incremental concat to a single growing file would force re-encoding and lose -c copy benefits. Current approach is optimal for this use case.

Grid composite only keeps audio from input 0, losing all other
webcam voices. Now extracts audio from all video files with audio
streams for mixing in _merge_audio_tracks, same as dedup path.
@yokidjo
Copy link
Copy Markdown
Contributor

yokidjo commented Apr 10, 2026

@cyberb

Architecture suggestion for future refactoring

First of all, thank you so much for the incredible amount of work you've put into this PR. The performance gains (14+ hours → 70 minutes) are mind-blowing, and the attention to edge cases — from OOM handling to speaker detection, from GPU overlay to audio tree reduction — shows an amazing level of dedication and skill. Seriously impressive stuff.

One observation: processor.py has grown to ~1900 lines and now handles GPU detection, audio mixing, grid composition, slide compositing, and overlap strategies all in one place. The conditional logic for choosing between grid, dedup, and simple concat is already there.

💡 For a future iteration, consider applying the Strategy pattern:

  • Define a base ProcessingStrategy class with analyze() and execute() methods
  • Implement concrete strategies: SimpleConcatStrategy, GridCompositionStrategy, SlideCompositionStrategy
  • Let a thin VideoProcessor context select the strategy based on has_overlaps and slide_events

This would make the code much easier to test (unit tests per strategy), extend with new scenarios, and maintain without touching the monolith.

Not a blocker for this PR at all — just a thought for when you're ready to split this beast apart! Thanks again for this massive contribution.

@cyberb
Copy link
Copy Markdown
Author

cyberb commented Apr 10, 2026

Could hold a bit I am still finding some sound issues in the fridge mode, I have 30 links to download, but it is getting better now. A day or two and I will let you know?

@yokidjo yokidjo mentioned this pull request Apr 10, 2026
cyberb added 2 commits April 10, 2026 07:52
Split monolithic processor.py (1915 lines) into 6 focused classes:
- FFmpegRunner: ffmpeg execution, GPU detection, encoder selection
- MediaProber: file probing, duration, streams, audio levels
- GridCompositor: multi-webcam grid layout
- SlideCompositor: presentation slide overlay
- AudioMerger: batched audio mixing with tree-reduce
- SegmentBuilder: normalize, gaps, dedup, admin detection

VideoProcessor composes all classes via constructor injection.
No static methods, no underscore prefixes on public methods.
processor.py is now a thin orchestrator with backward-compatible
module-level functions.

33 tests across 5 test files, all passing.
Added requirements-dev.txt with pytest.
@yokidjo
Copy link
Copy Markdown
Contributor

yokidjo commented Apr 10, 2026

@cyberb
Testing strategy suggestions (non-blocking)

Awesome work on the refactoring and adding 33 tests! The class decomposition with dependency injection is perfect for testing.

Two suggestions for future iterations:

  1. Integration tests with Testcontainers

If the current tests mock FFmpegRunner and MediaProber (which makes sense for fast unit tests), consider adding a few integration tests with Testcontainers to verify that complex ffmpeg filter graphs work with real binaries.

Benefits:

  • Catch regressions when ffmpeg CLI behavior changes between versions
  • Verify that amix tree-reduce, xstack grid, and CUDA overlay detection work end-to-end
  • Run identically in CI and locally

Trade-off: slower — can be marked @pytest.mark.integration and run separately.

  1. Code coverage reporting

Adding pytest-cov would help track which parts of the pipeline are well-tested vs. untested. Example setup:

pytest --cov=mtslinker --cov-report=term --cov-report=html

This gives visibility into coverage for audio mixing, grid composition, slide overlay, and edge case handling.

Both are just ideas for the roadmap — not blockers for this PR. Thanks again for the massive effort on this!

@yokidjo
Copy link
Copy Markdown
Contributor

yokidjo commented Apr 10, 2026

GitHub Actions integration suggestion

I see you added requirements-dev.txt — great first step toward CI. Here's a complete setup you could add in a future PR if you want automated testing on every push.

Create .github/workflows/ci.yml with:

name: CI

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.10"
          
      - name: Install ffmpeg
        run: sudo apt-get update && sudo apt-get install -y ffmpeg
        
      - name: Install dependencies
        run: |
          pip install --upgrade pip
          pip install -r requirements.txt
          pip install -r requirements-dev.txt
          
      - name: Run tests with coverage
        run: pytest --cov=mtslinker --cov-report=xml --cov-report=term
        
      - name: Upload coverage to Codecov
        uses: codecov/codecov-action@v3
        with:
          files: ./coverage.xml

Notes:

  • ubuntu-latest has Docker pre-installed, so Testcontainers would work out of the box if added later
  • For Codecov, you'll need to add CODECOV_TOKEN to repository secrets (optional — public repos can use tokenless upload)
  • Integration tests with @pytest.mark.integration can run in the same job since ffmpeg is already installed via apt

Not a blocker — just a template for when you're ready to automate the test suite.

Grid composite already includes audio from input 0 (sorted so
audio-bearing webcam is first). Extracting the same webcam audio
again for _merge_audio_tracks caused the voice to play twice.
Audio-only tracks already cover other participants.
@yokidjo
Copy link
Copy Markdown
Contributor

yokidjo commented Apr 10, 2026

@cyberb Let's wrap this up — amazing work!

I think we should cap this PR at the current functionality and move any remaining edge cases to follow-up PRs or issues. This will let us merge the massive performance improvements now rather than chasing the last 1% of fringe scenarios indefinitely.

Here's a summary of what's been accomplished:

Performance

  • 14+ hours → 70 minutes for 3-hour webinars
  • RAM usage: ~1GB → ~300MB
  • Parallel downloads (4 threads, 1MB chunks)
  • -c copy for video concat (no re-encoding)

Architecture

  • Split monolithic processor.py (1915 lines) into 6 focused classes:
    • FFmpegRunner: ffmpeg execution, GPU detection, encoder selection
    • MediaProber: file probing, duration, streams, audio levels
    • GridCompositor: multi-webcam grid layout
    • SlideCompositor: presentation slide overlay
    • AudioMerger: batched audio mixing with tree-reduce
    • SegmentBuilder: normalize, gaps, dedup, admin detection
  • Two-phase pipeline: analyze → manifest.json → execute
  • Dependency injection throughout (no static methods, testable)

Edge cases handled

  • Overlapping segments (grid composition + dedup strategies)
  • Variable resolutions (auto-detects max, 640x360 floor)
  • ADMIN detection for presenter prioritization
  • GPU acceleration (NVENC + CUDA overlay with CPU fallback)
  • OOM prevention (30-min chunking, tree-reduce audio mixing, no giant WAVs)
  • Corrupt file recovery (ffprobe validation, up to 2 retries)
  • Gaps and black segment generation
  • Audio-only tracks and webcams without audio
  • Speaker detection and switching (with hysteresis)
  • Presentation slide compositing (1280x720 layout)

Testing

  • 33 tests across 5 test files, all passing
  • requirements-dev.txt with pytest
  • Class structure ready for mocking and Testcontainers in the future

Security

  • All ffmpeg calls use subprocess.run with List[str] (no shell=True)
  • Safe against command injection from malicious filenames

What's left (can be follow-up issues/PRs)

  • Docker validation (build image, test on short webinar with slides, verify no ffmpeg errors)
  • Remaining audio edge cases from your 30-link test batch
  • GitHub Actions CI workflow
  • Code coverage reporting (pytest-cov)
  • Testcontainers integration tests

This PR is already a massive win. Let's merge it and iterate on the rest in smaller, focused PRs. 🚀

Grid input 0 already has audio — but other webcams' audio was lost.
Now extracts audio from all webcam files EXCEPT the one used as
input 0 in each grid segment. This captures all voices without echo.

Added test_audio_pipeline.py with integration tests:
- Grid audio no echo (same source not duplicated)
- Grid takes audio from first input
- Audio merge preserves timing with adelay
- Segment duration matches plan
- Black segments have silent audio stream
@yokidjo
Copy link
Copy Markdown
Contributor

yokidjo commented Apr 10, 2026

@cyberb
One question before this gets too deep into edge-case handling:

Have you looked at the actual MTS Link player JavaScript code in the browser?

I'm wondering if we could simplify (or even eliminate) most of the complex reconstruction logic by understanding how the official client does it. The browser player must have a source of truth for:

  • Segment timeline (what plays when)
  • Overlapping sources (webcam + screen share simultaneously)
  • Layout instructions (who goes where in the grid)
  • Audio mixing priorities

If we can find where the player builds its internal playlist, we could replicate that logic instead of heuristically fixing:

  • Gap detection and black frame insertion
  • Overlap deduplication
  • Manual adelay offset calculation
  • Grid xstack assembly
  • Admin/speaker detection

Copy link
Copy Markdown
Contributor

@yokidjo yokidjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyberb
One question before this gets too deep into edge-case handling:

Have you looked at the actual MTS Link player JavaScript code in the browser?

cyberb added 2 commits April 10, 2026 22:37
Grid input 0 already has audio — other webcams need extraction.
Tracks which path is input 0 per grid segment and excludes only those.
Added test_audio_pipeline.py: echo, timing, duration, silence tests.
Replace all guesswork (overlap detection, dedup, speaker switching) with
StreamTimeline that builds playback windows from API mediasession events —
matching exactly what the MTS-Link web player does.

- Add StreamTimeline class with dataclasses (MediaSession, TimeWindow,
  GridSource, AudioTrack, DownloadChunk, SlideEvent)
- GridCompositor now mixes all audio streams inline via amix
- Remove dedup strategy, overlap heuristics, webcam audio extraction
- Remove dead code: deduplicate, extract_admin_conf_ids, is_valid,
  is_silent, analyze_audio_levels, legacy compat wrappers
- Fix .gitignore (was too broad, ignored tests/)
- 49 tests passing
Copy link
Copy Markdown
Contributor

@yokidjo yokidjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyberb StreamTimeline approach looks great. Code is cleaner, logic matches the actual player, tests pass.

@motattack LGTM. Ready for merge.

cyberb added 6 commits April 12, 2026 17:51
Audio-only streams were downloaded as raw binary from the storage URL,
which returns valid MP4 containers with silent audio (-91 dB). The real
audio lives in the HLS playlist variants. Now tries HLS first for all
streams (video and audio-only), falling back to direct download only if
HLS is unavailable.
The variable was removed in the mediasession rewrite (9a8a657) but the
logging line still referenced it. Strategy is now always 'timeline'.
When multiple streams are active:
- Screenshare → main area, admin → PIP overlay
- Admin (no screenshare) → main area, participant → PIP overlay
- No admin/screenshare → fall back to grid

Also fix grid xstack to always output exact target resolution,
preventing concat corruption from mismatched segment sizes.
Split GridCompositor into smaller classes:
- GridLayout: xstack grid compositing
- PresenterLayout: main + PIP overlay compositing
- GridCompositor: backward-compatible facade delegating to both
- _build_audio_filter: shared audio mixing helper
- _even: shared utility

Add 12 new tests covering presenter layout (main-only, PIP, extra
audio, resolution consistency), audio filter builder, grid resolution
matching, and facade backward compat. 61 tests passing.
The final amix step assumed the video always has an audio stream and
that it matches the mixed audio's 44100 Hz sample rate. HLS sources
come in at 48000 Hz, causing "Invalid argument" in the filter graph.

Now checks for audio presence and resamples before mixing.
The old heuristic (has_video && !has_audio = screenshare) was wrong —
it matched webcams with muted mics. The API provides explicit
stream.screensharing data on mediasession.add events. Now uses that
to correctly identify screen share streams for presenter layout.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ускорение

2 participants