Replace MoviePy with ffmpeg for 10-100x performance improvement#15
Replace MoviePy with ffmpeg for 10-100x performance improvement#15cyberb wants to merge 60 commits intomotattack:masterfrom
Conversation
- Parallel downloads (4 concurrent) instead of sequential - Increase download chunk size from 8KB to 1MB - Replace MoviePy video processing with direct ffmpeg calls - Use ffmpeg concat demuxer with -c copy (no re-encoding) - Normalize segments to common resolution for reliable concat - Handle gaps with lightweight ffmpeg-generated black segments - Merge audio-only tracks using ffmpeg filter_complex - Remove moviepy and numpy dependencies - Add ffmpeg to Dockerfile - Bump version to 2.0.0 Fixes motattack#8
Some MTS Link recordings store only audio in the direct mp4 files, while the HLS delivery endpoint has both video and audio streams. Check the HLS playlist for a video track and download via ffmpeg when detected.
The old _merge_audio_tracks passed all audio files (up to 63+) as simultaneous inputs to a single ffmpeg amix command, which required ffmpeg to hold all delayed audio streams in memory for the full recording duration — causing OOM kills on long recordings. Now audio tracks are pre-delayed individually, then mixed via tree reduction in batches of 8. Also routes all subprocess calls through _run_ffmpeg which logs stderr on failure instead of silently swallowing it.
Recordings with multiple simultaneous feeds (webcam + screen share) have segments with overlapping timestamps. The old code laid them out sequentially, turning a 3hr recording into 10+ hours of concat video. This also caused the audio merge WAVs to be padded to 10hrs each, requiring ~400GB of disk. Added _deduplicate_overlapping() which keeps only the longest segment per time window (186 -> 7 segments in a real test case). Also pass total_duration from the API to _merge_audio_tracks so WAVs are padded to the correct recording length, not the (potentially inflated) concat file duration.
The previous approach materialized each audio track as a full-duration WAV (~1.8GB each for a 3hr recording). With 63 tracks that's ~113GB, filling the disk and crashing with "No space left on device". Now audio tracks are mixed in batches directly with adelay inside the ffmpeg filter graph, outputting compressed m4a (~15MB each). No intermediate WAVs are created. Batch results are tree-reduced and intermediates are deleted immediately after each round.
- Add _validate_downloaded_file() to check files with ffprobe after download - Re-download corrupt files (missing moov atom) up to 2 retries - Validate existing cached files on disk, re-download if corrupt - Add _is_valid_media() in processor to skip corrupt files during classification - Audio batch mixing catches errors and skips failed batches instead of crashing - If all audio batches fail, output video without audio overlay
Extract presentation.update events from the MTS API to get slide images and their timestamps. Download pre-rendered slide JPGs and composite them with the webcam video in a 1280x720 layout: - Left 960px: presentation slide - Top-right 320x180: webcam - Slides are pre-encoded as 1fps video segments and concatenated into a single track, then overlaid with the webcam in one pass. Recordings without presentations are unaffected (existing behavior).
Some recordings have tiny thumbnail-sized video segments (192x108) as the first file. The old code used the first segment's resolution for all normalization, resulting in a blurry output. Now scans all segments and picks the largest, with a 640x360 floor.
When multiple webcams overlap at the same timestamp, the old code kept the longest segment (often a random participant). Now tracks conference ID from the API and prefers the user with the most total segments across the recording — typically the presenter/instructor. Falls back to longest segment when conf_id is unavailable.
The -loop 1 -framerate 1 -t approach could produce millions of frames for long-duration slides (e.g., last slide staying up for 3 hours), causing ffmpeg to spin for hours and write gigabytes. Now uses -frames:v to strictly cap frame count to match duration at 1fps.
Detects h264_nvenc at startup and uses it for all encoding steps if available. Falls back to libx264 CPU encoding if no GPU. Massively reduces CPU load and encoding time on systems with NVIDIA GPUs, while keeping the CPU cool.
The overlay step was CPU-bound (97°C). Now uses hwupload_cuda, scale_cuda, and overlay_cuda to do the compositing entirely on GPU. Falls back to CPU filters if CUDA overlay is not available.
Two changes: 1. Swap inputs in slide compositing so webcam (25fps) drives the output frame clock instead of the slide track (1fps). Fixes choppy webcam playback in presentation videos. 2. For recordings without presentation slides that have multiple concurrent webcams (ПЗ sessions), composite all active webcams into a grid layout using xstack instead of discarding all but one. Grid size adapts to the number of concurrent webcams (2x1, 2x2, 3x3 etc). Audio from all participants is mixed.
…ipeline Webcam inputs may lack audio tracks, causing ffmpeg to fail with 'Stream specifier :a matches no streams'. Since _merge_audio_tracks handles all audio separately, the grid step should output video only.
The old scoring picked the conference with the most segments, which favored participants toggling their cameras (many short segments) over the presenter (few long segments). Also had a window-shrinking bug where replacing a long segment with a short higher-ranked one let subsequent segments leak through. New approach: identify the main conference by total recorded duration, keep its segments, and fill gaps from other conferences.
Extracts ADMIN role from userlist events, maps to conference IDs via conference.add events, and passes is_admin flag through the download pipeline. Dedup now prefers ADMIN conferences (the presenter), falling back to total duration when no admin is found. Also fixes download_chunks_parallel to preserve the is_admin flag.
…ebcam layout Dedup gap-fill: clamp "other" conference segments to actual gap boundaries instead of using raw file duration, preventing timeline overflow. Compile: skip segments starting before current_time (safety net for overlaps), and truncate segments via -t so they can't overflow into the next segment. Slide composite: scale webcam proportionally to 320px wide (was fixed 320x180), so portrait webcams render at a usable size instead of being squished.
Grid fix: cell dimensions from integer division could be odd, causing
ffmpeg's scale filter to round up and produce dimensions larger than
the pad target ("Padded dimensions cannot be smaller than input").
Now forces even dimensions and uses min() to cap scale output.
Audio fix: amix divides volume by number of inputs at each stage.
After 3 levels of mixing (batch->reduce->overlay), audio was
attenuated to near-silence (-91 dB). Added volume=N compensation
after each amix to restore original loudness.
normalize=0 already prevents amix from dividing by N, so the volume=N multiplier was over-amplifying (~x112 across 3 pipeline stages), turning noise from silent tracks into interference. Also filter out silent audio-only segments (<-80 dB) before mixing so they don't waste processing time or add noise floor.
-80 dB was filtering out participant microphone audio that sits around -80 to -60 dB. Only -91 dB is true digital silence.
With normalize=0 and no volume=N, mixing silent segments with real audio just gives real audio. The filter was incorrectly dropping participant microphone tracks. Removing it simplifies the pipeline and ensures all audio-only segments are included.
When slides + multiple webcams are present, analyzes audio levels per participant to detect who is talking. Switches the right-side webcam to show the active speaker, defaulting to presenter when nobody else talks. Uses 2s analysis windows with 4s minimum hold to prevent flickering.
NVENC + complex overlay filter on 3+ hour videos consumes ~7GB, triggering OOM killer. libx264 uses ~300MB for the same operation. All other encoding steps still use NVENC.
The 720p cap + fast preset still OOM-kills on 3.5h recordings with many segments (e.g. 1197678196: 125 chunks, 23 participants, 34 segments).
Split compositing into 30-min chunks so ffmpeg never holds the full video in memory. This allows using NVENC again (faster) and restores 720p resolution cap. Each chunk is composited independently then concatenated with stream copy.
_get_video_encoder_fast() set _NVENC_AVAILABLE directly, bypassing _detect_gpu(). This left _CUDA_OVERLAY_AVAILABLE as None, so compositing always used CPU overlay even with CUDA support available.
Each participant's audio segments are analyzed by independent ffmpeg calls. Running 4 in parallel instead of sequentially speeds up speaker detection ~4x on multi-core systems.
Speaker switching can produce segments whose combined duration exceeds the original recording. Cap the concat at total_duration from the API to ensure the output matches the expected length.
|
@cyberb Thank you for the excellent work done @motattack you can accept the pull request after finishing testing the work in docker |
Grid segments could start late (e.g. 1762s) but concat placed them at 0s, causing video/audio desync. Now inserts leading and internal gaps in the manifest so the video timeline matches total_duration exactly. Validated: planned duration equals API duration with zero timeline gaps.
|
Sorry it started as a simple improvement , then I tried on various videos, then some videos would not have sound, some no video. Then fixing fixing fixing. So currently it downloaded videos I needed, but testing it was a pain as I needed to wait for hours, then check videos, then fix then repeat. |
|
Security note: command injection via filenames Since we're passing arguments to ffmpeg via subprocess, we should ensure that filenames containing special characters (;, |, $(), ` ) are not interpreted as shell commands. ✅ The current implementation looks safe — all calls use subprocess.run(cmd: List[str]) without shell=True, so arguments remain as literal strings passed to ffmpeg, not parsed by the shell. Just double-checking: no os.system() or shell=True sneaked into any of the 46 commits, right? 😄 |
|
Architecture & I/O review: · Responsibility split: processor.py is quite heavy (~800 loc). Consider splitting audio / grid / manifest logic into separate modules in future refactoring. |
Grid composite only keeps audio from input 0, losing all other webcam voices. Now extracts audio from all video files with audio streams for mixing in _merge_audio_tracks, same as dedup path.
|
Architecture suggestion for future refactoring First of all, thank you so much for the incredible amount of work you've put into this PR. The performance gains (14+ hours → 70 minutes) are mind-blowing, and the attention to edge cases — from OOM handling to speaker detection, from GPU overlay to audio tree reduction — shows an amazing level of dedication and skill. Seriously impressive stuff. One observation: 💡 For a future iteration, consider applying the Strategy pattern:
This would make the code much easier to test (unit tests per strategy), extend with new scenarios, and maintain without touching the monolith. Not a blocker for this PR at all — just a thought for when you're ready to split this beast apart! Thanks again for this massive contribution. |
|
Could hold a bit I am still finding some sound issues in the fridge mode, I have 30 links to download, but it is getting better now. A day or two and I will let you know? |
Split monolithic processor.py (1915 lines) into 6 focused classes: - FFmpegRunner: ffmpeg execution, GPU detection, encoder selection - MediaProber: file probing, duration, streams, audio levels - GridCompositor: multi-webcam grid layout - SlideCompositor: presentation slide overlay - AudioMerger: batched audio mixing with tree-reduce - SegmentBuilder: normalize, gaps, dedup, admin detection VideoProcessor composes all classes via constructor injection. No static methods, no underscore prefixes on public methods. processor.py is now a thin orchestrator with backward-compatible module-level functions. 33 tests across 5 test files, all passing. Added requirements-dev.txt with pytest.
|
@cyberb Awesome work on the refactoring and adding 33 tests! The class decomposition with dependency injection is perfect for testing. Two suggestions for future iterations:
If the current tests mock FFmpegRunner and MediaProber (which makes sense for fast unit tests), consider adding a few integration tests with Testcontainers to verify that complex ffmpeg filter graphs work with real binaries. Benefits:
Trade-off: slower — can be marked @pytest.mark.integration and run separately.
Adding pytest-cov would help track which parts of the pipeline are well-tested vs. untested. Example setup: pytest --cov=mtslinker --cov-report=term --cov-report=html This gives visibility into coverage for audio mixing, grid composition, slide overlay, and edge case handling. Both are just ideas for the roadmap — not blockers for this PR. Thanks again for the massive effort on this! |
|
GitHub Actions integration suggestion I see you added requirements-dev.txt — great first step toward CI. Here's a complete setup you could add in a future PR if you want automated testing on every push. Create .github/workflows/ci.yml with: name: CI
on:
push:
branches: [ main ]
pull_request:
branches: [ main ]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.10"
- name: Install ffmpeg
run: sudo apt-get update && sudo apt-get install -y ffmpeg
- name: Install dependencies
run: |
pip install --upgrade pip
pip install -r requirements.txt
pip install -r requirements-dev.txt
- name: Run tests with coverage
run: pytest --cov=mtslinker --cov-report=xml --cov-report=term
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v3
with:
files: ./coverage.xmlNotes:
Not a blocker — just a template for when you're ready to automate the test suite. |
Grid composite already includes audio from input 0 (sorted so audio-bearing webcam is first). Extracting the same webcam audio again for _merge_audio_tracks caused the voice to play twice. Audio-only tracks already cover other participants.
|
@cyberb Let's wrap this up — amazing work! I think we should cap this PR at the current functionality and move any remaining edge cases to follow-up PRs or issues. This will let us merge the massive performance improvements now rather than chasing the last 1% of fringe scenarios indefinitely. Here's a summary of what's been accomplished: Performance
Architecture
Edge cases handled
Testing
Security
What's left (can be follow-up issues/PRs)
This PR is already a massive win. Let's merge it and iterate on the rest in smaller, focused PRs. 🚀 |
Grid input 0 already has audio — but other webcams' audio was lost. Now extracts audio from all webcam files EXCEPT the one used as input 0 in each grid segment. This captures all voices without echo. Added test_audio_pipeline.py with integration tests: - Grid audio no echo (same source not duplicated) - Grid takes audio from first input - Audio merge preserves timing with adelay - Segment duration matches plan - Black segments have silent audio stream
|
@cyberb Have you looked at the actual MTS Link player JavaScript code in the browser? I'm wondering if we could simplify (or even eliminate) most of the complex reconstruction logic by understanding how the official client does it. The browser player must have a source of truth for:
If we can find where the player builds its internal playlist, we could replicate that logic instead of heuristically fixing:
|
Grid input 0 already has audio — other webcams need extraction. Tracks which path is input 0 per grid segment and excludes only those. Added test_audio_pipeline.py: echo, timing, duration, silence tests.
Replace all guesswork (overlap detection, dedup, speaker switching) with StreamTimeline that builds playback windows from API mediasession events — matching exactly what the MTS-Link web player does. - Add StreamTimeline class with dataclasses (MediaSession, TimeWindow, GridSource, AudioTrack, DownloadChunk, SlideEvent) - GridCompositor now mixes all audio streams inline via amix - Remove dedup strategy, overlap heuristics, webcam audio extraction - Remove dead code: deduplicate, extract_admin_conf_ids, is_valid, is_silent, analyze_audio_levels, legacy compat wrappers - Fix .gitignore (was too broad, ignored tests/) - 49 tests passing
yokidjo
left a comment
There was a problem hiding this comment.
@cyberb StreamTimeline approach looks great. Code is cleaner, logic matches the actual player, tests pass.
@motattack LGTM. Ready for merge.
Audio-only streams were downloaded as raw binary from the storage URL, which returns valid MP4 containers with silent audio (-91 dB). The real audio lives in the HLS playlist variants. Now tries HLS first for all streams (video and audio-only), falling back to direct download only if HLS is unavailable.
The variable was removed in the mediasession rewrite (9a8a657) but the logging line still referenced it. Strategy is now always 'timeline'.
When multiple streams are active: - Screenshare → main area, admin → PIP overlay - Admin (no screenshare) → main area, participant → PIP overlay - No admin/screenshare → fall back to grid Also fix grid xstack to always output exact target resolution, preventing concat corruption from mismatched segment sizes.
Split GridCompositor into smaller classes: - GridLayout: xstack grid compositing - PresenterLayout: main + PIP overlay compositing - GridCompositor: backward-compatible facade delegating to both - _build_audio_filter: shared audio mixing helper - _even: shared utility Add 12 new tests covering presenter layout (main-only, PIP, extra audio, resolution consistency), audio filter builder, grid resolution matching, and facade backward compat. 61 tests passing.
The final amix step assumed the video always has an audio stream and that it matches the mixed audio's 44100 Hz sample rate. HLS sources come in at 48000 Hz, causing "Invalid argument" in the filter graph. Now checks for audio presence and resamples before mixing.
The old heuristic (has_video && !has_audio = screenshare) was wrong — it matched webcams with muted mics. The API provides explicit stream.screensharing data on mediasession.add events. Now uses that to correctly identify screen share streams for presenter layout.
Summary
-c copyfor near-instant concatenationhttpxandtqdmas Python deps. ffmpeg is the only new system requirementamixfilter with proper delay offsetsWhat changed
downloader.pydownload_chunks_parallel()using ThreadPoolExecutor, chunk size 8KB → 1MBprocessor.pywebinar.pyrequirements.txtmoviepysetup.pymoviepydep, bumped version to 2.0.0Dockerfileffmpegpackage installationPerformance
Tested on a real 3-hour webinar (139 chunks, 106 video + 33 audio-only segments):
-c copy(stream copy)For recordings without audio-only tracks, the improvement is even larger since the audio mixing step (the slowest remaining part) is skipped entirely.
Requirements
ffmpegandffprobemust be installed (apt install ffmpeg/brew install ffmpeg)Fixes #8
Test plan