Audio ingestion design note (DEFERRED) — layered pipeline, use cases, video streams, licensing, dep health #6
SwiftWing21
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Audio ingestion design note (DEFERRED — not roadmapped)
Status: Design exploration only. Not scheduled, not in any current milestone, not a commitment to ship. Captured so the thinking survives the session it was born in and future raude/laude/human collaborators don't have to re-derive it from scratch.
Filed by raude — Claude Code Opus 4.6 (1M context).
Why this note exists
During the 2026-04-10 session (helix-context v0.3.0b5 ship day), the user asked whether agents could "hear" — ingest audio/video content directly into the genome rather than relying on text transcripts alone. A 2+1 specialist research council was dispatched; the findings plus several subsequent architectural refinements by the user landed on a concrete design direction that's worth capturing. The audio MVP itself is deferred — Struggle 1 (density gate at ingest) and the bench harvest bug (Issue #3) are higher priority. But when audio work un-defers, this is the blueprint.
TL;DR
The architecture: ingest audio through a 10-layer pipeline of existing, mature, Apache-2.0-compatible tools, producing a structured JSON "audio gene" content field alongside a 512D CLAP embedding for cross-modal retrieval. CLAP plays three roles simultaneously — retrieval key, zero-shot classifier, and semantic critic for confidence scoring. Video is a parent-with-streams pattern: ffmpeg demuxes into audio + frames + captions, each stream goes through its own specialist pipeline, genes link via shared
source_id+video_offset_s. The bet: text is a projection of audio, not the source of truth — the primary storage is embeddings + feature JSON, the transcript is a view.Part 1 — The layered pipeline
Every layer is an existing, mature, permissively-licensed tool. Nothing here is research code or experimental.
soundfileffmpegexternallibrosapyloudnormsilero-vadlibrosa.feature.*librosa.onset+librosa.beatfaster-whisper(Whisper-tiny)madmomKey simplification (user's insight): CLAP handles layers 7 and 8 in a single model. No separate YAMNet/PANN classifier needed. Zero-shot classification via cosine against candidate label prompts is ~5-10% less accurate than a fine-tuned classifier but drops TensorFlow (~600MB) from the install tree and makes the class ontology customizable per deployment rather than baked into a pre-trained label set.
Total resident RAM fully warm: ~880 MB (CLAP ~500MB + Whisper-tiny ~75MB + silero-vad ~5MB + librosa working memory ~200MB + madmom lazy ~100MB). CPU-only, zero GPU required.
Total
pip install helix-context[audio]footprint: ~300 MB one-time download, ~100 MB wheel install. Madmom isolated in its own[audio-music]extra so music-specific users opt in separately.Part 2 — License audit + fork rights
The license mix is deliberately permissive so the audio layer survives upstream abandonment. Everything in the pipeline is fully forkable if any dep rots:
Fork playbook for any abandoned dep:
SwiftWing21/*-maintainedor similarhelix-context[audio]extra to depend on the new name + add NOTICE entryLegal effort: ~30 minutes. Engineering effort: scales with compatibility breakage.
The single Apache-2.0-patent-grant win: CLAP is the most protected piece of the pipeline, which matters because it's where the novel research lives. MIT/BSD/ISC tools are all pure DSP or small utilities where patent risk is effectively zero — OK that they lack formal patent grants.
Essentia (AGPL) was explicitly rejected — it was the only serious contender for chord/key detection but AGPL-on-network-service would have contaminated commercial helix-context deployments. Madmom (BSD-3) replaces it cleanly.
Part 3 — Dependency health detection
Because upstream abandonment is a real concern for audio tooling specifically (several contenders in this space are research-code with slow maintenance), a detection path for "is this dep going stale?" is part of the design:
/repos/{o}/{r}→pushed_atupload_timearchived: booltrue= hard stopinfo.classifierspip install --dry-runBucketing:
Where it lives:
scripts/check_dep_health.py(~200 LOC standalone) — CI nightlydependency-healthif anything goes YELLOW or REDhelix-context doctorCLI command that checks Python version, Ollama reachability, genome health, AND dependency freshness in one commandCurrent pipeline risk assessment (from memory, would need real-time verification):
[audio-music]extra for this reasonThe pipeline is deliberately structured so a single YELLOW dep (madmom) only affects users who opt into music-specific features, not the core audio ingest.
Part 4 — CLAP as semantic critic (confidence scoring)
Beyond retrieval and classification, CLAP plays a third role — it verifies the LLM's interpretation of an audio gene via cross-modal cosine similarity. The flow:
Ingest-time confidence stamp stored as gene metadata:
Genes with
interpretation_confidence < 0.15get flagged at ingest — either the LLM misread the features, or it's out-of-distribution audio CLAP also doesn't understand. Either way, the gene is marked low-trust and downweighted at retrieval.This plugs into the existing
ContextHealth.ellipticitypattern — audio genes contribute their confidence score to the window's health signal. A window with low-confidence audio genes returnsstatus=sparsewith reason "audio interpretation uncertain."The speculative long-term angle: CLAP as a reward signal for fine-tuning. The LLM produces interpretations of unlabeled audio; CLAP scores each interpretation against the audio; the scores become training signal for RLHF-style updates. Nobody has published this that I know of — it's a real research direction but not MVP scope.
Honest caveats on CLAP-as-critic:
Part 5 — Gene schema for audio
contentfield format — the "audio ΣĒMA JSON" produced by the layered pipeline:{ "schema": "helix-audio-gene/v1", "source": "meeting_20260410.mp3", "duration_s": 602.3, "sample_rate": 44100, "classification": { "top_classes": [["speech", 0.94], ["male_voice", 0.78], ["inside", 0.42]], "model": "laion-clap-zero-shot" }, "transcript": { "text": "the quarterly numbers came in at...", "segments": [{"t": 0.2, "end": 3.1, "text": "the quarterly..."}], "model": "whisper-tiny" }, "spectral_summary": { "rms_mean": 0.08, "rms_std": 0.04, "spectral_centroid_mean": 1843, "spectral_rolloff_mean": 3820, "zero_crossing_rate_mean": 0.062, "dominant_freq_band_hz": [200, 3400] }, "events": [ {"t": 0.21, "type": "speech_onset"}, {"t": 45.3, "type": "silence", "duration": 2.1}, ... ], "vad": {"speech_ratio": 0.73, "segments_count": 47, "longest_silence_s": 8.2} }Size: ~2-4 KB per gene for a 10-minute recording. 10x cheaper per gene than raw spectrogram dumping. LLM-readable structured data, not opaque vectors.
Schema migration: one new
audio_embedding TEXTcolumn on thegenestable, plus three new fields on theGenedataclass. IdempotentALTER TABLEat startup, null-safe for all existing text genes. No breaking change.Part 6 — What sound memories unlock (use cases)
The three new capabilities audio genes provide that text genes cannot:
1. Temporal/event reasoning — when things happened, not just what was discussed
2. Non-verbal context — hesitation, tone, pauses, background, speaker identity
3. Cross-modal search — three-tier retrieval (BM25 transcript + ΣĒMA 20D text + CLAP 512D audio)
Concrete use case table:
Part 7 — Video ingestion: parent-with-streams
Video is not a new gene type. It's a parent reference with child genes per extracted stream, using helix's existing
is_fragment=True+source_id+ a newvideo_offset_sfield:ffmpeg does the demux in one pass:
Each stream goes through its specialist pipeline; all resulting genes share a
source_id+video_offset_sso they can be joined at query time.Retrieval hits any stream independently:
video_offset_sStorage is proportional to content density, not runtime. You don't store 600 MB of video — you store 2-4 KB of audio feature JSON + a few KB of frame embeddings + the transcript. The original file is referenced by path + SHA256 hash, not inlined.
Each stream updates independently. New visual classifier? Re-run frame genes without touching audio. New speech model? Re-run Whisper without touching frames.
Part 8 — Output channels (what the agent produces with these memories)
Once audio/video genes exist, the agent's output channels grow in structured ways:
Part 9 — Impact on existing helix-context features
Audio ingestion enriches several non-audio features without requiring changes to them:
context_manager.py; no new subsystem.interpretation_confidencecontributes to window health. A window with low-confidence audio returnssparsewith reason "audio interpretation uncertain.".helixformat) — audio genes serialize their feature JSON + CLAP embedding + transcript without requiring raw audio transfer. Two Helix instances can share recorded knowledge with no raw file transfer. Combines naturally with the Shamir Secret Sharing design note — audio genes could be SSS-split the same way text genes could.DECODER_*prompts assume text-only. Adding audio requires one new prompt variant that teaches the big LLM how to read audio feature JSON. One new constant + one conditional.ingest_audio_file,ingest_video_file,query_audio_memory,describe_sound_at,extract_speech_segments,detect_acoustic_anomalies. Other MCP-speaking agents (BigEd, Claude Code sessions) gain audio memory as a first-class capability.audio_bench_needle.pyharvests audio-specific needles (find recordings where X was said, identify machine sounds matching templates, classify clips). Plugs into the existingcompare_ab.py+ benchmark state monitor.Part 10 — The "learning from watching" loop, concretely
The user's original framing — "learning ability from video/sound files via direct audio signal to context, bypass human txt on storage side" — becomes a concrete 6-step flow:
ffmpegdemuxes → audio + frames + captions (if present)AudioGenerecords (CLAP + librosa features + Whisper transcript + event timeline + zero-shot classification)FrameGenerecordssource_id+video_offset_sso they can be joined at query timeThe "bypass text on storage side" question, answered concretely:
Part 11 — Open questions (not answered by this note)
Things the user would need to decide before a real spec:
is_fragment=True. Matters more for hour-long podcasts than short clips.source_idpattern).DECODER_MOEvsDECODER_CONDENSEDdecision for text.HELIX_DISABLE_AUDIO=1mirrorsHELIX_DISABLE_HEADROOM=1— confirm the pattern.What this note is NOT
What this note IS
Related discussions
— raude, Claude Code Opus 4.6 (1M context)
Session: 2026-04-10 / v0.3.0b5 ship day / paired with laude on the benchmark track
Specialists consulted: general-purpose research agents for audio embedding models (Specialist A) and native audio LLMs / S2S stacks (Specialist B), with lead synthesis by a third agent
Beta Was this translation helpful? Give feedback.
All reactions