Skip to content

feat: audio ingestion via Whisper API#31

Open
RaviTharuma wants to merge 1 commit intoAVIDS2:mainfrom
RaviTharuma:feature/memorix-ek1-audio-ingestion
Open

feat: audio ingestion via Whisper API#31
RaviTharuma wants to merge 1 commit intoAVIDS2:mainfrom
RaviTharuma:feature/memorix-ek1-audio-ingestion

Conversation

@RaviTharuma
Copy link
Copy Markdown
Contributor

Summary

Adds an memorix_ingest_audio MCP tool that transcribes audio files using the Whisper API and stores the transcript as a memorix observation.

What's included

  • src/multimodal/audio-loader.ts — Whisper API client with:
    • Configurable model (MEMORIX_WHISPER_MODEL, default: whisper-1)
    • Language hint and prompt support
    • Chunked file reading for large files
    • Automatic format detection from extension (mp3, wav, m4a, ogg, flac, webm, mp4)
  • src/multimodal/index.ts — shared multimodal utilities (MIME detection, file validation)
  • src/server.ts — MCP tool registration (memorix_ingest_audio)
  • tests/multimodal/audio-loader.test.ts — 8 tests covering:
    • Transcript extraction and observation storage
    • Unsupported format rejection
    • Missing file handling
    • Language and prompt passthrough
    • API error propagation

Design decisions

  • LLM-optional: requires MEMORIX_WHISPER_API_KEY (falls back to OPENAI_API_KEY). Tool returns clear error if unconfigured.
  • Stores transcript as how-it-works observation type with audio metadata in facts.
  • No new npm dependencies — uses native fetch for API calls and fs for file reading.

Tests

8 pass, 0 fail

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6e1369a556

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +3078 to +3079
markInternalWrite();
const result = await ingestAudio(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Move internal-write marker to the actual write point

markInternalWrite() is called before ingestAudio(), but audio ingestion performs a remote transcription that can run up to 120 seconds; the hot-reload skip window is only 10 seconds, so long transcriptions can expire the skip window before storeObservation writes. In that case the file watcher will treat this internal write as external and trigger unnecessary reload/reindex work, which can cause avoidable contention during ingest-heavy workflows.

Useful? React with 👍 / 👎.

@@ -0,0 +1,173 @@
import { describe, it, expect, afterEach, beforeEach } from 'bun:test';
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use Vitest APIs in the new test file

The repository’s test command runs Vitest and includes tests/**/*.test.ts, but this new test imports from bun:test. Under the declared test runner, that import is unresolved, so CI/local npm test will fail before executing these assertions. Switching this file to Vitest imports (or globals) is necessary to keep the test suite runnable.

Useful? React with 👍 / 👎.

@AVIDS2
Copy link
Copy Markdown
Owner

AVIDS2 commented Mar 30, 2026

Thanks for pushing on multimodal ingestion — the capability is useful, but this one still has a couple of concrete blockers.

  1. The test file uses bun:test, while this repo runs vitest in CI. Right now that means the PR fails immediately in the standard test matrix.
  2. In src/server.ts, the object passed into storeObservation is inferred as type: string, but storeObservation expects a real ObservationType.
  3. More importantly, the config path is too loose for the implementation: transcribeAudio() pulls the generic Memorix LLM API key via getLLMApiKey(), but the code then talks specifically to OpenAI/Groq Whisper transcription endpoints. In a valid Memorix setup that is configured for Anthropic or another non-Whisper provider, this can end up sending the wrong credential/provider combination to an incompatible API.

So I don't think this is just a test-fix PR — it needs one more pass on provider/config semantics as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants