-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Because we farm out processing of audio and video to the transcription api, audo and video files are excluded from the tika metadata extraction. So no metadata is ever shown for these file types. This is a bit of a problem for people building up an archive of recordings.
I was windering if the metadata extraction could be performed by giant notwithstanding the fact that the transcription and translation work is performed externally.
And of course because I'm an idiot I asked copilot to write an RFC, which is below. Is it actually true under all circumstances that "Giant already has the source media"? If for example someone does a capture URL or "Send to Giant", doesn't stuff go to the media download service, then to the transcription service, then to the resolved placeholders in Giant? At that point haven't we moved beyond the tika stage?
RFC: Add Metadata Extraction for Video/Audio Files
Summary
When video and audio files are ingested into Giant, no file metadata is shown to users in the viewer sidebar. This RFC proposes adding metadata extraction for media files using Apache Tika, consistent with how all other document types are handled.
Problem
The DocumentBodyExtractor is responsible for extracting file metadata via Apache Tika and writing it to Elasticsearch. It has a hardcoded allowlist of MIME types that excludes all audio/* and video/* types. When a file is ingested, MimeTypeMapper.getExtractorsFor() matches it against registered extractors — for audio/video, only the transcription extractor matches.
The transcription extractors (TranscriptionExtractor / ExternalTranscriptionExtractor) call index.addDocumentTranscription(), which writes only transcript, vttTranscript, and transcriptExtracted fields. In contrast, DocumentBodyExtractor calls index.addDocumentDetails(), which writes extracted=true, metadata.extractedMetadata (raw Tika key-value pairs), and metadata.enrichedMetadata (title, author, dates, etc.). This method is never called for audio/video files.
The frontend (DocumentMetadata.js) renders the "File Metadata" section identically for all blob types with no MIME-type filtering. It shows an empty section because the data was never written to Elasticsearch.
Current flow for documents (e.g. PDF)
- File uploaded
MimeTypeMapperassignsDocumentBodyExtractor- Worker runs
DocumentBodyExtractor - Tika extracts text + metadata
MetadataEnrichment.enrich()produces structured fieldsindex.addDocumentDetails()writesextracted=true,metadata,enrichedMetadatato ES- Frontend displays metadata in sidebar ✅
Current flow for audio/video
- File uploaded
MimeTypeMapperassignsTranscriptionExtractoronly- Worker runs
TranscriptionExtractor(orExternalTranscriptionExtractor) - Transcription service returns transcript text
index.addDocumentTranscription()writes transcript fields only- No metadata in ES — frontend sidebar is empty ❌
Proposal
Extract metadata from audio/video files within Giant using Apache Tika, which already supports media metadata (MP3 ID3 tags, MP4 atoms, WAV headers, FLAC, OGG, etc.) via parsers in Giant's existing dependency chain.
Why Giant-side rather than the external transcription service?
- Tika already supports it — media metadata parsers are already available in Giant's Tika dependency
- Giant already has the source media in object storage, uploaded before extraction begins
- The full metadata pipeline exists — Tika →
MetadataEnrichment.enrich()→addDocumentDetails()→ Elasticsearch → frontend. It just isn't wired up for media types - Separation of concerns — the external transcription service's job is speech-to-text. Adding metadata extraction would create a new cross-service contract and couple unrelated concerns
- Consistency — all other document types have metadata extracted by Giant's internal workers
Approach: new MediaMetadataExtractor
Create a new extractor dedicated to audio/video metadata. This runs independently of and in parallel with the transcription extractor — metadata extraction is fast while transcription is slow.
The alternative — adding audio/video MIME types to the existing DocumentBodyExtractor — is simpler but semantically muddies "document body extraction" with media files that have no text body.
Implementation
1. Create MediaMetadataExtractor
New file: backend/app/extraction/MediaMetadataExtractor.scala
- Accepts the same audio/video MIME types as the transcription extractors (listed in
ExternalTranscriptionExtractor.mimeTypes) - Uses Tika to parse the input stream and extract metadata
- Calls
index.addDocumentDetails()with raw metadata, enriched metadata, andtext = None(media files have no text body) - Higher priority than transcription extractor so it completes quickly
Both this extractor and the transcription extractor would be assigned to the same MIME types via MimeTypeMapper, and both would run as independent extraction tasks.
2. Extend MetadataEnrichment (optional, can iterate later)
Some existing enrichment keys already overlap with Tika's audio/video output:
| Enriched field | Existing keys that work for media |
|---|---|
title |
dc:title (works for MP3 ID3 title, MP4 title) |
author |
dc:creator (works for MP3 artist) |
createdAt |
dcterms:created, meta:creation-date (works for MP4 creation date) |
Audio/video-specific Tika keys that would appear in raw metadata without further work:
xmpDM:duration— media durationxmpDM:audioSampleRate— sample ratexmpDM:audioChannelType— mono/stereoxmpDM:audioCompressor— audio codecxmpDM:videoFrameRate— video frame ratechannels,samplerate,bits
For a first pass, these would be visible in the "View Raw Metadata" toggle. Structured enriched fields for media-specific properties (e.g. duration) could be added in a follow-up.
3. Register the extractor
Add the new extractor to the extractors list in AppComponents.scala so it gets registered with MimeTypeMapper.
4. Frontend changes
Likely none required for initial implementation. DocumentMetadata.js already renders both enriched and raw metadata for all blob types. If we later add media-specific enriched fields (e.g. duration), we'd update the display formatting there.
Files to modify
| File | Change |
|---|---|
backend/app/extraction/MediaMetadataExtractor.scala |
New file — extractor using Tika for audio/video metadata |
backend/app/extraction/MetadataEnrichment.scala |
Optional — add media-specific enrichment keys |
backend/app/AppComponents.scala |
Register MediaMetadataExtractor in the extractors list |
backend/test/extraction/MediaMetadataExtractorTest.scala |
New file — unit tests |
Verification
- Ingest test audio (MP3, WAV, AAC) and video (MP4, MOV, AVI) files
- Confirm Elasticsearch documents have populated
metadata.extractedMetadataandmetadata.enrichedMetadata - Confirm the frontend viewer sidebar displays metadata
- Run existing extraction test suite to check for regressions
- Verify transcription still runs independently and completes normally
Open questions
- Should we add media-specific fields to
EnrichedMetadata? Fields likedurationorcodecdon't map to the existing case class. We could add them now, or ship with raw metadata only and iterate based on user feedback. - Safety check for large files?
DocumentBodyExtractorhas a safety check for oversizedtext/plainfiles. Media files will produce empty text bodies so this shouldn't be an issue, but we should confirm Tika handles large media files without excessive memory use (it only reads headers/atoms, not the full stream).