Skip to content

Extract and display metadata for video and audio files #604

@hoyla

Description

@hoyla

Because we farm out processing of audio and video to the transcription api, audo and video files are excluded from the tika metadata extraction. So no metadata is ever shown for these file types. This is a bit of a problem for people building up an archive of recordings.

I was windering if the metadata extraction could be performed by giant notwithstanding the fact that the transcription and translation work is performed externally.

And of course because I'm an idiot I asked copilot to write an RFC, which is below. Is it actually true under all circumstances that "Giant already has the source media"? If for example someone does a capture URL or "Send to Giant", doesn't stuff go to the media download service, then to the transcription service, then to the resolved placeholders in Giant? At that point haven't we moved beyond the tika stage?

RFC: Add Metadata Extraction for Video/Audio Files

Summary

When video and audio files are ingested into Giant, no file metadata is shown to users in the viewer sidebar. This RFC proposes adding metadata extraction for media files using Apache Tika, consistent with how all other document types are handled.

Problem

The DocumentBodyExtractor is responsible for extracting file metadata via Apache Tika and writing it to Elasticsearch. It has a hardcoded allowlist of MIME types that excludes all audio/* and video/* types. When a file is ingested, MimeTypeMapper.getExtractorsFor() matches it against registered extractors — for audio/video, only the transcription extractor matches.

The transcription extractors (TranscriptionExtractor / ExternalTranscriptionExtractor) call index.addDocumentTranscription(), which writes only transcript, vttTranscript, and transcriptExtracted fields. In contrast, DocumentBodyExtractor calls index.addDocumentDetails(), which writes extracted=true, metadata.extractedMetadata (raw Tika key-value pairs), and metadata.enrichedMetadata (title, author, dates, etc.). This method is never called for audio/video files.

The frontend (DocumentMetadata.js) renders the "File Metadata" section identically for all blob types with no MIME-type filtering. It shows an empty section because the data was never written to Elasticsearch.

Current flow for documents (e.g. PDF)

  1. File uploaded
  2. MimeTypeMapper assigns DocumentBodyExtractor
  3. Worker runs DocumentBodyExtractor
  4. Tika extracts text + metadata
  5. MetadataEnrichment.enrich() produces structured fields
  6. index.addDocumentDetails() writes extracted=true, metadata, enrichedMetadata to ES
  7. Frontend displays metadata in sidebar ✅

Current flow for audio/video

  1. File uploaded
  2. MimeTypeMapper assigns TranscriptionExtractor only
  3. Worker runs TranscriptionExtractor (or ExternalTranscriptionExtractor)
  4. Transcription service returns transcript text
  5. index.addDocumentTranscription() writes transcript fields only
  6. No metadata in ES — frontend sidebar is empty ❌

Proposal

Extract metadata from audio/video files within Giant using Apache Tika, which already supports media metadata (MP3 ID3 tags, MP4 atoms, WAV headers, FLAC, OGG, etc.) via parsers in Giant's existing dependency chain.

Why Giant-side rather than the external transcription service?

  1. Tika already supports it — media metadata parsers are already available in Giant's Tika dependency
  2. Giant already has the source media in object storage, uploaded before extraction begins
  3. The full metadata pipeline exists — Tika → MetadataEnrichment.enrich()addDocumentDetails() → Elasticsearch → frontend. It just isn't wired up for media types
  4. Separation of concerns — the external transcription service's job is speech-to-text. Adding metadata extraction would create a new cross-service contract and couple unrelated concerns
  5. Consistency — all other document types have metadata extracted by Giant's internal workers

Approach: new MediaMetadataExtractor

Create a new extractor dedicated to audio/video metadata. This runs independently of and in parallel with the transcription extractor — metadata extraction is fast while transcription is slow.

The alternative — adding audio/video MIME types to the existing DocumentBodyExtractor — is simpler but semantically muddies "document body extraction" with media files that have no text body.

Implementation

1. Create MediaMetadataExtractor

New file: backend/app/extraction/MediaMetadataExtractor.scala

  • Accepts the same audio/video MIME types as the transcription extractors (listed in ExternalTranscriptionExtractor.mimeTypes)
  • Uses Tika to parse the input stream and extract metadata
  • Calls index.addDocumentDetails() with raw metadata, enriched metadata, and text = None (media files have no text body)
  • Higher priority than transcription extractor so it completes quickly

Both this extractor and the transcription extractor would be assigned to the same MIME types via MimeTypeMapper, and both would run as independent extraction tasks.

2. Extend MetadataEnrichment (optional, can iterate later)

Some existing enrichment keys already overlap with Tika's audio/video output:

Enriched field Existing keys that work for media
title dc:title (works for MP3 ID3 title, MP4 title)
author dc:creator (works for MP3 artist)
createdAt dcterms:created, meta:creation-date (works for MP4 creation date)

Audio/video-specific Tika keys that would appear in raw metadata without further work:

  • xmpDM:duration — media duration
  • xmpDM:audioSampleRate — sample rate
  • xmpDM:audioChannelType — mono/stereo
  • xmpDM:audioCompressor — audio codec
  • xmpDM:videoFrameRate — video frame rate
  • channels, samplerate, bits

For a first pass, these would be visible in the "View Raw Metadata" toggle. Structured enriched fields for media-specific properties (e.g. duration) could be added in a follow-up.

3. Register the extractor

Add the new extractor to the extractors list in AppComponents.scala so it gets registered with MimeTypeMapper.

4. Frontend changes

Likely none required for initial implementation. DocumentMetadata.js already renders both enriched and raw metadata for all blob types. If we later add media-specific enriched fields (e.g. duration), we'd update the display formatting there.

Files to modify

File Change
backend/app/extraction/MediaMetadataExtractor.scala New file — extractor using Tika for audio/video metadata
backend/app/extraction/MetadataEnrichment.scala Optional — add media-specific enrichment keys
backend/app/AppComponents.scala Register MediaMetadataExtractor in the extractors list
backend/test/extraction/MediaMetadataExtractorTest.scala New file — unit tests

Verification

  1. Ingest test audio (MP3, WAV, AAC) and video (MP4, MOV, AVI) files
  2. Confirm Elasticsearch documents have populated metadata.extractedMetadata and metadata.enrichedMetadata
  3. Confirm the frontend viewer sidebar displays metadata
  4. Run existing extraction test suite to check for regressions
  5. Verify transcription still runs independently and completes normally

Open questions

  1. Should we add media-specific fields to EnrichedMetadata? Fields like duration or codec don't map to the existing case class. We could add them now, or ship with raw metadata only and iterate based on user feedback.
  2. Safety check for large files? DocumentBodyExtractor has a safety check for oversized text/plain files. Media files will produce empty text bodies so this shouldn't be an issue, but we should confirm Tika handles large media files without excessive memory use (it only reads headers/atoms, not the full stream).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions