Extract and display metadata for video and audio files

Because we farm out processing of audio and video to the transcription api, audo and video files are excluded from the tika metadata extraction. So no metadata is ever shown for these file types. This is a bit of a problem for people building up an archive of recordings. 

I was windering if the metadata extraction could be performed by giant notwithstanding the fact that the transcription and translation work is performed externally. 

And of course because I'm an idiot I asked copilot to write an RFC, which is below. Is it actually true under all circumstances that "Giant already has the source media"? If for example someone does a capture URL or "Send to Giant", doesn't stuff go to the media download service, then to the transcription service, then to the resolved placeholders in Giant? At that point haven't we moved beyond the tika stage?

# RFC: Add Metadata Extraction for Video/Audio Files

## Summary

When video and audio files are ingested into Giant, no file metadata is shown to users in the viewer sidebar. This RFC proposes adding metadata extraction for media files using Apache Tika, consistent with how all other document types are handled.

## Problem

The `DocumentBodyExtractor` is responsible for extracting file metadata via Apache Tika and writing it to Elasticsearch. It has a hardcoded allowlist of MIME types that **excludes all `audio/*` and `video/*` types**. When a file is ingested, `MimeTypeMapper.getExtractorsFor()` matches it against registered extractors — for audio/video, only the transcription extractor matches.

The transcription extractors (`TranscriptionExtractor` / `ExternalTranscriptionExtractor`) call `index.addDocumentTranscription()`, which writes only `transcript`, `vttTranscript`, and `transcriptExtracted` fields. In contrast, `DocumentBodyExtractor` calls `index.addDocumentDetails()`, which writes `extracted=true`, `metadata.extractedMetadata` (raw Tika key-value pairs), and `metadata.enrichedMetadata` (title, author, dates, etc.). **This method is never called for audio/video files.**

The frontend (`DocumentMetadata.js`) renders the "File Metadata" section identically for all blob types with no MIME-type filtering. It shows an empty section because the data was never written to Elasticsearch.

### Current flow for documents (e.g. PDF)

1. File uploaded
2. `MimeTypeMapper` assigns `DocumentBodyExtractor`
3. Worker runs `DocumentBodyExtractor`
4. Tika extracts text + metadata
5. `MetadataEnrichment.enrich()` produces structured fields
6. `index.addDocumentDetails()` writes `extracted=true`, `metadata`, `enrichedMetadata` to ES
7. Frontend displays metadata in sidebar ✅

### Current flow for audio/video

1. File uploaded
2. `MimeTypeMapper` assigns `TranscriptionExtractor` only
3. Worker runs `TranscriptionExtractor` (or `ExternalTranscriptionExtractor`)
4. Transcription service returns transcript text
5. `index.addDocumentTranscription()` writes transcript fields only
6. No metadata in ES — frontend sidebar is empty ❌

## Proposal

Extract metadata from audio/video files within Giant using Apache Tika, which already supports media metadata (MP3 ID3 tags, MP4 atoms, WAV headers, FLAC, OGG, etc.) via parsers in Giant's existing dependency chain.

### Why Giant-side rather than the external transcription service?

1. **Tika already supports it** — media metadata parsers are already available in Giant's Tika dependency
2. **Giant already has the source media** in object storage, uploaded before extraction begins
3. **The full metadata pipeline exists** — Tika → `MetadataEnrichment.enrich()` → `addDocumentDetails()` → Elasticsearch → frontend. It just isn't wired up for media types
4. **Separation of concerns** — the external transcription service's job is speech-to-text. Adding metadata extraction would create a new cross-service contract and couple unrelated concerns
5. **Consistency** — all other document types have metadata extracted by Giant's internal workers

### Approach: new `MediaMetadataExtractor`

Create a new extractor dedicated to audio/video metadata. This runs independently of and in parallel with the transcription extractor — metadata extraction is fast while transcription is slow.

The alternative — adding audio/video MIME types to the existing `DocumentBodyExtractor` — is simpler but semantically muddies "document body extraction" with media files that have no text body.

## Implementation

### 1. Create `MediaMetadataExtractor`

New file: `backend/app/extraction/MediaMetadataExtractor.scala`

- Accepts the same audio/video MIME types as the transcription extractors (listed in `ExternalTranscriptionExtractor.mimeTypes`)
- Uses Tika to parse the input stream and extract metadata
- Calls `index.addDocumentDetails()` with raw metadata, enriched metadata, and `text = None` (media files have no text body)
- Higher priority than transcription extractor so it completes quickly

Both this extractor and the transcription extractor would be assigned to the same MIME types via `MimeTypeMapper`, and both would run as independent extraction tasks.

### 2. Extend `MetadataEnrichment` (optional, can iterate later)

Some existing enrichment keys already overlap with Tika's audio/video output:

| Enriched field | Existing keys that work for media |
|---|---|
| `title` | `dc:title` (works for MP3 ID3 title, MP4 title) |
| `author` | `dc:creator` (works for MP3 artist) |
| `createdAt` | `dcterms:created`, `meta:creation-date` (works for MP4 creation date) |

Audio/video-specific Tika keys that would appear in raw metadata without further work:

- `xmpDM:duration` — media duration
- `xmpDM:audioSampleRate` — sample rate
- `xmpDM:audioChannelType` — mono/stereo
- `xmpDM:audioCompressor` — audio codec
- `xmpDM:videoFrameRate` — video frame rate
- `channels`, `samplerate`, `bits`

For a first pass, these would be visible in the "View Raw Metadata" toggle. Structured enriched fields for media-specific properties (e.g. duration) could be added in a follow-up.

### 3. Register the extractor

Add the new extractor to the `extractors` list in `AppComponents.scala` so it gets registered with `MimeTypeMapper`.

### 4. Frontend changes

Likely **none required** for initial implementation. `DocumentMetadata.js` already renders both enriched and raw metadata for all blob types. If we later add media-specific enriched fields (e.g. duration), we'd update the display formatting there.

## Files to modify

| File | Change |
|---|---|
| `backend/app/extraction/MediaMetadataExtractor.scala` | **New file** — extractor using Tika for audio/video metadata |
| `backend/app/extraction/MetadataEnrichment.scala` | Optional — add media-specific enrichment keys |
| `backend/app/AppComponents.scala` | Register `MediaMetadataExtractor` in the extractors list |
| `backend/test/extraction/MediaMetadataExtractorTest.scala` | **New file** — unit tests |

## Verification

1. Ingest test audio (MP3, WAV, AAC) and video (MP4, MOV, AVI) files
2. Confirm Elasticsearch documents have populated `metadata.extractedMetadata` and `metadata.enrichedMetadata`
3. Confirm the frontend viewer sidebar displays metadata
4. Run existing extraction test suite to check for regressions
5. Verify transcription still runs independently and completes normally

## Open questions

1. **Should we add media-specific fields to `EnrichedMetadata`?** Fields like `duration` or `codec` don't map to the existing case class. We could add them now, or ship with raw metadata only and iterate based on user feedback.
2. **Safety check for large files?** `DocumentBodyExtractor` has a safety check for oversized `text/plain` files. Media files will produce empty text bodies so this shouldn't be an issue, but we should confirm Tika handles large media files without excessive memory use (it only reads headers/atoms, not the full stream).

Enriched field	Existing keys that work for media
`title`	`dc:title` (works for MP3 ID3 title, MP4 title)
`author`	`dc:creator` (works for MP3 artist)
`createdAt`	`dcterms:created`, `meta:creation-date` (works for MP4 creation date)

File	Change
`backend/app/extraction/MediaMetadataExtractor.scala`	New file — extractor using Tika for audio/video metadata
`backend/app/extraction/MetadataEnrichment.scala`	Optional — add media-specific enrichment keys
`backend/app/AppComponents.scala`	Register `MediaMetadataExtractor` in the extractors list
`backend/test/extraction/MediaMetadataExtractorTest.scala`	New file — unit tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract and display metadata for video and audio files #604

RFC: Add Metadata Extraction for Video/Audio Files

Summary

Problem

Current flow for documents (e.g. PDF)

Current flow for audio/video

Proposal

Why Giant-side rather than the external transcription service?

Approach: new `MediaMetadataExtractor`

Implementation

1. Create `MediaMetadataExtractor`

2. Extend `MetadataEnrichment` (optional, can iterate later)

3. Register the extractor

4. Frontend changes

Files to modify

Verification

Open questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Extract and display metadata for video and audio files #604

Description

RFC: Add Metadata Extraction for Video/Audio Files

Summary

Problem

Current flow for documents (e.g. PDF)

Current flow for audio/video

Proposal

Why Giant-side rather than the external transcription service?

Approach: new MediaMetadataExtractor

Implementation

1. Create MediaMetadataExtractor

2. Extend MetadataEnrichment (optional, can iterate later)

3. Register the extractor

4. Frontend changes

Files to modify

Verification

Open questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Approach: new `MediaMetadataExtractor`

1. Create `MediaMetadataExtractor`

2. Extend `MetadataEnrichment` (optional, can iterate later)