This repository is a transport index dataset for Retreivr.
It stores mappings from canonical MusicBrainz recording MBIDs to known-good transport identifiers.
Canonical mapping model:
recording_mbid -> transport sources
Examples of transport identifiers:
- YouTube video IDs
- SoundCloud track IDs (future)
- Other supported transport IDs (future)
MusicBrainz remains the authoritative source of metadata. This repository does not replicate MusicBrainz entity metadata.
Current dataset namespace:
youtube/recording/<prefix>/<recording_mbid>.json
Where:
prefixis the first two characters ofrecording_mbid- filename stem equals
recording_mbid
Each record contains:
recording_mbidschema_versionupdated_atsources[]with transport candidate identifiers and verification fields
See schema/schema.json for the strict record contract.
This repository must not contain:
- scraped metadata dumps
- platform search result dumps
- thumbnails
- ranking heuristics
- MusicBrainz entity metadata copies
- media files or download URLs
Validation in .github/workflows/validate.yml enforces:
- JSON parse validity for dataset files
- JSON Schema compliance
- shard-path and filename/MBID consistency
- duplicate MBID prevention in namespace
- duplicate
video_idprevention within a recording file - preview of derived dataset stats during CI
Derived stats are maintained automatically on main by .github/workflows/update_stats.yml.
Publish PRs are validated against the dataset contract itself; stats/dataset.json is regenerated after merges instead of blocking automated publisher PRs.
Trusted PR automation in .github/workflows/trusted_pr_automerge.yml enables auto-merge for same-repo pull requests opened by publishers listed in .github/trusted_publishers.txt, once required checks pass.
Additional publish policy lives in .github/publish_policy.json, including the minimum source confidence floor enforced by CI.
Trusted publisher status controls who can use the fully automated same-repo PR and auto-merge flow.
If you want to become a trusted publisher:
- Open a GitHub Issue in this repository.
- Title it
Trusted Publisher Request: <your-github-username>. - Include:
- your GitHub username
- how you run Retreivr
- whether you are publishing from a personal server, shared instance, or test node
- links to any prior good publish PRs if you have them
- anything relevant about how you validate your node outputs
- Wait for maintainers to review and, if approved, add your GitHub username to
.github/trusted_publishers.txt.
Until then:
- you can still run Retreivr and generate publish proposals
- maintainers may still review PRs manually
- auto-merge is reserved for approved trusted publishers
The dataset accelerates transport resolution for Retreivr clients while keeping output deterministic, lightweight, and Git-native.