An interactive HTML corpus browser for the MITRA-parallel dataset, covering sentence-aligned parallel texts in Sanskrit, Tibetan, and Chinese.
- Foldable navigation tree: language pair → collection → category → text
- All 6 alignment directions: SA↔BO, SA↔ZH, BO↔ZH
- Inverted views: each TSV file is accessible from both directions
- Sentence-pair display with language-appropriate fonts
- Search across text names
Run the build script from this directory:
python3 build.pyThis reads metadata from metadata/ and TSV filenames from ../mitra-parallel/mitra-parallel/tsv/,
and generates docs/nav_data.js. Requires the mitra-parallel repo to be checked out as a sibling:
code/
mitra-parallel/ ← https://github.com/dharmamitra/mitra-parallel
mitra-parallel-website/ ← this repo
Create a symlink so the local server can serve TSV files:
ln -s /path/to/mitra-parallel/mitra-parallel/tsv docs/tsvThen start a local HTTP server (the page must be served via HTTP, not opened as a file):
cd docs && python3 -m http.server 8765
# Open http://localhost:8765/The site auto-detects localhost and uses ./tsv/ as the data URL.
- Push this repo to GitHub
- In repo Settings → Pages, set source to Deploy from branch, branch
main, folder/docs - The site will auto-use
https://raw.githubusercontent.com/dharmamitra/mitra-parallel/main/mitra-parallel/tsv/as the TSV source
You can also override the TSV URL at runtime via the yellow bar at the top of the page.
docs/
index.html — main website (self-contained HTML/CSS/JS)
nav_data.js — generated navigation index (~4MB, ~22k entries)
tsv/ — symlink or copy of mitra-parallel/mitra-parallel/tsv/
metadata/
SA_files.json, SA_category-names.json, SA_collection-names.json
BO_files.json, BO_category-names.json, BO_collection-names.json
ZH_files.json, ZH_category-names.json, ZH_collection-names.json
build.py — navigation builder
The alignment data comes from the MITRA-parallel corpus:
- 12,481 TSV files covering SA-BO, SA-ZH, and BO-ZH pairs
- ~90% of files matched to metadata with display names
- Released under CC BY-SA 4.0
Citation: Sebastian Nehrdich & Kurt Keutzer, MITRA: A Large-Scale Parallel Corpus and Multilingual Pretrained Language Model for Machine Translation and Semantic Retrieval, 2026.