Dharmamitra Parallel Aligned Text Corpus — Website

An interactive HTML corpus browser for the MITRA-parallel dataset, covering sentence-aligned parallel texts in Sanskrit, Tibetan, and Chinese.

Features

Foldable navigation tree: language pair → collection → category → text
All 6 alignment directions: SA↔BO, SA↔ZH, BO↔ZH
Inverted views: each TSV file is accessible from both directions
Sentence-pair display with language-appropriate fonts
Search across text names

Setup

1. Build the navigation data

Run the build script from this directory:

python3 build.py

This reads metadata from metadata/ and TSV filenames from ../mitra-parallel/mitra-parallel/tsv/, and generates docs/nav_data.js. Requires the mitra-parallel repo to be checked out as a sibling:

code/
  mitra-parallel/          ← https://github.com/dharmamitra/mitra-parallel
  mitra-parallel-website/  ← this repo

2. Run locally

Create a symlink so the local server can serve TSV files:

ln -s /path/to/mitra-parallel/mitra-parallel/tsv docs/tsv

Then start a local HTTP server (the page must be served via HTTP, not opened as a file):

cd docs && python3 -m http.server 8765
# Open http://localhost:8765/

The site auto-detects localhost and uses ./tsv/ as the data URL.

3. Deploy to GitHub Pages

Push this repo to GitHub
In repo Settings → Pages, set source to Deploy from branch, branch main, folder /docs
The site will auto-use https://raw.githubusercontent.com/dharmamitra/mitra-parallel/main/mitra-parallel/tsv/ as the TSV source

You can also override the TSV URL at runtime via the yellow bar at the top of the page.

Files

docs/
  index.html     — main website (self-contained HTML/CSS/JS)
  nav_data.js    — generated navigation index (~4MB, ~22k entries)
  tsv/           — symlink or copy of mitra-parallel/mitra-parallel/tsv/
metadata/
  SA_files.json, SA_category-names.json, SA_collection-names.json
  BO_files.json, BO_category-names.json, BO_collection-names.json
  ZH_files.json, ZH_category-names.json, ZH_collection-names.json
build.py         — navigation builder

Data

The alignment data comes from the MITRA-parallel corpus:

12,481 TSV files covering SA-BO, SA-ZH, and BO-ZH pairs
~90% of files matched to metadata with display names
Released under CC BY-SA 4.0

Citation: Sebastian Nehrdich & Kurt Keutzer, MITRA: A Large-Scale Parallel Corpus and Multilingual Pretrained Language Model for Machine Translation and Semantic Retrieval, 2026.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
metadata		metadata
.gitignore		.gitignore
README.md		README.md
build.py		build.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dharmamitra Parallel Aligned Text Corpus — Website

Features

Setup

1. Build the navigation data

2. Run locally

3. Deploy to GitHub Pages

Files

Data

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Dharmamitra Parallel Aligned Text Corpus — Website

Features

Setup

1. Build the navigation data

2. Run locally

3. Deploy to GitHub Pages

Files

Data

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages