Skip to content

dharmamitra/mitra-parallel-html

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dharmamitra Parallel Aligned Text Corpus — Website

An interactive HTML corpus browser for the MITRA-parallel dataset, covering sentence-aligned parallel texts in Sanskrit, Tibetan, and Chinese.

Features

  • Foldable navigation tree: language pair → collection → category → text
  • All 6 alignment directions: SA↔BO, SA↔ZH, BO↔ZH
  • Inverted views: each TSV file is accessible from both directions
  • Sentence-pair display with language-appropriate fonts
  • Search across text names

Setup

1. Build the navigation data

Run the build script from this directory:

python3 build.py

This reads metadata from metadata/ and TSV filenames from ../mitra-parallel/mitra-parallel/tsv/, and generates docs/nav_data.js. Requires the mitra-parallel repo to be checked out as a sibling:

code/
  mitra-parallel/          ← https://github.com/dharmamitra/mitra-parallel
  mitra-parallel-website/  ← this repo

2. Run locally

Create a symlink so the local server can serve TSV files:

ln -s /path/to/mitra-parallel/mitra-parallel/tsv docs/tsv

Then start a local HTTP server (the page must be served via HTTP, not opened as a file):

cd docs && python3 -m http.server 8765
# Open http://localhost:8765/

The site auto-detects localhost and uses ./tsv/ as the data URL.

3. Deploy to GitHub Pages

  1. Push this repo to GitHub
  2. In repo Settings → Pages, set source to Deploy from branch, branch main, folder /docs
  3. The site will auto-use https://raw.githubusercontent.com/dharmamitra/mitra-parallel/main/mitra-parallel/tsv/ as the TSV source

You can also override the TSV URL at runtime via the yellow bar at the top of the page.

Files

docs/
  index.html     — main website (self-contained HTML/CSS/JS)
  nav_data.js    — generated navigation index (~4MB, ~22k entries)
  tsv/           — symlink or copy of mitra-parallel/mitra-parallel/tsv/
metadata/
  SA_files.json, SA_category-names.json, SA_collection-names.json
  BO_files.json, BO_category-names.json, BO_collection-names.json
  ZH_files.json, ZH_category-names.json, ZH_collection-names.json
build.py         — navigation builder

Data

The alignment data comes from the MITRA-parallel corpus:

  • 12,481 TSV files covering SA-BO, SA-ZH, and BO-ZH pairs
  • ~90% of files matched to metadata with display names
  • Released under CC BY-SA 4.0

Citation: Sebastian Nehrdich & Kurt Keutzer, MITRA: A Large-Scale Parallel Corpus and Multilingual Pretrained Language Model for Machine Translation and Semantic Retrieval, 2026.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages