This repository provides pipeline scripts and raw/intermediate files for building the SynCED-EnDe dataset.
The clean, benchmark-ready release (train silver + eval gold) is on Hugging Face.
This repo is for reproducibility and transparency.
scripts/ # Core dataset creation pipeline
inject.py # inject controlled errors
reinject.py # retry injection for missing cases
skim.py # preview dataset (sanity check)
block_evalrows.py # filter out scraped evaluation rows from train set
final.py # assemble final dataset TSVs
data_check.py # sanity checks (counts, distributions, label balance)
data_scrape.py # scrape GOV.UK + Stack Exchange
data/
train/
synced_ende_train_silver.tsv # final train split
raw/ # raw/intermediate training data
ced_final_injected_multi_class.tsv
error_injected_rows_with_correctMT.tsv
rows_for_error_injection.tsv
sources_2024_2025.tsv
eval/
synced_ende_eval_gold.tsv # final eval split
eval_judged_quantified_annotated.tsv
raw/ # raw/intermediate eval data
ced_final_injected_multi_class.tsv
error_injected_rows_with_correctMT.tsv
rows_for_error_injection.tsv
sources_2024_2025.tsv
Run the scripts in order:
-
Scrape Data
python scripts/data_scrape.py
-
Inject errors (first pass)
python scripts/inject.py
-
Re-inject (fix empty rows, optional second pass)
python scripts/reinject.py
-
Preview dataset (skim head rows)
python scripts/skim.py
-
Block evaluation rows (avoid leakage)
python scripts/block_evalrows.py
-
Judge & quantify translations (LLM-based)
python scripts/judge_quantify.py
-
Assemble final dataset
python scripts/final.py
This produces:
synced_ende_train_silver.tsvsynced_ende_eval_gold.tsvjudged_quantified_annotated.tsv
Available in data/:
ced_final_injected_multi_class.tsverror_injected_rows_with_correctMT.tsvrows_for_error_injection.tsvsources_2024_2025.tsv
They are not for benchmarking — use only for reproducing pipeline steps.
This dataset and accompanying scripts are licensed under:
Creative Commons Attribution- 4.0 (CC-BY 4.0)
Attribution required.
@article{chopra2025syncedende,
title={SynCED-EnDe 2025: A Synthetic and Curated English - German Dataset for Critical Error Detection in Machine Translation},
author={Chopra, M. and others},
journal={Hugging Face Papers},
year={2025},
url={https://huggingface.co/papers/2510.05144}
}