SynCED-EnDe — Pipeline, Raw Data & Scripts (GitHub Companion)

This repository provides pipeline scripts and raw/intermediate files for building the SynCED-EnDe dataset.

The clean, benchmark-ready release (train silver + eval gold) is on Hugging Face.
This repo is for reproducibility and transparency.

📂 Repository Structure

scripts/                    # Core dataset creation pipeline
  inject.py                 # inject controlled errors
  reinject.py               # retry injection for missing cases
  skim.py                   # preview dataset (sanity check)
  block_evalrows.py         # filter out scraped evaluation rows from train set 
  final.py                  # assemble final dataset TSVs
  data_check.py             # sanity checks (counts, distributions, label balance)
  data_scrape.py            # scrape GOV.UK + Stack Exchange

data/
  train/
    synced_ende_train_silver.tsv     # final train split
    raw/                             # raw/intermediate training data
      ced_final_injected_multi_class.tsv
      error_injected_rows_with_correctMT.tsv
      rows_for_error_injection.tsv
      sources_2024_2025.tsv

  eval/
    synced_ende_eval_gold.tsv        # final eval split
    eval_judged_quantified_annotated.tsv
    raw/                             # raw/intermediate eval data
      ced_final_injected_multi_class.tsv
      error_injected_rows_with_correctMT.tsv
      rows_for_error_injection.tsv
      sources_2024_2025.tsv

🛠️ Pipeline Usage

Run the scripts in order:

Scrape Data
```
python scripts/data_scrape.py
```
Inject errors (first pass)
```
python scripts/inject.py
```
Re-inject (fix empty rows, optional second pass)
```
python scripts/reinject.py
```
Preview dataset (skim head rows)
```
python scripts/skim.py
```
Block evaluation rows (avoid leakage)
```
python scripts/block_evalrows.py
```
Judge & quantify translations (LLM-based)
```
python scripts/judge_quantify.py
```
Assemble final dataset
```
python scripts/final.py
```
This produces:
- synced_ende_train_silver.tsv
- synced_ende_eval_gold.tsv
- judged_quantified_annotated.tsv

📊 Raw / Intermediate Data

Available in data/:

ced_final_injected_multi_class.tsv
error_injected_rows_with_correctMT.tsv
rows_for_error_injection.tsv
sources_2024_2025.tsv

⚠️ Note: These files may contain earlier/unverified labels.
They are not for benchmarking — use only for reproducing pipeline steps.

📜 License

This dataset and accompanying scripts are licensed under:
Creative Commons Attribution- 4.0 (CC-BY 4.0)
Attribution required.

📖 Citation

@article{chopra2025syncedende,
  title={SynCED-EnDe 2025: A Synthetic and Curated English - German Dataset for Critical Error Detection in Machine Translation},
  author={Chopra, M. and others},
  journal={Hugging Face Papers},
  year={2025},
  url={https://huggingface.co/papers/2510.05144}
}

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
data		data
scripts		scripts
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SynCED-EnDe — Pipeline, Raw Data & Scripts (GitHub Companion)

📂 Repository Structure

🛠️ Pipeline Usage

📊 Raw / Intermediate Data

📜 License

📖 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SynCED-EnDe — Pipeline, Raw Data & Scripts (GitHub Companion)

📂 Repository Structure

🛠️ Pipeline Usage

📊 Raw / Intermediate Data

📜 License

📖 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages