BIRDTurk: Adaptation of the BIRD Text-to-SQL Dataset to Turkish

Overview

BIRDTurk is a Turkish adaptation of the BIRD (BigBench for Relational Databases) Text-to-SQL benchmark.
The goal of this project is to evaluate and analyze Text-to-SQL model performance in Turkish, a morphologically rich and low-resource language, while preserving the full structural and logical complexity of the original BIRD dataset.

This repository documents the arXiv version of the paper, which has been accepted to EACL 2026 (SIGTURK).
The final camera-ready version will be published in the official conference proceedings.

Dataset available on Hugging Face: Hugging Face Dataset Link

Paper (arXiv): https://arxiv.org/abs/2602.03633
Conference: EACL 2026 – SIGTURK

Note: All related code will be added to this repository soon.

Motivation

Most state-of-the-art Text-to-SQL models:

Are trained and evaluated primarily on English datasets
Show strong performance in English
Degrade significantly in morphologically rich languages such as Turkish

BIRDTurk is designed to:

Quantify performance degradation caused purely by language shift
Identify systematic cross-lingual failure modes
Provide a reproducible Turkish benchmark for Text-to-SQL research

Dataset Description

Original BIRD Dataset

12,751 natural language questions
95 real-world databases
37 different domains
Complex SQL queries (joins, nested queries, aggregations)
Large-scale (~33.4 GB)

BIRDTurk Dataset

One-to-one Turkish translation of BIRD questions
Identical SQL queries (no modification)
Database schemas and execution logic fully preserved
Translation validated for semantic and execution equivalence

Design principle: Only the natural language layer changes. SQL and databases remain untouched.

Translation and Validation

Controlled human-in-the-loop translation process
Focus on:
- Semantic fidelity
- Schema alignment
- Query executability
Statistical sampling with human evaluators
98.15% translation accuracy at 95% confidence level

This ensures the benchmark measures language effects, not translation noise.

Experimental Setup

Evaluations are conducted under three paradigms:

1. Inference-Based Prompting

Zero-shot / few-shot prompting
No task-specific training
Measures raw multilingual generalization

2. Agentic Multi-Stage Reasoning

Explicit reasoning decomposition
Tool-augmented or step-based inference
More robust to linguistic variation

3. Supervised Fine-Tuning

Fine-tuning on labeled Text-to-SQL data
Tested with multilingual and instruction-tuned models

Key Findings

Systematic Performance Drop

All models perform worse in Turkish than in English
Degradation is consistent across architectures
Confirms language representation as a primary bottleneck

Agentic Reasoning Is More Robust

Multi-stage reasoning reduces language sensitivity
Improves schema grounding and intent resolution
Still not language-agnostic, but clearly superior

Fine-Tuning Helps — With Limits

Multilingual pretraining alone is insufficient
Instruction-tuned models benefit more
Morphology and tokenization remain unresolved challenges

What Differentiates BIRDTurk

Aspect	Prior Benchmarks	BIRDTurk
Target Language	English	Turkish
Dataset Scale	Small–Medium	Large-scale (BIRD-level)
SQL Complexity	Often simplified	Full real-world SQL
Cross-Lingual Control	Weak	Strictly controlled
Validation	Limited	Statistically verified
Research Focus	Accuracy only	Language-induced effects

BIRDTurk is a controlled cross-lingual experiment, not a simple translation.

Why Turkish Exposes Model Weaknesses

Agglutinative morphology
Flexible word order
Implicit arguments
Indirect schema references
Subword tokenization mismatch

These properties reveal reasoning and grounding failures masked in English benchmarks.

Use Cases

Multilingual Text-to-SQL evaluation
Cross-lingual reasoning analysis
Agentic vs single-shot model comparison
Turkish NL-to-SQL system development
Methodological template for other low-resource languages

Limitations

Focused solely on Turkish
Schemas remain English (intentionally)
Diagnostic benchmark, not a solution

Conclusion

BIRDTurk is the first large-scale, execution-faithful Turkish Text-to-SQL benchmark and has been accepted to EACL 2026 SIGTURK.
It demonstrates that strong English Text-to-SQL performance does not translate to multilingual robustness and establishes language as a first-order challenge in structured reasoning.

Citation

@inproceedings{birdturk2026,
  title={BIRDTurk: Adaptation of the BIRD Text-to-SQL Dataset to Turkish},
  author={Aktas, Burak and Baytekin, Mehmet Can and Kose, Suha Kagan and Ilbilgi, Omer and Yilmaz, Elif Ozge and Toraman, Cagri and Gorur, Bilge Kaan},
  booktitle={Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL)},
  note={Accepted to SIGTURK. arXiv:2602.03633},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
images		images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BIRDTurk: Adaptation of the BIRD Text-to-SQL Dataset to Turkish

Overview

Motivation

Dataset Description

Original BIRD Dataset

BIRDTurk Dataset

Translation and Validation

Experimental Setup

1. Inference-Based Prompting

2. Agentic Multi-Stage Reasoning

3. Supervised Fine-Tuning

Key Findings

Systematic Performance Drop

Agentic Reasoning Is More Robust

Fine-Tuning Helps — With Limits

What Differentiates BIRDTurk

Why Turkish Exposes Model Weaknesses

Use Cases

Limitations

Conclusion

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

BIRDTurk: Adaptation of the BIRD Text-to-SQL Dataset to Turkish

Overview

Motivation

Dataset Description

Original BIRD Dataset

BIRDTurk Dataset

Translation and Validation

Experimental Setup

1. Inference-Based Prompting

2. Agentic Multi-Stage Reasoning

3. Supervised Fine-Tuning

Key Findings

Systematic Performance Drop

Agentic Reasoning Is More Robust

Fine-Tuning Helps — With Limits

What Differentiates BIRDTurk

Why Turkish Exposes Model Weaknesses

Use Cases

Limitations

Conclusion

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages