Supplementary materials for the ECIR 2026 short paper:
"Principled Context Engineering for RAG: Statistical Guarantees via Conformal Prediction"
- prompts/ - Prompt templates for relevance labeling and LLM scoring
- examples/ - Example inputs and outputs
Binary relevance judgments are generated using Llama 3.3-70B-Instruct with the following prompt:
Does this snippet contain or support the following fact?
Fact: {fact}
Snippet: {snippet}
Answer YES if the snippet contains this information, NO otherwise.
Think step by step:
1. What is the fact claiming?
2. Does the snippet mention this information?
3. Is the information in the snippet consistent with the fact?
Format: YES/NO
Answer:
See prompts/relevance_labeling.txt for full details.
| Component | Model | Details |
|---|---|---|
| Language Model | Llama 3.3-70B-Instruct | Relevance labeling, LLM scoring |
| Language Model | GPT-4o | LLM scoring (Conformal-LLM) |
| Embedding Model | Qwen3-Embedding-8B | - |
Scoring Function:
A_emb(q,s) = 1 - cos(emb(q), emb(s))
Scoring Function:
A_LLM(q,s) = 1 - rating
Where rating ∈ [0,1] is the LLM-provided relevance score. The same prompt was used for both GPT-4o and Llama 3.3-70B-Instruct. See prompts/llm_scoring.txt for the full prompt.
τ̂_α = Quantile_{1-α}({A(q,s) : r(q,s) = 1})
Coverage Guarantee:
P(s ∈ K_q | r(q,s) = 1) ≥ 1 - α
| Parameter | Value |
|---|---|
| Chunk size | 500 characters |
| Overlap | 100 characters |
| Boundary handling | Preserve sentence boundaries |
| Dataset | Calibration | Test | Split Strategy |
|---|---|---|---|
| NeuCLIR | 1,440 snippets | 740 snippets | Disjoint query topics |
| RAGTIME | 1,710 snippets | 560 snippets | Disjoint query topics |
| Method | α | F1 | ConRed% |
|---|---|---|---|
| Conformal-Embedding | 0.05 | 0.720† | 22.2 |
| Conformal-Embedding | 0.10 | 0.700† | 35.0 |
| Conformal-Embedding | 0.20 | 0.680 | 52.8 |
| Conformal-LLM | 0.05 | 0.710† | 46.5 |
| Conformal-LLM | 0.10 | 0.700† | 58.0 |
| Conformal-LLM | 0.20 | 0.680 | 57.8 |
| Unfiltered Baseline | - | 0.69 | 0 |
† indicates significant improvement over unfiltered baseline (p<0.05).
Key findings:
- Conformal methods achieve target coverage guarantees
- 2-3× context reduction while maintaining factual accuracy
- F1 improves under strict filtering (α ≤ 0.10)
@misc{chakraborty2025principledcontextengineeringrag,
title={Principled Context Engineering for RAG: Statistical Guarantees via Conformal Prediction},
author={Debashish Chakraborty and Eugene Yang and Daniel Khashabi and Dawn Lawrie and Kevin Duh},
year={2025},
eprint={2511.17908},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2511.17908},
}MIT License - see LICENSE for details.