Kexian Tang*, Jiani Wang*, Shaowen Wang, Kaifeng Lyu
Institute for Interdisciplinary Information Sciences, Tsinghua University
* Equal contribution.
Contact: {tangkx25,wangjn23}@mails.tsinghua.edu.cn
SPA (Scaling Prompt-engineered Augmentation) is a simple but tough-to-beat baseline. It uses a small set of carefully designed prompts to generate large-scale synthetic data for knowledge injection.
We evaluate SPA on three representative benchmarks: SQuAD (Wikipedia-based QA), QuALITY (long-document comprehension), and MultiHop-RAG (multi-hop reasoning). Through systematic comparisons, we show that despite its simplicity, SPA outperforms several strong and more complex baselines.
Our results suggest that, for knowledge injection, careful prompt design combined with straightforward large-scale augmentation can be surprisingly effective. We hope SPA can serve as a strong baseline for future studies in this area.
- [2026-03-24] Our Paper is released.
SPA operates in three steps:
-
Prompt Engineering — We draw upon insights from cognitive science and educational psychology to design a set of 7 prompt templates based on effective human learning strategies, covering three levels of learning strategies:
- Concept Learning: Key Concepts, Mind Map
- Critical Thinking: Implications, QA with Critical Thinking (QA-ct)
- Generative Learning: Case Studies, Discussions, Teacher-style
-
Scaling — Repeatedly prompt an LLM to rewrite the source content based on the set of prompts, progressively scaling the augmented corpus into a large-scale synthetic corpus.
-
Training — The target model is trained on the synthetic corpus via continued pretraining under the same experimental settings as prior work.
See the Code for full prompt templates.
SPA consistently improves with scale and achieves the highest accuracy at moderate-to-large token budgets across benchmarks.
| SQuAD | QuALITY |
![]() |
![]() |
See the paper for full results.
conda create -n spa python=3.12
conda activate spa
pip install -r requirements.txtCreate a .env file in the project root and add your OpenAI API key (used for synthetic data generation):
OPENAI_API_KEY=your_api_key_hereChoose the benchmark you want to run and execute the corresponding script:
SQuAD (Wikipedia-based QA):
bash scripts/make_squad_data.shQuALITY (Long-document comprehension):
bash scripts/make_quality_data.shNotes
- If you use GPT-OSS-120B to generate QuALITY synthetic data, please upgrade vLLM to 0.10.2.
- Upgrading vLLM may also upgrade these packages automatically:
openai==2.26.0,torch==2.8.0,transformers==4.57.6. This is fine, and we also use this setup in this step. Please ignore other dependency errors.- It is recommended to clone a fresh environment before running this workflow. We only use this environment for QuALITY data generation; all other workflows follow the versions in
requirements.
MultiHop-RAG (Multi-hop reasoning):
bash scripts/make_mhrag_data.shAfter generation, tokenize the synthetic corpus to prepare it for training:
bash scripts/tokenize.shRun continued pretraining on the tokenized synthetic corpus:
bash scripts/train.sh@article{tang2026spa,
title={SPA: A Simple but Tough-to-Beat Baseline for Knowledge Injection},
author={Tang, Kexian and Wang, Jiani and Wang, Shaowen and Lyu, Kaifeng},
journal={arXiv preprint arXiv:2603.22213},
year={2026},
url={https://arxiv.org/abs/2603.22213}
}
