SPA: A Simple but Tough-to-Beat Baseline for Knowledge Injection

Kexian Tang*, Jiani Wang*, Shaowen Wang, Kaifeng Lyu

Institute for Interdisciplinary Information Sciences, Tsinghua University

* Equal contribution.

Contact: {tangkx25,wangjn23}@mails.tsinghua.edu.cn

Overview

SPA (Scaling Prompt-engineered Augmentation) is a simple but tough-to-beat baseline. It uses a small set of carefully designed prompts to generate large-scale synthetic data for knowledge injection.

We evaluate SPA on three representative benchmarks: SQuAD (Wikipedia-based QA), QuALITY (long-document comprehension), and MultiHop-RAG (multi-hop reasoning). Through systematic comparisons, we show that despite its simplicity, SPA outperforms several strong and more complex baselines.

Our results suggest that, for knowledge injection, careful prompt design combined with straightforward large-scale augmentation can be surprisingly effective. We hope SPA can serve as a strong baseline for future studies in this area.

News

[2026-03-24] Our Paper is released.

Method

SPA operates in three steps:

Prompt Engineering — We draw upon insights from cognitive science and educational psychology to design a set of 7 prompt templates based on effective human learning strategies, covering three levels of learning strategies:
- Concept Learning: Key Concepts, Mind Map
- Critical Thinking: Implications, QA with Critical Thinking (QA-ct)
- Generative Learning: Case Studies, Discussions, Teacher-style
Scaling — Repeatedly prompt an LLM to rewrite the source content based on the set of prompts, progressively scaling the augmented corpus into a large-scale synthetic corpus.
Training — The target model is trained on the synthetic corpus via continued pretraining under the same experimental settings as prior work.

See the Code for full prompt templates.

Performance

SPA consistently improves with scale and achieves the highest accuracy at moderate-to-large token budgets across benchmarks.

SQuAD	QuALITY

See the paper for full results.

Getting Started

1. Environment Setup

conda create -n spa python=3.12
conda activate spa
pip install -r requirements.txt

Create a .env file in the project root and add your OpenAI API key (used for synthetic data generation):

OPENAI_API_KEY=your_api_key_here

2. Generate Synthetic Data

Choose the benchmark you want to run and execute the corresponding script:

SQuAD (Wikipedia-based QA):

bash scripts/make_squad_data.sh

QuALITY (Long-document comprehension):

bash scripts/make_quality_data.sh

Notes

If you use GPT-OSS-120B to generate QuALITY synthetic data, please upgrade vLLM to 0.10.2.

Upgrading vLLM may also upgrade these packages automatically: openai==2.26.0, torch==2.8.0, transformers==4.57.6. This is fine, and we also use this setup in this step. Please ignore other dependency errors.

It is recommended to clone a fresh environment before running this workflow. We only use this environment for QuALITY data generation; all other workflows follow the versions in requirements.

MultiHop-RAG (Multi-hop reasoning):

bash scripts/make_mhrag_data.sh

3. Tokenize Data

After generation, tokenize the synthetic corpus to prepare it for training:

bash scripts/tokenize.sh

4. Train

Run continued pretraining on the tokenized synthetic corpus:

bash scripts/train.sh

Citation

@article{tang2026spa,
      title={SPA: A Simple but Tough-to-Beat Baseline for Knowledge Injection}, 
      author={Tang, Kexian and Wang, Jiani and Wang, Shaowen and Lyu, Kaifeng},
      journal={arXiv preprint arXiv:2603.22213},
      year={2026},
      url={https://arxiv.org/abs/2603.22213}
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
asserts		asserts
data		data
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SPA: A Simple but Tough-to-Beat Baseline for Knowledge Injection

Overview

News

Method

Performance

Getting Started

1. Environment Setup

2. Generate Synthetic Data

3. Tokenize Data

4. Train

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SPA: A Simple but Tough-to-Beat Baseline for Knowledge Injection

Overview

News

Method

Performance

Getting Started

1. Environment Setup

2. Generate Synthetic Data

3. Tokenize Data

4. Train

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages