Kaiwen Zheng, Kai Zhou, Jinwu Hu, Te Gu, Mingkai Peng, Fei Liu
South China University of Technology, Pazhou Laboratory
Kaiwen Zheng, Kai Zhou, Jinwu Hu, Te Gu, Mingkai Peng, Fei Liu South China University of Technology, Pazhou Laboratory
Training-Free Test-Time Contrastive Learning (TF-TTCL) is a training-free framework for improving frozen or API-accessed LLMs during inference. It learns from test-time experience by contrasting better and worse reasoning trajectories, distilling them into reusable rules, and retrieving those rules for later questions.
Accepted by Findings of ACL 2026.
core/
├── main.py # Experiment runner
├── config/ # Config templates
└── source/ # Actors, retrieval, selection, summary, prompts
data/
├── MATH/ # Processed math datasets
└── OPEN/ # Processed open-domain datasets
evaluate/ # Offline evaluation scripts
LICENSE
requirements.txt
We recommend using uv and Python 3.12 for lightning-fast environment setup.
git clone https://github.com/KevinSCUTer/TF-TTCL.git
cd TF-TTCL
uv venv -p 3.12
source .venv/bin/activate
uv pip install -r requirements.txtThis repository includes processed datasets for:
- Math:
gsm8k,math500,aime24,minerva - Open-domain QA:
agriculture,geography,medicine,wealth
Current processed files live at:
data/MATH/{dataset_name}.jsondata/OPEN/{dataset_name}.json
Reference/raw dataset files are kept under data/MATH/reference/.
TF-TTCL expects:
- an OpenAI-compatible chat completion endpoint
- an embedding endpoint when
rag.enabled: true
Start from the config template:
cp core/config/config.yaml core/config/local.yamlThen edit core/config/local.yaml.
Note: mode and domain have only 5 kinds of composition: mode "close" with domain "math" mode "open" with domain "agriculture"/"medicine"/"geography"/"wealth"
experiment:
data_path: "/path/to/TF-TTCL/data/MATH/gsm8k.json"
limit: 10
mode: "close"
domain: "math"
prompt_file_path:
teacher: "/path/to/TF-TTCL/core/source/prompts/close/role/math/teacher/gsm.md"
ta: "/path/to/TF-TTCL/core/source/prompts/close/role/math/ta/simple_rephrase.md"
student: "/path/to/TF-TTCL/core/source/prompts/close/role/math/student/gsm.md"
positive_batch: "/path/to/TF-TTCL/core/source/prompts/close/rule/positive/batch_extract.md"
negative_batch: "/path/to/TF-TTCL/core/source/prompts/close/rule/negative/batch_extract.md"
llm:
api_url: "your-openai-compatible-url/v1"
api_key: "your_api_key"
model_name: "your_model_name"
max_tokens: 8192 # For AIME24, set to 16384 or larger to prevent truncation. For other datasets, 8192 is sufficient.
max_context_tokens: 32768
rag:
enabled: true
embedding_api_url: "your-embedding-url/v1"
embedding_api_key: "your_embedding_api_key"
embedding_model: "Qwen/Qwen3-Embedding-0.6B"
rules:
max_pos_rules: 3
max_neg_rules: 3
student:
number: 4Run an experiment with:
python core/main.py --config core/config/local.yamlUseful overrides:
python core/main.py --config core/config/local.yaml --limit 50
python core/main.py --config core/config/local.yaml --override rules.max_pos_rules=5 rules.max_neg_rules=5For quick customization without editing the config file, use the override script:
bash core/run_override.shThis is equivalent to:
python core/main.py --config absolute/path/to/your_config.yaml \
--override experiment.output_dir=exp_res/your_experiment_name \
student.number=4 \
batch.size=1 \
rules.max_pos_rules=30 \
rules.max_neg_rules=30Results are written to:
core/exp_res/<dataset>_<mode>_<yyyymmdd>_<hhmmss>/
Typical outputs include:
experiment.logoutput.jsonlrules.jsonsummary.jsonquestion_variants.jsonlwhendebug.save_variants: true
Offline evaluation scripts are in evaluate/.
Note: the evaluation scripts are lightweight research utilities. Before running them, update the input paths inside the scripts so they point to your generated results and local dataset/reference files.
# Please modify to absolute paths if you want to use the ground truth labels for GSM8K, MATH-500, Minerva, AIME24 datasets.
GSM8K_PARQUET_PATH = "path/to/data/MATH/reference/gsm8k/main/test-00000-of-00001.parquet"
MATH500_JSONL_PATH = "path/to/data/MATH/reference/MATH-500/test.jsonl"
MINERVA_JSONL_PATH = "path/to/data/MATH/reference/minerva/test.jsonl"
AIME_JSONL_PATH = "path/to/data/MATH/reference/aime24/aime.jsonl"Then run:
python evaluate/accuracy/eval_accuracy.pyCreate a new conda environment and install the extra evaluation dependencies if needed:
pip install rouge_score nltk bert_score git+https://github.com/google-research/bleurt.gitThen run:
python evaluate/similarity/eval_similarity.pyThanks for the open-source code of TLM and Verbalized Sampling
If you find our work interesting and meaningful, welcome to give a 🌟 to our repo and cite our paper.
@inproceedings{tfttcl,
title={Training-Free Test-time Contrastive Learning for Large Language Models},
author={Zheng, Kaiwen and Zhou, Kai and Hu, Jinwu and Gu, Te and Peng, Mingkai and Liu, Fei},
booktitle={Findings of the Association for Computational Linguistics: ACL 2026}
}