TL;DR: I successfully trained Qwen 2.5 (both 0.5B and 3B variants) to use a slightly complex calculator tool through multi-turn reinforcement learning, achieving a 62% absolute increase in evaluation accuracy. Final models are available on Hugging Face.
Why train an LLM to use a calculator?
Because despite their reasoning capabilities, LLMs struggle with arithmetic precisionโeven basic math can lead to errors.
- High-Level Workflow Summary
- Training Results
- Training & Rollout Details
- Reward Design
- Environment (Tools & State Management)
- Evaluation Suite
- Hardware & Cost Efficiency
- Dataset Details
- Models Available on Hugging Face
- Setup
- Acknowledgements
- Crafted and iterated on a tailored system prompt
- Implemented custom environment logic (tools & state handling)
- Built a dedicated evaluation dataset
- Conducted pre-Reward Model evaluations on Qwen 2.5 (0.5B & 3B)
- Designed and refined a hybrid reward function using both LLM-as-a-judge and code verification
- Generated training data
- Developed training pipeline leveraging the verifiers framework
- Iteratively improved reward signal
- Executed successful RL fine-tuning runs
Below is a visualization of the reward improvement over time for two model sizes.
- Starts at ~0.2 reward
- Ends above 0.8 reward
- Begins at 0.7 reward
- Peaks near perfect reward (~1.0)
This project leveraged Group Relative Policy Optimization (GRPO), a novel algorithm designed for preference-aligned policy updates using group-level comparisons. GRPO encourages the model to learn from relative advantages within a group of sampled responses, making it particularly well-suited for structured reasoning tasks.
- 8 samples per training prompt, each generated with a temperature of 0.9 to encourage diversity while maintaining coherence.
- All rollouts were performed using a vLLM server hosted locally during training.
- Algorithm: GRPO (Group Relative Policy Optimization)
- Batch Size: 8 generations/device (total batch scaled by GPU count)
- Learning Rate: 1e-6
- Training Epochs: 1 full pass through the dataset
- Policy Updates: 2 GRPO iterations per training step
- Reward Model: Hybrid signal combining LLM-as-a-judge and code-based verification
- Gradient Clipping: Max norm = 0.1
- Precision: bfloat16 enabled for efficiency
- Logging & Monitoring: WandB + detailed completion logging
- Generation Constraints:
- Max prompt length: 1024 tokens
- Max completion length: 500 tokens
To provide meaningful supervision during RL, rewards were computed using two complementary methods:
- Used Claude-3.5-Haiku as an external judge
- Provided a carefully crafted scoring rubric
- Encouraged nuanced feedback beyond simple pass/fail judgments
- Each sample had a known correct result
- Regex matching was used to extract final answers
- Verified correctness via programmatic comparison
The agent was asked to output both XML and YAML to use the calculator. The YAML structure was recursive.
For example the prompt: "What's the sum of 987 times 654, and 987 divided by the total of 321 and 11?" should induce this model output:
<calculator>
operation: add
operands:
- operation: multiply
operands:
- 987
- 654
- operation: divide
operands:
- 987
- operation: add
operands:
- 321
- 11
</calculator>
This output is parsed into Python code, calculated, and then the environment response is shown to the model as:
<output>
645500.9728915663
</output>
The model is then asked to formulate a final response to the user.
There was no need for a state as this was a simple tool calling environment.
A custom benchmark suite built using the agentic_environments library. The eval dataset is here, and the inference code here.
| Model Size | Pre-RL Accuracy | Post-RL Accuracy | โฒ Increase |
|---|---|---|---|
| 0.5B | 0.6% | 34% | +33.4 pts |
| 3B | 27% | 89% | +62 pts |
At the prompt of this Reddit comment and out of curiosity, I ran the evals against the newer Qwen3 models, which were released 7 months after Qwen 2.5.
Looking at the data, we can see that Qwen3 naturally achieves strong performance without RL.
We could expect that an RL-trained 1.7B Qwen3 would outperform an RL trained 3B Qwen2.5, a model almost double its size!
| Model Family | Model Size | Accuracy | Notes |
|---|---|---|---|
| Qwen2.5 (Base) | 0.5B | 0.6% | Pre-RL baseline |
| Qwen2.5 (RL-trained) | 0.5B | 34.2% | +5,300% relative improvement |
| Qwen2.5 (Base) | 3B | 27.8% | Pre-RL baseline |
| Qwen2.5 (RL-trained) | 3B | 89.9% | +223% relative improvement |
| Qwen3 | 0.6B | 7.6% | 12ร better than Qwen2.5 0.5B base |
| Qwen3 | 1.7B | 42.4% | Better than RL-trained Qwen2.5 0.5B |
| Qwen3 | 4B | 82.3% | Nearly matches RL-trained Qwen2.5 3B |
- Inference: 2x RTX6000 Ada
- Training: 6x RTX6000 Ada
- Total Time: 2h 56m 9s
- GPU Cost: $18.04 (~ยฃ13.47)
- Inference: 2x A100 (80GB)
- Training: 2x A100 (80GB)
- Total Time: 3h 6m 51s
- GPU Cost: $23.51 (~ยฃ17.55)
Both the eval dataset & the training dataset were synthetically generated using this process.
question,expression,answer
"What is 4829 multiplied by 736?", 4829*736,3554144.0Firstly, Gemini-2.5-Pro was prompted with relevant context and instructions to generate a very diverse dataset. It was instructed to diversify in the following ways:
- The way a question was asked (e.g: "Find the product", "What do you get when", etc..)
- The complexity of the question
It's guidelines were:
- Operations must be addition, subtraction, division and multiplication.
- Questions should at least be very / extremely hard for an average human to work out in their heads without the use of a calculator.
Gemini-2.5-Pro was asked to generate only the "question" & "expression" columns. This is to avoid relying on an LLM to find the correct answer. Instead the answer column was generated using a simple python script that accepted the csv as input, and ran eval() on each expression column to generate the "answer" column.
The eval dataset was made specifically to cover all the various scenarios covered in the training dataset.
โ Both trained models are now publicly accessible:
As a temp solution whilst verifiers is not on PyPI, clone verifiers at the same level as calculator_agent_rl.
-
|
-- calculator_agent_rl/
-- verfifers/
This is required in order to access the verifiers lib.
Use the devcontainer and Dockerfile for development. If using VSCode this should popup automatically.
- Rent GPU from somewhere like runpod and connect via SSH
- Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh - Open workspace dir & Clone repo into
/workspacegit clone https://{githubaccess_token}@github.com/AiTuning-Ltd/{repo}.git - Clone verifiers into
/workspacegit clone https://github.com/willccbb/verifiers.git - Follow when deployed steps below
The below example code runs Qwen 2.5 3B on 8x GPUs (x4 for inference, x4 for training)
- Run
uv syncafter the verifiers repo is cloned too as mentioned above. - Run
uv add flash-attn --no-build-isolation - Ensure .env file is set at the root of the project
- Run vLLM server (Example for a x4 GPUs):
a.
cd ../verifiersb.CUDA_VISIBLE_DEVICES=0,1,2,3 python verifiers/inference/vllm_serve.py --model "Qwen/Qwen2.5-3B-Instruct" --tensor_parallel_size 4 --max_model_len 8192 --gpu_memory_utilization 0.9 --enable_prefix_caching True - Run train.py using accelerate (Example on x4 GPUs):
CUDA_VISIBLE_DEVICES=4,5,6,7 accelerate launch --num-processes 4 --config-file ../verifiers/configs/zero3.yaml src/train.py
If GPUs hang at 100% utilisation for both vLLM or training script initialisation 0. Ensure you are using CUDA version 12.4+
- Stop the processes
- In the terminal:
export NCCL_P2P_DISABLE=1. Fix found here - Re-run script
If error in verfiers package about "question"
- Go to
data_utils.pyin verifiers and change"question"to'question'on lines 118 & 132
If Anthropic API key is not loading onto each GPU for judge
- Go to
claude.py& changeAnthropic()toAnthropic(api_key="{api_key}")
This project makes use of the Verifiers library for reinforcement learning in verifiable environments. If you use this code in your research, please cite:
@article{brown2025verifiers,
title={Verifiers: Reinforcement Learning with LLMs in Verifiable Environments},
author={Brown, William},
year={2025}
}





