My journey started with training a base model Qabra on gen1ou and gen9ou human data. The agent name is Qabra because it reused Abra model's architecture.
I trained models on Google Cloud VM. It's probably the setup issue that limits my experiments in the following ways:
- The accelerate command crashes without error information. I traced back the error to NCCL and used gemini-cli to help me find a solution (the setup_nccl_environment() functions in train.py and finetune_from_hf.py).
- The training job crashes frequently. So I added wandb resume to keep a single loss graph across multiple crash-resume cycles.
- The 8 * NVIDIA A100 40GB GPU cannot satisfy the 200M model training using accelerate. I am not able to fix it.
After I got Qabra, I ran many evaluations to understand its strengths and weaknesses.
The key strength is if I let Qabra play with SyntheticRLV2 on gen1ou competitive team set, Qabra will fail badly (10% win rate?). But if I let them play on replay team set, Qabra wins. (We can run evaluate_against_all.sh to reproduce.)
Potential reasons are
- Qabra has more human data (gen9ou) so it understands the meta-game better than SyntheticRLV2.
- SyntheticRLV2 win rate is improved significantly by adding synthetic data. But its parameters might be out of track to represent the real world distribution.
To understand the data impact on the agent policy, I ran ablation experiments (MedSynRL series) to evaluate the synthetic data impact on the pretrained model. These experiments have the same model size, same data size with same compute power, but their synthetic data and human data mix are different. The conclusion is that simply increasing synthetic data size cannot necessarily improve the performance.
After the ablation experiments, I decided to find a path to train more powerful models by mimicking real human learning experience. My goal is to find a common model that masters the game not just a team set. The overall curriculum is to find a coach for my agents and generate a small dataset and iterate on that small dataset with increasing teamset variation.
Curriculum 1. Qabra plays with heuristic agents and got 65K battles on competitive teamset. I trained Qabraft1 (4 epochs) using this small dataset.
Curriculum 2. I used the Abra model as a coach to my Qabra model. I got Qabraft1 -> Qabraft2 -> Qabraft4 -> Qabraft6. Each iteration has 100~150k data and only trained 3 epochs. The Qabraft3 and Qabraft5 are ablation experiments to evaluate the impact of this self iteration pattern vs increased data size and compute power.
Curriculum 3. This is not finished. I would like to continue the curriculum with learning from foul-play.