Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
96 commits
Select commit Hold shift + click to select a range
a417266
Trains and Evals
Kinvert Jan 13, 2026
49af2d4
Reward Changes
Kinvert Jan 13, 2026
daaf902
Rendered with spheres or something
Kinvert Jan 13, 2026
332a9ae
Good Claude - Wireframe Planes
Kinvert Jan 13, 2026
0116b97
Physics model: incidence, comments, test suite
Kinvert Jan 13, 2026
b29bf5a
Renamed md Files
Kinvert Jan 13, 2026
95eb2ef
Moved Physics to File
Kinvert Jan 13, 2026
3582d2d
Physics in Own File - Test Flights
Kinvert Jan 14, 2026
1c30c54
Coordinated Turn Tests
Kinvert Jan 14, 2026
1131e83
Simple Optimizations
Kinvert Jan 14, 2026
374871d
Small Perf - Move cosf Out of Loop
Kinvert Jan 14, 2026
8598067
Autopilot Seperate File
Kinvert Jan 14, 2026
80bcf31
Vectorized Autopilot
Kinvert Jan 14, 2026
0a1c2e6
Weighted Random Actions
Kinvert Jan 15, 2026
63a7aae
Observation Schemas Swept
Kinvert Jan 15, 2026
04dd016
Rewards Fixed - Sweepable
Kinvert Jan 15, 2026
26709b9
Preparing for Sweeps
Kinvert Jan 15, 2026
a31d1dc
Fix Terminals and Loggin
Kinvert Jan 15, 2026
3cc5b58
More Sweep Prep
Kinvert Jan 15, 2026
17f18c1
Fix Reward and Score
Kinvert Jan 15, 2026
d639ee3
Temp Undo Later - Clamp logstd
Kinvert Jan 15, 2026
2606e20
Apply Sweep df1 84 u5i33hej
Kinvert Jan 16, 2026
bc72836
New Obs Schemas - New Sweep Prep
Kinvert Jan 16, 2026
fe7e26a
Roll Penalty - Elevator Might Be Inversed
Kinvert Jan 16, 2026
652ab7a
Fix Elevator Problems
Kinvert Jan 17, 2026
30fa9fe
Fix Obs 5 Schema and Adjust Penalties
Kinvert Jan 17, 2026
ab222bf
Increase Batch Size for Speed
Kinvert Jan 17, 2026
7fd88f1
Next Sweep Improvements - Likes to Aileron Roll too Much
Kinvert Jan 17, 2026
9dca5c6
Reduce Prints
Kinvert Jan 17, 2026
b68d1b2
Simplify Penalties and Rewards
Kinvert Jan 18, 2026
03d1ebc
Try to Avoid NAN
Kinvert Jan 18, 2026
7a15539
Trying to Stop NANs
Kinvert Jan 18, 2026
2c3073f
Debug Prints
Kinvert Jan 18, 2026
be1e31c
Fix Mean Outside Bounds
Kinvert Jan 18, 2026
f6c821d
Still Trying to Fix Blowups
Kinvert Jan 18, 2026
3f0f8b4
Revert Some Ini Values
Kinvert Jan 18, 2026
6c61df6
Restore Much of Ini to 9dca5c6
Kinvert Jan 18, 2026
faf6eb6
Reduce Learning Rate Again
Kinvert Jan 18, 2026
4e640ee
Trying to Fix Curriculum - Agent Trains Poorly
Kinvert Jan 18, 2026
f302224
Aim Annealing - Removed Some Penalties
Kinvert Jan 19, 2026
f000fb8
Added More Debugging
Kinvert Jan 19, 2026
7a75d2b
Some Fixes - SPS Gains - New Sweep Soon
Kinvert Jan 19, 2026
92aa6c5
Fixed Rewards That Turn Negative
Kinvert Jan 19, 2026
fd1941f
Reduce Negative G Penalties
Kinvert Jan 19, 2026
d8a8475
Revert to df5 (f3022) + SPS gains, Ready for df7
Kinvert Jan 19, 2026
4c3ebd3
Clamp for nans - df7 2.0
Kinvert Jan 19, 2026
bfa061f
This Potentially Helps with Curriculum
Kinvert Jan 20, 2026
214338e
3M SPS Prep for df8 Sweep
Kinvert Jan 20, 2026
f2af35e
df9 Sweep Prep - Sweeping Stages
Kinvert Jan 20, 2026
060bbfb
Safer Sweeps - Obs Clamps - Coeff Ranges
Kinvert Jan 20, 2026
153bd08
Add sweep persistence and override injection for Protein
Kinvert Jan 21, 2026
8c7260b
df10 Sweep Prep - Simplified Rewards, New Obs Scheme
Kinvert Jan 21, 2026
4b72007
Observation Scheme Tests
Kinvert Jan 22, 2026
784856b
Rudder Damping - Obs HUD - Test Updates
Kinvert Jan 22, 2026
b0f22a3
Code Cleanup
Kinvert Jan 22, 2026
6859683
Reduce Sweep Params - Rudder Drag - Restructure and Add Tests
Kinvert Jan 22, 2026
84d8241
Logs Update
Kinvert Jan 22, 2026
1c191cf
New Physics Mode Scaffolding
Kinvert Jan 22, 2026
ed82326
Fix Hyper Override Edge Cases
Kinvert Jan 23, 2026
6a0a295
More Realistic Physics - WIP
Kinvert Jan 23, 2026
a88a6d7
Fix Autopilot Enum Bug
Kinvert Jan 23, 2026
8eb6963
Simplify Rewards etc
Kinvert Jan 23, 2026
ee9849d
Seems Ready for Real Physics Training Sweeps
Kinvert Jan 23, 2026
9cd25c4
Removing Old Physics - Keeping New Physics
Kinvert Jan 23, 2026
3b470c5
Replace Obs Schemes for New Physics - 3D Render
Kinvert Jan 24, 2026
c5cf1de
Update Tests for Obs
Kinvert Jan 24, 2026
77b93cc
Hopefully Fixed Difficulty Stages
Kinvert Jan 24, 2026
3022164
Fix Log Bug
Kinvert Jan 24, 2026
c5862b4
Sweep Warmup for df15
Kinvert Jan 24, 2026
69461fb
Intermediate Stages
Kinvert Jan 25, 2026
c705d6b
Added User Control
Kinvert Jan 25, 2026
d24e17a
More Stages for Smoother Learning
Kinvert Jan 25, 2026
e94db07
More Intermediate Stages
Kinvert Jan 26, 2026
a55c7a9
Stages Now Structs
Kinvert Jan 26, 2026
46832e4
Fix Minor Bug Perf Measure
Kinvert Jan 26, 2026
4b8a90f
Working State
Kinvert Jan 26, 2026
b0078fe
Timer Obs - Stage 8 max_steps
Kinvert Jan 27, 2026
07d2ce4
Edge Case Hypers
Kinvert Jan 27, 2026
ddf76d5
Consolidated Spawn Funcs - Potential Stage Advance Change
Kinvert Jan 27, 2026
7dea028
Finalize Drops Stage for Accurate Ultimate Sweeping
Kinvert Jan 27, 2026
fb1dbf8
Mastery Working
Kinvert Jan 27, 2026
1bca284
Quality of Life - Stage Improvements - Randomization
Kinvert Jan 27, 2026
9616c44
Ready for df18 Sweep
Kinvert Jan 27, 2026
861ac41
AutoAce - Render Improvements - Now Trains All Stages
Kinvert Jan 28, 2026
ab6f480
Joystick Control - Cleaning Up
Kinvert Jan 28, 2026
96d25d5
Cleanup
Kinvert Jan 28, 2026
ac32dbb
More Cleanup
Kinvert Jan 28, 2026
909dec6
Removing More md Files
Kinvert Jan 28, 2026
543524a
Remove Test From PR
Kinvert Jan 28, 2026
e04c7d5
Cleaning Up md Files
Kinvert Jan 28, 2026
d8a5e16
Remove More md Files
Kinvert Jan 28, 2026
2a798ac
Remove Dogfight Test
Kinvert Jan 28, 2026
bb26079
Getting Ready for Self Play Attempt
Kinvert Jan 28, 2026
202b67c
Dogfight Self Play
Kinvert Jan 29, 2026
9476ad7
Dogfight Self Play Seems to Work - Descending Rolling Scissors
Kinvert Jan 30, 2026
d469684
Trying to Smooth Flight Controls - Reduce Oscillations
Kinvert Feb 2, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -162,3 +162,4 @@ pufferlib/ocean/impulse_wars/*-release/
pufferlib/ocean/impulse_wars/debug-*/
pufferlib/ocean/impulse_wars/release-*/
pufferlib/ocean/impulse_wars/benchmark/
pufferlib/ocean/dogfight/dogfight_test
191 changes: 191 additions & 0 deletions pufferlib/checkpoint_queue.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
"""Checkpoint Queue for Self-Play Training.

Manages a queue of policy checkpoints where the opponent is always N checkpoints
behind the learner. This creates a stable skill gap and natural curriculum.

Training Flow:
Stages 0-9: Autopilot opponent (curriculum)
Stage 10: Save checkpoint A (milestone)
Stages 10-19: Continue curriculum with autopilot
Stage 20: Save checkpoint B (milestone), START SELF-PLAY vs A
Dominate: Save checkpoint C, upgrade opponent to B
Dominate: Save checkpoint D, upgrade opponent to C
...and so on (opponent always `lag` checkpoints behind)

Lag Semantics:
lag=1 means "2nd newest" (skip 1 checkpoint):
- Queue: [A, B, C] with lag=1 -> opponent uses B (index -2)
- Queue: [A, B, C, D] with lag=1 -> opponent uses C (index -2)
"""
import os
import shutil
from dataclasses import dataclass, field
from typing import List, Optional
import torch


@dataclass
class QueueEntry:
"""A checkpoint in the queue."""
path: str # Checkpoint file path
step: int # Global step when saved
stage: float # Curriculum stage when saved
tag: str # "stage10", "stage20", or "selfplay_N"

def is_milestone(self) -> bool:
"""Return True if this is a milestone checkpoint (stage10/stage20)."""
return self.tag in ("stage10", "stage20")


class CheckpointQueue:
"""Manages checkpoint queue for self-play training.

Checkpoints are saved when the learner dominates the opponent (exceeds
perf_threshold). The opponent is loaded from an older checkpoint in the
queue (determined by lag parameter).

Milestone checkpoints (stage10, stage20) are never pruned.
"""

def __init__(self, save_dir: str, max_checkpoints: int = 20):
"""Initialize checkpoint queue.

Args:
save_dir: Directory to store checkpoint files
max_checkpoints: Maximum selfplay checkpoints to keep (milestones always kept)
"""
self.save_dir = save_dir
self.max_checkpoints = max_checkpoints
self.checkpoints: List[QueueEntry] = []

# Create save directory if needed
os.makedirs(save_dir, exist_ok=True)

print(f'[CHECKPOINT-QUEUE] Initialized: save_dir={save_dir}, max={max_checkpoints}')

def save(self, policy, step: int, stage: float, tag: str) -> str:
"""Save checkpoint and add to queue.

Args:
policy: PyTorch policy module to save
step: Current global step
stage: Current curriculum stage
tag: Checkpoint tag ("stage10", "stage20", or "selfplay_N")

Returns:
Path to saved checkpoint file
"""
# Generate filename
filename = f"checkpoint_{tag}_step{step}.pt"
path = os.path.join(self.save_dir, filename)

# Save checkpoint
torch.save({
'policy_state_dict': policy.state_dict(),
'step': step,
'stage': stage,
'tag': tag,
}, path)

# Add to queue
entry = QueueEntry(path=path, step=step, stage=stage, tag=tag)
self.checkpoints.append(entry)

print(f'[CHECKPOINT-QUEUE] Saved {tag} at step {step}: {path}')

# Prune old checkpoints if needed
self._prune_old_checkpoints()

return path

def get_opponent(self, lag: int = 1) -> Optional[str]:
"""Get checkpoint path for opponent.

Args:
lag: How many positions behind the latest (1=2nd newest, index -2)

Returns:
Path to opponent checkpoint, or None if queue too small
"""
if len(self.checkpoints) < lag + 1:
return None

# lag=1 means index -2 (2nd newest)
index = -(lag + 1)
return self.checkpoints[index].path

def get_opponent_entry(self, lag: int = 1) -> Optional[QueueEntry]:
"""Get full QueueEntry for opponent.

Args:
lag: How many positions behind the latest (1=2nd newest, index -2)

Returns:
QueueEntry for opponent, or None if queue too small
"""
if len(self.checkpoints) < lag + 1:
return None

index = -(lag + 1)
return self.checkpoints[index]

def should_upgrade(self, current_opponent_path: Optional[str], lag: int) -> Optional[str]:
"""Check if opponent should be upgraded to newer checkpoint.

Args:
current_opponent_path: Path to current opponent checkpoint
lag: Desired lag positions behind latest

Returns:
New opponent path if upgrade needed, None otherwise
"""
new_path = self.get_opponent(lag)

if new_path is None:
return None

if new_path != current_opponent_path:
return new_path

return None

def _prune_old_checkpoints(self):
"""Remove oldest selfplay checkpoints, keeping milestones forever."""
# Count selfplay checkpoints (not milestones)
selfplay_checkpoints = [c for c in self.checkpoints if not c.is_milestone()]

# Reserve 2 slots for milestones (stage10, stage20)
max_selfplay = self.max_checkpoints - 2

while len(selfplay_checkpoints) > max_selfplay:
# Find oldest selfplay checkpoint
oldest = selfplay_checkpoints.pop(0)

# Remove file
if os.path.exists(oldest.path):
try:
os.remove(oldest.path)
print(f'[CHECKPOINT-QUEUE] Pruned old checkpoint: {oldest.path}')
except OSError as e:
print(f'[CHECKPOINT-QUEUE] Warning: Could not remove {oldest.path}: {e}')

# Remove from main list
self.checkpoints.remove(oldest)

def __len__(self) -> int:
"""Return number of checkpoints in queue."""
return len(self.checkpoints)

def __repr__(self) -> str:
"""Return string representation of queue."""
tags = [c.tag for c in self.checkpoints]
return f"CheckpointQueue({tags})"

def get_queue_state(self) -> dict:
"""Get serializable state of the queue for logging/debugging."""
return {
'num_checkpoints': len(self.checkpoints),
'tags': [c.tag for c in self.checkpoints],
'steps': [c.step for c in self.checkpoints],
'milestones': [c.tag for c in self.checkpoints if c.is_milestone()],
}
67 changes: 26 additions & 41 deletions pufferlib/config/default.ini
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ seed = 42
[rnn]

[train]
name = pufferai
name = pufferai
project = ablations

seed = 42
Expand All @@ -28,40 +28,40 @@ device = cuda
optimizer = muon
anneal_lr = True
precision = float32
total_timesteps = 10_000_000
learning_rate = 0.015
gamma = 0.995
gae_lambda = 0.90
update_epochs = 1
clip_coef = 0.2
vf_coef = 2.0
vf_clip_coef = 0.2
max_grad_norm = 1.5
ent_coef = 0.001
adam_beta1 = 0.95
adam_beta2 = 0.999
adam_eps = 1e-12
total_timesteps = 400_000_000
learning_rate = 0.0003812
gamma = 0.9903
gae_lambda = 0.9934
update_epochs = 4
clip_coef = 0.2576
vf_coef = 4.034
vf_clip_coef = 4.663
max_grad_norm = 1.501
ent_coef = 0.008355
adam_beta1 = 0.8453
adam_beta2 = 1
adam_eps = 2.72e-05

data_dir = experiments
checkpoint_interval = 200
batch_size = auto
minibatch_size = 8192
minibatch_size = 32768

# Accumulate gradients above this size
max_minibatch_size = 32768
max_minibatch_size = 65536
bptt_horizon = 64
compile = False
compile_mode = max-autotune-no-cudagraphs
compile_fullgraph = True

vtrace_rho_clip = 1.0
vtrace_c_clip = 1.0
vtrace_rho_clip = 2.91
vtrace_c_clip = 3.085

prio_alpha = 0.8
prio_beta0 = 0.2
prio_alpha = 0.9724
prio_beta0 = 0.6139

[sweep]
method = Protein
method = Protein
metric = score
goal = maximize
downsample = 5
Expand All @@ -75,26 +75,11 @@ prune_pareto = True
#mean = 8
#scale = auto

# TODO: Elim from base
[sweep.train.total_timesteps]
distribution = log_normal
min = 3e7
max = 1e10
mean = 2e8
scale = time

[sweep.train.bptt_horizon]
distribution = uniform_pow2
min = 16
max = 64
mean = 64
scale = auto

[sweep.train.minibatch_size]
distribution = uniform_pow2
min = 8192
min = 32768
max = 65536
mean = 32768
mean = 65536
scale = auto

[sweep.train.learning_rate]
Expand All @@ -115,7 +100,7 @@ scale = auto
distribution = logit_normal
min = 0.8
mean = 0.98
max = 0.9999
max = 0.995
scale = auto

[sweep.train.gae_lambda]
Expand Down Expand Up @@ -192,8 +177,8 @@ scale = auto

[sweep.train.adam_eps]
distribution = log_normal
min = 1e-14
mean = 1e-8
min = 1e-8
mean = 1e-6
max = 1e-4
scale = auto

Expand Down
Loading