A systematic framework for training deep reinforcement learning agents to master the game of Ludo through progressive difficulty levels. This project implements a 6-level curriculum that builds from basic movement to full 4-player competitive gameplay with preference-based reward learning, achieving 66% win rate in the final challenge.
Trained DQN agent playing 4-player Ludo in real-time
This project explores the application of deep reinforcement learning to Ludo, a complex stochastic multi-agent board game. Rather than jumping directly to the full game complexity, we employ a curriculum learning approach that incrementally introduces game mechanics:
- Level 1: Single token, no opponent interaction (basic movement)
- Level 2: Single token with opponent interactions (captures)
- Level 3: Multiple tokens per player (token selection strategy)
- Level 4: Full stochastic dice mechanics
- Level 5: 4-player multi-agent competition
- Level 6: T-REX with learned rewards from preferences
This structured approach enables the agent to learn fundamental skills before tackling the full game's strategic depth.
For a detailed technical walkthrough of the implementation, including egocentric feature engineering, Potential-Based Reward Shaping (PBRS), and Hybrid T-REX, see the full article:
How to Train AI to Play Multi-Agent Ludo: A Guide to T-REX and Reward Shaping
| Level | Challenge | Achieved | Episodes |
|---|---|---|---|
| 1 | Basic Movement | 95% | 2,500 |
| 2 | Opponent Interaction | 90% | 5,000 |
| 3 | Multi-Token Strategy | 78% | 7,500 |
| 4 | Stochastic Dynamics | 67% | 10,000 |
| 5 | Multi-Agent Chaos | 61% | 15,000 |
| 6 | T-REX (Learned Rewards) | 66% | 35,000 |
The agents demonstrate strong performance across all levels, with Level 6 showing 2.6x better than random baseline (25%) through preference-based learning.
- Curriculum-Based Training: 6 progressive difficulty levels with clear success metrics
- Multiple Architectures: SimpleDQN and T-REX (preference-based learning)
- Potential-Based Reward Shaping (PBRS): Theory-grounded reward engineering that preserves optimal policies
- Preference Learning: T-REX implementation for learning rewards from trajectory rankings
- Comprehensive Evaluation: Detailed metrics tracking win rates, captures, game lengths, and learning dynamics
- Visual Gameplay: Real-time CV2 visualization of trained agents playing
- Modular Architecture: Clean separation between environments, agents, and training logic
- Reproducibility: Seed management and hyperparameter tracking for all experiments
- Python 3.8+
- pip or conda
- Clone the repository:
git clone <repository-url>
cd RLagentLudo- Create and activate virtual environment:
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate- Install dependencies:
pip install -r requirements.txtTrain an agent on a specific level:
# Level 1: Basic movement
python experiments/level1_train.py --episodes 2500 --eval_freq 500
# Level 5: Full game (4 players, 2 tokens each)
python experiments/level5_train.py --episodes 15000 --eval_freq 1000
# Level 6: T-REX with learned rewards
python experiments/level6_train_policy.py --episodes 35000Watch a trained agent play with graphical visualization:
# Demo Level 5 agent with CV2 window (interactive)
python experiments/demo_visual.py --level 5 --episodes 3
# Generate animated GIF of gameplay (for README/presentations)
python experiments/generate_demo_gif.py --episodes 1 --max_steps 250 --fps 10This will create assets/demo_gameplay.gif showing the agent playing.
Evaluate a trained model:
# Test Level 5 agent
python experiments/test_level5.py --checkpoint checkpoints/level5/best_model.pth --num_eval 400Run comprehensive evaluation and generate visualizations:
# Evaluate all levels (1-6)
python experiments/evaluate_all_models.py
# Generate visualization plots
python experiments/visualize_results.py --results results/evaluations/all_models_evaluation_*.jsonRLagentLudo/
├── experiments/ # Training and testing scripts
│ ├── level1_train.py # Level 1: Basic movement
│ ├── level2_train.py # Level 2: With captures
│ ├── level3_train.py # Level 3: Multi-token
│ ├── level4_train.py # Level 4: Stochastic
│ ├── level5_train.py # Level 5: Multi-agent
│ ├── level6_train_policy.py # Level 6: T-REX
│ ├── demo_visual.py # Visual demo with CV2
│ ├── evaluate_all_models.py # Comprehensive evaluation
│ └── visualize_results.py # Generate plots
├── src/rl_agent_ludo/
│ ├── agents/ # Agent implementations
│ │ ├── simple_dqn.py # DQN with experience replay
│ │ ├── trex_agent.py # T-REX with learned rewards
│ │ └── baseline_agents.py # Random, Greedy agents
│ ├── environment/ # Environment wrappers
│ │ ├── level1_simple.py # Level 1 environment
│ │ ├── level2_interaction.py # Level 2 environment
│ │ ├── level3_multitoken.py # Level 3 environment
│ │ ├── level4_stochastic.py # Level 4 environment
│ │ ├── level5_multiagent.py # Level 5 environment
│ │ └── unifiedLudoEnv.py # Production env with PBRS
│ ├── preference_learning/ # T-REX components
│ │ ├── trajectory_collector.py # Collect demonstrations
│ │ ├── trajectory_ranker.py # Rank trajectories
│ │ └── reward_network.py # Learn reward function
├── results/ # Evaluation results and plots
│ └── visualizations/ # Generated PNG plots
└── requirements.txt
The primary agent uses a DQN architecture with:
- Experience Replay Buffer: Stores transitions, breaks correlation
- Target Network: Separate network for stable Q-value targets
- Epsilon-Greedy Exploration: Decays from 1.0 to 0.05
- Gradient Clipping: Prevents exploding gradients
Network Architecture:
- Input: State vector (4D to 16D depending on level)
- Hidden layers: 128x128 (ReLU activation)
- Output: Q-values for each action
The T-REX agent learns from trajectory preferences:
Innovation: Instead of hand-crafted rewards, learns reward function from ranked demonstrations
Pipeline:
- Collect Trajectories: Run agents and record full game sequences
- Rank Trajectories: Create preference pairs (better vs. worse)
- Learn Reward Network: Train neural network to predict rewards
- Train Policy: Use learned rewards to train DQN
Hybrid Reward Mode:
total_reward = env_reward + 0.3 × 10.0 × learned_reward
This combines sparse environment signals with dense learned feedback for best performance.
The project uses Potential-Based Reward Shaping (PBRS) to guide learning while preserving optimal policies:
- Win/Loss: +100 (win), -100 (loss)
- Progress Shaping: Distance-based potential function
- Capture Rewards: +30 (capture), -30 (captured)
- Goal Completion: +50 per token
PBRS guarantees that the shaped reward function has the same optimal policy as the original sparse reward, while significantly accelerating learning.
- Goal: Learn to move a single token from start to goal
- State: 4D (1 token position, goal flag, distance, progress)
- Actions: Move token or pass
- Challenge: Basic sequential decision-making
- Goal: Learn to capture opponents and avoid being captured
- State: 8D (player + opponent token states)
- Challenge: Adversarial interaction, risk assessment
- Goal: Manage 2 tokens simultaneously, strategic token selection
- State: 14D (2 tokens × 2 players)
- Actions: Move token 0, token 1, or pass
- Challenge: Resource allocation, multi-objective optimization
- Goal: Handle full dice mechanics (1-6 outcomes)
- State: 16D
- Challenge: Partial observability, long-term planning under uncertainty
- Goal: Compete against 3 random opponents simultaneously
- State: 16D (egocentric view)
- Challenge: Full game complexity, emergent multi-agent dynamics
- Goal: Improve over Level 5 using learned rewards
- Innovation: Preference-based reward learning
- Result: 66% win rate (vs Level 5's 61%)
Each level tracks:
- Win Rate: Primary success metric
- Average Reward: Cumulative episode reward
- Game Length: Steps per episode
- Capture Statistics: Captures made vs. received
- Epsilon: Exploration rate (decays from 1.0 to 0.05)
- Replay Buffer Size: Experience collected
Evaluations run with 500 test games per level against random opponents.
COMPONENTS_DOCUMENTATION.md- Detailed documentation of all agents, environments, and reward networksLEVEL6_EVALUATION_GUIDE.md- Guide for evaluating Level 6 (T-REX)docs/- Additional architecture and methodology docs
This project builds upon established research in:
- Curriculum Learning: Progressive task difficulty for skill acquisition
- Reward Shaping: Potential-based reward shaping (Ng et al., 1999)
- Deep RL: DQN architectures (Mnih et al., 2015)
- Preference Learning: T-REX for learning from demonstrations (Brown et al., 2019)
- Multi-Agent RL: Competitive gameplay and emergent strategies
- Ng, A. Y., Harada, D., & Russell, S. (1999). Policy invariance under reward transformations. ICML. (PBRS theory)
- Brown, D., et al. (2019). Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations. ICML. (T-REX)
- Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. Nature. (DQN)
Potential extensions and improvements:
- Self-Play Training: Train against past versions of the agent
- Multi-Agent Learning: Simultaneous training of all players
- Policy Gradient Methods: PPO, A3C for continuous improvement
- Opponent Modeling: Explicit modeling of opponent strategies
- Human Evaluation: Testing against human players
See LICENSE file for details.
If you use this code in your research, please cite:
@software{rl_agent_ludo_curriculum,
title = {Reinforcement Learning for Ludo: A Curriculum-Based Approach},
author = {Balegar, Hitesh},
year = {2025},
url = {https://github.com/yourusername/RLagentLudo},
note = {Deep RL with progressive curriculum and preference learning for multi-agent board games}
}This project builds upon:
- LudoPy: Python implementation of Ludo game mechanics
- DeepMind: DQN and Deep RL architectures
- OpenAI: Reinforcement learning best practices and methodologies
- Existing research on RL applications to board games and curriculum learning




