Skip to content

prabakaranc98/multiagentRL

Repository files navigation

Internal Other-Play for Overcooked V2

Zero-shot coordination learning through internal diversity and GRPO (Group Relative Policy Optimization).

Overview

This repository implements the Internal Other-Play (IOP) framework for training cooperative multi-agent systems without population-based training. A single agent learns to coordinate by simulating diverse "internal personalities" through hidden state perturbations.

Key Features

  • SoftMoE-GRU Architecture: Specialized experts for different coordination strategies (gathering, cooking, recipe reasoning, partner modeling)
  • Gumbel-Softmax Actions: Differentiable discrete action sampling for gradient-based learning
  • GRPO Training: Ranking-based policy optimization learning from top-performing cohorts
  • Overcooked V2 Optimized: Handles 30-channel observations, stochastic recipes, and POMDP environments

Architecture

Input (4×5×30 obs)
    ↓
CNN Encoder (128-dim)
    ↓
SoftMoE (4 experts, 256-dim)
    ↓
GRU (256-dim) ← Hidden state perturbation for IOP
    ↓
Gumbel Action Head (6 actions)

Expert Specialization

  • Expert 0: Ingredient gathering
  • Expert 1: Cooking coordination
  • Expert 2: Recipe reasoning (V2 stochastic recipes)
  • Expert 3: Partner inference and spatial coordination

Installation

# Create environment
uv venv
source .venv/bin/activate

# Install dependencies
uv pip install -r requirements.txt

Quick Start

Basic Training

python train_iop.py \
    --experiment_name iop_baseline \
    --num_iterations 5000 \
    --num_cohorts 8 \
    --top_k 4 \
    --use_expert_perturbation

Configuration Options

Environment:

  • --layout: Overcooked V2 layout (default: cramped_room)
  • --max_steps: Max steps per episode (default: 400)

Architecture:

  • --encoder_dim: Encoder output dimension (default: 128)
  • --moe_num_experts: Number of MoE experts (default: 4)
  • --gru_hidden_dim: GRU hidden state dimension (default: 256)
  • --action_temperature: Gumbel-Softmax temperature (default: 1.0)

GRPO Hyperparameters:

  • --learning_rate: Learning rate (default: 3e-4)
  • --num_cohorts: Number of IOP cohorts K (default: 8)
  • --top_k: Number of top cohorts for learning (default: 4)
  • --clip_ratio: PPO clip ratio (default: 0.2)
  • --kl_coef: KL divergence coefficient (default: 0.1)
  • --entropy_coef: Entropy bonus coefficient (default: 0.01)

IOP Settings:

  • --use_expert_perturbation: Use MoE-aligned perturbations (recommended)
  • --no_expert_perturbation: Use only Gaussian perturbations

Project Structure

multiagentRL/
├── src/
│   ├── models/
│   │   ├── soft_moe.py              # SoftMoE layer implementation
│   │   ├── gumbel_action_head.py    # Gumbel-Softmax action head
│   │   └── overcooked_v2_agent.py   # Complete agent model
│   ├── training/
│   │   ├── iop_rollout.py           # IOP cohort generation
│   │   └── grpo_trainer.py          # GRPO training loop
│   └── utils/
│       ├── mock_overcooked_v2.py    # Mock environment (for testing)
│       └── logger.py                # Experiment logging
├── train_iop.py                     # Main training script
├── requirements.txt                 # Dependencies
└── README.md                        # This file

How IOP Works

1. Cohort Generation (Internal Diversity)

For each training iteration, generate K=8 cohorts with perturbed hidden states:

# Cohort k gets:
# - Gaussian noise: σ ∈ [0.05, 0.25]
# - Expert bias: Prefer specific expert(s)

Cohort 0: σ=0.05, Expert 0 (Gathering)
Cohort 1: σ=0.10, Expert 1 (Cooking)
Cohort 2: σ=0.15, Expert 2 (Recipe)
Cohort 3: σ=0.20, Expert 3 (Partner)
Cohort 4: σ=0.10, Experts 0+1 (Gather+Cook)
...

2. GRPO Ranking

Rank cohorts by total episode reward and select top K=4:

advantages = (rewards - mean(rewards)) / std(rewards)
top_cohorts = argsort(advantages)[:4]

3. Policy Update

Update using PPO-style objective on top cohorts:

L = -min(ratio * A, clip(ratio, 0.8, 1.2) * A)
    + β_kl * KL(π_new || π_old)
    - β_ent * H(π)

Expected Results

Training Curves (Mock Environment)

After 5000 iterations on CrampedRoom:

  • Mean Soups: 45-50 per episode
  • Reward Std: Stabilizes around 12-15 (healthy diversity)
  • Expert Specialization: >65% activation in task-relevant contexts
  • Sample Efficiency: Convergence in <5M environment steps

Diversity Metrics

  • Reward Diversity: σ ≈ 12-25 across cohorts
  • Action Entropy: ~1.6 (diverse action distribution)
  • Routing Entropy: ~1.3 (balanced expert usage)

Monitoring Training

Logs and plots are saved to logs/{experiment_name}/:

logs/iop_baseline/
├── metrics.jsonl          # Per-iteration metrics
├── summary.json           # Final summary
├── config.json            # Experiment configuration
└── plots/
    ├── training_curves.png
    └── expert_usage.png

Checkpoints saved to checkpoints/{experiment_name}/:

checkpoints/iop_baseline/
├── checkpoint_500.keras
├── checkpoint_1000.keras
...

Running Tests

Test individual components:

# Test SoftMoE layer
python src/models/soft_moe.py

# Test Gumbel action head
python src/models/gumbel_action_head.py

# Test complete agent
python src/models/overcooked_v2_agent.py

# Test IOP rollout
python src/training/iop_rollout.py

# Test GRPO trainer
python src/training/grpo_trainer.py

# Test logger
python src/utils/logger.py

Key Implementation Details

Overcooked V2 Observations (30 Channels)

Channel Layout:
  [0-5]:   Terrain (walls, counters, delivery, pot, dish)
  [6-10]:  Tomatoes (counter, pot, hand, soup, plate)
  [11-15]: Onions (counter, pot, hand, soup, plate)
  [16-20]: Plates (counter, hand, soup, delivered)
  [21-24]: Agent positions & orientations
  [25-27]: Recipe state (tomato/onion/dish)
  [28-29]: Stochastic indicators (NEW in V2!)

Perturbation Strategy

# Combined perturbation
hidden_perturbed = (
    hidden_base +
    Gaussian(0, σ) +                    # Random diversity
    ExpertBias(expert_config) * 0.3     # Strategic diversity
)

Action Masking (V2 Critical!)

# V2 provides action masks
mask = state.action_mask  # (2, 6)

# Apply in forward pass
logits = model(obs)
masked_logits = where(mask > 0, logits, -1e9)

Troubleshooting

Out of Memory

Reduce cohort size or episode length:

python train_iop.py --num_cohorts 4 --episode_steps 50

Poor Coordination

Increase expert perturbation strength or cohort diversity:

python train_iop.py --use_expert_perturbation --moe_temperature 0.3

Low Expert Specialization

Decrease MoE temperature (more focused routing):

python train_iop.py --moe_temperature 0.3

Citation

If you use this code, please cite:

@misc{iop_overcooked_v2_2025,
  title={Internal Other-Play for Zero-Shot Coordination in Overcooked V2},
  author={Prabakaran, Kavin, Priyanka},
  year={2025}
}

Future Work

  • Integrate real JaxMARL Overcooked V2 environment
  • Multi-layout training and zero-shot transfer evaluation
  • Recipe adaptation experiments (2-item vs 4-item soups)
  • Ablation studies (K=4/8/16 cohorts, with/without expert bias)
  • Comparison with population-based baselines (FCP, TrajeDi)
  • Human-AI coordination experiments

License

MIT License

About

MARL

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages