Internal Other-Play for Overcooked V2

Zero-shot coordination learning through internal diversity and GRPO (Group Relative Policy Optimization).

Overview

This repository implements the Internal Other-Play (IOP) framework for training cooperative multi-agent systems without population-based training. A single agent learns to coordinate by simulating diverse "internal personalities" through hidden state perturbations.

Key Features

SoftMoE-GRU Architecture: Specialized experts for different coordination strategies (gathering, cooking, recipe reasoning, partner modeling)
Gumbel-Softmax Actions: Differentiable discrete action sampling for gradient-based learning
GRPO Training: Ranking-based policy optimization learning from top-performing cohorts
Overcooked V2 Optimized: Handles 30-channel observations, stochastic recipes, and POMDP environments

Architecture

Input (4×5×30 obs)
    ↓
CNN Encoder (128-dim)
    ↓
SoftMoE (4 experts, 256-dim)
    ↓
GRU (256-dim) ← Hidden state perturbation for IOP
    ↓
Gumbel Action Head (6 actions)

Expert Specialization

Expert 0: Ingredient gathering
Expert 1: Cooking coordination
Expert 2: Recipe reasoning (V2 stochastic recipes)
Expert 3: Partner inference and spatial coordination

Installation

# Create environment
uv venv
source .venv/bin/activate

# Install dependencies
uv pip install -r requirements.txt

Quick Start

Basic Training

python train_iop.py \
    --experiment_name iop_baseline \
    --num_iterations 5000 \
    --num_cohorts 8 \
    --top_k 4 \
    --use_expert_perturbation

Configuration Options

Environment:

--layout: Overcooked V2 layout (default: cramped_room)
--max_steps: Max steps per episode (default: 400)

Architecture:

--encoder_dim: Encoder output dimension (default: 128)
--moe_num_experts: Number of MoE experts (default: 4)
--gru_hidden_dim: GRU hidden state dimension (default: 256)
--action_temperature: Gumbel-Softmax temperature (default: 1.0)

GRPO Hyperparameters:

--learning_rate: Learning rate (default: 3e-4)
--num_cohorts: Number of IOP cohorts K (default: 8)
--top_k: Number of top cohorts for learning (default: 4)
--clip_ratio: PPO clip ratio (default: 0.2)
--kl_coef: KL divergence coefficient (default: 0.1)
--entropy_coef: Entropy bonus coefficient (default: 0.01)

IOP Settings:

--use_expert_perturbation: Use MoE-aligned perturbations (recommended)
--no_expert_perturbation: Use only Gaussian perturbations

Project Structure

multiagentRL/
├── src/
│   ├── models/
│   │   ├── soft_moe.py              # SoftMoE layer implementation
│   │   ├── gumbel_action_head.py    # Gumbel-Softmax action head
│   │   └── overcooked_v2_agent.py   # Complete agent model
│   ├── training/
│   │   ├── iop_rollout.py           # IOP cohort generation
│   │   └── grpo_trainer.py          # GRPO training loop
│   └── utils/
│       ├── mock_overcooked_v2.py    # Mock environment (for testing)
│       └── logger.py                # Experiment logging
├── train_iop.py                     # Main training script
├── requirements.txt                 # Dependencies
└── README.md                        # This file

How IOP Works

1. Cohort Generation (Internal Diversity)

For each training iteration, generate K=8 cohorts with perturbed hidden states:

# Cohort k gets:
# - Gaussian noise: σ ∈ [0.05, 0.25]
# - Expert bias: Prefer specific expert(s)

Cohort 0: σ=0.05, Expert 0 (Gathering)
Cohort 1: σ=0.10, Expert 1 (Cooking)
Cohort 2: σ=0.15, Expert 2 (Recipe)
Cohort 3: σ=0.20, Expert 3 (Partner)
Cohort 4: σ=0.10, Experts 0+1 (Gather+Cook)
...

2. GRPO Ranking

Rank cohorts by total episode reward and select top K=4:

advantages = (rewards - mean(rewards)) / std(rewards)
top_cohorts = argsort(advantages)[:4]

3. Policy Update

Update using PPO-style objective on top cohorts:

L = -min(ratio * A, clip(ratio, 0.8, 1.2) * A)
    + β_kl * KL(π_new || π_old)
    - β_ent * H(π)

Expected Results

Training Curves (Mock Environment)

After 5000 iterations on CrampedRoom:

Mean Soups: 45-50 per episode
Reward Std: Stabilizes around 12-15 (healthy diversity)
Expert Specialization: >65% activation in task-relevant contexts
Sample Efficiency: Convergence in <5M environment steps

Diversity Metrics

Reward Diversity: σ ≈ 12-25 across cohorts
Action Entropy: ~1.6 (diverse action distribution)
Routing Entropy: ~1.3 (balanced expert usage)

Monitoring Training

Logs and plots are saved to logs/{experiment_name}/:

logs/iop_baseline/
├── metrics.jsonl          # Per-iteration metrics
├── summary.json           # Final summary
├── config.json            # Experiment configuration
└── plots/
    ├── training_curves.png
    └── expert_usage.png

Checkpoints saved to checkpoints/{experiment_name}/:

checkpoints/iop_baseline/
├── checkpoint_500.keras
├── checkpoint_1000.keras
...

Running Tests

Test individual components:

# Test SoftMoE layer
python src/models/soft_moe.py

# Test Gumbel action head
python src/models/gumbel_action_head.py

# Test complete agent
python src/models/overcooked_v2_agent.py

# Test IOP rollout
python src/training/iop_rollout.py

# Test GRPO trainer
python src/training/grpo_trainer.py

# Test logger
python src/utils/logger.py

Key Implementation Details

Overcooked V2 Observations (30 Channels)

Channel Layout:
  [0-5]:   Terrain (walls, counters, delivery, pot, dish)
  [6-10]:  Tomatoes (counter, pot, hand, soup, plate)
  [11-15]: Onions (counter, pot, hand, soup, plate)
  [16-20]: Plates (counter, hand, soup, delivered)
  [21-24]: Agent positions & orientations
  [25-27]: Recipe state (tomato/onion/dish)
  [28-29]: Stochastic indicators (NEW in V2!)

Perturbation Strategy

# Combined perturbation
hidden_perturbed = (
    hidden_base +
    Gaussian(0, σ) +                    # Random diversity
    ExpertBias(expert_config) * 0.3     # Strategic diversity
)

Action Masking (V2 Critical!)

# V2 provides action masks
mask = state.action_mask  # (2, 6)

# Apply in forward pass
logits = model(obs)
masked_logits = where(mask > 0, logits, -1e9)

Troubleshooting

Out of Memory

Reduce cohort size or episode length:

python train_iop.py --num_cohorts 4 --episode_steps 50

Poor Coordination

Increase expert perturbation strength or cohort diversity:

python train_iop.py --use_expert_perturbation --moe_temperature 0.3

Low Expert Specialization

Decrease MoE temperature (more focused routing):

python train_iop.py --moe_temperature 0.3

Citation

If you use this code, please cite:

@misc{iop_overcooked_v2_2025,
  title={Internal Other-Play for Zero-Shot Coordination in Overcooked V2},
  author={Prabakaran, Kavin, Priyanka},
  year={2025}
}

Future Work

Integrate real JaxMARL Overcooked V2 environment
Multi-layout training and zero-shot transfer evaluation
Recipe adaptation experiments (2-item vs 4-item soups)
Ablation studies (K=4/8/16 cohorts, with/without expert bias)
Comparison with population-based baselines (FCP, TrajeDi)
Human-AI coordination experiments

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
IMPLEMENTATION_SUMMARY.md		IMPLEMENTATION_SUMMARY.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup_keras_jax.py		setup_keras_jax.py
test_setup.py		test_setup.py
train_iop.py		train_iop.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Internal Other-Play for Overcooked V2

Overview

Key Features

Architecture

Expert Specialization

Installation

Quick Start

Basic Training

Configuration Options

Project Structure

How IOP Works

1. Cohort Generation (Internal Diversity)

2. GRPO Ranking

3. Policy Update

Expected Results

Training Curves (Mock Environment)

Diversity Metrics

Monitoring Training

Running Tests

Key Implementation Details

Overcooked V2 Observations (30 Channels)

Perturbation Strategy

Action Masking (V2 Critical!)

Troubleshooting

Out of Memory

Poor Coordination

Low Expert Specialization

Citation

Future Work

License

About

Uh oh!

Releases

Packages

Languages

License

prabakaranc98/multiagentRL

Folders and files

Latest commit

History

Repository files navigation

Internal Other-Play for Overcooked V2

Overview

Key Features

Architecture

Expert Specialization

Installation

Quick Start

Basic Training

Configuration Options

Project Structure

How IOP Works

1. Cohort Generation (Internal Diversity)

2. GRPO Ranking

3. Policy Update

Expected Results

Training Curves (Mock Environment)

Diversity Metrics

Monitoring Training

Running Tests

Key Implementation Details

Overcooked V2 Observations (30 Channels)

Perturbation Strategy

Action Masking (V2 Critical!)

Troubleshooting

Out of Memory

Poor Coordination

Low Expert Specialization

Citation

Future Work

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages