Official implementation of AstroReason-Bench: Evaluating Unified Agentic Planning across Heterogeneous Space Planning Problems.
AstroReason-Bench is a comprehensive benchmark for evaluating agentic planning in astronautics mission design and planning. It integrates multiple scheduling regimes under a unified agent-oriented interface with strict physical constraints.
Five distinct planning challenges enforcing orbital mechanics, power budgets, data storage, and slew kinematics:
- SatNet - Deep Space Network resource allocation
- Revisit Optimization - Minimize time gaps for continuous target monitoring
- Regional Coverage - Maximize area coverage using strip-imaging satellites
- Stereo Imaging - Schedule synchronized observation pairs for 3D reconstruction
- Latency Optimization - Manage LEO constellation for integrated sensing and communications
This benchmark suite is under active development in the dev branch. The current implementation in the main branch represents a snapshot and a work-in-progress, and we are continously improving:
- Backend Transition: Currently relying on the astrox web API for orbital computations. We plan to migrate to local computation using established libraries for better reliability.
- Interface Exploration: Evaluating whether predefined MCP tools and Python APIs are optimal, or if agents should interact directly with computational libraries with codes.
- Benchmark Expansion: Actively designing better organizational structures for benchmarks and expanding to cover more diverse space missions.
- Baseline Performance: Current baselines are initial implementations for verification purposes. We plan to include more carefully-tuned baseline algorithms for each problem in the future.
- Python 3.12+
- Claude Code (required - agentic LLM interface)
- uv (required - manages environments and builds sandboxes)
- bubblewrap (optional, enables filesystem isolation):
# Debian/Ubuntu sudo apt install bubblewrap # Arch Linux sudo pacman -S bubblewrap # Fedora sudo dnf install bubblewrap
# Clone the repository with submodules
git clone --recurse-submodules https://github.com/your-org/astro-reason.git
cd astro-reason
# If you already cloned without submodules, initialize them:
# git submodule update --init --recursive
# Create virtual environment and install dependencies
uv sync --all-groups
# Activate the environment (required for all subsequent commands)
source .venv/bin/activate # bash/zsh
# or: source .venv/bin/activate.fish # fish
# Build sandbox environments (required before running benchmarks)
bash src/benchmark/build_sandbox.sh
bash src/satnet_agent/build_sandbox.shNote: The build scripts use uv pip install --python to install dependencies with shebangs pointing to .venv/bin/python3. Always activate the virtual environment before building or running benchmarks.
export ANTHROPIC_API_KEY="..." # Claude
export DEEPSEEK_API_KEY="..." # DeepSeek
export DASHSCOPE_API_KEY="..." # QwenEvaluate agentic LLM systems on benchmarks:
# Single case evaluation
python src/benchmark/run_benchmark.py \
--benchmark revisit-optimization \
--case case_0001 \
--model anthropic::claude-sonnet-4-5-20250929
# All cases in benchmark
python src/benchmark/run_benchmark.py \
--benchmark stereo-imaging \
--all \
--model anthropic::claude-sonnet-4-5-20250929
# Interactive mode (for close inspection and observation)
python src/benchmark/run_benchmark.py \
--benchmark regional-coverage \
--case case_0001 \
--model anthropic::claude-sonnet-4-5-20250929 \
--interactive
# File system isolation and resource limits
python src/benchmark/run_benchmark.py \
--benchmark latency-optimization \
--case case_0001 \
--bwrap \
--cpu-quota 800% \
--memory-limit 16G \
--model deepsee::deepseek-chatAvailable benchmarks: revisit-optimization, stereo-imaging, latency-optimization, regional-coverage
SatNet uses a separate runner:
# Run SatNet Week 40
python src/satnet_agent/run_benchmark.py \
--week 40 \
--model anthropic::claude-sonnet-4-5-20250929
# Run all weeks
python src/satnet_agent/run_benchmark.py \
--all \
--model anthropic::claude-sonnet-4-5-20250929
# Interactive mode
python src/satnet_agent/run_benchmark.py \
--week 40 \
--model anthropic::claude-sonnet-4-5-20250929 \
--interactive
# File isolation and resource limits
python src/satnet_agent/run_benchmark.py \
--week 40
--model anthropic::claude-sonnet-4-5-2025-0929 \
--bwrap \
--memory-limit 16G \
--cpu-quota 800%Available weeks: 10, 20, 30, 40, 50
Run the test suite to verify installation and environment setup:
# Run all tests
pytest
# Run specific test file
pytest tests/test_mcp_server.py
# Run with verbose output
pytest -v
# Run specific benchmark tests
pytest tests/test_scenario_satnet.py# Run all benchmarks with Claude Sonnet 4.5
for benchmark in revisit-optimization stereo-imaging latency-optimization regional-coverage; do
python src/benchmark/run_benchmark.py \
--benchmark $benchmark \
--bwrap --memory-limit 16G --cpu-quota 800% \
--all \
--model anthropic::claude-sonnet-4-5-20250929 \
--timeout 7200
done
# Run SatNet weeks
python src/satnet_agent/run_benchmark.py \
--bwrap --memory-limit 16G --cpu-quota 800% \
--all \
--model anthropic::claude-sonnet-4-5-20250929 \
--timeout 7200Each benchmark case includes:
src/dataset/<benchmark>/cases/<case_id>/
├── mission_brief.md # Natural language task description
├── manifest.json # Case metadata and configuration
├── requirements.yaml # Mission-specific requirements
├── satellites.yaml # Satellite constellation definition
├── stations.yaml # Ground station locations
├── targets.yaml # Observation targets
└── initial_plan.json # Empty/template plan
Four-layer design:
- Physics Layer - SGP4 propagation, slew kinematics, resource modeling (stateless)
- Scenario Layer - State management, action registry, persistence (stateful)
- Interface Layer - MCP tools + Python API
- Cognitive Layer - LLM agent (ReAct loop via Claude Code)
Agents use MCP tools for exploration and Python scripts for bulk optimization.
If you use AstroReason-Bench in your research, please cite:
@article{wang2026astroreason,
title={AstroReason-Bench: Evaluating Unified Agentic Planning across Heterogeneous Space Planning Problems},
author={Weiyi Wang and Xinchi Chen and Jingjing Gong and Xuanjing Huang and Xipeng Qiu},
year={2026},
eprint={2601.11354},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2601.11354},
}This benchmark integrates the SatNet scheduling problem:
@inproceedings{goh2021satnet,
title={SatNet: A benchmark for satellite scheduling optimization},
author={Goh, Edwin and Venkataram, Hamsa Shwetha and Balaji, Bharathan and Wilson, Brian D and Johnston, Mark D},
booktitle={AAAI-22 Workshop on Machine Learning for Operations Research (ML4OR)},
year={2021}
}Benchmark datasets are derived from the following sources:
- TLE orbital data: CelesTrak
- City locations: World cities database (CC BY 4.0)
- Ground stations: Ground Station Dataset (MIT License)
Note: Satellite parameters other than orbital elements (e.g., power budgets, data storage, slew rates) are fictional or represent typical values for benchmark purposes.
This project is licensed under the MIT License - see the LICENSE file for details.