|
2 | 2 |
|
3 | 3 | **Status**: In Review |
4 | 4 | **Created**: 10/20/2025 |
| 5 | +**Amended**: November 12, 2025 |
5 | 6 | **Authors**: @Darktex, @pankit-eng, @jspisak, @zkwentz |
6 | 7 | **RFC ID:** 001 |
7 | 8 |
|
| 9 | +## Amendment History |
| 10 | + |
| 11 | +**November 12, 2025**: Added two-interface model (MCP for agents, HTTP for operations), simulation layer clarity, event queues, state management, and "The Time Problem" section. |
| 12 | + |
8 | 13 | ## Summary |
9 | 14 | This document defines what we call an "Environment", what its responsibilities are, and how we expect our customers to use our environments in their systems. |
10 | 15 |
|
@@ -65,40 +70,150 @@ This is the contract that we are proposing. We feel it strikes a good balance be |
65 | 70 | These are the key abstractions that we expect. Note that in this project we only implement the "Environment" abstraction under our meaning. You can map to other "agents" or "environment" abstractions by writing adapters to and from OpenEnvs. |
66 | 71 |
|
67 | 72 | Key assumptions: |
68 | | -1. We separate tasks from environments. While it is a good idea to package up a dataset with an environment and evals, we expect this wrapping to be done *outside* the env box. This allows for the reuse of environments across tasks. |
| 73 | +1. The Environment bundles everything needed for agent interaction: tools (MCP servers), sandboxing, code execution, reward computation, tasks/datasets, and evals. This packaging makes environments self-contained and reusable. |
69 | 74 | 2. We hold the state of everything **external** to the agent in the Environment. For example, if your agent defines `a = 4` with an action and wants to read `a` some time in the future, the environment will persist the interpreter state and remember variable assignments. |
70 | 75 | 3. We expect a _thin_ Agent abstraction around your model that holds the state of everything pertaining to your model, such as conversation history, tokenizer etc. |
71 | 76 |
|
| 77 | +```mermaid |
| 78 | +flowchart TB |
| 79 | + subgraph outer["OUTER SYSTEM (RL Training Infrastructure)"] |
| 80 | + agent["Agent (Thin Wrapper) |
| 81 | +
|
| 82 | + - Model/Policy |
| 83 | + - Tokenizer |
| 84 | + - Conversation History"] |
| 85 | +
|
| 86 | + env["Environment (Docker Container) |
| 87 | +
|
| 88 | + - MCP Servers |
| 89 | + - Sandbox |
| 90 | + - Code Execution |
| 91 | + - Reward Pipeline |
| 92 | + - External State |
| 93 | + - Task/Dataset Loader |
| 94 | + - Evals (aggregated)"] |
| 95 | +
|
| 96 | + orchestration["RL Orchestration (Training Loop) |
| 97 | +
|
| 98 | + - reset, step, get_state |
| 99 | + - Simulation control |
| 100 | + - Metrics and monitoring"] |
| 101 | +
|
| 102 | + agent <-->|"MCP |
| 103 | + (Tool Calls)"| env |
| 104 | + orchestration -->|"HTTP |
| 105 | + (Orchestration)"| env |
| 106 | + end |
| 107 | +
|
| 108 | + classDef agentBox fill:#e1f5ff,stroke:#333,stroke-width:2px |
| 109 | + classDef envBox fill:#fff4e1,stroke:#333,stroke-width:2px |
| 110 | + classDef orchBox fill:#f0f0f0,stroke:#333,stroke-width:2px |
| 111 | +
|
| 112 | + class agent agentBox |
| 113 | + class env envBox |
| 114 | + class orchestration orchBox |
72 | 115 | ``` |
73 | | -┌──────────────────────────────────────────────────────────────────────────┐ |
74 | | -│ OUTER SYSTEM │ |
75 | | -│ │ |
76 | | -│ ┌──────────────────┐ ┌───────────────────────────┐ │ |
77 | | -│ │ Dataset/Task │ │ Agent │ │ |
78 | | -│ │ Loader │───────────────────>│ (thin wrapper) │ │ |
79 | | -│ │ │ Provides task │ │ │ |
80 | | -│ └──────────────────┘ │ • Model/Policy │ │ |
81 | | -│ │ • Tokenizer │ │ |
82 | | -│ ┌──────────────────┐ │ • Conversation History │ │ |
83 | | -│ │ Evals │ └───────┬───────────────────┘ │ |
84 | | -│ │ (data-dependent, │ │ ^ │ |
85 | | -│ │ aggregated) │ │ Action │ │ |
86 | | -│ └──────────────────┘ │ │Observation │ |
87 | | -│ v │ │ |
88 | | -│ ┌─────────────────┴───────────┐│ |
89 | | -│ │ Environment ││ |
90 | | -│ │ ││ |
91 | | -│ │ • Tools (MCP) ││ |
92 | | -│ │ • Sandbox (Docker) ││ |
93 | | -│ │ • Code Execution ││ |
94 | | -│ │ • Reward Pipeline ││ |
95 | | -│ │ • External State ││ |
96 | | -│ │ (e.g., interpreter vars) ││ |
97 | | -│ └─────────────────────────────┘│ |
98 | | -│ │ |
99 | | -└──────────────────────────────────────────────────────────────────────────┘ |
| 116 | + |
| 117 | +**Key Interfaces:** |
| 118 | +- **MCP (Agent ↔ Environment)**: Agent-environment tool interaction (training AND production) |
| 119 | +- **HTTP (Orchestration ↔ Environment)**: Simulation control + operations (training AND production) |
| 120 | + |
| 121 | + |
| 122 | +**Critical insight**: The Agent uses **MCP exclusively** to interact with the Environment. The HTTP interface is for orchestration (simulation control in training, operations in production), never for agent actions. |
| 123 | + |
| 124 | +## Two Interfaces, Two Purposes |
| 125 | + |
| 126 | +A critical insight shapes OpenEnv's architecture: **environments expose two distinct interfaces** serving fundamentally different purposes. |
| 127 | + |
| 128 | +**1. MCP (Agent Interface)** |
| 129 | +- Agent ↔ Environment tool interaction |
| 130 | +- Present in training AND production |
| 131 | +- Operations: Tool calls (`search()`, `execute_sql()`, etc.) |
| 132 | +- **This is the ONLY interface agents use** (see RFC 005) |
| 133 | + |
| 134 | +**2. HTTP (Service/Operations Interface)** |
| 135 | +- RL Orchestration ↔ Environment control |
| 136 | +- Present in training AND production (different purposes) |
| 137 | +- Operations: |
| 138 | + - Training: `reset()`, `step()`, `get_state()` (simulation control) |
| 139 | + - Production: Health checks, metrics, logs (operations) |
| 140 | +- **Agents NEVER access this directly** |
| 141 | + |
| 142 | +**Key principle**: MCP for agent actions, HTTP for orchestration. See RFC 002 for detailed specification of how these interfaces work in practice, including graceful degradation from training to production. |
| 143 | + |
| 144 | +**Special note**: Simulation control methods (`.reset()`, `.step()`) are **never** exposed as MCP tools. This ensures agents never learn they can reset reality—critical for safe production deployment. |
| 145 | + |
| 146 | +## The Time Problem: Simulation vs Production |
| 147 | + |
| 148 | +A critical insight that shapes our entire design: |
| 149 | + |
| 150 | +**Simulation Time (Training/Eval)**: |
| 151 | +- Time only advances when we say so (via `.step()`) |
| 152 | +- Agent can "think" for arbitrary real-world time - simulation is paused |
| 153 | +- Environment state is frozen until agent acts |
| 154 | +- Can reset to initial state infinitely |
| 155 | +- Code execution blocks execute atomically from environment's perspective |
| 156 | + |
| 157 | +**Real Time (Production)**: |
| 158 | +- Time flows continuously |
| 159 | +- Events arrive on their own schedule (people get hired *now*, not when agent is ready) |
| 160 | +- Agent must react with bounded latency |
| 161 | +- Cannot reset (it's the real world). Deleting records is a one-way door. |
| 162 | +- No "turns" in the traditional sense - continuous stream of events |
| 163 | + |
| 164 | +**Key insight**: You can simulate production (via event queues), but you can't "productionize" simulation (can't pause reality). |
| 165 | + |
| 166 | +This temporal duality drives the need for two distinct interfaces: |
| 167 | +- **Simulation control** (HTTP): Reset, step, reward computation (training/eval only) |
| 168 | +- **Agent-environment interaction** (MCP): Tool calls (training AND production) |
| 169 | + |
| 170 | +**See RFC 006** for how we simulate production performance characteristics (latency, reliability) during training to minimize the training-production delta. |
| 171 | + |
| 172 | +## Event Queues: First-Class Abstraction |
| 173 | + |
| 174 | +Environments fall into two categories: |
| 175 | + |
| 176 | +1. **Static environments**: State only changes when agent acts (chess, coding puzzles) |
| 177 | +2. **Dynamic environments**: State changes independently (database with external events, customer service) |
| 178 | + |
| 179 | +We make the event queue a **first-class abstraction**: |
| 180 | +- **Empty queue** = static environment |
| 181 | +- **Populated queue** = dynamic environment with external events |
| 182 | + |
| 183 | +```python |
| 184 | +class Environment: |
| 185 | + def __init__( |
| 186 | + self, |
| 187 | + mode: str, # "sim" or "prod" |
| 188 | + mcp_servers: List[MCPServerConfig], |
| 189 | + event_queue: EventQueue, # Empty for static, populated for dynamic |
| 190 | + .. |
| 191 | +. |
| 192 | + ): |
| 193 | + self.event_queue = event_queue |
| 194 | + self.mode = mode |
100 | 195 | ``` |
101 | 196 |
|
| 197 | +The event queue delivers external events (e.g., "new employee hired", "API request received") that change the environment state independently of agent actions. This enables realistic simulation of production scenarios where the world doesn't wait for the agent. |
| 198 | + |
| 199 | +## State Management: Why It's Separate |
| 200 | + |
| 201 | +**State** is a distinct concept from both **tools** and **data**: |
| 202 | + |
| 203 | +1. **Not part of the dataset**: While datasets contain tasks, the initial state snapshot (e.g., database contents) is separate. You can have many different tasks operate on the same state snapshot! |
| 204 | + |
| 205 | +2. **Not part of MCP tools**: Tools query and mutate state, but state itself isn't defined by MCP. MCP only deals with the interface to state. |
| 206 | + |
| 207 | +3. **Simulation-specific reset capability**: During training, we need the ability to reset state to its original snapshot. **Crucially**, the agent absolutely cannot trigger this reset—it's exclusively for the training loop via `.reset()` (HTTP). If the agent could reset state, it would learn that every error is recoverable, creating a huge training-production delta. |
| 208 | + |
| 209 | +**Example**: Database maintenance environment |
| 210 | +- Initial state: SQLite database with employee records |
| 211 | +- Agent calls `execute_sql("DELETE FROM employees")` → receives penalty in reward |
| 212 | +- Training loop calls `env.reset()` → database restored to initial snapshot |
| 213 | +- Agent learns not to delete records (because it can't undo the damage) |
| 214 | + |
| 215 | +In production, there is no reset. The agent must live with consequences of its actions. |
| 216 | + |
102 | 217 | ## Python Interfaces |
103 | 218 |
|
104 | 219 | Below are the core Python interfaces that define the contract between agents and environments. |
@@ -444,7 +559,7 @@ for batch_of_tasks in dataloader: |
444 | 559 |
|
445 | 560 | 3. **PyTorch DataLoader compatibility**: `TaskDataset` follows the PyTorch `IterableDataset` interface (implements `__iter__`), making it seamlessly compatible with PyTorch's `DataLoader` for streaming data, multiprocess loading, etc. This is ideal for sequential data access and large datasets. |
446 | 561 |
|
447 | | -4. **Flexibility**: Environments can support both traditional tool calling (where each tool call is a separate action) and CodeAct (where an action contains code that may call multiple tools). See RFC 004 for details on unified action interface and RFC 003 for MCP integration. |
| 562 | +4. **Flexibility**: Environments can support both traditional tool calling (where each tool call is a separate action) and CodeAct (where an action contains code that may call multiple tools). See RFC 005 for details on unified action interface, RFC 003 for traditional MCP integration, and RFC 004 for CodeAct. |
448 | 563 |
|
449 | 564 | 5. **State ownership**: The Environment owns all external state (file system, interpreter state, tool outputs). The Agent owns internal state (conversation history, model hidden states, etc.). |
450 | 565 |
|
|
0 commit comments