Skip to content

Commit cf69ef3

Browse files
committed
Comprehensive RFC revision: MCP-based architecture and modular design
This commit consolidates the evolution of the OpenEnv RFC structure, incorporating extensive feedback and iterative refinements: - Refactored RFC 000 with clear project phases and architectural principles - Enhanced RFC 001 with two-interface model (CodeAct + ToolCall) and simulation clarity - Restructured RFC 002 with dual-mode patterns and clearer positioning - Completely revised RFC 003 with MCP primer, progressive disclosure, and clearer interface separation - Removed superseded RFCs (004-007 and FEEDBACK_DECISIONS.md) after porting content - Added comprehensive diagrams (Mermaid) for MCP architecture visualization - Archived previous RFC versions for historical reference - Established MCP as the universal interface for environment interactions The final state represents a clean, cohesive RFC suite focused on the MCP-based architecture with clear separation between CodeAct and ToolCall interfaces.
1 parent e7d6952 commit cf69ef3

File tree

9 files changed

+2773
-2295
lines changed

9 files changed

+2773
-2295
lines changed

rfcs/000-project-phases.md

Lines changed: 30 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,31 @@
1-
# RFC: OpenEnv layering
1+
# RFC: Design Principles and Broad Roadmap
22

33
**Status**: In Review
44
**Created**: 10/17/2025
5+
**Amended**: November 12, 2025
56
**Authors**: @Darktex, @pankit-eng, @jspisak, @zkwentz
67
**RFC ID:** 000
78

9+
## Amendment History
10+
11+
**November 12, 2025**: Added design principles, target audience, and updated roadmap to reference RFCs 005-007.
12+
13+
## Design Principles
14+
15+
These principles guide every decision in OpenEnv:
16+
17+
1. **Minimize deltas across lifecycle**: Training → Evals → Production should use identical tool interfaces
18+
2. **Minimize human-agent divergence**: Tools that work for humans should work for agents
19+
3. **Be hands-on**: Provide ready-to-use implementations, not just specs
20+
4. **Design for LLMs**: Optimize for context efficiency, in-distribution behavior, and token costs
21+
22+
## Target Audience
23+
24+
- **Environment builders**: Get reach across projects without custom adapters
25+
- **Model builders**: Access massive inventory of environments and tools
26+
- **Researchers**: Reproducible setups with versioned tools, rewards, and evals
27+
- **Infrastructure engineers**: Clear contracts enabling backend optimization
28+
829
## Summary
930
Before jumping into the actual concrete proposals, this RFC introduces how we are going to approach this problem space, what problems we want to solve with this project, and how we plan to prioritize and solve them systematically.
1031

@@ -27,9 +48,14 @@ We will group development from now till version 1.0 into three phases.
2748
In the **first phase** of this project, we will focus **exclusively** on the narrowest definition of environments, without even worrying about rewards nor evals. Instead, the focus in this phase (and in the RFCs you find in this directory) is going to be on:
2849
1. Establishing a convention on what is an environment and where we draw the "environment" box (RFC 001).
2950
2. Landing the basics of _sandboxing_, _versioning_, _binary distribution_, _dependency management_ (RFC 002).
30-
3. Nailing our tools support through MCP (Model Context Protocol) integration for both remote and local tools (RFC 003).
31-
4. Defining a unified action interface for all environment types (RFC 004).
32-
5. Exploring RPC communication patterns beyond HTTP for long-running sessions (particularly for interpreted languages like Python, Bash, Ruby, etc.). Coming in an upcoming RFC.
51+
3. Nailing our tools support through MCP (Model Context Protocol) integration:
52+
- RFC 003: Traditional tool calling (ListToolsAction, CallToolAction)
53+
- RFC 004: CodeAct support (agents write executable code)
54+
- RFC 005: MCP as the universal interface (policy and rationale)
55+
4. Establishing tool registry and distribution patterns via Hugging Face Hub (upcoming RFC).
56+
5. Enabling production performance simulation to minimize training-production delta (RFC 006).
57+
6. Exploring MCP protocol interception for observability (RFC 007).
58+
7. Exploring RPC communication patterns beyond HTTP for long-running sessions (particularly for interpreted languages like Python, Bash, Ruby, etc.). Coming in an upcoming RFC.
3359

3460
We will conclude this phase with version 0.3.
3561

rfcs/001-abstractions.md

Lines changed: 144 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,14 @@
22

33
**Status**: In Review
44
**Created**: 10/20/2025
5+
**Amended**: November 12, 2025
56
**Authors**: @Darktex, @pankit-eng, @jspisak, @zkwentz
67
**RFC ID:** 001
78

9+
## Amendment History
10+
11+
**November 12, 2025**: Added two-interface model (MCP for agents, HTTP for operations), simulation layer clarity, event queues, state management, and "The Time Problem" section.
12+
813
## Summary
914
This document defines what we call an "Environment", what its responsibilities are, and how we expect our customers to use our environments in their systems.
1015

@@ -65,40 +70,150 @@ This is the contract that we are proposing. We feel it strikes a good balance be
6570
These are the key abstractions that we expect. Note that in this project we only implement the "Environment" abstraction under our meaning. You can map to other "agents" or "environment" abstractions by writing adapters to and from OpenEnvs.
6671

6772
Key assumptions:
68-
1. We separate tasks from environments. While it is a good idea to package up a dataset with an environment and evals, we expect this wrapping to be done *outside* the env box. This allows for the reuse of environments across tasks.
73+
1. The Environment bundles everything needed for agent interaction: tools (MCP servers), sandboxing, code execution, reward computation, tasks/datasets, and evals. This packaging makes environments self-contained and reusable.
6974
2. We hold the state of everything **external** to the agent in the Environment. For example, if your agent defines `a = 4` with an action and wants to read `a` some time in the future, the environment will persist the interpreter state and remember variable assignments.
7075
3. We expect a _thin_ Agent abstraction around your model that holds the state of everything pertaining to your model, such as conversation history, tokenizer etc.
7176

77+
```mermaid
78+
flowchart TB
79+
subgraph outer["OUTER SYSTEM (RL Training Infrastructure)"]
80+
agent["Agent (Thin Wrapper)
81+
82+
- Model/Policy
83+
- Tokenizer
84+
- Conversation History"]
85+
86+
env["Environment (Docker Container)
87+
88+
- MCP Servers
89+
- Sandbox
90+
- Code Execution
91+
- Reward Pipeline
92+
- External State
93+
- Task/Dataset Loader
94+
- Evals (aggregated)"]
95+
96+
orchestration["RL Orchestration (Training Loop)
97+
98+
- reset, step, get_state
99+
- Simulation control
100+
- Metrics and monitoring"]
101+
102+
agent <-->|"MCP
103+
(Tool Calls)"| env
104+
orchestration -->|"HTTP
105+
(Orchestration)"| env
106+
end
107+
108+
classDef agentBox fill:#e1f5ff,stroke:#333,stroke-width:2px
109+
classDef envBox fill:#fff4e1,stroke:#333,stroke-width:2px
110+
classDef orchBox fill:#f0f0f0,stroke:#333,stroke-width:2px
111+
112+
class agent agentBox
113+
class env envBox
114+
class orchestration orchBox
72115
```
73-
┌──────────────────────────────────────────────────────────────────────────┐
74-
│ OUTER SYSTEM │
75-
│ │
76-
│ ┌──────────────────┐ ┌───────────────────────────┐ │
77-
│ │ Dataset/Task │ │ Agent │ │
78-
│ │ Loader │───────────────────>│ (thin wrapper) │ │
79-
│ │ │ Provides task │ │ │
80-
│ └──────────────────┘ │ • Model/Policy │ │
81-
│ │ • Tokenizer │ │
82-
│ ┌──────────────────┐ │ • Conversation History │ │
83-
│ │ Evals │ └───────┬───────────────────┘ │
84-
│ │ (data-dependent, │ │ ^ │
85-
│ │ aggregated) │ │ Action │ │
86-
│ └──────────────────┘ │ │Observation │
87-
│ v │ │
88-
│ ┌─────────────────┴───────────┐│
89-
│ │ Environment ││
90-
│ │ ││
91-
│ │ • Tools (MCP) ││
92-
│ │ • Sandbox (Docker) ││
93-
│ │ • Code Execution ││
94-
│ │ • Reward Pipeline ││
95-
│ │ • External State ││
96-
│ │ (e.g., interpreter vars) ││
97-
│ └─────────────────────────────┘│
98-
│ │
99-
└──────────────────────────────────────────────────────────────────────────┘
116+
117+
**Key Interfaces:**
118+
- **MCP (Agent ↔ Environment)**: Agent-environment tool interaction (training AND production)
119+
- **HTTP (Orchestration ↔ Environment)**: Simulation control + operations (training AND production)
120+
121+
122+
**Critical insight**: The Agent uses **MCP exclusively** to interact with the Environment. The HTTP interface is for orchestration (simulation control in training, operations in production), never for agent actions.
123+
124+
## Two Interfaces, Two Purposes
125+
126+
A critical insight shapes OpenEnv's architecture: **environments expose two distinct interfaces** serving fundamentally different purposes.
127+
128+
**1. MCP (Agent Interface)**
129+
- Agent ↔ Environment tool interaction
130+
- Present in training AND production
131+
- Operations: Tool calls (`search()`, `execute_sql()`, etc.)
132+
- **This is the ONLY interface agents use** (see RFC 005)
133+
134+
**2. HTTP (Service/Operations Interface)**
135+
- RL Orchestration ↔ Environment control
136+
- Present in training AND production (different purposes)
137+
- Operations:
138+
- Training: `reset()`, `step()`, `get_state()` (simulation control)
139+
- Production: Health checks, metrics, logs (operations)
140+
- **Agents NEVER access this directly**
141+
142+
**Key principle**: MCP for agent actions, HTTP for orchestration. See RFC 002 for detailed specification of how these interfaces work in practice, including graceful degradation from training to production.
143+
144+
**Special note**: Simulation control methods (`.reset()`, `.step()`) are **never** exposed as MCP tools. This ensures agents never learn they can reset reality—critical for safe production deployment.
145+
146+
## The Time Problem: Simulation vs Production
147+
148+
A critical insight that shapes our entire design:
149+
150+
**Simulation Time (Training/Eval)**:
151+
- Time only advances when we say so (via `.step()`)
152+
- Agent can "think" for arbitrary real-world time - simulation is paused
153+
- Environment state is frozen until agent acts
154+
- Can reset to initial state infinitely
155+
- Code execution blocks execute atomically from environment's perspective
156+
157+
**Real Time (Production)**:
158+
- Time flows continuously
159+
- Events arrive on their own schedule (people get hired *now*, not when agent is ready)
160+
- Agent must react with bounded latency
161+
- Cannot reset (it's the real world). Deleting records is a one-way door.
162+
- No "turns" in the traditional sense - continuous stream of events
163+
164+
**Key insight**: You can simulate production (via event queues), but you can't "productionize" simulation (can't pause reality).
165+
166+
This temporal duality drives the need for two distinct interfaces:
167+
- **Simulation control** (HTTP): Reset, step, reward computation (training/eval only)
168+
- **Agent-environment interaction** (MCP): Tool calls (training AND production)
169+
170+
**See RFC 006** for how we simulate production performance characteristics (latency, reliability) during training to minimize the training-production delta.
171+
172+
## Event Queues: First-Class Abstraction
173+
174+
Environments fall into two categories:
175+
176+
1. **Static environments**: State only changes when agent acts (chess, coding puzzles)
177+
2. **Dynamic environments**: State changes independently (database with external events, customer service)
178+
179+
We make the event queue a **first-class abstraction**:
180+
- **Empty queue** = static environment
181+
- **Populated queue** = dynamic environment with external events
182+
183+
```python
184+
class Environment:
185+
def __init__(
186+
self,
187+
mode: str, # "sim" or "prod"
188+
mcp_servers: List[MCPServerConfig],
189+
event_queue: EventQueue, # Empty for static, populated for dynamic
190+
..
191+
.
192+
):
193+
self.event_queue = event_queue
194+
self.mode = mode
100195
```
101196

197+
The event queue delivers external events (e.g., "new employee hired", "API request received") that change the environment state independently of agent actions. This enables realistic simulation of production scenarios where the world doesn't wait for the agent.
198+
199+
## State Management: Why It's Separate
200+
201+
**State** is a distinct concept from both **tools** and **data**:
202+
203+
1. **Not part of the dataset**: While datasets contain tasks, the initial state snapshot (e.g., database contents) is separate. You can have many different tasks operate on the same state snapshot!
204+
205+
2. **Not part of MCP tools**: Tools query and mutate state, but state itself isn't defined by MCP. MCP only deals with the interface to state.
206+
207+
3. **Simulation-specific reset capability**: During training, we need the ability to reset state to its original snapshot. **Crucially**, the agent absolutely cannot trigger this reset—it's exclusively for the training loop via `.reset()` (HTTP). If the agent could reset state, it would learn that every error is recoverable, creating a huge training-production delta.
208+
209+
**Example**: Database maintenance environment
210+
- Initial state: SQLite database with employee records
211+
- Agent calls `execute_sql("DELETE FROM employees")` → receives penalty in reward
212+
- Training loop calls `env.reset()` → database restored to initial snapshot
213+
- Agent learns not to delete records (because it can't undo the damage)
214+
215+
In production, there is no reset. The agent must live with consequences of its actions.
216+
102217
## Python Interfaces
103218

104219
Below are the core Python interfaces that define the contract between agents and environments.
@@ -444,7 +559,7 @@ for batch_of_tasks in dataloader:
444559

445560
3. **PyTorch DataLoader compatibility**: `TaskDataset` follows the PyTorch `IterableDataset` interface (implements `__iter__`), making it seamlessly compatible with PyTorch's `DataLoader` for streaming data, multiprocess loading, etc. This is ideal for sequential data access and large datasets.
446561

447-
4. **Flexibility**: Environments can support both traditional tool calling (where each tool call is a separate action) and CodeAct (where an action contains code that may call multiple tools). See RFC 004 for details on unified action interface and RFC 003 for MCP integration.
562+
4. **Flexibility**: Environments can support both traditional tool calling (where each tool call is a separate action) and CodeAct (where an action contains code that may call multiple tools). See RFC 005 for details on unified action interface, RFC 003 for traditional MCP integration, and RFC 004 for CodeAct.
448563

449564
5. **State ownership**: The Environment owns all external state (file system, interpreter state, tool outputs). The Agent owns internal state (conversation history, model hidden states, etc.).
450565

0 commit comments

Comments
 (0)