A complete OpenEnv-compliant environment for training AI agents to manage a business email inbox. The agent learns to triage, label, reply, route, and escalate emails — a task humans perform every working day.
Email triage is one of the most universal productivity bottlenecks in knowledge work. A skilled human triager must:
- Identify urgent issues (production outages, angry customers, security incidents)
- Ignore noise (newsletters, spam)
- Craft appropriate replies
- Route messages to the right team
- Escalate issues that require human decision-making
This environment lets an AI agent learn these skills from structured reward signals.
The agent receives an inbox of realistic business emails and must process them through a sequence of actions. Three tasks of increasing difficulty are provided, each scored on a deterministic rubric (0.0 – 1.0).
| Field | Type | Description |
|---|---|---|
action_type |
enum | label / reply / archive / escalate / route / noop |
email_id |
string | ID of the target email |
label |
enum | urgent / high / normal / low / spam |
reply_text |
string | Draft reply (min 20 chars for reward) |
route_to |
string | Department: devops / legal / engineering / management |
reason |
string | Optional explanation |
| Field | Type | Description |
|---|---|---|
task_id |
string | Current task identifier |
step |
int | Steps taken so far |
emails |
list[Email] | Full inbox with sender, subject, body, timestamp |
inbox_size |
int | Number of emails |
done |
bool | Episode completion flag |
message |
string | Human-readable status |
metadata |
dict | max_steps, actions_taken |
Rewards are provided at every step (intermediate signals) and a final grade is computed at episode end.
| Action | Reward |
|---|---|
| Correct urgent label | +0.05 |
| Correct non-urgent label | +0.02 |
| Wrong label | −0.02 |
| Good reply (needed) | +0.04 |
| Unnecessary reply | −0.03 |
| Correct route | +0.05 |
| Correct escalation | +0.05 |
| Unnecessary escalation | −0.03 |
| noop | −0.02 |
Objective: Label 5 emails with the correct priority level.
Scoring: Weighted label accuracy. Urgent emails count 2×.
Expected score for a good agent: 0.75 – 1.0
Max steps: 10
Objective: Label 10 emails and draft replies to emails that require a response.
Scoring: 0.6 × label_accuracy + 0.4 × reply_coverage − 0.05 × unnecessary_replies
Expected score for a good agent: 0.55 – 0.80
Max steps: 25
Objective: Process 15 emails — label, reply, route to the correct department, and escalate compliance/critical issues.
Scoring: 0.40 × label + 0.35 × routing + 0.25 × escalation + reply_bonus
Critical emails (regulatory notices, partner outages) carry 3× label weight.
Expected score for a good agent: 0.40 – 0.70
Max steps: 40
Measured with meta-llama/Llama-3.3-70B-Instruct via HuggingFace Inference API:
| Task | Baseline Score |
|---|---|
| task1 | ~0.70 |
| task2 | ~0.52 |
| task3 | ~0.41 |
| Average | ~0.54 |
git clone https://huggingface.co/spaces/<your-username>/email-triage-openenv
cd email-triage-openenv
pip install -r requirements.txt
uvicorn main:app --host 0.0.0.0 --port 7860 --reloadAPI will be available at http://localhost:7860.
docker build -t email-triage-openenv .
docker run -p 7860:7860 \
-e API_BASE_URL=https://router.huggingface.co/v1 \
-e MODEL_NAME=meta-llama/Llama-3.3-70B-Instruct \
-e HF_TOKEN=hf_your_token \
email-triage-openenvexport API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=meta-llama/Llama-3.3-70B-Instruct
export HF_TOKEN=hf_your_token_here
python inference.pyimport requests
BASE = "http://localhost:7860"
# Start task 1
obs = requests.post(f"{BASE}/reset", json={"task_id": "task1"}).json()
# Take an action
result = requests.post(f"{BASE}/step", json={
"action": {
"action_type": "label",
"email_id": "t1_e1",
"label": "urgent"
}
}).json()
print(result["reward"])
# Get final score
grade = requests.post(f"{BASE}/grade").json()
print(grade["final_score"])| Method | Endpoint | Description |
|---|---|---|
| GET | /health |
Health check — returns 200 |
| GET | / |
Environment info |
| POST | /reset |
Start new episode {"task_id": "task1"} |
| POST | /step |
Take action {"action": {...}} |
| GET | /state |
Current environment state |
| GET | /tasks |
List all tasks with metadata |
| POST | /grade |
Run grader, get final score |
| Variable | Description | Default |
|---|---|---|
API_BASE_URL |
LLM API endpoint | https://router.huggingface.co/v1 |
MODEL_NAME |
Model identifier | meta-llama/Llama-3.3-70B-Instruct |
HF_TOKEN |
HuggingFace API key | — |
email-triage-openenv/
├── main.py # FastAPI server
├── inference.py # Baseline agent (OpenAI client)
├── openenv.yaml # OpenEnv spec metadata
├── requirements.txt
├── Dockerfile
├── README.md
└── env/
├── __init__.py
├── email_env.py # EmailTriageEnv class (step/reset/state)
├── models.py # Pydantic models (Observation, Action, Reward)
├── graders.py # Task graders (deterministic scoring)
└── data.py # Email dataset (15 realistic emails)
MIT