ship 0.5.0: bug-fix sweep and foundational agent capabilities

### Description

Master tracking issue for the **0.5.0** release. Scope is deliberately narrow: the security and correctness fixes from the recent code review, plus the foundational LLM capabilities that transform AgentLoom from a "YAML workflow orchestrator with LLM calls" into a framework that can express real agents.

Current version: **0.4.0** → Target: **0.5.0**.

The version jump is justified by the scope:

- **Security**: closing the router RCE and sandbox bypass surface (#104, #105).
- **Correctness**: gateway resilience, record/replay race, DAG skip propagation, provider adapter normalization, state lock bypass, subworkflow observability — eight bug-fix issues in total.
- **Cost correctness**: reasoning tokens (o1/o3/Claude thinking) are unbilled today; fixing the accounting changes reported costs.
- **Foundational LLM primitives**: tool calling, structured output, embeddings, conversation history, Agent primitive — none of which exist today, and all of which are prerequisites for any modern agent workflow.
- **Observability schema**: align with OTel GenAI semantic conventions, centralize span/metric names, and expose custom dimensions needed by downstream trace consumers.

Out of scope for 0.5.0 (deferred to 0.6.x or later) is listed at the bottom.

### How to use this issue

- Each phase is a coherent body of work with shared dependencies.
- Issues within a phase are parallelizable unless explicitly noted.
- GitHub renders the task lists below as tracked tasks: checkboxes update automatically when each linked issue closes, and the parent shows aggregate progress.

---

## Phase 0 — Bug-fix sweep and cost correctness

The bug-fix backlog from the security and correctness review, plus the reasoning-tokens cost bug. Everything downstream depends on a clean baseline.

- [x] #104 fix router expression sandbox to prevent arbitrary code execution
- [x] #105 harden tool sandbox against command, path, and URL-scheme bypasses
- [x] #106 fix gateway resilience: CB/RL ordering, stream cancellation, retry jitter, rate-limiter edge cases
- [x] #107 fix record/replay: concurrent write race, streaming capture, hash key coverage
- [x] #108 fix DAG skip propagation, parallel cancellation, and budget overshoot
- [x] #109 normalize provider adapters: kwargs, 429, streaming usage, pricing, connection cleanup
- [x] #110 fix state manager lock bypass, template rendering gaps, and approval gate flow
- [x] #111 wire subworkflow observability, checkpoint serialization, observer surface, metrics cardinality, webhook blocking
- [x] #127 add reasoning tokens tracking (o1, o3, Claude extended thinking) _(cost-correctness bug)_

**Parallelization:** all 9 simultaneously. Suggested split — security duo (#104, #105), providers/resilience trio (#106, #107, #109), core trio (#108, #110, #111), correctness solo (#127).

---

## Phase 1 — Internal hygiene refactor

Foundation work for the LLM-primitive phase — collaborator protocols and shared retry primitives so subsequent issues (#116 tool calling, #120 Agent) have stable seams to plug into.

- [x] #112 add typing protocols, retryable status codes, and shared retry primitives

**Sequencing:** single PR. Scope is the typing protocols, heapq topo sort, retry-primitive sharing, and `retryable_status_codes`. The collaborator-class split (`LayerRunner` / `RouterActivation` / `StepRetryLoop` / `BudgetEnforcer`) is deferred — the current `engine.py` size is acceptable until #116 / #120 prove a refactor is needed.

**Parallelization with Phase 0:** can start once #108 lands — the typing work uses the iterative DFS rewrite as its baseline.

---

## Phase 2 — Observability foundation

Stabilize the observability contract before adding features that emit data. Every feature landed after this phase will emit with the correct schema from day one — no second migration later. This phase also unblocks downstream trace consumers that need custom dimensions like version or run tags.

- [x] #125 expand OTel span schema, align with GenAI semantic conventions, centralize names _(adds run_id propagation, prompt metadata, tool-call details, provider-level child spans)_
- [x] #77 add experiment metadata logging per execution _(custom workflow labels propagate to all spans and metrics)_
- [x] #59 attach quality scores to OTel trace spans

**Parallelization:** #125 lands the `schema.py` module first; #77 and #59 follow once the namespace exists.

---

## Phase 3 — LLM core primitives

The three capabilities AgentLoom currently lacks structurally. Foundation for every agentic workflow.

- [ ] #116 add native tool/function calling with streaming and parallel-call support
- [ ] #117 add structured output (JSON schema, response_format) across providers
- [ ] #118 add embeddings API across providers

**Parallelization:** 3 independent PRs in parallel. #116 is the largest (likely 4 sub-PRs: one per provider plus the unified surface).

---

## Phase 4 — Composition primitives

Step types and conversation primitive. Most are independent of Phase 3; #119 strictly depends on #116.

- [ ] #38 add bounded retry loops with feedback _(loop step — required by the agent loop pattern)_
- [ ] #39 add evaluator step type for output quality assessment
- [ ] #43 add explicit fan-out / fan-in and map operations
- [ ] #119 add conversation history primitive with token-budget trimming and multi-agent threads _(depends on #116)_

**Parallelization:** #38, #39, #43 in parallel from day one. #119 starts when #116 closes.

---

## Phase 5 — Agent + typed state + templating

The capability layer. #120 is the deliverable that makes AgentLoom express real agents end-to-end: system prompt plus tools plus memory plus a bounded decision loop plus typed output.

- [ ] #128 add Pydantic-typed state schemas with validation on reads and writes _(depends on #117)_
- [ ] #129 add safe conditional, loop, and filter constructs to template engine _(pairs with #128, shares AST validator with #104)_
- [ ] #120 add Agent high-level primitive (system prompt + tools + memory + loop policy) _(closes Phase 5; depends on #116, #117, #119, #38, #39)_

**Parallelization:** #128 and #129 in parallel; #120 last.

---

## Phase 6 — Testing infrastructure

Primitives that make workflows testable at scale: deterministic failure replay, stress-test chaos injection, cross-provider benchmarking, and a deterministic inter-agent replay contract.

- [ ] #123 add MockProvider fault modes for deterministic failure replay _(depends on #106, #107; complements #62)_
- [ ] #62 add chaos/fault injection testing mode _(complements #123)_
- [ ] #124 add benchmark mode to compare workflow across N provider configs _(depends on #118 for output similarity)_
- [ ] #135 add deterministic inter-agent message record/replay with JSONL spec and verifier CLI _(depends on #107)_

**Parallelization:** #62 and #124 immediately. #123 after Phase 0 (#106, #107). #135 after #107.

---

## Cross-phase dependency map

```
Phase 0 (parallel sweep)
  |
  +--> Phase 1 (#112)
  |
  +--> Phase 2 (#125 -> #77, #59)
  |      |
  |      +--> Phase 3 (#116, #117, #118 parallel)
  |             |
  |             +--> #119 (#116)
  |             +--> #128 (#117)
  |             +--> #124 (#118)
  |
  +--> Phase 4 (parallel: #38, #39, #43)
  |
  +--> Phase 5 (#128, #129 parallel; #120 closes)
  |
  +--> Phase 6 (#62, #124 anytime; #123 and #135 after Phase 0)
```

## What is deliberately not in 0.5.0

The deferred items are tracked but not gated by 0.5.0:

**Provider catalog (defer to 0.6.0):**

- #16 DeepSeek
- #46 Azure OpenAI
- #47 AWS Bedrock
- #48 OpenAI-compatible endpoint
- #67 MCP

**Production scale beyond cost correctness (defer to 0.6.x):**

- #126 prompt caching
- #131 distributed budget
- #69 Redis rate limiting
- #115 Redis/S3 checkpointers
- #130 hedging/coalescing
- #51 response cache
- #50 strategy-based selection
- #49 multi-key round-robin
- #12 pre-flight estimation
- #132 soft alerts + semantic retry
- #56 Grafana dashboards
- #57 AlertManager rules
- #65 agentloom serve

**Security hardening for regulated industries (defer to 0.6.x or 0.7.x):**

- #52 prompt injection hooks
- #53 access control
- #54 secrets management
- #55 audit logging

**DX polish (defer to 0.6.x+):**

- #64 interactive debugger
- #45 workflow versioning
- #66 lint
- #60 agentloom test CLI
- #63 snapshot golden testing
- #44 workflow composition

**Step types and primitives not on the 0.5.0 critical path:**

- #121 transform/embed/moderate/delay
- #122 A2A protocol
- #74 logprob in router
- #13 router priority
- #33 subworkflow inheritance
- #10 selective state merge

## What 0.5.0 unlocks

After 0.5.0 ships:

- **AgentLoom is production-deployable**: all known critical bugs closed, cost reporting correct.
- **Agent workflows are expressible natively**: tool calling, structured output, conversation state, and the Agent primitive cover the modern agent pattern end-to-end.
- **Comparative benchmarks possible**: `agentloom bench` enables side-by-side evaluation across provider configs with unified reporting.
- **Observability is contract-stable**: schemas centralized in `schema.py`, `gen_ai.*` aligned, custom dimensions propagating. External consumers can build trace-parsing code that survives future releases without breakage.

The 32 deferred issues remain valuable but are not on the critical path for those four outcomes.

## Issue inventory (27 total)

| Phase | Count | Issues |
|---|---|---|
| 0 — Bug fix sweep | 9 | #104, #105, #106, #107, #108, #109, #110, #111, #127 |
| 1 — Internal hygiene | 1 | #112 |
| 2 — Observability | 3 | #125, #77, #59 |
| 3 — LLM core | 3 | #116, #117, #118 |
| 4 — Composition | 4 | #38, #39, #43, #119 |
| 5 — Agent + typing | 3 | #128, #129, #120 |
| 6 — Testing infra | 4 | #123, #62, #124, #135 |

**Total in 0.5.0: 27 issues** (8 critical bugs + 1 cost bug + 1 refactor + 17 features).
**Deferred to 0.6.x+: 32 issues.**

## Notes

- This is a **release-tracking issue**, not an implementation issue. Each child issue retains its own discussion, scope, and PR linkage.
- Progress is visible at the top of this issue once any child closes.
- Adjust phase membership freely as scope clarifies — this is a living plan, not a contract.
- If a deferred issue turns out to be blocking a consumer after all, promote it into the appropriate phase; if a 0.5.0 issue turns out to be smaller than expected, ship it early.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ship 0.5.0: bug-fix sweep and foundational agent capabilities #133

Description

How to use this issue

Phase 0 — Bug-fix sweep and cost correctness

Phase 1 — Internal hygiene refactor

Phase 2 — Observability foundation

Phase 3 — LLM core primitives

Phase 4 — Composition primitives

Phase 5 — Agent + typed state + templating

Phase 6 — Testing infrastructure

Cross-phase dependency map

What is deliberately not in 0.5.0

What 0.5.0 unlocks

Issue inventory (27 total)

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Phase	Count	Issues
0 — Bug fix sweep	9	#104, #105, #106, #107, #108, #109, #110, #111, #127
1 — Internal hygiene	1	#112
2 — Observability	3	#125, #77, #59
3 — LLM core	3	#116, #117, #118
4 — Composition	4	#38, #39, #43, #119
5 — Agent + typing	3	#128, #129, #120
6 — Testing infra	4	#123, #62, #124, #135

ship 0.5.0: bug-fix sweep and foundational agent capabilities #133

Description

Description

How to use this issue

Phase 0 — Bug-fix sweep and cost correctness

Phase 1 — Internal hygiene refactor

Phase 2 — Observability foundation

Phase 3 — LLM core primitives

Phase 4 — Composition primitives

Phase 5 — Agent + typed state + templating

Phase 6 — Testing infrastructure

Cross-phase dependency map

What is deliberately not in 0.5.0

What 0.5.0 unlocks

Issue inventory (27 total)

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions