perf: tensor retention optimization between subgraphs

## P3: Tensor retention between subgraphs

### Problem
With multiple subgraphs (from P1), boundary tensors are written to DRAM then re-read. Retaining tensors in fast memory across subgraph boundaries eliminates this round-trip cost (\`2 * tensor_size / bandwidth\`).

### Current State
Retention logic exists (\`optimizer/retention.rs\`) but is never triggered because the mega-fusion strategy produces only 1 subgraph with 0 boundaries.

### Acceptance Criteria
- [ ] After fusion decisions create multiple subgraphs, retention pass identifies candidates
- [ ] Retention only applied when residual capacity allows (retained tensor at full size)
- [ ] Net latency improvement validated: retained saves > capacity cost
- [ ] Track A (Rust) and Track B (Python) both updated
- [ ] Verified against Example 3C pattern (4,638.4 latency)

### Dependencies
Depends on #16 (cost-based fusion) — only relevant with multiple subgraphs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: tensor retention optimization between subgraphs #18

P3: Tensor retention between subgraphs

Problem

Current State

Acceptance Criteria

Dependencies

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

perf: tensor retention optimization between subgraphs #18

Description

P3: Tensor retention between subgraphs

Problem

Current State

Acceptance Criteria

Dependencies

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions