perf: recomputation support for diamond graph patterns

## P5: Recomputation for diamond graphs

### Problem
When a tensor is consumed by two downstream branches (diamond/skip-connection pattern), spilling it to DRAM costs \`2 * size / bandwidth\` per consumer. Recomputing the tensor (including it in multiple subgraphs) may be cheaper if the compute cost is low relative to the DRAM cost.

### Reference
PROBLEM.md Example 3B: recomputation achieves 6,276.8 latency vs 11,468.8 for spilling (45% improvement).

### Proposed Approach
1. Identify diamond patterns: tensor T produced by op A, consumed by ops B and C in different subgraphs
2. Compare: spill cost (write T + read T twice) vs recompute cost (run A twice, T stays ephemeral)
3. If recompute is cheaper, include op A in both subgraphs

### Acceptance Criteria
- [ ] Diamond pattern detection in DAG analysis
- [ ] Cost comparison: spill vs recompute
- [ ] Ops may appear in multiple subgraphs (already allowed in E2E validator)
- [ ] Track A (Rust) and Track B (Python) both updated
- [ ] Validated against Example 3B (6,276.8 latency target)

### Dependencies
Depends on #16 (cost-based fusion) — requires multiple subgraphs to have boundaries where recomputation applies.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: recomputation support for diamond graph patterns #19

P5: Recomputation for diamond graphs

Problem

Reference

Proposed Approach

Acceptance Criteria

Dependencies

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

perf: recomputation support for diamond graph patterns #19

Description

P5: Recomputation for diamond graphs

Problem

Reference

Proposed Approach

Acceptance Criteria

Dependencies

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions