perf: cost-based fusion instead of greedy merge-everything

## P1: Cost-based fusion decisions

### Problem
The current fusion strategy greedily merges ALL adjacent ops into a single mega-subgraph. This forces a single shared granularity (w, h, k) on all ops, which is suboptimal when ops have different optimal tile sizes (e.g., 4096x4096 MatMul followed by 128x128 Pointwise).

### Proposed Approach
Before merging two subgraphs, compare:
- \`latency(fused, best_shared_granularity)\`
- \`latency(A, best_gran_A) + latency(B, best_gran_B) + DRAM_boundary_cost\`

Only fuse if it actually reduces total latency.

### Expected Impact
2-5x improvement on heterogeneous graphs (benchmarks with mixed MatMul/Pointwise and varying tensor sizes).

### Acceptance Criteria
- [ ] Fusion decision compares fused vs split latency (not just memory feasibility)
- [ ] Each candidate merge evaluated with optimal granularity for both fused and split cases
- [ ] DRAM boundary transfer cost correctly estimated for split case
- [ ] Track A (Rust) and Track B (Python) both updated
- [ ] All tests pass; no regression on any benchmark

### Dependencies
Depends on #15 (k > 1 fix) — cost comparison requires correct latency calculation with proper k values.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: cost-based fusion instead of greedy merge-everything #16

P1: Cost-based fusion decisions

Problem

Proposed Approach

Expected Impact

Acceptance Criteria

Dependencies

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

perf: cost-based fusion instead of greedy merge-everything #16

Description

P1: Cost-based fusion decisions

Problem

Proposed Approach

Expected Impact

Acceptance Criteria

Dependencies

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions