Skip to content

perf: cost-based fusion instead of greedy merge-everything #16

@melroyanthony

Description

@melroyanthony

P1: Cost-based fusion decisions

Problem

The current fusion strategy greedily merges ALL adjacent ops into a single mega-subgraph. This forces a single shared granularity (w, h, k) on all ops, which is suboptimal when ops have different optimal tile sizes (e.g., 4096x4096 MatMul followed by 128x128 Pointwise).

Proposed Approach

Before merging two subgraphs, compare:

  • `latency(fused, best_shared_granularity)`
  • `latency(A, best_gran_A) + latency(B, best_gran_B) + DRAM_boundary_cost`

Only fuse if it actually reduces total latency.

Expected Impact

2-5x improvement on heterogeneous graphs (benchmarks with mixed MatMul/Pointwise and varying tensor sizes).

Acceptance Criteria

  • Fusion decision compares fused vs split latency (not just memory feasibility)
  • Each candidate merge evaluated with optimal granularity for both fused and split cases
  • DRAM boundary transfer cost correctly estimated for split case
  • Track A (Rust) and Track B (Python) both updated
  • All tests pass; no regression on any benchmark

Dependencies

Depends on #15 (k > 1 fix) — cost comparison requires correct latency calculation with proper k values.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions