Skip to content

perf: fix granularity search to try k > 1 for MatMul subgraphs #15

@melroyanthony

Description

@melroyanthony

P0: Fix k=1 pathology in granularity search

Problem

The granularity search currently selects k=1 for mega-fused subgraphs containing MatMul ops. This creates K_full k-steps per spatial tile, each loading tiny strips (h1 + 1w elements) and doing minimal compute (base_cost/K_full). The result is pathologically memory-bound execution.

Example (Benchmark 1): K_full=512, k=1 → 512 k-steps per tile, each loading 544 elements. With k=512, only 1 k-step loading 278,528 elements — but total memory traffic is far lower because strips aren't reloaded 512 times.

Expected Impact

10-100x improvement on MatMul-heavy benchmarks (1, 9, 13).

Acceptance Criteria

  • Granularity search tries k values from K_full down to 1 (powers of 2)
  • For each (w, h, k) candidate, computes actual total latency (sum of per-step roofline)
  • Selects the (w, h, k) that minimizes total subgraph latency within OOM constraint
  • Track A (Rust) and Track B (Python) both updated
  • All 15 Rust tests still pass
  • Benchmark latencies improve vs current values

Current vs Expected

Benchmark Current Latency k used Expected after fix
1 27,443 1 Significantly lower
9 110,100 1 Significantly lower
13 191,693 1 Significantly lower

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions