forked from yarongmu-google/MLSys
-
Notifications
You must be signed in to change notification settings - Fork 0
Closed
Labels
enhancementNew feature or requestNew feature or request
Description
P0: Fix k=1 pathology in granularity search
Problem
The granularity search currently selects k=1 for mega-fused subgraphs containing MatMul ops. This creates K_full k-steps per spatial tile, each loading tiny strips (h1 + 1w elements) and doing minimal compute (base_cost/K_full). The result is pathologically memory-bound execution.
Example (Benchmark 1): K_full=512, k=1 → 512 k-steps per tile, each loading 544 elements. With k=512, only 1 k-step loading 278,528 elements — but total memory traffic is far lower because strips aren't reloaded 512 times.
Expected Impact
10-100x improvement on MatMul-heavy benchmarks (1, 9, 13).
Acceptance Criteria
- Granularity search tries k values from K_full down to 1 (powers of 2)
- For each (w, h, k) candidate, computes actual total latency (sum of per-step roofline)
- Selects the (w, h, k) that minimizes total subgraph latency within OOM constraint
- Track A (Rust) and Track B (Python) both updated
- All 15 Rust tests still pass
- Benchmark latencies improve vs current values
Current vs Expected
| Benchmark | Current Latency | k used | Expected after fix |
|---|---|---|---|
| 1 | 27,443 | 1 | Significantly lower |
| 9 | 110,100 | 1 | Significantly lower |
| 13 | 191,693 | 1 | Significantly lower |
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request