Merge init-harness into main by debashishc · Pull Request #41 · Kernel-Heim/forge-cute-py

debashishc · 2026-02-08T12:17:45Z

Sync main with the current init-harness baseline.

* fix: correct pytest -k examples in documentation * feature: added ncu mtric extraction script, and updated DEVELOPMENT to reflect new changes * style: fix ruff formatting in ncu_extract.py

Summary: - remove reduce_sum variant plumbing (opinionated harness) - add reduce_sum kernel placeholder (sum only) - add reduce op kernel/ref/ops + tests + benchmarks - update benchmark runner + suites - update minimal docs Tests: - uv run pytest -q tests/test_reduce_sum.py tests/test_reduce.py Refs #23 Refs #29 Refs #20

Summary:\n- enforce row-wise reduce_sum (dim=-1/1)\n- add correctness grid for reduce_sum\n- add short/long reduce_sum benchmark suites\n\nTests:\n- uv run pytest -q tests/test_reduce_sum.py\n\nDepends on #35\nRefs #29

- Modal remote benchmark execution on B200 GPUs - Strict GPU matching, cost warnings, and kernelbot-inspired patterns - bench/runner.py for serialization compatibility - Updated docs with Modal usage and cost warnings

## Summary - Adds `bench/modal_bench.py` for running benchmarks remotely on Modal GPUs - Uses strict GPU matching (`!` suffix) to prevent auto-upgrades (e.g., H100 → H200) - Adds runtime and documentation warnings about Modal GPU costs - Supports 12 GPU types: `any`, `b200`, `h200`, `h100`, `a100`, `a100-40gb`, `a100-80gb`, `l40s`, `a10`, `a10g`, `l4`, `t4` ## Warning > **This script incurs Modal GPU costs.** Review `bench/modal_bench.py` and verify timeout/GPU settings before running. Start with `--suite smoke` to validate your setup. You are responsible for any credits consumed. ## Changes - `bench/modal_bench.py`: Modal benchmark runner with strict GPU matching and cost warnings - `DEVELOPMENT.md`: Document supported GPUs, strict matching, and cost warning - `CHANGELOG.md`: Add entry for strict GPU matching - `README.md`: Modal setup instructions (in previous commits) ## Test plan - [ ] Run `modal run bench/modal_bench.py --suite smoke --gpu t4` to verify warning appears - [ ] Verify strict GPU matching prevents upgrades - [ ] Confirm benchmark results are saved correctly with `--out` Closes #28 🤖 Generated with [Claude Code](https://claude.com/claude-code)

- Add _print_table() for human-readable benchmark output - Add reduce op to smoke suite - Add reduce_short and reduce_long shape-sweep suites - Fix _estimate_bytes to handle reduce op Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

(cherry picked from commit 9eb0b84)

## Summary - add `FORGE_SOFTMAX_IMPL` mode selection to `softmax_online` (`auto`, `ref`, `kernel`) - in `kernel` mode, fail fast with clear errors when kernel module/entrypoints are missing - keep `auto` mode contributor-friendly by falling back to the reference implementation - add tests for impl-mode behavior and invalid mode handling - add `--impl` to `bench/benchmark_online_softmax.py` and remove the hard N divisibility assertion - make `bench/run.py` skip softmax cases cleanly when strict kernel mode is unavailable - document softmax impl mode usage in `DEVELOPMENT.md` ## Validation - `uv run ruff check forge_cute_py/ops/softmax_online.py tests/test_softmax_online.py bench/benchmark_online_softmax.py bench/run.py` - `uv run ruff format forge_cute_py/ops/softmax_online.py tests/test_softmax_online.py bench/benchmark_online_softmax.py bench/run.py` - `uv run pytest tests/test_softmax_online.py -q` - `uv run pytest -q` - `uv run python bench/benchmark_online_softmax.py --m-sizes 64 --n-sizes 256 --dtypes float16 --warmup 2 --iterations 5 --impl auto` - `uv run python bench/benchmark_online_softmax.py --m-sizes 64 --n-sizes 256 --dtypes float16 --warmup 2 --iterations 5 --impl kernel` - `FORGE_SOFTMAX_IMPL=kernel uv run python bench/run.py --suite smoke --op softmax_online` ## Notes - includes cherry-picked benchmark script commit authored by `jonah <jsamost@gmail.com>` (`a9d4983`) - follow-up docs-only patch will be opened separately

## Summary - fix stale docs references to removed `CLAUDE.md` - correct public API wording in `DEVELOPMENT.md` (`forge_cute_py.ops.<op>()`) - update README benchmark quick reference with softmax benchmark commands and impl-mode notes - clarify current release state in `CONTRIBUTING.md` (`v0.1.0-rc1` exists; final `v0.1.0` pending) - add changelog bullets for softmax impl-mode selection and benchmark CLI updates ## Scope - docs-only patch (`README.md`, `DEVELOPMENT.md`, `ROADMAP.md`, `CHANGELOG.md`, `CONTRIBUTING.md`) - no runtime code changes ## Notes - follow-up to #39

debashishc and others added 17 commits January 13, 2026 12:07

fix: correct pytest -k examples in documentation

e8f8664

Feature/ncu metric extraction (#17) (#22)

a238698

* fix: correct pytest -k examples in documentation * feature: added ncu mtric extraction script, and updated DEVELOPMENT to reflect new changes * style: fix ruff formatting in ncu_extract.py

Simplify reduce_sum harness (refs #23)

c6b8b8e

Add reduce and reduce_sum kernels

cb88330

Add reduce op scaffolding (refs #20)

e82ad6a

Add reduce_sum shape grids (refs #29)

a0ac6b3

Add reduce_sum shape grids (dim=-1) (#36)

c0aab26

Summary:\n- enforce row-wise reduce_sum (dim=-1/1)\n- add correctness grid for reduce_sum\n- add short/long reduce_sum benchmark suites\n\nTests:\n- uv run pytest -q tests/test_reduce_sum.py\n\nDepends on #35\nRefs #29

Add Modal benchmark runner with strict GPU matching (#28)

800d593

- Modal remote benchmark execution on B200 GPUs - Strict GPU matching, cost warnings, and kernelbot-inspired patterns - bench/runner.py for serialization compatibility - Updated docs with Modal usage and cost warnings

Merge remote-tracking branch 'origin/main' into init-harness

5654b00

add benchmark script

a9d4983

(cherry picked from commit 9eb0b84)

Add softmax impl mode switch and benchmark/test wiring

e59e0b5

Refresh docs for softmax impl modes and release metadata

d9b9517

debashishc merged commit 57c2b5f into main Feb 8, 2026
2 checks passed

debashishc deleted the init-harness branch February 8, 2026 12:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Merge init-harness into main#41

Merge init-harness into main#41
debashishc merged 17 commits intomainfrom
init-harness

debashishc commented Feb 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

debashishc commented Feb 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants