Skip to content

Comments

Merge init-harness into main#41

Merged
debashishc merged 17 commits intomainfrom
init-harness
Feb 8, 2026
Merged

Merge init-harness into main#41
debashishc merged 17 commits intomainfrom
init-harness

Conversation

@debashishc
Copy link
Contributor

Sync main with the current init-harness baseline.

debashishc and others added 17 commits January 13, 2026 12:07
* fix: correct pytest -k examples in documentation

* feature: added ncu mtric extraction script, and updated DEVELOPMENT to reflect new changes

* style: fix ruff formatting in ncu_extract.py
Summary:
- remove reduce_sum variant plumbing (opinionated harness)
- add reduce_sum kernel placeholder (sum only)
- add reduce op kernel/ref/ops + tests + benchmarks
- update benchmark runner + suites
- update minimal docs

Tests:
- uv run pytest -q tests/test_reduce_sum.py tests/test_reduce.py

Refs #23
Refs #29
Refs #20
Summary:\n- enforce row-wise reduce_sum (dim=-1/1)\n- add correctness
grid for reduce_sum\n- add short/long reduce_sum benchmark
suites\n\nTests:\n- uv run pytest -q tests/test_reduce_sum.py\n\nDepends
on #35\nRefs #29
- Modal remote benchmark execution on B200 GPUs
- Strict GPU matching, cost warnings, and kernelbot-inspired patterns
- bench/runner.py for serialization compatibility
- Updated docs with Modal usage and cost warnings
## Summary

- Adds `bench/modal_bench.py` for running benchmarks remotely on Modal
GPUs
- Uses strict GPU matching (`!` suffix) to prevent auto-upgrades (e.g.,
H100 → H200)
- Adds runtime and documentation warnings about Modal GPU costs
- Supports 12 GPU types: `any`, `b200`, `h200`, `h100`, `a100`,
`a100-40gb`, `a100-80gb`, `l40s`, `a10`, `a10g`, `l4`, `t4`

## Warning

> **This script incurs Modal GPU costs.** Review `bench/modal_bench.py`
and verify timeout/GPU settings before running. Start with `--suite
smoke` to validate your setup. You are responsible for any credits
consumed.

## Changes

- `bench/modal_bench.py`: Modal benchmark runner with strict GPU
matching and cost warnings
- `DEVELOPMENT.md`: Document supported GPUs, strict matching, and cost
warning
- `CHANGELOG.md`: Add entry for strict GPU matching
- `README.md`: Modal setup instructions (in previous commits)

## Test plan

- [ ] Run `modal run bench/modal_bench.py --suite smoke --gpu t4` to
verify warning appears
- [ ] Verify strict GPU matching prevents upgrades
- [ ] Confirm benchmark results are saved correctly with `--out`

Closes #28

🤖 Generated with [Claude Code](https://claude.com/claude-code)
- Add _print_table() for human-readable benchmark output
- Add reduce op to smoke suite
- Add reduce_short and reduce_long shape-sweep suites
- Fix _estimate_bytes to handle reduce op

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
(cherry picked from commit 9eb0b84)
## Summary
- add `FORGE_SOFTMAX_IMPL` mode selection to `softmax_online` (`auto`,
`ref`, `kernel`)
- in `kernel` mode, fail fast with clear errors when kernel
module/entrypoints are missing
- keep `auto` mode contributor-friendly by falling back to the reference
implementation
- add tests for impl-mode behavior and invalid mode handling
- add `--impl` to `bench/benchmark_online_softmax.py` and remove the
hard N divisibility assertion
- make `bench/run.py` skip softmax cases cleanly when strict kernel mode
is unavailable
- document softmax impl mode usage in `DEVELOPMENT.md`

## Validation
- `uv run ruff check forge_cute_py/ops/softmax_online.py
tests/test_softmax_online.py bench/benchmark_online_softmax.py
bench/run.py`
- `uv run ruff format forge_cute_py/ops/softmax_online.py
tests/test_softmax_online.py bench/benchmark_online_softmax.py
bench/run.py`
- `uv run pytest tests/test_softmax_online.py -q`
- `uv run pytest -q`
- `uv run python bench/benchmark_online_softmax.py --m-sizes 64
--n-sizes 256 --dtypes float16 --warmup 2 --iterations 5 --impl auto`
- `uv run python bench/benchmark_online_softmax.py --m-sizes 64
--n-sizes 256 --dtypes float16 --warmup 2 --iterations 5 --impl kernel`
- `FORGE_SOFTMAX_IMPL=kernel uv run python bench/run.py --suite smoke
--op softmax_online`

## Notes
- includes cherry-picked benchmark script commit authored by `jonah
<jsamost@gmail.com>` (`a9d4983`)
- follow-up docs-only patch will be opened separately
## Summary
- fix stale docs references to removed `CLAUDE.md`
- correct public API wording in `DEVELOPMENT.md`
(`forge_cute_py.ops.<op>()`)
- update README benchmark quick reference with softmax benchmark
commands and impl-mode notes
- clarify current release state in `CONTRIBUTING.md` (`v0.1.0-rc1`
exists; final `v0.1.0` pending)
- add changelog bullets for softmax impl-mode selection and benchmark
CLI updates

## Scope
- docs-only patch (`README.md`, `DEVELOPMENT.md`, `ROADMAP.md`,
`CHANGELOG.md`, `CONTRIBUTING.md`)
- no runtime code changes

## Notes
- follow-up to #39
@debashishc debashishc merged commit 57c2b5f into main Feb 8, 2026
2 checks passed
@debashishc debashishc deleted the init-harness branch February 8, 2026 12:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants