Skip to content

docs(v0.2.3): add TF32 design doc and demo script#47

Merged
m96-chan merged 1 commit intomainfrom
feature/v0.2.3-docs
Dec 14, 2025
Merged

docs(v0.2.3): add TF32 design doc and demo script#47
m96-chan merged 1 commit intomainfrom
feature/v0.2.3-docs

Conversation

@m96-chan
Copy link
Copy Markdown
Owner

Summary

  • Add docs/tf32_tensorcore_design.md - comprehensive TF32 TensorCore design documentation
  • Add examples/demo_v023.py - demo script showcasing v0.2.3 features

Documentation Contents

docs/tf32_tensorcore_design.md

  • TF32 precision explanation
  • Kernel architecture (BM=128, BN=128, BK=16)
  • PTX mma.sync fragment mappings (empirically verified)
  • cp.async double-buffering pipeline
  • Optimization techniques (A fragment hoisting)
  • API usage examples
  • Performance results
  • Future work with CUTLASS reference

examples/demo_v023.py

  • Device capabilities API demo
  • TF32 vs FP32 matmul comparison
  • Correctness validation
  • Performance benchmarks

Demo Results (RTX 3090 Ti)

Matrix Size     FP32        TF32        Speedup
2048x2048       8.02T      10.66T       1.33x
4096x4096      13.10T      21.42T       1.63x
8192x8192      17.10T      28.44T       1.66x

Test plan

  • Demo script runs successfully
  • Correctness validation passes (TF32 error < 10%)
  • Documentation is accurate and complete

Closes partial requirements for #41

🤖 Generated with Claude Code

- Add docs/tf32_tensorcore_design.md with:
  - PTX mma.sync fragment mapping (empirically verified)
  - cp.async double-buffering pipeline
  - Kernel architecture and tiling parameters
  - Performance results and API usage
  - CUTLASS reference for future optimization

- Add examples/demo_v023.py demonstrating:
  - gp.matmul(a, b, use_tf32=True)
  - gp.get_device_capabilities()
  - Performance benchmark (FP32 vs TF32)
  - Correctness validation

Demo results (RTX 3090 Ti):
- 2048x2048: 10.66 TFLOPS (1.33x speedup)
- 4096x4096: 21.42 TFLOPS (1.63x speedup)
- 8192x8192: 28.44 TFLOPS (1.66x speedup)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@m96-chan m96-chan merged commit 018e1f6 into main Dec 14, 2025
26 checks passed
@m96-chan m96-chan deleted the feature/v0.2.3-docs branch December 26, 2025 09:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant