feat: add step-scoped H2D .to() timing #87
Open
ppraneth wants to merge 2 commits intotraceopt-ai:mainfrom
Open
feat: add step-scoped H2D .to() timing #87ppraneth wants to merge 2 commits intotraceopt-ai:mainfrom
ppraneth wants to merge 2 commits intotraceopt-ai:mainfrom
Conversation
Pendu
added a commit
to Pendu/traceml
that referenced
this pull request
Apr 25, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does
Closes #82.
Adds instrumentation for host-to-device transfers so TraceML can measure how long
tensor.to(cuda_device)takes during training steps. This is a meaningful signal in data-heavy pipelines where the GPU can sit idle waiting for DMA to finish.How it works
Automatic mode patches
torch.Tensor.to()once attraceml.init()time. The patch is gated by a thread-local flag that is only raised insidetrace_step(), so model initialization, checkpoint loading, and any other setup transfers are completely ignored. Only CUDA-targeted calls are timed -- dtype-only casts and CPU-to-CPU copies pass through with zero overhead.Manual and selective modes get an explicit
traceml.wrap_h2d(x)wrapper. It returns a thin proxy around the tensor or batch object so the next.to(device)call is timed. No context manager, no broader API change -- just wrap and go:Both paths emit CUDA events (same approach as forward/backward timing) and feed into the existing
StepTimeSamplerpipeline, so the measurement lands in thestep_time_samplesSQLite table automatically alongside all other training signals.Files changed
src/traceml/utils/patches/h2d_auto_timer_patch.py-- new patch module with TLS gating, CUDA target detection, and theh2d_auto_timercontext managersrc/traceml/instrumentation.py--h2d_auto_timernested insidetrace_stepalongsideforward_auto_timerandbackward_auto_timersrc/traceml/initialization.py--patch_h2dfield added toTraceMLInitConfig, wired intoinit()andstart()src/traceml/wrappers.py--_WrappedH2Dproxy class andwrap_h2d()function with duplicate instrumentation guardsrc/traceml/api.py--wrap_h2dexposed at the public API layer;init()andstart()acceptpatch_h2dsrc/traceml/__init__.py--wrap_h2dadded to top-level exportstests/test_h2d_timing.py-- 30 tests covering TLS gating, CUDA target detection, step scoping, manual wrapper, duplicate guards, and init configWhat is not in this PR
No rendering or UI changes. The
_event_bucketsummary function does not yet maph2d_timeto a display bucket -- that is a follow-up consistent with what the issue specifies.Testing
All 30 tests pass without a GPU. Timing falls back to CPU wall-clock when CUDA is unavailable, so the full test suite runs in any environment.