feat: add step-scoped H2D .to() timing by ppraneth · Pull Request #87 · traceopt-ai/traceml

ppraneth · 2026-04-24T03:46:12Z

What this PR does

Closes #82.

Adds instrumentation for host-to-device transfers so TraceML can measure how long tensor.to(cuda_device) takes during training steps. This is a meaningful signal in data-heavy pipelines where the GPU can sit idle waiting for DMA to finish.

How it works

Automatic mode patches torch.Tensor.to() once at traceml.init() time. The patch is gated by a thread-local flag that is only raised inside trace_step(), so model initialization, checkpoint loading, and any other setup transfers are completely ignored. Only CUDA-targeted calls are timed -- dtype-only casts and CPU-to-CPU copies pass through with zero overhead.

Manual and selective modes get an explicit traceml.wrap_h2d(x) wrapper. It returns a thin proxy around the tensor or batch object so the next .to(device) call is timed. No context manager, no broader API change -- just wrap and go:

x = traceml.wrap_h2d(x)
x = x.to(device)

Both paths emit CUDA events (same approach as forward/backward timing) and feed into the existing StepTimeSampler pipeline, so the measurement lands in the step_time_samples SQLite table automatically alongside all other training signals.

Files changed

src/traceml/utils/patches/h2d_auto_timer_patch.py -- new patch module with TLS gating, CUDA target detection, and the h2d_auto_timer context manager
src/traceml/instrumentation.py -- h2d_auto_timer nested inside trace_step alongside forward_auto_timer and backward_auto_timer
src/traceml/initialization.py -- patch_h2d field added to TraceMLInitConfig, wired into init() and start()
src/traceml/wrappers.py -- _WrappedH2D proxy class and wrap_h2d() function with duplicate instrumentation guard
src/traceml/api.py -- wrap_h2d exposed at the public API layer; init() and start() accept patch_h2d
src/traceml/__init__.py -- wrap_h2d added to top-level exports
tests/test_h2d_timing.py -- 30 tests covering TLS gating, CUDA target detection, step scoping, manual wrapper, duplicate guards, and init config

What is not in this PR

No rendering or UI changes. The _event_bucket summary function does not yet map h2d_time to a display bucket -- that is a follow-up consistent with what the issue specifies.

Testing

All 30 tests pass without a GPU. Timing falls back to CPU wall-clock when CUDA is unavailable, so the full test suite runs in any environment.

python -m pytest tests/test_h2d_timing.py -v
# 30 passed in 1.81s

… review

abhinavsriva requested a review from Pendu April 24, 2026 12:19

Pendu added a commit to Pendu/traceml that referenced this pull request Apr 25, 2026

docs(study): add PR reviews section with PR traceopt-ai#87 H2D timing…

e39e028

… review

h2d

d4d42a8

ppraneth force-pushed the h2d branch from 4d0aeb9 to d4d42a8 Compare April 29, 2026 04:54

fix lint

c8a5a2f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add step-scoped H2D .to() timing #87

feat: add step-scoped H2D .to() timing #87
ppraneth wants to merge 2 commits intotraceopt-ai:mainfrom
ppraneth:h2d

ppraneth commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ppraneth commented Apr 24, 2026

What this PR does

How it works

Files changed

What is not in this PR

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant