Skip to content

feat: add step-scoped H2D .to() timing #87

Open
ppraneth wants to merge 2 commits intotraceopt-ai:mainfrom
ppraneth:h2d
Open

feat: add step-scoped H2D .to() timing #87
ppraneth wants to merge 2 commits intotraceopt-ai:mainfrom
ppraneth:h2d

Conversation

@ppraneth
Copy link
Copy Markdown
Collaborator

What this PR does

Closes #82.

Adds instrumentation for host-to-device transfers so TraceML can measure how long tensor.to(cuda_device) takes during training steps. This is a meaningful signal in data-heavy pipelines where the GPU can sit idle waiting for DMA to finish.

How it works

Automatic mode patches torch.Tensor.to() once at traceml.init() time. The patch is gated by a thread-local flag that is only raised inside trace_step(), so model initialization, checkpoint loading, and any other setup transfers are completely ignored. Only CUDA-targeted calls are timed -- dtype-only casts and CPU-to-CPU copies pass through with zero overhead.

Manual and selective modes get an explicit traceml.wrap_h2d(x) wrapper. It returns a thin proxy around the tensor or batch object so the next .to(device) call is timed. No context manager, no broader API change -- just wrap and go:

x = traceml.wrap_h2d(x)
x = x.to(device)

Both paths emit CUDA events (same approach as forward/backward timing) and feed into the existing StepTimeSampler pipeline, so the measurement lands in the step_time_samples SQLite table automatically alongside all other training signals.

Files changed

  • src/traceml/utils/patches/h2d_auto_timer_patch.py -- new patch module with TLS gating, CUDA target detection, and the h2d_auto_timer context manager
  • src/traceml/instrumentation.py -- h2d_auto_timer nested inside trace_step alongside forward_auto_timer and backward_auto_timer
  • src/traceml/initialization.py -- patch_h2d field added to TraceMLInitConfig, wired into init() and start()
  • src/traceml/wrappers.py -- _WrappedH2D proxy class and wrap_h2d() function with duplicate instrumentation guard
  • src/traceml/api.py -- wrap_h2d exposed at the public API layer; init() and start() accept patch_h2d
  • src/traceml/__init__.py -- wrap_h2d added to top-level exports
  • tests/test_h2d_timing.py -- 30 tests covering TLS gating, CUDA target detection, step scoping, manual wrapper, duplicate guards, and init config

What is not in this PR

No rendering or UI changes. The _event_bucket summary function does not yet map h2d_time to a display bucket -- that is a follow-up consistent with what the issue specifies.

Testing

All 30 tests pass without a GPU. Timing falls back to CPU wall-clock when CUDA is unavailable, so the full test suite runs in any environment.

python -m pytest tests/test_h2d_timing.py -v
# 30 passed in 1.81s

@abhinavsriva abhinavsriva requested a review from Pendu April 24, 2026 12:19
Pendu added a commit to Pendu/traceml that referenced this pull request Apr 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add step-scoped H2D .to() timing

1 participant