Feat. Distributed Data Parallel for Trainer by Varun-sai-500 · Pull Request #91 · yoxu515/aot-benchmark

Varun-sai-500 · 2026-02-26T18:14:58Z

Fixes Issue: #94

Refactored the training pipeline to be fully device-agnostic (CPU/CUDA) and torchrun-compatible, introducing a unified CLI entrypoint and improving distributed training robustness, reproducibility, and maintainability.

                                        **** WORK UNDER PROGRESS  ****

Key Improvements
Device-agnostic Trainer
Reworked training manager to operate on torch.device instead of hardcoded CUDA paths
→ Eliminates device-specific failures and enables seamless CPU fallback for debugging and portability
Robust Distributed Training (DDP)
Added native support for torchrun (RANK, WORLD_SIZE, LOCAL_RANK)
Fixed metric aggregation using true world size
→ Ensures correctness and scalability across multi-GPU / multi-node setups
Unified Training Entrypoint (tools/train.py)
Centralized runtime setup, argument parsing, seeding, and config overrides
→ Simplifies experimentation, CI integration, and reproducible launches
Safe AMP Handling
Enabled mixed precision only when CUDA is available
→ Prevents runtime errors and improves training stability across environments
Device-aware Data Pipeline
Conditional pin_memory, prefetch_factor, and batch sizing based on runtime device
→ Optimizes data throughput without breaking CPU execution
Checkpointing API Refactor
Replaced GPU-index-based loading with explicit torch.device handling
→ Improves clarity, correctness, and cross-device compatibility
Cleaner Runtime Behavior
Removed unconditional torch.cuda.* calls
Added guarded device setup and distributed cleanup
→ Reduces silent bugs and improves reliability in mixed environments

Motivation:

This refactor removes implicit CUDA assumptions and replaces them with explicit, device-aware abstractions, making the training stack:

Portable → runs on CPU, single GPU, or distributed setups without code changes
Correct → fixes subtle DDP bugs (e.g., incorrect metric averaging)
Reproducible → centralized seeding and deterministic-ish behavior per rank
Production-ready → compatible with standard tooling (torchrun, CI pipelines, SLURM)
Maintainable → cleaner interfaces (torch.device) and reduced technical debt

z-x-yang

Thanks for the CPU-fallback refactor — good direction. I can't approve this yet because there are still blocking issues:

CPU path still breaks in eval.py / train.py

The CPU branch passes rank=None into Evaluator/Trainer.
But manager code still assumes GPU-only semantics (cfg.* + rank, torch.cuda.set_device(...), many .cuda(...) usages).
This will fail before evaluation/training starts on CPU.

Single-GPU device-id regression

Single-device launch now calls main_worker(args.gpu_id, ...) / main_worker(args.start_gpu, ...).
Manager code also adds offsets (cfg.TEST_GPU_ID + rank, cfg.DIST_START_GPU + rank).
For non-zero GPU ids this can double-offset and select the wrong device.

CUDA-only timing/memory calls are still unguarded in evaluator

torch.cuda.Event, torch.cuda.synchronize, torch.cuda.max_memory_allocated still run in code paths that need CPU compatibility.

Please fix these before merge. Suggested direction:

Keep rank as an integer (e.g., 0) on CPU;
Guard all CUDA-only APIs with torch.cuda.is_available();
Keep device selection in one place to avoid double offsets.

Varun-sai-500 · 2026-03-17T13:22:04Z

@z-x-yang Yes I agree with that, the main issues are the old gpu_id techniques, whcih are now replaced by distributed data parallels technique which is now supported in pytorch, please check #94 , thanks!

z-x-yang · 2026-03-24T02:35:53Z

I re-checked this branch too. It is still not mergeable yet. The CPU path still does not work end-to-end:\n\n- eval.py / train.py still pass rank=None into Evaluator / Trainer on CPU, while manager code still assumes integer GPU/rank semantics.\n- CUDA-only calls in the manager path are still not consistently guarded, so the flow still breaks before a real CPU run.\n\nPlease keep the rank/device contract consistent first, then guard the CUDA-only APIs end-to-end before the next review.

Varun-sai-500 · 2026-03-24T05:17:57Z

I haven't changed it yet actually, I will work on this

Varun-sai-500 · 2026-03-24T16:19:28Z

@z-x-yang Minimal fix for trainer.py

z-x-yang

Thanks — this is moving in the right direction, especially the trainer-side device cleanup. I still can't approve/merge this PR as-is because two blockers remain:

The PR still mixes CPU support with repo-default/local-config changes:

configs/default.py changes TRAIN_GPUS from 4 to 1
configs/default.py changes DIR_DATA from ../VOS02/datasets to datasets
there is also README/.gitignore noise mixed in
These should be split out or dropped from the CPU-fallback PR.

CPU support is still incomplete across the eval path:

networks/managers/evaluator.py still assumes CUDA in multiple places (torch.cuda.set_device, .cuda(...), torch.cuda.Event, torch.cuda.synchronize, memory accounting)
So the broader no-GPU train/eval/infer story is not yet complete.

Suggested next step: keep this PR scoped to CPU support only, revert the default/config changes, and either (a) finish evaluator CPU fallback here or (b) split evaluator support into a separate follow-up PR with a clear scope.

Varun-sai-500 · 2026-03-29T18:26:16Z

@z-x-yang I tested locally, can you also test it and tell me if it's working?

Varun-sai-500 · 2026-03-30T14:11:04Z

@z-x-yang The changes are pretty big, but it's coming from a person who tested it, the promised cpu fallback is working, and says "started training", after that there are errors, which are out of scope of this PR, and it has dealing with some shapes, and comes under research part, guidance needed on that one though

z-x-yang · 2026-03-30T14:16:13Z

I re-checked the latest branch. I'm still not merging this PR yet — there are still hard blockers in the training path.

Non-zero start_gpu / offset launches are still broken.
- tools/train.py builds the device from rank (cuda:{rank}) in main_worker (lines 18-19),
- but Trainer still computes the actual GPU id as rank + cfg.DIST_START_GPU (see networks/managers/trainer.py, lines 33, 46, 59).
- That means for --start_gpu != 0 (and offset multi-GPU launches), the selected CUDA device and the model/device placement can diverge.
Checkpoint / pretrained loading regressed on CUDA.
- trainer.py now passes self.device into load_network(...) / load_network_and_optimizer(...) (for example lines 181-218 and 249-260),
- but utils/checkpoint.py still expects a GPU index / None and builds map_location via torch.device(f"cuda:{gpu}") (lines 7-15 and 94-96).
- So on CUDA this becomes the wrong contract (e.g. cuda:cuda:0), which breaks the resume / pretrain path.
The PR still includes unrelated noise.
- README / .gitignore edits and the DAVIS README rename are not part of the trainer CPU-fallback fix. Please drop them from this PR.

If you want the next pass to be mergeable, I'd suggest: keep device selection in exactly one place, keep the checkpoint helper interface consistent (either pass GPU ids everywhere or update the helper to accept torch.device cleanly), and trim this PR back to the training-only CPU fallback scope.

Varun-sai-500 · 2026-03-30T14:28:03Z

@z-x-yang well but it runs well on my cpu device right? I haven't faced issues like that, but well this huge code refactor is actually tough in a single pass

z-x-yang · 2026-03-30T23:20:20Z

A local CPU-only smoke run is helpful, but that is not the merge bar for this PR.

For me to merge #91, it still needs to satisfy the repo's main training-path constraints, not just "starts training on one CPU machine":

No device/rank regression for the existing CUDA path.
- Right now tools/train.py always calls main_worker(0, ...) for the single-GPU path, while Trainer still computes the actual CUDA id as rank + cfg.DIST_START_GPU.
- That means the contract is still split across two places, and offset / non-zero GPU launches are still fragile.
No checkpoint/pretrain contract regression.
- Trainer now passes self.device into load_network(...) / load_network_and_optimizer(...),
- but utils/checkpoint.py is still written around a gpu argument and derives the device again internally.
- That interface mismatch is still not something I want to merge into the main training path.
Keep the PR scoped to the training fix only.
- README / .gitignore / datasets/DAVIS/readme.md edits should not be bundled into this trainer CPU-fallback PR.

So the current answer is still not mergeable yet.

If you want the next pass to be reviewable quickly, the cleanest path is:

keep device selection in one place,
keep the checkpoint helper contract consistent end-to-end,
and trim this PR back to the training-only fix with unrelated files dropped.

If that is easier as a fresh smaller PR, that is fine too.

Varun-sai-500 · 2026-04-07T10:25:22Z

@z-x-yang Thanks for your valuable suggestions, I have fixed the damn checkpoint issue, and also centralized device at one place now.

z-x-yang · 2026-04-08T13:48:12Z

I re-checked the latest branch.

Some earlier blockers do look improved now:

the trainer-side device selection is more centralized;
the unrelated config / README noise is gone;
the trainer/checkpoint contract is more consistent inside the training path.

But I’m still not merging #91 yet because there is still a hard blocker:

utils/checkpoint.py now changes the helper contract repo-wide, but the remaining call sites were not updated.
- utils/checkpoint.py now requires a torch.device (get_device(...) raises unless the argument is a torch.device; see lines 7-10, and load_network* at lines 12-14 / 93-95).
- networks/managers/evaluator.py still calls load_network(..., self.gpu) with an integer GPU id (lines 71-72, 81-82, 91-93).
- tools/demo.py still calls load_network(..., gpu_id if device.type == 'cuda' else -1) (line 125).

So in the current form, this PR would regress the existing eval/demo checkpoint-loading path even if the trainer path is improved.

For the next pass, please do one of these cleanly:

either keep the checkpoint helpers backward-compatible with the old GPU-id / None contract,
or update all remaining repo callers to pass torch.device consistently in the same PR.

After that, please smoke at least:

the trainer checkpoint load path, and
one eval/demo checkpoint load path.

Until that is fixed, this PR is still not mergeable yet.

Varun-sai-500 · 2026-04-14T18:48:44Z

@z-x-yang I don't have a gpu, couldn't test I am sorry, but is this good now?

z-x-yang requested changes Feb 26, 2026

View reviewed changes

Varun-sai-500 closed this Mar 1, 2026

Varun-sai-500 deleted the cpu branch March 1, 2026 13:51

Varun-sai-500 restored the cpu branch March 14, 2026 18:04

Varun-sai-500 reopened this Mar 14, 2026

Varun-sai-500 marked this pull request as draft March 22, 2026 17:39

Varun-sai-500 closed this Mar 24, 2026

Varun-sai-500 force-pushed the cpu branch from b4f4c5a to f83dc00 Compare March 24, 2026 06:06

Varun-sai-500 reopened this Mar 24, 2026

Varun-sai-500 marked this pull request as ready for review March 24, 2026 06:10

Varun-sai-500 requested a review from z-x-yang March 27, 2026 16:13

z-x-yang requested changes Mar 28, 2026

View reviewed changes

Varun-sai-500 requested a review from z-x-yang March 29, 2026 18:19

Varun-sai-500 closed this Apr 6, 2026

Varun-sai-500 force-pushed the cpu branch from 7a28340 to b74ca88 Compare April 6, 2026 14:56

Refactored trainer.py

2720d42

Varun-sai-500 reopened this Apr 6, 2026

Varun-sai-500 added 4 commits April 6, 2026 20:30

Added train.py

ecb8844

Merge branch 'yoxu515:main' into cpu

d465505

fixed bug in train.py

ff0016a

Fixed long standing checkpoint issue

10cf875

Varun-sai-500 added 2 commits April 7, 2026 15:56

fixed return type bug

c881b18

fixed another smaller bug

065d486

Varun-sai-500 changed the title ~~refactoring code for cpu_fallback~~ CPU Fallback for Trainer Apr 7, 2026

Varun-sai-500 changed the title ~~CPU Fallback for Trainer~~ CPU fallback for Trainer Apr 7, 2026

Varun-sai-500 mentioned this pull request Apr 7, 2026

Removing old Training methods like gpu_id and centralize device management #94

Open

Made trainer.py DDP compatible, removing legacy code

706bb31

Varun-sai-500 added 4 commits April 15, 2026 10:38

Updated train.py

195a168

Updated trainer code

473bab2

updated

bf3fdd6

updated the generate_permute_matrix function

75ed28d

Varun-sai-500 changed the title ~~CPU fallback for Trainer~~ Feat. Distributed Data Parallel for Trainer Apr 15, 2026

Varun-sai-500 marked this pull request as draft April 21, 2026 18:01

Conversation

Varun-sai-500 commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

z-x-yang left a comment

Choose a reason for hiding this comment

Uh oh!

Varun-sai-500 commented Mar 17, 2026

Uh oh!

z-x-yang commented Mar 24, 2026

Uh oh!

Varun-sai-500 commented Mar 24, 2026

Uh oh!

Varun-sai-500 commented Mar 24, 2026

Uh oh!

z-x-yang left a comment

Choose a reason for hiding this comment

Uh oh!

Varun-sai-500 commented Mar 29, 2026

Uh oh!

Varun-sai-500 commented Mar 30, 2026

Uh oh!

z-x-yang commented Mar 30, 2026

Uh oh!

Varun-sai-500 commented Mar 30, 2026

Uh oh!

z-x-yang commented Mar 30, 2026

Uh oh!

Varun-sai-500 commented Apr 7, 2026

Uh oh!

z-x-yang commented Apr 8, 2026

Uh oh!

Varun-sai-500 commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Varun-sai-500 commented Feb 26, 2026 •

edited

Loading

Varun-sai-500 commented Apr 14, 2026 •

edited

Loading