Skip to content

Feat. Distributed Data Parallel for Trainer#91

Draft
Varun-sai-500 wants to merge 12 commits intoyoxu515:mainfrom
Varun-sai-500:cpu
Draft

Feat. Distributed Data Parallel for Trainer#91
Varun-sai-500 wants to merge 12 commits intoyoxu515:mainfrom
Varun-sai-500:cpu

Conversation

@Varun-sai-500
Copy link
Copy Markdown
Contributor

@Varun-sai-500 Varun-sai-500 commented Feb 26, 2026

Fixes Issue: #94

Refactored the training pipeline to be fully device-agnostic (CPU/CUDA) and torchrun-compatible, introducing a unified CLI entrypoint and improving distributed training robustness, reproducibility, and maintainability.

                                        **** WORK UNDER PROGRESS  ****

Key Improvements
Device-agnostic Trainer
Reworked training manager to operate on torch.device instead of hardcoded CUDA paths
→ Eliminates device-specific failures and enables seamless CPU fallback for debugging and portability
Robust Distributed Training (DDP)
Added native support for torchrun (RANK, WORLD_SIZE, LOCAL_RANK)
Fixed metric aggregation using true world size
→ Ensures correctness and scalability across multi-GPU / multi-node setups
Unified Training Entrypoint (tools/train.py)
Centralized runtime setup, argument parsing, seeding, and config overrides
→ Simplifies experimentation, CI integration, and reproducible launches
Safe AMP Handling
Enabled mixed precision only when CUDA is available
→ Prevents runtime errors and improves training stability across environments
Device-aware Data Pipeline
Conditional pin_memory, prefetch_factor, and batch sizing based on runtime device
→ Optimizes data throughput without breaking CPU execution
Checkpointing API Refactor
Replaced GPU-index-based loading with explicit torch.device handling
→ Improves clarity, correctness, and cross-device compatibility
Cleaner Runtime Behavior
Removed unconditional torch.cuda.* calls
Added guarded device setup and distributed cleanup
→ Reduces silent bugs and improves reliability in mixed environments

Motivation:

This refactor removes implicit CUDA assumptions and replaces them with explicit, device-aware abstractions, making the training stack:

Portable → runs on CPU, single GPU, or distributed setups without code changes
Correct → fixes subtle DDP bugs (e.g., incorrect metric averaging)
Reproducible → centralized seeding and deterministic-ish behavior per rank
Production-ready → compatible with standard tooling (torchrun, CI pipelines, SLURM)
Maintainable → cleaner interfaces (torch.device) and reduced technical debt

Copy link
Copy Markdown
Collaborator

@z-x-yang z-x-yang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the CPU-fallback refactor — good direction. I can't approve this yet because there are still blocking issues:

  1. CPU path still breaks in eval.py / train.py
  • The CPU branch passes rank=None into Evaluator/Trainer.
  • But manager code still assumes GPU-only semantics (cfg.* + rank, torch.cuda.set_device(...), many .cuda(...) usages).
  • This will fail before evaluation/training starts on CPU.
  1. Single-GPU device-id regression
  • Single-device launch now calls main_worker(args.gpu_id, ...) / main_worker(args.start_gpu, ...).
  • Manager code also adds offsets (cfg.TEST_GPU_ID + rank, cfg.DIST_START_GPU + rank).
  • For non-zero GPU ids this can double-offset and select the wrong device.
  1. CUDA-only timing/memory calls are still unguarded in evaluator
  • torch.cuda.Event, torch.cuda.synchronize, torch.cuda.max_memory_allocated still run in code paths that need CPU compatibility.

Please fix these before merge. Suggested direction:

  • Keep rank as an integer (e.g., 0) on CPU;
  • Guard all CUDA-only APIs with torch.cuda.is_available();
  • Keep device selection in one place to avoid double offsets.

@Varun-sai-500 Varun-sai-500 deleted the cpu branch March 1, 2026 13:51
@Varun-sai-500 Varun-sai-500 restored the cpu branch March 14, 2026 18:04
@Varun-sai-500 Varun-sai-500 reopened this Mar 14, 2026
@Varun-sai-500
Copy link
Copy Markdown
Contributor Author

@z-x-yang Yes I agree with that, the main issues are the old gpu_id techniques, whcih are now replaced by distributed data parallels technique which is now supported in pytorch, please check #94 , thanks!

@Varun-sai-500 Varun-sai-500 marked this pull request as draft March 22, 2026 17:39
@z-x-yang
Copy link
Copy Markdown
Collaborator

I re-checked this branch too. It is still not mergeable yet. The CPU path still does not work end-to-end:\n\n- eval.py / train.py still pass rank=None into Evaluator / Trainer on CPU, while manager code still assumes integer GPU/rank semantics.\n- CUDA-only calls in the manager path are still not consistently guarded, so the flow still breaks before a real CPU run.\n\nPlease keep the rank/device contract consistent first, then guard the CUDA-only APIs end-to-end before the next review.

@Varun-sai-500
Copy link
Copy Markdown
Contributor Author

I haven't changed it yet actually, I will work on this

@Varun-sai-500 Varun-sai-500 reopened this Mar 24, 2026
@Varun-sai-500 Varun-sai-500 marked this pull request as ready for review March 24, 2026 06:10
@Varun-sai-500
Copy link
Copy Markdown
Contributor Author

@z-x-yang Minimal fix for trainer.py

@Varun-sai-500 Varun-sai-500 requested a review from z-x-yang March 27, 2026 16:13
Copy link
Copy Markdown
Collaborator

@z-x-yang z-x-yang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks — this is moving in the right direction, especially the trainer-side device cleanup. I still can't approve/merge this PR as-is because two blockers remain:

  1. The PR still mixes CPU support with repo-default/local-config changes:
  • configs/default.py changes TRAIN_GPUS from 4 to 1
  • configs/default.py changes DIR_DATA from ../VOS02/datasets to datasets
  • there is also README/.gitignore noise mixed in
    These should be split out or dropped from the CPU-fallback PR.
  1. CPU support is still incomplete across the eval path:
  • networks/managers/evaluator.py still assumes CUDA in multiple places (torch.cuda.set_device, .cuda(...), torch.cuda.Event, torch.cuda.synchronize, memory accounting)
    So the broader no-GPU train/eval/infer story is not yet complete.

Suggested next step: keep this PR scoped to CPU support only, revert the default/config changes, and either (a) finish evaluator CPU fallback here or (b) split evaluator support into a separate follow-up PR with a clear scope.

@Varun-sai-500 Varun-sai-500 requested a review from z-x-yang March 29, 2026 18:19
@Varun-sai-500
Copy link
Copy Markdown
Contributor Author

@z-x-yang I tested locally, can you also test it and tell me if it's working?

@Varun-sai-500
Copy link
Copy Markdown
Contributor Author

@z-x-yang The changes are pretty big, but it's coming from a person who tested it, the promised cpu fallback is working, and says "started training", after that there are errors, which are out of scope of this PR, and it has dealing with some shapes, and comes under research part, guidance needed on that one though

@z-x-yang
Copy link
Copy Markdown
Collaborator

I re-checked the latest branch. I'm still not merging this PR yet — there are still hard blockers in the training path.

  1. Non-zero start_gpu / offset launches are still broken.

    • tools/train.py builds the device from rank (cuda:{rank}) in main_worker (lines 18-19),
    • but Trainer still computes the actual GPU id as rank + cfg.DIST_START_GPU (see networks/managers/trainer.py, lines 33, 46, 59).
    • That means for --start_gpu != 0 (and offset multi-GPU launches), the selected CUDA device and the model/device placement can diverge.
  2. Checkpoint / pretrained loading regressed on CUDA.

    • trainer.py now passes self.device into load_network(...) / load_network_and_optimizer(...) (for example lines 181-218 and 249-260),
    • but utils/checkpoint.py still expects a GPU index / None and builds map_location via torch.device(f"cuda:{gpu}") (lines 7-15 and 94-96).
    • So on CUDA this becomes the wrong contract (e.g. cuda:cuda:0), which breaks the resume / pretrain path.
  3. The PR still includes unrelated noise.

    • README / .gitignore edits and the DAVIS README rename are not part of the trainer CPU-fallback fix. Please drop them from this PR.

If you want the next pass to be mergeable, I'd suggest: keep device selection in exactly one place, keep the checkpoint helper interface consistent (either pass GPU ids everywhere or update the helper to accept torch.device cleanly), and trim this PR back to the training-only CPU fallback scope.

@Varun-sai-500
Copy link
Copy Markdown
Contributor Author

@z-x-yang well but it runs well on my cpu device right? I haven't faced issues like that, but well this huge code refactor is actually tough in a single pass

@z-x-yang
Copy link
Copy Markdown
Collaborator

A local CPU-only smoke run is helpful, but that is not the merge bar for this PR.

For me to merge #91, it still needs to satisfy the repo's main training-path constraints, not just "starts training on one CPU machine":

  1. No device/rank regression for the existing CUDA path.

    • Right now tools/train.py always calls main_worker(0, ...) for the single-GPU path, while Trainer still computes the actual CUDA id as rank + cfg.DIST_START_GPU.
    • That means the contract is still split across two places, and offset / non-zero GPU launches are still fragile.
  2. No checkpoint/pretrain contract regression.

    • Trainer now passes self.device into load_network(...) / load_network_and_optimizer(...),
    • but utils/checkpoint.py is still written around a gpu argument and derives the device again internally.
    • That interface mismatch is still not something I want to merge into the main training path.
  3. Keep the PR scoped to the training fix only.

    • README / .gitignore / datasets/DAVIS/readme.md edits should not be bundled into this trainer CPU-fallback PR.

So the current answer is still not mergeable yet.

If you want the next pass to be reviewable quickly, the cleanest path is:

  • keep device selection in one place,
  • keep the checkpoint helper contract consistent end-to-end,
  • and trim this PR back to the training-only fix with unrelated files dropped.

If that is easier as a fresh smaller PR, that is fine too.

@Varun-sai-500 Varun-sai-500 reopened this Apr 6, 2026
@Varun-sai-500
Copy link
Copy Markdown
Contributor Author

@z-x-yang Thanks for your valuable suggestions, I have fixed the damn checkpoint issue, and also centralized device at one place now.

@Varun-sai-500 Varun-sai-500 changed the title refactoring code for cpu_fallback CPU Fallback for Trainer Apr 7, 2026
@Varun-sai-500 Varun-sai-500 changed the title CPU Fallback for Trainer CPU fallback for Trainer Apr 7, 2026
@z-x-yang
Copy link
Copy Markdown
Collaborator

z-x-yang commented Apr 8, 2026

I re-checked the latest branch.

Some earlier blockers do look improved now:

  • the trainer-side device selection is more centralized;
  • the unrelated config / README noise is gone;
  • the trainer/checkpoint contract is more consistent inside the training path.

But I’m still not merging #91 yet because there is still a hard blocker:

  1. utils/checkpoint.py now changes the helper contract repo-wide, but the remaining call sites were not updated.
    • utils/checkpoint.py now requires a torch.device (get_device(...) raises unless the argument is a torch.device; see lines 7-10, and load_network* at lines 12-14 / 93-95).
    • networks/managers/evaluator.py still calls load_network(..., self.gpu) with an integer GPU id (lines 71-72, 81-82, 91-93).
    • tools/demo.py still calls load_network(..., gpu_id if device.type == 'cuda' else -1) (line 125).

So in the current form, this PR would regress the existing eval/demo checkpoint-loading path even if the trainer path is improved.

For the next pass, please do one of these cleanly:

  • either keep the checkpoint helpers backward-compatible with the old GPU-id / None contract,
  • or update all remaining repo callers to pass torch.device consistently in the same PR.

After that, please smoke at least:

  • the trainer checkpoint load path, and
  • one eval/demo checkpoint load path.

Until that is fixed, this PR is still not mergeable yet.

@Varun-sai-500
Copy link
Copy Markdown
Contributor Author

Varun-sai-500 commented Apr 14, 2026

@z-x-yang I don't have a gpu, couldn't test I am sorry, but is this good now?

@Varun-sai-500 Varun-sai-500 changed the title CPU fallback for Trainer Feat. Distributed Data Parallel for Trainer Apr 15, 2026
@Varun-sai-500 Varun-sai-500 marked this pull request as draft April 21, 2026 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants