Skip to content

Feat: Dataset Unification Pipeline#96

Draft
Varun-sai-500 wants to merge 35 commits intoyoxu515:mainfrom
Varun-sai-500:separate_pretraining
Draft

Feat: Dataset Unification Pipeline#96
Varun-sai-500 wants to merge 35 commits intoyoxu515:mainfrom
Varun-sai-500:separate_pretraining

Conversation

@Varun-sai-500
Copy link
Copy Markdown
Contributor

@Varun-sai-500 Varun-sai-500 commented Apr 7, 2026

This PR introduces a structural overhaul of the training pipeline, separating concerns that were previously tightly coupled and blocking extensibility.

                                      **** WORK IN PROGRESS  ****

The original pipeline:

Hard-coupled static pretraining into trainer.py
Effectively supported only a single pretraining configuration
Had fragile dataset assumptions (non-standard formats → silent failures)
Made it painful to extend / reproduce / port

This PR fixes that at the root level, not with patches.

🔧 Core Changes

  1. Trainer Refactor (Critical Path)
    Decoupled static pretraining from trainer.py
    Removed StaticTrain dependency from the core training loop
    Cleaned dataset preparation logic → now modular and extensible

👉 Result:
trainer.py is now strictly responsible for video training, not overloaded with pretraining concerns.

  1. Dedicated Pretraining Pipeline
    Introduced: networks/managers/pre_trainer.py
    Static image pretraining is now:
    Standalone
    Reproducible
    Hardware-agnostic (CPU/GPU)

👉 Matches design patterns used in large-scale OSS (separation of stages, not flags inside one script)

  1. Dataset Unification (Big Fix)
    Added: tools/unify_datasets.py
    Converts heterogeneous datasets (COCO, MSRA10K, ECSSD, VOC, etc.) → single canonical format (datasets/static/)
    What it solves:
    Broken assumptions across datasets
    Inconsistent annotations/layouts
    Previous inability to use multiple datasets reliably

👉 Now users can:

python tools/unify_datasets.py --sources --output datasets/static
4. Pretrained Model Support (Finally Usable)
Added 5 pretraining checkpoints in README
Standardized loading path
Removed dependency on “that one working model”

👉 Result:

Users can actually run pretraining without debugging the repo for hours
5. Training Pipeline Improvements
Updated tools/train.py and train_eval.sh
Integrated:
Optional pretraining stage
Default checkpoint handling
Graceful fallback if weights are missing

@Varun-sai-500 Varun-sai-500 changed the title Separate pretraining Decoupled Pretraining + Dataset Unification Pipeline Apr 7, 2026
@Varun-sai-500
Copy link
Copy Markdown
Contributor Author

@z-x-yang Please ignore that trailing whitespace inside train_datasets, I don't know but removing it isn't as easy as I thought.

@Varun-sai-500
Copy link
Copy Markdown
Contributor Author

@z-x-yang Big Refactor, but I think it's worth it

@Varun-sai-500 Varun-sai-500 changed the title Decoupled Pretraining + Dataset Unification Pipeline Feat. Decoupled Pretraining + Dataset Unification Pipeline Apr 12, 2026
@Varun-sai-500 Varun-sai-500 changed the title Feat. Decoupled Pretraining + Dataset Unification Pipeline Feat: Decoupled Pretraining + Dataset Unification Pipeline Apr 12, 2026
@Varun-sai-500 Varun-sai-500 marked this pull request as draft April 15, 2026 15:14
@Varun-sai-500
Copy link
Copy Markdown
Contributor Author

@z-x-yang Can you review it if possible? Don't run yet, as it's incomplete, just check it if this is good

@Varun-sai-500 Varun-sai-500 changed the title Feat: Decoupled Pretraining + Dataset Unification Pipeline Feat: Dataset Unification Pipeline Apr 25, 2026
@z-x-yang
Copy link
Copy Markdown
Collaborator

Thanks for working on this — the dataset unification direction is useful, but I don’t think this PR is merge-ready yet.

A few hard blockers:

  1. dataloaders/train_datasets.py currently removes DAVIS2017_Train, YOUTUBEVOS_Train, and TEST, while networks/managers/trainer.py still imports and uses them. This breaks the main training path at import / dataset preparation time.

  2. StaticTrain now requires dataset_name, but the existing trainer call still uses the old signature:
    StaticTrain(cfg.DIR_STATIC, cfg.DATA_RANDOMCROP, ...).
    So the static pretraining path is not wired up yet either.

  3. _merge_sample was removed, but StaticTrain.__getitem__ still calls it in the dynamic merge branch. That path will fail at runtime.

  4. tools/unify_datasets.py does not create the dst/name/ directory before writing train.txt; a minimal image/mask smoke test fails with FileNotFoundError.

  5. The README command points to unify_dataset.py, but the added file is tools/unify_datasets.py, and the static pretraining section currently has duplicated / malformed markdown.

I’d suggest narrowing this PR first: keep the existing video-training dataset classes intact, preserve the current trainer API, make the dataset-unification tool standalone and smoke-tested, then update the README after the command actually works. After that I can review the design again.

@Varun-sai-500
Copy link
Copy Markdown
Contributor Author

oh well @z-x-yang this PR wasn't ready yet, I mistakenly added this too in the email, it's still a draft PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants