Feat: Dataset Unification Pipeline#96
Conversation
|
@z-x-yang Please ignore that trailing whitespace inside train_datasets, I don't know but removing it isn't as easy as I thought. |
|
@z-x-yang Big Refactor, but I think it's worth it |
|
@z-x-yang Can you review it if possible? Don't run yet, as it's incomplete, just check it if this is good |
|
Thanks for working on this — the dataset unification direction is useful, but I don’t think this PR is merge-ready yet. A few hard blockers:
I’d suggest narrowing this PR first: keep the existing video-training dataset classes intact, preserve the current trainer API, make the dataset-unification tool standalone and smoke-tested, then update the README after the command actually works. After that I can review the design again. |
|
oh well @z-x-yang this PR wasn't ready yet, I mistakenly added this too in the email, it's still a draft PR |
This PR introduces a structural overhaul of the training pipeline, separating concerns that were previously tightly coupled and blocking extensibility.
The original pipeline:
Hard-coupled static pretraining into trainer.py
Effectively supported only a single pretraining configuration
Had fragile dataset assumptions (non-standard formats → silent failures)
Made it painful to extend / reproduce / port
This PR fixes that at the root level, not with patches.
🔧 Core Changes
Decoupled static pretraining from trainer.py
Removed StaticTrain dependency from the core training loop
Cleaned dataset preparation logic → now modular and extensible
👉 Result:
trainer.py is now strictly responsible for video training, not overloaded with pretraining concerns.
Introduced: networks/managers/pre_trainer.py
Static image pretraining is now:
Standalone
Reproducible
Hardware-agnostic (CPU/GPU)
👉 Matches design patterns used in large-scale OSS (separation of stages, not flags inside one script)
Added: tools/unify_datasets.py
Converts heterogeneous datasets (COCO, MSRA10K, ECSSD, VOC, etc.) → single canonical format (datasets/static/)
What it solves:
Broken assumptions across datasets
Inconsistent annotations/layouts
Previous inability to use multiple datasets reliably
👉 Now users can:
python tools/unify_datasets.py --sources --output datasets/static
4. Pretrained Model Support (Finally Usable)
Added 5 pretraining checkpoints in README
Standardized loading path
Removed dependency on “that one working model”
👉 Result:
Users can actually run pretraining without debugging the repo for hours
5. Training Pipeline Improvements
Updated tools/train.py and train_eval.sh
Integrated:
Optional pretraining stage
Default checkpoint handling
Graceful fallback if weights are missing