Skip to content

Refactor: Split COCO dataset into detection and segmentation datasets#280

Merged
cregouby merged 6 commits intomlverse:mainfrom
Chandraveersingh1717:split-coco-dataset
Feb 4, 2026
Merged

Refactor: Split COCO dataset into detection and segmentation datasets#280
cregouby merged 6 commits intomlverse:mainfrom
Chandraveersingh1717:split-coco-dataset

Conversation

@Chandraveersingh1717
Copy link
Contributor

Summary

Split coco_detection_dataset() into two specialized datasets for object detection and instance segmentation tasks.

Problem

  • Current implementation carries 500MB+ annotation object in memory for entire dataset lifetime
  • Confusing documentation mixing detection and segmentation use cases
  • Tasks never run together (different model architectures)
  • Poor cache organization with unidentified large files

Solution

coco_detection_dataset() - Object Detection Only

  • Returns: boxes, labels, area, iscrowd
  • Memory: ~250MB (50% reduction)
  • Use: Faster R-CNN, YOLO, SSD

coco_segmentation_dataset() - Instance Segmentation (NEW)

  • Returns: boxes, labels, area, iscrowd, segmentation, masks
  • Memory: ~250MB
  • Use: Mask R-CNN, DeepLab

Cache Organization: Files now stored in /coco subdirectory for better identification

Breaking Change

Segmentation users must migrate:

# Before
coco_detection_dataset(..., target_transform = target_transform_coco_masks)

# After  
coco_segmentation_dataset(..., target_transform = target_transform_coco_masks)

…ory reduction and better UX (Breaking: segmentation users migrate to coco_segmentation_dataset)
@cregouby
Copy link
Collaborator

Hello @Chandraveersingh1717,

Thansk a lot for this contribution ! It actually fix #171 ! (And sorry for the delay to review it, I miss the notification)
Please give me few days to review it...

@cregouby cregouby self-requested a review January 31, 2026 17:25
Copy link
Collaborator

@cregouby cregouby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @Chandraveersingh1717

praise thanks a lot for this contribution
todo missing Would you add your name to the contributor list in DESCRIPTION file ?
todo documentation Documentation of the package is not up-to-date with your changes in roxygen2. Please perform a devtools::document() and resubmit Rd files
todo R CMD CHECK fails, and should be fixed.
suggestion You can see the R CMD CHECK failures in the CI/CD pipeline logs ("checks" tab in github) but you can see them performing a local devtools::check()
todo please remove all unexpected files at package top folder. (as per

❯ checking top-level files ... NOTE
Non-standard files/directories found at top level:
‘test_coco_changes.R’ ‘tiny_meta.json’ ‘tiny_meta.ndjson’
todo Please mention you are using a LLM if any in the NEWS


## New datasets

* Added `vggface2_dataset()` for loading the VGGFace2 dataset (@DerrickUnleashed, #238).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo please do not remove actual news

@cregouby cregouby merged commit b93ee12 into mlverse:main Feb 4, 2026
@cregouby
Copy link
Collaborator

cregouby commented Feb 4, 2026

Thanks a lot !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants