-
Notifications
You must be signed in to change notification settings - Fork 0
Open
1 / 11 of 1 issue completedLabels
Dataverse ProjectIssues related to Dataverse Project softwareIssues related to Dataverse Project softwareFY26 RoadmapThis issue is on the FY26 Dataverse Project RoadmapThis issue is on the FY26 Dataverse Project Roadmap
Description
Overview
- "Croissant is a high-level format for machine learning datasets that brings together four rich layers."
- This issue tracks activities related to our collaboration with the Kaggle Team related to Croissant.
Mission
- Make datasets easier to find and work with for Machine Learning, at scale and by diverse stakeholders (e.g. AI engineers, AI ethicists[e], compliance managers, interested public)
Vision
- Croissant is the most convenient and widely used machine-readable format for ML-ready datasets.
Issues
Issues we will probably work on
- Croissant support 🥐 dataverse#10341
- Announce that Croissant is ready for production use. DONE.
- Add Croissant exporter to Harvard Dataverse (and demo) dataverse.harvard.edu#294
- Add Croissant exporter to Harvard Dataverse (and demo) dataverse.harvard.edu#294
- Add Croissant to Signposting "describedby" output dataverse#10542
- Add Schema.org or Croissant metadata to header of Dataset view page dataverse-frontend#350
- Adding Dataverse to Croissant 🥐 Online Health mlcommons/croissant#530
- Reindex datasets in Harvard Dataverse so that the "License" search facet is meaningful. Should be done as part of upgrading to 6.3:
- Include a facet for the Search API for datasets that only have files that are truly open (no custom terms, no guestbooks).
- Let Kaggle know how many dataset and bytes to expect when copying CC0 dataset from Harvard Dataverse (see notes from 2024-07-18 meeting and Slack)
- Let Kaggle know the best way to see when datasets have changed
- Commit data from Dataverse to Kaggle via CroissantML via a button, as an explicit action from the user. Is this part of a larger story around pushing data to other systems, such as data lakes?
Issues we've opened or are keeping an eye on
Depending on the outcome of these issues, we may enhance our Croissant implementation to cover additional use cases.
- enable the Croissant exporter by default (move code to main repo) dataverse#11254
- 1.0 as a string should be a valid version for a dataset mlcommons/croissant#609
- clarify that citeAs can be used for dataset citations mlcommons/croissant#638
- clarify where to put file paths (e.g ml-25m/ratings.csv) mlcommons/croissant#639
- summary statistics (mean, max, min, etc.) mlcommons/croissant#640
- contentUrl for each format of a file (original proprietary vs archival) mlcommons/croissant#641
- add flag to validator to ignore certain warnings mlcommons/croissant#643
- guidance on large Croissant files, especially in
<head>mlcommons/croissant#646 - health: scrapydweb fails to launch, seems to require newer version mlcommons/croissant#647
- Invalid object type for field "distribution" mlcommons/croissant#725
Related
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
Dataverse ProjectIssues related to Dataverse Project softwareIssues related to Dataverse Project softwareFY26 RoadmapThis issue is on the FY26 Dataverse Project RoadmapThis issue is on the FY26 Dataverse Project Roadmap