Add datatrove tag to synthetic dataset cards#473
Merged
JoelNiklaus merged 1 commit intohuggingface:mainfrom Mar 16, 2026
Merged
Add datatrove tag to synthetic dataset cards#473JoelNiklaus merged 1 commit intohuggingface:mainfrom
datatrove tag to synthetic dataset cards#473JoelNiklaus merged 1 commit intohuggingface:mainfrom
Conversation
Every dataset card generated by DataTrove will now include the `datatrove` tag, making these datasets discoverable on the Hub. Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Datasets generated with DataTrove have no way to be discovered or filtered by the
datatrovetag on the Hugging Face Hub.Solution
Add
"datatrove"as a permanent tag in the dataset card generator's tag set, alongside the existing"synthetic"tag. Every new synthetic dataset card will now include- datatrovein its YAML frontmatter tags.Testing
Existing tests pass — none assert on specific tag values in the rendered card.
Made with Cursor
Note
Low Risk
Low risk: adds a single constant tag to the dataset card YAML frontmatter, with no changes to inference, uploads, or data processing logic.
Overview
Synthetic dataset cards generated by
InferenceDatasetCardGeneratornow always include thedatatrovetag (in addition tosynthetic) in the README YAML frontmatter, improving discoverability/filtering on the Hugging Face Hub.Written by Cursor Bugbot for commit 5fd2d3f. Configure here.