Skip to content

Integrate curate into CLI#86

Draft
marcromeyn wants to merge 1 commit intomainfrom
romeyn/cli-data-curate
Draft

Integrate curate into CLI#86
marcromeyn wants to merge 1 commit intomainfrom
romeyn/cli-data-curate

Conversation

@marcromeyn
Copy link
Copy Markdown
Contributor

No description provided.

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
Copy link
Copy Markdown
Contributor

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for adding cli support. Minor questions/comments. Main change needed is the extra installed from curator

Comment on lines +6 to +7
identify: true
remove: true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small note:
Identify is a GPU job.
remove is CPU only. We can do both on a GPU node.

container: nvcr.io/nvidia/nemo:25.02

# Operation flags
identify: true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Comment on lines +6 to +7
classify: true
ensemble: true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Classify is a GPU job, Ensemble is Cpu only.

# [tool.runspec]
# schema = "1"
# name = "data/curate/nemotron-cc/exact-dedup"
# image = "nvcr.io/nvidia/nemo:25.02"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future we can point to the Curator container

#
# [tool.runspec.resources]
# nodes = 1
# gpus_per_node = 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what the example usually is, but will need GPUs

Comment thread pyproject.toml
gcs = ["gcsfs>=2024.0.0"]
sentencepiece = ["sentencepiece>=0.2.0"]
xenna = ["cosmos-xenna"]
curator = ["nemo-curator[text-cpu]"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nemo-curator[text-cuda12] should cover all the steps merged in so far

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants