Neuron distributed #359

KeitaW · 2024-06-13T06:13:00Z

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

perifaws · 2024-06-13T17:20:32Z

Add digits for the directory number?

awsankur · 2024-06-28T23:30:24Z

3.test_cases/neuronx-distributed/README.md

+
+* AWS optimized [llm-foundry](https://github.com/mosaicml/llm-foundry/tree/main) container image.
+* Slurm scripts for the [c4 dataset](https://huggingface.co/datasets/c4) preparation and multi-node distributed training.
+


Add a prerequisites section? You need a 2 node Trn1 PC? What would I need to change if I am doing this on SMHP?

awsankur · 2024-06-28T23:32:08Z

3.test_cases/neuronx-distributed/README.md

+    ```
+
+Once preprocessing is done, you will run a training job in the next stage.
+


We do not need to compile the model? Can we add a few sentences on why? The Llama3 example on Trn1 has a section on compiling the models. Maybe add something similar?

awsankur · 2024-06-28T23:33:23Z

3.test_cases/neuronx-distributed/mingpt/0.cpu.py

Are we using this file?

awsankur · 2024-06-28T23:34:59Z

3.test_cases/neuronx-distributed/mingpt/demo.ipynb

Remove the notebook?

awsankur

Thanks for the great work Keita. Added a few comments.

KeitaW and others added 13 commits February 26, 2024 12:19

initial commit

ec254fa

reorder

c644830

Merge remote-tracking branch 'origin' into neuron-distributed

214a932

update

52e8c33

update

5e31242

current status

c1e1841

update

d2da958

clean up

8018562

update

0a93c71

add olma

9ac1245

update

e5b9994

7B failing

4c88faf

add config

0afd579

perifaws requested review from awsankur, iankouls-aws, pbelevich and syedazi June 13, 2024 17:22

fix zero

6caa922

awsankur reviewed Jun 28, 2024

View reviewed changes

3.test_cases/neuronx-distributed/mingpt/0.cpu.py

Copy link

Contributor

awsankur Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we using this file?

awsankur reviewed Jun 28, 2024

View reviewed changes

3.test_cases/neuronx-distributed/mingpt/demo.ipynb

Copy link

Contributor

awsankur Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the notebook?

awsankur reviewed Jun 28, 2024

View reviewed changes

KeitaW added 3 commits July 21, 2024 01:49

update with evaluation

dd37ca0

update 1.neuron

7cf634a

add ddp

749cea1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Neuron distributed #359

Neuron distributed #359

Uh oh!

KeitaW commented Jun 13, 2024

Uh oh!

perifaws commented Jun 13, 2024

Uh oh!

awsankur Jun 28, 2024

Uh oh!

awsankur Jun 28, 2024

Uh oh!

awsankur Jun 28, 2024

Uh oh!

awsankur Jun 28, 2024

Uh oh!

awsankur left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		* AWS optimized [llm-foundry](https://github.com/mosaicml/llm-foundry/tree/main) container image.
		* Slurm scripts for the [c4 dataset](https://huggingface.co/datasets/c4) preparation and multi-node distributed training.

		```

		Once preprocessing is done, you will run a training job in the next stage.

Neuron distributed #359

Are you sure you want to change the base?

Neuron distributed #359

Uh oh!

Conversation

KeitaW commented Jun 13, 2024

Uh oh!

perifaws commented Jun 13, 2024

Uh oh!

awsankur Jun 28, 2024

Choose a reason for hiding this comment

Uh oh!

awsankur Jun 28, 2024

Choose a reason for hiding this comment

Uh oh!

awsankur Jun 28, 2024

Choose a reason for hiding this comment

Uh oh!

awsankur Jun 28, 2024

Choose a reason for hiding this comment

Uh oh!

awsankur left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants