imported os and added ckpt scripts by dgourab-aws · Pull Request #41 · patrick-toulme/axlearn

dgourab-aws · 2024-12-19T23:36:52Z

No description provided.

dgourab-aws · 2024-12-26T18:14:51Z

Tested it locally using: pytest seed_test.py

HahTK

Most of the stuff we need is missing.

Most features are missing (setting the seed, parameterizing fuji.py, all features from GPU)
We are only have GPU training scripts for TRN repo? Where is the TRN script?
Likely completely untested. Did we run a TRN job?

HahTK · 2024-12-27T01:00:59Z

+
+# export JAX_PLATFORMS=cpu
+
+#Perf Tuning Guideline here : https://github.com/NVIDIA/JAX-Toolbox/blob/main/rosetta/docs/PGLE.md


why are GPU flags in TRN runs?

This is just the checkpointing script, I did not do any cleanup here.

HahTK · 2024-12-27T01:01:14Z

+###export NCCL_DEBUG_SUBSYS=COLL
+
+#HAH quick fix
+export XLA_FLAGS="--xla_dump_hlo_as_text --xla_dump_to=${HLO_DUMP_PATH} --xla_dump_hlo_pass_re='.*' --xla_dump_hlo_as_proto --xla_gpu_enable_latency_hiding_scheduler=true --xla_gpu_enable_while_loop_double_buffering=true --xla_gpu_enable_pipelined_all_gather=true --xla_gpu_enable_pipelined_reduce_scatter=true --xla_gpu_enable_pipelined_all_reduce=true --xla_gpu_multi_streamed_windowed_einsum=true --xla_gpu_enable_custom_fusions=true" # --xla_gpu_enable_address_computation_fusion=true"


why are GPU flags in TRN runs?

This is just the checkpointing script.
@HahTK I need a walkthrough of this script to actually clean it up. I havent used this to launch Trn jobs, I was using Apoorv's launch script.

HahTK · 2024-12-27T01:01:31Z

+    echo "ERROR : ${TEST_SETUP} for ${N_EXPECTED_NODES} was launched with ${num_nodes}"
+    exit 1
+fi
+MESH_SELECTOR="gpu-${num_nodes}node-baseline"


again everything is GPU here

HahTK · 2024-12-27T01:02:04Z

@@ -0,0 +1,150 @@
+#!/usr/bin/env bash


this needs to be fixed. This is a GPU script used to run TRN. We need to use the TRN script and just add ckpt resume to it.

HahTK · 2024-12-27T01:04:18Z

+import importlib
+
+
+class SeedTest(test_utils.TestCase):


where do we actually set the seed?

The seed has to be set as an environment variable from any launch script like this:
export DATA_SEED=42
The launch script has not been added to the PR.

HahTK · 2024-12-27T02:13:12Z

Also the branch was created from the wrong commit id. It should have been
33ec152

but it seems to be branched from this instead
c20387c

dgourab-aws · 2024-12-27T05:38:44Z

Also the branch was created from the wrong commit id. It should have been 33ec152

but it seems to be branched from this instead c20387c

The GPU branch was created from 33ec152, the TRN branch was to be created from the AXLearn upstream branch.

imported os and added ckpt scripts

c84cb33

dgourab-aws requested review from HahTK and aws-zhenguo December 19, 2024 23:37

added unit test for dataloader seed

89f54e7

removed user mentions

766acdb

HahTK reviewed Dec 27, 2024

View reviewed changes

added TRN launch scripts

968bfb6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

imported os and added ckpt scripts#41

imported os and added ckpt scripts#41
dgourab-aws wants to merge 4 commits intoaccuracy_workstream_trnfrom
ckpt_restore_dataloader_fix

dgourab-aws commented Dec 19, 2024

Uh oh!

dgourab-aws commented Dec 26, 2024

Uh oh!

HahTK left a comment

Uh oh!

HahTK Dec 27, 2024

Uh oh!

dgourab-aws Dec 27, 2024

Uh oh!

HahTK Dec 27, 2024

Uh oh!

dgourab-aws Dec 27, 2024

Uh oh!

HahTK Dec 27, 2024

Uh oh!

HahTK Dec 27, 2024

Uh oh!

HahTK Dec 27, 2024

Uh oh!

dgourab-aws Dec 27, 2024

Uh oh!

HahTK commented Dec 27, 2024

Uh oh!

dgourab-aws commented Dec 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		# export JAX_PLATFORMS=cpu

		#Perf Tuning Guideline here : https://github.com/NVIDIA/JAX-Toolbox/blob/main/rosetta/docs/PGLE.md

		import importlib


		class SeedTest(test_utils.TestCase):

Conversation

dgourab-aws commented Dec 19, 2024

Uh oh!

dgourab-aws commented Dec 26, 2024

Uh oh!

HahTK left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HahTK commented Dec 27, 2024

Uh oh!

dgourab-aws commented Dec 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants