Skip to content

Simulator Hangs When Using STG-Generated Chakra Traces with Astra-sim + NS3 #18

@Leonard226

Description

@Leonard226

I am using symbolic_tensor_graph to generate synthetic chakra traces in the .et format. I am using the following command:

python3 /opt/symbolic_tensor_graph/main.py \
	--output_dir output \
	--output_name workload.%d.et \
	--comm_group_file comm_group.json \
	--chakra_schema_version v0.0.4 \
	--dp 8 \
	--pp 1 \
	--tp 1 \
	--sp 1 \
	--weight_sharded 0 \
	--batch 32 \
	--din 2048 \
	--dout 2048 \
	--dmodel 2048 \
	--dff 8192 \
	--seq 1024 \
	--head 24 \
	--num_stacks 16

And this successfully generates the following files:

comm_group.json  workload.1.et	workload.3.et  workload.5.et  workload.7.et
workload.0.et	 workload.2.et	workload.4.et  workload.6.et

Now, i want to feed those traces into astra-sim + ns3, passing a pointer to the workload workload.%d.et and to the communicator group file comm_group.json, as you can see here:

WORKLOAD=opt/synthetic_traces/ml/output/workload
SYSTEM=system.json
NETWORK=config.txt
MEMORY=remote_memory.json
LOGICAL_TOPOLOGY=logical_topology.json
COMM_GROUP=/opt/synthetic_traces/ml/output/comm_group.json

${NS3_BIN} \
	--workload-configuration=${WORKLOAD} \
	--system-configuration=${SYSTEM} \
	--network-configuration=${NETWORK} \
	--logical-topology-configuration=${LOGICAL_TOPOLOGY} \
	--remote-memory-configuration=${MEMORY} \
	--comm-group-configuration=${COMM_GROUP}

However, when I start the simulation, the simulator gets suck here:

ASTRA-sim + NS3
There are 8 npus: 8,
[2025-07-08 11:34:43.558] [system::topology::RingTopology] [info] ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1 total nodes in ring: 8
[2025-07-08 11:34:43.558] [system::topology::RingTopology] [info] ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1 total nodes in ring: 8
[2025-07-08 11:34:43.558] [system::topology::RingTopology] [info] ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1 total nodes in ring: 8
[2025-07-08 11:34:43.558] [system::topology::RingTopology] [info] ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1 total nodes in ring: 8
QP is enabled 
maxRtt=161 maxBdp=9016

I am confident that the issue is not the topology, system or memory configuration.
Additionally, I found this slide stating that this is a known issue that is currently being addressed. However, I found that the PR#167 was closed, and I am wondering whether a solution has been found?

Image

Source: https://github.com/astra-sim/symbolic_tensor_graph/blob/main/docs/stg_demo_241006.pptx

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions