I am using symbolic_tensor_graph to generate synthetic chakra traces in the .et format. I am using the following command:
python3 /opt/symbolic_tensor_graph/main.py \
--output_dir output \
--output_name workload.%d.et \
--comm_group_file comm_group.json \
--chakra_schema_version v0.0.4 \
--dp 8 \
--pp 1 \
--tp 1 \
--sp 1 \
--weight_sharded 0 \
--batch 32 \
--din 2048 \
--dout 2048 \
--dmodel 2048 \
--dff 8192 \
--seq 1024 \
--head 24 \
--num_stacks 16
And this successfully generates the following files:
comm_group.json workload.1.et workload.3.et workload.5.et workload.7.et
workload.0.et workload.2.et workload.4.et workload.6.et
Now, i want to feed those traces into astra-sim + ns3, passing a pointer to the workload workload.%d.et and to the communicator group file comm_group.json, as you can see here:
WORKLOAD=opt/synthetic_traces/ml/output/workload
SYSTEM=system.json
NETWORK=config.txt
MEMORY=remote_memory.json
LOGICAL_TOPOLOGY=logical_topology.json
COMM_GROUP=/opt/synthetic_traces/ml/output/comm_group.json
${NS3_BIN} \
--workload-configuration=${WORKLOAD} \
--system-configuration=${SYSTEM} \
--network-configuration=${NETWORK} \
--logical-topology-configuration=${LOGICAL_TOPOLOGY} \
--remote-memory-configuration=${MEMORY} \
--comm-group-configuration=${COMM_GROUP}
However, when I start the simulation, the simulator gets suck here:
ASTRA-sim + NS3
There are 8 npus: 8,
[2025-07-08 11:34:43.558] [system::topology::RingTopology] [info] ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1 total nodes in ring: 8
[2025-07-08 11:34:43.558] [system::topology::RingTopology] [info] ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1 total nodes in ring: 8
[2025-07-08 11:34:43.558] [system::topology::RingTopology] [info] ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1 total nodes in ring: 8
[2025-07-08 11:34:43.558] [system::topology::RingTopology] [info] ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1 total nodes in ring: 8
QP is enabled
maxRtt=161 maxBdp=9016
I am confident that the issue is not the topology, system or memory configuration.
Additionally, I found this slide stating that this is a known issue that is currently being addressed. However, I found that the PR#167 was closed, and I am wondering whether a solution has been found?
Source: https://github.com/astra-sim/symbolic_tensor_graph/blob/main/docs/stg_demo_241006.pptx
I am using
symbolic_tensor_graphto generate synthetic chakra traces in the.etformat. I am using the following command:And this successfully generates the following files:
Now, i want to feed those traces into astra-sim + ns3, passing a pointer to the workload
workload.%d.etand to the communicator group filecomm_group.json, as you can see here:However, when I start the simulation, the simulator gets suck here:
I am confident that the issue is not the topology, system or memory configuration.
Additionally, I found this slide stating that this is a known issue that is currently being addressed. However, I found that the PR#167 was closed, and I am wondering whether a solution has been found?
Source: https://github.com/astra-sim/symbolic_tensor_graph/blob/main/docs/stg_demo_241006.pptx