The current Symbolic Tensor Graph (STG) framework is extremely valuable for synthetic modeling of distributed LLM training workloads, supporting rich parallelization strategies (DP/TP/PP/SP) and exporting Chakra execution traces for ASTRA‑sim exploration.
However, LLM inference is increasingly a dominant workload in production systems and presents fundamentally different characteristics from training:
Autoregressive token-by-token execution
Heavy emphasis on KV-cache memory traffic and placement
Absence of backward pass and optimizer steps
Latency‑critical (e2e and per‑token) vs throughput‑oriented optimization
Emerging parallelism strategies (context parallelism, decode batching, speculative decoding)
Extending STG to natively support LLM inference trace generation would make it highly impactful for deployment‑time system design, hardware exploration, and inference‑stack research, complementing the current training‑focused workflow.
The current Symbolic Tensor Graph (STG) framework is extremely valuable for synthetic modeling of distributed LLM training workloads, supporting rich parallelization strategies (DP/TP/PP/SP) and exporting Chakra execution traces for ASTRA‑sim exploration.
However, LLM inference is increasingly a dominant workload in production systems and presents fundamentally different characteristics from training:
Autoregressive token-by-token execution
Heavy emphasis on KV-cache memory traffic and placement
Absence of backward pass and optimizer steps
Latency‑critical (e2e and per‑token) vs throughput‑oriented optimization
Emerging parallelism strategies (context parallelism, decode batching, speculative decoding)
Extending STG to natively support LLM inference trace generation would make it highly impactful for deployment‑time system design, hardware exploration, and inference‑stack research, complementing the current training‑focused workflow.