From 8b51dda0c8a771effffd70ea6b9d539bd3483550 Mon Sep 17 00:00:00 2001
From: Jinsun Yoo <jinsunyoo332@gmail.com>
Date: Tue, 8 Jul 2025 06:22:24 +0000
Subject: [PATCH 1/2] Fix typo in build script

---
 _scripts/build_current_doc.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/_scripts/build_current_doc.sh b/_scripts/build_current_doc.sh
index e2954d6..fcf8192 100755
--- a/_scripts/build_current_doc.sh
+++ b/_scripts/build_current_doc.sh
@@ -7,6 +7,6 @@ rm -rf ./_build
 # Compile the latest version
 sphinx-build \
     -b html \
-    -j $(nprocs) \
+    -j $(nproc) \
     -d ./_build/doctree \
     . ./_build/html

From 3cdd9a9a2117a9713056b7530ff16e887701f2de Mon Sep 17 00:00:00 2001
From: Jinsun Yoo <jinsunyoo332@gmail.com>
Date: Thu, 24 Jul 2025 18:04:33 +0000
Subject: [PATCH 2/2] Update Explanation on Custom Collectives

---
 getting-started/argument-system-config.md |  4 ++
 system-layer/collective-implementation.md | 33 ++++++--------
 system-layer/sysinput.md                  | 52 +++++++++++++----------
 3 files changed, 46 insertions(+), 43 deletions(-)

diff --git a/getting-started/argument-system-config.md b/getting-started/argument-system-config.md
index 74cedd3..29a22a6 100644
--- a/getting-started/argument-system-config.md
+++ b/getting-started/argument-system-config.md
@@ -29,6 +29,10 @@ $ ${ASTRA_SIM}/examples/network_analytical/system.json
 - **preferred-dataset-splits**: (int)
   - The number of chunks we divide each collective into.
 
+:::{note}
+For a detailed discussion on collective related inputs, refer to the [System Layer](../system-layer/sysinput.md) section.
+:::
+
 - **all-reduce-implementation**: (Dimension0Collective_Dimension1Collective_xxx_DimensionNCollective)
   - Here we can create a multiphase colective all-reduce algorithm and directly specify the collective algorithm type for each logical dimension. The available options (algorithms) are: ring, direct, doubleBinaryTree, oneRing, oneDirect.
   - For example, "ring_doubleBinaryTree" means we create a logical topology with 2 dimensions and we perform ring algorithm on the first dimension followed by double binary tree on the second dimension for the all-reduce pattern. Hence the number of physical dimension should be equal to the number of logical dimensions. The only exceptions are oneRing/oneDirect where we assume no matter how many physical dimensions we have, we create a one big logical ring/direct(AllToAll) topology where all NPUs are connected and perfrom a one phase ring/direct algorithm.
diff --git a/system-layer/collective-implementation.md b/system-layer/collective-implementation.md
index 55ff097..33dc7be 100644
--- a/system-layer/collective-implementation.md
+++ b/system-layer/collective-implementation.md
@@ -1,36 +1,29 @@
-## Collective Implementation 
-![Collective Implementation](/_static/images/coll_implementation.svg)
+## Collective Algorithms: Native or Custom Implementation
+![Collective Algorithm Implementation](/_static/images/coll_implementation.svg)
 
-As discussed before, the simulator takes in a collective communication (e.g., AllReduce, AllGather, etc.) and breaks it down into send and receive messages. These send and receive messages are simulated by the network backend. 
+As discussed before, the simulator takes in a collective communication (e.g., AllReduce, AllGather, etc.) and breaks it down into send and receive messages. These send and receive messages are simulated by the network backend.
 
-There are two ways the simulator breaks a collective into send and receive messages. The prominent method so far was for the simulator to implement a predefined set of commonly used algorithms (e.g. Ring, DoubleBinary, HalvingDoubling, etc). This 'Native' implementation logic resides within the simulator codebase and allows users to quickly explore a predefined set of algorithms. 
+There are two ways the simulator breaks a collective into send and receive messages.
+- **Native Collective**: The simulator implements a predefined set of commonly used algorithms (e.g. Ring, DoubleBinary, HalvingDoubling, etc). This implementation logic resides within the simulator codebase and allows users to quickly explore a predefined set of algorithms.
+- **Custom Collective**: Users will provide their definition of the collective algorithm (i.e. the sequence of events to happen in the algorithm). ASTRA-sim's System layer exposes a *collective API* through which users can define *custom, arbitrary* collective algorithms.
 
-Since August 2024, ASTRA-sim supports a new way of collective algorithm representation. The system layer exposes a collective API, through which it can receive definitions of _arbitrary_ collective algorithms. 
+Both methods are implementations of the `CollectivePhase::Algorithm` object, which is the unit of scheduling in the System layer (refer to the previous scheduling page). Please refer to the code in [CollectivePhase.hh](https://github.com/astra-sim/astra-sim/blob/master/astra-sim/system/CollectivePhase.hh) for a definition of the `Algorithm` object, and [native_collectives] and [custom_collectives] for implementations. Because they implement the `CollectivePhase::Algorithm` object, both methods are executed through the stream based scheduler described in ::collective-scheduler.
 
-Both methods are implementations of the `CollectivePhase::Algorithm` object, which is the unit of scheduling in the System layer (refer to the previous scheduling page). Please refer to the code in [https://github.com/astra-sim/astra-sim/blob/master/astra-sim/system/CollectivePhase.hh](https://github.com/astra-sim/astra-sim/blob/master/astra-sim/system/CollectivePhase.hh).
+### Native Collective Implementation
+This part is still under construction. For now, please refer to the code in the [native_collecties] directory, especially [Ring.cc](https://github.com/astra-sim/astra-sim/blob/master/astra-sim/system/collective/Ring.cc) as a starting point.
 
-### ASTRA-Sim Native Implementation
-This part is still under construction. For now, please refer to the code in [https://github.com/astra-sim/astra-sim/blob/master/astra-sim/system/collective/Ring.cc](https://github.com/astra-sim/astra-sim/blob/master/astra-sim/system/collective/Ring.cc)
-
-### Chakra Based Arbitrary Definition Through Collective API
-
-An inherent limitation of the above native method is that to simulate a new collective algorithm, one would have to implement the whole collective in ASTRA_sim native code. 
+### Custom Collective Implementation
+An inherent limitation of the above native method is that to simulate a new collective algorithm, one would have to implement the whole collective in ASTRA_sim native code.
 With an increasing number of work in non-regular collectives, such as TACOS (topology aware collectives)[1], MSCCLang (expressively written collectives based on DSL)[2], etc., the need to quickly simulate and iterate over a wide variety of arbitrary collective algorithms becomes ever more important.
 
-Therefore, we expose a new `collective API` to accept the definition of any collective algorithms, not limited to the predefined set (Ring, etc.). For the representation, we use the Chakra ET schema for a separate graph. We represent the collective algorithm as a graph of COMM_SEND, COMM_RECV nodes in the Chakra ET schema. That is, instead of the system layer breaking down collectives into send and receive messages, the system layer simply follows the breakdown already represented in the Chakra graph. Since ASTRA-sim already uses Chakra ET to represent workload, using Chakra ET to additionally define collective algorithms provides an easy and simple way to navigate through the graph. 
+Therefore, we expose a new `collective API` to accept the definition of any collective algorithms, not limited to the predefined set (Ring, etc.). For the representation, we use the Chakra ET schema for a separate graph. We represent the collective algorithm as a graph of COMM_SEND, COMM_RECV nodes in the Chakra ET schema. That is, instead of the system layer breaking down collectives into send and receive messages, the system layer simply follows the breakdown already represented in the Chakra graph. Since ASTRA-sim already uses Chakra ET to represent workload, using Chakra ET to additionally define collective algorithms provides an easy and simple way to navigate through the graph.
 
 Refer to the figure on the top of this page. When the Workload layer issues an "AllReduce" collective, instead of running the native implementation logic already in the simulator codebase, the system layer will iterate through the Chakra ET representing the collective algorithm, which has been provided through the collective API. Note how the Chakra graph for the workload and the chakra graph for the collective is decoupled, and provided through different input points. It is the ASTRA-sim simulator that eventually 'replaces' the communication nodes with the collective implemenation. This graph substitution is made easy by the workload and communication using the same schema. We anticipate this to open up new avenues such as compute-communication co-optimization.
 
 For a more detailed explanation of the background and motivation, please refer to our paper published in Hot Interconnects (HotI) 2024. ([Paper Link](https://arxiv.org/abs/2408.11008))
 
-For a more detailed explanation to generating the inputs to the collective API, please refer to the [Collective API Input](https://astra-sim.github.io/astra-sim-docs/system-layer/sysinput.html) page.
+For a more detailed explanation to generating the inputs to the collective API, please refer to the [Collective API Input](sysinput.md) page.
 
 **Reference**
 [1] Won, William, et al. "TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine Learning.", In Proceedings of IEEE MICRO 2024 (to appear), https://arxiv.org/abs/2304.05301.
 [2] Meghan Cowan, et al. "MSCCLang: Microsoft Collective Communication Language.", In Proceedings of ASPLOS 2023, https://doi.org/10.1145/3575693.3575724
-
-
-
-
-
-
diff --git a/system-layer/sysinput.md b/system-layer/sysinput.md
index 900a9e6..bfa3582 100644
--- a/system-layer/sysinput.md
+++ b/system-layer/sysinput.md
@@ -1,10 +1,10 @@
-# Input Files for Collective API
-This section describes specific pointers to the collective API within the system layer input. 
+# Input Files for Collectives
+This section describes specific pointers to the collective API within the system layer input.
 For a general description of the other items in the system layer input, please refer to the [Argument ${SYSTEM_CONFIG}](https://astra-sim.github.io/astra-sim-docs/getting-started/argument-system-config.html) page.
-For a detailed explanation of the behavior of the system layer, please refer to the previous sections. 
+For a detailed explanation of the behavior of the system layer, please refer to the previous sections.
 
-## ASTRA-sim Native
-Let's briefly take note of an example of the system layer input using the astra-sim native implementation:
+## Native Collective Implementation
+Let's briefly take note of an example of the system layer input using the native collective implementation:
 
 ```
 ...
@@ -15,51 +15,57 @@ Let's briefly take note of an example of the system layer input using the astra-
 ...
 ```
 
-Note the values of the `all-*-implementation` items. These entries point to how the simulator will decompose the given collective into send and receive messages. Two entries in the list mean the simulator will break down the All Gather across two dimensions - the first dimension uses the ring algorithm and the second dimension uses the double binary algorithm. Note that repeated entries such as `["ring", "ring"]` are also possible. How the physical nodes are broken down into each dimension is defined by the network backend. For now, the native implementation requires that the dimensions for collective algorithsm are same across all collectives. The above example, where AllReduce is a 1D collective but AllGather is a 2D collective is simply an illustrative example. 
+Note the values of the `all-*-implementation` items. These entries point to how the simulator will decompose the given collective into send and receive messages. Two entries in the list mean the simulator will break down the All Gather across two dimensions - the first dimension uses the ring algorithm and the second dimension uses the double binary algorithm. Note that repeated entries such as `["ring", "ring"]` are also possible. How the physical nodes are broken down into each dimension is defined by the network backend. For now, the native implementation requires that the dimensions for collective algorithsm are same across all collectives. The above example, where AllReduce is a 1D collective but AllGather is a 2D collective is simply an illustrative example.
 
-The mapping between each string value and the corresponding simulator code can be found in the [generate_collective_impl_from_input](https://github.com/astra-sim/astra-sim/blob/92fc71a71752f4e38d92c7d03a44829114d70143/astra-sim/system/Sys.cc#L468) function. 
+The mapping between each string value and the corresponding simulator code can be found in the [generate_collective_impl_from_input](https://github.com/astra-sim/astra-sim/blob/92fc71a71752f4e38d92c7d03a44829114d70143/astra-sim/system/Sys.cc#L468) function.
 
 ## Collective API
-The below is an example of the system input using the Collective API: 
+The below is an example of the system input using the Collective API:
 ```
 ...
     "active-chunks-per-dimension": 1,
-    "all-reduce-implementation-chakra": ["/app/hoti2024/demo5/inputs/custom_ring"],
+    "all-reduce-implementation-custom": ["examples/custom_collective/allreduce_ring"],
 ...
 ```
 
-Note some differences: 
-First, we use the key `all-*-implemenation-chakra` instead of `all-*-implemenation`. Note how the Chakra ET files refereed here is different from the workload file passed to the workload layer. The value in each item is the absolute path to the Chakra ET files, excluding the last `{rank}.et` string (this is similar to the Workload layer input). Also, even if there are many 'dimensions', the list only accepts one value. This is because the notion of cross-dimension communication is already included in the Chakra ET.
+Note some differences:
+- First, we use the key `all-*-implemenation-custom` instead of `all-*-implemenation`.
+- Second, for the value, we point to a set of Chakra ET files, instead of the name of a generic algorithm. Note how the Chakra ET files refereed here is different from the workload file passed to the workload layer. The path could be absolute or relative to the working directory. The filepath should exclude the last `{rank}.et` string (this is similar to the Workload layer input). Also, even if there are many 'dimensions', the list only accepts one value. This is because the notion of cross-dimension communication is already included in the Chakra ET.
+
+One thing to note is that the focus of this collectiveAPI is to represent custom collectives. The Chakra ET simply happens to be the format used to represent the collective.
 
 
 ### Generating Chakra ET Representation from Collective Tools
-We now talk about obtaining the Chakra ET files to define the collective algorithm. 
+We now talk about generating the Chakra ET files to define the custom collective. Using a generic Collective API allows us to generate the collective representation from several tools.
 
 #### MSCCLang
 The MSCCLang Domain Specific Language (DSL) allows users to easily and expressively write arbitrary, custom collective algorithms.
-For a more detailed explanation into the MSCCLang DSL, please refer to their paper at [1]. 
-First, we lower the MSCCLang DSL program into an Intermediate Representation (IR) called MSCCL-IR, which is in XML format. 
-```
-git clone git@github.com:jinsun-yoo/msccl-tools.git
-cd msccl-tools/examples/mscclang
+For a more detailed explanation into the MSCCLang DSL, please refer to their paper at [1].
+First, we lower the MSCCLang DSL program into an Intermediate Representation (IR) called MSCCL-IR, which is in XML format.
+
+```bash
+git clone git@github.com:astra-sim/collectiveapi.git --recurse-submodules
+cd collectiveapi/msccl-tools/examples/mscclang
 python3 allreduce_a100_ring.py ${NUM_GPUS} 1 1 > allreduce_ring.xml
 ```
 
-Then, we convert this into a Chakra ET that ASTRA-sim's collective API can understand. 
-```
-git clone git@github.com:astra-sim/collectiveapi.git
-cd chakra_converter
+Then, we convert this into a Chakra ET that ASTRA-sim's collective API can understand.
+```bash
+cd ../chakra_converter
 
 python3 et_converter.py \
-    --input_type msccl \
     --input_filename ${FILEPATH}/allreduce_ring.xml \
     --output_filename allreduce_ring_mscclang \
-    --coll_size 1048576
+    --coll_size 1048576 \
+    --collective allreduce
 
 ls allreduce_ring_mscclang*
 allreduce_ring_mscclang.0.et  allreduce_ring_mscclang.1.et  allreduce_ring_mscclang.2.et  allreduce_ring_mscclang.3.et
 ...
 ```
+
+Then, in the `system-configuration` input, set the key `all-reduce-implementation-custom` and this path as the value.
+
 Note how we have to provide the communication size to the converter. This is a current limit where we have to hardcode the collective size into the algorithm, and ignore the collective size of the Chakra ET in the workload layer. We will release an update shortly to fix this limitation.
 
 #### TACOS