Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
0171bc2
repo: created a new branch for review
kaustav-goswami Apr 9, 2024
f401564
ext,config,mem: updated the code for review
kaustav-goswami Apr 11, 2024
688c221
ext,doc: added a documentation and sst-script
kaustav-goswami Apr 11, 2024
e27e690
configs,ext: final touchups for the composable project
kaustav-goswami Apr 22, 2024
2b05d8c
configs: hotfix to include fatal
kaustav-goswami Apr 22, 2024
32bd5d1
doc: updated the documentation
kaustav-goswami Apr 30, 2024
7071c0a
configs: upstreamed experiment scripts (partial)
kaustav-goswami May 3, 2024
223f796
doc: updated the README wtih traffic generator instructions
kaustav-goswami May 3, 2024
c6dc87a
exp: starting to prepare for the experiments
kaustav-goswami May 8, 2024
f589eeb
configs: fixed the system; added an npb script
kaustav-goswami May 10, 2024
16bf9e9
fixed the issues in the main scripts
mbabaie May 10, 2024
e431eb0
temp: uploading to kg to fix ext mem
kaustav-goswami May 11, 2024
c4d8cd6
exp: starting to prepare for the experiments
kaustav-goswami May 8, 2024
695af04
configs: fixed the system; added an npb script
kaustav-goswami May 10, 2024
b4ec774
stats,ext: added new stats in the external memory
kaustav-goswami May 14, 2024
9bec144
shrank the scripts and puut everything in the same dir: disaggdisaggr…
mbabaie May 15, 2024
b62eb9c
merging with latest changes from the repo
mbabaie May 15, 2024
07d32ad
all the shrunk scripts for running stream and npb, with checkpoint or…
mbabaie May 20, 2024
908cf77
ext,mem: Fixed incorrect bandwidth calculation and added a new stat t…
kaustav-goswami May 22, 2024
5f35af5
fixed the remote memory ranges for np apps and added md5sum for resou…
mbabaie May 24, 2024
6618566
Merge branch 'kg/composable-memory-2' of https://github.com/darchr/ge…
mbabaie May 24, 2024
d301960
latest version of scripts for experiments
mbabaie May 28, 2024
2031cb7
x86,ext: added a composable memory board to the repo
kaustav-goswami Jun 29, 2024
fd515c6
config,ext: added a arm shared memory board
kaustav-goswami Aug 28, 2024
c99fc7f
arch-x86,stdlib: updated the x86 main board with madt
kaustav-goswami Sep 18, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
177 changes: 177 additions & 0 deletions README-DM.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
# Composable Memory Simulation Platform

This documents how to use the composable memory simulation platform in a gem5,
SST and gem5 + SST setup.
The setup can be used in gem5 to fast-forward full-system simulation and then
used in SST to simulate a multi-node system.

The code is mainly confined in the `disaggregated_memory` directory.
The directory is divided into four subdirectories, similar to the structure of
the gem5's standard library:

- `boards`: The disaggregated memory boards are inherited from the stdlib's
boards. Users can pass two memory ranges. The first one is to model the local
memory and the second one is to model a remote memory. The remote memory may
or may not be in gem5, as these boards can be used directly with SST. These
ranges are exposed as NUMA and zNUMA nodes to the operating system.
Currently the following boards are supported:
- `ArmComposableMemoryBoard` implemented in `arm_main_board.py`
- `RiscvComposableMemoryBoard` implemented in `riscv_main_board.py`
- `memories`: This directory contains `ExternalRemoteMemory` inherited from
ExternalMemory. Users can use both gem5 and SST to model this remote memory.
- `cachehierarchies`: gem5's stdlib cachehierarchies were modified to handle
more than one outgoing connection from the LLC. Currently the following
cachehierarchies are supported:
- `ClassicPrivateL1PrivateL2DMCache`: A 2-level private classic cache
hierarchy
- `ClassicPrivateL1PrivateL2SharedL3DMCache`: A 3-level classic cache
hierarchy that has a shared LLC.
- *Note* ruby caches only work with the RiscvComposableMemoryBoard.
- `configs`: Top-level gem5 scripts that can be used to take checkpoints or run
SST simulations.

Instructions on how to use this platform can be found in the following
sections.

## Workflow

In short, we use this setup to fast-forward simulations using gem5 to reach the
ROI and take a checkpoint. We then end the simulation and start is again in SST
while loading the checkpoint.

SST does not allow untimed memory accesses at runtime as different gem5 nodes
might be reciding on different processes. Therefore, we split this simulation
into two phases. The following diagram shows the workflow of the platform.

```
G t0 : starting simulation in gem5 (atomic/kvm)
E |
M | t1 : simulation reached the start of ROI
5 |_____|____________________________________________________________ time ->
| |
S t2 : we start the simulation in SST (timing) |
S |
T end of simulation : t3
```
The first phase is entirely in gem5. This is represented by time t0 and t1. The
objective here is to reach the ROI asap take a checkpoint.

The second phase starts by loading the checkpoint back into the system but
using an SST-side script. The system remains identical except for the External
Memory, which now sends requests and receives responses to and from SST's
memory.

This can be scaled into N differnt gem5 nodes. Checkpoints need to be taken for
each of these nodes in their respective first phases.

See the paper link here for a better visualization.

## Taking Checkpoints

The following is an example of the first phase. We start the simulation
entirely in gem5. Assume that this is our first gem5 system (instance-id is 0).
This system has 2 GiB of local memory. Another block of 32 GiB memory is mapped
to this system as remote memory.

```sh
build/ARM/gem5.opt --outdir=ckpt_instance_0 disaggregated_memory/configs/arm-main.py \
--cpu-type=kvm \ # using a KVM CPU to skip OS boot. The host needs to support kvm
--instance=0 \ # set the instance id. This is appended with ckpt-file.
--local-memory-size=2GiB \ # The local memory should be small to moderate
--is-composable=False \ # We are using only gem5 to take the checkpoint
--remote-memory-addr-range=4294967296,6442450944 \ # Range 4 GiB to 6 GiB is mapped to a shared memory pool
--memory-alloc-policy=remote \ # Remote memory latency should be added on the SST-side script
--take-ckpt=True \ # This instance should take a checkpoint

```

If we are modelling multiple systems, all sharing the same memory resource in
SST, we need to repeat this step for the next system. This can be done by:

```sh
build/ARM/gem5.opt --outdir=ckpt_instance_1 disaggregated_memory/configs/arm-main.py \
--cpu-type=kvm \ # using a KVM CPU to skip OS boot. The host needs to support kvm
--instance=0 \ # set the instance id. This is appended with ckpt-file.
--local-memory-size=2GiB \ # The local memory should be small to moderate
--is-composable=False \ # We are using only gem5 to take the checkpoint
--remote-memory-addr-range=6442450944,8589934592 \ # Range 6 GiB to 8 GiB is mapped to a shared memory pool
--memory-alloc-policy=remote \ # Remote memory latency should be added on the SST-side script
--take-ckpt=True \ # This instance should take a checkpoint

```

Note that the stats.txt will be reset in the m5out directory. However, we are
not concerned about stats at this point as we are not using a timing CPU and
also we haven't reached the ROI.

This marks the end of phase 1.

## Restoring Checkpoints

The restoring of checkpoints marks the beginning of phase 2. The simulation now
needs to be initiated in SST. The SST-side script can be found in
`ext/sst/sst/arm_composable_memory.py`. Most of the required parameters need to
be set in the script directly.

```python
...
# XXX marks parameters that needs/can be changed.
disaggregated_memory_latency = "xxns" # add latency to memory requests going to SST.
...
is_composable = True # since this is now being simulated in SST
...
cpu_type = ["o3"]
...
gem5_run_script = "../../disaggregated_memory/configs/arm-main.py"

# node_memory_slice and remote_memory_slice needs to be consistent with the
# numbers used in phase 1.
...
# make sure that the --ckpt-file is correctly set in the cmd list.
```

All the outputs will be stored in `m5out_0`, `m5out_1` .. up to N directories.
If you are simulating just one node, then you can start the simulation without
mpi. This can be done by:
```sh
bin/sst --add-lib-path=./ sst/arm_composable_memory.py
```
If there are more than one gem5 system to simulate, then use the command below.
The number after -np should be number of gem5 nodes plus 1.
```sh
mpirun -np 3 -- bin/sst --add-lib-path=./ sst/arm_composable_memory.py
```
*Note* Make sure that the checkpoint paths are correctly set when restoring
multiple systems. The instance id is appended at the end of the --ckpt-file
name.

Also, for SST-side statistics, set the following path correctly;
```py
sst.setStatisticOutput("sst.statOutputTXT",
{"filepath" : f"arm-main-board.txt"})
```

## Sample Example with Traffic Generators

There is a simple example in the `disaggregated_memory/configs` that sets up a
system with SST's memory as the main memory. The goal is to allow gem5's
traffic generators to be generate traffic for SST. There is no checkpointing
involved in this setup.

The simulation needs to be started at the SST-side using the SST script in
`ext/sst/sst/example_traffic_gen.py`. This can be done by:

```sh
# Assuming that gem5 and SST is built already!

cd ext/sst
mpirun -np 2 -- bin/sst --add-lib-path=./ sst/example_traffic_gen.py -- --nodes=1 --link-latency=1ps
```

The above command simulates one gem5 node with SST as the main memory (0x0 to
0x80000000; hardcoded in the script). The link latency between gem5 and SST is
1ps. This can be varied.

Note that the default values for this script for the number of nodes and the
link latency is 1 and 1 ps respectively.

Empty file.
192 changes: 192 additions & 0 deletions disaggregated_memory/SST/exp_arm_npb.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,192 @@
# Copyright (c) 2023-24 The Regents of the University of California
# All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are
# met: redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer;
# redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution;
# neither the name of the copyright holders nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
# A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
# OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
# DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
# THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

# This SST configuration file can be used with the Composable script in gem5.
# For multi-node simulation, make sure to set the instance id correctly.

import sst
from sst import UnitAlgebra
import sys
import os
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
from configs.common import npb_benchmarks
import argparse


parser = argparse.ArgumentParser()

parser.add_argument(
"--ckpts-dir",
type=str,
required=True,
help="The path to the directory containing the checkpoints for all the nodes "+
"in the system. Each checkpoint directory must be named in this format: ckpt_i "+
"where i is the instance number of the node. Also, the output directory of this run "+
"will be inside this directory.",
)
parser.add_argument(
"--memory-allocation-policy",
type=str,
required=True,
help="The memory allocation policy can be local, interleaved, or remote.",
)
args = parser.parse_args()

def connect_components(link_name: str,
low_port_name: str, low_port_idx: int,
high_port_name: str, high_port_idx: int,
port = False, direct_link = False, latency = False):
link = sst.Link(link_name)
low_port = "low_network_" + str(low_port_idx)
if port == True:
low_port = "port"
high_port = "high_network_" + str(high_port_idx)
if direct_link == True:
high_port = "direct_link"
if latency == False:
link.connect(
(low_port_name, low_port, cache_link_latency),
(high_port_name, high_port, cache_link_latency)
)
else:
# TODO: Figure out if the added latency is correct!
link.connect(
(low_port_name, low_port, cache_link_latency),
(high_port_name, high_port, disaggregated_memory_latency)
)

gem5_run_script = "/home/babaie/projects/disaggregated-cxl/6/gem5/disaggregated_memory/configs/exp-npb-restore.py"
disaggregated_memory_latency = "750ns"
cache_link_latency = "1ps"
cpu_clock_rate = "4GHz"
stat_output_directory = f"{args.ckpts_dir}/SST_m5outs_NPB_all_short_test/{args.memory_allocation_policy}"


if args.memory_allocation_policy == "all-local":
sst_memory_size = str(2 + 85 + 9) + "GiB"
elif args.memory_allocation_policy == "numa-local-preferred":
sst_memory_size = str(2 + 8 + 152) + "GiB"
addr_range_end = UnitAlgebra(sst_memory_size).getRoundedValue()

# There is one cache bus connecting all gem5 ports to the remote memory.
mem_bus = sst.Component("membus", "memHierarchy.Bus")
mem_bus.addParams( { "bus_frequency" : cpu_clock_rate } )

# Set memctrl params
memctrl = sst.Component("memory", "memHierarchy.MemController")
memctrl.setRank(0, 0)

# `addr_range_end` should be changed accordingly to memory_size_sst
memctrl.addParams({
"debug" : "0",
"clock" : "1.2GHz",
"request_width" : "64",
"addr_range_end" : addr_range_end,
})
# We need a DDR4-like memory device.
memory = memctrl.setSubComponent( "backend", "memHierarchy.timingDRAM")
memory.addParams({
"id" : 0,
"addrMapper" : "memHierarchy.simpleAddrMapper",
"addrMapper.interleave_size" : "64B",
"addrMapper.row_size" : "1KiB",
"clock" : "1.2GHz",
"mem_size" : sst_memory_size,
"channels" : 4,
"channel.numRanks" : 2,
"channel.rank.numBanks" : 16,
"channel.rank.bank.TRP" : 14,
"printconfig" : 1,
})

# Add all the Gem5 nodes to this list.
gem5_nodes = []
memory_ports = []

# Create each of these nodes and conect it to a SST memory cache
npb_benchmarks_test = ["bt", "cg", "ep", "ft", "mg", "sp", "ua"]
for node, benchmark in enumerate(npb_benchmarks_test):
cmd = [
f"-re",
f"--outdir={stat_output_directory}/D/{benchmark}",
f"{gem5_run_script}",
f"--benchmark {benchmark}",
f"--size D",
f"--memory-allocation-policy {args.memory_allocation_policy}",
f"--ckpts-dir {args.ckpts_dir}",
]
ports = {
"remote_memory_port" : "board.remote_memory.outgoing_request_bridge"
}
port_list = []
for port in ports:
port_list.append(port)
cpu_params = {
"frequency" : cpu_clock_rate,
"cmd" : " ".join(cmd),
# "debug_flags" : "Checkpoint,MemoryAccess",
"ports" : " ".join(port_list)
}
# Each of the Gem5 node has to be separately simulated.
gem5_nodes.append(
sst.Component("gem5_node_{}".format(node), "gem5.gem5Component")
)
gem5_nodes[node].addParams(cpu_params)
gem5_nodes[node].setRank(node, 0)

memory_ports.append(
gem5_nodes[node].setSubComponent(
"remote_memory_port", "gem5.gem5Bridge", 0
)
)
memory_ports[node].addParams({
"response_receiver_name" : ports["remote_memory_port"]
})

# we dont need directory controllers in this example case. The start and
# end ranges does not really matter as the OS is doing this management in
# in this case.
# TODO: Figure out if we need to add the link latency here?
connect_components(f"node_{node}_mem_port_2_mem_bus",
memory_ports[node], 0,
mem_bus, node,
port = True, latency = True)

# All system nodes are setup. Now create a SST memory. Keep it simplemem for
# avoiding extra simulation time. There is only one memory node in SST's side.
# This will be updated in the future to use number of sst_memory_nodes

connect_components("membus_2_memory",
mem_bus, 0,
memctrl, 0,
direct_link = True)

# enable Statistics
stat_params = { "rate" : "0ns" }
sst.setStatisticLoadLevel(10)
sst.setStatisticOutput("sst.statOutputTXT",
{"filepath" : f"{stat_output_directory}/sstOuts/node.txt"})
sst.enableAllStatisticsForAllComponents()
Loading