Skip to content

ONNXim changes#28

Open
waelcoding03 wants to merge 2 commits intoPSAL-POSTECH:masterfrom
waelcoding03:master
Open

ONNXim changes#28
waelcoding03 wants to merge 2 commits intoPSAL-POSTECH:masterfrom
waelcoding03:master

Conversation

@waelcoding03
Copy link

Hello Team,

I’ve reached out to Mr. Wonhyuk Yang regarding extending the ONNXim simulator to enable more detailed tensor tracking during runtime, both in and out of memory.

My changes to Simulator.cc, Model.cc, and Common.h are still a work in progress. I’ve added comments in the code explaining the logic and would greatly appreciate the opportunity to discuss it further with the team.

Note that I’ve also modified the ResNet18 model for testing purposes, but the main focus of this pull request is the simulator extension, not the model itself.

Thank you for your time and feedback.

@HamHyungkyu
Copy link
Collaborator

Hi @waelcoding03,
Thank you for your interest in ONNXim.
I’ve reviewed your code changes and comments; however, I’m not entirely sure what feature you’re trying to add to ONNXim.
Could you please explain it in more detail?

@YWHyuk YWHyuk self-assigned this Oct 24, 2025
@YWHyuk YWHyuk added the enhancement New feature or request label Oct 24, 2025
@waelcoding03
Copy link
Author

waelcoding03 commented Oct 24, 2025

Yes, of course — and thank you very much for your time reviewing and assisting me; I truly appreciate it.

I am working on adding a methodology to track tensor movements in and out of memory at runtime meaning in sync with DRAM push and pop operation during the simulation cycles. The task should log the tensor’s name, size, and ID whenever a push or pop is invoked from/to DRAM in synchronization with the simulation cycle. This will allow us to monitor tensor activity at runtime.

In terms of implementation, I haven’t yet finalized the logic to achieve this goal, but I’ve been experimenting with an approach. I created a method called tensor_track(uint32 _id) in Model.cc, which leverages two existing methods in the same class:

get_tensor(uint32 _id) — retrieves a pointer to the full tensor given its ID.

print_tensor() — logs the tensor’s information to the console.

However, the challenge lies in passing the tensor ID correctly to get_tensor(uint32 _id) within Simulator.cc, since the tensor_track(uint32 _id) call would need to be placed inside the simulation cycle loop where DRAM push and pop operations occur. meaning
`
for (int mem_id = 0; mem_id < _n_memories; mem_id++) {
// ICNT to memory
if (!_icnt->is_empty(_n_cores + mem_id) &&
!_dram->is_full(mem_id, _icnt->top(_n_cores + mem_id))) {

          _dram->push(mem_id, _icnt->top(_n_cores + mem_id));

          ------>           // Call tensor_track is added 
        //if (!_models.empty()) {
        //    _models.front()->tensor_track(//how to get tensor id);
        //}

          _icnt->pop(_n_cores + mem_id);
      _nr_to_mem++;
    }
    // Pop response to ICNT from dram
    if (!_dram->is_empty(mem_id) &&
        !_icnt->is_full(_n_cores + mem_id, _dram->top(mem_id))) {
      _icnt->push(_n_cores + mem_id, get_dest_node(_dram->top(mem_id)),
                  _dram->top(mem_id));
      _dram->pop(mem_id);
      ---->// Call tensor_track is added 
       
       // if (!_models.empty()) {
       //   _models.front()->tensor_track(//how to get tensor id);
        //}

      _nr_from_mem++;
    }
  }`

If there is still any ambiguity I would be more than happy to explain more, and thanks for your time and efforts again tremendously appreciated.

@YWHyuk
Copy link
Collaborator

YWHyuk commented Oct 27, 2025

You’re interested in identifying which tensor a given address originates from at the moment it’s pushed to DRAM, right? There are generally two ways to implement this.

The first approach is to include the necessary information (e.g., tensor metadata) directly inside the mem_access structure. In this way, you can log the metadata right at the point where the DRAM push or pop operation occurs. However, I’m a bit concerned that this approach might introduce performance overhead during simulation.

The second approach is to log the allocation table and refer to it whenever an address needs to be mapped back to its corresponding tensor. This allocation table would record which memory regions belong to which tensors — for example:

Tensor A : 0x000100 ~ 0x001000  
Tensor B : 0x001000 ~ 0x002800  
Tensor C : 0x002800 ~ 0x004000

I think the second approach is easier to implement — it doesn’t have to be done on the fly.
You can do this as a post-processing step after the entire simulation finishes.

@waelcoding03
Copy link
Author


[2025-10-28 21:52:40.985] [info] TensorID   Name                 StartAddr          EndAddr            Size        
[2025-10-28 21:52:40.985] [info] 60         reorder_token_29     0x0000000001fab100 - 0x0000000001fc3900 100352      
[2025-10-28 21:52:40.986] [info] 58         reorder_token_27     0x0000000001f92800 - 0x0000000001fab000 100352      
[2025-10-28 21:52:40.986] [info] 56         reorder_token_25     0x0000000001f79f00 - 0x0000000001f92700 100352      
[2025-10-28 21:52:40.986] [info] 54         reorder_token_21     0x0000000001f61600 - 0x0000000001f79e00 100352      
[2025-10-28 21:52:40.986] [info] 52         reorder_token_23     0x0000000001f48d00 - 0x0000000001f61500 100352      
[2025-10-28 21:52:40.986] [info] 50         reorder_token_19     0x0000000001f17c00 - 0x0000000001f48c00 200704      
[2025-10-28 21:52:40.986] [info] 48         reorder_token_17     0x0000000001ee6b00 - 0x0000000001f17b00 200704      
[2025-10-28 21:52:40.986] [info] 46         reorder_token_15     0x0000000001eb5a00 - 0x0000000001ee6a00 200704      
[2025-10-28 21:52:40.986] [info] 44         reorder_token_11     0x0000000001e84900 - 0x0000000001eb5900 200704      
[2025-10-28 21:52:40.986] [info] 42         reorder_token_13     0x0000000001e53800 - 0x0000000001e84800 200704      
[2025-10-28 21:52:40.986] [info] 40         reorder_token_9      0x0000000001df1700 - 0x0000000001e53700 401408      
[2025-10-28 21:52:40.986] [info] 38         reorder_token_7      0x0000000001d8f600 - 0x0000000001df1600 401408      
[2025-10-28 21:52:40.986] [info] 36         reorder_token_5      0x0000000001d2d500 - 0x0000000001d8f500 401408      
[2025-10-28 21:52:40.986] [info] 34         reorder_token_3      0x0000000001ccb400 - 0x0000000001d2d400 401408      
[2025-10-28 21:52:40.986] [info] 14         reorder_token_10     0x0000000000b37400 - 0x0000000000b3b400 16384       
[2025-10-28 21:52:40.986] [info] 72         reorder_token_40     0x0000000002001300 - 0x0000000002001700 1024        
[2025-10-28 21:52:40.986] [info] 13         reorder_token_16     0x0000000000aef300 - 0x0000000000b37300 294912      
[2025-10-28 21:52:40.986] [info] 12         reorder_token_24     0x00000000009cf200 - 0x0000000000aef200 1179648     
[2025-10-28 21:52:40.986] [info] 70         reorder_token_39     0x0000000001ff4e00 - 0x0000000002001200 50176       
[2025-10-28 21:52:40.986] [info] 11         onnx::Conv_209       0x00000000009cf000 - 0x00000000009cf100 256         
[2025-10-28 21:52:40.986] [info] 10         reorder_token_36     0x000000000054ef00 - 0x00000000009cef00 4718592     
[2025-10-28 21:52:40.986] [info] 68         reorder_token_37     0x0000000001fe8900 - 0x0000000001ff4d00 50176       
[2025-10-28 21:52:40.986] [info] 9          reorder_token_12     0x000000000052ae00 - 0x000000000054ee00 147456      
[2025-10-28 21:52:40.986] [info] 8          reorder_token_26     0x000000000040ad00 - 0x000000000052ad00 1179648     
[2025-10-28 21:52:40.986] [info] 66         reorder_token_35     0x0000000001fdc400 - 0x0000000001fe8800 50176       
[2025-10-28 21:52:40.986] [info] 7          reorder_token_32     0x00000000001cac00 - 0x000000000040ac00 2359296     
[2025-10-28 21:52:40.986] [info] 6          reorder_token_30     0x000000000018ab00 - 0x00000000001cab00 262144      
[2025-10-28 21:52:40.986] [info] 64         reorder_token_31     0x0000000001fcff00 - 0x0000000001fdc300 50176       
[2025-10-28 21:52:40.986] [info] 5          onnx::Conv_194       0x000000000018aa00 - 0x000000000018aa80 128         
[2025-10-28 21:52:40.986] [info] 4          reorder_token_22     0x00000000000fa900 - 0x000000000018a900 589824      
[2025-10-28 21:52:40.986] [info] 62         reorder_token_33     0x0000000001fc3a00 - 0x0000000001fcfe00 50176       
[2025-10-28 21:52:40.986] [info] 3          fc.bias              0x00000000000fa100 - 0x00000000000fa8d0 2000        
[2025-10-28 21:52:40.986] [info] 32         reorder_token_1      0x0000000001c69300 - 0x0000000001ccb300 401408      
[2025-10-28 21:52:40.986] [info] 2          fc.weight            0x0000000000000000 - 0x00000000000fa000 1024000     
[2025-10-28 21:52:40.986] [info] 74         /avgpool/GlobalAveragePool_output_0 0x0000000002001800 - 0x0000000002001c00 1024        
[2025-10-28 21:52:40.986] [info] 15         reorder_token_20     0x0000000000b3b500 - 0x0000000000b4b500 65536       
[2025-10-28 21:52:40.986] [info] 16         reorder_token_18     0x0000000000b4b600 - 0x0000000000b93600 294912      
[2025-10-28 21:52:40.986] [info] 76         /Flatten_output_0    0x0000000002001d00 - 0x0000000002002100 1024        
[2025-10-28 21:52:40.986] [info] 17         onnx::Conv_224       0x0000000000b93700 - 0x0000000000b93900 512         
[2025-10-28 21:52:40.986] [info] 18         reorder_token_38     0x0000000000b93a00 - 0x0000000001013a00 4718592     
[2025-10-28 21:52:40.986] [info] 78         output               0x0000000002002200 - 0x00000000020029d0 2000        
[2025-10-28 21:52:40.986] [info] 19         reorder_token_14     0x0000000001013b00 - 0x000000000105bb00 294912      
[2025-10-28 21:52:40.986] [info] 20         reorder_token_8      0x000000000105bc00 - 0x000000000106dc00 73728       
[2025-10-28 21:52:40.986] [info] 21         reorder_token_6      0x000000000106dd00 - 0x000000000107fd00 73728       
[2025-10-28 21:52:40.986] [info] 22         reorder_token_4      0x000000000107fe00 - 0x0000000001091e00 73728       
[2025-10-28 21:52:40.986] [info] 23         onnx::Conv_239       0x0000000001091f00 - 0x0000000001092300 1024        
[2025-10-28 21:52:40.986] [info] 24         reorder_token_28     0x0000000001092400 - 0x00000000011b2400 1179648     
[2025-10-28 21:52:40.986] [info] 25         reorder_token_2      0x00000000011b2500 - 0x00000000011c4500 73728       
[2025-10-28 21:52:40.986] [info] 26         reorder_token_34     0x00000000011c4600 - 0x0000000001644600 4718592     
[2025-10-28 21:52:40.986] [info] 27         reorder              0x0000000001644700 - 0x0000000001649080 18816       
[2025-10-28 21:52:40.986] [info] 28         input                0x0000000001649100 - 0x0000000001ae1100 4816896     
[2025-10-28 21:52:40.986] [info] 30         reorder_token_0      0x0000000001ae1200 - 0x0000000001c69200 1605632   

I have obtained this in my simulation I went with your second approach since we should consider performance overhead , However how does this benefit me in memory tracking at runtime ?

My goal is to sketch a timeline where I can see each tensor scheduling at runtime.

Thank you a lot

@YWHyuk
Copy link
Collaborator

YWHyuk commented Oct 29, 2025

Now we add the logging logic in the section you mentioned to trace DRAM packet addresses. (If an environment variable triggers the DRAM trace logging, it won’t impact performance in other cases.)

Then, by combining the allocation table information with the address trace file, we can implement a separate post-processing logic for analysis.

@waelcoding03
Copy link
Author

I am not sure I get your idea about the DRAM packet trace and its relation at runtime since this table is produced after simulation finished meaning not as the inference is progressing, and our main goal is to track the tensors at runtime.

I will try to investigate more this method, but for now I am also trying to add tensor id member in MemoryAccess structure to get it explicitly through load and store requests of the core invoked by MOVIN/MOVOUT instructions.
Thus, I can pass the the full tensor and print it, through get_tensor(tensor_id) and print_tensor() methods present in the simulator these methods don't need implementation.

I will inform you with results as soon as I obtain significant updates. As always Thank you.

@waelcoding03
Copy link
Author

`/* TODO: Implement this */
void GlobalAvgPool::initialize_tiles(MappingTable& mapping_table) {
spdlog::trace("initialize_tile {}", _name);
std::vector<uint32_t> output_shape = get_output(0)->get_dims();

_tiles.push_back(std::make_unique(Tile{.status = Tile::Status::INITIALIZED,
.optype = "GlobalAvgPool",
.layer_id = _id,
.batch = 1,
.Q = 0,
.P = 0,
.C = 1,
.skip = true}));
initialize_instructions(_tiles.back().get(), Mapping{});
}

void GlobalAvgPool::initialize_instructions(Tile* tile, Mapping mapping) {
std::vector<uint32_t> output_shape = get_output(0)->get_dims();
std::vector<uint32_t> input_shape = get_input(0)->get_dims();

uint32_t h_kernel = _kernel_shape[0];
uint32_t w_kernel = _kernel_shape[1];
uint32_t kernel_size = h_kernel * w_kernel;
uint32_t compare_size_in_vector = _config.vector_process_bit /
(_config.precision * 8);

uint32_t N = tile->batch;
uint32_t C = tile->C;
uint32_t tout_q_offset = tile->Q * _strides[0];
uint32_t tout_p_offset = tile->P * _strides[1];

uint32_t total_compare = 0;
uint32_t tmp = kernel_size;

while (tmp > compare_size_in_vector) {
int quotient = tmp / compare_size_in_vector;
int remainder = tmp % compare_size_in_vector;

total_compare += quotient;
tmp = quotient + remainder;

}
total_compare += 1;

std::set<addr_type> input_set;

for (int q_offset = 0; q_offset < h_kernel; q_offset++) {
for (int p_offset = 0; p_offset < w_kernel; p_offset++) {
input_set.insert(make_activation_address(N, q_offset,
p_offset, C, input_shape));
}
}

std::string input_id = fmt::format("INPUT-{}-{}-{}-{}-{}", tile->layer_id,
N, tout_q_offset, tout_p_offset, C);

tile->instructions.push_back(
Instruction{.opcode = Opcode::MOVIN,
.id = input_id,
.addrs = std::vector<addr_type>(
input_set.begin(), input_set.end())});

std::set<addr_type> output_set;
std::string output_id = fmt::format("OUT-{}-{}-{}-{}-{}", tile->layer_id,
N, tout_q_offset, tout_p_offset, C);

output_set.insert(make_activation_address(N, tout_q_offset,
tout_p_offset, C, output_shape));

for (int i=0; i<total_compare; i++)
tile->instructions.push_back(
Instruction{.opcode = Opcode::COMP,
.tile_size = compare_size_in_vector,
.id = fmt::format("COMP-{}-{}-{}-{}-{}", tile->layer_id, N,
tout_q_offset, tout_p_offset, C),
.dependent_ids = std::vectorstd::string{input_id},
.dest_id = output_id});

tile->instructions.push_back(
Instruction{.opcode = Opcode::MOVOUT,
.id = output_id,
.addrs = std::vector<addr_type>(output_set.begin(),
output_set.end())});
`

I found this in operations folder in GlobalAvgPool.cc they were commented so I uncommented them but in Common.h the instruction struct is as follows
`typedef struct {
Opcode opcode;
cycle_type start_cycle;
cycle_type finish_cycle;
std::string id;
std::vectorstd::string dependent_ids;
std::string dest_id;
addr_type dest_addr;
uint32_t size; // Used for sram allocation. Multiple of _config.dram_req_size
uint32_t compute_size;
std::vector<addr_type> src_addrs;
int spad_id;
int accum_spad_id;
uint32_t operand_id = 0;
addr_type base_addr;

uint32_t tile_m;
uint32_t tile_k;
uint32_t tile_n;

bool src_from_accum = false;
bool zero_init = false;
bool last_inst = false;
Tile* my_tile;
std::string to_string();
} Instruction;`

could you please tell me how these relate because if I get the id of tensors in instructions I can initlize MemoryAccess with them and in simulator I can get each tensor and print its id as simulation is giong

Thank you

@waelcoding03
Copy link
Author

Hello

Hope your doing well,

I want kindly to ask in the operations folder the attention.cc controls all instruction instances so passing uint32_t tensor_id to instruction instances with opcodes MOVOUT and MOVIN can sufficiently be done there only.

Since if the above succeed we can through tile pass that member we added which is tensor_id from tile->instruction->MemoryAccess.

Thus MemoryAccess instances in core st ld handlers will allow tracking at runtime since these instances can access the tensor_id and thus we can track at simulation cycles when we loop over memories and generate these instances.

Thank you for your assistance it is highly appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants