Skip to content

Commit 72ee6d1

Browse files
author
Matt Sinclair
committed
mem-ruby: Update GPU VIPER TCC protocol to resolve deadlock
In the GPU VIPER TCC, programs with mixes of atomics and data accesses to the same address, in the same kernel, can experience deadlock when large applications (e.g., Pannotia's graph analytics algorithms) are running on very small GPUs (e.g., the default 4 CU GPU configuration). In this situation, deadlocks occur due to resource stalls interacting with the behavior of the current implementation for handling races between atomic accesses. The specific order of events causing this deadlock are: 1. TCC is waiting on an atomic to return from directory 2. In the meantime it receives another atomic to the same address -- when this happens, the TCC increments number of atomics to this address (numAtomics = 2) that are pending in TBE, and does a write through of the atomic to the directory. 3. When the first atomic returns from the Directory, it decrements the numAtomics counter. numAtomics was at 2 though, because of step #2. So it doesn't deallocate the TBE entry and calls Event:AtomicNotDone. 4. Another request (a LD) to the same address comes along for the same address. The LD does z_stall since the second atomic is pending –- so the LD retries every cycle until the deadlock counter times out (or until the second atomic comes back). 5. The second atomic returns to the TCC. However, because there are so many LD's pending in the cache, all doing z_stall's and retrying every cycle, there are a lot of resource stalls. So, when the second atomic returns, it is forced to retry its operation multiple times -- and each time it decrements the atomicDoneCnt flag (which was added to catch a race between atomics arriving and leaving the TCC in 7246f70) repeatedly. As a result atomicDoneCnt becomes negative. 6. Since this atomicDoneCnt flag is used to determine when Event:AtomicDone happens, and since the resource stalls caused the atomicDoneCnt flag to become negative, we never complete the atomic. Which means the pending LD can never access the line, because it's stuck waiting for the atomic to complete. 7. Eventually the deadlock threshold is reached. To fix this issue, this commit changes the VIPER TCC protocol from using z_stall to using the stall_and_wait buffer method that the Directory-level of the SLICC already uses. This change effectively prevents resource stalls from dominating the TCC level, by putting pending requests for a given address in a per-address stall buffer. These requests are then woken up when the pending request returns. As part of this change, this change also makes two small changes to the Directory-level protocol (MOESI_AMD_BASE-dir): 1. Updated the names of the wakeup actions to match the TCC wakeup actions, to avoid confusion. 2. Changed transition(B, UnblockWriteThrough, U) to check all stall buffers, as some requests were being placed later in the stall buffer than was being checked. This mirrors the changes in 187c44f to other Directory transitions to resolve races between GPU and DMA requests, but for transitions prior workloads did not stress. Change-Id: I60ac9830a87c125e9ac49515a7fc7731a65723c2 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/51367 Reviewed-by: Jason Lowe-Power <power.jg@gmail.com> Reviewed-by: Matthew Poremba <matthew.poremba@amd.com> Maintainer: Jason Lowe-Power <power.jg@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com>
1 parent 2b46872 commit 72ee6d1

File tree

2 files changed

+36
-15
lines changed

2 files changed

+36
-15
lines changed

src/mem/ruby/protocol/GPU_VIPER-TCC.sm

+27-6
Original file line numberDiff line numberDiff line change
@@ -126,6 +126,7 @@ machine(MachineType:TCC, "TCC Cache")
126126
void unset_tbe();
127127
void wakeUpAllBuffers();
128128
void wakeUpBuffers(Addr a);
129+
void wakeUpAllBuffers(Addr a);
129130

130131
MachineID mapAddressToMachine(Addr addr, MachineType mtype);
131132

@@ -569,6 +570,14 @@ machine(MachineType:TCC, "TCC Cache")
569570
probeNetwork_in.dequeue(clockEdge());
570571
}
571572

573+
action(st_stallAndWaitRequest, "st", desc="Stall and wait on the address") {
574+
stall_and_wait(coreRequestNetwork_in, address);
575+
}
576+
577+
action(wada_wakeUpAllDependentsAddr, "wada", desc="Wake up any requests waiting for this address") {
578+
wakeUpAllBuffers(address);
579+
}
580+
572581
action(z_stall, "z", desc="stall") {
573582
// built-in
574583
}
@@ -606,13 +615,22 @@ machine(MachineType:TCC, "TCC Cache")
606615
// they can cause a resource stall deadlock!
607616

608617
transition(WI, {RdBlk, WrVicBlk, Atomic, WrVicBlkBack}) { //TagArrayRead} {
609-
z_stall;
618+
// put putting the stalled requests in a buffer, we reduce resource contention
619+
// since they won't try again every cycle and will instead only try again once
620+
// woken up
621+
st_stallAndWaitRequest;
610622
}
611623
transition(A, {RdBlk, WrVicBlk, WrVicBlkBack}) { //TagArrayRead} {
612-
z_stall;
624+
// put putting the stalled requests in a buffer, we reduce resource contention
625+
// since they won't try again every cycle and will instead only try again once
626+
// woken up
627+
st_stallAndWaitRequest;
613628
}
614629
transition(IV, {WrVicBlk, Atomic, WrVicBlkBack}) { //TagArrayRead} {
615-
z_stall;
630+
// put putting the stalled requests in a buffer, we reduce resource contention
631+
// since they won't try again every cycle and will instead only try again once
632+
// woken up
633+
st_stallAndWaitRequest;
616634
}
617635
transition({M, V}, RdBlk) {TagArrayRead, DataArrayRead} {
618636
p_profileHit;
@@ -660,9 +678,10 @@ transition(I, Atomic, A) {TagArrayRead} {
660678

661679
transition(A, Atomic) {
662680
p_profileMiss;
663-
at_atomicThrough;
664-
ina_incrementNumAtomics;
665-
p_popRequestQueue;
681+
// put putting the stalled requests in a buffer, we reduce resource contention
682+
// since they won't try again every cycle and will instead only try again once
683+
// woken up
684+
st_stallAndWaitRequest;
666685
}
667686

668687
transition({M, W}, Atomic, WI) {TagArrayRead} {
@@ -750,6 +769,7 @@ transition(I, Atomic, A) {TagArrayRead} {
750769
wcb_writeCacheBlock;
751770
sdr_sendDataResponse;
752771
pr_popResponseQueue;
772+
wada_wakeUpAllDependentsAddr;
753773
dt_deallocateTBE;
754774
}
755775

@@ -762,6 +782,7 @@ transition(I, Atomic, A) {TagArrayRead} {
762782

763783
transition(A, AtomicDone, I) {TagArrayRead, TagArrayWrite} {
764784
dt_deallocateTBE;
785+
wada_wakeUpAllDependentsAddr;
765786
ptr_popTriggerQueue;
766787
}
767788

src/mem/ruby/protocol/MOESI_AMD_Base-dir.sm

+9-9
Original file line numberDiff line numberDiff line change
@@ -1092,15 +1092,15 @@ machine(MachineType:Directory, "AMD Baseline protocol")
10921092
stall_and_wait(dmaRequestQueue_in, address);
10931093
}
10941094

1095-
action(wa_wakeUpDependents, "wa", desc="Wake up any requests waiting for this address") {
1095+
action(wad_wakeUpDependents, "wad", desc="Wake up any requests waiting for this address") {
10961096
wakeUpBuffers(address);
10971097
}
10981098

1099-
action(wa_wakeUpAllDependents, "waa", desc="Wake up any requests waiting for this region") {
1099+
action(wa_wakeUpAllDependents, "wa", desc="Wake up any requests waiting for this region") {
11001100
wakeUpAllBuffers();
11011101
}
11021102

1103-
action(wa_wakeUpAllDependentsAddr, "waaa", desc="Wake up any requests waiting for this address") {
1103+
action(wada_wakeUpAllDependentsAddr, "wada", desc="Wake up any requests waiting for this address") {
11041104
wakeUpAllBuffers(address);
11051105
}
11061106

@@ -1206,7 +1206,7 @@ machine(MachineType:Directory, "AMD Baseline protocol")
12061206
d_writeDataToMemory;
12071207
al_allocateL3Block;
12081208
pr_profileL3HitMiss; //Must come after al_allocateL3Block and before dt_deallocateTBE
1209-
wa_wakeUpDependents;
1209+
wad_wakeUpDependents;
12101210
dt_deallocateTBE;
12111211
pr_popResponseQueue;
12121212
}
@@ -1232,12 +1232,12 @@ machine(MachineType:Directory, "AMD Baseline protocol")
12321232
}
12331233

12341234
transition({B}, CoreUnblock, U) {
1235-
wa_wakeUpAllDependentsAddr;
1235+
wada_wakeUpAllDependentsAddr;
12361236
pu_popUnblockQueue;
12371237
}
12381238

12391239
transition(B, UnblockWriteThrough, U) {
1240-
wa_wakeUpDependents;
1240+
wada_wakeUpAllDependentsAddr;
12411241
pt_popTriggerQueue;
12421242
}
12431243

@@ -1280,7 +1280,7 @@ machine(MachineType:Directory, "AMD Baseline protocol")
12801280
transition(BDR_M, MemData, U) {
12811281
mt_writeMemDataToTBE;
12821282
dd_sendResponseDmaData;
1283-
wa_wakeUpAllDependentsAddr;
1283+
wada_wakeUpAllDependentsAddr;
12841284
dt_deallocateTBE;
12851285
pm_popMemQueue;
12861286
}
@@ -1365,7 +1365,7 @@ machine(MachineType:Directory, "AMD Baseline protocol")
13651365
transition(BDW_P, ProbeAcksComplete, U) {
13661366
// Check for pending requests from the core we put to sleep while waiting
13671367
// for a response
1368-
wa_wakeUpAllDependentsAddr;
1368+
wada_wakeUpAllDependentsAddr;
13691369
dt_deallocateTBE;
13701370
pt_popTriggerQueue;
13711371
}
@@ -1374,7 +1374,7 @@ machine(MachineType:Directory, "AMD Baseline protocol")
13741374
dd_sendResponseDmaData;
13751375
// Check for pending requests from the core we put to sleep while waiting
13761376
// for a response
1377-
wa_wakeUpAllDependentsAddr;
1377+
wada_wakeUpAllDependentsAddr;
13781378
dt_deallocateTBE;
13791379
pt_popTriggerQueue;
13801380
}

0 commit comments

Comments
 (0)