Issue
When an ArgoDSM page is cached by a remote node, and the page is in private state (caching node is single writer or no writer, caching node is sharer), subsequent atomic writes to the page are not detected on this node by regular reads even after self-invalidation.
The question is whether this is a bug (I believe so), or if the semantics of the ArgoDSM atomic functions simply do not define such behavior in the first place. This is related to #20, but is further exposed by Ioannis work on memory allocation policies which fails some of the atomic tests when the allocated data is owned by a node other than 0.
Reproduction
This following test exposes the bug on 2 or more nodes with the naive allocation policy. In order to display the second case (no writer, caching node is sharer), simply substitute *counter = 0; with volatile int temp = *counter. The important point is that counter is allocated on node 0, and that exactly one other node performs a read or write to it.
include "argo/argo.hpp"
int main(){
argo::init(1*1024*1024);
argo::data_distribution::global_ptr<int> counter(argo::conew_<int>());
if(argo::node_id()==1) {
*counter = 0; // Node 0 owns the data, node 1 becomes single writer
}
argo::barrier(); // Barrier makes every node aware of the initialization
// Atomically increment counter on each node
for(int i=0; i<10; i++){
argo::backend::atomic::fetch_add(counter, 1);
}
argo::barrier(); // Make sure every node has completed execution
if(*counter == argo::number_of_nodes()*10) {
printf("Node %d successful (counter: %d).\n", argo::node_id(), *counter);
}else{
printf("Node %d failed (counter: %d).\n", argo::node_id(), *counter);
}
argo::finalize();
}
Detail
This issue is courtesy of the following optimization:
|
if( |
|
// node is single writer |
|
(globalSharers[classidx+1]==id) |
|
|| |
|
// No writer and assert that the node is a sharer |
|
((globalSharers[classidx+1]==0) && ((globalSharers[classidx]&id)==id)) |
|
){ |
|
MPI_Win_unlock(workrank, sharerWindow); |
|
touchedcache[i] =1; |
|
/*nothing - we keep the pages, SD is done in flushWB*/ |
|
} |
The reason is that ArgoDSM atomics do not alter the Pyxis directory (
globalSharers) state, and therefore cached remote pages in
single writer or
no writer, shared state are not invalidated upon self-invalidation causing the node to miss updates until the state of the page changes.
Solution?
The fact that cached private pages are not downgraded to shared on atomic writes means that it is never completely safe to mix atomic writes and regular reads/writes. I believe that the correct solution would be to write atomic changes to the cache and to update local (and remote when needed as a result) Pyxis directories to the correct state.
Issue
When an ArgoDSM page is cached by a remote node, and the page is in private state (caching node is single writer or no writer, caching node is sharer), subsequent atomic writes to the page are not detected on this node by regular reads even after self-invalidation.
The question is whether this is a bug (I believe so), or if the semantics of the ArgoDSM atomic functions simply do not define such behavior in the first place. This is related to #20, but is further exposed by Ioannis work on memory allocation policies which fails some of the atomic tests when the allocated data is owned by a node other than 0.
Reproduction
This following test exposes the bug on 2 or more nodes with the naive allocation policy. In order to display the second case (no writer, caching node is sharer), simply substitute
*counter = 0;withvolatile int temp = *counter. The important point is thatcounteris allocated on node 0, and that exactly one other node performs a read or write to it.Detail
This issue is courtesy of the following optimization:
argodsm/src/backend/mpi/swdsm.cpp
Lines 984 to 994 in 5f9b572
The reason is that ArgoDSM atomics do not alter the Pyxis directory (
globalSharers) state, and therefore cached remote pages in single writer or no writer, shared state are not invalidated upon self-invalidation causing the node to miss updates until the state of the page changes.Solution?
The fact that cached private pages are not downgraded to shared on atomic writes means that it is never completely safe to mix atomic writes and regular reads/writes. I believe that the correct solution would be to write atomic changes to the cache and to update local (and remote when needed as a result) Pyxis directories to the correct state.