Skip to content

Atomic store operations are not detected on cached private pages #52

@lundgren87

Description

@lundgren87

Issue

When an ArgoDSM page is cached by a remote node, and the page is in private state (caching node is single writer or no writer, caching node is sharer), subsequent atomic writes to the page are not detected on this node by regular reads even after self-invalidation.

The question is whether this is a bug (I believe so), or if the semantics of the ArgoDSM atomic functions simply do not define such behavior in the first place. This is related to #20, but is further exposed by Ioannis work on memory allocation policies which fails some of the atomic tests when the allocated data is owned by a node other than 0.

Reproduction

This following test exposes the bug on 2 or more nodes with the naive allocation policy. In order to display the second case (no writer, caching node is sharer), simply substitute *counter = 0; with volatile int temp = *counter. The important point is that counter is allocated on node 0, and that exactly one other node performs a read or write to it.

include "argo/argo.hpp"

int main(){
	argo::init(1*1024*1024);
	argo::data_distribution::global_ptr<int> counter(argo::conew_<int>());
	
	if(argo::node_id()==1) {
		*counter = 0; // Node 0 owns the data, node 1 becomes single writer
	}
	argo::barrier(); // Barrier makes every node aware of the initialization
	
	// Atomically increment counter on each node
	for(int i=0; i<10; i++){
		argo::backend::atomic::fetch_add(counter, 1);
	}

	argo::barrier(); // Make sure every node has completed execution
	if(*counter == argo::number_of_nodes()*10) {
		printf("Node %d successful (counter: %d).\n", argo::node_id(), *counter);
	}else{
		printf("Node %d failed (counter: %d).\n", argo::node_id(), *counter);
	}
	
	argo::finalize();
}

Detail

This issue is courtesy of the following optimization:

if(
// node is single writer
(globalSharers[classidx+1]==id)
||
// No writer and assert that the node is a sharer
((globalSharers[classidx+1]==0) && ((globalSharers[classidx]&id)==id))
){
MPI_Win_unlock(workrank, sharerWindow);
touchedcache[i] =1;
/*nothing - we keep the pages, SD is done in flushWB*/
}

The reason is that ArgoDSM atomics do not alter the Pyxis directory (globalSharers) state, and therefore cached remote pages in single writer or no writer, shared state are not invalidated upon self-invalidation causing the node to miss updates until the state of the page changes.

Solution?

The fact that cached private pages are not downgraded to shared on atomic writes means that it is never completely safe to mix atomic writes and regular reads/writes. I believe that the correct solution would be to write atomic changes to the cache and to update local (and remote when needed as a result) Pyxis directories to the correct state.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions