-
Notifications
You must be signed in to change notification settings - Fork 3
HPFS sync reliability: block requests sent to random peer cause hash mismatch when nodes have divergent state #412
Description
Summary
When a new node joins a running cluster where index.js (or any large state file)
has been modified after the initial deployment, HPFS sync fails repeatedly with:
Hpfs cont sync: Skipping mismatched block response from [xxxxxxxx] for block_id:0
(len:390541) of /state/index.js
The new node never syncs successfully and cannot join consensus.
Root Cause
In hpfs_sync.cpp, process_candidate_responses() correctly requests the file
hashmap from a peer, then generates block requests based on the received block hashes.
However in request_state_from_peer(), those block requests are sent to a random
peer via send_message_to_random_peer():
p2p::send_message_to_random_peer(fbuf, target_pubkey);
// todo: send to a node that hold the expected hash to improve
// reliability of retrieving hpfs state.If the random peer has a different version of the file than the peer that provided
the hashmap, the received block data won't match the expected hash — causing
validate_file_block_hash() to reject it and log the mismatch.
This is the existing TODO in the codebase. We hit this in practice when:
- A 3-node cluster is running with a modified
index.js - A new 4th node joins and requests
/state/index.js - Node A provides the hashmap (based on its version of
index.js) - The block request goes to Node B (random) which has a slightly different version
- Block hash mismatch → sync fails indefinitely
Reproduction
- Create a 3-node cluster
- Modify
index.json all 3 running nodes - Acquire a 4th node and deploy the contract bundle to it
- Watch the 4th node's hp.log — it will repeatedly log
Skipping mismatched block responsefor/state/index.jsand never join consensus
Suggested Fix
Add a source_peer field to sync_item so block requests are directed to the
same peer that provided the hashmap, rather than a random peer.
1. Add source_peer to sync_item in hpfs_sync.hpp
struct sync_item
{
SYNC_ITEM_TYPE type = SYNC_ITEM_TYPE::DIR;
std::string vpath;
int32_t block_id = -1;
util::h32 expected_hash;
bool high_priority = false;
std::string source_peer; // Preferred peer for block requests (empty = random)
uint32_t waiting_time = 0;
// ...
};2. Add send_message_to_peer to p2p.hpp and p2p.cpp
// p2p.hpp
void send_message_to_peer(const flatbuffers::FlatBufferBuilder &fbuf,
const std::string &preferred_pubkey,
std::string &target_pubkey);
// p2p.cpp
void send_message_to_peer(const flatbuffers::FlatBufferBuilder &fbuf,
const std::string &preferred_pubkey,
std::string &target_pubkey)
{
std::scoped_lock<std::mutex> lock(ctx.peer_connections_mutex);
if (!preferred_pubkey.empty())
{
const auto it = ctx.peer_connections.find(preferred_pubkey);
if (it != ctx.peer_connections.end())
{
it->second->send(msg::fbuf::builder_to_string_view(fbuf));
target_pubkey = it->second->uniqueid;
return;
}
LOG_DEBUG << "Preferred peer " << preferred_pubkey.substr(2, 8)
<< " not found. Falling back to random peer.";
}
// Fall back to random peer.
const size_t connected_peers = ctx.peer_connections.size();
if (connected_peers == 0)
{
LOG_DEBUG << "No peers to send.";
return;
}
auto it = ctx.peer_connections.begin();
std::advance(it, rand() % connected_peers);
it->second->send(msg::fbuf::builder_to_string_view(fbuf));
target_pubkey = it->second->uniqueid;
}3. Update request_state_from_peer in hpfs_sync.cpp
void hpfs_sync::request_state_from_peer(const std::string &path, const bool is_file,
const int32_t block_id,
const util::h32 expected_hash,
std::string &target_pubkey,
const std::string &preferred_peer = "")
{
// ... existing code ...
// Use preferred peer if specified, otherwise random.
if (!preferred_peer.empty())
p2p::send_message_to_peer(fbuf, preferred_peer, target_pubkey);
else
p2p::send_message_to_random_peer(fbuf, target_pubkey);
}4. Update submit_request to pass source_peer
request_state_from_peer(request.vpath, is_file, request.block_id,
request.expected_hash, target_pubkey,
request.source_peer); // Pass preferred peer5. Update handle_file_hashmap_response to tag block requests with source peer
int hpfs_sync::handle_file_hashmap_response(std::string_view vpath,
const mode_t file_mode,
const util::h32 *hashes,
const size_t hash_count,
const std::set<uint32_t> &responded_block_ids,
const uint64_t file_length,
const std::string &from_peer) // NEW
{
// ... existing code ...
for (int32_t block_id = 0; block_id <= max_block_id; block_id++)
{
sync_item item{SYNC_ITEM_TYPE::BLOCK, std::string(vpath), block_id, hashes[block_id]};
item.source_peer = from_peer; // Tag with hashmap source peer
pending_requests.emplace(item);
}
}6. Pass response.first (full pubkey) to handle_file_hashmap_response
// In process_candidate_responses():
handle_file_hashmap_response(vpath, file_resp.file_mode(), block_hashes,
block_hash_count, responded_block_ids,
file_resp.file_length(),
response.first); // Pass full sender pubkeyImpact
This change addresses the existing TODO comment and improves HPFS sync reliability
in scenarios where nodes have divergent state — particularly when contract files are
updated on a running cluster. In the stable case (all nodes identical) behaviour is
unchanged since the preferred peer will always respond correctly.
Testing
Tested on Evernode mainnet with a 5-node cluster running HotPocket 0.6.4. The
Skipping mismatched block response error occurs reproducibly when adding a new
node after modifying index.js on the running cluster.