Skip to content

HPFS sync reliability: block requests sent to random peer cause hash mismatch when nodes have divergent state #412

@rippleitinnz

Description

@rippleitinnz

Summary

When a new node joins a running cluster where index.js (or any large state file)
has been modified after the initial deployment, HPFS sync fails repeatedly with:

Hpfs cont sync: Skipping mismatched block response from [xxxxxxxx] for block_id:0 
(len:390541) of /state/index.js

The new node never syncs successfully and cannot join consensus.

Root Cause

In hpfs_sync.cpp, process_candidate_responses() correctly requests the file
hashmap from a peer, then generates block requests based on the received block hashes.
However in request_state_from_peer(), those block requests are sent to a random
peer
via send_message_to_random_peer():

p2p::send_message_to_random_peer(fbuf, target_pubkey); 
// todo: send to a node that hold the expected hash to improve 
// reliability of retrieving hpfs state.

If the random peer has a different version of the file than the peer that provided
the hashmap, the received block data won't match the expected hash — causing
validate_file_block_hash() to reject it and log the mismatch.

This is the existing TODO in the codebase. We hit this in practice when:

  1. A 3-node cluster is running with a modified index.js
  2. A new 4th node joins and requests /state/index.js
  3. Node A provides the hashmap (based on its version of index.js)
  4. The block request goes to Node B (random) which has a slightly different version
  5. Block hash mismatch → sync fails indefinitely

Reproduction

  1. Create a 3-node cluster
  2. Modify index.js on all 3 running nodes
  3. Acquire a 4th node and deploy the contract bundle to it
  4. Watch the 4th node's hp.log — it will repeatedly log Skipping mismatched block response for /state/index.js and never join consensus

Suggested Fix

Add a source_peer field to sync_item so block requests are directed to the
same peer that provided the hashmap, rather than a random peer.

1. Add source_peer to sync_item in hpfs_sync.hpp

struct sync_item
{
    SYNC_ITEM_TYPE type = SYNC_ITEM_TYPE::DIR;
    std::string vpath;
    int32_t block_id = -1;
    util::h32 expected_hash;
    bool high_priority = false;
    std::string source_peer; // Preferred peer for block requests (empty = random)
    uint32_t waiting_time = 0;
    // ...
};

2. Add send_message_to_peer to p2p.hpp and p2p.cpp

// p2p.hpp
void send_message_to_peer(const flatbuffers::FlatBufferBuilder &fbuf, 
                          const std::string &preferred_pubkey, 
                          std::string &target_pubkey);

// p2p.cpp
void send_message_to_peer(const flatbuffers::FlatBufferBuilder &fbuf, 
                          const std::string &preferred_pubkey, 
                          std::string &target_pubkey)
{
    std::scoped_lock<std::mutex> lock(ctx.peer_connections_mutex);

    if (!preferred_pubkey.empty())
    {
        const auto it = ctx.peer_connections.find(preferred_pubkey);
        if (it != ctx.peer_connections.end())
        {
            it->second->send(msg::fbuf::builder_to_string_view(fbuf));
            target_pubkey = it->second->uniqueid;
            return;
        }
        LOG_DEBUG << "Preferred peer " << preferred_pubkey.substr(2, 8) 
                  << " not found. Falling back to random peer.";
    }

    // Fall back to random peer.
    const size_t connected_peers = ctx.peer_connections.size();
    if (connected_peers == 0)
    {
        LOG_DEBUG << "No peers to send.";
        return;
    }

    auto it = ctx.peer_connections.begin();
    std::advance(it, rand() % connected_peers);
    it->second->send(msg::fbuf::builder_to_string_view(fbuf));
    target_pubkey = it->second->uniqueid;
}

3. Update request_state_from_peer in hpfs_sync.cpp

void hpfs_sync::request_state_from_peer(const std::string &path, const bool is_file, 
                                        const int32_t block_id,
                                        const util::h32 expected_hash, 
                                        std::string &target_pubkey,
                                        const std::string &preferred_peer = "")
{
    // ... existing code ...
    
    // Use preferred peer if specified, otherwise random.
    if (!preferred_peer.empty())
        p2p::send_message_to_peer(fbuf, preferred_peer, target_pubkey);
    else
        p2p::send_message_to_random_peer(fbuf, target_pubkey);
}

4. Update submit_request to pass source_peer

request_state_from_peer(request.vpath, is_file, request.block_id, 
                        request.expected_hash, target_pubkey,
                        request.source_peer); // Pass preferred peer

5. Update handle_file_hashmap_response to tag block requests with source peer

int hpfs_sync::handle_file_hashmap_response(std::string_view vpath, 
                                            const mode_t file_mode,
                                            const util::h32 *hashes, 
                                            const size_t hash_count,
                                            const std::set<uint32_t> &responded_block_ids,
                                            const uint64_t file_length,
                                            const std::string &from_peer) // NEW
{
    // ... existing code ...
    for (int32_t block_id = 0; block_id <= max_block_id; block_id++)
    {
        sync_item item{SYNC_ITEM_TYPE::BLOCK, std::string(vpath), block_id, hashes[block_id]};
        item.source_peer = from_peer; // Tag with hashmap source peer
        pending_requests.emplace(item);
    }
}

6. Pass response.first (full pubkey) to handle_file_hashmap_response

// In process_candidate_responses():
handle_file_hashmap_response(vpath, file_resp.file_mode(), block_hashes, 
                             block_hash_count, responded_block_ids, 
                             file_resp.file_length(),
                             response.first); // Pass full sender pubkey

Impact

This change addresses the existing TODO comment and improves HPFS sync reliability
in scenarios where nodes have divergent state — particularly when contract files are
updated on a running cluster. In the stable case (all nodes identical) behaviour is
unchanged since the preferred peer will always respond correctly.

Testing

Tested on Evernode mainnet with a 5-node cluster running HotPocket 0.6.4. The
Skipping mismatched block response error occurs reproducibly when adding a new
node after modifying index.js on the running cluster.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions