Skip to content

Conversation

@Jim8y
Copy link
Contributor

@Jim8y Jim8y commented Oct 24, 2025

Summary

  • stash all pre-startup Peer messages until ChannelsConfig arrives
  • replay the stashed Tcp.Connected after config is applied
  • add unit test covering Tcp.Connected before configuration

Testing

  • dotnet test tests/Neo.UnitTests/Neo.UnitTests.csproj

@Jim8y Jim8y linked an issue Oct 24, 2025 that may be closed by this pull request
@Jim8y
Copy link
Contributor Author

Jim8y commented Oct 24, 2025

fixing conflicts

@Jim8y
Copy link
Contributor Author

Jim8y commented Oct 24, 2025

Bug Context

  • Incoming Tcp.Connected arrives before ChannelsConfig → Peer processes
    it immediately, Config is still null, and the original code aborts with
    Sender.Tell(Tcp.Abort.Instance) or throws a NullReferenceException. That
    race is common when Akka schedules socket completion before configuration
    fan-out, so healthy handshakes died on startup.

Fix Mechanics

  • src/Neo/Network/P2P/Peer.cs:30 now makes Peer implement IWithUnboundedStash;
    the actor can buffer startup-sensitive messages.
  • src/Neo/Network/P2P/Peer.cs:188 stashes any message that depends on
    configuration (Timer, Peers, Connect, Tcp.*) until the ChannelsConfig
    message arrives. OnStart persists the config and Stash.UnstashAll() replays
    the saved traffic.
  • When the previously premature Tcp.Connected is replayed (with Config
    populated), it passes the remote-address guard and completes the
    registration flow. We still abort sockets that lack RemoteAddress, matching
    the guard already on dev.
  • tests/Neo.UnitTests/Network/P2P/UT_LocalNode.cs:52 validates the scenario:
    send Tcp.Connected first, confirm no immediate Tcp.Register, then send
    ChannelsConfig and assert the registration happens.

@shargon
Copy link
Member

shargon commented Oct 24, 2025

This close #4247 ?

@Jim8y
Copy link
Contributor Author

Jim8y commented Oct 24, 2025

This close #4247 ?

yes, trying to fix the bug for good

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR modifies the P2P Peer actor to defer message processing until configuration is received. The key change is implementing message stashing to handle the race condition where peer-to-peer messages (like TCP connections) arrive before the ChannelsConfig message, ensuring proper initialization order.

  • Implements IWithUnboundedStash interface in the Peer class to enable message stashing
  • Adds stashing logic to buffer pre-startup messages until ChannelsConfig arrives
  • Adds unit test verifying Tcp.Connected messages are properly deferred and processed after configuration

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/Neo/Network/P2P/Peer.cs Implements stashing mechanism to defer message processing until ChannelsConfig is received
tests/Neo.UnitTests/Network/P2P/UT_LocalNode.cs Adds test case verifying TCP connection messages are properly stashed before configuration

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Wi1l-B0t
Wi1l-B0t previously approved these changes Oct 25, 2025
vncoelho
vncoelho previously approved these changes Oct 27, 2025
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@Jim8y Jim8y dismissed stale reviews from vncoelho and Wi1l-B0t via f184d50 October 27, 2025 14:00
@Jim8y Jim8y requested a review from shargon October 27, 2025 14:00
@superboyiii
Copy link
Member

superboyiii commented Oct 28, 2025

Test result is not good. Quickly start a privatnet by run-localnet-nodes.sh within 7 consensus node. They can never make consensus and always timeout... But original dev can always make consensus smoothly. Somewhere is not correct here.
Tried many times, still no consensus after 4min.

[Info][17:09:23.021] OnStart
[Info][17:09:23.036] Initialize: height=1 view=0 index=2 role=Backup
[Info][17:09:23.038] Sending RecoveryRequest: height=1 view=0 nc=0 nf=0
[Info][17:09:53.060] Sending ChangeView: height=1 view=0 nv=1 nc=0 nf=0 reason=Timeout
[Info][17:09:55.077] OnChangeViewReceived: height=1 view=0 index=4 nv=1 reason=Timeout
[Info][17:09:57.217] OnChangeViewReceived: height=1 view=0 index=5 nv=1 reason=Timeout
[Info][17:09:59.330] OnChangeViewReceived: height=1 view=0 index=0 nv=1 reason=Timeout
[Info][17:10:53.076] Sending ChangeView: height=1 view=0 nv=1 nc=0 nf=0 reason=Timeout
[Info][17:11:53.086] Sending ChangeView: height=1 view=0 nv=1 nc=0 nf=0 reason=Timeout
[Info][17:12:53.108] Sending ChangeView: height=1 view=0 nv=1 nc=0 nf=0 reason=Timeout

@superboyiii
Copy link
Member

Strange, I manually set up 7 consensus nodes, they all works well.
6d45dd27-3b65-47a5-903e-3b4fe20ce409

@superboyiii
Copy link
Member

superboyiii commented Oct 28, 2025

OK, I find the issue. It's the config settings. The shell sets "MinDesiredConnections" as 3 makes it hard to ask for more connections. But not sure why this PR makes it worse since the shell still works in the last commit. I set it to 10, then both could work.

@vncoelho
Copy link
Member

Maybe for this 7 nodes consensus it would be better to se it to 5-6

@shargon
Copy link
Member

shargon commented Oct 28, 2025

@Jim8y could you deep into it?

OK, I find the issue. It's the config settings. The shell sets "MinDesiredConnections" as 3 makes it hard to ask for more connections. But not sure why this PR makes it worse since the shell still works in the last commit. I set it to 10, then both could work.

@vncoelho
Copy link
Member

Do not merge until fix #4254, I am not being able to test

@superboyiii
Copy link
Member

Maybe for this 7 nodes consensus it would be better to se it to 5-6

Yes. I think so. But still confused why it can work previously but now need to modify the config.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Random error during start

7 participants