Skip to content

Conversation

@lucksus
Copy link
Collaborator

@lucksus lucksus commented Jan 21, 2026

This adds two new tests with two agents initiating a connection, reproducing these scenarios that potentially lead to blocked messages:

  1. pre-flight handler synthetically takes longer to test potential race-condition allowing messages to be blocked before pre-flight adds peer to peer store
  2. sending pre-flight before agent was joined locally

First test passed before changes, invalidating that hypothesis.
Second test failed initially, but passes with the changes in this PR.

Problem

Messages were being incorrectly blocked at the beginning of sessions between agents. This was traced to a race condition where a connection could be established and preflight exchanged before a local agent joins the space.

Race-Condition in Holochain around space joining

Time  →

T0: space() called
T1: Space created, bootstrap discovery starts
T2: Connection established with remote peer → PREFLIGHT SENT (empty agent list!)
T3: local_agent_join() called
T4: Agent added to peer store
T5: bootstrap.put() called, preflight cache updated

Race condition window: T1-T4

If a connection is established between T1 and T4, the preflight will have an empty agent list.

Consequence in Kitsune

When a connection is established before local_agent_join() is called:

  1. The outgoing preflight contains an empty agent list (no local agent yet)
  2. The remote peer receives the empty preflight and inserts nothing into their peer store
  3. The remote peer has no access decision for the sender's URL
  4. When messages arrive, the remote peer defaults to blocking (no access decision = blocked)
  5. Even after the local agent joins, the remote peer never gets updated agent info because the connection is already established

This matches the production issue described in PR #417.

The race in Holochain could be improved, though I regard this fix here in Kitsune2 as a more robust solution to the problem.

Investigation Findings

Transport Behavior

Both tx5 and iroh transports exhibit this behavior because:

  • Preflight is only exchanged once during connection establishment
  • If the connection is established before a local agent joins, the preflight will be empty
  • There's no mechanism to update remote peers when local agent info changes

What Doesn't Cause the Issue

  • Slow preflight processing: The transports queue messages until preflight completes, so slow preflight handlers don't cause blocking
  • Race between preflight and access decision: The MemPeerStore listener is synchronous, so access decisions are computed before insert() returns

Confirmed with added test in this changeset.

Solution

Added a new method resend_preflight_to_connected_peers() to the Transport trait that is called when a local agent joins. This ensures remote peers receive updated agent information even if the connection was established before the agent joined.

Notes

  • The first message sent before local_agent_join() may still be blocked (unavoidable since there's no agent info yet)
  • Messages sent after local_agent_join() will not be blocked because preflight is resent
  • This fix works for both tx5 and iroh transports

Related Issues

Summary by CodeRabbit

  • New Features

    • Peers now receive updated agent preflight info automatically when local agent state changes, improving network consistency and reducing access-evaluation mismatches.
  • Tests

    • Added tests simulating slow preflight and delayed local joins to ensure messages are not improperly blocked and preflight resends restore correct behavior.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 21, 2026

Walkthrough

Adds transport support to regenerate and resend preflight messages to connected peers when local agent info changes, exposes a helper to generate per-peer preflight, updates DefaultTransport to hold a handler reference, integrates resend on local agent join, and adds tests simulating slow preflight race conditions.

Changes

Cohort / File(s) Summary
Transport layer implementation
crates/api/src/transport.rs
Adds TxImpHnd::generate_preflight_for_peer() to produce encoded preflight bytes for a specific peer; adds Transport::resend_preflight_to_connected_peers() trait method; adds handler: DynTxHandler to DefaultTransport and updates create(); implements resend logic with per-peer gather/encode/send and warning logs.
CoreSpace lifecycle integration
crates/core/src/factories/core_space.rs
Captures a weak transport reference and, after queuing new AgentInfo on local_agent_join, upgrades and calls resend_preflight_to_connected_peers().await, logging failures while retaining existing broadcast behavior.
Test scaffolding and race-condition testing
crates/kitsune2/tests/blocks.rs
Adds SlowPreflightTxHandler, TestPeerWithSlowPreflight, TestPeerDelayedJoin, factories, and tests (messages_should_not_be_blocked_during_slow_preflight, messages_blocked_when_preflight_sent_before_local_agent_joins) to exercise preflight vs. regular-message race conditions and delayed local-agent join scenarios.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Possibly related PRs

Suggested reviewers

  • matthme
  • ThetaSinner
  • jost-s
🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: resending preflight when local agent joins to address message blocking.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

github-actions bot commented Jan 21, 2026

The following will be added to the changelog


[0.4.0-dev.3] - 2026-01-21

Bug Fixes

  • Use weak ref to avoid keeping transport alive
  • Resend pre-flight to all connected peers to re-initiate access decision logic

Testing

  • Join space and initiate connection before local agent joins
  • Ensures no race condition with peer store insert

@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Jan 21, 2026

Deploying kitsune2 with  Cloudflare Pages  Cloudflare Pages

Latest commit: 2722783
Status: ✅  Deploy successful!
Preview URL: https://fdc945e2.kitsune2.pages.dev
Branch Preview URL: https://fix-blocking-due-to-prefligh.kitsune2.pages.dev

View logs

@cocogitto-bot
Copy link

cocogitto-bot bot commented Jan 21, 2026

✔️ 2906ee8...2722783 - Conventional commits check succeeded.

@lucksus lucksus requested review from ThetaSinner and jost-s January 21, 2026 22:39
Copy link
Member

@ThetaSinner ThetaSinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't read the code but looking at the title, that logic doesn't sound right. The intention of the pre-flight is to decide whether or not to establish a connection. Putting peer information into the pre-flight was a Holochain optimization and not supposed to be part of the K2 logic. We certainly can't rely on every host to do that and it's not really reasonable to re-check whether the connection should have been established.

When operating with a single space:

  • The first local agent won't be discoverable until they have published their peer info, so it's impossible that they aren't joined to the space.
  • Before a local agent has joined, it's possible to fetch peer info from the bootstrap server and start contacting other peers - I think that was part of this investigation, whether K2 will do that in any cases.

With multiple spaces:

  • The same logic applies for the first space.
  • For subsequent spaces, there's no preflight on join where a connection is already in place. There is a design for a "hello" module service which is part of the "access" module implementation. That's the solution to that problem but currently not implemented.

@lucksus
Copy link
Collaborator Author

lucksus commented Jan 22, 2026

@ThetaSinner, in #417 you wrote:

Two things I mentioned the other day that are worth looking at but let me write them up rather than just saying them briefly.

  1. Is it possible that network messages are being sent before the local agent joins the space? I don't think that's ? something we've protected against explicitly. Gossip won't initiate before an agent is available but maybe something else can? For example, an app sending signals or get requests before the local agent joins the newly created space. If the preflight gets sent without a local agent info then the connection has no way to recover until bootstrap happens to fix the issue. Gossip would also transfer agent infos but obviously we can't gossip if we're locked out.

The 2nd test in here shows that this is possible. I've analyized the Holochain code for this and it can totally happen. There are several ways to go about it. Changing the default like I did in #417 would be one. I've also looked into fixing this race in Holochain, but I'm not sure that would be possible without architectural changes. (moving the peer-store into K2?).

We certainly can't rely on every host to do that and it's not really reasonable to re-check whether the connection should have been established.

I'm not sure I get what your saying here.

In short: because of race in Holochain, pre-flight could be sent without any agent info since adding the local agent didn't finish yet. Problem when at the same time the default for connections without agents is blocked - and stays blocked until another pre-flight comes in. What this does is re-sending the pre-flight, after it got successfully added locally. We could also close and reinit the connection, but that seems more like a problem.

Or was the simple default change in #417 the right way of fixing this afterall?

mattyg

This comment was marked as outdated.

@mattyg
Copy link
Member

mattyg commented Jan 22, 2026

I was confused by the race condition description in the PR, but I think I tracked down a case where it would arise:

  • Alice joins space, joins local agent, publishes bootstrap info
  • Bob queries bootstrap server, gets Alices info
    ... (anything can happen here, including sucessfully connecting to each other)
  • Alice kills and restarts the app, joins space
  • Bob sends a message to Alice
  • Alice receives message because her space has started, but she has not finished joining her local agent yet
  • This triggers Alice to send her preflight, which is empty.

I this case I think Bob does already have an access decision for Alice, but isn't checking it. Instead he is checking for an access decision for a blank peer url.

I think a more direct solution would be to not send empty preflights, and instead close the connection. And maybe also on the receiving end as well to validate the size of preflight messages and close the connection if invalid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants