Fix: Resend preflight when local agent joins to prevent message blocking #441

lucksus · 2026-01-21T20:45:49Z

This adds two new tests with two agents initiating a connection, reproducing these scenarios that potentially lead to blocked messages:

pre-flight handler synthetically takes longer to test potential race-condition allowing messages to be blocked before pre-flight adds peer to peer store
sending pre-flight before agent was joined locally

First test passed before changes, invalidating that hypothesis.
Second test failed initially, but passes with the changes in this PR.

Problem

Messages were being incorrectly blocked at the beginning of sessions between agents. This was traced to a race condition where a connection could be established and preflight exchanged before a local agent joins the space.

Race-Condition in Holochain around space joining

Time  →

T0: space() called
T1: Space created, bootstrap discovery starts
T2: Connection established with remote peer → PREFLIGHT SENT (empty agent list!)
T3: local_agent_join() called
T4: Agent added to peer store
T5: bootstrap.put() called, preflight cache updated

Race condition window: T1-T4

If a connection is established between T1 and T4, the preflight will have an empty agent list.

Consequence in Kitsune

When a connection is established before local_agent_join() is called:

The outgoing preflight contains an empty agent list (no local agent yet)
The remote peer receives the empty preflight and inserts nothing into their peer store
The remote peer has no access decision for the sender's URL
When messages arrive, the remote peer defaults to blocking (no access decision = blocked)
Even after the local agent joins, the remote peer never gets updated agent info because the connection is already established

This matches the production issue described in PR #417.

The race in Holochain could be improved, though I regard this fix here in Kitsune2 as a more robust solution to the problem.

Investigation Findings

Transport Behavior

Both tx5 and iroh transports exhibit this behavior because:

Preflight is only exchanged once during connection establishment
If the connection is established before a local agent joins, the preflight will be empty
There's no mechanism to update remote peers when local agent info changes

What Doesn't Cause the Issue

Slow preflight processing: The transports queue messages until preflight completes, so slow preflight handlers don't cause blocking
Race between preflight and access decision: The MemPeerStore listener is synchronous, so access decisions are computed before insert() returns

Confirmed with added test in this changeset.

Solution

Added a new method resend_preflight_to_connected_peers() to the Transport trait that is called when a local agent joins. This ensures remote peers receive updated agent information even if the connection was established before the agent joined.

Notes

The first message sent before local_agent_join() may still be blocked (unavoidable since there's no agent info yet)
Messages sent after local_agent_join() will not be blocked because preflight is resent
This fix works for both tx5 and iroh transports

Related Issues

Addresses blocking issues discussed in [BUG] Kitsune2 blocking peers that it shouldn't #424 and Fix: Default to allowing peers when access decision unavailable #417

Summary by CodeRabbit

New Features
- Peers now receive updated agent preflight info automatically when local agent state changes, improving network consistency and reducing access-evaluation mismatches.
Tests
- Added tests simulating slow preflight and delayed local joins to ensure messages are not improperly blocked and preflight resends restore correct behavior.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…ecision logic

coderabbitai · 2026-01-21T20:46:10Z

Walkthrough

Adds transport support to regenerate and resend preflight messages to connected peers when local agent info changes, exposes a helper to generate per-peer preflight, updates DefaultTransport to hold a handler reference, integrates resend on local agent join, and adds tests simulating slow preflight race conditions.

Changes

Cohort / File(s)	Summary
Transport layer implementation `crates/api/src/transport.rs`	Adds `TxImpHnd::generate_preflight_for_peer()` to produce encoded preflight bytes for a specific peer; adds `Transport::resend_preflight_to_connected_peers()` trait method; adds `handler: DynTxHandler` to `DefaultTransport` and updates `create()`; implements resend logic with per-peer gather/encode/send and warning logs.
CoreSpace lifecycle integration `crates/core/src/factories/core_space.rs`	Captures a weak transport reference and, after queuing new AgentInfo on local_agent_join, upgrades and calls `resend_preflight_to_connected_peers().await`, logging failures while retaining existing broadcast behavior.
Test scaffolding and race-condition testing `crates/kitsune2/tests/blocks.rs`	Adds `SlowPreflightTxHandler`, `TestPeerWithSlowPreflight`, `TestPeerDelayedJoin`, factories, and tests (`messages_should_not_be_blocked_during_slow_preflight`, `messages_blocked_when_preflight_sent_before_local_agent_joins`) to exercise preflight vs. regular-message race conditions and delayed local-agent join scenarios.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Possibly related PRs

Respect blocks when sending a message #340: Modifies DefaultTransport internals and create/clone logic — overlaps with the added handler field and creation changes.
refactor: make preflight handlers async #326: Refactors preflight gather/validate codepaths that generate_preflight_for_peer() invokes.
feat: check if agents are blocked when receiving a message #320: Changes transport and agent-state handling related to preflight and connection lifecycle used by the resend integration.

Suggested reviewers

matthme
ThetaSinner
jost-s

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: resending preflight when local agent joins to address message blocking.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-01-21T20:46:45Z

The following will be added to the changelog

[0.4.0-dev.3] - 2026-01-21

Bug Fixes

Use weak ref to avoid keeping transport alive
Resend pre-flight to all connected peers to re-initiate access decision logic

Testing

Join space and initiate connection before local agent joins
Ensures no race condition with peer store insert

cloudflare-workers-and-pages · 2026-01-21T21:46:55Z

Deploying kitsune2 with Cloudflare Pages

Latest commit:	`2722783`
Status:	✅ Deploy successful!
Preview URL:	https://fdc945e2.kitsune2.pages.dev
Branch Preview URL:	https://fix-blocking-due-to-prefligh.kitsune2.pages.dev

View logs

cocogitto-bot · 2026-01-21T21:46:59Z

✔️ 2906ee8...2722783 - Conventional commits check succeeded.

ThetaSinner

I haven't read the code but looking at the title, that logic doesn't sound right. The intention of the pre-flight is to decide whether or not to establish a connection. Putting peer information into the pre-flight was a Holochain optimization and not supposed to be part of the K2 logic. We certainly can't rely on every host to do that and it's not really reasonable to re-check whether the connection should have been established.

When operating with a single space:

The first local agent won't be discoverable until they have published their peer info, so it's impossible that they aren't joined to the space.
Before a local agent has joined, it's possible to fetch peer info from the bootstrap server and start contacting other peers - I think that was part of this investigation, whether K2 will do that in any cases.

With multiple spaces:

The same logic applies for the first space.
For subsequent spaces, there's no preflight on join where a connection is already in place. There is a design for a "hello" module service which is part of the "access" module implementation. That's the solution to that problem but currently not implemented.

lucksus · 2026-01-22T15:33:13Z

@ThetaSinner, in #417 you wrote:

Two things I mentioned the other day that are worth looking at but let me write them up rather than just saying them briefly.

Is it possible that network messages are being sent before the local agent joins the space? I don't think that's ? something we've protected against explicitly. Gossip won't initiate before an agent is available but maybe something else can? For example, an app sending signals or get requests before the local agent joins the newly created space. If the preflight gets sent without a local agent info then the connection has no way to recover until bootstrap happens to fix the issue. Gossip would also transfer agent infos but obviously we can't gossip if we're locked out.

The 2nd test in here shows that this is possible. I've analyized the Holochain code for this and it can totally happen. There are several ways to go about it. Changing the default like I did in #417 would be one. I've also looked into fixing this race in Holochain, but I'm not sure that would be possible without architectural changes. (moving the peer-store into K2?).

We certainly can't rely on every host to do that and it's not really reasonable to re-check whether the connection should have been established.

I'm not sure I get what your saying here.

In short: because of race in Holochain, pre-flight could be sent without any agent info since adding the local agent didn't finish yet. Problem when at the same time the default for connections without agents is blocked - and stays blocked until another pre-flight comes in. What this does is re-sending the pre-flight, after it got successfully added locally. We could also close and reinit the connection, but that seems more like a problem.

Or was the simple default change in #417 the right way of fixing this afterall?

crates/kitsune2/tests/blocks.rs

mattyg · 2026-01-22T23:41:25Z

I was confused by the race condition description in the PR, but I think I tracked down a case where it would arise:

Alice joins space, joins local agent, publishes bootstrap info
Bob queries bootstrap server, gets Alices info
... (anything can happen here, including sucessfully connecting to each other)
Alice kills and restarts the app, joins space
Bob sends a message to Alice
Alice receives message because her space has started, but she has not finished joining her local agent yet
This triggers Alice to send her preflight, which is empty.

I this case I think Bob does already have an access decision for Alice, but isn't checking it. Instead he is checking for an access decision for a blank peer url.

I think a more direct solution would be to not send empty preflights, and instead close the connection. And maybe also on the receiving end as well to validate the size of preflight messages and close the connection if invalid.

lucksus added 3 commits January 21, 2026 21:07

test: ensures no race condition with peer store insert

2906ee8

test: join space and initiate connection before local agent joins

bcf00d8

fix: resend pre-flight to all connected peers to re-initiate access d…

890645d

…ecision logic

lucksus mentioned this pull request Jan 21, 2026

Fix: Default to allowing peers when access decision unavailable #417

Closed

lucksus mentioned this pull request Jan 21, 2026

[SPIKE] Review peer-blocking code, identify and fix edge-cases #439

Open

fix: use weak ref to avoid keeping transport alive

2722783

lucksus requested review from ThetaSinner and jost-s January 21, 2026 22:39

ThetaSinner requested changes Jan 22, 2026

View reviewed changes

mattyg reviewed Jan 22, 2026

View reviewed changes

crates/kitsune2/tests/blocks.rs Show resolved Hide resolved

mattyg reviewed Jan 22, 2026

View reviewed changes

crates/kitsune2/tests/blocks.rs Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

lucksus mentioned this pull request Jan 27, 2026

fix: blocking race condition by preventing sends before local agent joins #451

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Resend preflight when local agent joins to prevent message blocking #441

Fix: Resend preflight when local agent joins to prevent message blocking #441

Uh oh!

lucksus commented Jan 21, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 21, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 21, 2026 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages bot commented Jan 21, 2026 •

edited

Loading

Uh oh!

cocogitto-bot bot commented Jan 21, 2026

Uh oh!

ThetaSinner left a comment

Uh oh!

lucksus commented Jan 22, 2026

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

mattyg commented Jan 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix: Resend preflight when local agent joins to prevent message blocking #441

Are you sure you want to change the base?

Fix: Resend preflight when local agent joins to prevent message blocking #441

Uh oh!

Conversation

lucksus commented Jan 21, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Race-Condition in Holochain around space joining

Consequence in Kitsune

Investigation Findings

Transport Behavior

What Doesn't Cause the Issue

Solution

Notes

Related Issues

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

github-actions bot commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

[0.4.0-dev.3] - 2026-01-21

Bug Fixes

Testing

Uh oh!

cloudflare-workers-and-pages bot commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying kitsune2 with Cloudflare Pages

Uh oh!

cocogitto-bot bot commented Jan 21, 2026

Uh oh!

ThetaSinner left a comment

Choose a reason for hiding this comment

Uh oh!

lucksus commented Jan 22, 2026

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

mattyg commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lucksus commented Jan 21, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 21, 2026 •

edited

Loading

github-actions bot commented Jan 21, 2026 •

edited

Loading

cloudflare-workers-and-pages bot commented Jan 21, 2026 •

edited

Loading

mattyg commented Jan 22, 2026 •

edited

Loading