Skip to content

fix(retry): fix sporadic "no request forwarding" errors when fetching discovery members through a VIP endpoint#17

Merged
Handfish merged 3 commits intoHandfish:mainfrom
perfectra1n:fix/retry-and-remember-last-working-cp-node
Jan 21, 2026
Merged

fix(retry): fix sporadic "no request forwarding" errors when fetching discovery members through a VIP endpoint#17
Handfish merged 3 commits intoHandfish:mainfrom
perfectra1n:fix/retry-and-remember-last-working-cp-node

Conversation

@perfectra1n
Copy link
Copy Markdown
Contributor

@perfectra1n perfectra1n commented Jan 18, 2026

Fixes sporadic "no request forwarding" errors when fetching discovery members through a VIP endpoint.

When the TUI queries discovery members via the VIP, the request may route to a control plane node that can't forward it, causing the discovery_members to be cleared. Therefore falling back to an empty node list, and the TUI showing empty/stale cluster data.

Retry with fallback nodes, first tries VIP (2 retries), then queries each control plane node directly. I also set up to preserve cached data and don't clear discovery_members on error. Then finally, tertiary fallback where we use versions data as node source if discovery AND etcd both fail.

File Change
talos-rs/src/talosctl.rs Add get_discovery_members_for_node_async() and update retry function to accept fallback node IPs
talos-rs/src/lib.rs Export updated function
talos-pilot-tui/src/components/cluster.rs Pass etcd member IPs as fallback, add tertiary fallback using versions, preserve data on error
talos-pilot-tui/src/components/lifecycle.rs Pass version node IPs as fallback, preserve data on error

Let me know if you have any questions or any suggestions :)

@perfectra1n
Copy link
Copy Markdown
Contributor Author

I ran into this issue when having my Talos config set up like:

context: mycluster
contexts:
    mycluster:
        endpoints:
            - 192.168.9.11
            - 192.168.9.12
            - 192.168.9.13
            - 192.168.9.21
            - 192.168.9.22
            - 192.168.9.23
        nodes:
            - 192.168.9.11
            - 192.168.9.12
            - 192.168.9.13
            - 192.168.9.21
            - 192.168.9.22
            - 192.168.9.23

Where it would just randomly fail at startup...

- Replace unwrap() with unwrap_or_else to prevent panic
- Shuffle fallback nodes to distribute load
- Add unit tests for discovery member parsing
- Fix formatting

Small improvements to fix by: perfectra1n <jonfuller2012@gmail.com>
@Handfish Handfish force-pushed the fix/retry-and-remember-last-working-cp-node branch from 0be4b8e to 5df74ac Compare January 21, 2026 03:11
@Handfish Handfish merged commit e4158d3 into Handfish:main Jan 21, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants