Skip to content

fix: relay reconnect loop, asyncio loop safety, and /ping phone_connected#3

Open
dzp5103 wants to merge 1 commit intomainfrom
fix/relay-bugs
Open

fix: relay reconnect loop, asyncio loop safety, and /ping phone_connected#3
dzp5103 wants to merge 1 commit intomainfrom
fix/relay-bugs

Conversation

@dzp5103
Copy link
Copy Markdown

@dzp5103 dzp5103 commented Apr 2, 2026

Summary

Fixes 4 bugs in the Android bridge relay system.

Bug #1 — RelayClient: MAX_RETRIES=5 hard cap removed

Old code set shouldReconnect=false after 5 failed attempts. User had to manually tap Connect after any network blip.

Fix: Removed retry counter. Loop runs forever while shouldReconnect=true. Backoff caps at 30s.

Bug #4 — RelayClient: race condition on isConnected

Fixed 3s sleep replaced with 200ms polling loop (up to 8s). Exits immediately when onOpen fires.

Bug #2 — android_relay.py: asyncio Future on wrong loop

asyncio.Lock/Event/Future created in calling thread. Fixed: created inside _run_loop() and using asyncio.get_running_loop().create_future().

Bug #3 — /ping missing phone_connected field

Kotlin /ping now returns phone_connected=true so _check_requirements() correctly detects live connection.

…ield

Bug #1 + #4 (RelayClient.kt):
- Remove MAX_RETRIES=5 hard cap — reconnect loops forever until user taps
  Disconnect (shouldReconnect=false). Phone network blips no longer require
  manual reconnect.
- Replace blind 3-second post-connect sleep with a polling wait (up to 8s)
  that exits immediately when onOpen fires. Eliminates the race where
  isConnected was checked before onOpen had a chance to set it.
- Exponential backoff: 1s → 2s → 4s → … capped at 30s.

Bug #2 (android_relay.py _handle_http):
- Replace state.loop.create_future() (called from another thread context)
  with asyncio.get_running_loop().create_future() which correctly binds
  the Future to the event loop the coroutine is already running in.
  Fixes 'Future attached to a different loop' errors on Python 3.10+.
- asyncio.Lock / asyncio.Event objects moved from __init__ (calling thread)
  into _run_loop() so they are created inside the correct event loop.

Bug #3 (RelayClient.kt /ping response):
- /ping now returns phone_connected=true so android_tool.py
  _check_requirements() correctly detects a live connection instead of
  always falling back to False.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves the connection reliability of the RelayClient by removing the hard retry cap, implementing a more robust connection timeout mechanism, and exposing a phone_connected status in the /ping endpoint. Additionally, it fixes critical asyncio concurrency issues in the Python relay tool by ensuring locks and futures are correctly bound to the active event loop. I have provided feedback regarding the exponential backoff logic, which currently resets on every failure, and suggested using monotonic time for timeout calculations to avoid issues with system clock adjustments.

Comment on lines 158 to +161
reconnectJob?.cancel()
reconnectJob = scope?.launch {
var backoff = 1000L
var retries = 0
while (shouldReconnect && !isConnected && retries < MAX_RETRIES) {
retries++
Log.i(TAG, "Reconnecting in ${backoff}ms... (attempt $retries/$MAX_RETRIES)")
notifyStatus(false, "Reconnecting in ${backoff / 1000}s... (attempt $retries/$MAX_RETRIES)")
var attempt = 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The exponential backoff logic is currently reset on every failure. This happens because scheduleReconnect() is called from the onFailure and onClosed callbacks, and it unconditionally cancels the existing reconnectJob before starting a new one. Since backoff and attempt are local variables within the coroutine, they are reset to their initial values (1000L and 0) every time a connection fails, resulting in a constant 1-second retry loop.

To fix this, you should check if the reconnection loop is already active before attempting to start a new one. This allows the existing loop to continue to its next iteration, correctly doubling the backoff and incrementing the attempt counter.

Suggested change
reconnectJob?.cancel()
reconnectJob = scope?.launch {
var backoff = 1000L
var retries = 0
while (shouldReconnect && !isConnected && retries < MAX_RETRIES) {
retries++
Log.i(TAG, "Reconnecting in ${backoff}ms... (attempt $retries/$MAX_RETRIES)")
notifyStatus(false, "Reconnecting in ${backoff / 1000}s... (attempt $retries/$MAX_RETRIES)")
var attempt = 0
if (reconnectJob?.isActive == true) return
reconnectJob = scope?.launch {
var backoff = 1000L
var attempt = 0

Comment on lines +182 to 185
val deadline = System.currentTimeMillis() + CONNECT_TIMEOUT_MS
while (!isConnected && System.currentTimeMillis() < deadline && shouldReconnect) {
delay(200)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using System.currentTimeMillis() for calculating deadlines is unreliable because it is not monotonic and can change if the system clock is adjusted (e.g., via NTP or manual settings). For Android applications, it is a best practice to use android.os.SystemClock.elapsedRealtime() for measuring durations and timeouts.

Suggested change
val deadline = System.currentTimeMillis() + CONNECT_TIMEOUT_MS
while (!isConnected && System.currentTimeMillis() < deadline && shouldReconnect) {
delay(200)
}
val deadline = android.os.SystemClock.elapsedRealtime() + CONNECT_TIMEOUT_MS
while (!isConnected && android.os.SystemClock.elapsedRealtime() < deadline && shouldReconnect) {
delay(200)
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant