fix: relay reconnect loop, asyncio loop safety, and /ping phone_connected#3
fix: relay reconnect loop, asyncio loop safety, and /ping phone_connected#3
Conversation
…ield Bug #1 + #4 (RelayClient.kt): - Remove MAX_RETRIES=5 hard cap — reconnect loops forever until user taps Disconnect (shouldReconnect=false). Phone network blips no longer require manual reconnect. - Replace blind 3-second post-connect sleep with a polling wait (up to 8s) that exits immediately when onOpen fires. Eliminates the race where isConnected was checked before onOpen had a chance to set it. - Exponential backoff: 1s → 2s → 4s → … capped at 30s. Bug #2 (android_relay.py _handle_http): - Replace state.loop.create_future() (called from another thread context) with asyncio.get_running_loop().create_future() which correctly binds the Future to the event loop the coroutine is already running in. Fixes 'Future attached to a different loop' errors on Python 3.10+. - asyncio.Lock / asyncio.Event objects moved from __init__ (calling thread) into _run_loop() so they are created inside the correct event loop. Bug #3 (RelayClient.kt /ping response): - /ping now returns phone_connected=true so android_tool.py _check_requirements() correctly detects a live connection instead of always falling back to False.
There was a problem hiding this comment.
Code Review
This pull request improves the connection reliability of the RelayClient by removing the hard retry cap, implementing a more robust connection timeout mechanism, and exposing a phone_connected status in the /ping endpoint. Additionally, it fixes critical asyncio concurrency issues in the Python relay tool by ensuring locks and futures are correctly bound to the active event loop. I have provided feedback regarding the exponential backoff logic, which currently resets on every failure, and suggested using monotonic time for timeout calculations to avoid issues with system clock adjustments.
| reconnectJob?.cancel() | ||
| reconnectJob = scope?.launch { | ||
| var backoff = 1000L | ||
| var retries = 0 | ||
| while (shouldReconnect && !isConnected && retries < MAX_RETRIES) { | ||
| retries++ | ||
| Log.i(TAG, "Reconnecting in ${backoff}ms... (attempt $retries/$MAX_RETRIES)") | ||
| notifyStatus(false, "Reconnecting in ${backoff / 1000}s... (attempt $retries/$MAX_RETRIES)") | ||
| var attempt = 0 |
There was a problem hiding this comment.
The exponential backoff logic is currently reset on every failure. This happens because scheduleReconnect() is called from the onFailure and onClosed callbacks, and it unconditionally cancels the existing reconnectJob before starting a new one. Since backoff and attempt are local variables within the coroutine, they are reset to their initial values (1000L and 0) every time a connection fails, resulting in a constant 1-second retry loop.
To fix this, you should check if the reconnection loop is already active before attempting to start a new one. This allows the existing loop to continue to its next iteration, correctly doubling the backoff and incrementing the attempt counter.
| reconnectJob?.cancel() | |
| reconnectJob = scope?.launch { | |
| var backoff = 1000L | |
| var retries = 0 | |
| while (shouldReconnect && !isConnected && retries < MAX_RETRIES) { | |
| retries++ | |
| Log.i(TAG, "Reconnecting in ${backoff}ms... (attempt $retries/$MAX_RETRIES)") | |
| notifyStatus(false, "Reconnecting in ${backoff / 1000}s... (attempt $retries/$MAX_RETRIES)") | |
| var attempt = 0 | |
| if (reconnectJob?.isActive == true) return | |
| reconnectJob = scope?.launch { | |
| var backoff = 1000L | |
| var attempt = 0 |
| val deadline = System.currentTimeMillis() + CONNECT_TIMEOUT_MS | ||
| while (!isConnected && System.currentTimeMillis() < deadline && shouldReconnect) { | ||
| delay(200) | ||
| } |
There was a problem hiding this comment.
Using System.currentTimeMillis() for calculating deadlines is unreliable because it is not monotonic and can change if the system clock is adjusted (e.g., via NTP or manual settings). For Android applications, it is a best practice to use android.os.SystemClock.elapsedRealtime() for measuring durations and timeouts.
| val deadline = System.currentTimeMillis() + CONNECT_TIMEOUT_MS | |
| while (!isConnected && System.currentTimeMillis() < deadline && shouldReconnect) { | |
| delay(200) | |
| } | |
| val deadline = android.os.SystemClock.elapsedRealtime() + CONNECT_TIMEOUT_MS | |
| while (!isConnected && android.os.SystemClock.elapsedRealtime() < deadline && shouldReconnect) { | |
| delay(200) | |
| } |
Summary
Fixes 4 bugs in the Android bridge relay system.
Bug #1 — RelayClient: MAX_RETRIES=5 hard cap removed
Old code set
shouldReconnect=falseafter 5 failed attempts. User had to manually tap Connect after any network blip.Fix: Removed retry counter. Loop runs forever while
shouldReconnect=true. Backoff caps at 30s.Bug #4 — RelayClient: race condition on isConnected
Fixed 3s sleep replaced with 200ms polling loop (up to 8s). Exits immediately when
onOpenfires.Bug #2 — android_relay.py: asyncio Future on wrong loop
asyncio.Lock/Event/Futurecreated in calling thread. Fixed: created inside_run_loop()and usingasyncio.get_running_loop().create_future().Bug #3 — /ping missing phone_connected field
Kotlin
/pingnow returnsphone_connected=trueso_check_requirements()correctly detects live connection.