Fix LwIP TCP stability: PCB leak, keepalive, and critical sections#4
Fix LwIP TCP stability: PCB leak, keepalive, and critical sections#4flottokarotto merged 4 commits intomasterfrom
Conversation
HAL_ETH_RxAllocateCallback and HAL_ETH_RxLinkCallback are called from ETH IRQ context and access LwIP memory pools (LWIP_MEMPOOL_ALLOC) and pbuf reference counts concurrently with the mainloop (ethernetif_input, pbuf_free_custom). Without synchronization this causes heap corruption. - Enable SYS_LIGHTWEIGHT_PROT=1 so LwIP wraps all pool/heap operations with sys_arch_protect/sys_arch_unprotect critical sections - Implement sys_arch_protect/unprotect using PRIMASK (cpsid i / msr) so ISRs are disabled for the duration of each LwIP pool operation - Add sys_prot_t typedef (uint32_t) to cc.h as required by LwIP - Mark RxAllocStatus volatile: written from ISR, read from mainloop
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
This PR fixes a critical race condition in the LwIP networking stack implementation for the NUCLEO-H753ZI board. The issue occurs when Ethernet ISR callbacks (HAL_ETH_RxAllocateCallback, HAL_ETH_RxLinkCallback) running in interrupt context access LwIP memory pools concurrently with mainloop code, potentially causing heap corruption.
Changes:
- Enabled LwIP's lightweight protection mechanism (
SYS_LIGHTWEIGHT_PROT=1) to guard memory pool operations against concurrent ISR/mainloop access - Implemented PRIMASK-based critical sections that save and restore interrupt state, properly handling nested critical sections
- Added
volatilequalifier toRxAllocStatusto prevent compiler from caching its value across ISR writes
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
lwip_port/arch/lwipopts.h |
Enable SYS_LIGHTWEIGHT_PROT (0 → 1) with clarifying comment about ISR/mainloop protection |
lwip_port/sys_arch.c |
Implement sys_arch_protect/sys_arch_unprotect using PRIMASK register manipulation with inline assembly |
lwip_port/arch/cc.h |
Add sys_prot_t typedef (uint32_t) required by LwIP for storing protection state |
lwip_port/ethernetif.c |
Mark RxAllocStatus as volatile to ensure visibility across ISR and mainloop contexts |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Two bugs caused "status error 6" (Invalid_Request) after every 2nd reset: 1. KSEM_Client.Connected was not synced with the underlying TCP state. When TCP drops (e.g. KSEM sends RST/FIN for a stale connection left over from the previous reset), LwIP sets Current_State=Disconnected but KSEM_Client.Connected stayed True. Read_Power then forwarded the call to Modbus_TCP_Read_Registers which checked Is_Connected=False and returned Invalid_Request. Fixed by checking TCP_Is_Connected in Read_Power and clearing Connected if the TCP layer is down. 2. Only 1 connect attempt was made at startup. After a reset without TCP FIN, KSEM holds the old ESTABLISHED connection; its reply to the new SYN causes LwIP to abort the first attempt. After that exchange KSEM clears the stale connection and a second attempt would succeed. Increased retries from 1 to 5. Also added reconnect logic in the main loop so a connection drop during normal operation (e.g. KSEM reboot) is recovered automatically after 5 seconds instead of running forever with "KSEM read failed".
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| #define LWIP_NETIF_API 0 | ||
|
|
||
| #define SYS_LIGHTWEIGHT_PROT 0 /* No protection needed (no OS) */ | ||
| #define SYS_LIGHTWEIGHT_PROT 1 /* Guard ISR vs. mainloop (HAL ETH callbacks run in IRQ context) */ |
There was a problem hiding this comment.
This comment states HAL ETH callbacks run in IRQ context, but this port currently uses HAL_ETH_Start() + polling via HAL_ETH_ReadData() (called from the main loop). Please reword to avoid documenting an incorrect concurrency model (e.g., note it’s defensive for potential future interrupt-mode use).
| #define SYS_LIGHTWEIGHT_PROT 1 /* Guard ISR vs. mainloop (HAL ETH callbacks run in IRQ context) */ | |
| #define SYS_LIGHTWEIGHT_PROT 1 /* Enable lightweight protection; defensive for possible ISR vs. main loop use (e.g. future IRQ-based ETH callbacks) */ |
| * Provides: | ||
| * - sys_arch_protect/unprotect -- PRIMASK-based critical sections | ||
| * (SYS_LIGHTWEIGHT_PROT=1) to guard LwIP memory pools against | ||
| * concurrent access from ISR (HAL ETH callbacks) and mainloop. |
There was a problem hiding this comment.
The header comment claims concurrency between ISR (HAL ETH callbacks) and mainloop, but ethernetif.c currently calls HAL_ETH_ReadData from the polling path (no HAL_ETH_IRQHandler usage). To avoid misleading future maintainers, consider clarifying that sys_arch_protect/unprotect are added as defensive hardening for possible future IRQ-driven ETH.
| * concurrent access from ISR (HAL ETH callbacks) and mainloop. | |
| * possible concurrent access from ISR (HAL ETH callbacks) and | |
| * mainloop. In the current polling-based ethernetif.c they serve | |
| * primarily as defensive hardening for any future IRQ-driven ETH. |
| -- Sync Connected flag with underlying TCP state. | ||
| -- If TCP dropped (e.g. remote RST after a reset), reflect that here | ||
| -- so callers can detect the disconnect and trigger a reconnect. | ||
| if Connected and then not HAL_Stubs.TCP_Is_Connected then | ||
| Connected := False; | ||
| end if; |
There was a problem hiding this comment.
The reconnect logic in main.adb relies on KSEM_Client.Is_Connected, but the new TCP-state sync only runs inside Read_Power. If the TCP connection drops and Read_Power isn’t called (or is called much later), Is_Connected can still return True and block reconnect. Consider moving/splitting this sync into Is_Connected (or a shared helper) so the public connection status always reflects HAL_Stubs.TCP_Is_Connected.
…ions The TCP client would hang after a few minutes due to: - Missing tcp_close when remote closes (P=null in recv callback), leaving PCBs stuck in CLOSE_WAIT and exhausting the 2-PCB pool - No tcp_poll callback for lwIP periodic connection maintenance - No TCP keepalive to detect silently dead connections Fix: properly close PCB on remote disconnect, register poll callback, enable LWIP_TCP_KEEPALIVE (10s idle, 5s interval, 3 probes), and centralize PCB cleanup to prevent resource leaks in all code paths.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
examples/embedded/nucleo_h753_ksem_test/src/main.adb:151
- After 5 failed connect attempts at startup the code enters an infinite loop (polling Ethernet only). This prevents the later “automatic reconnect in main loop” logic from ever running if the KSEM is temporarily unavailable at boot. If the intent is to keep retrying until the device comes up, consider falling through into the main loop with
Connected=False(or keeping a non-halting retry loop) instead of halting permanently.
if Retry >= 5 then
UART_Console.Put ("ERROR: KSEM connect failed after 5 attempts (");
UART_Console.Put_Int (Integer_32 (Status'Pos (Result)));
UART_Console.Put_Line (")");
declare
function Get_TX_Count return Unsigned_32
with Import, Convention => C,
External_Name => "ethernetif_get_tx_count";
function Get_RX_Count return Unsigned_32
with Import, Convention => C,
External_Name => "ethernetif_get_rx_count";
begin
UART_Console.Put ("TX: ");
UART_Console.Put_Int (Integer_32 (Get_TX_Count));
UART_Console.Put (" RX: ");
UART_Console.Put_Int (Integer_32 (Get_RX_Count));
UART_Console.Put_Line ("");
end;
HAL_Stubs.Set_LED (HAL_Stubs.LED_Red, True);
loop
HAL_Stubs.Ethernet_Poll;
end loop;
end if;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| -- Get data length | ||
| Len := Pbuf_Get_Tot_Len (P); | ||
| if Len > Unsigned_16 (Max_RX_Size) then | ||
| Len := Unsigned_16 (Max_RX_Size); | ||
| end if; | ||
|
|
||
| -- Copy data to buffer | ||
| Copied := Pbuf_Copy_Partial (P, RX_Buffer'Address, Len, 0); | ||
| RX_Len := Natural (Copied); | ||
| RX_Ready := True; | ||
|
|
||
| -- Acknowledge received data | ||
| TCP_Recved (PCB, Len); | ||
| TCP_Recved (PCB, Pbuf_Get_Tot_Len (P)); | ||
|
|
There was a problem hiding this comment.
In Recv_Callback, when an incoming pbuf is larger than Max_RX_Size you cap Len for the copy, but still call TCP_Recved with the full Pbuf_Get_Tot_Len (P). This acknowledges bytes that were not actually processed/copied, which can silently truncate frames and desynchronize the Modbus/TCP stream. Consider treating oversize frames as a protocol error (e.g., close/abort the connection) or only acknowledging what you actually consume (and adjust buffer/logic accordingly).
| Current_State := Disconnected; | ||
| Result := Buffer_Too_Small; |
There was a problem hiding this comment.
When TCP_New returns null, the code reports Result := Buffer_Too_Small. That status is used elsewhere to mean a caller-provided buffer or TCP send buffer is too small, but here it indicates PCB allocation/resource exhaustion (e.g., MEMP_NUM_TCP_PCB). Please return a status that better reflects the failure mode (or introduce a dedicated status) so callers can distinguish allocation failures from buffer sizing issues.
| Current_State := Disconnected; | |
| Result := Buffer_Too_Small; | |
| -- PCB allocation / resource exhaustion (e.g., MEMP_NUM_TCP_PCB limit) | |
| Current_State := Disconnected; | |
| Result := Invalid_Request; |
|
|
||
| if not Connected then | ||
| Result := Invalid_Request; | ||
| Result := Timeout; |
There was a problem hiding this comment.
Read_Power now returns Timeout when Connected is false. This is inconsistent with other call sites in this example (e.g., HAL_Stubs.Modbus_TCP_Read_* returning Invalid_Request when the TCP layer isn’t connected) and makes it harder to differentiate “not connected / caller error” from a real Modbus/TCP timeout. Consider keeping Invalid_Request here (or adding a dedicated Disconnected status) and letting the reconnect logic rely on Is_Connected rather than overloading Timeout.
| Result := Timeout; | |
| Result := Invalid_Request; |
- Fix misleading ISR comments in lwipopts.h and sys_arch.c: clarify that critical sections are defensive for future IRQ-based ETH - Move TCP state sync from Read_Power into Is_Connected so reconnect logic always sees actual connection status - Abort connection on oversize frames instead of silently truncating and desynchronizing the Modbus/TCP stream - Return Invalid_Request (not Buffer_Too_Small) on PCB pool exhaustion - Use Invalid_Request consistently when not connected
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Summary
Fix TCP PCB leak causing hang after minutes (
tcp_client.adb):When the KSEM closed the connection (recv callback with P=null),
tcp_closewas never called — the PCB stayed in CLOSE_WAIT forever.With
MEMP_NUM_TCP_PCB=2, after 1-2 reconnects all PCBs wereexhausted and
tcp_newreturned null → program hung.Fix: properly close PCB on remote disconnect, centralize all cleanup
into
Cleanup_PCB(deregister callbacks, then close/abort).Enable TCP keepalive (
lwipopts.h):LWIP_TCP_KEEPALIVE=1with 10s idle / 5s interval / 3 probes.lwIP now detects silently dead connections (cable unplug, KSEM crash
without RST) and triggers
Err_Callbackautomatically.Register tcp_poll callback (
tcp_client.adb):Allows lwIP to do periodic connection maintenance (every 2s).
Fix stale TCP connection after reset (
ksem_client.adb,main.adb):Sync
KSEM_Client.Connectedwith TCP layer state; add 5 connectretries at startup and automatic reconnect in main loop.
Defensive hardening: LwIP critical sections (
lwipopts.h,cc.h,sys_arch.c):SYS_LIGHTWEIGHT_PROT=1with PRIMASK-based critical sections.volatile RxAllocStatusin ETH driver.Changes
src/tcp_client.adbtcp_closeon remote disconnect,Cleanup_PCBhelper,tcp_pollcallback, null-checks on PCBlwip_port/arch/lwipopts.hLWIP_TCP_KEEPALIVE=1(10s/5s/3),SYS_LIGHTWEIGHT_PROT=1src/ksem_client.adbConnectedwith TCP layer inRead_Powersrc/main.adblwip_port/arch/cc.hsys_prot_ttypedeflwip_port/sys_arch.csys_arch_protect/unprotectlwip_port/ethernetif.cvolatile RxAllocStatusTest plan
build-embeddedjob passes (arm-elf build)🤖 Generated with Claude Code