Skip to content

Fix LwIP TCP stability: PCB leak, keepalive, and critical sections#4

Merged
flottokarotto merged 4 commits intomasterfrom
fix/lwip-race-condition
Feb 18, 2026
Merged

Fix LwIP TCP stability: PCB leak, keepalive, and critical sections#4
flottokarotto merged 4 commits intomasterfrom
fix/lwip-race-condition

Conversation

@flottokarotto
Copy link
Owner

@flottokarotto flottokarotto commented Feb 18, 2026

Summary

  • Fix TCP PCB leak causing hang after minutes (tcp_client.adb):
    When the KSEM closed the connection (recv callback with P=null),
    tcp_close was never called — the PCB stayed in CLOSE_WAIT forever.
    With MEMP_NUM_TCP_PCB=2, after 1-2 reconnects all PCBs were
    exhausted and tcp_new returned null → program hung.
    Fix: properly close PCB on remote disconnect, centralize all cleanup
    into Cleanup_PCB (deregister callbacks, then close/abort).

  • Enable TCP keepalive (lwipopts.h):
    LWIP_TCP_KEEPALIVE=1 with 10s idle / 5s interval / 3 probes.
    lwIP now detects silently dead connections (cable unplug, KSEM crash
    without RST) and triggers Err_Callback automatically.

  • Register tcp_poll callback (tcp_client.adb):
    Allows lwIP to do periodic connection maintenance (every 2s).

  • Fix stale TCP connection after reset (ksem_client.adb, main.adb):
    Sync KSEM_Client.Connected with TCP layer state; add 5 connect
    retries at startup and automatic reconnect in main loop.

  • Defensive hardening: LwIP critical sections (lwipopts.h, cc.h, sys_arch.c):
    SYS_LIGHTWEIGHT_PROT=1 with PRIMASK-based critical sections.
    volatile RxAllocStatus in ETH driver.

Changes

File Change
src/tcp_client.adb Fix PCB leak: tcp_close on remote disconnect, Cleanup_PCB helper, tcp_poll callback, null-checks on PCB
lwip_port/arch/lwipopts.h LWIP_TCP_KEEPALIVE=1 (10s/5s/3), SYS_LIGHTWEIGHT_PROT=1
src/ksem_client.adb Sync Connected with TCP layer in Read_Power
src/main.adb 5 connect retries at startup; automatic reconnect in main loop
lwip_port/arch/cc.h Add sys_prot_t typedef
lwip_port/sys_arch.c PRIMASK-based sys_arch_protect/unprotect
lwip_port/ethernetif.c volatile RxAllocStatus

Test plan

  • Flash to NUCLEO-H753ZI, run for 10+ minutes — no hang
  • Reset multiple times — connect succeeds on every reset
  • Pull Ethernet cable while running — reconnect within 10-15 s after re-plug
  • CI: build-embedded job passes (arm-elf build)

🤖 Generated with Claude Code

HAL_ETH_RxAllocateCallback and HAL_ETH_RxLinkCallback are called from
ETH IRQ context and access LwIP memory pools (LWIP_MEMPOOL_ALLOC) and
pbuf reference counts concurrently with the mainloop (ethernetif_input,
pbuf_free_custom). Without synchronization this causes heap corruption.

- Enable SYS_LIGHTWEIGHT_PROT=1 so LwIP wraps all pool/heap operations
  with sys_arch_protect/sys_arch_unprotect critical sections
- Implement sys_arch_protect/unprotect using PRIMASK (cpsid i / msr)
  so ISRs are disabled for the duration of each LwIP pool operation
- Add sys_prot_t typedef (uint32_t) to cc.h as required by LwIP
- Mark RxAllocStatus volatile: written from ISR, read from mainloop
Copilot AI review requested due to automatic review settings February 18, 2026 19:10
@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical race condition in the LwIP networking stack implementation for the NUCLEO-H753ZI board. The issue occurs when Ethernet ISR callbacks (HAL_ETH_RxAllocateCallback, HAL_ETH_RxLinkCallback) running in interrupt context access LwIP memory pools concurrently with mainloop code, potentially causing heap corruption.

Changes:

  • Enabled LwIP's lightweight protection mechanism (SYS_LIGHTWEIGHT_PROT=1) to guard memory pool operations against concurrent ISR/mainloop access
  • Implemented PRIMASK-based critical sections that save and restore interrupt state, properly handling nested critical sections
  • Added volatile qualifier to RxAllocStatus to prevent compiler from caching its value across ISR writes

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
lwip_port/arch/lwipopts.h Enable SYS_LIGHTWEIGHT_PROT (0 → 1) with clarifying comment about ISR/mainloop protection
lwip_port/sys_arch.c Implement sys_arch_protect/sys_arch_unprotect using PRIMASK register manipulation with inline assembly
lwip_port/arch/cc.h Add sys_prot_t typedef (uint32_t) required by LwIP for storing protection state
lwip_port/ethernetif.c Mark RxAllocStatus as volatile to ensure visibility across ISR and mainloop contexts

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@flottokarotto flottokarotto changed the title Fix LwIP race condition: ISR vs. mainloop memory pool access Defensive hardening: LwIP critical sections and volatile for ETH driver Feb 18, 2026
Two bugs caused "status error 6" (Invalid_Request) after every 2nd reset:

1. KSEM_Client.Connected was not synced with the underlying TCP state.
   When TCP drops (e.g. KSEM sends RST/FIN for a stale connection left
   over from the previous reset), LwIP sets Current_State=Disconnected
   but KSEM_Client.Connected stayed True.  Read_Power then forwarded the
   call to Modbus_TCP_Read_Registers which checked Is_Connected=False
   and returned Invalid_Request.  Fixed by checking TCP_Is_Connected in
   Read_Power and clearing Connected if the TCP layer is down.

2. Only 1 connect attempt was made at startup.  After a reset without
   TCP FIN, KSEM holds the old ESTABLISHED connection; its reply to the
   new SYN causes LwIP to abort the first attempt.  After that exchange
   KSEM clears the stale connection and a second attempt would succeed.
   Increased retries from 1 to 5.

Also added reconnect logic in the main loop so a connection drop during
normal operation (e.g. KSEM reboot) is recovered automatically after
5 seconds instead of running forever with "KSEM read failed".
@flottokarotto flottokarotto changed the title Defensive hardening: LwIP critical sections and volatile for ETH driver Fix KSEM TCP reconnect and LwIP critical sections Feb 18, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

#define LWIP_NETIF_API 0

#define SYS_LIGHTWEIGHT_PROT 0 /* No protection needed (no OS) */
#define SYS_LIGHTWEIGHT_PROT 1 /* Guard ISR vs. mainloop (HAL ETH callbacks run in IRQ context) */
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment states HAL ETH callbacks run in IRQ context, but this port currently uses HAL_ETH_Start() + polling via HAL_ETH_ReadData() (called from the main loop). Please reword to avoid documenting an incorrect concurrency model (e.g., note it’s defensive for potential future interrupt-mode use).

Suggested change
#define SYS_LIGHTWEIGHT_PROT 1 /* Guard ISR vs. mainloop (HAL ETH callbacks run in IRQ context) */
#define SYS_LIGHTWEIGHT_PROT 1 /* Enable lightweight protection; defensive for possible ISR vs. main loop use (e.g. future IRQ-based ETH callbacks) */

Copilot uses AI. Check for mistakes.
* Provides:
* - sys_arch_protect/unprotect -- PRIMASK-based critical sections
* (SYS_LIGHTWEIGHT_PROT=1) to guard LwIP memory pools against
* concurrent access from ISR (HAL ETH callbacks) and mainloop.
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header comment claims concurrency between ISR (HAL ETH callbacks) and mainloop, but ethernetif.c currently calls HAL_ETH_ReadData from the polling path (no HAL_ETH_IRQHandler usage). To avoid misleading future maintainers, consider clarifying that sys_arch_protect/unprotect are added as defensive hardening for possible future IRQ-driven ETH.

Suggested change
* concurrent access from ISR (HAL ETH callbacks) and mainloop.
* possible concurrent access from ISR (HAL ETH callbacks) and
* mainloop. In the current polling-based ethernetif.c they serve
* primarily as defensive hardening for any future IRQ-driven ETH.

Copilot uses AI. Check for mistakes.
Comment on lines 99 to 104
-- Sync Connected flag with underlying TCP state.
-- If TCP dropped (e.g. remote RST after a reset), reflect that here
-- so callers can detect the disconnect and trigger a reconnect.
if Connected and then not HAL_Stubs.TCP_Is_Connected then
Connected := False;
end if;
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reconnect logic in main.adb relies on KSEM_Client.Is_Connected, but the new TCP-state sync only runs inside Read_Power. If the TCP connection drops and Read_Power isn’t called (or is called much later), Is_Connected can still return True and block reconnect. Consider moving/splitting this sync into Is_Connected (or a shared helper) so the public connection status always reflects HAL_Stubs.TCP_Is_Connected.

Copilot uses AI. Check for mistakes.
…ions

The TCP client would hang after a few minutes due to:
- Missing tcp_close when remote closes (P=null in recv callback),
  leaving PCBs stuck in CLOSE_WAIT and exhausting the 2-PCB pool
- No tcp_poll callback for lwIP periodic connection maintenance
- No TCP keepalive to detect silently dead connections

Fix: properly close PCB on remote disconnect, register poll callback,
enable LWIP_TCP_KEEPALIVE (10s idle, 5s interval, 3 probes), and
centralize PCB cleanup to prevent resource leaks in all code paths.
@flottokarotto flottokarotto changed the title Fix KSEM TCP reconnect and LwIP critical sections Fix LwIP TCP stability: PCB leak, keepalive, and critical sections Feb 18, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

examples/embedded/nucleo_h753_ksem_test/src/main.adb:151

  • After 5 failed connect attempts at startup the code enters an infinite loop (polling Ethernet only). This prevents the later “automatic reconnect in main loop” logic from ever running if the KSEM is temporarily unavailable at boot. If the intent is to keep retrying until the device comes up, consider falling through into the main loop with Connected=False (or keeping a non-halting retry loop) instead of halting permanently.
         if Retry >= 5 then
            UART_Console.Put ("ERROR: KSEM connect failed after 5 attempts (");
            UART_Console.Put_Int (Integer_32 (Status'Pos (Result)));
            UART_Console.Put_Line (")");
            declare
               function Get_TX_Count return Unsigned_32
                 with Import, Convention => C,
                      External_Name => "ethernetif_get_tx_count";
               function Get_RX_Count return Unsigned_32
                 with Import, Convention => C,
                      External_Name => "ethernetif_get_rx_count";
            begin
               UART_Console.Put ("TX: ");
               UART_Console.Put_Int (Integer_32 (Get_TX_Count));
               UART_Console.Put ("  RX: ");
               UART_Console.Put_Int (Integer_32 (Get_RX_Count));
               UART_Console.Put_Line ("");
            end;
            HAL_Stubs.Set_LED (HAL_Stubs.LED_Red, True);
            loop
               HAL_Stubs.Ethernet_Poll;
            end loop;
         end if;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 92 to 105
-- Get data length
Len := Pbuf_Get_Tot_Len (P);
if Len > Unsigned_16 (Max_RX_Size) then
Len := Unsigned_16 (Max_RX_Size);
end if;

-- Copy data to buffer
Copied := Pbuf_Copy_Partial (P, RX_Buffer'Address, Len, 0);
RX_Len := Natural (Copied);
RX_Ready := True;

-- Acknowledge received data
TCP_Recved (PCB, Len);
TCP_Recved (PCB, Pbuf_Get_Tot_Len (P));

Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Recv_Callback, when an incoming pbuf is larger than Max_RX_Size you cap Len for the copy, but still call TCP_Recved with the full Pbuf_Get_Tot_Len (P). This acknowledges bytes that were not actually processed/copied, which can silently truncate frames and desynchronize the Modbus/TCP stream. Consider treating oversize frames as a protocol error (e.g., close/abort the connection) or only acknowledging what you actually consume (and adjust buffer/logic accordingly).

Copilot uses AI. Check for mistakes.
Comment on lines 226 to 227
Current_State := Disconnected;
Result := Buffer_Too_Small;
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When TCP_New returns null, the code reports Result := Buffer_Too_Small. That status is used elsewhere to mean a caller-provided buffer or TCP send buffer is too small, but here it indicates PCB allocation/resource exhaustion (e.g., MEMP_NUM_TCP_PCB). Please return a status that better reflects the failure mode (or introduce a dedicated status) so callers can distinguish allocation failures from buffer sizing issues.

Suggested change
Current_State := Disconnected;
Result := Buffer_Too_Small;
-- PCB allocation / resource exhaustion (e.g., MEMP_NUM_TCP_PCB limit)
Current_State := Disconnected;
Result := Invalid_Request;

Copilot uses AI. Check for mistakes.

if not Connected then
Result := Invalid_Request;
Result := Timeout;
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read_Power now returns Timeout when Connected is false. This is inconsistent with other call sites in this example (e.g., HAL_Stubs.Modbus_TCP_Read_* returning Invalid_Request when the TCP layer isn’t connected) and makes it harder to differentiate “not connected / caller error” from a real Modbus/TCP timeout. Consider keeping Invalid_Request here (or adding a dedicated Disconnected status) and letting the reconnect logic rely on Is_Connected rather than overloading Timeout.

Suggested change
Result := Timeout;
Result := Invalid_Request;

Copilot uses AI. Check for mistakes.
- Fix misleading ISR comments in lwipopts.h and sys_arch.c: clarify
  that critical sections are defensive for future IRQ-based ETH
- Move TCP state sync from Read_Power into Is_Connected so reconnect
  logic always sees actual connection status
- Abort connection on oversize frames instead of silently truncating
  and desynchronizing the Modbus/TCP stream
- Return Invalid_Request (not Buffer_Too_Small) on PCB pool exhaustion
- Use Invalid_Request consistently when not connected
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@flottokarotto flottokarotto merged commit 8f03519 into master Feb 18, 2026
10 checks passed
@flottokarotto flottokarotto deleted the fix/lwip-race-condition branch February 18, 2026 21:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants