Fix LwIP TCP stability: PCB leak, keepalive, and critical sections by flottokarotto · Pull Request #4 · flottokarotto/AdaModbus

flottokarotto · 2026-02-18T19:10:12Z

Summary

Fix TCP PCB leak causing hang after minutes (tcp_client.adb):
When the KSEM closed the connection (recv callback with P=null),
tcp_close was never called — the PCB stayed in CLOSE_WAIT forever.
With MEMP_NUM_TCP_PCB=2, after 1-2 reconnects all PCBs were
exhausted and tcp_new returned null → program hung.
Fix: properly close PCB on remote disconnect, centralize all cleanup
into Cleanup_PCB (deregister callbacks, then close/abort).
Enable TCP keepalive (lwipopts.h):
LWIP_TCP_KEEPALIVE=1 with 10s idle / 5s interval / 3 probes.
lwIP now detects silently dead connections (cable unplug, KSEM crash
without RST) and triggers Err_Callback automatically.
Register tcp_poll callback (tcp_client.adb):
Allows lwIP to do periodic connection maintenance (every 2s).
Fix stale TCP connection after reset (ksem_client.adb, main.adb):
Sync KSEM_Client.Connected with TCP layer state; add 5 connect
retries at startup and automatic reconnect in main loop.
Defensive hardening: LwIP critical sections (lwipopts.h, cc.h, sys_arch.c):
SYS_LIGHTWEIGHT_PROT=1 with PRIMASK-based critical sections.
volatile RxAllocStatus in ETH driver.

Changes

File	Change
`src/tcp_client.adb`	Fix PCB leak: `tcp_close` on remote disconnect, `Cleanup_PCB` helper, `tcp_poll` callback, null-checks on PCB
`lwip_port/arch/lwipopts.h`	`LWIP_TCP_KEEPALIVE=1` (10s/5s/3), `SYS_LIGHTWEIGHT_PROT=1`
`src/ksem_client.adb`	Sync `Connected` with TCP layer in `Read_Power`
`src/main.adb`	5 connect retries at startup; automatic reconnect in main loop
`lwip_port/arch/cc.h`	Add `sys_prot_t` typedef
`lwip_port/sys_arch.c`	PRIMASK-based `sys_arch_protect/unprotect`
`lwip_port/ethernetif.c`	`volatile RxAllocStatus`

Test plan

Flash to NUCLEO-H753ZI, run for 10+ minutes — no hang
Reset multiple times — connect succeeds on every reset
Pull Ethernet cable while running — reconnect within 10-15 s after re-plug
CI: build-embedded job passes (arm-elf build)

🤖 Generated with Claude Code

HAL_ETH_RxAllocateCallback and HAL_ETH_RxLinkCallback are called from ETH IRQ context and access LwIP memory pools (LWIP_MEMPOOL_ALLOC) and pbuf reference counts concurrently with the mainloop (ethernetif_input, pbuf_free_custom). Without synchronization this causes heap corruption. - Enable SYS_LIGHTWEIGHT_PROT=1 so LwIP wraps all pool/heap operations with sys_arch_protect/sys_arch_unprotect critical sections - Implement sys_arch_protect/unprotect using PRIMASK (cpsid i / msr) so ISRs are disabled for the duration of each LwIP pool operation - Add sys_prot_t typedef (uint32_t) to cc.h as required by LwIP - Mark RxAllocStatus volatile: written from ISR, read from mainloop

codecov-commenter · 2026-02-18T19:11:58Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copilot

Pull request overview

This PR fixes a critical race condition in the LwIP networking stack implementation for the NUCLEO-H753ZI board. The issue occurs when Ethernet ISR callbacks (HAL_ETH_RxAllocateCallback, HAL_ETH_RxLinkCallback) running in interrupt context access LwIP memory pools concurrently with mainloop code, potentially causing heap corruption.

Changes:

Enabled LwIP's lightweight protection mechanism (SYS_LIGHTWEIGHT_PROT=1) to guard memory pool operations against concurrent ISR/mainloop access
Implemented PRIMASK-based critical sections that save and restore interrupt state, properly handling nested critical sections
Added volatile qualifier to RxAllocStatus to prevent compiler from caching its value across ISR writes

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File	Description
`lwip_port/arch/lwipopts.h`	Enable `SYS_LIGHTWEIGHT_PROT` (0 → 1) with clarifying comment about ISR/mainloop protection
`lwip_port/sys_arch.c`	Implement `sys_arch_protect`/`sys_arch_unprotect` using PRIMASK register manipulation with inline assembly
`lwip_port/arch/cc.h`	Add `sys_prot_t` typedef (`uint32_t`) required by LwIP for storing protection state
`lwip_port/ethernetif.c`	Mark `RxAllocStatus` as `volatile` to ensure visibility across ISR and mainloop contexts

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Two bugs caused "status error 6" (Invalid_Request) after every 2nd reset: 1. KSEM_Client.Connected was not synced with the underlying TCP state. When TCP drops (e.g. KSEM sends RST/FIN for a stale connection left over from the previous reset), LwIP sets Current_State=Disconnected but KSEM_Client.Connected stayed True. Read_Power then forwarded the call to Modbus_TCP_Read_Registers which checked Is_Connected=False and returned Invalid_Request. Fixed by checking TCP_Is_Connected in Read_Power and clearing Connected if the TCP layer is down. 2. Only 1 connect attempt was made at startup. After a reset without TCP FIN, KSEM holds the old ESTABLISHED connection; its reply to the new SYN causes LwIP to abort the first attempt. After that exchange KSEM clears the stale connection and a second attempt would succeed. Increased retries from 1 to 5. Also added reconnect logic in the main loop so a connection drop during normal operation (e.g. KSEM reboot) is recovered automatically after 5 seconds instead of running forever with "KSEM read failed".

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-18T19:55:07Z

examples/embedded/nucleo_h753_ksem_test/lwip_port/arch/lwipopts.h

 #define LWIP_NETIF_API                  0

-#define SYS_LIGHTWEIGHT_PROT            0       /* No protection needed (no OS) */
+#define SYS_LIGHTWEIGHT_PROT            1       /* Guard ISR vs. mainloop (HAL ETH callbacks run in IRQ context) */


This comment states HAL ETH callbacks run in IRQ context, but this port currently uses HAL_ETH_Start() + polling via HAL_ETH_ReadData() (called from the main loop). Please reword to avoid documenting an incorrect concurrency model (e.g., note it’s defensive for potential future interrupt-mode use).

Suggested change

#define SYS_LIGHTWEIGHT_PROT 1 /* Guard ISR vs. mainloop (HAL ETH callbacks run in IRQ context) */

#define SYS_LIGHTWEIGHT_PROT 1 /* Enable lightweight protection; defensive for possible ISR vs. main loop use (e.g. future IRQ-based ETH callbacks) */

Copilot · 2026-02-18T19:55:08Z

examples/embedded/nucleo_h753_ksem_test/lwip_port/sys_arch.c

 * Provides:
+ *   - sys_arch_protect/unprotect -- PRIMASK-based critical sections
+ *     (SYS_LIGHTWEIGHT_PROT=1) to guard LwIP memory pools against
+ *     concurrent access from ISR (HAL ETH callbacks) and mainloop.


The header comment claims concurrency between ISR (HAL ETH callbacks) and mainloop, but ethernetif.c currently calls HAL_ETH_ReadData from the polling path (no HAL_ETH_IRQHandler usage). To avoid misleading future maintainers, consider clarifying that sys_arch_protect/unprotect are added as defensive hardening for possible future IRQ-driven ETH.

Suggested change

* concurrent access from ISR (HAL ETH callbacks) and mainloop.

* possible concurrent access from ISR (HAL ETH callbacks) and

* mainloop. In the current polling-based ethernetif.c they serve

* primarily as defensive hardening for any future IRQ-driven ETH.

Copilot · 2026-02-18T19:55:08Z

examples/embedded/nucleo_h753_ksem_test/src/ksem_client.adb

+      --  Sync Connected flag with underlying TCP state.
+      --  If TCP dropped (e.g. remote RST after a reset), reflect that here
+      --  so callers can detect the disconnect and trigger a reconnect.
+      if Connected and then not HAL_Stubs.TCP_Is_Connected then
+         Connected := False;
+      end if;


The reconnect logic in main.adb relies on KSEM_Client.Is_Connected, but the new TCP-state sync only runs inside Read_Power. If the TCP connection drops and Read_Power isn’t called (or is called much later), Is_Connected can still return True and block reconnect. Consider moving/splitting this sync into Is_Connected (or a shared helper) so the public connection status always reflects HAL_Stubs.TCP_Is_Connected.

…ions The TCP client would hang after a few minutes due to: - Missing tcp_close when remote closes (P=null in recv callback), leaving PCBs stuck in CLOSE_WAIT and exhausting the 2-PCB pool - No tcp_poll callback for lwIP periodic connection maintenance - No TCP keepalive to detect silently dead connections Fix: properly close PCB on remote disconnect, register poll callback, enable LWIP_TCP_KEEPALIVE (10s idle, 5s interval, 3 probes), and centralize PCB cleanup to prevent resource leaks in all code paths.

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

examples/embedded/nucleo_h753_ksem_test/src/main.adb:151

After 5 failed connect attempts at startup the code enters an infinite loop (polling Ethernet only). This prevents the later “automatic reconnect in main loop” logic from ever running if the KSEM is temporarily unavailable at boot. If the intent is to keep retrying until the device comes up, consider falling through into the main loop with Connected=False (or keeping a non-halting retry loop) instead of halting permanently.

         if Retry >= 5 then
            UART_Console.Put ("ERROR: KSEM connect failed after 5 attempts (");
            UART_Console.Put_Int (Integer_32 (Status'Pos (Result)));
            UART_Console.Put_Line (")");
            declare
               function Get_TX_Count return Unsigned_32
                 with Import, Convention => C,
                      External_Name => "ethernetif_get_tx_count";
               function Get_RX_Count return Unsigned_32
                 with Import, Convention => C,
                      External_Name => "ethernetif_get_rx_count";
            begin
               UART_Console.Put ("TX: ");
               UART_Console.Put_Int (Integer_32 (Get_TX_Count));
               UART_Console.Put ("  RX: ");
               UART_Console.Put_Int (Integer_32 (Get_RX_Count));
               UART_Console.Put_Line ("");
            end;
            HAL_Stubs.Set_LED (HAL_Stubs.LED_Red, True);
            loop
               HAL_Stubs.Ethernet_Poll;
            end loop;
         end if;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-18T20:15:21Z

examples/embedded/nucleo_h753_ksem_test/src/tcp_client.adb

      --  Get data length
      Len := Pbuf_Get_Tot_Len (P);
      if Len > Unsigned_16 (Max_RX_Size) then
         Len := Unsigned_16 (Max_RX_Size);
      end if;

      --  Copy data to buffer
      Copied := Pbuf_Copy_Partial (P, RX_Buffer'Address, Len, 0);
      RX_Len := Natural (Copied);
      RX_Ready := True;

      --  Acknowledge received data
-      TCP_Recved (PCB, Len);
+      TCP_Recved (PCB, Pbuf_Get_Tot_Len (P));



In Recv_Callback, when an incoming pbuf is larger than Max_RX_Size you cap Len for the copy, but still call TCP_Recved with the full Pbuf_Get_Tot_Len (P). This acknowledges bytes that were not actually processed/copied, which can silently truncate frames and desynchronize the Modbus/TCP stream. Consider treating oversize frames as a protocol error (e.g., close/abort the connection) or only acknowledging what you actually consume (and adjust buffer/logic accordingly).

Copilot · 2026-02-18T20:15:22Z

examples/embedded/nucleo_h753_ksem_test/src/tcp_client.adb

+         Current_State := Disconnected;
         Result := Buffer_Too_Small;


When TCP_New returns null, the code reports Result := Buffer_Too_Small. That status is used elsewhere to mean a caller-provided buffer or TCP send buffer is too small, but here it indicates PCB allocation/resource exhaustion (e.g., MEMP_NUM_TCP_PCB). Please return a status that better reflects the failure mode (or introduce a dedicated status) so callers can distinguish allocation failures from buffer sizing issues.

Suggested change

Current_State := Disconnected;

Result := Buffer_Too_Small;

-- PCB allocation / resource exhaustion (e.g., MEMP_NUM_TCP_PCB limit)

Current_State := Disconnected;

Result := Invalid_Request;

Copilot · 2026-02-18T20:15:22Z

examples/embedded/nucleo_h753_ksem_test/src/ksem_client.adb

+
      if not Connected then
-         Result := Invalid_Request;
+         Result := Timeout;


Read_Power now returns Timeout when Connected is false. This is inconsistent with other call sites in this example (e.g., HAL_Stubs.Modbus_TCP_Read_* returning Invalid_Request when the TCP layer isn’t connected) and makes it harder to differentiate “not connected / caller error” from a real Modbus/TCP timeout. Consider keeping Invalid_Request here (or adding a dedicated Disconnected status) and letting the reconnect logic rely on Is_Connected rather than overloading Timeout.

Suggested change

Result := Timeout;

Result := Invalid_Request;

- Fix misleading ISR comments in lwipopts.h and sys_arch.c: clarify that critical sections are defensive for future IRQ-based ETH - Move TCP state sync from Read_Power into Is_Connected so reconnect logic always sees actual connection status - Abort connection on oversize frames instead of silently truncating and desynchronizing the Modbus/TCP stream - Return Invalid_Request (not Buffer_Too_Small) on PCB pool exhaustion - Use Invalid_Request consistently when not connected

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings February 18, 2026 19:10

Copilot started reviewing on behalf of flottokarotto February 18, 2026 19:10 View session

Copilot AI reviewed Feb 18, 2026

View reviewed changes

flottokarotto changed the title ~~Fix LwIP race condition: ISR vs. mainloop memory pool access~~ Defensive hardening: LwIP critical sections and volatile for ETH driver Feb 18, 2026

flottokarotto changed the title ~~Defensive hardening: LwIP critical sections and volatile for ETH driver~~ Fix KSEM TCP reconnect and LwIP critical sections Feb 18, 2026

flottokarotto requested a review from Copilot February 18, 2026 19:48

Copilot started reviewing on behalf of flottokarotto February 18, 2026 19:48 View session

Copilot AI reviewed Feb 18, 2026

View reviewed changes

flottokarotto changed the title ~~Fix KSEM TCP reconnect and LwIP critical sections~~ Fix LwIP TCP stability: PCB leak, keepalive, and critical sections Feb 18, 2026

flottokarotto requested a review from Copilot February 18, 2026 20:10

Copilot started reviewing on behalf of flottokarotto February 18, 2026 20:10 View session

Copilot AI reviewed Feb 18, 2026

View reviewed changes

flottokarotto requested a review from Copilot February 18, 2026 20:26

Copilot started reviewing on behalf of flottokarotto February 18, 2026 20:27 View session

Copilot AI reviewed Feb 18, 2026

View reviewed changes

flottokarotto merged commit 8f03519 into master Feb 18, 2026
10 checks passed

flottokarotto deleted the fix/lwip-race-condition branch February 18, 2026 21:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix LwIP TCP stability: PCB leak, keepalive, and critical sections#4

Fix LwIP TCP stability: PCB leak, keepalive, and critical sections#4
flottokarotto merged 4 commits intomasterfrom
fix/lwip-race-condition

flottokarotto commented Feb 18, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Feb 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	#define SYS_LIGHTWEIGHT_PROT 1 /* Guard ISR vs. mainloop (HAL ETH callbacks run in IRQ context) */
	#define SYS_LIGHTWEIGHT_PROT 1 /* Enable lightweight protection; defensive for possible ISR vs. main loop use (e.g. future IRQ-based ETH callbacks) */

- *     concurrent access from ISR (HAL ETH callbacks) and mainloop.
+ *     possible concurrent access from ISR (HAL ETH callbacks) and
+ *     mainloop. In the current polling-based ethernetif.c they serve
+ *     primarily as defensive hardening for any future IRQ-driven ETH.

Conversation

flottokarotto commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Uh oh!

codecov-commenter commented Feb 18, 2026

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

flottokarotto commented Feb 18, 2026 •

edited

Loading