Flaky fixes #8778

rustyrussell · 2025-12-10T01:46:49Z

This is based on #8767 (which was causing one of the "flakes"!).

This is a draft because the final commit disables all flakes so I can catch them (I'm also looping it on my laptop), so I can root cause the others. We may still mark some flaky if they are genuinely unreliable, but we need to make that determination!

Changelog-None

…MACHINE. We used to just run these without valgrind, but we already run them in CI (which sets SLOW_MACHINE) without valgrind, so this just doubles up. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

This is covering up real bugs; we need to fix CI instead. From https://pypi.org/project/pytest-rerunfailures/#re-run-all-failures: ``` To re-run all test failures, use the --reruns command line option with the maximum number of times you’d like the tests to run: $ pytest --reruns 5 ``` Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

This shows up as a flake in test_route_by_old_scid: ``` # Now restart l2, make sure it remembers the original! l2.restart() > l2.rpc.connect(l1.info['id'], 'localhost', l1.port) tests/test_splicing.py:554: ... > raise RpcError(method, payload, resp['error']) E pyln.client.lightning.RpcError: RPC call failed: method: connect, payload: {'id': '0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518', 'host': 'localhost', 'port': 33837}, error: {'code': 400, 'message': 'Unable to connect, no address known for peer'} ``` This is because it's already (auto)connecting, and fails. This failure is reported, before we've added the new address (once we add the new address, we connect fine, but it's too late!): ``` lightningd-2 2025-12-08T02:39:18.241Z DEBUG gossipd: REPLY WIRE_GOSSIPD_NEW_BLOCKHEIGHT_REPLY with 0 fds lightningd-2 2025-12-08T02:39:18.320Z DEBUG 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-connectd: Initializing important peer with 0 addresses lightningd-2 2025-12-08T02:39:18.320Z DEBUG 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-connectd: Failed connected out: Unable to connect, no address known for peer lightningd-2 2025-12-08T02:39:18.344Z DEBUG 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-connectd: Will try reconnect in 1 seconds lightningd-2 2025-12-08T02:39:18.344Z DEBUG 035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d-connectd: Initializing important peer with 1 addresses lightningd-2 2025-12-08T02:39:18.344Z DEBUG 035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d-connectd: Connected out, starting crypto lightningd-2 2025-12-08T02:39:18.344Z DEBUG 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-connectd: Adding 1 addresses to important peer lightningd-2 2025-12-08T02:39:18.345Z DEBUG 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-connectd: Connected out, starting crypto {'run_id': 256236335046680576, 'github_repository': 'ElementsProject/lightning', 'github_sha': '555f1ac461d151064aad6fc83b94a0290e2e9d5d', 'github_ref': 'refs/pull/8767/merge', 'github_ref_name': 'HEAD', 'github_run_id': 20013689862, 'github_head_ref': 'fixup-backfill-bug', 'github_run_number': 14774, 'github_base_ref': 'master', 'github_run_attempt': '1', 'testname': 'test_route_by_old_scid', 'start_time': 1765161493, 'end_time': 1765161558, 'outcome': 'fail'} lightningd-2 2025-12-08T02:39:18.421Z DEBUG 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-hsmd: Got WIRE_HSMD_ECDH_REQ lightningd-2 2025-12-08T02:39:18.421Z DEBUG hsmd: Client: Received message 1 from client lightningd-2 2025-12-08T02:39:18.453Z DEBUG 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-hsmd: Got WIRE_HSMD_ECDH_REQ lightningd-2 2025-12-08T02:39:18.453Z DEBUG hsmd: Client: Received message 1 from client --------------------------- Captured stdout teardown --------------------------- ```

We can still get a warning: lightningd-1 2025-12-10T01:11:07.232Z DEBUG 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-connectd: Received WIRE_WARNING: WARNING: channel_announcement: no unspent txout 109x1x1 This has nothing to do with l1 talking about the original channel (which would be 103x1x): it's because l2's gossipd (being the node which does the splice) immediately forgets the pre-splice id. If l1 sends some gossip, it will get a warning message. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Since this was written, we now test if remote side would get into this situation and stop it from happening, so the test doesn't work any more. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

1. It was flaky, probably because it didn't wait for the remote update_channel. 2. Rusty applied a fix in 5f664da, not clear if it worked. 3. Christian disabled it altogether in 23ce9a9. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

``` 2025-12-10T02:51:06.2435409Z [gw1] [ 77%] ERROR tests/test_plugin.py::test_commando ...lightningd-1: had BROKEN messages ... 2025-12-10T03:00:26.0440311Z lightningd-1 2025-12-10T02:51:01.548Z UNUSUAL jsonrpc#69: That's weird: Request checkrune took 5961 milliseconds ``` Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

signpsbt could be the one which takes a long time, so allow any psbt event. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

…essages. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

We restart the nodeL if the coin_movements.py plugin hasn't processed the notification yet, it will be incorrect: ``` > assert account_balance(l2, chanid_1) == 100001001 E AssertionError: assert 150_001_001msat == 100_001_001 E + where 150001001msat = account_balance(<fixtures.LightningNode object at 0x7f0634e1eb00>, '39ac52c818c5304cf0664940ff236c4e3f8f4ceb8993cb1491347142d61b62bc') ``` Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

We didn't log when anchor transactions had short signatures, which causes this test to not assert (did_short_sig): ``` total_feerate_perkw = total_fees / total_weight * 1000 > check_feerate([l3, l2], total_feerate_perkw, feerate) tests/test_closing.py:4063: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ nodes = [<fixtures.LightningNode object at 0x7f0fb322bb20>, <fixtures.LightningNode object at 0x7f0fb1b5ead0>] actual_feerate = 14006.105538595726, expected_feerate = 14000 def check_feerate(nodes, actual_feerate, expected_feerate): # Feerate can't be lower. assert actual_feerate > expected_feerate - 2 if actual_feerate >= expected_feerate + 2: if any([did_short_sig(n) for n in nodes]): return # Use assert as it shows the actual values on failure > assert actual_feerate < expected_feerate + 2 E AssertionError ``` Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

We don't want it to think that it can use both pre-splice and post-splice channels! Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

This means that we won't complain to peers which gossip about our channels, but it does mean that our channel graph (like other nodes on the network) will show two channels, not one, for the duration. For this reason, we need askrene to omit local dying channels. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

We removed the functionality, but only disabled the tests. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

``` > ret = l1.rpc.xpay(invstring=inv, maxfee=maxfee) tests/test_xpay.py:585: ... > raise RpcError(method, payload, resp['error']) E pyln.client.lightning.RpcError: RPC call failed: method: xpay, payload: {'invstring': 'lnbcrt1m1p5n5wdzsp5qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqpp52mu6842a26hs40xxgzscflm4smk5yjctqgf7hhrwhx7dh2492vzsdp22pshj6twvusxummyv5sr2wfqwa5hg6pqd4shsen9v5cqpj9qyyqqqj9kvjvrzy0ygdsfsskjtss0xrkrt7lq8jyrgzvtvdw5y9xqy0v25ddxd787c9ym36udm876lyhevznj8j9qzk0r7x03xm0akvq6ltwcq7vm7tk', 'maxfee': 57966}, error: {'code': 209, 'message': "Failed after 4 attempts. We got temporary_channel_failure for 59x81x28707/1, assuming it can't carry 49498813msat. Then routing for remaining 49498813msat failed: linear_routes: timed out after deadline"} ... lightningd-1 2025-12-11T03:25:41.972Z DEBUG plugin-cln-askrene: notify msg debug: get_routes failed after 15572 ms ``` Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Under Postgres, this actually takes more than 2 seconds, so w2 really has timed out already: ``` time.sleep(2) # total 2 assert not w1.done() > assert not w2.done() E assert not True E + where True = done() E + where done = <Future at 0x7fe14e54fee0 state=finished raised RpcError>.done tests/test_invoices.py:420: AssertionError ``` So space the timeouts out more, and sleep one second too short; the .result() (which sleeps) will catch up if we were extremely slow. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

3974806 added this: CI: Try not running group 2/10 UBSAN in parallel. It's being killed with signal 143, which means docker isn't happy; too much memory consumption? But since we're now at 12 groups, that probably doesn't apply (it might not have even before, in the two years since that commit since so may things have been added). And it caused this shard to take over 2 hours and timed out. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

It's not reliable: ``` # We should have deferred hook update at least once! > l2.daemon.wait_for_log("UNUSUAL plugin-dep_b.py: Deferring registration of hook htlc_accepted until it's not in use.") tests/test_plugin.py:2646: ... if self.is_in_log(r): print("({} was previously in logs!)".format(r)) > raise TimeoutError('Unable to find "{}" in logs.'.format(exs)) E TimeoutError: Unable to find "[re.compile("UNUSUAL plugin-dep_b.py: Deferring registration of hook htlc_accepted until it's not in use.")]" in logs. ``` Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Do this by setting notleak when we do the detach! ``` **BROKEN** lightningd: MEMLEAK: 0x60f0000bbb38 **BROKEN** lightningd: label=ccan/ccan/io/io.c:92:struct io_conn **BROKEN** lightningd: alloc: **BROKEN** lightningd: /home/runner/work/lightning/lightning/ccan/ccan/tal/tal.c:488 (tal_alloc_) **BROKEN** lightningd: /home/runner/work/lightning/lightning/ccan/ccan/io/io.c:92 (io_new_conn_) **BROKEN** lightningd: /home/runner/work/lightning/lightning/lightningd/subd.c:781 (new_subd) **BROKEN** lightningd: /home/runner/work/lightning/lightning/lightningd/subd.c:835 (new_channel_subd_) **BROKEN** lightningd: /home/runner/work/lightning/lightning/lightningd/channel_control.c:1715 (peer_start_channeld) **BROKEN** lightningd: /home/runner/work/lightning/lightning/lightningd/peer_control.c:1390 (connect_activate_subd) **BROKEN** lightningd: /home/runner/work/lightning/lightning/lightningd/peer_control.c:1516 (peer_connected_hook_final) **BROKEN** lightningd: /home/runner/work/lightning/lightning/lightningd/plugin_hook.c:243 (hook_done) **BROKEN** lightningd: /home/runner/work/lightning/lightning/lightningd/plugin_hook.c:343 (plugin_hook_call_next) **BROKEN** lightningd: /home/runner/work/lightning/lightning/lightningd/plugin_hook.c:299 (plugin_hook_callback) **BROKEN** lightningd: /home/runner/work/lightning/lightning/lightningd/plugin.c:701 (plugin_response_handle) **BROKEN** lightningd: /home/runner/work/lightning/lightning/lightningd/plugin.c:790 (plugin_read_json) **BROKEN** lightningd: /home/runner/work/lightning/lightning/ccan/ccan/io/io.c:60 (next_plan) **BROKEN** lightningd: /home/runner/work/lightning/lightning/ccan/ccan/io/io.c:422 (do_plan) **BROKEN** lightningd: /home/runner/work/lightning/lightning/ccan/ccan/io/io.c:439 (io_ready) **BROKEN** lightningd: /home/runner/work/lightning/lightning/ccan/ccan/io/poll.c:470 (io_loop) **BROKEN** lightningd: /home/runner/work/lightning/lightning/lightningd/io_loop_with_timers.c:22 (io_loop_with_timers) **BROKEN** lightningd: /home/runner/work/lightning/lightning/lightningd/lightningd.c:1492 (main) **BROKEN** lightningd: ../sysdeps/nptl/libc_start_call_main.h:58 (__libc_start_call_main) **BROKEN** lightningd: ../csu/libc-start.c:392 (__libc_start_main_impl) **BROKEN** lightningd: parents: **BROKEN** lightningd: lightningd/lightningd.c:108:struct lightningd ``` Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

rustyrussell requested a review from cdecker December 10, 2025 01:46

rustyrussell force-pushed the guilt/flaky-fixes branch 3 times, most recently from 50a04af to d3eada8 Compare December 11, 2025 04:21

rustyrussell added 19 commits December 11, 2025 17:54

pytest: don't run tests marked slow_test at all if VALGRIND and SLOW_…

9f4fd7e

…MACHINE. We used to just run these without valgrind, but we already run them in CI (which sets SLOW_MACHINE) without valgrind, so this just doubles up. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

pytest: remove test_lockup_drain.

112e7ef

Since this was written, we now test if remote side would get into this situation and stop it from happening, so the test doesn't work any more. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

pytest: fix test_bitcoin_backend_gianttx flake.

f985ac4

signpsbt could be the one which takes a long time, so allow any psbt event. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

pytest: note that we also trigger CI failure on this "That's weird" m…

97ca413

…essages. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

patch enable-offline-test.patch

3c3b86a

pytest: test the askrene doesn't use local dying channels.

d3801c1

We don't want it to think that it can use both pre-splice and post-splice channels! Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

pytest: remove channel upgrade tests.

e731100

We removed the functionality, but only disabled the tests. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

pytest: move benchmark in test_connection.py to tests/benchmarks.py

4faba50

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

rustyrussell force-pushed the guilt/flaky-fixes branch from d3eada8 to 11ca55c Compare December 11, 2025 07:24

rustyrussell added 3 commits December 12, 2025 10:02

pytest: disable remaining flaky and skip markers to see what else fails.

fa42981

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

rustyrussell force-pushed the guilt/flaky-fixes branch from 11ca55c to d21a24c Compare December 11, 2025 23:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Flaky fixes #8778

Flaky fixes #8778

rustyrussell commented Dec 10, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Flaky fixes #8778

Are you sure you want to change the base?

Flaky fixes #8778

Conversation

rustyrussell commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rustyrussell commented Dec 10, 2025 •

edited

Loading