Skip to content

Conversation

rustyrussell
Copy link
Contributor

Fixes: #8542

@rustyrussell rustyrussell added this to the v25.12 milestone Sep 23, 2025
@rustyrussell rustyrussell force-pushed the guilt/gossip-map-more-robust branch from 97f804d to 1990e37 Compare September 23, 2025 02:35
@grubles
Copy link
Contributor

grubles commented Sep 24, 2025

Still crashing for me:

2025-09-24T13:51:22.703Z **BROKEN** connectd: Bad checksum on gossmap record @9850670/9851114 should be 3379961343 (01009411e26cd56d68aabc285ee1c8ee43d59be6f939b0ce353d80213918680a7438356b9c5ea6bb001a6bb37a4dea93776f4abc8cd371525b4d1605a74b89d7cb1bfc8865ddf22288c7ea08b9d98b34155b4aed159eb81732957e6bf79b996752bf2a9995aaead1d65e7889e826ea0ba42f7746c176fe12f2fe6c04af1a74b4f0a262d20efd57133eb32693c789eb3f09caf4f4c6ecd2f734b3b36e751ffcc2748c58feabce4173c4ce6098a2c5397aabf1be5442cb67b5030be11ebd8b9841838dae127fe30000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000): waiting                                                                                          2025-09-24T13:51:22.703Z **BROKEN** connectd: Bad checksum on gossmap record @9850670/9851136 should be 3379961343 (01009411e26cd56d68aabc285ee1c8ee43d59be6f939b0ce353d80213918680a7438356b9c5ea6bb001a6bb37a4dea93776f4abc8cd371525b4d1605a74b89d7cb1bfc8865ddf22288c7ea08b9d98b34155b4aed159eb81732957e6bf79b996752bf2a9995aaead1d65e7889e826ea0ba42f7746c176fe12f2fe6c04af1a74b4f0a262d20efd57133eb32693c789eb3f09caf4f4c6ecd2f734b3b36e751ffcc2748c58feabce4173c4ce6098a2c5397aabf1be5442cb67b5030be11ebd8b9841838dae127fe3000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000): waiting                                                                                          
0x623de0e30611 send_backtrace                                                                                           
        common/daemon.c:33                                                                                              
0x623de0e3b0d2 status_failed                                                                                            
        common/status.c:206                                                                                             
0x623de0e2869e gossmap_manage_get_gossmap                                                                               
        gossipd/gossmap_manage.c:1460
0x623de0e29955 gossmap_manage_handle_get_txout_reply                                                                    
        gossipd/gossmap_manage.c:753                                                                                    
0x623de0e2572b recv_req                                                                                                 
        gossipd/gossipd.c:574
0x623de0e30945 handle_read                                                                                              
        common/daemon_conn.c:35
0x623de0eca8d7 next_plan                                                                                                
        ccan/ccan/io/io.c:60                                                                                            
0x623de0ecada8 do_plan                                                                                                  
        ccan/ccan/io/io.c:422                                                                                           
0x623de0ecae65 io_ready                                                                                                 
        ccan/ccan/io/io.c:439                                                                                           
0x623de0ecc7d7 io_loop                                                                                                  
        ccan/ccan/io/poll.c:455                                                                                         
0x623de0e261cd main
        gossipd/gossipd.c:663                                                                                           
0x7b19c33621c9 __libc_start_call_main                                                                                   
        ../sysdeps/nptl/libc_start_call_main.h:58
0x7b19c336228a __libc_start_main_impl
        ../csu/libc-start.c:360
0x623de0e22ed4 ???
        _start+0x24:0
0xffffffffffffffff ???
        ???:0
2025-09-24T13:51:23.141Z **BROKEN** gossipd: Bad checksum on gossmap record @9850670/9851136 should be 3379961343 (01009
411e26cd56d68aabc285ee1c8ee43d59be6f939b0ce353d80213918680a7438356b9c5ea6bb001a6bb37a4dea93776f4abc8cd371525b4d1605a74b8
9d7cb1bfc8865ddf22288c7ea08b9d98b34155b4aed159eb81732957e6bf79b996752bf2a9995aaead1d65e7889e826ea0ba42f7746c176fe12f2fe6
c04af1a74b4f0a262d20efd57133eb32693c789eb3f09caf4f4c6ecd2f734b3b36e751ffcc2748c58feabce4173c4ce6098a2c5397aabf1be5442cb6
7b5030be11ebd8b9841838dae127fe300000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000): waiting
2025-09-24T13:51:23.141Z **BROKEN** gossipd: Gossmap failed to process entire gossip_store, disabling mmap: at 9850670 o
f 9851136 remaining_mmap=00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 re
maining_fd=200001b0c9761dff0000000001009411e26cd56d68aabc285ee1c8ee43d59be6f939b0ce353d80213918680a7438356b9c5ea6bb001a6
bb37a4dea93776f4abc8cd371525b4d1605a74b89d7cb1bfc8865ddf22288c7ea08b9d98b34155b4aed159eb81732957e6bf79b996752bf2a9995aae
ad1d65e7889e826ea0ba42f7746c176fe12f2fe6c04af1a74b4f0a262d20efd57133eb32693c789eb3f09caf4f4c6ecd2f734b3b36e751ffcc2748c5
8feabce4173c4ce6098a2c5397aabf1be5442cb67b5030be11ebd8b9841838dae127fe30000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000002000000a218b9d93000000001005000000000000c060
2025-09-24T13:51:23.141Z **BROKEN** gossipd: Gossmap map_used 9850670 of 9851136 with 9851136 written (version v25.09-70
-g1990e37)
2025-09-24T13:51:23.141Z **BROKEN** gossipd: backtrace: common/daemon.c:41 (send_backtrace) 0x623de0e3065e
2025-09-24T13:51:23.141Z **BROKEN** gossipd: backtrace: common/status.c:206 (status_failed) 0x623de0e3b0d2
2025-09-24T13:51:23.141Z **BROKEN** gossipd: backtrace: gossipd/gossmap_manage.c:1460 (gossmap_manage_get_gossmap) 0x623
de0e2869e
2025-09-24T13:51:23.141Z **BROKEN** gossipd: backtrace: gossipd/gossmap_manage.c:753 (gossmap_manage_handle_get_txout_re
ply) 0x623de0e29955
2025-09-24T13:51:23.141Z **BROKEN** gossipd: backtrace: gossipd/gossipd.c:574 (recv_req) 0x623de0e2572b
2025-09-24T13:51:23.141Z **BROKEN** gossipd: backtrace: common/daemon_conn.c:35 (handle_read) 0x623de0e30945
2025-09-24T13:51:23.141Z **BROKEN** gossipd: backtrace: ccan/ccan/io/io.c:60 (next_plan) 0x623de0eca8d7
2025-09-24T13:51:23.141Z **BROKEN** gossipd: backtrace: ccan/ccan/io/io.c:422 (do_plan) 0x623de0ecada8
2025-09-24T13:51:23.141Z **BROKEN** gossipd: backtrace: ccan/ccan/io/io.c:439 (io_ready) 0x623de0ecae65
2025-09-24T13:51:23.141Z **BROKEN** gossipd: backtrace: ccan/ccan/io/poll.c:455 (io_loop) 0x623de0ecc7d7
2025-09-24T13:51:23.141Z **BROKEN** gossipd: backtrace: gossipd/gossipd.c:663 (main) 0x623de0e261cd
2025-09-24T13:51:23.141Z **BROKEN** gossipd: backtrace: ../sysdeps/nptl/libc_start_call_main.h:58 (__libc_start_call_mai
n) 0x7b19c33621c9
2025-09-24T13:51:23.141Z **BROKEN** gossipd: backtrace: ../csu/libc-start.c:360 (__libc_start_main_impl) 0x7b19c336228a
2025-09-24T13:51:23.141Z **BROKEN** gossipd: backtrace: (null):0 ((null)) 0x623de0e22ed4
2025-09-24T13:51:23.141Z **BROKEN** gossipd: backtrace: (null):0 ((null)) 0xffffffffffffffff
2025-09-24T13:51:23.141Z **BROKEN** gossipd: STATUS_FAIL_INTERNAL_ERROR: Gossmap map_used 9850670 of 9851136 with 985113
6 written
lightningd: gossipd failed (exit status 242), exiting.
Lost connection to the RPC socket.Lost connection to the RPC socket.Lost connection to the RPC socket.Lost connection to
 the RPC socket.Lost connection to the RPC socket.Lost connection to the RPC socket.Lost connection to the RPC socket.Lo
st connection to the RPC socket.Lost connection to the RPC socket.Lost connection to the RPC socket.Lost connection to t

@madelinevibes madelinevibes added the 25.09.1 Point release for 25.09 label Sep 25, 2025
@rustyrussell rustyrussell force-pushed the guilt/gossip-map-more-robust branch from 1990e37 to 58aabf0 Compare September 25, 2025 05:11
@grubles
Copy link
Contributor

grubles commented Sep 27, 2025

No crash with the latest commits. I've tried wiping gossip_store and re-syncing a few times to be sure.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This can happen with other subdaemons too, on ZFS on Linux:

```
2025-09-24T13:51:22.703Z **BROKEN** connectd: Bad checksum on gossmap record @9850670/9851114 should be 3379961343 (01009411e26cd56d68aabc285ee1c8ee43d59be6f939b0ce353d80213918680a7438356b9c5ea6bb001a6bb37a4dea93776f4abc8cd371525b4d1605a74b89d7cb1bfc8865ddf22288c7ea08b9d98b34155b4aed159eb81732957e6bf79b996752bf2a9995aaead1d65e7889e826ea0ba42f7746c176fe12f2fe6c04af1a74b4f0a262d20efd57133eb32693c789eb3f09caf4f4c6ecd2f734b3b36e751ffcc2748c58feabce4173c4ce6098a2c5397aabf1be5442cb67b5030be11ebd8b9841838dae127fe30000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
```

Reported-by: @grubles
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We might have not read the final entry.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
It only gets called for diagnostics when something goes wrong (and we
were going to exit anyway), and it's only useful with mmap (which we now disable
on error) but it shouldn't crash:

```
**BROKEN** gossipd: Truncated gossmap record @7991501/7991523 (len 0): waiting
**BROKEN** gossipd: FATAL SIGNAL 6 (version v25.09)                                            
**BROKEN** gossipd: backtrace: common/daemon.c:41 (send_backtrace) 0x6506817cc529
**BROKEN** gossipd: backtrace: common/daemon.c:78 (crashdump) 0x6506817cc578
**BROKEN** gossipd: backtrace: ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0 ((null)) 0x75e8267a032f
**BROKEN** gossipd: backtrace: ./nptl/pthread_kill.c:44 (__pthread_kill_implementation) 0x75e8267f9b2c
**BROKEN** gossipd: backtrace: ./nptl/pthread_kill.c:78 (__pthread_kill_internal) 0x75e8267f9b2c
**BROKEN** gossipd: backtrace: ./nptl/pthread_kill.c:89 (__GI___pthread_kill) 0x75e8267f9b2c
**BROKEN** gossipd: backtrace: ../sysdeps/posix/raise.c:26 (__GI_raise) 0x75e8267a027d
**BROKEN** gossipd: backtrace: ./stdlib/abort.c:79 (__GI_abort) 0x75e8267838fe
**BROKEN** gossipd: backtrace: ./assert/assert.c:96 (__assert_fail_base) 0x75e82678381a
**BROKEN** gossipd: backtrace: ./assert/assert.c:105 (__assert_fail) 0x75e826796516
**BROKEN** gossipd: backtrace: common/gossmap.c:111 (map_copy) 0x6506817cea77
**BROKEN** gossipd: backtrace: common/gossmap.c:1870 (gossmap_fetch_tail) 0x6506817d1f93
**BROKEN** gossipd: backtrace: gossipd/gossmap_manage.c:1442 (gossmap_manage_get_gossmap) 0x6506817c45fb
**BROKEN** gossipd: backtrace: gossipd/gossmap_manage.c:753 (gossmap_manage_handle_get_txout_reply) 0x6506817c5850
**BROKEN** gossipd: backtrace: gossipd/gossipd.c:574 (recv_req) 0x6506817c172b
```

Reported-by: @grubles
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This should detect partial writes more robustly, since we make a
separate pwrite() call to update this flag after the record is written.

Previously we were playing a bit loose with synchronization assumptions,
which seemed to work on Linux ext4, but not so well elsewhere.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
@rustyrussell rustyrussell force-pushed the guilt/gossip-map-more-robust branch 2 times, most recently from 2bcd99f to cf8feea Compare September 29, 2025 03:34
It was still using private channel announcements, which were removed
in v13.
@rustyrussell rustyrussell force-pushed the guilt/gossip-map-more-robust branch 2 times, most recently from 54cc00f to 9441853 Compare September 30, 2025 06:59
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
…D_BIT set.

Mostly this meant running them, then running devtools/convert-gossmap and replacing the code.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
…E_COMPLETED_BIT set.

Simply ran them through devtools/convert-gossmap, thought for gossip_store-part2 it
had to be appended to gossip_store-part1, converted, then cut off again.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
…et a read issue.

This is a last resort, but what else are we supposed to do when we wrote
something and it didn't appear?

In particular, ZFS doesn't just "fix itself":

```
remaining_fd=200001b0c9761dff0000000001009411e26cd56d68aabc285ee1c8ee43d59be6f939b0ce353d80213918680a7438356b9c5ea6bb001a6
bb37a4dea93776f4abc8cd371525b4d1605a74b89d7cb1bfc8865ddf22288c7ea08b9d98b34155b4aed159eb81732957e6bf79b996752bf2a9995aae
ad1d65e7889e826ea0ba42f7746c176fe12f2fe6c04af1a74b4f0a262d20efd57133eb32693c789eb3f09caf4f4c6ecd2f734b3b36e751ffcc2748c5
8feabce4173c4ce6098a2c5397aabf1be5442cb67b5030be11ebd8b9841838dae127fe30000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000002000000a218b9d93000000001005000000000000c060
```

Note the record appended on the end *after all the zeroes*.

Changelog-Changed: gossipd: add gossip_store recovery for filesystems which do not synchronize read and write (e.g. ZFS on Linux), by disabling mmap reads and rewriting the last records.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
gossipd now uses pwrite(), which is more broadly supported.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
@rustyrussell rustyrussell force-pushed the guilt/gossip-map-more-robust branch from 9441853 to 1db1b92 Compare October 1, 2025 01:23
@rustyrussell rustyrussell merged commit 6af7fc6 into ElementsProject:master Oct 1, 2025
35 of 39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
25.09.1 Point release for 25.09
Projects
None yet
Development

Successfully merging this pull request may close these issues.

crash when syncing gossip from scratch with v25.09
3 participants