Closing too many channels simultaneously can kill the node #2702

DerEwige · 2023-06-27T09:35:33Z

My local.htlc_infos table has grown to over 100 million entries.
The average channel has around 2 million entries in the table.
Deleting one of the channels from the table can take up to 20 minutes.

This is usually not a problem.
The delete job will just run in the DB and during this time the node reacts just a bit slower.

Yesterday I decided to clean out some channels and close 6 channels in short succession.
All transaction from the 6 closes got mined within 2 minutes and shortly after this my node entered a restart circle.

This error would happen:
2023-06-26 22:35:59,593 ERROR- eclair is shutting down because of a fatal database errorjava.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30002ms.
Node shuts down, restarts, runs for 2 minutes then shuts down again...
This would happen over and over.

The problem was this:

I had eclair.db.postgres.pool.size=4
When the node started up it would take a moment to realize that the 6 channel close TX were already mined and started the delete
Those deletes would run for several minutes and block all 4 connections to the db.
This would then lead to the shutdown of the node.
Because the node was shut down the delete transactions would not get committed and the circle would start over on next tart of node.

Increasing eclair.db.postgres.pool.size to 8 allowed the node to use 6 connections for the deletes while 2 connections would enough to handle the rest of the DB load.

How to reproduce:
- Grow your local.htlc_infos to the point where the delete of one channel takes more than 5 minutes.
- Close double the number of channels than your eclair.db.postgres.pool.size
- Shut down your node and wait for all closing TX to be mined (this is to make sure the deletes all happen at once)
- Start your node (node will then fill up the DB connections with deletes and run in a DB timeout)

Version:
version=0.9.0 commit=623f7e4 with postgres DB

The text was updated successfully, but these errors were encountered:

pm47 · 2023-06-27T11:50:49Z

Try increasing eclair.db.postgres.pool.connection-timeout, it defaults to 30s.

For the record, attempting 100 million payments that probably don't fulfill is not only non-sensical, it is a class of dos attack on the network that protocol developers including us actively work towards preventing.

rorp · 2023-06-27T17:18:38Z

I believe that's a duplicate of #2610

rorp · 2023-06-27T17:20:00Z

For the record, attempting 100 million payments that probably don't fulfill is not only non-sensical, it is a class of dos attack on the network that protocol developers including us actively work towards preventing.

The number doesn't seem that non-sensical if you rebalance channels a lot.

DerEwige · 2023-06-27T17:35:54Z

Yes, I rebalance a lot and that is how my node reaches this numbers.
But a big zero fee routing node would have similar numbers.

I believe a simple fix could be to make the function removeChannel() synchronized, so only one channel gets removed at a time.
There might be other SQL statements, that run a long time, but so far that is the only one that gave me problems.

When a channel is closed, we can forget the data from historical HTLCs sent and received through that channel (which is otherwise required to punish cheating attempts by our peer). We previously synchronously removed that data from the DB when the closing transaction confirmed. However, this could create performance issues as the `htlc_infos` table can be very large for busy nodes and many concurrent writes may be happening at the same time. We don't need to get rid of this data immediately: we only want to remove it to avoid degrading the performance of active channels that read and write to the `htlc_infos` table. We now mark channels as closed in a dedicated table, and run a background actor that deletes batches of obsolete htlc data at regular intervals. This ensures that the table is eventually cleaned up, without impacting the performance of active channels. Fixes #2610 and #2702

When a channel is closed, we can forget the data from historical HTLCs sent and received through that channel (which is otherwise required to punish cheating attempts by our peer). We previously synchronously removed that data from the DB when the closing transaction confirmed. However, this could create performance issues as the `htlc_infos` table can be very large for busy nodes and many concurrent writes may be happening at the same time. We don't need to get rid of this data immediately: we only want to remove it to avoid degrading the performance of active channels that read and write to the `htlc_infos` table. We now mark channels as closed in a dedicated table, and run a background actor that deletes batches of obsolete htlc data at regular intervals. This ensures that the table is eventually cleaned up, without impacting the performance of active channels. When a splice transaction confirms, all the revoked commitment transactions that only applied to the previous funding transaction cannot be published anymore, because the previous funding output has already been spent. We can thus forget all the historical HTLCs that were included in those commitments, because we will never need to generate the corresponding penalty transactions. This ensures that the growth of our DB is bounded, and will shrink every time a splice transaction is confirmed. Fixes #2610, #2702 and #2740

t-bast · 2024-02-14T13:07:14Z

Fixed by #2705

thomash-acinq changed the title ~~Closing to many channels simultaneously can kill the node~~ Closing too many channels simultaneously can kill the node Jun 27, 2023

t-bast mentioned this issue Jul 4, 2023

Asynchronously clean up obsolete HTLC info from DB #2705

Merged

t-bast closed this as completed Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closing too many channels simultaneously can kill the node #2702

Closing too many channels simultaneously can kill the node #2702

DerEwige commented Jun 27, 2023

pm47 commented Jun 27, 2023

rorp commented Jun 27, 2023

rorp commented Jun 27, 2023

DerEwige commented Jun 27, 2023

t-bast commented Feb 14, 2024

Closing too many channels simultaneously can kill the node #2702

Closing too many channels simultaneously can kill the node #2702

Comments

DerEwige commented Jun 27, 2023

pm47 commented Jun 27, 2023

rorp commented Jun 27, 2023

rorp commented Jun 27, 2023

DerEwige commented Jun 27, 2023

t-bast commented Feb 14, 2024