Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM reaping: Tx spam using large (~100kb) transactions results in excessive memory usage #77

Open
chainum opened this issue Oct 1, 2019 · 1 comment

Comments

@chainum
Copy link
Contributor

chainum commented Oct 1, 2019

Describe the bug
Tx spam using large transactions (~100kb) results in excessive node memory usage.

This could happen because of:

  • Unbounded/non-pooled Go subroutines eventually spinning out of control and crashing the nodes, and/or
  • Excessive in-memory allocation of certain arrays/slices/queues (the pendingTransactions queue is especially interesting).

Issue has already been reported to @harmony-ek in the P-OPS Telegram channel but after discussion with @AndyBoWu today on Discord I was asked to open an issue for this.

There's already a related open issue for the unbounded/non pooled subroutines here: harmony-one/harmony#1645

It's also very possible that the slices/arrays that keep track of pending transactions, cx receipts etc. could be a part of the problem. Some slices/arrays (especially pendingTransactions) store all of the tx data in memory, and if people routinely spam ~100kb transactions the in-memory consumption of the node process increases quite fast.

Assuming a 20k tx pool limit for pending transactions (I've routinely seen Pangaea node operators with 15-17k pendings transactions in their queues) some nodes might end up storing gigabytes of data in memory for that queue alone. 15k pending transactions (including all of the tx data/embedded base64-data) could theoretically end up storing 1.5gb of pending transaction data in memory, given all transactions are ~100kb.

Add the other slices/arrays/queues to the mix coupled with the unbounded Go subroutines and it's probably no surprise that even the explorer m5.large instances (with 7.5gb available memory) experience the OOM reaping.

After restarting the harmony process on a m5.large explorer node, the process typically only manages to stay alive for 15-20 minutes before getting OOM-reaped by the OS. The process usually caps out at 7.1-7.2 gb of memory before the OS reaps the process and a tracelog is outputted, e.g: https://gist.github.com/SebastianJ/ad569b1ce48742b2a06117d6c273fa3a

(Tracelog seems to indicate unbounded subroutines being a major issue)

To Reproduce
Steps to reproduce the behavior:

  1. Spam the network with transactions with a lot of base64 embedded data (90-100kb), e.g: https://gist.githubusercontent.com/SebastianJ/50c1405109d64651e13958d82eae112c/raw/fbbbdf598dd1bfd533f4d944f10f0176f71cb8c2/HugeTxExample (just put it in a loop or something to make sure it constantly spams the network)
  2. Let the network start processing these transactions
  3. Wait for nodes to start getting OOM reaped (seems explorer nodes get OOM-reaped way earlier than regular nodes with lesser memory - guessing this is because explorer nodes perform more processing and memory intensive tasks)

Expected behavior
Network should be able to cope with a massive amount of transactions, both in terms of number of transactions as well as the size of transactions in terms of bytes.

Environment (please complete the following information):
Explorer nodes:

  • OS: Linux 4.14.77-81.59.amzn2.x86_64 Script to send TXs #1 SMP Mon Nov 12 21:32:48 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
  • Instance type: m5.large (7.5gb available memory)
  • Harmony binary version: Harmony (C) 2019. harmony, version v4696-pangaea-20190924.0-0-ge3030c50 (ec2-user@ 2019-09-24T23:30:06+0000)

Additional info
Just as an experiment I upgraded all explorer nodes to use Systemd units to start the harmony binary so that they would auto-restart upon getting OOM-reaped.

So far that has only worked for the shard 1 explorer. None of the other explorer nodes manage to sync or display blocks properly on the explorer Web UI - they seem to get stuck in a perpetual state of trying to sync and then losing the sync status when they get OOM-reaped. Shard 1 somehow manages to get past this state.

There's also a related issue regarding large transactions and the Web UI here: harmony-one/harmony#1676

@chainum
Copy link
Contributor Author

chainum commented Oct 11, 2019

Potentially fixed by harmony-one/harmony#1710

Need to perform a stress test on a network built using that patch to verify.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant