-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Protobuf: buffer underflow #729
Comments
What is on the other side of the privval connection? You're the first to report an error like this in many years of TMKMS existing, so it seems like the other side is the likely culprit. |
It's a testnet (Sei Network). |
Hi, I can confirm that we have the same problem, afaik "buffer underflow" indicates an attempt to read a message that is larger than the amount of data available in the buffer (on tmkms side) |
It indicates the data is misframed. It seems very unlikely this is a TMKMS bug, and something wrong with the specific app you are trying to run. |
I'm not a Horcrux user, but it works fine in this chain too. |
I need a much better description of the problem than this, and instructions on how to reproduce it. Ideally it would be nice to have a dump of the incoming Protobuf which TMKMS can't parse. What chain are you running? How did you install it? How are you running it? Which message is causing the problem? Can you link to the Protobuf Schema for it? |
Expanding a bit on this issue, we, too, faced this issue on Sei testnet ( From what I understand, Sei has its own fork of tendermint - https://github.com/sei-protocol/sei-tendermint. So I believe the schema is https://github.com/sei-protocol/sei-tendermint/tree/main/proto, but I'm not 100% sure. As for us, we were running tmkms as usual, as a separate process, and pointing |
I'm curious if they're expecting to use gRPC for the We don't support that yet: #73. We've been waiting for it to be added to upstream Tendermint, since it was removed in v0.35. What version of Horcrux is it working with? |
My idea is to prepare you an environment with everything configured (chain/validator/tmkms) and give you access. |
That would be very helpful, thank you |
I'm getting that server ready for you. |
I will attempt to contact you via Discord |
Changed permissions... Could you try again with Discord ( |
Request sent |
Hey @tony-iqlusion , Any update on this issue ? Let me know how I can help to deal with, thanks |
Note. regarding previous tests and SSH access, now devnet is suspended. So we have to use testnet or mainnet chain |
I haven't had time to look, sorry. Note that there's a Hub upgrade today and a Stride upgrade tomorrow so I don't have much time this week either. What would be very, very helpful is if someone could log the message that TMKMS can't parse for inspection. |
Should we increase the debug level? Thanks |
You can try adding a Out of curiosity, do you have |
Thanks, I will try later in testnet
Yes |
Here: https://pastebin.com/4wDMbwP3 ERROR tmkms::client: [atlantic-2@tcp://127.0.0.1:26658] protocol error: malformed message packet: failed to decode Protobuf message: buffer underflow` |
Thanks, I believe this should be the message which is causing the issue encoded in hex: ba2b2ab72b0aa82b08201091a6cc0d20ffffffffffffffffff012a480a20a33e4740e0b974cf1e010e6794ff620487e52ad63718d77e46ba3a62e1adc406122408011220cf19b022e389f5cd701b17b800caf505171bb98ae967349da72513b2138c1820320c08f1fbf8a60610a7bab0920242220a20bb985439aebaf357fad861849528d82e763031abd79fcc87c3a305bc1485388442220a2039825ada697b2d6959d75fa711636c7627c450c87ed8fe1aeaba4f20e28d435042220a20d9c22318e69463191372f92f66e5bdbf859c9b0e52020753785df5f53ea4bacb42220a20fc7ae857c36f4ea2f56285281e283ccdd600f4934183574bf5fd0ebb4c278cf942220a20ab49231edaa3fec0921d42b85cf90ba61d5251144980fa16f62a842a201755ad42220a20a7a17a651bfc02228490723a752a96e9f8ac3d64d261377726e2cdb0ccc646e342220a203178335d0ebe7e0ce2905df5b23aa05b9295e09caf80ba45472d2f61ccc9fb2f42220a20887cddac11b90e7a71888fefec183600dd79b94e9c8c43fb24d66038970eb91242220a202bfaf4b02410e8ba736e0443d63543e855cb4fb1eb99522a66d764802392700e42220a209a3e9332f29019d03097bcf290f377939b6bf8ee6a9d1952920cb9433515ca9542220a20dd148712e5a43ca35a0ed3560d6e753598da4e2ff71bd16d9dd553eddc4adcde42220a207e1df69adb7f32259154e375e20e98cf0a1c95b9f78ef1349df3ab539ab615be42220a2070a1e82287b3775ab287f7fe4e803c9bce2b03cbb4d0dba83b86104a114cb6e542220a207783770f1b6f8bdb347a8a8f113ecc67842783b4afbebb608aced8a51bb29bce42220a2008f7c95794aec249d2eba33161199c28574bc77f13accff28e5d70c23c3ebbd642220a20b53a7d779ebd7d0778aa9bb66e4565fe80903cfac23268ec016f2cf00818be9042220a20fa2f7fe3daa69c0cdd44058a44734bb6d44fce3e0c4db585e77a5190869137a142220a20f26cd81b78fa41bd46df409302c976f3451b3f0065a0e1e3e8dcb9947fd7d07c42220a20d002698eebbfd05a20492eaae6ed769068fac6cf0f3a55b6869ae8cacf5b92bf42220a20c1a848f44dfc893511262c49dd48072eb50472d49398c44b27044a028c76113b42220a2089c0affd2d4c2151cc654eebec6f0d29302ed611664965616e534e3b854eb43f42220a2084df4847e6d5998ac091ff0d6905400c3c7264855277e88cf06a184b81d2676b42220a2010e5cb10974a839fdd1584f1efc256d20fbf9d710c40ac4b968e23cb4927125b42220a20b12308f12fac8ecdc13b84feba49bf617ae0fb084b8d5fa8e8c975ee7f38c7954a0052ad200890a6cc0d1a480a203dee68aaa35b98433eba6c0416ec13403e50e3a2afb6ed1beaca515bbd7a98fc I'll try to take a look and see if I can figure out what it's supposed to be |
If I plug that string into: https://protobuf-decoder.netlify.app/ ...it doesn't appear to have a valid field number, which would need to match one of these field types (1 - 8): ...but the decoder says if that is a protobuf, its field number is 695, which would be a very unusual field number for a proto (they start at 1 and generally don't much higher than 30 typically) So I have no idea what that message is supposed to be or how/why Horcrux would support it because it doesn't seem to be encoded correctly as a proto as far as I can tell. It's possible Horcrux simply ignores messages it doesn't understand, whereas TMKMS considers it an error. |
Does anyone happen to know what that message might be, or know where to ask about it? |
I don't really know too, if you could ignore this message, and deploy the code on a new branch, I could test and compile the tmkms on a machine and do the test on the testnet. |
I found the issue and our (Chorus One) testnet node runs with the patched / hacked tmkms version and everything seems to be fine. TL;DRThe issue is that the serialized block proposal ( resulting in message being always cut, thus malformed. Proof of concept fixI simply bumped the buffer size to:
Proper fixNow I wonder whether this limit is actually correct in this context: because this is not strictly p2p message (between peers in tendermint network), but rather a Now the question is how to fix it, shall I make the buffer len configurable or hardcode higher value, or does anyone has a better idea? another option to fix it, could be simply using dynamic allocation via @tony-iqlusion could you share your thoughts here? |
@mkaczanowski sounds like a bug/issue in the |
issue created: informalsystems/tendermint-rs#1356 |
funny enough this value comes from tmkms: (git blame) |
@tony-iqlusion I've just tried again with this tag, I still have this error, has anyone been able to test with SoftSign ? |
Getting back to this issue. Today I discovered that
more specifically,
then it manages to connect to validator for several blocks, and then fails again |
Thanks for the feedback, I was thinking that I'm not the only one with this problem, how do you use unix socket on tmkms ? |
we use it for local connection only, so tmkms should be running on the same host (n.b. there are cons to this approach, as you may guess). so for the cosmos chain # config.toml
[priv-validator]
laddr = "unix://path/to/somewhere/kms.sock" and in tmkms config you set the same address [[validator]]
addr = "unix://path/to/somewhere/kms.sock" (it's pretty late for me, so I hope I copypasted the correct thing) anyway will look into it shortly (on Wed or Thu) and try to make it work UPD: IIRC, the cosmos chain will create the socket, so tmkms will try connect to it. I would also recommend cleaning up (i.e. removing) the socket on every (re)start |
I can reopen this issue, however really issues should be filed against |
Do you have any news on this, if I can help with anything, unfortunately we're using tmkms on a remote machine, we can't make a Unix Socket connection, in the short term isn't there a workaround ? Thanks |
@IbrarMakaveli I filed informalsystems/tendermint-rs#1392 to request upstream help debugging this problem. What would be extremely helpful here is if someone could add reproduction instructions to that issue, especially if the issue is reproducible directly via the |
Can someone attempt to reproduce this on a fresh install, which should use |
Rolled out Using tcp connection within the same host |
@qezz can you confirm that |
let me check |
In the build log it says
|
Thanks, I reopened this issue: informalsystems/tendermint-rs#1392 (comment) |
We're running into the same issue very consistently with Initia's testnet when we turn on the oracle, which I believes adds a significant amount of data to the TMKMS requests.
Has anyone attempted to build with @zarkone's upstream PR? |
@datanexus-vincent did you try to build on top of: I think that shall fix your issue (as it did fix it for SEI). Though we haven't checked the Initia yet. @tony-iqlusion we'd appreciate your PR review :) |
@mkaczanowski I did once I realized it wasn't an upstream PR but a PR for this repo, and it worked! Thanks for the effort to get that working. |
We are seeing this issue on dYdX mainnet currently - I suspect related to the number of prices including in the vote extension, but unsure. It manifests as the |
@tombeynon can you try this branch and see if it fixes your problem: #903 |
@tony-iqlusion Just switched over, I'll let you know tomorrow morning if it resolved it. Thanks! |
@tony-iqlusion no issues at all since switching to that branch, definitely seems to have solved the problem |
@tombeynon similarly to Sei, dydx has forked cometbft with custom proto declarations. It's not exactly the same as in case of Sei, but, according to data from you it should be something very similar. Sei situation described in this PR: sei-protocol/sei-tendermint#240 |
@zarkone Okay great, I'll communicate this with the dYdX team. That PR is very well detailed, thank you! |
@tombeynon thanks! Might make sense to confirm that it's the same and find out which .proto msg exactly triggers the underflow error. |
Chains with:
github.com/tendermint/tendermint@v0.37.0-dev
~350ms
Every ~40 sigratures (softsign due to low time blocks) connection go down with this error:
protocol error: malformed message packet: failed to decode Protobuf message: buffer underflow
Let me know if you need any more details.
Thanks @tony-iqlusion
The text was updated successfully, but these errors were encountered: