Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECONNRESET: aborted when pushing large multi-container builds #2768

Open
timwedde opened this issue Jun 14, 2024 · 8 comments
Open

ECONNRESET: aborted when pushing large multi-container builds #2768

timwedde opened this issue Jun 14, 2024 · 8 comments

Comments

@timwedde
Copy link

Expected Behavior

Pushing arbitrarily-sized multi-container builds to Balena builders works fine and creates a new image successfully.

Actual Behavior

When pushing large multi-container docker-compose files to the Balena builders, the push operations fails in about 90% of cases with the below error message:

ECONNRESET: aborted

Error: aborted
    at TLSSocket.socketCloseListener (node:_http_client:462:19)
    at TLSSocket.emit (node:events:532:35)
    at TLSSocket.emit (node:domain:488:12)
    at node:net:338:12
    at TCP.done (node:_tls_wrap:659:7)

The behavior is not consistent:

  • There seems to be no correlation to specific builders as far as I can tell
  • The build works slightly more consistently on my M2 Macbook Air compared to my M1 Mac Studio, but not by much. The setup is almost the same, except the former is using Node 22 and the latter Node 20.
  • The aborted builds will often end up in the 'Releases' tab on Balena and will sometimes actually successfully complete. However, there is no way to know this without manually checking this tab every once in a while.

The command used to build is very simple:

balena push myFleet --release-tag description "debug" --draft

Here is one of the builds that failed, on the machine that has a slightly higher success rate:

❯ balena push myFleet --release-tag description "debug" --draft --debug
----------------------------------------------------------------------
[Warn] Node.js version "22.2.0" does not satisfy requirement "^20.6.0"
[Warn] This may cause unexpected behavior.
----------------------------------------------------------------------
[debug] new argv=[/opt/homebrew/Cellar/node/22.2.0/bin/node,/opt/homebrew/bin/balena,push,jetson-test,--release-tag,description,lpm debug,--draft] length=8
[debug] Deprecation check: 6.81196 days since last npm registry query for next major version release date.
[debug] Will not query the registry again until at least 7 days have passed.
[Debug]   Using build source directory: . 
(node:28123) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead.
(Use `node --trace-deprecation ...` to show where the warning was created)
[Debug]   Pushing to cloud for fleet: myFleet
[debug] Event tracking error: Timeout awaiting 'response' for 0ms
| Packaging the project source...[Debug]   Tarring all non-ignored files...
[Debug]   docker-compose.yml file found at "/Users/user/Documents/Work/project"
/ Packaging the project source...[Debug]   Tarring complete in 353 ms
[debug] Connecting to builder at https://builder.balena-cloud.com/v3/build?slug=gh_timwedde%2Fjetson-test&dockerfilePath=&emulated=false&nocache=false&headless=false&isdraft=true
\ Uploading source package to https://builder.balena-cloud.com[debug] received HTTP 200 OK
[debug] handling message: {"type":"metadata","resource":"buildLogId","value":"3047436"}
[debug] handling message: {"message":"\u001b[36m[Info]\u001b[39m         Starting build for myFleet, user gh_timwedde"}
[Info]         Starting build for myFleet, user gh_timwedde
[debug] handling message: {"message":"\u001b[36m[Info]\u001b[39m         Dashboard link: https://dashboard.balena-cloud.com/apps/ID/devices"}
[Info]         Dashboard link: https://dashboard.balena-cloud.com/apps/ID/devices
ECONNRESET: aborted

Error: aborted
    at TLSSocket.socketCloseListener (node:_http_client:462:19)
    at TLSSocket.emit (node:events:532:35)
    at TLSSocket.emit (node:domain:488:12)
    at node:net:338:12
    at TCP.done (node:_tls_wrap:659:7)

For further help or support, visit:
https://www.balena.io/docs/reference/balena-cli/#support-faq-and-troubleshooting


[debug] Timeout reporting error to sentry.io

Steps to Reproduce the Problem

Hard to say, I don't know if this is generally reproducible. This seems to occur with larger multi-container builds though.
My particular one is massive (in terms of final Docker image sizes at least), ending up at about 40-50GB. This is bad and I'm aware of that, but since I'm building for a Jetson and need multiple distinct containers that make use of GPU acceleration, I have to ship the entire driver stack several times, which bloats image sizes by a lot. I'm assuming I'm getting kicked off the builders because of cache or image sizes, but the error message is not clear about this nor could I find any hard limits on this, so I'm a bit confused as to the source of the issue.

Specifications

  • balena CLI version: 18.2.4
  • Cloud backend: balenaCloud?
  • Operating system version: macOS 14.5
  • 32/64 bit OS and processor: 64-bit OS, ARM processor
  • Install method: Executable installer
@joshuaxdmb
Copy link

Getting the same issue here on Apple M2 Pro. balena push suddenly stopped working a few weeks ago. I've been using balena build and balena deploy ever since.

@timwedde
Copy link
Author

Intermediary status update: This is still an issue for me.

@timwedde
Copy link
Author

timwedde commented Jul 3, 2024

Intermediary status update: This is still an issue for me.
It's pretty bad right now, about 80% of my push attempts fail, leading to a lot of wasted time just endlessly redoing the command until it eventually decides to work every once in a while.
Alas, building and pushing locally is also prohibitive because of the large container sizes (and seemingly no delta pushes with the local method) due to my somewhat slow internet.

@timwedde
Copy link
Author

timwedde commented Jul 5, 2024

A new bit of information emerges: It seems that when Balena CLI tells me that the build aborted due to a connection drop/reset, it still ends up on Balena Cloud. Seemingly it keeps running, but since I lose connection to the builder, I'm unable to see any logs. Release tags are also not applied (presumably because this happens after the build completes), so it's a weird half-state of kinda working, but not really. Would be nice if this behavior was consistent, seeing as it's one of the fundamental capabilities of the platform. I have not yet been able to test whether a build that completes in this manner is actually capable of running on a device or not.

Edit: The 'phantom build' seems to be stuck in the 'Running' state forever and never finishes, so I guess that's not really useful.

@timwedde
Copy link
Author

timwedde commented Jul 8, 2024

Intermediary status update: This is still an issue for me.
Is anybody actually triaging these? The repo has been filled with automatic dependency-bump PR's for months, with little to no human activity in the mix. Same goes for a lot of the issues here: Most of them have no replies at all and if they do, it's often other users and not anybody from Balena.
Is there a better way to actually reach a human about these issues? Currently it feels bit like the dead internet theory, just scoped to Balena's GitHub organization.

I suspect this is an issue with the builder system, so making a PR to fix this behavior is next to impossible. If I can find some time I'll try and dig through the source code myself, though I expect it'll be challenging without any help and if it is in the build system itself, then we're kind powerless here.

@otaviojacobi
Copy link
Contributor

Hello @timwedde I am sorry you are having issues with our builders and yes, we have other people reporting similar on the forums.

We currently use the builders for building our own docker images which are fairly large and we never faced this issue (our images have no priority, we just use the same build system as you do). So I don't think directly this is an issue with image sizes, but rather, something specific on a few docker composes that can cause the intermitency.

I also finished running a script that did 100 pushes of different images (with different sizes) and I could not reproduce, is there anyway you have an example of dockercompose + resources where you can reproduce the behaviour?

@timwedde
Copy link
Author

timwedde commented Jul 9, 2024

Yup, will work on creating a reproducible example that I am able to share, I'll post here again once I have something! Thanks for responding, much appreciated :)

@timwedde
Copy link
Author

Sorry for the long absence, things got rather busy at work for a little bit so I didn't have time to work on an MWE for this.
I have to say though, recently pushed have been more 'flaky', but in a good way: The builds go through more often than they used to, which is already nice.
We're also now migrating away from Balena, so unfortunately this has been pushed down on the list of priorities. I'll write again if I can reproduce this reliably, but at the current point in time -while it's definitely not fixed- it works well enough to survive the migration, at least.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants