Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry getting a free port on macos #418

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

reimai
Copy link

@reimai reimai commented Mar 6, 2025

Hi, we're having troubles with sometimes unreliable macOS sockets in our bazel monorepo, protoc-bridge version 0.9.8. We have about 3k scala proto targets, and every once in a week a socket problem causes a build to stuck with:

INFO: From ProtoScalaPBRule some_path/my_file_proto_scala_scalapb.srcjar:
Socket conflict on port 50356, you're gonna get an error <-- this line is added by me with a patched version of protoc-bridge
ERROR: /Users/reimai/some_path/BUILD:116:14: scala //some_path:my_file_proto failed: (Exit 1): scalac failed: error executing Scalac command (from target //some_path:my_file_proto) bazel-out/darwin_arm64-opt-exec-ST-a828a81199fe/bin/external/io_bazel_rules_scala/src/java/io/bazel/rulesscala/scalac/scalac '--jvm_flag=-Xss8m' '--jvm_flag=-Djava.security.manager=allow' ... (remaining 1 argument skipped)
Must have input files from either source jars or local files.
java.lang.RuntimeException: Must have input files from either source jars or local files.
	at io.bazel.rulesscala.scalac.ScalacWorker.work(ScalacWorker.java:66)
	at io.bazel.rulesscala.worker.Worker.persistentWorkerMain(Worker.java:86)
	at io.bazel.rulesscala.worker.Worker.workerMain(Worker.java:39)
	at io.bazel.rulesscala.scalac.ScalacWorker.main(ScalacWorker.java:33)
INFO: Elapsed time: 503.350s, Critical Path: 67.27s
INFO: 46560 processes: 26563 internal, 12419 darwin-sandbox, 7578 worker.
ERROR: Build did NOT complete successfully

It's non-retryable, though deleting an empty bazel-out/darwin_arm64-fastbuild/bin/some_path/my_file_proto_scala_scalapb.srcjar and rebuilding w/o remote cache helps.

This bug is also causing other problems, such as:

--scala_out: protoc-gen-scala: Plugin output is unparseable: \000\000\030\004\000\000\000\000\000\000\004\000@\000\000\000\005\000@\000\000\000\006\000\000@\000\376\003\000\000\000\001\000\000\004\010\000\000\000\000\000\000?\000\001

and

my_file.proto: is a proto3 file that contains optional fields, but code generator protoc-gen-scala hasn't been updated to support optional fields in proto3. Please ask the owner of this code generator to support proto3 optional.

but these could be fixed by a retry.

I've added a hack to get a free socket with more reliability by simply retrying new ServerSocket(0). Socket availability check is copied from socket stress test. This hack passes 100k iterations of stress test (actually even 3 retries would be sufficient) and helps our team with proto-on-mac problems.

Would you accept it as a temporary solution? These was not much action around mac problem for a half of a year and the problem is still visible to end users, although it's not as frequent as it was with named pipes.

@thesamet
Copy link
Contributor

thesamet commented Mar 6, 2025

Thanks for looking into this. Trying to understand the failure/conflict detection: when new ServerSocket(0) fails to allocate a port what happens? Isn't there an exception we can retry on? Are there alternatives to invoking lsof - for example, maybe we can do a test connection to the socket and send something unique to it like the pid?

@reimai
Copy link
Author

reimai commented Mar 6, 2025

No exception, it just fails (or not) later. Actually, sometimes I detect a port conflict via lsof and get a correct build afterwards. But other times it's causing an empty srcjar (no classes, just manifest) or an unparsable plugin output. Might depend on that other process with the same port.

I've tried checking port by running netcat (nc command), but it could return other process's ports too.

@thesamet
Copy link
Contributor

thesamet commented Mar 6, 2025

I'd prefer a solution that tests the socket locally without lsof or other external processes. In any solution we pick, performance is important too, it would be useful to benchmark and compare through the stress tests.

@reimai
Copy link
Author

reimai commented Mar 7, 2025

I've tried to detect without calling lsof, no luck though.
As an example, here I can open a new Socket on port :63342, already used by idea:

scala> new java.net.ServerSocket(63342).getChannel()
val res1: java.nio.channels.ServerSocketChannel = null
> lsof -i :63342                                                                                                                                                                     3569ms  Fri Mar  7 17:45:43 2025
COMMAND   PID   USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
idea    42473 reimai   28u  IPv6 0x6ffb33580b2cf7a2      0t0  TCP localhost:63342 (LISTEN)
java    57922 reimai   18u  IPv6 0x92f8736a1234fe65      0t0  TCP *:63342 (LISTEN)

An attempt to open it one more time from other scala repl fails:

scala> new java.net.ServerSocket(63342).getChannel()
java.net.BindException: Address already in use
  at java.base/sun.nio.ch.Net.bind0(Native Method)
  at java.base/sun.nio.ch.Net.bind(Net.java:565)
  at java.base/sun.nio.ch.Net.bind(Net.java:554)
  at java.base/sun.nio.ch.NioSocketImpl.bind(NioSocketImpl.java:636)
  at java.base/java.net.ServerSocket.bind(ServerSocket.java:389)
  at java.base/java.net.ServerSocket.<init>(ServerSocket.java:276)
  at java.base/java.net.ServerSocket.<init>(ServerSocket.java:170)
  ... 30 elided

So it looks like idea here opened a socket for non-exclusive use.
It's not the only example, other programs open such sockets too.

I know retrying with lsof is a rather ugly hack, I but for now can't find anything better.

@thesamet
Copy link
Contributor

thesamet commented Mar 7, 2025

In the first code snippet it shows that getChannel returned null - can't that be used for failure detection?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants