Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use osx.13.arm64.open queue for osx testing #112647

Merged
merged 3 commits into from
Feb 27, 2025
Merged

Conversation

MichalStrehovsky
Copy link
Member

The osx 12 queue only has a couple machines servicing it. Looks like osx 12 is already out of support by Apple so dnceng scaled it down?

This queue is the main reason why runtime-nativeaot-outerloop almost never succeeds despite my weeks long effort to get it green. E.g (https://helix.dot.net/api/jobs/226c27df-153e-4f75-9003-5dd5f5a355a2/workitems/System.Diagnostics.Tools.Tests?api-version=2019-06-17):

  "Queued": "2025-02-17T23:02:28.573+00:00",
  "Started": "2025-02-18T05:34:09.794+00:00",
  "Finished": "2025-02-18T05:34:12.732+00:00",
  "Delay": "06:31:41.2210000",
  "Duration": "00:00:02.9380000",

Six and a half hour wait time so we can run a three second job.

Per dotnet/dnceng#3008 (comment) the recommendation is to move to osx 14 queues but I'm trying with osx 13 since I assume our preference is as old as possible.

Cc @dotnet/runtime-infrastructure

@Copilot Copilot bot review requested due to automatic review settings February 18, 2025 08:31
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

@MichalStrehovsky
Copy link
Member Author

/azp run runtime-nativeaot-outerloop

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ivanpovazan
Copy link
Member

Should we bump the x64 variant as well?

@MichalStrehovsky
Copy link
Member Author

Should we bump the x64 variant as well?

I've not seen problems with that queue so my gut says "don't touch what's not broken". I don't know what schedule we follow in upgrading these. I just know that we had problems with this queue since december when the number of machines dropped from 30 to 10 (dotnet/dnceng#4734) and I don't see anyone doing anything about this so I'm just trying to do something.

@ivanpovazan
Copy link
Member

Ok, thanks.
We are experiencing timeouts on osx.1200.amd64 as well, for iOS platforms testing, so was wondering wether we should bump everything to osx.13 instead of gradually moving to the newer queue.

@am11
Copy link
Member

am11 commented Feb 18, 2025

x64 variant

Perhaps we should promote osx-arm64 to how much we are currently caring for (the declining) osx-x64 #108570 (Apple stopped producing x64 macs in 2023: second paragraph).

@@ -105,7 +105,7 @@ jobs:
# OSX arm64
- ${{ if eq(parameters.platform, 'osx_arm64') }}:
- ${{ if eq(variables['System.TeamProject'], 'public') }}:
- OSX.1200.ARM64.Open
- osx.13.arm64.open
- ${{ if eq(variables['System.TeamProject'], 'internal') }}:
- OSX.1200.ARM64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump this one too?

Suggested change
- OSX.1200.ARM64
- osx.13.arm64

@MichalStrehovsky
Copy link
Member Author

/azp run runtime-libraries-coreclr outerloop

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@MichalStrehovsky
Copy link
Member Author

System.Security.Cryptography.OpenSsl.Tests is running into SIGABRT on native AOT; that doesn't look very promising. I fired off libraries outerloop since by default we don't seem to test anything on arm64 macs and no other leg ran there.

@filipnavara
Copy link
Member

filipnavara commented Feb 18, 2025

System.Security.Cryptography.OpenSsl.Tests is running into SIGABRT on native AOT; that doesn't look very promising.

That's not entirely unexpected. macOS doesn't ship OpenSSL by default and the test prints No usable version of libssl was found which is followed by abort.

@am11
Copy link
Member

am11 commented Feb 18, 2025

It should be running this script:

- script: $(Build.SourcesDirectory)/eng/common/native/install-dependencies.sh ${{ parameters.osGroup }}
displayName: Install Build Dependencies

@akoeplinger
Copy link
Member

@am11 that is for the build machine, it doesn't apply to Helix

IIRC openssl was manually installed on these machines by dnceng?

@am11
Copy link
Member

am11 commented Feb 18, 2025

Yup, I meant test machine can use the same script to install runtime dependencies. We are using it in multiple places in addition to build machines.

@steveisok
Copy link
Member

@am11 that is for the build machine, it doesn't apply to Helix

IIRC openssl was manually installed on these machines by dnceng?

I pinged @ilyas1974 and he'll have it looked into.

@steveisok
Copy link
Member

Any reason why we can't bump to osx.15.arm64 and osx.15.arm64.open? We appear to have even more capacity there.

@MichalStrehovsky
Copy link
Member Author

Any reason why we can't bump to osx.15.arm64 and osx.15.arm64.open? We appear to have even more capacity there.

I assumed we have a preference for oldest possible macOS and then cross fingers about Apple's backwards compatibility. But I don't know our testing strategy for mac.

@EgorBo
Copy link
Member

EgorBo commented Feb 25, 2025

Hoping to see this merged soon. NativeAOT osx pipelines often timeout

@steveisok
Copy link
Member

@MichalStrehovsky @agocke, any chance we can skip the mac openssl tests for now and move this forward?

@MichalStrehovsky
Copy link
Member Author

@MichalStrehovsky @agocke, any chance we can skip the mac openssl tests for now and move this forward?

I don't know who is authoritative on tests that require openssl and whether it would be okay to have a test hole on arm64 mac temporarily - @dotnet/area-system-security @dotnet/ncl? I'm definitely not authoritative on this.

I assume we'll not be able to get openssl installed on the helix machines anytime soon and that's why you're asking?

@agocke
Copy link
Member

agocke commented Feb 27, 2025

We have no official support for OpenSSL on Mac so I think it’s fine to skip

@wfurt
Copy link
Member

wfurt commented Feb 27, 2025

I think it would be ok to skip openssl - especially if only on AOT. The Cryptography.OpenSsl only provides extra primitives AFAIK but I don't think it is widely used. Aside from that Quic/H3 also use OpenSSL but that is also another marginal use case IMHO. I think we should get OpenSSL from brew as usually check if everything works without AOT.

@MichalStrehovsky
Copy link
Member Author

I think it would be ok to skip openssl - especially if only on AOT.

The failure is not specific to AOT. This is also failing in runtime-libraries-coreclr-outerloop: https://dev.azure.com/dnceng-public/public/_build/results?buildId=955227&view=logs&jobId=436e3ba5-8535-542b-4a8e-476703290d56&j=a6a69bc3-4502-542e-928d-b29ce8b6245c&t=8f2f111f-0f49-524b-bd46-acb87857e7e1

We'd need to disable this on ARM64 macos everywhere (and this gives us a very good reason not to do #112647 (comment) so that we don't lose all macos coverage of this).

@MichalStrehovsky
Copy link
Member Author

/azp run runtime-libraries-coreclr outerloop

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@wfurt
Copy link
Member

wfurt commented Feb 27, 2025

It seems like new macOS does no load libraries from /usr/locacal/lib anymore (similar to nodejs/node#32740) so the brew package does not work any more.
We can perhaps make the tests conditional @bartonjs?

I tried to play with DYLD_LIBRARY_PATH but it seems like SIP strips that as well for security reasons.
Maybe @rolfbjarne has some recommendations how to fix it via entitlements. But it seems like realistically our users will face the very same problem.

@rolfbjarne
Copy link
Member

I tried to play with DYLD_LIBRARY_PATH but it seems like SIP strips that as well for security reasons.
Maybe @rolfbjarne has some recommendations how to fix it via entitlements. But it seems like realistically our users will face the very same problem.

Where is openssl supposed to be loaded from?

I have openssl installed:

$ brew info openssl
==> openssl@3: stable 3.4.1 (bottled)
Cryptography and SSL/TLS Toolkit
https://openssl-library.org
Installed
[...]

but nothing in /usr/local/lib:

$ ls -la /usr/local/lib
ls: /usr/local/lib: No such file or directory

As a sidenote, and without knowing any of the code involved, it seems to me that an improvement would be to make dotnet throw a managed exception instead of aborting if openssl can't be loaded. That would at the very least lead to better diagnostic results when running unit tests, and unit tests can be disabled/ignored individually if that's desired.

@akoeplinger
Copy link
Member

I just tried and it works on my arm64 mac if I explicitly add the symlink:

ln -s /opt/homebrew/lib/libssl.3.dylib /usr/local/lib

@wfurt
Copy link
Member

wfurt commented Feb 27, 2025

This is done on CM machines. There used to be brew command or it is done during onboarding

dotnet-dev@dci-macm2-build-046 ~ % ls -al /usr/local/lib
total 0
drwxr-xr-x  6 root  wheel  192 May 17  2023 .
drwxr-xr-x  6 root  wheel  192 Dec 17 15:35 ..
drwxr-xr-x  3 root  wheel   96 Apr  6  2021 dtrace
lrwxr-xr-x  1 root  wheel   63 May 17  2023 libcrypto.1.1.dylib -> /opt/homebrew/Cellar/openssl@1.1/1.1.1t/lib/libcrypto.1.1.dylib
lrwxr-xr-x  1 root  wheel   60 May 17  2023 libssl.1.1.dylib -> /opt/homebrew/Cellar/openssl@1.1/1.1.1t/lib/libssl.1.1.dylib
drwxr-xr-x  3 root  wheel   96 May 17  2023 node_modules

I have OpenSSL 3 on my system and it fails same way as CI

furt$ ls -al /usr/local/lib/libssl.*
lrwxr-xr-x@ 1 furt  staff  44 Nov  2 23:35 /usr/local/lib/libssl.3.dylib -> ../Cellar/openssl@3/3.4.0/lib/libssl.3.dylib
lrwxr-xr-x@ 1 furt  staff  38 Nov  2 23:35 /usr/local/lib/libssl.a -> ../Cellar/openssl@3/3.4.0/lib/libssl.a
lrwxr-xr-x@ 1 furt  staff  42 Nov  2 23:35 /usr/local/lib/libssl.dylib -> ../Cellar/openssl@3/3.4.0/lib/libssl.dylib

so I'm wondering iff this can still depend on SIP state or other OS configuration...???

@rolfbjarne
Copy link
Member

so I'm wondering iff this can still depend on SIP state or other OS configuration...???

It might depend on whether the dotnet binary is signed/notarized or not.

@akoeplinger did you execute using a locally built dotnet binary, or one from an official package?

@akoeplinger
Copy link
Member

akoeplinger commented Feb 27, 2025

I'm using official 9.0.100 sdk. I noticed that it works with running dotnet bin/Debug/app.dll but when I use PublishAot then it doesn't work. I need to explicitly add DYLIB_LIBRARY_PATH=/usr/local/lib and then it works with the NativeAOT one too. Signing with the entitlements from https://github.com/dotnet/runtime/blob/main/eng/native/entitlements.plist didn't change that.

Note: I'm not running the tests but a simple console app that uses var rsa = new RSAOpenSsl().

@am11
Copy link
Member

am11 commented Feb 27, 2025

As a sidenote, and without knowing any of the code involved, it seems to me that an improvement would be to make dotnet throw a managed exception instead of aborting if openssl can't be loaded. That would at the very least lead to better diagnostic results when running unit tests, and unit tests can be disabled/ignored individually if that's desired.

AFAIK, once kernel flags something as "security risk" it doesn't give a change for application to handle any signal, not even the debugger, hence the unpleasant sigabrt.

I have openssl installed:

I have the same, and I think helix should use the same setup. Having some one-off configuration on helix machine is not only weird but it defeats the purpose of "real world testing".

@filipnavara
Copy link
Member

hence the unpleasant sigabrt.

In this case it's an abort() call in our native code, not security triggered by OS.

@rolfbjarne
Copy link
Member

As a sidenote, and without knowing any of the code involved, it seems to me that an improvement would be to make dotnet throw a managed exception instead of aborting if openssl can't be loaded. That would at the very least lead to better diagnostic results when running unit tests, and unit tests can be disabled/ignored individually if that's desired.

AFAIK, once kernel flags something as "security risk" it doesn't give a change for application to handle any signal, not even the debugger, hence the unpleasant sigabrt.

Should be just a matter of returning an error condition here instead of calling +abort:

void InitializeOpenSSLShim(void)
{
if (!OpenLibrary())
{
fprintf(stderr, "No usable version of libssl was found\n");
abort();
}

@am11
Copy link
Member

am11 commented Feb 27, 2025

The logs do not show No usable version of libssl was found. Unless xunit is hiding it, it maybe aborting from dlopen call by the kernel.

@filipnavara
Copy link
Member

The logs do not show No usable version of libssl was found

image

@am11
Copy link
Member

am11 commented Feb 27, 2025

Ah, I'm blind never mind (had case sensitive search on) 🥲

@rolfbjarne
Copy link
Member

I'm using official 9.0.100 sdk. I noticed that it works with running dotnet bin/Debug/app.dll but when I use PublishAot then it doesn't work. I need to explicitly add DYLIB_LIBRARY_PATH=/usr/local/lib and then it works with the NativeAOT one too. Signing with the entitlements from main/eng/native/entitlements.plist didn't change that.

Note: I'm not running the tests but a simple console app that uses var rsa = new RSAOpenSsl().

So the problem seems to be twofold:

1: macoS doesn't look for libraries in /usr/local/lib anymore
2. DYLD_LIBRARY_PATH doesn't always work

The fix is to add an rpath to the native executable adding /usr/local/lib as a directory to look for dylibs.

This can be done either at build time (if the build actually creates a binary (as opposed to copying in a prebuilt binary), for instance with NativeAOT), or after the build by using the install_name_tool command line tool (this requires re-signing the executable afterwards).

Here's an example for both dotnet build and with NativeAOT: https://gist.github.com/rolfbjarne/8464957bad0bff762162e10322e1e214

@steveisok steveisok merged commit dfa71ee into main Feb 27, 2025
123 of 131 checks passed
@lewing lewing deleted the MichalStrehovsky-patch-6 branch February 27, 2025 23:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants