test larger cpu runner #1170

mikemhenry · 2025-02-28T21:00:05Z

Checklist

Added a news entry

Developers certificate of origin

I certify that this contribution is covered by the MIT License here and the Developer Certificate of Origin at https://developercertificate.org/.

codecov · 2025-02-28T21:17:22Z

Codecov Report

❌ Patch coverage is 10.00000% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.51%. Comparing base (192b582) to head (8a8b718).
⚠️ Report is 239 commits behind head on main.

Files with missing lines	Patch %	Lines
openfe/tests/protocols/conftest.py	11.11%	16 Missing ⚠️
...enfe/tests/protocols/openmm_ahfe/test_ahfe_slow.py	0.00%	1 Missing ⚠️
...tests/protocols/openmm_rfe/test_hybrid_top_slow.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1170      +/-   ##
==========================================
- Coverage   94.66%   92.51%   -2.16%     
==========================================
  Files         143      143              
  Lines       10994    11012      +18     
==========================================
- Hits        10408    10188     -220     
- Misses        586      824     +238

Flag	Coverage Δ
fast-tests	`92.51% <10.00%> (?)`
slow-tests	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

mikemhenry · 2025-02-28T22:30:09Z

"NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running."

Good to know! Re-running now

mikemhenry · 2025-02-28T22:31:09Z

Running here: https://github.com/OpenFreeEnergy/openfe/actions/runs/13597612702

mikemhenry · 2025-03-05T16:31:16Z

large worked but timed out after 12 hours (which we can set up to 1 week) -- I will try non-intergration tests since AFAIK that is what @IAlibay is trying to run -- just the slow tests.

IAlibay · 2025-03-05T17:15:03Z

large worked but timed out after 12 hours (which we can set up to 1 week) -- I will try non-intergration tests since AFAIK that is what @IAlibay is trying to run -- just the slow tests.

Yeah runninng the "integration" tests is probably overkill without a GPU.

mikemhenry · 2025-03-10T20:08:45Z

large:

============================= slowest 10 durations =============================
2655.53s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzoic_to_benzene_mapping-0-1-False-11-1-3]
2496.48s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzoic_to_benzene_mapping-0-0-True-14-1-3]
2480.21s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_aniline_mapping-0-1-False-11-4-1]
2453.59s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_benzoic_mapping-0--1-False-11-3-1]
2337.46s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_benzoic_mapping-0-0-True-14-3-1]
2298.25s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[aniline_to_benzene_mapping-0-0-True-14-1-4]
2239.40s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex[sams]
2214.30s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[aniline_to_benzene_mapping-0--1-False-11-1-4]
2173.35s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex[repex]
2111.31s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex[independent]
=========================== short test summary info ============================
FAILED openfe/tests/utils/test_system_probe.py::test_probe_system_smoke_test - subprocess.CalledProcessError: Command '['nvidia-smi', '--query-gpu=gpu_uuid,gpu_name,compute_mode,pstate,temperature.gpu,utilization.memory,memory.total,driver_version,', '--format=csv']' returned non-zero exit status 9.
FAILED openfe/tests/protocols/test_openmm_rfe_slow.py::test_openmm_run_engine[CUDA] - openmm.OpenMMException: Error initializing CUDA: CUDA_ERROR_NO_DEVICE (100) at /home/conda/feedstock_root/build_artifacts/openmm_1726255919104/work/platforms/cuda/src/CudaContext.cpp:91
= 2 failed, 912 passed, 31 skipped, 2 xfailed, 3 xpassed, 1913 warnings, 3 rerun in 24749.25s (6:52:29) =

mikemhenry · 2025-03-12T23:03:29Z

xlarge

============================= slowest 10 durations =============================
2509.67s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex[repex]
2237.81s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex[sams]
2151.15s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex[independent]
1884.45s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[aniline_to_benzene_mapping-0-0-True-14-1-4]
1808.82s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_aniline_mapping-0-1-False-11-4-1]
1451.05s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_benzoic_mapping-0-0-True-14-3-1]
1449.02s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[aniline_to_benzene_mapping-0--1-False-11-1-4]
1399.31s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_many_molecules_solvent
1388.60s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzoic_to_benzene_mapping-0-0-True-14-1-3]
1313.94s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_benzoic_mapping-0--1-False-11-3-1]
=========================== short test summary info ============================
FAILED openfe/tests/utils/test_system_probe.py::test_probe_system_smoke_test - subprocess.CalledProcessError: Command '['nvidia-smi', '--query-gpu=gpu_uuid,gpu_name,compute_mode,pstate,temperature.gpu,utilization.memory,memory.total,driver_version,', '--format=csv']' returned non-zero exit status 9.
FAILED openfe/tests/protocols/test_openmm_rfe_slow.py::test_openmm_run_engine[CUDA] - openmm.OpenMMException: Error initializing CUDA: CUDA_ERROR_NO_DEVICE (100) at /home/conda/feedstock_root/build_artifacts/openmm_1726255919104/work/platforms/cuda/src/CudaContext.cpp:91
= 2 failed, 912 passed, 31 skipped, 2 xfailed, 3 xpassed, 1978 warnings, 3 rerun in 11132.77s (3:05:32) =

mikemhenry · 2025-03-12T23:04:11Z

better than 2x improvement

mikemhenry · 2025-03-12T23:11:38Z

Last check, going to see if the intel flavor is any faster

IAlibay · 2025-03-13T00:11:25Z

@mikemhenry what flags are you using for these CPU runners? --runslow or --integration too? 3h seems wayy too long for just the slow tests.

mikemhenry · 2025-03-13T20:43:50Z

integration as well -- I wanted to get some benchmarking data on the integration tests without a GPU

mikemhenry · 2025-03-13T22:59:29Z

I actually turned off integration tests back in 98cec71

mikemhenry · 2025-03-13T23:00:39Z

But you right, that is kinda slow for just the slow tests

mikemhenry · 2025-03-14T14:51:39Z

Now the runners are running out of disk space when installing the env, need to check if there are new deps making the env bigger or something else going on. I can also increase the EBS image size.

mikemhenry · 2025-03-24T21:16:46Z

testing here https://github.com/OpenFreeEnergy/openfe/actions/runs/14044852203/job/39323509147

mikemhenry · 2025-03-24T23:30:05Z

Sweet, getting:
FAILED openfe/tests/protocols/test_openmm_rfe_slow.py::test_openmm_run_engine[CUDA] - openmm.OpenMMException: Error initializing CUDA: CUDA_ERROR_NO_DEVICE (100) at /home/conda/feedstock_root/build_artifacts/openmm_1726255919104/work/platforms/cuda/src/CudaContext.cpp:91
But we expect that to fail, I am not sure why we are running this test since we only have OFE_SLOW_TESTS: "true" but no integration tests turned on, and it has a mark @pytest.mark.integration

mikemhenry · 2025-03-24T23:34:31Z

timing info btw
= 5 failed, 936 passed, 28 skipped, 2 xfailed, 3 xpassed, 2010 warnings, 3 rerun in 11167.03s (3:06:07) =

mikemhenry · 2025-05-30T18:20:33Z

@IAlibay how do you invoke the tests?
$ CUDA_VISIBLE_DEVICES="" pytest -n 2 -vv --durations=10 --runslow openfecli/tests/ openfe/tests/ this is taking more than minutes on my laptop

IAlibay · 2025-05-30T18:33:06Z

@IAlibay how do you invoke the tests? $ CUDA_VISIBLE_DEVICES="" pytest -n 2 -vv --durations=10 --runslow openfecli/tests/ openfe/tests/ this is taking more than minutes on my laptop

Testing right now with the CUDA_VISIBLE_DEVICES being set.

IAlibay · 2025-05-30T19:18:56Z

@mikemhenry it runs in 35 mins for me

mikemhenry · 2025-05-30T19:21:44Z

While we wait for that, looking at the slowest runs:

 ============================= slowest 10 durations =============================
2383.65s call     openfe/tests/protocols/openmm_rfe/test_hybrid_top_protocol.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_aniline_mapping-0-1-False-11-4-1]
1811.27s call     openfe/tests/protocols/openmm_rfe/test_hybrid_top_protocol.py::test_dry_run_complex_alchemwater_totcharge[aniline_to_benzene_mapping-0--1-False-11-1-4]
1800.93s call     openfe/tests/protocols/openmm_rfe/test_hybrid_top_protocol.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_benzoic_mapping-0-0-True-14-3-1]
1755.66s call     openfe/tests/protocols/openmm_rfe/test_hybrid_top_protocol.py::test_dry_run_complex_alchemwater_totcharge[benzoic_to_benzene_mapping-0-0-True-14-1-3]
1676.54s call     openfe/tests/protocols/openmm_rfe/test_hybrid_top_protocol.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_benzoic_mapping-0--1-False-11-3-1]
1659.19s call     openfe/tests/protocols/openmm_rfe/test_hybrid_top_protocol.py::test_dry_run_complex_alchemwater_totcharge[aniline_to_benzene_mapping-0-0-True-14-1-4]
1619.34s call     openfe/tests/protocols/openmm_rfe/test_hybrid_top_protocol.py::test_dry_run_complex_alchemwater_totcharge[benzoic_to_benzene_mapping-0-1-False-11-1-3]
1465.10s call     openfe/tests/protocols/openmm_rfe/test_hybrid_top_protocol.py::test_dry_many_molecules_solvent
1314.54s call     openfe/tests/protocols/openmm_rfe/test_hybrid_top_protocol.py::test_dry_run_complex[repex]
1313.68s call     openfe/tests/protocols/openmm_rfe/test_hybrid_top_protocol.py::test_dry_run_complex[independent]

Should we move some of these tests to integration tests? I was thinking we could print all the test durations and see what it looks like to figure out which ones we should move. We could also mark them as needing a GPU or something. This reminds me of #1133

IAlibay · 2025-05-30T19:46:26Z

While we wait for that, looking at the slowest runs:

 ============================= slowest 10 durations =============================
2383.65s call     openfe/tests/protocols/openmm_rfe/test_hybrid_top_protocol.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_aniline_mapping-0-1-False-11-4-1]
1811.27s call     openfe/tests/protocols/openmm_rfe/test_hybrid_top_protocol.py::test_dry_run_complex_alchemwater_totcharge[aniline_to_benzene_mapping-0--1-False-11-1-4]
1800.93s call     openfe/tests/protocols/openmm_rfe/test_hybrid_top_protocol.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_benzoic_mapping-0-0-True-14-3-1]
1755.66s call     openfe/tests/protocols/openmm_rfe/test_hybrid_top_protocol.py::test_dry_run_complex_alchemwater_totcharge[benzoic_to_benzene_mapping-0-0-True-14-1-3]
1676.54s call     openfe/tests/protocols/openmm_rfe/test_hybrid_top_protocol.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_benzoic_mapping-0--1-False-11-3-1]
1659.19s call     openfe/tests/protocols/openmm_rfe/test_hybrid_top_protocol.py::test_dry_run_complex_alchemwater_totcharge[aniline_to_benzene_mapping-0-0-True-14-1-4]
1619.34s call     openfe/tests/protocols/openmm_rfe/test_hybrid_top_protocol.py::test_dry_run_complex_alchemwater_totcharge[benzoic_to_benzene_mapping-0-1-False-11-1-3]
1465.10s call     openfe/tests/protocols/openmm_rfe/test_hybrid_top_protocol.py::test_dry_many_molecules_solvent
1314.54s call     openfe/tests/protocols/openmm_rfe/test_hybrid_top_protocol.py::test_dry_run_complex[repex]
1313.68s call     openfe/tests/protocols/openmm_rfe/test_hybrid_top_protocol.py::test_dry_run_complex[independent]

Should we move some of these tests to integration tests? I was thinking we could print all the test durations and see what it looks like to figure out which ones we should move. We could also mark them as needing a GPU or something. This reminds me of #1133

@mikemhenry a lot of these are significantly faster with #1131 - we were meant to use the output of this PR to test out that PR.

IAlibay · 2025-05-30T19:46:59Z

While we wait for that, looking at the slowest runs:

@mikemhenry see above - 35 mins locally.

mikemhenry · 2025-05-30T20:04:21Z

Okay so then it sounds like we should merge this one in, @IAlibay @atravitz can I get a review? And which AWS instance do we want to use?

mikemhenry · 2025-05-30T20:09:54Z

While we wait for that, looking at the slowest runs:

@mikemhenry see above - 35 mins locally.

Was this with n=2? We can profile more to figure it out, but maybe we merge it in then use it to test #1131 and see what it looks like?

IAlibay

I'm going to block because it's Friday and I'm rather unsure as to what should be happening here.

@mikemhenry I think I just don't understand what we're trying to achieve here. The plan was for this to be a "fast" way to run slow tests. As it stands this AWS CPU runner is slower than running the package install tests (which also run slow tests?) by nearly 3x. Unless I'm missing something, this seems kinda not worth it?

github-actions · 2025-05-30T20:33:40Z

No API break detected ✅

mikemhenry · 2025-05-30T22:59:49Z

I was using the wrong instance family for this. The T-series have "burst" CPU but are not meant for constant use. If we use a c7i.xlarge it finishes in 1.32 hours (about twice as fast as a github runner) and costs $0.1785 an hour for a total cost of $0.24

mikemhenry · 2025-06-11T14:07:45Z

@IAlibay are we happy with this now?

mikemhenry · 2025-06-11T17:27:24Z

offline we discussed that we will merge this in and then test irfans PR that speeds up CI, then figure out if we want to optimize this further with a large instance

github-actions · 2025-06-11T18:38:07Z

No API break detected ✅

github-actions · 2025-06-11T18:38:10Z

No API break detected ✅

mikemhenry · 2025-06-11T18:38:29Z

.github/workflows/cpu-long-tests.yaml

          OFE_SLOW_TESTS: "true"
          DUECREDIT_ENABLE: 'yes'
-          OFE_INTEGRATION_TESTS: FALSE
+          OFE_INTEGRATION_TESTS: TRUE


Oh I think this should be false until we figure out the GPU stuff?

…u_runner' into feat/test_larger_cpu_runner

github-actions · 2025-06-11T18:39:32Z

No API break detected ✅

mikemhenry · 2025-06-11T18:40:44Z

https://github.com/OpenFreeEnergy/openfe/actions/runs/15592958046

testing it here before we merge it in

github-actions · 2025-06-12T17:40:27Z

No API break detected ✅

mikemhenry added 2 commits February 28, 2025 13:59

test larger cpu runner

20e908c

update omsf runner version

778e735

nvidia-smi throws an error if there is no GPU

bee8367

Try a large

973201d

mikemhenry added 2 commits March 5, 2025 09:31

just run slow tets

98cec71

Merge branch 'main' into feat/test_larger_cpu_runner

0fad31f

try xlarge (should fail but thats okay)

6f6b35c

mikemhenry added 2 commits March 12, 2025 16:04

Merge branch 'main' into feat/test_larger_cpu_runner

d551ae4

lets see if the intel flavor is any faster

5c392d2

mikemhenry added 2 commits March 13, 2025 16:02

see if this changes the runtime at all

25ac23e

go back to working runner instance type

ababe63

mikemhenry added 2 commits March 24, 2025 12:53

Merge branch 'main' into feat/test_larger_cpu_runner

17cf308

bump size of disk

df35dbc

IAlibay requested changes May 30, 2025

View reviewed changes

I think we are using the wrong instance family

7d68588

mikemhenry mentioned this pull request Jun 5, 2025

clean up CI workflow files #1346

Merged

2 tasks

IAlibay approved these changes Jun 11, 2025

View reviewed changes

mikemhenry added 2 commits June 11, 2025 11:37

Merge branch 'main' into feat/test_larger_cpu_runner

cfcb51f

Merge branch 'main' into feat/test_larger_cpu_runner

faaeb06

mikemhenry commented Jun 11, 2025

View reviewed changes

mikemhenry added 2 commits June 11, 2025 11:38

don't run itergration tests on the CPU

caec4c0

Merge remote-tracking branch 'refs/remotes/origin/feat/test_larger_cp…

7e4e3b4

…u_runner' into feat/test_larger_cpu_runner

atravitz approved these changes Jun 12, 2025

View reviewed changes

Merge branch 'main' into feat/test_larger_cpu_runner

8a8b718

mikemhenry merged commit 81cfa46 into main Jun 12, 2025
9 of 11 checks passed

mikemhenry deleted the feat/test_larger_cpu_runner branch June 12, 2025 17:40

test larger cpu runner #1170

test larger cpu runner #1170

Uh oh!

Conversation

mikemhenry commented Feb 28, 2025

Developers certificate of origin

Uh oh!

codecov bot commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mikemhenry commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikemhenry commented Feb 28, 2025

Uh oh!

mikemhenry commented Mar 5, 2025

Uh oh!

IAlibay commented Mar 5, 2025

Uh oh!

mikemhenry commented Mar 10, 2025

Uh oh!

mikemhenry commented Mar 12, 2025

Uh oh!

mikemhenry commented Mar 12, 2025

Uh oh!

mikemhenry commented Mar 12, 2025

Uh oh!

IAlibay commented Mar 13, 2025

Uh oh!

mikemhenry commented Mar 13, 2025

Uh oh!

mikemhenry commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikemhenry commented Mar 13, 2025

Uh oh!

mikemhenry commented Mar 14, 2025

Uh oh!

mikemhenry commented Mar 24, 2025

Uh oh!

mikemhenry commented Mar 24, 2025

Uh oh!

mikemhenry commented Mar 24, 2025

Uh oh!

mikemhenry commented May 30, 2025

Uh oh!

IAlibay commented May 30, 2025

Uh oh!

IAlibay commented May 30, 2025

Uh oh!

mikemhenry commented May 30, 2025

Uh oh!

IAlibay commented May 30, 2025

Uh oh!

IAlibay commented May 30, 2025

Uh oh!

mikemhenry commented May 30, 2025

Uh oh!

mikemhenry commented May 30, 2025

Uh oh!

IAlibay left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented May 30, 2025

Uh oh!

mikemhenry commented May 30, 2025

Uh oh!

mikemhenry commented Jun 11, 2025

Uh oh!

mikemhenry commented Jun 11, 2025

Uh oh!

github-actions bot commented Jun 11, 2025

Uh oh!

github-actions bot commented Jun 11, 2025

Uh oh!

mikemhenry Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Feb 28, 2025 •

edited

Loading

mikemhenry commented Feb 28, 2025 •

edited

Loading

mikemhenry commented Mar 13, 2025 •

edited

Loading

IAlibay left a comment •

edited

Loading