Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have all drivers' poll method ignore out of memory errors #9178

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jonathan-eq
Copy link
Contributor

@jonathan-eq jonathan-eq commented Nov 12, 2024

Issue
Resolves #8976

Approach
This commit makes it so that an OSError with the message Cannot allocate memory in all drivers' poll subprocess in the
driver.poll-method are ignored. This is applicable for all drivers
except for local, which does not poll.

  • PR title captures the intent of the changes, and is fitting for release notes.
  • Added appropriate release note label
  • Commit history is consistent and clean, in line with the contribution guidelines.
  • Make sure unit tests pass locally after every commit (git rebase -i main --exec 'pytest tests/ert/unit_tests -n logical -m "not integration_test"')

When applicable

  • When there are user facing changes: Updated documentation
  • New behavior or changes to existing untested code: Ensured that unit tests are added (See Ground Rules).
  • Large PR: Prepare changes in small commits for more convenient review
  • Bug fix: Add regression test for the bug
  • Bug fix: Create Backport PR to latest release

@jonathan-eq jonathan-eq added the release-notes:bug-fix Automatically categorise as bug fix in release notes label Nov 12, 2024
@jonathan-eq jonathan-eq force-pushed the fix-bugs branch 3 times, most recently from fa25a08 to c38e770 Compare November 12, 2024 09:43
@jonathan-eq jonathan-eq changed the title Have driver polling methods ignore OS error from Have driver polling methods ignore memory allocation OS error from Nov 12, 2024
@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 80.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 90.74%. Comparing base (a7559fd) to head (c38e770).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
src/ert/scheduler/lsf_driver.py 80.00% 1 Missing ⚠️
src/ert/scheduler/openpbs_driver.py 80.00% 1 Missing ⚠️
src/ert/scheduler/slurm_driver.py 80.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9178      +/-   ##
==========================================
- Coverage   90.75%   90.74%   -0.02%     
==========================================
  Files         352      352              
  Lines       21934    21952      +18     
==========================================
+ Hits        19906    19920      +14     
- Misses       2028     2032       +4     
Flag Coverage Δ
cli-tests 39.20% <0.00%> (-0.03%) ⬇️
gui-tests 71.72% <0.00%> (-0.03%) ⬇️
performance-tests 49.32% <0.00%> (-0.06%) ⬇️
unit-tests 79.66% <80.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@xjules
Copy link
Contributor

xjules commented Nov 12, 2024

  1. What about the local_driver?
  2. Commit message needs to be updated as it does not describe anything. Then the same goes for the PR description.

This commit makes it so that an OSError with the message `Cannot
allocate memory` in all drivers' poll subprocess in the
`driver.poll`-method are ignored. This is applicable for all drivers
except for local, which does not poll.
@jonathan-eq jonathan-eq changed the title Have driver polling methods ignore memory allocation OS error from Have all drivers' poll method ignore OSError Nov 13, 2024
@eivindjahren eivindjahren changed the title Have all drivers' poll method ignore OSError Have all drivers' poll method ignore out of memory errors Nov 15, 2024
@eivindjahren
Copy link
Contributor

eivindjahren commented Nov 15, 2024

This would have to be tested out in fairly real world conditions before we merge. The danger is that you cascade in to a worse problem by ignoring a "Cannot allocate memory". There might not be enough memory left to perform any subsequent operations, and so you may make things worse by ignoring it.

There are ways of working around such issues by e.g. sleeping at the right time, but we have to be 1) sure that the problem can be replicated. 2) see that this behavior is better, 3) tweak behavior to get it as reliable as possible.

You can look into using cgroups to restrict the amount of memory available for the ert process in order to reproduce the behavior: https://unix.stackexchange.com/questions/44985/limit-memory-usage-for-a-single-linux-process

Also, I think further discussion and investigation of the issue is needed: #8976 (comment) .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes:bug-fix Automatically categorise as bug fix in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unhandled oserrors in scheduler from create_subprocess_exec
4 participants