Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent out of memory issue on node with rhods notebooks #159

Open
vbedida79 opened this issue Oct 31, 2023 · 2 comments
Open

Inconsistent out of memory issue on node with rhods notebooks #159

vbedida79 opened this issue Oct 31, 2023 · 2 comments
Labels
bug Something isn't working gpu Intel GPU

Comments

@vbedida79
Copy link
Contributor

vbedida79 commented Oct 31, 2023

Summary

On OCP 4.13 using RHODS (RedHat openshift data science) with OpenVINO notebooks- the kernel restarts inconsistently with out of memory messages

Details

OCP cluster 4.13 with Intel Data Center Flex 170 GPU and notebook with memory requests and limits as 56GB.
When using RHODS with openvino notebook specifically while executing stable diffusion notebook, the python notebook kernel restarts inconsistently, dmesg on node shows:

[    0.019134] Early memory node ranges
[    0.023751] PM: hibernation: Registered nosave memory: [mem 0x00000000-0x00000fff]
[    0.023753] PM: hibernation: Registered nosave memory: [mem 0x0009d000-0x000fffff]
[    0.023755] PM: hibernation: Registered nosave memory: [mem 0x59039000-0x59039fff]
[    0.023757] PM: hibernation: Registered nosave memory: [mem 0x590fb000-0x590fbfff]
[    0.023758] PM: hibernation: Registered nosave memory: [mem 0x5ee4e000-0x5ee4efff]
[    0.023760] PM: hibernation: Registered nosave memory: [mem 0x5ee85000-0x5ee85fff]
[    0.023760] PM: hibernation: Registered nosave memory: [mem 0x5ee86000-0x5ee86fff]
[    0.023762] PM: hibernation: Registered nosave memory: [mem 0x5eebd000-0x5eebdfff]
[    0.023764] PM: hibernation: Registered nosave memory: [mem 0x5ef0b000-0x5efecfff]
[    0.023765] PM: hibernation: Registered nosave memory: [mem 0x66d71000-0x6866dfff]
[    0.023766] PM: hibernation: Registered nosave memory: [mem 0x6866e000-0x69897fff]
[    0.023766] PM: hibernation: Registered nosave memory: [mem 0x69898000-0x69dfdfff]
[    0.023768] PM: hibernation: Registered nosave memory: [mem 0x6f800000-0x8fffffff]
[    0.023769] PM: hibernation: Registered nosave memory: [mem 0x90000000-0xfdffffff]
[    0.023769] PM: hibernation: Registered nosave memory: [mem 0xfe000000-0xfe010fff]
[    0.023770] PM: hibernation: Registered nosave memory: [mem 0xfe011000-0xfed1ffff]
[    0.023770] PM: hibernation: Registered nosave memory: [mem 0xfed20000-0xfed44fff]
[    0.023771] PM: hibernation: Registered nosave memory: [mem 0xfed45000-0xffffffff]
[    0.237871] Freeing SMP alternatives memory: 36K
[    3.572274] Non-volatile memory driver v1.3
[    3.653525] Freeing initrd memory: 89312K
[    4.228204] Freeing unused decrypted memory: 2036K
[    4.232827] Freeing unused kernel image (initmem) memory: 2788K
[    4.247331] Freeing unused kernel image (text/rodata gap) memory: 2040K
[    4.251702] Freeing unused kernel image (rodata/data gap) memory: 60K
[   11.014980] i2c i2c-0: 16/32 memory slots populated (from DMI)
[   11.014982] i2c i2c-0: Systems with more than 4 memory slots not supported yet, not instantiating SPD
[   12.964055] EDAC i10nm: No hbm memory
[ 1357.676966] i915 0000:33:00.0: [drm] Local memory IO size: 0x000000037a800000
[ 1357.676968] i915 0000:33:00.0: [drm] Local memory available: 0x000000037a800000
[407440.611017]  out_of_memory+0xed/0x2e0
[407440.611029]  mem_cgroup_out_of_memory+0x13a/0x150
[407440.611116] memory: usage 58720252kB, limit 58720256kB, failcnt 23
[407440.611117] memory+swap: usage 58720252kB, limit 58720256kB, failcnt 17987903
[407440.611133] Tasks state (memory values in pages):
[407440.612317] Memory cgroup out of memory: Killed process 1535268 (python3.8) total-vm:1209308992kB, anon-rss:41459744kB, file-rss:466276kB, shmem-rss:4kB, UID:1000750000 pgtables:151104kB oom_score_adj:778
[408339.735618]  out_of_memory+0xed/0x2e0
[408339.735629]  mem_cgroup_out_of_memory+0x13a/0x150
[408339.735686] memory: usage 58720256kB, limit 58720256kB, failcnt 23
[408339.735687] memory+swap: usage 58720256kB, limit 58720256kB, failcnt 21385997
[408339.735703] Tasks state (memory values in pages):
[408339.736085] Memory cgroup out of memory: Killed process 2725201 (python3.8) total-vm:132980172kB, anon-rss:41961444kB, file-rss:304524kB, shmem-rss:4kB, UID:1000750000 pgtables:90372kB oom_score_adj:778
[457794.119151]  out_of_memory+0xed/0x2e0
[457794.119162]  mem_cgroup_out_of_memory+0x13a/0x150
[457794.119215] memory: usage 58720256kB, limit 58720256kB, failcnt 23
[457794.119217] memory+swap: usage 58720256kB, limit 58720256kB, failcnt 24769451
[457794.119234] Tasks state (memory values in pages):
[457794.119591] Memory cgroup out of memory: Killed process 2740651 (python3.8) total-vm:132968056kB, anon-rss:41960760kB, file-rss:305636kB, shmem-rss:4kB, UID:1000750000 pgtables:90380kB oom_score_adj:778

Todo/Solutions

Need to confirm the root cause, if its affected via CPU or GPU or memory issues on the node itself
Also execute other openvino notebooks and verify the issue

@hershpa hershpa added bug Something isn't working gpu Intel GPU labels Nov 2, 2023
@uMartinXu
Copy link
Contributor

@vbedida79 can we replace the issue on 1.2.0 and comming 1.2.1 release? If no, I think we can close the issue.

@vbedida79 vbedida79 mentioned this issue Mar 8, 2024
9 tasks
@vbedida79
Copy link
Contributor Author

@vbedida79 can we replace the issue on 1.2.0 and comming 1.2.1 release? If no, I think we can close the issue.

yes tried it on 1.2.1, this is repeating in some of the openvino notebooks with Flex and Max series

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gpu Intel GPU
Projects
None yet
Development

No branches or pull requests

3 participants