Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hetzner: calc engine gets asleep and outputs nothing approx. in 10% cases #11

Open
blokhin opened this issue Dec 14, 2020 · 5 comments
Open
Labels
bug Something isn't working

Comments

@blokhin
Copy link
Member

blokhin commented Dec 14, 2020

Manifested as:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 1872 root      20   0  269344  20472  13680 S   7.0   0.1 174:17.89 Pcrystal
 1874 root      20   0  269344  20172  13380 S   6.7   0.1 173:49.32 Pcrystal
 1875 root      20   0  269344  20128  13336 S   6.7   0.1 174:53.19 Pcrystal
 1876 root      20   0  269344  20388  13600 S   6.7   0.1 173:32.07 Pcrystal
 1886 root      20   0  269344  20204  13412 S   6.3   0.1 174:08.48 Pcrystal
 1877 root      20   0  269344  20132  13340 S   6.0   0.1 174:18.18 Pcrystal
 1881 root      20   0  269344  20404  13612 S   5.7   0.1 175:58.19 Pcrystal

and

root@node-dwsxhftb:~# cat /data/20201212_194555_dury/OUTPUT
[node-dwsxhftb][[36298,1],6][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(16) failed: Connection reset by peer (104)
[node-dwsxhftb][[36298,1],7][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(16) failed: Connection reset by peer (104)
[node-dwsxhftb][[36298,1],7][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(16) failed: Connection reset by peer (104)
@blokhin blokhin added the bug Something isn't working label Dec 14, 2020
@blokhin
Copy link
Member Author

blokhin commented Aug 27, 2022

Continues to occur on about every 20-th calculation on Hetzner:

[node-kcjlndpb][[7780,1],5][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(16) failed: Connection reset by peer (104)
[node-kcjlndpb][[7780,1],5][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(16) failed: Connection reset by peer (104)
[node-kcjlndpb][[7780,1],0][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(16) failed: Connection reset by peer (104)

@blokhin
Copy link
Member Author

blokhin commented Dec 14, 2022

And again now:

root@aiida9:~# yastatus
319   RUNNING
root@aiida9:~# yastatus -v
..................................................ID319 aiida-4727 at root@65.108.215.129:hetzner:data/tasks/20221202_194808_319
[node-gusrkxef][[30587,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(18) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(18) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(18) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(18) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],6][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(17) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(19) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],6][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(17) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(19) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(19) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(19) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(19) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(19) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(19) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],7][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(16) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],6][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(17) failed: Connection reset by peer (104)

so it's just eating the resources without any useful payload 😢 😢 😢

@blokhin
Copy link
Member Author

blokhin commented Dec 14, 2022

$ top
Tasks: 130 total,   1 running, 129 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.6 us,  1.6 sy,  0.0 ni, 96.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  31413.5 total,  29556.7 free,    246.1 used,   1610.8 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  30519.9 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                  
10669 root      20   0  267228  20420  13640 S  13.3   0.1   1148:21 Pcrystal                                                                                                                 
10672 root      20   0  267228  20464  13684 S  13.3   0.1   1158:01 Pcrystal                                                                                                                 
10675 root      20   0  267228  22340  13520 S  13.3   0.1   1148:37 Pcrystal                                                                                                                 
10667 root      20   0  267228  20284  13504 S   6.7   0.1   1170:54 Pcrystal                                                                                                                 
10668 root      20   0  267228  22044  13228 S   6.7   0.1   1156:48 Pcrystal                                                                                                                 
10670 root      20   0  267228  20356  13576 S   6.7   0.1   1150:24 Pcrystal                                                                                                                 
    1 root      20   0  170568  10444   7956 S   0.0   0.0   0:43.95 systemd                                                                                                                  
    2 root      20   0       0      0      0 S   0.0   0.0   0:00.21 kthreadd                                                                                                                 
    3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp                                                                                                                   
    4 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_par_gp                                                                                                               
    6 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/0:0H-events_highpri                                                                                              
    8 root       0 -20       0      0      0 I   0.0   0.0   0:25.24 kworker/0:1H-events_highpri                                                                                              
    9 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 mm_percpu_wq                                                                                                             
   10 root      20   0       0      0      0 S   0.0   0.0   0:01.86 ksoftirqd/0                                                                                                              
   11 root      20   0       0      0      0 I   0.0   0.0   0:31.81 rcu_sched                                                                                                                
   12 root      20   0       0      0      0 I   0.0   0.0   0:00.00 rcu_bh                                                                                                                   
   13 root      rt   0       0      0      0 S   0.0   0.0   0:01.55 migration/0                                                                                                              
   15 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/0                                                                                                                  
   16 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/1                                                                                                                  
   17 root      rt   0       0      0      0 S   0.0   0.0   0:01.72 migration/1                                                                                                              
   18 root      20   0       0      0      0 S   0.0   0.0   0:01.92 ksoftirqd/1                                                                                                              
   19 root      20   0       0      0      0 I   0.0   0.0   0:03.73 kworker/1:0-mm_percpu_wq                                                                                                 
   20 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/1:0H-kblockd                                                                                                     
   21 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/2                                                                                                                  
   22 root      rt   0       0      0      0 S   0.0   0.0   0:01.76 migration/2                                                                                                              
   23 root      20   0       0      0      0 S   0.0   0.0   0:02.78 ksoftirqd/2                                                                                                              
   25 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/2:0H-events_highpri                                                                                              
   26 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/3                                                                                                                  
   27 root      rt   0       0      0      0 S   0.0   0.0   0:01.74 migration/3                                                                                                              
   28 root      20   0       0      0      0 S   0.0   0.0   0:01.81 ksoftirqd/3                                                                                                              
   29 root      20   0       0      0      0 I   0.0   0.0   0:03.60 kworker/3:0-mm_percpu_wq                                                                                                 
   30 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/3:0H-events_highpri                                                                                              
   31 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/4                                                                                                                  
   32 root      rt   0       0      0      0 S   0.0   0.0   0:01.76 migration/4                                                                                                              
   33 root      20   0       0      0      0 S   0.0   0.0   0:01.84 ksoftirqd/4                                                                                                              
   35 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/4:0H-events_highpri                                                                                              
   36 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/5                                                                                                                  
   37 root      rt   0       0      0      0 S   0.0   0.0   0:01.94 migration/5                                                                                                              
   38 root      20   0       0      0      0 S   0.0   0.0   0:01.66 ksoftirqd/5                                                                                                              
   39 root      20   0       0      0      0 I   0.0   0.0   9:32.20 kworker/5:0-mm_percpu_wq

@blokhin
Copy link
Member Author

blokhin commented Mar 8, 2024

This is very bad issue, causing severe money losses, should be addressed asap 🚒

@akvatol
Copy link
Contributor

akvatol commented Jul 26, 2024

I'm having a similar issue. I got an error while the Pcrystal process was still running.

..................................................ID37 aiida-825 at root@...with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[node-cnsxffah:02513] PMIX ERROR: UNREACHABLE in file ../../../src/server/pmix_server.c at line 2795
[node-cnsxffah:02513] PMIX ERROR: UNREACHABLE in file ../../../src/server/pmix_server.c at line 2795
[node-cnsxffah:02513] PMIX ERROR: UNREACHABLE in file ../../../src/server/pmix_server.c at line 2795
[node-cnsxffah:02513] PMIX ERROR: UNREACHABLE in file ../../../src/server/pmix_server.c at line 2795
[node-cnsxffah:02513] PMIX ERROR: UNREACHABLE in file ../../../src/server/pmix_server.c at line 2795
[node-cnsxffah:02513] PMIX ERROR: UNREACHABLE in file ../../../src/server/pmix_server.c at line 2795

This may be caused by using the wrong OPENMPI version. Developers recommend openmpi-2.1.* here and here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

No branches or pull requests

2 participants