Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple processor crash #10

Open
cianciosa opened this issue Jan 21, 2021 · 5 comments
Open

Multiple processor crash #10

cianciosa opened this issue Jan 21, 2021 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@cianciosa
Copy link
Collaborator

Joachim Geiger has reported a crash when running with multiple processors. The following input files
Cases.zip
show the behavior. input.crashes uses an extended number of modes and crashed with a heap-overflow error when run with more than a single processor. The input.works` is the same case with a reduced number of modes. This cases does not exhibit the behavior. The crash was reported using the ifort compiler however, I was able to reproduce this crash by turning on the address-sanitizer flag.

% mpirun -n 4 xvmec input.crashes_3    
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  VMEC OUTPUT FILES ALREADY EXIST: OVERWRITING THEM ...
  SEQ =    1 TIME SLICE  0.0000E+00
  PROCESSING INPUT.crashes_3
  THIS IS PARVMEC (PARALLEL VMEC), VERSION 9.0
  Lambda: Full Radial Mesh. L-Force: hybrid full/half.

  COMPUTER: cianciosaimac   OS: Darwin   RELEASE: 19.6.0  DATE = Jan 21,2021  TIME = 12:52:34

  NS =    8 NO. FOURIER MODES =  185 FTOLV =  1.000E-06 NITER =  20000
  PROCESSOR COUNT - RADIAL:    4
 INITIAL JACOBIAN CHANGED SIGN!
 TRYING TO IMPROVE INITIAL MAGNETIC AXIS GUESS
  ---- Improved AXIS Guess ----
      RAXIS_CC =    5.5423259209884730       0.30747882334706500        3.6107777297953697E-002   2.1925887832076173E-002 -0.17127515915757005       0.33995876393572677        2.7194580396712614E-002   8.7619938032124662E-003   2.1641584886036458E-002  -3.0060375964156970E-002   4.0919407891436034E-003   7.2283631622133112E-003  -4.8096045954452264E-003   3.2132317238919464E-003   1.3366337123433408E-003  -5.0218208257885189E-003  -1.0805539441867496E-003   3.8372284158438586E-004   1.2322391511445112E-003   8.2564184559682900E-004   9.0462982158830627E-003
      ZAXIS_CS =   -0.0000000000000000      -0.40364620347171476       -2.6212416249487239E-002   2.5845975128812093E-002  0.15344591155188636      -0.27210128536906603       -2.4819582171628708E-002  -7.6814873421304332E-003  -2.2282872186040290E-002   1.9170323502591072E-002  -1.1569841914854002E-002  -6.1298139436995875E-004  -2.6220827681052326E-003  -5.6155647985143900E-003  -3.0101401187663541E-003  -8.9905949402988867E-003  -4.8346291121438923E-003  -5.7954765825185117E-003   8.0075797167838414E-003  -3.0281697953424324E-003  -3.8957154711619243E-003
  -----------------------------
=================================================================
=================================================================
=================================================================
=================================================================
==55382==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6180000077c0 at pc 0x000102552fed bp 0x7ffeed759dc0 sp 0x7ffeed759db8
==55380==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6180000077c0 at pc 0x00010ec8dfed bp 0x7ffee101edc0 sp 0x7ffee101edb8
READ of size 8 at 0x6180000077c0 thread T0
READ of size 8 at 0x6180000077c0 thread T0
==55383==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6180000077c0 at pc 0x00010c003fed bp 0x7ffee3ca8dc0 sp 0x7ffee3ca8db8
READ of size 8 at 0x6180000077c0 thread T0
==55381==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6180000077c0 at pc 0x00010d992fed bp 0x7ffee2319dc0 sp 0x7ffee2319db8
READ of size 8 at 0x6180000077c0 thread T0
    #0 0x10ec8dfec in __blocktridiagonalsolver_bst_MOD_initialize_bst blocktridiagonalsolver_bst.f90:2005
    #1 0x10f09af06 in runvmec_ runvmec.f:329
    #2 0x10ebdf804 in MAIN__ vmec.f:333
    #3 0x10ebe1818 in main vmec.f:2
    #0 0x10c003fec in __blocktridiagonalsolver_bst_MOD_initialize_bst blocktridiagonalsolver_bst.f90:2005
    #1 0x10c410f06 in runvmec_ runvmec.f:329
    #2 0x10bf55804 in MAIN__ vmec.f:333
    #3 0x10bf57818 in main vmec.f:2
    #4 0x7fff6fbc6cc8 in start (libdyld.dylib:x86_64+0x1acc8)

0x6180000077c0 is located 0 bytes to the right of 832-byte region [0x618000007480,0x6180000077c0)
allocated by thread T0 here:
    #4 0x7fff6fbc6cc8 in start (libdyld.dylib:x86_64+0x1acc8)

0x6180000077c0 is located 0 bytes to the right of 832-byte region [0x618000007480,0x6180000077c0)
allocated by thread T0 here:
    #0 0x10d992fec in __blocktridiagonalsolver_bst_MOD_initialize_bst blocktridiagonalsolver_bst.f90:2005
    #1 0x10dd9ff06 in runvmec_ runvmec.f:329
    #2 0x10d8e4804 in MAIN__ vmec.f:333
    #3 0x10d8e6818 in main vmec.f:2
    #0 0x113a341ad in wrap_malloc (libasan.5.dylib:x86_64+0x6c1ad)
    #1 0x10ec8d21b in __blocktridiagonalsolver_bst_MOD_initialize_bst blocktridiagonalsolver_bst.f90:2002
    #2 0x10f09af06 in runvmec_ runvmec.f:329
    #3 0x10ebdf804 in MAIN__ vmec.f:333
    #4 0x10ebe1818 in main vmec.f:2
    #5 0x7fff6fbc6cc8 in start (libdyld.dylib:x86_64+0x1acc8)

    #4 0x7fff6fbc6cc8 in start (libdyld.dylib:x86_64+0x1acc8)

0x6180000077c0 is located 0 bytes to the right of 832-byte region [0x618000007480,0x6180000077c0)
SUMMARY: AddressSanitizer: heap-buffer-overflow blocktridiagonalsolver_bst.f90:2005 in __blocktridiagonalsolver_bst_MOD_initialize_bst
allocated by thread T0 here:
Shadow bytes around the buggy address:
  0x1c3000000ea0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1c3000000eb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1c3000000ec0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1c3000000ed0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1c3000000ee0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x1c3000000ef0: 00 00 00 00 00 00 00 00[fa]fa fa fa fa fa fa fa
  0x1c3000000f00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x1c3000000f10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1c3000000f20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1c3000000f30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1c3000000f40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==55380==ABORTING

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x1136dc72c
#1  0x1136dbad3
#2  0x7fff6fdbf5fc
    #0 0x11128d1ad in wrap_malloc (libasan.5.dylib:x86_64+0x6c1ad)
    #1 0x10c00321b in __blocktridiagonalsolver_bst_MOD_initialize_bst blocktridiagonalsolver_bst.f90:2002
    #2 0x10c410f06 in runvmec_ runvmec.f:329
    #3 0x10bf55804 in MAIN__ vmec.f:333
    #4 0x10bf57818 in main vmec.f:2
    #5 0x7fff6fbc6cc8 in start (libdyld.dylib:x86_64+0x1acc8)

SUMMARY: AddressSanitizer: heap-buffer-overflow blocktridiagonalsolver_bst.f90:2005 in __blocktridiagonalsolver_bst_MOD_initialize_bst
Shadow bytes around the buggy address:
  0x1c3000000ea0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1c3000000eb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1c3000000ec0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1c3000000ed0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1c3000000ee0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x1c3000000ef0: 00 00 00 00 00 00 00 00[fa]fa fa fa fa fa fa fa
  0x1c3000000f00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x1c3000000f10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1c3000000f20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1c3000000f30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1c3000000f40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==55383==ABORTING

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x10d8f072c
#1  0x10d8efad3
#2  0x7fff6fdbf5fc
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 55380 on node cianciosaimac exited on signal 6 (Abort trap: 6).
--------------------------------------------------------------------------
@cianciosa cianciosa added the bug Something isn't working label Jan 21, 2021
@cianciosa cianciosa self-assigned this Jan 21, 2021
@cianciosa
Copy link
Collaborator Author

Looking at some debugging information, of

IF(.NOT.ALLOCATED(orig(globrowoff)%L)) ALLOCATE( orig(globrowoff)%L(M,1) )

orig is alloced with a lower and upper bounds of 1 and 2. But globrowoff is trying to access the 3rd index.

@cianciosa
Copy link
Collaborator Author

Crash is happening because something is changing the size of startglobrow and endglobrow.

@cianciosa
Copy link
Collaborator Author

cianciosa commented Jan 21, 2021

It looks like the crash happens because of the following sequence.

  1. Initialize_bst called with correct sizes
  2. eqsolve called
  3. evolve called
  4. jacobian changes sign
  5. evolve returns with bad jacobian flag.
  6. reset and retry eqsolve
  7. jacobian changes sign again
  8. evolve returns with bad jacobian flag
  9. exit from eqsolve
  10. next grid size is attempted.
  11. Initialize_bst called with incorrect sizes

@cianciosa
Copy link
Collaborator Author

This line causes the loop to return back to tag 50 and

IF (ier_flag .eq. bad_jacobian_flag .and. jacob_off .eq. 0) THEN

@cianciosa
Copy link
Collaborator Author

At the second attempt of the multigrid, the loop counter starts at index zero since jacobian_off is one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant