Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bbi sometimes errors when submitting with --overwrite while model is running #322

Open
seth127 opened this issue May 31, 2024 · 1 comment

Comments

@seth127
Copy link
Collaborator

seth127 commented May 31, 2024

I'm not fully clear what's happening here, but I noticed a strange behavior where submitting a model with --overwrite while that same model is currently running while cause bbi to error out sometimes. Some examples are below, but this seems to be unreliable/flaky, because sometimes I was able to reproduce it and other times it just seemed to work as expected.

I'm not sure if there's anything we can really do here, at least on the bbi side, but I wanted to document so we can look into a little deeper at some point.

As a side comment: this may be an argument for making submit_models() check the bbr::submit_models(..., .overwrite) arg first and deleting directories up front. There is some discussion related to this in comments on bbr#691.

truncated error output from bbr

This is where I originally saw, trying to re-run models that had failed or were stuck from a bbr bootstrap (working with unreleased bbr 1.10.0.8005).

boot_mods <- get_boot_models(boot_run)
unfinished <- which(!get_model_status(boot_run)$finished)
submit_models(
  boot_mods[unfinished], 
  .overwrite = TRUE, 
  .config_path = file.path(MODEL_DIR, "bbi.yaml") # needed for manually submitting sub-models
)
Submitting 71 models with 1 unique configurations.
Error in `map()` at bbr/R/submit-models.R:147:3:In index: 1.
Caused by error in `check_status_code()`:
! 

`bbi nonmem run sge models/pk/106-boot/032.ctl models/pk/106-boot/036.ctl ...

It went on longer than this, but got cut off eventually. From here, I ran the printed bbi command directly in the terminal and posted the full output below.

full error output from bbi run in terminal
$ qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q@ip-10-128-0-166.ec2.inte BIP   0/3/4          1.92     lx-amd64      
    703 0.55500 Run_117    sethg        r     05/31/2024 13:40:09     1        
    716 0.55500 Run_167    sethg        r     05/31/2024 13:40:09     1        
    731 0.55500 Run_189    sethg        r     05/31/2024 13:40:09     1        
---------------------------------------------------------------------------------
all.q@ip-10-128-1-204.ec2.inte BIP   0/3/4          2.14     lx-amd64      
    711 0.55500 Run_157    sethg        r     05/31/2024 13:40:09     1        
    725 0.55500 Run_183    sethg        r     05/31/2024 13:40:09     1        
    739 0.55500 Run_197    sethg        r     05/31/2024 13:40:09     1        
---------------------------------------------------------------------------------
all.q@ip-10-128-10-162.ec2.int BIP   0/4/4          1.77     lx-amd64      
    699 0.55500 Run_113    sethg        r     05/31/2024 13:40:09     1        
    713 0.55500 Run_162    sethg        r     05/31/2024 13:40:09     1        
    727 0.55500 Run_185    sethg        r     05/31/2024 13:40:09     1        
    741 0.55500 Run_199    sethg        r     05/31/2024 13:40:09     1        
---------------------------------------------------------------------------------
all.q@ip-10-128-10-199.ec2.int BIP   0/4/4          1.42     lx-amd64      
    691 0.55500 Run_105    sethg        r     05/31/2024 13:40:09     1        
    702 0.55500 Run_116    sethg        r     05/31/2024 13:40:09     1        
    717 0.55500 Run_169    sethg        r     05/31/2024 13:40:09     1        
    730 0.55500 Run_188    sethg        r     05/31/2024 13:40:09     1        
---------------------------------------------------------------------------------
all.q@ip-10-128-23-53.ec2.inte BIP   0/1/4          3.97     lx-amd64      
    688 0.55500 Run_102    sethg        r     05/31/2024 13:39:54     1        
---------------------------------------------------------------------------------
all.q@ip-10-128-26-0.ec2.inter BIP   0/4/4          1.63     lx-amd64      
    697 0.55500 Run_111    sethg        r     05/31/2024 13:40:09     1        
    710 0.55500 Run_156    sethg        r     05/31/2024 13:40:09     1        
    724 0.55500 Run_182    sethg        r     05/31/2024 13:40:09     1        
    738 0.55500 Run_196    sethg        r     05/31/2024 13:40:09     1        
---------------------------------------------------------------------------------
all.q@ip-10-128-40-102.ec2.int BIP   0/4/4          1.79     lx-amd64      
    700 0.55500 Run_114    sethg        r     05/31/2024 13:40:09     1        
    714 0.55500 Run_163    sethg        r     05/31/2024 13:40:09     1        
    728 0.55500 Run_186    sethg        r     05/31/2024 13:40:09     1        
    742 0.55500 Run_200    sethg        r     05/31/2024 13:40:09     1        
---------------------------------------------------------------------------------
all.q@ip-10-128-42-172.ec2.int BIP   0/4/4          1.46     lx-amd64      
    692 0.55500 Run_106    sethg        r     05/31/2024 13:40:09     1        
    706 0.55500 Run_134    sethg        r     05/31/2024 13:40:09     1        
    719 0.55500 Run_174    sethg        r     05/31/2024 13:40:09     1        
    734 0.55500 Run_192    sethg        r     05/31/2024 13:40:09     1        
---------------------------------------------------------------------------------
all.q@ip-10-128-43-190.ec2.int BIP   0/4/4          1.50     lx-amd64      
    694 0.55500 Run_108    sethg        r     05/31/2024 13:40:09     1        
    707 0.55500 Run_151    sethg        r     05/31/2024 13:40:09     1        
    721 0.55500 Run_176    sethg        r     05/31/2024 13:40:09     1        
    735 0.55500 Run_194    sethg        r     05/31/2024 13:40:09     1        
---------------------------------------------------------------------------------
all.q@ip-10-128-45-165.ec2.int BIP   0/4/4          1.53     lx-amd64      
    695 0.55500 Run_109    sethg        r     05/31/2024 13:40:09     1        
    708 0.55500 Run_152    sethg        r     05/31/2024 13:40:09     1        
    722 0.55500 Run_180    sethg        r     05/31/2024 13:40:09     1        
    736 0.55500 Run_193    sethg        r     05/31/2024 13:40:09     1        
---------------------------------------------------------------------------------
all.q@ip-10-128-59-142.ec2.int BIP   0/4/4          1.58     lx-amd64      
    696 0.55500 Run_110    sethg        r     05/31/2024 13:40:09     1        
    709 0.55500 Run_155    sethg        r     05/31/2024 13:40:09     1        
    723 0.55500 Run_178    sethg        r     05/31/2024 13:40:09     1        
    737 0.55500 Run_195    sethg        r     05/31/2024 13:40:09     1        
---------------------------------------------------------------------------------
all.q@ip-10-128-61-40.ec2.inte BIP   0/4/4          1.41     lx-amd64      
    690 0.55500 Run_104    sethg        r     05/31/2024 13:40:09     1        
    701 0.55500 Run_115    sethg        r     05/31/2024 13:40:09     1        
    715 0.55500 Run_164    sethg        r     05/31/2024 13:40:09     1        
    729 0.55500 Run_187    sethg        r     05/31/2024 13:40:09     1        
---------------------------------------------------------------------------------
all.q@ip-10-128-61-64.ec2.inte BIP   0/4/4          1.46     lx-amd64      
    693 0.55500 Run_107    sethg        r     05/31/2024 13:40:09     1        
    705 0.55500 Run_121    sethg        r     05/31/2024 13:40:09     1        
    720 0.55500 Run_177    sethg        r     05/31/2024 13:40:09     1        
    733 0.55500 Run_191    sethg        r     05/31/2024 13:40:09     1        
---------------------------------------------------------------------------------
all.q@ip-10-128-63-188.ec2.int BIP   0/3/4          1.95     lx-amd64      
    704 0.55500 Run_118    sethg        r     05/31/2024 13:40:09     1        
    718 0.55500 Run_173    sethg        r     05/31/2024 13:40:09     1        
    732 0.55500 Run_190    sethg        r     05/31/2024 13:40:09     1        
---------------------------------------------------------------------------------
all.q@ip-10-128-7-197.ec2.inte BIP   0/4/4          1.73     lx-amd64      
    698 0.55500 Run_112    sethg        r     05/31/2024 13:40:09     1        
    712 0.55500 Run_160    sethg        r     05/31/2024 13:40:09     1        
    726 0.55500 Run_184    sethg        r     05/31/2024 13:40:09     1        
    740 0.55500 Run_198    sethg        r     05/31/2024 13:40:09     1        
$ /data/apps/bbi nonmem run sge models/pk/106-boot/032.ctl models/pk/106-boot/036.ctl models/pk/106-boot/058.ctl models/pk/106-boot/067.ctl models/pk/106-boot/071.ctl models/pk/106-boot/086.ctl models/pk/106-boot/087.ctl models/pk/106-boot/088.ctl models/pk/106-boot/090.ctl models/pk/106-boot/091.ctl models/pk/106-boot/092.ctl models/pk/106-boot/093.ctl models/pk/106-boot/098.ctl models/pk/106-boot/099.ctl models/pk/106-boot/100.ctl models/pk/106-boot/101.ctl models/pk/106-boot/102.ctl models/pk/106-boot/103.ctl models/pk/106-boot/104.ctl models/pk/106-boot/105.ctl models/pk/106-boot/106.ctl models/pk/106-boot/107.ctl models/pk/106-boot/108.ctl models/pk/106-boot/109.ctl models/pk/106-boot/110.ctl models/pk/106-boot/111.ctl models/pk/106-boot/112.ctl models/pk/106-boot/113.ctl models/pk/106-boot/114.ctl models/pk/106-boot/115.ctl models/pk/106-boot/116.ctl models/pk/106-boot/117.ctl models/pk/106-boot/118.ctl models/pk/106-boot/121.ctl models/pk/106-boot/134.ctl models/pk/106-boot/151.ctl models/pk/106-boot/152.ctl models/pk/106-boot/155.ctl models/pk/106-boot/156.ctl models/pk/106-boot/157.ctl models/pk/106-boot/160.ctl models/pk/106-boot/162.ctl models/pk/106-boot/163.ctl models/pk/106-boot/164.ctl models/pk/106-boot/167.ctl models/pk/106-boot/169.ctl models/pk/106-boot/173.ctl models/pk/106-boot/174.ctl models/pk/106-boot/176.ctl models/pk/106-boot/177.ctl models/pk/106-boot/178.ctl models/pk/106-boot/180.ctl models/pk/106-boot/182.ctl models/pk/106-boot/183.ctl models/pk/106-boot/184.ctl models/pk/106-boot/185.ctl models/pk/106-boot/186.ctl models/pk/106-boot/187.ctl models/pk/106-boot/188.ctl models/pk/106-boot/189.ctl models/pk/106-boot/190.ctl models/pk/106-boot/191.ctl models/pk/106-boot/192.ctl models/pk/106-boot/193.ctl models/pk/106-boot/194.ctl models/pk/106-boot/195.ctl models/pk/106-boot/196.ctl models/pk/106-boot/197.ctl models/pk/106-boot/198.ctl models/pk/106-boot/199.ctl models/pk/106-boot/200.ctl --overwrite --config=models/pk/bbi.yaml
INFO[0000] Successfully loaded specified configuration from models/pk/bbi.yaml 
INFO[0000] Beginning Local Path                         
INFO[0000] A total of 71 models have been located for work 
INFO[0000] [032] Beginning SGE work phase               
INFO[0000] [036] Beginning SGE work phase               
INFO[0000] [058] Beginning SGE work phase               
INFO[0000] [067] Beginning SGE work phase               
INFO[0000] [071] Beginning SGE work phase               
INFO[0000] [086] Beginning SGE work phase               
INFO[0000] [087] Beginning SGE work phase               
INFO[0000] [088] Beginning SGE work phase               
INFO[0000] [090] Beginning SGE work phase               
INFO[0000] [091] Beginning SGE work phase               
INFO[0000] [092] Beginning SGE work phase               
INFO[0000] [093] Beginning SGE work phase               
INFO[0000] [098] Beginning SGE work phase               
INFO[0000] [099] Beginning SGE work phase               
INFO[0000] [100] Beginning SGE work phase               
INFO[0000] [101] Beginning SGE work phase               
INFO[0000] [103] Beginning SGE work phase               
INFO[0000] [104] Beginning SGE work phase               
INFO[0000] [102] Beginning SGE work phase               
INFO[0000] [105] Beginning SGE work phase               
INFO[0000] [106] Beginning SGE work phase               
INFO[0000] [108] Beginning SGE work phase               
INFO[0000] [107] Beginning SGE work phase               
INFO[0000] [110] Beginning SGE work phase               
INFO[0000] [109] Beginning SGE work phase               
INFO[0000] [114] Beginning SGE work phase               
INFO[0000] [115] Beginning SGE work phase               
INFO[0000] [116] Beginning SGE work phase               
INFO[0001] [117] Beginning SGE work phase               
INFO[0001] [121] Beginning SGE work phase               
INFO[0001] [134] Beginning SGE work phase               
INFO[0001] [118] Beginning SGE work phase               
INFO[0001] [152] Beginning SGE work phase               
INFO[0002] [164] Beginning SGE work phase               
INFO[0002] [162] Beginning SGE work phase               
INFO[0002] [169] Beginning SGE work phase               
INFO[0002] [167] Beginning SGE work phase               
INFO[0002] [174] Beginning SGE work phase               
INFO[0002] [173] Beginning SGE work phase               
INFO[0002] [180] Beginning SGE work phase               
INFO[0002] [177] Beginning SGE work phase               
INFO[0002] [182] Beginning SGE work phase               
INFO[0003] [183] Beginning SGE work phase               
INFO[0003] [184] Beginning SGE work phase               
INFO[0003] [186] Beginning SGE work phase               
INFO[0003] [187] Beginning SGE work phase               
INFO[0003] [188] Beginning SGE work phase               
INFO[0003] [189] Beginning SGE work phase               
INFO[0003] [192] Beginning SGE work phase               
INFO[0003] [191] Beginning SGE work phase               
INFO[0003] [193] Beginning SGE work phase               
INFO[0003] [195] Beginning SGE work phase               
INFO[0003] [194] Beginning SGE work phase               
INFO[0003] [196] Beginning SGE work phase               
INFO[0003] [197] Beginning SGE work phase               
INFO[0004] [198] Beginning SGE work phase               
INFO[0004] [200] Beginning SGE work phase               
INFO[0004] [199] Beginning SGE work phase               
INFO[0004] Waiting for any post execution hooks to finish 
ERRO[0004] 13 errors were experienced during the run    
ERRO[0004] Errors were experienced while running model 111. Details are unlinkat models/pk/106-boot/111/temp_dir: directory not empty 
ERRO[0004] Errors were experienced while running model 112. Details are unlinkat models/pk/106-boot/112/temp_dir: directory not empty 
ERRO[0004] Errors were experienced while running model 113. Details are unlinkat models/pk/106-boot/113/temp_dir: directory not empty 
ERRO[0004] Errors were experienced while running model 151. Details are unlinkat models/pk/106-boot/151/temp_dir: directory not empty 
ERRO[0004] Errors were experienced while running model 155. Details are unlinkat models/pk/106-boot/155/temp_dir: directory not empty 
ERRO[0004] Errors were experienced while running model 156. Details are unlinkat models/pk/106-boot/156/temp_dir: directory not empty 
ERRO[0004] Errors were experienced while running model 157. Details are unlinkat models/pk/106-boot/157/temp_dir: directory not empty 
ERRO[0004] Errors were experienced while running model 160. Details are unlinkat models/pk/106-boot/160/temp_dir: directory not empty 
ERRO[0004] Errors were experienced while running model 163. Details are unlinkat models/pk/106-boot/163/temp_dir: directory not empty 
ERRO[0004] Errors were experienced while running model 176. Details are unlinkat models/pk/106-boot/176/temp_dir: directory not empty 
ERRO[0004] Errors were experienced while running model 178. Details are unlinkat models/pk/106-boot/178/temp_dir: directory not empty 
ERRO[0004] Errors were experienced while running model 185. Details are unlinkat models/pk/106-boot/185/temp_dir: directory not empty 
ERRO[0004] Errors were experienced while running model 190. Details are unlinkat models/pk/106-boot/190/temp_dir: directory not empty 
71 models completed in 4.211071473s         

trying (and failing) to reproduce reliably with a single model
$ qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q@ip-10-128-58-134.ec2.int BIP   0/1/4          1.23     lx-amd64      
    801 0.55500 Run_032    sethg        r     05/31/2024 13:43:39     1        
---------------------------------------------------------------------------------

sethg@ip-10-128-40-150:$ bbi nonmem run sge /data/home/sethg/models/pk/106-boot/032.ctl  --overwrite
INFO[0000] Successfully loaded specified configuration from /data/home/sethg/models/pk/bbi.yaml 
INFO[0000] Beginning Local Path                         
INFO[0000] A total of 1 models have been located for work 
INFO[0000] [032] Beginning SGE work phase               
INFO[0000] Waiting for any post execution hooks to finish 
1 models completed in 10.411396ms           
@kyleam
Copy link
Collaborator

kyleam commented May 31, 2024

Details are unlinkat models/pk/106-boot/151/temp_dir: directory not empty

I think that's consistent with something like this:

  • process 1 is running and writing to */temp_dir/

  • process 2 starts to delete */temp_dir/ by a) first removing its contents and b) then removing the empty directory

  • process 1 writes something else to under */temp_dir/ before b fires, leading to the error

In this case, process 2 is the run sge call that's removing the output directory in the Prepare step.


It's probably tricky to reliably time things to trigger any given error, but here's one example of an error that can be triggered by the script below (that uses local, but same core problem):

INFO[0000] [1] Beginning local work phase
ERRO[0004] [1] Exit code was 110, details were exit status 110
ERRO[0004] [1] output details were: Starting NMTRAN

 WARNINGS AND ERRORS (IF ANY) FOR PROBLEM    1

 (WARNING  2) NM-TRAN INFERS THAT THE DATA ARE POPULATION.

 LIM VALUES MAXLIM ASSESSED BY NMTRAN: 1,2,3,4,5,6,7,8,10,11,13,15,16

Note: Analytical 2nd Derivatives are constructed in FSUBS but are never used.
      You may insert $ABBR DERIV2=NO after the first $PROB to save FSUBS construction and compilation time

cat: trash.tmp: No such file or directory
Recompiling certain components
cp: cannot create regular file 'temp_dir/': Not a directory
cp: cannot stat '/opt/NONMEM/nm75/pr/temp_dir/PRED.o': No such file or directory
cp: cannot stat '/opt/NONMEM/nm75/pr/temp_dir/PREDI.o': No such file or directory
cp: cannot stat '/opt/NONMEM/nm75/pr/temp_dir/CHECK.o': No such file or directory
cp: cannot stat '/opt/NONMEM/nm75/pr/temp_dir/SADVAN.o': No such file or directory
cp: cannot stat '/opt/NONMEM/nm75/pr/temp_dir/ADVAN2.o': No such file or directory
cp: cannot stat '/opt/NONMEM/nm75/pr/temp_dir/SSS0.o': No such file or directory
cp: cannot stat '/opt/NONMEM/nm75/pr/temp_dir/TRANS2.o': No such file or directory
cp: cannot stat '/opt/NONMEM/nm75/pr/temp_dir/INFN.o': No such file or directory
Building NONMEM Executable
gfortran: error: temp_dir/SSS0.o: No such file or directory
gfortran: error: temp_dir/INFN.o: No such file or directory
No nonmem execution.
INFO[0004] Waiting for any post execution hooks to finish
ERRO[0004] 1 errors were experienced during the run
ERRO[0004] Errors were experienced while running model 1.mod. Details are Running the programmatic shell script caused an error
script
#!/bin/sh

set -eu

tdir=$(mktemp -d "${TMPDIR:-/tmp}"/bbi-322-XXXXXXX)
cd "$tdir"

url_base=https://raw.githubusercontent.com/metrumresearchgroup/bbi/v3.3.0/integration/testdata/acop
curl -fSsL "$url_base/acop.mod" >1.mod
curl -fSsLO "$url_base/acop.csv"

bbi init --dir /opt/NONMEM
bbi nonmem run local --nm_version nm75 1.mod >run1.out 2>run1.err &
bbi nonmem run local --nm_version nm75 --overwrite 1.mod

I'm not sure if there's anything we can really do here, at least on the bbi side, but I wanted to document so we can look into a little deeper at some point.

If you have two bbi processes running and racing against each other, I don't think there's anything to do to avoid all the possible errors.

I'd say the main thing to do is to prevent that in the first place by having bbi write some sort of lock file to the output directory that indicates that it is running and then have it remove the lock after it completes (related to last point here). Any subsequent bbi invocation would refuse to work on a directory with a lock file, giving a clear error. I don't think you're hoping for with --overwrite, but in my opinion aborting if there's another bbi process already working with an output directory is reasonable behavior.

One risk is when bbi dies unexpectedly without a chance to clean that up. Then you have a stale lock file around that would cause a future bbi to refuse to work. bbi could mention in the error message that, if you know another bbi process isn't running, you can manually remove the lock file. (This is similar to what Git does with its index lock.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants