Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lost jobs when using singularity/s3, high runner cpu usage #328

Open
sb10 opened this issue Jul 2, 2020 · 0 comments
Open

Lost jobs when using singularity/s3, high runner cpu usage #328

sb10 opened this issue Jul 2, 2020 · 0 comments
Labels

Comments

@sb10
Copy link
Member

sb10 commented Jul 2, 2020

User reported an issue with singularity/s3 jobs that would keep going in to lost state, though they would always recover.
I was not able to replicate this, but investigation is below and brought up some other cpu-usage issues.

User has stats like
complete=0 running=72 ready=88 dependent=0 lost=25 delayed=0 buried=0
(with single core jobs and flavors, so actually 72 servers, and he has supposedly managed 500 instances, though did have open file limit errors in his logs)

But the lost jobs always recover. He can ssh to a worker instance with a lost job, and
WR_MANAGERHOST=192.168.0.12 WR_MANAGERPORT=56669 time /tmp/wr status -o s
works just fine, returning in ~0.02s on repeated calls.

He also had the problem running lots of jobs on a single 60 core s2.4xlarge instance, where he had this state:
[nextflow] get_read_groups : complete=0 running=1 ready=1 dependent=0 lost=15 delayed=0 buried=0
[nextflow] irods_iget : complete=0 running=14 ready=1 dependent=0 lost=0 delayed=0 buried=0
all above : complete=0 running=15 ready=2 dependent=0 lost=15 delayed=0 buried=0

The iget jobs are normal, while the get_read_groups jobs are running singularity. It seems like he never gets lost iget jobs, so maybe this is somehow an issue specific to singularity?

His top showed some wr processes using 100% cpu on the worker instance.

For reference, other user's usage has wr's typically using ~30% CPU, no noticable memory. Maybe CPU usage increases when a job is outputting stdout/err continuously?
Other user also say CPU goes up to 100% when there are IO problems like lustre going wonky. In my own basic testing with sleep, runners use like 0.1% CPU.

Can I replicate and fix?

He was using Singularity 2.5.2. Also had the same problem with latest 3.3.0-rc.3.

install_singularity.sh:

sudo apt-get update
sudo apt-get install -y singularity-container stress

[this actually gives me v2.4.2-dist]

wr cloud deploy --script install_singularity.sh -f s2.4xlarge -m 1

singularity run docker://sendu/usecpu:v1
Can't modify anonymous hash ({}) in local at -e line 1, near "} ="
syntax error at -e line 2, near "{  ="
  (Might be a runaway multi-line [] string starting on line 1)
Execution of -e aborted due to compilation errors.

Just try running something that uses CPU, based on what my usecpu docker does:

/tmp/use_cpu.pl:

#!/usr/bin/env perl
sub use_cpu {
  eval {
    local $SIG{ALRM} = sub { die qq[alarm\n] };
    alarm 300;
    $x;
    while(1) {
      $x = $x * 2
    }
    alarm 0;
  };
}
use_cpu();
chmod +x /tmp/use_cpu.pl

perl -e 'for(1..1000){print "echo $_ && /tmp/use_cpu.pl\n"}' | wr add -r 0 -i use_cpu -o 2
top -u ubuntu

[perls using 100% cpu each, top usuaully using more cpu than the only wr process that fits on the screen at 0.7%, never saw any lost jobs before completion]

Try again with the runners on worker nodes:

wr cloud teardown
wr cloud deploy --script install_singularity.sh -m 3 --max_local_cores 0 --debug

perl -e 'for(1..1000){print "echo $_ && stress --cpu 1 --io 1 --vm 1 --vm-hang 60 --hdd 1 --timeout 300\n"}' | wr add -r 0 -i stress -o 2 --cloud_flavor s2.4xlarge

Same deal, no lost jobs, wr not visible in top. (stress is using about 30% cpu with these settings; it uses 100% with just the --cpu option, and wr is also not visible)

Try again, with the job spitting out tons of stdout:

perl -e "for(1..1000){print qq[echo \$_ && perl -e 'eval { local \\\$SIG{ALRM} = sub { die qq{alarm\\\n} }; alarm 300; \\\$x;  while(1) { \\\$x = \\\$x * 2; print \\\$x, qq{\\\n}; } alarm 0; };'\n]}" | wr add -r 0 -i perlprint -o 2 --cloud_flavor s2.4xlarge

Now top is filled with wr using 50-60% cpu, with perl using ~45%. No lost jobs though.

So we confirm it is stdout processing causing high cpu usage. Maybe the simplest fix would be to just stop processing std once received more than 10MB of it? Not ideal, because then we miss the actual error at the end.

Want to try again with getting something in singularity to work...
(and use smaller worker flavors since not enough hosts)

wr cloud deploy --script install_singularity.sh -m 3 --max_local_cores 0 --debug

cd /lustre/scratch117/sciops/team117/npg/sb10/wr/lostjobs
cp /lustre/scratch115/projects/singularity_gn5/samtools-1.6.img .
s3cmd put samtools-1.6.img s3://sb10/images/samtools-1.6.img
dev

perl -e 'for(1..1000){print "echo $_ && singularity exec images/samtools-1.6.img samtools view samtools_testing/NA20903.cram\n"}' | wr add -r 0 -i sing -o 2 --cloud_flavor s2.3xlarge --mounts 'ur:npg-repository,ur:sb10' --env 'REF_PATH=cram_cache/%2s/%2s/%s'

(just read a 7GB cram file)

[ssh to executor and]
top -u ubuntu

30 jobs hit run state, but top on the exectuing host only shows 1 samtools process at 98% cpu, and 1 wr process at ~60% cpu, and a load of ~0% wr jobs.
After about 2.5m, all the other samtools processes appeared. Maybe slow s3 reads of the image?
0% memory usage on everything, and all the samtools take more cpu than the wrs, averaging 40-60% and 30-40% respectively.

s3 was potentially overloaded by others during this time, but it took a long time to run. Never saw any lost jobs. 1 eventually completed after 30 mins!

Try again using his image, and writing out the view to local disk, in case that makes a difference. Also try with the output folder as an uncached write S3 mount.

He tells me he has lost jobs while he has wr runner processes with nothing running underneath them, so probably the S3 upload phase then. Confirmed by looking at top. Definitely have to test long S3 uploads.

perl -e 'for(1..1000){print "echo $_ && singularity exec samtools-1.6.img samtools view -b -o $_.bam NA20903.cram\n"}' | wr add -r 0 -i singw -o 2 --cloud_flavor s2.3xlarge --mounts 'ur:npg-repository,ur:sb10/images,ur:sb10/samtools_testing,uw:sb10/samtools_testing_output' --env 'REF_PATH=cram_cache/%2s/%2s/%s'

runners starting, but not actually running anything:

Aug 14 15:43:46 wr-dev-sb10-dc4262cb-4b77-458b-9a47-1d41423b3ba7 wrrunner[3356]: t=2019-08-14T15:43:46+0000 lvl=warn msg="command [echo 56 && singularity exec samtools-1.6.img samtools view -b -o 56.bam NA20903.cram] started running, but I killed it due to a jobqueue server error: jobqueue jstart(16dd4c3c8ab4f5e4735ae6d64742c364): bad job (not in queue or correct sub-queue)"

Then after some time, they did start a few, but then mostly just wr at 0.3% or 0 cpu, with no samtools in sight?? Maybe just initial oddity and slowness in downloading 30 of the image at once. Try again, caching the image.

wr cloud deploy --script install_singularity.sh -m 3 --max_local_cores 0 --debug
perl -e 'for(1..1000){print "echo $_ && singularity exec images/samtools-1.6.img samtools view -b -o outputs/$_.bam inputs/NA20903.cram\n"}' | wr add -r 0 -i singw -o 2 --cloud_flavor s2.3xlarge -j '[{"Mount":"cram_cache","Targets":[{"Path":"npg-repository/cram_cache","CacheDir":"/tmp/s3cache"}]},{"Mount":"images","Targets":[{"Path":"sb10/images","CacheDir":"/tmp/s3cache"}]},{"Mount":"inputs","Targets":[{"Path":"sb10/samtools_testing"}]},{"Mount":"outputs","Targets":[{"Path":"sb10/samtools_testing_output","Write":true}]}]' --env 'REF_PATH=cram_cache/%2s/%2s/%s'

More weird behaviour. On the first server brought up, only started 1 job, then on the second started 30. On both only 1 samtools process. Then number of running started to drop and fluctuate and some started to go delayed with no errors.

Odd lines in logs:

t=2019-08-14T15:59:08+0000 lvl=dbug msg="schedule changed number needed" scheduler=openstack cmd="/tmp/wr runner -s '1024:60:1:0:93541bc5fb09080985f079a80e29ebcd' --deployment development --server '192.168.0.8:46407' --domain localhost -r 1 -m 0 --debug" before=1000 needs=1001 caller=local.go:294

(with the need rising on subsequent calls)

didn't finish going through logs, but another total mess! Does the limit fix that forces the trigger screw this up?

Try and see what's going on if add just 1 job:

perl -e 'for(1..1){print "echo $_ && singularity exec images/samtools-1.6.img samtools view -b -o outputs/$_.bam inputs/NA20903.cram\n"}' | wr add -r 0 -i singw -o 2 --cloud_flavor s2.3xlarge -j '[{"Mount":"cram_cache","Targets":[{"Path":"npg-repository/cram_cache","CacheDir":"/tmp/s3cache"}]},{"Mount":"images","Targets":[{"Path":"sb10/images","CacheDir":"/tmp/s3cache"}]},{"Mount":"inputs","Targets":[{"Path":"sb10/samtools_testing"}]},{"Mount":"outputs","Targets":[{"Path":"sb10/samtools_testing_output","Write":true}]}]' --env 'REF_PATH=cram_cache/%2s/%2s/%s'

samtools started running after a few seconds, 94+% cpu, wr at under 4%.
samtools still running 50m later, gave up and killed it.

This isn't actually the same thing as what nextflow ends up doing anyway, correct and try again with 5 jobs at once:

perl -e 'for(1..5){print "echo $_ && singularity exec images/samtools-1.6.img samtools view -b -o $_.bam inputs/NA20903.cram && mv $_.bam outputs/\n"}' | wr add -r 0 -i singw -o 2 --cloud_flavor s2.3xlarge -j '[{"Mount":"cram_cache","Targets":[{"Path":"npg-repository/cram_cache","CacheDir":"/tmp/s3cache"}]},{"Mount":"images","Targets":[{"Path":"sb10/images","CacheDir":"/tmp/s3cache"}]},{"Mount":"inputs","Targets":[{"Path":"sb10/samtools_testing"}]},{"Mount":"outputs","Targets":[{"Path":"sb10/samtools_testing_output","Write":true}]}]' --env 'REF_PATH=cram_cache/%2s/%2s/%s'

Started all 5 within seconds at 99+% cpu., wr using about 1% each (even with all the samtools simultaneously peaking to 100%).

Still running at ~100% after 26m when I had to go to a meeting.
Ended up completing, didn't see what happened during upload.

Try again, with a fresh server (no cache):

perl -e 'for(6..10){print "echo $_ && singularity exec images/samtools-1.6.img samtools view -b -o $_.bam inputs/NA20903.cram && mv $_.bam outputs/\n"}' | wr add -r 0 -i singw -o 2 --cloud_flavor s2.3xlarge -j '[{"Mount":"cram_cache","Targets":[{"Path":"npg-repository/cram_cache","CacheDir":"/tmp/s3cache"}]},{"Mount":"images","Targets":[{"Path":"sb10/images","CacheDir":"/tmp/s3cache"}]},{"Mount":"inputs","Targets":[{"Path":"sb10/samtools_testing"}]},{"Mount":"outputs","Targets":[{"Path":"sb10/samtools_testing_output","Write":true}]}]' --env 'REF_PATH=cram_cache/%2s/%2s/%s'

A few seconds for the first samtools to start, then the other 4 started a few seconds after that; the cache seems to be working just fine.

After about 50mins drops to 5-30% cpu usage, wr at 4-8%. Occasional peaks to samtools using 100%, and then after 54mins, samtools ended and wr started using ~25% cpu during the mv. mv itself using 10-20% or so. All completed fine without any lost jobs.

Try again with the full 1000 after a teardown and redeploy.

s3cmd rm s3://sb10/samtools_testing_output/*.bam
perl -e 'for(1..1000){print "echo $_ && singularity exec images/samtools-1.6.img samtools view -b -o $_.bam inputs/NA20903.cram && mv $_.bam outputs/\n"}' | wr add -r 0 -i singw -o 2 --cloud_flavor s2.3xlarge -j '[{"Mount":"cram_cache","Targets":[{"Path":"npg-repository/cram_cache","CacheDir":"/tmp/s3cache"}]},{"Mount":"images","Targets":[{"Path":"sb10/images","CacheDir":"/tmp/s3cache"}]},{"Mount":"inputs","Targets":[{"Path":"sb10/samtools_testing"}]},{"Mount":"outputs","Targets":[{"Path":"sb10/samtools_testing_output","Write":true}]}]' --env 'REF_PATH=cram_cache/%2s/%2s/%s'

Hit the same issue as before with jobs restarting and going delayed with no errors.

Got up to 60 running on 2 servers, then stuff like:

t=2019-08-15T13:06:33+0000 lvl=dbug msg="schedule changed number needed" scheduler=openstack cmd="/tmp/wr runner -s '1024:60:1:0:93541bc5fb09080985f079a80e29ebcd' --de
ployment development --server '192.168.0.7:46407' --domain localhost -r 1 -m 0 --debug" before=1004 needs=1008 caller=local.go:294
t=2019-08-15T13:06:33+0000 lvl=dbug msg="processQueue starting" scheduler=openstack reason=schedule caller=local.go:489
t=2019-08-15T13:06:33+0000 lvl=dbug msg="processQueue binpacking" scheduler=openstack needs=1008 current=60 cmd="/tmp/wr runner -s '1024:60:1:0:93541bc5fb09080985f079a
80e29ebcd' --deployment development --server '192.168.0.7:46407' --domain localhost -r 1 -m 0 --debug" caller=local.go:538
t=2019-08-15T13:06:33+0000 lvl=dbug msg="processQueue canCount" scheduler=openstack can=0 running=60 should=948 caller=local.go:561
t=2019-08-15T13:06:33+0000 lvl=dbug msg="processQueue ending" scheduler=openstack caller=local.go:517

Then, having reserved echo 2 a long time ago:
t=2019-08-15T13:07:44+0000 lvl=eror msg="Server handle client request error" err="jobqueue jstart(8a302056062230af329b3fbbd647ddcf): bad job (not in queue or correct sub-queue)" caller=server.go:680
t=2019-08-15T13:07:44+0000 lvl=dbug msg="reserved job" cmd="echo 2 && singularity exec images/samtools-1.6.img samtools view -b -o 2.bam inputs/NA20903.cram && mv 2.bam outputs/" schedGrp=1024:60:1:0:93541bc5fb09080985f079a80e29ebcd caller=serverCLI.go:231
[...]
t=2019-08-15T13:08:24+0000 lvl=eror msg="Server handle client request error" err="jobqueue jstart(968457dc169fc04eff252fcacf0d2447): you must Reserve() a Job before passing it to other methods" caller=server.go:680

grepping either of those is the same pattern of bad job, must Reserve, bad job, must Reserve, with a few mins inbetween each. grepping "echo 2 " shows it was reserved multiple times.

grep wrrunner /var/log/syslog on a worker instance shows:

Aug 15 13:05:03 wr-dev-sb10-133037e8-9cb0-4033-afd7-5c3af380e497 wrrunner[2607]: t=2019-08-15T13:05:03+0000 lvl=info msg="will start executing [echo 2 && singularity exec images/samtools-1.6.img samtools view -b -o 2.bam inputs/NA20903.cram && mv 2.bam outputs/]"
[...]
Aug 15 13:07:44 wr-dev-sb10-133037e8-9cb0-4033-afd7-5c3af380e497 wrrunner[2609]: t=2019-08-15T13:07:44+0000 lvl=warn msg="command [echo 4 && singularity exec images/samtools-1.6.img samtools view -b -o 4.bam inputs/NA20903.cram && mv 4.bam outputs/] started running, but I killed it due to a jobqueue server error: jobqueue jstart(8a302056062230af329b3fbbd647ddcf): bad job (not in queue or correct sub-queue)
Aug 15 13:07:44 wr-dev-sb10-133037e8-9cb0-4033-afd7-5c3af380e497 wrrunner[2609]: t=2019-08-15T13:07:44+0000 lvl=info msg="will start executing [echo 2 && singularity exec images/samtools-1.6.img samtools view -b -o 2.bam inputs/NA20903.cram && mv 2.bam outputs/]"
[...]
Aug 15 13:07:49 wr-dev-sb10-133037e8-9cb0-4033-afd7-5c3af380e497 wrrunner[2603]: t=2019-08-15T13:07:49+0000 lvl=warn msg="command [echo 5 && singularity exec images/samtools-1.6.img samtools view -b -o 5.bam inputs/NA20903.cram && mv 5.bam outputs/] started running, but I killed it due to a jobqueue server error: jobqueue jstart(10b32d91578594a41a2fcd51c2967a57): bad job (not in queue or correct sub-queue) (and unmounting the job failed: Unmount failure(s): 2 errors occurred:\n\t* /bin/fusermount: failed to unmount /tmp/wr_cwd/1/0/b/32d91578594a41a2fcd51c2967a57869141835/cwd/images: Device or resource busy\n (code exit status 1)\n\n\t* /bin/fusermount: failed to unmount /tmp/wr_cwd/1/0/b/32d91578594a41a2fcd51c2967a57869141835/cwd/inputs: Device or resource busy\n (code exit status 1)\n\n\n)"
[...]
Aug 15 13:09:05 wr-dev-sb10-133037e8-9cb0-4033-afd7-5c3af380e497 wrrunner[2607]: t=2019-08-15T13:09:05+0000 lvl=warn msg="command [echo 2 && singularity exec images/samtools-1.6.img samtools view -b -o 2.bam inputs/NA20903.cram && mv 2.bam outputs/] started running, but I killed it due to a jobqueue server error: jobqueue jstart(47999f6c7f04284c32f24f509cfd9ed3): bad job (not in queue or correct sub-queue)"
[...]
Aug 15 13:10:32 wr-dev-sb10-133037e8-9cb0-4033-afd7-5c3af380e497 wrrunner[2609]: t=2019-08-15T13:10:32+0000 lvl=warn msg="command [echo 2 && singularity exec images/samtools-1.6.img samtools view -b -o 2.bam inputs/NA20903.cram && mv 2.bam outputs/] started running, but I killed it due to a jobqueue server error: jobqueue jstart(47999f6c7f04284c32f24f509cfd9ed3): you must Reserve() a Job before passing it to other methods"

So a runner reserves it, I guess there are problems mounting, it never jstarts before getting released to delay queue, then another runner reserves it, then the first one gets around to trying the jstart but it fails due to release, then the second one tries to jstart but I guess the client reserve info got removed from the job on going to delay, so it is regarded as unreserved?

Probably fixable by making sure we start the touch cycle before trying to mount.

After fixing that, got 60 running. They start off in an odd state in the web interface with no buttons or coloring because there's no start time.

After some time, got 1 samtools going, but the others did not start straight away, so they're all having problems with their mount? A little while later all the samtools processes started to accumulate though. By 5+mins, had about 15 samtools/30 going. By 6+mins, had all 30 going. manager logs look fine.

Started to complete after 1h15m, no lost jobs.

Try again with G's version of singularity. install_singularity.sh:

sudo apt-get update
sudo apt-get install -y make gcc python libarchive-dev
VERSION=2.5.2
wget https://github.com/singularityware/singularity/releases/download/$VERSION/singularity-$VERSION.tar.gz
tar xvf singularity-$VERSION.tar.gz
cd singularity-$VERSION
./configure --prefix=/usr/local
make
sudo make install

(giving singularity 2.5.2, can change to VERSION=3.3.0-rc.3 to try the other
version he tested with, or 3.3.0 for latest version)

wr cloud deploy --script install_singularity.sh -m 3 --max_local_cores 0 --debug

perl -e 'for(1..1000){print "echo $_ && singularity exec images/samtools-1.6.img samtools view -b -o $_.bam inputs/NA20903.cram && mv $_.bam outputs/\n"}' | wr add -r 0 -i singw -o 2 --cloud_flavor s2.3xlarge -j '[{"Mount":"cram_cache","Targets":[{"Path":"npg-repository/cram_cache","CacheDir":"/tmp/s3cache"}]},{"Mount":"images","Targets":[{"Path":"sb10/images","CacheDir":"/tmp/s3cache"}]},{"Mount":"inputs","Targets":[{"Path":"sb10/samtools_testing"}]},{"Mount":"outputs","Targets":[{"Path":"sb10/samtools_testing_output","Write":true}]}]' --env 'REF_PATH=cram_cache/%2s/%2s/%s'

In case there are problems reading the memory usage file, add a timeout on that call and make sure nothing else can block us from sending out the touch.

@sb10 sb10 added the bug label Jul 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant