Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runners failing due to "failed to create new OS thread" #330

Open
sb10 opened this issue Jul 2, 2020 · 1 comment
Open

Runners failing due to "failed to create new OS thread" #330

sb10 opened this issue Jul 2, 2020 · 1 comment
Labels

Comments

@sb10
Copy link
Member

sb10 commented Jul 2, 2020

rm -fr ~/.wr_development out && /tmp/wr manager start -s local --deployment development --debug -f 2> out

[from second shell]
perl -e 'for(1..2000){print "echo $_\n"}' | /tmp/wr add -i 30114-qc_genotype --cpus 0 -m 20M -o 2 -r 0 --cwd /tmp --cwd_matters --deployment development && sleep 15 && echo "getting status...\n" && /tmp/wr status -i 30114 -z -o c --deployment development; grep -c "completed job" ~/.wr_development/log; grep -c "failed to" ~/.wr_development/log

It bombs out with:

runtime: failed to create new OS thread (have 1306 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc

ulimit -u
1828079

Not something I can really fix, so I set 0 core jobs to run at most double core count.

Couldn't replicate any problems with this. Reverting back to unlimited clients:
git checkout 27d9335
also couldn't replicate the problem on an m2.3xlarge after 10 attempts.
Likewise on s2.3xlarge and o2.3xlarge (which needs more time to work).
Reconfirmed on an m1.3xlarge. But this has 54 cores and all the others have 30. Need to try with m1.2xlarge (26 cores), or m2/s2/o2.4xlarge with 60 cores. It happened on an m2.4xlarge. It did not happen with the m1.2xlarge.

Even when it works, the fork issues in runners mean that things go delayed and it takes a lot longer to actually get everything to complete, compared to a 30 cpu machine. So limiting the number of runners is a legit thing to do.

Out of curiosity though, is it a go runtime bug, or an Ubunutu bug? Trying with other operating systems:

wr cloud deploy -f m2.4xlarge -m 1 -o cirros-0.3.5-x86_64-disk.img -u cirros
EROR[08-21|14:58:06] failed to launch a server in openstack: cloud server never became ready to use
[can be ssh'd to, nothing in /tmp]
wr cloud deploy -f m2.4xlarge -m 1 -o CentOS-7-2019-01-28 -u centos
runtime: failed to create new OS thread (have 363 already; errno=11)
wr cloud deploy -f m2.4xlarge -m 1 -o debian-9.8.2-20190303 -u debian
The status command failed with:
runtime: failed to create new OS thread (have 9 already; errno=11)
!!! And then worked on the second attempt.
Another status command failed with:
runtime: failed to create new OS thread (have 2 already; errno=11)
[...]
-bash: fork: retry: Resource temporarily unavailable
Did get the manager to fail with:
runtime: failed to create new OS thread (have 990 already; errno=11)

Worth looking at this again to see if there's any way to deal with these issues.

@sb10 sb10 added the bug label Jul 2, 2020
@sb10
Copy link
Member Author

sb10 commented Jul 2, 2020

These tests pretty reliable reproduce the scheduler locking up and the manager dropping dead with no logged errors, despite adding the guard around the use of the reserve channel.

wr cloud deploy -f m1.3xlarge -m 1
[ssh there]
/tmp/wr manager stop --deployment development
make && scp -i /nfs/users/nfs_s/sb10/.wr_development/cloud_resources.openstack.key /nfs/users/nfs_s/sb10/go/bin/wr ubuntu@172.27.80.140:/tmp/wr

rm -fr ~/.wr_development && /tmp/wr manager start -s local --deployment development --debug && perl -e 'for(1..2000){print "echo $_\n"}' | /tmp/wr add -i 30114-qc_genotype --cpus 0 -m 20M -o 2 -r 0 --cwd /tmp --cwd_matters --deployment development && sleep 63 && echo "getting status...\n" && /tmp/wr status -i 30114 -z -o c --deployment development

Trying again after raising ulimit. That seemed to help a bit, but still get errors and manager deaths.

syslog shows no problems until maybe everything complete, then a bunch of runners start and exit due to queue being empty, then there's a whole bunch of "receive time out" and "jobqueue Connect(): could not reach the server" starting at 13:30:37.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant