Skip to content
This repository has been archived by the owner on Feb 1, 2021. It is now read-only.

periodic failures when creating a new cluster #60

Open
bmabey opened this issue Nov 2, 2017 · 3 comments
Open

periodic failures when creating a new cluster #60

bmabey opened this issue Nov 2, 2017 · 3 comments

Comments

@bmabey
Copy link

bmabey commented Nov 2, 2017

I have run into this a few times:

$ dask create foo cluster.yml
....
replicationcontroller "jupyter-notebook" created
replicationcontroller "dask-scheduler" created
replicationcontroller "dask-worker" created
INFO: Waiting for kubernetes... (^C to stop)
INFO: Services are up
INFO: Services are up
The connection to the server x.x.x.x was refused - did you specify the right host or port?
CRITICAL: Traceback (most recent call last):
  File "/Users/bmabey/anaconda/envs/drugdiscovery/lib/python3.6/site-packages/dask_kubernetes-0.0.1-py3.6.egg/dask_kubernetes/cli/main.py", line 26, in start

Is there anything I can do to have the cluster continue to be setup after this? Eventually the dask info foo returned information but I was unable to connect to any of the services.

@martindurant
Copy link
Member

Do you have any idea what is actually going on?
We can certainly put more try/excepts around trying to connect to the cluster, but I'm not sure how that will help when we don't understand the cause.

Could you possibly debug where in the code the exception is happening?
Any idea how the message "Services are up" can have appeared twice?

@bmabey
Copy link
Author

bmabey commented Nov 2, 2017

The double INFO: Services are up was a copy/paste error.

In general when this happens once the info command comes back then I am able to connect. The one time that I couldn't connect I had tried messing with the pods and so that is probably what broke it.

So I think what needs to happen is for this subprocess call to be retried with some back-offs and evntual timeouts:

subprocess.CalledProcessError: Command 'kubectl --output=json --context gke_foo_us-east1-b_cluster get services' returned non-zero exit status 1.

@martindurant
Copy link
Member

Probably it would be reasonable to put

try:
    ...
except:
    continue

around the calls to get_pods and services_in_context calls within wait_until_ready (but not in the functions themselves - if they are called directly, they should raise I think).

Would you like to contribute this?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants