periodic failures when creating a new cluster #60

bmabey · 2017-11-02T20:03:29Z

I have run into this a few times:

$ dask create foo cluster.yml
....
replicationcontroller "jupyter-notebook" created
replicationcontroller "dask-scheduler" created
replicationcontroller "dask-worker" created
INFO: Waiting for kubernetes... (^C to stop)
INFO: Services are up
INFO: Services are up
The connection to the server x.x.x.x was refused - did you specify the right host or port?
CRITICAL: Traceback (most recent call last):
  File "/Users/bmabey/anaconda/envs/drugdiscovery/lib/python3.6/site-packages/dask_kubernetes-0.0.1-py3.6.egg/dask_kubernetes/cli/main.py", line 26, in start

Is there anything I can do to have the cluster continue to be setup after this? Eventually the dask info foo returned information but I was unable to connect to any of the services.

The text was updated successfully, but these errors were encountered:

martindurant · 2017-11-02T20:12:44Z

Do you have any idea what is actually going on?
We can certainly put more try/excepts around trying to connect to the cluster, but I'm not sure how that will help when we don't understand the cause.

Could you possibly debug where in the code the exception is happening?
Any idea how the message "Services are up" can have appeared twice?

bmabey · 2017-11-02T20:17:35Z

The double INFO: Services are up was a copy/paste error.

In general when this happens once the info command comes back then I am able to connect. The one time that I couldn't connect I had tried messing with the pods and so that is probably what broke it.

So I think what needs to happen is for this subprocess call to be retried with some back-offs and evntual timeouts:

subprocess.CalledProcessError: Command 'kubectl --output=json --context gke_foo_us-east1-b_cluster get services' returned non-zero exit status 1.

martindurant · 2017-11-02T20:21:58Z

Probably it would be reasonable to put

try:
    ...
except:
    continue

around the calls to get_pods and services_in_context calls within wait_until_ready (but not in the functions themselves - if they are called directly, they should raise I think).

Would you like to contribute this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

periodic failures when creating a new cluster #60

periodic failures when creating a new cluster #60

bmabey commented Nov 2, 2017

martindurant commented Nov 2, 2017

bmabey commented Nov 2, 2017 •

edited

Loading

martindurant commented Nov 2, 2017

periodic failures when creating a new cluster #60

periodic failures when creating a new cluster #60

Comments

bmabey commented Nov 2, 2017

martindurant commented Nov 2, 2017

bmabey commented Nov 2, 2017 • edited Loading

martindurant commented Nov 2, 2017

bmabey commented Nov 2, 2017 •

edited

Loading