lost internet connection during "drain gpu nodes" unable to restart job #32

snowch · 2022-02-09T10:07:48Z

ansible output ...

TASK [drain gpu nodes] *********************************************************
failed: [localhost] (item={'changed': True, 'stdout': 'ip-10-1-0-176.eu-west-1.compute.internal', 'stderr': '', 'rc': 0, 'cmd': 'kubectl get nodes -o json | jq -r \'.items[] | select( .status.addresses[].address == "10.1.0.176") | .metadata.name\'', 'start': '2022-02-09 10:03:24.206134', 'end': '2022-02-09 10:03:25.481530', 'delta': '0:00:01.275396', 'msg': '', 'invocation': {'module_args': {'_raw_params': 'kubectl get nodes -o json | jq -r \'.items[] | select( .status.addresses[].address == "10.1.0.176") | .metadata.name\'', '_uses_shell': True, 'warn': False, 'stdin_add_newline': True, 'strip_empty_ends': True, 'argv': None, 'chdir': None, 'executable': None, 'creates': None, 'removes': None, 'stdin': None}}, 'stdout_lines': ['ip-10-1-0-176.eu-west-1.compute.internal'], 'stderr_lines': [], 'failed': False, 'item': '10.1.0.176', 'ansible_loop_var': 'item'}) => {"ansible_loop_var": "item", "changed": true, "cmd": "kubectl drain --ignore-daemonsets \"ip-10-1-0-176.eu-west-1.compute.internal\"", "delta": "0:00:01.718773", "end": "2022-02-09 10:03:27.463665", "item": {"ansible_loop_var": "item", "changed": true, "cmd": "kubectl get nodes -o json | jq -r '.items[] | select( .status.addresses[].address == \"10.1.0.176\") | .metadata.name'", "delta": "0:00:01.275396", "end": "2022-02-09 10:03:25.481530", "failed": false, "invocation": {"module_args": {"_raw_params": "kubectl get nodes -o json | jq -r '.items[] | select( .status.addresses[].address == \"10.1.0.176\") | .metadata.name'", "_uses_shell": true, "argv": null, "chdir": null, "creates": null, "executable": null, "removes": null, "stdin": null, "stdin_add_newline": true, "strip_empty_ends": true, "warn": false}}, "item": "10.1.0.176", "msg": "", "rc": 0, "start": "2022-02-09 10:03:24.206134", "stderr": "", "stderr_lines": [], "stdout": "ip-10-1-0-176.eu-west-1.compute.internal", "stdout_lines": ["ip-10-1-0-176.eu-west-1.compute.internal"]}, "msg": "non-zero return code", "rc": 1, "start": "2022-02-09 10:03:25.744892", "stderr": "error: unable to drain node \"ip-10-1-0-176.eu-west-1.compute.internal\", aborting command...\n\nThere are pending nodes to be drained:\n ip-10-1-0-176.eu-west-1.compute.internal\nerror: cannot delete Pods with local storage (use --delete-emptydir-data to override): istio-system/grafana-784c89f4cf-rk6g4", "stderr_lines": ["error: unable to drain node \"ip-10-1-0-176.eu-west-1.compute.internal\", aborting command...", "", "There are pending nodes to be drained:", " ip-10-1-0-176.eu-west-1.compute.internal", "error: cannot delete Pods with local storage (use --delete-emptydir-data to override): istio-system/grafana-784c89f4cf-rk6g4"], "stdout": "node/ip-10-1-0-176.eu-west-1.compute.internal already cordoned", "stdout_lines": ["node/ip-10-1-0-176.eu-west-1.compute.internal already cordoned"]}

I'm wondering if it is possible to handle this issue?

The text was updated successfully, but these errors were encountered:

erdincka · 2022-02-09T10:18:11Z

Can we just change the fail condition? I guess this is a subsequent run, so should be safe to skip if it is already trying to drain the node;

       register: return
       failed_when: 
         - return.rc == 1
         - '"already cordoned" not in return.stdout'

snowch · 2022-02-09T12:04:37Z

Looks good!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lost internet connection during "drain gpu nodes" unable to restart job #32

lost internet connection during "drain gpu nodes" unable to restart job #32

snowch commented Feb 9, 2022

erdincka commented Feb 9, 2022 •

edited

Loading

snowch commented Feb 9, 2022

lost internet connection during "drain gpu nodes" unable to restart job #32

lost internet connection during "drain gpu nodes" unable to restart job #32

Comments

snowch commented Feb 9, 2022

erdincka commented Feb 9, 2022 • edited Loading

snowch commented Feb 9, 2022

erdincka commented Feb 9, 2022 •

edited

Loading