Skip to content
This repository has been archived by the owner on Feb 16, 2022. It is now read-only.

lost internet connection during "drain gpu nodes" unable to restart job #32

Open
snowch opened this issue Feb 9, 2022 · 2 comments
Open

Comments

@snowch
Copy link
Member

snowch commented Feb 9, 2022

ansible output ...

TASK [drain gpu nodes] *********************************************************
failed: [localhost] (item={'changed': True, 'stdout': 'ip-10-1-0-176.eu-west-1.compute.internal', 'stderr': '', 'rc': 0, 'cmd': 'kubectl get nodes -o json | jq -r \'.items[] | select( .status.addresses[].address == "10.1.0.176") | .metadata.name\'', 'start': '2022-02-09 10:03:24.206134', 'end': '2022-02-09 10:03:25.481530', 'delta': '0:00:01.275396', 'msg': '', 'invocation': {'module_args': {'_raw_params': 'kubectl get nodes -o json | jq -r \'.items[] | select( .status.addresses[].address == "10.1.0.176") | .metadata.name\'', '_uses_shell': True, 'warn': False, 'stdin_add_newline': True, 'strip_empty_ends': True, 'argv': None, 'chdir': None, 'executable': None, 'creates': None, 'removes': None, 'stdin': None}}, 'stdout_lines': ['ip-10-1-0-176.eu-west-1.compute.internal'], 'stderr_lines': [], 'failed': False, 'item': '10.1.0.176', 'ansible_loop_var': 'item'}) => {"ansible_loop_var": "item", "changed": true, "cmd": "kubectl drain --ignore-daemonsets \"ip-10-1-0-176.eu-west-1.compute.internal\"", "delta": "0:00:01.718773", "end": "2022-02-09 10:03:27.463665", "item": {"ansible_loop_var": "item", "changed": true, "cmd": "kubectl get nodes -o json | jq -r '.items[] | select( .status.addresses[].address == \"10.1.0.176\") | .metadata.name'", "delta": "0:00:01.275396", "end": "2022-02-09 10:03:25.481530", "failed": false, "invocation": {"module_args": {"_raw_params": "kubectl get nodes -o json | jq -r '.items[] | select( .status.addresses[].address == \"10.1.0.176\") | .metadata.name'", "_uses_shell": true, "argv": null, "chdir": null, "creates": null, "executable": null, "removes": null, "stdin": null, "stdin_add_newline": true, "strip_empty_ends": true, "warn": false}}, "item": "10.1.0.176", "msg": "", "rc": 0, "start": "2022-02-09 10:03:24.206134", "stderr": "", "stderr_lines": [], "stdout": "ip-10-1-0-176.eu-west-1.compute.internal", "stdout_lines": ["ip-10-1-0-176.eu-west-1.compute.internal"]}, "msg": "non-zero return code", "rc": 1, "start": "2022-02-09 10:03:25.744892", "stderr": "error: unable to drain node \"ip-10-1-0-176.eu-west-1.compute.internal\", aborting command...\n\nThere are pending nodes to be drained:\n ip-10-1-0-176.eu-west-1.compute.internal\nerror: cannot delete Pods with local storage (use --delete-emptydir-data to override): istio-system/grafana-784c89f4cf-rk6g4", "stderr_lines": ["error: unable to drain node \"ip-10-1-0-176.eu-west-1.compute.internal\", aborting command...", "", "There are pending nodes to be drained:", " ip-10-1-0-176.eu-west-1.compute.internal", "error: cannot delete Pods with local storage (use --delete-emptydir-data to override): istio-system/grafana-784c89f4cf-rk6g4"], "stdout": "node/ip-10-1-0-176.eu-west-1.compute.internal already cordoned", "stdout_lines": ["node/ip-10-1-0-176.eu-west-1.compute.internal already cordoned"]}

I'm wondering if it is possible to handle this issue?

@erdincka
Copy link
Collaborator

erdincka commented Feb 9, 2022

Can we just change the fail condition? I guess this is a subsequent run, so should be safe to skip if it is already trying to drain the node;

       register: return
       failed_when: 
         - return.rc == 1
         - '"already cordoned" not in return.stdout'

@snowch
Copy link
Member Author

snowch commented Feb 9, 2022

Looks good!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants