You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 12, 2023. It is now read-only.
Another instance of batch pool creation unreliability, I'm afraid. Over the past 48 hours I've been unable to start any low-priority nodes using a docker image which has been fine the previous 5 days.
I'm trying to boot 8 x E64s_v3 low priority nodes. 7 will start successfully and be sat idle. One will always have the status "unusable". Every. Single. Time. I must have attempted to boot 50+ pools over the past 48 hours. I have also tried booting a few decidated nodes (only 2 or 3) and get the same issue. While booting I have also kept an eye on the node status graphs to see if nodes are being pre-empted during the boot process, which could possibly prevent a node from successfully booting. Unfortunately I've seen nothing out of the ordinary which could lead me to believe that would be an issue. I have also tried creating same size and smaller pools using different VM classes (F64s_v2, D64s_v3) with the same result. Note that I am using resource files during pool creation.
Because the node is unusable, there are no files/logs for me to view, so I can't troubleshoot the issue. If I use Batch Explorer to look what's going on, I can locate the unusable node but upon clicking it just says: "Node is currently 'unusable', there are no files to view now". I cannot reboot the node because I get a red popup warning (top right of batch explorer) which says: "Reboot failed".
As I say, everything was working fine, now it isn't. Nothing has changed my end in terms of configuring my pool, or the docker image (arcalis/nichemapr) that I've been using without issue until 2 days ago.
Thanks,
Simon
The text was updated successfully, but these errors were encountered:
I'll check if there's been any changes on the service side. There could have been a new deployment.
If you are on Batch Explorer, you can upload the Batch node agent logs to your Azure storage container.
The node agent logs will contain useful information about the VM and its status with the Batch service.
Pool > Node > Upload Batch logs to Storage:
Here's an image for uploading your node agent logs:
If you can share the node agent logs information through email (razurebatch@microsoft.com), that'll be great for diagnostic for us.
Can I get the region, pool name, and time of occurrence?
Thanks!
Brian
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Hello,
Another instance of batch pool creation unreliability, I'm afraid. Over the past 48 hours I've been unable to start any low-priority nodes using a docker image which has been fine the previous 5 days.
I'm trying to boot 8 x E64s_v3 low priority nodes. 7 will start successfully and be sat idle. One will always have the status "unusable". Every. Single. Time. I must have attempted to boot 50+ pools over the past 48 hours. I have also tried booting a few decidated nodes (only 2 or 3) and get the same issue. While booting I have also kept an eye on the node status graphs to see if nodes are being pre-empted during the boot process, which could possibly prevent a node from successfully booting. Unfortunately I've seen nothing out of the ordinary which could lead me to believe that would be an issue. I have also tried creating same size and smaller pools using different VM classes (F64s_v2, D64s_v3) with the same result. Note that I am using resource files during pool creation.
Because the node is unusable, there are no files/logs for me to view, so I can't troubleshoot the issue. If I use Batch Explorer to look what's going on, I can locate the unusable node but upon clicking it just says: "Node is currently 'unusable', there are no files to view now". I cannot reboot the node because I get a red popup warning (top right of batch explorer) which says: "Reboot failed".
As I say, everything was working fine, now it isn't. Nothing has changed my end in terms of configuring my pool, or the docker image (arcalis/nichemapr) that I've been using without issue until 2 days ago.
Thanks,
Simon
The text was updated successfully, but these errors were encountered: