You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What happens:
When starting wr manager in cloud mode, if we use an image which is not available (wrong name/id, wrong permissions for the image, etc) wr manager end in a strange state when it tries to bring up worker instances. Our understanding after reading logs is that wr tries to create an instance using the image provided. We think it increases the internal counter of reserved cpus according to flavour info. Since the image is not available, wr cannot create the instance. In logs it is reported as err="no OS image with prefix [<image id>] was found". This is expected, but internally wr fails to update the number or reserved cpus. The loop repeats increasing the number of reserved cpus on every failed worker. This continues until the number of reserved cpus gets close to the available quota, this locks the wr manager in a state from which it will not recover. That is reported in logs as msg="lack of cores quota" schedulertype=openstack callvalue=b23809cf1a67d5f1 remaining=25 max=2915 used=460 reserved=2430. We think that even if the image was available again (after for example a name change or permissions update), the manager will be stuck in the state where it thinks it has reserved all cpus available for the quota.
What we would expect to happen:
wr manager reports the problem with the image as it reports now. The internal counter for reserved is updated after every event when the worker fails to be created because of problems with the image. Either the manager keeps trying to create or it pauses. Currently the only way to recover is to stop and start the manager. Maybe a log error message explaining this can be an infinite loop unless corrected will be a good idea.
Hi Jaime, sorry to hear about this. I'll look in to it.
Can I confirm that your usage involves specifying specific image ids to use (and this is a case of specifying a bad one to use), and that there are not actually any instances using up the quota after these failed attempts?
Do I understand correctly that this isn't an issue that is completely breaking your ability to work, because you can restart with the correct image id?
We are using ids (not names) to prevent clashes. It happens with both the id of a hidden image and the id of an image which does not exist. We didn't check using a name prefix but may be the same result.
There is definitely free quota. From the example I mentioned: remaining=25 max=2915 used=460 reserved=2430 used=460 are cpus in use by instances running. The rest of the quota is free and there is hardware available to cover at least 1k cpus of that quota.
Versions affected 0.32.0 and 0.32.1 (at least)
What happens:
When starting wr manager in cloud mode, if we use an image which is not available (wrong name/id, wrong permissions for the image, etc) wr manager end in a strange state when it tries to bring up worker instances. Our understanding after reading logs is that wr tries to create an instance using the image provided. We think it increases the internal counter of reserved cpus according to flavour info. Since the image is not available, wr cannot create the instance. In logs it is reported as
err="no OS image with prefix [<image id>] was found"
. This is expected, but internally wr fails to update the number or reserved cpus. The loop repeats increasing the number of reserved cpus on every failed worker. This continues until the number of reserved cpus gets close to the available quota, this locks the wr manager in a state from which it will not recover. That is reported in logs asmsg="lack of cores quota" schedulertype=openstack callvalue=b23809cf1a67d5f1 remaining=25 max=2915 used=460 reserved=2430
. We think that even if the image was available again (after for example a name change or permissions update), the manager will be stuck in the state where it thinks it has reserved all cpus available for the quota.What we would expect to happen:
wr manager reports the problem with the image as it reports now. The internal counter for reserved is updated after every event when the worker fails to be created because of problems with the image. Either the manager keeps trying to create or it pauses. Currently the only way to recover is to stop and start the manager. Maybe a log error message explaining this can be an infinite loop unless corrected will be a good idea.
with input from @dkj and @kjsanger
The text was updated successfully, but these errors were encountered: