wr manager in cloud mode fails to update stats of reserved cpus when OS image is not available #453

jmtcsngr · 2023-03-23T17:42:21Z

Versions affected 0.32.0 and 0.32.1 (at least)

What happens:
When starting wr manager in cloud mode, if we use an image which is not available (wrong name/id, wrong permissions for the image, etc) wr manager end in a strange state when it tries to bring up worker instances. Our understanding after reading logs is that wr tries to create an instance using the image provided. We think it increases the internal counter of reserved cpus according to flavour info. Since the image is not available, wr cannot create the instance. In logs it is reported as err="no OS image with prefix [<image id>] was found". This is expected, but internally wr fails to update the number or reserved cpus. The loop repeats increasing the number of reserved cpus on every failed worker. This continues until the number of reserved cpus gets close to the available quota, this locks the wr manager in a state from which it will not recover. That is reported in logs as msg="lack of cores quota" schedulertype=openstack callvalue=b23809cf1a67d5f1 remaining=25 max=2915 used=460 reserved=2430. We think that even if the image was available again (after for example a name change or permissions update), the manager will be stuck in the state where it thinks it has reserved all cpus available for the quota.

What we would expect to happen:
wr manager reports the problem with the image as it reports now. The internal counter for reserved is updated after every event when the worker fails to be created because of problems with the image. Either the manager keeps trying to create or it pauses. Currently the only way to recover is to stop and start the manager. Maybe a log error message explaining this can be an infinite loop unless corrected will be a good idea.

with input from @dkj and @kjsanger

The text was updated successfully, but these errors were encountered:

sb10 · 2023-03-24T15:56:39Z

Hi Jaime, sorry to hear about this. I'll look in to it.

Can I confirm that your usage involves specifying specific image ids to use (and this is a case of specifying a bad one to use), and that there are not actually any instances using up the quota after these failed attempts?

Do I understand correctly that this isn't an issue that is completely breaking your ability to work, because you can restart with the correct image id?

jmtcsngr · 2023-03-24T16:08:09Z

Hi Sendu,

We are using ids (not names) to prevent clashes. It happens with both the id of a hidden image and the id of an image which does not exist. We didn't check using a name prefix but may be the same result.

There is definitely free quota. From the example I mentioned:
remaining=25 max=2915 used=460 reserved=2430
used=460 are cpus in use by instances running. The rest of the quota is free and there is hardware available to cover at least 1k cpus of that quota.

We have a workaround. So not stopping us.

sb10 self-assigned this Mar 24, 2023

sb10 added the bug label Mar 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wr manager in cloud mode fails to update stats of reserved cpus when OS image is not available #453

wr manager in cloud mode fails to update stats of reserved cpus when OS image is not available #453

jmtcsngr commented Mar 23, 2023 •

edited

Loading

sb10 commented Mar 24, 2023

jmtcsngr commented Mar 24, 2023

wr manager in cloud mode fails to update stats of reserved cpus when OS image is not available #453

wr manager in cloud mode fails to update stats of reserved cpus when OS image is not available #453

Comments

jmtcsngr commented Mar 23, 2023 • edited Loading

sb10 commented Mar 24, 2023

jmtcsngr commented Mar 24, 2023

jmtcsngr commented Mar 23, 2023 •

edited

Loading