Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wr manager in cloud mode fails to update stats of reserved cpus when OS image is not available #453

Open
jmtcsngr opened this issue Mar 23, 2023 · 2 comments
Assignees
Labels

Comments

@jmtcsngr
Copy link

jmtcsngr commented Mar 23, 2023

Versions affected 0.32.0 and 0.32.1 (at least)

What happens:
When starting wr manager in cloud mode, if we use an image which is not available (wrong name/id, wrong permissions for the image, etc) wr manager end in a strange state when it tries to bring up worker instances. Our understanding after reading logs is that wr tries to create an instance using the image provided. We think it increases the internal counter of reserved cpus according to flavour info. Since the image is not available, wr cannot create the instance. In logs it is reported as err="no OS image with prefix [<image id>] was found". This is expected, but internally wr fails to update the number or reserved cpus. The loop repeats increasing the number of reserved cpus on every failed worker. This continues until the number of reserved cpus gets close to the available quota, this locks the wr manager in a state from which it will not recover. That is reported in logs as msg="lack of cores quota" schedulertype=openstack callvalue=b23809cf1a67d5f1 remaining=25 max=2915 used=460 reserved=2430. We think that even if the image was available again (after for example a name change or permissions update), the manager will be stuck in the state where it thinks it has reserved all cpus available for the quota.

What we would expect to happen:
wr manager reports the problem with the image as it reports now. The internal counter for reserved is updated after every event when the worker fails to be created because of problems with the image. Either the manager keeps trying to create or it pauses. Currently the only way to recover is to stop and start the manager. Maybe a log error message explaining this can be an infinite loop unless corrected will be a good idea.

with input from @dkj and @kjsanger

@sb10
Copy link
Member

sb10 commented Mar 24, 2023

Hi Jaime, sorry to hear about this. I'll look in to it.

Can I confirm that your usage involves specifying specific image ids to use (and this is a case of specifying a bad one to use), and that there are not actually any instances using up the quota after these failed attempts?

Do I understand correctly that this isn't an issue that is completely breaking your ability to work, because you can restart with the correct image id?

@sb10 sb10 self-assigned this Mar 24, 2023
@sb10 sb10 added the bug label Mar 24, 2023
@jmtcsngr
Copy link
Author

Hi Sendu,

We are using ids (not names) to prevent clashes. It happens with both the id of a hidden image and the id of an image which does not exist. We didn't check using a name prefix but may be the same result.

There is definitely free quota. From the example I mentioned:
remaining=25 max=2915 used=460 reserved=2430
used=460 are cpus in use by instances running. The rest of the quota is free and there is hardware available to cover at least 1k cpus of that quota.

We have a workaround. So not stopping us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants