-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
incorrect calculation of free storage after client restart #6172
Comments
Hi @ygersie, There are two mechanisms at play here. The first is the initial fingerprinting that a Nomad client conducts on startup. As you noted, many of these are static fingerprints; included in these are total memory, total CPU, and available disk, in order to decide the quantity of resources available for scheduling. The available resources can be adjust using the CPU and memory fingerprinting use total capacity (less any reserved amount); these numbers are constantly in flux, so the intention is to be able to use the The other piece at play here is the resource allocation that happens at scheduling. The resources listed in the |
Hey @cgbaker, thanks for the quick reply here. I know how the resource allocation works however what I'm stating here is that there's only 3 * 150 GiB allocated on the worker mentioned in the example and still Nomad shows that we ran out of resources. The reason I suspect the stale fingerprint is at play here is because changing the disk utilization using a dummy image doesn't influence the scheduling of the job until I restart the Nomad agent. |
Maybe to clarify a little bit better. Why can I not schedule a job asking for |
You cannot schedule a job for Consider this: if you tried to schedule a job using Nomad does treat disk differently from CPU/Memory; for disk, the amount available for scheduling is the free space during fingerprinting at startup. For CPU/Mem, the amount available for scheduling is the system total, less anything in the There is an argument to be made that disk should be treated the same as CPU/Mem: the full disk is available for scheduling, and that users are responsible for using the There is also an argument to be made that disk fingerprinting should not be static, but dynamic. This is a reasonable proposal; however, it is a significant change with many consequences. This is because the fingerprinting would need to account for the fact that allocated tasks may be using some-but-not-all of the storage that was allocated to them. Fingerprinting would therefore need to be allocation-aware. Also, this type of fingerprinting is very disruptive, because it continuously updates the node information (which invokes the scheduler for all allocations on the nodes, potentially causing existing allocations to be moved to other nodes). I've brought this up with the team and we're discussing whether this should be changed. |
I'm still not entirely following. I get everything you're saying w.r.t. how disk is different to determine in terms of scheduling but if you take into account the size of the storage occupied by Nomad allocations you have massive resource loss. The reason Nomad fingerprints the storage available and not the storage total (I assume) is because you can not from a Nomad perspective determine what other applications on the system are utilizing disk space. If Nomad were the only thing responsible for resource management you wouldn't have this issue in the first place, you could just use the total capacity. Just like with CPU and Memory it can't be Nomad's task to see if external sources are causing over-utilization. So why not subtract allocated from total and a default sane "reserved" percentage for any non-Nomad managed resources. This fingerprinting now leads to the situation where I can only allocate (allocate not even actually use) 440GiB out of 985GB. Surely this is not something we want and should be considered a bug. The goal of a scheduler is to efficiently schedule resources which is definitely not the case now.. Anyway, surely appreciated that you're following up on this issue! |
That is one approach, and it's probably necessary if we use the disk total size as the schedulable amount.
I want to make sure we're on the same page... the output posted above indicates that your Nomad cluster has already allocated Can you please explain where you think the "massive resource loss" is occurring? |
Ok, I know this is kind of confusing, I'm trying to explain the issue the best I can 😄 So in above situation like shown in my shell output I can not schedule a job with a resource ask of only 150GiB. What I think is happening: ygersie@worker059:~$ df -kP /var/lib/nomad
Filesystem 1024-blocks Used Available Capacity Mounted on
/dev/vda1 1032089344 381532420 608584208 39% / This is perfectly fine the moment you start Nomad for the first time and nothing is occupying disk space yet. Now everything is working like it should and we schedule resources on this worker (in above example 3 allocations each asking for 150GiB for a total of 450GiB). Time passes and these allocations start utilizing the disk, in my example a total of around 365GB is actually used. At this point I restart the Nomad agent and the fingerprint using As I've already allocated a total of 440GiB in running allocations Nomad states: oh shit, I only have 580GB total available now, and I've already allocated 440GiB so sorry, your resource ask for an additional 150GiB is denied due to running out of resources, which is ridiculous of course as there should at least be another 460GiB available. That's why I'm stating, because you're calculating the resource utilization of running allocations in the grand total of "this is what is available on this node" you basically have "massive resource loss" if at any point you restart the Agent while allocations have been filling up the disk. I hope this explains the issue better. If not, it may be easier to setup a call? |
fyi: for now I'm shipping a patched version which fingerprints the total available disk space and I'll use the |
Yes, that's right. I noticed the bug where the storage that is actually used by allocations is not considered by the storage fingerprinter. I wasn't sure whether that was the problem you were reporting or whether it was something else. Thanks for bringing this up and for being patient. 😄 As to your patch, it's one approach. It has the added benefit of treating storage the same as CPU and memory. Under that approach, because the amount fingerprinted is not dependent on the used storage, then so as long as the reserved amount is set appropriately and no tasks exceed their allocated storage, everything should be fine. It's possible that a tighter bound (which doesn't require manually setting the reserved stanza) could be generated by instead having the storage fingerprinter consider the storage used in the allocation directories; that would be a bigger change to the code. |
You are most welcome and thanks for the follow up. As I see it there’s a couple of options:
I would opt for the simple solution (option #3) as it would also be similar to memory and cpu, but I do not have the context you guys have. |
We're running Nomad 0.10.5 and bumped against this issue recently after we moved to standardized workload "sizing" on our Nomad clusters to make it easier to do capacity planning. After we restarted Nomad clients to pick up a Reading through this issue was enlightening. We routinely restart the Nomad client in order to pick up Fingerprinting using the filesystem total and reserving a percentage of that total to come up with a "disk available" number would fit well with our Nomad setup. As it is, we've had to remove the ephemeral disk attributes from task groups and have been relying on other, less immediate, means to control user job disk utilization. |
Oh man this is confusing! I just bought a bunch of drives unnecessarily because I mistakenly believed I didn't have enough ephemeral disk capacity as a result of this bug. This is my new ritual for working around the bug when adding a meta key to a client: #!/bin/sh
nomad node drain -enable -force -self
service nomad stop
umount -R /mnt/nomad_alloc
mount /mnt/nomad_alloc
rm -rf /mnt/nomad_alloc/*
service nomad start
sleep 10
nomad node eligibility -self -enable |
Is this still a bug in the latest or has the underlying code changed so much that this is no longer an issue? |
same on 1.2.6 , tested today : before update and restart Allocated Resources Allocation Resource Utilization Host Resource Utilization after Allocated Resources Allocation Resource Utilization Host Resource Utilization |
See also #14871, which we closed as a duplicate of this one. Note we don't have a fix in-progress though. One idea we've had some discussions around is to have the client mount a loopback filesystem that we could discard with the allocation. That would let us make firm determinations on disk quotas in the process. |
@tgross Would it be possible as a short-term workaround to implement a field I stumbled across this issue as I have a cluster with three client agents where I mix workloads with both high and low disk space demands. Those clients are dedicated nomad workers, whereby I know exactly how much disk space should be allocatable and how much should be reserved for system tasks. At the moment I start to decrease the size in the ephermal_disk stanza to keep nomad scheduling allocations. I have plenty of disk space left, but nomad does not want to allocate it as soon as some of those bigger allocations aren't garbage collected before a restart. |
That seems like a reasonable approach. |
I wonder if this issue was fixed? |
Version
I've tested and reproduced on 2 older versions of Nomad (0.5.6 and 0.7.1) and this seems to be still an issue in 0.9.4 as well.
Issue
Nomad can not allocate resources under the false presumption that we ran out of disk resources.
Reproduction steps
Start a Nomad agent. Compare the Allocated Resources section of a
nomad node-status -self
withdf -h
reported available space. Now try to schedule a job with an ephemeral disk size similar to what is available to Nomad. This should succeed. Now fill up the disk using some random data:dd if=/dev/zero of=/root/test.img bs=1M count=4096
. Try anomad plan
again, this should still succeed. Now restart the Nomad agent and try anomad plan
again, it should now fail stating it ran out of disk resources.The behaviour w.r.t. disk resources is different than for example CPU and Memory:
A real world example which clearly shows we should have plenty of disk resources available on this node but due the deduction: (free (566 GiB) - allocated (440 GiB)) we ran out:
ygersie@worker059:~$ df -h / Filesystem Size Used Avail Use% Mounted on /dev/vda1 985G 355G 591G 38% /
and here an example job which asks to schedule Redis on
worker059
:Impact
We can not schedule any more jobs even though we have plenty of disk resources available.. This is a pretty significant bug as I do not know a way to workaround it except lowering the resource ask dramatically. For CPU and Memory we can override what Nomad thinks it has available but not for Disk.
The text was updated successfully, but these errors were encountered: