-
Notifications
You must be signed in to change notification settings - Fork 40
Enforcing CPU and memory limits #1455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@ribalba Love your opinion on this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 files reviewed, 3 comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 files reviewed, 4 comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 files reviewed, 5 comments
* main: (fix): setfacl must be sudo Removing ACLs for GitHub Codespace to not inherit them into the mounted filesystem of the docker container via /tmp/repo (fix): Removing set -x again as it writes to stderr Update cron schedule for website tester workflow (fix): Escapestring must be assigned in start of file Show logged in user in frontend (#1454)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 files reviewed, 1 comment
davidkopp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of interest I looked through the implementation. In general it looks good to me 😊 but I think there is a bug in the code.
| # apply cpuset but keep one core for GMT and metric providers free | ||
| # This cannot be configured via user as no knowledge of machine shall be required | ||
| docker_run_string.append('--cpuset-cpus') | ||
| docker_run_string.append(','.join(map(str, range(1,SYSTEM_ASSIGNABLE_CPU_COUNT+1)))) # range inclusive as we do not assign to 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is bug. Based on my understanding, the +1 is wrong here and it should be
docker_run_string.append(','.join(map(str, range(1,SYSTEM_ASSIGNABLE_CPU_COUNT))))
If there are in total 4 cores, the variable SYSTEM_ASSIGNABLE_CPU_COUNT has the value 3. --cpuset-cpus is expected to be set to 1-3, but with the current implementation it is set to 1-4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you take into account the maybe unexpected range function behaviour in python? The right boundary is exclusive, not inclusive.
>>> list(range(1,4))
[1, 2, 3]
Please follow up if I misunderstood the bug report
|
The code looks ok from scanning over it. The bigger problem I am seeing is that we are introducing another hurdle for people to measure with the GMT. Or warning on another thing. When benchmarking a tool I don't know how much memory it will take and to be honest I don't really care. Where I care is if the server goes down because of OOM and here I would want something in place stopping the GMT process making the server unresponsive. So I would make this feature a config var that we can enable on the servers but no warn on the desktop. I can see the argument that people should always specify the constraints, and I agree totally personally, but I don't see it happening in real life. |
|
Setting proper CPU and memory requests/reservations & limits for deployments e.g. in Kubernetes environments is important, especially from an environmental point of view to avoid wasting resources. So I think it makes sense to be able to set proper configurations also in a benchmarking tool like GMT that is usually used before the deployment. Of course, the configuration values between the GMT environment and a production environment may differ. But still, I think it could be valueable to think about resource constraints already during a benchmarking (idealistic thinking from my side). This topic is not relevant for all GMT users. So I see the point from Didi that it may be "another hurdle for people to measure with the GMT". Arne already mentioned the idea of having a "limits/utilization tab". If the warnings are only displayed there, users that are not interested in it won't see the warnings and don't get distracted. |
|
I would also argue that this behaviour is already enabled in development and visible for the user. I hear that you feel that a warning is too much. I will propose a new version where the Containers and their configurations / limits are displayed more clearly. Currently the info is backfilled into the usage scenario tab where i belive it will just drown. |
This PR introduces auto-set CPU and memory limits to prevent the host system from either running into OOM or getting timer issues with the Metric Providers.
CPU
Currently the GMT puts all metric providers on Core 0.
This PR uses cpuset directives of docker to keep core 0 free.
This also introduces the minium system requirement of at least having two cores available.
Memory
The host running GMT could easily OOM if the containers request too much memory.
Before memory limits where optional and had no limits. With this PR GMT will now auto-assign memory limits to the extent that 1 GB is always kept unavailable to the containers.
This shall save the host from getting OOM
This PR is subject to change. Discussion points:
Greptile Overview
Greptile Summary
This PR introduces automatic CPU and memory limits for Docker containers to prevent OOM and timing issues. The implementation reserves 1GB for GMT overhead, assigns remaining memory fairly to containers, and keeps CPU core 0 free for metric providers using
cpuset.Key changes:
_populate_cpu_and_memory_limits()function that runs before container startupdeploy.resources.limits.memorytomem_limitfor consistent handling--memory-swapequal to--memoryutils.docker_memory_to_bytes()functionIssues found:
optimization_providers/resources/utilization.py:20only checksdeploy.resources.limits.memory, but this is deleted and moved tomem_limitduring processing. This will cause false warnings about missing memory limits.Architecture notes: