Enforcing CPU and memory limits #1455

ArneTR · 2025-11-28T13:01:08Z

This PR introduces auto-set CPU and memory limits to prevent the host system from either running into OOM or getting timer issues with the Metric Providers.

CPU

Currently the GMT puts all metric providers on Core 0.
This PR uses cpuset directives of docker to keep core 0 free.

This also introduces the minium system requirement of at least having two cores available.

Memory

The host running GMT could easily OOM if the containers request too much memory.

Before memory limits where optional and had no limits. With this PR GMT will now auto-assign memory limits to the extent that 1 GB is always kept unavailable to the containers.
This shall save the host from getting OOM

This PR is subject to change. Discussion points:

Is 1 GB a good limit? Typically GMT with the python runtime takes ~800 MB to run. So this would mean 200 MB left for the OS. Enough ...?
The set configuration is very opaque. Where to display the info best? Re-paste it to the usage scenario? Show it somewhere else? Maybe a limits/utilization tab?

Greptile Overview

Greptile Summary

This PR introduces automatic CPU and memory limits for Docker containers to prevent OOM and timing issues. The implementation reserves 1GB for GMT overhead, assigns remaining memory fairly to containers, and keeps CPU core 0 free for metric providers using cpuset.

Key changes:

Adds _populate_cpu_and_memory_limits() function that runs before container startup
Auto-assigns memory limits by dividing available memory (total - 1GB) equally among containers without user-specified limits
Moves memory limits from deploy.resources.limits.memory to mem_limit for consistent handling
Configures cpuset to exclude core 0 from container assignment
Disables swap completely by setting --memory-swap equal to --memory
Updates system checks to require minimum 2 CPU cores and 2GB free memory
Refactors memory parsing into shared utils.docker_memory_to_bytes() function

Issues found:

Critical: The utilization reporter at optimization_providers/resources/utilization.py:20 only checks deploy.resources.limits.memory, but this is deleted and moved to mem_limit during processing. This will cause false warnings about missing memory limits.

Architecture notes:

The same CPU count validation is performed twice (lines 626 and 1110), suggesting these could be deduplicated by storing the value as an instance variable
The PR correctly handles edge cases like tight memory scenarios with appropriate warnings

ArneTR · 2025-11-28T13:01:15Z

@greptileai

ArneTR · 2025-11-28T13:01:27Z

@ribalba Love your opinion on this

greptile-apps

_{4 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

lib/scenario_runner.py

lib/utils.py

lib/system_checks.py

ArneTR · 2025-11-28T13:10:29Z

@greptileai

greptile-apps

_{4 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

lib/scenario_runner.py

ArneTR · 2025-11-28T14:30:07Z

@greptileai

greptile-apps

_{5 files reviewed, 5 comments}

_{Edit Code Review Agent Settings | Greptile}

lib/scenario_runner.py

lib/system_checks.py

lib/scenario_runner.py

* main: (fix): setfacl must be sudo Removing ACLs for GitHub Codespace to not inherit them into the mounted filesystem of the docker container via /tmp/repo (fix): Removing set -x again as it writes to stderr Update cron schedule for website tester workflow (fix): Escapestring must be assigned in start of file Show logged in user in frontend (#1454)

ArneTR · 2025-11-28T19:06:44Z

@greptileai

greptile-apps

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

optimization_providers/resources/utilization.py

davidkopp

out of interest I looked through the implementation. In general it looks good to me 😊 but I think there is a bug in the code.

lib/scenario_runner.py

davidkopp · 2025-11-30T14:46:05Z

lib/scenario_runner.py

+            # apply cpuset but keep one core for GMT and metric providers free
+            # This cannot be configured via user as no knowledge of machine shall be required
+            docker_run_string.append('--cpuset-cpus')
+            docker_run_string.append(','.join(map(str, range(1,SYSTEM_ASSIGNABLE_CPU_COUNT+1)))) # range inclusive as we do not assign to 0


I think there is bug. Based on my understanding, the +1 is wrong here and it should be

docker_run_string.append(','.join(map(str, range(1,SYSTEM_ASSIGNABLE_CPU_COUNT))))

If there are in total 4 cores, the variable SYSTEM_ASSIGNABLE_CPU_COUNT has the value 3. --cpuset-cpus is expected to be set to 1-3, but with the current implementation it is set to 1-4.

Did you take into account the maybe unexpected range function behaviour in python? The right boundary is exclusive, not inclusive.

>>> list(range(1,4)) [1, 2, 3]

Please follow up if I misunderstood the bug report

ribalba · 2025-12-02T08:18:57Z

The code looks ok from scanning over it. The bigger problem I am seeing is that we are introducing another hurdle for people to measure with the GMT. Or warning on another thing. When benchmarking a tool I don't know how much memory it will take and to be honest I don't really care.

Where I care is if the server goes down because of OOM and here I would want something in place stopping the GMT process making the server unresponsive. So I would make this feature a config var that we can enable on the servers but no warn on the desktop.

I can see the argument that people should always specify the constraints, and I agree totally personally, but I don't see it happening in real life.

davidkopp · 2025-12-02T14:55:24Z

Setting proper CPU and memory requests/reservations & limits for deployments e.g. in Kubernetes environments is important, especially from an environmental point of view to avoid wasting resources. So I think it makes sense to be able to set proper configurations also in a benchmarking tool like GMT that is usually used before the deployment. Of course, the configuration values between the GMT environment and a production environment may differ. But still, I think it could be valueable to think about resource constraints already during a benchmarking (idealistic thinking from my side).

This topic is not relevant for all GMT users. So I see the point from Didi that it may be "another hurdle for people to measure with the GMT". Arne already mentioned the idea of having a "limits/utilization tab". If the warnings are only displayed there, users that are not interested in it won't see the warnings and don't get distracted.

ArneTR · 2025-12-03T10:31:59Z

I would also argue that this behaviour is already enabled in development and visible for the user.

I hear that you feel that a warning is too much. I will propose a new version where the Containers and their configurations / limits are displayed more clearly.

Currently the info is backfilled into the usage scenario tab where i belive it will just drown.

Enforcing CPU and memory limits

51ed06c

greptile-apps bot reviewed Nov 28, 2025

View reviewed changes

lib/scenario_runner.py Outdated Show resolved Hide resolved

lib/utils.py Outdated Show resolved Hide resolved

lib/system_checks.py Outdated Show resolved Hide resolved

Fixes: [skip ci]

88f540e

(fix): Moving subprocess call for CPUs outside loop

edb7031

greptile-apps bot reviewed Nov 28, 2025

View reviewed changes

lib/scenario_runner.py Outdated Show resolved Hide resolved

lib/scenario_runner.py Outdated Show resolved Hide resolved

lib/scenario_runner.py Outdated Show resolved Hide resolved

lib/scenario_runner.py Outdated Show resolved Hide resolved

ArneTR added 3 commits November 28, 2025 14:35

Introducing check for too little memory to assign

56e7afd

Disabling swap properly

0bd4cb1

Populating values for memory and CPU upfront

d5621c3

greptile-apps bot reviewed Nov 28, 2025

View reviewed changes

lib/scenario_runner.py Outdated Show resolved Hide resolved

lib/scenario_runner.py Show resolved Hide resolved

lib/scenario_runner.py Show resolved Hide resolved

lib/system_checks.py Outdated Show resolved Hide resolved

lib/scenario_runner.py Outdated Show resolved Hide resolved

ArneTR added 3 commits November 28, 2025 19:46

typos and extra checks

0eb8707

(fix): variable must be number

a5519cb

greptile-apps bot reviewed Nov 28, 2025

View reviewed changes

optimization_providers/resources/utilization.py Outdated Show resolved Hide resolved

ArneTR added 2 commits November 29, 2025 07:30

Optimization provider now relies on presence of cpus and mem_limit

e393fd2

Added Tests

d8883ea

davidkopp reviewed Nov 30, 2025

View reviewed changes

Enforcing CPU and memory limits #1455

Are you sure you want to change the base?

Enforcing CPU and memory limits #1455

Uh oh!

Conversation

ArneTR commented Nov 28, 2025 • edited by greptile-apps bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CPU

Memory

Greptile Overview

Greptile Summary

Uh oh!

ArneTR commented Nov 28, 2025

Uh oh!

ArneTR commented Nov 28, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ArneTR commented Nov 28, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ArneTR commented Nov 28, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ArneTR commented Nov 28, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

davidkopp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

davidkopp Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

ArneTR Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ribalba commented Dec 2, 2025

Uh oh!

davidkopp commented Dec 2, 2025

Uh oh!

ArneTR commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ArneTR commented Nov 28, 2025 •

edited by greptile-apps bot

Loading

ArneTR Dec 3, 2025 •

edited

Loading