Skip to content

Conversation

@ArneTR
Copy link
Member

@ArneTR ArneTR commented Nov 28, 2025

This PR introduces auto-set CPU and memory limits to prevent the host system from either running into OOM or getting timer issues with the Metric Providers.

CPU

Currently the GMT puts all metric providers on Core 0.
This PR uses cpuset directives of docker to keep core 0 free.

This also introduces the minium system requirement of at least having two cores available.

Memory

The host running GMT could easily OOM if the containers request too much memory.

Before memory limits where optional and had no limits. With this PR GMT will now auto-assign memory limits to the extent that 1 GB is always kept unavailable to the containers.
This shall save the host from getting OOM

This PR is subject to change. Discussion points:

  • Is 1 GB a good limit? Typically GMT with the python runtime takes ~800 MB to run. So this would mean 200 MB left for the OS. Enough ...?
  • The set configuration is very opaque. Where to display the info best? Re-paste it to the usage scenario? Show it somewhere else? Maybe a limits/utilization tab?

Greptile Overview

Greptile Summary

This PR introduces automatic CPU and memory limits for Docker containers to prevent OOM and timing issues. The implementation reserves 1GB for GMT overhead, assigns remaining memory fairly to containers, and keeps CPU core 0 free for metric providers using cpuset.

Key changes:

  • Adds _populate_cpu_and_memory_limits() function that runs before container startup
  • Auto-assigns memory limits by dividing available memory (total - 1GB) equally among containers without user-specified limits
  • Moves memory limits from deploy.resources.limits.memory to mem_limit for consistent handling
  • Configures cpuset to exclude core 0 from container assignment
  • Disables swap completely by setting --memory-swap equal to --memory
  • Updates system checks to require minimum 2 CPU cores and 2GB free memory
  • Refactors memory parsing into shared utils.docker_memory_to_bytes() function

Issues found:

  • Critical: The utilization reporter at optimization_providers/resources/utilization.py:20 only checks deploy.resources.limits.memory, but this is deleted and moved to mem_limit during processing. This will cause false warnings about missing memory limits.

Architecture notes:

  • The same CPU count validation is performed twice (lines 626 and 1110), suggesting these could be deduplicated by storing the value as an instance variable
  • The PR correctly handles edge cases like tight memory scenarios with appropriate warnings

@ArneTR
Copy link
Member Author

ArneTR commented Nov 28, 2025

@greptileai

@ArneTR
Copy link
Member Author

ArneTR commented Nov 28, 2025

@ribalba Love your opinion on this

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

@ArneTR
Copy link
Member Author

ArneTR commented Nov 28, 2025

@greptileai

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

@ArneTR
Copy link
Member Author

ArneTR commented Nov 28, 2025

@greptileai

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile

* main:
  (fix): setfacl must be sudo
  Removing ACLs for GitHub Codespace to not inherit them into the mounted filesystem of the docker container via /tmp/repo
  (fix): Removing set -x again as it writes to stderr
  Update cron schedule for website tester workflow
  (fix): Escapestring must be assigned in start of file
  Show logged in user in frontend (#1454)
@ArneTR
Copy link
Member Author

ArneTR commented Nov 28, 2025

@greptileai

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@davidkopp davidkopp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of interest I looked through the implementation. In general it looks good to me 😊 but I think there is a bug in the code.

# apply cpuset but keep one core for GMT and metric providers free
# This cannot be configured via user as no knowledge of machine shall be required
docker_run_string.append('--cpuset-cpus')
docker_run_string.append(','.join(map(str, range(1,SYSTEM_ASSIGNABLE_CPU_COUNT+1)))) # range inclusive as we do not assign to 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is bug. Based on my understanding, the +1 is wrong here and it should be

docker_run_string.append(','.join(map(str, range(1,SYSTEM_ASSIGNABLE_CPU_COUNT))))

If there are in total 4 cores, the variable SYSTEM_ASSIGNABLE_CPU_COUNT has the value 3. --cpuset-cpus is expected to be set to 1-3, but with the current implementation it is set to 1-4.

Copy link
Member Author

@ArneTR ArneTR Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you take into account the maybe unexpected range function behaviour in python? The right boundary is exclusive, not inclusive.

>>> list(range(1,4))
[1, 2, 3]

Please follow up if I misunderstood the bug report

@ribalba
Copy link
Member

ribalba commented Dec 2, 2025

The code looks ok from scanning over it. The bigger problem I am seeing is that we are introducing another hurdle for people to measure with the GMT. Or warning on another thing. When benchmarking a tool I don't know how much memory it will take and to be honest I don't really care.

Where I care is if the server goes down because of OOM and here I would want something in place stopping the GMT process making the server unresponsive. So I would make this feature a config var that we can enable on the servers but no warn on the desktop.

I can see the argument that people should always specify the constraints, and I agree totally personally, but I don't see it happening in real life.

@davidkopp
Copy link
Contributor

Setting proper CPU and memory requests/reservations & limits for deployments e.g. in Kubernetes environments is important, especially from an environmental point of view to avoid wasting resources. So I think it makes sense to be able to set proper configurations also in a benchmarking tool like GMT that is usually used before the deployment. Of course, the configuration values between the GMT environment and a production environment may differ. But still, I think it could be valueable to think about resource constraints already during a benchmarking (idealistic thinking from my side).

This topic is not relevant for all GMT users. So I see the point from Didi that it may be "another hurdle for people to measure with the GMT". Arne already mentioned the idea of having a "limits/utilization tab". If the warnings are only displayed there, users that are not interested in it won't see the warnings and don't get distracted.

@ArneTR
Copy link
Member Author

ArneTR commented Dec 3, 2025

I would also argue that this behaviour is already enabled in development and visible for the user.

I hear that you feel that a warning is too much. I will propose a new version where the Containers and their configurations / limits are displayed more clearly.

Currently the info is backfilled into the usage scenario tab where i belive it will just drown.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants