Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving file eviction performance #696

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ulrfa
Copy link
Contributor

@ulrfa ulrfa commented Sep 1, 2023

No description provided.

mostynb and others added 4 commits September 1, 2023 21:40
…erloaded

We have been using a file removal semaphore with weight 5,000 (half of Go's
default 10,000 maximum OS threads, beyond which Go will crash), in an attempt
to avoid crashing when the filesystem/storage layer can't keep up with our
requirements.

This change renames that semaphore to `diskWaitSem` and also uses it for
disk-write operations. When the semaphore cannot be acquired for disk-writes,
we return HTTP 503 (service unavailable) or gRPC RESOURCE_EXHAUSTED error codes
to the client.

Relates to buchgr#638
This commit:

 - Performs evictions from a single background goroutine that receives
   files to be removed via a channel.

 - Throttles number of concurrent Put requests with semaphore (but not
   rejecting them).

In order to:

 - Avoid crashing on high load.

 - Achieve up to 3 times faster cache eviction.

 - Achieve up to 70% higher write throughput in scenario with many
   cache evictions.

The cache can grow above max_size when asynchronous files removals do
not catch up with new file writes. This is addressed in the following
part 2 commit. This issue was masqueraded in previous bazel-remote
versions by instead running out of operating system threads and crash.

Change-Id: Ifa2ed6c5a093adbb407750a0d38a4181a07f227f
Introduce a disk_size_limit for the total disk space of:

 - Files currently in the cache.
 - Reserved space for files currently being uploaded.
 - Evicted files not yet removed.

Setting this limit is optional (at least for now).

Reservations for Put requests are rejected when
disk_size_limit is exceeded.

The prometheus gauge bazel_remote_disk_cache_size_bytes is
updated to be a max value for the previous 30 seconds,
in order to be aware of short spikes when tuning the
disk_size_limit configuration.

There is also a new prometheus gauge
bazel_remote_disk_cache_size_bytes_limit showing current
configured limits in order to help visualize if current size
is getting close to the limit and help tuning the
disk_size_limit.

Change-Id: Iaec29af9a2e02796c29f294b993989783d575c4b
Use access logger instead of error logger when
requests are rejected due to overload, in order
to avoid too verbose error log when many requests
are rejected.

Number of rejects can also be monitored via codes
in the prometheus metrics
http_request_duration_seconds_count and
grpc_server_handled_total

Change-Id: I5d5999360b3e49b153fd6f122e2244d4789cf2ff
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants