Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP cache size limit environment variables #11530

Merged
merged 47 commits into from
Nov 13, 2024

Conversation

GregoryTravis
Copy link
Contributor

@GregoryTravis GregoryTravis commented Nov 11, 2024

Also simplified testing methods. Added LRUCacheSettings to hold settings, and mocks for free disk space and current time.

Checklist

Please ensure that the following checklist has been satisfied before submitting the PR:

  • The documentation has been updated, if necessary.
  • Screenshots/screencasts have been attached, if there are any visual changes. For interactive or animated visual changes, a screencast is preferred.
  • All code follows the
    Scala,
    Java,
    TypeScript,
    and
    Rust
    style guides. In case you are using a language not listed above, follow the Rust style guide.
  • Unit tests have been written where possible.
  • If meaningful changes were made to logic or tests affecting Enso Cloud integration in the libraries,
    or the Snowflake database integration, a run of the Extra Tests has been scheduled.
    • If applicable, it is suggested to paste a link to a successful run of the Extra Tests.

@GregoryTravis GregoryTravis marked this pull request as ready for review November 11, 2024 20:59
@GregoryTravis GregoryTravis added the CI: No changelog needed Do not require a changelog entry for this PR. label Nov 11, 2024
Comment on lines 7 to 9
public DiskSpaceGetter() {
super(() -> getRootPath().getUsableSpace());
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to ignore as it's just personal style preference, but IMHO here it would be more readable if Mockable just had a protected abstract computeValue(); that would be overridden in a sub-class, instead of passing lambdas around.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +94 to +97
## PRIVATE
Returns the path of the project root.
root_path : Text
root_path self = self.root.path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels a bit redundant, since we can just call invokeMethod twice on Java side, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this because invokeMember didn't work here; I get:

Exception occurred in target VM: Unsupported operation Value.invoke(path, Object...) for '(File /Users/gmt/dev/enso/enso/test/Table_Tests)'(language: Java, type: org.graalvm.polyglot.Value). You can ensure that the operation is supported using Value.canInvoke(String).

Perhaps because it's a builtin Enso type?

root.hasMember("path")
false

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. Well in such case such workaround seems justified.

@@ -28,22 +28,46 @@
* deleting entries to make space for new ones. All cache files are set to be deleted automatically
* on JVM exit.
*
* <p>Limits should be set with environment variables: ENSO_LIB_HTTP_CACHE_MAX_FILE_SIZE_MEGS --
* single file size, in megs ENSO_LIB_HTTP_CACHE_MAX_TOTAL_CACHE_LIMIT -- total cache size, in megs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use full names instead of abbreviations in the docs? Or at least let's use the standard MB abbreviation. Megs sounds very casual and may not be understandable to everyone IMHO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 43 to 47
/**
* An upper limit on the total cache size. If the cache size limit specified by the other
* parameters goes over this value, then this value is used.
*/
private static final long MAX_TOTAL_CACHE_SIZE_FREE_SPACE_UPPER_BOUND = 100L * 1024 * 1024 * 1024;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the FREE_SPACE part of this name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Comment on lines 268 to 277
public long getMaxTotalCacheSize(long currentlyUsed) {
var totalCacheSize =
switch (settings.getTotalCacheLimit()) {
case TotalCacheLimit.Bytes bytes -> bytes.bytes();
case TotalCacheLimit.Percentage percentage -> {
long usableSpace = diskSpaceGetter.get() + currentlyUsed;
yield (long) (percentage.percentage() * usableSpace);
}
};
return Long.min(MAX_TOTAL_CACHE_SIZE_FREE_SPACE_UPPER_BOUND, totalCacheSize);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few edge cases that I think need to be handled here:

  1. in Bytes mode we completely skip checking the available disk space. What if our cache limit is set to 1000MB but the device has only 100MB of available space? This code currently returns 1000MB as the max total cache size. So the cache logic will happily perform writes for up to 1GB, exceeding the available 100MB. This will likely mean that the first cache write operation that happens when we run out of disk space will probably fail with some IOException: Out of storage space. But we have no logic to handle this. Running out of storage space for a cache should not mean that a HTTP request fails - from user's perspective it has nothing to do with disk storage, so the details of cache handling should be opaque to the user and not cause errors. But instead of catching this kind of error, we should be pro-actively ensuring that our cache writes do not exceed the available disk space.
  2. moreover, I don't think it is good practice for a cache to take up the last bytes available on the disk. As that may cause other applications to stop working properly. This is maybe less important but it feels like to be a good citizen of the OS, an optional cache should not use up the last resources. Thus I was proposing to keep a margin that is never exceeded. E.g. if the available disk space is less than 100MB (or 1GB, up to us), then regardless of total cache size set, we probably should not take up these last few bytes.

(2) is optional, but (1) is I think really needed to make this cache usable - running out of disk space should not be a cause for failing HTTP requests. But since we need to ensure we don't make the cache larger than available disk space, we probably could additionally add a little margin of ~100MB to satisfy (2) as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When working on either 1 or 2, we need to take into account the ENSO_LIB_HTTP_CACHE_MAX_FILE_SIZE - as the cache can grow by at most this amount. So we need to ensure that before downloading any file to a cache location, we have at least this much free disk space.

Of course there's always the possibility of edge cases like other applications writing to disk at the same time as us and thus running out of space even if we had all the necessary margins. But IMHO such cases are probably ok to ignore for now. But keeping the margins is really rather easy so IMO it would be worth to do it. For the other cases - we may want to intercept the IOException and wrap it into some more Enso-friendly error, e.g. Illegal_State.Error telling the user that we ran out of disk space when trying to cache the file and that they may try to disable caches for this request or just re-run the node to try again.

Copy link
Contributor Author

@GregoryTravis GregoryTravis Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a fixed margin, I capped the total allowed size at 90% (MAX_PERCENTAGE) of the free disk space, regardless of whether the limit was specified in MB or as a percentage. I think this provides the protection we need.

For the max file size -- previously, it was clearing enough space for the download if there was a content-length, but not otherwise. Now, if there is no content-length, it clears out space for the largest allowed file. This is likely to be too much, but this case is probably a lot less common.

As for running out of space anyway, because of other applications writing to disk (or for any other reason) -- there is already logic to handle that case. If it encounters an IOException makeRequestAndCache, it removes any partially-downloaded file, and re-issues the request without caching. I don't attach a warning in this case, because this is very likely to be transient. Since it's unlikely to happen repeatedly, I don't think there's any value in bothering the user with it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, I did not notice the 90% cap. Okay, I think that is good indeed.

Looks great then.

return (long) (maxFileSizeMegs * 1024 * 1024);
}

/** Uses the environment variable if set, otherwise uses a default percentage. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on this function is wrong, it does not do what it says.

Do we need this separate static function here, it feels like we can replace it with just a call to TotalCacheLimit.parse. IMO it adds an unnecessary layer of abstraction - essentially an alias that does not really have a purpose. It'd make sense if it could be overridden or had any other logic - but if all it does is delegate like an alias - I'd rather remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

@radeusgd radeusgd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great, the new approach to setting parameters for testing is a great improvement.

IMO we should try to address the case of the cache size possibly exceeding all disk space.

Other than that, just some small comments.

Comment on lines 100 to 101
// If we have a content-length, clear up enough space for that not. If not,
// then clear up enough space for the largest allowed file size.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some typo in the comment (not)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// If we have a content-length, clear up enough space for that not. If not,
// then clear up enough space for the largest allowed file size.
// If we have a content-length, clear up enough space for that. If not,
// then clear up enough space for the largest allowed file size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -4,8 +4,8 @@
import org.enso.base.CurrentEnsoProject;

public class DiskSpaceGetter extends Mockable<Long> {
public DiskSpaceGetter() {
super(() -> getRootPath().getUsableSpace());
public Long computeValue() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably add @Override at the top.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -22,6 +22,7 @@

/** Makes HTTP requests with secrets in either header or query string. */
public final class EnsoSecretHelper extends SecretValueResolver {
private static final EnsoHTTPResponseCache cache = new EnsoHTTPResponseCache();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should create an instance the first time getCache is called.

Creating it statically has some problems.

We check environment variables when instantiating LRUCache inside of EnsoHTTPResponseCache constructor.

If I recall correctly, in Native Image builds the static initialization is done at image build time and thus the environment variables read during static initialization would be the variables from the build machine, not the machine on which then we are being run. So deferring the initialization may prepare us better for Native Image.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally I saw throwing exceptions when the env variables are outside of ranges. Again, I don't think it's a good idea to do it inside of static initializer. Tbh I have no idea what will happen when we throw such an exception from a static initializer when we are running the GUI. I would have a guess that the Language Server will fail to boot in such a case. Which is not a good thing for users, especially as we are unlikely to have good reporting for such errors.

I'd move away to a 'lazy' initialization.

And overall I'm wondering if in this case we want throwing an exception. This will make the HTTP requests unusable due to misconfiguration. It feels like logging an error/warning and falling back to some defaults or clamping the values may be good enough - it will inform the user of the misconfiguration but will allow them to continue using the product.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverting to default instead of throwing.

@GregoryTravis GregoryTravis merged commit fb50a8f into develop Nov 13, 2024
36 checks passed
@GregoryTravis GregoryTravis deleted the wip/gmt/11410-cache-env branch November 13, 2024 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI: No changelog needed Do not require a changelog entry for this PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants