Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache per interpreter #62

Closed
wants to merge 2 commits into from
Closed

cache per interpreter #62

wants to merge 2 commits into from

Conversation

cpcloud
Copy link

@cpcloud cpcloud commented Jul 26, 2021

This PR is one implementation of a solution to address #61.

The problem is that when traversing the standard library, memestra can potentially (and does)
cache a standard library module's code from a potentially different version of Python than the
one that created the cache entry. This hides bugs in the cache implementation for different versions
of Python, because the cache will be hit regardless of the version.

The implementation here creates a directory below the top level cache directory that
is equivalent to sys.executable with any leading path separators or drives (in the case of Windows)
removed.

The effect is that there's a distinct cache per interpreter.

One caveat is that on some systems, an upgrade of the system python, for example /usr/bin/python3,
can result in a similar bug being hidden on such systems. I'm not sure whether
that case (development environments that use the system Python) is common
enough to warrant addressing in this PR.

Closes #61.

@serge-sans-paille
Copy link
Collaborator

Your issue seems to come from a cache entry computed with a given version of beniget/gast/memestra, and then a usage from another version of gast/beniget/memestra. Theoritically, the interpreter version should not change the semantic computed by memstra, so it shouldn't be part of the key computation. One option here would be to use beniget version in the hash computation (and probably memestra's too?)

@cpcloud
Copy link
Author

cpcloud commented Aug 1, 2021

Your issue seems to come from a cache entry computed with a given version of beniget/gast/memestra, and then a usage from another version of gast/beniget/memestra

I don't think so. I ran the same exact code with two different interpreters to reproduce.

Theoritically, the interpreter version should not change the semantic computed by memstra

Yes, but it hides bugs in caching code paths that are different between interpreters such as the ones that are hit when dealing with NamedExprs for example.

One option here would be to use beniget version in the hash computation (and probably memestra's too?)

I don't think that will help, because it's not a library version issue.

@serge-sans-paille
Copy link
Collaborator

Do you mean you have the same library version with both interpreters, but still different cache entries for the same file?

@cpcloud
Copy link
Author

cpcloud commented Aug 1, 2021

No, the cache isn't different, and that's the issue. The cache is the same, regardless of interpreter. This is problematic because of implementation details that differ in memestra due to different Python versions. This hides code paths that are different because of this different versions. The recent bugs I found in gast and beniget were found after removing the cache and running with Python 3.8. If I hadn't deleted the cache the bug wouldn't have triggered.

@serge-sans-paille
Copy link
Collaborator

This is problematic because of implementation details that differ in memestra due to different Python versions

I see nothing that explicitly depends on sys.version* in memestra.

The recent bugs I found in gast and beniget were found after removing the cache and running with Python 3.8

My understanding is that if we have a bug in beniget, we get a cache key and we keep it even once the bug in beniget is fixed. Thus my hint toward adding the beniget version in the hash key computation.

Note that I'm not saying that you're wrong. I just want to understand the exact origin of the caching issue you're meeting.

From an abstract point of view, the cache key is just a hash of the source, and the cache value is (roughly) a list of obsolete functions. This binding should not depend on the interpreter.

@cpcloud
Copy link
Author

cpcloud commented Aug 1, 2021

I see nothing that explicitly depends on sys.version* in memestra.

Indeed, but that isn't necessary for a code path that is Python 3.8 specific to be hit while traversing the import graph.

My understanding is that if we have a bug in beniget, we get a cache key and we keep it even once the bug in beniget is fixed. Thus my hint toward adding the beniget version in the hash key computation.

Hm, I don't think I'm explaining the issue clearly enough, because this would not address the issue I'm raising.

This binding should not depend on the interpreter.

Ok, I'm going to write down the exact failure mode, which will hopefully make it clear why I think the interpreter version should be part of the cache structure.

You need the previous versions of gast and beniget to reproduce the problem.

  1. With an empty cache and a Python 3.7 interpreter, run the test suite. It should pass.
  2. Without removing the cache and a Python 3.8 interpreter, run test suite. This should also pass.
  3. Remove the cache
  4. With a Python 3.8 interpreter, run the test suite. This should fail.

The second step here should always fail, even if the source code in the module is byte-for-byte equal for the different interpreters.

Hopefully this makes it clear that even if the source code being cached is exactly the same, the code path being taken to generate the cache key isn't necessarily the same.

@serge-sans-paille
Copy link
Collaborator

This looks like the hash should use as a salt:

  1. the interpreter version (to capture ast changes)
  2. beniget and gast version (to capture bugs etc)

Unfortunately, gast and beinget don't provide a __version__ field, I'll add that. Can you propose a patch with python version used as part of the hasing (and not using repo like you currently do).

@serge-sans-paille
Copy link
Collaborator

#63 implements the above idea. I've made the necessary change to gast and beniget to advertise their __version__ and will update #63 accordingly

@cpcloud
Copy link
Author

cpcloud commented Sep 5, 2021

Cool. Closing this PR then.

@cpcloud cpcloud closed this Sep 5, 2021
@cpcloud cpcloud deleted the cache-per-interpreter branch September 5, 2021 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

memestra tests are sensitive to global mutable state
2 participants