Remote Caching POC #1400

psilospore · 2021-07-08T01:25:33Z

psilospore
Jul 8, 2021

Hey all,

I'm interested in returning to my Proof of concept PR to add Remote Caching: #726 However I wanted to ask some questions and get feedback before I return to it.

Outside of what was mentioned in the PR description and the readme in: https://github.com/psilospore/mill-remote-cache-server

Based on the comments in the original PR I was planning the following:

Cache based on time to compute. If a user exceeds a limit then the target will be cached.
Allow users to set the above limit.
Cache only targets relative to the project root. This would exclude caching coursier artifacts.
Work on adding a configuration to enable remote caching and point to a specific remote caching server.
Open question: Do we want opt-out caching on certain targets e.g. we could make something like T.noRemoteCache this would be the opposite of T.remoteCached that @lihaoyi proposed. The reason why I may not want opt-in caching is that without the concern of small tasks taking longer to fetch than compute (by the time based limit) and caching-only relative to project targets we don't need to pick what targets would be worth it to cache.

The other main question I have is based on this comment I'm not sure if that seems like a completely different approach or how I would use that. Would anyone be able to share more on that?

lefou · 2021-07-08T09:18:56Z

lefou
Jul 8, 2021
Maintainer

Hey @psilospore , great news!

I very much like the idea to only cache what's inside mills out folder. The out folder itself is an abstraction, and is in general only accessed via API (e.g. T.dest) or as a result of another target. So, I don't see any need to configure remote caching in the build file via T.{no}remoteCached. If there are reasons to enable/disable caching, these are most likely specific to the executing environment. If that is the case, we could later add some command line options, local configuration, or environment variable support.

Caching based on time: Two aspects here, upload and download: For upload you suggested to only upload the cache if the exec time exceeded a threshold. Makes sense. For dowload I could also imagine some concurrent strategy, use whatever return first, download or local computation.

While upgrading zinc support in mill, I recognized that zinc itself has some remote caching features. I haven't explored how this works but my comment pointed out the place where we need to hook in, if we want to use it in mill. I can't tell how this would or should work with your approach, but it is clearly limited to compilation targets.

One thing to note though: Mills cache is very sensitive to changes in the build setup, e.g. one small change in the build.sc or it's imports, and the whole tree gets invalid. A remote cache for mill should probably support more than one cached value per coordinate. I have the following scenario in mind:

2 developers on two machines
1 shared remote cache
1 developer changes a small setting in her build.sc, e.g. to test stuff

If we only support one cached value per coordinate, each run from the other developer will invalidate the remote cache making things worse than without any caching.
Contrary, if we do support multiple values per coordinate and we roll back some local changes, we may still be able to use cached targets (if present), which would allow huge speed improvements for larger teams, where not all developers have the newest local working copy on their machine. Also switching forth and back between branches or bisecting could benefit from it.

And one last thought. Ideally, the remote cache can be used locally by a single developer, just to speed up tasks like switching branches or bisecting etc. To name an existing tool: ccache does exactly that in c/c++ land.

And now one implementation idea: The workhorse to hold hashes and detect changes in mill is PathRef. PathRefs always keeps absolute paths which most probably are not identical on different machines. For remote caching we need some comparable paths and hashes, so the probably easiest way could be to add an additional (optional) relative path and hash value to PathRef. If it is present and matches, we can use the file/dir in the remote cache. Files ourside out obviously have an absent value. While downloading a cached value, we probably need to recompute the absolute path and hash.

0 replies

nafg · 2022-04-08T07:19:38Z

nafg
Apr 8, 2022

Maybe a prerequisite would be to invalidate less when the build changes? I'm not sure if it's sound but the naive assumption would be that only tasks whose code changed (including methods it calls etc.) should be invalidated.

I don't know how you would detect that though.

1 reply

lefou Aug 14, 2023
Maintainer

We now have much finer cache invalidation in Mill 0.11.1+, so the input hash is probably a good enough cache key.

lefou · 2022-04-08T07:30:28Z

lefou
Apr 8, 2022
Maintainer

When relying on PathRef, we should make it more reproducible and stable for directories, is terms of file ordering. See:

PathRef for directories is filesystem-dependent and not reproducible #1808

0 replies

psilospore · 2022-04-20T00:07:40Z

psilospore
Apr 20, 2022
Author

Hey sorry all I ended up getting really busy but also am doing less Scala these days. I probably won't be working on this. I would love for someone to pick this up and would be happy to answer any questions about my proof of concept.

0 replies

sgammon · 2023-08-14T05:20:08Z

sgammon
Aug 14, 2023

hey @psilospore / @lefou,

I found Mill via GitHub's dependency graph support and I'm quite interested in trying it out. I'm the author of Buildless, so of course I'm always looking to add remote caching support where we can. We'd love to support Mill and I'm happy to (try) to help get this PR over the line. Do you guys recall what was needed? Cheers and thanks for any guidance.

0 replies

lefou · 2023-09-22T12:35:07Z

lefou
Sep 22, 2023
Maintainer

POC Remote Caching #2777

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remote Caching POC #1400

{{title}}

Replies: 6 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Remote Caching POC #1400

psilospore Jul 8, 2021

Replies: 6 comments · 1 reply

lefou Jul 8, 2021 Maintainer

nafg Apr 8, 2022

lefou Aug 14, 2023 Maintainer

lefou Apr 8, 2022 Maintainer

psilospore Apr 20, 2022 Author

sgammon Aug 14, 2023

lefou Sep 22, 2023 Maintainer

psilospore
Jul 8, 2021

Replies: 6 comments 1 reply

lefou
Jul 8, 2021
Maintainer

nafg
Apr 8, 2022

lefou Aug 14, 2023
Maintainer

lefou
Apr 8, 2022
Maintainer

psilospore
Apr 20, 2022
Author

sgammon
Aug 14, 2023

lefou
Sep 22, 2023
Maintainer