Promisor objects (partial clone) strategy? #1041

cesfahani · 2023-09-29T03:47:03Z

cesfahani
Sep 29, 2023

Hello!

I know partial clones (and promisor packfiles) is listed on the roadmap, which is fantastic. I know I'm not alone in my belief that partial clones will become the successor to LFS, and the only reasonable way to handle large repositories at scale. That said, I wanted to ask if there's been any discussion into designing components of gitoxide in a way that will make implementing partial clones easy without sacrificing performance when working with a partial clone.

Partial clones in git are... functional. They are awesome at the surface, making clone times orders of magnitude faster than full clones, but as you keep using them for daily work, it soon becomes obvious that they were retrofitted into git in a far less-than-ideal way. This is well documented in this section of the git docs. The vast majority of operations in git have not been "optimized" for partial clones, and the result of this is beyond painful... Commands like git blame or git rebase will end up trying to fetch each promisor object eagerly as it's encountered, which results not only in a round-trip to the remote for each missing object, but also a completely new SSH/HTTPS session. I've developed some custom tooling to help with this by determining which objects would be missing for a specific command to complete, and then fetching them in a single request, but this is clearly far less than ideal.

Has there been any consideration into how this may all work with gitoxide? For partial clones to work well, each operation needs to be able to first resolve a set of promisor object IDs that it needs, then should go fetch them all in one big request, and should then finally carry on with the operation. From what I can tell, it seems like gitoxide hasn't yet implement too much functionality that would be impacted by this (checkout is one exception I'm aware of). But before I dive into anything, I'd love to know if there are any existing plans/designs around this.

Appreciate any thoughts anyone may have on this!

Byron · 2023-09-29T07:07:41Z

Byron
Sep 29, 2023
Maintainer

Thanks so much for bringing this up! Promisor objects have been a phantom to me that showed up in the git codebase, but I never found any actual documentation for it. The link you provided has already proven incredibly valuable.

I know I'm not alone in my belief that partial clones will become the successor to LFS

This is really interesting to hear! I always felt that, at least for big files, packs aren't great especially when repacking, but probably one can 'just' be smart about the way one organizes packs. Big objects could go into their own packs, and once they are beyond a certain size one could leave them alone and start a new one (instead of rewriting them). With this in mind, I also think this could be something, but… I also feel that for big files I want a file system and a protocol to efficiently sync multiple remotes for them.

Has there been any consideration into how this may all work with gitoxide?

No, they have been too ominous for that. But that changed today and they are on my radar now. Intuitively I say they should be supported, but only in a minimal fashion at first. With that in mind, gitoxide would probably go through the same transformation as git did as they would be added after the fact.

Such an addition should happen ideally when various items of "Future Work" have been defined and implemented. Right now, I see so many issues with promisor packs that I don't think they should be implemented now.
I also see no other way than to adjust algorithms specifically to deal with promisor packs, and gather objects they need in bulk, so I don't hink anything is lost by postponing this feature until its more mature in git itself. It should be beneficial even as it keeps complexity down right now, which should mean faster progression.

But before I dive into anything, I'd love to know if there are any existing plans/designs around this.

For what I need out of gitoxide myself, I am banking on git-lfs and think this could be much better. That will be where my focus will go to, but mainly because I don't feel good about using the git ODB as large object storage as then everything has to go through ZLIB.

But if everything goes according to plan, gitoxide will be the fastest git implementation (merely because it can be fearlessly parallel), and that would mean there will be more interest by those who have huge repositories for which the promisor system was created in the first place.

Those parties will be welcome to contribute such a feature, of course.

PS: I was contemplating with 'minimal support', which would mean to handle the promisor system to not fail, even if it would be slow and inefficient. But that I discarded as 'minimal' is likely unusably slow, while the feature isn't needed for the majority of the user-base in the first place.

6 replies

Byron Sep 29, 2023
Maintainer

From what I know, DEFLATE/INFLATE is what's used to read pack entries post entry-header, but nothing says this compression algorithm couldn't be changed in Pack V3 for instance.
An optimization I can imagine is to set the compression level for huge files to 0, which should store them verbatim, which in theory should nearly be as good as the compression not being there (but I doubt there is no overhead).
Streaming large pack entries could be done, and if the implementation would know that this is a 'pack of large files', it could even omit memory mapping and speed up the reads even more.

The question is just if packs then have any advantage over loose files (in the 'a file in the file system sense'). One could be tempted to think that loose objects would be a solution, and maybe they even are if one implements them in a streaming fashion.
But I keep thinking that just using a file system should always be faster for incompressible files, and it's 'just' the synchronization that then needs solving.

I think I will find out because I really need a git-lfs server and client implementation, and in the course of this I guess I will find out 😅. gif-lfs is like its own object store with its own mechanism for transfer and synchronization between multiple remotes (and I am thinking server-side backups or failovers).

Maybe one inherent problem is that I didn't implement gix gc or gix repack so I only have a very limited understanding on what's possible when optimizing packs and miss how they could indeed by the future of git-lfs. It would be nice to have 'the one' object store, for sure, it has many perks to use git for object transfer as well. But maybe I am just rambling by now 😁.

cesfahani Sep 29, 2023
Author

Thank you for the detailed reply, @Byron!

I also feel that for big files I want a file system and a protocol to efficiently sync multiple remotes for them.

The folks over at Gitlab have some interesting ideas about this. If/when that lands, then the server will get to decide which objects go to which object stores, and can shuffle things around totally transparent to the client. In addition, the client still has control over deciding the criteria for which objects to fetch up-front, and which to lazily fetch only as needed.

Another similar-spirited improvement made recently to git is the bundle-uri feature, and it has been designed with partial-clones in mind. I'm very-much looking forward to seeing where that one goes. git hosting solutions are working on implementing this now, as it should give them significant compute savings. That might also be worth considering for the gitoxide roadmap.

Those parties will be welcome to contribute such a feature, of course.
PS: I was contemplating with 'minimal support', which would mean to handle the promisor system to not fail, even if it would be slow and inefficient. But that I discarded as 'minimal' is likely unusably slow, while the feature isn't needed for the majority of the user-base in the first place.

Any objection to me looking into this "minimal support" option? I understand it'd need to be implemented in a way that won't impede progress on other gitoxide features. I'd love to whip up a quick proposal and get feedback from you prior to diving in.

NobodyXu Sep 29, 2023

P.S. Here's the official doc for bundle-uri

Byron Sep 30, 2023
Maintainer

Thanks again!

Bundle-URIs are definitely more liked by me as they are less invasive, but I also see how they address a different need and how they align with partial clones (by providing bundles that have a specific filter applied).

Any objection to me looking into this "minimal support" option? I understand it'd need to be implemented in a way that won't impede progress on other gitoxide features. I'd love to whip up a quick proposal and get feedback from you prior to diving in.

Please feel free to create a new issue as tracking ticket if it's OK for you if I would also edit the content at some point. That issue would of course also be used for discussions around the topic, but the digest of it would go into the issue itself.
Thanks for pushing this topic @cesfahani, it's appreciated as it's good to integrate early.

cesfahani Oct 3, 2023
Author

Created a (very rough) new issue at #1046. I'll be working on some homework to get a better understanding of things so I can fill in more details there, but feel free to edit however you like!

Thanks again @Byron.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Promisor objects (partial clone) strategy? #1041

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Promisor objects (partial clone) strategy? #1041

cesfahani Sep 29, 2023

Replies: 1 comment · 6 replies

Byron Sep 29, 2023 Maintainer

Byron Sep 29, 2023 Maintainer

cesfahani Sep 29, 2023 Author

NobodyXu Sep 29, 2023

Byron Sep 30, 2023 Maintainer

cesfahani Oct 3, 2023 Author

cesfahani
Sep 29, 2023

Replies: 1 comment 6 replies

Byron
Sep 29, 2023
Maintainer

Byron Sep 29, 2023
Maintainer

cesfahani Sep 29, 2023
Author

Byron Sep 30, 2023
Maintainer

cesfahani Oct 3, 2023
Author