Can gitoxide be used as a GitHub proxy for allowing stable tests against a mock GitHub API with real repository data? #242

xmo-odoo · 2021-10-27T09:07:09Z

xmo-odoo
Oct 27, 2021

I'm building a mock / fake which has to operate on (and present) git repositories. While it has basically no support anywhere I figured in-memory git would still be a good idea (as I don't really need resilience or much concurrent access or anything).

Though like the rest the gitoxide project doesn't have in-memory support the various sub-crates seem like really cool building blocks with some affordances, however in reading the documentation of the various sub-crates which look to be of interest it looks like the very low level use-case seems to be... not unsupported but not very coherent at this point? But I don't really know if it's by design / convenience or just a matter of having other stuff to deal with and that not being important.

Specifically: the story about parsing and serialisation of objects seems a bit all over the place, with git_object I was expecting the possibility of relatively easily parsing from "store blobs" but it looks like the crate only parses to and serializes from the objects themselves, and the "store headers" are in git_pack::loose::object::header, and the individual stores implement the hashing of the result "by hand" (generally using git_features::hash::Write so they can straight write the "store blob" to a zlib compressor), that correct?

Wouldn't it make sense for the header parsing and serialisation to be in git_object instead, such that you could parse / serialize without concerns of Kind and get a "store-suitable" representation (modulo possible compression)?

Also would git_odb be "interested" in an in-memory store of git objects, and possibly a version of Compound (or maybe Linked) which can delegate to arbitrary store implementations? (not sure Find is object-safe, though I figure there only needs to be one store which is Write and would logically be the "head of line" when looking up an object).

And finally, I've looked around whether there was any support for the server-side part of the git network protocol(s), and so far that doesn't seem to be the case right? git-protocol and git-transport seem to mostly / only implement the client side of the communication at this point unless I missed something?

Byron · 2021-10-27T09:23:42Z

Byron
Oct 27, 2021
Maintainer

It's great to see your feedback as apparently you have spent quite some time looking through the crates provided here. As I don't think I fully understood the first two paragraphs, let me skip ahead a little. From there it might be possible to fill in the blanks in the conversation that might follow from that.

Wouldn't it make sense for the header parsing and serialisation to be in git_object instead, such that you could parse / serialize without concerns of Kind and get a "store-suitable" representation (modulo possible compression)?

Pack headers and loose object headers are indeed different, but the hash must be computed based on the loose object header. git_pack::loose::object::header definitely give me some trouble as it doesn't feel quite right in git-pack, so maybe it's indeed better to place it in git-object instead.

Also would git_odb be "interested" in an in-memory store of git objects, and possibly a version of Compound (or maybe Linked) which can delegate to arbitrary store implementations? (not sure Find is object-safe, though I figure there only needs to be one store which is Write and would logically be the "head of line" when looking up an object).

git_odb already has a Sink which is half way there to being an in-memory object store. Generalizing the other stores might be possible too, even though it would most certainly make a ripple through the crates leading up to git-repository and affecting git-pack, too. These only want to work with the Linked variant currently, which is also assumed to have pack support. Find right now has plenty of pack-related capabilities as well (I think), so generalizing it will certainly be some work.

Doing this kind of work would make an eventual move to WASM much easier though, so I am definitely open to seeing work done in that direction.

And finally, I've looked around whether there was any support for the server-side part of the git network protocol(s), and so far that doesn't seem to be the case right? git-protocol and git-transport seem to mostly / only implement the client side of the communication at this point unless I missed something?

It's true, there is no server side yet. upload-pack would be next on my list, and from there the server side will probably emerge as well.

1 reply

xmo-odoo Oct 27, 2021
Author

It's great to see your feedback as apparently you have spent quite some time looking through the crates provided here. As I don't think I fully understood the first two paragraphs, let me skip ahead a little. From there it might be possible to fill in the blanks in the conversation that might follow from that.

Right, sorry, I was just trying to give some context / background but it's probably missing some edition or clarity.

Basically I'm in the early stage of building something which needs to embed git repositories (and also provide a "raw" git interface), specifically I'm trying to build a mock / fake github v3 (http/rest) API for testing a tool (testing against a live github is not great and has become a lot more problematic with the recent addition of the secondary rate limiting; and there are far too many ways in which a synchronous mock layer fails to reproduce github behaviour and wastes more time than it saves).

And the tool under test has a lot of direct interactions with the "git database" subsystem. So my personal MVP includes this, which is why I've been looking whether to try and handroll all of it or there was something I would reuse whole or in part.

I've been aware of gitoxide for a while but this week I actually tried to take a look at what it provided and which bits I could reuse, hence having been spending some time jumping between the sub-crate documentations and figuring out how they fit together and what I could use / would need.

Wouldn't it make sense for the header parsing and serialisation to be in git_object instead, such that you could parse / serialize without concerns of Kind and get a "store-suitable" representation (modulo possible compression)?

Pack headers and loose object headers are indeed different, but the hash must be computed based on the loose object header. git_pack::loose::object::header definitely give me some trouble as it doesn't feel quite right in git-pack, so maybe it's indeed better to place it in git-object instead.

Yeah initially I figured I might be missing something -- and I still might, tbh -- but my understanding of the data pipeline is basically

object data -> serialized object -> object header -> hash
                                                 |-> loose compression
                                                 \-> packing

so I figure the "object header" part doesn't rely on any of hashing, packing, or compression, and (aside from the bytes-level manipulation) really only deals with the object Kind and the serialized structure. So while git_object currently does object data -> serialized object it could also perform the next step and write out the loose object (or directly parse an ObjectRef from it).

It might be that packing doesn't include the object header though, and only has the serialized object? I really have no idea, I've pretty much not looked at packs ever.

Either way I'll be working on my thing for now (as I still have a lot to learn, probably) and hopefully eventually I'll be able to contribute experience and more meaningful feedback (and code).

Also would git_odb be "interested" in an in-memory store of git objects, and possibly a version of Compound (or maybe Linked) which can delegate to arbitrary store implementations? (not sure Find is object-safe, though I figure there only needs to be one store which is Write and would logically be the "head of line" when looking up an object).

git_odb already has a Sink which is half way there to being an in-memory object store. Generalizing the other stores might be possible too, even though it would most certainly make a ripple through the crates leading up to git-repository and affecting git-pack, too. These only want to work with the Linked variant currently, which is also assumed to have pack support. Find right now has plenty of pack-related capabilities as well (I think), so generalizing it will certainly be some work.

Yeah Find means a hard dependency on pack but TBF that's odb in general (at least now, and unless the odb itself gets split into a billion sub-crates as well which is probably a bit much).

But Find specifically documents that the pack stuff can be ignored for non-pack-based stores so I figure that's not too much of an issue (even if loose::Store doesn't currently implement Find). Maybe the pack / location-based methods could have a default implementation doing nothing and returning None? I don't know if / whether that woudl break things.

For the existing users of Linked it's not like they necessarily have to be migrated, especially as a purported generic version would be a bit slower (as it would have to dynamically dispatch the lookup through the chain at the very least). To be honest I mostly mentioned this idea because I saw #124 when browsing through the discussions (and searching for possibly existing in-memory stores in case that already existed). I figured a generic store delegating to a single head of line for read/writing and then any number of Find for further reading would solve their issue (with the in-memory head-of-line git db left as an exercise to the reader ;))

Doing this kind of work would make an eventual move to WASM much easier though, so I am definitely open to seeing work done in that direction.

And finally, I've looked around whether there was any support for the server-side part of the git network protocol(s), and so far that doesn't seem to be the case right? git-protocol and git-transport seem to mostly / only implement the client side of the communication at this point unless I missed something?

It's true, there is no server side yet. upload-pack would be next on my list, and from there the server side will probably emerge as well.

Cool I mostly wanted to know whether I'd missed something, my MVP for now is the API endpoints but the "phase 2" has straight up git clone and push (and direct "smart protocol" refs discovery because why quit while you're ahead). If I get somewhere with the first part I might be able to contribute server stuff (though I expect it'll be less than simple depending on the desired specifics wrt efficiency, async, etc...)

Byron · 2021-10-28T03:44:26Z

Byron
Oct 28, 2021
Maintainer

Yeah initially I figured I might be missing something -- and I still might, tbh -- but my understanding of the data pipeline is basically […]

It's close, but here is where it differs:

object data -> serialized object -> object header -> hash
                              |                  |-> loose compression
                              |-> packed-header + compression (delta or non-delta)

With that revelation and all the information added here, there might be a more suitable headline for this discussion in case you consider a change.

Skipping right ahead to the problem at hand.

Cool I mostly wanted to know whether I'd missed something, my MVP for now is the API endpoints but the "phase 2" has straight up git clone and push (and direct "smart protocol" refs discovery because why quit while you're ahead). If I get somewhere with the first part I might be able to contribute server stuff (though I expect it'll be less than simple depending on the desired specifics wrt efficiency, async, etc...)

Even though gitoxide would be able to fetch a pack, writing it definitely involves file IO and memory mapping. With some knowledge on how this works, it would definitely be possible make all this in-memory, but its certainly not easy and might just be too early.

In the interest of an MVP, I'd create a RAM-disk for all git operations, and use git-fetch or gitoxide-fetch to write a bare repository. From there, gitoxide can already be used to do everything related to CRUD access to refs and objects. This would require no change at all and for simplicity, one could use git to clone/fetch the repositories to ram disk.

I highly recommend to go with something like that instead.

1 reply

xmo-odoo Oct 28, 2021
Author

It's close, but here is where it differs:

Ah, and the "pack" workflow needs both the serialised object (without header) and the hash? That's an interesting interface challenge.

Even though gitoxide would be able to fetch a pack, writing it definitely involves file IO and memory mapping. With some knowledge on how this works, it would definitely be possible make all this in-memory, but its certainly not easy and might just be too early.

I think I was not clear (or I'm confused by the answer), the phase 2 system would be a git server not client, an actual git client is cloning from and pushing to it. And being a server is not really optional as it's going swap in (via mitmproxy) for an actual git host.

Byron · 2021-10-28T06:07:58Z

Byron
Oct 28, 2021
Maintainer

I think I was not clear (or I'm confused by the answer), the phase 2 system would be a git server not client, an actual git client is cloning from and pushing to it. And being a server is not really optional as it's going swap in (via mitmproxy) for an actual git host.

All my remarks thus far refer to the server side, but I was under the assumption that it would at some point fetch/clone repositories from GitHub to act as a proxy with actual data. That data, in the form of git repositories, should be in memory, is what I understand.

If that's at all the case, cloning to a disk and letting the OS handle memory mapping/caching is probably the easiest and most scalable solution, but ultimately it doesn't matter as long as there is a real filesystem.

Again, I don't recommend to try generalizing gitoxide as simpler solutions seem possible.

8 replies

Byron Oct 28, 2021
Maintainer

Thanks a lot for taking the time to fill in all the blanks. I found it to be a very interesting read that definitely help me to gain (more) complete understanding.

This also leads me to say that a Sink turned into a map with objects is probably all it takes to have an in-memory object database, especially since everything is built from the ground. This should even work when building a pack is required for sending it to the client, as Find can be implemented to emulate loose objects only. The only problem I see is that the server protocol is not implemented, so the server side of the git protocol would have to be implemented elsewhere. Maybe cloning from this intermediate server isn't required though.

xmo-odoo Oct 31, 2021
Author

Thanks a lot for taking the time to fill in all the blanks. I found it to be a very interesting read that definitely help me to gain (more) complete understanding.

This also leads me to say that a Sink turned into a map with objects is probably all it takes to have an in-memory object database, especially since everything is built from the ground.

Right, but Sink is currently advertised as a write-only black-hole so it seems a bit weird to turn it into a not-write-only black-hole, wouldn't it make more sense to have an in-memory version next to it? Especially since it could be performance-sensitive if it's used by the packed stores to compute object hashes.

Incidentally I don't understand why it can optionally do compression internally, is it just a way to bench various compression packages?

That aside, I've been looking at moving the header stuff, and I realized that a header is 28 bytes at most (6 bytes for the kind, 1 for the space, 20 bytes for the size, and 1 for the NUL), so it could handily use a SmallVec<u8, 28> as a local buffer (or ArrayVec but that is not currently a dependency of git-object, it's in the cargo locks of gitoxide and various experiments but it doesn't seem to have ever been used), working entirely on the stack, and then copy that to the "real" output stream[0].

Unless I'm misunderstanding it seems like it'd be of interest given git_odb::store::loose::write.rs specifically calls out that write_* "will cost at least 4 IO operations", and 3 of those IO operations are in the current implementation of encode:

     pub fn encode(object: object::Kind, size: u64, mut out: impl std::io::Write) -> Result<usize, std::io::Error> {
         let mut written = out.write(kind_to_bytes_with_space(object))?;
         written += itoa::write(&mut out, size)?;
         out.write_u8(0)?;
         Ok(written + 1)
     }

'course it depends whether "IO operations" is meant as "calls to write" or "possible syscalls" (depending on the buffering of the underlying stream and all), I assumed the latter but if it's the former then writing to a local buffer then copying over doesn't really help (quite the opposite).

[0] it could even be a plain [u8;28] but that would require a bit more bookkeeping as that can't keep track of the amount of content, so depends whether it's only used internally or there'd be a method returning it, currently I've added the latter so an array is less attractive.

Byron Nov 1, 2021
Maintainer

Right, but Sink is currently advertised as a write-only black-hole so it seems a bit weird to turn it into a not-write-only black-hole, wouldn't it make more sense to have an in-memory version next to it? Especially since it could be performance-sensitive if it's used by the packed stores to compute object hashes.

Absolutely, I meant to say such an in-memory store could be based off Sink, as an example.

Incidentally I don't understand why it can optionally do compression internally, is it just a way to bench various compression packages?

Yes, this allows it to simulate pure performance without paying for any IO. Of course, it's dominated by zlib right now which to my mind is the main bottleneck for git (and git-packs specifically). It's impressive how many allocations it does while decompressing, no surprise it's that slow. But I digress.

Unless I'm misunderstanding it seems like it'd be of interest given git_odb::store::loose::write.rs specifically calls out that write_* "will cost at least 4 IO operations", and 3 of those IO operations are in the current implementation of encode […]

This might be an avenue, and when writing loose objects it's indeed writing to a named tempfile which appears to be unbuffered. When used for packs, it writes into memory though. At least in theory these writes translate into extra syscalls, but it would be worth validating that in a release build I suppose before splitting up the code paths.

xmo-odoo Nov 2, 2021
Author

Right, but Sink is currently advertised as a write-only black-hole so it seems a bit weird to turn it into a not-write-only black-hole, wouldn't it make more sense to have an in-memory version next to it? Especially since it could be performance-sensitive if it's used by the packed stores to compute object hashes.

Absolutely, I meant to say such an in-memory store could be based off Sink, as an example.

Ah cool, that tracks then, sorry for the misunderstanding.

Incidentally I don't understand why it can optionally do compression internally, is it just a way to bench various compression packages?

Yes, this allows it to simulate pure performance without paying for any IO. Of course, it's dominated by zlib right now which to my mind is the main bottleneck for git (and git-packs specifically). It's impressive how many allocations it does while decompressing, no surprise it's that slow. But I digress.

Hey at least it's an interesting disgression. Thank you for the information.

Unless I'm misunderstanding it seems like it'd be of interest given git_odb::store::loose::write.rs specifically calls out that write_* "will cost at least 4 IO operations", and 3 of those IO operations are in the current implementation of encode […]

This might be an avenue, and when writing loose objects it's indeed writing to a named tempfile which appears to be unbuffered. When used for packs, it writes into memory though. At least in theory these writes translate into extra syscalls, but it would be worth validating that in a release build I suppose before splitting up the code paths.

Ok, so probably not an issue for packs at least, there's minor overhead from the function calls (assuming they're not entirely optimised out) but likely nothing major. Especially as all the writes are used using low-level functions (write_all and iota::write), so the inefficiencies of write!'s machinery are not a concern there.

Byron Nov 3, 2021
Maintainer

Ok, so probably not an issue for packs at least, there's minor overhead from the function calls (assuming they're not entirely optimised out) but likely nothing major. Especially as all the writes are used using low-level functions (write_all and iota::write), so the inefficiencies of write!'s machinery are not a concern there.

Yes, there is no way to look at code and know for sure that it's slow, but usually it's easy to tell if it's simple. The latter is my goal, then it's pitching it against git and make it as fast or faster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can gitoxide be used as a GitHub proxy for allowing stable tests against a mock GitHub API with real repository data? #242

{{title}}

Replies: 3 comments 10 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Can gitoxide be used as a GitHub proxy for allowing stable tests against a mock GitHub API with real repository data? #242

xmo-odoo Oct 27, 2021

Replies: 3 comments · 10 replies

Byron Oct 27, 2021 Maintainer

xmo-odoo Oct 27, 2021 Author

Byron Oct 28, 2021 Maintainer

xmo-odoo Oct 28, 2021 Author

Byron Oct 28, 2021 Maintainer

Byron Oct 28, 2021 Maintainer

xmo-odoo Oct 31, 2021 Author

Byron Nov 1, 2021 Maintainer

xmo-odoo Nov 2, 2021 Author

Byron Nov 3, 2021 Maintainer

xmo-odoo
Oct 27, 2021

Replies: 3 comments 10 replies

Byron
Oct 27, 2021
Maintainer

xmo-odoo Oct 27, 2021
Author

Byron
Oct 28, 2021
Maintainer

xmo-odoo Oct 28, 2021
Author

Byron
Oct 28, 2021
Maintainer

Byron Oct 28, 2021
Maintainer

xmo-odoo Oct 31, 2021
Author

Byron Nov 1, 2021
Maintainer

xmo-odoo Nov 2, 2021
Author

Byron Nov 3, 2021
Maintainer