Config Parse/Read/Write Discussion #45
Replies: 17 comments 5 replies
-
Hi Corbin,
it’s great to meet you and I will do my best to support your work. As I am currently traveling at 300km through China I won’t get to answering the questions right now, but will do in a few hours.
Also I am thinking it might be best to setup a zoom call (or equivalent) to get you started more quickly. As you saw I already laid out the structure, wanting to add the implementation next. Maybe it might be easiest to do that together even so you can chime in and potentially take it from there. That could be a great entry point for some Rust mentoring too, in case you are interested.
Talk more in a bit 😊
…Sent from my iPhone
On Nov 30, 2020, at 1:01 PM, Corbin Crutchley ***@***.***> wrote:
Hi! I'm a developer that's been tasked to implement GitOxide into a WASM stack in an effort to clone a project as fast as possible in the browser.
As part of that work, I'm looking to introduce a lot of new functionality into the project to get clone working. As an initial task, I'm hoping to get the git-config package able to read and write config files. I've got a (very) WIP port of config parsing logic from isomorphic-git over on my fork.
I won't bother asking for a code review right now, I'm brand new to Rust and have a colleague taking a look at the code right now to give me some pointers on how I can improve my work
That said, I am unsure of what direction we want to go in with regards to data organization. The current output is a pretty unmanageable tangled vector of Sections and Entries alike in a single struct
I've noticed that the current git-config file has both Section and Entry types that look to contain all of the relevant properties, so I'll be refactoring my current code to use that data structure.
Outside of that, I was hoping you could give me some kind of direction when it comes to the different mods in the git-config/lib.rs file:
https://github.com/Byron/gitoxide/blob/main/git-config/src/lib.rs#L57-L72
https://github.com/Byron/gitoxide/blob/main/git-config/src/lib.rs#L131-L217
I'm not entirely sure what borrowed or spawned is referring to (which I admit as a point-of-noobiness), and I'm not sure I track what Span is doing here.
I also would love some insight as to what a Token is referring to here:
https://github.com/Byron/gitoxide/blob/main/git-config/src/file.rs#L4-L23
As I'm not sure an instance that would have a Section and only a single Entry
Looking forward to learning more and being able to meaningfully contribute upstream!
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Beta Was this translation helpful? Give feedback.
-
Entirely understandable. Honestly, I'm surprised you responded as quickly as you have at all! I was thinking the same thing about setting up a zoom (or similar) call! I would love to work together to add the implementation - I want to make sure the code standards are up-to-snuff with the rest of the package and would absolutely love some mentoring on my Rust journey! If we'd like to schedule a meeting privately, I can be reached at Just as a heads up, our schedule is way different, as I'm located in California, USA, but I'd be happy to modify my schedule and do a (for me) very early morning/late night call in order to make this work. Thanks so much for being so open and welcoming! Looking forward to working together! |
Beta Was this translation helpful? Give feedback.
-
Alright, I thought it might be easiest to add comments in code, a commit should show up in this thread for you to look at. It's definitely not perfect but might already help a little. Truth to be told, I think I have pretty much sketched out something that I believe I could implement when doing zero-copy parsing using rust-dangerous. The crates author is super experienced with parsers and very nice and responsive, bringing the crate to a point where it would be perfect for gitoxide because of it's ease-of-use and its ability to create nice parsing errors out of the box. My vision for the implementation is definitely not the easiest, but it's designed to be en-par with the original git implementation which has a lot of bells and whistles, out of necessity I suppose. For instance, it's non-destructive for the most part even in the presence of comments. Using regex for parsing is a none-starter, which unfortunately rules out a port of isomorphic-git, the reason being that the regex crate bloats any binary and takes a long time to compile which again, would not be competitive enough with the existing git-config implementation in C. Of course I hope that this doesn't scare you, because if I can do it, so can you. For a lack of time I can't propose some times for a Zoom call right now, but would hope you drop me some suggestions (to the email address I use in commits) of times that would work for you (I am UTC+8), any remaining day this week. Thanks a bunch! |
Beta Was this translation helpful? Give feedback.
-
Love the extra comments, they help a lot. I think the zero-copy parsing is certainly the way to go. Non-destructive and more performant are huge reasons for going that route. I am under a bit of a deadline for an initial draft of the cloning work (although I have time afterward to perf optimize), so I'll likely be continuing the development of my regex implementation separately, and working to get the zero-copy parsing implementation mainlined during the same time. I'll be sending out an email right now to try to figure out a time that would work best for a sync. :D |
Beta Was this translation helpful? Give feedback.
-
Hey folks, I actually started writing a For what it's worth, my parser is currently zero-copy and guarantees emitting enough events to identically reconstruct the source git config file, and currently depends only on I'm currently writing a higher level abstraction on top of the parse that allows for semi-efficient insertion and deletion to modify git config files in place. Would the gitoxide team be interested in merging the work? |
Beta Was this translation helpful? Give feedback.
-
Hi @edward-shen, thanks for sharing your work! I am quite baffled by it's completeness and amazed that it integrates with serde. The last time I deep dived into Since the implementation here is merely an API sketch along with a 'sample' of a parser using There reason for It's used to parse bytes directly due to The idea was to provide better error messages than git itself in case of malformed files while making it easy (maybe even with a compile flag) to allow other encodings in values or section names - right now these apparently have to be ascii/latin character sets (my knowledge about this is just based on a few manual tests though). I would be very interested in your thoughts on where the current implelementation/ideas of |
Beta Was this translation helpful? Give feedback.
-
Hey @Byron! Glad to hear that the team is interested. I should first clarify that while the crate is named
I chose nom because the combinatorial parsing strategy was intuitive for me to conceptually understand, and made testing very easy for me. It was very easy for me to quickly debug and unit test, which helpful in getting a parser quickly. That being said, I'm sure we can do some similar strategy for Currently, error messages in my work are very scuffed—at best you'll just get an error message of where the parsing fails, but that's an improvement for when more functionality is done. That being said, since we do know where the parsing fails, giving an meaningful error message would just be as simple as bubbling what sub-parser failed at what input, which we can derive display from easily.
This is a good point that I haven't considered, but I think it's fine to be more restrictive than git in this case. Considering the use case of
For a little more context, my currently architecture is a two layer implementation, with a lower-level parser powering a higher level wrapper. The lower-level parser only handles emitting text events, while the wrapper itself handles reading and modifying that stream of events. This lets us expose both layers in case users want to use the lower level parser instead of the abstraction. This seems the primary difference in the architecture, since the The current "roadmap" for my implementation would be:
And probably after all that is done would I feel confident releasing a 1.0 version. On a side note, is there a reason why you have |
Beta Was this translation helpful? Give feedback.
-
Apologies in advance if my reply seems a bit short or concise, it's really that I am short in time but try not to wait too long with my reply here. It would be great to find a way to benefit from your implementation which looked like it adheres to high quality standards.
From my experience with the Having helpful error messages is a must for the
Here I would be very conservative and assume that all kinds of potentially uncommon encodings are produced by people editing their config files somewhere in the world, so being more restrictive than
That's true, the parser here is considered an implementation detail. Exposing it is nothing I would be too afraid of though as I am OK incrementing major version numbers liberally on low level crates.
That's something I very much look forward to! It's nothing that
Even though it's not 'sketched' yet I also want a combined
Thus far there is only one real 'blocker' which is the usage of
I have seen this being used in a crate I love, Reading my ramblings, here is the summary:
Anything I said should not be understood as discouragement or lack of confidence, as I can only restate that I am excited about the implementation and would love to make it a part of That ended up being longer than anticipated, I hope you can find it useful in a way that helps making this work :). |
Beta Was this translation helpful? Give feedback.
-
No need to rush—I just didn't want us to have duplicated work. Considering the breadth of what you do, it's understandable that you're very short on time, so I definitely appreciate the thought out response.
I definitely agree, and it's definitely a shortfall in my current implementation that can be worked on. I don't mind using a different crate for parsing, but considering how the parser is already completed in
I'm not too strongly for or against using I think I talked myself into supporting
Don't worry, I haven't gotten that impression at all. Just sounds like a maintainer doing due diligence and getting motivations and goals cleared up ;) If my responses look good to you, joining sounds good! How would you like to move forward then? I don't mind working on this in my own repo until I get some minimal viable product ready to merge, or work in a branch in the main repo. Whatever works for you! Edit: Parser now uses |
Beta Was this translation helpful? Give feedback.
-
It's quite incredible how scarce focus and uninterrupted keyboard time have become thees days and writing complex code like…before… is something I can only dream about.
This is the best time to see if @avitex , the author of Personally I think it would end up being a rewrite, especially since without the use of the As using When done right, @avitex has done tremendous work reviewing my first steps towards implementing a parser (see #40 and #44), and I feel a bit guilty to think about adopting
Neat! Converting from
🙌
Since you do all the work really I want to do my very best to not add any hurdles and actually slow your progress or enjoyment developing something that essential to Here is some suggestions and leading questions:
No matter what happens, I firmly believe that this conversation alone is of great value - in the best case |
Beta Was this translation helpful? Give feedback.
-
I took a deeper look at the API sketch and there's an incredible number of similarities between what we have. I think the primary divergent idea would be how the gitconfig editing occurs—the sketch seems to use The primary concern with that approach though would be handling multivars, which were a pain point for me. I'm not sure how the user would be able to control which instance of the key/value pair would be set through that appraoch.
This was actually a problem I ran into, since the parser actually only returned references. The solution for me to use This somewhat leaks implementation detail of higher level abstractions into lower ones, but I think it's better than to have an owned versus copied event enums.
Yes, of course! I've already made serde an optional feature.
Now that I think about it, I haven't really seen any meaningful errors from As a side note, I've been working on getting stuff well tested. After 16 million fuzzer iterations without error, I think I'm semi-confident that my implementation is panic-free, but I'm going to leave it running overnight to make sure. ;) |
Beta Was this translation helpful? Give feedback.
-
Thanks for all your effort - it's more appreciated than has been expressed.
Even though the API might be lacking, unintuitive or incorrect when actually trying to implement it, the idea is that every owned entry is either new or was created from an existing borrowed entry. This allows these updates to know where they belong. Ultimately borrowed sections or entries know their index in the list of parsed tokens, which is immutable. Doing edits by pre-recording them and applying them during serialization certainly is complex and maybe could be simplified by letting In short: should work, but who knows 😅. Please note that No lookup tables or acceleration structures are used to avoid allocations where possible and I take a wait-and-see stance to add them only when there is actual need.
I truly believe this delays the inevitable problem: Either each access to git configuration creates a temporary In the sketch that's the Being able to parse once and keep the buffers is the reason the sketched API looks the way it looks, a lot of effort it creates for itself to avoid allocations and keep things minimal in memory. Anyway, ranting :D, but in short a good test would be to try keeping the buffer within
Yes, 'only' these would be the minimum, and with
Pretty neat! I myself never got around using an actual fuzzer, but resorted to 'stress' tests that verify big repositories. A few issues were found that way, but of course, tells nothing about malformed input. I hope this helps. |
Beta Was this translation helpful? Give feedback.
-
After some more work I've tried benching initialization of my current implementation:
It's a rough estimate, but I'm currently hitting 187MB/s (for a 6.7k file) for the parser. I think this is pretty good in my opinion, but I'm not sure how this fares with other parsers. The Still working on benching mutations for |
Beta Was this translation helpful? Give feedback.
-
My apologies for the late reply - I was about to get started a few times already but got interrupted. Here I go again, possibly providing multiple comments over time.
That would be true, and if there really is a way to apply a bad mutation, a possibility I didn't see, that would clearly be a disadvantage. Being so lazy is probably not necessary by any means either and could be considered a premature optimization. It's certainly easier to make Regarding the bad mutations, I didn't see it because ultimately that interface is string based and every string is valid for every key. Whether that makes sense in the context of the value being written is up to the user of the API. Maybe I am missing something though. |
Beta Was this translation helpful? Give feedback.
-
If the simplified Generally I think that it won't have to be lazy though and if checks fail when setting, that should be evident immediately. Ideally there is only one interface and personally I don't think performance is anything to be optimized there. Even though I make a different impression sometimes :D, generally I avoid premature optimizations unless it's reasonably easy to do. Avoiding allocations for instance by using
Absolutely! It looks like enough information is already collected to support nice error messages, and massaging these into place can be left for improvements beyond the MVP. When it comes to performance, I wouldn't fret too much initially (but having seen the benchmark I guess that worked out fine :D). |
Beta Was this translation helpful? Give feedback.
-
On another note, when looking at the hashmap for sections, I checked to see how it deals with duplicate sections which are valid within git config files. commit f2d5a7b3212a670c5b962676cd59afee4c4ea53e
Author: Sebastian Thiel <redacted>
Date: Tue Mar 2 11:25:46 2021 +0800
Duplicate section seems to break tests in unforseen manner
Section keys are not unique
diff --git a/tests/integration_tests/file_integeration_test.rs b/tests/integration_tests/file_integeration_test.rs
index b2b068c..7b31669 100644
--- a/tests/integration_tests/file_integeration_test.rs
+++ b/tests/integration_tests/file_integeration_test.rs
@@ -15,6 +15,8 @@ fn get_value_for_all_provided_values() -> Result<(), Box<dyn std::error::Error>>
color = brightgreen red \
bold
other = hello world
+ [core]
+ bare = true
"#;
let file = GitConfig::try_from(config)?; It appears the test then fails as the duplicate (maybe) overwrites the existing section. This might also be relevant when writing sections, as these shouldn't be merged to enforce their uniqueness. |
Beta Was this translation helpful? Give feedback.
-
And that would mean about 75MB/s for the GitConfig file. To my mind that's alright as well and probably not slower than the current From what I see the tradeoff in the All in all I am super excited to see your progress and am looking forward to welcoming you here as the second crate maintainer :) - |
Beta Was this translation helpful? Give feedback.
-
Hi! I'm a developer that's been tasked to implement GitOxide into a WASM stack in an effort to clone a project as fast as possible in the browser.
As part of that work, I'm looking to introduce a lot of new functionality into the project to get
clone
working. As an initial task, I'm hoping to get thegit-config
package able to read and write config files. I've got a (very) WIP port of config parsing logic fromisomorphic-git
over on my fork.That said, I am unsure of what direction we want to go in with regards to data organization. The current output is a pretty unmanageable tangled vector of Sections and Entries alike in a single
struct
I've noticed that the current
git-config
file has bothSection
andEntry
types that look to contain all of the relevant properties, so I'll be refactoring my current code to use that data structure.Outside of that, I was hoping you could give me some kind of direction when it comes to the different
mod
s in thegit-config/lib.rs
file:https://github.com/Byron/gitoxide/blob/main/git-config/src/lib.rs#L57-L72
https://github.com/Byron/gitoxide/blob/main/git-config/src/lib.rs#L131-L217
I'm not entirely sure what
borrowed
orspawned
is referring to (which I admit as a point-of-noobiness), and I'm not sure I track whatSpan
is doing here.I also would love some insight as to what a
Token
is referring to here:https://github.com/Byron/gitoxide/blob/main/git-config/src/file.rs#L4-L23
As I'm not sure an instance that would have a
Section
and only a singleEntry
Looking forward to learning more and being able to meaningfully contribute upstream!
Beta Was this translation helpful? Give feedback.
All reactions