Config Parse/Read/Write Discussion #45

crutchcorn · 2020-11-30T05:01:06Z

crutchcorn
Nov 30, 2020

Hi! I'm a developer that's been tasked to implement GitOxide into a WASM stack in an effort to clone a project as fast as possible in the browser.

As part of that work, I'm looking to introduce a lot of new functionality into the project to get clone working. As an initial task, I'm hoping to get the git-config package able to read and write config files. I've got a (very) WIP port of config parsing logic from isomorphic-git over on my fork.

I won't bother asking for a code review right now, I'm brand new to Rust and have a colleague taking a look at the code right now to give me some pointers on how I can improve my work

That said, I am unsure of what direction we want to go in with regards to data organization. The current output is a pretty unmanageable tangled vector of Sections and Entries alike in a single struct

I've noticed that the current git-config file has both Section and Entry types that look to contain all of the relevant properties, so I'll be refactoring my current code to use that data structure.

Outside of that, I was hoping you could give me some kind of direction when it comes to the different mods in the git-config/lib.rs file:

https://github.com/Byron/gitoxide/blob/main/git-config/src/lib.rs#L57-L72

https://github.com/Byron/gitoxide/blob/main/git-config/src/lib.rs#L131-L217

I'm not entirely sure what borrowed or spawned is referring to (which I admit as a point-of-noobiness), and I'm not sure I track what Span is doing here.

I also would love some insight as to what a Token is referring to here:

https://github.com/Byron/gitoxide/blob/main/git-config/src/file.rs#L4-L23

As I'm not sure an instance that would have a Section and only a single Entry

Looking forward to learning more and being able to meaningfully contribute upstream!

Byron · 2020-11-30T07:34:44Z

Byron
Nov 30, 2020
Maintainer

Hi Corbin, it’s great to meet you and I will do my best to support your work. As I am currently traveling at 300km through China I won’t get to answering the questions right now, but will do in a few hours. Also I am thinking it might be best to setup a zoom call (or equivalent) to get you started more quickly. As you saw I already laid out the structure, wanting to add the implementation next. Maybe it might be easiest to do that together even so you can chime in and potentially take it from there. That could be a great entry point for some Rust mentoring too, in case you are interested. Talk more in a bit 😊

…

Sent from my iPhone

On Nov 30, 2020, at 1:01 PM, Corbin Crutchley ***@***.***> wrote: Hi! I'm a developer that's been tasked to implement GitOxide into a WASM stack in an effort to clone a project as fast as possible in the browser. As part of that work, I'm looking to introduce a lot of new functionality into the project to get clone working. As an initial task, I'm hoping to get the git-config package able to read and write config files. I've got a (very) WIP port of config parsing logic from isomorphic-git over on my fork. I won't bother asking for a code review right now, I'm brand new to Rust and have a colleague taking a look at the code right now to give me some pointers on how I can improve my work That said, I am unsure of what direction we want to go in with regards to data organization. The current output is a pretty unmanageable tangled vector of Sections and Entries alike in a single struct I've noticed that the current git-config file has both Section and Entry types that look to contain all of the relevant properties, so I'll be refactoring my current code to use that data structure. Outside of that, I was hoping you could give me some kind of direction when it comes to the different mods in the git-config/lib.rs file: https://github.com/Byron/gitoxide/blob/main/git-config/src/lib.rs#L57-L72 https://github.com/Byron/gitoxide/blob/main/git-config/src/lib.rs#L131-L217 I'm not entirely sure what borrowed or spawned is referring to (which I admit as a point-of-noobiness), and I'm not sure I track what Span is doing here. I also would love some insight as to what a Token is referring to here: https://github.com/Byron/gitoxide/blob/main/git-config/src/file.rs#L4-L23 As I'm not sure an instance that would have a Section and only a single Entry Looking forward to learning more and being able to meaningfully contribute upstream! — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

0 replies

crutchcorn · 2020-11-30T07:54:05Z

crutchcorn
Nov 30, 2020
Author

Entirely understandable. Honestly, I'm surprised you responded as quickly as you have at all!

I was thinking the same thing about setting up a zoom (or similar) call! I would love to work together to add the implementation - I want to make sure the code standards are up-to-snuff with the rest of the package and would absolutely love some mentoring on my Rust journey!

If we'd like to schedule a meeting privately, I can be reached at {myGitHubUsername} at gmail dot com. Otherwise, if you'd rather, I'm available via DMs on my Twitter

Just as a heads up, our schedule is way different, as I'm located in California, USA, but I'd be happy to modify my schedule and do a (for me) very early morning/late night call in order to make this work.

Thanks so much for being so open and welcoming! Looking forward to working together!

0 replies

Byron · 2020-11-30T16:20:51Z

Byron
Nov 30, 2020
Maintainer

Alright, I thought it might be easiest to add comments in code, a commit should show up in this thread for you to look at. It's definitely not perfect but might already help a little.

Truth to be told, I think I have pretty much sketched out something that I believe I could implement when doing zero-copy parsing using rust-dangerous. The crates author is super experienced with parsers and very nice and responsive, bringing the crate to a point where it would be perfect for gitoxide because of it's ease-of-use and its ability to create nice parsing errors out of the box.

My vision for the implementation is definitely not the easiest, but it's designed to be en-par with the original git implementation which has a lot of bells and whistles, out of necessity I suppose. For instance, it's non-destructive for the most part even in the presence of comments.

Using regex for parsing is a none-starter, which unfortunately rules out a port of isomorphic-git, the reason being that the regex crate bloats any binary and takes a long time to compile which again, would not be competitive enough with the existing git-config implementation in C.

Of course I hope that this doesn't scare you, because if I can do it, so can you. For a lack of time I can't propose some times for a Zoom call right now, but would hope you drop me some suggestions (to the email address I use in commits) of times that would work for you (I am UTC+8), any remaining day this week.

Thanks a bunch!

0 replies

crutchcorn · 2020-11-30T21:01:07Z

crutchcorn
Nov 30, 2020
Author

Love the extra comments, they help a lot.

I think the zero-copy parsing is certainly the way to go. Non-destructive and more performant are huge reasons for going that route.

I am under a bit of a deadline for an initial draft of the cloning work (although I have time afterward to perf optimize), so I'll likely be continuing the development of my regex implementation separately, and working to get the zero-copy parsing implementation mainlined during the same time.

I'll be sending out an email right now to try to figure out a time that would work best for a sync. :D

0 replies

edward-shen · 2021-02-20T17:37:04Z

edward-shen
Feb 20, 2021

Hey folks, I actually started writing a git-config crate myself, and was going to try a find a crate name until I stumbled that gitoxide had already started one. From the 0.0.0 version tag, I assume this is just a crate name reservation.

For what it's worth, my parser is currently zero-copy and guarantees emitting enough events to identically reconstruct the source git config file, and currently depends only on nom (although I'm sure we can parse out the dependencies if needed).

I'm currently writing a higher level abstraction on top of the parse that allows for semi-efficient insertion and deletion to modify git config files in place.

Would the gitoxide team be interested in merging the work?

0 replies

Byron · 2021-02-21T15:13:52Z

Byron
Feb 21, 2021
Maintainer

Hi @edward-shen, thanks for sharing your work! I am quite baffled by it's completeness and amazed that it integrates with serde. The last time I deep dived into serde my brain caught fire.

Since the implementation here is merely an API sketch along with a 'sample' of a parser using dangerous there is definitely interest in merging your work.

There reason for gitoxide choosing dangerous over nom even though the latter is already used for parsing git objects is the vastly superior error messages it can produce with little effort (by now it even integrates with nom errors to the extend possible).

It's used to parse bytes directly due to git itself never specifying the use of valid unicode and seems to permit pretty much 'anything' as values at least in theory. Thus I never dared to use str but chose BStr instead. That way there won't be any surprises.

The idea was to provide better error messages than git itself in case of malformed files while making it easy (maybe even with a compile flag) to allow other encodings in values or section names - right now these apparently have to be ascii/latin character sets (my knowledge about this is just based on a few manual tests though).

I would be very interested in your thoughts on where the current implelementation/ideas of gitoxide differ from yours. Thanks a lot!

0 replies

edward-shen · 2021-02-21T16:13:09Z

edward-shen
Feb 21, 2021

Hey @Byron! Glad to hear that the team is interested. I should first clarify that while the crate is named serde-git-config, I haven't hooked up the serde portions yet. Serde explicitly doesn't handle parsing, so I've been currently working on a functional parser first before working working on serde integration. That being said, I am experienced with serde, so that should be coming soon once the higher level abstractions are done.

There reason for gitoxide choosing dangerous over nom even though the latter is already used for parsing git objects is the vastly superior error messages it can produce with little effort (by now it even integrates with nom errors to the extend possible).

I chose nom because the combinatorial parsing strategy was intuitive for me to conceptually understand, and made testing very easy for me. It was very easy for me to quickly debug and unit test, which helpful in getting a parser quickly. That being said, I'm sure we can do some similar strategy for dangerous as well—I just didn't know about it at the time.

Currently, error messages in my work are very scuffed—at best you'll just get an error message of where the parsing fails, but that's an improvement for when more functionality is done. That being said, since we do know where the parsing fails, giving an meaningful error message would just be as simple as bubbling what sub-parser failed at what input, which we can derive display from easily.

It's used to parse bytes directly due to git itself never specifying the use of valid unicode and seems to permit pretty much 'anything' as values at least in theory. Thus I never dared to use str but chose BStr instead. That way there won't be any surprises.

This is a good point that I haven't considered, but I think it's fine to be more restrictive than git in this case. Considering the use case of git-config, I'd doubt people intentionally would want to put something like binary data in a git-config file, and if there is something like that, my opinion is that it's more likely a mistake than something that the user intended. This might be the largest difference in the goals between the two works (semi-opinionated versus universal).

I would be very interested in your thoughts on where the current implelementation/ideas of gitoxide differ from yours

For a little more context, my currently architecture is a two layer implementation, with a lower-level parser powering a higher level wrapper. The lower-level parser only handles emitting text events, while the wrapper itself handles reading and modifying that stream of events. This lets us expose both layers in case users want to use the lower level parser instead of the abstraction. This seems the primary difference in the architecture, since the File implementation in gitoxide seems to do both.

The current "roadmap" for my implementation would be:

implement stable set_value for GitConfig
implement serialization for GitConfig
adding sections anywhere in the gitconfig
serde support
error handling
Even higher level abstractions (e.g. struct for global, user-local, and directory-local gitconfigs given some path)

And probably after all that is done would I feel confident releasing a 1.0 version.

On a side note, is there a reason why you have deny(rust_2018_idioms) in your implementation? Are you trying to hit a specific minimum supported rust version?

0 replies

Byron · 2021-02-23T15:47:34Z

Byron
Feb 23, 2021
Maintainer

Apologies in advance if my reply seems a bit short or concise, it's really that I am short in time but try not to wait too long with my reply here. It would be great to find a way to benefit from your implementation which looked like it adheres to high quality standards.

Currently, error messages in my work are very scuffed—at best you'll just get an error message of where the parsing fails, but that's an improvement for when more functionality is done. That being said, since we do know where the parsing fails, giving an meaningful error message would just be as simple as bubbling what sub-parser failed at what input, which we can derive display from easily.

From my experience with the nom error handling I would expect this to not be super trivial, but from what I have seen so far I would put my money on you being able to pull it off anyway! dangerous was a god sent when I discovered it on Reddit and from that point on I managed to bounce some ideas back and forth with the author who has been super responsive and helpful making it this encounter range among the best open source experiences I have had thus far.

Having helpful error messages is a must for the gitoxide implementation, git isn't so bad at this either.

This is a good point that I haven't considered, but I think it's fine to be more restrictive than git in this case.

Here I would be very conservative and assume that all kinds of potentially uncommon encodings are produced by people editing their config files somewhere in the world, so being more restrictive than git makes it likely that some of them wouldn't be able to use gitoxde on their projects. This is something I want to avoid at if feasible, and here I think it is.

This lets us expose both layers in case users want to use the lower level parser instead of the abstraction. This seems the primary difference in the architecture, since the File implementation in gitoxide seems to do both.

That's true, the parser here is considered an implementation detail. Exposing it is nothing I would be too afraid of though as I am OK incrementing major version numbers liberally on low level crates.

implement serialization for GitConfig

That's something I very much look forward to! It's nothing that gitoxide needs, but it's just cool to be able to serialize to the gitconfig format nonetheless. Besides, it's probably quite convenient to declare configuration naturally with structs, too.

Even higher level abstractions (e.g. struct for global, user-local, and directory-local gitconfigs given some path)

Even though it's not 'sketched' yet I also want a combined Config kind of type which allows access to the git config hierarchy. So that would be a perfect match, too.

And probably after all that is done would I feel confident releasing a 1.0 version.

Thus far there is only one real 'blocker' which is the usage of str instead of parsing bytes. With BStr the (testing) experience is equivalent to using str in case you would consider changing this. However, I do understand it's a big ask especially since having error handling could still block merging if it doesn't end up a little better than the one git provides already with nom possibly being a risk in that regard.

On a side note, is there a reason why you have deny(rust_2018_idioms) in your implementation? Are you trying to hit a specific minimum supported rust version?

I have seen this being used in a crate I love, smol, and adopted it without hesitation. In practice this meant that lifetimes cannot be omitted anymore in structs that borrow when used in function arguments, making code more explicit and readable.

Reading my ramblings, here is the summary:

I would love to merge your work, make you a member of the gitoxide team and see you as the maintainer of the git-config crate.
Not being able to making error handling competitive with git is a risk
Forcing parsing to UTF-8 compatible seems like the only serious blocker. Probably I should consider removing this constraint in favour of having a great and probably fine implementation much sooner than would otherwise be possible, but I am not quite ready to do that yet.

Anything I said should not be understood as discouragement or lack of confidence, as I can only restate that I am excited about the implementation and would love to make it a part of gitoxide one way or another.

That ended up being longer than anticipated, I hope you can find it useful in a way that helps making this work :).

0 replies

edward-shen · 2021-02-23T17:02:06Z

edward-shen
Feb 23, 2021

No need to rush—I just didn't want us to have duplicated work. Considering the breadth of what you do, it's understandable that you're very short on time, so I definitely appreciate the thought out response.

Having helpful error messages is a must for the gitoxide implementation, git isn't so bad at this either.

I definitely agree, and it's definitely a shortfall in my current implementation that can be worked on. I don't mind using a different crate for parsing, but considering how the parser is already completed in nom, it would be a lot easier to migrate if I can incrementally move things away nom. As I'm not familiar with dangerous, how easy is it to migrate from a nom parser? Alternatively, we could first try to get nice error messages with just nom if migration is difficult?

Thus far there is only one real 'blocker' which is the usage of str instead of parsing bytes.

I'm not too strongly for or against using BStr, so I don't mind using it. If anything, we can always provide a UTF-8 restrictive entrypoint that just accepts a &str instead but uses a bstr as the underlying implementation. I don't think my parser is strictly dependent on any utf-8 feature (if anything, I think it makes it more difficult), so I don't mind moving towards this.

I think I talked myself into supporting Bstr, haha.

Anything I said should not be understood as discouragement or lack of confidence, as I can only restate that I am excited about the implementation and would love to make it a part of gitoxide one way or another.

Don't worry, I haven't gotten that impression at all. Just sounds like a maintainer doing due diligence and getting motivations and goals cleared up ;)

If my responses look good to you, joining sounds good! How would you like to move forward then? I don't mind working on this in my own repo until I get some minimal viable product ready to merge, or work in a branch in the main repo. Whatever works for you!

Edit: Parser now uses BStr, and the Config now accepts any input that can be converted into a BStr :)

0 replies

Byron · 2021-02-24T15:51:28Z

Byron
Feb 24, 2021
Maintainer

Considering the breadth of what you do, it's understandable that you're very short on time, so I definitely appreciate the thought out response.

It's quite incredible how scarce focus and uninterrupted keyboard time have become thees days and writing complex code like…before… is something I can only dream about.

I definitely agree, and it's definitely a shortfall in my current implementation that can be worked on. I don't mind using a different crate for parsing, but considering how the parser is already completed in nom, it would be a lot easier to migrate if I can incrementally move things away nom. As I'm not familiar with dangerous, how easy is it to migrate from a nom parser? Alternatively, we could first try to get nice error messages with just nom if migration is difficult?

This is the best time to see if @avitex , the author of dangerous, has any experience they could share. I have been in many conversations with them which has cemented the knowledge that it's the right crate to use because it's just so good and fast moving to getting even better (and towards 1.0).

Personally I think it would end up being a rewrite, especially since without the use of the zc crate the dangerous::Span type would be required for tokens instead of BStr to avoid self-referential structs. To my mind doing that doesn't add much complexity as the API sketch seems to indicate.

As using dangerous over nom effectively hinges on better out-of-the-box error messages (and by now, my unapologetic fandom :D) I would say it's worth trying to get errors to look nice and be helpful with nom first.

When done right, dangerous errors provide a nice human readable parser stack like 'when reading section foo, the fifth entry's value could not be decoded' (not quite there yet, but it's achievable with the information that's present), and getting this with nom is probably hard to the point where the advantages of nom diminish. But…I also don't really know and I am biased. What I am trying to say is that getting there with nom might be hard and before you run away I'd rather cut back on the quality of parsing errors.

@avitex has done tremendous work reviewing my first steps towards implementing a parser (see #40 and #44), and I feel a bit guilty to think about adopting nom if its good enough even though that's clearly a pragmatic choice.

I think I talked myself into supporting Bstr, haha.

Neat! Converting from BStr to str and dealing with potential failures is very well supported by its API already, so there is no need for an additional API for that.

Parser now uses BStr, and the Config now accepts any input that can be converted into a BStr :)

🙌

If my responses look good to you, joining sounds good! How would you like to move forward then? I don't mind working on this in my own repo until I get some minimal viable product ready to merge, or work in a branch in the main repo. Whatever works for you!

Since you do all the work really I want to do my very best to not add any hurdles and actually slow your progress or enjoyment developing something that essential to gitoxide and the world :).

Here is some suggestions and leading questions:

take a look at the current git-config API sketch and see how it resonates with you or your vision. Maybe it's something to partially adopt or to improve upon.
How would self-referential structs work when &BStr is used in tokens? Is there a way to avoid that entirely?
Is an MVP without Serde support possible? gitoxide wouldn't need it, and for it to be mergeable it would have to go behind a feature toggle as serde wouldn't be a dependency otherwise.
If a draft PR replacing the current git-config implementation is submitted, feedback can be received early by me and possibly @avitex.
Once there is a PR which demonstrates error messages at least en-par with the ones of git, it could be merged early as MVP. Then you would become a collaborator to this repository, sole author and co-owner of the git-config crate on crates.io. (Moving gitoxide into an organization is definitely an option eventually, too). From there you could develop on main or submit PRs for feedback, as usual.

No matter what happens, I firmly believe that this conversation alone is of great value - in the best case gitoxide gains a state-of-the-art implementation of the git config API surface, let alone having you join the roster of maintainers to help build something quite unique and, one may hope, of historical relevance 😅.

1 reply

edward-shen Feb 28, 2021

@Byron After some further scrutiny, it looks like the only time where it's possible to accept any arbitrary value (as opposed to a utf-8 value) is only during the value portion of a key value pairs and during comments—all section headers and key values use isalnum in git's source code, which means that UTF-8 is a strict superset of the allowed values for those cases (or in other words, str is perfectly acceptable there.

As a result, I think the only time where it's necessary to dive into byte strings is for parsing values and comments.

edward-shen · 2021-02-24T22:51:04Z

edward-shen
Feb 24, 2021

I took a deeper look at the API sketch and there's an incredible number of similarities between what we have. I think the primary divergent idea would be how the gitconfig editing occurs—the sketch seems to use Edit which applies to a file once all edits are done, while my implementation applies the edits as you make them (Not to the file, but to it's internal representation). It's certainly an approach that I haven't thought of, but it definitely has benefits over my implementation, since you would be able to debug the list of changes that were going to be applied to a file.

The primary concern with that approach though would be handling multivars, which were a pain point for me. I'm not sure how the user would be able to control which instance of the key/value pair would be set through that appraoch.

How would self-referential structs work when &BStr is used in tokens? Is there a way to avoid that entirely?

This was actually a problem I ran into, since the parser actually only returned references. The solution for me to use Cow<'a, BStr> for all event outputs, even if the parser emits borrowed events only. As a result, higher level abstractions (such as GitConfig, or File in your sketch) and insert borrowed or owned values without any problems, thus avoiding the need for self referential structs.

This somewhat leaks implementation detail of higher level abstractions into lower ones, but I think it's better than to have an owned versus copied event enums.

Is an MVP without serde possible?

Yes, of course! I've already made serde an optional feature.

Once there is a PR which demonstrates error messages at least en-par with the ones of git

Now that I think about it, I haven't really seen any meaningful errors from git-config. Do you mind posting a few examples? The only one I've really gotten while working on this were something akin to fatal: bad config line 21 in file <>, but I assume you meant something else?

As a side note, I've been working on getting stuff well tested. After 16 million fuzzer iterations without error, I think I'm semi-confident that my implementation is panic-free, but I'm going to leave it running overnight to make sure. ;)

0 replies

Byron · 2021-02-25T14:56:46Z

Byron
Feb 25, 2021
Maintainer

Thanks for all your effort - it's more appreciated than has been expressed.

The primary concern with that approach though would be handling multivars, which were a pain point for me. I'm not sure how the user would be able to control which instance of the key/value pair would be set through that appraoch.

Even though the API might be lacking, unintuitive or incorrect when actually trying to implement it, the idea is that every owned entry is either new or was created from an existing borrowed entry. This allows these updates to know where they belong. Ultimately borrowed sections or entries know their index in the list of parsed tokens, which is immutable.

Doing edits by pre-recording them and applying them during serialization certainly is complex and maybe could be simplified by letting Edits keep a mutable copy of all tokens. Definitely something to consider in an MVP.

In short: should work, but who knows 😅.

Please note that Edit types are required because nothing is ever owned in this implementation. Values are transformed/decoded only on demand, which is where Cows come into play, whereas new values need their own type to be representable.

No lookup tables or acceleration structures are used to avoid allocations where possible and I take a wait-and-see stance to add them only when there is actual need.

This was actually a problem I ran into, since the parser actually only returned references. The solution for me to use Cow<'a, BStr> for all event outputs, even if the parser emits borrowed events only. As a result, higher level abstractions (such as GitConfig, or File in your sketch) and insert borrowed or owned values without any problems, thus avoiding the need for self referential structs.

I truly believe this delays the inevitable problem: Either each access to git configuration creates a temporary GitConfig to keep the values on the stack, avoiding self-referential structures, or something has to keep both, the buffers read from disk as well as the parse result.

In the sketch that's the File abstraction, whereas your version avoids this, pushing the issue further down the line. The rental crate has nice documentation for this as well as a way to workaround it.

Being able to parse once and keep the buffers is the reason the sketched API looks the way it looks, a lot of effort it creates for itself to avoid allocations and keep things minimal in memory. git itself doesn't even do that and seems to allocate everything, like String everywhere :). Maybe it's a bit overdone, but I think doing that gives gitoxide an edge over git further adding to its relevancy. With some excitement I can say that all git object parsing is done using a similar technique, separating borrowed and owned versions, making parsing nearly free and so fast that it leaves git in the dust. When looking at their implementation you will see a lot of allocations happening to parse objects, and special methods to parse only parts of it to accelerate hot code. In Rust that's not needed, while being safe to use (after all, handing out references/pointers in C is probably difficult to manage).

Anyway, ranting :D, but in short a good test would be to try keeping the buffer within GitConfig.

Now that I think about it, I haven't really seen any meaningful errors from git-config. Do you mind posting a few examples? The only one I've really gotten while working on this were something akin to fatal: bad config line 21 in file <>, but I assume you meant something else?

Yes, 'only' these would be the minimum, and with nom something like it shouldn't be a problem either I suppose. Maybe also with nom one can be more specific to point at a specific sub-token for instance, but to me that's entirely optional.

As a side note, I've been working on getting stuff well tested. After 16 million fuzzer iterations without error, I think I'm semi-confident that my implementation is panic-free, but I'm going to leave it running overnight to make sure. ;)

Pretty neat! I myself never got around using an actual fuzzer, but resorted to 'stress' tests that verify big repositories. A few issues were found that way, but of course, tells nothing about malformed input.

I hope this helps.

1 reply

edward-shen Feb 26, 2021

Ah, I'm starting to understand where you're going with the Edits architecture now. If I were to make a comparison, my design would be more towards single level mutations where yours emphasizes performance with batches of edits (immediate versus lazy, from another perspective). If I'm understanding it correctly, your users might not necessarily know if they make a bad mutation (the most immediate example would perhaps be attempting to write into a key-value a color where it was originally a boolean before) somewhere until they've tried to apply to_write then, right? But it also allows you to decouple the lifetime of the edits from the lifetime of the parsed file itself, which I think is a more flexible implementation if there's complex logic to be done with the edits themself.

My design was definitely oriented towards smaller, simple edits or reading (e.g. maybe set a value or two) and not necessarily for complex operations, so the idea was that pushing it down the line would be sufficient for API use. Perhaps we should have both interfaces? One for simpler interactions, and one for more complex, performant interactions? I definitely see benefits in both, and having multiple interaction paradigms isn't uncommon in Rust—Hashmap's get_mut and Entry API would be good examples.

On error messages, I've been working a little bit on this:

#[test]
fn error() {
    let input = "[core] a=b\n 4a=3";
    println!("{}", parse_from_str(input).unwrap_err());
    let input = "[core] a=b\n =3";
    println!("{}", parse_from_str(input).unwrap_err());
    let input = "[core";
    println!("{}", parse_from_str(input).unwrap_err());
}

Got an unexpected token on line 2 while trying to parse a config name: '4a=3'
Got an unexpected token on line 2 while trying to parse a config name: '=3'
Got an unexpected token on line 1 while trying to parse a section header: '[core'

The goal is to hopefully a get little more info (so we'd get something like Got an unexpected token on line 2 while trying to parse a config name in section 'core': '4a=3'), but I need to wrangle with my implementation a little more to write it in a performant manner.

Is that sufficient?

edward-shen · 2021-03-02T00:10:55Z

edward-shen
Mar 2, 2021

After some more work I've tried benching initialization of my current implementation:

GitConfig large config file                                                                            
                        time:   [83.853 us 84.139 us 84.413 us]
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

Parser large config file                                                                             
                        time:   [33.611 us 33.792 us 33.972 us]
Found 18 outliers among 100 measurements (18.00%)
  5 (5.00%) low severe
  4 (4.00%) low mild
  5 (5.00%) high mild
  4 (4.00%) high severe

It's a rough estimate, but I'm currently hitting 187MB/s (for a 6.7k file) for the parser. I think this is pretty good in my opinion, but I'm not sure how this fares with other parsers.

The GitConfig wrapper, despite having multiple additional allocations, is still offering sub tenth-millisecond performance as well. I'm not sure if this reaches your desired performance (and I've yet to implement the Edits API for comparison), but I think the performance is good enough for a 0.1.0 release.

Still working on benching mutations for GitConfig but considering how implementation is hashmap based I would expect very high performance in this regard.

0 replies

Byron · 2021-03-02T02:46:40Z

Byron
Mar 2, 2021
Maintainer

My apologies for the late reply - I was about to get started a few times already but got interrupted. Here I go again, possibly providing multiple comments over time.

If I'm understanding it correctly, your users might not necessarily know if they make a bad mutation (the most immediate example would perhaps be attempting to write into a key-value a color where it was originally a boolean before) somewhere until they've tried to apply to_write then, right?

That would be true, and if there really is a way to apply a bad mutation, a possibility I didn't see, that would clearly be a disadvantage. Being so lazy is probably not necessary by any means either and could be considered a premature optimization. It's certainly easier to make Edits independent of the parent File and just copy everything it needs while applying edits immediately.

Regarding the bad mutations, I didn't see it because ultimately that interface is string based and every string is valid for every key. Whether that makes sense in the context of the value being written is up to the user of the API. Maybe I am missing something though.

0 replies

Byron · 2021-03-02T03:04:53Z

Byron
Mar 2, 2021
Maintainer

My design was definitely oriented towards smaller, simple edits or reading (e.g. maybe set a value or two) and not necessarily for complex operations, so the idea was that pushing it down the line would be sufficient for API use. Perhaps we should have both interfaces?

If the simplified Edits implementation works on a copy of the File and ends up having the same members, then editing would probably be best part of the File itself. If not, Edits or something like it seems like the more elegant and easier to handle way.

Generally I think that it won't have to be lazy though and if checks fail when setting, that should be evident immediately. Ideally there is only one interface and personally I don't think performance is anything to be optimized there. Even though I make a different impression sometimes :D, generally I avoid premature optimizations unless it's reasonably easy to do. Avoiding allocations for instance by using Spans is easy enough, and avoiding to build up any caching structures like hash maps seems easier to me as well (at the expense of search times). In Rust so much more seems easily achievable so I tend to just do it to push the boundary a little and be rewarded with superb performance from the get go. This was the case when implementing pack reading and verification for example, whose core algorithm could already compete with git at the cost of merely being mindful.

The goal is to hopefully a get little more info (so we'd get something like Got an unexpected token on line 2 while trying to parse a config name in section 'core': '4a=3'), but I need to wrangle with my implementation a little more to write it in a performant manner.

Is that sufficient?

Absolutely! It looks like enough information is already collected to support nice error messages, and massaging these into place can be left for improvements beyond the MVP. When it comes to performance, I wouldn't fret too much initially (but having seen the benchmark I guess that worked out fine :D).

0 replies

Byron · 2021-03-02T03:40:33Z

Byron
Mar 2, 2021
Maintainer

On another note, when looking at the hashmap for sections, I checked to see how it deals with duplicate sections which are valid within git config files.

commit f2d5a7b3212a670c5b962676cd59afee4c4ea53e
Author: Sebastian Thiel <redacted>
Date:   Tue Mar 2 11:25:46 2021 +0800

    Duplicate section seems to break tests in unforseen manner
    
    Section keys are not unique

diff --git a/tests/integration_tests/file_integeration_test.rs b/tests/integration_tests/file_integeration_test.rs
index b2b068c..7b31669 100644
--- a/tests/integration_tests/file_integeration_test.rs
+++ b/tests/integration_tests/file_integeration_test.rs
@@ -15,6 +15,8 @@ fn get_value_for_all_provided_values() -> Result<(), Box<dyn std::error::Error>>
             color = brightgreen red \
             bold
             other = hello world
+        [core]
+            bare = true
     "#;
 
     let file = GitConfig::try_from(config)?;

It appears the test then fails as the duplicate (maybe) overwrites the existing section. This might also be relevant when writing sections, as these shouldn't be merged to enforce their uniqueness.

3 replies

edward-shen Mar 2, 2021

That looks like a bug in the section lookup code then — I actually do have a test to ensure that duplicate sections were handled properly: https://git.eddie.sh/edward/git-config/src/branch/master/src/file.rs#L1021-L1067

And right below that had tests with multival behaviors as well.

Byron Mar 2, 2021
Maintainer

That's interesting! Seeing this I'd also think that my modification should have worked as expected.

edward-shen Mar 2, 2021

Ah, I know what's wrong—the lookup only checks the highest section id value (or the last one in unmodified cases) and doesn't check the rest if fetching fails. Correct behavior is to check all sections with the same name in reverse order before failing.

Edit: fix has been pushed and regression test has been added.

Byron · 2021-03-02T03:49:06Z

Byron
Mar 2, 2021
Maintainer

It's a rough estimate, but I'm currently hitting 187MB/s (for a 6.7k file) for the parser. I think this is pretty good in my opinion, but I'm not sure how this fares with other parsers.

And that would mean about 75MB/s for the GitConfig file. To my mind that's alright as well and probably not slower than the current git implementation - they allocate even more since they decode all strings unconditionally and allocate all the time (at least so I think, who would want to implement copy-on-write even in C?)

From what I see the tradeoff in the serde-git-config implementation is towards doing more work upfront which avoids duplicate work later during lookup - the current API sketch here would trigger parsing values each time before use for example, and possibly even check if a section name needs decoding each time someone tries to find it. So I think it's a trade-off that fits well the typical usage scheme where many lookups are made across multiple git-config files.

All in all I am super excited to see your progress and am looking forward to welcoming you here as the second crate maintainer :) - git-commitgraph was similarly contributed.

0 replies

Config Parse/Read/Write Discussion #45

Uh oh!

Replies: 17 comments · 5 replies

Uh oh!

Byron Nov 30, 2020 Maintainer

Uh oh!

crutchcorn Nov 30, 2020 Author

Uh oh!

Byron Nov 30, 2020 Maintainer

Uh oh!

crutchcorn Nov 30, 2020 Author

Uh oh!

Uh oh!

Byron Feb 21, 2021 Maintainer

Uh oh!

Uh oh!

Byron Feb 23, 2021 Maintainer

Uh oh!

Uh oh!

Uh oh!

Byron Feb 24, 2021 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Byron Feb 25, 2021 Maintainer

Uh oh!

Uh oh!

Uh oh!

Byron Mar 2, 2021 Maintainer

Uh oh!

Byron Mar 2, 2021 Maintainer

Uh oh!

Byron Mar 2, 2021 Maintainer

Uh oh!

Uh oh!

Byron Mar 2, 2021 Maintainer

Uh oh!

Uh oh!

Uh oh!

Byron Mar 2, 2021 Maintainer

Replies: 17 comments 5 replies

Byron
Nov 30, 2020
Maintainer

crutchcorn
Nov 30, 2020
Author

Byron
Nov 30, 2020
Maintainer

crutchcorn
Nov 30, 2020
Author

Byron
Feb 21, 2021
Maintainer

Byron
Feb 23, 2021
Maintainer

Byron
Feb 24, 2021
Maintainer

Byron
Feb 25, 2021
Maintainer

Byron
Mar 2, 2021
Maintainer

Byron
Mar 2, 2021
Maintainer

Byron
Mar 2, 2021
Maintainer

Byron Mar 2, 2021
Maintainer

Byron
Mar 2, 2021
Maintainer