diff --git a/.vscode/settings.json b/.vscode/settings.json index 37a446bc..7b781304 100644 --- a/.vscode/settings.json +++ b/.vscode/settings.json @@ -12,6 +12,7 @@ "Gjengset", "gothambold", "helveticaneue", + "hexdigit", "Hsing", "jetbrainsmono", "Lowlight", diff --git a/_posts/2023-02-20-guide-to-nom-parsing.md b/_posts/2023-02-20-guide-to-nom-parsing.md index 85bb16a1..9ea35029 100644 --- a/_posts/2023-02-20-guide-to-nom-parsing.md +++ b/_posts/2023-02-20-guide-to-nom-parsing.md @@ -16,12 +16,19 @@ categories: - [Introduction](#introduction) -- [Documentation](#documentation) - [Getting to know nom using a simple example](#getting-to-know-nom-using-a-simple-example) - [Parsing hex color codes](#parsing-hex-color-codes) - [What does this code do, how does it work?](#what-does-this-code-do-how-does-it-work) + - [The Parser trait and IResult](#the-parser-trait-and-iresult) + - [Main parser function that calls all the other parsers](#main-parser-function-that-calls-all-the-other-parsers) + - [The hex segment parser, comprised of nom combinator functions, and IResult](#the-hex-segment-parser-comprised-of-nom-combinator-functions-and-iresult) - [Generalized workflow](#generalized-workflow) + - [Why can't we parse "πŸ”…#2F14DF"?](#why-cant-we-parse-2f14df) + - [What if we wanted to get better error reporting on what is happening?](#what-if-we-wanted-to-get-better-error-reporting-on-what-is-happening) - [Build a Markdown parser](#build-a-markdown-parser) +- [Other examples](#other-examples) +- [Related video](#related-video) +- [References](#references) @@ -43,58 +50,26 @@ This tutorial has 2 examples in it: > object oriented), please take a look at this [paper](https://arxiv.org/pdf/2307.07069.pdf) > by Will Crichton demonstrating Typed Design Patterns with Rust. -## Documentation - - -nom is a huge topic. This tutorial takes a hands on approach to learning nom. However, the resources -listed below are very useful for learning nom. Think of them as a reference guide and deep dive into -how the nom library works. - -- Useful: - - Source code examples (fantastic way to learn nom): - - [export-logseq-notes repo](https://github.com/dimfeld/export-logseq-notes/tree/master/src) - - Videos: - - [Intro from the author 7yrs old](https://youtu.be/EXEMm5173SM) - - Nom 7 deep dive videos: - - [Parsing name, age, and preference from natural language input](https://youtu.be/Igajh2Vliog) - - [Parsing number ranges](https://youtu.be/Xm4jrjohDN8) - - [Parsing lines of text](https://youtu.be/6b2ymQWldoE) - - Nom 6 videos (deep dive into how nom combinators themselves are constructed): - - [Deep dive, Part 1](https://youtu.be/zHF6j1LvngA) - - [Deep dive, Part 2](https://youtu.be/9GLFJcSO08Y) - - Tutorials: - - [Build a JSON parser using nom7](https://codeandbitters.com/lets-build-a-parser/) - - [Excellent beginner to advanced](https://github.com/benkay86/nom-tutorial) - - [Write a parser from scratch](https://github.com/rust-bakery/nom/blob/main/doc/making_a_new_parser_from_scratch.md) - - Reference docs: - - [nominomicon](https://tfpk.github.io/nominomicon/introduction.html) - - [What combinator or parser to use?](https://github.com/rust-bakery/nom/blob/main/doc/choosing_a_combinator.md) - - [docs.rs](https://docs.rs/nom/7.1.3/nom/) - - [Upgrading to nom 5.0](https://github.com/rust-bakery/nom/blob/main/doc/upgrading_to_nom_5.md) -- Less useful: - - [README](https://github.com/rust-bakery/nom) - - [nom crate](https://crates.io/crates/nom) - ## Getting to know nom using a simple example -[nom](https://crates.io/crates/nom) is a parser combinator library for Rust. You can write small +[`nom`](https://crates.io/crates/nom) is a parser combinator library for Rust. You can write small functions that parse a specific part of your input, and then combine them to build a parser that -parses the whole input. nom is very efficient and fast, it does not allocate memory when parsing if -it doesn't have to, and it makes it very easy for you to do the same. nom uses streaming mode or +parses the whole input. `nom` is very efficient and fast, it does not allocate memory when parsing if +it doesn't have to, and it makes it very easy for you to do the same. `nom` uses streaming mode or complete mode, and in this tutorial & code examples provided we will be using complete mode. -Roughly the way it works is that you tell nom how to parse a bunch of bytes in a way that matches +Roughly the way it works is that you tell `nom` how to parse a bunch of bytes in a way that matches some pattern that is valid for your data. It will try to parse as much as it can from the input, and the rest of the input will be returned to you. -You express the pattern that you're looking for by combining parsers. nom has a whole bunch of these -that come out of the box. And a huge part of learning nom is figuring out what these built in +You express the pattern that you're looking for by combining parsers. `nom` has a whole bunch of these +that come out of the box. And a huge part of learning `nom` is figuring out what these built in parsers are and how to combine them to build a parser that does what you want. Errors are a key part of it being able to apply a variety of different parsers to the same input. If -a parser fails, nom will return an error, and the rest of the input will be returned to you. This +a parser fails, `nom` will return an error, and the rest of the input will be returned to you. This allows you to combine parsers in a way that you can try to parse a bunch of different things, and if one of them fails, you can try the next one. This is very useful when you are trying to parse a bunch of different things, and you don't know which one you are going to get. @@ -102,17 +77,22 @@ bunch of different things, and you don't know which one you are going to get. ### Parsing hex color codes +> You can get the source code for the examples in this +> [repo](https://github.com/nazmulidris/rust-scratch/blob/main/nom). -Let's dive into nom using a simple example of parsing +Let's dive into `nom` using a simple example of parsing [hex color codes](https://developer.mozilla.org/en-US/docs/Web/CSS/color). ```rust -//! This module contains a parser that parses a hex color string into a [Color] struct. +//! This module contains a parser that parses a hex color +//! string into a [Color] struct. //! The hex color string can be in the following format `#RRGGBB`. //! For example, `#FF0000` is red. use std::num::ParseIntError; -use nom::{bytes::complete::*, combinator::*, error::*, sequence::*, IResult, Parser}; +use nom::{ + bytes::complete::*, combinator::*, error::*, sequence::*, IResult, Parser +}; #[derive(Debug, PartialEq)] pub struct Color { @@ -127,18 +107,21 @@ impl Color { } } -/// Helper functions to match and parse hex digits. These are not [Parser] -/// implementations. +/// Helper functions to match and parse hex digits. These are not +/// [Parser] implementations. mod helper_fns { use super::*; - /// This function is used by [map_res] and it returns a [Result], not [IResult]. - pub fn parse_str_to_hex_num(input: &str) -> Result { + /// This function is used by [map_res] and it returns a [Result] + /// not [IResult]. + pub fn parse_str_to_hex_num(input: &str) -> + Result + { u8::from_str_radix(input, 16) } - /// This function is used by [take_while_m_n] and as long as it returns `true` - /// items will be taken from the input. + /// This function is used by [take_while_m_n] and as long as it + /// returns `true` items will be taken from the input. pub fn match_is_hex_digit(c: char) -> bool { c.is_ascii_hexdigit() } @@ -151,14 +134,17 @@ mod helper_fns { } } -/// These are [Parser] implementations that are used by [hex_color_no_alpha]. +/// These are [Parser] implementations that are used by +/// [hex_color_no_alpha]. mod intermediate_parsers { use super::*; /// Call this to return function that implements the [Parser] trait. - pub fn gen_hex_seg_parser_fn<'input, E>() -> impl Parser<&'input str, u8, E> + pub fn gen_hex_seg_parser_fn<'input, E>() -> + impl Parser<&'input str, u8, E> where - E: FromExternalError<&'input str, ParseIntError> + ParseError<&'input str>, + E: FromExternalError<&'input str, ParseIntError> + + ParseError<&'input str>, { map_res( take_while_m_n(2, 2, helper_fns::match_is_hex_digit), @@ -179,7 +165,8 @@ fn hex_color_no_alpha(input: &str) -> IResult<&str, Color> { ), ); let (input, _) = tag("#")(input)?; - let (input, (red, green, blue)) = tuple(it)(input)?; // same as `it.parse(input)?` + let (input, (red, green, blue)) = + tuple(it)(input)?; // same as `it.parse(input)?` Ok((input, Color { red, green, blue })) } @@ -211,71 +198,94 @@ mod tests { ### What does this code do, how does it work? - Please note that: -- This string can be parsed: `#2F14DFπŸ”…`. -- However, this string can't `πŸ”…#2F14DF`. +- This string can be parsed: `#2F14DFπŸ”…` βœ…. +- However, this string can't `πŸ”…#2F14DF` πŸ€”. So what is going on in the source code above? -1. The `intermediate_parsers::hex_color_no_alpha()` function is the main function that orchestrates - all the other functions to parse an `input: &str` and turn it into a `(&str, Color)`. - - - The `tag` combinator function is used to match the `#` character. This means that if the input - doesn't start with `#`, the parser will fail (which is why `πŸ”…#2F14DF` fails). It returns the - remaining input after `#`. And the output is `#` which we throw away. - - A `tuple` is created that takes 3 parsers, which all do the same exact thing, but are written - in 3 different ways just to demonstrate how these can be written. - 1. The `helper_fns::parse_hex_seg()` function is added to a tuple. - 2. The higher order function `intermediate_parsers::gen_hex_seg_parser_fn()` is added to the - tuple. - 3. Finally, the `map_res` combinator is directly added to the tuple. - - An extension function on this tuple called `parse()` is called w/ the `input` (thus far). This - is used to parse the input hex number. - - It returns the remaining input after the hex number which is why `#2F14DFπŸ”…` returns `πŸ”…` as - the first item in the tuple. - - The second item in the tuple is the parsed color string turned into a `Color` struct. - -2. Let's look at the `helper_fns::parse_hex_seg` (the other 2 ways shown above do the same exact - thing). The signature of this function tells nom that you can call the function w/ `input` - argument and it will return `IResult`. This signature is the pattern that - we will end up using to figure out how to chain combinators together. Here's how the `map_res` - combinator is used by `parse_hex_seg()` to actually do the parsing: - - 1. `take_while_m_n`: This combinator takes a range of characters (`2, 2`) and applies the - function `match_is_hex_digit` to determine whether the `char` is a hex digit (using - `is_ascii_hexdigit()` on the `char`). This is used to match a valid hex digit. It returns a - `&str` slice of the matched characters. Which is then passed to the next combinator. - 2. `parse_str_to_hex_num`: This parser is used on the string slice returned from above. It simply - takes string slice and turns it into a `Result, std::num::ParseIntError>`. The error is - important, since if the string slice is not a valid hex digit, then we want to return this - error. - -3. The key concept in nom is the `Parser` trait which is implemented for any `FnMut` that accepts an - input and returns an `IResult`. - - If you write a simple function w/ the signature - `fn(input: Input) -> IResult` then you are good to go! You just need to - call `parse()` on the `Input` type and this will kick off the parsing. - - Alternatively, you can just call the nom `tuple` function directly via - `nom::sequence::tuple(...)(input)?`. Or you can just call the `parse()` method on the tuple - since this is an extension function on tuples provided by nom. - - `IResult` is a very important type alias. It encapsulates 3 key types that are related to - parsing: - 1. The `Input` type is the type of the input that is being parsed. For example, if you are - parsing a string, then the `Input` type is `&str`. - 2. The `Output` type is the type of the output that is returned by the parser. For example, if - you are parsing a string and you want to return a `Color` struct, then the `Output` type is - `Color`. - 3. The `Error` type is the type of the error that is returned by the parser. For example, if - you are parsing a string and you want to return a `nom::Err::Error` error, then the `Error` - type is `nom::Err::Error`. This is very useful when you are developing your parser - combinators and you run into errors and have to debug them. +#### The Parser trait and IResult + + +The key concept in `nom` is the `Parser` trait which is implemented for any `FnMut` that accepts an +input and returns an `IResult`. + +- If you write a simple function w/ the signature + `fn(input: Input) -> IResult` then you are good to go! You just need to + call `parse()` on the `Input` type and this will kick off the parsing. +- Alternatively, you can just call the `nom` `tuple` function directly via + `nom::sequence::tuple(...)(input)?`. Or you can just call the `parse()` method on the tuple + since this is an extension function on tuples provided by `nom`. +- `IResult` is a very important type alias. It encapsulates 3 key types that are related to + parsing: + 1. The `Input` type is the type of the input that is being parsed. For example, if you are + parsing a string, then the `Input` type is `&str`. + 2. The `Output` type is the type of the output that is returned by the parser. For example, if + you are parsing a string and you want to return a `Color` struct, then the `Output` type is + `Color`. + 3. The `Error` type is the type of the error that is returned by the parser. For example, if + you are parsing a string and you want to return a `nom::Err::Error` error, then the `Error` + type is `nom::Err::Error`. This is very useful when you are developing your parser + combinators and you run into errors and have to debug them. + 4. Typically we are dealing with complete parsers which are character based. These are reflected + in the functions that we import from `nom`. It is pretty common to see the `'input` lifetime + parameter used in functions that are parsers. This way slices of the input can be returned + from the parser without having to `Clone` or allocate memory. + + Here's an example of this: + ```rust + pub fn parse_hex_seg<'input, E /* thread this generic type down */>( + input: &'input str, + ) -> IResult<&'input str, u8, E> + where + E: ParseError<&'input str> + ContextError<&'input str> + { /* code */ } + ``` + +#### Main parser function that calls all the other parsers + + +The `intermediate_parsers::hex_color_no_alpha()` function is the main function that +orchestrates all the other functions to parse an `input: &str` and turn it into a +`(&str, Color)`. + +- The `tag` combinator function is used to match the `#` character. This means that if the input + doesn't start with `#`, the parser will fail (which is why `πŸ”…#2F14DF` fails). It returns the + remaining input after `#`. And the output is `#` which we throw away. +- A `tuple` is created that takes 3 parsers, which all do the same exact thing, but are written + in 3 different ways just to demonstrate how these can be written. + 1. The `helper_fns::parse_hex_seg()` function is added to a tuple. + 2. The higher order function `intermediate_parsers::gen_hex_seg_parser_fn()` is added to the + tuple. + 3. Finally, the `map_res` combinator is directly added to the tuple. +- An extension function on this tuple called `parse()` is called w/ the `input` (thus far). This + is used to parse the input hex number. + - It returns the remaining input after the hex number which is why `#2F14DFπŸ”…` returns `πŸ”…` as + the first item in the tuple. + - The second item in the tuple is the parsed color string turned into a `Color` struct. + +#### The hex segment parser, comprised of nom combinator functions, and IResult + + +Let's look at the `helper_fns::parse_hex_seg` (the other 2 ways shown above do the same exact +thing). The signature of this function tells `nom` that you can call the function w/ `input` +argument and it will return `IResult`. This signature is the pattern that +we will end up using to figure out how to chain combinators together. Here's how the `map_res` +combinator is used by `parse_hex_seg()` to actually do the parsing: + +1. `take_while_m_n`: This combinator takes a range of characters (`2, 2`) and applies the + function `match_is_hex_digit` to determine whether the `char` is a hex digit (using + `is_ascii_hexdigit()` on the `char`). This is used to match a valid hex digit. It returns a + `&str` slice of the matched characters. Which is then passed to the next combinator. +2. `parse_str_to_hex_num`: This parser is used on the string slice returned from above. It simply + takes string slice and turns it into a `Result, std::num::ParseIntError>`. The error is + important, since if the string slice is not a valid hex digit, then we want to return this + error. ### Generalized workflow - After the really complicated walk through above, we could have just written the entire thing concisely like so: @@ -311,6 +321,228 @@ complex parsers. You start w/ the simplest one first, and then build up from the - The `?` operator is used to return the error if there is one. - The `Ok()` is used to return the parsed `Color` struct and the remaining input. +### Why can't we parse "πŸ”…#2F14DF"? + + +The reason we can't parse `πŸ”…#2F14DF` is because the `tag("#")` combinator is used to +match the `#` character at the very start of our input. Remember that the parser will try +to eat the bytes from the start of the input to the end. This means that if the input +doesn't start with `#`, the parser will fail. + +If we have the requirement to parse a hex color code that doesn't start with `#`, then we +can modify the parser to handle this case. Here's one way in which we can do this. + +```rust +/// This is the "main" function that is called by the tests. +fn hex_color_no_alpha( + input: &str, +) -> IResult< + ( + /* start remainder */ &str, + /* end remainder */ &str, + ), + Color, +> { + let mut root_fn = preceded( + /* throw away "#" */ + context("remove #", tag("#")), + /* return color */ + tuple(( + context("first hex seg", helper_fns::parse_hex_seg), + context( + "second hex seg", + intermediate_parsers::gen_hex_seg_parser_fn(), + ), + context( + "third hex seg", + map_res( + take_while_m_n(2, 2, helper_fns::match_is_hex_digit), + helper_fns::parse_str_to_hex_num, + ), + ), + )), + ); + + // Get chars before "#". + let pre_root_fn = take_until::< + /* input after "#" */ &str, + /* start remainder */ &str, + nom::error::VerboseError<&str>, + >("#"); + + if let Ok((input_after_hash, start_remainder)) = pre_root_fn(input) { + if let Ok((end_remainder, (red, green, blue))) = + root_fn(input_after_hash) + { + Ok(( + (start_remainder, end_remainder), + Color::new(red, green, blue), + )) + } else + { + Err(nom::Err::Error(Error::new( + (input_after_hash, ""), + ErrorKind::Fail, + ))) + } + } else { + Err(nom::Err::Error(Error::new((input, ""), ErrorKind::Fail))) + } +} +``` + +And this is what the tests would look like: + +```rust +#[test] +fn parse_valid_color() { + let input = "\n🌜\n#2F14DF\nπŸ”…\n"; + let result = dbg!(hex_color_no_alpha(input)); + let Ok((remainder, color)) = result else { + panic!(); + }; + assert_eq!(remainder, ("\n🌜\n", "\nπŸ”…\n")); + assert_eq!(color, Color::new(47, 20, 223)); +} +``` + +### What if we wanted to get better error reporting on what is happening? + + +We can use the `context` combinator to provide better error reporting. This is a very useful +combinator that you can use to provide better error messages when parsing fails. However, when +using it, we need to: + +1. Be careful of expressing the `nom` error types as generic arguments to the parser + functions, by using the `nom::error::VerboseError` type to get more detailed error + messages which are used by `nom::error::convert_error`. +2. This type needs to be passed as a generic argument to each parser that uses the + `context` combinator. + +Here's an example of this. + +```rust +use nom::{ + bytes::complete::{tag, take_while_m_n}, + combinator::map_res, + error::{context, convert_error}, + sequence::Tuple, + IResult, Parser, +}; + +/// `nom` is used to parse the hex digits from string. Then +/// [u8::from_str_radix] is used to convert the hex string into a +/// number. This can't fail, even though in the function signature, +/// that may return a [core::num::ParseIntError], which never +/// happens. Note the use of [nom::error::VerboseError] to get more +/// detailed error messages that are passed to +/// [nom::error::convert_error]. +pub fn parse_hex_seg(input: &str) -> IResult< + &str, + u8, + nom::error::VerboseError<&str> +> { + map_res( + take_while_m_n::<_, &str, nom::error::VerboseError<_>>( + 2, + 2, + |it| { it.is_ascii_hexdigit() }), + |it| u8::from_str_radix(it, 16), + ) + .parse(input) +} + +/// Note the use of [nom::error::VerboseError] to get more detailed +/// error messages that are passed to [nom::error::convert_error]. +pub fn root(input: &str) -> IResult< + &str, + (&str, u8, u8, u8), + nom::error::VerboseError<&str> +> { + let (remainder, (_, red, green, blue)) = ( + context("start of hex color", tag("#")), + context("hex seg 1", parse_hex_seg), + context("hex seg 2", parse_hex_seg), + context("hex seg 3", parse_hex_seg), + ) + .parse(input)?; + + Ok((remainder, ("", red, green, blue))) +} +``` + +This just sets up our code to use `context`, but we still have to format the output of the +error in a human readable way to `stdout`. This is where `convert_error` comes in. Here's +how you can use it. + +```rust +#[test] +fn test_root_1() { + let input = "x#FF0000"; + let result = root(input); + println!("{:?}", result); + assert!(result.is_err()); + + match result { + Err(nom::Err::Error(e)) | Err(nom::Err::Failure(e)) => { + println!( + "Could not parse because ... {}", + convert_error(input, e) + ); + } + _ => { /* do nothing for nom::Err::Incomplete(_) */ } + } +} +``` + +Here's the output of the test. + +```text +Err(Error(VerboseError { errors: [("x#FF0000", Nom(Tag)), ("x#FF0000", Context("start of hex color"))] })) +Could not parse because ... 0: at line 1, in Tag: +x#FF0000 +^ + +1: at line 1, in start of hex color: +x#FF0000 +^ +``` + +Here's another test to see even more detailed error messages. + +```rust +#[test] +fn test_root_2() { + let input = "#FF_000"; + let result = root(input); + println!("{:?}", result); + assert!(result.is_err()); + + match result { + Err(nom::Err::Error(e)) | Err(nom::Err::Failure(e)) => { + println!( + "Could not parse because ... {}", + convert_error(input, e) + ); + } + _ => { /* do nothing for nom::Err::Incomplete(_) */ } + } +} +``` + +Here's the output of this test. + +```text +Err(Error(VerboseError { errors: [("_000", Nom(TakeWhileMN)), ("_000", Context("hex seg 2"))] })) +Could not parse because ... 0: at line 1, in TakeWhileMN: +#FF_000 + ^ + +1: at line 1, in hex seg 2: +#FF_000 + ^ +``` + ## Build a Markdown parser @@ -350,3 +582,49 @@ Here are some entry points into the codebase. that are used to represent the Markdown document model ([`Document`](https://github.com/r3bl-org/r3bl-open-core/blob/main/tui/src/tui/md_parser/types.rs)) and all the other intermediate types (`Fragment`, `Block`, etc) & enums required for parsing. + +## Other examples + + +1. [Simple CSS parser](https://github.com/nazmulidris/rust-scratch/blob/main/nom/src/parser_simple_css.rs). +2. [Simple natural language parser](https://github.com/nazmulidris/rust-scratch/blob/main/nom/src/parse_natural_lang.rs). + +## Related video + + +> You can get the source code for the examples in this +> [repo](https://github.com/nazmulidris/rust-scratch/blob/main/nom). + +TK: add video here + +## References + + +`nom` is a huge topic. This tutorial takes a hands on approach to learning `nom`. However, the resources +listed below are very useful for learning `nom`. Think of them as a reference guide and deep dive into +how the `nom` library works. + +- Useful: + - Source code examples (fantastic way to learn `nom`): + - [export-logseq-notes repo](https://github.com/dimfeld/export-logseq-notes/tree/master/src) + - Videos: + - [Intro from the author 7yrs old](https://youtu.be/EXEMm5173SM) + - `nom` 7 deep dive videos: + - [Parsing name, age, and preference from natural language input](https://youtu.be/Igajh2Vliog) + - [Parsing number ranges](https://youtu.be/Xm4jrjohDN8) + - [Parsing lines of text](https://youtu.be/6b2ymQWldoE) + - `nom` 6 videos (deep dive into how nom combinators themselves are constructed): + - [Deep dive, Part 1](https://youtu.be/zHF6j1LvngA) + - [Deep dive, Part 2](https://youtu.be/9GLFJcSO08Y) + - Tutorials: + - [Build a JSON parser using `nom` 7](https://codeandbitters.com/lets-build-a-parser/) + - [Excellent beginner to advanced](https://github.com/benkay86/nom-tutorial) + - [Write a parser from scratch](https://github.com/rust-bakery/nom/blob/main/doc/making_a_new_parser_from_scratch.md) + - Reference docs: + - [nominomicon](https://tfpk.github.io/nominomicon/introduction.html) + - [What combinator or parser to use?](https://github.com/rust-bakery/nom/blob/main/doc/choosing_a_combinator.md) + - [docs.rs](https://docs.rs/nom/7.1.3/nom/) + - [Upgrading to `nom` 5.0](https://github.com/rust-bakery/nom/blob/main/doc/upgrading_to_nom_5.md) +- Less useful: + - [README](https://github.com/rust-bakery/nom) + - [`nom` crate](https://crates.io/crates/nom) diff --git a/_sass/globals.scss b/_sass/globals.scss index 13138034..dfce2923 100644 --- a/_sass/globals.scss +++ b/_sass/globals.scss @@ -34,7 +34,7 @@ $specialHeadingFontFamily: "Gotham Bold", "Monocraft", "Mabry Pro", "Ndot-55"; $headingFontFamily: "Monocraft", "Mabry Pro", "Gotham Bold", "Spline Sans", "Helvetica Neue", "Google Sans", "Work Sans", Arial, sans-serif; $baseFontFamily: "Iosevka Term Web", "Victor Mono", "Spline Sans Mono", "Fira Mono", "Fira Sans", "JetBrains Mono", "Helvetica Neue", "Lexend Deca", sans-serif; -$baseFontSize: 12.5pt; +$baseFontSize: 13pt; $baseFontWeight: 400; $baseLineHeight: 1.6; @@ -75,4 +75,4 @@ $onLaptop: 800px; @mixin relative-font-size($ratio) { font-size: $baseFontSize * $ratio; -} +} \ No newline at end of file diff --git a/docs/2023/02/20/guide-to-nom-parsing/index.html b/docs/2023/02/20/guide-to-nom-parsing/index.html index 6f70c589..0c5ba2dd 100644 --- a/docs/2023/02/20/guide-to-nom-parsing/index.html +++ b/docs/2023/02/20/guide-to-nom-parsing/index.html @@ -266,15 +266,25 @@

Guide to parsing with nom @@ -340,70 +350,6 @@

object oriented), please take a look at this paper by Will Crichton demonstrating Typed Design Patterns with Rust.

-

- - - Documentation # - - -

- -

- -

nom is a huge topic. This tutorial takes a hands on approach to learning nom. However, the resources -listed below are very useful for learning nom. Think of them as a reference guide and deep dive into -how the nom library works.

- -

@@ -414,22 +360,22 @@

-

nom is a parser combinator library for Rust. You can write small +

nom is a parser combinator library for Rust. You can write small functions that parse a specific part of your input, and then combine them to build a parser that -parses the whole input. nom is very efficient and fast, it does not allocate memory when parsing if -it doesn’t have to, and it makes it very easy for you to do the same. nom uses streaming mode or +parses the whole input. nom is very efficient and fast, it does not allocate memory when parsing if +it doesn’t have to, and it makes it very easy for you to do the same. nom uses streaming mode or complete mode, and in this tutorial & code examples provided we will be using complete mode.

-

Roughly the way it works is that you tell nom how to parse a bunch of bytes in a way that matches +

Roughly the way it works is that you tell nom how to parse a bunch of bytes in a way that matches some pattern that is valid for your data. It will try to parse as much as it can from the input, and the rest of the input will be returned to you.

-

You express the pattern that you’re looking for by combining parsers. nom has a whole bunch of these -that come out of the box. And a huge part of learning nom is figuring out what these built in +

You express the pattern that you’re looking for by combining parsers. nom has a whole bunch of these +that come out of the box. And a huge part of learning nom is figuring out what these built in parsers are and how to combine them to build a parser that does what you want.

Errors are a key part of it being able to apply a variety of different parsers to the same input. If -a parser fails, nom will return an error, and the rest of the input will be returned to you. This +a parser fails, nom will return an error, and the rest of the input will be returned to you. This allows you to combine parsers in a way that you can try to parse a bunch of different things, and if one of them fails, you can try the next one. This is very useful when you are trying to parse a bunch of different things, and you don’t know which one you are going to get.

@@ -443,15 +389,23 @@

-

Let’s dive into nom using a simple example of parsing +

+

You can get the source code for the examples in this +repo.

+
+ +

Let’s dive into nom using a simple example of parsing hex color codes.

-
//! This module contains a parser that parses a hex color string into a [Color] struct.
+
//! This module contains a parser that parses a hex color
+//! string into a [Color] struct.
 //! The hex color string can be in the following format `#RRGGBB`.
 //! For example, `#FF0000` is red.
 
 use std::num::ParseIntError;
-use nom::{bytes::complete::*, combinator::*, error::*, sequence::*, IResult, Parser};
+use nom::{
+    bytes::complete::*, combinator::*, error::*, sequence::*, IResult, Parser
+};
 
 #[derive(Debug, PartialEq)]
 pub struct Color {
@@ -466,18 +420,21 @@ 

} } -/// Helper functions to match and parse hex digits. These are not [Parser] -/// implementations. +/// Helper functions to match and parse hex digits. These are not +/// [Parser] implementations. mod helper_fns { use super::*; - /// This function is used by [map_res] and it returns a [Result], not [IResult]. - pub fn parse_str_to_hex_num(input: &str) -> Result<u8, std::num::ParseIntError> { + /// This function is used by [map_res] and it returns a [Result] + /// not [IResult]. + pub fn parse_str_to_hex_num(input: &str) -> + Result<u8, std::num::ParseIntError> + { u8::from_str_radix(input, 16) } - /// This function is used by [take_while_m_n] and as long as it returns `true` - /// items will be taken from the input. + /// This function is used by [take_while_m_n] and as long as it + /// returns `true` items will be taken from the input. pub fn match_is_hex_digit(c: char) -> bool { c.is_ascii_hexdigit() } @@ -490,14 +447,17 @@

} } -/// These are [Parser] implementations that are used by [hex_color_no_alpha]. +/// These are [Parser] implementations that are used by +/// [hex_color_no_alpha]. mod intermediate_parsers { use super::*; /// Call this to return function that implements the [Parser] trait. - pub fn gen_hex_seg_parser_fn<'input, E>() -> impl Parser<&'input str, u8, E> + pub fn gen_hex_seg_parser_fn<'input, E>() -> + impl Parser<&'input str, u8, E> where - E: FromExternalError<&'input str, ParseIntError> + ParseError<&'input str>, + E: FromExternalError<&'input str, ParseIntError> + + ParseError<&'input str>, { map_res( take_while_m_n(2, 2, helper_fns::match_is_hex_digit), @@ -518,7 +478,8 @@

), ); let (input, _) = tag("#")(input)?; - let (input, (red, green, blue)) = tuple(it)(input)?; // same as `it.parse(input)?` + let (input, (red, green, blue)) = + tuple(it)(input)?; // same as `it.parse(input)?` Ok((input, Color { red, green, blue })) } @@ -559,84 +520,122 @@

Please note that:

    -
  • This string can be parsed: #2F14DFπŸ”….
  • -
  • However, this string can’t πŸ”…#2F14DF.
  • +
  • This string can be parsed: #2F14DFπŸ”… βœ….
  • +
  • However, this string can’t πŸ”…#2F14DF πŸ€”.

So what is going on in the source code above?

+

+ + + The Parser trait and IResult # + + +

+ +

-
    -
  1. -

    The intermediate_parsers::hex_color_no_alpha() function is the main function that orchestrates -all the other functions to parse an input: &str and turn it into a (&str, Color).

    - -
      -
    • The tag combinator function is used to match the # character. This means that if the input -doesn’t start with #, the parser will fail (which is why πŸ”…#2F14DF fails). It returns the -remaining input after #. And the output is # which we throw away.
    • -
    • A tuple is created that takes 3 parsers, which all do the same exact thing, but are written -in 3 different ways just to demonstrate how these can be written. -
        -
      1. The helper_fns::parse_hex_seg() function is added to a tuple.
      2. -
      3. The higher order function intermediate_parsers::gen_hex_seg_parser_fn() is added to the -tuple.
      4. -
      5. Finally, the map_res combinator is directly added to the tuple.
      6. -
      -
    • -
    • An extension function on this tuple called parse() is called w/ the input (thus far). This -is used to parse the input hex number. -
        -
      • It returns the remaining input after the hex number which is why #2F14DFπŸ”… returns πŸ”… as -the first item in the tuple.
      • -
      • The second item in the tuple is the parsed color string turned into a Color struct.
      • -
      -
    • -
    -
  2. -
  3. -

    Let’s look at the helper_fns::parse_hex_seg (the other 2 ways shown above do the same exact -thing). The signature of this function tells nom that you can call the function w/ input -argument and it will return IResult<Input, Output, Error>. This signature is the pattern that -we will end up using to figure out how to chain combinators together. Here’s how the map_res -combinator is used by parse_hex_seg() to actually do the parsing:

    - -
      -
    1. take_while_m_n: This combinator takes a range of characters (2, 2) and applies the -function match_is_hex_digit to determine whether the char is a hex digit (using -is_ascii_hexdigit() on the char). This is used to match a valid hex digit. It returns a -&str slice of the matched characters. Which is then passed to the next combinator.
    2. -
    3. parse_str_to_hex_num: This parser is used on the string slice returned from above. It simply -takes string slice and turns it into a Result<u8>, std::num::ParseIntError>. The error is -important, since if the string slice is not a valid hex digit, then we want to return this -error.
    4. -
    -
  4. -
  5. -

    The key concept in nom is the Parser trait which is implemented for any FnMut that accepts an +

    The key concept in nom is the Parser trait which is implemented for any FnMut that accepts an input and returns an IResult<Input, Output, Error>.

    -
      -
    • If you write a simple function w/ the signature + +
        +
      • If you write a simple function w/ the signature fn(input: Input) -> IResult<Input, Output, Error> then you are good to go! You just need to call parse() on the Input type and this will kick off the parsing.
      • -
      • Alternatively, you can just call the nom tuple function directly via +
      • Alternatively, you can just call the nom tuple function directly via nom::sequence::tuple(...)(input)?. Or you can just call the parse() method on the tuple -since this is an extension function on tuples provided by nom.
      • -
      • IResult is a very important type alias. It encapsulates 3 key types that are related to +since this is an extension function on tuples provided by nom.
      • +
      • IResult is a very important type alias. It encapsulates 3 key types that are related to parsing: -
          -
        1. The Input type is the type of the input that is being parsed. For example, if you are +
            +
          1. The Input type is the type of the input that is being parsed. For example, if you are parsing a string, then the Input type is &str.
          2. -
          3. The Output type is the type of the output that is returned by the parser. For example, if +
          4. The Output type is the type of the output that is returned by the parser. For example, if you are parsing a string and you want to return a Color struct, then the Output type is Color.
          5. -
          6. The Error type is the type of the error that is returned by the parser. For example, if +
          7. The Error type is the type of the error that is returned by the parser. For example, if you are parsing a string and you want to return a nom::Err::Error error, then the Error type is nom::Err::Error. This is very useful when you are developing your parser combinators and you run into errors and have to debug them.
          8. -
          +
        2. +

          Typically we are dealing with complete parsers which are character based. These are reflected +in the functions that we import from nom. It is pretty common to see the 'input lifetime +parameter used in functions that are parsers. This way slices of the input can be returned +from the parser without having to Clone or allocate memory.

          + +

          Here’s an example of this:

          +
          pub fn parse_hex_seg<'input, E /* thread this generic type down */>(
          +    input: &'input str,
          +) -> IResult<&'input str, u8, E>
          +where
          +    E: ParseError<&'input str> + ContextError<&'input str>
          +{ /* code */ }
          +
        3. +
        +
      • +
      +

      + + + Main parser function that calls all the other parsers # + + +

      + +

      + +

      The intermediate_parsers::hex_color_no_alpha() function is the main function that +orchestrates all the other functions to parse an input: &str and turn it into a +(&str, Color).

      + +
        +
      • The tag combinator function is used to match the # character. This means that if the input +doesn’t start with #, the parser will fail (which is why πŸ”…#2F14DF fails). It returns the +remaining input after #. And the output is # which we throw away.
      • +
      • A tuple is created that takes 3 parsers, which all do the same exact thing, but are written +in 3 different ways just to demonstrate how these can be written. +
          +
        1. The helper_fns::parse_hex_seg() function is added to a tuple.
        2. +
        3. The higher order function intermediate_parsers::gen_hex_seg_parser_fn() is added to the + tuple.
        4. +
        5. Finally, the map_res combinator is directly added to the tuple.
        6. +
        +
      • +
      • An extension function on this tuple called parse() is called w/ the input (thus far). This +is used to parse the input hex number. +
          +
        • It returns the remaining input after the hex number which is why #2F14DFπŸ”… returns πŸ”… as +the first item in the tuple.
        • +
        • The second item in the tuple is the parsed color string turned into a Color struct.
      • +
      +

      + + + The hex segment parser, comprised of nom combinator functions, and IResult # + + +

      + +

      + +

      Let’s look at the helper_fns::parse_hex_seg (the other 2 ways shown above do the same exact +thing). The signature of this function tells nom that you can call the function w/ input +argument and it will return IResult<Input, Output, Error>. This signature is the pattern that +we will end up using to figure out how to chain combinators together. Here’s how the map_res +combinator is used by parse_hex_seg() to actually do the parsing:

      + +
        +
      1. take_while_m_n: This combinator takes a range of characters (2, 2) and applies the + function match_is_hex_digit to determine whether the char is a hex digit (using + is_ascii_hexdigit() on the char). This is used to match a valid hex digit. It returns a + &str slice of the matched characters. Which is then passed to the next combinator.
      2. +
      3. parse_str_to_hex_num: This parser is used on the string slice returned from above. It simply + takes string slice and turns it into a Result<u8>, std::num::ParseIntError>. The error is + important, since if the string slice is not a valid hex digit, then we want to return this + error.

      @@ -686,6 +685,235 @@

  6. +

    + + + Why can’t we parse β€œπŸ”…#2F14DF”? # + + +

    + +

    + +

    The reason we can’t parse πŸ”…#2F14DF is because the tag("#") combinator is used to +match the # character at the very start of our input. Remember that the parser will try +to eat the bytes from the start of the input to the end. This means that if the input +doesn’t start with #, the parser will fail.

    + +

    If we have the requirement to parse a hex color code that doesn’t start with #, then we +can modify the parser to handle this case. Here’s one way in which we can do this.

    + +
    /// This is the "main" function that is called by the tests.
    +fn hex_color_no_alpha(
    +    input: &str,
    +) -> IResult<
    +    (
    +        /* start remainder */ &str,
    +        /* end remainder */ &str,
    +    ),
    +    Color,
    +> {
    +    let mut root_fn = preceded(
    +        /* throw away "#" */
    +        context("remove #", tag("#")),
    +        /* return color */
    +        tuple((
    +            context("first hex seg", helper_fns::parse_hex_seg),
    +            context(
    +                "second hex seg",
    +                intermediate_parsers::gen_hex_seg_parser_fn(),
    +            ),
    +            context(
    +                "third hex seg",
    +                map_res(
    +                    take_while_m_n(2, 2, helper_fns::match_is_hex_digit),
    +                    helper_fns::parse_str_to_hex_num,
    +                ),
    +            ),
    +        )),
    +    );
    +
    +    // Get chars before "#".
    +    let pre_root_fn = take_until::<
    +        /* input after "#" */ &str,
    +        /* start remainder */ &str,
    +        nom::error::VerboseError<&str>,
    +    >("#");
    +
    +    if let Ok((input_after_hash, start_remainder)) = pre_root_fn(input) {
    +        if let Ok((end_remainder, (red, green, blue))) =
    +            root_fn(input_after_hash)
    +        {
    +            Ok((
    +                (start_remainder, end_remainder),
    +                Color::new(red, green, blue),
    +            ))
    +        } else
    +        {
    +            Err(nom::Err::Error(Error::new(
    +                (input_after_hash, ""),
    +                ErrorKind::Fail,
    +            )))
    +        }
    +    } else {
    +        Err(nom::Err::Error(Error::new((input, ""), ErrorKind::Fail)))
    +    }
    +}
    +
    + +

    And this is what the tests would look like:

    + +
    #[test]
    +fn parse_valid_color() {
    +    let input = "\n🌜\n#2F14DF\nπŸ”…\n";
    +    let result = dbg!(hex_color_no_alpha(input));
    +    let Ok((remainder, color)) = result else {
    +        panic!();
    +    };
    +    assert_eq!(remainder, ("\n🌜\n", "\nπŸ”…\n"));
    +    assert_eq!(color, Color::new(47, 20, 223));
    +}
    +
    +

    + + + What if we wanted to get better error reporting on what is happening? # + + +

    + +

    + +

    We can use the context combinator to provide better error reporting. This is a very useful +combinator that you can use to provide better error messages when parsing fails. However, when +using it, we need to:

    + +
      +
    1. Be careful of expressing the nom error types as generic arguments to the parser +functions, by using the nom::error::VerboseError type to get more detailed error +messages which are used by nom::error::convert_error.
    2. +
    3. This type needs to be passed as a generic argument to each parser that uses the +context combinator.
    4. +
    + +

    Here’s an example of this.

    + +
    use nom::{
    +    bytes::complete::{tag, take_while_m_n},
    +    combinator::map_res,
    +    error::{context, convert_error},
    +    sequence::Tuple,
    +    IResult, Parser,
    +};
    +
    +/// `nom` is used to parse the hex digits from string. Then
    +/// [u8::from_str_radix] is used to convert the hex string into a
    +/// number. This can't fail, even though in the function signature,
    +/// that may return a [core::num::ParseIntError], which never
    +/// happens. Note the use of [nom::error::VerboseError] to get more
    +/// detailed error  messages that are passed to
    +/// [nom::error::convert_error].
    +pub fn parse_hex_seg(input: &str) -> IResult<
    +    &str,
    +    u8,
    +    nom::error::VerboseError<&str>
    +> {
    +    map_res(
    +        take_while_m_n::<_, &str, nom::error::VerboseError<_>>(
    +            2,
    +            2,
    +            |it| { it.is_ascii_hexdigit() }),
    +        |it| u8::from_str_radix(it, 16),
    +    )
    +    .parse(input)
    +}
    +
    +/// Note the use of [nom::error::VerboseError] to get more detailed
    +/// error messages that are passed to [nom::error::convert_error].
    +pub fn root(input: &str) -> IResult<
    +    &str,
    +    (&str, u8, u8, u8),
    +    nom::error::VerboseError<&str>
    +> {
    +    let (remainder, (_, red, green, blue)) = (
    +        context("start of hex color", tag("#")),
    +        context("hex seg 1", parse_hex_seg),
    +        context("hex seg 2", parse_hex_seg),
    +        context("hex seg 3", parse_hex_seg),
    +    )
    +        .parse(input)?;
    +
    +    Ok((remainder, ("", red, green, blue)))
    +}
    +
    + +

    This just sets up our code to use context, but we still have to format the output of the +error in a human readable way to stdout. This is where convert_error comes in. Here’s +how you can use it.

    + +
    #[test]
    +fn test_root_1() {
    +    let input = "x#FF0000";
    +    let result = root(input);
    +    println!("{:?}", result);
    +    assert!(result.is_err());
    +
    +    match result {
    +        Err(nom::Err::Error(e)) | Err(nom::Err::Failure(e)) => {
    +            println!(
    +                "Could not parse because ... {}",
    +                convert_error(input, e)
    +            );
    +        }
    +        _ => { /* do nothing for nom::Err::Incomplete(_) */ }
    +    }
    +}
    +
    + +

    Here’s the output of the test.

    + +
    Err(Error(VerboseError { errors: [("x#FF0000", Nom(Tag)), ("x#FF0000", Context("start of hex color"))] }))
    +Could not parse because ... 0: at line 1, in Tag:
    +x#FF0000
    +^
    +
    +1: at line 1, in start of hex color:
    +x#FF0000
    +^
    +
    + +

    Here’s another test to see even more detailed error messages.

    + +
    #[test]
    +fn test_root_2() {
    +    let input = "#FF_000";
    +    let result = root(input);
    +    println!("{:?}", result);
    +    assert!(result.is_err());
    +
    +    match result {
    +        Err(nom::Err::Error(e)) | Err(nom::Err::Failure(e)) => {
    +            println!(
    +                "Could not parse because ... {}",
    +                convert_error(input, e)
    +            );
    +        }
    +        _ => { /* do nothing for nom::Err::Incomplete(_) */ }
    +    }
    +}
    +
    + +

    Here’s the output of this test.

    + +
    Err(Error(VerboseError { errors: [("_000", Nom(TakeWhileMN)), ("_000", Context("hex seg 2"))] }))
    +Could not parse because ... 0: at line 1, in TakeWhileMN:
    +#FF_000
    +   ^
    +
    +1: at line 1, in hex seg 2:
    +#FF_000
    +   ^
    +

    @@ -741,6 +969,100 @@

    and all the other intermediate types (Fragment, Block, etc) & enums required for parsing.

+

+ + + Other examples # + + +

+ +

+ +
    +
  1. Simple CSS parser.
  2. +
  3. Simple natural language parser.
  4. +
+ + +

+ +
+

You can get the source code for the examples in this +repo.

+
+ +

TK: add video here

+

+ + + References # + + +

+ +

+ +

nom is a huge topic. This tutorial takes a hands on approach to learning nom. However, the resources +listed below are very useful for learning nom. Think of them as a reference guide and deep dive into +how the nom library works.

+ + diff --git a/docs/authors/nazmulidris/index.html b/docs/authors/nazmulidris/index.html index bb0261b5..e9ac330b 100644 --- a/docs/authors/nazmulidris/index.html +++ b/docs/authors/nazmulidris/index.html @@ -14,11 +14,11 @@ - + +{"@context":"https://schema.org","@type":"BlogPosting","author":{"@type":"Person","name":"Nazmul Idris"},"dateModified":"2024-06-18T14:20:49-05:00","datePublished":"2024-06-18T14:20:49-05:00","description":"Nazmul is a software engineer focused on Rust, TUI, Web, and Android technologies.","headline":"Nazmulidris","mainEntityOfPage":{"@type":"WebPage","@id":"http://developerlife.com/authors/nazmulidris/"},"url":"http://developerlife.com/authors/nazmulidris/"} diff --git a/docs/feed.xml b/docs/feed.xml index 1ad8c268..fbea6756 100644 --- a/docs/feed.xml +++ b/docs/feed.xml @@ -1,4 +1,4 @@ -Jekyll2024-06-10T19:56:05-05:00http://developerlife.com/feed.xmldeveloperlife.comRust, TUI, Android, Web, Desktop, Cloud technologies, and UX engineering and design tutorials.Nazmul IdrisRust error handling with miette2024-06-10T10:00:00-05:002024-06-10T10:00:00-05:00http://developerlife.com/2024/06/10/rust-miette-error-handling

+Jekyll2024-06-18T14:20:49-05:00http://developerlife.com/feed.xmldeveloperlife.comRust, TUI, Android, Web, Desktop, Cloud technologies, and UX engineering and design tutorials.Nazmul IdrisRust error handling with miette2024-06-10T10:00:00-05:002024-06-10T10:00:00-05:00http://developerlife.com/2024/06/10/rust-miette-error-handling

diff --git a/docs/sitemap.xml b/docs/sitemap.xml index dbc0898a..a25eb7bb 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -2,11 +2,11 @@ http://developerlife.com/authors/nadiaidris/ -2024-06-10T19:56:05-05:00 +2024-06-18T14:20:49-05:00 http://developerlife.com/authors/nazmulidris/ -2024-06-10T19:56:05-05:00 +2024-06-18T14:20:49-05:00 http://developerlife.com/1998/12/01/xml-and-java-tutorial-part-1/