-
Notifications
You must be signed in to change notification settings - Fork 34
Implement new tokenizer #416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @Enchufa2 , The proposed changes are in the |
Yes, exactly. Thank you very much. |
|
In general it sound like a smart strategy to keep it simple and fast. In most cases the new behaviour works just as before. However, there are some exotic cases where corrections are required. Consider the string Anyway after adding explicit brackets in the string this was fixed again: Please let me know when/if you plan to make this the definitive new behaviour. It will require more thorough sanitation on my behalf |
|
I am new to the use of the library and have just started evaluating for my use case. As such, I have no legacy code that will be impacted and will defer to the group. However, given the specific example of "oz/10 gal/1000 ft2", I think embracing the use of parens to disambiguate intent should be expected. For this specific example, piping directly into udunits at the command-line gives: So the underlying library behaves differently based on paren use, which is what I would expect to happen. It sounds like this may be a breaking change but trying to guess user intent would likely cascade to a lot of edge cases that don't behave as one would expect and that do not match the underlying library. This assumes that the underlying library is the "source-of-truth". |
|
Currently, the tokenizer switches to the denominator when a I don't have a strong opinion here. We could support this interpretation by switching only after a symbol, but not after a number. Changing the interpretation could be an option of the package, and if this is common enough, we could make it the default. |
|
@pepijn-devries With the latest commit, strict tokenization can be enabled via an option, but it's off by default, meaning that numbers are effectively treated like prefixes: units::as_units("oz/10 gal/1000 ft2")
#> 1 [oz/10/gal/1000/ft^2]
units::as_units("ml/min/1.73m^2")
#> 1 [ml/min/1.73/m^2]Now the question is whether this non-strict numbers-as-prefixes mode should be propagated to the formatting, and therefore whether we should remove the |
I don't have a strong opinion on what the default should be. But I would appreciate it if it can be set as an option. Is this already an option in the R package, or in the underpinning C-library? In my application the strict tokenization would yield the highest success rate in unit conversions. However, in my application unit strings are not formatted consistently and therefore may not be a typical use-case. |
|
With the latest commit, formatting follows the library(units)
#> udunits database from /usr/share/udunits/udunits2.xml
(u1 <- as_units("oz/10 gal/1000 ft2"))
#> 1 [oz/10gal/1000ft^2]
units(u1)
#> $numerator
#> [1] "oz"
#>
#> $denominator
#> [1] "10" "gal" "1000" "ft" "ft"
#>
#> attr(,"class")
#> [1] "symbolic_units"
(u2 <- as_units("ml/min/1.73m^2"))
#> 1 [ml/min/1.73m^2]
units(u2)
#> $numerator
#> [1] "ml"
#>
#> $denominator
#> [1] "min" "1.73" "m" "m"
#>
#> attr(,"class")
#> [1] "symbolic_units"
units_options(strict_tokenizer = TRUE)
(u1 <- as_units("oz/10 gal/1000 ft2"))
#> 1 [oz*gal*ft^2/10/1000]
units(u1)
#> $numerator
#> [1] "oz" "gal" "ft" "ft"
#>
#> $denominator
#> [1] "10" "1000"
#>
#> attr(,"class")
#> [1] "symbolic_units"
(u2 <- as_units("ml/min/1.73m^2"))
#> 1 [ml*m^2/min/1.73]
units(u2)
#> $numerator
#> [1] "ml" "m" "m"
#>
#> $denominator
#> [1] "min" "1.73"
#>
#> attr(,"class")
#> [1] "symbolic_units" |
|
Thanks, I will use the units_options to control its behaviour in my package. Since which version of |
|
Sorry, I wasn't clear: this option is added with this new tokenizer, in this PR. In your initial test, the new tokenizer was too strict with your |
|
@alwinw Adding you to this conversation because I found your package {epocakir} deals with |
Ah ok. But still if I want to make use of this feature, I should first check which version of |
|
Greater than the one currently on CRAN. :) Given the importance of the change, I will make it v1.0. |
|
I've run a full revdep check and there's only one package failing with this PR ({epocakir} @alwinw, see the report), which is very good news. Conjuring up @billdenney too in case you have some time to test this. |
|
I doubt that I will have time to test it soon. My challenges often come with math done on units. For example, will this handle something like |
This works as before. The PR is about unit creation from the tokenization of the string you pass to |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #416 +/- ##
==========================================
+ Coverage 91.21% 92.03% +0.82%
==========================================
Files 19 20 +1
Lines 1070 1143 +73
==========================================
+ Hits 976 1052 +76
+ Misses 94 91 -3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
I'm happy with the result. I'm planning to merge this by the end of the week. I'll be happy to receive further feedback or otherwise I'll interpret administrative silence as agreement. ;-) |
249bea9 to
aa7811c
Compare
Thanks for this! I have not been able to test the code yet. However, for the input “ml/min/1.73m^2”, the intent is to represent a glomerular filtration rate (ml/min) normalized to a “typical” body surface area (1.73 m^2). So the interpretation “ml/min/1.73/m^2” is unexpected, though I understand why this could be considered mathematically correct. It suffices for my use case if users can supply “ml/min/(1.73m^2)”, and it is even more convenient (see elsewhere in this thread) if |
Thanks for the heads up, I'll make an update for my package thanks! |
Closes #221, closes #383.
I would like to request feedback from everyone involved:
If you have time, it would be great if you could test this, maybe discover issues I didn't consider.
This is a complete rework of the unit construction code. Previously, unit strings were adapted with a bunch of regex so that they could be parsed into R expressions, which then would be evaluated in an environment populated with the individual units. In fact, R expressions are much more general than what's needed here, so I've decided to go in the other direction. Instead, expressions are converted to strings, and a much simpler and faster C++ tokenizer gets the job done, no regex required.
TL;DR, this tokenizer expects multiplications and divisions of numbers and unit names/symbols, with parentheses and integer exponents. Both numbers and symbols are treated as individual tokens, so that things like
ml/min/1.73/m^2work.