Skip to content

Conversation

@Enchufa2
Copy link
Member

@Enchufa2 Enchufa2 commented Sep 28, 2025

Closes #221, closes #383.

I would like to request feedback from everyone involved:

If you have time, it would be great if you could test this, maybe discover issues I didn't consider.

This is a complete rework of the unit construction code. Previously, unit strings were adapted with a bunch of regex so that they could be parsed into R expressions, which then would be evaluated in an environment populated with the individual units. In fact, R expressions are much more general than what's needed here, so I've decided to go in the other direction. Instead, expressions are converted to strings, and a much simpler and faster C++ tokenizer gets the job done, no regex required.

TL;DR, this tokenizer expects multiplications and divisions of numbers and unit names/symbols, with parentheses and integer exponents. Both numbers and symbols are treated as individual tokens, so that things like ml/min/1.73/m^2 work.

@pepijn-devries
Copy link

Hi @Enchufa2 ,

The proposed changes are in the tokenizer branch right? I have some specific tests for unit conversions in my package, so I will test how it behaves with these modifications. I will report back here...

@Enchufa2
Copy link
Member Author

The proposed changes are in the tokenizer branch right?

Yes, exactly. Thank you very much.

@pepijn-devries
Copy link

In general it sound like a smart strategy to keep it simple and fast. In most cases the new behaviour works just as before. However, there are some exotic cases where corrections are required.

Consider the string "oz/10 gal/1000 ft2". In the previous release of units this was translated as "oz/1000ft2/10gal" which is ok. The new implementation returns "oz*gal*ft^2/10/1000" which is absolutely not correct. Could we expect the units package to sanitise the input string to fix this, or should the user do this?

Anyway after adding explicit brackets in the string this was fixed again:
units::as_units("oz/(10gal)/(1000ft2)") returns [oz/10/gal/1000/ft^2]

Please let me know when/if you plan to make this the definitive new behaviour. It will require more thorough sanitation on my behalf

@gitrdm
Copy link

gitrdm commented Sep 28, 2025

I am new to the use of the library and have just started evaluating for my use case. As such, I have no legacy code that will be impacted and will defer to the group. However, given the specific example of "oz/10 gal/1000 ft2", I think embracing the use of parens to disambiguate intent should be expected. For this specific example, piping directly into udunits at the command-line gives:

bash> udunits2
You have: oz/10 gal/1000 ft2
You want: 
    2.74747095666e-12 m⁶·s⁻²
You have: oz/(10gal)/(1000ft2)
You want: 
    3.18326841080766e-06 s²
You have: 

So the underlying library behaves differently based on paren use, which is what I would expect to happen. It sounds like this may be a breaking change but trying to guess user intent would likely cascade to a lot of edge cases that don't behave as one would expect and that do not match the underlying library. This assumes that the underlying library is the "source-of-truth".

@Enchufa2
Copy link
Member Author

Enchufa2 commented Sep 28, 2025

Currently, the tokenizer switches to the denominator when a / is found, and switches back to the numerator when a token is read, number or symbol. This is strictly correct mathematically speaking, and this is what udunits2 does, as @gitrdm points out. Instead, your interpretation is that a numerical value goes somewhat attached to the next symbol.

I don't have a strong opinion here. We could support this interpretation by switching only after a symbol, but not after a number. Changing the interpretation could be an option of the package, and if this is common enough, we could make it the default.

@Enchufa2
Copy link
Member Author

@pepijn-devries With the latest commit, strict tokenization can be enabled via an option, but it's off by default, meaning that numbers are effectively treated like prefixes:

units::as_units("oz/10 gal/1000 ft2")
#> 1 [oz/10/gal/1000/ft^2]
units::as_units("ml/min/1.73m^2")
#> 1 [ml/min/1.73/m^2]

Now the question is whether this non-strict numbers-as-prefixes mode should be propagated to the formatting, and therefore whether we should remove the / between number and unit. What do you think?

@pepijn-devries
Copy link

@pepijn-devries With the latest commit, strict tokenization can be enabled via an option, but it's off by default, meaning that numbers are effectively treated like prefixes:

units::as_units("oz/10 gal/1000 ft2")
#> 1 [oz/10/gal/1000/ft^2]
units::as_units("ml/min/1.73m^2")
#> 1 [ml/min/1.73/m^2]

Now the question is whether this non-strict numbers-as-prefixes mode should be propagated to the formatting, and therefore whether we should remove the / between number and unit. What do you think?

I don't have a strong opinion on what the default should be. But I would appreciate it if it can be set as an option. Is this already an option in the R package, or in the underpinning C-library? In my application the strict tokenization would yield the highest success rate in unit conversions. However, in my application unit strings are not formatted consistently and therefore may not be a typical use-case.

@Enchufa2
Copy link
Member Author

With the latest commit, formatting follows the strict_tokenizer option too (by default FALSE):

library(units)
#> udunits database from /usr/share/udunits/udunits2.xml

(u1 <- as_units("oz/10 gal/1000 ft2"))
#> 1 [oz/10gal/1000ft^2]
units(u1)
#> $numerator
#> [1] "oz"
#> 
#> $denominator
#> [1] "10"   "gal"  "1000" "ft"   "ft"  
#> 
#> attr(,"class")
#> [1] "symbolic_units"
(u2 <- as_units("ml/min/1.73m^2"))
#> 1 [ml/min/1.73m^2]
units(u2)
#> $numerator
#> [1] "ml"
#> 
#> $denominator
#> [1] "min"  "1.73" "m"    "m"   
#> 
#> attr(,"class")
#> [1] "symbolic_units"

units_options(strict_tokenizer = TRUE)
(u1 <- as_units("oz/10 gal/1000 ft2"))
#> 1 [oz*gal*ft^2/10/1000]
units(u1)
#> $numerator
#> [1] "oz"  "gal" "ft"  "ft" 
#> 
#> $denominator
#> [1] "10"   "1000"
#> 
#> attr(,"class")
#> [1] "symbolic_units"
(u2 <- as_units("ml/min/1.73m^2"))
#> 1 [ml*m^2/min/1.73]
units(u2)
#> $numerator
#> [1] "ml" "m"  "m" 
#> 
#> $denominator
#> [1] "min"  "1.73"
#> 
#> attr(,"class")
#> [1] "symbolic_units"

@pepijn-devries
Copy link

Thanks, I will use the units_options to control its behaviour in my package. Since which version of units is it possible to specify this option? Then I can check which version the user has installed

@Enchufa2
Copy link
Member Author

Enchufa2 commented Sep 28, 2025

Sorry, I wasn't clear: this option is added with this new tokenizer, in this PR. In your initial test, the new tokenizer was too strict with your oz/10 gal/1000 ft2 example, so now it parses these examples as you expected, but has the option to make it more strict as initially implemented.

@Enchufa2
Copy link
Member Author

@alwinw Adding you to this conversation because I found your package {epocakir} deals with ml/min/1.73m^2 measurements.

@pepijn-devries
Copy link

Sorry, I wasn't clear: this option is added with this new tokenizer, in this PR. In your initial test, the new tokenizer was too strict with your oz/10 gal/1000 ft2 example, so now it parses these examples as you expected, but has the option to make it more strict as initially implemented.

Ah ok. But still if I want to make use of this feature, I should first check which version of units the user has installed, or make it a hard requirement in my description file. Hence my question in which version number this feature becomes available.

@Enchufa2
Copy link
Member Author

Greater than the one currently on CRAN. :) Given the importance of the change, I will make it v1.0.

@Enchufa2
Copy link
Member Author

I've run a full revdep check and there's only one package failing with this PR ({epocakir} @alwinw, see the report), which is very good news.

Conjuring up @billdenney too in case you have some time to test this.

@billdenney
Copy link
Contributor

I doubt that I will have time to test it soon. My challenges often come with math done on units. For example, will this handle something like log(set_units(1, "kg/m^2"))?

@Enchufa2
Copy link
Member Author

I doubt that I will have time to test it soon. My challenges often come with math done on units. For example, will this handle something like log(set_units(1, "kg/m^2"))?

This works as before. The PR is about unit creation from the tokenization of the string you pass to set_units. It doesn't rely on regex anymore. If you don't have any specific issues with this, don't worry, it should not change your workflows at all. Thanks!

@codecov
Copy link

codecov bot commented Sep 29, 2025

Codecov Report

❌ Patch coverage is 99.24812% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 92.03%. Comparing base (907dda2) to head (455450a).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
R/make_units.R 97.22% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #416      +/-   ##
==========================================
+ Coverage   91.21%   92.03%   +0.82%     
==========================================
  Files          19       20       +1     
  Lines        1070     1143      +73     
==========================================
+ Hits          976     1052      +76     
+ Misses         94       91       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Enchufa2
Copy link
Member Author

I'm happy with the result. I'm planning to merge this by the end of the week. I'll be happy to receive further feedback or otherwise I'll interpret administrative silence as agreement. ;-)

@Enchufa2 Enchufa2 force-pushed the tokenizer branch 4 times, most recently from 249bea9 to aa7811c Compare September 29, 2025 16:58
@bergsmat
Copy link

Closes #221, closes #383.

I would like to request feedback from everyone involved:

If you have time, it would be great if you could test this, maybe discover issues I didn't consider.

This is a complete rework of the unit construction code. Previously, unit strings were adapted with a bunch of regex so that they could be parsed into R expressions, which then would be evaluated in an environment populated with the individual units. In fact, R expressions are much more general than what's needed here, so I've decided to go in the other direction. Instead, expressions are converted to strings, and a much simpler and faster C++ tokenizer gets the job done, no regex required.

TL;DR, this tokenizer expects multiplications and divisions of numbers and unit names/symbols, with parentheses and integer exponents. Both numbers and symbols are treated as individual tokens, so that things like ml/min/1.73/m^2 work.

Thanks for this! I have not been able to test the code yet. However, for the input “ml/min/1.73m^2”, the intent is to represent a glomerular filtration rate (ml/min) normalized to a “typical” body surface area (1.73 m^2). So the interpretation “ml/min/1.73/m^2” is unexpected, though I understand why this could be considered mathematically correct. It suffices for my use case if users can supply “ml/min/(1.73m^2)”, and it is even more convenient (see elsewhere in this thread) if strict_tokenizer can be set FALSE, without prejudice as to what the default should be. Obviously default FALSE is most convenient, but I leave to your discretion. Again thanks, and I will try to test soon.

@Enchufa2 Enchufa2 merged commit ce00d33 into main Oct 3, 2025
30 checks passed
@Enchufa2 Enchufa2 deleted the tokenizer branch October 3, 2025 07:57
@alwinw
Copy link

alwinw commented Oct 4, 2025

I've run a full revdep check and there's only one package failing with this PR ({epocakir} @alwinw, see the report), which is very good news.

Conjuring up @billdenney too in case you have some time to test this.

Thanks for the heads up, I'll make an update for my package thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wrong parsing of numeric values as part of units squared unit behaves differently in numerator and denominator

7 participants