Implement new tokenizer #416

Enchufa2 · 2025-09-28T13:20:11Z

Closes #221, closes #383.

I would like to request feedback from everyone involved:

@bergsmat, @pepijn-devries, @gitrdm as the ones affected by the previous behavior;
@edzer, @t-kalinowski as the authors of the previous mechanism.

If you have time, it would be great if you could test this, maybe discover issues I didn't consider.

This is a complete rework of the unit construction code. Previously, unit strings were adapted with a bunch of regex so that they could be parsed into R expressions, which then would be evaluated in an environment populated with the individual units. In fact, R expressions are much more general than what's needed here, so I've decided to go in the other direction. Instead, expressions are converted to strings, and a much simpler and faster C++ tokenizer gets the job done, no regex required.

TL;DR, this tokenizer expects multiplications and divisions of numbers and unit names/symbols, with parentheses and integer exponents. Both numbers and symbols are treated as individual tokens, so that things like ml/min/1.73/m^2 work.

pepijn-devries · 2025-09-28T15:04:11Z

Hi @Enchufa2 ,

The proposed changes are in the tokenizer branch right? I have some specific tests for unit conversions in my package, so I will test how it behaves with these modifications. I will report back here...

Enchufa2 · 2025-09-28T15:08:35Z

The proposed changes are in the tokenizer branch right?

Yes, exactly. Thank you very much.

pepijn-devries · 2025-09-28T15:50:54Z

In general it sound like a smart strategy to keep it simple and fast. In most cases the new behaviour works just as before. However, there are some exotic cases where corrections are required.

Consider the string "oz/10 gal/1000 ft2". In the previous release of units this was translated as "oz/1000ft2/10gal" which is ok. The new implementation returns "oz*gal*ft^2/10/1000" which is absolutely not correct. Could we expect the units package to sanitise the input string to fix this, or should the user do this?

Anyway after adding explicit brackets in the string this was fixed again:
units::as_units("oz/(10gal)/(1000ft2)") returns [oz/10/gal/1000/ft^2]

Please let me know when/if you plan to make this the definitive new behaviour. It will require more thorough sanitation on my behalf

gitrdm · 2025-09-28T16:02:15Z

I am new to the use of the library and have just started evaluating for my use case. As such, I have no legacy code that will be impacted and will defer to the group. However, given the specific example of "oz/10 gal/1000 ft2", I think embracing the use of parens to disambiguate intent should be expected. For this specific example, piping directly into udunits at the command-line gives:

bash> udunits2
You have: oz/10 gal/1000 ft2
You want: 
    2.74747095666e-12 m⁶·s⁻²
You have: oz/(10gal)/(1000ft2)
You want: 
    3.18326841080766e-06 s²
You have:

So the underlying library behaves differently based on paren use, which is what I would expect to happen. It sounds like this may be a breaking change but trying to guess user intent would likely cascade to a lot of edge cases that don't behave as one would expect and that do not match the underlying library. This assumes that the underlying library is the "source-of-truth".

Enchufa2 · 2025-09-28T16:14:56Z

Currently, the tokenizer switches to the denominator when a / is found, and switches back to the numerator when a token is read, number or symbol. This is strictly correct mathematically speaking, and this is what udunits2 does, as @gitrdm points out. Instead, your interpretation is that a numerical value goes somewhat attached to the next symbol.

I don't have a strong opinion here. We could support this interpretation by switching only after a symbol, but not after a number. Changing the interpretation could be an option of the package, and if this is common enough, we could make it the default.

Enchufa2 · 2025-09-28T17:09:55Z

@pepijn-devries With the latest commit, strict tokenization can be enabled via an option, but it's off by default, meaning that numbers are effectively treated like prefixes:

units::as_units("oz/10 gal/1000 ft2")
#> 1 [oz/10/gal/1000/ft^2]
units::as_units("ml/min/1.73m^2")
#> 1 [ml/min/1.73/m^2]

Now the question is whether this non-strict numbers-as-prefixes mode should be propagated to the formatting, and therefore whether we should remove the / between number and unit. What do you think?

pepijn-devries · 2025-09-28T17:53:34Z

@pepijn-devries With the latest commit, strict tokenization can be enabled via an option, but it's off by default, meaning that numbers are effectively treated like prefixes:
units::as_units("oz/10 gal/1000 ft2")
#> 1 [oz/10/gal/1000/ft^2]
units::as_units("ml/min/1.73m^2")
#> 1 [ml/min/1.73/m^2]
Now the question is whether this non-strict numbers-as-prefixes mode should be propagated to the formatting, and therefore whether we should remove the / between number and unit. What do you think?

I don't have a strong opinion on what the default should be. But I would appreciate it if it can be set as an option. Is this already an option in the R package, or in the underpinning C-library? In my application the strict tokenization would yield the highest success rate in unit conversions. However, in my application unit strings are not formatted consistently and therefore may not be a typical use-case.

Enchufa2 · 2025-09-28T18:20:49Z

With the latest commit, formatting follows the strict_tokenizer option too (by default FALSE):

library(units)
#> udunits database from /usr/share/udunits/udunits2.xml

(u1 <- as_units("oz/10 gal/1000 ft2"))
#> 1 [oz/10gal/1000ft^2]
units(u1)
#> $numerator
#> [1] "oz"
#> 
#> $denominator
#> [1] "10"   "gal"  "1000" "ft"   "ft"  
#> 
#> attr(,"class")
#> [1] "symbolic_units"
(u2 <- as_units("ml/min/1.73m^2"))
#> 1 [ml/min/1.73m^2]
units(u2)
#> $numerator
#> [1] "ml"
#> 
#> $denominator
#> [1] "min"  "1.73" "m"    "m"   
#> 
#> attr(,"class")
#> [1] "symbolic_units"

units_options(strict_tokenizer = TRUE)
(u1 <- as_units("oz/10 gal/1000 ft2"))
#> 1 [oz*gal*ft^2/10/1000]
units(u1)
#> $numerator
#> [1] "oz"  "gal" "ft"  "ft" 
#> 
#> $denominator
#> [1] "10"   "1000"
#> 
#> attr(,"class")
#> [1] "symbolic_units"
(u2 <- as_units("ml/min/1.73m^2"))
#> 1 [ml*m^2/min/1.73]
units(u2)
#> $numerator
#> [1] "ml" "m"  "m" 
#> 
#> $denominator
#> [1] "min"  "1.73"
#> 
#> attr(,"class")
#> [1] "symbolic_units"

pepijn-devries · 2025-09-28T19:04:59Z

Thanks, I will use the units_options to control its behaviour in my package. Since which version of units is it possible to specify this option? Then I can check which version the user has installed

Enchufa2 · 2025-09-28T19:08:13Z

Sorry, I wasn't clear: this option is added with this new tokenizer, in this PR. In your initial test, the new tokenizer was too strict with your oz/10 gal/1000 ft2 example, so now it parses these examples as you expected, but has the option to make it more strict as initially implemented.

Enchufa2 · 2025-09-28T19:18:48Z

@alwinw Adding you to this conversation because I found your package {epocakir} deals with ml/min/1.73m^2 measurements.

pepijn-devries · 2025-09-28T20:32:14Z

Sorry, I wasn't clear: this option is added with this new tokenizer, in this PR. In your initial test, the new tokenizer was too strict with your oz/10 gal/1000 ft2 example, so now it parses these examples as you expected, but has the option to make it more strict as initially implemented.

Ah ok. But still if I want to make use of this feature, I should first check which version of units the user has installed, or make it a hard requirement in my description file. Hence my question in which version number this feature becomes available.

Enchufa2 · 2025-09-28T21:24:34Z

Greater than the one currently on CRAN. :) Given the importance of the change, I will make it v1.0.

Enchufa2 · 2025-09-29T10:33:58Z

I've run a full revdep check and there's only one package failing with this PR ({epocakir} @alwinw, see the report), which is very good news.

Conjuring up @billdenney too in case you have some time to test this.

billdenney · 2025-09-29T10:51:52Z

I doubt that I will have time to test it soon. My challenges often come with math done on units. For example, will this handle something like log(set_units(1, "kg/m^2"))?

Enchufa2 · 2025-09-29T11:11:09Z

I doubt that I will have time to test it soon. My challenges often come with math done on units. For example, will this handle something like log(set_units(1, "kg/m^2"))?

This works as before. The PR is about unit creation from the tokenization of the string you pass to set_units. It doesn't rely on regex anymore. If you don't have any specific issues with this, don't worry, it should not change your workflows at all. Thanks!

codecov · 2025-09-29T11:21:00Z

Codecov Report

❌ Patch coverage is 99.24812% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 92.03%. Comparing base (907dda2) to head (455450a).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
R/make_units.R	97.22%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #416      +/-   ##
==========================================
+ Coverage   91.21%   92.03%   +0.82%     
==========================================
  Files          19       20       +1     
  Lines        1070     1143      +73     
==========================================
+ Hits          976     1052      +76     
+ Misses         94       91       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Enchufa2 · 2025-09-29T11:53:10Z

I'm happy with the result. I'm planning to merge this by the end of the week. I'll be happy to receive further feedback or otherwise I'll interpret administrative silence as agreement. ;-)

bergsmat · 2025-09-29T17:20:18Z

Closes #221, closes #383.

I would like to request feedback from everyone involved:

@bergsmat, @pepijn-devries, @gitrdm as the ones affected by the previous behavior;

@edzer, @t-kalinowski as the authors of the previous mechanism.

If you have time, it would be great if you could test this, maybe discover issues I didn't consider.

This is a complete rework of the unit construction code. Previously, unit strings were adapted with a bunch of regex so that they could be parsed into R expressions, which then would be evaluated in an environment populated with the individual units. In fact, R expressions are much more general than what's needed here, so I've decided to go in the other direction. Instead, expressions are converted to strings, and a much simpler and faster C++ tokenizer gets the job done, no regex required.

TL;DR, this tokenizer expects multiplications and divisions of numbers and unit names/symbols, with parentheses and integer exponents. Both numbers and symbols are treated as individual tokens, so that things like ml/min/1.73/m^2 work.

Thanks for this! I have not been able to test the code yet. However, for the input “ml/min/1.73m^2”, the intent is to represent a glomerular filtration rate (ml/min) normalized to a “typical” body surface area (1.73 m^2). So the interpretation “ml/min/1.73/m^2” is unexpected, though I understand why this could be considered mathematically correct. It suffices for my use case if users can supply “ml/min/(1.73m^2)”, and it is even more convenient (see elsewhere in this thread) if strict_tokenizer can be set FALSE, without prejudice as to what the default should be. Obviously default FALSE is most convenient, but I leave to your discretion. Again thanks, and I will try to test soon.

alwinw · 2025-10-04T10:31:16Z

I've run a full revdep check and there's only one package failing with this PR ({epocakir} @alwinw, see the report), which is very good news.

Conjuring up @billdenney too in case you have some time to test this.

Thanks for the heads up, I'll make an update for my package thanks!

pepijn-devries mentioned this pull request Sep 28, 2025

Standardise units pepijn-devries/ECOTOXr#56

Open

Enchufa2 force-pushed the tokenizer branch 4 times, most recently from 249bea9 to aa7811c Compare September 29, 2025 16:58

Enchufa2 added 6 commits September 30, 2025 09:47

implement new tokenizer

9ee881b

reimplement as_units based on new tokenizer

f1fb41e

remove unsupported bits

0ba30d2

run full pillar test

2378f52

add more tests

a85c574

treat numbers as prefixes

29aa422

Enchufa2 added 5 commits September 30, 2025 09:47

follow strict_tokenizer while formatting too

8672f3a

implement lookahead to enable numbers in the middle of symbols

b5a0a28

try to convert only if parseable

8e29761

bump version, update NEWS and revdep checks

c1c7fba

more tests

455450a

Enchufa2 force-pushed the tokenizer branch from aa7811c to 455450a Compare September 30, 2025 07:47

Enchufa2 mentioned this pull request Sep 30, 2025

fix units check for upcoming version of units alwinw/epocakir#45

Merged

Enchufa2 merged commit ce00d33 into main Oct 3, 2025
30 checks passed

Enchufa2 deleted the tokenizer branch October 3, 2025 07:57

Implement new tokenizer #416

Implement new tokenizer #416

Uh oh!

Conversation

Enchufa2 commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pepijn-devries commented Sep 28, 2025

Uh oh!

Enchufa2 commented Sep 28, 2025

Uh oh!

pepijn-devries commented Sep 28, 2025

Uh oh!

gitrdm commented Sep 28, 2025 • edited by Enchufa2 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Enchufa2 commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Enchufa2 commented Sep 28, 2025

Uh oh!

pepijn-devries commented Sep 28, 2025

Uh oh!

Enchufa2 commented Sep 28, 2025

Uh oh!

pepijn-devries commented Sep 28, 2025

Uh oh!

Enchufa2 commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Enchufa2 commented Sep 28, 2025

Uh oh!

pepijn-devries commented Sep 28, 2025

Uh oh!

Enchufa2 commented Sep 28, 2025

Uh oh!

Enchufa2 commented Sep 29, 2025

Uh oh!

billdenney commented Sep 29, 2025

Uh oh!

Enchufa2 commented Sep 29, 2025

Uh oh!

codecov bot commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Enchufa2 commented Sep 29, 2025

Uh oh!

bergsmat commented Sep 29, 2025

Uh oh!

Uh oh!

alwinw commented Oct 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Enchufa2 commented Sep 28, 2025 •

edited

Loading

gitrdm commented Sep 28, 2025 •

edited by Enchufa2

Loading

Enchufa2 commented Sep 28, 2025 •

edited

Loading

Enchufa2 commented Sep 28, 2025 •

edited

Loading

codecov bot commented Sep 29, 2025 •

edited

Loading