Implement floats #65

ruuda · 2024-08-10T22:21:33Z

This is a work in progress, I’ll update the description later.

TODO: Write something about the implementation (decimal, rational, ...)
TODO: Write something about the types (Float, Num)

Find "json superset" caveats and update them — it’s finally practically true.
Ensure the fuzzer discovers new interesting code paths and that dictionaries are up to date.
Ensure test coverage.
Ensure grammars are up to date, for types as well as syntax.
Ensure documentation is up to date.

Open questions

How should (in)equality between numbers interact with their identity? If we allow 10 < 11.0 to typecheck and produce the result you’d expect, then we should also allow 10 == 10.0, and {10, 10.0} But what should they produce? Numerically they are equal, but as runtime::Value, they are different values. I see two possible ways forward:

Erode the difference between Int and Float further. Possibly only keep a Num type after all and everything is a float. 10 and 10.0 are equal, {10, 10.0} == {10}. This probably means that we print all numbers that are ints without suffix, which means that processing json is no longer “lossless” in the sense that we'd convert [10.0, 10] into [10, 10]. That sounds like something that will cause problems, hmmm ...
Do not allow Int and Float to mix so implicitly. 10 < 11.0 is a type error. 10 == 10.0 is false for the same reason that 10 == "10" is false. (Which makes sense and is consistent in one way, but may be surprising in another.) {10, 10.0} is a set with two elements.

I’m leaning towards the second one, but on the other hand, am I making my own life difficult? Why have both Int and Float? I think primarily to add additional type-safety in schemas (for things like ports, or counts, non-integers make no sense). But if that’s the argument, then should there also be UInt? Uint16? Fin n? ...

What does Python do here? It has different runtime representations, but considers the values equal, and for sets the first value is preserved:

>>> {10, 10.0}
{10}
>>> {10.0, 10}
{10.0}

Cue treats 10.0 as not an int, even though it is equal to the integer 10:

$ cue eval -
x: int
x: 10.0
x: conflicting values int and 10.0 (mismatched types int and float):
    -:1:4
    -:2:4

$ cue eval -
10.0 == 10
true

Thinking out loud, a possible solution may be:

There are only two numeric types: Int and Float (or maybe Float should be called Num), and Int is a subtype of the more general one.
There is only one runtime representation, all numbers are implemented as Decimal.
Decimal should preserve some presentation information that does not affect its identity, in particular the number of decimals. This way 10.0 and 10 can both be represented, and preserved exactly, for lossless json processing.
For runtime checks, if we have a Num and expect an Int, we need to do more than just a type check; we should also change the presentation attribute to render it with zero decimals. This feels ad-hoc though. An alternative is to do what Cue does and verify the presentation attribute as part of the type check.

The one thing that prevents that right now is floats, and the fuzzer discovered it within a few seconds: ╭──────╴ Opcode (hex) │ ╭───╴ Argument (hex) │ │ ╭╴ Operation, argument (decimal) 26 03 ExprPushInput, 3 take_str, 3 → "4e2" e6 01 ModeJsonSuperset, 1 EvalJsonSuperset --> 4e2

It doesn't add functionality to deal with exponents or decimal points, it only moves stuff into a function. The test currently has the wrong expectation with an error in there.

Let's finally add the float type. After some deliberation, I think I want to represent them as decimals in scientific notation internally, unless you do a division at which point they will turn into a rational. I can turn the rational back into a decimal (which may be lossy) in order to format it as a string; the rational itself is an implementation detail, but it does allow us to get certain computations exactly right where a float would not.

It contains the Decimal type after all, and right now I don't have anything named "float" either way.

Why the naming discrepancy? My thinking is that I want both Decimal and Rational as internal representation, so Value can be decimal or rational. But what is a good name for "decimal or rational"? I can think of "number", but I want to reserve that for the supertype of int and this one. "Rational" would be technically correct because decimals are also rationals, just with a power of 10 denominator. But then what if I put a sqrt function in there, or an approximation of pi? Sure technically they are approximations, but I think this would _also_ be weird. So let's go with float ... Also, this is what Cue calls them so there is precedent.

RCL rejects 9223372036854775807.576460752303423487 with an overflow error, but I think that is fine, I don't want to lose precision on inputs.

This adds the type and the relations, but it doesn't add all the tests, fuzzer dictionary, documentation, etc. I'll do all of that later as part of this same feature branch.

This should get us one step closer to json compatibility, maybe even the final step.

The choice I went with is to have a 16-bit exponent, which gives RCL's float/decimal type more range than a regular f64. Now the fuzzer can generate an input with a large exponent, and RCL will happily echo it, and it's technically syntactically valid json, but Serde rejects it with "number out of range" (in the same way that RCL rejects some numbers as overflow). So add an exception for this mismatch.

Now that Rustfmt formatted them tall anyway, it's probably best to keep them sorted.

This removes one case of incompatibility with Serde. If you write a float literal that is too precise to be represented exactly, then we now silently round it rather than treating it as an overflow error. I think this is acceptable because if you are in the case where you care about numbers to 19 significant digits then probably RCL is not the best tool for what you are doing, but the case where we encounter some arbitrary json that we want to query with "rcl jq" and it happens to have some humongous float in it, that is probably more likely. Python handles float literals in this way too so I think it's okay.

This is tricky stuff. Better write a lot of fuzz tests for this later.

Ugh this is such a rabbit hole, and it looks like I am putting it full of special cases, I have a feeling it could be way more elegant.

This adds back the exception that was removed by allowing float parsing imprecision, though in a more limited form initially because it only affected exponents. But after running the fuzzer for a bit longer, it also affects large integers, so we are back to the start, overflow is just an intentional incompatibility.

This was a to do, the fuzzer found it pretty quickly. The exact case it found that triggered it was this: {5.001,5.001e97,}

RCL can handle larger exponents on floats, we have to admit that then.

Surrogate pairs are not supported by RCL on purpose, so when that can be parsed by Serde but is rejected by RCL, we shouldn't fail the fuzzer on it.

This overflow was discovered by the fuzz_source fuzzer. The test currently fails.

See also the parent commit that adds the test. The test now passes.

The golden test doesn't hurt, but this is one of the few places where we can do a reasonable unit test as well.

This is only the start, but let's verify Decimal::cmp against f64::cmp. It instantly finds an input where they disagree: Compare { a: NormalF64( -0.16406250000007813, ), b: NormalF64( 0.0, ), }

It turns out that the case that the fuzzer found was already one that my past self marked as to do.

ruuda force-pushed the type-float branch from e0f60ca to 21c23c5 Compare August 10, 2024 22:39

ruuda added 22 commits August 24, 2024 12:44

Extract decimal parsing into function, add test

f0864b9

It doesn't add functionality to deal with exponents or decimal points, it only moves stuff into a function. The test currently has the wrong expectation with an error in there.

Add a parser for decimal floats

53bacf9

Implement Eq for Decimal

8de3113

Rename 'float' module to 'decimal'

6d3b6e7

It contains the Decimal type after all, and right now I don't have anything named "float" either way.

Relax Serde json fuzzer slightly

a4b6f50

RCL rejects 9223372036854775807.576460752303423487 with an overflow error, but I think that is fine, I don't want to lose precision on inputs.

Add a Num supertype above Int and Float

61b9c88

This adds the type and the relations, but it doesn't add all the tests, fuzzer dictionary, documentation, etc. I'll do all of that later as part of this same feature branch.

Implement negation for floats

93efb20

This should get us one step closer to json compatibility, maybe even the final step.

Sort type names alphabetically in matches

cd20842

Now that Rustfmt formatted them tall anyway, it's probably best to keep them sorted.

Add golden tests for negation unop type errors

e5f1538

Implement Decimal::cmp

24a25a2

This is tricky stuff. Better write a lot of fuzz tests for this later.

Update Vega example now that floats are supported

45bcee1

Handle negative numbers in decimal printing

5c6bdac

Ugh this is such a rabbit hole, and it looks like I am putting it full of special cases, I have a feeling it could be way more elegant.

Handle pow10 overflow in Decimal comparison

3389542

This was a to do, the fuzzer found it pretty quickly. The exact case it found that triggered it was this: {5.001,5.001e97,}

Add float parsing exception to TOML fuzzer

7892b7c

RCL can handle larger exponents on floats, we have to admit that then.

Add surrogate pair exception to json superset fuzzer

13ca286

Surrogate pairs are not supported by RCL on purpose, so when that can be parsed by Serde but is rejected by RCL, we shouldn't fail the fuzzer on it.

Implement float conversion for Python

ebe8923

ruuda force-pushed the type-float branch from f2fb87b to ebe8923 Compare August 24, 2024 11:54

ruuda added 6 commits August 31, 2024 13:12

Add golden for overflow in Decimal::cmp

0a39375

This overflow was discovered by the fuzz_source fuzzer. The test currently fails.

Fix overflow in Decimal::cmp

fc47b45

See also the parent commit that adds the test. The test now passes.

Make the Decimal tests more readable

4af0532

Add one more test for Decimal::cmp overflow

df4d242

The golden test doesn't hurt, but this is one of the few places where we can do a reasonable unit test as well.

Add a fuzzer to test various Decimal properties

8f56688

This is only the start, but let's verify Decimal::cmp against f64::cmp. It instantly finds an input where they disagree: Compare { a: NormalF64( -0.16406250000007813, ), b: NormalF64( 0.0, ), }

Make Decimal::cmp work for negative numerators

113b2f7

It turns out that the case that the fuzzer found was already one that my past self marked as to do.

Extend Decimal fuzzer to cover more cases

19c83a2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement floats #65

Implement floats #65

ruuda commented Aug 10, 2024 •

edited

Loading

Implement floats #65

Are you sure you want to change the base?

Implement floats #65

Conversation

ruuda commented Aug 10, 2024 • edited Loading

Open questions

ruuda commented Aug 10, 2024 •

edited

Loading