Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix xs:float text number parse/unparse edge cases #1133

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,13 @@

package org.apache.daffodil.unparsers.runtime1

import java.lang.{ Float => JFloat }
import java.lang.{ Number => JNumber }
import java.math.{ BigDecimal => JBigDecimal }

import org.apache.daffodil.lib.util.Maybe
import org.apache.daffodil.lib.util.Numbers.isZero
import org.apache.daffodil.runtime1.dpath.NodeInfo
import org.apache.daffodil.runtime1.processors.TextNumberFormatEv
import org.apache.daffodil.runtime1.processors._
import org.apache.daffodil.runtime1.processors.parsers.TextDecimalVirtualPointMixin
Expand Down Expand Up @@ -86,7 +89,22 @@ case class ConvertTextNumberUnparser(
case n: JNumber if zeroRep.isDefined && isZero(n) => zeroRep.get
case _ =>
try {
df.format(scaledValue)
if (
(context.optPrimType.get eq NodeInfo.Float) &&
JFloat.isFinite(scaledValue.asInstanceOf[Float])
) {
// ICU4J has has no format() function that handles a float. Instead, ICU4J converts
// the float to a double, and formats that value. However, formatting a double as a
// string has different precision than formatting a float as a string, resulting in
// extra precision that implies more accuracy than actually exists for floats, and
// can also lead to failures to exactly round. To solve this, we convert the float
// to a String, convert that String to a BigDecimal, and then format that. This is
// more expensive, but ensures we unparse the exact same precision as represented by
// the float.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get it. Does constructing BigDecimal(string) work if the string uses exponential notation as in "3.4E38" ?

If so why didn't they provide a constructor that takes a float?

I guess it has to work, or your tests would be failing.

We could bypass all of this if the float is an integer because none of this is required if it's an integer. The issue with double conversion only matters if there are real non-zero fraction digits. That said, I think the only way to tell if a float is an integer is to round it to one and see if it is equal.

Nah. Not worth it. Plus I am not sure the interactions with the number pattern "@" character stuff, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does constructing BigDecimal(string) work if the string uses exponential notation as in "3.4E38" ?

Yep, here's the grammar it supports: https://docs.oracle.com/javase/8/docs/api/java/math/BigDecimal.html#BigDecimal-java.lang.String-

If so why didn't they provide a constructor that takes a float?

ICU "natively" supports BigInteger, BigDecimal, long, and double. Any Number that isn't one of those is converted to a double (via Number.doubleValue) and formats it using the double logic:

https://github.com/unicode-org/icu/blob/main/icu4j/main/core/src/main/java/com/ibm/icu/text/NumberFormat.java#L269-L289

This is fine for int/short/byte since they can all be exactly represented by a double. And it means float is supported, by when a float is converted to a double, more decimal places are sometimes needed when converted to a string. This Java and not ICU, but it shows the same issue more clearly:

scala> Float.MaxValue.toString
val res0: String = 3.4028235E38

scala> Float.MaxValue.doubleValue.toString
val res1: String = 3.4028234663852886E38

So ICU's logic of converting to a double is maybe questionable--you can't just convert a float to a double and expect to get the same string. Note that if you then parse that double string as a float you should get the original float back, so the thing in the infoset should come out the same, but the data doesn't round trip.

I couldn't find a way to convert a float to a double in a way that would result in the same string value (without doing what I did), though it does seem like a better algorithm should exist.

We could bypass all of this if the float is an integer...Nah. Not worth it...

Also, if a user is expecting floats, it probably is less likely integers to appear in the data/infoset, so it might be an optimization that isn't hit that often.

Probably the best thing to do is suggest that if you're dealing with text numbers containing decimals, then just always use xs:double--its much less likely to have precision issues and it avoids this hack and should be more performant.

df.format(new JBigDecimal(scaledValue.toString))
} else {
df.format(scaledValue)
}
} catch {
case e: java.lang.ArithmeticException =>
UE(state, "Unable to format number to pattern: %s", e.getMessage())
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -651,13 +651,19 @@ object NodeInfo extends Enum {
trait PrimNumericFloat extends PrimNumeric { self: Numeric.Kind =>
def min: Double
def max: Double
private lazy val minBD = new JBigDecimal(min)
private lazy val maxBD = new JBigDecimal(max)
def minStr: String
def maxStr: String
private lazy val minBD = new JBigDecimal(minStr)
private lazy val maxBD = new JBigDecimal(maxStr)

def isValid(n: java.lang.Number): Boolean = n match {
case bd: JBigDecimal => {
bd.compareTo(minBD) >= 0 && bd.compareTo(maxBD) <= 0
}
case bi: JBigInt => {
val bd = new JBigDecimal(bi)
bd.compareTo(minBD) >= 0 && bd.compareTo(maxBD) <= 0
}
case _ => {
val d = n.doubleValue
(d.isNaN || d.isInfinite) || (d >= min && d <= max)
Expand All @@ -681,6 +687,10 @@ object NodeInfo extends Enum {
protected override def fromNumberNoCheck(n: Number): DataValueFloat = n.floatValue
override val min = -JFloat.MAX_VALUE.doubleValue
override val max = JFloat.MAX_VALUE.doubleValue
// we cannot use min/max.toString for minStr/maxStr because min/max are doubles so
// toString would have a precision different than Float.MaxValue.toString
override val minStr = "-" + JFloat.MAX_VALUE.toString
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below about minStr

override val maxStr = JFloat.MAX_VALUE.toString
override val width: MaybeInt = MaybeInt(32)
}

Expand All @@ -698,6 +708,8 @@ object NodeInfo extends Enum {
protected override def fromNumberNoCheck(n: Number): DataValueDouble = n.doubleValue
override val min = -JDouble.MAX_VALUE
override val max = JDouble.MAX_VALUE
override val minStr = "-" + JDouble.MAX_VALUE.toString
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason these aren't just min.toString and max.toString? The fact that they are not suggests to me something funny is going on that requires us not to call toString on the min because it's negated or something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the min/max vals are Doubles, and Double.toString results in more decimal places in the string. When we create the BigDecimal for range comparison, we can't have those extra bits of precision or it changes the valid range for a Float. For example,

scala> new java.math.BigDecimal(Float.MaxValue.doubleValue.toString)
val res0: java.math.BigDecimal = 3.4028234663852886E+38

That value is much smaller than the actual Float max of 3.4028235E38--we must call toString on the Float, e.g.

scala> new java.math.BigDecimal(Float.MaxValue.toString)
val res0: java.math.BigDecimal = 3.4028235E+38

override val maxStr = JDouble.MAX_VALUE.toString
override val width: MaybeInt = MaybeInt(64)
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -140,8 +140,7 @@
<xs:element name="floatTextFail" type="xs:float"
dfdl:lengthKind="implicit" />
<xs:element name="floatTextDelim" type="xs:float"
dfdl:representation="text" dfdl:lengthKind="delimited"
dfdl:terminator="%SP;"/>
dfdl:representation="text" dfdl:lengthKind="delimited" />

<xs:element name="nonNegIntBin" type="xs:nonNegativeInteger"
dfdl:representation="binary" dfdl:lengthKind="explicit" dfdl:length="{ 16 }" />
Expand Down Expand Up @@ -3722,10 +3721,14 @@
</tdml:infoset>
</tdml:parserTestCase>

<tdml:parserTestCase name="float_text_delim" root="floatTextDelim"
<tdml:parserTestCase name="float_text_max" root="floatTextDelim"
model="SimpleTypes-Embedded.dfdl.xsd" description="Section 5 Simple type-float - DFDL-5-008R">

<tdml:document><![CDATA[3.4028235E38]]></tdml:document>
<!--
We would really prefer to use "3.4028235E38" as the document data (wich would parse exactly the same),
but textNumberPattern is "#,##0.###" so this unparses with all 39 digits plus commas instead of using
an exponent. In order to round trip correctly, we set the document to the more verbose form
-->
<tdml:document><![CDATA[340,282,350,000,000,000,000,000,000,000,000,000,000]]></tdml:document>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to stare at this a while and go look at GeneralFormatBase to decide this is ok.

Insert a comment that this is 39 digits long because the textNumberPattern is "#,###.###" which doesn't allow for use of an exponent, so it has to output all the integer digits.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. This was needed to roundtrip--it parses correctly since it's lax parsing and supports exponents, but since the textNumberPattern is what you say, it unparses without exponents.

<tdml:infoset>
<tdml:dfdlInfoset>
<floatTextDelim>3.4028235E38</floatTextDelim>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6333,8 +6333,7 @@
-->

<tdml:parserTestCase name="xPathFunc_round_hte_02" root="round-hte"
model="XPathFunctions" description="Section 23.5.2 - Standard XPath Functions - round-half-to-even - DFDL-23-098R"
roundTrip="twoPass">
model="XPathFunctions" description="Section 23.5.2 - Standard XPath Functions - round-half-to-even - DFDL-23-098R">

<tdml:document>
<tdml:documentPart type="text">3.455,2</tdml:documentPart>
Expand All @@ -6360,8 +6359,7 @@
-->

<tdml:parserTestCase name="xPathFunc_round_hte_03" root="round-hte"
model="XPathFunctions" description="Section 23.5.2 - Standard XPath Functions - round-half-to-even - DFDL-23-098R"
roundTrip="twoPass">
model="XPathFunctions" description="Section 23.5.2 - Standard XPath Functions - round-half-to-even - DFDL-23-098R">

<tdml:document>
<tdml:documentPart type="text">3.00865,4</tdml:documentPart>
Expand All @@ -6386,8 +6384,7 @@
-->

<tdml:parserTestCase name="xPathFunc_round_hte_04" root="round-hte"
model="XPathFunctions" description="Section 23.5.2 - Standard XPath Functions - round-half-to-even - DFDL-23-098R"
roundTrip="twoPass">
model="XPathFunctions" description="Section 23.5.2 - Standard XPath Functions - round-half-to-even - DFDL-23-098R">

<tdml:document>
<tdml:documentPart type="text">3.00065,4</tdml:documentPart>
Expand All @@ -6412,8 +6409,7 @@
-->

<tdml:parserTestCase name="xPathFunc_round_hte_05" root="round-hte"
model="XPathFunctions" description="Section 23.5.2 - Standard XPath Functions - round-half-to-even - DFDL-23-098R"
roundTrip="twoPass">
model="XPathFunctions" description="Section 23.5.2 - Standard XPath Functions - round-half-to-even - DFDL-23-098R">

<tdml:document>
<tdml:documentPart type="text">3.000065,5</tdml:documentPart>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -826,9 +826,7 @@ class TestSimpleTypes {
@Test def test_float_text(): Unit = { runner.runOneTest("float_text") }
@Test def test_float_text2(): Unit = { runner.runOneTest("float_text2") }
@Test def test_float_text3(): Unit = { runner.runOneTest("float_text3") }

// DAFFODIL-2867
// @Test def test_float_text_delim(): Unit = { runner.runOneTest("float_text_delim") }
@Test def test_float_text_max(): Unit = { runner.runOneTest("float_text_max") }

@Test def test_float_text_fail(): Unit = { runner.runOneTest("float_text_fail") }
@Test def test_characterDuringValidFloat(): Unit = {
Expand Down