Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Bug with valueLength being overwritten after Trim #1338

Merged
merged 1 commit into from
Feb 4, 2025

Conversation

olabusayoT
Copy link
Contributor

@olabusayoT olabusayoT commented Oct 15, 2024

  • currently after trimming the value of the element, we set the valueLength, and then overwrite it after returning from the parse that does the trimming. This results in the wrong value for value length. This fixes it by only setting it if it's a non-choice comlplexType and simpleTypes are handled elsewhere
  • we also incorrectly use valuelength for prefixed length calculations when we ought to be using content length per the spec
  • we also do not ensure valuelength isn't getting overwritten so we add asserts to setAbsStartPos0bInBits and setAbsEndPos0bInBits to verify that
  • fix bug where padding is being added around prefixed length element (DAFFODIL-2943) by changing CaptureLengthRegion to wrap around contentlengthStart and padding
  • fix bug where we were missing return after PE for Out of Range Binary Integers (DAFFODIL-2942)
  • fix bug where we were using the main element's qname instead of the prefixed element qname in the Unparse Error message
  • refactor Prefixed parsers to use state's bitLimit to get the prefix length (PrefixedLengthParserMixin2) since the specifiedLengthPrefixedParser will take care of parsing the prefix length
  • refactored Prefixed unparsers to not try to unparse prefix length since that is taken care of by SpecifiedLengthPrefixedUnparser
  • refactored prefixed parsers and unparsers to remove unused prefixed length parser related members
  • add tests

DAFFODIL-2658

Copy link
Contributor

@jadams-tresys jadams-tresys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

(isSimpleType && (impliedRepresentation == Representation.Text || lengthKind == LengthKind.Delimited)) ||
val capturedByValueParsers =
(isSimpleType && (
primType == PrimType.String || lengthKind == LengthKind.Delimited)) ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this change is correct?

For example, say we have this element:

<xs:element name="foo" type="xs:int" dfdl:representation="text" dfdl:trimKind="padChar" dfdl:lengthKind="explicit" dfdl:length="10" ... />

So a fixed length text integer with padding. In this case I think what Daffodil does is it create a String parser to parse the fixed length string and remove padding, and then creates another parser to convert that string to an actual integer.

So in that case, even though the primType is not String, I think the String parser will still be used to capture the value length after padding is removed. So I think impliedRepresentation == Text is still needed?

@olabusayoT olabusayoT force-pushed the daf-2658-paddingNotRemoved branch 3 times, most recently from 1a0b7cc to 78050c8 Compare November 1, 2024 04:36
Copy link
Member

@stevedlawrence stevedlawrence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable and the right approach for correctly implement prefixed length. Using things like valueLength was clearly wrong. Just a few questions.

new SpecifiedLengthExplicit(this, body, bitsMultiplier)
if (isSimpleType && primType == PrimType.HexBinary) {
// hexBinary has some checks that need to be done that SpecifiedLengthExplicit
// gets in the way of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of this comment, we can say HexBinary has it's own HexBinarySpecifiedLength parser that handles calculating the length, so we do not need the SpecifiedLengthExplicit parser?

In fact, do we need to exclude a number of other primitive types that do their own explicit length handling? Looking at the current code base, I think maybe only simple types that are strings and complex types use the SpecifiedLengthExplicit parser? I think all other primitives implement their own specified length handling?

So maybe this wants to be

if (isComplexType || primType == PrimType.String) {
  SpecifiedLengthExplicit(...)
} else {
  // non-string simple types have their own custom parsers/unparsers for handling explicit lengths
  body
}

In fact, I wonder if we eventually want to refactor all of this to completely get rid of all the custom explicit/implicit length parsers? We just have various SpecifiedLength parser that sets a bit limit (based on a pattern, a prefix length, evaluaating a length expression etc) and then we just have a single parser that just reads all bit up until that current bit limit. Separation of concerns kind of thing. It would get rid of this condiation and all these BinaryIntegerKnownLength/RuntimeLength/PrefixLength/etc parsers. There's just a single BinaryNumberParser, and it just gets the length from the bitLimit.

Maybe that generality would take performance hit? I'm also not exactly sure how that would work with unparsing--the SpecifiedLengthUnparser would need to somehow pass the calculated length to the child unparser, I guess it could still use bitLimit since that is a thing in UState?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbeckerle , any thoughts on refactoring the code to have various specified length parser as described above, and any idea on how that would work with unparsing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not refactor this now. This code has gotten kind of undisciplined and been that way for a long time now.

Our performance last we checked was in the ballpark of 2x what a programmer would write by hand for simple binary data.

If we start reorganizing this sort of thing we need to rerun those tests. We're too close to a release for that.

Fixing these bugs does seem worth it however.

body
} else {
new SpecifiedLengthExplicit(this, body, bitsMultiplier)
}
case LengthKind.Explicit => {
Assert.invariant(!knownEncodingIsFixedWidth)
Assert.invariant(lengthUnits eq LengthUnits.Characters)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below this we have cases for implicit lengths. Do we need to do anything special for non-string simple types? I think those primitives have custom parsers that handle the implict length logic and don't need a SpecifiedLengthImplicit gramar? My concern is we could be adding that grammar and it would do something like set a bit limit, but the child paser that actually parsrers a the thing would just use it's own calculate and wouldn't need the bit limit, so we are just wasting effort.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest create a cleanup issue ticket for this. It's a performance (and maintainability) improvement but not specifically about this PR's primary goal is it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is a cleanup of existing issues. I think this change could possibly lead to adding duplicate parsers and might slow things down. So it's not that we already have duplicates, but this change could lead to new duplicates.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I believe this bug just fixes issues with prefixed lengths, which I think has a number of other outstanding issues. So even if we fix this one, I'm not sure prefixedLength is still going to be safe to use. Feels like it's less criticial to get into this release, and due to possible regressions is worth pushing to the next release.

e.lengthUnits,
e.prefixedLengthAdjustmentInUnits
)
override lazy val parser = new BCDIntegerPrefixedLengthParser(e.elementRuntimeData)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we want to rename these something like BCDIntegerBitLimitLengthParser and BCDIntegerMinimumLengthUnparser, making it clear that they aren't really doing anything specifically with prefixed length, and better describes the behavior of how they actuall parse/unparse things?

And when we implement things like lengthKind endOfParent, we could probably just use the same BitLimitParser, for example.

* This mixin doesn't require parsing the prefix length element and just uses
* the state's bitLimit and position to get the bitLength instead
*/
trait PrefixedLengthParserMixin2 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a different name for this. Numbers are not descriptive enough. Maybe we renames these something like PrefixLengthFromParserMixin and PrefixLengthFromBitLimitMixin? Something to make it more clear how they differ without having to read the code.

case Prefixed => true
case _ => false
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can just be lengthKind eq LengthKind.Prefixed. Don't really need a match/case if we are just going to return true/false.

import LengthKind._
lengthKind match {
case Delimited =>
true // don't test for hasDelimiters because it might not be our delimiter, but a surrounding group's separator, or it's terminator, etc.
case Pattern => true
case Prefixed => true
case Prefixed => isPrefixed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest just returning true here, isPrefixed doesn't do anything except check lengthKind == Prefixed, which is exactly what this match case does.

Copy link
Member

@stevedlawrence stevedlawrence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still not clear to me that we are correctly adding/not adding SpecifiedLength* parsers. Do we have any documentation or any easy way to tell that we are doing things correctly, and only adding those parsers when they are actually needed?

@@ -289,7 +289,8 @@ object DaffodilCCodeGenerator
case g: ElementParseAndUnspecifiedLength =>
elementParseAndUnspecifiedLengthGenerateCode(g, cgState)
case g: ElementUnused => noop(g)
case g: HexBinaryLengthPrefixed => hexBinaryLengthPrefixedGenerateCode(g.e, cgState)
case g: HexBinaryEndOfBitLimit if g.e.isPrefixed =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe my suggestion was wrong and the primitive still wants to be something like HexBInaryLengthPrefixed so that the grammar is obvious and things like code generators/etc can use the obvious grammar names? The parsers generated don't necessarily have to match the names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm...do you mean the suggestion to remove HexBinaryLengthPrefixed or the suggestion to rename all PrefixedLength parsers/unparser to BitLimitLength/MinimumLength?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that wasn't clear. I'm suggesting that we should keep the old grammar name HexBinaryLengthPrefixed so this change isn't needed. This way the grammar is a more clear for things like the code generator that examine it. But it's still fine to call the parsers BitLimitLength/MinimumLength etc from that grammar.

if isSimpleType && impliedRepresentation == Representation.Binary =>
if isSimpleType &&
impliedRepresentation == Representation.Binary &&
primType != PrimType.HexBinary =>
new SpecifiedLengthImplicit(this, body, implicitBinaryLengthInBits)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition feels like it doesn't exclude enough things. For example, this matches all binary types except hex binary. But, for example, integers with length kind implicit use something like BinaryIntegerKnownLength primitive/parser, which doesn't rely on SpecifiedLengthImplicit parser. It doesn't necessarily hurt, but it's just unnecessary work and might slow things down, since the BinaryIntegerKnownLengthParser is going to do that work.

I'm wonder if it's just string types that really make use of specified length stuff?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So It look like nillable binary types need SpecifiedLengthImplicit., they are the only tests failing when I comment out that chunk of code

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that means we only need the SpecifiedLengthImplicit when we are trying to parse the nillable part of a number. For example, the way things currently work is we have something like a SimpleNilOrValueParser and that has two children parsers, one is the nil parser (which sounds like it requires SpecifiedLengthImplicit) and the other is the binary paser (which probably does not need SpecifiedLenghtImplicit).

I wonder if captureLengthRegion needs a new paramater (e.g. forNilContent: Boolean) which is passed into specifiedLength. This way specifiedLength can do different things depending on if it's trying to represent nil content or non-nil content.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest parking this cleanup as a separate ticket. Unnecessary parsers are not optimized out in many cases I bet, and this is just one such case where specified length stuff is unnecessary but not removed.

@olabusayoT olabusayoT force-pushed the daf-2658-paddingNotRemoved branch from ccdc433 to 8d30149 Compare January 9, 2025 20:56
Copy link
Contributor

@mbeckerle mbeckerle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Code looks good. Questions on how to organize this code better long term remain, but that requires more than a PR review.

new SpecifiedLengthExplicit(this, body, bitsMultiplier)
if (isSimpleType && primType == PrimType.HexBinary) {
// hexBinary has some checks that need to be done that SpecifiedLengthExplicit
// gets in the way of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not refactor this now. This code has gotten kind of undisciplined and been that way for a long time now.

Our performance last we checked was in the ballpark of 2x what a programmer would write by hand for simple binary data.

If we start reorganizing this sort of thing we need to rerun those tests. We're too close to a release for that.

Fixing these bugs does seem worth it however.

if isSimpleType && impliedRepresentation == Representation.Binary =>
if isSimpleType &&
impliedRepresentation == Representation.Binary &&
primType != PrimType.HexBinary =>
new SpecifiedLengthImplicit(this, body, implicitBinaryLengthInBits)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest parking this cleanup as a separate ticket. Unnecessary parsers are not optimized out in many cases I bet, and this is just one such case where specified length stuff is unnecessary but not removed.

@olabusayoT
Copy link
Contributor Author

Created DAFFODIL-2967 to ensure we're properly adding/not adding the schemas

@mbeckerle , the piece of code you highlighted as being better off removed, removing that bit caused test failures iirc!

@olabusayoT olabusayoT force-pushed the daf-2658-paddingNotRemoved branch 3 times, most recently from 66de0e8 to 6362dde Compare February 3, 2025 21:08
Copy link
Member

@stevedlawrence stevedlawrence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, just some minor comments about questions and a little cleanup

isComplexType ||
lengthKind == LengthKind.Prefixed ||
isSimpleType && primType == PrimType.HexBinary && lengthKind == LengthKind.Pattern
) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth setting this condition to a val to make it more self documenting. Even with that, it might also be worth adding comment clarifying why we do this. Maybe something like"

// there are essentially two categories of processors that read/write data input/output
// stream: those that calculate lengths themselves and those that expect another
// processor to calculate the length and set the bit limit which this processor will use as
// the length. The following determines if this element requires another processor to
// calculate and set the bit limit, and if so adds the appropriate grammar to do that
val bodyRequiresSpecifiedLengthBitLimit = isSimpleType && impliedRepresentation == Representation.Text || ...

if (!bodyRequiresSpecifiedLengthBitLimit) body
else lengthKind match {
  // all the cases that add various specified length grammars that body depends on
}

body
val lk = lengthKind
lk match {
case LengthKind.Delimited => body
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lengthKind=Delimted just uses body--thoughts on adding a || lengthKind == LengthKind.Delimited to the condition above, making it more clear that all delimited processors do the length calculations themselves and don't need a specified length processor?

notYetImplemented("lengthKind='endOfParent' for complex type")
case LengthKind.EndOfParent =>
notYetImplemented("lengthKind='endOfParent' for simple type")
case _ => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This default case isn't covered, which is probably a good thing--it probably means the condition at the top is catching all cases that really don't need a specified length parser and we aren't just accidentally falling into this case. I would suggest we either remove this case, or if Scala complains about a non-exhaustive match change it an an Assert.impossible with codecov disabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe there's a bug with code coverage, hexBinary_bits_be_msbf definitely hits the default case

if isSimpleType && impliedRepresentation == Representation.Binary =>
new SpecifiedLengthImplicit(this, body, implicitBinaryLengthInBits)
case LengthKind.Implicit if isComplexType =>
body // for complex types, implicit means "roll up from the bottom"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change the condition at the top to isComplex && lengthKind != lengthKind.Implicit and remove this case? I'm just wondering if there's value in making it so non of these cases have body as the result if that makes things more clear exactly when we need a specified length parser?

new SpecifiedLengthImplicitCharacters(this, body, this.maxLength.longValue)

case LengthKind.Implicit if isSimpleType && primType == PrimType.HexBinary =>
new SpecifiedLengthImplicit(this, body, this.maxLength.longValue * bitsMultiplier)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case has no coverage, which I think makes sense since only prefixed length hexBinary uses a specified length parser. All other hexBinary parsers calculate their own lengths. Should we remove this case to avoid dead code and possible confusion with how hexBinary is implemented?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm...im surprised it has no coverage, several tests pass that condition of isSimpleType and impliedRep == Text and isHexBinary and LK implicit!

ex: maxHexBinaryError, hexBinary_Implicit_03b etc

These tests fail when we remove that case

@@ -65,7 +65,8 @@ trait ElementBaseRuntime1Mixin { self: ElementBase =>
// no reason (unless it is referenced in a contentLength expression).
val mightHaveSuspensions = (maybeFixedLengthInBits.isDefined && couldHaveSuspensions)

isReferenced || mightHaveSuspensions
// we want to capture contentlength when LK = prefixed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should explain why we need to capture content length. Maybe something like

Some prefixed length unparsers are unable to calculate the prefixed length of the field. Instead, they unparse the field to a buffer and the captured content length of the buffer is used. For this reason, prefixed length elements must capture content length for unparse.

@@ -200,7 +124,7 @@ trait CalculatedPrefixedLengthUnparserMixin {
UnparseError(
One(state.schemaFileLocation),
One(state.currentLocation),
s"The calculated value of ${elem.namedQName} ($adjustedLenInUnits) failed check due to ${check.errMsg}"
s"The calculated value of ${plElem.namedQName} ($adjustedLenInUnits) failed check due to ${check.errMsg}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is switching to plElem correct? Although the check is done against plElem, plElem is really an ephemeral quasi-element that doesn't exist in the real infoset, so it might be confusing to mention it in an error message. Maybe this instead wants to be something like "The calculated prefix length of ${elem.namedQName} ..."?


/**
* This mixin doesn't require parsing the prefix length element and just uses
* the state's bitLimit and position to get the bitLength instead
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is a little confusing out of context, since it isn't really clear why prefix length elements are mentioned since this doesn't really seem to be prefixed length related. This could also in theory be used on many of the parsers that rely on bit limit for length, which isn't specific to prefixed length. Maybe instead say something like

Some parsers do not calculate their own length, but instead expect another parser to set the bit limit and then they use that bit limit as the length. An example of this is prefix length parsers. This trait can be used by those parsers to do determine the length based on the bitLimit and position.

- currently after trimming the value of the element, we set the valueLength, and then overwrite it after returning from the parse that does the trimming. This results in the wrong value for value length. This fixes it by checking if the valueLength has already been set, and only setting it in SpecifiedLengthParserBase.parse if it hasn't
- add asserts to setAbsStartPos0bInBits and setAbsEndPos0bInBits to ensure they're not being overwritten
- do not capture the value lengths of choices/complexTypes in specified length parser base
- fix bug where padding is being added around prefixed length element (DAFFODIL-2943) by changing CaptureLengthRegion to wrap around contentlengthStart and padding
- fix bug where we use the valuelength to calculate the prefix length, according to the spec it should be the content length
- fix bug where we were missing return after PE for Out of Range Binary Integers (DAFFODIL-2942)
- refactor Prefixed parsers to use state's bitLimit to get the prefix length (BitLengthFromBitLimitMixin) since the specifiedLengthPrefixedParser will take care of parsing the prefix length
- refactored Prefixed unparsers to not try to unparse prefix length since that is taken care of by SpecifiedLengthPrefixedUnparser
- refactored prefixed parsers and unparsers to remove unused prefixed length parser related members
- rename custom prefixedlength parsers to *BitLengthParsers to more accurately reflect what they're doing
- rename custom prefixedlength unparsers to MinimumLengthUnparsers to more accurately reflect what they're doing
- add tests

DAFFODIL-2658
@olabusayoT olabusayoT force-pushed the daf-2658-paddingNotRemoved branch from 419793f to 018570f Compare February 4, 2025 21:02
@mbeckerle
Copy link
Contributor

mbeckerle commented Feb 4, 2025 via email

@olabusayoT olabusayoT merged commit ba465e3 into apache:main Feb 4, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants