-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Bug with valueLength being overwritten after Trim #1338
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
daffodil-test/src/test/resources/org/apache/daffodil/section12/lengthKind/PrefixedTests.tdml
Outdated
Show resolved
Hide resolved
.../src/main/scala/org/apache/daffodil/runtime1/processors/parsers/SpecifiedLengthParsers.scala
Show resolved
Hide resolved
.../src/main/scala/org/apache/daffodil/runtime1/processors/parsers/SpecifiedLengthParsers.scala
Show resolved
Hide resolved
daffodil-core/src/main/scala/org/apache/daffodil/core/runtime1/ElementBaseRuntime1Mixin.scala
Show resolved
Hide resolved
(isSimpleType && (impliedRepresentation == Representation.Text || lengthKind == LengthKind.Delimited)) || | ||
val capturedByValueParsers = | ||
(isSimpleType && ( | ||
primType == PrimType.String || lengthKind == LengthKind.Delimited)) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if this change is correct?
For example, say we have this element:
<xs:element name="foo" type="xs:int" dfdl:representation="text" dfdl:trimKind="padChar" dfdl:lengthKind="explicit" dfdl:length="10" ... />
So a fixed length text integer with padding. In this case I think what Daffodil does is it create a String parser to parse the fixed length string and remove padding, and then creates another parser to convert that string to an actual integer.
So in that case, even though the primType is not String, I think the String parser will still be used to capture the value length after padding is removed. So I think impliedRepresentation == Text
is still needed?
1a0b7cc
to
78050c8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable and the right approach for correctly implement prefixed length. Using things like valueLength was clearly wrong. Just a few questions.
...l-core/src/main/scala/org/apache/daffodil/core/grammar/primitives/PrimitivesLengthKind.scala
Show resolved
Hide resolved
new SpecifiedLengthExplicit(this, body, bitsMultiplier) | ||
if (isSimpleType && primType == PrimType.HexBinary) { | ||
// hexBinary has some checks that need to be done that SpecifiedLengthExplicit | ||
// gets in the way of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of this comment, we can say HexBinary has it's own HexBinarySpecifiedLength parser that handles calculating the length, so we do not need the SpecifiedLengthExplicit parser?
In fact, do we need to exclude a number of other primitive types that do their own explicit length handling? Looking at the current code base, I think maybe only simple types that are strings and complex types use the SpecifiedLengthExplicit parser? I think all other primitives implement their own specified length handling?
So maybe this wants to be
if (isComplexType || primType == PrimType.String) {
SpecifiedLengthExplicit(...)
} else {
// non-string simple types have their own custom parsers/unparsers for handling explicit lengths
body
}
In fact, I wonder if we eventually want to refactor all of this to completely get rid of all the custom explicit/implicit length parsers? We just have various SpecifiedLength parser that sets a bit limit (based on a pattern, a prefix length, evaluaating a length expression etc) and then we just have a single parser that just reads all bit up until that current bit limit. Separation of concerns kind of thing. It would get rid of this condiation and all these BinaryIntegerKnownLength/RuntimeLength/PrefixLength/etc parsers. There's just a single BinaryNumberParser, and it just gets the length from the bitLimit.
Maybe that generality would take performance hit? I'm also not exactly sure how that would work with unparsing--the SpecifiedLengthUnparser would need to somehow pass the calculated length to the child unparser, I guess it could still use bitLimit since that is a thing in UState?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbeckerle , any thoughts on refactoring the code to have various specified length parser as described above, and any idea on how that would work with unparsing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not refactor this now. This code has gotten kind of undisciplined and been that way for a long time now.
Our performance last we checked was in the ballpark of 2x what a programmer would write by hand for simple binary data.
If we start reorganizing this sort of thing we need to rerun those tests. We're too close to a release for that.
Fixing these bugs does seem worth it however.
body | ||
} else { | ||
new SpecifiedLengthExplicit(this, body, bitsMultiplier) | ||
} | ||
case LengthKind.Explicit => { | ||
Assert.invariant(!knownEncodingIsFixedWidth) | ||
Assert.invariant(lengthUnits eq LengthUnits.Characters) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Below this we have cases for implicit lengths. Do we need to do anything special for non-string simple types? I think those primitives have custom parsers that handle the implict length logic and don't need a SpecifiedLengthImplicit gramar? My concern is we could be adding that grammar and it would do something like set a bit limit, but the child paser that actually parsrers a the thing would just use it's own calculate and wouldn't need the bit limit, so we are just wasting effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest create a cleanup issue ticket for this. It's a performance (and maintainability) improvement but not specifically about this PR's primary goal is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is a cleanup of existing issues. I think this change could possibly lead to adding duplicate parsers and might slow things down. So it's not that we already have duplicates, but this change could lead to new duplicates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I believe this bug just fixes issues with prefixed lengths, which I think has a number of other outstanding issues. So even if we fix this one, I'm not sure prefixedLength is still going to be safe to use. Feels like it's less criticial to get into this release, and due to possible regressions is worth pushing to the next release.
e.lengthUnits, | ||
e.prefixedLengthAdjustmentInUnits | ||
) | ||
override lazy val parser = new BCDIntegerPrefixedLengthParser(e.elementRuntimeData) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we want to rename these something like BCDIntegerBitLimitLengthParser
and BCDIntegerMinimumLengthUnparser
, making it clear that they aren't really doing anything specifically with prefixed length, and better describes the behavior of how they actuall parse/unparse things?
And when we implement things like lengthKind endOfParent, we could probably just use the same BitLimitParser, for example.
* This mixin doesn't require parsing the prefix length element and just uses | ||
* the state's bitLimit and position to get the bitLength instead | ||
*/ | ||
trait PrefixedLengthParserMixin2 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a different name for this. Numbers are not descriptive enough. Maybe we renames these something like PrefixLengthFromParserMixin
and PrefixLengthFromBitLimitMixin
? Something to make it more clear how they differ without having to read the code.
case Prefixed => true | ||
case _ => false | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can just be lengthKind eq LengthKind.Prefixed
. Don't really need a match/case if we are just going to return true/false.
import LengthKind._ | ||
lengthKind match { | ||
case Delimited => | ||
true // don't test for hasDelimiters because it might not be our delimiter, but a surrounding group's separator, or it's terminator, etc. | ||
case Pattern => true | ||
case Prefixed => true | ||
case Prefixed => isPrefixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest just returning true here, isPrefixed doesn't do anything except check lengthKind == Prefixed, which is exactly what this match case does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still not clear to me that we are correctly adding/not adding SpecifiedLength* parsers. Do we have any documentation or any easy way to tell that we are doing things correctly, and only adding those parsers when they are actually needed?
@@ -289,7 +289,8 @@ object DaffodilCCodeGenerator | |||
case g: ElementParseAndUnspecifiedLength => | |||
elementParseAndUnspecifiedLengthGenerateCode(g, cgState) | |||
case g: ElementUnused => noop(g) | |||
case g: HexBinaryLengthPrefixed => hexBinaryLengthPrefixedGenerateCode(g.e, cgState) | |||
case g: HexBinaryEndOfBitLimit if g.e.isPrefixed => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, maybe my suggestion was wrong and the primitive still wants to be something like HexBInaryLengthPrefixed so that the grammar is obvious and things like code generators/etc can use the obvious grammar names? The parsers generated don't necessarily have to match the names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm...do you mean the suggestion to remove HexBinaryLengthPrefixed or the suggestion to rename all PrefixedLength parsers/unparser to BitLimitLength/MinimumLength?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry that wasn't clear. I'm suggesting that we should keep the old grammar name HexBinaryLengthPrefixed
so this change isn't needed. This way the grammar is a more clear for things like the code generator that examine it. But it's still fine to call the parsers BitLimitLength/MinimumLength
etc from that grammar.
daffodil-core/src/main/scala/org/apache/daffodil/core/grammar/ElementBaseGrammarMixin.scala
Outdated
Show resolved
Hide resolved
if isSimpleType && impliedRepresentation == Representation.Binary => | ||
if isSimpleType && | ||
impliedRepresentation == Representation.Binary && | ||
primType != PrimType.HexBinary => | ||
new SpecifiedLengthImplicit(this, body, implicitBinaryLengthInBits) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition feels like it doesn't exclude enough things. For example, this matches all binary types except hex binary. But, for example, integers with length kind implicit use something like BinaryIntegerKnownLength
primitive/parser, which doesn't rely on SpecifiedLengthImplicit parser. It doesn't necessarily hurt, but it's just unnecessary work and might slow things down, since the BinaryIntegerKnownLengthParser is going to do that work.
I'm wonder if it's just string types that really make use of specified length stuff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So It look like nillable binary types need SpecifiedLengthImplicit., they are the only tests failing when I comment out that chunk of code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that means we only need the SpecifiedLengthImplicit when we are trying to parse the nillable part of a number. For example, the way things currently work is we have something like a SimpleNilOrValueParser
and that has two children parsers, one is the nil parser (which sounds like it requires SpecifiedLengthImplicit) and the other is the binary paser (which probably does not need SpecifiedLenghtImplicit).
I wonder if captureLengthRegion needs a new paramater (e.g. forNilContent: Boolean
) which is passed into specifiedLength
. This way specifiedLength
can do different things depending on if it's trying to represent nil content or non-nil content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest parking this cleanup as a separate ticket. Unnecessary parsers are not optimized out in many cases I bet, and this is just one such case where specified length stuff is unnecessary but not removed.
ccdc433
to
8d30149
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 Code looks good. Questions on how to organize this code better long term remain, but that requires more than a PR review.
new SpecifiedLengthExplicit(this, body, bitsMultiplier) | ||
if (isSimpleType && primType == PrimType.HexBinary) { | ||
// hexBinary has some checks that need to be done that SpecifiedLengthExplicit | ||
// gets in the way of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not refactor this now. This code has gotten kind of undisciplined and been that way for a long time now.
Our performance last we checked was in the ballpark of 2x what a programmer would write by hand for simple binary data.
If we start reorganizing this sort of thing we need to rerun those tests. We're too close to a release for that.
Fixing these bugs does seem worth it however.
if isSimpleType && impliedRepresentation == Representation.Binary => | ||
if isSimpleType && | ||
impliedRepresentation == Representation.Binary && | ||
primType != PrimType.HexBinary => | ||
new SpecifiedLengthImplicit(this, body, implicitBinaryLengthInBits) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest parking this cleanup as a separate ticket. Unnecessary parsers are not optimized out in many cases I bet, and this is just one such case where specified length stuff is unnecessary but not removed.
Created DAFFODIL-2967 to ensure we're properly adding/not adding the schemas @mbeckerle , the piece of code you highlighted as being better off removed, removing that bit caused test failures iirc! |
66de0e8
to
6362dde
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, just some minor comments about questions and a little cleanup
isComplexType || | ||
lengthKind == LengthKind.Prefixed || | ||
isSimpleType && primType == PrimType.HexBinary && lengthKind == LengthKind.Pattern | ||
) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth setting this condition to a val to make it more self documenting. Even with that, it might also be worth adding comment clarifying why we do this. Maybe something like"
// there are essentially two categories of processors that read/write data input/output
// stream: those that calculate lengths themselves and those that expect another
// processor to calculate the length and set the bit limit which this processor will use as
// the length. The following determines if this element requires another processor to
// calculate and set the bit limit, and if so adds the appropriate grammar to do that
val bodyRequiresSpecifiedLengthBitLimit = isSimpleType && impliedRepresentation == Representation.Text || ...
if (!bodyRequiresSpecifiedLengthBitLimit) body
else lengthKind match {
// all the cases that add various specified length grammars that body depends on
}
body | ||
val lk = lengthKind | ||
lk match { | ||
case LengthKind.Delimited => body |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lengthKind=Delimted just uses body--thoughts on adding a || lengthKind == LengthKind.Delimited
to the condition above, making it more clear that all delimited processors do the length calculations themselves and don't need a specified length processor?
notYetImplemented("lengthKind='endOfParent' for complex type") | ||
case LengthKind.EndOfParent => | ||
notYetImplemented("lengthKind='endOfParent' for simple type") | ||
case _ => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This default case isn't covered, which is probably a good thing--it probably means the condition at the top is catching all cases that really don't need a specified length parser and we aren't just accidentally falling into this case. I would suggest we either remove this case, or if Scala complains about a non-exhaustive match change it an an Assert.impossible
with codecov disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe there's a bug with code coverage, hexBinary_bits_be_msbf definitely hits the default case
if isSimpleType && impliedRepresentation == Representation.Binary => | ||
new SpecifiedLengthImplicit(this, body, implicitBinaryLengthInBits) | ||
case LengthKind.Implicit if isComplexType => | ||
body // for complex types, implicit means "roll up from the bottom" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we change the condition at the top to isComplex && lengthKind != lengthKind.Implicit
and remove this case? I'm just wondering if there's value in making it so non of these cases have body
as the result if that makes things more clear exactly when we need a specified length parser?
new SpecifiedLengthImplicitCharacters(this, body, this.maxLength.longValue) | ||
|
||
case LengthKind.Implicit if isSimpleType && primType == PrimType.HexBinary => | ||
new SpecifiedLengthImplicit(this, body, this.maxLength.longValue * bitsMultiplier) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case has no coverage, which I think makes sense since only prefixed length hexBinary uses a specified length parser. All other hexBinary parsers calculate their own lengths. Should we remove this case to avoid dead code and possible confusion with how hexBinary is implemented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm...im surprised it has no coverage, several tests pass that condition of isSimpleType and impliedRep == Text and isHexBinary and LK implicit!
ex: maxHexBinaryError, hexBinary_Implicit_03b etc
These tests fail when we remove that case
...l-core/src/main/scala/org/apache/daffodil/core/grammar/primitives/PrimitivesLengthKind.scala
Show resolved
Hide resolved
@@ -65,7 +65,8 @@ trait ElementBaseRuntime1Mixin { self: ElementBase => | |||
// no reason (unless it is referenced in a contentLength expression). | |||
val mightHaveSuspensions = (maybeFixedLengthInBits.isDefined && couldHaveSuspensions) | |||
|
|||
isReferenced || mightHaveSuspensions | |||
// we want to capture contentlength when LK = prefixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should explain why we need to capture content length. Maybe something like
Some prefixed length unparsers are unable to calculate the prefixed length of the field. Instead, they unparse the field to a buffer and the captured content length of the buffer is used. For this reason, prefixed length elements must capture content length for unparse.
@@ -200,7 +124,7 @@ trait CalculatedPrefixedLengthUnparserMixin { | |||
UnparseError( | |||
One(state.schemaFileLocation), | |||
One(state.currentLocation), | |||
s"The calculated value of ${elem.namedQName} ($adjustedLenInUnits) failed check due to ${check.errMsg}" | |||
s"The calculated value of ${plElem.namedQName} ($adjustedLenInUnits) failed check due to ${check.errMsg}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is switching to plElem
correct? Although the check is done against plElem
, plElem
is really an ephemeral quasi-element that doesn't exist in the real infoset, so it might be confusing to mention it in an error message. Maybe this instead wants to be something like "The calculated prefix length of ${elem.namedQName} ..."?
|
||
/** | ||
* This mixin doesn't require parsing the prefix length element and just uses | ||
* the state's bitLimit and position to get the bitLength instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is a little confusing out of context, since it isn't really clear why prefix length elements are mentioned since this doesn't really seem to be prefixed length related. This could also in theory be used on many of the parsers that rely on bit limit for length, which isn't specific to prefixed length. Maybe instead say something like
Some parsers do not calculate their own length, but instead expect another parser to set the bit limit and then they use that bit limit as the length. An example of this is prefix length parsers. This trait can be used by those parsers to do determine the length based on the bitLimit and position.
- currently after trimming the value of the element, we set the valueLength, and then overwrite it after returning from the parse that does the trimming. This results in the wrong value for value length. This fixes it by checking if the valueLength has already been set, and only setting it in SpecifiedLengthParserBase.parse if it hasn't - add asserts to setAbsStartPos0bInBits and setAbsEndPos0bInBits to ensure they're not being overwritten - do not capture the value lengths of choices/complexTypes in specified length parser base - fix bug where padding is being added around prefixed length element (DAFFODIL-2943) by changing CaptureLengthRegion to wrap around contentlengthStart and padding - fix bug where we use the valuelength to calculate the prefix length, according to the spec it should be the content length - fix bug where we were missing return after PE for Out of Range Binary Integers (DAFFODIL-2942) - refactor Prefixed parsers to use state's bitLimit to get the prefix length (BitLengthFromBitLimitMixin) since the specifiedLengthPrefixedParser will take care of parsing the prefix length - refactored Prefixed unparsers to not try to unparse prefix length since that is taken care of by SpecifiedLengthPrefixedUnparser - refactored prefixed parsers and unparsers to remove unused prefixed length parser related members - rename custom prefixedlength parsers to *BitLengthParsers to more accurately reflect what they're doing - rename custom prefixedlength unparsers to MinimumLengthUnparsers to more accurately reflect what they're doing - add tests DAFFODIL-2658
419793f
to
018570f
Compare
ChatGPT thinks that yes there are issues with CodeCov and Scala code:
https://chatgpt.com/share/67a28042-0944-800f-b2e5-95c1856c91d6
…On Tue, Feb 4, 2025 at 1:55 PM olabusayoT ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In
daffodil-core/src/main/scala/org/apache/daffodil/core/grammar/ElementBaseGrammarMixin.scala
<#1338 (comment)>:
> + }
+ case LengthKind.Implicit if isSimpleType && primType == PrimType.String =>
+ new SpecifiedLengthImplicitCharacters(this, body, this.maxLength.longValue)
+
+ case LengthKind.Implicit if isSimpleType && primType == PrimType.HexBinary =>
+ new SpecifiedLengthImplicit(this, body, this.maxLength.longValue * bitsMultiplier)
+ case LengthKind.Implicit
+ if isSimpleType && impliedRepresentation == Representation.Binary =>
+ new SpecifiedLengthImplicit(this, body, implicitBinaryLengthInBits)
+ case LengthKind.Implicit if isComplexType =>
+ body // for complex types, implicit means "roll up from the bottom"
+ case LengthKind.EndOfParent if isComplexType =>
+ notYetImplemented("lengthKind='endOfParent' for complex type")
+ case LengthKind.EndOfParent =>
+ notYetImplemented("lengthKind='endOfParent' for simple type")
+ case _ => {
Maybe there's a bug with code coverage, hexBinary_bits_be_msbf definitely
hits the default case
—
Reply to this email directly, view it on GitHub
<#1338 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AALUDAZJDC6SSICL3X2HUDD2OEEKLAVCNFSM6AAAAABQADPAVKVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDKOJTG43TOMBUHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
DAFFODIL-2658