Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of padding removal when parsing #1134

Merged

Conversation

stevedlawrence
Copy link
Member

The current algorithm to remove right padding of left justified strings first reverses the String, removes leading pad characters using dropWhile, and then reverses the result. The two reverses are linear in the length of the String, and requires allocating multiple String instances and copying characters from one to the other. And this is done regardless of how many, if any, pad chars exist in the String. This logic is very clear, but is fairly inefficient, enough to show up while profiling.

To improve performance, this rewrites the algorithm to scan through the String in reverse to find the index of the last pad character and then uses the substring() function to create a new String with those pad characters removed. This is now linear in the number of pad characters in a String instead of the full length of the string. Additionally, the use of substring() avoids character copies, since it just allocates a new String using the same underlying String value but with different indices.

I have not looked into detail how scala implements dropWhile() for Strings (skimming the code, it looks like it will allocate a new String and copy characters), but for consistency and maximum performance, this also updates the algorithm that removes left padding of right justified strings to use similar logic as the new right padding algorithm. By using substring() we should avoid possible copies.

In one test with lots of left justified strings, many of which are padded, this saw about a 15% improvement in parse times (excluding infoset creating using the null infoset outputter), and padding removal no longer shows up while profiling.

DAFFODIL-2868

Copy link
Contributor

@mbeckerle mbeckerle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Surprising the codecov gaps. It's possible that removeRightPadding can't be called unless there is a parsing pad character. I.e., it's actually an impossible code path.

In that case I would change it from an if to an Assert.invariant(parsingPadChar.nonEmpty).

Copy link
Contributor

@tuxji tuxji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Well, I searched removeRightPadding within the entire file, and the last function is the public entry point: def trimByJustification(str: String): String. The codecov gaps are probably due to parsingPadChar always getting set by default in Daffodil code.

@stevedlawrence
Copy link
Member Author

parsingPadChar comes from here:

https://github.com/apache/daffodil/blob/main/daffodil-core/src/main/scala/org/apache/daffodil/core/grammar/primitives/Padded.scala#L48-L52

If textTrimKind is "none", then parsingPadChar is None and the justification is forced to None, so we'll never call remove*Padding and hit the isEmpty true block.

If textTrimKind is not "none", then we require a pad character, so whatever remove*Padding function is called (based on the justification), we will have a parsingPadChar defined and hit the isEmpty false block.

So it is reasonable to assert here--it's not possible for parsingPadChar to be empty if a remove*Padding function is called. I'll make the change.

The current algorithm to remove right padding of left justified strings
first reverses the String, removes leading pad characters using
dropWhile, and then reverses the result. The two reverses are linear in
the length of the String, and requires allocating multiple String
instances and copying characters from one to the other. And this is done
regardless of how many, if any, pad chars exist in the String. This
logic is very clear, but is fairly inefficient, enough to show up while
profiling.

To improve performance, this rewrites the algorithm to scan through the
String in reverse to find the index of the last pad character and then
uses the substring() function to create a new String with those pad
characters removed. This is now linear in the number of pad characters
in a String instead of the full length of the string. Additionally, the
use of substring() avoids character copies, since it just allocates a
new String using the same underlying String value but with different
indices.

I have not looked into detail how scala implements dropWhile() for
Strings (skimming the code, it looks like it will allocate a new String
and copy characters), but for consistency and maximum performance, this
also updates the algorithm that removes left padding of right justified
strings to use similar logic as the new right padding algorithm. By
using substring() we should avoid possible copies.

In one test with lots of left justified strings, many of which are
padded, this saw about a 15% improvement in parse times (excluding
infoset creating using the null infoset outputter), and padding removal
no longer shows up while profiling.

DAFFODIL-2868
@stevedlawrence stevedlawrence force-pushed the daffodil-2868-performance-padding branch from 10cdb9e to 4f28f2b Compare January 9, 2024 12:40
@stevedlawrence stevedlawrence merged commit f4996f0 into apache:main Jan 9, 2024
10 checks passed
@stevedlawrence stevedlawrence deleted the daffodil-2868-performance-padding branch January 9, 2024 14:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants