Improve performance of padding removal when parsing #1134
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The current algorithm to remove right padding of left justified strings first reverses the String, removes leading pad characters using dropWhile, and then reverses the result. The two reverses are linear in the length of the String, and requires allocating multiple String instances and copying characters from one to the other. And this is done regardless of how many, if any, pad chars exist in the String. This logic is very clear, but is fairly inefficient, enough to show up while profiling.
To improve performance, this rewrites the algorithm to scan through the String in reverse to find the index of the last pad character and then uses the substring() function to create a new String with those pad characters removed. This is now linear in the number of pad characters in a String instead of the full length of the string. Additionally, the use of substring() avoids character copies, since it just allocates a new String using the same underlying String value but with different indices.
I have not looked into detail how scala implements dropWhile() for Strings (skimming the code, it looks like it will allocate a new String and copy characters), but for consistency and maximum performance, this also updates the algorithm that removes left padding of right justified strings to use similar logic as the new right padding algorithm. By using substring() we should avoid possible copies.
In one test with lots of left justified strings, many of which are padded, this saw about a 15% improvement in parse times (excluding infoset creating using the null infoset outputter), and padding removal no longer shows up while profiling.
DAFFODIL-2868