Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add char[] versions for NumberInput parseFloat, parseDouble, parseBigInteger #1229

Closed
devinrsmith opened this issue Feb 29, 2024 · 20 comments
Closed
Labels
2.18 Issues planned at earliest for 2.18

Comments

@devinrsmith
Copy link

I've got a high performance streaming use case where I'm handling some of the lower-level details. For example, translating a VALUE_STRING into a double. I'm relying on NumberInput with the appropriate useFastParser config, and would like to avoid extraneous String allocation when possible. The BigDecimal case is currently the most complete, as it works with JsonParser#hasTextCharacters:

static BigDecimal parseStringAsBigDecimal(JsonParser parser) throws IOException {
    final boolean useFastParser = parser.isEnabled(StreamReadFeature.USE_FAST_BIG_NUMBER_PARSER);
    return parser.hasTextCharacters()
            ? NumberInput.parseBigDecimal(parser.getTextCharacters(), parser.getTextOffset(), parser.getTextLength(), useFastParser)
            : NumberInput.parseBigDecimal(parser.getText(), useFastParser);
}

The other similar cases aren't able to take advantage since the underlying char[] parsing methods aren't exposed:

static float parseStringAsFloat(JsonParser parser) throws IOException {
    // TODO: improve when parser.hasTextCharacters()
    return NumberInput.parseFloat(parser.getText(), parser.isEnabled(StreamReadFeature.USE_FAST_DOUBLE_PARSER));
}

static double parseStringAsDouble(JsonParser parser) throws IOException {
    // TODO: improve when parser.hasTextCharacters()
    return NumberInput.parseDouble(parser.getText(), parser.isEnabled(StreamReadFeature.USE_FAST_DOUBLE_PARSER));
}

static BigInteger parseStringAsBigInteger(JsonParser parser) throws IOException {
    // TODO: improve when parser.hasTextCharacters()
    return NumberInput.parseBigInteger(parser.getText(), parser.isEnabled(StreamReadFeature.USE_FAST_BIG_NUMBER_PARSER));
}

It would be great to add char[] versions NumberInput#parseFloat, NumberInput#parseDouble, and NumberInput#parseBigInteger.

@cowtowncoder
Copy link
Member

cowtowncoder commented Feb 29, 2024

Might make sense, esp. if performance benchmark shows meaningful improvement from avoided allocations (or rather, gc time). Challenges:

  1. JDK implementation only takes String (which is probably why there's no char[] taking alternative), so might end up being no-win for default case
  2. Number of permutations wrt fast/JDK parser (number of methods we have with overloads)

Then again, if and when we are moving to default to FastDoubleParser, it could be a win. Would def want to know something about improvement tho.

Anyway, +1 for experimentation.

@devinrsmith
Copy link
Author

Yeah. At least in the case of int/long, the JDK has CharSequence parsing support:

static int parseStringAsInt(JsonParser parser) throws IOException {
    if (parser.hasTextCharacters()) {
        final int len = parser.getTextLength();
        final CharSequence cs = CharBuffer.wrap(parser.getTextCharacters(), parser.getTextOffset(), len);
        return Integer.parseInt(cs, 0, len, 10);
    } else {
        return Integer.parseInt(parser.getText());
    }
}

A bit surprising the JDK is missing similar methods for parseFloat / parseDouble.

Would a PR adding these new methods to NumberInput be welcome, or would you need to see performance numbers first? I'm not planning to have anything in the library call them; although, I suspect there may be opportunities to improve JsonParser#getValueAsDouble and related in these regards.

@cowtowncoder
Copy link
Member

cowtowncoder commented Feb 29, 2024

int/long is sort of less relevant as we implement optimized decoding ourselves and do not rely on JDK (for cases that matter -- parsing int/long from String is not on critical path).
And wrapping char[] in CharSequence is unlikely to have much if any performance benefits: might be slower than constructing Strings.

As to adding methods... I think performance numbers would be good. Can be stand-along micro-benchmark, no need to be integrated via parser (altho obv integrated one would be the ultimate thing).
It is too easy to try to add speculative improvements that have no real effect.

The reason I am slightly skeptical is that the real heavy part for double and float is decimal-to-binary decoding of actual numeric value; it's much more expensive than int/long f.ex.

@cowtowncoder
Copy link
Member

Oh btw, forgot to say -- given that you do have high-performance streaming use case, this is probably quite relevant. So just to make sure: I am not against improvements like this. I just want to emphasize verifying improvements; in this case showing for your actual use case (replica thereof) would be absolutely awesome.

I actually have -- interestingly enough -- need for fast float parsing for Vector DB use case. So this is something I could quite likely benefit from as well, now that I think about it :)

@pjfanning
Copy link
Member

You can use https://github.com/wrandelshofer/FastDoubleParser directly - It has support for Char Array, Char Sequence and String.

We can update NumberInput too but I'd prefer if that was done with perf tests that prove that it improves Jackson performance.

I'd prefer not to delay the 2.17 release so any Jackson changes in this area are likely not to be released until 2.18 or beyond.

@cowtowncoder
Copy link
Member

One thing I may have missed -- thank you @pjfanning for sort of hinting at -- NumberInput is primarily meant to be used by Jackson itself. It is not meant as utility for code outside Jackson.
So improvements need to be something that JsonParser implementations use.

And yeah, definitely none of changes would go in 2.17 but 2.18 or later.

@devinrsmith
Copy link
Author

I understand there is nuance in assuming "public" means "public API". Looking through some related comments though, rallyhealth/weePickle#117 (comment) @pjfanning seems to suggest that NumberInput is part of the "public API"?

I'm very familiar with https://github.com/wrandelshofer/FastDoubleParser (we use it as part of https://github.com/deephaven/deephaven-csv), it's great :D. Good point though, I can use it directly in the meantime.

@pjfanning
Copy link
Member

I understand there is nuance in assuming "public" means "public API". Looking through some related comments though, rallyhealth/weePickle#117 (comment) @pjfanning seems to suggest that NumberInput is part of the "public API"?

I'm very familiar with https://github.com/wrandelshofer/FastDoubleParser (we use it as part of https://github.com/deephaven/deephaven-csv), it's great :D. Good point though, I can use it directly in the meantime.

The methods that we have in NumberInput - we can't stop you using them. We are not going to add new methods to NumberInput unless they are useful for the internal Jackson code.

@cowtowncoder
Copy link
Member

cowtowncoder commented Feb 29, 2024

Exactly, what @pjfanning said: primary user is Jackson itself, and anything that benefits it is potentially good idea. Making it more useful for other use cases is not a priority and could be considered counter-productive (due to added code to maintain, or to limit possibility of changes to API, implementation).

So, I don't consider NumberInput to be part of Public API.

And even more importantly: embedded/shaded FastDoubleParser is definitely not to be used by anyone directly (but acceptable via NumberInput I guess).

@devinrsmith
Copy link
Author

devinrsmith commented Feb 29, 2024

Thanks for the additional context.

Ultimately, I think I'll be making a case for additional methods on JsonParser, something like:

/**
 * Assumes {@link JsonToken#VALUE_STRING}, parses the string as a <b>float</b>.
 */
public float getStringAsFloat() throws IOException;

// ...
public double getStringAsDouble() throws IOException;

// ...
public BigInteger getStringAsBigInteger() throws IOException;

// ...
public BigDecimal getStringAsBigDecimal() throws IOException;

I would assume that getValueAsFloat/getValueAsDouble ID_STRING cases would delegate to respective impl.

Edit:
Probably getStringAsInt and getStringAsLong as well.

@cowtowncoder
Copy link
Member

There are already getValueAsXxx() for int, long, double, fwtw. So naming should be followed.
Other than that I am quite ambivalent on whether supporting such use is good or not (from bigger picture perspective). But can definitely be discussed as part of proposal.

@devinrsmith
Copy link
Author

I've got some of my own quantitative numbers motivating this. I'm essentially seeing an uplift from 14mm doubles / second -> 20mm doubles / second when using ch.randelshofer.fastdoubleparser.JavaDoubleParser from getText to getTextCharacters, and an order of magnitude less GC (count, total reclaimed, total GC time). This is parsing through an array of a billion strings that are Math.random() doubles. Java 21, -Xmx1g -XX:CompileCommand=inline,java/lang/String.charAt (of course, this is on my machine with some overhead from our engine, and only representative of this particularly setup, etc):

getText

getTextCharacters

@devinrsmith
Copy link
Author

(Sorry if you saw my previous comment which I deleted, it was incorrect.)

I think there's an opportunity to also improve the performance of JsonParser#getDoubleValue, which seems like it always goes through String instead of char[]? Testing out an array of doubles (as opposed to an array of double strings), the performance matches up with my slower getText results from above.

Essentially, I'm in a position where the JsonParser API lets me parse double strings faster than doubles, which is an odd position to be in. (Unless I'm mistaken and there is a way for me to get char[] from VALUE_NUMBER_FLOAT contexts.)

@cowtowncoder
Copy link
Member

@devinrsmith I am 100% in favor of ensuring that getDoubleValue() on JsonToken.VALUE_NUMBER_FLOAT is optimal. So if and when we are missing method in NumberInput to support that, I'd be in favor of adding a method or methods as necessary.

(for 2.18 branch once we get there)

@cowtowncoder
Copy link
Member

cowtowncoder commented Mar 1, 2024

Ok. This is awkward: I was assuming getDoubleValue() and getFloatValue() would invoke efficient methods in FastDoubleParser (assuming fast processing enabled). But due to some refactoring in 2.16 or 2.15, this is no longer the case: I think deferred processing was applied too widely and as a result a String is constructed proactively and used from thereon. I don't think that is necessary: fast method should be called from ParserBase._parseSlowFloat() via TextBuffer _textBuffer methods contentsAsFloat() and contentsAsDouble().
(and looks like TextBuffer.contentsAsFloat() needs to be rewritten to actually use buffered content, if any).

This should be fixed, but timeline is such that it's too late for 2.17 -- after rc1 I think there's too much risk for edge case handling to go awry (deferred number handling was added for a reason and although I hope unit test coverage is sufficient I am not confident).
So I'll mark this as 2.18 thing to look into.

@cowtowncoder cowtowncoder added the 2.18 Issues planned at earliest for 2.18 label Mar 1, 2024
@cowtowncoder
Copy link
Member

With #1230 now merged, we could tackle this issue.

But looking at ParserBase, logic is rather muddy: deferral of decoding increase has essentially lead ALL FP access to go via intermediate String _numberString which is now accurate but sub-optimal.

Conceptually, there shouldn't be a problem with more eager access: if and when JsonParser.getDoubleValue() is called, for example, value can now be materialized as double -- we do have other accessors that do not force evaluation, specifically JsonParser.getNumberValueDeferred(), used for buffering.
But the trick is untangle methods the way they are done currently...

@cowtowncoder
Copy link
Member

Re-created as #1284 to address performance issue; closing this one since I do not think new methods are needed, just better implementations of existing ones.

@cowtowncoder
Copy link
Member

@devinrsmith Now that #1284 is complete in 2.18, getDoubleValue() in 2.18.0-SNAPSHOT should avoid String allocation: and my initial testing does support this, there is a nice boost.
I also uncovered a bug in #1230 when testing and not first seeing speed up.

So... if it was possible for you to test 2.18.0-SNAPSHOT (from Sonatype snapshot repo or build locally), that'd be great.

@devinrsmith
Copy link
Author

@cowtowncoder Looks great, thanks! Quick test reading 100mm doubles out of an array, it looks like 2.18.0-SNAPSHOT parses doubles without extraneous allocation; went from ~6GiB to <100MiB.
Screenshot from 2024-06-26 19-18-28

@cowtowncoder
Copy link
Member

@devinrsmith Excellent! Thank you for super fast follow-up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.18 Issues planned at earliest for 2.18
Projects
None yet
Development

No branches or pull requests

3 participants