Improve performance of number parsing #11228

radeusgd · 2024-10-01T16:14:00Z

Currently the NumberParser relies on a really complicated Regex that is not as efficient as we want.

We want to modify it to use a custom solution that should be more efficient.

TODO:

what are the goals exactly?
do we have benchmarks that measure the performance, so that we can compare? Parsing a column of Text values to Integer? CSV reading?

The text was updated successfully, but these errors were encountered:

jdunkerley · 2024-11-06T21:18:25Z

New NumberParser is in #11499
This ticket should add Java unit tests to it and expose the functionality in Standard.Base so we can parse with the same capabilities there as well.

https://github.com/jdunkerley/scratch-code/blob/main/scratch-java/src/test/java/uk/co/jdunkerley/scratch/parser/DoublesTest.java
https://github.com/jdunkerley/scratch-code/blob/main/scratch-java/src/test/java/uk/co/jdunkerley/scratch/parser/IntegersTest.java
https://github.com/jdunkerley/scratch-code/blob/main/scratch-java/src/test/java/uk/co/jdunkerley/scratch/parser/SeparatorsTest.java

jdunkerley · 2024-11-11T17:57:51Z

Better integration of problems with the new Parser:

Propagate the message from NumberParseFailure in some way.
Get more friendly error reporting it would be great to save somewhere (perhaps in a SeparatorParseResult) what character was encountered to cause the 'invalid separators'.

jdunkerley · 2024-11-12T13:29:44Z

Exponential Notation:
1E+3
1E ==> 1 with symbol E
1,000,000.456E6
Rule was (0<=x<10)
1,000,000.456E6 ==> 1,000,000.456 and stop at E
23E6 => 23000000
Allow in exponent mode a single decimal point separator and allow a wider range than before (i.e. 0 <= x < 1000).

Merged version will be just enough to get it working again not deeply tested.

jdunkerley · 2024-11-12T13:30:10Z

Add a benchmark for parsing a single column and parsing 300 columns.

jdunkerley · 2024-11-13T11:07:43Z

ToDos:

Redundant separator spacing check.

radeusgd self-assigned this Oct 1, 2024

github-project-automation bot added this to Issues Board Oct 1, 2024

github-project-automation bot moved this to ❓New in Issues Board Oct 1, 2024

radeusgd added -libs Libraries: New libraries to be implemented --low-performance labels Oct 1, 2024

jdunkerley mentioned this issue Nov 11, 2024

New NumberParser for Table parsing #11499

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of number parsing #11228

Improve performance of number parsing #11228

radeusgd commented Oct 1, 2024

jdunkerley commented Nov 6, 2024 •

edited

Loading

jdunkerley commented Nov 11, 2024

jdunkerley commented Nov 12, 2024

jdunkerley commented Nov 12, 2024

jdunkerley commented Nov 13, 2024

Improve performance of number parsing #11228

Improve performance of number parsing #11228

Comments

radeusgd commented Oct 1, 2024

jdunkerley commented Nov 6, 2024 • edited Loading

jdunkerley commented Nov 11, 2024

jdunkerley commented Nov 12, 2024

jdunkerley commented Nov 12, 2024

jdunkerley commented Nov 13, 2024

jdunkerley commented Nov 6, 2024 •

edited

Loading