Improve gsv performance #11

giacomocavalieri · 2024-10-19T12:11:55Z

This PR implements an optimised version of the csv parser used by gsv.
Things I still need to do:

Document how the implementation is compliant or not with the RFC
Decide what to do with parser errors
Replace the to_lists implementation with this more performant one
Write a bunch of tests
See how to make to_dicts more performant
See how to make from_lists and from_dicts more performant, I'm sure there's something to gain there as well
Since this would be a breaking change, we could as well change the api to use different names like parse (and to_string) and parse_as_dicts (and dicts_to_string)?
I don't think I'll do this but it would be cool to set up some kind of performance regression testing since we care about that here

lpil · 2024-10-25T18:16:55Z

Since this would be a breaking change, we could as well change the api to use different names like parse and parse_as_dicts?

What's the breaking change?

giacomocavalieri · 2024-10-25T18:37:03Z

Since this would be a breaking change, we could as well change the api to use different names like parse and parse_as_dicts?

What's the breaking change?

A couple of things changed:

the parser now doesn't trim fields to be compliant with the RFC
the error type has changed, instead of being a string explaining the error it is a custom type with a variant for each possible error and a byte index that points to the error in the csv

bcpeinhardt · 2024-10-25T20:08:22Z

Since three of the four remaining tasks in the checklist are performance optimization, I wonder if we can reduce the scope outlined for this PR to get it merged sooner.
Regarding the names, I'm not tied to them but is there a particular reason you want to change them?

@giacomocavalieri and I already talked separately about how grateful I am he's putting in this work as I'm quite busy these days, but once again Jak you're the best ❤️

giacomocavalieri · 2024-10-27T09:19:39Z

Regarding the names, I'm not tied to them but is there a particular reason you want to change them?

parse sounds like a good fit for a function that takes a string and turns it into some structured data and loads of other functions both in stdlib and other packages follow the same convention int.parse/int.to_string, float.parse/float.to_string, json.parse/json.to_string, jot.parse, tom.parse

Since here we're working with both lists of lists and dictionaries we could have a parse for the most common one and parse_to_x for the other (parse_to_lists, parse_to_dicts)?

giacomocavalieri · 2024-10-27T13:26:52Z

Ok this is ready for review. A recap of what I did:

The parsing functions now return a structured error instead of a plain string
The parser no longer trims fields nor discards whitespace only fields
I've updated various bits of documentation to add examples and document how this implementation differs from the original csv RFC
I've slightly reworked and formatted the README's code example
I've made the parsing more efficient (~6.5 times faster):
- It's now implemented as a single pass parser instead of first tokenising the string
- It works through the string byte-by-byte instead of first splitting it into graphemes
- Instead of building new strings for each field it takes slices of the original one, so it's not only more efficient but also way less memory hungry
I've also made the from_lists function more efficient (): before it went over the lists multiple times by chaining list.maps, now it's a recursive function that traverses the data once building the output as it goes

And what I didn't do and will leave out of this PR:

I haven't renamed the functions, I'll leave that up to you if you want that
I haven't looked all that much into making to_dicts more efficient
I haven't set up any performance regression test

bcpeinhardt · 2024-10-27T14:14:19Z

This is fantastic work, I'm really glad you did this as Gleam needs a production ready CSV parser.
I'm making notes on this PR that include some stuff to consider implementing, but I'm not requesting that you necessarily be the one to do said work, and I'm extremely grateful for all the work you've already done!

I haven't set up any performance regression test

Can you tell me how you're testing performance now? Maybe I can set this up :)

I don't think we should attempt to be RFC compliant to be honest, specifically because of this section:

Within the header and each record, there may be one or more
fields, separated by commas. Each line should contain the same
number of fields throughout the file. Spaces are considered part
of a field and should not be ignored. The last field in the
record must not be followed by a comma.

The only errors I see in the ParseError type have to do with quotes, which makes me think we're treating missing fields as not a big deal (technically there are situations a row might need ,,,,, at the end to produce the correct number of empty fields to complete the row.) Some tests outlining how the to_lists parsing behaves with rows of varying lengths might be a good idea, even though we kind of get it from the dicts with missing headers tests.

I think trying to be overly complaint with stuff like this will make working with "real-world" csv hard (it already happened to @lpil on stream with liberapay's csv output, which is why I removed it as an error case and instead opted to be permissive).

I actually think deliberately changing the names would be a good idea. Even though on a breaking change we can technically change how functions operate, I'd prefer to force folks to read the changelog and realize they need to start trimming fields.
parse and parse_dicts works for me, as I don't think we'll ever end up wanting a custom Csv type.

giacomocavalieri · 2024-10-27T14:31:30Z

Can you tell me how you're testing performance now? Maybe I can set this up :)

It's really crude, I switch between branches, run this snippet of code and look at the output with a 10_000 lines csv file:

import gleam/io
import gleamy/bench
import gsv
import simplifile

pub fn main() {
  let assert Ok(csv) =
    simplifile.read("/Users/giacomocavalieri/Desktop/customers-10000.csv")

  let assert Ok(rows) = gsv.to_lists(csv)

  bench.run(
    [bench.Input("10_000 rows", rows)],
    [bench.Function("gsv.from_lists", gsv.from_lists(_, ",", gsv.Windows))],
    [bench.Duration(10_000), bench.Warmup(3000)],
  )
  |> bench.table([bench.Min, bench.Mean, bench.Max, bench.P(95)])
  |> io.println
}

I don't think we should attempt to be RFC compliant to be honest, specifically because of this section

You're right! The parser totally disregards that section of the RFC, so I've changed the doc to document this difference too

Some tests outlining how the to_lists parsing behaves with rows of varying lengths might be a good idea

I've added a test!

bcpeinhardt · 2024-10-29T05:16:16Z

I'm going to go ahead and merge this :) I'm gonna hold off on publishing just a little while until I have time to explore the performance aspect and write up a little migration guide (probably a paragraph but still important), but hopefully will be published this weekend at the latest.

giacomocavalieri force-pushed the perf-optimisation branch 3 times, most recently from 545b7e2 to ede9bcf Compare October 19, 2024 14:19

giacomocavalieri added 17 commits October 27, 2024 13:51

wip

b87cde6

Remove deprecated code

71c56cc

stop testing internals of the library

73d451d

remove commented unused test

f7ddd4e

fix small bug

8c7a668

use explicit newline in readme example

2398cd7

use new implementation for to_lists

0a0c592

bunch of tests

beb1617

documentation for to_lists

2d7cebd

remove old implementation

4f29096

one big file good

26407da

nice error tests

eab6f8c

fix ffi bug

8b39589

to_dicts

b70c031

from_lists

f722509

from_dicts

8fa68b0

rebase gone wrong

55ef9ee

giacomocavalieri force-pushed the perf-optimisation branch from c7791de to 55ef9ee Compare October 27, 2024 12:55

changelog

a1b15f4

giacomocavalieri marked this pull request as ready for review October 27, 2024 12:59

readme

864fc8d

improve from_lists

7d607f4

giacomocavalieri added 3 commits October 27, 2024 15:21

improve from_lists

852d422

document other differences

28c4fa2

add test for line with different number of fields

dd9cfbb

bcpeinhardt merged commit 3cfa5b7 into bcpeinhardt:main Oct 29, 2024
1 check passed

giacomocavalieri deleted the perf-optimisation branch October 29, 2024 07:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve gsv performance #11

Improve gsv performance #11

giacomocavalieri commented Oct 19, 2024 •

edited

Loading

lpil commented Oct 25, 2024

giacomocavalieri commented Oct 25, 2024 •

edited

Loading

bcpeinhardt commented Oct 25, 2024

giacomocavalieri commented Oct 27, 2024

giacomocavalieri commented Oct 27, 2024 •

edited

Loading

bcpeinhardt commented Oct 27, 2024

giacomocavalieri commented Oct 27, 2024

bcpeinhardt commented Oct 29, 2024

Improve gsv performance #11

Improve gsv performance #11

Conversation

giacomocavalieri commented Oct 19, 2024 • edited Loading

lpil commented Oct 25, 2024

giacomocavalieri commented Oct 25, 2024 • edited Loading

bcpeinhardt commented Oct 25, 2024

giacomocavalieri commented Oct 27, 2024

giacomocavalieri commented Oct 27, 2024 • edited Loading

bcpeinhardt commented Oct 27, 2024

giacomocavalieri commented Oct 27, 2024

bcpeinhardt commented Oct 29, 2024

giacomocavalieri commented Oct 19, 2024 •

edited

Loading

giacomocavalieri commented Oct 25, 2024 •

edited

Loading

giacomocavalieri commented Oct 27, 2024 •

edited

Loading