Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve gsv performance #11

Merged
merged 23 commits into from
Oct 29, 2024

Conversation

giacomocavalieri
Copy link
Contributor

@giacomocavalieri giacomocavalieri commented Oct 19, 2024

This PR implements an optimised version of the csv parser used by gsv.
Things I still need to do:

  • Document how the implementation is compliant or not with the RFC
  • Decide what to do with parser errors
  • Replace the to_lists implementation with this more performant one
  • Write a bunch of tests
  • See how to make to_dicts more performant
  • See how to make from_lists and from_dicts more performant, I'm sure there's something to gain there as well
  • Since this would be a breaking change, we could as well change the api to use different names like parse (and to_string) and parse_as_dicts (and dicts_to_string)?
  • I don't think I'll do this but it would be cool to set up some kind of performance regression testing since we care about that here

@giacomocavalieri giacomocavalieri force-pushed the perf-optimisation branch 3 times, most recently from 545b7e2 to ede9bcf Compare October 19, 2024 14:19
@lpil
Copy link
Contributor

lpil commented Oct 25, 2024

Since this would be a breaking change, we could as well change the api to use different names like parse and parse_as_dicts?

What's the breaking change?

@giacomocavalieri
Copy link
Contributor Author

giacomocavalieri commented Oct 25, 2024

Since this would be a breaking change, we could as well change the api to use different names like parse and parse_as_dicts?

What's the breaking change?

A couple of things changed:

  • the parser now doesn't trim fields to be compliant with the RFC
  • the error type has changed, instead of being a string explaining the error it is a custom type with a variant for each possible error and a byte index that points to the error in the csv

@bcpeinhardt
Copy link
Owner

Since three of the four remaining tasks in the checklist are performance optimization, I wonder if we can reduce the scope outlined for this PR to get it merged sooner.
Regarding the names, I'm not tied to them but is there a particular reason you want to change them?

@giacomocavalieri and I already talked separately about how grateful I am he's putting in this work as I'm quite busy these days, but once again Jak you're the best ❤️

@giacomocavalieri
Copy link
Contributor Author

Regarding the names, I'm not tied to them but is there a particular reason you want to change them?

parse sounds like a good fit for a function that takes a string and turns it into some structured data and loads of other functions both in stdlib and other packages follow the same convention int.parse/int.to_string, float.parse/float.to_string, json.parse/json.to_string, jot.parse, tom.parse

Since here we're working with both lists of lists and dictionaries we could have a parse for the most common one and parse_to_x for the other (parse_to_lists, parse_to_dicts)?

@giacomocavalieri giacomocavalieri marked this pull request as ready for review October 27, 2024 12:59
@giacomocavalieri
Copy link
Contributor Author

giacomocavalieri commented Oct 27, 2024

Ok this is ready for review. A recap of what I did:

  • The parsing functions now return a structured error instead of a plain string
  • The parser no longer trims fields nor discards whitespace only fields
  • I've updated various bits of documentation to add examples and document how this implementation differs from the original csv RFC
  • I've slightly reworked and formatted the README's code example
  • I've made the parsing more efficient (~6.5 times faster):
    • It's now implemented as a single pass parser instead of first tokenising the string
    • It works through the string byte-by-byte instead of first splitting it into graphemes
    • Instead of building new strings for each field it takes slices of the original one, so it's not only more efficient but also way less memory hungry
  • I've also made the from_lists function more efficient (): before it went over the lists multiple times by chaining list.maps, now it's a recursive function that traverses the data once building the output as it goes

And what I didn't do and will leave out of this PR:

  • I haven't renamed the functions, I'll leave that up to you if you want that
  • I haven't looked all that much into making to_dicts more efficient
  • I haven't set up any performance regression test

@bcpeinhardt
Copy link
Owner

This is fantastic work, I'm really glad you did this as Gleam needs a production ready CSV parser.
I'm making notes on this PR that include some stuff to consider implementing, but I'm not requesting that you necessarily be the one to do said work, and I'm extremely grateful for all the work you've already done!

I haven't set up any performance regression test

Can you tell me how you're testing performance now? Maybe I can set this up :)

I don't think we should attempt to be RFC compliant to be honest, specifically because of this section:

Within the header and each record, there may be one or more
fields, separated by commas. Each line should contain the same
number of fields throughout the file. Spaces are considered part
of a field and should not be ignored. The last field in the
record must not be followed by a comma.

The only errors I see in the ParseError type have to do with quotes, which makes me think we're treating missing fields as not a big deal (technically there are situations a row might need ,,,,, at the end to produce the correct number of empty fields to complete the row.) Some tests outlining how the to_lists parsing behaves with rows of varying lengths might be a good idea, even though we kind of get it from the dicts with missing headers tests.

I think trying to be overly complaint with stuff like this will make working with "real-world" csv hard (it already happened to @lpil on stream with liberapay's csv output, which is why I removed it as an error case and instead opted to be permissive).

I actually think deliberately changing the names would be a good idea. Even though on a breaking change we can technically change how functions operate, I'd prefer to force folks to read the changelog and realize they need to start trimming fields.
parse and parse_dicts works for me, as I don't think we'll ever end up wanting a custom Csv type.

@giacomocavalieri
Copy link
Contributor Author

Can you tell me how you're testing performance now? Maybe I can set this up :)

It's really crude, I switch between branches, run this snippet of code and look at the output with a 10_000 lines csv file:

import gleam/io
import gleamy/bench
import gsv
import simplifile

pub fn main() {
  let assert Ok(csv) =
    simplifile.read("/Users/giacomocavalieri/Desktop/customers-10000.csv")

  let assert Ok(rows) = gsv.to_lists(csv)

  bench.run(
    [bench.Input("10_000 rows", rows)],
    [bench.Function("gsv.from_lists", gsv.from_lists(_, ",", gsv.Windows))],
    [bench.Duration(10_000), bench.Warmup(3000)],
  )
  |> bench.table([bench.Min, bench.Mean, bench.Max, bench.P(95)])
  |> io.println
}

I don't think we should attempt to be RFC compliant to be honest, specifically because of this section

You're right! The parser totally disregards that section of the RFC, so I've changed the doc to document this difference too

Some tests outlining how the to_lists parsing behaves with rows of varying lengths might be a good idea

I've added a test!

@bcpeinhardt
Copy link
Owner

I'm going to go ahead and merge this :) I'm gonna hold off on publishing just a little while until I have time to explore the performance aspect and write up a little migration guide (probably a paragraph but still important), but hopefully will be published this weekend at the latest.

@bcpeinhardt bcpeinhardt merged commit 3cfa5b7 into bcpeinhardt:main Oct 29, 2024
1 check passed
@giacomocavalieri giacomocavalieri deleted the perf-optimisation branch October 29, 2024 07:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants