-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve gsv performance #11
Improve gsv performance #11
Conversation
545b7e2
to
ede9bcf
Compare
What's the breaking change? |
A couple of things changed:
|
Since three of the four remaining tasks in the checklist are performance optimization, I wonder if we can reduce the scope outlined for this PR to get it merged sooner. @giacomocavalieri and I already talked separately about how grateful I am he's putting in this work as I'm quite busy these days, but once again Jak you're the best ❤️ |
Since here we're working with both lists of lists and dictionaries we could have a |
c7791de
to
55ef9ee
Compare
Ok this is ready for review. A recap of what I did:
And what I didn't do and will leave out of this PR:
|
This is fantastic work, I'm really glad you did this as Gleam needs a production ready CSV parser.
Can you tell me how you're testing performance now? Maybe I can set this up :) I don't think we should attempt to be RFC compliant to be honest, specifically because of this section:
The only errors I see in the ParseError type have to do with quotes, which makes me think we're treating missing fields as not a big deal (technically there are situations a row might need I think trying to be overly complaint with stuff like this will make working with "real-world" csv hard (it already happened to @lpil on stream with liberapay's csv output, which is why I removed it as an error case and instead opted to be permissive). I actually think deliberately changing the names would be a good idea. Even though on a breaking change we can technically change how functions operate, I'd prefer to force folks to read the changelog and realize they need to start trimming fields. |
It's really crude, I switch between branches, run this snippet of code and look at the output with a 10_000 lines csv file: import gleam/io
import gleamy/bench
import gsv
import simplifile
pub fn main() {
let assert Ok(csv) =
simplifile.read("/Users/giacomocavalieri/Desktop/customers-10000.csv")
let assert Ok(rows) = gsv.to_lists(csv)
bench.run(
[bench.Input("10_000 rows", rows)],
[bench.Function("gsv.from_lists", gsv.from_lists(_, ",", gsv.Windows))],
[bench.Duration(10_000), bench.Warmup(3000)],
)
|> bench.table([bench.Min, bench.Mean, bench.Max, bench.P(95)])
|> io.println
}
You're right! The parser totally disregards that section of the RFC, so I've changed the doc to document this difference too
I've added a test! |
I'm going to go ahead and merge this :) I'm gonna hold off on publishing just a little while until I have time to explore the performance aspect and write up a little migration guide (probably a paragraph but still important), but hopefully will be published this weekend at the latest. |
This PR implements an optimised version of the csv parser used by gsv.
Things I still need to do:
to_lists
implementation with this more performant oneto_dicts
more performantfrom_lists
andfrom_dicts
more performant, I'm sure there's something to gain there as wellparse
(andto_string
) andparse_as_dicts
(anddicts_to_string
)?