Performance testing of different partition options #17
Labels
data
Issues related to creating or updating data, usually on source.coop
get_buildings
Issues related to the get_buildings operations
help wanted
Extra attention is needed
In making the
get-buildings
command I went through a couple of iterations of trying out different formatting - definitely realizing that more row groups than gpq makes by default is better. And with the latest scripts I have a way to set the 'max number of rows' per file and also the number of row groups. But I have no idea if things could be lots faster if we increased or decreased row group size, and/or increased / decreased number of files. The 'defaults' I used were max 10 million rows per file and 20000 rows per group. It'd be great to try out some variations on that. And ideally experiment on the tradeoffs between 'legibility for download' (like use country then admin level 1 like the google buildings data does) vs 'balance of spatial size' (like use the quadkey max size algorithm entirely, instead of country then quadkey, so we'd have much fewer files over all, but each file would be meaningless to users - they'd need to use the 'tool' to download).The performance I was getting to was 20-30 seconds to download a small number of buildings. But it was just a handful of tests.
Ideally we'd have a command that would run a 'benchmark' that would have 20-30 locations globally and get the performance for each of them and report that out, so we can easily compare how tweaks to the data work.
The text was updated successfully, but these errors were encountered: