-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sorting vectors #66
Comments
Usually, what we sort data in memory and then write it to disk with e.g. |
You would need a beefy machine to sort the OSM planet. There are 7_992_520_838 nodes. Just sorting a 64bit index would be ~60GB of RAM. It would be cool if that wasn't needed, because I can run your tool otherwise on my laptop. What is the challenge behind making vectors backed by memmap mutable? |
There are several problems with memory mapping mutable vectors: That's why our general plan was to rather extend ExternalVector with a dedicated sorting method that is optimized for external memory, and can only be called when finalizing the vector (e.g. super-scalar-sample-sort, see https://github.com/lorenzhs/ssssort ), and prevents being used in case the struct is not self-contained. |
The reason I ask is that I'm experimenting with creating tile indices using a Hilbert Curve. I am converting Node LatLons to their corresponding Hilbert number, then sorting Nodes by that. For ways, I do the same thing, with the Hilbert number derived from their Point on Surface. osmflat-rs seems like a compelling fit for a back-end format, but an easy way of memmap sorting is needed. |
Here's a general idea behind sorting vectors. Mutable accesors could be built into the storage code in flatdata to allow this sort of thing and maybe provide some guardrails... It would be super cool if you could annotate the schema to make the sorting populate throughout all references from the sorted vector. I haven't yet tried sorting the nodes themselves yet. |
Building sorting into flatdata itself is definitely a feature we are looking forward to. Automatically updating any references is most likely not going to happen, since such a thing usually requires having a large in-memory index that remaps references, and also reordering data references with The nodes in osmflat itself are not sortable directly, though (as they have Regarding the usability of osmflat as a backend format: While it can word for simple use cases I think it is much better suited as a input format to build your backend format: Most backends will need some of:
In my opinion osmflat is rather suited as an input format to build your own backend format out of: It is much easier to consume than PBF, requires less RAM (due to references being already resolved). We were mostly designing it as a data-exchange format replacing PBF. |
Extending ExternalVector with a sort is a pretty interesting idea. Not sure if there is a parent issue on this. Have you thought about radix sort? |
Radix sort is usually not the best choice if you need to sort data that does not fit into RAM: You would need one pass over the data for each 4/8/16 bits (whatever your word size is). E.g. assuming 16 bits words, and 32 byte structures you would need up to 16 passes of reading/writing the data. Some of the later passes might fit into memory, but it mostly depends on the data. Compare that e.g. with (assuming that your data size is less than
|
Makes sense. Basically, the sort needs to be localized first to reduce paging. |
Right now, it looks like we can only build a
Vec
by appending. It would be nice to be able to sort our vectors after they have been built. For example, I may want to compute a hilbert index for the nodes and sort them by that. That way, I can access the data with good locality.Is there a straightforward way to sort?
The text was updated successfully, but these errors were encountered: