Conversation
Implement a SPATIAL flag for use in creating a spatial field.
Encodes triangles with the bounding box in the first four words, enabling efficient spatial pruning during tree traversal without reconstructing the full triangle. The remaining words contain an additional vertex and packed reconstruction metadata, allowing exact triangle recovery when needed.
The `triangulate` function takes a polygon with floating-point lat/lon coordinates, converts to integer coordinates with millimeter precision (using 2^32 scaling), performs constrained Delaunay triangulation, and encodes the resulting triangles with boundary edge information for block kd-tree spatial indexing. It handles polygons with holes correctly, preserving which triangle edges lie on the original polygon boundaries versus internal tessellation edges.
Implemented byte-wise histogram selection to find median values without comparisons, enabling efficient partitioning of spatial data during block kd-tree construction. Processes values through multiple passes, building histograms for each byte position after a common prefix, avoiding the need to sort or compare elements directly.
Implemented a `Surveyor` that will evaluate the bounding boxes of a set of triangles and determine the dimension with the maximum spread and the shared prefix for the values of dimension with the maximum spread.
Implements dimension-major bit-packing with zigzag encoding for signed i32 deltas, enabling compression of spatially-clustered triangles from 32-bit coordinates down to 4-19 bits per delta depending on spatial extent.
Implement an immutable bulk-loaded spatial index using recursive median partitioning on bounding box dimensions. Each leaf stores up to 512 triangles with delta-compressed coordinates and doc IDs. The tree provides three query types (intersects, within, contains) that use exact integer arithmetic for geometric predicates and accumulate results in bit sets for efficient deduplication across leaves. The serialized format stores compressed leaf pages followed by the tree structure (leaf and branch nodes), enabling zero-copy access through memory-mapped segments without upfront decompression.
Lossless compression for floating-point lat/lon coordinates using XOR delta encoding on IEEE 754 bit patterns with variable-length integer encoding. Designed for per-polygon random access in the document store, where each polygon compresses independently without requiring sequential decompression.
The triangulation function in `triangle.rs` is now called `delaunay_to_triangles` and it accepts the output of a Delaunay triangulation from `i_triangle` and not a GeoRust multi-polygon. The translation of user polygons to `i_triangle` polygons and subsequent triangulation will take place outside of `triangle.rs`.
Implemented a geometry document field with a minimal `Geometry` enum. Now able to add that Geometry from GeoJSON parsed from a JSON document. Geometry is triangulated if it is a polygon, otherwise it is correctly encoded as a degenerate triangle if it is a point or a line string. Write accumulated triangles to a block kd-tree on commit. Serialize the original `f64` polygon for retrieval from search. Created a query method for intersection. Query against the memory mapped block kd-tree. Return hits and original `f64` polygon. Implemented a merge of one or more block kd-trees from one or more segments during merge. Updated the block kd-tree to write to a Tantivy `WritePtr` instead of more generic Rust I/O.
Ended up using `select_nth_unstable_by_key` from the Rust standard library instead.
Read node structures using `from_le_bytes` instead of casting memory. After an inspection of columnar storage, it appears that this is the standard practice in Rust and in the Tantivy code base. Left the structure alignment for now in case it tends to align with cache boundaries.
|
@flatheadmill This is a super interesting PR. A bit of contributions has accumulated, so it might take a bit of time to get it reviewed. To make things smoother, would you be ok for a quick call to walk me through the code? |
|
Of course. Doesn't have to be quick for my sake. Check your email. |
Addressed all `todo!()` markers created when adding `Spatial` field type and `Geometry` value type to existing code paths: - Dynamic field handling: `Geometry` not supported in dynamic JSON fields, return `unimplemented!()` consistently with other complex types. - Fast field writer: Panic if geometry routed incorrectly (internal error.) - `OwnedValue` serialization: Implement `Geometry` to GeoJSON serialization and reference-to-owned conversion. - Field type: Return `None` for `get_index_record_option()` since spatial fields use BKD trees, not inverted index. - Space usage tracking: Add spatial field to `SegmentSpaceUsage` with proper integration through `SegmentReader`. - Spatial query explain: Implement `explain()` method following pattern of other binary/constant-score queries. Fixed `MultiPolygon` deserialization bug: count total points across all rings, not number of rings. Added clippy expects for legitimate too_many_arguments cases in geometric predicates.
|
For all the talk of k-dimensions, the end result of writing the tree is a run-of-the-mill binary tree whose search functions are very easy to understand. In the end, we have a binary tree of inner nodes that have a left and right pointer to child nodes and a bounding box. During a search the bounding box allows us to eliminate sub-trees that do not intersect. The binary tree terminates in leaf notes that have a bounding box and a pointer to where triangles are stored. If the path you follow intersects your sought bounding box, you scan the triangles referenced by the leaf node. Triangles? Yes, the polygons you add to the index are tessellated into triangles. The i_triangle crate used for tessellation in this pull request has an animation of triangulation in its GitHub README. The animation is all you need to understand what we're trying to accomplish. We are tokenizing a polygon, turning it into triangles. We know we've matched a document if we match a triangle in the polygon. Triangles are described with three points and therefore easy to store. It was not necessary to study tessellation for this implementation, it was enough to simply choose an implementation that met requirements; it works on integers and tracks boundary edges. With the polygon broken down into triangles we can store a triangle and the In this way, this pull request implements an orthogonal range query to search for 2-dimensional shapes within axis-aligned partitions, i.e. bounding-box search. To grasp the kd-concept quickly, it is enough to see a simple tree and the space it partitions. Otherwise, you could read a Medium article. You will come away with an understanding of how to partition two-dimensional space. You will know how to search for a point in a two-dimensional kd-tree. It's a matter of descending the tree and choosing a path based on the depth of the tree. There are plenty of little examples of this 2-dimensional kd-tree of points in every programming language. You will be left wondering how to search for a triangle in two-dimensional space. Nowhere will you find a description of how to do this. A search for answers is going to trip over the fact that the word "dimension" is overloaded. Our minimal kd-tree stores 1-dimensional spatial data in a tree that it describes as 2-dimensional. When we use the term "dimension" to describe a kd-tree we are describing the number of partitioning dimensions of the tree, not the spatial dimension of the indexed forms. What I found, what finally clicked, was a single "teach a man to fish" answer to a StackOverflow question. A 4-dimensional point can represent a bounding box where the partitioning dimentions are the min and max and the x and y. I wrote a program to create a kd-tree with 4-dimensions and was able to search for a box with a box "using the range (a0,+inf) x (-inf, a1) x (b0, +inf) x (-inf, b1)". Meanwhile, I had been stepping through the Lucene search where I'd see the bounding boxes in the nodes during search and knew that Lucene was depending on the properties of a kd-tree to build the tree, but not to search the tree. It did not search the tree using a 4-dimensional point, it used bounding boxes instead. It would build the bounding boxes as it partitioned the tree. Another muddle on the way to understanding is the concept of a "block" kd-tree. Seeing I never looked at the paper again once I began stepping through the Lucene code. There I saw the bounding box in kd-tree node used for search and affirmed my hyper-point as bounding box understanding by seeing that Lucene was indeed indexing the 2-demential (spatial) triangle bounding box using a 4-dimensional (partitions) tree. I saw too, that a single tree was built by bulk-partitioning. I never referenced the original paper myself, but here I am standing on the shoulders of giants, reading the theory in practice, so I'm not the one to ask how the paper influenced the design of Lucene, I can only talk about how Lucene influenced the design of this pull request. However, since we are not creating a LSM-like structure nor using the serialization strategy presented in the paper, I'm going to rename That's enough red herrings. The tree is a 4-dimensional kd-tree built with a recursive partitioning of an array of triangles. The 4-dimensions represent the four positions of the two points in a bounding box. We partition our array of triangles on the kd-tree dimension with the widest spread, the greatest distance between the min and max value. We calculate a bounding box to encapsulate the triangles in the array and then call our partition function on the left and right partitions. We create an inner node that has a bounding box and a left and right child reference. When an array is less than a certain size, that's a leaf. If you look at the P.S. — As you can see, the implementation in this pull request does not implement a generic kd-tree like Lucene. We do not keep the partition dimension in the node, we throw it away and use just the bounding box. Lucene uses its kd-tree for range queries, but I believe Tantivy has "fast fields" for that use case. It has some code to do kNN with kd-trees, but the primary index it uses for kNN is a Hierarchical Navigable Small World index. I'd speculate that the kd-tree kNN was an offering that was superseded by HNSW. |
This reverts commit 19eab16. Restore radix select in order to implement a merge solution that will not require a temporary file.
Existing code behaved as if the result of `select_nth_unstable_by_key` was either a sorted array or the product of an algorithm that gathered partition values as in the Dutch national flag problem. The existing code was written knowing that the former isn't true and the latter isn't advertised. Knowing, but not remembering. Quite the oversight.
|
@flatheadmill Hello. I had time to have a deeper look at your PR. For large PR like this, I usually modify the code as I look around. I will put my comments inline, you can also just pick the commits from #2755 when it makes sense to you. |
| }; | ||
| // Get spatial writer and rebuild block kd-tree. | ||
| spatial_serializer.serialize_field(field, triangles)?; | ||
| } |
There was a problem hiding this comment.
FYI I imported your PR in #2755 and started reviewing.
The test do not pass in your PR mainly due to :
| } | |
| } | |
| spatial_serializer.close()?; |
src/index/index_meta.rs
Outdated
| SegmentComponent::FastFields => ".fast".to_string(), | ||
| SegmentComponent::FieldNorms => ".fieldnorm".to_string(), | ||
| SegmentComponent::Delete => format!(".{}.del", self.delete_opstamp().unwrap_or(0)), | ||
| SegmentComponent::Spatial => ".spatial".to_string(), |
There was a problem hiding this comment.
General thing...
I think we should avoid creating this file if it is not used.
It might seem benigne, but tantivy creates those files on every single commit. Ideally we would fix that at the directory level, but I never managed to do it.
TBH it should be true for other components in tantivy (especially field norms).
I think I managed to get that in my #2755 .
src/indexer/merger.rs
Outdated
| for (field, mut temp_file) in temp_files { | ||
| // Flush and sync triangles. | ||
| temp_file.flush()?; | ||
| temp_file.as_file_mut().sync_all()?; |
There was a problem hiding this comment.
fsyncing is not useful for this usage.
src/indexer/merger.rs
Outdated
| } | ||
| for (segment_ord, reader) in self.readers.iter().enumerate() { | ||
| for (field, temp_file) in &mut temp_files { | ||
| let spatial_readers = reader.spatial_fields(); |
There was a problem hiding this comment.
we need the temp_file to be buffered.
src/query/spatial_query.rs
Outdated
| } | ||
|
|
||
| impl SpatialWeight { | ||
| fn new(field: Field, bounds: [(i32, i32); 2], query_type: SpatialQueryType) -> Self { |
There was a problem hiding this comment.
I think (?) the integer bbox you end up using assume that the two points given here are the min corner and the max corner.
(because of the intersect code)
It might be worth enforcing that somewhere to avoid bugs.
Ideally I would prefer to have a IBox struct type that forces that invariant upon construction.
src/spatial/triangle.rs
Outdated
| } | ||
| } | ||
| // change orientation if CW | ||
| if orient2d( |
There was a problem hiding this comment.
robust is imported just for orient2d.
I think we should remove the dep and use an adhoc implementation.
orient_2d is also needlessly complex. (We just want the sign of the determinant)
src/spatial/triangle.rs
Outdated
| Coord { y: cy, x: cx }, | ||
| ) < 0.0 | ||
| { | ||
| let temp_x = bx; |
There was a problem hiding this comment.
| let temp_x = bx; | |
| // We fix the orientation by swapping B and C. | |
| let temp_x = bx; |
src/spatial/triangle.rs
Outdated
| let temp_y = by; | ||
| let temp_boundary = ab; | ||
| // ax and ay do not change, ab becomes bc | ||
| ab = bc; |
There was a problem hiding this comment.
I believe this is a bug:
| ab = bc; | |
| ab = ac; |
src/spatial/triangle.rs
Outdated
| /// indicating which edges are polygon boundaries. Returns a triangle struct with the bounding | ||
| /// box in the first four words as `[min_y, min_x, max_y, max_x]`. When decoded, the vertex | ||
| /// order may differ from the original input to `new()` due to normalized rotation. | ||
| pub fn new(doc_id: u32, triangle: [i32; 6], boundaries: [bool; 3]) -> Self { |
There was a problem hiding this comment.
In several place in the code, we rely on implicit ordering to have lon, lat or y, x.
The code would benefit from introducing
GeoPoint { x: f64, y: f64}
and IPoint { x: u32, u32 }, and IBox { min_corner: IPoint, max_corner: IPoint }orIBox {min_x, min_y, max_x, max_y}`
This is zero cost in rust.
Removed use of robust.
Spatial fields returned false from is_indexed(), so the segment writer skipped them before reaching add_geometry. Moved spatial field handling above the is_indexed gate since spatial fields have their own write path and do not participate in postings. The spatial serializer called CompositeWrite::for_field twice for the edge index -- once to create the EdgeWriter and again to flush. The second call tripped a duplicate field address assertion. Hoisted the for_field call so the same reference serves both uses. RegionCoverer::get_covering_internal sent initial candidates directly to the priority queue without expanding their children. The main loop then drained an empty children vec and produced nothing. Changed to route initial candidates through add_candidate, which expands children before queuing. This matches the S2 C++ reference.
The forward walk counted entries to reach a target geometry position but did not skip doc_id footers that appear between sets. When geometry_id 3 followed a 3-member set, the walk stopped at the footer after the third entry instead of advancing past it. The footer bytes were then read as a vertex byte length, producing a slice overflow. After the walk loop exits, consume any doc_id footers before treating the current position as the target entry.
Only set has_interior when the covering cell fully contains the index cell. Use range_min comparison in the Subdivided walk for correct overlap detection across cell levels.
The CellIndex assigns edges to cells using conservative bounding box tests, so multiple geometries can share a cell even when they do not overlap spatially. When an interior covering cell contains such a shared index cell, the shortcut now confirms the candidate's first vertex is inside the query polygon before accepting.
Same pattern as the serializer fix: CompositeWrite::for_field was called twice for the edge index, once to create the EdgeWriter and again to flush. Hoisted the for_field call so the same reference serves both. Removed a leftover dbg statement.
The merger was writing edges in cell-walk encounter order, but the GeometryMap assigns IDs during the interleave stage before the split and sibling stages reorder cells. The edge positions in the output did not match the geometry IDs in the merged cell index. Write edges in sequential new_id order instead of encounter order. Assign set member IDs starting from the head so they are consecutive regardless of which member is encountered first during the cell walk.
Introduces $intersects and $contains as query parser syntax, composable
with text queries via AND/OR. The field name is required and must refer
to a spatial field. Coordinates are comma-separated lon/lat pairs with
implicit ring closure.
geometry:$intersects(-99.49 45.56, -99.45 45.56, -99.45 45.59, -99.49 45.59)
name:Hosmer AND geometry:$intersects(-100.0 45.0, -99.0 45.0, -99.0 46.0, -100.0 46.0)
New SpatialPredicateKind enum in the grammar crate, Spatial variant in
UserInputLeaf and LogicalLiteral, parser rule wired into the strict
literal alt, and lowering to SpatialQuery in convert_literal_to_query.
Points enter the cell index as degenerate edges (v0 == v1) with dimension 0. The edge reader duplicates single-vertex entries on decode so the edge_id rule holds without caller changes. Line strings exposed false positives in the intersects query -- brute_force_contains treated open paths as polygons. The fix encodes a closed bit in the high bit of the edge entry len field and moves the doc_id from a trailing footer to inline on the head entry. The intersects query skips the reverse containment test for open geometries. The has_crossing guard drops from n < 3 to n < 2 to allow 2-vertex line strings.
…reverse containment test. GeoJSON RFC 7946 specifies hole rings as clockwise. S2 expects all rings counterclockwise, enclosing at most half the sphere. A clockwise hole ring interpreted as an S2 loop encloses nearly the entire sphere, causing compute_origin_inside to return true. The InteriorTracker then marks the geometry as containing every cell on the face. The fix reverses hole rings in the writer before computing origin_inside. The reverse containment test in the intersects query now uses indexed containment through the segment's cell index instead of brute_force_contains. A ContainmentIndex trait abstracts cell lookup and edge resolution over both the in-memory CellIndex (forward direction) and the CellIndexReader plus EdgeReader (reverse direction). The generic contains_point function finds the cell containing the test point, starts from contains_center, and counts crossings of only the clipped edges. This eliminates the flattened vertex bugs in brute_force_contains (degenerate A-to-A edges from ring closure, spurious cross-ring edges from multi-ring flattening) and removes origin_inside from the query path.
Mechanical port of GetDistance, UpdateMinDistance, IsDistanceLess, UpdateMinInteriorDistance, and GetUpdateMinDistanceMaxError. These compute the minimum distance from a point to a great circle edge on the unit sphere using chord length, needed for the distance query verification step.
DistanceQuery builds an S2Cap from a center point and radius, covers it with RegionCoverer, and verifies candidates by computing the minimum point-to-edge distance using the s2edge_distances port. For polygons, containment is checked first via indexed containment through the segment's cell index -- if the center is inside the polygon, the distance is zero. The query parser accepts $within(50mi, lon lat) and $between(50mi, 100mi, lon lat) with unit suffixes mi, km, m, ft, rad.
Geometry enters as Geometry<Plane>, gets projected onto the surface via project::<Sphere>(), and gets smashed into stored format via to_geometry_set. One function, one format, for both query and write. * Surface trait with Sphere and Plane, project. * GeometrySet and EdgeSet as the smashed representation. * to_geometry_set smashes projected geometry, computes contains_hilbert_start, flattens rings with closure vertices. * build_from_sets on IndexBuilder seeds the InteriorTracker from precomputed flags instead of running brute_force_contains at build time. * Queries accept GeometrySet directly. QueryEdgeProvider wraps it. * Edge reader returns EdgeSet via get_edge_set so callers never compute member indices. * Edge writer takes GeometrySet.
* get_distance_to_point -- minimum distance from cell to point, zero if the point is inside. Feature-based computation in the cell's face-local UVW frame with Voronoi region branching. * get_boundary_distance -- minimum distance from cell boundary to point. * get_distance_to_edge -- minimum distance from cell to edge, zero if the edge intersects the cell or an endpoint is inside.
Geometry-to-geometry distance query over the cell hierarchy. The query geometry gets its own CellIndex for bilateral pruning -- only query edges in cells that overlap the stored cell are checked. The priority queue is keyed by cell-to-query-geometry lower bounds using S2Cell::get_distance_to_edge, and leaf verification uses update_edge_pair_min_distance. * closest_edge_query.rs: branch-and-bound traversal with three modes from one algorithm -- kNN (dynamic threshold), within-D (fixed threshold), boolean (exit on first witness). * $knn(K, lon lat) syntax in the query grammar. * SpatialPredicate::Knn wired through SpatialQuery and SpatialWeight.
The covering-based DistanceQuery is no longer called from the query path. Within, between, and kNN all go through the branch-and-bound traversal. Added range distance mode for $between with inner and outer thresholds.
The executor is a Query whose Weight construction evaluates a plan tree before the scorer pipeline runs. Plan nodes are either tantivy queries evaluated to per-segment bitsets, or spatial operations whose filter inputs are other plan nodes. The tree evaluates inside-out so nested operations resolve before their parents. * Added SpatialExecutor implementing Query. Weight construction calls a recursive evaluate function over the plan tree, producing per-segment bitsets or scored result sets. Scorers iterate the precomputed results. * Added PlanNode enum with Query, Intersects, Knn, and Join variants. Knn and Join are stubbed. Intersects is wired to search_segment_filtered. * Added EdgeReader::doc_id_for to resolve geometry_id to doc_id without decoding vertices or populating the cache. Reads the skip list and entry headers only. * Added IntersectsQuery::search_segment_filtered which checks a terms bitset per candidate shape before verification. Shapes whose doc_id is not in the bitset are skipped before crossing tests. * Added Clone to EdgeSet and GeometrySet.
…cutor Within node. * Added ClosestEdgeQuery::search_segment_filtered. Checks a terms bitset per candidate shape in process_edges before edge-pair distance computation. The filter check uses doc_id_for to resolve geometry_id to doc_id without decoding vertices. * Added DistanceQuery::search_segment_filtered. Same bitset check pattern in collect_candidates. * Added PlanNode::Within to the executor. Takes a GeometrySet and radius. Uses ClosestEdgeQuery::within for geometry-to-geometry distance, not the point-based DistanceQuery. The executor uses branch-and-bound for all distance operations.
The join evaluates inner and outer child nodes to per-segment bitsets, then for each segment walks the cell index to find outer geometries by checking doc_id_for against the outer bitset. For each outer geometry, builds a ClosestEdgeQuery with any_within and probes all segments filtered by the inner bitset. First positive across any segment sets the bit. The result is per-segment bitsets of outer docs that passed the spatial join predicate.
Left over from the triangle days.
* Added ContainsQuery::search_segment_filtered with terms bitset check per candidate shape, same pattern as IntersectsQuery and DistanceQuery. * The Join evaluate arm dispatches on SpatialRelation: Near uses ClosestEdgeQuery::any_within, Between uses any_between, Intersects uses IntersectsQuery, Contains uses ContainsQuery. All four probe with search_segment_filtered against the inner bitset. * Added ClosestEdgeQuery::any_between for boolean range distance with early exit.
Point-to-geometry distance replaced by closest_edge_query which handles any geometry type through edge-pair distance with branch-and-bound.
Runs ClosestEdgeQuery::knn per segment with an optional filter bitset, collects results tagged by segment ID, sorts by distance, truncates to global top K, and redistributes to per-segment SegmentResult::Scored vecs for the replay scorer. Fixed a double-yield bug in ReplayScorer where the Scored variant emitted the first doc twice. Removed the started flag and simplified advance to always increment the index.
Spatial predicates now accept $query(...) where coordinates would normally go. The inner query text is parsed recursively by the query parser and lowered to a SpatialExecutor with PlanNode::Join. The join absorbs all four spatial relations: $within, $between, $intersects, $contains. * Added query_arg parser in query_grammar.rs to extract balanced $query(...) content. Each spatial predicate parser tries $query() before falling back to coordinate parsing. * Added inner_query field to UserInputLeaf::Spatial and LogicalLiteral::Spatial. * The lowering in query_parser.rs detects inner_query and builds a SpatialExecutor with PlanNode::Join. The inner query is parsed recursively via the same parser. * Threaded query_parser reference through convert_to_query and convert_literal_to_query for the recursive parse.
When the query parser lowers a boolean AND that contains a spatial join, the non-join children become the join's outer PlanNode instead of composing as independent boolean sub-queries. The executor probes only documents matching the sibling predicates, not all documents in the index. Added SpatialExecutor::set_outer to replace the outer PlanNode after construction.
Each geometry entry now starts with a flags byte (u8) followed by len (u32). The flags byte carries closed, contains_origin, has_holes, and is_head in individual bits. The old len and set u32 pair with bit-packed flags and back pointers is gone. Head entries are flags + len + doc_id + data (9 byte header). Member entries are flags + len + data (5 byte header). Sets are delimited by the is_head flag, not by a member count. The forward walk notes the doc_id from each head it passes. doc_id_for walks forward from the skip list, noting doc_ids from heads. If the target is reached without a head, it retries from the previous skip list entry. No back pointers, no backward motion through the data.
This pull request adds spatial indexing to Tantivy using block kd-trees. The implementation tessellates polygons into triangles and indexes the triangles in the block kd-tree.
Search supports three query types: "intersects" finds documents whose geometries overlap a query rectangle, "within" finds documents whose geometries fall entirely inside the query bounds, and "contains" finds documents whose geometries completely enclose the query region. I've attempted to follow Tantivy's established patterns. Spatial fields use a new
FieldType::Spatialvariant. I usedPreTokenizedStringas a go-by.The user provided
f64geographic coordinates are serialized using XOR compression that exploits spatial locality. When consecutive vertices are close together, their bit representations share high-order bits, and XOR reveals this redundancy as zeros that varint encoding can compress. The compression falls back to uncompressed when data is incompressible. Looked at alternatives,bitshuffle+zstdmight be nice for large polygons, but small polygons would be smaller than azstdblock. The XOR is simple and a good place to start.The block kd-tree implementation uses bulk-loaded immutable construction with recursive median partitioning, creating a somewhat balanced block kd-tree with 512 triangles per leaf. The tree stores triangles. Polygons are tessellated into triangles prior to indexing.
f64dimensions are converted toi32prior to tessellation and the block kd-tree stores the dimensions asi32. Triangles are stored in an encoded format that places a bounding box for the triangle in the first four words, followed by the spare x/y, boundary flags and finally the doc_id. The boundary flags indicate if a side of the triangle shares a boundary with the tessellated polygon and is used for "within" and "contains" testing.The tree construction uses Rust's standard library
select_nth_unstable_by_keyfor median partitioning, operating directly on&mut [Triangle]slices. Tree construction is a simple recursive descent based around this standard library function. The recursive descent receives slices of progressively smaller sub-ranges until reaching leaf size. This approach works identically whether the triangles come from aVecduring initial indexing or from a memory-mapped file during merge.Merge is implemented by writing these
Trianglestructures into a temporary file, memory mapping the temporary file, and then performing an unsafe cast of the memory to&mut [Traingle]. Because the file is memory mapped, theselect_nth_unstable_by_keytree construction can rely on the operating system to manage the memory when merging large segments.If you are zealously opposed to unsafe code then
&mut [Triangle]andselect_nth_unstable_by_keyhave to be rewritten, but I don't see how it is a problem. TheVec<Triangle>is Rust-safe. The temporary file is entirely implemented as a transient step in the block kd-tree merge. Ought to be able to add a compile switch for endianess. It is not a persistent file. It is temporary only for the merge. It doesn't not have to have the same endianess on different architectures.This implementation is certainly Lucene inspired. The triangle encoding is a 1:1 port with the exception of replacing orient with the robust orient2d implementation of Shewchuk's 2d orient. The rest is more inspiration than port. Porting from Lucene 1:1 not sensible. Lucene needs to operate on mapped memory via Java arrays, it goes to great lengths to avoid the creation of POJOs to keep the garbage collector out of the picture. The Rust implementation can read more like Rust.
f64toi32is Lucene inspired, roughly centimeter precision. The bulk-loaded partition construction is Lucene inspired, but the in place sorting and reuse of the standard library is a feature of Rust. Delta encoding fori32/u32is Lucene inspired. Currently using barycentric point-in-triangle instead of Lucene's orientation tests.Dependences added are
robustfororient2dandi_trianglefor its integer-based Delaunay triangulation that tracks boundary edges.I have avoiding making anything "pluggable" and bonded tightly to Tantivy and the chosen 3rd-party libraries for simplicity and easy reading.
In the above you'll see a discussion of how the use of
select_nth_unstable_by_keywith&mut [Triangle]against a memory mapped file that has been cast to&mut [Triangle], please consider it.Please consider the minimalist Geometry enum. Rather than create yet another object hierarchy of
Point,Polygon,BoundingBox, et. al. I used underlying representations that can be easily handed off to GeoRust or similar. I didn't want to make GeoRust or GeoJSON parsing a Tantivy dependency. The block kd-tree is indented for bounding box and point queries, not for the full range of queries you would find in PostGIS, like polygon in polygon. Such a test could be performed while iterating the search results.Note that the block kd-tree is not going work with geometries that cross the antimeridian, but this is a known issue, and data sets will probably already split a polygon that crosses the antimeridian into a multi-polygon with a polygon for each side of the antimeridian. This is even in the GeoJSON spec.
Note that there is a Spatial field and a Geometry type stored in that field. Would it be better to have a Geographic or Geo field since the implementation is geo specific?
As I write this, "contains" and "within" are implemented in the block kd-tree but not exposed through the interfaces. I will implement "disjoint." There are
todo!()s to implement in theField::Spatial(_)match arms. There are some tree balancing improvements that I'd like to add. I'd like to put a lot of Geofabrik data through it and see how well it performs. However, it is ready for consideration by the Tantivy maintainers.Run
cargo run --example geo_json, a minimal round-trip through the index.