Geo-search by flatheadmill · Pull Request #2729 · quickwit-oss/tantivy

flatheadmill · 2025-11-04T08:10:06Z

Regarding Geo-search #44.

This pull request adds spatial indexing to Tantivy using block kd-trees. The implementation tessellates polygons into triangles and indexes the triangles in the block kd-tree.

Search supports three query types: "intersects" finds documents whose geometries overlap a query rectangle, "within" finds documents whose geometries fall entirely inside the query bounds, and "contains" finds documents whose geometries completely enclose the query region. I've attempted to follow Tantivy's established patterns. Spatial fields use a new FieldType::Spatial variant. I used PreTokenizedString as a go-by.

The user provided f64 geographic coordinates are serialized using XOR compression that exploits spatial locality. When consecutive vertices are close together, their bit representations share high-order bits, and XOR reveals this redundancy as zeros that varint encoding can compress. The compression falls back to uncompressed when data is incompressible. Looked at alternatives, bitshuffle+zstd might be nice for large polygons, but small polygons would be smaller than a zstd block. The XOR is simple and a good place to start.

The block kd-tree implementation uses bulk-loaded immutable construction with recursive median partitioning, creating a somewhat balanced block kd-tree with 512 triangles per leaf. The tree stores triangles. Polygons are tessellated into triangles prior to indexing. f64 dimensions are converted to i32 prior to tessellation and the block kd-tree stores the dimensions as i32. Triangles are stored in an encoded format that places a bounding box for the triangle in the first four words, followed by the spare x/y, boundary flags and finally the doc_id. The boundary flags indicate if a side of the triangle shares a boundary with the tessellated polygon and is used for "within" and "contains" testing.

The tree construction uses Rust's standard library select_nth_unstable_by_key for median partitioning, operating directly on &mut [Triangle] slices. Tree construction is a simple recursive descent based around this standard library function. The recursive descent receives slices of progressively smaller sub-ranges until reaching leaf size. This approach works identically whether the triangles come from a Vec during initial indexing or from a memory-mapped file during merge.

Merge is implemented by writing these Triangle structures into a temporary file, memory mapping the temporary file, and then performing an unsafe cast of the memory to &mut [Traingle]. Because the file is memory mapped, the select_nth_unstable_by_key tree construction can rely on the operating system to manage the memory when merging large segments.

If you are zealously opposed to unsafe code then &mut [Triangle] and select_nth_unstable_by_key have to be rewritten, but I don't see how it is a problem. The Vec<Triangle> is Rust-safe. The temporary file is entirely implemented as a transient step in the block kd-tree merge. Ought to be able to add a compile switch for endianess. It is not a persistent file. It is temporary only for the merge. It doesn't not have to have the same endianess on different architectures.

This implementation is certainly Lucene inspired. The triangle encoding is a 1:1 port with the exception of replacing orient with the robust orient2d implementation of Shewchuk's 2d orient. The rest is more inspiration than port. Porting from Lucene 1:1 not sensible. Lucene needs to operate on mapped memory via Java arrays, it goes to great lengths to avoid the creation of POJOs to keep the garbage collector out of the picture. The Rust implementation can read more like Rust.

f64 to i32 is Lucene inspired, roughly centimeter precision. The bulk-loaded partition construction is Lucene inspired, but the in place sorting and reuse of the standard library is a feature of Rust. Delta encoding for i32/u32 is Lucene inspired. Currently using barycentric point-in-triangle instead of Lucene's orientation tests.

Dependences added are robust for orient2d and i_triangle for its integer-based Delaunay triangulation that tracks boundary edges.

I have avoiding making anything "pluggable" and bonded tightly to Tantivy and the chosen 3rd-party libraries for simplicity and easy reading.

In the above you'll see a discussion of how the use of select_nth_unstable_by_key with &mut [Triangle] against a memory mapped file that has been cast to &mut [Triangle], please consider it.

Please consider the minimalist Geometry enum. Rather than create yet another object hierarchy of Point, Polygon, BoundingBox, et. al. I used underlying representations that can be easily handed off to GeoRust or similar. I didn't want to make GeoRust or GeoJSON parsing a Tantivy dependency. The block kd-tree is indented for bounding box and point queries, not for the full range of queries you would find in PostGIS, like polygon in polygon. Such a test could be performed while iterating the search results.

Note that the block kd-tree is not going work with geometries that cross the antimeridian, but this is a known issue, and data sets will probably already split a polygon that crosses the antimeridian into a multi-polygon with a polygon for each side of the antimeridian. This is even in the GeoJSON spec.

Note that there is a Spatial field and a Geometry type stored in that field. Would it be better to have a Geographic or Geo field since the implementation is geo specific?

As I write this, "contains" and "within" are implemented in the block kd-tree but not exposed through the interfaces. I will implement "disjoint." There are todo!()s to implement in the Field::Spatial(_) match arms. There are some tree balancing improvements that I'd like to add. I'd like to put a lot of Geofabrik data through it and see how well it performs. However, it is ready for consideration by the Tantivy maintainers.

Run cargo run --example geo_json, a minimal round-trip through the index.

Implement a SPATIAL flag for use in creating a spatial field.

Encodes triangles with the bounding box in the first four words, enabling efficient spatial pruning during tree traversal without reconstructing the full triangle. The remaining words contain an additional vertex and packed reconstruction metadata, allowing exact triangle recovery when needed.

The `triangulate` function takes a polygon with floating-point lat/lon coordinates, converts to integer coordinates with millimeter precision (using 2^32 scaling), performs constrained Delaunay triangulation, and encodes the resulting triangles with boundary edge information for block kd-tree spatial indexing. It handles polygons with holes correctly, preserving which triangle edges lie on the original polygon boundaries versus internal tessellation edges.

Implemented byte-wise histogram selection to find median values without comparisons, enabling efficient partitioning of spatial data during block kd-tree construction. Processes values through multiple passes, building histograms for each byte position after a common prefix, avoiding the need to sort or compare elements directly.

Implemented a `Surveyor` that will evaluate the bounding boxes of a set of triangles and determine the dimension with the maximum spread and the shared prefix for the values of dimension with the maximum spread.

Implements dimension-major bit-packing with zigzag encoding for signed i32 deltas, enabling compression of spatially-clustered triangles from 32-bit coordinates down to 4-19 bits per delta depending on spatial extent.

Implement an immutable bulk-loaded spatial index using recursive median partitioning on bounding box dimensions. Each leaf stores up to 512 triangles with delta-compressed coordinates and doc IDs. The tree provides three query types (intersects, within, contains) that use exact integer arithmetic for geometric predicates and accumulate results in bit sets for efficient deduplication across leaves. The serialized format stores compressed leaf pages followed by the tree structure (leaf and branch nodes), enabling zero-copy access through memory-mapped segments without upfront decompression.

Lossless compression for floating-point lat/lon coordinates using XOR delta encoding on IEEE 754 bit patterns with variable-length integer encoding. Designed for per-polygon random access in the document store, where each polygon compresses independently without requiring sequential decompression.

The triangulation function in `triangle.rs` is now called `delaunay_to_triangles` and it accepts the output of a Delaunay triangulation from `i_triangle` and not a GeoRust multi-polygon. The translation of user polygons to `i_triangle` polygons and subsequent triangulation will take place outside of `triangle.rs`.

Implemented a geometry document field with a minimal `Geometry` enum. Now able to add that Geometry from GeoJSON parsed from a JSON document. Geometry is triangulated if it is a polygon, otherwise it is correctly encoded as a degenerate triangle if it is a point or a line string. Write accumulated triangles to a block kd-tree on commit. Serialize the original `f64` polygon for retrieval from search. Created a query method for intersection. Query against the memory mapped block kd-tree. Return hits and original `f64` polygon. Implemented a merge of one or more block kd-trees from one or more segments during merge. Updated the block kd-tree to write to a Tantivy `WritePtr` instead of more generic Rust I/O.

Ended up using `select_nth_unstable_by_key` from the Rust standard library instead.

Read node structures using `from_le_bytes` instead of casting memory. After an inspection of columnar storage, it appears that this is the standard practice in Rust and in the Tantivy code base. Left the structure alignment for now in case it tends to align with cache boundaries.

fulmicoton · 2025-11-10T09:32:00Z

@flatheadmill This is a super interesting PR. A bit of contributions has accumulated, so it might take a bit of time to get it reviewed. To make things smoother, would you be ok for a quick call to walk me through the code?

flatheadmill · 2025-11-13T00:02:06Z

Of course. Doesn't have to be quick for my sake. Check your email.

Addressed all `todo!()` markers created when adding `Spatial` field type and `Geometry` value type to existing code paths: - Dynamic field handling: `Geometry` not supported in dynamic JSON fields, return `unimplemented!()` consistently with other complex types. - Fast field writer: Panic if geometry routed incorrectly (internal error.) - `OwnedValue` serialization: Implement `Geometry` to GeoJSON serialization and reference-to-owned conversion. - Field type: Return `None` for `get_index_record_option()` since spatial fields use BKD trees, not inverted index. - Space usage tracking: Add spatial field to `SegmentSpaceUsage` with proper integration through `SegmentReader`. - Spatial query explain: Implement `explain()` method following pattern of other binary/constant-score queries. Fixed `MultiPolygon` deserialization bug: count total points across all rings, not number of rings. Added clippy expects for legitimate too_many_arguments cases in geometric predicates.

flatheadmill · 2025-11-16T16:41:59Z

For all the talk of k-dimensions, the end result of writing the tree is a run-of-the-mill binary tree whose search functions are very easy to understand. In the end, we have a binary tree of inner nodes that have a left and right pointer to child nodes and a bounding box. During a search the bounding box allows us to eliminate sub-trees that do not intersect. The binary tree terminates in leaf notes that have a bounding box and a pointer to where triangles are stored. If the path you follow intersects your sought bounding box, you scan the triangles referenced by the leaf node.

Triangles? Yes, the polygons you add to the index are tessellated into triangles. The i_triangle crate used for tessellation in this pull request has an animation of triangulation in its GitHub README. The animation is all you need to understand what we're trying to accomplish. We are tokenizing a polygon, turning it into triangles. We know we've matched a document if we match a triangle in the polygon. Triangles are described with three points and therefore easy to store. It was not necessary to study tessellation for this implementation, it was enough to simply choose an implementation that met requirements; it works on integers and tracks boundary edges.

With the polygon broken down into triangles we can store a triangle and the DocId associated with the polygon. We also store boundary edges that the triangle shares with the tessellated polygon. We can test if a polygon is within a bounding box by testing if a boundary edge crosses the bounding box. If it does, the polygon is marked as excluded, if it doesn't it is included. Take the difference for the set of documents within the bounding box. For intersection all we need to know is whether the bounding box overlaps the triangle in any way.

In this way, this pull request implements an orthogonal range query to search for 2-dimensional shapes within axis-aligned partitions, i.e. bounding-box search.

To grasp the kd-concept quickly, it is enough to see a simple tree and the space it partitions. Otherwise, you could read a Medium article. You will come away with an understanding of how to partition two-dimensional space. You will know how to search for a point in a two-dimensional kd-tree. It's a matter of descending the tree and choosing a path based on the depth of the tree. There are plenty of little examples of this 2-dimensional kd-tree of points in every programming language.

You will be left wondering how to search for a triangle in two-dimensional space. Nowhere will you find a description of how to do this. A search for answers is going to trip over the fact that the word "dimension" is overloaded. Our minimal kd-tree stores 1-dimensional spatial data in a tree that it describes as 2-dimensional. When we use the term "dimension" to describe a kd-tree we are describing the number of partitioning dimensions of the tree, not the spatial dimension of the indexed forms.

What I found, what finally clicked, was a single "teach a man to fish" answer to a StackOverflow question. A 4-dimensional point can represent a bounding box where the partitioning dimentions are the min and max and the x and y. I wrote a program to create a kd-tree with 4-dimensions and was able to search for a box with a box "using the range (a0,+inf) x (-inf, a1) x (b0, +inf) x (-inf, b1)".

Meanwhile, I had been stepping through the Lucene search where I'd see the bounding boxes in the nodes during search and knew that Lucene was depending on the properties of a kd-tree to build the tree, but not to search the tree. It did not search the tree using a 4-dimensional point, it used bounding boxes instead. It would build the bounding boxes as it partitioned the tree.

Another muddle on the way to understanding is the concept of a "block" kd-tree. Seeing BKDReader.java and BKDWriter.java in Lucene implied that Lucene has implemented a BKD-tree as defined in Bkd-Tree: A Dynamic Scalable kd-Tree and so I assumed that whatever was in the paper was in Lucene. The paper describes an optimization based on creating LSM-like sets of progressively larger of kd-trees. Kd-trees are difficult to update, which is what this paper addresses, but Lucene and Tantivy already address this problem with segments. There's only one kd-tree for a spatial field per segment and that tree is rewritten when segments merge.

I never looked at the paper again once I began stepping through the Lucene code. There I saw the bounding box in kd-tree node used for search and affirmed my hyper-point as bounding box understanding by seeing that Lucene was indeed indexing the 2-demential (spatial) triangle bounding box using a 4-dimensional (partitions) tree. I saw too, that a single tree was built by bulk-partitioning. I never referenced the original paper myself, but here I am standing on the shoulders of giants, reading the theory in practice, so I'm not the one to ask how the paper influenced the design of Lucene, I can only talk about how Lucene influenced the design of this pull request.

However, since we are not creating a LSM-like structure nor using the serialization strategy presented in the paper, I'm going to rename bkd.rs to kd.rs to put an end to red herring for future readers. The term "block" in the paper references an I/O optimization to the serialized tree structure that is not employed by this pull request nor Lucene.

That's enough red herrings. The tree is a 4-dimensional kd-tree built with a recursive partitioning of an array of triangles. The 4-dimensions represent the four positions of the two points in a bounding box. We partition our array of triangles on the kd-tree dimension with the widest spread, the greatest distance between the min and max value. We calculate a bounding box to encapsulate the triangles in the array and then call our partition function on the left and right partitions. We create an inner node that has a bounding box and a left and right child reference. When an array is less than a certain size, that's a leaf.

If you look at the write_leaf_pages function, you'll see the partitioning. As of this commit, that is all the k-dimensional code, the rest is bounding box based. In fact, kd-tree is probably a misnomer. We're building a binary bounding box tree (a name I just now made up) using k-dimensional partitioning.

P.S. — As you can see, the implementation in this pull request does not implement a generic kd-tree like Lucene. We do not keep the partition dimension in the node, we throw it away and use just the bounding box. Lucene uses its kd-tree for range queries, but I believe Tantivy has "fast fields" for that use case. It has some code to do kNN with kd-trees, but the primary index it uses for kNN is a Hierarchical Navigable Small World index. I'd speculate that the kd-tree kNN was an offering that was superseded by HNSW.

This reverts commit 19eab16. Restore radix select in order to implement a merge solution that will not require a temporary file.

Existing code behaved as if the result of `select_nth_unstable_by_key` was either a sorted array or the product of an algorithm that gathered partition values as in the Dutch national flag problem. The existing code was written knowing that the former isn't true and the latter isn't advertised. Knowing, but not remembering. Quite the oversight.

fulmicoton · 2025-12-04T08:56:11Z

@flatheadmill Hello. I had time to have a deeper look at your PR.

For large PR like this, I usually modify the code as I look around.
You can check that work there:
https://github.com/quickwit-oss/tantivy/pull/2755/files

I will put my comments inline, you can also just pick the commits from #2755 when it makes sense to you.

fulmicoton · 2025-12-01T16:05:36Z

src/indexer/merger.rs

+                };
+                // Get spatial writer and rebuild block kd-tree.
+                spatial_serializer.serialize_field(field, triangles)?;
+            }


FYI I imported your PR in #2755 and started reviewing.

The test do not pass in your PR mainly due to :

Suggested change

}

}

spatial_serializer.close()?;

fulmicoton · 2025-12-04T08:59:21Z

src/index/index_meta.rs

            SegmentComponent::FastFields => ".fast".to_string(),
            SegmentComponent::FieldNorms => ".fieldnorm".to_string(),
            SegmentComponent::Delete => format!(".{}.del", self.delete_opstamp().unwrap_or(0)),
+            SegmentComponent::Spatial => ".spatial".to_string(),


General thing...

I think we should avoid creating this file if it is not used.
It might seem benigne, but tantivy creates those files on every single commit. Ideally we would fix that at the directory level, but I never managed to do it.

TBH it should be true for other components in tantivy (especially field norms).

I think I managed to get that in my #2755 .

fulmicoton · 2025-12-04T09:00:38Z

src/indexer/merger.rs

+            for (field, mut temp_file) in temp_files {
+                // Flush and sync triangles.
+                temp_file.flush()?;
+                temp_file.as_file_mut().sync_all()?;


fsyncing is not useful for this usage.

fulmicoton · 2025-12-04T09:01:08Z

src/indexer/merger.rs

+        }
+        for (segment_ord, reader) in self.readers.iter().enumerate() {
+            for (field, temp_file) in &mut temp_files {
+                let spatial_readers = reader.spatial_fields();


we need the temp_file to be buffered.

fulmicoton · 2025-12-04T10:44:48Z

src/query/spatial_query.rs

+}
+
+impl SpatialWeight {
+    fn new(field: Field, bounds: [(i32, i32); 2], query_type: SpatialQueryType) -> Self {


I think (?) the integer bbox you end up using assume that the two points given here are the min corner and the max corner.
(because of the intersect code)

It might be worth enforcing that somewhere to avoid bugs.

Ideally I would prefer to have a IBox struct type that forces that invariant upon construction.

fulmicoton · 2025-12-04T10:46:36Z

src/spatial/triangle.rs

+            }
+        }
+        // change orientation if CW
+        if orient2d(


robust is imported just for orient2d.

I think we should remove the dep and use an adhoc implementation.

orient_2d is also needlessly complex. (We just want the sign of the determinant)

fulmicoton · 2025-12-04T10:47:11Z

src/spatial/triangle.rs

+            Coord { y: cy, x: cx },
+        ) < 0.0
+        {
+            let temp_x = bx;


Suggested change

let temp_x = bx;

// We fix the orientation by swapping B and C.

let temp_x = bx;

fulmicoton · 2025-12-04T10:48:00Z

src/spatial/triangle.rs

+            let temp_y = by;
+            let temp_boundary = ab;
+            // ax and ay do not change, ab becomes bc
+            ab = bc;


I believe this is a bug:

Suggested change

ab = bc;

ab = ac;

fulmicoton · 2025-12-04T10:50:23Z

src/spatial/triangle.rs

+    /// indicating which edges are polygon boundaries. Returns a triangle struct with the bounding
+    /// box in the first four words as `[min_y, min_x, max_y, max_x]`. When decoded, the vertex
+    /// order may differ from the original input to `new()` due to normalized rotation.
+    pub fn new(doc_id: u32, triangle: [i32; 6], boundaries: [bool; 3]) -> Self {


In several place in the code, we rely on implicit ordering to have lon, lat or y, x.

The code would benefit from introducing

GeoPoint { x: f64, y: f64}
and IPoint { x: u32, u32 }, and IBox { min_corner: IPoint, max_corner: IPoint }orIBox {min_x, min_y, max_x, max_y}`
This is zero cost in rust.

Removed use of robust.

Spatial fields returned false from is_indexed(), so the segment writer skipped them before reaching add_geometry. Moved spatial field handling above the is_indexed gate since spatial fields have their own write path and do not participate in postings. The spatial serializer called CompositeWrite::for_field twice for the edge index -- once to create the EdgeWriter and again to flush. The second call tripped a duplicate field address assertion. Hoisted the for_field call so the same reference serves both uses. RegionCoverer::get_covering_internal sent initial candidates directly to the priority queue without expanding their children. The main loop then drained an empty children vec and produced nothing. Changed to route initial candidates through add_candidate, which expands children before queuing. This matches the S2 C++ reference.

The forward walk counted entries to reach a target geometry position but did not skip doc_id footers that appear between sets. When geometry_id 3 followed a 3-member set, the walk stopped at the footer after the third entry instead of advancing past it. The footer bytes were then read as a vertex byte length, producing a slice overflow. After the walk loop exits, consume any doc_id footers before treating the current position as the target entry.

Only set has_interior when the covering cell fully contains the index cell. Use range_min comparison in the Subdivided walk for correct overlap detection across cell levels.

The CellIndex assigns edges to cells using conservative bounding box tests, so multiple geometries can share a cell even when they do not overlap spatially. When an interior covering cell contains such a shared index cell, the shortcut now confirms the candidate's first vertex is inside the query polygon before accepting.

Same pattern as the serializer fix: CompositeWrite::for_field was called twice for the edge index, once to create the EdgeWriter and again to flush. Hoisted the for_field call so the same reference serves both. Removed a leftover dbg statement.

The merger was writing edges in cell-walk encounter order, but the GeometryMap assigns IDs during the interleave stage before the split and sibling stages reorder cells. The edge positions in the output did not match the geometry IDs in the merged cell index. Write edges in sequential new_id order instead of encounter order. Assign set member IDs starting from the head so they are consecutive regardless of which member is encountered first during the cell walk.

Introduces $intersects and $contains as query parser syntax, composable with text queries via AND/OR. The field name is required and must refer to a spatial field. Coordinates are comma-separated lon/lat pairs with implicit ring closure. geometry:$intersects(-99.49 45.56, -99.45 45.56, -99.45 45.59, -99.49 45.59) name:Hosmer AND geometry:$intersects(-100.0 45.0, -99.0 45.0, -99.0 46.0, -100.0 46.0) New SpatialPredicateKind enum in the grammar crate, Spatial variant in UserInputLeaf and LogicalLiteral, parser rule wired into the strict literal alt, and lowering to SpatialQuery in convert_literal_to_query.

Points enter the cell index as degenerate edges (v0 == v1) with dimension 0. The edge reader duplicates single-vertex entries on decode so the edge_id rule holds without caller changes. Line strings exposed false positives in the intersects query -- brute_force_contains treated open paths as polygons. The fix encodes a closed bit in the high bit of the edge entry len field and moves the doc_id from a trailing footer to inline on the head entry. The intersects query skips the reverse containment test for open geometries. The has_crossing guard drops from n < 3 to n < 2 to allow 2-vertex line strings.

…reverse containment test. GeoJSON RFC 7946 specifies hole rings as clockwise. S2 expects all rings counterclockwise, enclosing at most half the sphere. A clockwise hole ring interpreted as an S2 loop encloses nearly the entire sphere, causing compute_origin_inside to return true. The InteriorTracker then marks the geometry as containing every cell on the face. The fix reverses hole rings in the writer before computing origin_inside. The reverse containment test in the intersects query now uses indexed containment through the segment's cell index instead of brute_force_contains. A ContainmentIndex trait abstracts cell lookup and edge resolution over both the in-memory CellIndex (forward direction) and the CellIndexReader plus EdgeReader (reverse direction). The generic contains_point function finds the cell containing the test point, starts from contains_center, and counts crossings of only the clipped edges. This eliminates the flattened vertex bugs in brute_force_contains (degenerate A-to-A edges from ring closure, spurious cross-ring edges from multi-ring flattening) and removes origin_inside from the query path.

Mechanical port of GetDistance, UpdateMinDistance, IsDistanceLess, UpdateMinInteriorDistance, and GetUpdateMinDistanceMaxError. These compute the minimum distance from a point to a great circle edge on the unit sphere using chord length, needed for the distance query verification step.

DistanceQuery builds an S2Cap from a center point and radius, covers it with RegionCoverer, and verifies candidates by computing the minimum point-to-edge distance using the s2edge_distances port. For polygons, containment is checked first via indexed containment through the segment's cell index -- if the center is inside the polygon, the distance is zero. The query parser accepts $within(50mi, lon lat) and $between(50mi, 100mi, lon lat) with unit suffixes mi, km, m, ft, rad.

Geometry enters as Geometry<Plane>, gets projected onto the surface via project::<Sphere>(), and gets smashed into stored format via to_geometry_set. One function, one format, for both query and write. * Surface trait with Sphere and Plane, project. * GeometrySet and EdgeSet as the smashed representation. * to_geometry_set smashes projected geometry, computes contains_hilbert_start, flattens rings with closure vertices. * build_from_sets on IndexBuilder seeds the InteriorTracker from precomputed flags instead of running brute_force_contains at build time. * Queries accept GeometrySet directly. QueryEdgeProvider wraps it. * Edge reader returns EdgeSet via get_edge_set so callers never compute member indices. * Edge writer takes GeometrySet.

* get_distance_to_point -- minimum distance from cell to point, zero if the point is inside. Feature-based computation in the cell's face-local UVW frame with Voronoi region branching. * get_boundary_distance -- minimum distance from cell boundary to point. * get_distance_to_edge -- minimum distance from cell to edge, zero if the edge intersects the cell or an endpoint is inside.

Geometry-to-geometry distance query over the cell hierarchy. The query geometry gets its own CellIndex for bilateral pruning -- only query edges in cells that overlap the stored cell are checked. The priority queue is keyed by cell-to-query-geometry lower bounds using S2Cell::get_distance_to_edge, and leaf verification uses update_edge_pair_min_distance. * closest_edge_query.rs: branch-and-bound traversal with three modes from one algorithm -- kNN (dynamic threshold), within-D (fixed threshold), boolean (exit on first witness). * $knn(K, lon lat) syntax in the query grammar. * SpatialPredicate::Knn wired through SpatialQuery and SpatialWeight.

The covering-based DistanceQuery is no longer called from the query path. Within, between, and kNN all go through the branch-and-bound traversal. Added range distance mode for $between with inner and outer thresholds.

The executor is a Query whose Weight construction evaluates a plan tree before the scorer pipeline runs. Plan nodes are either tantivy queries evaluated to per-segment bitsets, or spatial operations whose filter inputs are other plan nodes. The tree evaluates inside-out so nested operations resolve before their parents. * Added SpatialExecutor implementing Query. Weight construction calls a recursive evaluate function over the plan tree, producing per-segment bitsets or scored result sets. Scorers iterate the precomputed results. * Added PlanNode enum with Query, Intersects, Knn, and Join variants. Knn and Join are stubbed. Intersects is wired to search_segment_filtered. * Added EdgeReader::doc_id_for to resolve geometry_id to doc_id without decoding vertices or populating the cache. Reads the skip list and entry headers only. * Added IntersectsQuery::search_segment_filtered which checks a terms bitset per candidate shape before verification. Shapes whose doc_id is not in the bitset are skipped before crossing tests. * Added Clone to EdgeSet and GeometrySet.

…cutor Within node. * Added ClosestEdgeQuery::search_segment_filtered. Checks a terms bitset per candidate shape in process_edges before edge-pair distance computation. The filter check uses doc_id_for to resolve geometry_id to doc_id without decoding vertices. * Added DistanceQuery::search_segment_filtered. Same bitset check pattern in collect_candidates. * Added PlanNode::Within to the executor. Takes a GeometrySet and radius. Uses ClosestEdgeQuery::within for geometry-to-geometry distance, not the point-based DistanceQuery. The executor uses branch-and-bound for all distance operations.

The join evaluates inner and outer child nodes to per-segment bitsets, then for each segment walks the cell index to find outer geometries by checking doc_id_for against the outer bitset. For each outer geometry, builds a ClosestEdgeQuery with any_within and probes all segments filtered by the inner bitset. First positive across any segment sets the bit. The result is per-segment bitsets of outer docs that passed the spatial join predicate.

Left over from the triangle days.

* Added ContainsQuery::search_segment_filtered with terms bitset check per candidate shape, same pattern as IntersectsQuery and DistanceQuery. * The Join evaluate arm dispatches on SpatialRelation: Near uses ClosestEdgeQuery::any_within, Between uses any_between, Intersects uses IntersectsQuery, Contains uses ContainsQuery. All four probe with search_segment_filtered against the inner bitset. * Added ClosestEdgeQuery::any_between for boolean range distance with early exit.

Point-to-geometry distance replaced by closest_edge_query which handles any geometry type through edge-pair distance with branch-and-bound.

Runs ClosestEdgeQuery::knn per segment with an optional filter bitset, collects results tagged by segment ID, sorts by distance, truncates to global top K, and redistributes to per-segment SegmentResult::Scored vecs for the replay scorer. Fixed a double-yield bug in ReplayScorer where the Scored variant emitted the first doc twice. Removed the started flag and simplified advance to always increment the index.

Spatial predicates now accept $query(...) where coordinates would normally go. The inner query text is parsed recursively by the query parser and lowered to a SpatialExecutor with PlanNode::Join. The join absorbs all four spatial relations: $within, $between, $intersects, $contains. * Added query_arg parser in query_grammar.rs to extract balanced $query(...) content. Each spatial predicate parser tries $query() before falling back to coordinate parsing. * Added inner_query field to UserInputLeaf::Spatial and LogicalLiteral::Spatial. * The lowering in query_parser.rs detects inner_query and builds a SpatialExecutor with PlanNode::Join. The inner query is parsed recursively via the same parser. * Threaded query_parser reference through convert_to_query and convert_literal_to_query for the recursive parse.

When the query parser lowers a boolean AND that contains a spatial join, the non-join children become the join's outer PlanNode instead of composing as independent boolean sub-queries. The executor probes only documents matching the sibling predicates, not all documents in the index. Added SpatialExecutor::set_outer to replace the outer PlanNode after construction.

Each geometry entry now starts with a flags byte (u8) followed by len (u32). The flags byte carries closed, contains_origin, has_holes, and is_head in individual bits. The old len and set u32 pair with bit-packed flags and back pointers is gone. Head entries are flags + len + doc_id + data (9 byte header). Member entries are flags + len + data (5 byte header). Sets are delimited by the is_head flag, not by a member count. The forward walk notes the doc_id from each head it passes. doc_id_for walks forward from the skip list, noting doc_ids from heads. If the target is reached without a head, it retries from the previous skip list entry. No back pointers, no backward motion through the data.

flatheadmill added 12 commits November 4, 2025 01:26

Implement SPATIAL flag.

62acb91

Implement a SPATIAL flag for use in creating a spatial field.

Add a surveyor to determine spread and prefix.

ba6d77c

Implemented a `Surveyor` that will evaluate the bounding boxes of a set of triangles and determine the dimension with the maximum spread and the shared prefix for the values of dimension with the maximum spread.

Add delta compression for block kd-tree leaf nodes.

32b4e04

Implements dimension-major bit-packing with zigzag encoding for signed i32 deltas, enabling compression of spatially-clustered triangles from 32-bit coordinates down to 4-19 bits per delta depending on spatial extent.

Remove radix_select.rs.

19eab16

Ended up using `select_nth_unstable_by_key` from the Rust standard library instead.

flatheadmill marked this pull request as draft November 4, 2025 08:10

flatheadmill added 2 commits November 18, 2025 00:22

Revert "Remove radix_select.rs."

05cf864

This reverts commit 19eab16. Restore radix select in order to implement a merge solution that will not require a temporary file.

fulmicoton marked this pull request as ready for review December 1, 2025 15:36

fulmicoton mentioned this pull request Dec 1, 2025

Flatheadmill geo #2755

Open

fulmicoton requested changes Dec 4, 2025

View reviewed changes

philippemnoel mentioned this pull request Dec 10, 2025

Add Geo-Search Support paradedb/paradedb#3719

Open

fulmicoton-dd added 6 commits December 12, 2025 01:56

bugfix

0fa1c4c

added doc

973ad35

plastic surgery

1a3a22a

Added bugfix and unit tests

7e22446

Removed use of robust.

plastic surgery

9abc97b

Introduced geopoint.

e4ecff4

flatheadmill added 30 commits March 23, 2026 02:24

Fix interior cell shortcut and Subdivided range walk in spatial queries.

8d99b2a

Only set has_interior when the covering cell fully contains the index cell. Use range_min comparison in the Subdivided walk for correct overlap detection across cell levels.

Port edge-pair distance functions from s2edge_distances.cc.

c23ebb0

Route $within and $between through closest_edge_query.

acb4915

The covering-based DistanceQuery is no longer called from the query path. Within, between, and kNN all go through the branch-and-bound traversal. Added range distance mode for $between with inner and outer thresholds.

Run cargo fmt.

02c5d79

Merge upstream main through tantivy 0.26 release.

956ba96

Remove radix select again.

49df855

Left over from the triangle days.

Remove distance_query module.

407d1fd

Point-to-geometry distance replaced by closest_edge_query which handles any geometry type through edge-pair distance with branch-and-bound.

Implement Clone for PlanNode and SpatialExecutor.

19291ff

Run cargo fmt.

aab0311

	let temp_x = bx;
	// We fix the orientation by swapping B and C.
	let temp_x = bx;

Uh oh!

Conversation

flatheadmill commented Nov 4, 2025

Uh oh!

fulmicoton commented Nov 10, 2025

Uh oh!

flatheadmill commented Nov 13, 2025

Uh oh!

flatheadmill commented Nov 16, 2025

Uh oh!

fulmicoton commented Dec 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants