Skip to content

Geo-search#2729

Open
flatheadmill wants to merge 68 commits intoquickwit-oss:mainfrom
flatheadmill:geo
Open

Geo-search#2729
flatheadmill wants to merge 68 commits intoquickwit-oss:mainfrom
flatheadmill:geo

Conversation

@flatheadmill
Copy link


This pull request adds spatial indexing to Tantivy using block kd-trees. The implementation tessellates polygons into triangles and indexes the triangles in the block kd-tree.

Search supports three query types: "intersects" finds documents whose geometries overlap a query rectangle, "within" finds documents whose geometries fall entirely inside the query bounds, and "contains" finds documents whose geometries completely enclose the query region. I've attempted to follow Tantivy's established patterns. Spatial fields use a new FieldType::Spatial variant. I used PreTokenizedString as a go-by.

The user provided f64 geographic coordinates are serialized using XOR compression that exploits spatial locality. When consecutive vertices are close together, their bit representations share high-order bits, and XOR reveals this redundancy as zeros that varint encoding can compress. The compression falls back to uncompressed when data is incompressible. Looked at alternatives, bitshuffle+zstd might be nice for large polygons, but small polygons would be smaller than a zstd block. The XOR is simple and a good place to start.

The block kd-tree implementation uses bulk-loaded immutable construction with recursive median partitioning, creating a somewhat balanced block kd-tree with 512 triangles per leaf. The tree stores triangles. Polygons are tessellated into triangles prior to indexing. f64 dimensions are converted to i32 prior to tessellation and the block kd-tree stores the dimensions as i32. Triangles are stored in an encoded format that places a bounding box for the triangle in the first four words, followed by the spare x/y, boundary flags and finally the doc_id. The boundary flags indicate if a side of the triangle shares a boundary with the tessellated polygon and is used for "within" and "contains" testing.

The tree construction uses Rust's standard library select_nth_unstable_by_key for median partitioning, operating directly on &mut [Triangle] slices. Tree construction is a simple recursive descent based around this standard library function. The recursive descent receives slices of progressively smaller sub-ranges until reaching leaf size. This approach works identically whether the triangles come from a Vec during initial indexing or from a memory-mapped file during merge.

Merge is implemented by writing these Triangle structures into a temporary file, memory mapping the temporary file, and then performing an unsafe cast of the memory to &mut [Traingle]. Because the file is memory mapped, the select_nth_unstable_by_key tree construction can rely on the operating system to manage the memory when merging large segments.

If you are zealously opposed to unsafe code then &mut [Triangle] and select_nth_unstable_by_key have to be rewritten, but I don't see how it is a problem. The Vec<Triangle> is Rust-safe. The temporary file is entirely implemented as a transient step in the block kd-tree merge. Ought to be able to add a compile switch for endianess. It is not a persistent file. It is temporary only for the merge. It doesn't not have to have the same endianess on different architectures.

This implementation is certainly Lucene inspired. The triangle encoding is a 1:1 port with the exception of replacing orient with the robust orient2d implementation of Shewchuk's 2d orient. The rest is more inspiration than port. Porting from Lucene 1:1 not sensible. Lucene needs to operate on mapped memory via Java arrays, it goes to great lengths to avoid the creation of POJOs to keep the garbage collector out of the picture. The Rust implementation can read more like Rust.

f64 to i32 is Lucene inspired, roughly centimeter precision. The bulk-loaded partition construction is Lucene inspired, but the in place sorting and reuse of the standard library is a feature of Rust. Delta encoding for i32/u32 is Lucene inspired. Currently using barycentric point-in-triangle instead of Lucene's orientation tests.

Dependences added are robust for orient2d and i_triangle for its integer-based Delaunay triangulation that tracks boundary edges.

I have avoiding making anything "pluggable" and bonded tightly to Tantivy and the chosen 3rd-party libraries for simplicity and easy reading.


In the above you'll see a discussion of how the use of select_nth_unstable_by_key with &mut [Triangle] against a memory mapped file that has been cast to &mut [Triangle], please consider it.

Please consider the minimalist Geometry enum. Rather than create yet another object hierarchy of Point, Polygon, BoundingBox, et. al. I used underlying representations that can be easily handed off to GeoRust or similar. I didn't want to make GeoRust or GeoJSON parsing a Tantivy dependency. The block kd-tree is indented for bounding box and point queries, not for the full range of queries you would find in PostGIS, like polygon in polygon. Such a test could be performed while iterating the search results.

Note that the block kd-tree is not going work with geometries that cross the antimeridian, but this is a known issue, and data sets will probably already split a polygon that crosses the antimeridian into a multi-polygon with a polygon for each side of the antimeridian. This is even in the GeoJSON spec.

Note that there is a Spatial field and a Geometry type stored in that field. Would it be better to have a Geographic or Geo field since the implementation is geo specific?

As I write this, "contains" and "within" are implemented in the block kd-tree but not exposed through the interfaces. I will implement "disjoint." There are todo!()s to implement in the Field::Spatial(_) match arms. There are some tree balancing improvements that I'd like to add. I'd like to put a lot of Geofabrik data through it and see how well it performs. However, it is ready for consideration by the Tantivy maintainers.


Run cargo run --example geo_json, a minimal round-trip through the index.

Implement a SPATIAL flag for use in creating a spatial field.
Encodes triangles with the bounding box in the first four words,
enabling efficient spatial pruning during tree traversal without
reconstructing the full triangle. The remaining words contain an
additional vertex and packed reconstruction metadata, allowing exact
triangle recovery when needed.
The `triangulate` function takes a polygon with floating-point lat/lon
coordinates, converts to integer coordinates with millimeter precision
(using 2^32 scaling), performs constrained Delaunay triangulation, and
encodes the resulting triangles with boundary edge information for block
kd-tree spatial indexing.

It handles polygons with holes correctly, preserving which triangle
edges lie on the original polygon boundaries versus internal
tessellation edges.
Implemented byte-wise histogram selection to find median values without
comparisons, enabling efficient partitioning of spatial data during
block kd-tree construction. Processes values through multiple passes,
building histograms for each byte position after a common prefix,
avoiding the need to sort or compare elements directly.
Implemented a `Surveyor` that will evaluate the bounding boxes of a set
of triangles and determine the dimension with the maximum spread and the
shared prefix for the values of dimension with the maximum spread.
Implements dimension-major bit-packing with zigzag encoding for signed i32
deltas, enabling compression of spatially-clustered triangles from 32-bit
coordinates down to 4-19 bits per delta depending on spatial extent.
Implement an immutable bulk-loaded spatial index using recursive median
partitioning on bounding box dimensions. Each leaf stores up to 512
triangles with delta-compressed coordinates and doc IDs. The tree
provides three query types (intersects, within, contains) that use exact
integer arithmetic for geometric predicates and accumulate results in
bit sets for efficient deduplication across leaves.

The serialized format stores compressed leaf pages followed by the tree
structure (leaf and branch nodes), enabling zero-copy access through
memory-mapped segments without upfront decompression.
Lossless compression for floating-point lat/lon coordinates using XOR
delta encoding on IEEE 754 bit patterns with variable-length integer
encoding. Designed for per-polygon random access in the document store,
where each polygon compresses independently without requiring sequential
decompression.
The triangulation function in `triangle.rs` is now called
`delaunay_to_triangles` and it accepts the output of a Delaunay
triangulation from `i_triangle` and not a GeoRust multi-polygon. The
translation of user polygons to `i_triangle` polygons and subsequent
triangulation will take place outside of `triangle.rs`.
Implemented a geometry document field with a minimal `Geometry` enum.
Now able to add that Geometry from GeoJSON parsed from a JSON document.
Geometry is triangulated if it is a polygon, otherwise it is correctly
encoded as a degenerate triangle if it is a point or a line string.
Write accumulated triangles to a block kd-tree on commit.

Serialize the original `f64` polygon for retrieval from search.

Created a query method for intersection. Query against the memory mapped
block kd-tree. Return hits and original `f64` polygon.

Implemented a merge of one or more block kd-trees from one or more
segments during merge.

Updated the block kd-tree to write to a Tantivy `WritePtr` instead of
more generic Rust I/O.
Ended up using `select_nth_unstable_by_key` from the Rust standard
library instead.
Read node structures using `from_le_bytes` instead of casting memory.
After an inspection of columnar storage, it appears that this is the
standard practice in Rust and in the Tantivy code base. Left the
structure alignment for now in case it tends to align with cache
boundaries.
@flatheadmill flatheadmill marked this pull request as draft November 4, 2025 08:10
@fulmicoton
Copy link
Collaborator

@flatheadmill This is a super interesting PR. A bit of contributions has accumulated, so it might take a bit of time to get it reviewed. To make things smoother, would you be ok for a quick call to walk me through the code?

@flatheadmill
Copy link
Author

Of course. Doesn't have to be quick for my sake. Check your email.

Addressed all `todo!()` markers created when adding `Spatial` field type
and `Geometry` value type to existing code paths:

- Dynamic field handling: `Geometry` not supported in dynamic JSON
  fields, return `unimplemented!()` consistently with other complex
  types.
- Fast field writer: Panic if geometry routed incorrectly (internal
  error.)
- `OwnedValue` serialization: Implement `Geometry` to GeoJSON
  serialization and reference-to-owned conversion.
- Field type: Return `None` for `get_index_record_option()` since
  spatial fields use BKD trees, not inverted index.
- Space usage tracking: Add spatial field to `SegmentSpaceUsage` with
  proper integration through `SegmentReader`.
- Spatial query explain: Implement `explain()` method following pattern of
  other binary/constant-score queries.

Fixed `MultiPolygon` deserialization bug: count total points across all
rings, not number of rings.

Added clippy expects for legitimate too_many_arguments cases in geometric
predicates.
@flatheadmill
Copy link
Author

For all the talk of k-dimensions, the end result of writing the tree is a run-of-the-mill binary tree whose search functions are very easy to understand. In the end, we have a binary tree of inner nodes that have a left and right pointer to child nodes and a bounding box. During a search the bounding box allows us to eliminate sub-trees that do not intersect. The binary tree terminates in leaf notes that have a bounding box and a pointer to where triangles are stored. If the path you follow intersects your sought bounding box, you scan the triangles referenced by the leaf node.


Triangles? Yes, the polygons you add to the index are tessellated into triangles. The i_triangle crate used for tessellation in this pull request has an animation of triangulation in its GitHub README. The animation is all you need to understand what we're trying to accomplish. We are tokenizing a polygon, turning it into triangles. We know we've matched a document if we match a triangle in the polygon. Triangles are described with three points and therefore easy to store. It was not necessary to study tessellation for this implementation, it was enough to simply choose an implementation that met requirements; it works on integers and tracks boundary edges.

With the polygon broken down into triangles we can store a triangle and the DocId associated with the polygon. We also store boundary edges that the triangle shares with the tessellated polygon. We can test if a polygon is within a bounding box by testing if a boundary edge crosses the bounding box. If it does, the polygon is marked as excluded, if it doesn't it is included. Take the difference for the set of documents within the bounding box. For intersection all we need to know is whether the bounding box overlaps the triangle in any way.

In this way, this pull request implements an orthogonal range query to search for 2-dimensional shapes within axis-aligned partitions, i.e. bounding-box search.


To grasp the kd-concept quickly, it is enough to see a simple tree and the space it partitions. Otherwise, you could read a Medium article. You will come away with an understanding of how to partition two-dimensional space. You will know how to search for a point in a two-dimensional kd-tree. It's a matter of descending the tree and choosing a path based on the depth of the tree. There are plenty of little examples of this 2-dimensional kd-tree of points in every programming language.

You will be left wondering how to search for a triangle in two-dimensional space. Nowhere will you find a description of how to do this. A search for answers is going to trip over the fact that the word "dimension" is overloaded. Our minimal kd-tree stores 1-dimensional spatial data in a tree that it describes as 2-dimensional. When we use the term "dimension" to describe a kd-tree we are describing the number of partitioning dimensions of the tree, not the spatial dimension of the indexed forms.

What I found, what finally clicked, was a single "teach a man to fish" answer to a StackOverflow question. A 4-dimensional point can represent a bounding box where the partitioning dimentions are the min and max and the x and y. I wrote a program to create a kd-tree with 4-dimensions and was able to search for a box with a box "using the range (a0,+inf) x (-inf, a1) x (b0, +inf) x (-inf, b1)".

Meanwhile, I had been stepping through the Lucene search where I'd see the bounding boxes in the nodes during search and knew that Lucene was depending on the properties of a kd-tree to build the tree, but not to search the tree. It did not search the tree using a 4-dimensional point, it used bounding boxes instead. It would build the bounding boxes as it partitioned the tree.


Another muddle on the way to understanding is the concept of a "block" kd-tree. Seeing BKDReader.java and BKDWriter.java in Lucene implied that Lucene has implemented a BKD-tree as defined in Bkd-Tree: A Dynamic Scalable kd-Tree and so I assumed that whatever was in the paper was in Lucene. The paper describes an optimization based on creating LSM-like sets of progressively larger of kd-trees. Kd-trees are difficult to update, which is what this paper addresses, but Lucene and Tantivy already address this problem with segments. There's only one kd-tree for a spatial field per segment and that tree is rewritten when segments merge.

I never looked at the paper again once I began stepping through the Lucene code. There I saw the bounding box in kd-tree node used for search and affirmed my hyper-point as bounding box understanding by seeing that Lucene was indeed indexing the 2-demential (spatial) triangle bounding box using a 4-dimensional (partitions) tree. I saw too, that a single tree was built by bulk-partitioning. I never referenced the original paper myself, but here I am standing on the shoulders of giants, reading the theory in practice, so I'm not the one to ask how the paper influenced the design of Lucene, I can only talk about how Lucene influenced the design of this pull request.

However, since we are not creating a LSM-like structure nor using the serialization strategy presented in the paper, I'm going to rename bkd.rs to kd.rs to put an end to red herring for future readers. The term "block" in the paper references an I/O optimization to the serialized tree structure that is not employed by this pull request nor Lucene.


That's enough red herrings. The tree is a 4-dimensional kd-tree built with a recursive partitioning of an array of triangles. The 4-dimensions represent the four positions of the two points in a bounding box. We partition our array of triangles on the kd-tree dimension with the widest spread, the greatest distance between the min and max value. We calculate a bounding box to encapsulate the triangles in the array and then call our partition function on the left and right partitions. We create an inner node that has a bounding box and a left and right child reference. When an array is less than a certain size, that's a leaf.

If you look at the write_leaf_pages function, you'll see the partitioning. As of this commit, that is all the k-dimensional code, the rest is bounding box based. In fact, kd-tree is probably a misnomer. We're building a binary bounding box tree (a name I just now made up) using k-dimensional partitioning.


P.S. — As you can see, the implementation in this pull request does not implement a generic kd-tree like Lucene. We do not keep the partition dimension in the node, we throw it away and use just the bounding box. Lucene uses its kd-tree for range queries, but I believe Tantivy has "fast fields" for that use case. It has some code to do kNN with kd-trees, but the primary index it uses for kNN is a Hierarchical Navigable Small World index. I'd speculate that the kd-tree kNN was an offering that was superseded by HNSW.

This reverts commit 19eab16.

Restore radix select in order to implement a merge solution that will
not require a temporary file.
Existing code behaved as if the result of `select_nth_unstable_by_key`
was either a sorted array or the product of an algorithm that gathered
partition values as in the Dutch national flag problem. The existing
code was written knowing that the former isn't true and the latter isn't
advertised. Knowing, but not remembering. Quite the oversight.
@fulmicoton fulmicoton marked this pull request as ready for review December 1, 2025 15:36
@fulmicoton fulmicoton mentioned this pull request Dec 1, 2025
@fulmicoton
Copy link
Collaborator

@flatheadmill Hello. I had time to have a deeper look at your PR.

For large PR like this, I usually modify the code as I look around.
You can check that work there:
https://github.com/quickwit-oss/tantivy/pull/2755/files

I will put my comments inline, you can also just pick the commits from #2755 when it makes sense to you.

};
// Get spatial writer and rebuild block kd-tree.
spatial_serializer.serialize_field(field, triangles)?;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI I imported your PR in #2755 and started reviewing.

The test do not pass in your PR mainly due to :

Suggested change
}
}
spatial_serializer.close()?;

SegmentComponent::FastFields => ".fast".to_string(),
SegmentComponent::FieldNorms => ".fieldnorm".to_string(),
SegmentComponent::Delete => format!(".{}.del", self.delete_opstamp().unwrap_or(0)),
SegmentComponent::Spatial => ".spatial".to_string(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General thing...

I think we should avoid creating this file if it is not used.
It might seem benigne, but tantivy creates those files on every single commit. Ideally we would fix that at the directory level, but I never managed to do it.

TBH it should be true for other components in tantivy (especially field norms).

I think I managed to get that in my #2755 .

for (field, mut temp_file) in temp_files {
// Flush and sync triangles.
temp_file.flush()?;
temp_file.as_file_mut().sync_all()?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fsyncing is not useful for this usage.

}
for (segment_ord, reader) in self.readers.iter().enumerate() {
for (field, temp_file) in &mut temp_files {
let spatial_readers = reader.spatial_fields();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need the temp_file to be buffered.

}

impl SpatialWeight {
fn new(field: Field, bounds: [(i32, i32); 2], query_type: SpatialQueryType) -> Self {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think (?) the integer bbox you end up using assume that the two points given here are the min corner and the max corner.
(because of the intersect code)

It might be worth enforcing that somewhere to avoid bugs.

Ideally I would prefer to have a IBox struct type that forces that invariant upon construction.

}
}
// change orientation if CW
if orient2d(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

robust is imported just for orient2d.

I think we should remove the dep and use an adhoc implementation.

orient_2d is also needlessly complex. (We just want the sign of the determinant)

Coord { y: cy, x: cx },
) < 0.0
{
let temp_x = bx;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let temp_x = bx;
// We fix the orientation by swapping B and C.
let temp_x = bx;

let temp_y = by;
let temp_boundary = ab;
// ax and ay do not change, ab becomes bc
ab = bc;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is a bug:

Suggested change
ab = bc;
ab = ac;

/// indicating which edges are polygon boundaries. Returns a triangle struct with the bounding
/// box in the first four words as `[min_y, min_x, max_y, max_x]`. When decoded, the vertex
/// order may differ from the original input to `new()` due to normalized rotation.
pub fn new(doc_id: u32, triangle: [i32; 6], boundaries: [bool; 3]) -> Self {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In several place in the code, we rely on implicit ordering to have lon, lat or y, x.

The code would benefit from introducing

GeoPoint { x: f64, y: f64}
and IPoint { x: u32, u32 }, and IBox { min_corner: IPoint, max_corner: IPoint }orIBox {min_x, min_y, max_x, max_y}`
This is zero cost in rust.

Spatial fields returned false from is_indexed(), so the segment writer
skipped them before reaching add_geometry. Moved spatial field handling
above the is_indexed gate since spatial fields have their own write path
and do not participate in postings.

The spatial serializer called CompositeWrite::for_field twice for the
edge index -- once to create the EdgeWriter and again to flush. The
second call tripped a duplicate field address assertion. Hoisted the
for_field call so the same reference serves both uses.

RegionCoverer::get_covering_internal sent initial candidates directly to
the priority queue without expanding their children. The main loop then
drained an empty children vec and produced nothing. Changed to route
initial candidates through add_candidate, which expands children before
queuing. This matches the S2 C++ reference.
The forward walk counted entries to reach a target geometry position but
did not skip doc_id footers that appear between sets. When geometry_id 3
followed a 3-member set, the walk stopped at the footer after the third
entry instead of advancing past it. The footer bytes were then read as a
vertex byte length, producing a slice overflow.

After the walk loop exits, consume any doc_id footers before treating the
current position as the target entry.
Only set has_interior when the covering cell fully contains the index
cell. Use range_min comparison in the Subdivided walk for correct
overlap detection across cell levels.
The CellIndex assigns edges to cells using conservative bounding box
tests, so multiple geometries can share a cell even when they do not
overlap spatially. When an interior covering cell contains such a shared
index cell, the shortcut now confirms the candidate's first vertex is
inside the query polygon before accepting.
Same pattern as the serializer fix: CompositeWrite::for_field was called
twice for the edge index, once to create the EdgeWriter and again to
flush. Hoisted the for_field call so the same reference serves both.
Removed a leftover dbg statement.
The merger was writing edges in cell-walk encounter order, but the
GeometryMap assigns IDs during the interleave stage before the split
and sibling stages reorder cells. The edge positions in the output
did not match the geometry IDs in the merged cell index.

Write edges in sequential new_id order instead of encounter order.
Assign set member IDs starting from the head so they are consecutive
regardless of which member is encountered first during the cell walk.
Introduces $intersects and $contains as query parser syntax, composable
with text queries via AND/OR. The field name is required and must refer
to a spatial field. Coordinates are comma-separated lon/lat pairs with
implicit ring closure.

    geometry:$intersects(-99.49 45.56, -99.45 45.56, -99.45 45.59, -99.49 45.59)
    name:Hosmer AND geometry:$intersects(-100.0 45.0, -99.0 45.0, -99.0 46.0, -100.0 46.0)

New SpatialPredicateKind enum in the grammar crate, Spatial variant in
UserInputLeaf and LogicalLiteral, parser rule wired into the strict
literal alt, and lowering to SpatialQuery in convert_literal_to_query.
Points enter the cell index as degenerate edges (v0 == v1) with
dimension 0. The edge reader duplicates single-vertex entries on
decode so the edge_id rule holds without caller changes.

Line strings exposed false positives in the intersects query --
brute_force_contains treated open paths as polygons. The fix
encodes a closed bit in the high bit of the edge entry len field
and moves the doc_id from a trailing footer to inline on the head
entry. The intersects query skips the reverse containment test for
open geometries. The has_crossing guard drops from n < 3 to n < 2
to allow 2-vertex line strings.
…reverse containment test.

GeoJSON RFC 7946 specifies hole rings as clockwise. S2 expects all rings
counterclockwise, enclosing at most half the sphere. A clockwise hole
ring interpreted as an S2 loop encloses nearly the entire sphere, causing
compute_origin_inside to return true. The InteriorTracker then marks the
geometry as containing every cell on the face. The fix reverses hole
rings in the writer before computing origin_inside.

The reverse containment test in the intersects query now uses indexed
containment through the segment's cell index instead of brute_force_contains.
A ContainmentIndex trait abstracts cell lookup and edge resolution over both
the in-memory CellIndex (forward direction) and the CellIndexReader plus
EdgeReader (reverse direction). The generic contains_point function finds the
cell containing the test point, starts from contains_center, and counts
crossings of only the clipped edges. This eliminates the flattened vertex
bugs in brute_force_contains (degenerate A-to-A edges from ring closure,
spurious cross-ring edges from multi-ring flattening) and removes
origin_inside from the query path.
Mechanical port of GetDistance, UpdateMinDistance, IsDistanceLess,
UpdateMinInteriorDistance, and GetUpdateMinDistanceMaxError. These
compute the minimum distance from a point to a great circle edge on
the unit sphere using chord length, needed for the distance query
verification step.
DistanceQuery builds an S2Cap from a center point and radius, covers
it with RegionCoverer, and verifies candidates by computing the minimum
point-to-edge distance using the s2edge_distances port. For polygons,
containment is checked first via indexed containment through the
segment's cell index -- if the center is inside the polygon, the
distance is zero.

The query parser accepts $within(50mi, lon lat) and
$between(50mi, 100mi, lon lat) with unit suffixes mi, km, m, ft, rad.
Geometry enters as Geometry<Plane>, gets projected onto the surface via
project::<Sphere>(), and gets smashed into stored format via
to_geometry_set. One function, one format, for both query and write.

* Surface trait with Sphere and Plane, project.
* GeometrySet and EdgeSet as the smashed representation.
* to_geometry_set smashes projected geometry, computes
  contains_hilbert_start, flattens rings with closure vertices.
* build_from_sets on IndexBuilder seeds the InteriorTracker from
  precomputed flags instead of running brute_force_contains at build
  time.
* Queries accept GeometrySet directly. QueryEdgeProvider wraps it.
* Edge reader returns EdgeSet via get_edge_set so callers never
  compute member indices.
* Edge writer takes GeometrySet.
* get_distance_to_point -- minimum distance from cell to point, zero
  if the point is inside. Feature-based computation in the cell's
  face-local UVW frame with Voronoi region branching.
* get_boundary_distance -- minimum distance from cell boundary to point.
* get_distance_to_edge -- minimum distance from cell to edge, zero if
  the edge intersects the cell or an endpoint is inside.
Geometry-to-geometry distance query over the cell hierarchy. The query
geometry gets its own CellIndex for bilateral pruning -- only query
edges in cells that overlap the stored cell are checked. The priority
queue is keyed by cell-to-query-geometry lower bounds using
S2Cell::get_distance_to_edge, and leaf verification uses
update_edge_pair_min_distance.

* closest_edge_query.rs: branch-and-bound traversal with three modes
  from one algorithm -- kNN (dynamic threshold), within-D (fixed
  threshold), boolean (exit on first witness).
* $knn(K, lon lat) syntax in the query grammar.
* SpatialPredicate::Knn wired through SpatialQuery and SpatialWeight.
The covering-based DistanceQuery is no longer called from the query
path. Within, between, and kNN all go through the branch-and-bound
traversal. Added range distance mode for $between with inner and
outer thresholds.
The executor is a Query whose Weight construction evaluates a plan tree before the
scorer pipeline runs. Plan nodes are either tantivy queries evaluated to per-segment
bitsets, or spatial operations whose filter inputs are other plan nodes. The tree
evaluates inside-out so nested operations resolve before their parents.

* Added SpatialExecutor implementing Query. Weight construction calls a recursive
  evaluate function over the plan tree, producing per-segment bitsets or scored
  result sets. Scorers iterate the precomputed results.
* Added PlanNode enum with Query, Intersects, Knn, and Join variants. Knn and Join
  are stubbed. Intersects is wired to search_segment_filtered.
* Added EdgeReader::doc_id_for to resolve geometry_id to doc_id without decoding
  vertices or populating the cache. Reads the skip list and entry headers only.
* Added IntersectsQuery::search_segment_filtered which checks a terms bitset per
  candidate shape before verification. Shapes whose doc_id is not in the bitset
  are skipped before crossing tests.
* Added Clone to EdgeSet and GeometrySet.
…cutor Within node.

* Added ClosestEdgeQuery::search_segment_filtered. Checks a terms bitset per
  candidate shape in process_edges before edge-pair distance computation. The
  filter check uses doc_id_for to resolve geometry_id to doc_id without decoding
  vertices.
* Added DistanceQuery::search_segment_filtered. Same bitset check pattern in
  collect_candidates.
* Added PlanNode::Within to the executor. Takes a GeometrySet and radius. Uses
  ClosestEdgeQuery::within for geometry-to-geometry distance, not the point-based
  DistanceQuery. The executor uses branch-and-bound for all distance operations.
The join evaluates inner and outer child nodes to per-segment bitsets, then for
each segment walks the cell index to find outer geometries by checking doc_id_for
against the outer bitset. For each outer geometry, builds a ClosestEdgeQuery with
any_within and probes all segments filtered by the inner bitset. First positive
across any segment sets the bit. The result is per-segment bitsets of outer docs
that passed the spatial join predicate.
Left over from the triangle days.
* Added ContainsQuery::search_segment_filtered with terms bitset check per
  candidate shape, same pattern as IntersectsQuery and DistanceQuery.
* The Join evaluate arm dispatches on SpatialRelation: Near uses
  ClosestEdgeQuery::any_within, Between uses any_between, Intersects uses
  IntersectsQuery, Contains uses ContainsQuery. All four probe with
  search_segment_filtered against the inner bitset.
* Added ClosestEdgeQuery::any_between for boolean range distance with early exit.
Point-to-geometry distance replaced by closest_edge_query which handles
any geometry type through edge-pair distance with branch-and-bound.
Runs ClosestEdgeQuery::knn per segment with an optional filter bitset, collects
results tagged by segment ID, sorts by distance, truncates to global top K, and
redistributes to per-segment SegmentResult::Scored vecs for the replay scorer.

Fixed a double-yield bug in ReplayScorer where the Scored variant emitted the
first doc twice. Removed the started flag and simplified advance to always
increment the index.
Spatial predicates now accept $query(...) where coordinates would normally go.
The inner query text is parsed recursively by the query parser and lowered to a
SpatialExecutor with PlanNode::Join. The join absorbs all four spatial relations:
$within, $between, $intersects, $contains.

* Added query_arg parser in query_grammar.rs to extract balanced $query(...)
  content. Each spatial predicate parser tries $query() before falling back to
  coordinate parsing.
* Added inner_query field to UserInputLeaf::Spatial and LogicalLiteral::Spatial.
* The lowering in query_parser.rs detects inner_query and builds a SpatialExecutor
  with PlanNode::Join. The inner query is parsed recursively via the same parser.
* Threaded query_parser reference through convert_to_query and
  convert_literal_to_query for the recursive parse.
When the query parser lowers a boolean AND that contains a spatial join, the
non-join children become the join's outer PlanNode instead of composing as
independent boolean sub-queries. The executor probes only documents matching
the sibling predicates, not all documents in the index.

Added SpatialExecutor::set_outer to replace the outer PlanNode after construction.
Each geometry entry now starts with a flags byte (u8) followed by len (u32).
The flags byte carries closed, contains_origin, has_holes, and is_head in
individual bits. The old len and set u32 pair with bit-packed flags and back
pointers is gone.

Head entries are flags + len + doc_id + data (9 byte header). Member entries
are flags + len + data (5 byte header). Sets are delimited by the is_head
flag, not by a member count. The forward walk notes the doc_id from each
head it passes.

doc_id_for walks forward from the skip list, noting doc_ids from heads. If
the target is reached without a head, it retries from the previous skip list
entry. No back pointers, no backward motion through the data.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants