Skip to content

Nearest neighbor joins for arbitrary geometries #1713

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 55 commits into from
Mar 7, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
c7a1ea8
untested changes
Jonathan24680 Dec 17, 2024
a266201
continued development
Jonathan24680 Dec 18, 2024
2cc8504
added all areas to the dataset and some tests
Jonathan24680 Dec 23, 2024
d784513
make file_buffer_size in the test index be a parameter
Jonathan24680 Dec 28, 2024
f6e1647
bugfix area dataset
Jonathan24680 Dec 28, 2024
1a4ff8f
backup commit
Jonathan24680 Dec 30, 2024
e3a2644
area test with points and areas working
Jonathan24680 Dec 31, 2024
f38cb1d
moved some stuff from dev space into the project
Jonathan24680 Jan 2, 2025
4d71c4e
new tests
Jonathan24680 Jan 4, 2025
38c9585
first test works now for the areas as well
Jonathan24680 Jan 11, 2025
ddea6d3
all tests are now executed and passed
Jonathan24680 Jan 12, 2025
1902d44
clang format
Jonathan24680 Jan 12, 2025
31e7054
merge master
Jonathan24680 Jan 12, 2025
cba4324
backup with removed filebuffersize compiling but not testing
Jonathan24680 Jan 17, 2025
d5c934c
make buffer size of test index variable
Jonathan24680 Jan 17, 2025
c1dc963
clean up code
Jonathan24680 Jan 17, 2025
c5284ef
sonarqube
Jonathan24680 Jan 20, 2025
fc1a1fe
correct bug for function without inline in header file
Jonathan24680 Jan 21, 2025
ec75e89
arbitrary geometry types and areas dont need to be approximated as po…
Jonathan24680 Jan 25, 2025
4fd698c
PR feedback
Jonathan24680 Jan 31, 2025
173e646
PR feedback
Jonathan24680 Jan 31, 2025
7dd8156
backup
Jonathan24680 Feb 1, 2025
1c60bf9
PR feedback
Jonathan24680 Feb 1, 2025
3b8ea1a
spelling mistake
Jonathan24680 Feb 1, 2025
a121ef2
Sonarqube issues
Jonathan24680 Feb 4, 2025
1168f32
PR changes
Jonathan24680 Feb 4, 2025
ca7f08e
PR changes
Jonathan24680 Feb 4, 2025
9d2956e
PR changes
Jonathan24680 Feb 4, 2025
14032bd
PR changes
Jonathan24680 Feb 4, 2025
29171b5
Sonarqube issues
Jonathan24680 Feb 4, 2025
457f8d4
format check
Jonathan24680 Feb 4, 2025
bf886c4
backup
Jonathan24680 Feb 7, 2025
10bdf7b
strange bug
Jonathan24680 Feb 7, 2025
5f12f22
solved strange bug
Jonathan24680 Feb 7, 2025
978b512
PR feedback
Jonathan24680 Feb 7, 2025
4132701
Sonarqube issue
Jonathan24680 Feb 7, 2025
d711586
Merge branch 'master' into SpatialJoinArea
Jonathan24680 Feb 7, 2025
f3c5b01
consistent naming of QueryBox and BoundingBox
Jonathan24680 Feb 8, 2025
a46fccd
add test for distance between GeoPoints and Areas
Jonathan24680 Feb 8, 2025
2cc0fef
improve true area distance calculation
Jonathan24680 Feb 8, 2025
96a903a
backup
Jonathan24680 Feb 10, 2025
47ba454
backup
Jonathan24680 Feb 10, 2025
61c6b98
PR changes
Jonathan24680 Feb 10, 2025
c0bf282
clang format
Jonathan24680 Feb 10, 2025
988ba8d
Sonarqube issus and bugfix
Jonathan24680 Feb 12, 2025
e42d9d0
format check
Jonathan24680 Feb 12, 2025
5b88f47
Merge branch 'master' into SpatialJoinArea
Jonathan24680 Feb 14, 2025
5a99080
Sonarqube issue remove unused operator
Jonathan24680 Feb 14, 2025
a7f4695
PR feedback
Jonathan24680 Feb 15, 2025
e5121f6
suggested parentheses around and condition
Jonathan24680 Feb 15, 2025
c272d19
cancellation handle and missing allocator
Jonathan24680 Mar 6, 2025
c56c863
bugfix
Jonathan24680 Mar 6, 2025
0c35b50
forgot to add SpatialJoin.h and bugfix in constructor
Jonathan24680 Mar 6, 2025
6666773
bugfix
Jonathan24680 Mar 6, 2025
6d62020
bugfix
Jonathan24680 Mar 6, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions src/engine/SpatialJoin.h
Original file line number Diff line number Diff line change
Expand Up @@ -174,6 +174,14 @@ class SpatialJoin : public Operation {
return childRight_;
}

PreparedSpatialJoinParams onlyForTestingGetPrepareJoin() const {
return prepareJoin();
}

void checkCancellationWrapperForSpatialJoinAlgorithms() const {
checkCancellation();
}

private:
// helper function to generate a variable to column map from `childRight_`
// that only contains the columns selected by `config_.payloadVariables_`
Expand Down
282 changes: 221 additions & 61 deletions src/engine/SpatialJoinAlgorithms.cpp

Large diffs are not rendered by default.

186 changes: 150 additions & 36 deletions src/engine/SpatialJoinAlgorithms.h
Original file line number Diff line number Diff line change
Expand Up @@ -10,20 +10,72 @@
#include <boost/geometry/geometries/box.hpp>
#include <boost/geometry/geometries/point.hpp>
#include <memory>
#include <variant>

#include "engine/Result.h"
#include "engine/SpatialJoin.h"
#include "util/GeoSparqlHelpers.h"

namespace BoostGeometryNamespace {
namespace bg = boost::geometry;
namespace bgi = boost::geometry::index;

using Point = bg::model::point<double, 2, bg::cs::cartesian>;
using Box = bg::model::box<Point>;
using Value = std::pair<Point, size_t>;
using Polygon = boost::geometry::model::polygon<
boost::geometry::model::d2::point_xy<double>>;
using Linestring = bg::model::linestring<Point>;
using MultiPoint = bg::model::multi_point<Point>;
using MultiLinestring = bg::model::multi_linestring<Linestring>;
using MultiPolygon = bg::model::multi_polygon<Polygon>;
using AnyGeometry = boost::variant<Point, Linestring, Polygon, MultiPoint,
MultiLinestring, MultiPolygon>;
using Segment = boost::geometry::model::segment<Point>;

// this struct is used to get the bounding box of an arbitrary geometry type.
struct BoundingBoxVisitor : public boost::static_visitor<Box> {
template <typename Geometry>
Box operator()(const Geometry& geometry) const {
Box box;
boost::geometry::envelope(geometry, box);
return box;
}
};

// this struct is used to calculate the distance between two arbitrary
// geometries. It calculates the two closest points (in euclidean geometry),
// transforms the two closest points, to a GeoPoint and then calculates the
// distance of the two points on the earth. As the closest points are calculated
// using euclidean geometry, this is only an approximation. On the sphere two
// other points might be closer.
struct ClosestPointVisitor : public boost::static_visitor<double> {
template <typename Geometry1, typename Geometry2>
double operator()(const Geometry1& geo1, const Geometry2& geo2) const {
Segment seg;
bg::closest_points(geo1, geo2, seg);
GeoPoint closestPoint1(bg::get<0, 1>(seg), bg::get<0, 0>(seg));
GeoPoint closestPoint2(bg::get<1, 1>(seg), bg::get<1, 0>(seg));
return ad_utility::detail::wktDistImpl(closestPoint1, closestPoint2);
}
};

struct RtreeEntry {
size_t row_;
std::optional<size_t> geometryIndex_;
std::optional<GeoPoint> geoPoint_;
std::optional<Box> boundingBox_;
};

using Value = std::pair<Box, RtreeEntry>;

} // namespace BoostGeometryNamespace

class SpatialJoinAlgorithms {
using Point = BoostGeometryNamespace::Point;
using Box = BoostGeometryNamespace::Box;
using AnyGeometry = BoostGeometryNamespace::AnyGeometry;
using RtreeEntry = BoostGeometryNamespace::RtreeEntry;

public:
// initialize the Algorithm with the needed parameters
SpatialJoinAlgorithms(QueryExecutionContext* qec,
Expand All @@ -34,16 +86,57 @@ class SpatialJoinAlgorithms {
Result S2geometryAlgorithm();
Result BoundingBoxAlgorithm();

std::vector<BoostGeometryNamespace::Box>
OnlyForTestingWrapperComputeBoundingBox(
const BoostGeometryNamespace::Point& startPoint) const {
return computeBoundingBox(startPoint);
// This function computes the bounding box(es) which represent all points,
// which are in reach of the starting point with a distance of at most
// 'maxDistanceInMeters'. In theory there is always only one bounding box, but
// when mapping the spherical surface on a cartesian plane there are borders.
// So when the "single true" bounding box crosses the left or right (+/-180
// longitude line) or the poles (+/- 90 latitude, which on the cartesian
// mapping is the top and bottom edge of the rectangular mapping) then the
// single box gets split into multiple boxes (i.e. one on the left and one on
// the right, which when seen on the sphere look like a single box, but on the
// map and in the internal representation it looks like two/more boxes). The
// additionalDist gets added on the max distance to compensate for areas being
// bigger than points. AdditionalDist must be the max distance from the
// midpoint of the bounding box of the area to any point inside the area.
// The function getMaxDistFromMidpointToAnyPointInsideTheBox() can be used to
// calculate it.
std::vector<Box> computeQueryBox(const Point& startPoint,
double additionalDist = 0) const;

// This function returns true, iff the given point is contained in any of the
// bounding boxes
bool isContainedInBoundingBoxes(const std::vector<Box>& boundingBox,
Point point) const;

// calculates the midpoint of the given Box
Point calculateMidpointOfBox(const Box& box) const;

void setUseMidpointForAreas_(bool useMidpointForAreas) {
useMidpointForAreas_ = useMidpointForAreas;
}

bool OnlyForTestingWrapperContainedInBoundingBoxes(
const std::vector<BoostGeometryNamespace::Box>& boundingBox,
const BoostGeometryNamespace::Point& point) const {
return isContainedInBoundingBoxes(boundingBox, point);
// Helper function, which computes the distance of two geometries, where each
// geometry has already been parsed and is available as an RtreeEntry
Id computeDist(RtreeEntry& geo1, RtreeEntry& geo2);

// this function calculates the maximum distance from the midpoint of the box
// to any other point, which is contained in the box. If the midpoint has
// already been calculated, because it is needed in other places as well, it
// can be given to the function, otherwise the function calculates the
// midpoint itself
double getMaxDistFromMidpointToAnyPointInsideTheBox(
const Box& box, std::optional<Point> midpoint = std::nullopt) const;

// this function gets the string which represents the area from the idtable.
std::optional<size_t> getAnyGeometry(const IdTable* idtable, size_t row,
size_t col);

// wrapper to access non const private function for testing
std::optional<RtreeEntry> onlyForTestingGetRtreeEntry(const IdTable* idTable,
const size_t row,
const ColumnIndex col) {
return getRtreeEntry(idTable, row, col);
}

private:
Expand All @@ -52,11 +145,9 @@ class SpatialJoinAlgorithms {
std::optional<GeoPoint> getPoint(const IdTable* restable, size_t row,
ColumnIndex col) const;

// Helper function, which computes the distance of two points, where each
// point comes from a different result table
Id computeDist(const IdTable* resLeft, const IdTable* resRight,
size_t rowLeft, size_t rowRight, ColumnIndex leftPointCol,
ColumnIndex rightPointCol) const;
// returns everything between the first two quotes. If the string does not
// contain two quotes, the string is returned as a whole
std::string_view betweenQuotes(std::string_view extractFrom) const;

// Helper function, which adds a row, which belongs to the result to the
// result table. As inputs it uses a row of the left and a row of the right
Expand All @@ -65,19 +156,6 @@ class SpatialJoinAlgorithms {
const IdTable* resultRight, size_t rowLeft,
size_t rowRight, Id distance) const;

// This function computes the bounding box(es) which represent all points,
// which are in reach of the starting point with a distance of at most
// 'maxDistanceInMeters'. In theory there is always only one bounding box, but
// when mapping the spherical surface on a cartesian plane there are borders.
// So when the "single true" bounding box crosses the left or right (+/-180
// longitude line) or the poles (+/- 90 latitude, which on the cartesian
// mapping is the top and bottom edge of the rectangular mapping) then the
// single box gets split into multiple boxes (i.e. one on the left and one on
// the right, which when seen on the sphere look like a single box, but on the
// map and in the internal representation it looks like two/more boxes)
std::vector<BoostGeometryNamespace::Box> computeBoundingBox(
const BoostGeometryNamespace::Point& startPoint) const;

// This helper function calculates the bounding boxes based on a box, where
// definitely no match can occur. This means every element in the anti
// bounding box is guaranteed to be more than 'maxDistanceInMeters' away from
Expand All @@ -86,20 +164,46 @@ class SpatialJoinAlgorithms {
// gets used, when the usual procedure, would just result in taking a big
// bounding box, which covers the whole planet (so for extremely large max
// distances)
std::vector<BoostGeometryNamespace::Box> computeBoundingBoxForLargeDistances(
const BoostGeometryNamespace::Point& startPoint) const;

// This function returns true, iff the given point is contained in any of the
// bounding boxes
bool isContainedInBoundingBoxes(
const std::vector<BoostGeometryNamespace::Box>& boundingBox,
BoostGeometryNamespace::Point point) const;
std::vector<Box> computeQueryBoxForLargeDistances(
const Point& startPoint) const;

// this helper function approximates a conversion of the distance between two
// objects from degrees to meters. Here we assume, that the conversion from
// degrees to meters is constant, which is however only true for the latitude
// values. For the longitude values this is not true. Therefore a value which
// works very good for almost all longitudes and latitudes has been chosen.
// Only for the poles, the conversion will be way to large (for the longitude
// difference). Note, that this function is expensive and should only be
// called when needed
double computeDist(const size_t geometryIndex1,
const size_t geometryIndex2) const;

// this helper function takes an idtable, a row and a column. It then tries
// to parse a geometry or a geoPoint of that cell in the idtable. If it
// succeeds, it returns an rtree entry of that geometry/geopoint
std::optional<RtreeEntry> getRtreeEntry(const IdTable* idTable,
const size_t row,
const ColumnIndex col);

// this helper function converts a GeoPoint into a boost geometry Point
size_t convertGeoPointToPoint(GeoPoint point);

// this helper function calculates the query box. The query box, is the box,
// which contains the area, where all possible candidates of the max distance
// query must be contained in. It returns a vector, because if the box crosses
// the poles or the -180/180 longitude line, then it is disjoint in the
// cartesian coordinates. The boxes themselves are disjoint to each other.
std::vector<Box> getQueryBox(const std::optional<RtreeEntry>& entry) const;

QueryExecutionContext* qec_;
PreparedSpatialJoinParams params_;
SpatialJoinConfiguration config_;
std::optional<SpatialJoin*> spatialJoin_;

// if the distance calculation should be approximated, by the midpoint of
// the area
bool useMidpointForAreas_ = true;

// circumference in meters at the equator (max) and the pole (min) (as the
// earth is not exactly a sphere the circumference is different. Note that
// the values are given in meters)
Expand All @@ -111,8 +215,18 @@ class SpatialJoinAlgorithms {
static constexpr double radius_ = 6'378'000;

// convert coordinates to the usual ranges (-180 to 180 and -90 to 90)
void convertToNormalCoordinates(BoostGeometryNamespace::Point& point) const;
void convertToNormalCoordinates(Point& point) const;

// return whether one of the poles is being touched
std::array<bool, 2> isAPoleTouched(const double& latitude) const;

// number of times the parsing of a geometry failed. For now this is only used
// to print the warning once, but it could also be used to print how many
// geometries failed. It is mutable to let parsing function which are const
// still modify the the nr of failed parsings.
size_t numFailedParsedGeometries_ = 0;

// this vector stores the geometries, which have already been parsed
std::vector<AnyGeometry, ad_utility::AllocatorWithLimit<AnyGeometry>>
geometries_;
};
Loading
Loading