You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/04_query_python.md
+263Lines changed: 263 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -7,4 +7,267 @@ date: 2018-06-18 22:31:13 +0200
7
7
8
8
# Tutorial: Query, cross-match, play!
9
9
10
+
The spark3D library contains a number of methods and tools to manipulate 3D RDD. Currently, you can already play with *window query*, *KNN* and *cross-match between data sets*.
11
+
12
+
## Envelope query
13
+
14
+
A Envelope query takes as input a `RDD[Shape3D]` and an envelope, and returns all objects in the RDD intersecting the envelope (contained in and crossing the envelope):
15
+
16
+
```python
17
+
from pyspark3d import get_spark_session
18
+
from pyspark3d import load_user_conf
19
+
20
+
from pyspark3d.geometryObjects import ShellEnvelope
21
+
from pyspark3d.spatial3DRDD import SphereRDD
22
+
from pyspark3d.spatialOperator import windowQuery
23
+
24
+
# Load the user configuration, and initialise the spark session.
# Load the envelope (Sphere at the origin, and radius 0.5)
33
+
sh = ShellEnvelope(0.0, 0.0, 0.0, False, 0.0, 0.5)
34
+
35
+
# Perform the query
36
+
matchRDD = windowQuery(rdd.rawRDD(), sh)
37
+
print("{}/{} objects found in the envelope".format(
38
+
len(matchRDD.collect()), rdd.rawRDD().count()))
39
+
# 1435/20000 objects found in the envelope
40
+
```
41
+
42
+
Note that the input objects and the envelope can be anything among the `Shape3D`: points, shells (incl. sphere), boxes.
43
+
44
+
## Cross-match between data-sets
45
+
46
+
A cross-match takes as input two data sets, and return objects matching based on the center distance, or pixel index of objects. Note that performing a cross-match between a data set of N elements and another of M elements is a priori a NxM operation - so it can be very costly! Let's load two `Point3D` data sets:
47
+
48
+
```python
49
+
from pyspark3d import get_spark_session
50
+
from pyspark3d import load_user_conf
51
+
52
+
from pyspark3d.spatial3DRDD import Point3DRDD
53
+
54
+
# Load the user configuration, and initialise the spark session.
By default, the two sets are partitioned randomly (in the sense points spatially close are probably not in the same partition).
67
+
In order to decrease the cost of performing the cross-match, you need to partition the two data sets the same way. By doing so, you will cross-match only points belonging to the same partition. For a large number of partitions, you will decrease significantly the cost:
68
+
69
+
```python
70
+
# nPart is the wanted number of partitions.
71
+
# Default is rdd.rawRDD() partition number.
72
+
npart =100
73
+
74
+
# For the spatial partitioning, you can currently choose
75
+
# between LINEARONIONGRID, or OCTREE (see GridType.scala).
We advice to cache the re-partitioned sets, to speed-up future call by not performing the re-partitioning again.
84
+
However keep in mind that if a large `nPart` decreases the cost of performing the cross-match, it increases the partitioning cost as more partitions implies more data shuffle between partitions. There is no magic number for `nPart` which applies in general, and you'll need to set it according to the needs of your problem. My only advice would be: re-partitioning is typically done once, queries can be multiple...
85
+
86
+
### What a cross-match returns?
87
+
88
+
In spark3D, the cross-match between two sets A and B can return:
89
+
90
+
* (1) Elements of (A, B) matching (returnType="AB")
91
+
* (2) Elements of A matching B (returnType="A")
92
+
* (3) Elements of B matching A (returnType="B")
93
+
94
+
Which one you should choose? That depends on what you need:
95
+
(1) gives you all pairs matching but can be slow.
96
+
(2) & (3) give you all elements matching only in one side but is faster.
97
+
98
+
### What is the criterion for the cross-match?
99
+
100
+
Currently, we implemented two methods to perform a cross-match:
101
+
102
+
* Based on center distance (a and b match if norm(a - b) < epsilon).
103
+
* Based on the center angular separation (Healpix index) inside a shell (a and b match if their healpix index is the same). Note that this strategy can be used only in combination with the `LINEARONIONGRID` partitioning which produces 3D shells along the radial axis, and project the data in 2D shells (where Healpix can be used!).
104
+
105
+
Here is an example which returns only elements from A with counterpart in B using distance center:
106
+
107
+
```python
108
+
from pyspark3d.spatialOperator import CrossMatchCenter
109
+
110
+
# Distance threshold for the match
111
+
epsilon =0.04
112
+
113
+
# Keeping only elements from A with counterpart in B
print("{}/{} elements in A match with elements of B!".format(
132
+
matchRDDB_healpix.count(), rddB_part.count()))
133
+
# 15/1000 elements in A match with elements of B!
134
+
```
135
+
136
+
In addition, you can choose to return only the Healpix indices for which points match (returnType="healpix"). It is even faster than returning objects.
137
+
138
+
## Neighbour search
139
+
140
+
### Simple KNN
141
+
142
+
Finds the K nearest neighbours of a query object within a `rdd`.
143
+
The naive implementation here searches through all the the objects in the
144
+
RDD to get the KNN. The nearness of the objects here is decided on the
145
+
basis of the distance between their centers.
146
+
Note that `queryObject` and elements of `rdd` must have the same type
147
+
(either both Point3D, or both ShellEnvelope, or both BoxEnvelope).
148
+
149
+
```python
150
+
from pyspark3d import get_spark_session
151
+
from pyspark3d import load_user_conf
152
+
153
+
from pyspark3d.geometryObjects import Point3D
154
+
from pyspark3d.spatial3DRDD import Point3DRDD
155
+
from pyspark3d.spatialOperator importKNN
156
+
157
+
# Load the user configuration, and initialise the spark session.
0 commit comments