Skip to content

Commit 32f946a

Browse files
committed
Update website with pyspark3d queries
1 parent 87f120b commit 32f946a

File tree

3 files changed

+303
-10
lines changed

3 files changed

+303
-10
lines changed

docs/04_query_python.md

Lines changed: 263 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,4 +7,267 @@ date: 2018-06-18 22:31:13 +0200
77

88
# Tutorial: Query, cross-match, play!
99

10+
The spark3D library contains a number of methods and tools to manipulate 3D RDD. Currently, you can already play with *window query*, *KNN* and *cross-match between data sets*.
11+
12+
## Envelope query
13+
14+
A Envelope query takes as input a `RDD[Shape3D]` and an envelope, and returns all objects in the RDD intersecting the envelope (contained in and crossing the envelope):
15+
16+
```python
17+
from pyspark3d import get_spark_session
18+
from pyspark3d import load_user_conf
19+
20+
from pyspark3d.geometryObjects import ShellEnvelope
21+
from pyspark3d.spatial3DRDD import SphereRDD
22+
from pyspark3d.spatialOperator import windowQuery
23+
24+
# Load the user configuration, and initialise the spark session.
25+
dic = load_user_conf()
26+
spark = get_spark_session(dicconf=dic)
27+
28+
# Load the data
29+
fn = "src/test/resources/cartesian_spheres.fits"
30+
rdd = SphereRDD(spark, fn, "x,y,z,radius", False, "fits", {"hdu": "1"})
31+
32+
# Load the envelope (Sphere at the origin, and radius 0.5)
33+
sh = ShellEnvelope(0.0, 0.0, 0.0, False, 0.0, 0.5)
34+
35+
# Perform the query
36+
matchRDD = windowQuery(rdd.rawRDD(), sh)
37+
print("{}/{} objects found in the envelope".format(
38+
len(matchRDD.collect()), rdd.rawRDD().count()))
39+
# 1435/20000 objects found in the envelope
40+
```
41+
42+
Note that the input objects and the envelope can be anything among the `Shape3D`: points, shells (incl. sphere), boxes.
43+
44+
## Cross-match between data-sets
45+
46+
A cross-match takes as input two data sets, and return objects matching based on the center distance, or pixel index of objects. Note that performing a cross-match between a data set of N elements and another of M elements is a priori a NxM operation - so it can be very costly! Let's load two `Point3D` data sets:
47+
48+
```python
49+
from pyspark3d import get_spark_session
50+
from pyspark3d import load_user_conf
51+
52+
from pyspark3d.spatial3DRDD import Point3DRDD
53+
54+
# Load the user configuration, and initialise the spark session.
55+
dic = load_user_conf()
56+
spark = get_spark_session(dicconf=dic)
57+
58+
# Load the raw data (spherical coordinates)
59+
fn = "src/test/resources/astro_obs_{}_light.fits"
60+
rddA = Point3DRDD(
61+
spark, fn.format("A"), "Z_COSMO,RA,DEC", True, "fits", {"hdu": "1"})
62+
rddB = Point3DRDD(
63+
spark, fn.format("B"), "Z_COSMO,RA,DEC", True, "fits", {"hdu": "1"})
64+
```
65+
66+
By default, the two sets are partitioned randomly (in the sense points spatially close are probably not in the same partition).
67+
In order to decrease the cost of performing the cross-match, you need to partition the two data sets the same way. By doing so, you will cross-match only points belonging to the same partition. For a large number of partitions, you will decrease significantly the cost:
68+
69+
```python
70+
# nPart is the wanted number of partitions.
71+
# Default is rdd.rawRDD() partition number.
72+
npart = 100
73+
74+
# For the spatial partitioning, you can currently choose
75+
# between LINEARONIONGRID, or OCTREE (see GridType.scala).
76+
rddA_part = rddA.spatialPartitioningPython("LINEARONIONGRID", 100)
77+
78+
# Repartition B as A
79+
rddB_part = rddB.spatialPartitioningPython(
80+
rddA_part.partitioner().get()).cache()
81+
```
82+
83+
We advice to cache the re-partitioned sets, to speed-up future call by not performing the re-partitioning again.
84+
However keep in mind that if a large `nPart` decreases the cost of performing the cross-match, it increases the partitioning cost as more partitions implies more data shuffle between partitions. There is no magic number for `nPart` which applies in general, and you'll need to set it according to the needs of your problem. My only advice would be: re-partitioning is typically done once, queries can be multiple...
85+
86+
### What a cross-match returns?
87+
88+
In spark3D, the cross-match between two sets A and B can return:
89+
90+
* (1) Elements of (A, B) matching (returnType="AB")
91+
* (2) Elements of A matching B (returnType="A")
92+
* (3) Elements of B matching A (returnType="B")
93+
94+
Which one you should choose? That depends on what you need:
95+
(1) gives you all pairs matching but can be slow.
96+
(2) & (3) give you all elements matching only in one side but is faster.
97+
98+
### What is the criterion for the cross-match?
99+
100+
Currently, we implemented two methods to perform a cross-match:
101+
102+
* Based on center distance (a and b match if norm(a - b) < epsilon).
103+
* Based on the center angular separation (Healpix index) inside a shell (a and b match if their healpix index is the same). Note that this strategy can be used only in combination with the `LINEARONIONGRID` partitioning which produces 3D shells along the radial axis, and project the data in 2D shells (where Healpix can be used!).
104+
105+
Here is an example which returns only elements from A with counterpart in B using distance center:
106+
107+
```python
108+
from pyspark3d.spatialOperator import CrossMatchCenter
109+
110+
# Distance threshold for the match
111+
epsilon = 0.04
112+
113+
# Keeping only elements from A with counterpart in B
114+
matchRDDB = CrossMatchCenter(rddA_part, rddB_part, epsilon, "A")
115+
116+
print("{}/{} elements in A match with elements of B!".format(
117+
matchRDDB.count(), rddB_part.count()))
118+
# 19/1000 elements in A match with elements of B!
119+
```
120+
121+
and the same using the Healpix indices:
122+
123+
```python
124+
from pyspark3d.spatialOperator import PixelCrossMatch
125+
126+
# Shell resolution for Healpix indexing
127+
nside = 512
128+
129+
# Keeping only elements from A with counterpart in B
130+
matchRDDB_healpix = CrossMatchHealpixIndex(rddA_part, rddB_part, 512, "A")
131+
print("{}/{} elements in A match with elements of B!".format(
132+
matchRDDB_healpix.count(), rddB_part.count()))
133+
# 15/1000 elements in A match with elements of B!
134+
```
135+
136+
In addition, you can choose to return only the Healpix indices for which points match (returnType="healpix"). It is even faster than returning objects.
137+
138+
## Neighbour search
139+
140+
### Simple KNN
141+
142+
Finds the K nearest neighbours of a query object within a `rdd`.
143+
The naive implementation here searches through all the the objects in the
144+
RDD to get the KNN. The nearness of the objects here is decided on the
145+
basis of the distance between their centers.
146+
Note that `queryObject` and elements of `rdd` must have the same type
147+
(either both Point3D, or both ShellEnvelope, or both BoxEnvelope).
148+
149+
```python
150+
from pyspark3d import get_spark_session
151+
from pyspark3d import load_user_conf
152+
153+
from pyspark3d.geometryObjects import Point3D
154+
from pyspark3d.spatial3DRDD import Point3DRDD
155+
from pyspark3d.spatialOperator import KNN
156+
157+
# Load the user configuration, and initialise the spark session.
158+
dic = load_user_conf()
159+
spark = get_spark_session(dicconf=dic)
160+
161+
# Load the data (spherical coordinates)
162+
fn = "src/test/resources/astro_obs.fits"
163+
rdd = Point3DRDD(spark, fn, "Z_COSMO,RA,DEC", True, "fits", {"hdu": "1"}
164+
)
165+
166+
# Define your query point
167+
pt = Point3D(0.0, 0.0, 0.0, True)
168+
169+
# Perform the query: look for the 5 closest neighbours from `pt`.
170+
# Note that the last argument controls whether we want to eliminate duplicates.
171+
match = KNN(rdd.rawRDD(), pt, 5, True)
172+
173+
# `match` is a list of Point3D. Take the coordinates and convert them
174+
# into cartesian coordinates for later use:
175+
mod = "com.astrolabsoftware.spark3d.utils.Utils.sphericalToCartesian"
176+
converter = load_from_jvm(mod)
177+
match_coord = [converter(m).getCoordinatePython() for m in match]
178+
179+
# Print the distance to the query point (origin)
180+
def normit(l: list) -> float:
181+
"""
182+
Return the distance to the origin for each point of the list
183+
184+
Parameters
185+
----------
186+
l : list of list
187+
List of (list of 3 float). Each main list element represents
188+
the cartesian coordinates of a point in space.
189+
190+
Returns
191+
----------
192+
ld : list of float
193+
list containing (euclidean) norm btw each point and the origin.
194+
"""
195+
return sqrt(sum([e**2 for e in l]))
196+
197+
# Pretty print
198+
print([str(normit(l) * 1e6) + "e-6" for l in match_coord])
199+
# [72, 73, 150, 166, 206]
200+
```
201+
202+
### More efficient KNN
203+
204+
More efficient implementation of the KNN query above.
205+
First we seek the partitions in which the query object belongs and we
206+
will look for the knn only in those partitions. After this if the limit k
207+
is not satisfied, we keep looking similarly in the neighbors of the
208+
containing partitions.
209+
210+
Note 1: elements of `rdd` and `queryObject` can have different types
211+
among Shape3D (Point3D or ShellEnvelope or BoxEnvelope) (unlike KNN above).
212+
213+
Note 2: KNNEfficient only works on repartitioned RDD (python version).
214+
215+
```python
216+
from pyspark3d import get_spark_session
217+
from pyspark3d import load_user_conf
218+
219+
from pyspark3d.geometryObjects import Point3D
220+
from pyspark3d.spatial3DRDD import Point3DRDD
221+
from pyspark3d.spatialOperator import KNN
222+
223+
# Load the user configuration, and initialise the spark session.
224+
dic = load_user_conf()
225+
spark = get_spark_session(dicconf=dic)
226+
227+
# Load the data (spherical coordinates)
228+
fn = "src/test/resources/astro_obs.fits"
229+
rdd = Point3DRDD(spark, fn, "Z_COSMO,RA,DEC", True, "fits", {"hdu": "1"}
230+
)
231+
232+
# Repartition the RDD
233+
rdd_part = rdd.spatialPartitioningPython("LINEARONIONGRID", 5)
234+
235+
# Define your query point
236+
pt = Point3D(0.0, 0.0, 0.0, True)
237+
238+
# Perform the query: look for the 5 closest neighbours from `pt`.
239+
# Automatically discards duplicates
240+
match = KNNEfficient(rdd_part, pt, 5)
241+
242+
# `match` is a list of Point3D. Take the coordinates and convert them
243+
# into cartesian coordinates for later use:
244+
mod = "com.astrolabsoftware.spark3d.utils.Utils.sphericalToCartesian"
245+
converter = load_from_jvm(mod)
246+
match_coord = [converter(m).getCoordinatePython() for m in match]
247+
248+
# Print the distance to the query point (origin)
249+
def normit(l: list) -> float:
250+
"""
251+
Return the distance to the origin for each point of the list
252+
253+
Parameters
254+
----------
255+
l : list of list
256+
List of (list of 3 float). Each main list element represents
257+
the cartesian coordinates of a point in space.
258+
259+
Returns
260+
----------
261+
ld : list of float
262+
list containing (euclidean) norm btw each point and the origin.
263+
"""
264+
return sqrt(sum([e**2 for e in l]))
265+
266+
# Pretty print
267+
print([str(normit(l) * 1e6) + "e-6" for l in match_coord])
268+
# [72, 73, 150, 166, 206]
269+
```
270+
271+
## Benchmarks
272+
10273
TBD

docs/04_query_scala.md

Lines changed: 40 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -39,11 +39,8 @@ val center = new Point3D(0.9, 0.0, 0.0, spherical)
3939
val radius = 0.1
4040
val envelope = new ShellEnvelope(center, radius)
4141

42-
// Instantiate a RangeQuery[data, envelope] object
43-
val rq = new RangeQuery[Point3D, ShellEnvelope]
44-
45-
// Return the match
46-
val queryResult = rq.windowQuery(objects, envelope)
42+
// Perform the match
43+
val queryResult = RangeQuery.windowQuery(objects, envelope)
4744
```
4845

4946
Note that the input objects and the envelope can be anything among the `Shape3D`: points, shells (incl. sphere), boxes.
@@ -158,7 +155,15 @@ For more details on the cross-match, see the following [notebook](https://github
158155

159156
## Neighbour search
160157

161-
Brute force KNN:
158+
159+
### Simple KNN
160+
161+
Finds the K nearest neighbours of a query object within a `rdd`.
162+
The naive implementation here searches through all the the objects in the
163+
RDD to get the KNN. The nearness of the objects here is decided on the
164+
basis of the distance between their centers.
165+
Note that `queryObject` and elements of `rdd` must have the same type
166+
(either both Point3D, or both ShellEnvelope, or both BoxEnvelope).
162167

163168
```scala
164169
// Load the data
@@ -168,10 +173,37 @@ val pRDD = new Point3DRDD(spark, fn, columns, isSpherical, "csv", options)
168173
val queryObject = new Point3D(0.0, 0.0, 0.0, false)
169174

170175
// Find the `nNeighbours` closest neighbours
171-
val knn = SpatialQuery.KNN(queryObject, pRDD.rawRDD, nNeighbours)
176+
// Note that the last argument controls whether we want to eliminate duplicates.
177+
val knn = SpatialQuery.KNN(pRDD.rawRDD, queryObject, nNeighbours, unique)
172178
```
173179

174-
To come: partitioning + indexing!
180+
### More efficient KNN
181+
182+
More efficient implementation of the KNN query above.
183+
First we seek the partitions in which the query object belongs and we
184+
will look for the knn only in those partitions. After this if the limit k
185+
is not satisfied, we keep looking similarly in the neighbors of the
186+
containing partitions.
187+
188+
Note 1: elements of `rdd` and `queryObject` can have different types
189+
among Shape3D (Point3D or ShellEnvelope or BoxEnvelope) (unlike KNN above).
190+
191+
Note 2: KNNEfficient only works on repartitioned RDD (python version).
192+
193+
```scala
194+
// Load the data
195+
val pRDD = new Point3DRDD(spark, fn, columns, isSpherical, "csv", options)
196+
197+
// Repartition the data
198+
pRDD_part = pRDD.spatialPartitioning(GridType.LINEARONIONGRID, 100)
199+
200+
// Centre object for the query
201+
val queryObject = new Point3D(0.0, 0.0, 0.0, false)
202+
203+
// Find the `nNeighbours` closest neighbours
204+
// Automatically discards duplicates
205+
val knn = SpatialQuery.KNNEfficient(pRDD_part, queryObject, nNeighbours)
206+
```
175207

176208
## Benchmarks
177209

pyspark3d/spatialOperator.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -297,7 +297,6 @@ def CrossMatchCenter(
297297
----------
298298
>>> from pyspark3d import get_spark_session
299299
>>> from pyspark3d import load_user_conf
300-
>>> from pyspark3d.geometryObjects import Point3D
301300
>>> from pyspark3d_conf import path_to_conf
302301
>>> from pyspark3d.spatial3DRDD import Point3DRDD
303302
@@ -394,7 +393,6 @@ def CrossMatchHealpixIndex(
394393
----------
395394
>>> from pyspark3d import get_spark_session
396395
>>> from pyspark3d import load_user_conf
397-
>>> from pyspark3d.geometryObjects import Point3D
398396
>>> from pyspark3d_conf import path_to_conf
399397
>>> from pyspark3d.spatial3DRDD import Point3DRDD
400398

0 commit comments

Comments
 (0)