astrolabsoftware
diff --git a/‎docs/03_partitioning_python.md
Lines changed: 69 additions & 0 deletions b/‎docs/03_partitioning_python.md
Lines changed: 69 additions & 0 deletions
diff --git a/‎docs/03_partitioning_scala.md
Lines changed: 4 additions & 0 deletions b/‎docs/03_partitioning_scala.md
Lines changed: 4 additions & 0 deletions
diff --git a/‎docs/assets/images/octree_nopart_python.png
295 KB b/‎docs/assets/images/octree_nopart_python.png
295 KB
diff --git a/‎docs/assets/images/octree_part_python.png
263 KB b/‎docs/assets/images/octree_part_python.png
263 KB
diff --git a/‎docs/assets/images/onion_nopart_python.png
213 KB b/‎docs/assets/images/onion_nopart_python.png
213 KB
diff --git a/‎docs/assets/images/onion_part_python.png
204 KB b/‎docs/assets/images/onion_part_python.png
204 KB
diff --git a/‎examples/python/pyspark3d_octree_part.py
Lines changed: 140 additions & 0 deletions b/‎examples/python/pyspark3d_octree_part.py
Lines changed: 140 additions & 0 deletions
diff --git a/‎examples/python/pyspark3d_onion_part.py
Lines changed: 136 additions & 0 deletions b/‎examples/python/pyspark3d_onion_part.py
Lines changed: 136 additions & 0 deletions
diff --git a/‎pic/pyspark3d_lib_0.2.1.png
172 KB b/‎pic/pyspark3d_lib_0.2.1.png
172 KB
diff --git a/‎pic/spark3d_lib_0.2.1.png
188 KB b/‎pic/spark3d_lib_0.2.1.png
188 KB
diff --git a/‎pyspark3d/__init__.py
Lines changed: 4 additions & 1 deletion b/‎pyspark3d/__init__.py
Lines changed: 4 additions & 1 deletion
@@ -15,4 +15,73 @@ Unfortunately, re-partitioning the space involves potentially large shuffle betw
 
 ## Available partitioning and partitioners
 
+There are currently 2 partitioning implemented in the library:
+
+- **Onion Partitioning:** See [here](https://github.com/astrolabsoftware/spark3D/issues/11) for a description. This is mostly intented for processing astrophysical data as it partitions the space in 3D shells along the radial axis, with the possibility of projecting into 2D shells (and then partitioning the shells using Healpix).
+- **Octree:** An octree extends a quadtree by using three orthogonal splitting planes to subdivide a tile into eight children. Like quadtrees, 3D Tiles allows variations to octrees such as non-uniform subdivision, tight bounding volumes, and overlapping children.
+
+### Onion Partitioning
+
+In the following example, we load `Point3D` data, and we re-partition it with the onion partitioning
+
+```python
+from pyspark3d import load_user_conf, get_spark_session
+from pyspark3d.spatial3DRDD import Point3DRDD
+
+# Load user config and the Spark session
+dic = load_user_conf()
+spark = get_spark_session(dicconf=dic)
+
+# Load the data
+fn = "src/test/resources/astro_obs.fits"
+p3drdd = Point3DRDD(spark, fn, "Z_COSMO,RA,DEC", True, "fits", {"hdu": "1"})
+
+# nPart is the wanted number of partitions.
+# Default is rdd.rawRDD() partition number.
+npart = 5
+gridtype = "LINEARONIONGRID"
+rdd_part = p3drdd.spatialPartitioningPython(gridtype, npart)
+```
+
+| Raw data set | Re-partitioned data set
+|:---------:|:---------:
+| ![raw]({{ "/assets/images/onion_nopart_python.png" | absolute_url }}) | ![repartitioning]({{ "/assets/images/onion_part_python.png" | absolute_url }})
+
+Color code indicates the partitions (all objects with the same color belong to the same partition).
+
+### Octree Partitioning
+
+In the following example, we load `ShellEnvelope` data (spheres), and we re-partition it with the octree partitioning
+
+```python
+from pyspark3d import load_user_conf, get_spark_session
+from pyspark3d.spatial3DRDD import SphereRDD
+
+# Load user config and the Spark session
+dic = load_user_conf()
+spark = get_spark_session(dicconf=dic)
+
+# Load the data
+fn = "src/test/resources/cartesian_spheres.fits"
+srdd = SphereRDD(spark, fn, "x,y,z,radius", False, "fits", {"hdu": "1"})
+
+# nPart is the wanted number of partitions (floored to a power of 8).
+# Default is rdd.rawRDD() partition number.
+npart = 10
+gridtype = "OCTREE"
+rdd_part = srdd.spatialPartitioningPython(gridtype, npart)
+```
+
+
+We advice to cache as well the re-partitioned sets, to speed-up future call by not performing the re-partitioning again. If you are short in memory, unpersist first the rawRDD before caching the repartitioned RDD.
+However keep in mind that if a large `nPart` decreases the cost of performing future queries (cross-match, KNN, ...), it increases the partitioning cost as more partitions implies more data shuffle between partitions. There is no magic number for `nPart` which applies in general, and you'll need to set it according to the needs of your problem. My only advice would be: re-partitioning is typically done once, queries can be multiple...
+
+| Raw data set | Re-partitioned data set
+|:---------:|:---------:
+| ![raw]({{ "/assets/images/octree_nopart_python.png" | absolute_url }}) | ![repartitioning]({{ "/assets/images/octree_part_python.png" | absolute_url }})
+
+Size of the markers is proportional to the radius size. Color code indicates the partitions (all objects with the same color belong to the same partition).
+
+## Current benchmark
+
 TBD
@@ -55,6 +55,8 @@ val pointRDD_partitioned = pointRDD.spatialPartitioning(GridType.LINEARONIONGRID
 |:---------:|:---------:
 | ![raw]({{ "/assets/images/myOnionFigRaw.png" | absolute_url }}) | ![repartitioning]({{ "/assets/images/myOnionFig.png" | absolute_url }})
 
+Color code indicates the partitions (all objects with the same color belong to the same partition).
+
 ### Octree Partitioning
 
 In the following example, we load `ShellEnvelope` data (spheres), and we re-partition it with the octree partitioning
@@ -94,6 +96,8 @@ However keep in mind that if a large `nPart` decreases the cost of performing fu
 |:---------:|:---------:
 | ![raw]({{ "/assets/images/rawData_noOctree.png" | absolute_url }}) | ![repartitioning]({{ "/assets/images/rawData_withOctree.png" | absolute_url }})
 
+Color code indicates the partitions (all objects with the same color belong to the same partition).
+
 ## Current benchmark
 
 TBD
@@ -0,0 +1,140 @@
+# Copyright 2018 Julien Peloton
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from pyspark.sql import SparkSession
+
+import numpy as np
+
+from pyspark3d import set_spark_log_level
+from pyspark3d import load_user_conf
+from pyspark3d import get_spark_session
+from pyspark3d.spatial3DRDD import SphereRDD
+
+import argparse
+
+def addargs(parser):
+    """ Parse command line arguments for pyspark3d_part """
+
+    ## Arguments
+    parser.add_argument(
+        '-inputpath', dest='inputpath',
+        required=True,
+        help='Path to a FITS file')
+
+    ## Arguments
+    parser.add_argument(
+        '-hdu', dest='hdu',
+        required=True,
+        help='HDU index to load.')
+
+    ## Arguments
+    parser.add_argument(
+        '-part', dest='part',
+        default=None,
+        help='Type of partitioning')
+
+    ## Arguments
+    parser.add_argument(
+        '-npart', dest='npart',
+        default=10,
+        type=int,
+        help='Number of partition')
+
+    ## Arguments
+    parser.add_argument(
+        '--plot', dest='plot',
+        action="store_true",
+        help='Number of partition')
+
+
+if __name__ == "__main__":
+    """
+    Re-partition RDD using OCTREE partitioning using pyspark3d
+    """
+    parser = argparse.ArgumentParser(
+        description="""
+        Re-partition RDD using OCTREE partitioning using pyspark3d
+        """)
+    addargs(parser)
+    args = parser.parse_args(None)
+
+    # Load user conf and Spark session
+    dic = load_user_conf()
+    spark = get_spark_session(dicconf=dic)
+
+    # Set logs to be quiet
+    set_spark_log_level()
+
+    # Load raw data
+    fn = args.inputpath
+    rdd = SphereRDD(
+        spark, fn, "x,y,z,radius", False, "fits", {"hdu": args.hdu})
+
+    # Perform the re-partitioning
+    npart = args.npart
+    gridtype = args.part
+
+    if gridtype is not None:
+        rdd_part = rdd.spatialPartitioningPython(gridtype, npart)
+    else:
+        rdd_part = rdd.rawRDD().toJavaRDD().repartition(npart)
+
+    if not args.plot:
+        count = rdd_part.count()
+        print("{} elements".format(count))
+    else:
+        # Plot the result
+        # Collect the data on driver -- just for visualisation purpose, do not
+        # do that with full data set or you will destroy your driver!!
+        import pylab as pl
+        from mpl_toolkits.mplot3d import Axes3D
+
+        fig = pl.figure()
+        ax = Axes3D(fig)
+
+        # Convert data for plot
+        # List[all partitions] of List[all Point3D per partition]
+        data_glom = rdd_part.glom().collect()
+
+        # Take only a few points (400 per partition) to speed-up
+        # For each Sphere (el), takes the center and grab its coordinates and
+        # make it a python list (it is JavaList by default)
+        data_all = [
+            np.array(
+                [list(
+                    el.center().getCoordinatePython())
+                 for el in part[0:400]]).T
+            for part in data_glom]
+
+        # Collect the radius sizes
+        radius = [
+            np.array(
+                [el.outerRadius()
+                 for el in part[0:400]])
+            for part in data_glom]
+
+        # Plot partition-by-partition
+        for i in range(len(data_all)):
+            s = radius[i] * 3000
+            ax.scatter(data_all[i][0], data_all[i][1], data_all[i][2], s=s)
+
+        ax.set_xlabel("X")
+        ax.set_ylabel("Y")
+        ax.set_zlabel("Z")
+
+        # Save the result on disk
+        if gridtype is not None:
+            pl.savefig("octree_part_python.png")
+        else:
+            pl.savefig("octree_nopart_python.png")
+        pl.show()
@@ -0,0 +1,136 @@
+# Copyright 2018 Julien Peloton
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from pyspark.sql import SparkSession
+
+import numpy as np
+
+from pyspark3d import set_spark_log_level
+from pyspark3d import load_user_conf
+from pyspark3d import get_spark_session
+from pyspark3d import load_from_jvm
+from pyspark3d.spatial3DRDD import Point3DRDD
+
+import argparse
+
+def addargs(parser):
+    """ Parse command line arguments for pyspark3d_part """
+
+    ## Arguments
+    parser.add_argument(
+        '-inputpath', dest='inputpath',
+        required=True,
+        help='Path to a FITS file')
+
+    ## Arguments
+    parser.add_argument(
+        '-hdu', dest='hdu',
+        required=True,
+        help='HDU index to load.')
+
+    ## Arguments
+    parser.add_argument(
+        '-part', dest='part',
+        default=None,
+        help='Type of partitioning')
+
+    ## Arguments
+    parser.add_argument(
+        '-npart', dest='npart',
+        default=10,
+        type=int,
+        help='Number of partition')
+
+    ## Arguments
+    parser.add_argument(
+        '--plot', dest='plot',
+        action="store_true",
+        help='Number of partition')
+
+
+if __name__ == "__main__":
+    """
+    Re-partition RDD using ONION partitioning using pyspark3d
+    """
+    parser = argparse.ArgumentParser(
+        description="""
+        Re-partition RDD using ONION partitioning using pyspark3d
+        """)
+    addargs(parser)
+    args = parser.parse_args(None)
+
+    # Load user conf and Spark session
+    dic = load_user_conf()
+    spark = get_spark_session(dicconf=dic)
+
+    # Set logs to be quiet
+    set_spark_log_level()
+
+    # Load raw data
+    fn = args.inputpath
+    rdd = Point3DRDD(
+        spark, fn, "Z_COSMO,RA,DEC", True, "fits", {"hdu": args.hdu})
+
+    # Perform the re-partitioning
+    npart = args.npart
+    gridtype = args.part
+
+    if gridtype is not None:
+        rdd_part = rdd.spatialPartitioningPython(gridtype, npart)
+    else:
+        rdd_part = rdd.rawRDD().toJavaRDD().repartition(npart)
+
+    if not args.plot:
+        count = rdd_part.count()
+        print("{} elements".format(count))
+    else:
+        # Plot the result
+        # Collect the data on driver -- just for visualisation purpose, do not
+        # do that with full data set or you will destroy your driver!!
+        import pylab as pl
+        from mpl_toolkits.mplot3d import Axes3D
+
+        fig = pl.figure()
+        ax = Axes3D(fig)
+
+        # Converter from spherical to cartesian coordinate system
+        # it takes a Point3D and return a Point3D
+        mod = "com.astrolabsoftware.spark3d.utils.Utils.sphericalToCartesian"
+        converter = load_from_jvm(mod)
+
+        # Convert data for plot -- List of List of Point3D
+        data_glom = rdd_part.glom().collect()
+
+        # Take only a few points (400 per partition) to speed-up
+        # For each Point3D (el), grab the coordinates, convert it from
+        # spherical to cartesian coordinate system (for the plot) and
+        # make it a python list (it is JavaList by default)
+        data_all = [
+            np.array(
+                [list(
+                    converter(el).getCoordinatePython())
+                 for el in part[0:400]]).T
+            for part in data_glom]
+
+        for i in range(len(data_all)):
+            ax.scatter(data_all[i][0], data_all[i][1], data_all[i][2])
+
+        ax.set_xlabel("X")
+        ax.set_ylabel("Y")
+        ax.set_zlabel("Z")
+
+        if gridtype is not None:
+            pl.savefig("onion_part_python.png")
+        else:
+            pl.savefig("onion_nopart_python.png")
+        pl.show()
@@ -17,6 +17,8 @@
 
 from typing import Any, List, Dict
 
+import sys
+
 from version import __version__
 from pyspark3d_conf import extra_jars, extra_packages, log_level
 
@@ -288,4 +290,5 @@ def set_spark_log_level(log_level_manual=None):
         np.set_printoptions(legacy="1.13")
 
     # Run the test suite
-    doctest.testmod()
+    failure_count, test_count = doctest.testmod()
+    sys.exit(failure_count)