Geospatial processing pipeline for large-scale datasets. Built on PySpark + Sedona.
Suitable for spatial big data analyses, service accessibility modeling, and grid-based enrichment.
-
Python >=3.12
-
Java
-
Environment Variables Setup: JAVA_HOME, HADOOP_HOME (for windows)
-
Download Java and install it if not already done.
- Set JAVA_HOME its respective installation directory that has directories like "bin, lib, legal..."
- Usually it is something like
C:\Program Files\Java\jre-1.8
on windows. - And set system PATH =
%JAVA_HOME%\bin
.
-
Download
winutils.exe
andhadoop.dll
from .- Place
winutils.exe
in a directory such asC:/Hadoop/bin
. - Place
hadoop.dll
inC:/Windows/System32
. - And set HADOOP_HOME =
C:/Hadoop/bin
and system PATH =%HADOOP_HOME%\bin
.
- Place
-
- For Linux/macOS:
python -m venv .venv && \
source .venv/bin/activate && \
python -m pip install --upgrade pip
- For Windows:
python -m venv .venv
.\.venv\Scripts\activate
python -m pip install --upgrade pip
Make sure you have at least 1.3 GB of disk space; run:
pip install geoenricher
Run in the terminal:
geoenricher
This sets up the data directory and creates a sample notebook with necessary imports and usage guide.
- This will take a while since it reads the datasets from disk, cleans and makes some essential transformations on the loaded datasets.
- But it is a one-time operation.
parquet_all()
will save all datasets to the disk; preserving any transformations applied. - From next runs, you can directly load them with:
load_from_parquets()
.
data/
├── uploaded_dir_1/
├── uploaded_dir_2/
├── ...
├── pickle_parquets/
│ ├── archive/
│ ├── dfs_list/
│ └── others
├── z_maps
Upload your datasets in separate directories per dataset under "data/". Other directories will be maintained automatically.
from geoenricher import Enricher
# provide the data directory
data_dir = "./data"
# `datasets` in this format:
# datasets = {"<dataset_name>": ("<path>", "<file_format>"), ...}
datasets = {
"dg_urban": (f"{data_dir}/DGURBA", "shapefile"),
"pop_grids": (f"{data_dir}/grids_2021_3035", "geoparquet"),
"hospitals": (f"{data_dir}/hospitals_3035", "geopackage"),
# ...
}
obj = Enricher(crs="EPSG:3035")
obj.setup_cluster(
data_dir=data_dir,
which="sedona", # or "wherobots"
ex_mem=26, # Executor memory (GB)
dr_mem=24, # Driver memory (GB)
log_level="ERROR"
)
obj.load(datasets, silent=True)
# optionally, run "fix_geometries()"
# to fix invalid geometries, if any;
# If you want to skip the check for some dataframes,
# pass their names in "skip[]"
obj.fix_geometries(
skip=['pop_grids', 'pop_grids_new']
)
# save dataframes in memory to disk
# for quick access in subsequent runs
obj.parquet_all() # default directory: "./{data_dir}/pickle_parquets/dfs_list"
# you may change the directory by passing it in `parquet_dir`,
# relative to the `data_dir`
obj = Enricher(crs="EPSG:3035")
obj.setup_cluster(
data_dir=data_dir,
which="sedona",
ex_mem=26,
dr_mem=24,
log_level="INFO"
)
# loads all the datasets in the default directory:
# "./{data_dir}/pickle_parquets/dfs_list"
# optionally, to load from a different directory,
# pass the directory path in "parquet_dir"; reative to the {data_dir}
obj.load_from_parquets()
from pyspark.sql import functions as F
# Spatial join example
enriched_df = obj.enrich_sjoin(
df1="pop_grids",
df2="countries",
enr_cols=["CNTR_NAME"]
).filter(F.col('CNTR_ID') == "IT")
Pass a list of either: names of the loaded datasets or directly the Spark dataframes in memory.
obj.plot_this(
df=[ #: str | SparkDataFrame | list[str | SparkDataFrame]
obj.dfs_list[''].filter(F.col('CNTR_ID') == 'IT'),
# obj.dfs_list["dg_urban"].filter(F.col('CNTR_CODE') == 'IT'),
# temp.filter(F.col('CNTR_ID') == 'IT'),
],
new_map=True, # if False, the dataset will be added to the old map (if it exists, or else, makes a new one)
save_html=False, # saves the Kepler map to a html file in `{data_dir}/z_maps`
)
# you can plot the map in another cell, by running `<obj>.map`
- Invalid geometries: run
obj.fix_geometries()
andobj.force_repartition()
to override default partitions to the number of cores available. - Memory issues: Increase
ex_mem
and/ordr_mem
- Slow joins: check
obj.inspect_partitions()