Unified tooling for generating geospatial embeddings from coordinates and for building large land-only embedding datasets with a consistent output format.
This repository has two main entry points:
scripts/get_embeddings.py: small point-query CLI forgeoclipandsatclipscripts/generate_dataset.py: land-only dataset generator with support for 5 embedding products
The dataset generator currently supports:
geoclipsatclipcopernicus_embedtesseragoogle_satellite_embedding
All dataset outputs use the same coordinate conventions and save format, even though the underlying models and products use different native coordinate orders and storage layouts.
scripts/generate_dataset.py:
- samples candidate coordinates with Fibonacci sphere sampling
- randomizes the Fibonacci phase per run, so repeated runs do not reuse the exact same locations
- filters to land using Natural Earth polygons
- queries one or more encoders
- drops invalid rows for the selected encoder set
- writes
.ptor.csvdatasets - writes location plots and ICA-to-RGB embedding plots
For temporal products, the generator can emit YYYY.pt files, one per year.
Inside this repo, the standard coordinate input format is always:
(latitude, longitude)
Saved dataset files include both explicit coordinate layouts:
coordinatesandcoordinates_latlon:(lat, lon)coordinates_lonlat:(lon, lat)- separate
latitudeandlongitudetensors/vectors
For .pt outputs, the metadata block records these conventions explicitly.
- Python 3.10+
pip- optional CUDA GPU for faster
geoclip/satclip
git clone https://github.com/crp94/geospatial_embeddings_wrapper.git
cd geospatial_embeddings_wrapper
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtsatclip/ is vendored in this repository already. You do not need to clone it separately.
scripts/get_embeddings.py currently supports:
geoclipsatclip
Example:
python scripts/get_embeddings.py \
--lat 40.7128 34.0522 \
--lon -74.0060 -118.2437 \
--encoders geoclip satclip \
--output embeddings.npzscripts/generate_dataset.py supports:
geoclipsatclipcopernicus_embedtesseragoogle_satellite_embedding
Notes:
geoclipandsatclipare model-backed encoderscopernicus_embedis TorchGeo-backed and auto-downloads its rastertesseracan run throughgeotesserawithout a local rootgoogle_satellite_embeddingcan run against the public AEF annual index without a local root- land-only behavior is enforced by this generator layer, not by every source dataset
Generate a static 100k land-only dataset with the two model-backed encoders:
python scripts/generate_dataset.py \
--n_points 100000 \
--encoders geoclip satclip \
--device cuda \
--output_path outputs/example_staticGenerate a 2024 land-only dataset for the 3 temporal/raster products:
python scripts/generate_dataset.py \
--n_points 100000 \
--encoders copernicus_embed tessera google_satellite_embedding \
--years 2024 \
--output_path outputs/example_2024Generate the full 5-product set as separate runs:
python scripts/generate_dataset.py --n_points 500000 --encoders geoclip --device cuda --output_path outputs/land_only_500k/geoclip_land_500k
python scripts/generate_dataset.py --n_points 500000 --encoders satclip --device cuda --output_path outputs/land_only_500k/satclip_land_500k
python scripts/generate_dataset.py --n_points 500000 --encoders copernicus_embed --output_path outputs/land_only_500k/copernicus_land_500k
python scripts/generate_dataset.py --n_points 500000 --encoders tessera --years 2024 --output_path outputs/land_only_500k/tessera_land_500k
python scripts/generate_dataset.py --n_points 500000 --encoders google_satellite_embedding --years 2024 --output_path outputs/land_only_500k/google_satellite_embedding_land_500kGenerate CSV output without plots:
python scripts/generate_dataset.py \
--n_points 50000 \
--encoders geoclip \
--output_format csv \
--no_plot \
--output_path outputs/geoclip_csvEach saved dataset contains:
metadatalatitudelongitudecoordinatescoordinates_latloncoordinates_lonlat- one
*_embeddingstensor per encoder
Example keys:
[
"metadata",
"latitude",
"longitude",
"coordinates",
"coordinates_latlon",
"coordinates_lonlat",
"geoclip_embeddings",
]The metadata block includes:
- selected encoders
- year
- number of points
- coordinate order declarations
- encoder-specific metadata such as embedding dimension and available years
When plotting is enabled, the generator writes:
*_locations.png: sampled land coordinates*_<encoder>_ica.png: embeddings projected to RGB with ICA
The ICA fit is done on a capped subsample and transformed in batches, so large outputs remain tractable.
The following products are temporal in this repo:
tesseragoogle_satellite_embedding
copernicus_embed is treated as a fixed annual product with reference year 2021.
If you pass --years, the generator creates one file per requested year:
python scripts/generate_dataset.py \
--n_points 100000 \
--encoders tessera google_satellite_embedding \
--years 2023 2024 \
--output_path outputs/temporal_pairThis produces:
outputs/temporal_pair_2023.ptoutputs/temporal_pair_2024.pt
The shared encoder contract is defined in wrappers/embedding_encoder.py.
The main implementation split is:
wrappers/geoclip_encoder.pywrappers/satclip_encoder.pywrappers/torchgeo_encoders.pywrappers/registry.py
Canonical encoder names and aliases are centralized in wrappers/registry.py.
Run the test suite with:
./.venv/bin/python -m unittest discover -s tests -vQuick syntax check:
python3 -m py_compile scripts/generate_dataset.py wrappers/torchgeo_encoders.py tests/test_generate_dataset.py- Running the 5 encoders separately with the same
--n_pointsdoes not produce the same coordinates, because the Fibonacci sampler is now randomized per run. - If you need the exact same coordinates across multiple encoder outputs, add a fixed coordinate export/reuse workflow instead of relying on repeated sampling.
geoclipandsatclipare much faster than the large raster-backed products.tesseraandgoogle_satellite_embeddingmay need substantial network and disk activity on first use.google_satellite_embeddinguses the public AEF annual index and remote GeoTIFF access.
Licensing is not uniform across the supported products.
geoclippackage: MITsatclip: MITtesseraembeddings in TorchGeo: CC0-1.0copernicus_embed: CC-BY-4.0google_satellite_embedding: CC-BY-4.0
For the CC-BY products, attribution is required.
geospatial_embeddings_wrapper/
├── images/
├── outputs/
├── satclip/
├── scripts/
│ ├── generate_dataset.py
│ └── get_embeddings.py
├── tests/
├── wrappers/
│ ├── embedding_encoder.py
│ ├── geoclip_encoder.py
│ ├── satclip_encoder.py
│ ├── torchgeo_encoders.py
│ └── registry.py
├── README.md
└── requirements.txt
