Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
dd89d96
reformat chip_label
forrestfwilliams Oct 10, 2025
a19881b
cleanup
forrestfwilliams Oct 10, 2025
58a6a1c
finish labels
forrestfwilliams Oct 13, 2025
44426b7
get s2 working
forrestfwilliams Oct 13, 2025
a2c8439
fix hls
forrestfwilliams Oct 13, 2025
8c2f275
update s1rtc formatting
forrestfwilliams Oct 14, 2025
536bc07
scratch -> image
forrestfwilliams Oct 14, 2025
600e874
update changelog
forrestfwilliams Oct 14, 2025
198e3b7
fix ruff
forrestfwilliams Oct 14, 2025
4d74cb9
update readme
forrestfwilliams Oct 14, 2025
1ac9f43
update viewer
forrestfwilliams Oct 14, 2025
ff78341
fix rtc band ordering
forrestfwilliams Oct 14, 2025
978265e
update readme
forrestfwilliams Oct 14, 2025
ae7f1a9
remove unused file, update test
forrestfwilliams Oct 15, 2025
6a9ab1d
output -> chip
forrestfwilliams Oct 15, 2025
0a24d34
improve pass of rtc bands
forrestfwilliams Oct 15, 2025
3561bfb
further improvements
forrestfwilliams Oct 15, 2025
0a3f538
Update src/satchip/chip_view.py
forrestfwilliams Oct 15, 2025
87320a5
Update src/satchip/chip_data.py
forrestfwilliams Oct 15, 2025
887c2fa
solidify integration test
forrestfwilliams Oct 15, 2025
5b3147d
change dir name
forrestfwilliams Oct 15, 2025
44f07e3
fix ruff
forrestfwilliams Oct 15, 2025
10b8326
Update src/satchip/chip_data.py
forrestfwilliams Oct 15, 2025
40bf1ea
update gitignore
forrestfwilliams Oct 15, 2025
6931ec8
edits from review
forrestfwilliams Oct 15, 2025
30bf7eb
Update image_set variable names and change to use TypedDict
williamh890 Oct 15, 2025
b3b09cf
Remove duplicat chip_paths array
williamh890 Oct 15, 2025
87849ea
Merge pull request #55 from ASFHyP3/wds-update
forrestfwilliams Oct 16, 2025
63ef4e2
update integration
forrestfwilliams Oct 16, 2025
05207d6
fix duplicate bug
forrestfwilliams Oct 16, 2025
f264201
Merge pull request #54 from ASFHyP3/webdataset
forrestfwilliams Oct 16, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -244,3 +244,4 @@ tags
# Data
*.zarr.zip
*tif*
integration_test/
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,11 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [PEP 440](https://www.python.org/dev/peps/pep-0440/)
and uses [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.5.0]

### Changed
* Format and layout of chips to more closely match the TerraMesh dataset.

## [0.4.0]

### Changed
Expand Down
28 changes: 14 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,38 +6,38 @@ A package for satellite image AI data prep. This package "chips" data labels and
`SatChip` relies on a two-step process; chip your label train data inputs, then create corresponding chips for different remote sensing data sources.

### Step 1: Chip labels
The `chiplabel` CLI tool takes a GDAL-compatible image, a collection date, and an optional output directory as input using the following format:
The `chiplabel` CLI tool takes a GDAL-compatible image, a collection date, and an optional chip directory as input using the following format:

```bash
chiplabel PATH/TO/LABELS.tif DATE(UTC FORMAT) --outdir OUTPUT_DIR
chiplabel PATH/TO/LABELS.tif DATE(UTC FORMAT) --chipdir CHIP_DIR
```
For example:
```bash
chiplabel LA_damage_20250113_v0.tif 2024-01-01T01:01:01 --outdir chips
chiplabel LA_damage_20250113_v0.tif 2024-01-01T01:01:01 --chipdir chips
```
This will produce an output zipped Zarr store label dataset with the name `{LABELS}.zarr.zip` in the specified output directory (`--outdir`). This file will be the input to the remote sensing data chipping step.
This will produce an output zipped Zarr store label dataset with the name `{LABEL}_{SAMPLE}.zarr.zip` (see the (Tiling Schema)[#tiling_schema] section for details on the `SAMPLE` name) to the `LABEL` directory in the specified chip directory (`--chipdir`). This file will be the input to the remote sensing data chipping step.

For more information on usage see `chiplabel --help`

### Step 2: Chip remote sensing data
The `chipdata` CLI tool takes a label zipped Zarr store, a dataset name, a date range and a set of optional parameters using the following format:
The `chipdata` CLI tool takes a path to a directory containing chip labels, a dataset name, a date range and a set of optional parameters using the following format:
```bash
chipdata PATH/TO/LABELS.zarr.zip DATASET Ymd-Ymd \
chipdata PATH/TO/LABEL DATASET Ymd-Ymd \
--maxcloudpct MAX_CLOUD_PCT --strategy STRATEGY \
--outdir OUTPUT_DIR --scratchdir SCRATCH_DIR
--chipdir CHIPPUT_DIR --imagedir IMAGE_DIR
```
For example:
```bash
chipdata LA_damage_20250113_v0.zarr.zip S2L2A 20250112-20250212 --maxcloudpct 20 --outdir chips --scratchdir images
chipdata LABEL S2L2A 20250112-20250212 --maxcloudpct 20 --chipdir CHIP_DIR --imagedir IMAGES
```
Similarly to step 1, this will produce an output zipped Zarr store that contains chipped data for your chosen dataset with the name `{LABELS}_{DATASET}.zarr.zip`. The arguments are as follows:
- `PATH/TO/LABELS.zarr.zip`: the path to your training lables.
Similarly to step 1, this will produce an output zipped Zarr store that contains chipped data for your chosen dataset with the name `{LABELS_{SAMPLE}_{DATASET}.zarr.zip`. The arguments are as follows:
- `PATH/TO/LABEL`: the path to your training labels
- `DATASET`: The satellite imagery dataset you would like to create labels for. See the list below for all current options.
- `Ymd-Ymd`: The date range to select imagery from. For example, `20250112-20250212` selects imagery between January 12 and February 12, 2025.
- `MAX_CLOUD_PCT`: For optical data, this optional parameter lets you set the maximum amount of cloud coverage allowed in a chip. Values between 0 and 100 are allowed. Cloud coverage is calculated on a per-chip basis. The default is 100 i.e., no limit.
- `STRATEGY`: Lets you selected what data inside your date range will be used to create chips. Specifying `BEST` (the default) will create a chip for the image closest to the beginning of your date range that has at least 95% spatial coverage. Specifying `ALL` will create chips for all images within your date range that have at least 95% spatial coverage.
- `OUTPUT_DIR`: Specifies the directory where the image chips will be saved. If not specified, this defaults to your current directory.
- `SCRATCH_DIR`: Specifies the directory where the full-size satellite images will be downloaded to. If this argument is not provided, the images will be stored in a scratch directory that will be deleted when the `chipdata` call finishes.
- `CHIP_DIR`: Specifies the directory where the image chips will be saved. If not specified, this defaults to your current directory.
- `IMAGE_DIR`: Specifies the directory where the full-size satellite images will be downloaded to. If this argument is not provided, the images will be stored in the `IMAGES` directory within `CHIP_DIR`.

Currently supported datasets include:
- `S2L2A`: Sentinel-2 L2A data sourced from the [Sentinel-2 AWS Open Data Archive](https://registry.opendata.aws/sentinel-2/)
Expand Down Expand Up @@ -65,9 +65,9 @@ For instance, the bottom-left subgrid of MajorTOM tile `434U_876L` is named `434
## Viewing Chips
Assessing chips after their creation can be challenging due to the large number of small images created. To address this issue, SatChip includes a `chipview` CLI tool that uses Matplotlib to quickly visualize the data included within the created zipped Zarr stores:
```bash
chipview PATH/TO/CHIPS.zarr.zip BAND --idx IDX
chipview PATH/TO/CHIP.zarr.zip --band BAND
```
Where `PATH/TO/CHIPS.zarr.zip` is the path to the chip file (labels or image data), `BAND` is the name of the band you would like to view, and `IDX` is an optional integer index of which dataset you would like to initially view.
Where `PATH/TO/CHIPS.zarr.zip` is the path to the chip file (labels or image data), and `BAND` is an OPTIONAL name of the band you would like to view. If no band is specified, an OPERA-style RGB decomposition will be shown for RTC data, and an RGB composite will be shown for optical data.

## License
`SatChip` is licensed under the BSD-3-Clause open source license. See the LICENSE file for more details.
Expand Down
2 changes: 2 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,8 @@ chipview = "satchip.chip_view:main"
[tool.pytest.ini_options]
testpaths = ["tests"]
script_launch_mode = "subprocess"
addopts = '-ra -q -m "not integration"'
markers = ["integration"]

[tool.setuptools]
include-package-data = true
Expand Down
22 changes: 0 additions & 22 deletions scripts/open_chips.py

This file was deleted.

121 changes: 64 additions & 57 deletions src/satchip/chip_data.py
Original file line number Diff line number Diff line change
@@ -1,18 +1,18 @@
import argparse
from collections import Counter
from datetime import datetime
from pathlib import Path
from tempfile import TemporaryDirectory

import numpy as np
import xarray as xr
from shapely.geometry import box
from tqdm import tqdm

import satchip
from satchip import utils
from satchip.chip_hls import get_hls_data
from satchip.chip_sentinel1rtc import get_rtc_paths_for_chips, get_s1rtc_chip_data
from satchip.chip_sentinel2 import get_s2l2a_data
from satchip.terra_mind_grid import TerraMindGrid
from satchip.terra_mind_grid import TerraMindChip, TerraMindGrid


def fill_missing_times(data_chip: xr.DataArray, times: np.ndarray) -> xr.DataArray:
Expand All @@ -31,65 +31,80 @@ def fill_missing_times(data_chip: xr.DataArray, times: np.ndarray) -> xr.DataArr
return xr.concat([data_chip, missing_data], dim='time').sortby('time')


def get_chip(label_path: Path) -> TerraMindChip:
label_dataset = utils.load_chip(label_path)
buffered = box(*label_dataset.bounds).buffer(0.1).bounds
grid = TerraMindGrid([buffered[1], buffered[3]], [buffered[0], buffered[2]]) # type: ignore
label_chip_name = label_dataset.sample.item()
chip = [c for c in grid.terra_mind_chips if c.name == label_chip_name]
assert len(chip) == 1, f'No TerraMind chip found for label {label_chip_name}'
return chip[0]


def chip_data(
label_path: Path,
chip: TerraMindChip,
platform: str,
opts: utils.ChipDataOpts,
image_dir: Path,
) -> xr.Dataset:
if platform == 'S1RTC':
rtc_paths = opts['local_hyp3_paths'][chip.name]
chip_dataset = get_s1rtc_chip_data(chip, rtc_paths)
elif platform == 'S2L2A':
chip_dataset = get_s2l2a_data(chip, image_dir, opts=opts)
elif platform == 'HLS':
chip_dataset = get_hls_data(chip, image_dir, opts=opts)
else:
raise Exception(f'Unknown platform {platform}')

return chip_dataset


def create_chips(
label_paths: list[Path],
platform: str,
date_start: datetime,
date_end: datetime,
strategy: str,
max_cloud_pct: int,
output_dir: Path,
scratch_dir: Path,
) -> xr.Dataset:
labels = utils.load_chip(label_path)
date = labels.time.data[0].astype('M8[ms]').astype(datetime)
bounds = labels.attrs['bounds']

grid = TerraMindGrid([bounds[1] - 1, bounds[3] + 1], [bounds[0] - 1, bounds[2] + 1]) # type: ignore
terra_mind_chips = [c for c in grid.terra_mind_chips if c.name in list(labels.sample.data)]
chip_dir: Path,
image_dir: Path,
) -> list[Path]:
platform_dir = chip_dir / platform
platform_dir.mkdir(parents=True, exist_ok=True)

opts: utils.ChipDataOpts = {'strategy': strategy, 'date_start': date_start, 'date_end': date_end}
if platform in ['S2L2A', 'HLS']:
opts['max_cloud_pct'] = max_cloud_pct

chips = [get_chip(p) for p in label_paths]
chip_names = [c.name for c in chips]
if len(chip_names) != len(set(chip_names)):
duplicates = [name for name, count in Counter(chip_names).items() if count > 1]
msg = f'Duplicate sample locations not supported. Duplicate chips: {", ".join(duplicates)}'
raise NotImplementedError(msg)
chip_paths = [
platform_dir / (x.with_suffix('').with_suffix('').name + f'_{platform}.zarr.zip') for x in label_paths
]
if platform == 'S1RTC':
rtc_paths_for_chips = get_rtc_paths_for_chips(terra_mind_chips, bounds, scratch_dir, opts)

data_chips = []
for chip in tqdm(terra_mind_chips):
if platform == 'S1RTC':
rtc_paths = rtc_paths_for_chips[chip.name]
chip_data = get_s1rtc_chip_data(chip, rtc_paths, scratch_dir, opts=opts)
elif platform == 'S2L2A':
chip_data = get_s2l2a_data(chip, scratch_dir, opts=opts)
elif platform == 'HLS':
chip_data = get_hls_data(chip, scratch_dir, opts=opts)
else:
raise Exception(f'Unknown platform {platform}')

data_chips.append(chip_data)

times = np.unique(np.concatenate([dc.time.data for dc in data_chips]))
for i, data_chip in enumerate(data_chips):
if len(data_chip.time) < len(times):
data_chips[i] = fill_missing_times(data_chip, times)
attrs = {'date_created': date.isoformat(), 'satchip_version': satchip.__version__, 'bounds': labels.attrs['bounds']}
dataset = xr.Dataset(attrs=attrs)
dataset['data'] = xr.combine_by_coords(data_chips, join='override')
output_path = output_dir / (label_path.with_suffix('').with_suffix('').name + f'_{platform}.zarr.zip')
utils.save_chip(dataset, output_path)
return labels
rtc_paths_for_chips = get_rtc_paths_for_chips(chips, image_dir, opts)
opts['local_hyp3_paths'] = rtc_paths_for_chips

for chip, chip_path in tqdm(zip(chips, chip_paths), desc='Chipping labels'):
dataset = chip_data(chip, platform, opts, image_dir)
utils.save_chip(dataset, chip_path)
return chip_paths


def main() -> None:
parser = argparse.ArgumentParser(description='Chip a label image')
parser.add_argument('labelpath', type=Path, help='Path to the label image')
parser.add_argument('labelpath', type=Path, help='Path to the label directory')
parser.add_argument('platform', choices=['S2L2A', 'S1RTC', 'HLS'], type=str, help='Dataset to create chips for')
parser.add_argument('daterange', type=str, help='Inclusive date range to search for data in the format Ymd-Ymd')
parser.add_argument('--maxcloudpct', default=100, type=int, help='Maximum percent cloud cover for a data chip')
parser.add_argument('--outdir', default='.', type=Path, help='Output directory for the chips')
parser.add_argument('--chipdir', default='.', type=Path, help='Output directory for the chips')
parser.add_argument(
'--scratchdir', default=None, type=Path, help='Output directory for scratch files if you want to keep them'
'--imagedir', default=None, type=Path, help='Output directory for image files. Defaults to chipdir/IMAGES'
)
parser.add_argument(
'--strategy',
Expand All @@ -103,23 +118,15 @@ def main() -> None:
assert 0 <= args.maxcloudpct <= 100, 'maxcloudpct must be between 0 and 100'
date_start, date_end = [datetime.strptime(d, '%Y%m%d') for d in args.daterange.split('-')]
assert date_start < date_end, 'start date must be before end date'
label_paths = list(args.labelpath.glob('*.zarr.zip'))
assert len(label_paths) > 0, f'No label files found in {args.labelpath}'

params = (
args.labelpath,
args.platform,
date_start,
date_end,
args.strategy,
args.maxcloudpct,
args.outdir,
)
if args.imagedir is None:
args.imagedir = args.chipdir / 'IMAGES'

if args.scratchdir is not None:
chip_data(*params, args.scratchdir)
else:
with TemporaryDirectory() as tmp_dir:
scratch_dir = Path(tmp_dir)
chip_data(*params, scratch_dir)
create_chips(
label_paths, args.platform, date_start, date_end, args.strategy, args.maxcloudpct, args.chipdir, args.imagedir
)


if __name__ == '__main__':
Expand Down
Loading
Loading