-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adapt model to load 512x512 images from s3 bucket #85
Conversation
Increase the chip image size from 256 to 512 pixels, and the patch size from 32 to 64 pixels. Updated the unit test and an assert statement, and fixed a typo.
New feature to allow passing in a URL to an s3 bucket, and loading the GeoTIFF data from there directly. Added a unit test that checks that this works to list a GeoTIFF file from s3://copernicus-dem-30m/. Also improved the docstring and type hint of the setup() function's 'stage' parameter.
Need to do this so that the data loading is distributed to the workers, otherwise each worker is doing duplicated work. Also set num_workers to 1 in test_geotiffdatapipemodule to get a consistent result.
Ensure that *.ckpt files in sub-folders are ignored too.
Prevents messages like `You are using a CUDA device ('NVIDIA A10G') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance.`
Just casually documenting in the main README.md on how one can directly generate embeddings from GeoTIFF files stored in an s3 bucket instead of locally.
# Step 1 - Get list of GeoTIFF filepaths from s3 bucket or data/ folder | ||
if self.data_path.startswith("s3://"): | ||
dp = torchdata.datapipes.iter.IterableWrapper(iterable=[self.data_path]) | ||
self.dp_paths = dp.list_files_by_s3(masks="*.tif") | ||
else: # if self.data_path is a local data path | ||
self.dp_paths = torchdata.datapipes.iter.FileLister( | ||
root=self.data_path, masks="*.tif", recursive=True | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that loading data from s3 is still a bit slower than a local data folder, but it can be useful for cases like one-off inference/prediction when we don't want to store large amounts of data locally.
Gonna merge this PR directly, as it consists of mostly minor tweaks (which I've accumulated over the past few weeks). Some of the options (e.g. patch_size) can be changed later, but thought it'd be good to have a datapipe/model that works with the 512x512 images soon-ish. |
* ✨ Allow ClayDataModule to get GeoTIFF data from an s3 bucket Allow passing in a URL to an s3 bucket, and loading the GeoTIFF data from there directly. Using the same torchdata based code for the s3 pathway as with commit f288eb8 in #85. * 🚚 Rename datacube's path key to source_url Using the same 'source_url' key in the returned datacube dictionary for both ClayDataModule and GeoTIFFDataPipeModule * 🚑 Use try-except to get absolute chip_path or fallback to str The getattr doesn't actually work properly, since we need to call chip_path.absolute() with brackets. Using a good ol' try-except statement instead, with the fallback being just the plain chip_path str (for s3 URLs). * ✨ Implement predict_dataloader for ClayDataModule Similar to the train/val dataloaders, but shuffling and pin_memory are both disabled. * ✅ Add parametrized test for checking ClayDataModule Ensure that outputs of both ClayDataModule and GeoTIFFDataPipeModule are the same-ish. Needed to make the split_ratio in ClayDataModule configurable, and check sorted list outputs instead of unsorted outputs for determinism. Fixed some hardcoded tensor shapes/dtypes, and dictionary keys too. Removed the nan_to_num casting of the image pixels in ClayDataModule so that int16 dtype inputs are accepted. * 📝 Edit docstrings in test_datamodule.py to be more generic Not just testing one, but two different LightningDataModules now! * 🔧 Add GDAL environment variables that might help with s3 loading Setting GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR and GDAL_HTTP_MERGE_CONSECUTIVE_RANGES=YES is supposed to improve GDAL performance when reading Cloud-Optimized GeoTIFFs. See https://gdal.org/user/configoptions.html.
* 🔧 Increase image_size from 256 to 512, patch_size from 32 to 64 Increase the chip image size from 256 to 512 pixels, and the patch size from 32 to 64 pixels. Updated the unit test and an assert statement, and fixed a typo. * 👽 Get YYYY-MM-DD from GeoTIFF tag instead of filename Obtaining the YYYY-MM-DD date from the GeoTIFF's tag metadata, instead of parsing it from the filename, thanks to the change at 426aa06/#72. * ✨ Allow GeoTIFFDataModule to get GeoTIFF data from an s3 bucket New feature to allow passing in a URL to an s3 bucket, and loading the GeoTIFF data from there directly. Added a unit test that checks that this works to list a GeoTIFF file from s3://copernicus-dem-30m/. Also improved the docstring and type hint of the setup() function's 'stage' parameter. * 🐛 Add sharding filter before loading GeoTIFF data to torch.Tensor Need to do this so that the data loading is distributed to the workers, otherwise each worker is doing duplicated work. Also set num_workers to 1 in test_geotiffdatapipemodule to get a consistent result. * 🙈 Gitignore checkpoints in nested folders Ensure that *.ckpt files in sub-folders are ignored too. * ⚡ Set float32 matmul precision to medium Prevents messages like `You are using a CUDA device ('NVIDIA A10G') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance.` * 📝 Mention in main README.md that data_path can be an s3 bucket Just casually documenting in the main README.md on how one can directly generate embeddings from GeoTIFF files stored in an s3 bucket instead of locally.
* ✨ Allow ClayDataModule to get GeoTIFF data from an s3 bucket Allow passing in a URL to an s3 bucket, and loading the GeoTIFF data from there directly. Using the same torchdata based code for the s3 pathway as with commit f288eb8 in #85. * 🚚 Rename datacube's path key to source_url Using the same 'source_url' key in the returned datacube dictionary for both ClayDataModule and GeoTIFFDataPipeModule * 🚑 Use try-except to get absolute chip_path or fallback to str The getattr doesn't actually work properly, since we need to call chip_path.absolute() with brackets. Using a good ol' try-except statement instead, with the fallback being just the plain chip_path str (for s3 URLs). * ✨ Implement predict_dataloader for ClayDataModule Similar to the train/val dataloaders, but shuffling and pin_memory are both disabled. * ✅ Add parametrized test for checking ClayDataModule Ensure that outputs of both ClayDataModule and GeoTIFFDataPipeModule are the same-ish. Needed to make the split_ratio in ClayDataModule configurable, and check sorted list outputs instead of unsorted outputs for determinism. Fixed some hardcoded tensor shapes/dtypes, and dictionary keys too. Removed the nan_to_num casting of the image pixels in ClayDataModule so that int16 dtype inputs are accepted. * 📝 Edit docstrings in test_datamodule.py to be more generic Not just testing one, but two different LightningDataModules now! * 🔧 Add GDAL environment variables that might help with s3 loading Setting GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR and GDAL_HTTP_MERGE_CONSECUTIVE_RANGES=YES is supposed to improve GDAL performance when reading Cloud-Optimized GeoTIFFs. See https://gdal.org/user/configoptions.html.
) | ||
# Step 1 - Get list of GeoTIFF filepaths from s3 bucket or data/ folder | ||
if self.data_path.startswith("s3://"): | ||
dp = torchdata.datapipes.iter.IterableWrapper(iterable=[self.data_path]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting datapipes are deprecated, is this a long term solution?
Modify the ViT MAE model to accept input images of size 512x512 pixel after #78. Also making a few small enhancements to the datapipe.
TODO: