-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-43712: Add configurable buffer for template and refcat preload #204
base: main
Are you sure you want to change the base?
Conversation
This should have been together with 674a4b1 Unlike the old DECam test dataset, the new test LSSTComCamSim dataset does not have crosstalk. So, use another dataset type to test. Also fix a time stamp missed in the previous commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many thanks for the well-organized changes! However, I'm a bit worried that they're eroding the (already shaky) coherence of MiddlewareInterface
, specifically with the region
state variable and the splitting of queries across multiple _filter_datasets
calls. I tried to offer some specific suggestions, but I'd like to take another look at the final result.
@@ -272,6 +272,8 @@ class MiddlewareInterface: | |||
Information about which pipelines to run on ``visit``'s raws. | |||
skymap: `str` | |||
Name of the skymap in the central repo for querying templates. | |||
padding: `int` | |||
Number of arcseconds to pad the refcat and template region in preloading. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit uncomfortable having this be part of the MiddlewareInterface
API, which is already bloated with too many fiddly details. I'm looking into refactoring this class into something more object-oriented, but in the meantime it might make more sense to read PRELOAD_PADDING
in middleware_interface.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pre-existing issue with skymap
, but this should have an extra space before :
.
for dataset_type in graph.inputs_of("retrieveTemplate"): | ||
if dataset_type.endswith("Coadd"): | ||
template_types.add(dataset_type) | ||
# For cases where the pipelines do not contain "retrieveTemplate" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd really like to avoid hardcoding assumptions about the pipeline (I think parameters:apdb_config
is the only necessary evil). This is infringing on DM-40245 a little, but could you use PipelineGraph.iter_overall_inputs instead? If not, could you add a comment that this is temporary until DM-40245?
# For cases where the pipelines do not contain "retrieveTemplate" | ||
except KeyError: | ||
pass | ||
return list(template_types) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does it need to be a sequence? I would expect any collection to work, and a set seems the natural choice of the "standard" collection types. Does order matter?
center = wcs.pixelToSky(detector.getCenter(lsst.afw.cameraGeom.PIXELS)) | ||
corners = wcs.pixelToSky(detector.getCorners(lsst.afw.cameraGeom.PIXELS)) | ||
padded = [c.offset(center.bearingTo(c), self.padding) for c in corners] | ||
self.region = lsst.sphgeom.ConvexPolygon.convexHull([c.getVector() for c in padded]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't like the thought of self.region
being undefined before preload, and then possibly existing afterward; that requires a lot of redundant state checks throughout the code. If we only use it during preload, do we actually need to make it a member of self
, or can we just pass it around between functions?
If it does need to be part of the object's state, I'd recommend the following changes:
- Move the computation to
__init__
time instead ofprep_butler
time - Move the actual assignment to
self.region
out of_compute_region
(i.e., make the method have no side effects) - Guarantee that
self.region
isNone
(instead of undefined) if and only if_skip_spatial_preload
known_datasets = set() | ||
except MissingDatasetTypeError as e: | ||
_log.debug("Pre-export query with args '%s' failed with %s", formatted_args, e) | ||
known_datasets = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"# If dataset type never registered locally, then any such datasets are missing."?
|
||
# Let exceptions from src_repo query raise: if it fails, that invalidates | ||
# this operation. | ||
# "expanded" dimension records are ignored for DataCoordinate equality | ||
# comparison, so we only need them on src_datasets. | ||
with lsst.utils.timer.time_this(_log, msg=f"_filter_datasets({formatted_args}) (source datasets)", | ||
level=logging.DEBUG): | ||
src_datasets = set(src_repo.registry.queryDatasets(*args, **kwargs).expanded()) | ||
# Ok with empty query result here, not log an error, and let the downstream | ||
# method decide what to do with empty results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this comment. _MissingDatasetError
is raised on new line 1794.
More broadly, I'm worried that _filter_datasets
has stopped being a useful abstraction, if you're having to duplicate iteration code outside the function. Something to look into in DM-46178?
.github/workflows/build-service.yml
Outdated
@@ -34,7 +34,7 @@ env: | |||
# relatively stable Pipelines containers that are needed to avoid issues with | |||
# the "latest" version; they would remain in this list until "latest" becomes | |||
# usable for all building and testing. | |||
BASE_TAG_LIST: '["latest"]' | |||
BASE_TAG_LIST: '["w_latest"]' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for access to the new query API? I would think it makes more sense to update latest
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed I added it when the latest
was too old at the time (>a month ago), but it's not needed right now.
Thanks for the reminder that I should keep in mind of the option of updating latest
in this case.
indexer = HtmIndexer(depth=7) | ||
shard_ids, _ = indexer.getShardIds(center, radius+self.padding) | ||
shard_ids, _ = indexer.getShardIds(center, radius) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the new query system, can this be replaced with a query directly by region?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No with w_2024_37
when the API first became public and when I started working on this, but yes with w_2024_38
or newer now.
So, we can switch to use region query too. Thanks for motivating me to test again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we bothered that the pipelines using refcats might not be using butler in getting the shards?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand. You mean a task might have an old-style loader and it might demand specific shards that weren't in the region?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean a task might have an old-style loader and it might demand specific shards that weren't in the region?
Honestly I don't know and have not looked enough to understand how the loader decides what shards to use. I thought it was via HtmIndexer
like the current PP code. Also, for the few examples I tried manually, the two methods (HtmIndexer
versus butler query) give identical results. I don't know if they always do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect all modern tasks would get them from the Butler, I think that was part of the Gen2->3 transition.
The optimal value likely depends on the instrument and the skymap choice.
Instead of determining the instrument's wcsFlipX here, it is more robust to use its formatter, which knows its camera orientation from its obs package.
f8b6ae7
to
6839a4b
Compare
This changes the APIs of _export_skymap_and_templates and _export_refcats to take a lsst.sphgeom.Region directly. Note that the centroid of the region is not the same of the detector center, but it should not matter. Because htm7 can be too coarse compared to the patch size, using htm7 indices to search for templates may lead to preloading more patches than necessary and wasting time. This feature of using htm7 to search for overlapping templates is also about to be deprecated and replaced by the arbitrary spatial region query in Butler. The usage will be replaced when switching to the new butler query system.
This is a preparation step before switching to the new butler query system. In the new query system, query_datasets takes one datasetType at a time, while butler.registry.queryDatasets can take a list of dataset types in the old system. So, we change to query (via _filter_datasets) one type of calibs/refcats/templates at a time. Currently we can preload more types of calibs/refcats/templates than the actual pipelines really need. It's possible that some types are not preloaded but it's okay. For now we allow _MissingDatasetError.
The new Butler query systems supports spatial-constraint query via lsst.sphgeom.Region directly. With this change, we use it in template and refcat search. This needs stack w_2024_38 or newer. make_export.py uses _filter_datasets so it needs to adjust to the new underlying API too.
Some unit tests were temporarily marked expectedFailure in 674a4b1. Now that we switch to the new query system, make them work again. The test repo was put together using middleware tools, which intrinsically uses butler repo's visit-detector regions with its padding from defineVisits config. That padding config is not the same as the preload region padding in prompt processing. This explains the patch differences in template selection.
6839a4b
to
5653980
Compare
No description provided.