Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-43712: Add configurable buffer for template and refcat preload #204

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

hsinfang
Copy link
Collaborator

No description provided.

This should have been together with 674a4b1

Unlike the old DECam test dataset, the new test LSSTComCamSim dataset
does not have crosstalk. So, use another dataset type to test.

Also fix a time stamp missed in the previous commit.
Copy link
Member

@kfindeisen kfindeisen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for the well-organized changes! However, I'm a bit worried that they're eroding the (already shaky) coherence of MiddlewareInterface, specifically with the region state variable and the splitting of queries across multiple _filter_datasets calls. I tried to offer some specific suggestions, but I'd like to take another look at the final result.

@@ -272,6 +272,8 @@ class MiddlewareInterface:
Information about which pipelines to run on ``visit``'s raws.
skymap: `str`
Name of the skymap in the central repo for querying templates.
padding: `int`
Number of arcseconds to pad the refcat and template region in preloading.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit uncomfortable having this be part of the MiddlewareInterface API, which is already bloated with too many fiddly details. I'm looking into refactoring this class into something more object-oriented, but in the meantime it might make more sense to read PRELOAD_PADDING in middleware_interface.py.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-existing issue with skymap, but this should have an extra space before :.

Comment on lines +1683 to +1652
for dataset_type in graph.inputs_of("retrieveTemplate"):
if dataset_type.endswith("Coadd"):
template_types.add(dataset_type)
# For cases where the pipelines do not contain "retrieveTemplate"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really like to avoid hardcoding assumptions about the pipeline (I think parameters:apdb_config is the only necessary evil). This is infringing on DM-40245 a little, but could you use PipelineGraph.iter_overall_inputs instead? If not, could you add a comment that this is temporary until DM-40245?

# For cases where the pipelines do not contain "retrieveTemplate"
except KeyError:
pass
return list(template_types)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it need to be a sequence? I would expect any collection to work, and a set seems the natural choice of the "standard" collection types. Does order matter?

tests/test_middleware_interface.py Outdated Show resolved Hide resolved
center = wcs.pixelToSky(detector.getCenter(lsst.afw.cameraGeom.PIXELS))
corners = wcs.pixelToSky(detector.getCorners(lsst.afw.cameraGeom.PIXELS))
padded = [c.offset(center.bearingTo(c), self.padding) for c in corners]
self.region = lsst.sphgeom.ConvexPolygon.convexHull([c.getVector() for c in padded])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like the thought of self.region being undefined before preload, and then possibly existing afterward; that requires a lot of redundant state checks throughout the code. If we only use it during preload, do we actually need to make it a member of self, or can we just pass it around between functions?

If it does need to be part of the object's state, I'd recommend the following changes:

  • Move the computation to __init__ time instead of prep_butler time
  • Move the actual assignment to self.region out of _compute_region (i.e., make the method have no side effects)
  • Guarantee that self.region is None (instead of undefined) if and only if _skip_spatial_preload

etc/export_comCamSim.yaml Show resolved Hide resolved
known_datasets = set()
except MissingDatasetTypeError as e:
_log.debug("Pre-export query with args '%s' failed with %s", formatted_args, e)
known_datasets = set()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"# If dataset type never registered locally, then any such datasets are missing."?


# Let exceptions from src_repo query raise: if it fails, that invalidates
# this operation.
# "expanded" dimension records are ignored for DataCoordinate equality
# comparison, so we only need them on src_datasets.
with lsst.utils.timer.time_this(_log, msg=f"_filter_datasets({formatted_args}) (source datasets)",
level=logging.DEBUG):
src_datasets = set(src_repo.registry.queryDatasets(*args, **kwargs).expanded())
# Ok with empty query result here, not log an error, and let the downstream
# method decide what to do with empty results.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment. _MissingDatasetError is raised on new line 1794.

More broadly, I'm worried that _filter_datasets has stopped being a useful abstraction, if you're having to duplicate iteration code outside the function. Something to look into in DM-46178?

@@ -34,7 +34,7 @@ env:
# relatively stable Pipelines containers that are needed to avoid issues with
# the "latest" version; they would remain in this list until "latest" becomes
# usable for all building and testing.
BASE_TAG_LIST: '["latest"]'
BASE_TAG_LIST: '["w_latest"]'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for access to the new query API? I would think it makes more sense to update latest.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed I added it when the latest was too old at the time (>a month ago), but it's not needed right now.

Thanks for the reminder that I should keep in mind of the option of updating latest in this case.

indexer = HtmIndexer(depth=7)
shard_ids, _ = indexer.getShardIds(center, radius+self.padding)
shard_ids, _ = indexer.getShardIds(center, radius)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new query system, can this be replaced with a query directly by region?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No with w_2024_37 when the API first became public and when I started working on this, but yes with w_2024_38 or newer now.

So, we can switch to use region query too. Thanks for motivating me to test again.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we bothered that the pipelines using refcats might not be using butler in getting the shards?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand. You mean a task might have an old-style loader and it might demand specific shards that weren't in the region?

Copy link
Collaborator Author

@hsinfang hsinfang Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean a task might have an old-style loader and it might demand specific shards that weren't in the region?

Honestly I don't know and have not looked enough to understand how the loader decides what shards to use. I thought it was via HtmIndexer like the current PP code. Also, for the few examples I tried manually, the two methods (HtmIndexer versus butler query) give identical results. I don't know if they always do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect all modern tasks would get them from the Butler, I think that was part of the Gen2->3 transition.

The optimal value likely depends on the instrument and the
skymap choice.
Instead of determining the instrument's wcsFlipX here, it is more
robust to use its formatter, which knows its camera orientation
from its obs package.
@hsinfang hsinfang force-pushed the tickets/DM-43712 branch 2 times, most recently from f8b6ae7 to 6839a4b Compare October 4, 2024 17:04
This changes the APIs of _export_skymap_and_templates and
_export_refcats to take a lsst.sphgeom.Region directly.

Note that the centroid of the region is not the same of the
detector center, but it should not matter.

Because htm7 can be too coarse compared to the patch size, using
htm7 indices to search for templates may lead to preloading more
patches than necessary and wasting time. This feature of using
htm7 to search for overlapping templates is also about to be
deprecated and replaced by the arbitrary spatial region query in
Butler. The usage will be replaced when switching to the new
butler query system.
This is a preparation step before switching to the new butler query
system.

In the new query system, query_datasets takes one datasetType at a
time, while butler.registry.queryDatasets can take a list of dataset
types in the old system. So, we change to query (via _filter_datasets)
one type of calibs/refcats/templates at a time.

Currently we can preload more types of calibs/refcats/templates
than the actual pipelines really need. It's possible that some types
are not preloaded but it's okay. For now we allow _MissingDatasetError.
The new Butler query systems supports spatial-constraint query
via lsst.sphgeom.Region directly. With this change, we use it
in template and refcat search. This needs stack w_2024_38 or newer.

make_export.py uses _filter_datasets so it needs to adjust to the
new underlying API too.
Some unit tests were temporarily marked expectedFailure in 674a4b1.

Now that we switch to the new query system, make them work again.

The test repo was put together using middleware tools, which
intrinsically uses butler repo's visit-detector regions with its
padding from defineVisits config.

That padding config is not the same as the preload region padding
in prompt processing.

This explains the patch differences in template selection.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants