feat: `EntitySet` and `EntitySetIterator` by RobertJacobsonCDC · Pull Request #786 · CDCgov/ixa

RobertJacobsonCDC · 2026-02-19T18:30:51Z

What this PR does

This PR introduces a new public set abstraction, EntitySet, and wires it into ContextEntitiesExt through a new method:

ContextEntitiesExt::query<E, Q>(&self, query: Q) -> EntitySet<E>

The short version is: queries can now return a composable set object, not just an iterator. That gives us a cleaner API surface for set operations and lets us keep query execution lazy.

In this PR, EntitySet is used primarily as the representation of query results, but the type itself is intentionally more general than queries. It is a reusable abstraction for representing and composing sets of entities more generally.

New public API

`EntitySet`

EntitySet is a set-expression wrapper over entity IDs. In this PR, query APIs return it, but it is not query-specific. It represents a lazy set expression made from:

"source" leaves (population/indexed/property-derived/etc.), an invisible internal type
set operations (union, intersection, difference):

let alive_people = context.query((Alive(true), ));
let seniors = context.query((AgeGroup::Senior, ));
let alive_seniors = alive_people.intersection(seniors);

You can:

call contains without materializing a full result vector (generally very efficient)
iterate it with into_iter()
compose sets as set algebra

`ContextEntitiesExt::query`

query returns an EntitySet directly:

let people = context.query((Age(42), Vaccinated(true)));

query_result_iterator remains available:

// Equivalent to
//   let people_iter = context.query((Age(42), Vaccinated(true))).into_iter();
let people_iter = context.query_result_iterator((Age(42), Vaccinated(true)));

(There is a micro-optimization to construct an EntitySetIterator directly instead of creating an EntitySet first in the case of query_result_iterator, which squeezes out a couple more percentage points of performance improvement in the tightest loops.)

Internal machinery

Internally, query evaluation is now split into two concepts:

EntitySet
- owns the set expression tree
- applies simplifications/reordering during construction
- provides lazy contains and lazy iteration entrypoint
EntitySetIterator
- an existing type, almost completely rewritten in this PR
- executes the expression tree lazily
- has dedicated execution variants for source, intersection, union, difference
- performance micro-optimizations: specialized IntersectionSources path for source-only intersections, direct construction path to avoid constructing EntitySet first...

These are built on:

SourceSet / SourceSetIterator
- Private leaf / building-block types
- Provide a uniform interface over several internal low-level data-source representations used by query execution

This structure lets us keep API-level behavior straightforward while still being explicit about hot paths internally.

Why this abstraction

Better API

Returning EntitySet from query makes set semantics first-class. It is easier to reason about and easier to compose than forcing everything through immediate iteration. It also gives us a shared abstraction we can reuse outside query execution, instead of inventing ad hoc set wrappers for each subsystem.

It also provides a basis for implementing queries having OR, NOT, and predicate conditionals ((Age > 60)).

Clearer separation of concerns

Query represents what the query is. EntitySet represents what the result is. EntitySetIterator represents how to execute it. That split makes future optimization work less tangled. Representation and computation logic is moved out of ContextEntitiesExt and Query impls.

Performance

Careful benchmarking and optimization work resulted in better performance over the existing implementation ~~across the board~~ almost across the board. Skip the CI benchmarks and look at the local results I posted.

Future optimization opportunities like compilation of membership checking (and, by extension, iteration) to a BDD now have a clear path forward if and when we are ready for them.

Internals are set up for more targeted optimization without changing user-facing query syntax.

Limitations

Operations on `EntitySet` consume `self`

Set operations consume self, and EntitySet doesn't provide an iter method, it only implements IntoIterator (consuming self). The reason is that SourceSets derived from indexes have type Ref<IndexSet<_>>, which is neither copy nor clone. Once we construct and return the EntitySet, we don't have access to the RefCell that gave us the Ref instance, and so we can't get another reference to the index set.

For the same reason, EntitySet and EntitySetIterator do not implement Clone.

The good news is, now EntitySets are reasonably cheap to create. If you want more than one copy of the same set, just call context.query multiple times.

This is a difficult limitation to overcome without either major rearchitecture or unsafe. If we need to lift this restriction, the best path forward is probably unsafe. (The problem boils down to this: when we need to mutate an index for a derived property, we also need to give the derived property an immutable reference to context in order for it to compute its value. So we need both mutable and immutable access simultaneously.)

`EntitySet` / `EntitySetIterator` holds an immutable reference to `context`, is tied to `context`'s lifetime

The underlying SourceSets are immutable views into the internal data structures owned by context. The lifetime 'a of &'a context is a generic parameter of SourceSet<'a, E: Entity> and thus of EntitySet<'a, E: Entity> / EntitySetIterator<'a, E: Entity>. The compiler statically enforces Rust lifetime and aliasing rules. The context cannot be mutated while an EntitySet / EntitySetIterator exists.

Client code can compute the set of entity IDs (or use some other pattern) if they need access to the result set while mutating context. We provide the EntitySet::to_owned_vec method to conveniently compute a Vec<EntityId<E>>.

Issues and Questions

Q: What should we do with with_query_result?

Right now with_query_result gives client code direct access to an immutable reference to an IndexSet. This method made sense at the time we implemented it, but we will likely have different kinds of indexes in the future which will have different internal container types. What's more, with_query_result is necessarily eager in the unindexed case, realizing the full result set. We do emit a warning in the unindexed case, but it's awkward that a method intended to give an optimized fast path also exposes a new slow path.

The with_query_result method still has a use case: it gives client code a scope for the lifetime of the result set—a lifetime during which context cannot be mutated. If we want to keep it, it makes sense to me to expose the result set as an EntitySet. Membership checking, iteration, and other operations should be virtually identical between IndexSet and EntitySet both in terms of performance and convenience. As far as I can tell, the only advantage of IndexSet over EntitySet is that IndexSet has a len method. But we could give EntitySet a method like EntitySet::try_len() that returns an Option<usize> that is None if EntitySet doesn't trivially know its length (without needing to compute anything). Client code can always do entity_set.into_iter().count(), which is fast if EntitySet::try_len() would return Some, but which consumes entity_set.

RobertJacobsonCDC · 2026-02-19T20:40:32Z

Benchmark Comparisons

The benchmark comparisons from CI are all over the place. Below is the data from my local run. I ran it a couple of times, and the results were remarkably consistent, even in cases of small percent-change. As usual, one should be skeptical of small differences, of course.

The example-births-deaths benchmark is the most interesting to me. It suggests we do not have good benchmark coverage—there's something happening relevant to performance that isn't being measured by any existing benchmark.

Run a comparison on your local machine and compare. The suite takes ~10 minutes to run on my dev machine. This is long enough to make thermal throttling a problem, so keep an eye on that when you run it.

Regressions

Group	Bench	Change	CI Lower	CI Upper
large_dataset	bench_filter_unindexed_entity	1.831%	1.136%	2.704%
large_dataset	bench_query_population_multi_unindexed_entities	4.247%	3.948%	4.562%
algorithm_benches	algorithm_sampling_multiple_l_reservoir	3.453%	3.005%	3.878%
algorithm_benches	algorithm_sampling_multiple_known_length	7.050%	6.391%	7.877%
sampling	sampling_single_l_reservoir_entities	4.062%	3.847%	4.257%
examples	example-births-deaths	14.580%	14.048%	15.332%
indexing	query_people_single_indexed_property_entities	5.998%	5.859%	6.139%
indexing	query_people_count_multiple_individually_indexed_properties_enti	7.964%	6.837%	9.195%
indexing	query_people_count_single_indexed_property_entities	2.995%	2.686%	3.356%
indexing	query_people_multiple_individually_indexed_properties_entities	5.801%	5.584%	5.988%
indexing	with_query_results_single_indexed_property_entities	2.369%	2.053%	2.708%
indexing	query_people_indexed_multi-property_entities	1.777%	1.371%	2.191%
counts	multi_property_unindexed_entities	4.368%	3.950%	4.831%

Improvements

Group	Bench	Change	CI Lower	CI Upper
sample_entity_single_property_indexed	10000	-7.387%	-7.762%	-7.022%
sample_entity_single_property_indexed	100000	-7.824%	-8.360%	-7.283%
sample_entity_single_property_indexed	1000	-6.785%	-7.144%	-6.356%
sample_entity_whole_population	10000	-45.231%	-45.404%	-45.066%
sample_entity_whole_population	100000	-45.052%	-45.331%	-44.658%
sample_entity_whole_population	1000	-46.777%	-47.507%	-46.191%
sampling	sampling_multiple_l_reservoir_entities	-1.919%	-2.053%	-1.751%
examples	example-basic-infection	-12.803%	-13.941%	-11.916%
sample_entity_single_property_unindexed	10000	-1.429%	-1.812%	-1.091%
sample_entity_single_property_unindexed	100000	-2.414%	-2.770%	-2.042%
indexing	with_query_results_multiple_individually_indexed_properties_enti	-97.154%	-97.160%	-97.149%
indexing	with_query_results_indexed_multi-property_entities	-2.565%	-2.854%	-2.303%
counts	single_property_unindexed_entities	-3.611%	-4.306%	-2.897%
counts	reindex_after_adding_more_entities	-1.774%	-2.005%	-1.545%

Unchanged

Group	Bench	Change	CI Lower	CI Upper
large_dataset	bench_query_population_property_entities	0.135%	-0.230%	0.567%
large_dataset	bench_query_population_indexed_property_entities	-0.380%	-0.708%	-0.011%
large_dataset	bench_filter_indexed_entity	1.289%	0.508%	2.091%
large_dataset	bench_match_entity	0.290%	-0.034%	0.662%
large_dataset	bench_query_population_multi_indexed_entities	-0.376%	-0.702%	-0.072%
large_dataset	bench_query_population_derived_property_entities	0.587%	0.113%	1.013%
algorithm_benches	algorithm_sampling_single_known_length	-0.025%	-0.284%	0.261%
algorithm_benches	algorithm_sampling_single_rand_reservoir	-0.337%	-0.741%	0.024%
algorithm_benches	algorithm_sampling_single_l_reservoir	0.319%	-0.036%	0.700%
sampling	sampling_single_unindexed_entities	1.497%	0.939%	2.042%
sampling	sampling_multiple_known_length_entities	-0.318%	-0.598%	-0.048%
sampling	sampling_multiple_unindexed_entities	0.795%	0.279%	1.434%
sampling	sampling_single_known_length_entities	0.920%	0.481%	1.463%
sample_entity_single_property_unindexed	1000	0.081%	-0.275%	0.381%
sample_entity_multi_property_indexed	10000	0.215%	-0.221%	0.589%
sample_entity_multi_property_indexed	100000	1.124%	0.660%	1.587%
sample_entity_multi_property_indexed	1000	0.666%	0.190%	1.106%
indexing	query_people_count_indexed_multi-property_entities	0.710%	0.094%	1.515%
counts	multi_property_indexed_entities	-0.548%	-0.828%	-0.240%
counts	index_after_adding_entities	-1.345%	-2.300%	-0.703%
counts	single_property_indexed_entities	0.537%	-0.070%	1.424%

RobertJacobsonCDC · 2026-02-20T03:08:03Z

New benchmarks after a little bit of optimization. The improvement to example-births-deaths is somewhat artificial. I implemented a new fused sample-count algorithm. Our script to generate the nice table doesn't work if you add new benchmarks, so we don't have the same format.

Smaller (faster) time bolded. Relative speedup = main ÷ dev.

> 1 → dev faster
< 1 → dev slower

Algorithm Benchmarks

benchmark	main	dev	relative speedup (main ÷ dev)
algorithm_sampling_single_known_length	5.9815 ns	5.9881 ns	1.00×
algorithm_sampling_single_l_reservoir	492.36 ns	380.26 ns	1.29×
algorithm_sampling_single_rand_reservoir	155.74 µs	153.21 µs	1.02×
algorithm_sampling_multiple_known_length	1.2897 µs	1.4203 µs	0.91×
algorithm_sampling_multiple_l_reservoir	17.821 µs	18.503 µs	0.96×

Counts Benchmarks

benchmark	main	dev	relative speedup (main ÷ dev)
single_property_unindexed	4.6566 µs	5.0497 µs	0.92×
single_property_indexed	61.653 ns	60.109 ns	1.03×
multi_property_unindexed	5.4680 µs	5.7498 µs	0.95×
multi_property_indexed	177.03 ns	161.29 ns	1.10×
index_after_adding	619.15 µs	606.57 µs	1.02×
reindex_after_adding_more	350.32 µs	350.43 µs	1.00×

Examples

benchmark	main	dev	relative speedup (main ÷ dev)
example-basic-infection	2.8767 ms	2.6938 ms	1.07×
example-births-deaths	27.990 ms	16.095 ms	1.74×

Indexing Benchmarks

benchmark	main	dev	relative speedup (main ÷ dev)
with_query_results_single_indexed_property	59.330 µs	60.534 µs	0.98×
with_query_results_multiple_individually_indexed_properties	6.6940 ms	184.67 µs	≈36.2×
with_query_results_indexed_multi-property	155.73 µs	153.74 µs	1.01×
query_people_count_single_indexed_property	56.177 µs	58.957 µs	0.95×
query_people_count_multiple_individually_indexed_properties	4.7671 ms	4.9924 ms	0.96×
query_people_count_indexed_multi-property	149.72 µs	147.73 µs	1.01×
query_people_single_indexed_property	20.583 ms	21.561 ms	0.95×
query_people_multiple_individually_indexed_properties	6.2824 ms	6.4651 ms	0.97×
query_people_indexed_multi-property	428.40 µs	428.43 µs	1.00×

Large Dataset Benchmarks

benchmark	main	dev	relative speedup (main ÷ dev)
bench_query_population_property	4.6398 µs	4.6357 µs	1.00×
bench_query_population_indexed_property	60.952 ns	60.778 ns	1.00×
bench_query_population_derived_property	77.670 µs	76.679 µs	1.01×
bench_query_population_multi_unindexed	5.8717 µs	6.0354 µs	0.97×
bench_query_population_multi_indexed	161.93 ns	161.47 ns	1.00×
bench_match_entity	5.0671 ns	5.0528 ns	1.00×
bench_filter_indexed_entity	545.17 ns	548.45 ns	0.99×
bench_filter_unindexed_entity	6.3485 µs	6.3732 µs	1.00×

sample_entity Benchmarks

benchmark	main	dev	relative speedup (main ÷ dev)
whole_population / 1 000	12.469 ns	6.7922 ns	1.84×
whole_population / 10 000	12.416 ns	6.8152 ns	1.82×
whole_population / 100 000	12.414 ns	6.8696 ns	1.81×
single_property_indexed / 1 000	88.215 ns	89.028 ns	0.99×
single_property_indexed / 10 000	87.693 ns	84.226 ns	1.04×
single_property_indexed / 100 000	88.046 ns	89.953 ns	0.98×
multi_property_indexed / 1 000	183.19 ns	181.24 ns	1.01×
multi_property_indexed / 10 000	176.59 ns	182.42 ns	0.97×
multi_property_indexed / 100 000	177.44 ns	177.55 ns	1.00×
single_property_unindexed / 1 000	903.30 ns	722.11 ns	1.25×
single_property_unindexed / 10 000	9.2574 µs	6.8811 µs	1.35×
single_property_unindexed / 100 000	106.28 µs	97.742 µs	1.09×

Sampling Benchmarks

benchmark	main	dev	relative speedup (main ÷ dev)
sampling_single_known_length	85.444 µs	85.095 µs	1.00×
sampling_single_l_reservoir	5.7734 ms	6.3588 ms	0.91×
sampling_multiple_known_length	730.34 µs	725.66 µs	1.01×
sampling_multiple_l_reservoir	6.7672 ms	7.3110 ms	0.93×
sampling_single_unindexed	105.40 ms	105.60 ms	1.00×
sampling_multiple_unindexed	113.09 ms	118.07 ms	0.96×

New Benchmarks in `RobertJacobsonCDC_746_entity_set`

benchmark	main	dev
counts/concrete_plus_derived_unindexed	—	4.7510 µs
sampling/count_and_sampling_single_known_length	—	107.79 µs
sampling/sampling_single_unindexed_concrete_plus_derived	—	204.70 ms
sampling/count_and_sampling_single_unindexed_concrete_plus_derived	—	213.88 ms

Observations

Major dev wins:
- with_query_results_multiple_individually_indexed_properties (~36×)
- example-births-deaths (~1.74×)
- Whole population sampling (~1.8×)
- Unindexed single-property sampling (~1.25–1.35×)
Minor regressions in dev:
- algorithm_sampling_multiple_known_length
- Some indexing count queries (~3–5%)
- sampling_single_l_reservoir (~9% slower)

Most other changes fall within ±2%.

feat: EntitySet and EntitySetIterator

e7d1b07

RobertJacobsonCDC linked an issue Feb 19, 2026 that may be closed by this pull request

EntitySet and EntitySetIterator #746

Open

This comment was marked as outdated.

Sign in to view

github-actions bot added a commit that referenced this pull request Feb 19, 2026

Update bench-history.json (PR #786 @ e7d1b07)

4583790

RobertJacobsonCDC marked this pull request as ready for review February 20, 2026 17:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat: `EntitySet` and `EntitySetIterator`#786

feat: `EntitySet` and `EntitySetIterator`#786
RobertJacobsonCDC wants to merge 1 commit intomainfrom
RobertJacobsonCDC_746_entity_set

RobertJacobsonCDC commented Feb 19, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

RobertJacobsonCDC commented Feb 19, 2026

Uh oh!

RobertJacobsonCDC commented Feb 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

RobertJacobsonCDC commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

New public API

EntitySet

ContextEntitiesExt::query

Internal machinery

Why this abstraction

Better API

Clearer separation of concerns

Performance

Limitations

Operations on EntitySet consume self

EntitySet / EntitySetIterator holds an immutable reference to context, is tied to context's lifetime

Issues and Questions

Uh oh!

This comment was marked as outdated.

RobertJacobsonCDC commented Feb 19, 2026

Benchmark Comparisons

Regressions

Improvements

Unchanged

Uh oh!

RobertJacobsonCDC commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Algorithm Benchmarks

Counts Benchmarks

Examples

Indexing Benchmarks

Large Dataset Benchmarks

sample_entity Benchmarks

Sampling Benchmarks

New Benchmarks in RobertJacobsonCDC_746_entity_set

Observations

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

RobertJacobsonCDC commented Feb 19, 2026 •

edited

Loading

`EntitySet`

`ContextEntitiesExt::query`

Operations on `EntitySet` consume `self`

`EntitySet` / `EntitySetIterator` holds an immutable reference to `context`, is tied to `context`'s lifetime

RobertJacobsonCDC commented Feb 20, 2026 •

edited

Loading

New Benchmarks in `RobertJacobsonCDC_746_entity_set`