Authors: Tyler Hou, Shadaj Laddad, Joe Hellerstein
Equality saturation is a powerful technique for program optimization. Contextual equality saturation extends this to support rewrite rules that are conditioned on where a term appears in an expression. Existing work has brought contextual reasoning to egg; in this paper, we share our ongoing work to extend this to relational equality saturation in egglog. We summarize the existing approaches to contextual equality saturation, outline its main applications, and identify key challenges in combining this approach with relational models.
In recent years there as been a surge of interest in using equality saturation to optimize programs [10]. Equality saturation is an optimization technique which repeatedly applies rewrite rules to an program. These rewrite rules add equalities between (sub)terms of the expression to a data structure called an e-graph. The e-graph represents a possibly-infinite set programs equivalent to the original program. Finally, an optimized program can be extracted from the e-graph.
However, existing equality saturation tools like egg
[11] and
egglog
[12] do not natively support contextual equalities.
Contextual equalities are equalities between terms that are valid in
some subgraph, but may not be valid in general. For example, consider
the ternary then
branch of
the ternary, the condition
Reasoning under contexts is necessary in many program optimizers to achieve better performance. For example, by applying context-aware optimizations to circuit design, an RTL optimization tool reduced circuit area by 41% and delay by 33% [1]. In addition, reasoning about contextual properties like sort order is critical for dataflow optimization and relational query optimization [4; 8].
There have been various approaches to support contextual reasoning in
equality saturation frameworks. The authors of the RTL optimization tool
encoded contextual reasoning in egg
by adding ASSUME(x, c)
e-nodes
to the e-graph, where x
is the expression to be optimized, and c
is
a set of expressions which represent known constraints [1].
Auxiliary rewrite rules "push down" ASSUME
e-nodes into the expression
tree, adding additional constraints when appropriate. Finally, after
ASSUME
e-nodes have been pushed down sufficiently far, domain-specific
rules rewrite them into more efficient terms using the constraints.
However, the main limitation of this approach is that these "extra"
ASSUME
e-nodes significantly expand the size of the e-graph, harming
performance. In their tool, optimization of a floating point subtactor
took 22 minutes, with the majority of time spent in e-graph expansion
[1].
An alternative to adding explicit ASSUME
e-nodes into the e-graph is
to annotate e-classes with the contexts that they appear in via a
top-down analysis. Then, a contextual rewrite rule can "copy" the
contextual subgraph, replacing subterms that can be contextually
rewritten with their equivalents [2]. However, this approach
can also substantially expand the e-graph: to avoid false equalities,
such a rewrite must make independent copies of all
contextually-rewritten e-classes and their ancestors, up to and
including the "source" of the context. When multiple contexts are
nested, this can again lead to a combinatorial explosion of the e-graph
size. Finally, copying the e-graph also makes it more expensive to apply
context insensitive rewrites -- such rewrites may have to match all
copied subgraphs since they are no longer related to the original
subgraph.
A promising approach for efficiently supporting contextual equalities
are colored e-graphs [9]. Instead of adding new e-nodes
to the e-graph, colored e-graphs support multiple equivalence relations:
a base equivalence relation represents equalities that apply in every
context, and context-sensitive, colored equivalences are layered on
top of the base relation. These layered relations can be thought of
"shallow" copies of the base relation. This approach has two advantages:
first, it saves memory because context-sensitive rewrites typically add
relatively few equivalences on top of the base equivalence. Second, when
additional base equivalences are found, the context-sensitive
equivalence relations can be efficiently updated. However, one
limitation of the colored e-graph work is that it targets the egg
library, which implements non-relational e-matching.
To better understand the space of applications that can benefit from contextual equality saturation, we explore three case studies from a range of optimization domains.
Database engines work by translating user-provided queries (often in SQL) down
to a query plan, then repeatedly applying rewrite rules to find more efficient
plans. For example, consider the SQL query in Figure 1, which finds all nodes
reachable in a path of exactly three steps from a source node with
[id]{.smallcaps}

However, such a plan may still not be optimal; after selection, the
inputs to the bottom join may be relatively small. If relations are
small and fit in memory, then it may be more efficient to hash join
them. Thus, we would like to replace the merge join


It is not difficult to fix this though: we can add a sort enforcer
[5] underneath the top merge join that sorts its left input on
the
More generally, in database query optimization, query (sub-)plans must often satisfy certain physical properties, because their outputs are fed to consumers who rely on those physical properties [5]. Physical properties include sort order, partitioning, and data location. In general, valid rewrites must preserve physical properties. However, in certain contexts, like underneath a sort enforcer, additional rewrites that "break" physical properties become valid, because those physical properties will be re-established by the enforcer.
As shown in the introduction, contextual equality saturation can also be
used to discover rewrites for conditional statements. The ternary
But not only can contextual equality saturation reason under
conditionals, in some circumstances, it can remove conditional
statements entirely. Consider another ternary
-
Under the then branch of the ternary, we have the additional equivalence
$a > b \equiv \text{true}$ . Hence, the e-class inside the then branch contains the terms${a > b, \text{true}}$ . -
Under the then branch of the ternary, we have the additional equivalence
$\neg{}(a > b) \equiv \text{true}$ . Another (general) rewrite rule could rewrite$\neg{} (a > b) \equiv b \leq a$ , so, by transitivity,$b \leq a \equiv \text{true}$ . The e-class inside the else branch would contain the terms${\neg (a > b), \text{true}, b \leq a}$ . -
An "intersecting" rule says if the bodies of branches is equivalent to some term
$t$ , then the entire ternary is equivalent to$t$ . That is, because both the e-classes for the then branch and the else branch contain the term , then the entire ternary is equivalent to :$\big((a > b);?;(a > b) : (a \leq b)\big) \equiv \text{true}$ . (This rule is corresponds to the "disjunction elimination" or "proof by cases" inference rule in propositional logic [9].)
We note that step 3 "converts" contextual equivalences within the branches of the ternary into context-insensitive equivalences by taking the "intersection" of two equivalence relations. In order to support such reasoning, a contextual equality saturation framework needs to let a user write rules that can refer to and manipulate (contextual) equivalence relations.
Finally, we show an example where contextual reasoning can simplify the
body of a lambda application. This type of contextual reasoning is
similar to conditional simplification, except has one additional
complication. Consider the lambda application
Here is the complication: we want to simplify the entire lambda
application to
The rule that we want is: if the lambda body is equal to a term
However, this type of rule is hard to implement in existing equality
saturation frameworks because existing equality saturations frameworks
like egg
and egglog
canonicalize the e-graph---the body of the
lambda application is not a set of terms, but instead an e-class.
Furthermore, there is no way to "cleave apart" an e-class to recover the
terms that do not contain
In this section we present ongoing work on a set-theoretic perspective on contextual equality saturation. We aim to describe contextual equality saturation in an implementation-independent manner and to suggest how a relational contextual equality saturation framework might practically support contexts. We assume familiarity with lattices and equivalence relations; definitions for lattices and equivalence relations can be found in the Definitions. We begin with a definition of quotient sets, and we relate them to e-classes and a hierarchy of equivalence relations.
Definition 4.1.
Let
Example 4.2.
Let find
operation in egglog
[12]).
We note that egglog
because
terms within an equivalence class in
Theorem 4.3.
Let
Intuitively, the meet of two equivalence relations
Definition 4.4.
Let
Definition 4.4 suggests what structure contexts should
have to be suitable for labels in contextual equality saturation---they
should have the property that as one learns more information, more, not
fewer, equalities become available. We associate with the base context,
labeled
In 3.1, our context lattice contained two elements:
Similarly, we (secretly) used the lattice structure of equivalence
relations in 3.2. Under each branch of the conditional, we
transitioned up the lattice: in the then branch, we transitioned from
Lemma 4.5. Let
- The quotient map (which sends elements to their equivalence classes)
$\pi_2 : R \to R/\sim_2$ "factors through" the quotient map$\pi_1 : R \to R/\sim_1$ : there is a map$q: R/\sim_1 \to R/\sim_2$ such that$\pi_2 = q \circ \pi_1$ See Figure 3.

{:start="2"}
2. The quotient set
- The coarser quotient set is smaller than the finer set:
$\lvert{}A/\sim_2\rvert{} \leq \lvert{}A/\sim_1\rvert{}$ .
In Lemma 4.5, one can interpret
As observed in both egglog
[12] and colored e-graphs
[9], it can be exponentially faster to e-match on a
canonicalized e-graph versus an uncanonicalized e-graph. The existence
of $q$ suggests that contextual equality saturation frameworks can trade
off space for time: instead of storing a canonicalized copy of the
e-graph for every equivalence relation, databases can store e-graphs for
a lower-bound of contexts. When e-matching is requested on some
context that is not explicitly materialized, frameworks have two
options: either they can e-match directly on the stored e-graphs,
incurring the cost of a join on the equivalence relation; or they can
apply
Finally, we note that item (3) in Lemma 4.5 implies that the canonicalized
database is not monotonic! As more equivalences are added, the size of the
canonicalized database can shrink. Relational logic systems like Datalog rely
on monotonicity to be efficient and correct. Thus, a relational equality
saturation framework that supports contexts needs to syntactically prevent
users from observing the non-monotonicity of the database. (egglog
already
achieves this for a single equivalence relation.)
We have provided an overview of existing approaches to contextual equality saturation, and shown three problems that contextual equality saturation may be able to (partially) solve. Our early work points towards a set-theoretic model for contextual equality saturation that lends itself to natural system optimizations from the database literature.
In the future, we plan to further develop our set-theoretic model and
extend it to the relational setting. In addition, we aim to develop a
syntax that lets users easily express context-sensitive rewrites in a
relational model and manipulate equivalences across contexts, but
prevents them from observing non-monotonicity (which is crucial for
performance and correctness). Finally, we aim to explore whether
existing Datalog systems like egglog
and Soufflé can be adapted to
support a hierarchy of equivalence relations.
This work is supported in part by National Science Foundation CISE Expeditions Award CCF-1730628, IIS-1955488, IIS-2027575, GRFP Award DGE-2146752, DOE award DE-SC0016260, ARO award W911NF2110339, and ONR award N00014-21-1-2724, and by gifts from Astronomer, Google, IBM, Intel, Lacework, Microsoft, Mohamed Bin Zayed University of Artificial Intelligence, Nexla, Samsung SDS, Uber, and VMware.
Definition A.1.
A meet semilattice is an algebraic structure
Definition A.2.
Dually, a join semilattice is an algebraic structure
Definition A.3.
A lattice is an algebraic structure
Example A.4.
Let
-
Both
$\min$ and$\max$ are idempotent, associative, and commutative. -
The meet of
$a$ and$b$ is$\min(a, b)$ , and if$\min(a, b) = a$ then$a \leq b$ (under the normal ordering). -
The join of
$a$ and$b$ is$\max(a, b)$ , and if$\max(a, b) = a$ then$a \geq b$ . -
$\min(a, \max(a, b)) = a = \max(a, \min(a, b))$ .
Definition A.5.
A binary relation on a set
Lemma A.6.
Let
Proof. Without loss of generality, we prove the lemma for
◻
[1] Samuel Coward, George A. Constantinides, and Theo Drane. 2023. Automating Constraint-Aware Datapath Optimization using E-Graphs. arXiv:2303.01839 [cs.AR]
[2] Alexandre Drewery. 2022. Automatic Equivalence Verification for Translation Validation. Master’s thesis. ENS Rennes. https://perso.eleves.ens-rennes.fr/people/alexandre.drewery/internship_report.pdf
[3] David S. Dummit and Richard M. Foote. 2004. Abstract Algebra. John Wiley & Sons, Hoboken, New Jersey.
[4] Goetz Graefe. 1995. The Cascades Framework for Query Optimization. IEEE Data Eng. Bull. 18, 3 (1995), 19–29. http://sites.computer.org/debull/95SEP-CD.pdf
[5] G. Graefe and W.J. McKenna. 1993. The Volcano optimizer generator: extensibility and efficient search. In Proceedings of IEEE 9th International Conference on Data Engineering. 209–218. https://doi.org/10.1109/ICDE.1993.344061
[6] George Grätzer. 2009. Lattice Theory: First Concepts and Distributive Lattices. Dover Publications, Mineola, New York.
[7] Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Reichart, Murali Venkatrao, Frank Pellow, and Hamid Pirahesh. 2007. Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals. arXiv:cs/0701155 [cs.DB]
[8] Shadaj Laddad, Conor Power, Tyler Hou, Alvin Cheung, and Joseph M. Hellerstein. 2023. Optimizing Stateful Dataflow with Local Rewrites. arXiv:2306.10585 [cs.PL]
[9] Eytan Singher and Shachar Itzhaky. 2023. Colored E-Graph: Equality Reasoning with Conditions. arXiv:2305.19203 [cs.PL]
[10] Ross Tate, Michael Stepp, Zachary Tatlock, and Sorin Lerner. 2009. Equality saturation: a new approach to optimization. In Proceedings of the 36th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (Savannah, GA, USA) (POPL ’09). Association for Computing Machinery, New York, NY, USA, 264–276. https://doi.org/10.1145/1480881.1480915
[11] Max Willsey, Chandrakana Nandi, Yisu Remy Wang, Oliver Flatt, Zachary Tatlock, and Pavel Panchekha. 2021. Egg: Fast and Extensible Equality Saturation. Proc. ACM Program. Lang. 5, POPL, Article 23 (January 2021), 29 pages. https://doi.org/10.1145/3434304
[12] Yihong Zhang, Yisu Remy Wang, Oliver Flatt, David Cao, Philip Zucker, Eli Rosenthal, Zachary Tatlock, and Max Willsey. 2023. Better Together: Unifying Datalog and Equality Saturation. Proc. ACM Program. Lang. 7, PLDI, Article 125 (June 2023), 25 pages. https://doi.org/10.1145/3591239
Footnotes
-
We note that existing equality saturations can implement beta-reduction (as well as entire interpreters for a lambda calculus) by using e-graph analyses to keep track of free and bound variables in an expression. ↩