The Theoretical Aspect of Equality Saturation (Part I)

The phase-ordering problem is a long standing problem in program optimization: Consider program \((a\times 2)/2\). A program optimizer may readily optimize the subexpression \(a\times 2\) into \(a\ll 1\) and get \((a\ll 1)/2\), at which stage it cannot optimize the program further. Or it can also realize that the multiplication and division operators can be simply canceled out, yielding \(a\). This indicates that applying one optimization may disable another optimization, but naively exploring all optimization orders can lead to a combinatorial explosion.

E-graphs were first introduced for program verification by Greg Nelson in his thesis in the 80s and are nowadays used in most modern SMT solvers. At a high level, an e-graph is a data structure that compactly represents a program space and equivalences of programs in this space. In 2009, Tate et al. observed that applying term rewriting over e-graphs can make program optimization explore multiple optimization paths simultaneously, thus mitigating the phase-ordering problem. This idea is revitalized in the 2021 egg paper paper by Willsey et al. Since then, more and more people have adopted equality saturation in their projects, both in academia and in industry. At the time of this blog post, there are at least 40 non-trivial projects that use equality saturation.

Most of the ongoing exploration focuses on the practical side of equality saturation. Some of the hot topics among these explorations include how to represent higher-order programs (e.g., Schleich et al., Koehler et al., Gonzalez et al) and how to represent stateful programs (e.g., eggcc and aegraphs). Others focus on applying equality saturation to domains like hardware designs and database systems. On the other hand, the theoretical side of equality saturation has not been well explored. There are many important unanswered questions I believe answering these questions will give us new insights into equality saturation.

I have been working on the equality saturation (EqSat) theory project with Dan and Remy for a while. In this journey, we have found many interesting connections to techniques from other communities. These connections imply opportunities both to learn from and to give new insights to these communities. In this blog post, I want to give an overview of our journey so far. I will try to keep this post concise and instead give pointers to relevant pieces.

In the egg paper, e-graphs are defined as a tuple of the union-find data structure, a mapping from e-classes to e-nodes, and a hash-consing data structure for memoization. While easily understandable, this definition is arguably verbose. For example, operations over e-graphs need to be defined imperatively as manipulations of e-graph’s member data structures. But e-graphs can be defined very concisely in tree automata terminology. Pollock and Haan observed that an e-graph is a deterministic finite tree automaton (DFTA), a generalization of a finite state automaton from strings to trees. To be more precise- an e-graph is a DFTA with no inaccessible states. Moreover, two programs are equivalent in an e-graph if they are accepted by the same e-class/state. This definition immediately tells us something about e-graphs: E-graphs are as expressive as tree automata, meaning the set of terms represented by an e-graph has to be regular. Moreover, the Myhill-Nerode theorem for trees implies that the equivalence relation induced by an e-graph is always finitely generated (i.e., is the smallest equivalence relation induced by a finite list of identities between terms) and that it is possible to minimize an e-graph by merging indistinguishable e-classes. Wang et al. used this minimization idea to speed up ILP-based e-graph extraction.

On the other hand, an e-graph can be viewed as a database. Under this view, an e-graph is just a database where the first n-1 columns form a key. We took this approach in our relational e-matching paper, where we sped up a core query operation over e-graphs and proved its bound, and more recently the egglog paper, where we propose an entirely new abstraction of e-graphs based on databases. The new abstraction allows database techniques to be applied to e-graphs: Database query planning and execution techniques make equality saturation blazingly fast, while the semi-naive evaluation of Datalog programs effectively makes equality saturation incremental, Finally, the classic congruence closure algorithm of Downey et al. used in e-graphs essentially gives an \(O(n\log n)\) algorithm for repairing arbitrary functional dependencies.

Equality saturation does fixpoint reasoning over e-graphs. It is no surprise that similar ideas are also explored in tree automata literature and the databases literature. In tree automata literature, a similar technique called tree automata completion is proposed by Genet. Unlike e-graphs, which are constrained to only track equivalences between terms, tree automata track a directed rewritability relation between terms via \(\epsilon\)-transitions. In other words, while an e-graph tells you whether two programs are equivalent, tree automata completion tells you whether one program can be transformed into another program, which is strictly more powerful. One application of tree automata completion is program verification: We can consider the semantics of a program as a set of rewrite rules. Applying tree automata completion over an e-graph allows us to represent (potentially an over-approximation of) all possible program states as a tree automaton. If every term accepted by this tree automaton satisfies certain safety invariants, we can assert this program is safe. The fine difference between equality saturation and tree automata completion can sometimes cause unexpected behavior, which I talked about in my EGRAPHS 2023 talk briefly.

Similarly, the databases community has long studied a family of algorithms called the chase, which can reason about “embedded dependencies” (this is what database people call rewrite rules) in databases. The chase literature is very deep and theoretical, and the chase has been used in many data management tasks, such as data integration, semantic query optimization, query containment checking, and so on. Our egglog paper implies that it is possible to view equality saturation as a form of the chase. Indeed, it is a very special form, as equality saturation always produces a core e-graph. In our ongoing work, we found that theorems about the chase have direct implications for equality saturation, and results about equality saturation also give simple proofs to problems concerning the chase.

The following table summarizes the connection between e-graphs, tree automata, and databases.

E-graphs Tree Automata Databases
E-nodes Transitions Tuples
E-classes States Marked nulls
Congruence closure Determinism Functional dependencies
Equality saturation TA completion Chase

In conclusion, we have gained a glimpse into the theoretical side of equality saturation and its connections with tree automata and databases. As we delve deeper into the EqSat theory project, it is evident that exploring these connections will not only enrich our understanding of EqSat but also contribute valuable insights to related fields. In Part II of this blog post, we will explore some of the concrete open problems of equality saturation. Stay tuned and see you next time!