What is the phase ordering problem and can equality saturation help?

Post Metadata

This is the first part of a two-part blog post. In this post, I’ll set up the problem we’re trying to solve, and a future post will go into more detail about how our solution works.

Compiler optimizations are typically implemented as a sequence of passes, each designed to transform the program in a particular way. For example, constant folding computes the result of constant expressions and replaces the expression with a constant value (e.g., i = (1 + 2 + 3 + 4) * 5 would be replaced by i = 50); dead code elimination removes code that doesn’t affect the program’s result; and function inlining replaces a function call with the body of the function. Passes can also compute facts or analyses about the program in order to enable other passes. Rather than transforming the program, these passes will tag parts of the program with analysis results, which can be leveraged by later passes. For example, a liveness analysis is required in order to perform dead code elimination.

In the pass pipeline architecture, there is a fixed, predetermined sequence of passes. Each pass may transform the program, and the transformed program is then used as the input to the next pass. Consider for example the following expression: (a * 2) / 2. If we first apply the rule (x * y) / z --> x * (y / z) to reassociate the terms, then we can apply the rules x / x --> 1 and x * 1 --> x to find that the entire term simplifies to just a. However, if we applied the rule x * 2 --> x << 1 first, we would be left with (a << 1) / 2, which has no further opportunities for simplification. Does that mean x * 2 --> x << 1 is not a useful rewrite? Not in general! It was unhelpful for this particular input, but there are many other inputs where this rule would be crucial for optimization.

Then the question is: How should the passes be ordered so that they find the right opportunities for optimization for each input program?

This is known as the Phase Ordering Problem. It is not possible to find one fixed ordering that will perfectly optimize every possible input program. Since each pass can destructively change the program, running a pass can either enable or disable future passes from applying.

Consider, for example, the following two programs, which we’ll call P and Q:

P Q
x = input()
res = abs(x) * 2
if x > 0 {
  res = res + abs(x) - x
  res = foo(x, res)
} else {
  res = res + abs(x) + x
  res = bar(x, res)
}
x = input()
res = abs(x) * 2
if x > 0 {
  res = res + abs(x)
  res = foo(x, res)
} else {
  res = res + abs(x)
  res = bar(x, res)
}

And the correctly optimized programs, P’ and Q’ (these are what we want the compiler to produce):

P' Q'
x = input()
res = abs(x) * 2
if x > 0 {
  res = foo(x, res)
} else {
  res = bar(x, res)
}
x = input()
res = abs(x) * 3
if x > 0 {
  res = foo(x, res)
} else {
  res = bar(x, res)
}

Now consider the following two compiler passes: Interval-analysis based peepholes (A) and conditional invariant code motion (B). Optimization A can determine that x is positive in the true branch of the conditional and non-negative in the else branch, which enables the rewrites abs(x) --> x in the true branch and abs(x) --> -x in the else branch. Optimization B can determine that abs(x) is computed in both branches and move it outside the conditional. Unfortunately, neither ordering of A and B correctly optimizes both P and Q. If we do A and then B, we fail to optimize Q because the common subexpression abs(x) is lost after applying the peephole optimizations. However, if we do B and then A, we fail to optimize P because abs(x) is hoisted out of the branches, making it impossible to do flow-sensitive simplification in each branch.

Okay, so what should we do? What if we just try running the optimizations in both orders and see which one yields better results? That might work in this example, but it’s probably not going to work in practice. Real compilers will have dozens or even hundreds of passes that work together to optimize and lower the input program. For example, here’s a list of some of the passes in LLVM: alt text

So clearly, trying all possible orders isn’t going to be feasible.

… Or is it?

Equality Saturation is a technique that uses a data structure called an e-graph. Starting from an input term or program, P, rewrite rules are applied non-destructively, in effect, storing many equivalent versions of P simultaneously. Unlike traditional term rewriting systems, previous versions of P are not discarded, so the order in which rules are applied isn’t as important. Thus, an equality-saturation based compiler can mitigate the phase ordering problem described above.

[Side note: You’ll notice I say the order of rule application doesn’t matter as much, not that it doesn’t matter at all. Rule scheduling in equality saturation engines is still a problem, largely because of resource limits. That is, if some rules fire a lot and grow the e-graph, then you need to be careful about how often you let them run because they can blow through the set resource limits, which may prevent the other rules from getting a chance to find useful optimizations. While we have developed some useful heuristics for writing good schedules (for example, be careful with associativity and commutativity; rules where all of the terms on the right-hand-side also appear on the left-hand-side are safe because they do not grow the e-graph), formalizing principles of rule scheduling is future work.]

Recent work has shown that equality saturation can be useful in a wide variety of domains (Szalinski, Herbie, Diospyros, and Enumo just to name a few) However, most of the applications share one important similarity: they apply equality saturation to pure expressions (no effects like printing, memory access, etc.).

Equality saturation for imperative programs presents a couple of key challenges: context and linearity.

Context: E-graphs typically maximize sharing of subexpressions, which is important for keeping the e-graph size manageable, but makes it impossible to distinguish between identical fragments that occur at different program points. For imperative programs, this is important because it is often possible to apply a context-specific or flow-sensitive optimization in one place but not another. With typical eq-sat approaches, it would not be possible to optimize the term in only one place, because there is a single e-class in the e-graph representing both fragments. For example, consider the following program:

z = 0
y = x * 2
if x == 0 {
  z = x * 2
}
return y + z

We may want to rewrite x * 2 to 0 inside the if expression, leveraging the contextual information that the predicate x == 0 was true. However, in an e-graph without context nodes, there would be a single e-class that represents both the value of y and the value of z. As such, it would be unsound to mark that e-class equivalent to 0, since y may not be 0.

Linearity: Extracting an optimized program out of an e-graph consists of picking the best e-node from every e-class using a preset cost model. However, if the program has effects, extraction needs to be very careful to use state linearly. That is, you cannot extract an e-node from one e-class that relies on one ordering of effectful operations and an e-node from a different e-class that relies on a different ordering. Previous work on extraction (see Peggy) of effectful programs uses a solver in the extraction process in order to solve these linearity constraints. We have been working on a prototype eqsat-based compiler, eggcc, that addresses these challenges. We introduce a new intermediate representation (IR) inserts context nodes to encode context syntactically, allowing the e-graph to distinguish between identical fragments appearing at different program points, and enabling flow-sensitive optimizations. We also introduce a new two-step extraction algorithm that enables us to extract a program that uses memory linearly, so long as our optimization rules adhere to a simple invariant that we call weak linearity. Part 2 will describe how context nodes and weak linearity work in more detail.