Exploring In-Synthesis Verification of Hardware

Getting hardware right is crucial – without correct hardware, there’s no way to guarantee that software will work as intended. To raise the stakes even higher, hardware designers really only have one shot to get it right – once the hardware’s made, you can’t really patch it! To address this importance, there are many tools which use formal methods to give strong guarantees that hardware is correct. In practice, this often entails writing equivalence proofs showing that hardware behaves the same as some high-level model. In this post, we’ll take a look at the challenges I ran into while making Gator, a project I worked on which explored how to make these proofs easier!

Post-Synthesis Verification

As stated above, correct hardware is really important. To have stronger guarantees that hardware is correct, engineers often use external tools to formally verify that their hardware is correct. But what does it actually look like to formally verify the correctness of hardware? I’ll give a high-level example that roughly mirrors what I did during my time as a verification intern at Intel.

To start, we need some hardware. Let’s say we have some accelerator that computes the sum of four numbers, which are a, b, c, and d. In my experience, this is implemented in low-level Verilog. Once the hardware team has a design for this accelerator, we’ll also need a high-level model of what the accelerator should do. This can be written in any language, so let’s choose Python. A high-level model for this accelerator might look like:

# Returns the sum of four numbers at time `t`.
def model(a, b, c, d, t):
    return a + b + c + d

Great! Now we have a hardware design and a high-level model. The next step is to write a proof that the hardware design and the model are functionally equivalent. This proof is a formal verification of the hardware design. There are many different ways to write this proof, but here’s roughly what I wrote when I was at Intel:

forall a, b, c, d, t > 0:
    assert(sim_accelerator(a, b, c, d, t) == model(a, b, c, d, t))

This proof is a simple assertion that the output of the hardware design is equal to the output of the high-level model. If the assertion passes, then we can be confident that the hardware design is correct.

Notice the funky forall I tacked on at the beginning. This represents the fact that these proofs need to make sure that the hardware design is correct for all possible inputs and all possible times. The time aspect is particularly important, because it’s different from software verification.

From here, we can send this proof to tools like VossII which use techniques like symbolic trajectory evaluation to verify the proof. This style of verification is called “post-synthesis verification” because it happens after the hardware design process.

So, what’s the difficulty in doing this? The reality is that creating these models and writing these proofs is hard – far harder than the simple example above! Working with these tools not only requires intricate knowledge of hardware design, but also a deep understanding of formal methods. This is a high barrier to entry if you want a guarantee that your hardware is correct!

Gator: Exploring In-Synthesis Verification

How do we it easier to have correct hardware? One way is to move the verification process earlier in the design process. In both software and hardware, this has been the motivation behind creating compilers that are correct by construction. That is, the compiler itself is verified using formal methods to ensure that the output of the compiler is correct. In a hardware context, this means that the output of the compiler (i.e., the hardware design) is functionally equivalent to the high-level model, or the source code fed into the compiler.

This idea isn’t new – verified compilation has been around for a while. But, mainstream FPGA tools aren’t formally verified, and surveys exist which show that FPGA tools are buggy!

At UW, I was very lucky to work on FPGA compilation tool called Lakeroad which touches on this problem. I’ll defer to our paper for the full details, but one of the benefits of using Lakeroad compared to other tools is that it provides equivalence guarantees for the first few cycles. That is, there’s an implicit bounded model check each time you compile your hardware design.

But what if we could go further? Can we make it so that we have a guarantee that the hardware design is correct for all cycles? This is the question that I explored for my master’s thesis.

Before we can chat about the challenges of this problem, let’s first understand a bit more about the tools I used to explore this problem. Gator is built on top of Rosette, a solver-aided programming language designed for the type of problems Gator is trying to solve (in addition to many others!). I’ll be eliding a lot of details, but the basic idea is that Gator (and Lakeroad) use Rosette’s synthesize function, which for our purposes is a black box that essentially means “find me a program that satisfies these constraints.”

So, how do we use synthesize in the context of hardware compilation? In Lakeroad, the call to synthesize relied on an interpreter for a specialized IR we were using. The overall constraint we needed is that the hardware must be equivalent to the high-level model for the first k cycles. In effect, the call that we made to synthesize looked something like this:


def interp(model, t):
    # Interpret the hardware design at time `t`.

constraints = [
    assert(interp(hardware, 0) == interp(model, 0)),
    assert(interp(hardware, 1) == interp(model, 1)),
    ...
    assert(interp(hardware, k - 1) == interp(model, k - 1))
]


synthesize(constraints)

The key idea here is embedded in the constraints we’re passing to synthesize. synthesize must find a program that satisfies all of these constraints, which means that the low level hardware design it’s synthesizing must be equivalent to the high level model for the first k cycles.

Gator is about extending this idea to all cycles. In effect, it would be nice if we could say, for some arbitrary t >= 0, that interpreting the hardware design and the high level model are equivalent. How do we do this? In a proof assistant, or on paper, this is an inductive proof. That is, after proving the base case (t = 0), we assume that the equivalence holds for some arbitrary t = n, and then show that it holds for t = n + 1. With our current machinery, we might use Rosette’s assume function to do this as follows:

constraints = [
    # Our base case.
    assert(interp(hardware, 0) == interp(model, 0)),
    # Our inductive step.
    assume(interp(hardware, n) == interp(model, n)),
    assert(interp(hardware, n + 1) == interp(model, n + 1))
]

synthesize(constraints)

This is where we run into some challenges. Let’s talk about them!

Challenge 1: Loops!

The first challenge in this approach is that hardware designs often have loops. Registers, for example, are hardware components that store values across cycles. A register might be wired into a loop. One particularly simple example would be a register wired into itself. When trying to reason about interpreting the hardware design, we might have code like this:

def interp_register(register, t):
    if t == 0:
        return 0
    return interp_register(register, t - 1)

This type of code is typically not a problem when given a concrete value for t, because the recursion will eventually bottom out at t = 0. However, in the constraints we’re passing above, the inductive step requires us to reason about arbitrary (i.e., symbolic) values of t. In effect, what this means when we actually run the code above is that we infinitely recurse and never hit the base case. This is a big problem.

One way of solving this problem is to create a “cache” of sorts which matches specific symbolic values of t to the values we’re assuming the hardware will have. In other words, if we wanted to encode that at time t the value of a register is x, instead of writing:

assume(interp_register(register, t) == x)

we might modify our interpreter to take a cache:

cache = {
    # Notice that both the hardware and the model
    # are equal to `x` at time `t`!
    (hardware_reg, t): x
    (model_reg, t): x
}

def interp(prog, t, cache):
    # Compute the value of the model at time `t`.
    if (prog , t) in cache:
        return cache[(prog, t)]
    ...

This mechanism allows us to manually encode inductive hypotheses about the hardware design. In effect, we’re manually encoding the inductive proof that we would write on paper.

Challenge 2: Inductive Hypotheses

The solution above works in my head, but in practice it’s a bit more complicated. I see two main problems with having a cache:

  1. The cache is hard to create. When using a verified compiler, the user shouldn’t have to manually specify inductive hypotheses. Specifying these hypotheses is a non-trivial task, and is similar to a lot of the verification work Gator is trying to automate!
  2. Getting the cache right is hard. In practice, many hardware designs are pipelined, meaning that it’s not always enough to specify the value of a single instruction at time t; we might need to specify the value of multiple depending on the hardware design.

We’re reaching the limit of what I explored at my time at UW, but one possible solution to this problem is to encode hardware designs using uninterpreted functions.

The idea is: for each hardware component, create an uninterpreted function from nat -> bitvector, where nat is the time at which we’re interpreting the hardware design. From here, the hope is that by encoding that interp(node, t) == x, Rosette will be able to “push” the inductive hypotheses through the design, without manually specifying the cache.

I’ve just scratched the surface of the challenges that Gator faces. Come September, I’ll be starting my PhD at UC San Diego, where I probably won’t be working on Gator. But I’d still love to chat about it! If you’re interested, feel free to reach out to me at a7cheung@ucsd.edu. Ideally, we’ll meet somewhere in La Jolla and chat about it over tacos.