Designing high-performance zkVMs

This article is a deep-dive into proof system design for zkVMs, split into two parts. In Part 1, we give a high-level overview of the proof system that underlies RISC Zero’s zkVM, and what’s on our horizon for improving zkVM performance. In Part 2, we’ll take a closer look at each layer of the proof system, touching on design considerations with respect to innovations such as folding schemes, JOLT, Binius, and Circle STARKs.

Part 1: A Bird’s Eye View on zkVM Design

One year ago, the industry was still uncertain about the utility of a RISC-V zkVM. RISC Zero’s zkVM was the only option that would let you prove a high-level language like Rust, and builders were skeptical about giving up on using circuit languages like Circom and Noir. At that time, most of the ZK industry was focused on the problem of hand-writing custom circuits for the purpose of building zkEVMs.

RISC Zero’s early hypothesis was that enabling builders to use mature languages like Rust – rather than hand-writing circuits – would facilitate a zero-to-one difference in terms of the viability of getting verifiable software applications to market. Our release of Zeth was our first major demonstration of this value proposition. Zeth was the first fully Ethereum-compatible zkEVM, and its simplicity made the value proposition of zkVMs clear. While other teams were fundraising hundreds of millions for zkEVM development, RISC Zero casually released the first fully Ethereum compatible zkEVM, and it was built in roughly 2 weeks by 3 engineers. Since then, we’ve seen teams like Taiko massively accelerate their go-to-market time by aborting their zkEVM circuit development projects in favor of Zeth.

Ultimately, the key unlock for rapid development of zkEVMs was to run an existing EVM implementation inside a zkVM. Put simply, Zeth is made possible by running reth inside the zkVM. Similarly, we unlocked Bonsai Pay by using an OIDC crate in the zkVM, and unlocked zkDoom by running a Doom crate in the guest. Rather than requiring every project to build primitives from scratch, builders can simply import the relevant Rust crates and move on. This enables builders to get their project built quickly and reduces the maintenance burden to close-to-zero. To keep Zeth in line with the latest updates to Ethereum, we simply update to the latest version of reth. As further testament to the robustness of the approach used in Zeth, the Ethereum Foundation recently announced a major initiative to formally verify zkEVMs via this EVM-inside-zkVM design.

Today, there is growing alignment on the idea that zkVMs are the right way to build and deploy verifiable software applications, and growing alignment that RISC-V is the right foundation for building zkVMs. We’re thrilled to see that our hypothesis about RISC-V zkVMs has been validated by the industry-at-large, and we’ve been excited to see a number of interesting new projects emerging over the past few months.

We’re constantly being asked about our take on these newer zkVMs & newer proof systems, and whether we’re planning to adopt these techniques. When it comes to proof systems and the associated techniques, the answer generally is the same – we’re keeping an eye on all of the major projects and research in the ecosystem in order to inform our designs and our priorities. We generally see easier and more impactful performance improvements available at the engineering level than we do at the proof system design level, and we’re prioritizing accordingly.

Low Hanging Fruit for Proof System Improvements

Now that we’ve reached maturity in our API design with the release of zkVM 1.0, and deployed verifiers on a number of chains, and built a highly-efficient proving service, our primary objective for improving the zkVM is to minimize the cost of proof generation, particularly for the use cases we’re seeing most demand for – including proving for EVM rollups, OP rollups, Eigenlayer partial withdrawals, and aggregation of proofs from various proof systems. We’ve already seen a 10x cost improvement over the first half of this year, using the old-fashioned technique of Amdahl’s Law – i.e., identifying the bottleneck and addressing it. We are proud to offer the cheapest zkVM on the market, and we’re looking forward to shipping further cost reductions in the coming months.

Looking forward, the next major cost improvements we’re targeting come from integration of application-specific accelerator circuits (aka precompiles/gadgets/chiplets). By combining a RISC-V zkVM with support for integrating application-specific accelerator circuits, we enable the fastest possible development time without sacrificing prover efficiency. The recommended development process for application developers is familiar from standard software development practice: build a prototype, and then measure the performance. Observe the bottlenecks, engineer solutions to those bottlenecks, and repeat. To make this process easier, we’ve written an Optimization Guide for builders.

Once we’ve identified a significant bottleneck, we can integrate an application-specific accelerator circuit into the zkVM in order to address that bottleneck. We’re in the process of releasing an RSA accelerator, which will be followed by additional circuits for Keccak, ECDSA, pairings, and BLS12-381. These operations constitute the bottlenecks for most of the proving demand we’re seeing today. We’ll continue to identify and address bottlenecks as we go, and we have some exciting announcements coming soon with respect to enabling users to build their own accelerator circuits.

After pushing these accelerator circuits, we’re excited to release V2 of our RISC-V circuit. Compared to V1, we’re expecting a 2x reduction in the size of the circuit, due to a more efficient memory argument, better compiler technology, and various minor design improvements.

Proof System Architecture for zkVMs

The pace of technological advancement grows with the number of participants in the space, and it’s been exciting to see so many innovations in terms of zkVM design and proof system design. At this stage, RISC Zero, Succinct, and a16z each have their own implementation of RISC-V based zkVMs. Meanwhile, a number of other projects are targeting other high level ISAs such as WASM and MIPS, and others still (STARKware, Valida) are targeting custom instruction sets. And a layer beneath the zkVM design, we’ve seen a number of innovations in proof system design in the past 12 months that are similarly exciting.

Let’s outline the design for RISC Zero’s proof system, which I’ve described at zkSummit 10 and in our dev docs. This design not only underlies the RISC Zero zkVM, but also the zkEVMs that have been built by Polygon and zkSync, as well as SP1. The following diagram shows the “basic” idea, which can be divided into four parts: execution, RISC-V proving, aggregation proving, and STARK-to-SNARK proving.

Execution. The program is executed, generating a number of “segments.”

RISC-V Proving. Each segment is proven using a FRI-based prover.

Aggregation Proving. All the segment proofs are aggregated into a single proof, using a FRI-based prover.

STARK-to-SNARK Proving. In order to output a proof that’s small enough to verify on-chain, the last step in the proving system relies on an elliptic-curve based prover.

Today, we use FRI because it is extremely efficient for horizontally scalable recursive proving, and we use Groth16 because it’s very cheap to verify on-chain. We’re always on the lookout for improvements to this high-level design and its implementation, both via minor tweaks and major changes. When considering alternatives, there are a number of aspects of the system to consider, including:

Proof size

Proving cost/time

Hardware requirements for prover

Verifier cost/time

Recursive proving cost/time (i.e., running the verification algorithm inside the prover)

Parallelizability

Decentralizability

Security

Auditability

Privacy

Name Dropping: The 0STARK Protocol

To make it a bit easier to talk about our proof system, we are introducing the name 0STARK to refer to the IOP protocol described in our cryptography documentation and implemented in our risc0-zkp crate. The 0STARK protocol is very similar in structure to the proof systems described in plonky2 and in the eSTARK paper – the arithmetization scheme is AIR-based, and the commitment scheme is based on FRI. 0STARK was designed for highly efficient recursion on GPUs, and to our knowledge, it is the most efficient option for massively scalable proof generation on clusters of GPUs.

Now that we have the name 0STARK in our vernacular, we can give a succinct description of the proving system that underlies our zkVM:

We use continuations to split large computations into segments.

We use 0STARK to prove the segments, producing 1 STARK per segment.

We use 0STARK to aggregate these STARKs.

Finally, we use Groth16 to output a proof which can be posted & verified on-chain.

That concludes Part 1. In the next part, we’ll dive deeper into RISC Zero’s proving architecture, and share how we’re thinking about various new techniques, including folding schemes, JOLT, Binius, and Circle STARKs.

Part 2: Balancing Tradeoffs in zkVM Design

So far, we’ve outlined the high-level design for the proof system design which underlies the RISC Zero zkVM. Put simply, the system consists of 4 layers:

Execution

RISC-V Proving

Aggregation Proving

STARK-to-SNARK Proving

In terms of iterating on the zkVM, our focus is on maximizing efficiency. Let’s now dive into some discussion of how we’re thinking about the various options on the table. Spoiler: as I mentioned above, we see bigger performance gains with less effort by focusing more on engineering-level problems than proof system design.

What about Folding Schemes?

At the time of my talk at zkSummit, the major alternative to our design was based on folding schemes. The Lurk zkVM was gaining traction, and the Nexus zkVM was getting started – both of which were targeting a zkVM design based on folding schemes. When we evaluated the option of folding-based zkVMs, our conclusion was that a folding-based zkVM design would not be conducive to efficient parallelization or decentralization.

There has been some encouraging work in order to make folding more parallelizable, including Paranova and Cyclefold, but our conclusion remains that FRI-based systems offer a more efficient mechanism for building the large-scale distributed proving architecture we are building toward. I discuss this briefly in the last few minutes of my talk at zkSummit, and elaborate on the details further in this Napkin Math Analysis. So, we’re unlikely to pivot to a folding-based zkVM in the nearterm. On the other hand, some of the other recent work looks more promising for our purposes.

As we look toward the future of zkVM proving, the overall structure of Execution, Segment Proving, Aggregation Proving, STARK-to-SNARK Proving seems pretty stable. The major question is how to best handle each of these layers of the proving stack.

Before we dig into each of these layers, let’s give some high-level perspective on the efficiency of each layer. As you can see in the table below, it’s not easy to articulate a simple answer to the question of, “What’s the bottleneck?” The answer depends whether you’re focused on minimizing time or minimizing cost, and whether you’re interested in proving small computations or proving large computations.

A surprising observation here is that the biggest pain point for users is the last step of the process – STARK-to-SNARK proving. For small proofs such as signature verifications, the STARK-to-SNARK proving takes a whopping 5x longer than the rest of the proving system combined. On CUDA today, we’re looking at ~15 seconds for STARK-to-SNARK vs. ~2.5 seconds to output a STARK proof.

Given these challenges at the STARK-to-SNARK layer, let’s discuss that topic in more detail, and then we’ll turn our attention toward Execution, RISC-V Proving, and Aggregation Proving.

STARK-to-SNARK Proving

While the earlier parts of the proving system are optimized for efficient proof generation, the final stage is optimized for small proofs that can be efficiently verified on-chain.

And when it comes to minimizing proof size, elliptic-curve based SNARKs (Groth16, PLONK, FFLONK, etc) are the clear winner. Elliptic-curve SNARKs have proofs on the order of hundreds of bytes, while proofs for hash-based SNARKs (e.g., FRI-based systems) have proofs on the order of hundreds of kilobytes.

As we mentioned above, this final layer turns out to be a major bottleneck for our zkVM and others. We’ve managed to bring our Groth16 proving time down to ~15 seconds on Bonsai, but even with those improvements, this is still the time bottleneck in many applications.

What we want is a functional, reliable implementation that:

Doesn’t require a circuit-specific trusted setup

Is cheap to verify on-chain

Actually works

Doesn’t take forever

Groth16 is the clear leader when it comes to prover efficiency, but it requires a circuit-specific trusted setup which we’d prefer to avoid. On paper, there are a number of good options here, including PLONK, FFLONK, and Pianist. In practice, this is another context where engineering turns out to be harder than theory. We went with Groth16 largely because the available tooling was more functional when we were ready to go to production.

After playing with the various options available today, we still haven’t found a totally satisfying approach: our current analysis suggests that we can either use a system that requires a circuit-specific trusted setup, or one that takes several minutes to generate proofs. For now, we’re opting for the fast option with the most mature tooling (Groth16). We expect to be able to bring down our Groth16 proving time down to just a few seconds, by moving from rapidsnark to gnark and by tweaking the proof system parameters to shrink the proof size in the final stage of Aggregation Proving.

Execution

In May 2023, we introduced a new feature called continuations, making RISC Zero the first zkVM capable of generating proofs for arbitrarily complex computations. The basic idea is that large computations are split into a number of smaller segments, and each segment is proven separately.

To showcase the power of continuations, we released Zeth – the first Type 1 zkEVM. Before continuations, we couldn’t even prove a single Ethereum transaction. With continuations, we were suddenly able to prove entire Ethereum blocks.

Today, continuations has become a standard design pattern for zkVMs, but there’s some variance from project-to-project about how to handle memory consistency checks that span multiple segments. For example, suppose you write the value `5` to some memory location in the first segment, and then read from that location in the second segment.

zkVMs typically enforce memory read/write operations using a permutation argument. The Prover commits to the “main trace,” then generates Fiat-Shamir randomness, and then commits to an “auxiliary trace.” In order to implement a zkVM with continuations, we need to choose between running one permutation argument per segment or running a single permutation argument that spans many segments. In other words, do we generate Fiat-Shamir randomness on a per-segment basis, or do we generate all the segments and then compute the Fiat-Shamir randomness?

We opt for running one permutation argument per segment. This approach introduces a bit of overhead due to paging costs, but it saves us from a number of headaches in the context of horizontally scaling our proving infrastructure. Notably, this approach allows us to begin proving each segment as soon as it is constructed. On Bonsai, we begin proving segments before we finish running the executor, which reduces time-to-prove and increases system utilization. Using a permutation argument that spans many segments means that you can’t begin proving until the last segment has been constructed.

RISC-V Proving & Aggregation Proving

For both RISC-V Proving & Aggregation Proving, our zkVM uses polynomial constraints defined over the Baby Bear field, paired with a FRI-based commitment scheme. Looking at this part of the proving system, there are a lot of shiny new options, including JOLT, Circle STARKs, Binius, and STIR.

We’ve looked into all of these options, but in terms of balancing our engineering priorities, it doesn’t make sense for us to adopt a new proof system unless we see a massive upside. At this stage, we see bigger upsides by integrating application-specific accelerator circuits, continuing to optimize our GPU kernels, and shipping V2 of our RISC-V circuit than we do from proof system re-design.

To give some more context on the design space, let’s take a closer look at a few of the newer techniques that are available.

JOLT

JOLT uses lookup tables in order to minimize the number of hand-written polynomial constraints.

This offers a major benefit in terms of reducing code complexity and improving auditability, and the use of Sumcheck/GKR is always an attractive option in terms of Prover efficiency. Unfortunately, the underlying proof system – LASSO – results in higher verifier complexity (and worse recursion) than FRI.

JOLT is in the process of moving from LASSO to Binius, which makes this lookup-centric design much more compelling. We’ll definitely be keeping an eye on the progress here!

To learn more about JOLT, check out my notes or the original release announcement.

Circle STARKs

Whereas our zkVM is built using a 31-bit field called Baby Bear, STARKware and Plonky3 are working on a transition to use a different 31-bit field called Mersenne 31, using a technique called Circle STARKs. According to the authors, this approach offers a 30% improvement over Baby Bear, in the context of CPU proving.

But our system is designed and optimized for GPU proving, and we don’t expect this 30% improvement to translate to the GPU context we’re optimizing for – not to mention the fact that we’d likely need to build an entirely new set of GPU kernels in order to find out.

Let’s step back and take a broader look at the context behind Baby Bear and Mersenne31. One of the major motifs in zkVM design over the past few years has been to move from large fields to smaller fields. STARKware’s Stone prover is built over a 251-bit field. Using such large fields drives what the Binius authors call embedding overhead. Despite the fact that most of the values being used in zkVMs are bits or bytes, users are forced to pay for the cost of an entire field element.

Mir Protocol, which eventually became plonky2, reduced the significance of embedding overhead when they introduced the technique of small-field STARKs, which operated over a 64-bit field known as “Goldilocks” or “Oxfoi.” RISC Zero took this a step further, building our zkVM over a 31-bit field, called “Baby Bear.” Circle STARKs follows suit with the choice to use a 31-bit field, but targets a different 31-bit prime than we chose.

M31 is defined by the particularly elegant prime 2^31 - 1. To my knowledge, Daniel Lubarov first hinted at use of M31, at ETH Denver 2023 when introducing Plonky3. It’s a particularly nice field in terms of efficient multiplication, but it’s not nice with respect to NTTs.

Later in 2023, Polygon researchers addressed the challenge of NTTs over M31 with their work in Reed-Solomon Codes over the Circle Group, which paved the way for Circle STARKs. To learn more about Circle STARKS, check out these introductions to the topic by Vitalik, STARKware, and Lambdaworks.

Binius

Binius takes the idea of small fields to the limit, targeting Towers of Binary Fields. Binary fields are extremely hardware-efficient, and Binius makes use of a packing technique that removes the “embedding overhead” that I described above. As with JOLT, our first impression at the release of Binius was along the lines of, “This looks great for RISC-V proving, but not so much for aggregation.”

In fact, this same sentiment held true for systems like Ligero, Orion, and Brakedown – the proving time looks great, but the verifier complexity is larger than FRI, which results in poor recursion performance.

More recently, the team behind Binius has introduced a variant of FRI that works natively with Towers of Binary Fields, based on insights and techniques from BaseFold (among others). With this addition, Binius offers a very attractive option – removing the embedding overhead from the Segment Proving layer without causing a blow-up in costs at the Aggregation Proving layer.

At this stage, the Binius team is working on their recursion layer, while JOLT is working on integrating Binius into their RISC-V based proving architecture. Still too early for us to want to adopt, but this is another project we’re definitely keeping an eye on!

To learn more about Binius, check out these introductions from Vitalik and Lambdaworks, or Jim Posen’s talk at zkSummit 11.

Betting on the Future

RISC Zero is in a very interesting position with respect to all of this work. In the nearterm, we expect to be more efficient than either Circle STARKs or Binius due to the maturity of our overall system – our system operates on GPUs rather than CPUs, and we’ve had almost 2 years to dial in the cost-savings in the context of running proving clusters.

But ignoring the code maturity and just looking just at the underlying proof system, Circle STARKs and Binius both look promising. In the longterm, we’d love to see our tooling be as modular as possible – we’d love to let users choose between proof systems when running the zkVM. But in practice, everything we build and everything we support has an engineering burden and a maintenance burden.

Currently, we see STARKware and Polygon each spending substantial resources building their own implementations of Circle STARKs, while Irreducible is building out the aggregation proving (i.e., recursion) for Binius.

It’s still too early to tell whether M31 or Baby Bear or Towers of Binary Fields are “the best option.” In reality, the question is largely context-dependent. So for now, rather than re-inventing our foundations – which would both require us to substantially slow down our engineering progress and make a huge bet about whether to go with Circle STARKs or Binius – we’re going to stick with Baby Bear in order to continue shipping obvious iterative improvements to our system, while expanding our ability to support real use cases for real users.

The temptation to switch to the new shiny proof system is constantly present. But we still haven’t gotten around to implementing some very low-hanging fruit such as grinding or LogUp, because Amdahl’s Law has demanded our attention elsewhere. Similarly, STIR looks like a clear improvement over FRI, but we’re focusing our attention elsewhere because recursion isn’t our bottleneck, and FRI is only a small portion of the costs for segment proving. When we get to a point where we see a better ROI from proof system re-design than we do from engineering work, we’ll re-design the proof system.

Until then, our general plan is pretty simple: just keep shipping 🚢.

Thanks for reading!

Hopefully, this article has been helpful in contextualizing what’s happening in the zkVM design space. After a long while as the only RISC-V zkVM in the ecosystem, it’s exciting to have some brilliant competition entering the space. We’re still just beginning to enter the era when verifiable computation is affordable enough to be justifiable for most use cases. We’re excited to see those costs dropping due to improvements at all levels, and we’re looking forward to growing together.

💡

Questions, comments, corrections, or suggestions?

Find us on Discord.

This article is a deep-dive into proof system design for zkVMs, split into two parts. In Part 1, we give a high-level overview of the proof system that underlies RISC Zero’s zkVM, and what’s on our horizon for improving zkVM performance. In Part 2, we’ll take a closer look at each layer of the proof system, touching on design considerations with respect to innovations such as folding schemes, JOLT, Binius, and Circle STARKs.

Part 1: A Bird’s Eye View on zkVM Design

Low Hanging Fruit for Proof System Improvements

Proof System Architecture for zkVMs

Execution. The program is executed, generating a number of “segments.”

RISC-V Proving. Each segment is proven using a FRI-based prover.

Aggregation Proving. All the segment proofs are aggregated into a single proof, using a FRI-based prover.

STARK-to-SNARK Proving. In order to output a proof that’s small enough to verify on-chain, the last step in the proving system relies on an elliptic-curve based prover.

Proof size

Proving cost/time

Hardware requirements for prover

Verifier cost/time

Recursive proving cost/time (i.e., running the verification algorithm inside the prover)

Parallelizability

Decentralizability

Security

Auditability

Privacy

Name Dropping: The 0STARK Protocol

Now that we have the name 0STARK in our vernacular, we can give a succinct description of the proving system that underlies our zkVM:

We use continuations to split large computations into segments.

We use 0STARK to prove the segments, producing 1 STARK per segment.

We use 0STARK to aggregate these STARKs.

Finally, we use Groth16 to output a proof which can be posted & verified on-chain.

Part 2: Balancing Tradeoffs in zkVM Design

So far, we’ve outlined the high-level design for the proof system design which underlies the RISC Zero zkVM. Put simply, the system consists of 4 layers:

Execution

RISC-V Proving

Aggregation Proving

STARK-to-SNARK Proving

What about Folding Schemes?

Given these challenges at the STARK-to-SNARK layer, let’s discuss that topic in more detail, and then we’ll turn our attention toward Execution, RISC-V Proving, and Aggregation Proving.

STARK-to-SNARK Proving

While the earlier parts of the proving system are optimized for efficient proof generation, the final stage is optimized for small proofs that can be efficiently verified on-chain.

What we want is a functional, reliable implementation that:

Doesn’t require a circuit-specific trusted setup

Is cheap to verify on-chain

Actually works

Doesn’t take forever

Execution

RISC-V Proving & Aggregation Proving

To give some more context on the design space, let’s take a closer look at a few of the newer techniques that are available.

JOLT

JOLT uses lookup tables in order to minimize the number of hand-written polynomial constraints.

JOLT is in the process of moving from LASSO to Binius, which makes this lookup-centric design much more compelling. We’ll definitely be keeping an eye on the progress here!

To learn more about JOLT, check out my notes or the original release announcement.

Circle STARKs

Binius

To learn more about Binius, check out these introductions from Vitalik and Lambdaworks, or Jim Posen’s talk at zkSummit 11.

Betting on the Future

Until then, our general plan is pretty simple: just keep shipping 🚢.

Thanks for reading!

💡

Questions, comments, corrections, or suggestions?

Find us on Discord.

Designing high-performance zkVMs

Part 1: A Bird’s Eye View on zkVM Design

Low Hanging Fruit for Proof System Improvements

Proof System Architecture for zkVMs

Name Dropping: The 0STARK Protocol

Part 2: Balancing Tradeoffs in zkVM Design

What about Folding Schemes?

STARK-to-SNARK Proving

Execution

RISC-V Proving & Aggregation Proving

JOLT

Circle STARKs

Binius

Betting on the Future

Thanks for reading!

Powering the Modular Expansion with Blobstream Zero

Introducing Boundless: The Verifiable Compute Layer

Designing high-performance zkVMs

Part 1: A Bird’s Eye View on zkVM Design

Low Hanging Fruit for Proof System Improvements

Proof System Architecture for zkVMs

Name Dropping: The 0STARK Protocol

Part 2: Balancing Tradeoffs in zkVM Design

What about Folding Schemes?

STARK-to-SNARK Proving

Execution

RISC-V Proving & Aggregation Proving

JOLT

Circle STARKs

Binius

Betting on the Future

Thanks for reading!

Powering the Modular Expansion with Blobstream Zero

Introducing Boundless: The Verifiable Compute Layer