Jekyll2025-12-01T19:30:07+00:00https://bytecodealliance.org/feed.xmlBytecode AllianceWelcome to the Bytecode AllianceA Function Inliner for Wasmtime and Cranelift2025-11-19T00:00:00+00:002025-11-19T00:00:00+00:00https://bytecodealliance.org/articles/inlinerFunction inlining is one of the most important compiler optimizations, not because of its direct effects, but because of the follow-up optimizations it unlocks. It may reveal, for example, that an otherwise-unknown function parameter value is bound to a constant argument, which makes a conditional branch unconditional, which in turn exposes that the function will always return the same value. Inlining is the catalyst of modern compiler optimization.

Note: This is cross-posted from my personal blog.

Wasmtime is a WebAssembly runtime that focuses on safety and fast Wasm execution. But despite that focus on speed, Wasmtime has historically chosen not to perform inlining in its optimizing compiler backend, Cranelift. There were two reasons for this surprising decision: first, Cranelift is a per-function compiler designed such that Wasmtime can compile all of a Wasm module’s functions in parallel. Inlining is inter-procedural and requires synchronization between function compilations; that synchronization reduces parallelism. Second, Wasm modules are generally produced by an optimizing toolchain, like LLVM, that already did all the beneficial inlining. Any calls remaining in the module will not benefit from inlining — perhaps they are on slow paths marked [[unlikely]] or the callee is annotated with #[inline(never)]. But WebAssembly’s component model changes this calculus.

With the component model, developers can compose multiple Wasm modules — each produced by different toolchains — into a single program. Those toolchains only had a local view of the call graph, limited to their own module, and they couldn’t see cross-module or fused adapter function definitions. None of them, therefore, had an opportunity to inline calls to such functions. Only the Wasm runtime’s compiler, which has the final, complete call graph and function definitions in hand, has that opportunity.

Therefore we implemented function inlining in Wasmtime and Cranelift. Its initial implementation landed in Wasmtime version 36, however, it remains off-by-default and is still baking. You can test it out via the -C inlining=y command-line flag or the wasmtime::Config::compiler_inlining method. The rest of this article describes function inlining in more detail, digs into the guts of our implementation and rationale for its design choices, and finally looks at some early performance results.

Function Inlining

Function inlining is a compiler optimization where a call to a function f is replaced by a copy of f’s body. This removes function call overheads (spilling caller-save registers, setting up the call frame, etc…) which can be beneficial on its own. But inlining’s main benefits are indirect: it enables subsequent optimization of f’s body in the context of the call site. That context is important — a parameter’s previously unknown value might be bound to a constant argument and exposing that to the optimizer might cascade into a large code clean up.

Consider the following example, where function g calls function f:

fn f(x: u32) -> bool {
    return x < u32::MAX / 2;
}

fn g() -> u32 {
    let a = 42;
    if f(a) {
        return a;
    } else {
        return 0;
    }
}

After inlining the call to f, function g looks something like this:

fn g() -> u32 {
    let a = 42;

    let x = a;
    let f_result = x < u32::MAX / 2;

    if f_result {
        return a;
    } else {
        return 0;
    }
}

Now the whole subexpression that defines f_result only depends on constant values, so the optimizer can replace that subexpression with its known value:

fn g() -> u32 {
    let a = 42;

    let f_result = true;
    if f_result {
        return a;
    } else {
        return 0;
    }
}

This reveals that the if-else conditional will, in fact, unconditionally transfer control to the consequent, and g can be simplified into the following:

fn g() -> u32 {
    let a = 42;
    return a;
}

In isolation, inlining f was a marginal transformation. When considered holistically, however, it unlocked a plethora of subsequent simplifications that ultimately led to g returning a constant value rather than computing anything at run-time.

Implementation

Cranelift’s unit of compilation is a single function, which Wasmtime leverages to compile each function in a Wasm module in parallel, speeding up compile times on multi-core systems. But inlining a function at a particular call site requires that function’s definition, which implies parallelism-hurting synchronization or some other compromise, like additional read-only copies of function bodies. So this was the first goal of our implementation: to preserve as much parallelism as possible.

Additionally, although Cranelift is primarily developed for Wasmtime by Wasmtime’s developers, it is independent from Wasmtime. It is a reusable library and is reused, for example, by the Rust project as an alternative backend for rustc. But a large part of inlining, in practice, are the heuristics for deciding when inlining a call is likely beneficial, and those heuristics can be domain specific. Wasmtime generally wants to leave most calls out-of-line, inlining only cross-module calls, while rustc wants something much more aggressive to boil away its Iterator combinators and the like. So our second implementation goal was to separate how we inline a function call from the decision of whether to inline that call.

These goals led us to a layered design where Cranelift has an optional inlining pass, but the Cranelift embedder (e.g. Wasmtime) must provide a callback to it. The inlining pass invokes the callback for each call site, the callback returns a command of either “leave the call as-is” or “here is a function body, replace the call with it”. Cranelift is responsible for the inlining transformation and the embedder is responsible for deciding whether to inline a function call and, if so, getting that function’s body (along with whatever synchronization that requires).

The mechanics of the inlining transformation — wiring arguments to parameters, renaming values, and copying instructions and basic blocks into the caller — are, well, mechanical. Cranelift makes extensive uses of arenas for various entities in its IR, and we begin by appending the callee’s arenas to the caller’s arenas, renaming entity references from the callee’s arena indices to their new indices in the caller’s arenas as we do so. Next we copy the callee’s block layout into the caller and replace the original call instruction with a jump to the caller’s inlined version of the callee’s entry block. Cranelift uses block parameters, rather than phi nodes, so the call arguments simply become jump arguments. Finally, we translate each instruction from the callee into the caller. This is done via a pre-order traversal to ensure that we process value definitions before value uses, simplifying instruction operand rewriting. The changes to Wasmtime’s compilation orchestration are more interesting.

The following pseudocode describes Wasmtime’s compilation orchestration before Cranelift gained an inlining pass and also when inlining is disabled:

// Compile each function in parallel.
let objects = parallel map for func in wasm.functions {
    compile(func)
};

// Combine the functions into one region of executable memory, resolving
// relocations by mapping function references to PC-relative offsets.
return link(objects)

The naive way to update that process to use Cranelift’s inlining pass might look something like this:

// Optionally perform some pre-inlining optimizations in parallel.
parallel for func in wasm.functions {
    pre_optimize(func);
}

// Do inlining sequentially.
for func in wasm.functions {
    func.inline(|f| if should_inline(f) {
        Some(wasm.functions[f])
    } else {
        None
    })
}

// And then proceed as before.
let objects = parallel map for func in wasm.functions {
    compile(func)
};
return link(objects)

Inlining is performed sequentially, rather than in parallel, which is a bummer. But if we tried to make that loop parallel by logically running each function’s inlining pass in its own thread, then a callee function we are inlining might or might not have had its transitive function calls inlined already depending on the whims of the scheduler. That leads to non-deterministic output, and our compilation must be deterministic, so it’s a non-starter.1 But whether a function has already had transitive inlining done or not leads to another problem.

With this naive approach, we are either limited to one layer of inlining or else potentially duplicating inlining effort, repeatedly inlining e into f each time we inline f into g, h, and i. This is because f may come before or after g in our wasm.functions list. We would prefer it if f already contained e and was already optimized accordingly, so that every caller of f didn’t have to redo that same work when inlining calls to f.

This suggests we should topologically sort our functions based on their call graph, so that we inline in a bottom-up manner, from leaf functions (those that do not call any others) towards root functions (those that are not called by any others, typically main and other top-level exported functions). Given a topological sort, we know that whenever we are inlining f into g either (a) f has already had its own inlining done or (b) f and g participate in a cycle. Case (a) is ideal: we aren’t repeating any work because it’s already been done. Case (b), when we find cycles, means that f and g are mutually recursive. We cannot fully inline recursive calls in general (just as you cannot fully unroll a loop in general) so we will simply avoid inlining these calls.2 So topological sort avoids repeating work, but our inlining phase is still sequential.

At the heart of our proposed topological sort is a call graph traversal that visits callees before callers. To parallelize inlining, you could imagine that, while traversing the call graph, we track how many still-uninlined callees each caller function has. Then we batch all functions whose associated counts are currently zero (i.e. they aren’t waiting on anything else to be inlined first) into a layer and process them in parallel. Next, we decrement each of their callers’ counts and collect the next layer of ready-to-go functions, continuing until all functions have been processed.

let call_graph = CallGraph::new(wasm.functions);

let counts = { f: call_graph.num_callees_of(f) for f in wasm.functions };

let layer = [ f for f in wasm.functions if counts[f] == 0 ];
while layer is not empty {
    parallel for func in layer {
        func.inline(...);
    }

    let next_layer = [];
    for func in layer {
        for caller in call_graph.callers_of(func) {
            counts[caller] -= 1;
            if counts[caller] == 0 {
                next_layer.push(caller)
            }
        }
    }
    layer = next_layer;
}

This algorithm will leverage available parallelism, and it avoids repeating work via the same dependency-based scheduling that topological sorting did, but it has a flaw. It will not terminate when it encounters recursion cycles in the call graph. If function f calls function g which also calls f, for example, then it will not schedule either of them into a layer because they are both waiting for the other to be processed first. One way we can avoid this problem is by avoiding cycles.

If you partition a graph’s nodes into disjoint sets, where each set contains every node reachable from every other node in that set, you get that graph’s strongly-connected components (SCCs). If a node does not participate in a cycle, then it will be in its own singleton SCC. The members of a cycle, on the other hand, will all be grouped into the same SCC, since those nodes are all reachable from each other.

In the following example, the dotted boxes designate the graph’s SCCs:

Ignoring edges between nodes within the same SCC, and only considering edges across SCCs, gives us the graph’s condensation. The condensation is always acyclic, because the original graph’s cycles are “hidden” within the SCCs.

Here is the condensation of the previous example:

We can adapt our parallel-inlining algorithm to operate on strongly-connected components, and now it will correctly terminate because we’ve removed all cycles. First, we find the call graph’s SCCs and create the reverse (or transpose) condensation, where an edge a→b is flipped to b→a. We do this because we will query this graph to find the callers of a given function f, not the functions that f calls. I am not aware of an existing name for the reverse condensation, so, at Chris Fallin’s brilliant suggestion, I have decided to call it an evaporation. From there, the algorithm largely remains as it was before, although we keep track of counts and layers by SCC rather than by function.

let call_graph = CallGraph::new(wasm.functions);
let components = StronglyConnectedComponents::new(call_graph);
let evaoporation = Evaporation::new(components);

let counts = { c: evaporation.num_callees_of(c) for c in components };

let layer = [ c for c in components if counts[c] == 0 ];
while layer is not empty {
    parallel for func in scc in layer {
        func.inline(...);
    }

    let next_layer = [];
    for scc in layer {
        for caller_scc in evaporation.callers_of(scc) {
            counts[caller_scc] -= 1;
            if counts[caller_scc] == 0 {
                next_layer.push(caller_scc);
            }
        }
    }
    layer = next_layer;
}

This is the algorithm we use in Wasmtime, modulo minor tweaks here and there to engineer some data structures and combine some loops. After parallel inlining, the rest of the compiler pipeline continues in parallel for each function, yielding unlinked machine code. Finally, we link all that together and resolve relocations, same as we did previously.

Heuristics are the only implementation detail left to discuss, but there isn’t much to say that hasn’t already been said. Wasmtime prefers not to inline calls within the same Wasm module, while cross-module calls are a strong hint that we should consider inlining. Beyond that, our heuristics are extremely naive at the moment, and only consider the code sizes of the caller and callee functions. There is a lot of room for improvement here, and we intend to make those improvements on-demand as people start playing with the inliner. For example, there are many things we don’t consider in our heuristics today, but possibly should:

  • Hints from WebAssembly’s compilation-hints proposal
  • The number of edges to a callee function in the call graph
  • Whether any of a call’s arguments are constants
  • Whether the call is inside a loop or a block marked as “cold”
  • Etc…

Some Initial Results

The speed up you get (or don’t get) from enabling inlining is going to vary from program to program. Here are a couple synthetic benchmarks.

First, let’s investigate the simplest case possible, a cross-module call of an empty function in a loop:

(component
  ;; Define one module, exporting an empty function `f`.
  (core module $M
    (func (export "f")
      nop
    )
  )

  ;; Define another module, importing `f`, and exporting a function
  ;; that calls `f` in a loop.
  (core module $N
    (import "m" "f" (func $f))
    (func (export "g") (param $counter i32)
      (loop $loop
        ;; When counter is zero, return.
        (if (i32.eq (local.get $counter) (i32.const 0))
          (then (return)))
        ;; Do our cross-module call.
        (call $f)
        ;; Decrement the counter and continue to the next iteration
        ;; of the loop.
        (local.set $counter (i32.sub (local.get $counter)
                                     (i32.const 1)))
        (br $loop))
    )
  )

  ;; Instantiate and link our modules.
  (core instance $m (instantiate $M))
  (core instance $n (instantiate $N (with "m" (instance $m))))

  ;; Lift and export the looping function.
  (func (export "g") (param "n" u32)
    (canon lift (core func $n "g"))
  )
)

We can inspect the machine code that this compiles down to via the wasmtime compile and wasmtime objdump commands. Let’s focus only on the looping function. Without inlining, we see a loop around a call, as we would expect:

00000020 wasm[1]::function[1]:
        ;; Function prologue.
        20: pushq   %rbp
        21: movq    %rsp, %rbp

        ;; Check for stack overflow.
        24: movq    8(%rdi), %r10
        28: movq    0x10(%r10), %r10
        2c: addq    $0x30, %r10
        30: cmpq    %rsp, %r10
        33: ja      0x89

        ;; Allocate this function's stack frame, save callee-save
        ;; registers, and shuffle some registers.
        39: subq    $0x20, %rsp
        3d: movq    %rbx, (%rsp)
        41: movq    %r14, 8(%rsp)
        46: movq    %r15, 0x10(%rsp)
        4b: movq    0x40(%rdi), %rbx
        4f: movq    %rdi, %r15
        52: movq    %rdx, %r14

        ;; Begin loop.
        ;;
        ;; Test our counter for zero and break out if so.
        55: testl   %r14d, %r14d
        58: je      0x72
        ;; Do our cross-module call.
        5e: movq    %r15, %rsi
        61: movq    %rbx, %rdi
        64: callq   0
        ;; Decrement our counter.
        69: subl    $1, %r14d
        ;; Continue to the next iteration of the loop.
        6d: jmp     0x55

        ;; Function epilogue: restore callee-save registers and
        ;; deallocate this functions's stack frame.
        72: movq    (%rsp), %rbx
        76: movq    8(%rsp), %r14
        7b: movq    0x10(%rsp), %r15
        80: addq    $0x20, %rsp
        84: movq    %rbp, %rsp
        87: popq    %rbp
        88: retq

        ;; Out-of-line traps.
        89: ud2
            ╰─╼ trap: StackOverflow

When we enable inlining, then M::f gets inlined into N::g. Despite N::g becoming a leaf function, we will still push %rbp and all that in the prologue and pop it in the epilogue, because Wasmtime always enables frame pointers. But because it no longer needs to shuffle values into ABI argument registers or allocate any stack space, it doesn’t need to do any explicit stack checks, and nearly all the rest of the code also goes away. All that is left is a loop decrementing a counter to zero:3

00000020 wasm[1]::function[1]:
        ;; Function prologue.
        20: pushq   %rbp
        21: movq    %rsp, %rbp

        ;; Loop.
        24: testl   %edx, %edx
        26: je      0x34
        2c: subl    $1, %edx
        2f: jmp     0x24

        ;; Function epilogue.
        34: movq    %rbp, %rsp
        37: popq    %rbp
        38: retq

With this simplest of examples, we can just count the difference in number of instructions in each loop body:

  • 12 without inlining (7 in N::g and 5 in M::f which are 2 to push the frame pointer, 2 to pop it, and 1 to return)
  • 4 with inlining

But we might as well verify that the inlined version really is faster via some quick-and-dirty benchmarking with hyperfine. This won’t measure only Wasm execution time, it also measures spawning a whole Wasmtime process, loading code from disk, etc…, but it will work for our purposes if we crank up the number of iterations:

$ hyperfine \
    "wasmtime run --allow-precompiled -Cinlining=n --invoke 'g(100000000)' no-inline.cwasm" \
    "wasmtime run --allow-precompiled -Cinlining=y --invoke 'g(100000000)' yes-inline.cwasm"

Benchmark 1: wasmtime run --allow-precompiled -Cinlining=n --invoke 'g(100000000)' no-inline.cwasm
  Time (mean ± σ):     138.2 ms ±   9.6 ms    [User: 132.7 ms, System: 6.7 ms]
  Range (min … max):   128.7 ms … 167.7 ms    19 runs

Benchmark 2: wasmtime run --allow-precompiled -Cinlining=y --invoke 'g(100000000)' yes-inline.cwasm
  Time (mean ± σ):      37.5 ms ±   1.1 ms    [User: 33.0 ms, System: 5.8 ms]
  Range (min … max):    35.7 ms …  40.8 ms    77 runs

Summary
  'wasmtime run --allow-precompiled -Cinlining=y --invoke 'g(100000000)' yes-inline.cwasm' ran
    3.69 ± 0.28 times faster than 'wasmtime run --allow-precompiled -Cinlining=n --invoke 'g(100000000)' no-inline.cwasm'

Okay so if we measure Wasm doing almost nothing but empty function calls and then we measure again after removing function call overhead, we get a big speed up — it would be disappointing if we didn’t! But maybe we can benchmark something a tiny bit more realistic.

A program that we commonly reach for when benchmarking is a small wrapper around the pulldown-cmark markdown library that parses the CommonMark specification (which is itself written in markdown) and renders that to HTML. This is Real World™ code operating on Real World™ inputs that matches Real World™ use cases people have for Wasm. That is, good benchmarking is incredibly difficult, but this program is nonetheless a pretty good candidate for inclusion in our corpus. There’s just one hiccup: in order for our inliner to activate normally, we need a program using components and making cross-module calls, and this program doesn’t do that. But we don’t have a good corpus of such benchmarks yet because this kind of component composition is still relatively new, so let’s keep using our pulldown-cmark program but measure our inliner’s effects via a more circuitous route.

Wasmtime has tunables to enable the inlining of intra-module calls4 and rustc and LLVM have tunables for disabling inlining5. Therefore we can roughly estimate the speed ups our inliner might unlock on a similar, but extensively componentized and cross-module calling, program by:

  • Disabling inlining when compiling the Rust source code to Wasm

  • Compiling the resulting Wasm binary to native code with Wasmtime twice: once with inlining disabled, and once with intra-module call inlining enabled

  • Comparing those two different compilations’ execution speeds

Running this experiment with Sightglass, our internal benchmarking infrastructure and tooling, yields the following results:

execution :: instructions-retired :: pulldown-cmark.wasm

  Δ = 7329995.35 ± 2.47 (confidence = 99%)

  with-inlining is 1.26x to 1.26x faster than without-inlining!

  [35729153 35729164.72 35729173] without-inlining
  [28399156 28399169.37 28399179] with-inlining

Conclusion

Wasmtime and Cranelift now have a function inliner! Test it out via the -C inlining=y command-line flag or via the wasmtime::Config::compiler_inlining method. Let us know if you run into any bugs or whether you see any speed-ups when running Wasm components containing multiple core modules.

Thanks to Chris Fallin and Graydon Hoare for reading early drafts of this piece and providing valuable feedback. Any errors that remain are my own.

  1. Deterministic compilation gives a number of benefits: testing is easier, debugging is easier, builds can be byte-for-byte reproducible, it is well-behaved in the face of incremental compilation and fine-grained caching, etc… 

  2. For what it is worth, this still allows collapsing chains of mutually-recursive calls (a calls b calls c calls a) into a single, self-recursive call (abc calls abc). Our actual implementation does not do this in practice, preferring additional parallelism instead, but it could in theory. 

  3. Cranelift cannot currently remove loops without side effects, and generally doesn’t mess with control-flow at all in its mid-end. We’ve had various discussions about how we might best fit control-flow-y optimizations into Cranelift’s mid-end architecture over the years, but it also isn’t something that we’ve seen would be very beneficial for actual, Real World™ Wasm programs, given that (a) LLVM has already done much of this kind of thing when producing the Wasm, and (b) we do some branch-folding when lowering from our mid-level IR to our machine-specific IR. Maybe we will revisit this sometime in the future if it crops up more often after inlining. 

  4. -C cranelift-wasmtime-inlining-intra-module=yes 

  5. -Cllvm-args=--inline-threshold=0, -Cllvm-args=--inlinehint-threshold=0, and -Zinline-mir=no 

]]>
Nick Fitzgerald
Exceptions in Cranelift and Wasmtime2025-11-06T00:00:00+00:002025-11-06T00:00:00+00:00https://bytecodealliance.org/articles/wasmtime-exceptionsThis is a blog post outlining the odyssey I recently took to implement the Wasm exception-handling proposal in Wasmtime, the open-source WebAssembly engine for which I’m a core team member/maintainer, and its Cranelift compiler backend.

Note: this is a cross-post with my personal blog; this post is also available here.

When first discussing this work, I made an off-the-cuff estimate in the Wasmtime biweekly project meeting that it would be “maybe two weeks on the compiler side and a week in Wasmtime”. Reader, I need to make a confession now: I was wrong and it was not a three-week task. This work spanned from late March to August of this year (roughly half-time, to be fair; I wear many hats). Let that be a lesson!1

In this post we’ll first cover what exceptions are and why some languages want them (and what other languages do instead) – in particular what the big deal is about (so-called) “zero-cost” exception handling. Then we’ll see how Wasm has specified a bytecode-level foundation that serves as a least-common denominator but also has some unique properties. We’ll then take a roundtrip through what it means for a compiler to support exceptions – the control-flow implications, how one reifies the communication with the unwinder, how all this intersects with the ABI, etc. – before finally looking at how Wasmtime puts it all together (and is careful to avoid performance pitfalls and stay true to the intended performance of the spec).

Why Exceptions?

Many readers will already be familiar with exceptions as they are present in languages as widely varied as Python, Java, JavaScript, C++, Lisp, OCaml, and many more. But let’s briefly review so we can (i) be precise what we mean by an exception, and (ii) discuss why exceptions are so popular.

Exception handling is a mechanism for nonlocal flow control. In particular, most flow-control constructs are intraprocedural (send control to other code in the current function) and lexical (target a location that can be known statically). For example, if statements and loops both work this way: they stay within the local function, and we know exactly where they will transfer control. In contrast, exceptions are (or can be) interprocedural (can transfer control to some point in some other function) and dynamic (target a location that depends on runtime state).2

To unpack that a bit: an exception is thrown when we want to signal an error or some other condition that requires “unwinding” the current computation, i.e., backing out of the current context; and it is caught by a “handler” that is interested in the particular kind of exception and is currently “active” (waiting to catch that exception). That handler can be in the current function, or in any function that has called it. Thus, an exception throw and catch can result in an abnormal, early return from a function.

One can understand the need for this mechanism by considering how programs can handle errors. In some languages, such as Rust, it is common to see function signatures of the form fn foo(...) -> Result<T, E>. The Result type indicates that foo normally returns a value of type T, but may produce an error of type E instead. The key to making this ergonomic is providing some way to “short-circuit” execution if an error is returned, propagating that error upward: that is, Rust’s ? operator, for example, which turns into essentially “if there was an error, return that error from this function”.3 This is quite conceptually nice in many ways: why should error handling be different than any other data flow in the program? Let’s describe the type of results to include the possibility of errors; and let’s use normal control flow to handle them. So we can write code like

fn f() -> Result<u32, Error> {
  if bad {
    return Err(Error::new(...));
  }
  Ok(0)
}

fn g() -> Result<u32, Error> {
  // The `?` propagates any error to our caller, returning early.
  let result = f()?;
  Ok(result + 1)
}

and we don’t have to do anything special in g to propagate errors from f further, other than use the ? operator.

But there is a cost to this: it means that every error-producing function has a larger return type, which might have ABI implications (another return register at least, if not a stack-allocated representation of the Result and the corresponding loads/stores to memory), and also, there is at least one conditional branch after every call to such a function that checks if we need to handle the error. The dynamic efficiency of the “happy path” (with no thrown exceptions) is thus impacted. Ideally, we skip any cost unless an error actually occurs (and then perhaps we accept slightly more cost in that case, as tradeoffs often go).

It turns out that this is possible with the help of the language runtime. Consider what happens if we omit the Result return types and error checks at each return. We will need to reach the code that handles the error in some other way. Perhaps we can jump directly to this code somehow?

The key idea of “zero-cost exception handling” is to get the compiler to build side-tables to tell us where this code – known as a “handler” – is. We can walk the callstack, visiting our caller and its caller and onward, until we find a function that would be interested in the error condition we are raising. This logic is implemented with the help of these side-tables and some code in the language runtime called the “unwinder” (because it “unwinds” the stack). If no errors are raised, then none of this logic is executed at runtime. And we no longer have our explicit checks for error returns in the “happy path” where no errors occur. This is why the common term for this style of error-handling is called “zero-cost”: more precisely, it is zero-cost when no errors occur, but the unwinding in case of error can still be expensive.

This is the status quo for exception-handling implementations in most production languages: for example, in the C++ world, exception handling is commonly implemented via the Itanium C++ ABI4, which defines a comprehensive set of tables emitted by the compiler and a complex dance between the system unwinding library and compiler-generated code to find and transfer control to handlers. Handler tables and stack unwinders are common in interpreted and just-in-time (JIT)-compiled language implementations, too: for example, SpiderMonkey has try notes on its bytecode (so named for “try blocks”) and a HandleException function that walks stack frames to find a handler.

The Wasm Exception-Handling Spec

The WebAssembly specification now (since version 3.0) has exception handling. This proposal was a long time in the making by various folks in the standards, toolchain and browser worlds, and the CG (standards group) has now merged it into the spec and included it in the recently-released “Wasm 3.0” milestone. If you’re already familiar with the proposal, you can skip over this section to the Cranelift- and Wasmtime-specific bits below.

First: let’s discuss why Wasm needs an extension to the bytecode definition to support exceptions. As we described above, the key idea of zero-cost exception handling is that an unwinder visits stack frames and looks for handlers, transferring control directly to the first handler it finds, outside the normal function return path. Because the call stack is protected, or not directly readable or writable from Wasm code (part of Wasm’s control-flow integrity aspect), an unwinder that works this way necessarily must be a privileged part of the Wasm runtime itself. We can’t implement it in “userspace” because there is no way for Wasm bytecode to transfer control directly back to a distant caller, aside from a chain of returns. This missing functionality is what the extension to the specification adds.

The implementation comes down to only three opcodes (!), and some new types in the bytecode-level type system. (In other words – given the length of this post – it’s deceptively simple.) These opcodes are:

  • try_table, which wraps an inner body, and specifies handlers to be active during that body. For example:

    (block $b1    ;; defines a label for a forward edge to the end of this block
      (block $b2  ;; likewise, another label
        (try_table
          (catch $tag1 $b1) ;; exceptions with tag `$tag1` will be caught by code at $b1
          (catch_all $b2)   ;; all other exceptions will be caught by code at $b2
    
          body...)))
    

    In this example, if an exception is thrown from within the code in body, and it matches one of the specified tags (more below!), control will transfer to the location defined by the end of the given block. (This is the same as other control-flow transfers in Wasm: for example, a branch br $b1 also jumps to the end of $b1.)

    This construct is the single all-purpose “catch” mechanism, and is powerful enough to directly translate typical try/catch blocks in most programming languages with exceptions.

  • throw: an instruction to directly throw a new exception. It carries the tag for the exception, like: throw $tag1.

  • throw_ref, used to rethrow an exception that has already been caught and is held by reference (more below!).

And that’s it! We implement those three opcodes and we are “done”.

Payloads

That’s not the whole story, of course. Ordinarily a source language will offer the ability to carry some data as part of an exception: that is, the error condition is not just one of a static set of kinds of errors, but contains some fields as well. (E.g.: not just “file not found”, but “file not found: $PATH”.)

One could build this on top of an bytecode-level exception-throw mechanism that only had throw/catch with static tags, with the help of some global state, but that would be cumbersome; instead, the Wasm specification offers payloads on each exception. For full generality, this payload can actually take the form of a list of values; i.e., it is a full product type (struct type).

We alluded to “tags” above but didn’t describe them in detail. These tags are key to the payload definition: each tag is effectively a type definition that specifies its list of payload value types as well. (Technically, in the Wasm AST, a tag definition names a function type with only parameters, no returns, which is a nice way of reusing an existing entity/concept.) Now we show how they are defined with a sample module:

(module
 ;; Define a "tag", which serves to define the specific kind of exception
 ;; and specify its payload values.
 (tag $t (param i32 i64))

 (func $f (param i32 i64)
       ;; Throw an exception, to be caught by whatever handler is "closest"
       ;; dynamically.
       (throw $t (local.get 0) (local.get 1)))

 (func $g (result i32 i64)
       (block $b (result i32 i64)
              ;; Run a body below, with the given handlers (catch-clauses)
              ;; in-scope to catch any matching exceptions.
              ;;
              ;; Here, if an exception with tag `$t` is thrown within the body,
              ;; control is transferred to the end of block `$b` (as if we had
              ;; branched to it), with the payload values for that exception
              ;; pushed to the operand stack.
              (try_table (catch $t $b)
                         (call $f (i32.const 1) (i64.const 2)))
              (i32.const 3)
              (i64.const 4))))

Here we’ve defined one tag (the Wasm text format lets us attach a name $t, but in the binary format it is only identified by its index, 0), with two payload values. We can throw an exception with this tag given values of these types (as in function $f) and we can catch it if we specify a catch destination as the end of a block meant to return exactly those types as well. Here, if function $g is invoked, the exception payload values 1 and 2 will be thrown with the exception, which will be caught by the try_table; the results of $g will be 1 and 2. (The values 3 and 4 are present to allow the Wasm module to validate, i.e. have correct types, but they are dynamically unreachable because of the throw in $f and will not be returned.)

This is an instance where Wasm, being a bytecode, can afford to generalize a bit relative to real-metal ISAs and offer conveniences to the Wasm producer (i.e., toolchain generating Wasm modules). In this sense, it is a little more like a compiler IR. In contrast, most other exception-throw ABIs have a fixed definition of payload, e.g., one or two machine register-sized values. In practice some producers might choose a small fixed signature for all exception tags anyway, but there is no reason to impose such an artificial limit if there is a compiler and runtime behind the Wasm in any case.

Unwind, Cleanup, and Destructors

So far, we’ve seen how Wasm’s primitives can allow for basic exception throws and catches, but what about languages with scoped resources, e.g. C++ with its destructors? If one writes something like

struct Scoped {
    Scoped() {}
    ~Scoped() { cleanup(); }
};

void f() {
    Scoped s();
    throw my_exception();
}

then the throw should transfer control out of f and upward to whatever handler matches, but the destructor of s still needs to run and call cleanup. This is not quite a “catch” because we don’t want to terminate the search: we aren’t actually handling the error condition.

The usual approach to compile such a program is to “catch and rethrow”. That is, the program is lowered to something like

try {
    throw ...
} catch_any(e) {
    cleanup();
    rethrow e;
}

where catch_any catches any exception propagating past this point on the stack, and rethrow re-throws the same exception.

Wasm’s exception primitives provide exactly the pieces we need for this: a catch_all_ref clause, which catches all exceptions and boxes the caught exception as a reference; and a throw_ref instruction, which re-throws a previously-caught exception.5

In actuality there is a two-by-two matrix of “catch” options: we can catch a specific tag or catch_all; and we can catch and immediately unpack the exception into its payload values (as we saw above), or we can catch it as a reference. So we have catch, catch_ref, catch_all, and catch_all_ref.6

Dynamic Identity and Compositionality

There is one final detail to the Wasm proposal, and in fact it’s the part that I find the most interesting and unique. Given the above introduction, and any familiarity with exception systems in other language semantics and/or runtime systems, one might expect that the “tags” identifying kinds of exceptions and matching throws with particular catch handlers would be static labels. In other words, if I throw an exception with tag $tA, then the first handler for $tA anywhere up the stack, from any module, should catch it.

However, one of Wasm’s most significant properties as a bytecode is its emphasis on isolation. It has a distinction between static modules and dynamic instances of those modules, and modules have no “static members”: every entity (e.g., memory, table, or global variable) defined by a module is replicated per instance of that module. This creates a clean separation between instances and means that, for example, one can freely reuse a common module (say, some kind of low-level glue or helper module) with separate instances in many places without them somehow communicating or interfering with each other.

Consider what happens if we have an instance A that invokes some other (dynamically provided) function reference which ultimately invokes a callback in A. Say that the instance throws an exception from within its callback in order to unwind all the way to its outer stack frames, across the intermediate functions in some other Wasm instance(s):

                A.f   ---------call--------->   B.g   --------call--------->    A.callback
                 ^                                                                  v
               catch $t                                                           throw $t
                 |                                                                  |
                 `----------------------------<-------------------------------------'

The instance A expects that the exception that it throws from its callback function to f is a local concern to that instance only, and that B cannot interfere. After all, if the exception tag is defined inside A, and Wasm preserves modularity, then B should not be able to name that tag to catch exceptions by that tag, even if it also uses exception handling internally. The two modules should not interact: that is the meaning of modularity, and it permits us to reason about each instance’s behavior locally, with the effects of “the rest of the world” confined to imports and exports.

Unfortunately, if one designed a straightforward “static” tag-matching scheme, this might not be the case if B were an instance of the same module as A: in that case, if B also used a tag $t internally and registered handlers for that tag, it could interfere with the desired throw/catch behavior, and violate modularity.

So the Wasm exception handling standard specifies that tags have dynamic instances as well, just as memories, tables and globals do. (Put in programming-language theory terms, tags are generative.) Each instance of a module creates its own dynamic identities for the statically-defined tags in those modules, and uses those dynamic identities to tag exceptions and find handlers. This means that no matter what instance B is, above, if instance A does not export its tag $t for B to import, there is no way for B to catch the thrown exception explicitly (it can still catch all exceptions, and it may do so and rethrow to perform some cleanup). Local modular reasoning is restored.

Once we have tags as dynamic entities, just like Wasm memories, we can take the same approach that we do for the other entities to allow them to be imported and exported. Thus, visibility of exception payloads and ability for modules to catch certain exceptions is completely controlled by the instantiation graph and the import/export linking, just as for all other Wasm storage.

This is surprising (or at least was to me)! It creates some pretty unique implementation challenges in the unwinder – in essence, it means that we need to know about instance identity for each stack frame, not just static code location and handler list.

Compiling Exceptions in Cranelift

Before we implement the primitives for exception handling in Wasmtime, we need to support exceptions in our underlying compiler backend, Cranelift.

Why should this be a compiler concern? What is special about exceptions that makes them different from, say, new Wasm instructions that implement additional mathematical operators (when we already have many arithmetic operators in the IR), or Wasm memories (when we already have loads/stores in the IR)?

In brief, the complexities come in three flavors: new kinds of control flow, fundamentally different than ordinary branches or calls in that they are “externally actuated” (by the unwinder); a new facet of the ABI (that we get to define!) that governs how the unwinder interacts with compiled code; and interactions between the “scoped” nature of handlers and inlining in particular. We’ll talk about each below.

Note that much of this discussion started with an RFC for Wasmtime/Cranelift, which had been posted way back in August of 2024 by Daniel Hillerstrom with help from my colleague Nick Fitzgerald, and was discussed then; many of the choices within were subsequently refined as I discovered interesting nuances during implementation and we talked them through.

Control Flow

There are a few ways to think about exception handlers from the point of view of compiler IR (intermediate representation). First, let’s recognize that exception handling (i) is a form of control flow, and (ii) has all the same implications various compiler stages that other kinds of control flow do. For example, the register allocator has to consider how to get registers into the right state whenever control moves from one basic block to the next (“edge moves”); exception catches are a new kind of edge, and so the regalloc needs to be aware of that, too.

One could see every call or other opcode that could throw as having regular control-flow edges to every possible handler that could match. I’ll call this the “regular edges” approach. The upside is that it’s pretty simple to retrofit: one “only” needs to add new kinds of control-flow opcodes that have out-edges, but that’s already a kind of thing that IRs have. The disadvantage is that, in functions with a lot of possible throwing opcodes and/or handlers, the overhead can get quite high. And control-flow graph overhead is a bad kind of overhead: many analyses’ runtimes are heavily dependent the edge and node (basic block) counts, sometimes superlinearly.

The other major option is to build a kind of implicit new control flow into the IR’s semantics. For example, one could lower the source-language semantics of a “try block” down to regions in the IR, with one set of handlers attached. This is clearly more efficient than adding out-edges from (say) every callsite within the try-block to every handler in scope. On the other hand, it’s hard to understate how invasive this change would be. This means that every traversal over IR, analyzing dataflow or reachability or any other property, has to consider these new implicit edges anyway. In a large established compiler like Cranelift, we can lean on Rust’s type system for a lot of different kinds of refactors, but changing a fundamental invariant goes beyond that: we would likely have a long tail of issues stemming from such a change, and it would permanently increase the cognitive overhead of making new changes to the compiler. In general we want to trend toward a smaller, simpler core and compositional rather than entangled complexity.

Thus, the choice is clear: in Cranelift we opted to introduce one new instruction, try_call, that calls a function and catches (some) exceptions. In other words, there are now two possible kinds of return paths: a normal return or (possibly one of many) exceptional return(s). The handled exceptions and block targets are enumerated in an exception table. Because there are control-flow edges stemming from this opcode, it is a block terminator, like a conditional branch. It looks something like (in Cranelift’s IR, CLIF):

function %f0(i32) -> i32, f32, f64 {
    sig0 = (i32) -> f32 tail
    fn0 = %g(i32) -> f32 tail

    block0(v1: i32):
        v2 = f64const 0x1.0
        ;; exception-catching callsite
        try_call fn0(v1), sig0, block1(ret0, v2), [ tag0: block2(exn0), default: block3(exn0) ]

    ;; normal return path
    block1(v3: f32, v4: f64):
        v5 = iconst.i32 1
        return v5, v3, v4

    ;; exception handler for tag0
    block2(v6: i64):
        v7 = ireduce.i32 v6
        v8 = iadd_imm.i32 v7, 1
        v9 = f32const 0x0.0        
        return v8, v9, v2

    ;; exception handler for all other exceptions
    block2(v10: i64):
        v11 = ireduce.i32 v10
        v12 = f32.const 0x0.0
        v13 = f64.const 0x0.0
        return v11, v12, v13
}

There are a few aspects to note here. First, why are we only concerned with calls? What about other sources of exceptions? This is an important invariant in the IR: exception throws are only externally sourced. In other words, if an exception has been thrown, if we go deep enough into the callstack, we will find that that throw was implemented by calling out into the runtime. The IR itself has no other opcodes that throw! This turns out to be sufficient: (i) we only need to build what Wasmtime needs, here, and (ii) we can implement Wasm’s throw opcodes as “libcalls”, or calls into the Wasmtime runtime. So, within Cranelift-compiled code, exception throws always happen at callsites. We can thus get away with adding only one opcode, try_call, and attach handler information directly to that opcode.

The next characteristic of note is that handlers are ordinary basic blocks. Thus may not seem remarkable unless one has seen other compiler IRs, such as LLVM’s, where exception handlers are definitely special: they start with “landing pad” instructions, and cannot be branched to as ordinary basic blocks. That might look something like:

function %f() {
    block0:
        ;; Callsite defining a return value `v0`, with normal
        ;; return path to `block1` and exception handler `block2`.
        v0 = try_call ..., block1, [ tag0: block2 ]
        
    block1:
        ;; Normal return; use returned value.
        return v0
        
    block2 exn_handler: ;; Specially-marked block!
        ;; Exception handler payload value.
        v1 = exception_landing_pad
        ...
}

This bifurcation of kinds of blocks (normal and exception handler) is undesirable from our point of view: just as exceptional edges add a new cross-cutting concern that every analysis and transform needs to consider, so would new kinds of blocks with restrictions. It was an explicit design goal (and we have tests that show!) that the same block can be both an ordinary block and a handler block – not because that would be common, necessarily (handlers usually do very different things than normal code paths), but because it’s one less weird quirk of the IR.

But then if handlers are normal blocks, the data flow question becomes very interesting. An exception-catching call, unlike every other opcode in our IR, has conditionally-defined values: that is, its normal function return value(s) are available only if the callee returns normally, and the exception payload value(s), which are passed in from the unwinder and carry information about the caught exception, are available only if the callee throws an exception that we catch. How can we ensure that these values are represented such that they can only be used in valid ways? We can’t make them all regular SSA definitions of the opcode: that would mean that all successors (regular return and exceptional) get to use them, as in:

function %f() {
    block0:
        ;; Callsite defining a return value `v0`, with normal return path
        ;; to `block1` and exception handler `block2`.
        v0 = try_call ..., block1, [ tag0: block2 ]
      
    block1:
        ;; Use `v0` legally: it is defined on normal return.
        return v0
      
    block2:
        ;; Oops! We use `v0` here, but the normal return value is undefined
        ;; when an exception is caught and control reaches this handler block.
        return v0
}

This is the reason that a compiler may choose to make handler blocks special: by bifurcating the universe of blocks, one ensures that normal-return and exceptional-return values are used only where appropriate. Some compiler IRs reify exceptional return payloads via “landing pad” instructions that must start handler blocks, just as phis start regular blocks (in phi- rather than blockparam-based SSA). But, again, this bifurcation is undesirable.

Our insight here, after a lot of discussion, was to put the definitions where they belong: on the edges. That is, regular returns are only defined once we know we’re following the regular-return edge, and likewise for exception payloads. But we don’t want to have special instructions that must be in the successor blocks: that’s a weird distributed invariant and, again, likely to lead to bugs when transforming IR. Instead, we leverage the fact that we use blockparam-based SSA and we widen the domain of allowable block-call arguments.

Whereas previously one might end a block like brif v1, block2(v2, v3), block3(v4, v5), i.e. with blockparams assigned values in the chosen successor via a list of value-uses in the branch, we now allow (i) SSA values, (ii) a special “normal return value” sentinel, or (iii) a special “exceptional return value” sentinel. The latter two are indexed because there can be more than one of each. So one can write a block-call in a try_call as block2(ret0, v1, ret1), which passes the two return values of the call and a normal SSA value; or block3(exn0, exn1), which passes just the two exception payload values. We do have a new well-formedness check on the IR that ensures that (i) normal returns are used only in the normal-return blockcall, and exception payloads are used only in the handler-table blockcalls; (ii) normal returns’ indices are bounded by the signature; and (iii) exception payloads’ indices are bounded by the ABI’s number of exception payload values; but all of these checks are local to the instruction, not distributed across blocks. That’s nice, and conforms with the way that all of our other instructions work, too. (Block-call argument types are then checked against block-parameter types in the successor block, but that happens the same as for any branch.) So we have, repeating from above, a callsite like

    block1:
        try_call fn0(v1), block2(ret0), [ tag0: block3(exn0, exn1) ]

with all of the desired properties: only one kind of block, explicit control flow, and SSA values defined only where they are legal to use.

All of this may seem somewhat obvious in hindsight, but as attested by the above GitHub discussions and Cranelift weekly meeting minutes, it was far from clear when we started how to design all of this to maximize simplicity and generality and minimize quirks and footguns. I’m pretty happy with our final design: it feels like a natural extension of our core blockparam-SSA control flow graph, and I managed to put it into the compiler without too much trouble at all (well, a few PRs and associated fixes to Cranelift and regalloc2 functionality and testing; and I’m sure I’ve missed a few).

Data Flow and ABI

So we have defined an IR that can express exception handlers – what about the interaction between this function body and the unwinder? We will need to define a different kind of semantics to nail down that interface: in essence, it is a property of the ABI (Application Binary Interface).

As mentioned above, existing exception-handling ABIs exist for native code, such as compiled C++. While we are certainly willing to draw inspiration from native ABIs and align with them as much as makes sense, in Wasmtime we already define our own ABI7, and so we are not necessarily constrained by existing standards.

In particular, there is a very good reason we would prefer not to: to unwind to a particular exception handler, register state must be restored as specified in the ABI, and the standard Itanium ABI requires the usual callee-saved (“non-volatile”) registers on the target ISA to be restored. But this requires (i) having the register state at time of throw, and (ii) processing unwind metadata at each stack frame as we walk up the stack, reading out values of saved registers from stack frames. The latter is already supported with a generic “unwind pseudoinstruction” framework I built four years ago, but would still add complexity to our unwinder, and this complexity would be load-bearing for correctness; and the former is extremely difficult with Wasmtime’s normal runtime-entry trampolines. So we instead choose to have a simpler exception ABI: all try_calls, that is, callsites with handlers, clobber all registers. This means that the compiler’s ordinary register-allocation behavior will save all live values to the stack and restore them on either a normal or exceptional return. We only have to restore the stack (stack pointer and frame pointer registers) and redirect the program counter (PC) to a handler.

The other aspect of the ABI that matters to the exception-throw unwinder is exceptional payload. The native Itanium ABI specifies two registers on most platforms (e.g.: rax and rdx on x86-64, or x0 and x1 on aarch64) to carry runtime-defined playload; so for simplicity, we adopt the same convention.

That’s all well and good; now how do we implement try_call with the appropriate register-allocator behavior to conform to this? We already have fairly complex ABI handling (machine-independent and five different architecture implementations) in Cranelift, but it follows a general pattern: we generate a single instruction at the register-allocator level, and emit uses and defs with fixed-register constraints. That is, we tell regalloc that parameters must be in certain registers (e.g., rdi, rsi, rcx, rdx, r8, r9 on x86-64 System-V calling-convention platforms, or x0 up to x7 on aarch64 platforms) and let it handle any necessary moves. So in the simplest case, a call might look like (on aarch64), with register-allocator uses/defs and constraints annotated:

bl (call) v0 [def, fixed(x0)], v1 [use, fixed(x0)], v2 [use, fixed(x1)]

It is not always this simple, however: calls are not actually always a single instruction, and this turned out to be quite problematic for exception-handling support. In particular, when values are returned in memory, as the ABI specifies they must be when there are more return values than registers, we add (or added, prior to this work!) load instructions after the call to load the extra results from their locations on the stack. So a callsite might generate instructions like

bl v0 [def, fixed(x0)], ..., v7 [def, fixed(x7)] # first eight return values
ldr v8, [sp]     # ninth return value
ldr v9, [sp, #8] # tenth return value

and so on. This is problematic simply because we said that the try_call was a terminator; and it is at the IR level, but no longer at the regalloc level, and regalloc expects correctly-formed control-flow graphs as well. So I had to do a refactor to merge these return-value loads into a single regalloc-level pseudoinstruction, and in turn this cascaded into a few regalloc fixes (allowing more than 256 operands and more aggressively splitting live-ranges to allow worst-case allocation, plus a fix to the live range-splitting fix and a fuzzing improvement).

There is one final question that might arise when considering the interaction of exception handling and register allocation in Cranelift-compiled code. In Cranelift, we have an invariant that the register allocator is allowed to insert moves between any two instructions – register-to-register, or loads or stores to/from spill-slots in the stack frame, or moves between different spill-slots – and indeed it does this whenever there is more state than fits in registers. It also needs to insert edge moves “between” blocks, because when jumping to another spot in the code, we might need the register values in a differently-assigned configuration. When we have an unwinder that jumps to a different spot in the code to invoke a handler, we need to ensure that all the proper moves have executed so the state is as expected.

The answer here turns out to be a careful argument that we don’t need to do anything at all. (That’s the best kind of solution to a problem, but only if one is correct!) The crux of the argument has to do with critical edges. A critical edge is one from a block with multiple successors to one with multiple predecessors: for example, in the graph

   A    D
  / \  /
 B   C

where A can jump to B or C, and D can also jump to C, then A-to-C is a critical edge. The problem with critical edges is that there is nowhere to put code that has to run on the transition from A to C (it can’t go in A, because we may go to B or C; and it can’t go in C, because we may have come from A or D). So the register allocator prohibits them, and we “split” them when generating code by inserting empty blocks (e below) on them:

   A    D
  / \   |
 |   e  |
 |   \ /
 B    C

The key insight is that a try_call always has more than one successor as long as it has a handler (because it must always have a normal return-path successor too)8; and in this case, because we split critical edges, the immediate successor block on the exception-catch path has only one predecessor. So the register allocator can always put its moves that have to run on catching an exception in the successor (handler) block rather than the predecessor block. Our rule for where to put edge moves prefers the successor (block “after” the edge) unless it has multiple in-edges, so this was already the case. The only thing we have to be careful about is to record the address of the inserted edge block, if any (e above), rather than the IR-level handler block (C above), in the handler table.

And that’s pretty much it, as far as register allocation is concerned!

We’ve now covered the basics of Cranelift’s exception support. At this point, having landed the compiler half but not the Wasmtime half, I context-switched away for a bit, and in the meantime, bjorn3 picked this support up right away as a means to add panic-unwinding support to rustc_codegen_cranelift, the Cranelift-based Rust compiler backend. With a few small changes they contributed, and a followup edge-case fix and a refactor, panic-unwinding support in rustc_codegen_cranelift was working. That was very good intermediate validation that what I had built was usable and relatively solid.

Exceptions in Wasmtime

We have a compiler that supports exceptions; we understand Wasm exception semantics; let’s build support into Wasmtime! How hard could it be?

Challenge 1: Garbage Collection Interactions

I started by sketching out the codegen for each of the three opcodes (try_table, throw, and throw_ref). My mental model at the very beginning of this work, having read but not fully internalized the Wasm exception-handling proposal, was that I would be able to implement a “basic” throw/catch first, and then somehow build the exnref objects later. And I had figured I could build exnrefs in a (in hindsight) somewhat hacky way, by aggregating values together in a kind of tuple and creating a table of such tuples indexed by exnrefs, just as Wasmtime does for externrefs.

This understanding quickly gave way to a deeper one when I realized a few things:

  • Exception objects (exnrefs) can carry references to other GC objects (that is, GC types can be part of the payload signature of an exception), and GC objects can store exnrefs in fields. Hence, exnrefs need to be traced, and can participate in GC cycles; this either implies an additional collector on top of our GC collector (ugh) or means that exception objects needs to be on the GC heap when GC is enabled.

  • We’ll need a host API to introspect and build exception objects, and we already have nice host APIs for GC objects.

There was a question in an extensively-discussed PR whether we could build a cheap “subset” implementation that doesn’t mandate the existence of a GC heap for storing exception objects. This would be great in theory for guests that use exceptions for C-level setjmp/longjmp but no other GC features. However, it’s a little tricky for a few reasons. First, this would require the subset to exclude throw_ref (so we don’t have to invent another kind of exception object storage). But it’s not great to subset the spec – and throw_ref is not just for GC guest languages, but also for rethrows. Second, more generally, this is additional maintenance and testing surface that we’d rather not have for now. Instead we expect that we can make GC cheap enough, and its growth heuristic smart enough that a “frequent setjmp/longjmp” stress-test of exceptions (for example) should live within a very small (e.g., few-kilobyte) GC heap, essentially approximating the purpose-built storage. My colleague Nick Fitzgerald (who built and is driving improvements to Wasmtime’s GC support) wrote up a nice issue describing the tradeoffs and ideas we have.

All of that said, we’ll only build one exception object implementation – great! – but it will have to be a new kind of GC object. This spawned a large PR to build out exception objects first, prior to actual support for throwing and catching them, with host APIs to allocate them and inspect their fields. In essence, they are structs with immutable fields and with a less-exposed type lattice and no subtyping.

Challenge 2: Generative Tags and Dynamic Identity

So there I was, implementing the throw instruction’s libcall (runtime implementation), and finally getting to the heart of the matter: the unwinder itself, which walks stack frames to find a matching exception handler. This is the final bit of functionality that ties it all together. We’re almost there!

But wait: check out that spec language. We load the “tag address” from the store in step 9: we allocate the exception instance {tag z.tags[x], fields val^n}. What is this tags array on the store (z) in the runtime semantics? Tags have dynamic identity, not static identity! (This is the part where I learned about the thing I described above.)

This was a problem, because I had defined exception tables to associate handlers with tags that were identified by integer (u32) – like most other entities in Cranelift IR, I had figured this would be sufficient to let Wasmtime define indices (say: index of the tag in the module), and then we could compare static tag IDs.

Perhaps this is no problem: the static index defines the entity ID in the module (defined or imported tag), and we can compare that and the instance ID to see if a handler is a match. But how do we get the instance ID from the stack frame?

It turns out that Wasmtime didn’t have a way, because nothing had needed that yet. (This deficiency had been noticed before when implementing Wasm coredumps, but there hadn’t been enough reason or motivation to fix it then.) So I filed an issue with a few ideas. We could add a new field in every frame storing the instance pointer – and in fact this is a simple version of what at least one other production Wasm implementation, in the SpiderMonkey web engine, does (though as described in that [SMDOC] comment, it only stores instance pointers on transitions between frames of different instances; this is enough for the unwinder when walking linearly up the stack). But that would add overhead to every Wasm function (or with SpiderMonkey’s approach, require adding trampolines between instances, which would be a large change for Wasmtime), and exception handling is still used somewhat rarely in practice. Ideally we’d have a “pay-as-you-go” scheme with as little extra complexity as posible.

Instead, I came up with an idea to add “dynamic context” items to exception handler lists. The idea is that we inject an SSA value into the list and it is stored in a stack location that is given in the handler table metadata, so the stack-walker can find it. To Cranelift, this is some arbitrary opaque value; Wasmtime will use it to store the raw instance pointer (vmctx) for use by the unwinder.

This filled out the design to a more general state nicely: it is symmetric with exception payload, in the sense that the compiled code can communicate context or state to the unwinder as it reads the frames, and the unwinder in turn can communicate data to the compiled code when it unwinds.

It turns out – though I didn’t intend this at all at the time – that this also nicely solves the inlining problem. In brief, we want all of our IR to be “local”, not treating the function boundary specially; this way, IR can be composed by the inliner without anything breaking. Storing some “current instance” state for the whole function will, of course, break when we inline a function from one module (hence instance) into another!

Instead, we can give a nice operational semantics to handler tables with dynamic-context items: the unwinder should read left-to-right, updating its “current dynamic context” at each dynamic-context item, and checking for a tag match at tag-handler items. Then the inliner can compose exception tables: when a try_call callsite inlines a function body as its callee, and that body itself has any other callsites, we attach a handler table that simply concatenates the exception table items.

It’s important, here, to point out another surprising fact about Wasm semantics: we cannot do certain optimizations to resolve handlers statically or optimize the handler list, or at least not naively, without global program analysis to understand where tags come from. For example, if we see a handler for tag 0 then one for tag 1, and we see a throw for tag 1 directly inside the try_tables body, we cannot necessarily resolve it: tag 0 and tag 1 could be the same tag!

Wait, how can that be? Well, consider tag imports:

(module
  (import "test" "e0" (tag $e0))
  (import "test" "e1" (tag $e1))

  (func ...
        (try_table
                   (catch $e0 $b0)
                   (catch $e1 $b1)
                   (throw $e1)
                   (unreachable))))

We could instantiate this module giving the same dynamic tag instance twice, for both imports; then the first handler (to block $b0) matches; or separate tags; then the block $b1 matches. The only way to win the optimization game is not to play – we have to preserve the original handler list. Fortunately, that makes the compiler’s job easier. We transcribe the try_table’s handlers directly to Cranelift exception-handler tables, and those directly to metadata in the compiled module, read in exactly that order by the unwinder’s handler-matching logic.

Challenge 3: Rooting

Since exception objects are GC-managed objects, we have to ensure that they are properly rooted: that is, any handles to these objects outside of references inside other GC objects need to be known to the GC so the objects remain alive (and so the references are updated in the case of a moving GC).

Within a Wasm-to-Wasm exception throw scenario, this is fairly easy: the references are rooted in the compiled code on either side of the control-flow transfer, and the reference only briefly passes through the unwinder. As long as we are careful to handle it with the appropriate types, all will work fine.

Passing exceptions across the host/Wasm boundary is another matter, though. We support the full matrix of {host, Wasm} x {host, Wasm} exception catch/throw pairs: that is, exceptions can be thrown from native host code called by Wasm (via a Wasm import), and exceptions can be thrown out of Wasm code and returned as a kind of error to the host code that invoked the Wasm. This works by boxing the exception inside an anyhow::Error so we use Rust-style value-based error propagation (via Result and the ? operator) in host code.

What happens when we have a value inside the Error that holds an exception object in the Wasmtime Store? How does Wasmtime know this is rooted?

The answer in Wasmtime prior to recent work was to use one of two kinds of external rooting wrappers: Rooted and ManuallyRooted. Both wrappers hold an index into a table contained inside the Store, and that table contains the actual GC reference. This allows the GC to easily see the roots and update them.

The difference lies in the lifetime disciplines: ManuallyRooted requires, as the name implies, manual unrooting; it has no Drop implementation, and so easily creates leaks. Rooted, on the other hand, had a LIFO (last-in first-out) discipline based on a Scope, an RAII type created by the embedder (user) of Wasmtime. Rooted GC references that escape that dynamic scope are unrooted, and will cause an error (panic) at runtime if used. Neither of those behaviors is ideal for a value type – an exception – that is meant to escape scopes via ?-propagation.

The design that we landed on, instead, takes a different and much simpler approach: the Store has a single, explicit root slot for the “pending exception”, and host code can set this and then return a sentinel value (wasmtime::ThrownException) in the Result’s error type (boxed up into an anyhow::Error). This easily allows propagation to work as expected, with no unbounded leaks (there is only one pending exception that is rooted) and no unrooted propagating exceptions (because no actual GC reference propagates, only the sentinel).

As a side-quest, while thinking through this rooting dilemma, I also realized that it should be possible to create an “owned” rooted reference that behaves more like a conventional owned Rust value (e.g. Box); hence OwnedRooted was born to replace ManuallyRooted. This type works without requiring access to the Store to unroot when dropped; the key idea is to hold a refcount to a separate tiny allocation that is used as a “drop flag”, and then have the store periodically scan these drop-flags and lazily remove roots, with a thresholding algorithm to give that scanning amortized linear-time behavior.9

Now that we have this, in theory, we could pass an OwnedRooted<ExnRef> directly in the Error type to propagate exceptions through host code; but the store-rooted approach is simple enough, has a marginal performance advantage (no separate allocation), and so I don’t see a strong need to change the API at the moment.

Life of an Exception: Quick Walkthrough

Now that we’ve discussed all the design choices, let’s walk through the life an exception throw/catch, from start to finish. Let’s assume a Wasm-to-Wasm throw/catch for simplicity here.

  • First, the Wasm program is executing within a try_table, which results in an exception handler catch blocks being created for each handler case listed in the try_table instruction. The create_catch_block function generates code that invokes translate_exn_unbox, which reads out all of the fields from the exception object and pushes them onto the Wasm operand stack in the handler path. This handler block is registered in the HandlerState, which tracks the current lexical stack of handlers (and hands out checkpoints so that when we pop out of a Wasm block-type operator, we can pop the handlers off the state as well). These handlers are provided as an iterator which is passed to the translate_call method and eventually ends up creating an exception table on a try_call instruction. This try_call will invoke whatever Wasm code is about to throw the exception.
  • Then, the Wasm program reaches a throw opcode, which is translated via FuncEnvironment::translate_exn_throw to a three-operation sequence that fetches the current instance ID (via a libcall into the runtime), allocates a new exception object with that instance ID and a fixed tag number and fills in its slots with the given values popped from the Wasm operand stack, and delegates to throw_ref.
  • The throw_ref opcode implementation then invokes the throw_ref libcall.
  • This libcall is deceptively simple: its implementation sets the pending exception on the store, and returns a sentinel that signals a pending exception. That’s it!
  • This works because the glue code for all libcalls processes errors (via the HostResult trait implementations) and eventually reaches this case which sees a pending exception sentinel and invokes compute_handler. Now we’re getting to the heart of the exception-throw implementation.
  • compute_handler walks the stack with Handler::find, which itself is based on visit_frames, which does about what one would expect for code with a frame-pointer chain: it walks the singly-linked list of frames. At each frame, the closure that compute_handler gave to Handler::find looks up the program counter in that frame (which will be a return address, i.e., the instruction after the call that created the next lower frame) using lookup_module_by_pc to find a Module, which itself has an ExceptionTable (a parser for serialized metadata produced during compilation from Cranelift metadata) that knows how to look up a PC within a module. This will produce an Iterator over handlers which we test in order to see if any match. (The groups of exception-handler table items that come out of Cranelift are post-processed here to generate the tables that the above routines search.)
  • If we find a handler, that is, if the dynamic tag instance is the same or we reach a catch-all handler, then we have an exception handler! We return the PC and SP to restore here, computing SP via an FP-to-SP offset (i.e., the size of the frame), which is fixed and included in the exception tables when we construct them.
  • That action then becomes an UnwindState::UnwindToWasm here.
  • This UnwindToWasm state then triggers this case in the unwind libcall, which is invoked whenever any libcall returns an error code; that eventually calls the no-return function resume_to_exception_handler, which is a little function written in inline assembly that does exactly what it says on the tin. These three instructions set rsp and rbp to their new values, and jump to the new rip (PC). The same stub exists for each of our four native-compilation architectures (x86-64 above, aarch64, riscv64, and s390x10). That transfers control to the catch-block created above, and the Wasm continues running, unboxing the exception payload and running the handler!

Conclusion

So we have Wasm exception handling now! For all of the interesting design questions we had to work through, the end was pretty anticlimactic. I landed the final PR, and after a follow-up cleanup PR (1) and some fuzzbug fixes (1 2 3 4 5 6 7) having mostly to do with null-pointer handling and other edge cases in the type system, plus one interaction with tail-calls (and a separate/pre-existing s390x ABI bug that it uncovered), it has been basically stable. We pretty quickly got a few user reports: here it was reported as working for a Lua interpreter using setjmp/longjmp inside Wasm based on exceptions, and here it enabled Kotlin-on-Wasm to run and pass a large testsuite. Not bad!

All told, this took 37 PRs with a diff-stat of +16264 -4004 (16KLoC total) – certainly not the “small-to-medium-sized” project I had initially optimistically expected, but I’m happy we were able to build it out and get it to a stable state relatively easily. It was a rewarding journey in a different way than a lot of my past work (mostly on the Cranelift side) – where many of my past projects have been really very open-ended design or even research questions, here we had the high-level shape already and all of the work was in designing high-quality details and working out all the interesting interactions with the rest of the system. I’m happy with how clean the IR design turned out in particular, and I don’t think it would have done so without the really excellent continual discussion with the rest of the Cranelift and Wasmtime contributors (thanks to Nick Fitzgerald and Alex Crichton in particular here).

As an aside: I am happy to see how, aside from use-cases for Wasm exception handling, the exception support in Cranelift itself has been useful too. As mentioned above, cg_clif picked it up almost as soon as it was ready; but then, as an unexpected and pleasant surprise, Alex subsequently rewrote Wasmtime’s trap unwinding to use Cranelift exception handlers in our entry trampolines rather than a setjmp/longjmp, as the latter have longstanding semantic questions/issues in Rust. This took one more intrinsic, which I implemented after discussing with Alex how best to expose exception handler addresses to custom unwind logic without the full exception unwinder, but was otherwise a pretty direct application of try_call and our exception ABI. General building blocks prove generally useful, it seems!


Thanks to Alex Crichton and Nick Fitzgerald for providing feedback on a draft of this post!

  1. To explain myself a bit, I underestimated the interactions of exception handling with garbage collection (GC); I hadn’t realized yet that exnrefs were a full first-class value and would need to be supported including in the host API. Also, it turns out that exceptions can cross the host/guest boundary, and goodness knows that gets really fun really fast. I was only off by a factor of two on the compiler side at least! 

  2. From an implementation perspective, the dynamic, interprocedural nature of exceptions is what makes them far more interesting, and involved, than classical control flow such as conditionals, loops, or calls! This is why we need a mechanism that involves runtime data structrues, “stack walks”, and lookup tables, rather than simply generating a jump to the right place: the target of an exception-throw can only be computed at runtime, and we need a convention to transfer control with “payload” to that location. 

  3. For those so inclined, this is a monad, and e.g. Haskell implements the ability to have “result or error” types that return from a sequence early via Either, explicitly describing the concept as such. The ? operator serves as the “bind” of the monad: it connects an error-producing computation with a use of the non-error value, returning the error directly if one is given instead. 

  4. So named for the Intel Itanium (IA-64), an instruction-set architecture that happened to be the first ISA where this scheme was implemented for C++, and is now essentially dead (before its time! woefully misunderstood!) but for that legacy… 

  5. It’s worth briefly noting here that the Wasm exception handling proposal went through a somewhat twisty journey, with an earlier variant (now called “legacy exception handling”) that shipped in some browsers but was never standardized handling rethrows in a different way. In particular, that proposal did not offer first-class exception object references that could be rethrown; instead, it had an explicit rethrow instruction. I wasn’t around for the early debates about this design, but in my opinion, providing first-class exception object references that can be plumbed around via ordinary dataflow is far nicer. It also permits a simpler implementation, as long as one literally implements the semantics by always allocating an exception object.11 

  6. To be precise, because it may be a little surprising: catch_ref pushes both the payload values and the exception reference onto the operand stack at the handler destination. In essence, the rule is: tag-specific variants always unpack the payloads; and also, _ref variants always push the exception reference. 

  7. In particular, we have defined our own ABI in Wasmtime to allow universal tail calls between any two signatures to work, as required by the Wasm tail-calling opcodes. This ABI, called “tail”, is based on the standard System V calling convention but differs in that the callee cleans up any stack arguments. 

  8. It’s not compiler hacking without excessive trouble from edge-cases, of course, so we had one interesting bug from the empty handler-list case which means we have to force edge-splitting anyway for all try_calls for this subtle reason. 

  9. Of course, while doing this, I managed to create CVE-2025-61670 in the C/C++ API by a combination of (i) a simple typo in the C FFI bindings (as vs. from, which is important when transferring ownership!) and (ii) not realizing that the C++ wrapper does not properly maintain single ownership. We didn’t have ASAN tests, so I didn’t see this upfront; Alex discovered the issue while updating the Python bindings (which quickly found the leak) and managed the CVE. Sorry and thanks! 

  10. It turns out that even three lines of assembly are hard to get right: the s390x variant had a bug where we got the register constraints wrong (GPR 0 is special on s390x, and a branch-to-register can only take GPR 1–15; we needed a different constraint to represent that)and had a miscompilation as a result. Thanks to our resident s390x compiler hacker Ulrich Weigand for tracking this down. 

  11. Of course, always boxing exceptions is not the only way to implement the proposal. It should be possible to “unbox” exceptions and skip the allocation, carrying payloads directly through some other engine state, if they are not caught as references. We haven’t implemented this optimization in Wasmtime and we expect the allocation performance for small exception objects to be adequate for most use-cases. 

]]>
Chris Fallin
Wasmtime 35 Brings AArch64 Support in Winch2025-08-14T00:00:00+00:002025-08-14T00:00:00+00:00https://bytecodealliance.org/articles/winch-aarch64-supportWasmtime is a fast, secure, standards compliant and lightweight WebAssembly (Wasm) runtime.

As of Wasmtime 35, Winch supports AArch64 for Core Wasm proposals, along with additional Wasm proposals like the Component Model and Custom Page Sizes.

Embedders can configure Wasmtime to use either Cranelift or Winch as the Wasm compiler depending on the use-case: Cranelift is an optimizing compiler aiming to generate fast code. Winch is a ‘baseline’ compiler, aiming for fast compilation and low-latency startup.

This blog post will cover the main changes needed to accommodate support for AArch64 in Winch.

Quick Tour of Winch’s Architecture

To achieve its low-latency goal, Winch focuses on converting Wasm code to assembly code for the target Instruction Set Architecture (ISA) as quickly as possible. Unlike Cranelift, Winch’s architecture intentionally avoids using an intermediate representation or complex register allocation algorithms in its compilation process. For this reason, baseline compilers are also referred to as single-pass compilers.

Winch’s architecure can be largely divided into two parts which can be classified as ISA-agnostic and ISA-specific.

Winch's Architecture

Adding support for AArch64 to Winch involved adding a new implementation of the MacroAssembler trait, which is ultimately in charge of emitting AArch64 assembly. Winch’s ISA-agnostic components remained unchanged, and shared with the existing x86_64 implementation.

Winch’s code generation context implements wasmparser’s VisitOperator trait, which requires defining handlers for each Wasm opcode:

fn visit_i32_const() -> Self::Output {
  // Code generation starts here.
}

When an opcode handler is invoked, the Code Generation Context prepares all the necessary values and registers, followed by the machine code emission of the sequence of instructions to represent the Wasm instruction in the target ISA.

Last but not least, the register allocator algorithm uses a simple round robin approach over the available ISA registers. When a requested register is unavailable, all the current live values at the current program point are saved to memory (known as value spilling), thereby freeing the requested register for immediate use.

Emitting AArch64 Assembly

Shadow Stack Pointer (SSP)

AArch64 defines very specific restrictions with regards to the usage of the stack pointer register (SP). Concretely, SP must be 16-byte aligned whenever it is used to address stack memory. Given that Winch’s register allocation algorithm requires value spilling at arbitrary program points, it can be challenging to maintain such alignment.

AArch64’s SP requirement states that SP must be 16-byted when addressing stack memory, however it can be unaligned if not used to address stack memory and doesn’t prevent using other registers for stack memory addressing, nor it states that these other registers be 16-byte aligned. To avoid opting for less efficient approaches like overallocating memory to ensure alignment each time a value is saved, Winch’s architecture employs a shadow stack pointer approach.

Winch’s shadow stack pointer approach defines x28 as the base register for stack memory addressing, enabling:

  • 8-byte stack slots for live value spilling.
  • 8-byte aligned stack memory loads.

Signal handlers

Wasmtime can be configured to leverage signals-based traps to detect exceptional situations in Wasm programs e.g., an out-of-bounds memory access. Traps are synchronous exceptions, and when they are raised, they are caught and handled by code defined in Wasmtime’s runtime. These handlers are Rust functions compiled to the target ISA, following the native calling convention, which implies that whenever there is a transition from Winch generated code to a signal handler, SP must be 16-byte aligned. Note that even though Wasmtime can be configured to avoid signals-based traps, Winch does not support such option yet.

Given that traps can happen at arbitrary program points, Winch’s approach to ensure 16-byte alignment for SP is two-fold:

  • Emit a series of instructions that will correctly align SP before each potentially-trapping Wasm instruction. Note that this could result in overallocation of stack memory if SP is not 16-byte aligned.
  • Exclusively use SSP as the canonical stack pointer value, copying the value of SSP to SP after each allocation/deallocation. This maintains the SP >= SSP invariant, which ensures that SP always reflects an overapproximation of the consumed stack space and it allows the generated code to save an extra move instruction, if overallocation due to alignment happens, as described in the previous point.

It’s worth noting that the approach mentioned above doesn’t take into account asynchronous exceptions, also known as interrupts. Further testing and development is needed in order to ensure that Winch generated code for AArch64 can correctly handle interrupts e.g., SIGALRM.

Immediate Value Handling

To minimize register pressure and reduce the need for spilling values, Winch’s instruction selection prioritizes emitting instructions that support immediate operands whenever possible, such as mov x0, #imm. However, due to the fixed-width instruction encoding in AArch64 (which always uses 32-bit instructions), encoding large immediate values directly within a single instruction can sometimes be impossible. In such cases, the immediate is first loaded into an auxiliary register—often a “scratch” or temporary register—and then used in subsequent instructions that require register operands.

Scratch registers offer the advantage that they are not tracked by the register allocator, reducing the possibility of register allocator induced spills. However, they should be used sparingly and only for short-lived operations.

AArch64’s fixed 32-bit instruction encoding imposes stricter limits on the size of immediate values that can be encoded directly, unlike other ISAs supported by Winch, such as x86_64, which support variable-length instructions and can encode larger immediates more easily.

Before supporting AArch64, Winch’s ISA-agnostic component assumed a single scratch register per ISA. While this worked well for x86_64, where most instructions can encode a broad range of immediates directly, it proved problematic for AArch64. Specifically, for instruction sequences involving instructions with immediates in which the scratch register was previously acquired.

Consider the following snippet from Winch’s ISA-agnostic code for computing a Wasm table element address:

// 1. Load index into the scratch register.
masm.mov(scratch.writable(), index.into(), bound_size)?; 
// 2. Multiply with an immediate element size.
masm.mul(
	scratch.writable(),
	scratch.inner(),
	RegImm::i32(table_data.element_size.bytes() as i32),
	table_data.element_size,
)?;
masm.load_ptr(
	masm.address_at_reg(base, table_data.offset)?,
	writable!(base),
)?;
masm.mov(writable!(tmp), base.into(), ptr_size)?;
masm.add(writable!(base), base, scratch.inner().into(), ptr_size)

In step 1, the code clobbers the designated scratch register. More critically, if the immediate passed to Masm::mul cannot be encoded directly in the AArch64 mul instruction, the Masm::mul implementation will load the immediate into a register—clobbering the scratch register again—and emit a register-based multiplication instruction.

One way to address this limitation is to avoid using a scratch register for the index altogether and instead request a register from the register allocator. This approach, however, increases register pressure and potentially raises memory traffic, particularly in architectures like x86_64.

Winch’s preferred solution is to introduce an explicit scratch register allocator that provides a small pool of scratch registers (e.g., x16 and x17 in AArch64). By managing scratch registers explicitly, Winch can safely allocate and use them without risking accidental clobbering, especially when generating code for architectures with stricter immediate encoding constraints.

What’s Next

Though it wasn’t a radical change, the completeness of AArch64 in Winch marks a new stage for the compiler’s architecture, layering a more robust and solid foundation for future ISA additions.

Contributions are welcome! If you’re interested in contributing, you can:

That’s a wrap

Thanks to everyone who contributed to the completeness of the AArch64 backend! Thanks also to Nick Fitzgerald and Chris Fallin for their feedback on early drafts of this article.

]]>
Saúl Cabrera
Running WebAssembly (Wasm) Components From the Command Line2025-05-21T00:00:00+00:002025-05-21T00:00:00+00:00https://bytecodealliance.org/articles/invoking-component-functions-in-wasmtime-cliWasmtime’s 33.0.0 release supports invoking Wasm component exports directly from the command line with the new --invoke flag. This article walks through building a Wasm component in Rust and using wasmtime run --invoke to execute specific functions (enabling powerful workflows for scripting, testing, and integrating Wasm into modern development pipelines).

The Evolution of Wasmtime’s CLI

Wasmtime’s run subcommand has traditionally supported running Wasm modules as well as invoking that module’s exported function. However, with the evolution of the Wasm Component Model, this article focuses on a newer capability; creating a component that exports a function and then demonstrating how to invoke that component’s exported function.

By the end of this article, you’ll be ready to create Wasm components and orchestrate their exported component functions to improve your workflow’s efficiency and promote reuse. Potential examples include:

  • Shell Scripting: Embed Wasm logic directly into Bash or Python scripts for seamless automation.
  • CI/CD Pipelines: Validate components in GitHub Actions, GitLab CI, or other automation tools without embedding them in host applications.
  • Cross-Language Testing: Quickly verify that interfaces match across implementations in Rust, JavaScript, and Python.
  • Debugging: Inspect exported functions during development with ease.
  • Microservices: Chain components in serverless workflows, such as compress → encrypt → upload, leveraging Wasm’s modularity.

Tooling & Dependencies

If you want to follow along, please install:

You can check versions using the following commands:

$ rustc --version
$ cargo --version
$ cargo component --version
$ wasmtime --version

We must explicitly add the wasm32-wasip2 target. This ensures that our component adheres to WASI’s system interface for non-browser environments (e.g., file system access, sockets, random etc.):

$ rustup target add wasm32-wasip2

Creating a New Wasm Component With Rust

Let’s start by creating a new Wasm library that we will later convert to a Wasm component using cargo component and the wasm32-wasip2 target:

$ cargo component new --lib wasm_answer
$ cd wasm_answer

If you open the Cargo.toml file, you will notice that the cargo component command has automatically added some essential configurations.

The wit-bindgen-rt dependency (with the ["bitflags"] feature) under [dependencies], and the crate-type = ["cdylib"] setting under the [lib] section.

Your Cargo.toml should now include these entries (as shown in the example below):

[package]
name = "wasm_answer"
version = "0.1.0"
edition = "2024"

[dependencies]
wit-bindgen-rt = { version = "0.41.0", features = ["bitflags"] }

[lib]
crate-type = ["cdylib"]

[package.metadata.component]
package = "component:wasm-answer"

[package.metadata.component.dependencies]

The directory structure of the wasm_answer example is automatically scaffolded out for us by cargo component:

$ tree wasm_answer

wasm_answer
├── Cargo.lock
├── Cargo.toml
├── src
│   ├── bindings.rs
│   └── lib.rs
└── wit
    └── world.wit

WIT

If we open the wit/world.wit file, that cargo component created for us, we can see that cargo component generates a minimal world.wit that exports a raw function:

package component:wasm-answer;

/// An example world for the component to target.
world example {
    export hello-world: func() -> string;
}

We can simply adjust the export line (as shown below):

package component:wasm-answer;

/// An example world for the component to target.
world example {
    export get-answer: func() -> u32;
}

But, instead, let’s use an interface to export our function!

While the above approach works, the recommended best practice is to wrap related functions inside an interface, which you then export from your world. This is more modular, extensible, and aligns with how the Wasm Interface Type (WIT) format is used in multi-function or real-world components. Let’s update the wit/world.wit file as follows:

package component:wasm-answer;

interface answer {
    get-answer: func() -> u32;
}

world example {
    export answer;
}

Next, we update our src/lib.rs file accordingly, by pasting in the following Rust code:

#[allow(warnings)]
mod bindings;

use bindings::exports::component::wasm_answer::answer::Guest;

struct Component;

impl Guest for Component {
    fn get_answer() -> u32 {
        42
    }
}

bindings::export!(Component with_types_in bindings);

Now, let’s create the Wasm component with our exported get_answer() function:

$ cargo component build --target wasm32-wasip2

Our newly generated .wasm file now lives at the following location:

$ file target/wasm32-wasip2/debug/wasm_answer.wasm
target/wasm32-wasip2/debug/wasm_answer.wasm: WebAssembly (wasm) binary module version 0x1000d

We can also use the --release option which optimises builds for production:

$ cargo component build --target wasm32-wasip2 --release

If we check the sizes of the debug and release, we see a difference of 2.1M and 16K, respectively.

Debug:

$ du -mh target/wasm32-wasip2/debug/wasm_answer.wasm
2.1M	target/wasm32-wasip2/debug/wasm_answer.wasm

Release:

$ du -mh target/wasm32-wasip2/release/wasm_answer.wasm
16K	target/wasm32-wasip2/release/wasm_answer.wasm

How Invoke Works: A Practical Example

The wasmtime run command can take one positional argument and just run a .wasm or .wat file:

$ wasmtime run foo.wasm
$ wasmtime run foo.wat

Invoke: Wasm Modules

In the case of a Wasm module that exports a raw function directly, the run command accepts an optional --invoke argument, which is the name of an exported raw function (of the module) to run:

$ wasmtime run --invoke initialize foo.wasm

Invoke: Wasm Components

In the case of a Wasm component that uses typed interfaces (defined in WIT, in concert with the Component Model), the run command now also accepts the optional --invoke argument for calling an exported function of a component.

However, the calling of an exported function of a component uses WAVE(a human-oriented text encoding of Wasm Component Model values). For example:

$ wasmtime run --invoke 'initialize()' foo.wasm

You will notice the different syntax of initialize versus 'initialize()' when referring to a module versus a component, respectively.

Back to our get-answer() example:

$ wasmtime run --invoke 'get-answer()' target/wasm32-wasip2/debug/wasm_answer.wasm
42

You will notice that the above get-answer() function call does not pass in any arguments. Let’s discuss how to represent the arguments passed into function calls in a structured way (using WAVE).

Wasm Value Encoding (WAVE)

Transferring and invoking complex argument data via the command line is challenging, especially with Wasm components that use diverse value types. To simplify this, Wasm Value Encoding (WAVE) was introduced; offering a concise way to represent structured values directly in the CLI.

WAVE provides a standard way to encode function calls and/or results. WAVE is a human-oriented text encoding of Wasm Component Model values; designed to be consistent with the WIT IDL format.

Below are a few additional pointers for constructing your wasmtime run --invoke commands using WAVE.

Quotes

As shown above, the component’s exported function name and mandatory parentheses are contained in one set of single quotes, i.e., 'get-answer()':

$ wasmtime run --invoke 'get-answer()' target/wasm32-wasip2/release/wasm_answer.wasm

The result from our correctly typed command above is as follows:

42

Parentheses

Parentheses after the exported function’s name are mandatory. The presence of the parenthesis () signifies function invocation, as opposed to the function name just being referenced. If your function takes a string argument, ensure that you contain your string in double quotes (inside the parentheses). For example:

$ wasmtime run --invoke 'initialize("hello")' foo.wasm

If your exported function takes more than one argument, ensure that each argument is separated using a single comma , as shown below:

$ wasmtime run --invoke 'initialize("Pi", 3.14)' foo.wasm
$ wasmtime run --invoke 'add(1, 2)' foo.wasm

Recap: Wasm Modules versus Wasm Components

Let’s wrap this article up with a recap to crystallize your knowledge.

Earlier Wasmtime Run Support for Modules

If we are not using the Component Model and just creating a module, we use a simple command like wasmtime run foo.wasm (without WAVE syntax). This approach typically applies to modules, which export a _start function, or reactor modules, which can optionally export the wasi:cli/run interface—standardized to enable consistent execution semantics.

Example of running a Wasm module that exports a raw function directly:

$ wasmtime run --invoke initialize foo.wasm

Wasmtime Run Support for Components

As Wasm evolves with the Component Model, developers gain fine-grained control over component execution and composition. Components using WIT can now be run with wasmtime run, using the optional --invoke argument to call exported functions (with WAVE).

Example of running a Wasm component that exports a function:

$ wasmtime run --invoke 'add(1, 2)' foo.wasm

For more information, visit the cli-options section of the Wasmtime documentation.

Benefits and Usefulness

The addition of support for the run --invoke feature for components allows users to specify and execute exported functions from a Wasm component. This enables greater flexibility for testing, debugging, and integration. We now have the ability to perform the execution of arbitrary exported functions directly from the command line, this feature opens up a world of possibilities for integrating Wasm into modern development pipelines.

This evolution from monolithic Wasm modules to composable, CLI-friendly components exemplifies the versatility and power of Wasm in real-world scenarios.

]]>
Tim McCallum
Wasmtime Becomes the First Bytecode Alliance Core Project2025-04-30T00:00:00+00:002025-04-30T00:00:00+00:00https://bytecodealliance.org/articles/wasmtime-core-projectThe Bytecode Alliance is very happy to announce a significant milestone for both Wasmtime and the Bytecode Alliance: Wasmtime has officially been promoted to become the BA’s first Core Project. As someone deeply involved in Wasmtime and the proposal process, I’m incredibly excited to share this news and what it signifies.

Defining Core Projects

Within the Bytecode Alliance, we’ve established two tiers for the projects under our umbrella: Hosted and Core. While all projects in the BA, Hosted and Core alike, are required to drive forward and align with our mission and operational principles, Core Projects represent the flagships of the Alliance.

This distinction isn’t merely symbolic. Core Projects are held to even more rigorous standards concerning governance maturity, security practices, community health, and strategic alignment with the BA’s goals. You can find the detailed criteria in our Core and Hosted Project Requirements. In return for meeting these heightened expectations, Core Projects gain direct representation on the Bytecode Alliance Technical Steering Committee (TSC), playing a crucial role in guiding the technical evolution of the Alliance. Establishing this tier, and having Wasmtime be the first project to meet its requirements, is a vital step in maturing the BA’s governance structure.

Wasmtime: A Natural Fit as the Inaugural Core Project

Wasmtime is a fast, scaleable, highly secure, and embeddable WebAssembly runtime in wide use across many different environments.

From its inception, Wasmtime was designed to embody the core tenets of the Bytecode Alliance. Its focus on providing a fast, secure, and standards-compliant WebAssembly runtime aligns directly with the BA’s mission to create state-of-the-art foundations emphasizing security, efficiency, and modularity.

Wasmtime has been instrumental in turning the Component Model vision of fine-grained sandboxing and capabilities-based security – what we initially called “nanoprocesses” – into a practical reality. It has consistently served as a proving ground for cutting-edge standards work, particularly the Component Model and WASI, driving innovation while maintaining strict standards compliance. Our commitment to robust security practices, including extensive fuzzing and a rigorous security response process, is non-negotiable.

The journey to Core Project status involved formally documenting how Wasmtime meets these stringent requirements. You can find this documentation in our proposal for Core Project status, which provides evidence for the Wasmtime project’s mature governance, security posture, CI/CD processes, community health, and widespread production adoption. Based on this evidence and the TSC’s strong recommendation, the Board of Directors unanimously agreed that Wasmtime not only fulfills the criteria but is strategically vital to the Alliance’s success, making it the ideal candidate to become the first Core Project.

Re-Joining the TSC

After the Core Project promotion, the Wasmtime core team has appointed me to represent the project on the TSC, so I re-joined the TSC in this new role.

More Information

You can find more information about Wasmtime in a number of places:

And you can join the conversation in the Bytecode Alliance community’s chat platform, which has a dedicated channel for Wasmtime.

]]>
Till Schneidereit
Wasmtime LTS Releases2025-04-22T00:00:00+00:002025-04-22T00:00:00+00:00https://bytecodealliance.org/articles/wasmtime-ltsWasmtime is a lightweight WebAssembly runtime built for speed, security, and standards-compliance. Wasmtime now supports long-term-support (LTS) releases that are maintained with security fixes for 2 years after their initial release.

The Wasmtime project releases a new version once a month with new features, bug fixes, and performance improvements. Previously though these releases were only supported for 2 months meaning that embedders needed to follow the Wasmtime project pretty closely to receive security updates. This rate of change can be too fast for users so Wasmtime now supports LTS releases.

Every 12th version of Wasmtime will now be considered an LTS release and will receive security fixes for 2 years, or 24 months. This means that users can now update Wasmtime once-a-year instead of once-a-month and be guaranteed that they will always receive security updates. Wasmtime’s 24.0.0 release has been retroactively classified as a LTS release and will be supported until August 20,

  1. Wasmtime’s upcoming 36.0.0 release on August 20, 2025 will be supported until August 20, 2027, meaning that users will have one year starting in August to upgrade from 24.0.0 to 36.0.0.

You can view a table of Wasmtime’s releases in the documentation book which has information on all currently supported releases, upcoming releases, and information about previously supported releases. The high-level summary of Wasmtime’s LTS release channel is:

  • LTS releases receive patch updates for 2 years after their initial release.
  • Patch releases are guaranteed to preserve API compatibility.
  • Patch releases strive to maintain tooling compatibility (e.g. the Rust version required to compile Wasmtime) from the time of release. Depending on EOL dates from components such as GitHub Actions images, however, this may need minor updates.
  • Patch releases are guaranteed to be issued for any security bug found in historical releases of Wasmtime.
  • Patch releases may be issued to fix non-security released bugs as they are discovered. The Wasmtime project will rely on contributions to provide backports for these fixes.
  • Patch releases will not be issued for new features to Wasmtime, even if a contribution is made to backport a new feature.

If you’re a current user of Wasmtime and would like to use an LTS release then it’s recommended to either downgrade to the 24.0.0 version or wait for this August to upgrade to the 36.0.0 version. Wasmtime 34.0.0, to be released June 20, 2025, will be supported up until the release of Wasmtime 36.0.0 on August 20, 2025.

]]>
Alex Crichton
WAMR 2024: A Year in Review2025-02-19T00:00:00+00:002025-02-19T00:00:00+00:00https://bytecodealliance.org/articles/wamr-2024-summaryIn 2024, the WAMR community saw many thrilling advancements, including the development of new features, increased industrial use, and an improved experience for developers. Passionate developers and industry professionals have come together to enhance and expand WAMR in ways we couldn’t have imagined. From exciting new tools to a growing community, there’s a lot to be proud of. Let’s take a closer look at the key highlights of WAMR 2024, showcasing the community’s efforts, new features, and the establishment of the Embedded Special Interest Group (ESIG).

Community Contributions: A Year of Growth

The WAMR community has shown incredible dedication and enthusiasm throughout 2024. Here are some impressive numbers that highlight the community’s contributions:

  • 707 New PRs: The community has been actively involved in enhancing WAMR, with 707 new PRs submitted this year.
  • 292 New Issues: Developers have identified and reported 292 new issues, helping to improve the stability and performance of WAMR.
  • 861 New Stars on GitHub: The project gained 861 new stars, reflecting its growing popularity and recognition.
  • 236 Active Participants: With 236 active participants, the community has been vibrant and engaged, driving WAMR forward with their collective efforts.

Breaking down the contributions further:

  • Intel and Others: Half of the PRs, 43.85%, were created by Intel, while the remaining 56.15% were created by the community, including independent contributors and customers integrating WAMR into their products.
  • Community Contributions: The major driving force within the community is company contributors, who provided approximately 85% of the PRs among those created by non-Intel contributors.

The top three non-Intel organized contributors have made significant impacts:

  • Midokura: Contributed 33.33% of organized PRs and helped review 35.01% of PRs.
  • Amazon: Contributed 14.33% of organized PRs and helped review 19.90% of PRs.
  • Xiaomi: Contributed 12.40% of organized PRs and helped review 15.11% of PRs.

These contributions have been instrumental in driving WAMR forward, and we extend our heartfelt thanks to everyone involved.

New Features in WAMR 2024

Several exciting new features have been added to WAMR in 2024, aimed at enhancing the development experience and expanding the capabilities of WAMR. Here are some of the key features:

Development Tools: Simplifying Wasm Development

One of the most exciting additions to WAMR in 2024 is the introduction of new development tools aimed at simplifying Wasm development. These tools include:

  • Linux perf for Wasm Functions: This tool allows developers to profile Wasm functions directly, providing insights into performance bottlenecks.
  • AOT Debugging: Ahead-of-time (AOT) debugging support has been added, making it easier to debug Wasm applications.
  • Call Stack Dumps: Enhanced call stack dumps provide detailed information about the execution flow, aiding in troubleshooting and optimization.

Before these tools, developing a Wasm application or plugin using a host language was a complex task. Mapping Wasm functions back to the source code written in the host language required deep knowledge and was often cumbersome. Debugging information from the runtime and the host language felt like two foreign languages trying to communicate without a translator. These new development tools act as that much-needed translator, bridging the gap and making Wasm development more accessible and efficient.

Shared Heap: Efficient Memory Sharing

Another significant feature introduced in 2024 is the shared heap. This feature addresses the challenge of sharing memory between the host and Wasm. Traditionally, copying data at the host-Wasm border was inefficient, and existing solutions like externref lacked flexibility and toolchain support.

The shared heap approach uses a pre-allocated region of linear memory as a “swap” area. Both the embedded system and Wasm can store and access shared objects here without the need for copying. However, this feature comes with its own set of challenges. Unlike memory.grow(), the new memory region isn’t controlled by Wasm and may not even be aware of it. This requires runtime APIs to map the embedded-provided memory area into linear memory, making it a runtime-level solution rather than a Wasm opcode.

It’s important to note that the shared heap is an experimental feature, and the intent is to work towards a standardized approach within the WebAssembly Community Group (CG). This will help set expectations for early adopters and ensure alignment with the broader Wasm ecosystem. As the feature evolves, feedback from the community will be crucial in shaping its development and eventual standardization.

Newly Implemented Features

Several features have been finalized in 2024, further enhancing WAMR’s capabilities:

  • GC: Garbage collection features for the interpreter, LLVM-JIT, and AOT have been finalized.
  • Legacy Exception Handling: Legacy exception handling for the interpreter has been added.
  • WASI-NN: Support for WASI-NN with OpenVINO and llama.cpp backends has been introduced.
  • WASI Preview1 Support: Ongoing support for WASI on ESP-IDF and Zephyr.
  • Memory64: Table64 support for the interpreter and AOT has been finalized.

These new features and improvements are designed to make WAMR more powerful and easier to use, catering to the needs of developers and industry professionals alike.

Active engagement in Embedded Special Interest Group (ESIG)

In the embedding industry, the perspective on Wasm differs slightly from the cloud-centric view that the current Wasm Community Group (CG) often focuses on. To address these unique requirements, the Embedded Special Interest Group (ESIG) was established in 2024. This group aims to discover solutions that prioritize performance, footprint and stability, tailored specifically for embedding devices.

The ESIG has already achieved several accomplishments this year, thanks to the shared understanding and collaboration with customers. By focusing on the unique needs of the embedding industry, ESIG is paving the way for more specialized and efficient Wasm solutions.

Industrial adoption

The adoption of WAMR in the industry has been remarkable, with several key players integrating WAMR into their systems to leverage its performance and flexibility. Here are some notable examples:

Alibaba’s Microservice Engine (MSE) has adopted WAMR as a Wasm runtime to execute Wasm plugins in their gateways Higress. This integration has resulted in an impressive ~50% performance improvement, showcasing the efficiency and robustness of WAMR in real-world applications.

WAMR has also been integrated into Runwasi as one of the Wasm runtimes to execute Wasm in containerd. This integration allows for seamless execution of Wasm modules within containerized environments, providing a versatile and efficient solution for running Wasm applications.

For more information on industrial adoptions and other use cases, please refer to this link.

These examples highlight the growing trust and reliance on WAMR in various industrial applications, demonstrating its capability to deliver significant performance enhancements and operational efficiencies.

Conclusion

2024 has been a transformative year for WAMR, marked by significant community contributions, innovative features, and the establishment of the ESIG. As we look ahead, we are excited about the continued growth and evolution of WAMR, driven by the passion and dedication of our community. We invite you to join us on this journey, explore the new features, and contribute to the future of WebAssembly Micro Runtime.

Thank you for being a part of the WAMR community. Here’s to an even more exciting 2025!

]]>
Liang He
Bytecode Alliance Election Results2025-01-14T00:00:00+00:002025-01-14T00:00:00+00:00https://bytecodealliance.org/articles/election-resultsEach December the Bytecode Alliance conducts elections to fill important roles on our governing Board and Technical Steering Committee (TSC). I’m pleased to announce the results of our just-held December 2024 election, in which our Recognized Contributors (RCs) selected three Elected Delegates to the TSC and one At-Large Director to represent them on the Alliance Board.

TSC Elected Delegates

The Bytecode Alliance Technical Steering Committee acts as the top-level governing body for projects and Special Interest Groups hosted by the Alliance, ensuring they further the Alliance’s mission and are conducted in accordance with our values and principles. The TSC also oversees the Bytecode Alliance Recognized Contributor program to encourage and engage individual contributors as participants in Alliance projects and groups. As defined in its charter the TSC is composed of representatives from each Alliance Core Project and individuals selected by Recogized Contributors.

Our new TSC Elected Delegates (and their GitHub IDs, as we know each other in our RC community) are:

  • Andrew Brown (@abrown)
  • Bailey Hayes (@ricochet)
  • Oscar Spencer (@ospencer)

They will each serve a two-year term on the TSC.

At-Large Director

Our RCs are also represented by two At-Large Directors they select to serve on our Board (as described in our organization bylaws), with overlapping two-year terms staggered to start each January. In this most recent election The Recognized Contributors chose Bailey Hayes (@ricochet) as At-Large Director.

Congratulations!

I look forward to working with each of our electees, and am happy to introduce them here as part of bringing them onboard in their new roles. You’ll find our full Board and TSC listed on the About page of our website.

Thank you to all our Recognized Contributors for taking part in the election process and in general for their ongoing support of Alliance projects and communities. I’d also like to thank our outgoing leadership for their outstanding work - Nick Fitzgerald (@fitzgen) as TSC Chair and Elected Delegate, and Till Schneidereit (@tschneidereit) as Elected Delegate and At-Large Director.

]]>
David Bryant
Wasmtime 28.0: Optimizing at compile-time, Booleans in the Cranelift DSL, and more2025-01-03T00:00:00+00:002025-01-03T00:00:00+00:00https://bytecodealliance.org/articles/wasmtime-28.0Wasmtime is a lightweight WebAssembly runtime built for speed, security, and standards-compliance. December’s v28.0 release brings enhancements including a new option for optimization at compile-time, a new first-class type for Cranelift’s domain-specific language (DSL), and more.

  • The Instruction Selection Lowering Expressions (ISLE) DSL for Cranelift now includes a first-class Boolean type, giving users more tools for expressing parts of the Cranelift compiler backend.
  • The addition of a new single-pass register allocator enables users to choose between optimizing builds for either performance at compile-time or performance at runtime for the generated code.
  • The public-facing documentation within Wasmtime on memory settings has been reworked and clarified to make usage and configuration clearer.

What’s new in Wasmtime v28.0

Wasmtime v28.0 includes a variety of enhancements and fixes. The release notes are available here.

Added

  • The ISLE DSL used for Cranelift now has a first-class bool type. #9593
  • Cranelift now supports a new single-pass register allocator designed for compile-time performance (unlike the current default which is optimized for runtime-of-generated-code performance). #9611
  • The wasmtime crate now natively supports the wasm-wave crate and its encoding of component value types. #8872
  • A Module can now be created from an already-open file. #9571
  • A new default-enabled crate feature, signals-based-traps, has been added to the wasmtime crate. When disabled then runtime signal handling is not required by the host. This is intended to help with future effort to port Wasmtime to more platforms. #9614
  • Linear memories may now be backed by malloc in certain conditions when guard pages are disabled, for example. #9614 #9634
  • Wasmtime’s async feature no longer requires std. #9689
  • The buffer and budget capacity of OutgoingBody in wasmtime-wasi-http are now configurable. #9670

Changed

  • Wasmtime’s external and internal distinction of “static” and “dynamic” memories has been refactored and reworded. All options are preserved but exported under different names with improved documentation about how they all interact with one another (and everything should be easier to understand). #9545
  • Each Store<T> now caches a single fiber stack in async mode to avoid allocating/deallocating if the store is used multiple times. #9604
  • Linear memories now have a 32MiB guard region at the end instead of a 2GiB guard region by default. #9606
  • Wasmtime will no longer validate dependencies between WebAssembly features, instead delegating this work to wasmparser’s validator. #9623
  • Cranelift’s isle-in-source-tree feature has been re-worked as an environment variable. #9633
  • Wasmtime’s minimum supported Rust version is now 1.81. #9692
  • Synthetic types in DWARF are now more efficiently represented. #9700
  • Debug builtins on Windows are now exported correctly. #9706
  • Documentation on Config now clarifies that defaults of some options may differ depending on the selected target or compiler depending on features supported. #9705
  • Wasmtime’s error-related types now all unconditionally implement the Error trait, even in #[no_std] mode. #9702

Fixed

  • Field type matching for subtyping with wasm GC has been fixed. #9724
  • Native unwind info generated for s390x has been fixed in the face of tail calls. #9725

Get involved

Thanks to Karl Meakin, Chris Fallin, Pat Hickey, Alex Crichton, Xinzhao Xu, SingleAccretion, Nick Fitzgerald, and Ulrich Weigand for their work on this release.

Want to get involved with Wasmtime? Join the community on our Zulip chat and read the Wasmtime contributors’ guide for more information.

]]>
Eric Gregory
Making WebAssembly and Wasmtime More Portable2024-12-17T00:00:00+00:002024-12-17T00:00:00+00:00https://bytecodealliance.org/articles/wasmtime-portabilityPortability is among the first properties promoted on WebAssembly’s official homepage:

WebAssembly (abbreviated Wasm) is a binary instruction format for a stack-based virtual machine. Wasm is designed as a portable compilation target for programming languages, enabling deployment on the web for client and server applications.

This portability has led many people to claim that it is a “universal bytecode” — an instruction set that can run on any computer, abstracting away the underlying native architecture and operating system. In practice, however, there remain places you cannot take standard WebAssembly, for example certain memory-constrained embedded devices. Runtimes have been forced to choose between deviating from the standard with ad-hoc language modifications or else avoiding these platforms. This article details in-progress standards proposals to lift these extant language limitations; enumerates recent engineering efforts to greatly expand Wasmtime’s platform support; and, finally, shares some ways that you can get involved and help us further improve Wasm’s portability.

WebAssembly has a lot going for it. It has a formal specification that is developed in an open, collaborative standards process by browser, runtime, hardware, and language toolchain vendors, among others. It’s sandboxed, so a Wasm program cannot access any resource you don’t explicitly give it access to, leading to the development of standard Wasm APIs leveraging capability-based security. It is designed such that, after compilation to native code, it can be executed at near-native speeds. And, even if there is room for improvement, it is portable across many systems, running in Web browsers and on servers, workstations, phones, and more. These qualities are worth commending, preserving, and making available in even more places.

What does Wasm need in order to run on a given platform? A Wasm runtime that supports that platform. In this article, we’ll focus on the runtime we’re building: Wasmtime.

Wasmtime is a lightweight, standalone WebAssembly runtime developed openly within the Bytecode Alliance. Wasmtime is fast. It can, for example, spawn new Wasm instances in just 5 microseconds. We, the Wasmtime developers, labor to ensure that Wasmtime is correct and secure, leveraging ubiquitous fuzzing and formal verification, because Wasm’s theoretical security properties are only as strong as the runtime’s actual implementation. We are committed to open standards and actively participate in Wasm standardization; Wasmtime does not and will never implement ad-hoc, non-standard Wasm extensions.1 We believe that bringing Wasmtime, its principles, and its strengths to more platforms is a worthwhile endeavor.

So what must Wasmtime, or any other Wasm runtime, have in order to run Wasm on a given platform? There are two fundamental operations that, no matter how they are implemented, a Wasm runtime requires:

  1. A method for allocating a Wasm program’s state, such as its linear memories
  2. A method to execute Wasm instructions

A Wasm runtime’s portability is determined by how few assumptions it makes about its underlying platform in its implementation of those operations. Does it assume an operating system that provides the mmap syscall or a CPU that supports virtual memory? Does it support just a small, fixed set of instructions sets, such as x86_64 and aarch64, or a wide, extensible set of ISAs? And, as previously mentioned, no matter which implementation choices are made, assumptions baked into the Wasm language specification itself can also limit a runtime’s portability.

Removing Runtime Assumptions

Wasmtime’s runtime previously made unnecessary assumptions, artificially constraining its portability, and we’ve spent the last year or so removing them one by one. Wasmtime is now a no_std crate with minimal platform assumptions. It doesn’t require that the underlying platform provide mmap in order to allocate Wasm memories like it previously did; in fact, it no longer even depends upon an underlying operating system at all. As of today, Wasmtime’s only mandatory platform requirement is a global memory allocator (i.e. malloc).

Wasmtime previously assumed that it could always use guard pages to catch out-of-bounds memory accesses, constraining its portability to platforms with virtual memory. Wasmtime can now be configured to rely only on explicit checks to catch out-of-bounds accesses, and Wasmtime no longer assumes the presence of virtual memory.

Wasmtime previously assumed that it could always detect division-by-zero by installing a signal handler. It would translate Wasm division instructions into unguarded, native div instructions and catch the corresponding signals that the operating system translated from divide-by-zero exceptions. This constrained Wasmtime’s portability to only operating systems with signals and instruction sets that raise exceptions on division by zero. Wasmtime can now be configured to emit explicit tests for zero divisors, removing the assumption that divide-by-zero signals are always available.

Configure Wasmtime to avoid depending upon virtual memory and signals by building without the signals-based-traps cargo feature and with Config::signals_based_traps(false). More information about configuring minimal Wasmtime builds, as well as integrating with custom operating systems, can be found in the Wasmtime guide.

This effort was spearheaded by Alex Crichton, with contributions from Chris Fallin.

Lifting Spec Constraints

The WebAssembly language specification imposes a fairly well-known portability constraint on standards-compliant implementations: Wasm memories are composed of pages, and Wasm pages have a fixed size of 64KiB. Therefore, a Wasm memory’s size is always a multiple of 64KiB, and the smallest non-zero memory size is 64KiB. But there exist embedded devices with less than 64KiB of memory available for Wasm, but where developers nonetheless want to run Wasm. I have been championing a new proposal in the WebAssembly standardization group to address this mismatch.

The custom-page-sizes proposal allows a Wasm module to specify a memory’s page size, in bytes, in the memory’s static definition. This gives Wasm modules finer-grained control over their resource consumption: with a one-byte page size, for example, a Wasm memory can be sized to exactly the embedded device’s capacity, even when less than 64KiB are available.

I implemented support for the custom-page-sizes proposal in Wasmtime. You can experiment with it via the --wasm=custom-page-sizes flag on the command line or via the Config::wasm_custom_page_sizes method in the library. Since then, three other Wasm engines have added support for the proposal as well.

The proposal is on the standards track and is currently in phase 2 of the standardization process. I intend to shepherd it to phase 3 in 2025. The phase 3 entry requirements (spec tests and an implementation) are already satisfied multiple times over today.

Compilers Without Backends

We’ve discussed allocating Wasm memories portably and removing assumptions from the runtime and language specification; now we turn our attention to portably executing Wasm instructions. Wasmtime previously had two available approaches to Wasm execution:

  1. Cranelift, an optimizing compiler backend that is comparable to the optimizing tier in a just-in-time system, such as V8’s TurboFan or SpiderMonkey’s Ion.
  2. Winch, a single-pass, “baseline” compiler that gives you fast, low-latency compilation but generates worse machine code, leading to slower Wasm execution throughput.

Both options compile Wasm down to native instructions, which has two portability consequences. First, when loaded into memory, the compiled Wasm’s machine code must be executable, and non-portable assumptions around the presence of mmap, memory permissions, and virtual memory on the underlying platform creep in again. Second, compiling to native code as an execution strategy requires a compiler backend for the target platform’s architecture. We cannot translate Wasm instructions into native instructions without a compiler backend that knows how to emit those native instructions. Cranelift has backends for aarch64, riscv64, s390x, and x86_64. Winch has an aarch64 backend and an x86_64 backend. If you wanted to execute Wasm on a different architecture, say armv7 or riscv32, you had to first author a whole compiler backend for that architecture, which is not a quick-and-easy task for established Wasmtime and Cranelift hackers, let alone new contributors. This was a huge roadblock to Wasmtime’s portability.

The typical way to add portable execution is with an interpreter written in a portable manner, and we started investigating the design space for Wasmtime. With a portable interpreter, you can execute Wasm on any platform you can compile the interpreter. In Wasmtime’s case, because it is written in Rust, a portable interpreter would expand Wasmtime’s portability to all of the many platforms that rustc supports.

We want to maximize the interpreter’s execution throughput — how fast it can run Wasm.2 If people are running the interpreter due to the absence of a compiler backend for their architecture, then the usual method of tuning Wasmtime for fast Wasm execution (using Cranelift as the execution strategy) is unavailable. Beyond optimizing the interpreter’s core loop and opcode dispatch, the best way to speed up an interpreter is to execute fewer instructions, doing relatively more work per instruction. This pushes us towards translating Wasm into a custom, internal bytecode format. The internal bytecode format can be register-based, rather than stack-based like Wasm, which generally requires fewer instructions to encode the same program. With an internal bytecode we also have the freedom to define “super-instructions” or “macro-ops” — single instructions that do the work of multiple smaller instructions all at once — whenever we determine it would be beneficial. The Wasm-to-internal-bytecode translation step gives us a place to optimize the resulting bytecode before we begin executing it. In addition to coalescing multiple operations into macro-ops, we have the opportunity to do things like deduplicate subexpressions and eliminate redundant moves. This is when we realized that this translation step was sounding more and more like a proper optimizing compiler, and we already maintain an optimizing compiler that already performs these sorts of optimizations, we just need to teach it to emit the interpreter’s internal bytecode, rather than native code.

The Pulley interpreter is the culmination of this line of thinking. When Wasmtime is using Pulley, it translates Wasm to Cranelift’s intermediate representation, CLIF; then Cranelift runs its mid-end optimizations on the CLIF, such as constant propagation, GVN, and LICM; next, Cranelift lowers the CLIF to Pulley bytecode, coalescing multiple CLIF instructions into single Pulley macro-ops, eliminating dead code, and (re)allocating (virtual) registers to reduce moves; and finally, Wasmtime interprets the resulting optimized bytecode.

       ┌──────┐
       │ Wasm │
       └──────┘
          │
          │
Wasm-to-CLIF translation
          │
          ▼
       ┌──────┐
       │ CLIF │
       └──────┘
          │
          │
 mid-end optimizations
          │
          ▼
       ┌──────┐
       │ CLIF │
       └──────┘
          │
          │
       lowering
          │
          ▼
 ┌─────────────────┐
 │ Pulley bytecode │
 └─────────────────┘

Just like Wasm-to-native-code compilation, Wasm-to-Pulley-bytecode compilation can be performed offline and ahead of time. Bytecode compilation need not be on the critical path and, given an already-bytecode-compiled Wasm module, Pulley execution can leverage the same 5-microsecond instantiation that native compilation strategies enjoy.

Initial Pulley support has landed in Wasmtime, but it is still a work in progress and at times incomplete. We have not spent time optimizing Pulley, its interpreter loop, or selection of macro-ops yet, so it is expected that its performance today is not as good as it should be. You can experiment with Pulley by enabling the pulley cargo feature and passing the --target pulley32 or --target pulley64 command line flag (depending on if you are on a 32- or 64-bit machine respectively) or by calling config.target("pulley32") or config.target("pulley64) when using Wasmtime as a library. Note that you must use the (default) Cranelift compilation strategy with Pulley; Winch doesn’t support emitting Pulley bytecode at this time.

The architecture and pipeline for Pulley emerged from discussions between myself and Alex Crichton. The initial Pulley interpreter and Cranelift backend to emit Pulley bytecode were both developed by me. Alex integrated the interpreter into Wasmtime’s runtime and has since been filling in its full breadth of Wasm support.

Write Once, Run Anywhere?

We’ve been focusing on Wasmtime’s portability and the ability to run any Wasm code on as many platforms as possible. The infamous “write once, run anywhere” (WORA) ambition aims even higher: to run the exact same code on all platforms, without changing its source or recompiling it.

At a high level, an application requires certain core capabilities. Not all code needs to or should run on platforms that lack their required capabilities: running an IRC chat client on a device that isn’t connected to the internet doesn’t generally make sense because an IRC chat client requires network access. WORA across two given platforms is a worthy goal only when the platforms both provide the application’s required capabilities (at a high level, regardless if they happen to use incompatible syscalls or different mechanisms to expose the capabilities).

The WebAssembly Component Model makes explicit the capability dependencies of a Wasm component and introduces the concept of a world to formalize an environment’s available capabilities. With components and worlds, we can precisely answer the question of whether WORA makes sense across two given platforms. Along with the standard worlds and interfaces defined by WASI, we already have all the tools we need to make WORA a reality for Wasm where it makes sense.3

Come Help Us!

Do you believe in our vision and want to contribute to our portability efforts? The work here isn’t done and there are many opportunities to get involved!

  • Try building a minimal Wasmtime for your niche platform, kick the tires, and share your feedback with us.

  • Help us get Pulley passing all of the .wast spec tests! Making a failing test start passing is usually pretty straightforward and just involves adding a missing instruction or two. This is a great way to start contributing to Wasmtime and Cranelift.

  • Once Pulley is complete, or at least mostly completely, we can start analyzing and improving its performance. We can run our Sightglass benchmarks like spidermonkey.wasm under Pulley to determine what can be improved. We can inspect the generated bytecode, identify which pairs of opcodes are often found one after the other, and create new macro-ops. There is a lot of fun, meaty performance engineering work available here for folks who enjoy making number go up.

  • Support for running Wasm binaries that use custom page sizes is complete in Wasmtime, but toolchain support for generating Wasm binaries with custom page sizes is still largely missing. Adding support for the custom-page-sizes proposal to wasm-ld is what is needed most. It’s expected that this implementation should be relatively straightforward and that exposing a __wasm_page_size symbol can be modelled after the existing __tls_size symbol.

  • At the time of writing, a minimal dynamic library that runs pre-compiled Wasm modules is a 315KiB binary on x86_64. A minimal build of Wasmtime’s whole C API as a dynamic library is 698KiB. These numbers aren’t terrible, but we also haven’t put any effort into optimizing Wasmtime for code size yet, so we expect there to be a fair amout of potential code size wins and low-hanging fruit available. We suspect error strings are a major code size offender, and revamping wasmtime::Error to optionally (based on compile-time features) contain just error codes, instead of full strings, is one idea we have. Analyzing code size with cargo bloat would also be fantastic.

We also publish high-level contributor documentation in the Wasmtime guide.

Thanks

Big thanks to everyone who has contributed to the recent portability effort and to Wasmtime over the years. Thanks also to Alex Crichton and Till Schneidereit for reviewing early drafts of this article.

  1. If an engine chooses not to abide by the constraints imposed by the WebAssembly language specification, then it is not implementing WebAssembly. It is instead implementing a language that is similar to but subtly different from WebAssembly. This leads to interoperability hazards, de facto standards, and ecosystem splits. We saw this during the early days of the Web, when Websites used non-standard, Internet Explorer-specific APIs. This led to broken Websites for people using other browsers, and eventually forced other browsers to reverse engineer the non-standard APIs. The Web is still stuck with the resulting baggage and tech debt today. We must prevent this from happening to WebAssembly. Therefore we refuse the temptation to deviate from the WebAssembly specification. Instead, when we identify language-level constraints, we engage with the standards process to create solutions that the whole ecosystem can rely on. 

  2. To build the very fastest interpreter possible, you probably want to write assembly by hand, but that directly conflicts with our primary goal of portability so it is unacceptable. We want to maximize interpreter speed to the degree we can, but we cannot prioritize it over portability. 

  3. The component model also gives us tools to break Wasm applications down into their constituent parts, and share those parts across different applications. Even when WORA doesn’t make sense for a full application, it might make sense for some subset of its business logic that happens to require fewer capabilities than the full application. For example, we may want to share the logic for maintaining the set of active IRC users between both the server and the client. 

]]>
Nick Fitzgerald