Note: This is cross-posted from my personal blog.
Wasmtime is a WebAssembly runtime that focuses on safety and fast Wasm
execution. But despite that focus on speed, Wasmtime has historically chosen not
to perform inlining in its optimizing compiler backend, Cranelift. There were
two reasons for this surprising decision: first, Cranelift is a per-function
compiler designed such that Wasmtime can compile all of a Wasm module’s
functions in parallel. Inlining is inter-procedural and requires synchronization
between function compilations; that synchronization reduces parallelism. Second,
Wasm modules are generally produced by an optimizing toolchain, like LLVM, that
already did all the beneficial inlining. Any calls remaining in the module will
not benefit from inlining — perhaps they are on slow paths marked
[[unlikely]] or the callee is annotated with #[inline(never)]. But
WebAssembly’s component model changes this calculus.
With the component model, developers can compose multiple Wasm modules — each produced by different toolchains — into a single program. Those toolchains only had a local view of the call graph, limited to their own module, and they couldn’t see cross-module or fused adapter function definitions. None of them, therefore, had an opportunity to inline calls to such functions. Only the Wasm runtime’s compiler, which has the final, complete call graph and function definitions in hand, has that opportunity.
Therefore we implemented function inlining in Wasmtime and Cranelift. Its
initial implementation landed in Wasmtime version 36, however, it remains
off-by-default and is still baking. You can test it out via the -C inlining=y
command-line flag or the
wasmtime::Config::compiler_inlining method. The rest of
this article describes function inlining in more detail, digs into the guts of
our implementation and rationale for its design choices, and finally looks at
some early performance results.
Function inlining is a compiler optimization where a call to a function f is
replaced by a copy of f’s body. This removes function call overheads (spilling
caller-save registers, setting up the call frame, etc…) which can be
beneficial on its own. But inlining’s main benefits are indirect: it enables
subsequent optimization of f’s body in the context of the call site. That
context is important — a parameter’s previously unknown value might be
bound to a constant argument and exposing that to the optimizer might cascade
into a large code clean up.
Consider the following example, where function g calls function f:
fn f(x: u32) -> bool {
return x < u32::MAX / 2;
}
fn g() -> u32 {
let a = 42;
if f(a) {
return a;
} else {
return 0;
}
}
After inlining the call to f, function g looks something like this:
fn g() -> u32 {
let a = 42;
let x = a;
let f_result = x < u32::MAX / 2;
if f_result {
return a;
} else {
return 0;
}
}
Now the whole subexpression that defines f_result only depends on constant
values, so the optimizer can replace that subexpression with its known value:
fn g() -> u32 {
let a = 42;
let f_result = true;
if f_result {
return a;
} else {
return 0;
}
}
This reveals that the if-else conditional will, in fact, unconditionally
transfer control to the consequent, and g can be simplified into the
following:
fn g() -> u32 {
let a = 42;
return a;
}
In isolation, inlining f was a marginal transformation. When considered
holistically, however, it unlocked a plethora of subsequent simplifications that
ultimately led to g returning a constant value rather than computing anything
at run-time.
Cranelift’s unit of compilation is a single function, which Wasmtime leverages to compile each function in a Wasm module in parallel, speeding up compile times on multi-core systems. But inlining a function at a particular call site requires that function’s definition, which implies parallelism-hurting synchronization or some other compromise, like additional read-only copies of function bodies. So this was the first goal of our implementation: to preserve as much parallelism as possible.
Additionally, although Cranelift is primarily developed for Wasmtime by
Wasmtime’s developers, it is independent from Wasmtime. It is a reusable library
and is reused, for example, by the Rust project as an alternative backend for
rustc. But a large part of inlining, in practice, are the heuristics
for deciding when inlining a call is likely beneficial, and those heuristics can
be domain specific. Wasmtime generally wants to leave most calls out-of-line,
inlining only cross-module calls, while rustc wants something much more
aggressive to boil away its Iterator combinators and the like. So our second
implementation goal was to separate how we inline a function call from the
decision of whether to inline that call.
These goals led us to a layered design where Cranelift has an optional inlining pass, but the Cranelift embedder (e.g. Wasmtime) must provide a callback to it. The inlining pass invokes the callback for each call site, the callback returns a command of either “leave the call as-is” or “here is a function body, replace the call with it”. Cranelift is responsible for the inlining transformation and the embedder is responsible for deciding whether to inline a function call and, if so, getting that function’s body (along with whatever synchronization that requires).
The mechanics of the inlining transformation — wiring arguments to
parameters, renaming values, and copying instructions and basic blocks into the
caller — are, well, mechanical. Cranelift makes extensive uses of arenas
for various entities in its IR,
and we begin by appending the callee’s arenas to the caller’s arenas, renaming
entity references from the callee’s arena indices to their new indices in the
caller’s arenas as we do so. Next we copy the callee’s block layout into the
caller and replace the original call instruction with a jump to the caller’s
inlined version of the callee’s entry block. Cranelift uses block parameters,
rather than phi nodes, so the call arguments simply become jump
arguments. Finally, we translate each instruction from the callee into the
caller. This is done via a pre-order traversal to ensure that we process value
definitions before value uses, simplifying instruction operand rewriting. The
changes to Wasmtime’s compilation orchestration are more interesting.
The following pseudocode describes Wasmtime’s compilation orchestration before Cranelift gained an inlining pass and also when inlining is disabled:
// Compile each function in parallel.
let objects = parallel map for func in wasm.functions {
compile(func)
};
// Combine the functions into one region of executable memory, resolving
// relocations by mapping function references to PC-relative offsets.
return link(objects)
The naive way to update that process to use Cranelift’s inlining pass might look something like this:
// Optionally perform some pre-inlining optimizations in parallel.
parallel for func in wasm.functions {
pre_optimize(func);
}
// Do inlining sequentially.
for func in wasm.functions {
func.inline(|f| if should_inline(f) {
Some(wasm.functions[f])
} else {
None
})
}
// And then proceed as before.
let objects = parallel map for func in wasm.functions {
compile(func)
};
return link(objects)
Inlining is performed sequentially, rather than in parallel, which is a bummer. But if we tried to make that loop parallel by logically running each function’s inlining pass in its own thread, then a callee function we are inlining might or might not have had its transitive function calls inlined already depending on the whims of the scheduler. That leads to non-deterministic output, and our compilation must be deterministic, so it’s a non-starter.1 But whether a function has already had transitive inlining done or not leads to another problem.
With this naive approach, we are either limited to one layer of inlining or else
potentially duplicating inlining effort, repeatedly inlining e into f each
time we inline f into g, h, and i. This is because f may come before
or after g in our wasm.functions list. We would prefer it if f already
contained e and was already optimized accordingly, so that every caller of f
didn’t have to redo that same work when inlining calls to f.
This suggests we should topologically sort our functions based on their call
graph, so that we inline in a bottom-up manner, from leaf functions (those that
do not call any others) towards root functions (those that are not called by any
others, typically main and other top-level exported functions). Given a
topological sort, we know that whenever we are inlining f into g either (a)
f has already had its own inlining done or (b) f and g participate in a
cycle. Case (a) is ideal: we aren’t repeating any work because it’s already been
done. Case (b), when we find cycles, means that f and g are mutually
recursive. We cannot fully inline recursive calls in general (just as you cannot
fully unroll a loop in general) so we will simply avoid inlining these
calls.2 So topological sort avoids repeating work, but our inlining
phase is still sequential.
At the heart of our proposed topological sort is a call graph traversal that visits callees before callers. To parallelize inlining, you could imagine that, while traversing the call graph, we track how many still-uninlined callees each caller function has. Then we batch all functions whose associated counts are currently zero (i.e. they aren’t waiting on anything else to be inlined first) into a layer and process them in parallel. Next, we decrement each of their callers’ counts and collect the next layer of ready-to-go functions, continuing until all functions have been processed.
let call_graph = CallGraph::new(wasm.functions);
let counts = { f: call_graph.num_callees_of(f) for f in wasm.functions };
let layer = [ f for f in wasm.functions if counts[f] == 0 ];
while layer is not empty {
parallel for func in layer {
func.inline(...);
}
let next_layer = [];
for func in layer {
for caller in call_graph.callers_of(func) {
counts[caller] -= 1;
if counts[caller] == 0 {
next_layer.push(caller)
}
}
}
layer = next_layer;
}
This algorithm will leverage available parallelism, and it avoids repeating work
via the same dependency-based scheduling that topological sorting did, but it
has a flaw. It will not terminate when it encounters recursion cycles in the
call graph. If function f calls function g which also calls f, for
example, then it will not schedule either of them into a layer because they are
both waiting for the other to be processed first. One way we can avoid this
problem is by avoiding cycles.
If you partition a graph’s nodes into disjoint sets, where each set contains every node reachable from every other node in that set, you get that graph’s strongly-connected components (SCCs). If a node does not participate in a cycle, then it will be in its own singleton SCC. The members of a cycle, on the other hand, will all be grouped into the same SCC, since those nodes are all reachable from each other.
In the following example, the dotted boxes designate the graph’s SCCs:
Ignoring edges between nodes within the same SCC, and only considering edges across SCCs, gives us the graph’s condensation. The condensation is always acyclic, because the original graph’s cycles are “hidden” within the SCCs.
Here is the condensation of the previous example:
We can adapt our parallel-inlining algorithm to operate on strongly-connected
components, and now it will correctly terminate because we’ve removed all
cycles. First, we find the call graph’s SCCs and create the reverse (or
transpose) condensation, where an edge a→b is flipped to b→a. We do this
because we will query this graph to find the callers of a given function f,
not the functions that f calls. I am not aware of an existing name for the
reverse condensation, so, at Chris Fallin’s brilliant suggestion, I have decided
to call it an evaporation. From there, the algorithm largely remains as it was
before, although we keep track of counts and layers by SCC rather than by
function.
let call_graph = CallGraph::new(wasm.functions);
let components = StronglyConnectedComponents::new(call_graph);
let evaoporation = Evaporation::new(components);
let counts = { c: evaporation.num_callees_of(c) for c in components };
let layer = [ c for c in components if counts[c] == 0 ];
while layer is not empty {
parallel for func in scc in layer {
func.inline(...);
}
let next_layer = [];
for scc in layer {
for caller_scc in evaporation.callers_of(scc) {
counts[caller_scc] -= 1;
if counts[caller_scc] == 0 {
next_layer.push(caller_scc);
}
}
}
layer = next_layer;
}
This is the algorithm we use in Wasmtime, modulo minor tweaks here and there to engineer some data structures and combine some loops. After parallel inlining, the rest of the compiler pipeline continues in parallel for each function, yielding unlinked machine code. Finally, we link all that together and resolve relocations, same as we did previously.
Heuristics are the only implementation detail left to discuss, but there isn’t much to say that hasn’t already been said. Wasmtime prefers not to inline calls within the same Wasm module, while cross-module calls are a strong hint that we should consider inlining. Beyond that, our heuristics are extremely naive at the moment, and only consider the code sizes of the caller and callee functions. There is a lot of room for improvement here, and we intend to make those improvements on-demand as people start playing with the inliner. For example, there are many things we don’t consider in our heuristics today, but possibly should:
The speed up you get (or don’t get) from enabling inlining is going to vary from program to program. Here are a couple synthetic benchmarks.
First, let’s investigate the simplest case possible, a cross-module call of an empty function in a loop:
(component
;; Define one module, exporting an empty function `f`.
(core module $M
(func (export "f")
nop
)
)
;; Define another module, importing `f`, and exporting a function
;; that calls `f` in a loop.
(core module $N
(import "m" "f" (func $f))
(func (export "g") (param $counter i32)
(loop $loop
;; When counter is zero, return.
(if (i32.eq (local.get $counter) (i32.const 0))
(then (return)))
;; Do our cross-module call.
(call $f)
;; Decrement the counter and continue to the next iteration
;; of the loop.
(local.set $counter (i32.sub (local.get $counter)
(i32.const 1)))
(br $loop))
)
)
;; Instantiate and link our modules.
(core instance $m (instantiate $M))
(core instance $n (instantiate $N (with "m" (instance $m))))
;; Lift and export the looping function.
(func (export "g") (param "n" u32)
(canon lift (core func $n "g"))
)
)
We can inspect the machine code that this compiles down to via the wasmtime
compile and wasmtime objdump commands. Let’s focus only on the looping
function. Without inlining, we see a loop around a call, as we would expect:
00000020 wasm[1]::function[1]:
;; Function prologue.
20: pushq %rbp
21: movq %rsp, %rbp
;; Check for stack overflow.
24: movq 8(%rdi), %r10
28: movq 0x10(%r10), %r10
2c: addq $0x30, %r10
30: cmpq %rsp, %r10
33: ja 0x89
;; Allocate this function's stack frame, save callee-save
;; registers, and shuffle some registers.
39: subq $0x20, %rsp
3d: movq %rbx, (%rsp)
41: movq %r14, 8(%rsp)
46: movq %r15, 0x10(%rsp)
4b: movq 0x40(%rdi), %rbx
4f: movq %rdi, %r15
52: movq %rdx, %r14
;; Begin loop.
;;
;; Test our counter for zero and break out if so.
55: testl %r14d, %r14d
58: je 0x72
;; Do our cross-module call.
5e: movq %r15, %rsi
61: movq %rbx, %rdi
64: callq 0
;; Decrement our counter.
69: subl $1, %r14d
;; Continue to the next iteration of the loop.
6d: jmp 0x55
;; Function epilogue: restore callee-save registers and
;; deallocate this functions's stack frame.
72: movq (%rsp), %rbx
76: movq 8(%rsp), %r14
7b: movq 0x10(%rsp), %r15
80: addq $0x20, %rsp
84: movq %rbp, %rsp
87: popq %rbp
88: retq
;; Out-of-line traps.
89: ud2
╰─╼ trap: StackOverflow
When we enable inlining, then M::f gets inlined into N::g. Despite N::g
becoming a leaf function, we will still push %rbp and all that in the prologue
and pop it in the epilogue, because Wasmtime always enables frame pointers. But
because it no longer needs to shuffle values into ABI argument registers or
allocate any stack space, it doesn’t need to do any explicit stack checks, and
nearly all the rest of the code also goes away. All that is left is a loop
decrementing a counter to zero:3
00000020 wasm[1]::function[1]:
;; Function prologue.
20: pushq %rbp
21: movq %rsp, %rbp
;; Loop.
24: testl %edx, %edx
26: je 0x34
2c: subl $1, %edx
2f: jmp 0x24
;; Function epilogue.
34: movq %rbp, %rsp
37: popq %rbp
38: retq
With this simplest of examples, we can just count the difference in number of instructions in each loop body:
N::g and 5 in M::f which are 2 to push the
frame pointer, 2 to pop it, and 1 to return)But we might as well verify that the inlined version really is faster via some
quick-and-dirty benchmarking with hyperfine. This won’t measure only Wasm
execution time, it also measures spawning a whole Wasmtime process, loading code
from disk, etc…, but it will work for our purposes if we crank up the number
of iterations:
$ hyperfine \
"wasmtime run --allow-precompiled -Cinlining=n --invoke 'g(100000000)' no-inline.cwasm" \
"wasmtime run --allow-precompiled -Cinlining=y --invoke 'g(100000000)' yes-inline.cwasm"
Benchmark 1: wasmtime run --allow-precompiled -Cinlining=n --invoke 'g(100000000)' no-inline.cwasm
Time (mean ± σ): 138.2 ms ± 9.6 ms [User: 132.7 ms, System: 6.7 ms]
Range (min … max): 128.7 ms … 167.7 ms 19 runs
Benchmark 2: wasmtime run --allow-precompiled -Cinlining=y --invoke 'g(100000000)' yes-inline.cwasm
Time (mean ± σ): 37.5 ms ± 1.1 ms [User: 33.0 ms, System: 5.8 ms]
Range (min … max): 35.7 ms … 40.8 ms 77 runs
Summary
'wasmtime run --allow-precompiled -Cinlining=y --invoke 'g(100000000)' yes-inline.cwasm' ran
3.69 ± 0.28 times faster than 'wasmtime run --allow-precompiled -Cinlining=n --invoke 'g(100000000)' no-inline.cwasm'
Okay so if we measure Wasm doing almost nothing but empty function calls and then we measure again after removing function call overhead, we get a big speed up — it would be disappointing if we didn’t! But maybe we can benchmark something a tiny bit more realistic.
A program that we commonly reach for when benchmarking is a small wrapper
around the pulldown-cmark markdown library that parses the CommonMark
specification (which is itself written in markdown) and renders that to
HTML. This is Real World™ code operating on Real World™ inputs that matches Real
World™ use cases people have for Wasm. That is, good benchmarking is incredibly
difficult, but this program is nonetheless a pretty good candidate for inclusion
in our corpus. There’s just one hiccup: in order for our inliner to activate
normally, we need a program using components and making cross-module calls, and
this program doesn’t do that. But we don’t have a good corpus of such benchmarks
yet because this kind of component composition is still relatively new, so let’s
keep using our pulldown-cmark program but measure our inliner’s effects via a
more circuitous route.
Wasmtime has tunables to enable the inlining of intra-module
calls4 and rustc and LLVM have tunables for disabling
inlining5. Therefore we can roughly estimate the speed
ups our inliner might unlock on a similar, but extensively componentized and
cross-module calling, program by:
Disabling inlining when compiling the Rust source code to Wasm
Compiling the resulting Wasm binary to native code with Wasmtime twice: once with inlining disabled, and once with intra-module call inlining enabled
Comparing those two different compilations’ execution speeds
Running this experiment with Sightglass, our internal benchmarking infrastructure and tooling, yields the following results:
execution :: instructions-retired :: pulldown-cmark.wasm
Δ = 7329995.35 ± 2.47 (confidence = 99%)
with-inlining is 1.26x to 1.26x faster than without-inlining!
[35729153 35729164.72 35729173] without-inlining
[28399156 28399169.37 28399179] with-inlining
Wasmtime and Cranelift now have a function inliner! Test it out via the -C
inlining=y command-line flag or via the
wasmtime::Config::compiler_inlining method. Let us know if
you run into any bugs or whether you see any speed-ups when running Wasm
components containing multiple core modules.
Thanks to Chris Fallin and Graydon Hoare for reading early drafts of this piece and providing valuable feedback. Any errors that remain are my own.
Deterministic compilation gives a number of benefits: testing is easier, debugging is easier, builds can be byte-for-byte reproducible, it is well-behaved in the face of incremental compilation and fine-grained caching, etc… ↩
For what it is worth, this still allows collapsing chains of
mutually-recursive calls (a calls b calls c calls a) into a single,
self-recursive call (abc calls abc). Our actual implementation does not
do this in practice, preferring additional parallelism instead, but it could
in theory. ↩
Cranelift cannot currently remove loops without side effects, and generally doesn’t mess with control-flow at all in its mid-end. We’ve had various discussions about how we might best fit control-flow-y optimizations into Cranelift’s mid-end architecture over the years, but it also isn’t something that we’ve seen would be very beneficial for actual, Real World™ Wasm programs, given that (a) LLVM has already done much of this kind of thing when producing the Wasm, and (b) we do some branch-folding when lowering from our mid-level IR to our machine-specific IR. Maybe we will revisit this sometime in the future if it crops up more often after inlining. ↩
-C cranelift-wasmtime-inlining-intra-module=yes ↩
-Cllvm-args=--inline-threshold=0,
-Cllvm-args=--inlinehint-threshold=0, and -Zinline-mir=no ↩
Note: this is a cross-post with my personal blog; this post is also available here.
When first discussing this work, I made an off-the-cuff estimate in the Wasmtime biweekly project meeting that it would be “maybe two weeks on the compiler side and a week in Wasmtime”. Reader, I need to make a confession now: I was wrong and it was not a three-week task. This work spanned from late March to August of this year (roughly half-time, to be fair; I wear many hats). Let that be a lesson!1
In this post we’ll first cover what exceptions are and why some languages want them (and what other languages do instead) – in particular what the big deal is about (so-called) “zero-cost” exception handling. Then we’ll see how Wasm has specified a bytecode-level foundation that serves as a least-common denominator but also has some unique properties. We’ll then take a roundtrip through what it means for a compiler to support exceptions – the control-flow implications, how one reifies the communication with the unwinder, how all this intersects with the ABI, etc. – before finally looking at how Wasmtime puts it all together (and is careful to avoid performance pitfalls and stay true to the intended performance of the spec).
Many readers will already be familiar with exceptions as they are present in languages as widely varied as Python, Java, JavaScript, C++, Lisp, OCaml, and many more. But let’s briefly review so we can (i) be precise what we mean by an exception, and (ii) discuss why exceptions are so popular.
Exception
handling
is a mechanism for nonlocal flow control. In particular, most
flow-control constructs are intraprocedural (send control to other
code in the current function) and lexical (target a location that
can be known statically). For example, if statements and loops
both work this way: they stay within the local function, and we know
exactly where they will transfer control. In contrast, exceptions are
(or can be) interprocedural (can transfer control to some point in
some other function) and dynamic (target a location that depends on
runtime state).2
To unpack that a bit: an exception is thrown when we want to signal an error or some other condition that requires “unwinding” the current computation, i.e., backing out of the current context; and it is caught by a “handler” that is interested in the particular kind of exception and is currently “active” (waiting to catch that exception). That handler can be in the current function, or in any function that has called it. Thus, an exception throw and catch can result in an abnormal, early return from a function.
One can understand the need for this mechanism by considering how
programs can handle errors. In some languages, such as Rust, it is
common to see function signatures of the form fn foo(...) ->
Result<T, E>. The
Result type
indicates that foo normally returns a value of type T, but may
produce an error of type E instead. The key to making this ergonomic
is providing some way to “short-circuit” execution if an error is
returned, propagating that error upward: that is, Rust’s ? operator,
for example, which turns into essentially “if there was an error,
return that error from this function”.3 This is quite conceptually
nice in many ways: why should error handling be different than any
other data flow in the program? Let’s describe the type of results to
include the possibility of errors; and let’s use normal control flow
to handle them. So we can write code like
fn f() -> Result<u32, Error> {
if bad {
return Err(Error::new(...));
}
Ok(0)
}
fn g() -> Result<u32, Error> {
// The `?` propagates any error to our caller, returning early.
let result = f()?;
Ok(result + 1)
}
and we don’t have to do anything special in g to propagate errors
from f further, other than use the ? operator.
But there is a cost to this: it means that every error-producing
function has a larger return type, which might have ABI implications
(another return register at least, if not a stack-allocated
representation of the Result and the corresponding loads/stores to
memory), and also, there is at least one conditional branch after
every call to such a function that checks if we need to handle the
error. The dynamic efficiency of the “happy path” (with no thrown
exceptions) is thus impacted. Ideally, we skip any cost unless an
error actually occurs (and then perhaps we accept slightly more cost
in that case, as tradeoffs often go).
It turns out that this is possible with the help of the language
runtime. Consider what happens if we omit the Result return types
and error checks at each return. We will need to reach the code that
handles the error in some other way. Perhaps we can jump directly to
this code somehow?
The key idea of “zero-cost exception handling” is to get the compiler to build side-tables to tell us where this code – known as a “handler” – is. We can walk the callstack, visiting our caller and its caller and onward, until we find a function that would be interested in the error condition we are raising. This logic is implemented with the help of these side-tables and some code in the language runtime called the “unwinder” (because it “unwinds” the stack). If no errors are raised, then none of this logic is executed at runtime. And we no longer have our explicit checks for error returns in the “happy path” where no errors occur. This is why the common term for this style of error-handling is called “zero-cost”: more precisely, it is zero-cost when no errors occur, but the unwinding in case of error can still be expensive.
This is the status quo for exception-handling implementations in most production languages: for example, in the C++ world, exception handling is commonly implemented via the Itanium C++ ABI4, which defines a comprehensive set of tables emitted by the compiler and a complex dance between the system unwinding library and compiler-generated code to find and transfer control to handlers. Handler tables and stack unwinders are common in interpreted and just-in-time (JIT)-compiled language implementations, too: for example, SpiderMonkey has try notes on its bytecode (so named for “try blocks”) and a HandleException function that walks stack frames to find a handler.
The WebAssembly specification now (since version 3.0) has exception handling. This proposal was a long time in the making by various folks in the standards, toolchain and browser worlds, and the CG (standards group) has now merged it into the spec and included it in the recently-released “Wasm 3.0” milestone. If you’re already familiar with the proposal, you can skip over this section to the Cranelift- and Wasmtime-specific bits below.
First: let’s discuss why Wasm needs an extension to the bytecode definition to support exceptions. As we described above, the key idea of zero-cost exception handling is that an unwinder visits stack frames and looks for handlers, transferring control directly to the first handler it finds, outside the normal function return path. Because the call stack is protected, or not directly readable or writable from Wasm code (part of Wasm’s control-flow integrity aspect), an unwinder that works this way necessarily must be a privileged part of the Wasm runtime itself. We can’t implement it in “userspace” because there is no way for Wasm bytecode to transfer control directly back to a distant caller, aside from a chain of returns. This missing functionality is what the extension to the specification adds.
The implementation comes down to only three opcodes (!), and some new types in the bytecode-level type system. (In other words – given the length of this post – it’s deceptively simple.) These opcodes are:
try_table, which wraps an inner body, and specifies handlers to
be active during that body. For example:
(block $b1 ;; defines a label for a forward edge to the end of this block
(block $b2 ;; likewise, another label
(try_table
(catch $tag1 $b1) ;; exceptions with tag `$tag1` will be caught by code at $b1
(catch_all $b2) ;; all other exceptions will be caught by code at $b2
body...)))
In this example, if an exception is thrown from within the code in
body, and it matches one of the specified tags (more below!),
control will transfer to the location defined by the end of the
given block. (This is the same as other control-flow transfers in
Wasm: for example, a branch br $b1 also jumps to the end of
$b1.)
This construct is the single all-purpose “catch” mechanism, and is
powerful enough to directly translate typical try/catch blocks
in most programming languages with exceptions.
throw: an instruction to directly throw a new exception. It
carries the tag for the exception, like: throw $tag1.
throw_ref, used to rethrow an exception that has already been
caught and is held by reference (more below!).
And that’s it! We implement those three opcodes and we are “done”.
That’s not the whole story, of course. Ordinarily a source language will offer the ability to carry some data as part of an exception: that is, the error condition is not just one of a static set of kinds of errors, but contains some fields as well. (E.g.: not just “file not found”, but “file not found: $PATH”.)
One could build this on top of an bytecode-level exception-throw mechanism that only had throw/catch with static tags, with the help of some global state, but that would be cumbersome; instead, the Wasm specification offers payloads on each exception. For full generality, this payload can actually take the form of a list of values; i.e., it is a full product type (struct type).
We alluded to “tags” above but didn’t describe them in detail. These tags are key to the payload definition: each tag is effectively a type definition that specifies its list of payload value types as well. (Technically, in the Wasm AST, a tag definition names a function type with only parameters, no returns, which is a nice way of reusing an existing entity/concept.) Now we show how they are defined with a sample module:
(module
;; Define a "tag", which serves to define the specific kind of exception
;; and specify its payload values.
(tag $t (param i32 i64))
(func $f (param i32 i64)
;; Throw an exception, to be caught by whatever handler is "closest"
;; dynamically.
(throw $t (local.get 0) (local.get 1)))
(func $g (result i32 i64)
(block $b (result i32 i64)
;; Run a body below, with the given handlers (catch-clauses)
;; in-scope to catch any matching exceptions.
;;
;; Here, if an exception with tag `$t` is thrown within the body,
;; control is transferred to the end of block `$b` (as if we had
;; branched to it), with the payload values for that exception
;; pushed to the operand stack.
(try_table (catch $t $b)
(call $f (i32.const 1) (i64.const 2)))
(i32.const 3)
(i64.const 4))))
Here we’ve defined one tag (the Wasm text format lets us attach a name
$t, but in the binary format it is only identified by its index, 0),
with two payload values. We can throw an exception with this tag given
values of these types (as in function $f) and we can catch it if we
specify a catch destination as the end of a block meant to return
exactly those types as well. Here, if function $g is invoked, the
exception payload values 1 and 2 will be thrown with the
exception, which will be caught by the try_table; the results of
$g will be 1 and 2. (The values 3 and 4 are present to allow
the Wasm module to validate, i.e. have correct types, but they are
dynamically unreachable because of the throw in $f and will not be
returned.)
This is an instance where Wasm, being a bytecode, can afford to generalize a bit relative to real-metal ISAs and offer conveniences to the Wasm producer (i.e., toolchain generating Wasm modules). In this sense, it is a little more like a compiler IR. In contrast, most other exception-throw ABIs have a fixed definition of payload, e.g., one or two machine register-sized values. In practice some producers might choose a small fixed signature for all exception tags anyway, but there is no reason to impose such an artificial limit if there is a compiler and runtime behind the Wasm in any case.
So far, we’ve seen how Wasm’s primitives can allow for basic exception throws and catches, but what about languages with scoped resources, e.g. C++ with its destructors? If one writes something like
struct Scoped {
Scoped() {}
~Scoped() { cleanup(); }
};
void f() {
Scoped s();
throw my_exception();
}
then the throw should transfer control out of f and upward to
whatever handler matches, but the destructor of s still needs to run
and call cleanup. This is not quite a “catch” because we don’t want
to terminate the search: we aren’t actually handling the error
condition.
The usual approach to compile such a program is to “catch and rethrow”. That is, the program is lowered to something like
try {
throw ...
} catch_any(e) {
cleanup();
rethrow e;
}
where catch_any catches any exception propagating past this point
on the stack, and rethrow re-throws the same exception.
Wasm’s exception primitives provide exactly the pieces we need for
this: a catch_all_ref clause, which catches all exceptions and
boxes the caught exception as a reference; and a throw_ref
instruction, which re-throws a previously-caught exception.5
In actuality there is a two-by-two matrix of “catch” options: we can
catch a specific tag or catch_all; and we can catch and
immediately unpack the exception into its payload values (as we saw
above), or we can catch it as a reference. So we have catch,
catch_ref, catch_all, and catch_all_ref.6
There is one final detail to the Wasm proposal, and in fact it’s the
part that I find the most interesting and unique. Given the above
introduction, and any familiarity with exception systems in other
language semantics and/or runtime systems, one might expect that the
“tags” identifying kinds of exceptions and matching throws with
particular catch handlers would be static labels. In other words, if I
throw an exception with tag $tA, then the first handler for $tA
anywhere up the stack, from any module, should catch it.
However, one of Wasm’s most significant properties as a bytecode is its emphasis on isolation. It has a distinction between static modules and dynamic instances of those modules, and modules have no “static members”: every entity (e.g., memory, table, or global variable) defined by a module is replicated per instance of that module. This creates a clean separation between instances and means that, for example, one can freely reuse a common module (say, some kind of low-level glue or helper module) with separate instances in many places without them somehow communicating or interfering with each other.
Consider what happens if we have an instance A that invokes some other (dynamically provided) function reference which ultimately invokes a callback in A. Say that the instance throws an exception from within its callback in order to unwind all the way to its outer stack frames, across the intermediate functions in some other Wasm instance(s):
A.f ---------call---------> B.g --------call---------> A.callback
^ v
catch $t throw $t
| |
`----------------------------<-------------------------------------'
The instance A expects that the exception that it throws from its
callback function to f is a local concern to that instance only,
and that B cannot interfere. After all, if the exception tag is
defined inside A, and Wasm preserves modularity, then B should not be
able to name that tag to catch exceptions by that tag, even if it also
uses exception handling internally. The two modules should not
interact: that is the meaning of modularity, and it permits us to
reason about each instance’s behavior locally, with the effects of
“the rest of the world” confined to imports and exports.
Unfortunately, if one designed a straightforward “static” tag-matching
scheme, this might not be the case if B were an instance of the same
module as A: in that case, if B also used a tag $t internally and
registered handlers for that tag, it could interfere with the desired
throw/catch behavior, and violate modularity.
So the Wasm exception handling standard specifies that tags have
dynamic instances as well, just as memories, tables and globals
do. (Put in programming-language theory terms, tags are generative.)
Each instance of a module creates its own dynamic identities for the
statically-defined tags in those modules, and uses those dynamic
identities to tag exceptions and find handlers. This means that no
matter what instance B is, above, if instance A does not export its
tag $t for B to import, there is no way for B to catch the thrown
exception explicitly (it can still catch all exceptions, and it may
do so and rethrow to perform some cleanup). Local modular reasoning is
restored.
Once we have tags as dynamic entities, just like Wasm memories, we can take the same approach that we do for the other entities to allow them to be imported and exported. Thus, visibility of exception payloads and ability for modules to catch certain exceptions is completely controlled by the instantiation graph and the import/export linking, just as for all other Wasm storage.
This is surprising (or at least was to me)! It creates some pretty unique implementation challenges in the unwinder – in essence, it means that we need to know about instance identity for each stack frame, not just static code location and handler list.
Before we implement the primitives for exception handling in Wasmtime, we need to support exceptions in our underlying compiler backend, Cranelift.
Why should this be a compiler concern? What is special about exceptions that makes them different from, say, new Wasm instructions that implement additional mathematical operators (when we already have many arithmetic operators in the IR), or Wasm memories (when we already have loads/stores in the IR)?
In brief, the complexities come in three flavors: new kinds of control flow, fundamentally different than ordinary branches or calls in that they are “externally actuated” (by the unwinder); a new facet of the ABI (that we get to define!) that governs how the unwinder interacts with compiled code; and interactions between the “scoped” nature of handlers and inlining in particular. We’ll talk about each below.
Note that much of this discussion started with an RFC for Wasmtime/Cranelift, which had been posted way back in August of 2024 by Daniel Hillerstrom with help from my colleague Nick Fitzgerald, and was discussed then; many of the choices within were subsequently refined as I discovered interesting nuances during implementation and we talked them through.
There are a few ways to think about exception handlers from the point of view of compiler IR (intermediate representation). First, let’s recognize that exception handling (i) is a form of control flow, and (ii) has all the same implications various compiler stages that other kinds of control flow do. For example, the register allocator has to consider how to get registers into the right state whenever control moves from one basic block to the next (“edge moves”); exception catches are a new kind of edge, and so the regalloc needs to be aware of that, too.
One could see every call or other opcode that could throw as having regular control-flow edges to every possible handler that could match. I’ll call this the “regular edges” approach. The upside is that it’s pretty simple to retrofit: one “only” needs to add new kinds of control-flow opcodes that have out-edges, but that’s already a kind of thing that IRs have. The disadvantage is that, in functions with a lot of possible throwing opcodes and/or handlers, the overhead can get quite high. And control-flow graph overhead is a bad kind of overhead: many analyses’ runtimes are heavily dependent the edge and node (basic block) counts, sometimes superlinearly.
The other major option is to build a kind of implicit new control flow into the IR’s semantics. For example, one could lower the source-language semantics of a “try block” down to regions in the IR, with one set of handlers attached. This is clearly more efficient than adding out-edges from (say) every callsite within the try-block to every handler in scope. On the other hand, it’s hard to understate how invasive this change would be. This means that every traversal over IR, analyzing dataflow or reachability or any other property, has to consider these new implicit edges anyway. In a large established compiler like Cranelift, we can lean on Rust’s type system for a lot of different kinds of refactors, but changing a fundamental invariant goes beyond that: we would likely have a long tail of issues stemming from such a change, and it would permanently increase the cognitive overhead of making new changes to the compiler. In general we want to trend toward a smaller, simpler core and compositional rather than entangled complexity.
Thus, the choice is clear: in Cranelift we opted to introduce one new
instruction, try_call, that calls a function and catches (some)
exceptions. In other words, there are now two possible kinds of
return paths: a normal return or (possibly one of many) exceptional
return(s). The handled exceptions and block targets are enumerated in
an exception table. Because there are control-flow edges stemming
from this opcode, it is a block terminator, like a conditional
branch. It looks something like (in Cranelift’s IR, CLIF):
function %f0(i32) -> i32, f32, f64 {
sig0 = (i32) -> f32 tail
fn0 = %g(i32) -> f32 tail
block0(v1: i32):
v2 = f64const 0x1.0
;; exception-catching callsite
try_call fn0(v1), sig0, block1(ret0, v2), [ tag0: block2(exn0), default: block3(exn0) ]
;; normal return path
block1(v3: f32, v4: f64):
v5 = iconst.i32 1
return v5, v3, v4
;; exception handler for tag0
block2(v6: i64):
v7 = ireduce.i32 v6
v8 = iadd_imm.i32 v7, 1
v9 = f32const 0x0.0
return v8, v9, v2
;; exception handler for all other exceptions
block2(v10: i64):
v11 = ireduce.i32 v10
v12 = f32.const 0x0.0
v13 = f64.const 0x0.0
return v11, v12, v13
}
There are a few aspects to note here. First, why are we only concerned
with calls? What about other sources of exceptions? This is an
important invariant in the IR: exception throws are only externally
sourced. In other words, if an exception has been thrown, if we go
deep enough into the callstack, we will find that that throw was
implemented by calling out into the runtime. The IR itself has no
other opcodes that throw! This turns out to be sufficient: (i) we only
need to build what Wasmtime needs, here, and (ii) we can implement
Wasm’s throw opcodes as “libcalls”, or calls into the Wasmtime
runtime. So, within Cranelift-compiled code, exception throws always
happen at callsites. We can thus get away with adding only one opcode,
try_call, and attach handler information directly to that opcode.
The next characteristic of note is that handlers are ordinary basic blocks. Thus may not seem remarkable unless one has seen other compiler IRs, such as LLVM’s, where exception handlers are definitely special: they start with “landing pad” instructions, and cannot be branched to as ordinary basic blocks. That might look something like:
function %f() {
block0:
;; Callsite defining a return value `v0`, with normal
;; return path to `block1` and exception handler `block2`.
v0 = try_call ..., block1, [ tag0: block2 ]
block1:
;; Normal return; use returned value.
return v0
block2 exn_handler: ;; Specially-marked block!
;; Exception handler payload value.
v1 = exception_landing_pad
...
}
This bifurcation of kinds of blocks (normal and exception handler) is undesirable from our point of view: just as exceptional edges add a new cross-cutting concern that every analysis and transform needs to consider, so would new kinds of blocks with restrictions. It was an explicit design goal (and we have tests that show!) that the same block can be both an ordinary block and a handler block – not because that would be common, necessarily (handlers usually do very different things than normal code paths), but because it’s one less weird quirk of the IR.
But then if handlers are normal blocks, the data flow question becomes very interesting. An exception-catching call, unlike every other opcode in our IR, has conditionally-defined values: that is, its normal function return value(s) are available only if the callee returns normally, and the exception payload value(s), which are passed in from the unwinder and carry information about the caught exception, are available only if the callee throws an exception that we catch. How can we ensure that these values are represented such that they can only be used in valid ways? We can’t make them all regular SSA definitions of the opcode: that would mean that all successors (regular return and exceptional) get to use them, as in:
function %f() {
block0:
;; Callsite defining a return value `v0`, with normal return path
;; to `block1` and exception handler `block2`.
v0 = try_call ..., block1, [ tag0: block2 ]
block1:
;; Use `v0` legally: it is defined on normal return.
return v0
block2:
;; Oops! We use `v0` here, but the normal return value is undefined
;; when an exception is caught and control reaches this handler block.
return v0
}
This is the reason that a compiler may choose to make handler blocks special: by bifurcating the universe of blocks, one ensures that normal-return and exceptional-return values are used only where appropriate. Some compiler IRs reify exceptional return payloads via “landing pad” instructions that must start handler blocks, just as phis start regular blocks (in phi- rather than blockparam-based SSA). But, again, this bifurcation is undesirable.
Our insight here, after a lot of discussion, was to put the definitions where they belong: on the edges. That is, regular returns are only defined once we know we’re following the regular-return edge, and likewise for exception payloads. But we don’t want to have special instructions that must be in the successor blocks: that’s a weird distributed invariant and, again, likely to lead to bugs when transforming IR. Instead, we leverage the fact that we use blockparam-based SSA and we widen the domain of allowable block-call arguments.
Whereas previously one might end a block like brif v1, block2(v2,
v3), block3(v4, v5), i.e. with blockparams assigned values in the
chosen successor via a list of value-uses in the branch, we now allow
(i) SSA values, (ii) a special “normal return value” sentinel, or
(iii) a special “exceptional return value” sentinel. The latter two
are indexed because there can be more than one of each. So one can
write a block-call in a try_call as block2(ret0, v1, ret1), which
passes the two return values of the call and a normal SSA value; or
block3(exn0, exn1), which passes just the two exception payload
values. We do have a new well-formedness check on the IR that ensures
that (i) normal returns are used only in the normal-return blockcall,
and exception payloads are used only in the handler-table blockcalls;
(ii) normal returns’ indices are bounded by the signature; and (iii)
exception payloads’ indices are bounded by the ABI’s number of
exception payload values; but all of these checks are local to the
instruction, not distributed across blocks. That’s nice, and conforms
with the way that all of our other instructions work, too. (Block-call
argument types are then checked against block-parameter types in the
successor block, but that happens the same as for any branch.) So we
have, repeating from above, a callsite like
block1:
try_call fn0(v1), block2(ret0), [ tag0: block3(exn0, exn1) ]
with all of the desired properties: only one kind of block, explicit control flow, and SSA values defined only where they are legal to use.
All of this may seem somewhat obvious in hindsight, but as attested by the above GitHub discussions and Cranelift weekly meeting minutes, it was far from clear when we started how to design all of this to maximize simplicity and generality and minimize quirks and footguns. I’m pretty happy with our final design: it feels like a natural extension of our core blockparam-SSA control flow graph, and I managed to put it into the compiler without too much trouble at all (well, a few PRs and associated fixes to Cranelift and regalloc2 functionality and testing; and I’m sure I’ve missed a few).
So we have defined an IR that can express exception handlers – what about the interaction between this function body and the unwinder? We will need to define a different kind of semantics to nail down that interface: in essence, it is a property of the ABI (Application Binary Interface).
As mentioned above, existing exception-handling ABIs exist for native code, such as compiled C++. While we are certainly willing to draw inspiration from native ABIs and align with them as much as makes sense, in Wasmtime we already define our own ABI7, and so we are not necessarily constrained by existing standards.
In particular, there is a very good reason we would prefer not to: to
unwind to a particular exception handler, register state must be
restored as specified in the ABI, and the standard Itanium ABI
requires the usual callee-saved (“non-volatile”) registers on the
target ISA to be restored. But this requires (i) having the register
state at time of throw, and (ii) processing unwind metadata at each
stack frame as we walk up the stack, reading out values of saved
registers from stack frames. The latter is already
supported
with a generic “unwind pseudoinstruction” framework I built four years
ago, but would still add complexity to our unwinder, and this
complexity would be load-bearing for correctness; and the former is
extremely difficult with Wasmtime’s normal runtime-entry
trampolines. So we instead choose to have a simpler exception ABI: all
try_calls, that is, callsites with handlers, clobber all
registers. This means that the compiler’s ordinary register-allocation
behavior will save all live values to the stack and restore them on
either a normal or exceptional return. We only have to restore the
stack (stack pointer and frame pointer registers) and redirect the
program counter (PC) to a handler.
The other aspect of the ABI that matters to the exception-throw
unwinder is exceptional payload. The native Itanium ABI specifies two
registers on most platforms (e.g.: rax and rdx on x86-64, or x0
and x1 on aarch64) to carry runtime-defined playload; so for
simplicity, we adopt the same convention.
That’s all well and good; now how do we implement try_call with the
appropriate register-allocator behavior to conform to this? We already
have fairly complex ABI handling
(machine-independent
and
five
different
architecture
implementations)
in Cranelift, but it follows a general pattern: we generate a single
instruction at the register-allocator level, and emit uses and defs
with fixed-register constraints. That is, we tell regalloc that
parameters must be in certain registers (e.g., rdi, rsi, rcx,
rdx, r8, r9 on x86-64 System-V calling-convention platforms, or
x0 up to x7 on aarch64 platforms) and let it handle any necessary
moves. So in the simplest case, a call might look like (on aarch64),
with register-allocator uses/defs and constraints annotated:
bl (call) v0 [def, fixed(x0)], v1 [use, fixed(x0)], v2 [use, fixed(x1)]
It is not always this simple, however: calls are not actually always a single instruction, and this turned out to be quite problematic for exception-handling support. In particular, when values are returned in memory, as the ABI specifies they must be when there are more return values than registers, we add (or added, prior to this work!) load instructions after the call to load the extra results from their locations on the stack. So a callsite might generate instructions like
bl v0 [def, fixed(x0)], ..., v7 [def, fixed(x7)] # first eight return values
ldr v8, [sp] # ninth return value
ldr v9, [sp, #8] # tenth return value
and so on. This is problematic simply because we said that the
try_call was a terminator; and it is at the IR level, but no longer
at the regalloc level, and regalloc expects correctly-formed
control-flow graphs as well. So I had to do a
refactor to
merge these return-value loads into a single regalloc-level
pseudoinstruction, and in turn this cascaded into a few regalloc fixes
(allowing more than 256
operands and
more aggressively splitting live-ranges to allow worst-case
allocation,
plus a fix to the live range-splitting
fix and a
fuzzing
improvement).
There is one final question that might arise when considering the interaction of exception handling and register allocation in Cranelift-compiled code. In Cranelift, we have an invariant that the register allocator is allowed to insert moves between any two instructions – register-to-register, or loads or stores to/from spill-slots in the stack frame, or moves between different spill-slots – and indeed it does this whenever there is more state than fits in registers. It also needs to insert edge moves “between” blocks, because when jumping to another spot in the code, we might need the register values in a differently-assigned configuration. When we have an unwinder that jumps to a different spot in the code to invoke a handler, we need to ensure that all the proper moves have executed so the state is as expected.
The answer here turns out to be a careful argument that we don’t need to do anything at all. (That’s the best kind of solution to a problem, but only if one is correct!) The crux of the argument has to do with critical edges. A critical edge is one from a block with multiple successors to one with multiple predecessors: for example, in the graph
A D
/ \ /
B C
where A can jump to B or C, and D can also jump to C, then A-to-C is a
critical edge. The problem with critical edges is that there is
nowhere to put code that has to run on the transition from A to C (it
can’t go in A, because we may go to B or C; and it can’t go in C,
because we may have come from A or D). So the register allocator
prohibits them, and we “split” them when generating code by inserting
empty blocks (e below) on them:
A D
/ \ |
| e |
| \ /
B C
The key insight is that a try_call always has more than one
successor as long as it has a handler (because it must always have a
normal return-path successor too)8; and in this case, because we
split critical edges, the immediate successor block on the
exception-catch path has only one predecessor. So the register
allocator can always put its moves that have to run on catching an
exception in the successor (handler) block rather than the predecessor
block. Our rule for where to put edge moves prefers the successor
(block “after” the edge) unless it has multiple in-edges, so this was
already the case. The only thing we have to be careful about is to
record the address of the inserted edge block, if any (e above),
rather than the IR-level handler block (C above), in the handler
table.
And that’s pretty much it, as far as register allocation is concerned!
We’ve now covered the basics of Cranelift’s exception support. At this
point, having landed the compiler half but not the Wasmtime half, I
context-switched away for a bit, and in the meantime, bjorn3 picked
this support up right away as a means to add panic-unwinding support
to
rustc_codegen_cranelift,
the Cranelift-based Rust compiler backend. With a few small
changes they
contributed, and a followup edge-case
fix and a
refactor,
panic-unwinding support in rustc_codegen_cranelift was working. That
was very good intermediate validation that what I had built was usable
and relatively solid.
We have a compiler that supports exceptions; we understand Wasm exception semantics; let’s build support into Wasmtime! How hard could it be?
I started by sketching out the codegen for each of the three opcodes
(try_table, throw, and throw_ref). My mental model at the very
beginning of this work, having read but not fully internalized the
Wasm exception-handling proposal, was that I would be able to
implement a “basic” throw/catch first, and then somehow build the
exnref objects later. And I had figured I could build exnrefs in a
(in hindsight) somewhat hacky way, by aggregating values together in a
kind of tuple and creating a table of such tuples indexed by exnrefs,
just as Wasmtime does for externrefs.
This understanding quickly gave way to a deeper one when I realized a few things:
Exception objects (exnrefs) can carry references to other GC objects (that is, GC types can be part of the payload signature of an exception), and GC objects can store exnrefs in fields. Hence, exnrefs need to be traced, and can participate in GC cycles; this either implies an additional collector on top of our GC collector (ugh) or means that exception objects needs to be on the GC heap when GC is enabled.
We’ll need a host API to introspect and build exception objects, and we already have nice host APIs for GC objects.
There was a question in an extensively-discussed
PR whether
we could build a cheap “subset” implementation that doesn’t mandate
the existence of a GC heap for storing exception objects. This would
be great in theory for guests that use exceptions for C-level
setjmp/longjmp but no other GC features. However, it’s a little
tricky for a few reasons. First, this would require the subset to
exclude throw_ref (so we don’t have to invent another kind of
exception object storage). But it’s not great to subset the spec –
and throw_ref is not just for GC guest languages, but also for
rethrows. Second, more generally, this is additional maintenance and
testing surface that we’d rather not have for now. Instead we expect
that we can make GC cheap enough, and its growth heuristic smart
enough that a “frequent setjmp/longjmp” stress-test of exceptions (for
example) should live within a very small (e.g., few-kilobyte) GC heap,
essentially approximating the purpose-built storage. My colleague Nick
Fitzgerald (who built and is driving improvements to Wasmtime’s GC
support) wrote up a nice
issue
describing the tradeoffs and ideas we have.
All of that said, we’ll only build one exception object implementation – great! – but it will have to be a new kind of GC object. This spawned a large PR to build out exception objects first, prior to actual support for throwing and catching them, with host APIs to allocate them and inspect their fields. In essence, they are structs with immutable fields and with a less-exposed type lattice and no subtyping.
So there I was, implementing the throw instruction’s libcall
(runtime implementation), and finally getting to the heart of the
matter: the unwinder itself, which walks stack frames to find a
matching exception handler. This is the final bit of functionality
that ties it all together. We’re almost there!
But wait: check out that spec
language.
We load the “tag address” from the store in step 9: we allocate the
exception instance {tag z.tags[x], fields val^n}. What is this
tags array on the store (z) in the runtime semantics? Tags have
dynamic identity, not static identity! (This is the part where I
learned about the thing I described
above.)
This was a problem, because I had defined exception tables to
associate handlers with tags that were identified by integer (u32)
– like most other entities in Cranelift IR, I had figured this would
be sufficient to let Wasmtime define indices (say: index of the tag in
the module), and then we could compare static tag IDs.
Perhaps this is no problem: the static index defines the entity ID in the module (defined or imported tag), and we can compare that and the instance ID to see if a handler is a match. But how do we get the instance ID from the stack frame?
It turns out that Wasmtime didn’t have a way, because nothing had
needed that yet. (This deficiency had been noticed before when
implementing Wasm coredumps, but there hadn’t been enough reason or
motivation to fix it then.) So I filed an
issue with
a few ideas. We could add a new field in every frame storing the
instance pointer – and in fact this is a simple version of what at
least one other production Wasm implementation, in the SpiderMonkey
web engine,
does
(though as described in that [SMDOC] comment, it only stores
instance pointers on transitions between frames of different
instances; this is enough for the unwinder when walking linearly up
the stack). But that would add overhead to every Wasm function (or
with SpiderMonkey’s approach, require adding trampolines between
instances, which would be a large change for Wasmtime), and exception
handling is still used somewhat rarely in practice. Ideally we’d have
a “pay-as-you-go” scheme with as little extra complexity as posible.
Instead, I came up with an idea to add “dynamic context” items to
exception handler
lists. The
idea is that we inject an SSA value into the list and it is stored in
a stack location that is given in the handler table metadata, so the
stack-walker can find it. To Cranelift, this is some arbitrary opaque
value; Wasmtime will use it to store the raw instance pointer
(vmctx) for use by the unwinder.
This filled out the design to a more general state nicely: it is symmetric with exception payload, in the sense that the compiled code can communicate context or state to the unwinder as it reads the frames, and the unwinder in turn can communicate data to the compiled code when it unwinds.
It turns out – though I didn’t intend this at all at the time – that this also nicely solves the inlining problem. In brief, we want all of our IR to be “local”, not treating the function boundary specially; this way, IR can be composed by the inliner without anything breaking. Storing some “current instance” state for the whole function will, of course, break when we inline a function from one module (hence instance) into another!
Instead, we can give a nice operational semantics to handler tables
with dynamic-context items: the unwinder should read left-to-right,
updating its “current dynamic context” at each dynamic-context item,
and checking for a tag match at tag-handler items. Then the inliner
can compose exception tables: when a try_call callsite inlines a
function body as its callee, and that body itself has any other
callsites, we attach a handler table that simply concatenates the
exception table items.
It’s important, here, to point out another surprising fact about Wasm
semantics: we cannot do certain optimizations to resolve handlers
statically or optimize the handler list, or at least not naively,
without global program analysis to understand where tags come
from. For example, if we see a handler for tag 0 then one for tag 1,
and we see a throw for tag 1 directly inside the try_tables body, we
cannot necessarily resolve it: tag 0 and tag 1 could be the same tag!
Wait, how can that be? Well, consider tag imports:
(module
(import "test" "e0" (tag $e0))
(import "test" "e1" (tag $e1))
(func ...
(try_table
(catch $e0 $b0)
(catch $e1 $b1)
(throw $e1)
(unreachable))))
We could instantiate this module giving the same dynamic tag instance
twice, for both imports; then the first handler (to block $b0)
matches; or separate tags; then the block $b1 matches. The only way
to win the optimization game is not to play – we have to preserve the
original handler list. Fortunately, that makes the compiler’s job
easier. We transcribe the try_table’s handlers directly to Cranelift
exception-handler tables, and those directly to metadata in the
compiled module, read in exactly that order by the unwinder’s
handler-matching logic.
Since exception objects are GC-managed objects, we have to ensure that they are properly rooted: that is, any handles to these objects outside of references inside other GC objects need to be known to the GC so the objects remain alive (and so the references are updated in the case of a moving GC).
Within a Wasm-to-Wasm exception throw scenario, this is fairly easy: the references are rooted in the compiled code on either side of the control-flow transfer, and the reference only briefly passes through the unwinder. As long as we are careful to handle it with the appropriate types, all will work fine.
Passing exceptions across the host/Wasm boundary is another matter,
though. We support the full matrix of {host, Wasm} x {host, Wasm}
exception catch/throw pairs: that is, exceptions can be thrown from
native host code called by Wasm (via a Wasm import), and exceptions
can be thrown out of Wasm code and returned as a kind of error to the
host code that invoked the Wasm. This works by boxing the exception
inside an anyhow::Error so we use Rust-style value-based error
propagation (via Result and the ? operator) in host code.
What happens when we have a value inside the Error that holds an
exception object in the Wasmtime Store? How does Wasmtime know this
is rooted?
The answer in Wasmtime prior to recent work was to use one of two
kinds of external rooting wrappers: Rooted and
ManuallyRooted. Both wrappers hold an index into a table contained
inside the Store, and that table contains the actual GC
reference. This allows the GC to easily see the roots and update them.
The difference lies in the lifetime disciplines: ManuallyRooted
requires, as the name implies, manual unrooting; it has no Drop
implementation, and so easily creates leaks. Rooted, on the other
hand, had a LIFO (last-in first-out) discipline based on a Scope, an
RAII type created by the embedder (user) of Wasmtime. Rooted GC
references that escape that dynamic scope are unrooted, and will cause
an error (panic) at runtime if used. Neither of those behaviors is
ideal for a value type – an exception – that is meant to escape
scopes via ?-propagation.
The design that we landed on, instead, takes a different and much
simpler approach: the Store has a single, explicit root slot for the
“pending exception”, and host code can set this and then return a
sentinel value (wasmtime::ThrownException) in the Result’s error
type (boxed up into an anyhow::Error). This easily allows
propagation to work as expected, with no unbounded leaks (there is
only one pending exception that is rooted) and no unrooted propagating
exceptions (because no actual GC reference propagates, only the
sentinel).
As a side-quest, while thinking through this rooting dilemma, I also
realized
that it should be possible to create an “owned” rooted reference
that behaves more like a conventional owned Rust value (e.g. Box);
hence OwnedRooted was born to replace
ManuallyRooted.
This type works without requiring access to the Store to unroot when
dropped; the key idea is to hold a refcount to a separate tiny
allocation that is used as a “drop flag”, and then have the store
periodically scan these drop-flags and lazily remove roots, with a
thresholding algorithm to give that scanning amortized linear-time
behavior.9
Now that we have this, in theory, we could pass an
OwnedRooted<ExnRef> directly in the Error type to propagate
exceptions through host code; but the store-rooted approach is simple
enough, has a marginal performance advantage (no separate allocation),
and so I don’t see a strong need to change the API at the moment.
Now that we’ve discussed all the design choices, let’s walk through the life an exception throw/catch, from start to finish. Let’s assume a Wasm-to-Wasm throw/catch for simplicity here.
try_table, which
results in an exception handler catch blocks being
created
for each handler case listed in the try_table instruction. The
create_catch_block
function generates code that invokes
translate_exn_unbox,
which reads out all of the fields from the exception object and
pushes them onto the Wasm operand stack in the handler path. This
handler block is registered in the
HandlerState,
which tracks the current lexical stack of handlers (and hands out
checkpoints so that when we pop out of a Wasm block-type operator,
we can pop the handlers off the state as well). These handlers are
provided as an
iterator
which is passed to the translate_call
method
and eventually ends up creating an exception
table
on a try_call instruction. This try_call will invoke whatever
Wasm code is about to throw the exception.throw opcode, which is
translated
via
FuncEnvironment::translate_exn_throw
to a three-operation
sequence
that fetches the current instance ID (via a libcall into the
runtime), allocates a new exception object with that instance ID and
a fixed tag number and fills in its slots with the given values
popped from the Wasm operand stack, and delegates to throw_ref.throw_ref opcode implementation then invokes the
throw_ref
libcall.HostResult
trait implementations) and eventually reaches this
case
which sees a pending exception sentinel and invokes
compute_handler. Now
we’re getting to the heart of the exception-throw implementation.compute_handler walks the stack with
Handler::find,
which itself is based on
visit_frames,
which does about what one would expect for code with a frame-pointer
chain: it walks the singly-linked list of frames. At each frame, the
closure
that compute_handler gave to Handler::find looks up the program
counter in that frame (which will be a return address, i.e., the
instruction after the call that created the next lower frame) using
lookup_module_by_pc
to find a Module, which itself has an
ExceptionTable
(a parser for serialized metadata produced during compilation from
Cranelift metadata) that knows how to look up a PC within a
module. This will produce an Iterator over
handlers
which we test in
order
to see if any match. (The groups of exception-handler table items
that come out of Cranelift are post-processed
here
to generate the tables that the above routines search.)UnwindState::UnwindToWasm
here.UnwindToWasm state then triggers this
case
in the unwind libcall, which is invoked whenever any libcall
returns an error code; that eventually calls the no-return function
resume_to_exception_handler, which is a little function written in
inline assembly that does exactly what it says on the tin. These
three
instructions
set rsp and rbp to their new values, and jump to the new rip
(PC). The same stub exists for each of our four native-compilation
architectures (x86-64 above,
aarch64,
riscv64,
and
s390x10).
That transfers control to the catch-block created above, and the
Wasm continues running, unboxing the exception payload and running
the handler!So we have Wasm exception handling now! For all of the interesting design questions we had to work through, the end was pretty anticlimactic. I landed the final PR, and after a follow-up cleanup PR (1) and some fuzzbug fixes (1 2 3 4 5 6 7) having mostly to do with null-pointer handling and other edge cases in the type system, plus one interaction with tail-calls (and a separate/pre-existing s390x ABI bug that it uncovered), it has been basically stable. We pretty quickly got a few user reports: here it was reported as working for a Lua interpreter using setjmp/longjmp inside Wasm based on exceptions, and here it enabled Kotlin-on-Wasm to run and pass a large testsuite. Not bad!
All told, this took 37 PRs with a diff-stat of +16264 -4004 (16KLoC
total) – certainly not the “small-to-medium-sized” project I had
initially optimistically expected, but I’m happy we were able to build
it out and get it to a stable state relatively easily. It was a
rewarding journey in a different way than a lot of my past work
(mostly on the Cranelift side) – where many of my past projects have
been really very open-ended design or even research questions, here we
had the high-level shape already and all of the work was in designing
high-quality details and working out all the interesting interactions
with the rest of the system. I’m happy with how clean the IR design
turned out in particular, and I don’t think it would have done so
without the really excellent continual discussion with the rest of the
Cranelift and Wasmtime contributors (thanks to Nick Fitzgerald and
Alex Crichton in particular here).
As an aside: I am happy to see how, aside from use-cases for Wasm
exception handling, the exception support in Cranelift itself has been
useful too. As mentioned above, cg_clif picked it up almost as soon
as it was ready; but then, as an unexpected and pleasant surprise,
Alex subsequently rewrote Wasmtime’s trap
unwinding to
use Cranelift exception handlers in our entry trampolines rather than
a setjmp/longjmp, as the latter have longstanding semantic
questions/issues in Rust. This took one more
intrinsic,
which I implemented after discussing with Alex how best to expose
exception handler addresses to custom unwind logic without the full
exception unwinder, but was otherwise a pretty direct application of
try_call and our exception ABI. General building blocks prove
generally useful, it seems!
Thanks to Alex Crichton and Nick Fitzgerald for providing feedback on a draft of this post!
To explain myself a bit, I underestimated the interactions of
exception handling with garbage collection (GC); I hadn’t
realized yet that exnrefs were a full first-class value and
would need to be supported including in the host API. Also, it
turns out that exceptions can cross the host/guest boundary, and
goodness knows that gets really fun really fast. I was only
off by a factor of two on the compiler side at least! ↩
From an implementation perspective, the dynamic, interprocedural nature of exceptions is what makes them far more interesting, and involved, than classical control flow such as conditionals, loops, or calls! This is why we need a mechanism that involves runtime data structrues, “stack walks”, and lookup tables, rather than simply generating a jump to the right place: the target of an exception-throw can only be computed at runtime, and we need a convention to transfer control with “payload” to that location. ↩
For those so inclined, this is a
monad,
and e.g. Haskell
implements the ability to have “result or error” types that
return from a sequence early via
Either,
explicitly describing the concept as such. The ? operator
serves as the “bind” of the monad: it connects an
error-producing computation with a use of the non-error value,
returning the error directly if one is given instead. ↩
So named for the Intel Itanium (IA-64), an instruction-set architecture that happened to be the first ISA where this scheme was implemented for C++, and is now essentially dead (before its time! woefully misunderstood!) but for that legacy… ↩
It’s worth briefly noting here that the Wasm exception handling
proposal went through a somewhat twisty journey, with an earlier
variant (now called “legacy exception handling”) that shipped in
some browsers but was never standardized handling rethrows in a
different way. In particular, that proposal did not offer
first-class exception object references that could be rethrown;
instead, it had an explicit rethrow instruction. I wasn’t
around for the early debates about this design, but in my
opinion, providing first-class exception object references that
can be plumbed around via ordinary dataflow is far nicer. It
also permits a simpler implementation, as long as one literally
implements the semantics by always allocating an exception
object.11 ↩
To be precise, because it may be a little surprising:
catch_ref pushes both the payload values and the exception
reference onto the operand stack at the handler destination. In
essence, the rule is: tag-specific variants always unpack the
payloads; and also, _ref variants always push the exception
reference. ↩
In particular, we have defined our own ABI in Wasmtime to allow
universal tail calls between any two signatures to work, as
required by the Wasm tail-calling opcodes. This ABI, called
“tail”, is based on the standard System V calling convention
but differs in that the callee cleans up any stack arguments. ↩
It’s not compiler hacking without excessive trouble from
edge-cases, of course, so we had one interesting
bug
from the empty handler-list case which means we have to force
edge-splitting anyway for all try_calls for this subtle
reason. ↩
Of course, while doing this, I managed to create
CVE-2025-61670
in the C/C++ API by a combination of (i) a simple typo in the C
FFI bindings (as vs. from, which is important when
transferring ownership!) and (ii) not realizing that the C++
wrapper does not properly maintain single ownership. We didn’t
have ASAN tests, so I didn’t see this upfront; Alex discovered
the issue while updating the Python bindings (which quickly
found the leak) and managed the CVE. Sorry and thanks! ↩
It turns out that even three lines of assembly are hard to get right: the s390x variant had a bug where we got the register constraints wrong (GPR 0 is special on s390x, and a branch-to-register can only take GPR 1–15; we needed a different constraint to represent that)and had a miscompilation as a result. Thanks to our resident s390x compiler hacker Ulrich Weigand for tracking this down. ↩
Of course, always boxing exceptions is not the only way to implement the proposal. It should be possible to “unbox” exceptions and skip the allocation, carrying payloads directly through some other engine state, if they are not caught as references. We haven’t implemented this optimization in Wasmtime and we expect the allocation performance for small exception objects to be adequate for most use-cases. ↩
As of Wasmtime 35, Winch supports AArch64 for Core Wasm proposals, along with additional Wasm proposals like the Component Model and Custom Page Sizes.
Embedders can configure Wasmtime to use either Cranelift or Winch as the Wasm compiler depending on the use-case: Cranelift is an optimizing compiler aiming to generate fast code. Winch is a ‘baseline’ compiler, aiming for fast compilation and low-latency startup.
This blog post will cover the main changes needed to accommodate support for AArch64 in Winch.
To achieve its low-latency goal, Winch focuses on converting Wasm code to assembly code for the target Instruction Set Architecture (ISA) as quickly as possible. Unlike Cranelift, Winch’s architecture intentionally avoids using an intermediate representation or complex register allocation algorithms in its compilation process. For this reason, baseline compilers are also referred to as single-pass compilers.
Winch’s architecure can be largely divided into two parts which can be classified as ISA-agnostic and ISA-specific.

Adding support for AArch64 to Winch involved adding a new
implementation of the MacroAssembler trait, which is ultimately in
charge of emitting AArch64 assembly. Winch’s ISA-agnostic components
remained unchanged, and shared with the existing x86_64
implementation.
Winch’s code generation context implements
wasmparser’s
VisitOperator
trait, which requires defining handlers for each Wasm opcode:
fn visit_i32_const() -> Self::Output {
// Code generation starts here.
}
When an opcode handler is invoked, the Code Generation Context prepares all the necessary values and registers, followed by the machine code emission of the sequence of instructions to represent the Wasm instruction in the target ISA.
Last but not least, the register allocator algorithm uses a simple round robin approach over the available ISA registers. When a requested register is unavailable, all the current live values at the current program point are saved to memory (known as value spilling), thereby freeing the requested register for immediate use.
AArch64 defines very specific restrictions with regards to the usage of the stack pointer register (SP). Concretely, SP must be 16-byte aligned whenever it is used to address stack memory. Given that Winch’s register allocation algorithm requires value spilling at arbitrary program points, it can be challenging to maintain such alignment.
AArch64’s SP requirement states that SP must be 16-byted when addressing stack memory, however it can be unaligned if not used to address stack memory and doesn’t prevent using other registers for stack memory addressing, nor it states that these other registers be 16-byte aligned. To avoid opting for less efficient approaches like overallocating memory to ensure alignment each time a value is saved, Winch’s architecture employs a shadow stack pointer approach.
Winch’s shadow stack pointer approach defines x28 as the base register
for stack memory addressing, enabling:
Wasmtime can be configured to leverage signals-based traps to detect exceptional situations in Wasm programs e.g., an out-of-bounds memory access. Traps are synchronous exceptions, and when they are raised, they are caught and handled by code defined in Wasmtime’s runtime. These handlers are Rust functions compiled to the target ISA, following the native calling convention, which implies that whenever there is a transition from Winch generated code to a signal handler, SP must be 16-byte aligned. Note that even though Wasmtime can be configured to avoid signals-based traps, Winch does not support such option yet.
Given that traps can happen at arbitrary program points, Winch’s approach to ensure 16-byte alignment for SP is two-fold:
It’s worth noting that the approach mentioned above doesn’t take into
account asynchronous exceptions, also known as interrupts. Further
testing and development is needed in order to ensure that Winch
generated code for AArch64 can correctly handle interrupts e.g.,
SIGALRM.
To minimize register pressure and reduce the need for spilling values,
Winch’s instruction selection prioritizes emitting instructions that
support immediate operands whenever possible, such as mov x0,
#imm. However, due to the fixed-width instruction encoding in AArch64
(which always uses 32-bit instructions), encoding large immediate
values directly within a single instruction can sometimes be
impossible. In such cases, the immediate is first loaded into an
auxiliary register—often a “scratch” or temporary register—and then
used in subsequent instructions that require register operands.
Scratch registers offer the advantage that they are not tracked by the register allocator, reducing the possibility of register allocator induced spills. However, they should be used sparingly and only for short-lived operations.
AArch64’s fixed 32-bit instruction encoding imposes stricter limits on the size of immediate values that can be encoded directly, unlike other ISAs supported by Winch, such as x86_64, which support variable-length instructions and can encode larger immediates more easily.
Before supporting AArch64, Winch’s ISA-agnostic component assumed a single scratch register per ISA. While this worked well for x86_64, where most instructions can encode a broad range of immediates directly, it proved problematic for AArch64. Specifically, for instruction sequences involving instructions with immediates in which the scratch register was previously acquired.
Consider the following snippet from Winch’s ISA-agnostic code for computing a Wasm table element address:
// 1. Load index into the scratch register.
masm.mov(scratch.writable(), index.into(), bound_size)?;
// 2. Multiply with an immediate element size.
masm.mul(
scratch.writable(),
scratch.inner(),
RegImm::i32(table_data.element_size.bytes() as i32),
table_data.element_size,
)?;
masm.load_ptr(
masm.address_at_reg(base, table_data.offset)?,
writable!(base),
)?;
masm.mov(writable!(tmp), base.into(), ptr_size)?;
masm.add(writable!(base), base, scratch.inner().into(), ptr_size)
In step 1, the code clobbers the designated scratch register. More
critically, if the immediate passed to Masm::mul cannot be encoded
directly in the AArch64 mul instruction, the Masm::mul implementation
will load the immediate into a register—clobbering the scratch
register again—and emit a register-based multiplication instruction.
One way to address this limitation is to avoid using a scratch register for the index altogether and instead request a register from the register allocator. This approach, however, increases register pressure and potentially raises memory traffic, particularly in architectures like x86_64.
Winch’s preferred solution is to introduce an explicit scratch register allocator that provides a small pool of scratch registers (e.g., x16 and x17 in AArch64). By managing scratch registers explicitly, Winch can safely allocate and use them without risking accidental clobbering, especially when generating code for architectures with stricter immediate encoding constraints.
Though it wasn’t a radical change, the completeness of AArch64 in Winch marks a new stage for the compiler’s architecture, layering a more robust and solid foundation for future ISA additions.
Contributions are welcome! If you’re interested in contributing, you can:
Thanks to everyone who contributed to the completeness of the AArch64 backend! Thanks also to Nick Fitzgerald and Chris Fallin for their feedback on early drafts of this article.
]]>--invoke flag.
This article walks through building a Wasm component in Rust and using wasmtime run --invoke to execute specific functions (enabling powerful workflows for scripting, testing, and integrating Wasm into modern development pipelines).
Wasmtime’s run subcommand has traditionally supported running Wasm modules as well as invoking that module’s exported function. However, with the evolution of the Wasm Component Model, this article focuses on a newer capability; creating a component that exports a function and then demonstrating how to invoke that component’s exported function.
By the end of this article, you’ll be ready to create Wasm components and orchestrate their exported component functions to improve your workflow’s efficiency and promote reuse. Potential examples include:
If you want to follow along, please install:
cargo (if already installed, please make sure you are on the latest version),cargo component (if already installed, please make sure you are on the latest version), andwasmtime CLI (or use a precompiled binary). If already installed, ensure you are using v33.0.0 or newer.You can check versions using the following commands:
$ rustc --version
$ cargo --version
$ cargo component --version
$ wasmtime --version
We must explicitly add the wasm32-wasip2 target. This ensures that our component adheres to WASI’s system interface for non-browser environments (e.g., file system access, sockets, random etc.):
$ rustup target add wasm32-wasip2
Let’s start by creating a new Wasm library that we will later convert to a Wasm component using cargo component and the wasm32-wasip2 target:
$ cargo component new --lib wasm_answer
$ cd wasm_answer
If you open the Cargo.toml file, you will notice that the cargo component command has automatically added some essential configurations.
The wit-bindgen-rt dependency (with the ["bitflags"] feature) under [dependencies], and the crate-type = ["cdylib"] setting under the [lib] section.
Your Cargo.toml should now include these entries (as shown in the example below):
[package]
name = "wasm_answer"
version = "0.1.0"
edition = "2024"
[dependencies]
wit-bindgen-rt = { version = "0.41.0", features = ["bitflags"] }
[lib]
crate-type = ["cdylib"]
[package.metadata.component]
package = "component:wasm-answer"
[package.metadata.component.dependencies]
The directory structure of the wasm_answer example is automatically scaffolded out for us by cargo component:
$ tree wasm_answer
wasm_answer
├── Cargo.lock
├── Cargo.toml
├── src
│ ├── bindings.rs
│ └── lib.rs
└── wit
└── world.wit
If we open the wit/world.wit file, that cargo component created for us, we can see that cargo component generates a minimal world.wit that exports a raw function:
package component:wasm-answer;
/// An example world for the component to target.
world example {
export hello-world: func() -> string;
}
We can simply adjust the export line (as shown below):
package component:wasm-answer;
/// An example world for the component to target.
world example {
export get-answer: func() -> u32;
}
But, instead, let’s use an interface to export our function!
While the above approach works, the recommended best practice is to wrap related functions inside an interface, which you then export from your world. This is more modular, extensible, and aligns with how the Wasm Interface Type (WIT) format is used in multi-function or real-world components. Let’s update the wit/world.wit file as follows:
package component:wasm-answer;
interface answer {
get-answer: func() -> u32;
}
world example {
export answer;
}
Next, we update our src/lib.rs file accordingly, by pasting in the following Rust code:
#[allow(warnings)]
mod bindings;
use bindings::exports::component::wasm_answer::answer::Guest;
struct Component;
impl Guest for Component {
fn get_answer() -> u32 {
42
}
}
bindings::export!(Component with_types_in bindings);
Now, let’s create the Wasm component with our exported get_answer() function:
$ cargo component build --target wasm32-wasip2
Our newly generated .wasm file now lives at the following location:
$ file target/wasm32-wasip2/debug/wasm_answer.wasm
target/wasm32-wasip2/debug/wasm_answer.wasm: WebAssembly (wasm) binary module version 0x1000d
We can also use the --release option which optimises builds for production:
$ cargo component build --target wasm32-wasip2 --release
If we check the sizes of the debug and release, we see a difference of 2.1M and 16K, respectively.
Debug:
$ du -mh target/wasm32-wasip2/debug/wasm_answer.wasm
2.1M target/wasm32-wasip2/debug/wasm_answer.wasm
Release:
$ du -mh target/wasm32-wasip2/release/wasm_answer.wasm
16K target/wasm32-wasip2/release/wasm_answer.wasm
The wasmtime run command can take one positional argument and just run a .wasm or .wat file:
$ wasmtime run foo.wasm
$ wasmtime run foo.wat
In the case of a Wasm module that exports a raw function directly, the run command accepts an optional --invoke argument, which is the name of an exported raw function (of the module) to run:
$ wasmtime run --invoke initialize foo.wasm
In the case of a Wasm component that uses typed interfaces (defined in WIT, in concert with the Component Model), the run command now also accepts the optional --invoke argument for calling an exported function of a component.
However, the calling of an exported function of a component uses WAVE(a human-oriented text encoding of Wasm Component Model values). For example:
$ wasmtime run --invoke 'initialize()' foo.wasm
You will notice the different syntax of
initializeversus'initialize()'when referring to a module versus a component, respectively.
Back to our get-answer() example:
$ wasmtime run --invoke 'get-answer()' target/wasm32-wasip2/debug/wasm_answer.wasm
42
You will notice that the above get-answer() function call does not pass in any arguments. Let’s discuss how to represent the arguments passed into function calls in a structured way (using WAVE).
Transferring and invoking complex argument data via the command line is challenging, especially with Wasm components that use diverse value types. To simplify this, Wasm Value Encoding (WAVE) was introduced; offering a concise way to represent structured values directly in the CLI.
WAVE provides a standard way to encode function calls and/or results. WAVE is a human-oriented text encoding of Wasm Component Model values; designed to be consistent with the WIT IDL format.
Below are a few additional pointers for constructing your wasmtime run --invoke commands using WAVE.
As shown above, the component’s exported function name and mandatory parentheses are contained in one set of single quotes, i.e., 'get-answer()':
$ wasmtime run --invoke 'get-answer()' target/wasm32-wasip2/release/wasm_answer.wasm
The result from our correctly typed command above is as follows:
42
Parentheses after the exported function’s name are mandatory. The presence of the parenthesis () signifies function invocation, as opposed to the function name just being referenced. If your function takes a string argument, ensure that you contain your string in double quotes (inside the parentheses). For example:
$ wasmtime run --invoke 'initialize("hello")' foo.wasm
If your exported function takes more than one argument, ensure that each argument is separated using a single comma , as shown below:
$ wasmtime run --invoke 'initialize("Pi", 3.14)' foo.wasm
$ wasmtime run --invoke 'add(1, 2)' foo.wasm
Let’s wrap this article up with a recap to crystallize your knowledge.
If we are not using the Component Model and just creating a module, we use a simple command like wasmtime run foo.wasm (without WAVE syntax). This approach typically applies to modules, which export a _start function, or reactor modules, which can optionally export the wasi:cli/run interface—standardized to enable consistent execution semantics.
Example of running a Wasm module that exports a raw function directly:
$ wasmtime run --invoke initialize foo.wasm
As Wasm evolves with the Component Model, developers gain fine-grained control over component execution and composition. Components using WIT can now be run with wasmtime run, using the optional --invoke argument to call exported functions (with WAVE).
Example of running a Wasm component that exports a function:
$ wasmtime run --invoke 'add(1, 2)' foo.wasm
For more information, visit the cli-options section of the Wasmtime documentation.
The addition of support for the run --invoke feature for components allows users to specify and execute exported functions from a Wasm component. This enables greater flexibility for testing, debugging, and integration. We now have the ability to perform the execution of arbitrary exported functions directly from the command line, this feature opens up a world of possibilities for integrating Wasm into modern development pipelines.
This evolution from monolithic Wasm modules to composable, CLI-friendly components exemplifies the versatility and power of Wasm in real-world scenarios.
]]>Within the Bytecode Alliance, we’ve established two tiers for the projects under our umbrella: Hosted and Core. While all projects in the BA, Hosted and Core alike, are required to drive forward and align with our mission and operational principles, Core Projects represent the flagships of the Alliance.
This distinction isn’t merely symbolic. Core Projects are held to even more rigorous standards concerning governance maturity, security practices, community health, and strategic alignment with the BA’s goals. You can find the detailed criteria in our Core and Hosted Project Requirements. In return for meeting these heightened expectations, Core Projects gain direct representation on the Bytecode Alliance Technical Steering Committee (TSC), playing a crucial role in guiding the technical evolution of the Alliance. Establishing this tier, and having Wasmtime be the first project to meet its requirements, is a vital step in maturing the BA’s governance structure.
Wasmtime is a fast, scaleable, highly secure, and embeddable WebAssembly runtime in wide use across many different environments.
From its inception, Wasmtime was designed to embody the core tenets of the Bytecode Alliance. Its focus on providing a fast, secure, and standards-compliant WebAssembly runtime aligns directly with the BA’s mission to create state-of-the-art foundations emphasizing security, efficiency, and modularity.
Wasmtime has been instrumental in turning the Component Model vision of fine-grained sandboxing and capabilities-based security – what we initially called “nanoprocesses” – into a practical reality. It has consistently served as a proving ground for cutting-edge standards work, particularly the Component Model and WASI, driving innovation while maintaining strict standards compliance. Our commitment to robust security practices, including extensive fuzzing and a rigorous security response process, is non-negotiable.
The journey to Core Project status involved formally documenting how Wasmtime meets these stringent requirements. You can find this documentation in our proposal for Core Project status, which provides evidence for the Wasmtime project’s mature governance, security posture, CI/CD processes, community health, and widespread production adoption. Based on this evidence and the TSC’s strong recommendation, the Board of Directors unanimously agreed that Wasmtime not only fulfills the criteria but is strategically vital to the Alliance’s success, making it the ideal candidate to become the first Core Project.
After the Core Project promotion, the Wasmtime core team has appointed me to represent the project on the TSC, so I re-joined the TSC in this new role.
You can find more information about Wasmtime in a number of places:
And you can join the conversation in the Bytecode Alliance community’s chat platform, which has a dedicated channel for Wasmtime.
]]>The Wasmtime project releases a new version once a month with new features, bug fixes, and performance improvements. Previously though these releases were only supported for 2 months meaning that embedders needed to follow the Wasmtime project pretty closely to receive security updates. This rate of change can be too fast for users so Wasmtime now supports LTS releases.
Every 12th version of Wasmtime will now be considered an LTS release and will receive security fixes for 2 years, or 24 months. This means that users can now update Wasmtime once-a-year instead of once-a-month and be guaranteed that they will always receive security updates. Wasmtime’s 24.0.0 release has been retroactively classified as a LTS release and will be supported until August 20,
You can view a table of Wasmtime’s releases in the documentation book which has information on all currently supported releases, upcoming releases, and information about previously supported releases. The high-level summary of Wasmtime’s LTS release channel is:
If you’re a current user of Wasmtime and would like to use an LTS release then it’s recommended to either downgrade to the 24.0.0 version or wait for this August to upgrade to the 36.0.0 version. Wasmtime 34.0.0, to be released June 20, 2025, will be supported up until the release of Wasmtime 36.0.0 on August 20, 2025.
]]>The WAMR community has shown incredible dedication and enthusiasm throughout 2024. Here are some impressive numbers that highlight the community’s contributions:
Breaking down the contributions further:
The top three non-Intel organized contributors have made significant impacts:
These contributions have been instrumental in driving WAMR forward, and we extend our heartfelt thanks to everyone involved.
Several exciting new features have been added to WAMR in 2024, aimed at enhancing the development experience and expanding the capabilities of WAMR. Here are some of the key features:
One of the most exciting additions to WAMR in 2024 is the introduction of new development tools aimed at simplifying Wasm development. These tools include:
Before these tools, developing a Wasm application or plugin using a host language was a complex task. Mapping Wasm functions back to the source code written in the host language required deep knowledge and was often cumbersome. Debugging information from the runtime and the host language felt like two foreign languages trying to communicate without a translator. These new development tools act as that much-needed translator, bridging the gap and making Wasm development more accessible and efficient.
Another significant feature introduced in 2024 is the shared heap. This feature addresses the challenge of sharing memory between the host and Wasm. Traditionally, copying data at the host-Wasm border was inefficient, and existing solutions like externref lacked flexibility and toolchain support.
The shared heap approach uses a pre-allocated region of linear memory as a “swap” area. Both the embedded system and Wasm can store and access shared objects here without the need for copying. However, this feature comes with its own set of challenges. Unlike memory.grow(), the new memory region isn’t controlled by Wasm and may not even be aware of it. This requires runtime APIs to map the embedded-provided memory area into linear memory, making it a runtime-level solution rather than a Wasm opcode.
It’s important to note that the shared heap is an experimental feature, and the intent is to work towards a standardized approach within the WebAssembly Community Group (CG). This will help set expectations for early adopters and ensure alignment with the broader Wasm ecosystem. As the feature evolves, feedback from the community will be crucial in shaping its development and eventual standardization.
Several features have been finalized in 2024, further enhancing WAMR’s capabilities:
These new features and improvements are designed to make WAMR more powerful and easier to use, catering to the needs of developers and industry professionals alike.
In the embedding industry, the perspective on Wasm differs slightly from the cloud-centric view that the current Wasm Community Group (CG) often focuses on. To address these unique requirements, the Embedded Special Interest Group (ESIG) was established in 2024. This group aims to discover solutions that prioritize performance, footprint and stability, tailored specifically for embedding devices.
The ESIG has already achieved several accomplishments this year, thanks to the shared understanding and collaboration with customers. By focusing on the unique needs of the embedding industry, ESIG is paving the way for more specialized and efficient Wasm solutions.
The adoption of WAMR in the industry has been remarkable, with several key players integrating WAMR into their systems to leverage its performance and flexibility. Here are some notable examples:
Alibaba’s Microservice Engine (MSE) has adopted WAMR as a Wasm runtime to execute Wasm plugins in their gateways Higress. This integration has resulted in an impressive ~50% performance improvement, showcasing the efficiency and robustness of WAMR in real-world applications.
WAMR has also been integrated into Runwasi as one of the Wasm runtimes to execute Wasm in containerd. This integration allows for seamless execution of Wasm modules within containerized environments, providing a versatile and efficient solution for running Wasm applications.
For more information on industrial adoptions and other use cases, please refer to this link.
These examples highlight the growing trust and reliance on WAMR in various industrial applications, demonstrating its capability to deliver significant performance enhancements and operational efficiencies.
2024 has been a transformative year for WAMR, marked by significant community contributions, innovative features, and the establishment of the ESIG. As we look ahead, we are excited about the continued growth and evolution of WAMR, driven by the passion and dedication of our community. We invite you to join us on this journey, explore the new features, and contribute to the future of WebAssembly Micro Runtime.
Thank you for being a part of the WAMR community. Here’s to an even more exciting 2025!
]]>The Bytecode Alliance Technical Steering Committee acts as the top-level governing body for projects and Special Interest Groups hosted by the Alliance, ensuring they further the Alliance’s mission and are conducted in accordance with our values and principles. The TSC also oversees the Bytecode Alliance Recognized Contributor program to encourage and engage individual contributors as participants in Alliance projects and groups. As defined in its charter the TSC is composed of representatives from each Alliance Core Project and individuals selected by Recogized Contributors.
Our new TSC Elected Delegates (and their GitHub IDs, as we know each other in our RC community) are:
They will each serve a two-year term on the TSC.
Our RCs are also represented by two At-Large Directors they select to serve on our Board (as described in our organization bylaws), with overlapping two-year terms staggered to start each January. In this most recent election The Recognized Contributors chose Bailey Hayes (@ricochet) as At-Large Director.
I look forward to working with each of our electees, and am happy to introduce them here as part of bringing them onboard in their new roles. You’ll find our full Board and TSC listed on the About page of our website.
Thank you to all our Recognized Contributors for taking part in the election process and in general for their ongoing support of Alliance projects and communities. I’d also like to thank our outgoing leadership for their outstanding work - Nick Fitzgerald (@fitzgen) as TSC Chair and Elected Delegate, and Till Schneidereit (@tschneidereit) as Elected Delegate and At-Large Director.
]]>Wasmtime v28.0 includes a variety of enhancements and fixes. The release notes are available here.
bool type. #9593wasmtime crate now natively supports the wasm-wave crate and its encoding of component value types. #8872Module can now be created from an already-open file. #9571signals-based-traps, has been added to the wasmtime crate. When disabled then runtime signal handling is not required by the host. This is intended to help with future effort to port Wasmtime to more platforms. #9614malloc in certain conditions when guard pages are disabled, for example. #9614 #9634async feature no longer requires std. #9689OutgoingBody in wasmtime-wasi-http are now configurable. #9670Store<T> now caches a single fiber stack in async mode to avoid allocating/deallocating if the store is used multiple times. #9604wasmparser’s validator. #9623isle-in-source-tree feature has been re-worked as an environment variable. #9633Config now clarifies that defaults of some options may differ depending on the selected target or compiler depending on features supported. #9705Error trait, even in #[no_std] mode. #9702Thanks to Karl Meakin, Chris Fallin, Pat Hickey, Alex Crichton, Xinzhao Xu, SingleAccretion, Nick Fitzgerald, and Ulrich Weigand for their work on this release.
Want to get involved with Wasmtime? Join the community on our Zulip chat and read the Wasmtime contributors’ guide for more information.
]]>WebAssembly (abbreviated Wasm) is a binary instruction format for a stack-based virtual machine. Wasm is designed as a portable compilation target for programming languages, enabling deployment on the web for client and server applications.
This portability has led many people to claim that it is a “universal bytecode” — an instruction set that can run on any computer, abstracting away the underlying native architecture and operating system. In practice, however, there remain places you cannot take standard WebAssembly, for example certain memory-constrained embedded devices. Runtimes have been forced to choose between deviating from the standard with ad-hoc language modifications or else avoiding these platforms. This article details in-progress standards proposals to lift these extant language limitations; enumerates recent engineering efforts to greatly expand Wasmtime’s platform support; and, finally, shares some ways that you can get involved and help us further improve Wasm’s portability.
WebAssembly has a lot going for it. It has a formal specification that is developed in an open, collaborative standards process by browser, runtime, hardware, and language toolchain vendors, among others. It’s sandboxed, so a Wasm program cannot access any resource you don’t explicitly give it access to, leading to the development of standard Wasm APIs leveraging capability-based security. It is designed such that, after compilation to native code, it can be executed at near-native speeds. And, even if there is room for improvement, it is portable across many systems, running in Web browsers and on servers, workstations, phones, and more. These qualities are worth commending, preserving, and making available in even more places.
What does Wasm need in order to run on a given platform? A Wasm runtime that supports that platform. In this article, we’ll focus on the runtime we’re building: Wasmtime.
Wasmtime is a lightweight, standalone WebAssembly runtime developed openly within the Bytecode Alliance. Wasmtime is fast. It can, for example, spawn new Wasm instances in just 5 microseconds. We, the Wasmtime developers, labor to ensure that Wasmtime is correct and secure, leveraging ubiquitous fuzzing and formal verification, because Wasm’s theoretical security properties are only as strong as the runtime’s actual implementation. We are committed to open standards and actively participate in Wasm standardization; Wasmtime does not and will never implement ad-hoc, non-standard Wasm extensions.1 We believe that bringing Wasmtime, its principles, and its strengths to more platforms is a worthwhile endeavor.
So what must Wasmtime, or any other Wasm runtime, have in order to run Wasm on a given platform? There are two fundamental operations that, no matter how they are implemented, a Wasm runtime requires:
A Wasm runtime’s portability is determined by how few assumptions it makes about
its underlying platform in its implementation of those operations. Does it
assume an operating system that provides the mmap syscall or a CPU that
supports virtual memory? Does it support just a small, fixed set of instructions
sets, such as x86_64 and aarch64, or a wide, extensible set of ISAs? And, as
previously mentioned, no matter which implementation choices are made,
assumptions baked into the Wasm language specification itself can also limit a
runtime’s portability.
Wasmtime’s runtime previously made unnecessary assumptions, artificially
constraining its portability, and we’ve spent the last year or so removing them
one by one. Wasmtime is now a no_std crate with minimal platform
assumptions. It doesn’t require that the underlying platform provide mmap in
order to allocate Wasm memories like it previously did; in fact, it no longer
even depends upon an underlying operating system at all. As of today, Wasmtime’s
only mandatory platform requirement is a global memory allocator
(i.e. malloc).
Wasmtime previously assumed that it could always use guard pages to catch out-of-bounds memory accesses, constraining its portability to platforms with virtual memory. Wasmtime can now be configured to rely only on explicit checks to catch out-of-bounds accesses, and Wasmtime no longer assumes the presence of virtual memory.
Wasmtime previously assumed that it could always detect division-by-zero by
installing a signal handler. It would translate Wasm division instructions into
unguarded, native div instructions and catch the corresponding signals that
the operating system translated from divide-by-zero exceptions. This constrained
Wasmtime’s portability to only operating systems with signals and instruction
sets that raise exceptions on division by zero. Wasmtime can now be configured
to emit explicit tests for zero divisors, removing the assumption that
divide-by-zero signals are always available.
Configure Wasmtime to avoid depending upon virtual memory and signals by
building without the signals-based-traps cargo feature and with
Config::signals_based_traps(false). More information about
configuring minimal Wasmtime builds, as well as integrating with custom
operating systems, can be found in the Wasmtime guide.
This effort was spearheaded by Alex Crichton, with contributions from Chris Fallin.
The WebAssembly language specification imposes a fairly well-known portability constraint on standards-compliant implementations: Wasm memories are composed of pages, and Wasm pages have a fixed size of 64KiB. Therefore, a Wasm memory’s size is always a multiple of 64KiB, and the smallest non-zero memory size is 64KiB. But there exist embedded devices with less than 64KiB of memory available for Wasm, but where developers nonetheless want to run Wasm. I have been championing a new proposal in the WebAssembly standardization group to address this mismatch.
The custom-page-sizes proposal allows a Wasm module to specify a memory’s page size, in bytes, in the memory’s static definition. This gives Wasm modules finer-grained control over their resource consumption: with a one-byte page size, for example, a Wasm memory can be sized to exactly the embedded device’s capacity, even when less than 64KiB are available.
I implemented support for the custom-page-sizes proposal in Wasmtime. You can
experiment with it via the --wasm=custom-page-sizes flag on the command line
or via the Config::wasm_custom_page_sizes
method
in the library. Since then, three other Wasm engines have added support for
the proposal as well.
The proposal is on the standards track and is currently in phase 2 of the standardization process. I intend to shepherd it to phase 3 in 2025. The phase 3 entry requirements (spec tests and an implementation) are already satisfied multiple times over today.
We’ve discussed allocating Wasm memories portably and removing assumptions from the runtime and language specification; now we turn our attention to portably executing Wasm instructions. Wasmtime previously had two available approaches to Wasm execution:
Both options compile Wasm down to native instructions, which has two portability
consequences. First, when loaded into memory, the compiled Wasm’s machine code
must be executable, and non-portable assumptions around the presence of mmap,
memory permissions, and virtual memory on the underlying platform creep in
again. Second, compiling to native code as an execution strategy requires a
compiler backend for the target platform’s architecture. We cannot translate
Wasm instructions into native instructions without a compiler backend that knows
how to emit those native instructions. Cranelift has backends for aarch64,
riscv64, s390x, and x86_64. Winch has an aarch64 backend and an x86_64
backend. If you wanted to execute Wasm on a different architecture, say armv7
or riscv32, you had to first author a whole compiler backend for that
architecture, which is not a quick-and-easy task for established Wasmtime and
Cranelift hackers, let alone new contributors. This was a huge roadblock to
Wasmtime’s portability.
The typical way to add portable execution is with an interpreter written in a
portable manner, and we started investigating the design space for
Wasmtime. With a portable interpreter, you can execute Wasm on any platform you
can compile the interpreter. In Wasmtime’s case, because it is written in Rust,
a portable interpreter would expand Wasmtime’s portability to all of the many
platforms that rustc supports.
We want to maximize the interpreter’s execution throughput — how fast it can run Wasm.2 If people are running the interpreter due to the absence of a compiler backend for their architecture, then the usual method of tuning Wasmtime for fast Wasm execution (using Cranelift as the execution strategy) is unavailable. Beyond optimizing the interpreter’s core loop and opcode dispatch, the best way to speed up an interpreter is to execute fewer instructions, doing relatively more work per instruction. This pushes us towards translating Wasm into a custom, internal bytecode format. The internal bytecode format can be register-based, rather than stack-based like Wasm, which generally requires fewer instructions to encode the same program. With an internal bytecode we also have the freedom to define “super-instructions” or “macro-ops” — single instructions that do the work of multiple smaller instructions all at once — whenever we determine it would be beneficial. The Wasm-to-internal-bytecode translation step gives us a place to optimize the resulting bytecode before we begin executing it. In addition to coalescing multiple operations into macro-ops, we have the opportunity to do things like deduplicate subexpressions and eliminate redundant moves. This is when we realized that this translation step was sounding more and more like a proper optimizing compiler, and we already maintain an optimizing compiler that already performs these sorts of optimizations, we just need to teach it to emit the interpreter’s internal bytecode, rather than native code.
The Pulley interpreter is the culmination of this line of thinking. When Wasmtime is using Pulley, it translates Wasm to Cranelift’s intermediate representation, CLIF; then Cranelift runs its mid-end optimizations on the CLIF, such as constant propagation, GVN, and LICM; next, Cranelift lowers the CLIF to Pulley bytecode, coalescing multiple CLIF instructions into single Pulley macro-ops, eliminating dead code, and (re)allocating (virtual) registers to reduce moves; and finally, Wasmtime interprets the resulting optimized bytecode.
┌──────┐
│ Wasm │
└──────┘
│
│
Wasm-to-CLIF translation
│
▼
┌──────┐
│ CLIF │
└──────┘
│
│
mid-end optimizations
│
▼
┌──────┐
│ CLIF │
└──────┘
│
│
lowering
│
▼
┌─────────────────┐
│ Pulley bytecode │
└─────────────────┘
Just like Wasm-to-native-code compilation, Wasm-to-Pulley-bytecode compilation can be performed offline and ahead of time. Bytecode compilation need not be on the critical path and, given an already-bytecode-compiled Wasm module, Pulley execution can leverage the same 5-microsecond instantiation that native compilation strategies enjoy.
Initial Pulley support has landed in Wasmtime, but it is still a work in
progress and at times incomplete. We have not spent time optimizing Pulley, its
interpreter loop, or selection of macro-ops yet, so it is expected that its
performance today is not as good as it should be. You can experiment with Pulley
by enabling the pulley cargo feature and passing the --target pulley32 or
--target pulley64 command line flag (depending on if you are on a 32- or
64-bit machine respectively) or by calling config.target("pulley32") or
config.target("pulley64) when using Wasmtime as a
library. Note that you must use the (default) Cranelift compilation strategy
with Pulley; Winch doesn’t support emitting Pulley bytecode at this time.
The architecture and pipeline for Pulley emerged from discussions between myself and Alex Crichton. The initial Pulley interpreter and Cranelift backend to emit Pulley bytecode were both developed by me. Alex integrated the interpreter into Wasmtime’s runtime and has since been filling in its full breadth of Wasm support.
We’ve been focusing on Wasmtime’s portability and the ability to run any Wasm code on as many platforms as possible. The infamous “write once, run anywhere” (WORA) ambition aims even higher: to run the exact same code on all platforms, without changing its source or recompiling it.
At a high level, an application requires certain core capabilities. Not all code needs to or should run on platforms that lack their required capabilities: running an IRC chat client on a device that isn’t connected to the internet doesn’t generally make sense because an IRC chat client requires network access. WORA across two given platforms is a worthy goal only when the platforms both provide the application’s required capabilities (at a high level, regardless if they happen to use incompatible syscalls or different mechanisms to expose the capabilities).
The WebAssembly Component Model makes explicit the capability dependencies of a Wasm component and introduces the concept of a world to formalize an environment’s available capabilities. With components and worlds, we can precisely answer the question of whether WORA makes sense across two given platforms. Along with the standard worlds and interfaces defined by WASI, we already have all the tools we need to make WORA a reality for Wasm where it makes sense.3
Do you believe in our vision and want to contribute to our portability efforts? The work here isn’t done and there are many opportunities to get involved!
Try building a minimal Wasmtime for your niche platform, kick the tires, and share your feedback with us.
Help us get Pulley passing all of the .wast spec
tests! Making a
failing test start passing is usually pretty straightforward and just involves
adding a missing instruction or two. This is a great way to start contributing
to Wasmtime and Cranelift.
Once Pulley is complete, or at least mostly completely, we can start analyzing
and improving its performance. We can run our Sightglass
benchmarks like
spidermonkey.wasm under Pulley to determine what can be improved. We can
inspect the generated bytecode, identify which pairs of opcodes are often
found one after the other, and create new macro-ops. There is a lot of fun,
meaty performance engineering work available here for folks who enjoy making
number go up.
Support for running Wasm binaries that use custom page sizes is complete in
Wasmtime, but toolchain support for generating Wasm binaries with custom page
sizes is still largely missing. Adding support for the custom-page-sizes
proposal to
wasm-ld
is what is needed most. It’s expected that this implementation should be
relatively straightforward and that exposing a __wasm_page_size symbol can
be modelled after the existing __tls_size
symbol.
At the time of writing, a minimal dynamic library that runs pre-compiled Wasm
modules is a 315KiB binary on x86_64. A minimal build of Wasmtime’s whole C
API as a dynamic library is 698KiB. These numbers aren’t terrible, but we also
haven’t put any effort into optimizing Wasmtime for code size yet, so we
expect there to be a fair amout of potential code size wins and low-hanging
fruit available. We suspect error strings are a major code size offender, and
revamping wasmtime::Error to optionally (based on compile-time features)
contain just error codes, instead of full strings, is one idea we
have. Analyzing code size with cargo
bloat would also be fantastic.
We also publish high-level contributor documentation in the Wasmtime guide.
Big thanks to everyone who has contributed to the recent portability effort and to Wasmtime over the years. Thanks also to Alex Crichton and Till Schneidereit for reviewing early drafts of this article.
If an engine chooses not to abide by the constraints imposed by the WebAssembly language specification, then it is not implementing WebAssembly. It is instead implementing a language that is similar to but subtly different from WebAssembly. This leads to interoperability hazards, de facto standards, and ecosystem splits. We saw this during the early days of the Web, when Websites used non-standard, Internet Explorer-specific APIs. This led to broken Websites for people using other browsers, and eventually forced other browsers to reverse engineer the non-standard APIs. The Web is still stuck with the resulting baggage and tech debt today. We must prevent this from happening to WebAssembly. Therefore we refuse the temptation to deviate from the WebAssembly specification. Instead, when we identify language-level constraints, we engage with the standards process to create solutions that the whole ecosystem can rely on. ↩
To build the very fastest interpreter possible, you probably want to write assembly by hand, but that directly conflicts with our primary goal of portability so it is unacceptable. We want to maximize interpreter speed to the degree we can, but we cannot prioritize it over portability. ↩
The component model also gives us tools to break Wasm applications down into their constituent parts, and share those parts across different applications. Even when WORA doesn’t make sense for a full application, it might make sense for some subset of its business logic that happens to require fewer capabilities than the full application. For example, we may want to share the logic for maintaining the set of active IRC users between both the server and the client. ↩