wasmtime / issue #10150 Cranelift: define and use monolit... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / issue #10150 Cranelift: define and use monolit...

Wasmtime GitHub notifications bot (Jan 29 2025 at 20:08):

Today in the Wasm-to-CLIF translator we have a detailed set of cases by which we break down Wasm-level loads and stores into statically or dynamically bounds-checked operations. For example, a fully general dynamic bounds check becomes something like:

Compute the Wasm index value (i32) plus offset given on the load;

Compute that Wasm address plus the access size;

Compare that end-of-access address to a loaded memory size/bound;

Trap if out-of-bounds;

Add the loaded memory base to the index plus offset;

Load from that computed host address.

With variations possible depending on exactly what guard regions exist, whether Spectre mitigations are enabled (cmoves-of-null-pointers rather than branches for traps), whether the address is constant and known in-bounds, whether the access is 1 byte or multiple, and all the rest.

The theory behind "exploding" the Wasm-level operation into constituent parts in CLIF is twofold. First, it allows the compiler backends to deal with a simpler "core IR" that knows only about host loads and stores -- there is no need to have a concept of separate bounds-checked address spaces baked into every backend, and all the backends benefit from one central encoding of all the special cases and optimizations. Second, it was thought that this might allow some parts to be optimized: e.g., GVN when accesses have parts of the address in common (only offsets differ), or GVN and LICM when some parts are loop-invariant (e.g., base pointers on immovable memories or limits on non-growable memories), or constant folding, or maybe other things. The less we have to teach the optimizer about "bounds check" as a concept, the better.

However, there are several more recent realizations we have had. First, the promised optimization is only partly realizable; for example, the ability for memories to grow means that limit fields must be reloaded frequently or every time. And the common deployment mode that results in dynamic bounds-checking typically has no guard region at all ("no virtual memory" scenarios on embedded systems, or high-density scenarios with many memories where frequent virtual memory changes on guards during memory growth might become a bottleneck). Because of this, some of the commonality is lost: e.g., in common configurations we probably won't be able to factor out the compare-and-trap/cmov from Wasm loads/stores with different offsets on the same base.

Second, as @alexcrichton noted in today's Cranelift meeting, the Pulley interpreter-bytecode Cranelift backend wants to target new macro-ops that implement the Wasm bounds-check sequence. When there is interpreter overhead for every instruction, there is pressure away from "exploded" views of operation steps and back toward ready-made larger ops. Currently we would have to pattern-match all the various kinds of code we can generate to recover these macro-ops.

Third, as I described recently in a talk about proof-carrying code, verifying the tiny pieces that make up a dynamic bounds-check sequence is extremely challenging: getting an analysis to understand all the variations is brittle and complex, and leads to surprising and un-intuitive quadratic verification behavior. A verifier ideally wants to see a bounds-check as a single operation with a verified implementation rather than a graph of operators that lose the intent.

For all these reasons, I think we should consider adding a bounds_check operator to CLIF. Its semantics might be something like: boundscheck32_or_null.i64 host_base, offset, limit, imm_offset, size, where host_base (i64 on 64-bit-pointer systems) is a host address of the start of a bounds-checked memory, offset is an i32 offset into the memory, limit is an i32 memory size, imm_offset is a 16-bit constant offset to add to the returned address (u16), and size is the size of the access (u8). (These operand sizes were picked to allow the whole InstructionData to fit in 16 bytes: three Values and 24 bits of payload. If the enum tag becomes a u16 then we can make imm_offset a u8 and still capture most cases without a separate iadd` probably.)

The semantics would be: if offset + imm_offset + size <= limit, then return host_base + offset + imm_offset, otherwise return 0. We could define a variant ..._or_trap that traps instead on out-of-bounds.

The important considerations in my view are:

We should ensure that we can either encode optimizations that bounds_check.rs does with the new opcode, or else decide these optimizations aren't desired anymore. (Hopefully the former unless we have good data otherwise.) For example, if we do have a guard region, we could use the bounds-check opcode to compute a base without the specific load offset, then put the load offset on the machine load instruction. The Pulley backend should be able to combine these still and the verifier should be able to reason about valid ranges too.

We should ensure that we aren't removing important GVN-ability. Notably the above still externalizes the loads of base and limit; so whatever we do to allow those to be readonly, LICM them out of loops, etc., still applies. Also, this opcode would be either pure (Spectre/null variant) or idempotent (trapping variant) so to the extent that we can factor out offsets as above, it should be combinable.

Curious what others think -- any requirements this is missing? Any other important optimizations we would lose out on?

Wasmtime GitHub notifications bot (Jan 29 2025 at 20:09):

cfallin edited issue #10150:

Today in the Wasm-to-CLIF translator we have a detailed set of cases by which we break down Wasm-level loads and stores into statically or dynamically bounds-checked operations. For example, a fully general dynamic bounds check becomes something like:

Compute the Wasm index value (i32) plus offset given on the load;

Compute that Wasm address plus the access size;

Compare that end-of-access address to a loaded memory size/bound;

Trap if out-of-bounds;

Add the loaded memory base to the index plus offset;

Load from that computed host address.

With variations possible depending on exactly what guard regions exist, whether Spectre mitigations are enabled (cmoves-of-null-pointers rather than branches for traps), whether the address is constant and known in-bounds, whether the access is 1 byte or multiple, and all the rest.

The theory behind "exploding" the Wasm-level operation into constituent parts in CLIF is twofold. First, it allows the compiler backends to deal with a simpler "core IR" that knows only about host loads and stores -- there is no need to have a concept of separate bounds-checked address spaces baked into every backend, and all the backends benefit from one central encoding of all the special cases and optimizations. Second, it was thought that this might allow some parts to be optimized: e.g., GVN when accesses have parts of the address in common (only offsets differ), or GVN and LICM when some parts are loop-invariant (e.g., base pointers on immovable memories or limits on non-growable memories), or constant folding, or maybe other things. The less we have to teach the optimizer about "bounds check" as a concept, the better.

However, there are several more recent realizations we have had. First, the promised optimization is only partly realizable; for example, the ability for memories to grow means that limit fields must be reloaded frequently or every time. And the common deployment mode that results in dynamic bounds-checking typically has no guard region at all ("no virtual memory" scenarios on embedded systems, or high-density scenarios with many memories where frequent virtual memory changes on guards during memory growth might become a bottleneck). Because of this, some of the commonality is lost: e.g., in common configurations we probably won't be able to factor out the compare-and-trap/cmov from Wasm loads/stores with different offsets on the same base.

Second, as @alexcrichton noted in today's Cranelift meeting, the Pulley interpreter-bytecode Cranelift backend wants to target new macro-ops that implement the Wasm bounds-check sequence. When there is interpreter overhead for every instruction, there is pressure away from "exploded" views of operation steps and back toward ready-made larger ops. Currently we would have to pattern-match all the various kinds of code we can generate to recover these macro-ops.

Third, as I described recently in a talk about proof-carrying code, verifying the tiny pieces that make up a dynamic bounds-check sequence is extremely challenging: getting an analysis to understand all the variations is brittle and complex, and leads to surprising and un-intuitive quadratic verification behavior. A verifier ideally wants to see a bounds-check as a single operation with a verified implementation rather than a graph of operators that lose the intent.

For all these reasons, I think we should consider adding a bounds_check operator to CLIF. Its semantics might be something like: boundscheck32_or_null.i64 host_base, offset, limit, imm_offset, size, where host_base (i64 on 64-bit-pointer systems) is a host address of the start of a bounds-checked memory, offset is an i32 offset into the memory, limit is an i32 memory size, imm_offset is a 16-bit constant offset to add to the returned address (u16), and size is the size of the access (u8). (These operand sizes were picked to allow the whole InstructionData to fit in 16 bytes: three Values and 24 bits of payload. If the enum tag becomes a u16 then we can make imm_offset a u8 and still capture most cases without a separate iadd probably.)

The semantics would be: if offset + imm_offset + size <= limit, then return host_base + offset + imm_offset, otherwise return 0. We could define a variant ..._or_trap that traps instead on out-of-bounds.

The important considerations in my view are:

We should ensure that we can either encode optimizations that bounds_check.rs does with the new opcode, or else decide these optimizations aren't desired anymore. (Hopefully the former unless we have good data otherwise.) For example, if we do have a guard region, we could use the bounds-check opcode to compute a base without the specific load offset, then put the load offset on the machine load instruction. The Pulley backend should be able to combine these still and the verifier should be able to reason about valid ranges too.

We should ensure that we aren't removing important GVN-ability. Notably the above still externalizes the loads of base and limit; so whatever we do to allow those to be readonly, LICM them out of loops, etc., still applies. Also, this opcode would be either pure (Spectre/null variant) or idempotent (trapping variant) so to the extent that we can factor out offsets as above, it should be combinable.

Curious what others think -- any requirements this is missing? Any other important optimizations we would lose out on?

Wasmtime GitHub notifications bot (Jan 29 2025 at 20:10):

cfallin edited issue #10150:

Today in the Wasm-to-CLIF translator we have a detailed set of cases by which we break down Wasm-level loads and stores into statically or dynamically bounds-checked operations. For example, a fully general dynamic bounds check becomes something like:

Compute the Wasm index value (i32) plus offset given on the load;

Compute that Wasm address plus the access size;

Compare that end-of-access address to a loaded memory size/bound;

Trap if out-of-bounds;

Add the loaded memory base to the index plus offset;

Load from that computed host address.

With variations possible depending on exactly what guard regions exist, whether Spectre mitigations are enabled (cmoves-of-null-pointers rather than branches for traps), whether the address is constant and known in-bounds, whether the access is 1 byte or multiple, and all the rest.

The theory behind "exploding" the Wasm-level operation into constituent parts in CLIF is twofold. First, it allows the compiler backends to deal with a simpler "core IR" that knows only about host loads and stores -- there is no need to have a concept of separate bounds-checked address spaces baked into every backend, and all the backends benefit from one central encoding of all the special cases and optimizations. Second, it was thought that this might allow some parts to be optimized: e.g., GVN when accesses have parts of the address in common (only offsets differ), or GVN and LICM when some parts are loop-invariant (e.g., base pointers on immovable memories or limits on non-growable memories), or constant folding, or maybe other things. The less we have to teach the optimizer about "bounds check" as a concept, the better.

However, there are several more recent realizations we have had. First, the promised optimization is only partly realizable; for example, the ability for memories to grow means that limit fields must be reloaded frequently or every time. And the common deployment mode that results in dynamic bounds-checking typically has no guard region at all ("no virtual memory" scenarios on embedded systems, or high-density scenarios with many memories where frequent virtual memory changes on guards during memory growth might become a bottleneck). Because of this, some of the commonality is lost: e.g., in common configurations we probably won't be able to factor out the compare-and-trap/cmov from Wasm loads/stores with different offsets on the same base.

Second, as @alexcrichton noted in today's Cranelift meeting, the Pulley interpreter-bytecode Cranelift backend wants to target new macro-ops that implement the Wasm bounds-check sequence. When there is interpreter overhead for every instruction, there is pressure away from "exploded" views of operation steps and back toward ready-made larger ops. Currently we would have to pattern-match all the various kinds of code we can generate to recover these macro-ops.

Third, as I described recently in a talk about proof-carrying code, verifying the tiny pieces that make up a dynamic bounds-check sequence is extremely challenging: getting an analysis to understand all the variations is brittle and complex, and leads to surprising and un-intuitive quadratic verification behavior. A verifier ideally wants to see a bounds-check as a single operation with a verified implementation rather than a graph of operators that lose the intent.

For all these reasons, I think we should consider adding a bounds-check operator to CLIF. Its semantics might be something like: boundscheck32_or_null.i64 host_base, offset, limit, imm_offset, size, where host_base (i64 on 64-bit-pointer systems) is a host address of the start of a bounds-checked memory, offset is an i32 offset into the memory, limit is an i32 memory size, imm_offset is a 16-bit constant offset to add to the returned address (u16), and size is the size of the access (u8). (These operand sizes were picked to allow the whole InstructionData to fit in 16 bytes: three Values and 24 bits of payload. If the enum tag becomes a u16 then we can make imm_offset a u8 and still capture most cases without a separate iadd probably.)

The semantics would be: if offset + imm_offset + size <= limit, then return host_base + offset + imm_offset, otherwise return 0. We could define a variant ..._or_trap that traps instead on out-of-bounds.

The important considerations in my view are:

We should ensure that we can either encode optimizations that bounds_check.rs does with the new opcode, or else decide these optimizations aren't desired anymore. (Hopefully the former unless we have good data otherwise.) For example, if we do have a guard region, we could use the bounds-check opcode to compute a base without the specific load offset, then put the load offset on the machine load instruction. The Pulley backend should be able to combine these still and the verifier should be able to reason about valid ranges too.

We should ensure that we aren't removing important GVN-ability. Notably the above still externalizes the loads of base and limit; so whatever we do to allow those to be readonly, LICM them out of loops, etc., still applies. Also, this opcode would be either pure (Spectre/null variant) or idempotent (trapping variant) so to the extent that we can factor out offsets as above, it should be combinable.

Curious what others think -- any requirements this is missing? Any other important optimizations we would lose out on?

Wasmtime GitHub notifications bot (Jan 30 2025 at 00:06):

alexcrichton commented on issue #10150:

For reference https://github.com/bytecodealliance/wasmtime/pull/10154 is the culmination of the work I was doing for folding pulley bounds checks/loads into one op. For me the ISLE isn't tiny but it's also not unmanagable, personally.

Wasmtime GitHub notifications bot (Feb 26 2025 at 21:55):

fitzgen commented on issue #10150:

However, there are several more recent realizations we have had. First, the promised optimization is only partly realizable; for example, the ability for memories to grow means that limit fields must be reloaded frequently or every time.

I would expect that alias analysis would make it so that two wasm loads/stores that do not cross a function call would indeed only load the heap limit once. Are you saying that is not the case in practice?

Wasmtime GitHub notifications bot (Mar 04 2025 at 22:54):

cfallin commented on issue #10150:

I would expect that alias analysis would make it so that two wasm loads/stores that do not cross a function call would indeed only load the heap limit once. Are you saying that is not the case in practice?

Ah, I may have read too much into some examples I had been looking at -- indeed it seems that limits are reused (alias analysis merges loads) between function calls.

Last updated: Apr 18 2025 at 10:03 UTC