Stream: git-wasmtime

Topic: wasmtime / issue #8225 PCC: compile module with knowledge...


view this post on Zulip Wasmtime GitHub notifications bot (Mar 22 2024 at 21:45):

cfallin opened issue #8225:

A fuzzbug recently came in (https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=67568) that has a load with a very large (~3GiB) offset. With our bounds-checking code and default setting of a 2GiB guard region, we handle this today with a dynamic bounds-check. PCC currently cannot handle dynamic bounds-checks. So if we want to (i) deploy PCC in our default configuration, and (ii) correctly handle all legal Wasm inputs, we need to find a way to check such instructions.

It occurs to me that if our guard region is large enough, any load/store with an offset larger than the guard region will also have a minimum address that is larger than the biggest possible memory in most "reasonable" production configurations. If we can guarantee that, and compile with that assumption, we can statically compile a i32.load offset=9bazillion instruction into a trap (with a HeapOutOfBounds trap code so it is semantically unchanged to observers). This would allow PCC to check the result as well.

I'd like to file the issue first to get opinions: would it be reasonable to have a maximum_memory_size configuration option that bounds all memories, such that it is an error to grow larger than that, and then use this during compilation to handle the above case? And then set it so that it is at (or ever so slightly smaller than, by the maximum load/store width) the default guard-region size, so we get the nice property that our default configuration never generates dynamic bounds checks?

view this post on Zulip Wasmtime GitHub notifications bot (Mar 22 2024 at 21:50):

cfallin commented on issue #8225:

Here's a minimal tests/disas/ testcase for reference:

<details>
<summary>compile test</summary>

;;! target = "x86_64"
;;! test = "compile"
;;! flags = [ "-Cpcc=y" ]

(module
  (func (result i32)
    i32.const 0
    i32.load offset=3876003872)
  (memory (;0;) 8164 8167))

</details>

view this post on Zulip Wasmtime GitHub notifications bot (Mar 24 2024 at 17:19):

alexcrichton commented on issue #8225:

It occurs to me that if our guard region is large enough, any load/store with an offset larger than the guard region will also have a minimum address that is larger than the biggest possible memory in most "reasonable" production configurations

This isn't quite true in that wasm memory loads create an effective 33-bit address meaning that with a 4G guard region we statically know everything will hit a guard page and no dynamic bounds checks regardless of the offset are needed.

Notably you can't use an offset larger than the memory size, that's a validation error (IIRC).

we can statically compile a i32.load offset=9bazillion instruction into a trap (with a HeapOutOfBounds trap code so it is semantically unchanged to observers). This would allow PCC to check the result as well.

I'll note that I believe this already happens.

so we get the nice property that our default configuration never generates dynamic bounds checks?

I think that makes sense yeah. The pooling allocator actually already has this configuration but that's purely runtime configuration, it's not compile-time configuration, and this would be a compile-time-config option (that then must be matched at runtime as well)

view this post on Zulip Wasmtime GitHub notifications bot (Mar 24 2024 at 20:23):

cfallin commented on issue #8225:

4G guard region

Ah, the default is a 2GiB guard region (post, and separately 2GiB pre); that's what I'm assuming we need to find a workable PCC configuration for, in order not to impact virtual memory overhead.

I'll note that I believe this already happens.

@alexcrichton please see the test case above, where it doesn't, for a default wasmtime config (2GiB guard region) -- is there some other setting we need to enforce to get this property? In particular I note this case in the seven bounds-check cases, which triggers for static memories where the guard is otherwise not large enough. There is the case here that statically becomes a trap, but that's only if the offset and size exceed the memory size bound, and that bound needs to be 4GiB to elide bounds checks otherwise; so a 3GiB offset doesn't fit. That's precisely the gap I want to reify here: a separate "actual maximum size" configuration that is distinct from "virtual memory size".

view this post on Zulip Wasmtime GitHub notifications bot (Mar 24 2024 at 20:34):

cfallin commented on issue #8225:

Notably you can't use an offset larger than the memory size, that's a validation error (IIRC).

This also seems not to be true, per the test case above: its max memory size is 8167 pages (a little under 512MiB) yet it has an offset of ~3GiB. Regardless, we aren't rejecting modules at compile time that have a max memory size somehow incompatible with our memory config, so we can't rely on that here.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 25 2024 at 14:54):

alexcrichton commented on issue #8225:

Ah yes sorry forgot to mention Wasmtime's default is 2G. I can confirm that with -O static-memory-guard-size=$((1<<32)) the above test case works as-is today. I can also confirm that if the above offset is changed to $((1<<32)) - 4 then it still works, but offsets larger than that actually work with the default 2G region.

I think I was confusing a different optimization we have where when access size + offset is >4G we compile a static trap, but you're right that the maximum size of memory isn't taken into account. That's something we should add! Not that it would "truly" fix this test case though since most memories don't have an upper bound in the real world.

Also sorry for "larger than memory size" I should have clarified I meant the indexing type. The index type here is a 32-bit integer so the offset 1<<32 is rejected by wasm validation. My point being that for 32-bit linear memories we statically know all linear memory offsets are less than 1<<32, so we are guaranteed that everything is a 33-bit address and no more.

Thinking a bit more on this, I think we could perhaps do two different things:

view this post on Zulip Wasmtime GitHub notifications bot (Mar 26 2024 at 17:10):

jameysharp commented on issue #8225:

I'm trying to follow along and I have a lot of questions.

First off, how do I tell which heap settings Wasmtime actually picked?

With default flags, it's producing this CLIF for your test case:

<details><summary>Unoptimized CLIF for the above disas test case</summary>

;; function u0:0(i64 vmctx, i64) -> i32 fast {
;;     gv0 = vmctx
;;     gv1 = load.i64 notrap aligned readonly gv0+8
;;     gv2 = load.i64 notrap aligned gv1
;;     gv3 = vmctx
;;     gv4 = load.i64 notrap aligned readonly checked gv3+80
;;     stack_limit = gv2
;;
;;                                 block0(v0: i64, v1: i64):
;; @0020                               v3 = iconst.i32 0
;; @0022                               v4 = uextend.i64 v3  ; v3 = 0
;; @0022                               v5 = iconst.i64 0x18f8_dfdc
;; @0022                               v6 = icmp ugt v4, v5  ; v5 = 0x18f8_dfdc
;; @0022                               v7 = global_value.i64 gv4
;; @0022                               v8 = iadd v7, v4
;; @0022                               v9 = iconst.i64 0xe707_2020
;; @0022                               v10 = iadd v8, v9  ; v9 = 0xe707_2020
;; @0022                               v11 = iconst.i64 0
;; @0022                               v12 = select_spectre_guard v6, v11, v10  ; v11 = 0
;; @0022                               v13 = load.i32 little heap v12
;; @0029                               jump block1(v13)
;;
;;                                 block1(v2: i32):
;; @0029                               return v2
;; }

</details>

It's checking the index (a constant 0) against 4GB minus the offset and access width. So is this case "3. General case for static memories" with bound=4GB? And this means we're allocating 4GB of virtual memory and leaving the out-of-bounds parts unmapped?

Wasmtime won't generate even unoptimized CLIF for this with -Cpcc=y so I guess the problem is there was some instruction in the unoptimized version that we can't generate a fact for; what was it?

After optimization, the bounds-check constant-folds away:

<details><summary>Optimized CLIF</summary>

;; @0022                               v7 = load.i64 notrap aligned readonly checked v0+80
;; @0022                               v9 = iconst.i64 0xe707_2020
;; @0022                               v10 = iadd v7, v9  ; v9 = 0xe707_2020
;; @0022                               v13 = load.i32 little heap v10

</details>

This seems like it should be easy to verify: it's an add of a constant <4GB to a pointer which I gather should be valid up to 4GB. Perhaps it would be a better test case if the index were an i32 parameter, since I assume your concern is for PCC's ability to verify the general case.

If I add the -Ostatic-memory-maximum-size=535232512 flag, then the same example compiles to an unconditional trap. (That number is 8167 pages, the maximum bound of the memory's limits.) So is that the flag that you wanted, Chris?

view this post on Zulip Wasmtime GitHub notifications bot (Mar 26 2024 at 17:53):

alexcrichton commented on issue #8225:

Ah I think now that heap settings moved out of CLIF you'd need to consult tunables and cross-reference that with flags passed to Wasmtime. (where tunables mostly match Config names but not always)

By default static memories have a max size of 4G (exactly 1<<32) and a 2G guard region (1<<31).

And this means we're allocating 4GB of virtual memory and leaving the out-of-bounds parts unmapped?

Mostly correct yeah. The code generator assumes that 4G (the static memory bound) + 2G (guard region) is allocated and the only mapped pages are those actually in. use by the heap.

I also sort of forget exactly what the static_memory_bound variable is. It selects dynamic/static but I think you're also seeing an interaction that I forgot about (and I suspect at this point you could probably explain it to me as well!)

view this post on Zulip Wasmtime GitHub notifications bot (Mar 26 2024 at 17:59):

cfallin commented on issue #8225:

@jameysharp ah, sure, it was a trivial testcase that was not optimization-proof, but replace the i32.const 0 with a local.get if you like -- the interesting bit is that with a large-enough load offset, we still get a dynamic bounds check, even with a nominally static memory. (Sorry for the confusion!) The bit that is not PCC-able is exactly the dynamic bounds check -- the symbolic reasoning to match up the compare to the select breaks when opts merge more than one access together. The point of the testcase is not exactly literally to trigger a PCC failure but to show that we can still get dynamic bounds checks here.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 26 2024 at 19:44):

jameysharp commented on issue #8225:

I also sort of forget exactly what the static_memory_bound variable is. It selects dynamic/static but I think you're also seeing an interaction that I forgot about (and I suspect at this point you could probably explain it to me as well!)

Reading enough of the implementation to explain this may help me understand what Chris is asking for, so here goes.

Tunables::static_memory_bound is both:

If at compile time we know that a static memory style was selected, and the constant offset plus access width given in the memory access is greater than the static memory bound in bytes, then it doesn't matter what index we add to that offset: The result will always be greater than the current size of the memory—which must be no bigger than the static memory bound—and so the access should always trap.

Chris, what you want is 4GB of guard pages and a static memory bound of reasonable size. For example if an embedder doesn't allow more than 128MB of linear memory, the static memory bound should be set to 128MB, or 2048 pages. The static memory bound will turn overly-large offsets into traps, and for smaller offsets, the 4GB of guard pages will ensure that any i32 index doesn't need a bounds-check.

We could also just trap when the offset is too large for any memory which is declared with a maximum limit, whether it's static or dynamic. If the static memory bound is used as an implied maximum limit when the module doesn't declare one, then that would remove the need for the current "case 1" for static memories. However that doesn't change that we need 4GB of guard pages to avoid bounds-checks when the offset is smaller.

The bit that is not PCC-able is exactly the dynamic bounds check -- the symbolic reasoning to match up the compare to the select breaks when opts merge more than one access together.

I'd still like to understand this better. This error is being reported before any optimization pass runs, so optimizations merging accesses isn't the direct cause, right? Is it that you need a new kind of fact which you then can't define a merge operation for? Which specific instruction do you need this fact for? Is it the select_spectre_guard which could produce either a known-valid pointer or a known-inaccessible one?

view this post on Zulip Wasmtime GitHub notifications bot (Mar 26 2024 at 20:00):

cfallin commented on issue #8225:

I'd still like to understand this better. This error is being reported before any optimization pass runs, so optimizations merging accesses isn't the direct cause, right? Is it that you need a new kind of fact which you then can't define a merge operation for? Which specific instruction do you need this fact for? Is it the select_spectre_guard which could produce either a known-valid pointer or a known-inaccessible one?

Yes, but again, we're a few steps removed from the actual issue in this test case. The point of the test case is to show that dynamic bounds checks still occur; it says nothing about PCC, is not designed to be tested with PCC, has never even heard of PCC, etc etc.

The underlying cause for all of this is that, yes, when optimizations run, multiple dynamic checks get merged together (the same base address is used for multiple accesses because of GVN, and the facts merge), and when symbolic facts merge, we either need to merge symbolic worlds (unscalable) or a consistent merging strategy; and separately, sometimes optimizations around constant addresses are smart and there's no stable thing with an identity to hang the comparison reasoning off if, because constants get absorbed directly into instructions or folded away.

I haven't shown any test case that demonstrates that, because it's deep in fuzzbugs and I really don't have the bandwidth to write it up properly (sorry). You're of course welcome to dig into it; the pcc_memory integration test will trigger the issue if you enable bounds-checking configurations. But it's a deep research question and IMHO the right course is to save it for later (and I will not be able to help further right now).

The point of all this is that we think we can do PCC if we only have implicit (virtual memory)-based bounds checking. So I pushed a recursion level and instantiated a subproblem: let's create a mode where codegen never does explicit bounds checks. Then we don't have to worry about the above. That's this specific issue, and the test case above demonstrates that: we still get dynamic bounds checks even in the default configuration (and even under opts if you replace my silly const-0 with an arg input). Does that make sense?

view this post on Zulip Wasmtime GitHub notifications bot (Mar 26 2024 at 20:11):

jameysharp commented on issue #8225:

Okay, that makes sense.

Does my suggested configuration sound reasonable to you? I don't think we need any new configuration options to do what you want. Just set -Ostatic-memory-maximum-size to e.g. 128MB and -Ostatic-memory-guard-size to 4GB. Then I believe you'll never get bounds checks.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 26 2024 at 20:17):

cfallin commented on issue #8225:

I think so actually -- thanks for digging into this! If I understand right, one way to see it is that we've been "holding it wrong" wrt the two knobs and how they divide the total space: if the embedder has a reasonable per-instance limit significantly below 4GiB then most of the address space should be within the guard, not the possibly-available static memory region.

I think that's sufficient to close the issue actually; when I pick PCC back up I'll plan to fuzz under such a configuration.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 26 2024 at 20:17):

cfallin closed issue #8225:

A fuzzbug recently came in (https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=67568) that has a load with a very large (~3GiB) offset. With our bounds-checking code and default setting of a 2GiB guard region, we handle this today with a dynamic bounds-check. PCC currently cannot handle dynamic bounds-checks. So if we want to (i) deploy PCC in our default configuration, and (ii) correctly handle all legal Wasm inputs, we need to find a way to check such instructions.

It occurs to me that if our guard region is large enough, any load/store with an offset larger than the guard region will also have a minimum address that is larger than the biggest possible memory in most "reasonable" production configurations. If we can guarantee that, and compile with that assumption, we can statically compile a i32.load offset=9bazillion instruction into a trap (with a HeapOutOfBounds trap code so it is semantically unchanged to observers). This would allow PCC to check the result as well.

I'd like to file the issue first to get opinions: would it be reasonable to have a maximum_memory_size configuration option that bounds all memories, such that it is an error to grow larger than that, and then use this during compilation to handle the above case? And then set it so that it is at (or ever so slightly smaller than, by the maximum load/store width) the default guard-region size, so we get the nice property that our default configuration never generates dynamic bounds checks?


Last updated: Nov 22 2024 at 16:03 UTC