Stream: git-wasmtime

Topic: wasmtime / issue #3251 Spec-interpreter fuzzing: modify t...


view this post on Zulip Wasmtime GitHub notifications bot (Aug 26 2021 at 18:03):

cfallin opened issue #3251:

In #3186 we found that the WebAssembly spec interpreter may not be suitable for high-throughput fuzzing because of certain performance characteristics. While we were able to locally patch one source of quadratic runtime, there are likely others, and the patch was not really a good fit to upstream. We are looking into other peer implementations to fuzz against for general programs (aside from existing differential_wasmi fuzzing).

However, there is another way we could use the reference interpreter: we could validate single instructions' semantics more closely by generating short test cases that just execute an instruction (or a straight-line sequence with no control flow or calls) and return. This would be very valuable for instructions with more complex semantics, such as many of the SIMD instructions.

This is sort of a breadth-vs-depth situation: the "depth" comes from longer-running programs and tests interactions between instructions, and a good oracle there is a more optimized implementation; while the "breadth" comes from exhaustive many-inputs testing of very short sequences and tests mainly that we got the instruction semantics right. The combination seems like it should give pretty good coverage.

cc @abrown @alexcrichton

view this post on Zulip Wasmtime GitHub notifications bot (Sep 03 2021 at 19:10):

alexcrichton labeled issue #3251:

In #3186 we found that the WebAssembly spec interpreter may not be suitable for high-throughput fuzzing because of certain performance characteristics. While we were able to locally patch one source of quadratic runtime, there are likely others, and the patch was not really a good fit to upstream. We are looking into other peer implementations to fuzz against for general programs (aside from existing differential_wasmi fuzzing).

However, there is another way we could use the reference interpreter: we could validate single instructions' semantics more closely by generating short test cases that just execute an instruction (or a straight-line sequence with no control flow or calls) and return. This would be very valuable for instructions with more complex semantics, such as many of the SIMD instructions.

This is sort of a breadth-vs-depth situation: the "depth" comes from longer-running programs and tests interactions between instructions, and a good oracle there is a more optimized implementation; while the "breadth" comes from exhaustive many-inputs testing of very short sequences and tests mainly that we got the instruction semantics right. The combination seems like it should give pretty good coverage.

cc @abrown @alexcrichton

view this post on Zulip Wasmtime GitHub notifications bot (Nov 19 2021 at 16:21):

alexcrichton labeled issue #3251:

In #3186 we found that the WebAssembly spec interpreter may not be suitable for high-throughput fuzzing because of certain performance characteristics. While we were able to locally patch one source of quadratic runtime, there are likely others, and the patch was not really a good fit to upstream. We are looking into other peer implementations to fuzz against for general programs (aside from existing differential_wasmi fuzzing).

However, there is another way we could use the reference interpreter: we could validate single instructions' semantics more closely by generating short test cases that just execute an instruction (or a straight-line sequence with no control flow or calls) and return. This would be very valuable for instructions with more complex semantics, such as many of the SIMD instructions.

This is sort of a breadth-vs-depth situation: the "depth" comes from longer-running programs and tests interactions between instructions, and a good oracle there is a more optimized implementation; while the "breadth" comes from exhaustive many-inputs testing of very short sequences and tests mainly that we got the instruction semantics right. The combination seems like it should give pretty good coverage.

cc @abrown @alexcrichton

view this post on Zulip Wasmtime GitHub notifications bot (Nov 19 2021 at 16:21):

alexcrichton unlabeled issue #3251:

In #3186 we found that the WebAssembly spec interpreter may not be suitable for high-throughput fuzzing because of certain performance characteristics. While we were able to locally patch one source of quadratic runtime, there are likely others, and the patch was not really a good fit to upstream. We are looking into other peer implementations to fuzz against for general programs (aside from existing differential_wasmi fuzzing).

However, there is another way we could use the reference interpreter: we could validate single instructions' semantics more closely by generating short test cases that just execute an instruction (or a straight-line sequence with no control flow or calls) and return. This would be very valuable for instructions with more complex semantics, such as many of the SIMD instructions.

This is sort of a breadth-vs-depth situation: the "depth" comes from longer-running programs and tests interactions between instructions, and a good oracle there is a more optimized implementation; while the "breadth" comes from exhaustive many-inputs testing of very short sequences and tests mainly that we got the instruction semantics right. The combination seems like it should give pretty good coverage.

cc @abrown @alexcrichton

view this post on Zulip Wasmtime GitHub notifications bot (Nov 19 2021 at 16:23):

github-actions[bot] commented on issue #3251:

Subscribe to Label Action

cc @fitzgen

<details>
This issue or pull request has been labeled: "fuzzing"

Thus the following users have been cc'd because of the following labels:

To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.

Learn more.
</details>

view this post on Zulip Wasmtime GitHub notifications bot (Nov 19 2021 at 18:04):

fitzgen commented on issue #3251:

My intuition is that combining multiple instructions, even if it is just two or three, would give us much more bang for our buck than testing single instructions.

view this post on Zulip Wasmtime GitHub notifications bot (Nov 19 2021 at 18:39):

cfallin commented on issue #3251:

Ah, old issue, paging back in context (I'm not sure why the labeler decided to tag you just now on a discussion from August!)...

Yup, agreed, that's the "or a straight-line sequence with no control flow or calls" above :-)

Adding some more thought just now: IMHO we should do it only up to the max depth of our instruction combining patterns. The point of this suggestion was specifically to target the fuzz throughput in a different way, at the input-value space rather than the input-program space. As we test straightline sequences of n instructions we have O(|wasm opcodes|^n) possibilities, and cut our time spent (on exploring argument values) per individual test program exponentially.

In other words, if my goal is to say that all cases for (say) SIMD instruction X have been covered, then separately fuzzing vector_add(X(x, y, z), [1.0, 1.0, 1.0, 1.0]) and vector_add(X(x, y, z), [2.0, 2.0, 2.0, 2.0]) doesn't really add any coverage. Testing vector_add(X, ...) and vector_mul(X, ...) might, if we have combining patterns that differ for those. But combining instructions may also mask some of the value space, for any operation that isn't bijective (e.g. extract one lane, or multiply by zero, or min/max).

So, said another way, what I think would be useful is a unit-test-inspired approach where we take each individual lowering and then throw a long test-vector of argument values at it. (In other words the Arbitrary type is a (short inst sequence, Vec<ArgValues>).) That gets us "each individual lowering is good" coverage in a way that might hide when we're testing n (for n larger) instructions at a time.

cc @abrown , any interest in looking at this as a way to get the spec interpreter in active fuzzing use?

view this post on Zulip Wasmtime GitHub notifications bot (Nov 23 2021 at 23:49):

abrown commented on issue #3251:

I would eventually like to look into this more but if anyone else has the bandwidth feel free to jump in first (read: I'm not sure when I will get to this).

view this post on Zulip Wasmtime GitHub notifications bot (Jan 05 2022 at 23:58):

abrown commented on issue #3251:

@cfallin, in a separate project I did something using proptest that reminded me of this. The basic idea, adapted to Wasmtime, would be to:

I wasn't too concerned with multiple-instruction sequences and other things mentioned above but if we have a place to put this proptest-guided fuzz test then I would be happy to contribute the code.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 23 2022 at 20:28):

abrown commented on issue #3251:

I think this is closed by #4409 which is now in use by the new differential fuzz target (#4515).

view this post on Zulip Wasmtime GitHub notifications bot (Aug 23 2022 at 20:28):

abrown closed issue #3251:

In #3186 we found that the WebAssembly spec interpreter may not be suitable for high-throughput fuzzing because of certain performance characteristics. While we were able to locally patch one source of quadratic runtime, there are likely others, and the patch was not really a good fit to upstream. We are looking into other peer implementations to fuzz against for general programs (aside from existing differential_wasmi fuzzing).

However, there is another way we could use the reference interpreter: we could validate single instructions' semantics more closely by generating short test cases that just execute an instruction (or a straight-line sequence with no control flow or calls) and return. This would be very valuable for instructions with more complex semantics, such as many of the SIMD instructions.

This is sort of a breadth-vs-depth situation: the "depth" comes from longer-running programs and tests interactions between instructions, and a good oracle there is a more optimized implementation; while the "breadth" comes from exhaustive many-inputs testing of very short sequences and tests mainly that we got the instruction semantics right. The combination seems like it should give pretty good coverage.

cc @abrown @alexcrichton


Last updated: Dec 23 2024 at 12:05 UTC