Stream: git-wasmtime

Topic: wasmtime / issue #4883 Code generated by `wasmtime` doesn...


view this post on Zulip Wasmtime GitHub notifications bot (Sep 08 2022 at 07:54):

koute opened issue #4883:

The problem

Currently wasmtime/cranelift (unlike e.g. LLVM which doesn't have this problem AFAIK) doesn't cache-align the loops it generates, leading to potentially huge performance regressions if a hot loop ends up accidentally spanning over multiple cache lines.

Background

Recently we were updating from wasmtime 0.38 to 0.40 and we saw a peculiar performance regression when doing so. One of our benchmarks took almost 2x the time to run, with a lot of them taking around ~45% more time. A huge regression. Ultimately it ended up being unrelated to the 0.38 -> 0.40 upgrade. We tracked the problem down to memset within the WASM (we're currently not using the bulk memory ops extension) suddenly taking a lot more time to run for no apparent reason. Depending on which exact address wasmtime decided to generate the code for memset at (which is essentially random, although consistent for the same code with the same flags in the same environment) the benchmarks were either slow, or fast, and it all boiled down to whether the hot loop of the memset spanned multiple cache lines or not.

You can find a detailed analysis of the problem in this comment and this comment of mine.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 08 2022 at 16:26):

cfallin commented on issue #4883:

Thanks for tracking this down, @koute! Yes, I agree that aligning loop headers to cache-line boundaries makes sense. Probably as a compile-time option, when opts are enabled (debug code is going to be substantially more bloated for other reasons so we don't want to inflate further, and is going to be slow anyway).

view this post on Zulip Wasmtime GitHub notifications bot (Sep 08 2022 at 17:21):

cfallin commented on issue #4883:

This might be a good starter issue for someone to tackle. The main steps I see this taking are:

If no one else wants to take it, I can do this at some point but I thought I would put this out there first!

view this post on Zulip Wasmtime GitHub notifications bot (Sep 08 2022 at 17:21):

cfallin labeled issue #4883:

The problem

Currently wasmtime/cranelift (unlike e.g. LLVM which doesn't have this problem AFAIK) doesn't cache-align the loops it generates, leading to potentially huge performance regressions if a hot loop ends up accidentally spanning over multiple cache lines.

Background

Recently we were updating from wasmtime 0.38 to 0.40 and we saw a peculiar performance regression when doing so. One of our benchmarks took almost 2x the time to run, with a lot of them taking around ~45% more time. A huge regression. Ultimately it ended up being unrelated to the 0.38 -> 0.40 upgrade. We tracked the problem down to memset within the WASM (we're currently not using the bulk memory ops extension) suddenly taking a lot more time to run for no apparent reason. Depending on which exact address wasmtime decided to generate the code for memset at (which is essentially random, although consistent for the same code with the same flags in the same environment) the benchmarks were either slow, or fast, and it all boiled down to whether the hot loop of the memset spanned multiple cache lines or not.

You can find a detailed analysis of the problem in this comment and this comment of mine.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 08 2022 at 17:21):

cfallin labeled issue #4883:

The problem

Currently wasmtime/cranelift (unlike e.g. LLVM which doesn't have this problem AFAIK) doesn't cache-align the loops it generates, leading to potentially huge performance regressions if a hot loop ends up accidentally spanning over multiple cache lines.

Background

Recently we were updating from wasmtime 0.38 to 0.40 and we saw a peculiar performance regression when doing so. One of our benchmarks took almost 2x the time to run, with a lot of them taking around ~45% more time. A huge regression. Ultimately it ended up being unrelated to the 0.38 -> 0.40 upgrade. We tracked the problem down to memset within the WASM (we're currently not using the bulk memory ops extension) suddenly taking a lot more time to run for no apparent reason. Depending on which exact address wasmtime decided to generate the code for memset at (which is essentially random, although consistent for the same code with the same flags in the same environment) the benchmarks were either slow, or fast, and it all boiled down to whether the hot loop of the memset spanned multiple cache lines or not.

You can find a detailed analysis of the problem in this comment and this comment of mine.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 12 2022 at 09:47):

akirilov-arm labeled issue #4883:

The problem

Currently wasmtime/cranelift (unlike e.g. LLVM which doesn't have this problem AFAIK) doesn't cache-align the loops it generates, leading to potentially huge performance regressions if a hot loop ends up accidentally spanning over multiple cache lines.

Background

Recently we were updating from wasmtime 0.38 to 0.40 and we saw a peculiar performance regression when doing so. One of our benchmarks took almost 2x the time to run, with a lot of them taking around ~45% more time. A huge regression. Ultimately it ended up being unrelated to the 0.38 -> 0.40 upgrade. We tracked the problem down to memset within the WASM (we're currently not using the bulk memory ops extension) suddenly taking a lot more time to run for no apparent reason. Depending on which exact address wasmtime decided to generate the code for memset at (which is essentially random, although consistent for the same code with the same flags in the same environment) the benchmarks were either slow, or fast, and it all boiled down to whether the hot loop of the memset spanned multiple cache lines or not.

You can find a detailed analysis of the problem in this comment and this comment of mine.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 12 2022 at 09:47):

akirilov-arm labeled issue #4883:

The problem

Currently wasmtime/cranelift (unlike e.g. LLVM which doesn't have this problem AFAIK) doesn't cache-align the loops it generates, leading to potentially huge performance regressions if a hot loop ends up accidentally spanning over multiple cache lines.

Background

Recently we were updating from wasmtime 0.38 to 0.40 and we saw a peculiar performance regression when doing so. One of our benchmarks took almost 2x the time to run, with a lot of them taking around ~45% more time. A huge regression. Ultimately it ended up being unrelated to the 0.38 -> 0.40 upgrade. We tracked the problem down to memset within the WASM (we're currently not using the bulk memory ops extension) suddenly taking a lot more time to run for no apparent reason. Depending on which exact address wasmtime decided to generate the code for memset at (which is essentially random, although consistent for the same code with the same flags in the same environment) the benchmarks were either slow, or fast, and it all boiled down to whether the hot loop of the memset spanned multiple cache lines or not.

You can find a detailed analysis of the problem in this comment and this comment of mine.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 12 2022 at 09:47):

akirilov-arm labeled issue #4883:

The problem

Currently wasmtime/cranelift (unlike e.g. LLVM which doesn't have this problem AFAIK) doesn't cache-align the loops it generates, leading to potentially huge performance regressions if a hot loop ends up accidentally spanning over multiple cache lines.

Background

Recently we were updating from wasmtime 0.38 to 0.40 and we saw a peculiar performance regression when doing so. One of our benchmarks took almost 2x the time to run, with a lot of them taking around ~45% more time. A huge regression. Ultimately it ended up being unrelated to the 0.38 -> 0.40 upgrade. We tracked the problem down to memset within the WASM (we're currently not using the bulk memory ops extension) suddenly taking a lot more time to run for no apparent reason. Depending on which exact address wasmtime decided to generate the code for memset at (which is essentially random, although consistent for the same code with the same flags in the same environment) the benchmarks were either slow, or fast, and it all boiled down to whether the hot loop of the memset spanned multiple cache lines or not.

You can find a detailed analysis of the problem in this comment and this comment of mine.


Last updated: Jan 24 2025 at 00:11 UTC