jeffcharles opened issue #13355:
Feature
Would it be possible to have Wasmtime use a faster implementation for hashmaps and hashsets?
Benefit
When benchmarking Wasmtime 42.0.2 compared to Wasmtime 41.0.4 for some of our workloads, I noticed a ~14% regression in terms of wall clock time and estimated cycles. When I update
crates/environ/src/collections.rsto usehashbrowninstead of the standard library hashmap and hashset based off undoing part of #12509, that regression drops to ~7%. Our use-case for Wasmtime is in a latency sensitive environment.Implementation
Naively, would it be possible/advisable to use
hashbrownorfxhashas the hashmap and hashset implementations whenstdis enabled until collections which are able to handle OOMs are in place? Given they perform faster than the standard library hashmaps and hashsets and I don't think we need the cryptographic security in the standard library's implementations.Alternatives
Maybe a configurable hash implementation? We're willing to tolerate OOMs hypothetically causing a process abort if it reduces the amount of the performance regression we're seeing.
alexcrichton commented on issue #13355:
Would you be able to share and/or assemble some workloads that regressed? Changing hash maps and algorithms is totally on the table and reasonable to do, but depending on where the slowdown is coming from we might be able to remove the hash maps entirely and/or get some other larger win.
bjorn3 commented on issue #13355:
FWIW libstd's hashmap already uses hashbrown internally. The performance difference is almost certainly caused by libstd using a HashDOS resistent hasher by default. You can override the hashes for libstd's hashmap too.
jeffcharles commented on issue #13355:
Would you be able to share and/or assemble some workloads that regressed?
Take a look at https://github.com/jeffcharles/wasmtime-42-perf-analysis. The repo includes a
./run-callgrind.shscript to run a callgrind benchmark in an x86 OCI container. Themainbranch uses Wasmtime 41, thewasmtime-42branch uses Wasmtime 42, and thewasmtime-42-hashbrownbranch uses a fork of Wasmtime 42 that uses hashbrown instead.The results when comparing
main(Wasmtime 41) andwasmtime-42:Instructions: 157275|136774 (+14.9890%) [+1.14989x] L1 Hits: 205114|181013 (+13.3145%) [+1.13315x] LL Hits: 2600|2913 (-10.7449%) [-1.12038x] RAM Hits: 2238|2304 (-2.86458%) [-1.02949x] Total read+write: 209952|186230 (+12.7380%) [+1.12738x] Estimated Cycles: 296444|276218 (+7.32248%) [+1.07322x]The results when comparing
main(Wasmtime 41) andwasmtime-42-hashbrown:Instructions: 128474|135948 (-5.49769%) [-1.05818x] L1 Hits: 170470|179969 (-5.27813%) [-1.05572x] LL Hits: 2509|2836 (-11.5303%) [-1.13033x] RAM Hits: 2272|2279 (-0.30715%) [-1.00308x] Total read+write: 175251|185084 (-5.31272%) [-1.05611x] Estimated Cycles: 262535|273914 (-4.15422%) [-1.04334x]The code being benchmarked is in
raw_run_module.
alexcrichton commented on issue #13355:
Thanks! Without going too deep down the docker/callgrind hole, I lightly edited it to just be raw criterion and I'm showing a 4% regression in wall time from Wasmtime 41 to Wasmtime 42. A
samply-based profile looks like this.From this it looks like the main hash map related location is the
Linker, and that's what would in theory need to change. There's a few things about this worth pointing out:
- We generally consider the
Linkera create-once primitive where it's not designed forclone/insertion to be on the hot path, as it is here. The natural implementation of this use case sort of requires it though due to this leveraging runtime linking of one instance to another. This is something I'd consider a bit of a gap in Wasmtime's embedder API where you're unable to leverageInstancePre, for example, without further changes. The absolute ideal performance here will come about if you're able to link these modules together statically and have that compiled by Wasmtime. That way you'd be able to avoid hash maps entirely on the hot path.- Using a faster hash algorithm here is a bit tricky because the
Linkeris sometimes user-controlled and sometimes host-controlled. It's neither obvious that a DoS resistant hash is needed nor that it's specifically not needed. One theoretical option here would be that we could add a type parameter toLinkerto allow embeddings to control this, and you'd be able to configure it in this use case to something faster.So, on one hand, yes, I think we could either just switch
Linkeror expose a type parameter to use a non-DoS-resistant hash. I haven't tested locally the perf impact of that, however. On the other hand, though, if you're interested in the fastest possible execution time it'll be side-stepping this entierly. The "easiest" option would be to usewalrusor something similar to combine the two core wasm modules here into one. You'd resolve imports of one to another and the final module would only have the resulting imports. This would be a relatively invasive change, however, and I understand if you don't have appetite for such a change.Nevertheless I wanted to at least write this all down. I wasn't able to repro 7% or a 14% regression, but that could just be a difference in hardware perhaps.
jeffcharles commented on issue #13355:
Thanks for writing that up! Having an API like
InstancePreexcept stateless (that is, it would define hashmap entries with stubbed values for the imports that would get replaced at instantiation time) would likely help without having to give up dynamic linking. But I can understand a reluctance to add something like this just for us. And point taken on us having an option to statically link with an additional ahead of time transformation of the guest Wasm.And yes, I did notice a difference in the performance change between x86 and AArch64. Tried to use callgrind on x86 to ensure some degree of consistency with the numbers.
jeffcharles commented on issue #13355:
Thinking about this a little more, we require some approach that can enable dynamic linking between fresh instances in the hot path. We have extremely aggressive upper limits on the size of final Wasm modules to minimize memory use and minimize latency fetching them so statically linking them to dynamic libraries ahead-of-time is not feasible.
alexcrichton commented on issue #13355:
Perhaps the fastest option for you in the meantime would be to invoke
Instance::newdirectly? You could precompute ahead of time what exports to extract and pass in to various places (e.g. via introspection, similar to whatLinkerdoes). That still won't be the fastest path since it'll re-type-check everything on all instantiations, but it'll avoid needing to clone aLinkerand/or manage items within it, bypassing hash maps entirely. Would that be feasible for you?
jeffcharles commented on issue #13355:
Thank you for the suggestion! I can explore that and see if that works for us.
Last updated: Jun 01 2026 at 09:49 UTC