zhuxiujia edited Issue #2644:
How do I enable JIT compilation in the code under Example?
Hi, I'm trying to use the code in Example to perform a JIT operation,But the performance is very slow
- toml
default = ["jitdump", "wasmtime/wat", "wasmtime/parallel-compilation","experimental_x64"]
- wat
(module (func $sum_f (param $x i32) (param $y i32) (result i32) local.get $x local.get $y i32.add) (export "run" (func $sum_f)))
- example/hello.rs
println!("Instantiating module..."); let instance = Instance::new(&store, &module, &[])?; // Next we poke around a bit to extract the `run` function from the module. println!("Extracting export..."); let run = instance .get_func("run") .ok_or(anyhow::format_err!("failed to find `run` function export"))? .get2::<i32,i32,i32>()?; let now=std::time::Instant::now(); let total=1000000; for _ in 0..total{ run(1,1)?; } let time = now.elapsed(); println!( "use Time: {:?} ,each:{} ns/op", &time, time.as_nanos() / (total as u128) );
- cargo run result(This is very slow, even though I'm using --release,it should be 1ns/op)
cargo run --release --example hello //use Time: 852.04292ms ,each:852 ns/op
zhuxiujia edited Issue #2644:
Runtime invocation overhead 800ns/op
Hi, I'm trying to use the code in Example to perform a JIT operation,But the performance is very slow
- toml
default = ["jitdump", "wasmtime/wat", "wasmtime/parallel-compilation","experimental_x64"]
- wat
(module (func $sum_f (param $x i32) (param $y i32) (result i32) local.get $x local.get $y i32.add) (export "run" (func $sum_f)))
- example/hello.rs
println!("Instantiating module..."); let instance = Instance::new(&store, &module, &[])?; // Next we poke around a bit to extract the `run` function from the module. println!("Extracting export..."); let run = instance .get_func("run") .ok_or(anyhow::format_err!("failed to find `run` function export"))? .get2::<i32,i32,i32>()?; let now=std::time::Instant::now(); let total=1000000; for _ in 0..total{ run(1,1)?; } let time = now.elapsed(); println!( "use Time: {:?} ,each:{} ns/op", &time, time.as_nanos() / (total as u128) );
- cargo run result(This is very slow, even though I'm using --release,it should be 1ns/op)
cargo run --release --example hello //use Time: 852.04292ms ,each:852 ns/op
alexcrichton commented on Issue #2644:
Thanks for the report! Can you clarify what platform you're using?
Entry/exit into wasm isn't entirely trivial because we need to set up infrastructure to catch traps and such. Locally on x86_64 macOS I also get ~700ns overhead, but some time profiling shows that ~80% of that time is spent in
setjmp
which is how we implement traps in WebAssembly (usinglongjmp
back to the start). I posted https://github.com/bytecodealliance/wasmtime/pull/2645 which helps there, but there's possibly other low-hanging fruit here too.In any case it'd be good to see what platform you're running on!
zhuxiujia commented on Issue #2644:
Thanks for the report! Can you clarify what platform you're using?
Entry/exit into wasm isn't entirely trivial because we need to set up infrastructure to catch traps and such. Locally on x86_64 macOS I also get ~700ns overhead, but some time profiling shows that ~80% of that time is spent in
setjmp
which is how we implement traps in WebAssembly (usinglongjmp
back to the start). I posted #2645 which helps there, but there's possibly other low-hanging fruit here too.In any case it'd be good to see what platform you're running on!
Hi:
Locally on x86_64 macOSI tried to use WASM to implement the interpreter crate(for example:'1+1'=2, "'1'+'1'"="11" ), so both WASM and the host were called frequently
Frequent comings and goings in and out of WASM can take a long time
zhuxiujia edited a comment on Issue #2644:
Thanks for the report! Can you clarify what platform you're using?
Entry/exit into wasm isn't entirely trivial because we need to set up infrastructure to catch traps and such. Locally on x86_64 macOS I also get ~700ns overhead, but some time profiling shows that ~80% of that time is spent in
setjmp
which is how we implement traps in WebAssembly (usinglongjmp
back to the start). I posted #2645 which helps there, but there's possibly other low-hanging fruit here too.In any case it'd be good to see what platform you're running on!
Hi:
Locally on x86_64 macOS
But it's fast on Windows10I tried to use WASM to implement the interpreter crate(for example:'1+1'=2, "'1'+'1'"="11" ), so both WASM and the host were called frequently
Frequent comings and goings in and out of WASM can take a long time
alexcrichton commented on Issue #2644:
Oh great! Then we're running on the same platform :)
Is the 55ns overhead I recorded in #2645 still too larger for your use case?
zhuxiujia commented on Issue #2644:
Oh great! Then we're running on the same platform :)
Is the 55ns overhead I recorded in #2645 still too larger for your use case?
Maybe that's why, anyway, it's on my Mac Book
zhuxiujia edited a comment on Issue #2644:
Oh great! Then we're running on the same platform :)
Is the 55ns overhead I recorded in #2645 still too larger for your use case?
Maybe that's why, anyway, it's on my Mac Book
The same issue arose with Wasmer crate
zhuxiujia edited a comment on Issue #2644:
Oh great! Then we're running on the same platform :)
Is the 55ns overhead I recorded in #2645 still too larger for your use case?
Maybe that's why, anyway, it's on my Mac Book
The same issue arose with Wasmer crate
zhuxiujia commented on Issue #2644:
Is it possible to have something to do with Cranelift??
alexcrichton commented on Issue #2644:
Sorry but to clarify, can you benchmark with #2645 applied? Is wasmtime with that patch fast enough for your use case or is it still too slow?
Also, are you saying that Windows is fast locally for you? If so, what is the overhead you're seeing on Windows?
As for other sources of overhead, the main source seems to be accessing thread locals at this point (after #2645), I don't think Cranelift needs to be improved in any regards here.
alexcrichton commented on Issue #2644:
I believe the original issue has been fixed so I'm going to close this.
alexcrichton closed Issue #2644:
Runtime invocation overhead 800ns/op
Hi, I'm trying to use the code in Example to perform a JIT operation,But the performance is very slow
- toml
default = ["jitdump", "wasmtime/wat", "wasmtime/parallel-compilation","experimental_x64"]
- wat
(module (func $sum_f (param $x i32) (param $y i32) (result i32) local.get $x local.get $y i32.add) (export "run" (func $sum_f)))
- example/hello.rs
println!("Instantiating module..."); let instance = Instance::new(&store, &module, &[])?; // Next we poke around a bit to extract the `run` function from the module. println!("Extracting export..."); let run = instance .get_func("run") .ok_or(anyhow::format_err!("failed to find `run` function export"))? .get2::<i32,i32,i32>()?; let now=std::time::Instant::now(); let total=1000000; for _ in 0..total{ run(1,1)?; } let time = now.elapsed(); println!( "use Time: {:?} ,each:{} ns/op", &time, time.as_nanos() / (total as u128) );
- cargo run result(This is very slow, even though I'm using --release,it should be 1ns/op)
cargo run --release --example hello //use Time: 852.04292ms ,each:852 ns/op
Last updated: Jan 24 2025 at 00:11 UTC