wasmtime / issue #3733 Implement lazy funcref table and a... · git-wasmtime

Sure; here's a benchmarking run using your updated instantiation benchmark. I include baselines for default and pooling, but my local changes don't affect default (the mmap allocator) -- in pooling.rs, I remove the call to decommit_instance_pages and pass false to prezeroed on initialize_vmcontext instead, and that's it.

================================================================================
With madvise() to flash-zero Instance/VMContext:
================================================================================

sequential/default/empty.wat
                        time:   [1.1723 us 1.1752 us 1.1778 us]
sequential/pooling/empty.wat
                        time:   [2.4567 us 2.4608 us 2.4652 us]

parallel/default/empty.wat: with 1 background thread
                        time:   [1.1596 us 1.1628 us 1.1666 us]
parallel/default/empty.wat: with 16 background threads
                        time:   [20.918 us 21.125 us 21.307 us]
parallel/pooling/empty.wat: with 1 background thread
                        time:   [2.4203 us 2.4224 us 2.4249 us]
parallel/pooling/empty.wat: with 16 background threads
                        time:   [25.422 us 25.860 us 26.295 us]

sequential/default/small_memory.wat
                        time:   [5.5100 us 5.5230 us 5.5361 us]
sequential/pooling/small_memory.wat
                        time:   [3.0336 us 3.0356 us 3.0376 us]

parallel/default/small_memory.wat: with 1 background thread
                        time:   [5.3376 us 5.3488 us 5.3614 us]
parallel/default/small_memory.wat: with 16 background threads
                        time:   [95.562 us 96.023 us 96.485 us]
parallel/pooling/small_memory.wat: with 1 background thread
                        time:   [3.0222 us 3.0270 us 3.0316 us]
parallel/pooling/small_memory.wat: with 16 background threads
                        time:   [172.62 us 175.12 us 177.54 us]

sequential/default/data_segments.wat
                        time:   [6.3065 us 6.3094 us 6.3120 us]
sequential/pooling/data_segments.wat
                        time:   [2.8083 us 2.8116 us 2.8147 us]

parallel/default/data_segments.wat: with 1 background thread
                        time:   [6.0155 us 6.0290 us 6.0407 us]
parallel/default/data_segments.wat: with 16 background threads
                        time:   [141.99 us 142.58 us 143.27 us]
parallel/pooling/data_segments.wat: with 1 background thread
                        time:   [2.7353 us 2.7533 us 2.7687 us]
parallel/pooling/data_segments.wat: with 16 background threads
                        time:   [40.435 us 41.035 us 41.644 us]

sequential/default/wasi.wasm
                        time:   [6.6997 us 6.7118 us 6.7264 us]
sequential/pooling/wasi.wasm
                        time:   [3.7074 us 3.7099 us 3.7129 us]

parallel/default/wasi.wasm: with 1 background thread
                        time:   [6.7795 us 6.7964 us 6.8131 us]
parallel/default/wasi.wasm: with 16 background threads
                        time:   [154.73 us 155.36 us 156.03 us]
parallel/pooling/wasi.wasm: with 1 background thread
                        time:   [3.8161 us 3.8195 us 3.8231 us]
parallel/pooling/wasi.wasm: with 16 background threads
                        time:   [60.806 us 61.850 us 62.927 us]

sequential/default/spidermonkey.wasm
                        time:   [15.974 us 15.983 us 15.993 us]
sequential/pooling/spidermonkey.wasm
                        time:   [6.0185 us 6.0215 us 6.0248 us]

parallel/default/spidermonkey.wasm: with 1 background thread
                        time:   [16.189 us 16.201 us 16.215 us]
parallel/default/spidermonkey.wasm: with 16 background threads
                        time:   [165.91 us 167.16 us 168.51 us]
parallel/pooling/spidermonkey.wasm: with 1 background thread
                        time:   [5.9293 us 5.9348 us 5.9403 us]
parallel/pooling/spidermonkey.wasm: with 16 background threads
                        time:   [55.862 us 57.049 us 58.373 us]

================================================================================
With explicit memset:
  ("default"-policy was not changed, so excluding)
================================================================================

sequential/pooling/empty.wat
                        time:   [1.2088 us 1.2111 us 1.2136 us]
                        change: [-51.445% -51.270% -51.113%] (p = 0.00 < 0.05)
                        Performance has improved.

parallel/pooling/empty.wat: with 1 background thread
                        time:   [1.1838 us 1.1853 us 1.1869 us]
                        change: [-50.952% -50.778% -50.613%] (p = 0.00 < 0.05)
                        Performance has improved.
parallel/pooling/empty.wat: with 16 background threads
                        time:   [21.178 us 21.353 us 21.515 us]
                        change: [-18.533% -17.405% -16.183%] (p = 0.00 < 0.05)
                        Performance has improved.

sequential/pooling/small_memory.wat
                        time:   [1.7699 us 1.7719 us 1.7740 us]
                        change: [-41.479% -41.392% -41.293%] (p = 0.00 < 0.05)
                        Performance has improved.

parallel/pooling/small_memory.wat: with 1 background thread
                        time:   [1.7915 us 1.7930 us 1.7945 us]
                        change: [-40.593% -40.487% -40.383%] (p = 0.00 < 0.05)
                        Performance has improved.
parallel/pooling/small_memory.wat: with 16 background threads
                        time:   [65.378 us 66.775 us 68.080 us]
                        change: [-62.639% -61.536% -60.394%] (p = 0.00 < 0.05)
                        Performance has improved.

sequential/pooling/data_segments.wat
                        time:   [1.5276 us 1.5288 us 1.5301 us]
                        change: [-45.622% -45.551% -45.475%] (p = 0.00 < 0.05)
                        Performance has improved.

parallel/pooling/data_segments.wat: with 1 background thread
                        time:   [1.5664 us 1.5716 us 1.5776 us]
                        change: [-42.192% -41.806% -41.407%] (p = 0.00 < 0.05)
                        Performance has improved.
parallel/pooling/data_segments.wat: with 16 background threads
                        time:   [30.378 us 30.951 us 31.554 us]
                        change: [-24.975% -23.479% -21.963%] (p = 0.00 < 0.05)
                        Performance has improved.

sequential/pooling/wasi.wasm
                        time:   [2.0274 us 2.0401 us 2.0519 us]
                        change: [-46.488% -46.206% -45.904%] (p = 0.00 < 0.05)
                        Performance has improved.

parallel/pooling/wasi.wasm: with 1 background thread
                        time:   [1.9543 us 1.9559 us 1.9579 us]
                        change: [-48.881% -48.790% -48.700%] (p = 0.00 < 0.05)
                        Performance has improved.
parallel/pooling/wasi.wasm: with 16 background threads
                        time:   [49.629 us 50.532 us 51.431 us]
                        change: [-21.128% -19.627% -18.124%] (p = 0.00 < 0.05)
                        Performance has improved.

sequential/pooling/spidermonkey.wasm
                        time:   [12.469 us 12.556 us 12.635 us]
                        change: [+108.02% +109.03% +110.10%] (p = 0.00 < 0.05)
                        Performance has regressed.

parallel/pooling/spidermonkey.wasm: with 1 background thread
                        time:   [12.523 us 12.562 us 12.595 us]
                        change: [+109.29% +110.20% +111.03%] (p = 0.00 < 0.05)
                        Performance has regressed.
parallel/pooling/spidermonkey.wasm: with 16 background threads
                        time:   [254.13 us 277.81 us 304.77 us]
                        change: [+410.73% +449.63% +496.28%] (p = 0.00 < 0.05)
                        Performance has regressed.

So, switching to an explicit memset, we see performance improvements in all cases except the large SpiderMonkey module (31k functions in my local build), where using explicit memset is 5.49x slower (+449%).

Either way, it's clear to me that we'll feel some pain on the low end (madvise) or high end (memset), so the ultimate design probably has to be to incorporate this flash-clearing into an madvise we're already doing (single-madvise idea). I can implement that as soon as we confirm that we want to remove uffd. I guess the question is just which we settle on in the meantime :-)

threads	memset	madvise
1	691ns	234ns
4	752ns	7995ns
8	766ns	15991ns

threads	madvise	memset
1	16us	8us
2	113us	15us
3	174us	40us
4	234us	44us

threads	madvise	memset
1	136ns	470ns
2	1.16µs	582ns
4	3.79µs	625ns
8	6µs	646ns

threads	madvise	memset
1	363ns	10.635µs
2	1.724µs	11.438µs
4	3.857µs	11.694µs
8	5.554µs	12.308µs

threads	madvise	memset
1	136ns	470ns
2	1.16µs	582ns
4	3.79µs	625ns
8	6µs	646ns

Stream: git-wasmtime

Topic: wasmtime / issue #3733 Implement lazy funcref table and a...

Wasmtime GitHub notifications bot (Feb 03 2022 at 07:58):

Wasmtime GitHub notifications bot (Feb 03 2022 at 17:45):

Wasmtime GitHub notifications bot (Feb 04 2022 at 02:56):

Wasmtime GitHub notifications bot (Feb 05 2022 at 01:34):

Wasmtime GitHub notifications bot (Feb 05 2022 at 02:49):

Wasmtime GitHub notifications bot (Feb 05 2022 at 03:57):

Wasmtime GitHub notifications bot (Feb 07 2022 at 19:41):

Wasmtime GitHub notifications bot (Feb 07 2022 at 21:47):

Wasmtime GitHub notifications bot (Feb 07 2022 at 23:32):

Wasmtime GitHub notifications bot (Feb 07 2022 at 23:54):

Wasmtime GitHub notifications bot (Feb 08 2022 at 01:04):

Wasmtime GitHub notifications bot (Feb 08 2022 at 04:04):

Wasmtime GitHub notifications bot (Feb 08 2022 at 04:48):

64KiB zeroing:

1 MiB zeroing:

Wasmtime GitHub notifications bot (Feb 08 2022 at 07:52):

64KiB zeroing:

1 MiB zeroing:

Wasmtime GitHub notifications bot (Feb 08 2022 at 15:34):

Wasmtime GitHub notifications bot (Feb 08 2022 at 16:59):

Wasmtime GitHub notifications bot (Feb 09 2022 at 01:13):

Wasmtime GitHub notifications bot (Feb 09 2022 at 05:59):

Wasmtime GitHub notifications bot (Feb 09 2022 at 21:56):