Stream: wasmtime

Topic: constant loading in a loop


view this post on Zulip Dmitry Sinyavin (Aug 15 2022 at 16:34):

I'm struggling to understand the compiler behavior when generating a loop and I definitely need some help here.

Consider the following simple example:

  (func (;0;) (type 0)
    i32.const -1
    i32.const 1
    i32.store
    i32.const -2
    i32.const 2
    i32.store
  )

It generates rather predictable code on x64 (I'm using Intel notation):

mov    esi,DWORD PTR [rip+0x2e]        # 38 <_wasm_function_0+0x38>
mov    r11,QWORD PTR [rdi+0x48]
mov    edi,0x1
mov    DWORD PTR [r11+rsi*1+0x0],edi
mov    esi,DWORD PTR [rip+0x12]        # 30 <_wasm_function_0+0x30>
mov    edi,0x2
mov    DWORD PTR [r11+rsi*1+0x0],edi

It's a separate question why bother putting negative constants to a constant pool but, apart from that, logic is pretty straightforward: get an offset, get a value, put the value in its place, then do the same thing once more for the other pair of offset/value.

Now I'm putting the same code in a recurring loop:

  (func (;0;) (type 0)
    loop
      i32.const -1
      i32.const 1
      i32.store
      i32.const -2
      i32.const 2
      i32.store
      br 0
    end
  )

The code generated is very similar but there are some differences:

mov    r11d,DWORD PTR [rip+0x2d]        # 38 <_wasm_function_0+0x38>
mov    rsi,QWORD PTR [rdi+0x48]
mov    edi,DWORD PTR [rip+0x1b]        # 30 <_wasm_function_0+0x30>
mov    eax,0x1
mov    DWORD PTR [rsi+r11*1+0x0],eax
mov    eax,0x2
mov    DWORD PTR [rsi+rdi*1+0x0],eax
jmp    15 <_wasm_function_0+0x15>

Now it is more like "get ALL the offsets first, then get values one by one and put them on their places". If I grow that const/const/store chain to be long enough to run out of free registers, offsets are spilled to the stack. That is, in the beginning of loop it takes all the offsets from constant pool and loads them into stack frame and then gets them one by one from stack and puts values into memory. Given a chain of 99 const/const/stores it reserves a stack frame of 800 bytes. And it only happens in a loop body, if a loop is not there, it requires no stack at all no matter how long the chain is.

Any clarification on what's going on here would be appreciated.

view this post on Zulip Mossaka (Joe) (Aug 15 2022 at 16:36):

oops, sorry I replied to the wrong thread.

view this post on Zulip bjorn3 (Aug 15 2022 at 16:40):

Now it is more like "get ALL the offsets first, then get values one by one and put them on their places".

That is probably LICM moving the address calculation out of the loop.

If I grow that const/const/store chain to be long enough to run out of free registers, offsets are spilled to the stack.

I think regalloc doesn't implement rematerialization for the address computation. @Chris Fallin is that right?

view this post on Zulip Chris Fallin (Aug 15 2022 at 16:44):

yep, I've noticed this inefficiency too; I've addressed it as part of the mid-end optimizer I'm working on now, which does GVN + LICM + remat in a unified way

view this post on Zulip bjorn3 (Aug 15 2022 at 16:45):

For reference https://github.com/bytecodealliance/wasmtime/pull/4249

This is a work-in-progress, and meant to sketch the direction I've been thinking in for a mid-end framework. A proper BA RFC will come soon. This PR builds a phase in the optimization pipeline that...

view this post on Zulip Dmitry Sinyavin (Aug 15 2022 at 16:53):

Thanks @bjorn3 and @Chris Fallin for the clarification, it makes perfects sense!


Last updated: Nov 22 2024 at 16:03 UTC