I'm struggling to understand the compiler behavior when generating a loop and I definitely need some help here.
Consider the following simple example:
(func (;0;) (type 0)
i32.const -1
i32.const 1
i32.store
i32.const -2
i32.const 2
i32.store
)
It generates rather predictable code on x64 (I'm using Intel notation):
mov esi,DWORD PTR [rip+0x2e] # 38 <_wasm_function_0+0x38>
mov r11,QWORD PTR [rdi+0x48]
mov edi,0x1
mov DWORD PTR [r11+rsi*1+0x0],edi
mov esi,DWORD PTR [rip+0x12] # 30 <_wasm_function_0+0x30>
mov edi,0x2
mov DWORD PTR [r11+rsi*1+0x0],edi
It's a separate question why bother putting negative constants to a constant pool but, apart from that, logic is pretty straightforward: get an offset, get a value, put the value in its place, then do the same thing once more for the other pair of offset/value.
Now I'm putting the same code in a recurring loop:
(func (;0;) (type 0)
loop
i32.const -1
i32.const 1
i32.store
i32.const -2
i32.const 2
i32.store
br 0
end
)
The code generated is very similar but there are some differences:
mov r11d,DWORD PTR [rip+0x2d] # 38 <_wasm_function_0+0x38>
mov rsi,QWORD PTR [rdi+0x48]
mov edi,DWORD PTR [rip+0x1b] # 30 <_wasm_function_0+0x30>
mov eax,0x1
mov DWORD PTR [rsi+r11*1+0x0],eax
mov eax,0x2
mov DWORD PTR [rsi+rdi*1+0x0],eax
jmp 15 <_wasm_function_0+0x15>
Now it is more like "get ALL the offsets first, then get values one by one and put them on their places". If I grow that const/const/store
chain to be long enough to run out of free registers, offsets are spilled to the stack. That is, in the beginning of loop it takes all the offsets from constant pool and loads them into stack frame and then gets them one by one from stack and puts values into memory. Given a chain of 99 const/const/store
s it reserves a stack frame of 800 bytes. And it only happens in a loop body, if a loop is not there, it requires no stack at all no matter how long the chain is.
Any clarification on what's going on here would be appreciated.
oops, sorry I replied to the wrong thread.
Now it is more like "get ALL the offsets first, then get values one by one and put them on their places".
That is probably LICM moving the address calculation out of the loop.
If I grow that const/const/store chain to be long enough to run out of free registers, offsets are spilled to the stack.
I think regalloc doesn't implement rematerialization for the address computation. @Chris Fallin is that right?
yep, I've noticed this inefficiency too; I've addressed it as part of the mid-end optimizer I'm working on now, which does GVN + LICM + remat in a unified way
For reference https://github.com/bytecodealliance/wasmtime/pull/4249
Thanks @bjorn3 and @Chris Fallin for the clarification, it makes perfects sense!
Last updated: Jan 24 2025 at 00:11 UTC