wasmtime / issue #4080 Improve x64 address lowering: cons... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / issue #4080 Improve x64 address lowering: cons...

Wasmtime GitHub notifications bot (Apr 28 2022 at 04:02):

github-actions[bot] commented on issue #4080:

Subscribe to Label Action

cc @cfallin, @fitzgen

<details>
This issue or pull request has been labeled: "cranelift", "cranelift:area:aarch64", "cranelift:area:machinst", "cranelift:area:x64", "isle"

Thus the following users have been cc'd because of the following labels:

cfallin: isle

fitzgen: isle

To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.

Learn more.
</details>

Wasmtime GitHub notifications bot (Apr 28 2022 at 05:11):

iximeow commented on issue #4080:

focusing more on the spidermonkey hot bit that you described, the mov/add with 32-bit wrapping behavior would be faithfully implemented by lead $3(%rsi), %edx, i think that saves both the instruction and ~3 bytes if you can get the lea as 8d 72 03..

Wasmtime GitHub notifications bot (Apr 28 2022 at 05:43):

cfallin commented on issue #4080:

Funny you say that @iximeow, I was just playing with LEA! A quick hack to use it for every add, and pattern-match adds of the form x+y+const, surprisingly is showing no performance difference (fewer µops maybe offset by the need to hit a more contended execution unit for addr-gen?).

The hottest block in SM

    0xC06B527:  movq %r9,%rax
    0xC06B52A:  addl %esi,%eax
    0xC06B52C:  movq %rdi,%r11
    0xC06B52F:  addl %esi,%r11d
    0xC06B532:  movzbq 0(%r13,%r11),%rdx
    0xC06B538:  movb %dl,0(%r13,%rax)
    0xC06B53D:  movq %rax,%rdx
    0xC06B540:  addl $1, %edx
    0xC06B543:  movq %r11,%r12
    0xC06B546:  addl $1, %r12d
    0xC06B54A:  movzbq 0(%r13,%r12),%r12
    0xC06B550:  movb %r12b,0(%r13,%rdx)
    0xC06B555:  addl $2, %eax
    0xC06B558:  addl $2, %r11d
    0xC06B55C:  movzbq 0(%r13,%r11),%rdx
    0xC06B562:  movb %dl,0(%r13,%rax)
    0xC06B567:  addl $3, %esi
    0xC06B56A:  addl $-3, %r15d
    0xC06B56E:  movq %rbx,%rax
    0xC06B571:  addl %r15d,%eax
    0xC06B574:  cmpl $2, %eax
    0xC06B577:  ja-32 0xC06B527

in an all-LEA world becomes

    0xC06D064:  leal 0(%rsi,%rdx), %r11d
    0xC06D06A:  leal 0(%rax,%rdx), %edi
    0xC06D06F:  movzbq 0(%r13,%rdi),%r12
    0xC06D075:  movb %r12b,0(%r13,%r11)
    0xC06D07A:  leal 1(%r11), %r12d
    0xC06D07F:  leal 1(%rdi), %r14d
    0xC06D084:  movzbq 0(%r13,%r14),%r14
    0xC06D08A:  movb %r14b,0(%r13,%r12)
    0xC06D08F:  leal 2(%r11), %r11d
    0xC06D094:  leal 2(%rdi), %edi
    0xC06D098:  movzbq 0(%r13,%rdi),%rdi
    0xC06D09E:  movb %dil,0(%r13,%r11)
    0xC06D0A3:  leal 3(%rdx), %edi
    0xC06D0A7:  leal -3(%r9), %r11d
    0xC06D0AC:  leal -3(%rbx,%r9), %r9d
    0xC06D0B2:  cmpl $2, %r9d
    0xC06D0B6:  jbe-32 0xC06D0C7

which as at least more _aesthetically pleasing_, if nothing else (and the register allocator is going to be happier with fewer constraints from the non-destructive sources!), but... I'll keep playing.

(Incidentally, since you're an encoding guru: the 32-bit form of LEA needs the 0x67 addr-size override prefix in long mode; is that much of a penalty from a performance perspective?)

Wasmtime GitHub notifications bot (Apr 28 2022 at 05:44):

cfallin edited a comment on issue #4080:

The hottest block in SM

    0xC06B527:  movq %r9,%rax
    0xC06B52A:  addl %esi,%eax
    0xC06B52C:  movq %rdi,%r11
    0xC06B52F:  addl %esi,%r11d
    0xC06B532:  movzbq 0(%r13,%r11),%rdx
    0xC06B538:  movb %dl,0(%r13,%rax)
    0xC06B53D:  movq %rax,%rdx
    0xC06B540:  addl $1, %edx
    0xC06B543:  movq %r11,%r12
    0xC06B546:  addl $1, %r12d
    0xC06B54A:  movzbq 0(%r13,%r12),%r12
    0xC06B550:  movb %r12b,0(%r13,%rdx)
    0xC06B555:  addl $2, %eax
    0xC06B558:  addl $2, %r11d
    0xC06B55C:  movzbq 0(%r13,%r11),%rdx
    0xC06B562:  movb %dl,0(%r13,%rax)
    0xC06B567:  addl $3, %esi
    0xC06B56A:  addl $-3, %r15d
    0xC06B56E:  movq %rbx,%rax
    0xC06B571:  addl %r15d,%eax
    0xC06B574:  cmpl $2, %eax
    0xC06B577:  ja-32 0xC06B527

in an all-LEA world becomes

    0xC06D064:  leal 0(%rsi,%rdx), %r11d
    0xC06D06A:  leal 0(%rax,%rdx), %edi
    0xC06D06F:  movzbq 0(%r13,%rdi),%r12
    0xC06D075:  movb %r12b,0(%r13,%r11)
    0xC06D07A:  leal 1(%r11), %r12d
    0xC06D07F:  leal 1(%rdi), %r14d
    0xC06D084:  movzbq 0(%r13,%r14),%r14
    0xC06D08A:  movb %r14b,0(%r13,%r12)
    0xC06D08F:  leal 2(%r11), %r11d
    0xC06D094:  leal 2(%rdi), %edi
    0xC06D098:  movzbq 0(%r13,%rdi),%rdi
    0xC06D09E:  movb %dil,0(%r13,%r11)
    0xC06D0A3:  leal 3(%rdx), %edi
    0xC06D0A7:  leal -3(%r9), %r11d
    0xC06D0AC:  leal -3(%rbx,%r9), %r9d
    0xC06D0B2:  cmpl $2, %r9d
    0xC06D0B6:  jbe-32 0xC06D0C7

which is at least more _aesthetically pleasing_, if nothing else (and the register allocator is going to be happier with fewer constraints from the non-destructive sources!), but... I'll keep playing.

(Incidentally, since you're an encoding guru: the 32-bit form of LEA needs the 0x67 addr-size override prefix in long mode; is that much of a penalty from a performance perspective?)

Wasmtime GitHub notifications bot (Apr 28 2022 at 05:59):

cfallin commented on issue #4080:

Interesting tidbit: comparing before/after running SpiderMonkey with the LEA change, I see about a 3% reduction in instruction count, but about the same reduction in IPC. I suspect that move elision at the renamer may be making the increased moves of the two-operand traditional add world not matter much. Also, spidermonkey.cwasm in LEA-world is... 0.03% smaller. Weird, I would have expected more...

Wasmtime GitHub notifications bot (Apr 28 2022 at 06:04):

iximeow commented on issue #4080:

ah! yeah i think contending for address generation would be a real issue there? on the most recent intel parts it should be less of an issue, but i think generally the add-to-lea translation is probably preferable if it lets you collapse an add or two plus a mov, rather than as a default.

and 0x67, as far as i understand, is "never use if at all possible" territory. but for implementing a wrapping 32-bit add with LEA, do you need that? you can use 64-bit input registers and rely on truncation of the destination to get the right value.

Wasmtime GitHub notifications bot (Apr 28 2022 at 06:21):

cfallin commented on issue #4080:

ah! yeah i think contending for address generation would be a real issue there? on the most recent intel parts it should be less of an issue, but i think generally the add-to-lea translation is probably preferable if it lets you collapse an add or two plus a mov, rather than as a default.

Yeah, I'll play with some more specific pattern-matching here :-) It's tricky in general to switch fluidly between them because we represent adds in an SSA 3-reg form with a register constraint (reuse src1 for dest) and the regalloc inserts the move, rather than the lowering pattern; so we'd need a way to tell ra2 that "if you whack this instruction in the right way, it transmutes into a slower form but without a constraint". Then work out when to pull that lever vs split a liverange. I'm really really curious how LLVM solves this now (it'll turn a return arg0 + arg1 function into a single lea rax, [rdi+rsi] but use add otherwise so there has to be some regalloc integration somehow!).

and 0x67, as far as i understand, is "never use if at all possible" territory. but for implementing a wrapping 32-bit add with LEA, do you need that? you can use 64-bit input registers and rely on truncation of the destination to get the right value.

You're right, I was misled by nasm here... writing lea eax, [ebx+ecx] in long mode gets a 0x67 prefix but lea eax, [rbx+rcx] does the exact same thing. D'oh.

Last updated: Apr 18 2025 at 18:04 UTC