github-actions[bot] commented on issue #4080:
Subscribe to Label Action
cc @cfallin, @fitzgen
<details>
This issue or pull request has been labeled: "cranelift", "cranelift:area:aarch64", "cranelift:area:machinst", "cranelift:area:x64", "isle"Thus the following users have been cc'd because of the following labels:
- cfallin: isle
- fitzgen: isle
To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.
Learn more.
</details>
iximeow commented on issue #4080:
focusing more on the spidermonkey hot bit that you described, the mov/add with 32-bit wrapping behavior would be faithfully implemented by
lead $3(%rsi), %edx
, i think that saves both the instruction and ~3 bytes if you can get thelea
as8d 72 03
..
cfallin commented on issue #4080:
Funny you say that @iximeow, I was just playing with LEA! A quick hack to use it for every add, and pattern-match adds of the form x+y+const, surprisingly is showing no performance difference (fewer µops maybe offset by the need to hit a more contended execution unit for addr-gen?).
The hottest block in SM
0xC06B527: movq %r9,%rax 0xC06B52A: addl %esi,%eax 0xC06B52C: movq %rdi,%r11 0xC06B52F: addl %esi,%r11d 0xC06B532: movzbq 0(%r13,%r11),%rdx 0xC06B538: movb %dl,0(%r13,%rax) 0xC06B53D: movq %rax,%rdx 0xC06B540: addl $1, %edx 0xC06B543: movq %r11,%r12 0xC06B546: addl $1, %r12d 0xC06B54A: movzbq 0(%r13,%r12),%r12 0xC06B550: movb %r12b,0(%r13,%rdx) 0xC06B555: addl $2, %eax 0xC06B558: addl $2, %r11d 0xC06B55C: movzbq 0(%r13,%r11),%rdx 0xC06B562: movb %dl,0(%r13,%rax) 0xC06B567: addl $3, %esi 0xC06B56A: addl $-3, %r15d 0xC06B56E: movq %rbx,%rax 0xC06B571: addl %r15d,%eax 0xC06B574: cmpl $2, %eax 0xC06B577: ja-32 0xC06B527
in an all-LEA world becomes
0xC06D064: leal 0(%rsi,%rdx), %r11d 0xC06D06A: leal 0(%rax,%rdx), %edi 0xC06D06F: movzbq 0(%r13,%rdi),%r12 0xC06D075: movb %r12b,0(%r13,%r11) 0xC06D07A: leal 1(%r11), %r12d 0xC06D07F: leal 1(%rdi), %r14d 0xC06D084: movzbq 0(%r13,%r14),%r14 0xC06D08A: movb %r14b,0(%r13,%r12) 0xC06D08F: leal 2(%r11), %r11d 0xC06D094: leal 2(%rdi), %edi 0xC06D098: movzbq 0(%r13,%rdi),%rdi 0xC06D09E: movb %dil,0(%r13,%r11) 0xC06D0A3: leal 3(%rdx), %edi 0xC06D0A7: leal -3(%r9), %r11d 0xC06D0AC: leal -3(%rbx,%r9), %r9d 0xC06D0B2: cmpl $2, %r9d 0xC06D0B6: jbe-32 0xC06D0C7
which as at least more _aesthetically pleasing_, if nothing else (and the register allocator is going to be happier with fewer constraints from the non-destructive sources!), but... I'll keep playing.
(Incidentally, since you're an encoding guru: the 32-bit form of LEA needs the 0x67 addr-size override prefix in long mode; is that much of a penalty from a performance perspective?)
cfallin edited a comment on issue #4080:
Funny you say that @iximeow, I was just playing with LEA! A quick hack to use it for every add, and pattern-match adds of the form x+y+const, surprisingly is showing no performance difference (fewer µops maybe offset by the need to hit a more contended execution unit for addr-gen?).
The hottest block in SM
0xC06B527: movq %r9,%rax 0xC06B52A: addl %esi,%eax 0xC06B52C: movq %rdi,%r11 0xC06B52F: addl %esi,%r11d 0xC06B532: movzbq 0(%r13,%r11),%rdx 0xC06B538: movb %dl,0(%r13,%rax) 0xC06B53D: movq %rax,%rdx 0xC06B540: addl $1, %edx 0xC06B543: movq %r11,%r12 0xC06B546: addl $1, %r12d 0xC06B54A: movzbq 0(%r13,%r12),%r12 0xC06B550: movb %r12b,0(%r13,%rdx) 0xC06B555: addl $2, %eax 0xC06B558: addl $2, %r11d 0xC06B55C: movzbq 0(%r13,%r11),%rdx 0xC06B562: movb %dl,0(%r13,%rax) 0xC06B567: addl $3, %esi 0xC06B56A: addl $-3, %r15d 0xC06B56E: movq %rbx,%rax 0xC06B571: addl %r15d,%eax 0xC06B574: cmpl $2, %eax 0xC06B577: ja-32 0xC06B527
in an all-LEA world becomes
0xC06D064: leal 0(%rsi,%rdx), %r11d 0xC06D06A: leal 0(%rax,%rdx), %edi 0xC06D06F: movzbq 0(%r13,%rdi),%r12 0xC06D075: movb %r12b,0(%r13,%r11) 0xC06D07A: leal 1(%r11), %r12d 0xC06D07F: leal 1(%rdi), %r14d 0xC06D084: movzbq 0(%r13,%r14),%r14 0xC06D08A: movb %r14b,0(%r13,%r12) 0xC06D08F: leal 2(%r11), %r11d 0xC06D094: leal 2(%rdi), %edi 0xC06D098: movzbq 0(%r13,%rdi),%rdi 0xC06D09E: movb %dil,0(%r13,%r11) 0xC06D0A3: leal 3(%rdx), %edi 0xC06D0A7: leal -3(%r9), %r11d 0xC06D0AC: leal -3(%rbx,%r9), %r9d 0xC06D0B2: cmpl $2, %r9d 0xC06D0B6: jbe-32 0xC06D0C7
which is at least more _aesthetically pleasing_, if nothing else (and the register allocator is going to be happier with fewer constraints from the non-destructive sources!), but... I'll keep playing.
(Incidentally, since you're an encoding guru: the 32-bit form of LEA needs the 0x67 addr-size override prefix in long mode; is that much of a penalty from a performance perspective?)
cfallin commented on issue #4080:
Interesting tidbit: comparing before/after running SpiderMonkey with the LEA change, I see about a 3% reduction in instruction count, but about the same reduction in IPC. I suspect that move elision at the renamer may be making the increased moves of the two-operand traditional
add
world not matter much. Also, spidermonkey.cwasm in LEA-world is... 0.03% smaller. Weird, I would have expected more...
iximeow commented on issue #4080:
ah! yeah i think contending for address generation would be a real issue there? on the most recent intel parts it should be less of an issue, but i think generally the add-to-lea translation is probably preferable if it lets you collapse an add or two plus a mov, rather than as a default.
and 0x67, as far as i understand, is "never use if at all possible" territory. but for implementing a wrapping 32-bit add with LEA, do you need that? you can use 64-bit input registers and rely on truncation of the destination to get the right value.
cfallin commented on issue #4080:
ah! yeah i think contending for address generation would be a real issue there? on the most recent intel parts it should be less of an issue, but i think generally the add-to-lea translation is probably preferable if it lets you collapse an add or two plus a mov, rather than as a default.
Yeah, I'll play with some more specific pattern-matching here :-) It's tricky in general to switch fluidly between them because we represent adds in an SSA 3-reg form with a register constraint (reuse src1 for dest) and the regalloc inserts the move, rather than the lowering pattern; so we'd need a way to tell ra2 that "if you whack this instruction in the right way, it transmutes into a slower form but without a constraint". Then work out when to pull that lever vs split a liverange. I'm really really curious how LLVM solves this now (it'll turn a
return arg0 + arg1
function into a singlelea rax, [rdi+rsi]
but useadd
otherwise so there has to be some regalloc integration somehow!).and 0x67, as far as i understand, is "never use if at all possible" territory. but for implementing a wrapping 32-bit add with LEA, do you need that? you can use 64-bit input registers and rely on truncation of the destination to get the right value.
You're right, I was misled by nasm here... writing
lea eax, [ebx+ecx]
in long mode gets a 0x67 prefix butlea eax, [rbx+rcx]
does the exact same thing. D'oh.
Last updated: Jan 24 2025 at 00:11 UTC