cfallin commented on Issue #2504:
For context, I should also note: this (
I128
support) is one of the few pieces still missing from the new x64 backend. The Wasm frontend does not use it, butrustc_codegen_clif
does (cc @bjorn3). I expect we would need a few more ops before we can try this with cg_clif, though.
julian-seward1 commented on Issue #2504:
@cfallin As a general comment, I see the dilemma. I haven't yet read the patch, but wanted to ask: part of the motivation for going this route was that we would then be able to remove the complexity, the semantic murkyness, and the potential for generating poor code, caused by having to explicitly represent the carry flag in CLIF. If we revert to expanding out doubleword arithmetic at the CLIF level, is there some other way we can avoid those disadvantages?
Also .. does it matter that, doing this per-target (per this patch) potentially makes it possible to use target-specific sequences, which wouldn't be possible if it were done at the CLIF level?
julian-seward1 commented on Issue #2504:
doing this per-target (per this patch) potentially makes it possible to use target-specific sequences
For example, doubleword comparison on x64 is a bit elaborate, but for aarch64 it's just
cmp ; ccmp
.
bjorn3 commented on Issue #2504:
I just tested this with cg_clif. As far as I can see this works fine, but there are still missing pieces both related and unrelated to 128bit int support which mean that I can't completely test this yet.
support necessary in any case:
isplit
/iconcat
icmp.i128
tls_value
AbiPurpose::StructArg
support necessary to not require modifications of cg_clif:
load
/store
ofi128
rotl.i128
/rotr.i128
bitrev.i8
julian-seward1 commented on Issue #2504:
One thing that does concern me is
struct ValueRegs
, in particular the potential inefficiency induced by it. It's obviously fully general, in the sense that it can hold 4 unrelatedReg
values, but it does require passing around a 128-bit thing where before we were just passing around a 32-bit thing (Reg
). And, given that it requires indexing, I reckon it ain't gonna wind up in a vector register, hence will always be in memory.I had wondered if such generality is really necessary. Given that -- IIUC -- all the result virtual register fields are allocated before isel starts, would it be adequate to have an representation which specifies a base virtual register, plus zero, one or two subsequently-numbered vregs of the same class? Then that would fit in 34 bits (hence, a
uint64_t
) and could be passed around in a (host) register just asReg
is now.Even if that won't work, does
ValueRegs
really need to be template-typed? Are there any other types of things we need to store in there?
bjorn3 edited a comment on Issue #2504:
I just tested this with cg_clif. As far as I can see this works fine, but there are still missing pieces both related and unrelated to 128bit int support which mean that I can't completely test this yet.
support necessary in any case:
isplit
/iconcat
(https://github.com/cfallin/wasmtime/pull/14)icmp.i128
tls_value
AbiPurpose::StructArg
support necessary to not require modifications of cg_clif:
load
/store
ofi128
rotl.i128
/rotr.i128
bitrev.i8
bjorn3 edited a comment on Issue #2504:
I just tested this with cg_clif. As far as I can see this works fine, but there are still missing pieces both related and unrelated to 128bit int support which mean that I can't completely test this yet.
support necessary in any case:
isplit
/iconcat
(https://github.com/cfallin/wasmtime/pull/14)icmp.i128
(https://github.com/cfallin/wasmtime/pull/15)tls_value
AbiPurpose::StructArg
support necessary to not require modifications of cg_clif:
load
/store
ofi128
rotl.i128
/rotr.i128
bitrev.i8
brz.i128
/brnz.i128
bjorn3 edited a comment on Issue #2504:
I just tested this with cg_clif. As far as I can see this works fine, but there are still missing pieces both related and unrelated to 128bit int support which mean that I can't completely test this yet.
support necessary in any case:
isplit
/iconcat
(https://github.com/cfallin/wasmtime/pull/14)icmp.i128
(https://github.com/cfallin/wasmtime/pull/15)tls_value
AbiPurpose::StructArg
call
withi128
argumentssupport necessary to not require modifications of cg_clif:
load
/store
ofi128
rotl.i128
/rotr.i128
bitrev.i8
brz.i128
/brnz.i128
cfallin commented on Issue #2504:
So I'm starting to warm up to this approach relative to falling back on existing legalizations; it's really not too much more work to get to the end-point and the legalization is pretty hairy too. I'm working on the additional ops now; hopefully ready soon!
To a few design-question points:
One thing that does concern me is
struct ValueRegs
, in particular the potential inefficiency induced by it. It's obviously fully general, in the sense that it can hold 4 unrelatedReg
values, but it does require passing around a 128-bit thing where before we were just passing around a 32-bit thing (Reg
). And, given that it requires indexing, I reckon it ain't gonna wind up in a vector register, hence will always be in memory.I played with this a bit with Compiler Explorer (link) -- it seems it's passed by reference as an arg, but returned in two regs (
rdx:rax
). So, yeah, not as efficient as I had hoped, but at least it's not malloc'ing... though see below for alternative idea.I had wondered if such generality is really necessary. Given that -- IIUC -- all the result virtual register fields are allocated before isel starts, would it be adequate to have an representation which specifies a base virtual register, plus zero, one or two subsequently-numbered vregs of the same class? Then that would fit in 34 bits (hence, a
uint64_t
) and could be passed around in a (host) register just asReg
is now.Tempting idea! This doesn't handle the real-register case though (needed at least for ABI specs). I suppose we could special-case that but then we bifurcate some other APIs (gen load/spill, move, ...).
I'm thinking that we might actually be able to optimize in another (much simpler) way -- on 64-bit machines and with our largest types being 128-bit, we can get away with the 1 / 2-reg cases. Perhaps we could define a static constant for size and, with some config magic, select
2
unless arm32 (or later, x86-32, etc.) are compiled in. Then we're just passing around 64 bits, which with padding is probably free relative to our current 32-bit size in most cases.Even if that won't work, does
ValueRegs
really need to be template-typed? Are there any other types of things we need to store in there?Yes, I did this to try to keep to the spirit of
Reg
,WritableReg
,VirtualReg
,RealReg
. At least for the former two, I tried at first doingWritable<ValueRegs>
instead, butWritable
is defined inregalloc
and requires implementation of a private trait, i.e., it's closed to new uses. IMHO it's somewhat nice forregs()
to return (an array of) the correctly-typed thing in many cases...
julian-seward1 commented on Issue #2504:
@cfallin Great work, btw! I should have said that earlier.
cfallin commented on Issue #2504:
Just added implementations for (I think) all of the relevant
i128
ops. Will make a cleanup pass and address some of the comments above later. @bjorn3, I didn't get to the TLS support orStructArg
support yet, but this might be close enough to hack something together? If not, I'll return to this in a bit :-)
cfallin commented on Issue #2504:
@bjorn3 I just added what I think might be a sufficient implementation of ELF TLS support (basically just copying your recipe in the old x86 backend), and also permitted
StructReturn
-purpose args to compile. I'm not sure if any more needs to be done for the latter (does cg_clif rely on some sort of legalization that lowers into a struct-return pointer?).Let me know if this is sufficient to get cg_clif working! If so, I'll clean this up a bit and split out some pieces to actually land it.
cfallin edited a comment on Issue #2504:
@bjorn3 I just added what I think might be a sufficient implementation of ELF TLS support (basically just copying your recipe in the old x86 backend), and also permitted
StructReturn
(edit: andStructArgument
)-purpose args to compile. I'm not sure if any more needs to be done for the latter (does cg_clif rely on some sort of legalization that lowers into a struct-return pointer?).Let me know if this is sufficient to get cg_clif working! If so, I'll clean this up a bit and split out some pieces to actually land it.
cfallin edited a comment on Issue #2504:
@bjorn3 I just added what I think might be a sufficient implementation of ELF TLS support (basically just copying your recipe in the old x86 backend), and also permitted
StructReturn
(edit: andStructArgument
)-purpose args to compile. I'm not sure if any more needs to be done for the latter (does cg_clif rely on some sort of legalization that lowers into a struct-arg/return pointer?).Let me know if this is sufficient to get cg_clif working! If so, I'll clean this up a bit and split out some pieces to actually land it.
bjorn3 commented on Issue #2504:
Let me know if this is sufficient to get cg_clif working!
Almost. The standard library compiles with the above TLS fix, but std_example fails at several points:
saturating_sub::<i128>()
(select.i128
?)- u128 -> float cast (incorrect return value)
__m128i
->[u8; 16]
transmute (raw_bitcast
betweeni64x2
andi128
?)
bjorn3 edited a comment on Issue #2504:
Let me know if this is sufficient to get cg_clif working!
Almost. The standard library compiles with the above TLS fix, but std_example fails at several points:
saturating_sub::<i128>()
(select.i128
?)- u128 -> float cast (incorrect return value)
- <strike>
__m128i
->[u8; 16]
transmute (raw_bitcast
betweeni64x2
andi128
?)</strike> (edit: it fixed itself somehow)
bjorn3 commented on Issue #2504:
I got a SIGSEGV on simple-raytracer. I reduced it to:
struct Vec3 { x: f64, y: f64, z: f64, } impl Vec3 { fn neg(self) -> Vec3 { Vec3 { x: -self.x, y: -self.y, z: -self.z, } } } fn main() { Vec3 { x: 0.0, y: 0.0, z: 0.0, }.neg(); }
(rr) cont Continuing. Program received signal SIGSEGV, Segmentation fault. 0x0000560560f17c2a in main::Vec3::neg () at src/bin/main.rs:10 10 x: -self.x, (rr) disassemble/r Dump of assembler code for function main::Vec3::neg: 0x0000560560f17c17 <+0>: 55 push %rbp 0x0000560560f17c18 <+1>: 48 89 e5 mov %rsp,%rbp 0x0000560560f17c1b <+4>: 48 b8 00 00 00 00 00 00 00 80 movabs $0x8000000000000000,%rax 0x0000560560f17c25 <+14>: 66 48 0f 6e c0 movq %rax,%xmm0 => 0x0000560560f17c2a <+19>: 66 0f 57 06 xorpd (%rsi),%xmm0 0x0000560560f17c2e <+23>: 48 b8 00 00 00 00 00 00 00 80 movabs $0x8000000000000000,%rax 0x0000560560f17c38 <+33>: 66 48 0f 6e c8 movq %rax,%xmm1 0x0000560560f17c3d <+38>: 66 0f 57 4e 08 xorpd 0x8(%rsi),%xmm1 0x0000560560f17c42 <+43>: 48 b8 00 00 00 00 00 00 00 80 movabs $0x8000000000000000,%rax 0x0000560560f17c4c <+53>: 66 48 0f 6e d0 movq %rax,%xmm2 0x0000560560f17c51 <+58>: 66 0f 57 56 10 xorpd 0x10(%rsi),%xmm2 0x0000560560f17c56 <+63>: f2 0f 11 07 movsd %xmm0,(%rdi) 0x0000560560f17c5a <+67>: f2 0f 11 4f 08 movsd %xmm1,0x8(%rdi) 0x0000560560f17c5f <+72>: f2 0f 11 57 10 movsd %xmm2,0x10(%rdi) 0x0000560560f17c64 <+77>: 48 89 ec mov %rbp,%rsp 0x0000560560f17c67 <+80>: 5d pop %rbp 0x0000560560f17c68 <+81>: c3 retq End of assembler dump. (rr) info reg rsi rsi 0x7ffd802b5678 140726753777272 (rr) p *(0x7ffd802b5678 as *const usize) $1 = 0
bjorn3 edited a comment on Issue #2504:
I got a SIGSEGV on simple-raytracer. I reduced it to:
struct Vec3 { x: f64, y: f64, z: f64, } impl Vec3 { fn neg(self) -> Vec3 { Vec3 { x: -self.x, y: -self.y, z: -self.z, } } } fn main() { Vec3 { x: 0.0, y: 0.0, z: 0.0, }.neg(); }
(rr) cont Continuing. Program received signal SIGSEGV, Segmentation fault. 0x0000560560f17c2a in main::Vec3::neg () at src/bin/main.rs:10 10 x: -self.x, (rr) disassemble/r Dump of assembler code for function main::Vec3::neg: 0x0000560560f17c17 <+0>: 55 push %rbp 0x0000560560f17c18 <+1>: 48 89 e5 mov %rsp,%rbp 0x0000560560f17c1b <+4>: 48 b8 00 00 00 00 00 00 00 80 movabs $0x8000000000000000,%rax 0x0000560560f17c25 <+14>: 66 48 0f 6e c0 movq %rax,%xmm0 => 0x0000560560f17c2a <+19>: 66 0f 57 06 xorpd (%rsi),%xmm0 0x0000560560f17c2e <+23>: 48 b8 00 00 00 00 00 00 00 80 movabs $0x8000000000000000,%rax 0x0000560560f17c38 <+33>: 66 48 0f 6e c8 movq %rax,%xmm1 0x0000560560f17c3d <+38>: 66 0f 57 4e 08 xorpd 0x8(%rsi),%xmm1 0x0000560560f17c42 <+43>: 48 b8 00 00 00 00 00 00 00 80 movabs $0x8000000000000000,%rax 0x0000560560f17c4c <+53>: 66 48 0f 6e d0 movq %rax,%xmm2 0x0000560560f17c51 <+58>: 66 0f 57 56 10 xorpd 0x10(%rsi),%xmm2 0x0000560560f17c56 <+63>: f2 0f 11 07 movsd %xmm0,(%rdi) 0x0000560560f17c5a <+67>: f2 0f 11 4f 08 movsd %xmm1,0x8(%rdi) 0x0000560560f17c5f <+72>: f2 0f 11 57 10 movsd %xmm2,0x10(%rdi) 0x0000560560f17c64 <+77>: 48 89 ec mov %rbp,%rsp 0x0000560560f17c67 <+80>: 5d pop %rbp 0x0000560560f17c68 <+81>: c3 retq End of assembler dump. (rr) info reg rsi rsi 0x7ffd802b5678 140726753777272 (rr) p *(0x7ffd802b5678 as *const usize) $1 = 0
Edit: This bug is unrelated to this PR. I opened #2507.
cfallin commented on Issue #2504:
@bjorn3 I just pushed a number of fixes/updates; this I think addresses everything above except for the
StructReturn
fix and whatever is causing the u128 to/from float conversion error. (I'm testing with a localrustc_codegen_cranelift
checkout; I see the same error that you're presumably seeing on std_example.rs line 79, converting u128->f32).I'm not sure if the latter has to do with the former; it seems that the u128->f32 conversion is a call into libgcc so I suspect it's an ABI issue rather than an i128 arithmetic issue. (Thank goodness for that; I started to work out the open-coded sequence for each of the u128 to/from f32/f64 cases, signed/unsigned, and quickly ran away...)
I won't have more time to look at this for a bit (I'm mostly away until after the new year) but if you have any further debugging insights or want to try to implement the
StructReturn
fix (I think we just need to preprend a StructReturn return value if an arg exists, and wire the value through) please feel free!(Also it mostly goes without saying but since this PR has grown quite a bit beyond just i128 arithmetic, I plan to split it up into proper pieces for separate review once we confirm that cg_clif works.)
bjorn3 commented on Issue #2504:
I'm not sure if the latter has to do with the former; it seems that the u128->f32 conversion is a call into libgcc so I suspect it's an ABI issue rather than an i128 arithmetic issue.
On my system it calls into rust's compiler-builtins.
or want to try to implement the StructReturn fix (I think we just need to preprend a StructReturn return value if an arg exists, and wire the value through)
That and use the correct register. It isn't very important though as the old backend didn't implement it either.
bjorn3 commented on Issue #2504:
Minimal repro:
fn convert(i: u128) -> f32 { if i == 0 { // this branch is taken return 0.0; } extract_sign(i); // except when this call is removed 100.0 } fn main() { assert_eq!(convert(100u128), 100.0); } fn extract_sign(i: u128) {}
Assembly of
convert
:00000000000f4af5 <_ZN11std_example7convert17ha5c22c0f04088fadE>: f4af5: 55 push %rbp f4af6: 48 89 e5 mov %rsp,%rbp f4af9: 48 83 ff 00 cmp $0x0,%rdi f4afd: 40 0f 94 c0 sete %al f4b01: 48 83 fe 00 cmp $0x0,%rsi f4b05: 40 0f 94 c1 sete %cl f4b09: 48 21 c1 and %rax,%rcx f4b0c: 0f 84 08 00 00 00 je f4b1a <_ZN11std_example7convert17ha5c22c0f04088fadE+0x25> f4b12: 0f 57 c0 xorps %xmm0,%xmm0 f4b15: 48 89 ec mov %rbp,%rsp f4b18: 5d pop %rbp f4b19: c3 retq f4b1a: e8 a1 01 00 00 callq f4cc0 <_ZN11std_example12extract_sign17h8b332672bd416c0aE> f4b1f: be 00 00 c8 42 mov $0x42c80000,%esi f4b24: 66 0f 6e c6 movd %esi,%xmm0 f4b28: 48 89 ec mov %rbp,%rsp f4b2b: 5d pop %rbp f4b2c: c3 retq
cfallin commented on Issue #2504:
@bjorn3 thanks for that repro -- the disassembly made it pretty clear what was wrong: brz/brnz on an i128 used an incorrect width of and/or instruction. Pushed a fix just now.
Locally,
./test.sh
in rustc_codegen_cranelift using this branch now seems to pass all the tests; it dies later during benchmarking but I'm not sure what's wrong:Benchmark #2: ../build/cargo.sh build Error: Command terminated with non-zero exit code. Use the '-i'/'--ignore-failure' option if you want to ignore this. Alternatively, use the '--show-output' option to debug what went wrong.
(Also, in case it's relevant, I set
JIT_SUPPORTED=0
inscripts/config.sh
as that seemed to get me past an issue related to the TLS relocation not being handled in cranelift-jit. Known issue?)In any case, I'm happy that this seems to allow cg_clif to pass the tests in
examples/
:-)
bjorn3 commented on Issue #2504:
Locally, ./test.sh in rustc_codegen_cranelift using this branch now seems to pass all the tests
:tada:
it dies later during benchmarking but I'm not sure what's wrong:
Can you add
--show-output
to the hyperfine invocation inscripts/tests.sh
? I can't test it myself until tomorrow. You could also cd tosimple-raytracer
and then run../build/cargo.sh build
directly to avoid having to build the sysroot again.
cfallin commented on Issue #2504:
I tried building simple-raytracer, and the
cg_clif
invocation to compile thetiff
crate is dying with a segfault:Compiling tiff v0.2.1 error: could not compile `tiff` Caused by: process didn't exit successfully: `/home/cfallin/work/rustc_codegen_cranelift/build/bin/cg_clif --crate-name tiff [...]` (signal: 11, SIGSEGV: invalid memory reference)
I tracked this back to a nonsense pointer passed into
slice::copy_from_slice
and tracked it backward with rr for a while; but couldn't reach the origin (it's a somewhat manual process tracing through disassemblies and setting memory watchpoints). Valgrind shows that the pointer itself is undefined data. So we're missing some sort of dataflow linkage (a store? ABI semantics related to wide values?) but I've hit my quota for today. Curious to see if you can discover more!(I implemented
StructReturn
slightly more properly but this didn't seem to be the problem; wasn't likely, as you noted, but wanted to be sure.)
bjorn3 commented on Issue #2504:
This is a proc macro that has an ABI incompatibility with the cg_llvm comliled rustc due to
StructArgument
not yet being implemented.
cfallin commented on Issue #2504:
@bjorn3 I went ahead and implemented
StructArgument
as well. With that done,simple-raytracer
builds, and its binary runs to completion. (The following is the first result computed via rustc -> cg_clif -> x64-new-backend, aside from compilations themselves: result.png!)There do seem to be a few last test failures later in the
test.sh
run (in the stdlib I guess?):1139 passed; 16 failed
, mostly somei8
/i16
ops and some FP string formatting issues. The former doesn't surprise me too much as narrow-value support is still somewhat young relative to the 32/64-bit paths that are well-hammered by Wasm. In any case, if you (or anyone else) feel like taking a look at those last issues, please do feel free!I'll clean up this mess of a draft branch when I'm actually "working" again on Jan 4 :-)
cfallin commented on Issue #2504:
Last two fixes pushed:
DIV
on 8-bit values places remainders inAH
, notDL
(the weird and wonderful depths of x86 never cease to amuse!), and ourselect
impl wasn't recognizing that booleans are integers and need integer moves.All stdlib tests pass now! I really will close my laptop now and read some books until I clean this up in the new year :-)
bjorn3 commented on Issue #2504:
Yay! I hope you will have a great New Year!
julian-seward1 commented on Issue #2504:
@cfallin
DIV on 8-bit values places remainders in AH, not DL
That's surely a case of the general redundant-REX (0x40) rule, or in this case the lack of 0x40? Does the remainder go in DL if there's an 0x40 prefix?
julian-seward1 edited a comment on Issue #2504:
@cfallin
DIV on 8-bit values places remainders in AH, not DL
That's surely a case of the general redundant-REX (0x40) rule, or in this case the lack of 0x40? Does the remainder go in DL if there's an 0x40 prefix?
cfallin commented on Issue #2504:
@cfallin
DIV on 8-bit values places remainders in AH, not DL
That's surely a case of the general redundant-REX (0x40) rule, or in this case the lack of 0x40? Does the remainder go in DL if there's an 0x40 prefix?
Sadly no, a REX prefix doesn't change anything -- it's just the way the 8-bit variant is defined (reference). A likely remnant of the 8086 days when AL and AH were separately-useful registers for byte-sized values...
cfallin edited a comment on Issue #2504:
@cfallin
DIV on 8-bit values places remainders in AH, not DL
That's surely a case of the general redundant-REX (0x40) rule, or in this case the lack of 0x40? Does the remainder go in DL if there's an 0x40 prefix?
Sadly no, a REX prefix doesn't change anything -- it's just the way the 8-bit variant is defined (reference). A likely remnant of the 8086 days when AL and AH were separately-useful registers for byte-sized values...
(Edit: here's another reference showing more -- a REX byte will change whether the divisor arg refers to low byte of one of the 16 x86-64 registers, or a low/high byte of one of the 8 original registers; but remainder always goes into AH.)
cfallin commented on Issue #2504:
Did some rebase/squash surgery on this branch -- I will split it into I think 8 PRs:
- Multi-reg value framework
- I128 ops for x64 backend
- TLS support in x64 backend
- StructArg/StructRet ABI support in x64/aarch64
- Fix to #2508 (REX prefix fixes for 8-bit instructions)
- Fix to #2507 (Avoid unaligned SSE loads)
- urem/srem bugfix for 8-bit DIV
- select.b1 bugfix
Several of the above are dependent on the multi-reg refactor but the rest should be reviewable in parallel.
cfallin commented on Issue #2504:
The pieces were a bit more interrelated than I had hoped so this became four PRs: #2538 (framework), #2539 (i128 ops), #2540 (TLS), #2541 (struct args). 2539 depends on 2538; and 2540/2541 depend on 2538 and 2539.
cfallin commented on Issue #2504:
Closing this draft PR now that the real PRs resulting from it are under review.
Last updated: Jan 24 2025 at 00:11 UTC