Artentus opened issue #5572:
Hello,
I am currently trying to write a Cranelift backend that targets a 32 bit architecture but am encountering some issues.
Since all the existing backends are 64 bit am I not sure whether this is due to missing support or my lack of understanding.For example,
codegen::machinst::valueregs::ValueRegs
states in its descriptionwe cap the capacity of this at four (when any 32-bit target is enabled) or two (otherwise)
, however the capacity is hardcoded to 2.Also, inside the
codegen::machinst::isle::isle_lower_prelude_methods
macro, the functiontemp_writable_reg
assumes only a single register is returned. However, this function is called inIsleContext<'_, '_, MInst, XYZBackend>::imm
, which takes 64 bit values, so on a 32 bit target two registers are required.In general, the
Context
trait generated by ISLE requires 64 bit quantities in most places. How is one expected to implement these for a 32 bit target? Do you invent non-existing instructions that operate on 64 bit and then lower them into equivalent real instructions?
jameysharp commented on issue #5572:
Cool, it's nice to hear from people working on more Cranelift backends! I'm curious what architecture you're targeting; glancing at your GitHub, I'm guessing it's your A32 RISC ISA? I'd also love to hear some about your goal: Do you want to run Wasmtime on this target, or compile Rust to it with cg-clif, or something else?
As you noticed, we don't currently have backends implemented for any 32-bit targets, so I'm not particularly surprised that you're running into situations where we didn't quite think through the implications for a 32-bit target.
There are existing cases where backends use what we refer to as "pseudo-instructions", which are what I think you're describing as "non-existing instructions". You can find a bunch of examples with
git grep -E 'pseudo-?inst'
; we haven't been super consistent about spelling apparently. Maybe there are cases where that would help with 64-bit assumptions, but off-hand my guess is there will be few if any of those cases.For the most part I think you'll just need to do for 64-bit types what the existing backends do for 128-bit types: split the value across two registers and figure out how to combine register pairs as needed. For some relatively simple examples, look at how various backends implement
(lower (has_type $I128 (iadd x y)))
.The only case where you should need four registers in
ValueRegs
is if you're going to support 128-bit integers with 32-bit registers, and we're very inconsistent about I128 support across the existing backends anyway. So I'd just skip that for now. Once you have a proof of concept working we can figure out how to handle issues likeValueRegs
being too small. It's a little tricky because I'm guessing making it bigger will have a measurable impact on compile time, and also because we'll need to do some API design work.The only
fn imm
I see that takes a 64-bit value and callstemp_writable_reg
is in the riscv64 backend. You don't have to implement that. It was apparently convenient for that backend but you can do something different if you need to.I hope this helps give you a starting point. We're happy to answer questions if we can!
Artentus commented on issue #5572:
Hey, thanks for that detailed answer.
Yes you are right, I'm doing this for academic purposes atm. The ISA I'm targeting is similar to RV32I in a lot of ways so if you don't wanna read through it just pretend its RV32I. This is also the reason why I mostly used the RiscV backend as a template.
My goal is to compile Rust. I messed around with LLVM but writing a backend for that turned out to be really difficult. And since I know my way around Rust a lot better than C++ I decided to give Cranelift a go. Since this is mostly about the learning experience I don't really care that the generated code is not quite on par with LLVM.
Alright so I will ignore i128s for now. I guess as long as I don't use them in the code I want to compile they should never be emitted right? Same with floats probably.
I'm gonna read through the other backedns ISLE code and see how it's being done there. I'm still finding my way around ISLE.
bjorn3 commented on issue #5572:
If you want the rust standard library to compile without patches you need both i128 and floats. It is possible to patch them out though with a bit of work.
fitzgen commented on issue #5572:
I’d suggest looking at the aarch64 and x64 backends when looking for inspiration.
Artentus commented on issue #5572:
Ok, I got a very simple function to correctly compile now.
Before I continue I wanna ask: are there any tests I should modify/add specifically for the backend?
The other backends contain very little testing directly so I guess it happens somewhere else.
fitzgen commented on issue #5572:
Nice! Most testing is done via file tests (‘cranelift/filetests/filetests/isa/*’), fuzzing, and wasmtime exercising the codegen.
Artentus commented on issue #5572:
Ah thanks, that's what I was looking for.
Artentus commented on issue #5572:
I have implemented all of the arithmetic and bitwise operations now, except for division and remainder.
The architecture doesn't have division or remainder instructions so they have to be implemented algorithmically.
However emitting the algorithm inline does not seem like a wise choice, instead I want to generate a call to some function (I believe in LLVM the function in question would be calleddivsi3
).But since none of the other backends do this I don't know how to go about it. I can manage to emit a call but where does the function come from, how do I define it?
Ideally the function would be defined in target-specific code to hand-optimize it.
cfallin commented on issue #5572:
@Artentus we have a mechanism for this with
LibCall
s; we do have a few such calls in some of the instruction lowerings.
Artentus commented on issue #5572:
I saw that in the memcpy lowering but couldn't figure out how to actually define new ones.
cfallin commented on issue #5572:
The enum is defined here. If you grep for one of the existing ones, e.g.
FloorF32
, you'll see the places in the code where it's used and how the relocation can be handled to link to an actual function (e.g. in Wasmtime here to this function).
cfallin commented on issue #5572:
(Forgot to finish: then you can follow that example to define a new one; also in general, when adding a new option to an enum, the compile errors will usually lead you to where you need to make changes.)
Artentus commented on issue #5572:
Thanks!
However I assume this only works within the runtime correct?
Does that mean if I want to compile Rust using the rustc cranelift backend I have to supply these through the linker?
cfallin commented on issue #5572:
Yes, you would need to somehow provide implementations of the libcalls that the relocation resolution can find. cc @bjorn3 for the best way to do that...
afonso360 commented on issue #5572:
:wave: Hey, See #4460 for an example on how to define a new LibCall. Mostly once you provide the correct name in the module crate, the linker can usually figure it out automatically.
afonso360 edited a comment on issue #5572:
:wave: Hey, See #4460 for an example on how to define a new LibCall. Mostly once you provide the correct name in the module crate, the linker can usually figure it out automatically.
That PR also does some setup groundwork to emit libcall's from isle that you shouldn't need to repeat.
Artentus commented on issue #5572:
Yes I mostly copied the code from x64
emit_vm_call
and it's correctly generating these calls now.
My main problem is how to get linkable object files that contain these functions, since there is no existing compilers for this target.
bjorn3 commented on issue #5572:
You could add it to the compiler-builtins rust crate I think. This crate is included in everything rustc compiles.
Artentus commented on issue #5572:
I am currently working on lowering for loads and stores, and there are these two ir instructions I'm unsure about:
uload32
andsload32
.
On a 32 bit architecture these are both functionally identical to justload.i32
but I'm wondering if the compiler will still try to emit these.I tried experimenting and defined this test function:
function %uload32(i32) -> i32 { block0(v0: i32): v1 = uload32.i32 v0+42 return v1 }
But this yields the following error:
inst1 (return v1): arg 0 (v1) has type i64, must match function signature of i32
So do I just completely ignore these?
bjorn3 commented on issue #5572:
uload and iload generally return a 64bit value by zero/sign extending a 32bit loaded value.
bjorn3 edited a comment on issue #5572:
uload and sload generally return a 64bit value by zero/sign extending a 32bit loaded value.
Artentus commented on issue #5572:
Well,
uload8
,sload8
,uload16
andsload16
do work fine with 32 bit. This code compiles perfectly fine:function %uload8(i32) -> i32 { block0(v0: i32): v1 = uload8.i32 v0+42 return v1 }
bjorn3 commented on issue #5572:
Looks like
uload*
andsload*
have a constraint that their output size is bigger than what they load (up to 64bit). So for exampleuload32
will always return ani64
anduload16
ani32
ori64
. https://github.com/bytecodealliance/wasmtime/blob/69cd0a6b1a78c9c793b549256d76c879b48cb3db/cranelift/codegen/meta/src/shared/instructions.rs#L814
Artentus commented on issue #5572:
Ah, I guess that makes sense. So I should probably implement all of them for both 32 bit _and_ 64 bit right?
bjorn3 commented on issue #5572:
uload8 and sload8 should he implemented for i16, i32 and i64 (i16 can be skipped for wasm I think, uload16 and sload16 should be implemented for i32 and i64. Uload32 and uload64 should be implemented for i64.
bjorn3 edited a comment on issue #5572:
uload8 and sload8 should he implemented for i16, i32 and i64 (i16 can be skipped for wasm I think, uload16 and sload16 should be implemented for i32 and i64. Uload32 and sload32 should be implemented for i64.
Artentus commented on issue #5572:
If you want the rust standard library to compile without patches you need both i128 and floats.
Unfortunately it seems like this is true even for the core library. I tried to compile it but am gettings loads of errors related to invalid SSA types (f32 and i128).
So how do I go about patching that stuff out?
bjorn3 commented on issue #5572:
I know there are usages in
core::time
andcore::num
. Try commenting them out and see what breaks. Then comment everything out that depends on those impls and repeat. You might take some inspiration from the patch file https://github.com/bjorn3/rustc_codegen_cranelift/commit/425b527dc31c4d88ed2099bcc91af96cebf8a463 introduced to remove i128 usage from libcore. It doesn't apply to the latest version anymore, but it may be useful as guide.
Artentus commented on issue #5572:
Thank you very much, I actually managed to have it compile the (modified) core library without errors.
However I must say it is very awkward having to inject a custom core library.
I think I will actually do the work and add libcalls for all the float instructions. It's tedious but not difficult.128 bit integers on the other hand do pose a problem, I'm not sure what to do about that.
bjorn3 commented on issue #5572:
For rustc_codegen_cranelift the patches directory contains various kinds of patches. Those with
-sysroot-
are applied to the standard library. You can commit your changes, dogit format-patch HEAD~..HEAD
(or if you have multiple commits use the appropriate amount of~
) and then move the generated patch to the patches directory and rename it to include-sysroot-
. After that running./y.rs prepare
again will apply the patch to the standard library sources that./y.rs build
uses. If you need to only build libcore and not the rest of the standard library, you can editbuild_sysroot/Cargo.toml
to remove the dependencies on the rest. (you will likely need to keep compiler_builtins) Note that you will have to do./y.rs prepare
again after editing it as it is copied elsewhere. (I plan to make this automatic, but haven't gotten to it yet.)
Artentus commented on issue #5572:
I think adding support for 128 bit integers in the backend is the better solution than patching the core library.
Modifying Cranelift to support 4 register values wasn't actually all that difficult.
Artentus commented on issue #5572:
I'm successfully compiling code now, including the entire Rust core and alloc libraries.
However after inspecting the generated code I am seeing some very weird things being generated.
To be clear, I never expected Cranelift to generate code that is optimized to the same degree as LLVM, it clearly states so after all.
But I did not expect it to do such hilariously bad things.I compiled this very basic hello world main function:
mod io { pub fn serial_write(s: &str) { // ... } } fn main() { io::serial_write("Hello World!"); }
And this is the generated assembly (manually commented):
;; allocate stack memory sub sp, sp, 32 ;; load address of "Hello World!" string (0 because pre-relocation) ldui t7, 0 or a0, t7, 0 ;; store address of "Hello World!" string at sp+0 add a1, sp, 0 st [a1], a0 ;; store length of "Hello World!" string (12 bytes) at sp+4 add a1, sp, 4 ldi a2, 12 st [a1], a2 ;; load address from sp+0 add a2, sp, 0 ld a2, [a2] ;; load length from sp+4 add a3, sp, 4 ld a3, [a3] ;; store address at sp+16 add a4, sp, 16 st [a4], a2 ;; store length at sp+20 add a4, sp, 20 st [a4], a3 ;; load address from sp+16 add a4, sp, 16 ld a0, [a4] ;; load length from sp+20 add a4, sp, 20 ld a1, [a4] ;; call io::serial_write (0 because pre-relocation) ldui t7, 0 or a4, t7, 0 jl ra, a4 ;; free stack memory add sp, sp, 32
As you can see what should be loading two simple arguments into
a0
anda1
registers is actually round tripping two times through memory for absolutely no reason.
Also it is not making use of offsetting memory instructions at all, instead emitting an add+load/store.
cfallin commented on issue #5572:
@Artentus I wonder if it would make sense to move some of this discussion to a Zulip thread? We're interested in helping you support your ISA, but normally we like to use GitHub issues more for targeted single issues.
In any case, it's hard to say much without actually playing with your new backend. Are you enabling optimizations? How did you implement the ABI? What does the CLIF look like? The redundant ops could be coming from any one of several layers.
bjorn3 commented on issue #5572:
I'm successfully compiling code now, including the entire Rust core and alloc libraries.
:tada:
However after inspecting the generated code I am seeing some very weird things being generated.
Yeah, rustc generates terrible MIR with a lot of copies in it. I try to remove some of the copies in cg_clif, but many are passed through to Cranelift. Cranelift doesn't have the optimization passes necessary to remove them. If you enable optimizations for cg_clif you this will enable some optimizations working on MIR which may tidy it up a bit.
Artentus commented on issue #5572:
@cfallin got it. I have just never used Zulip in my life so I have to sign up first.
@bjorn3 I'm already compiling the compiler itself in release mode if that's what you mean. Test program was also compiled in release mode.
bjorn3 commented on issue #5572:
I just checked it and this is indeed caused by the MIR containing copies:
fn main() -> () { let mut _0: (); // return place in scope 0 at /app/example.rs:7:15: 7:15 let _1: (); // in scope 0 at /app/example.rs:8:5: 8:37 let mut _2: &str; // in scope 0 at /app/example.rs:8:22: 8:36 let _3: &str; // in scope 0 at /app/example.rs:8:22: 8:36 bb0: { _3 = const "Hello World!"; // scope 0 at /app/example.rs:8:22: 8:36 _2 = _3; // scope 0 at /app/example.rs:8:22: 8:36 _1 = serial_write(move _2) -> bb1; // scope 0 at /app/example.rs:8:5: 8:37 } bb1: { return; // scope 0 at /app/example.rs:9:2: 9:2 } }
The first store is
_3 = const "Hello World!";
, the first load and store is_2 = _3;
and the last load is themove _2
argument. Compiling with-Zmir-opt-level=3
optimizes away the_2 = _3
.This seems to be a case where cg_clif itself should be able to avoid stack allocation, but for some reason it doesn't. I'm currently investigating.
bjorn3 commented on issue #5572:
Found it. This was a regression from https://github.com/bjorn3/rustc_codegen_cranelift/commit/777d4732dc2389752ebaa4fc7256245c4873bc25. I've fixed it in https://github.com/bjorn3/rustc_codegen_cranelift/commit/df04fd6fba2e1b2b85a108e3ec7e56c731456ba9 for a 12% runtime perf win on a benchmark I use. Thanks for pointing this out!
Artentus commented on issue #5572:
That's great news! 12% from such a small change is huge.
This patch did get rid of one of the unnecessary copies.
I believe the second one is because of a transmute from&str
to&[u8]
which Cranelift can't optimize out. But that's fine.I also figured out how to optimize those add+load/store sequences into a single instruction.
Just needed a rule to coverstack_addr
followed byload
orstore
IR instructions.Since everything is working I believe I can now close this. Thanks to everyone who helped me with this, I got way more and way faster responses than I ever expected. The codebase was also very pleasant to work with, especially compared to the mess that is LLVM.
If you're interested I can PR the changes that are required to fully support 32 bit targets in cranelift_codegen.
It's very little code, essentially just extendingValueRegs
to size of 4. Easily feature gated as well so if no 32 bit targets are enabled there won't be any impact.
Artentus closed issue #5572:
Hello,
I am currently trying to write a Cranelift backend that targets a 32 bit architecture but am encountering some issues.
Since all the existing backends are 64 bit am I not sure whether this is due to missing support or my lack of understanding.For example,
codegen::machinst::valueregs::ValueRegs
states in its descriptionwe cap the capacity of this at four (when any 32-bit target is enabled) or two (otherwise)
, however the capacity is hardcoded to 2.Also, inside the
codegen::machinst::isle::isle_lower_prelude_methods
macro, the functiontemp_writable_reg
assumes only a single register is returned. However, this function is called inIsleContext<'_, '_, MInst, XYZBackend>::imm
, which takes 64 bit values, so on a 32 bit target two registers are required.In general, the
Context
trait generated by ISLE requires 64 bit quantities in most places. How is one expected to implement these for a 32 bit target? Do you invent non-existing instructions that operate on 64 bit and then lower them into equivalent real instructions?
cfallin commented on issue #5572:
If you're interested I can PR the changes that are required to fully support 32 bit targets in cranelift_codegen.
It's very little code, essentially just extending ValueRegs to size of 4. Easily feature gated as well so if no 32 bit targets are enabled there won't be any impact.We actually had exactly this code before, when we had a (partial) arm32 backend in-tree! I think it's probably best to wait until we add a new target to bring this back, as we are also considering other ways now of supporting wide values (namely: legalization/narrowing rules in the mid-end, as Cranelift used to do actually) that might be simpler in the end.
I'm happy you got this working though, it's a really neat demo of portability! And if you want to contribute any documentation regarding rough edges or confusing bits you hit, that might also be a good outcome from this :-)
Artentus commented on issue #5572:
A pice of documentation I would have loved to have is a listing of all CLIF instructions, just to know what needs to be implemented instead of trying to compile something and then reading error messages about missing lowering rules.
Some of the instructions also have quirks that weren't immediately obvious to me.If you tell me where to put it I can start working on it.
cfallin commented on issue #5572:
A pice of documentation I would have loved to have is a listing of all CLIF instructions, just to know what needs to be implemented instead of trying to compile something and then reading error messages about missing lowering rules. Some of the instructions also have quirks that weren't immediately obvious to me.
If you tell me where to put it I can start working on it.
The closest we have to that is the InstBuilder trait docs that are generated from the source (that ultimately come from the doc-strings in
cranelift/codegen/meta/shared/src/instructions.rs
). Any updates to those descriptions that would have been helpful are very welcome contributions!We have a cranelift/docs/ directory as well that could use some attention in general; if you come up with any more general thoughts from your experience please do feel free to start a
porting.md
or similar! It's on our roadmap for this year to update docs and a porting-to-new-ISA guide is something we do want to build out, eventually.
Last updated: Dec 23 2024 at 13:07 UTC