Stream: git-wasmtime

Topic: wasmtime / issue #5572 32 bit targets in Cranelift


view this post on Zulip Wasmtime GitHub notifications bot (Jan 13 2023 at 20:58):

Artentus opened issue #5572:

Hello,

I am currently trying to write a Cranelift backend that targets a 32 bit architecture but am encountering some issues.
Since all the existing backends are 64 bit am I not sure whether this is due to missing support or my lack of understanding.

For example, codegen::machinst::valueregs::ValueRegs states in its description we cap the capacity of this at four (when any 32-bit target is enabled) or two (otherwise), however the capacity is hardcoded to 2.

Also, inside the codegen::machinst::isle::isle_lower_prelude_methods macro, the function temp_writable_reg assumes only a single register is returned. However, this function is called in IsleContext<'_, '_, MInst, XYZBackend>::imm, which takes 64 bit values, so on a 32 bit target two registers are required.

In general, the Context trait generated by ISLE requires 64 bit quantities in most places. How is one expected to implement these for a 32 bit target? Do you invent non-existing instructions that operate on 64 bit and then lower them into equivalent real instructions?

view this post on Zulip Wasmtime GitHub notifications bot (Jan 13 2023 at 22:38):

jameysharp commented on issue #5572:

Cool, it's nice to hear from people working on more Cranelift backends! I'm curious what architecture you're targeting; glancing at your GitHub, I'm guessing it's your A32 RISC ISA? I'd also love to hear some about your goal: Do you want to run Wasmtime on this target, or compile Rust to it with cg-clif, or something else?

As you noticed, we don't currently have backends implemented for any 32-bit targets, so I'm not particularly surprised that you're running into situations where we didn't quite think through the implications for a 32-bit target.

There are existing cases where backends use what we refer to as "pseudo-instructions", which are what I think you're describing as "non-existing instructions". You can find a bunch of examples with git grep -E 'pseudo-?inst'; we haven't been super consistent about spelling apparently. Maybe there are cases where that would help with 64-bit assumptions, but off-hand my guess is there will be few if any of those cases.

For the most part I think you'll just need to do for 64-bit types what the existing backends do for 128-bit types: split the value across two registers and figure out how to combine register pairs as needed. For some relatively simple examples, look at how various backends implement (lower (has_type $I128 (iadd x y))).

The only case where you should need four registers in ValueRegs is if you're going to support 128-bit integers with 32-bit registers, and we're very inconsistent about I128 support across the existing backends anyway. So I'd just skip that for now. Once you have a proof of concept working we can figure out how to handle issues like ValueRegs being too small. It's a little tricky because I'm guessing making it bigger will have a measurable impact on compile time, and also because we'll need to do some API design work.

The only fn imm I see that takes a 64-bit value and calls temp_writable_reg is in the riscv64 backend. You don't have to implement that. It was apparently convenient for that backend but you can do something different if you need to.

I hope this helps give you a starting point. We're happy to answer questions if we can!

view this post on Zulip Wasmtime GitHub notifications bot (Jan 13 2023 at 23:38):

Artentus commented on issue #5572:

Hey, thanks for that detailed answer.

Yes you are right, I'm doing this for academic purposes atm. The ISA I'm targeting is similar to RV32I in a lot of ways so if you don't wanna read through it just pretend its RV32I. This is also the reason why I mostly used the RiscV backend as a template.

My goal is to compile Rust. I messed around with LLVM but writing a backend for that turned out to be really difficult. And since I know my way around Rust a lot better than C++ I decided to give Cranelift a go. Since this is mostly about the learning experience I don't really care that the generated code is not quite on par with LLVM.

Alright so I will ignore i128s for now. I guess as long as I don't use them in the code I want to compile they should never be emitted right? Same with floats probably.

I'm gonna read through the other backedns ISLE code and see how it's being done there. I'm still finding my way around ISLE.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 13 2023 at 23:46):

bjorn3 commented on issue #5572:

If you want the rust standard library to compile without patches you need both i128 and floats. It is possible to patch them out though with a bit of work.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 14 2023 at 22:10):

fitzgen commented on issue #5572:

I’d suggest looking at the aarch64 and x64 backends when looking for inspiration.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 15 2023 at 02:06):

Artentus commented on issue #5572:

Ok, I got a very simple function to correctly compile now.
Before I continue I wanna ask: are there any tests I should modify/add specifically for the backend?
The other backends contain very little testing directly so I guess it happens somewhere else.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 15 2023 at 03:39):

fitzgen commented on issue #5572:

Nice! Most testing is done via file tests (‘cranelift/filetests/filetests/isa/*’), fuzzing, and wasmtime exercising the codegen.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 15 2023 at 03:45):

Artentus commented on issue #5572:

Ah thanks, that's what I was looking for.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 19 2023 at 02:57):

Artentus commented on issue #5572:

I have implemented all of the arithmetic and bitwise operations now, except for division and remainder.
The architecture doesn't have division or remainder instructions so they have to be implemented algorithmically.
However emitting the algorithm inline does not seem like a wise choice, instead I want to generate a call to some function (I believe in LLVM the function in question would be called divsi3).

But since none of the other backends do this I don't know how to go about it. I can manage to emit a call but where does the function come from, how do I define it?
Ideally the function would be defined in target-specific code to hand-optimize it.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 19 2023 at 02:59):

cfallin commented on issue #5572:

@Artentus we have a mechanism for this with LibCalls; we do have a few such calls in some of the instruction lowerings.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 19 2023 at 03:03):

Artentus commented on issue #5572:

I saw that in the memcpy lowering but couldn't figure out how to actually define new ones.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 19 2023 at 03:08):

cfallin commented on issue #5572:

The enum is defined here. If you grep for one of the existing ones, e.g. FloorF32, you'll see the places in the code where it's used and how the relocation can be handled to link to an actual function (e.g. in Wasmtime here to this function).

view this post on Zulip Wasmtime GitHub notifications bot (Jan 19 2023 at 03:09):

cfallin commented on issue #5572:

(Forgot to finish: then you can follow that example to define a new one; also in general, when adding a new option to an enum, the compile errors will usually lead you to where you need to make changes.)

view this post on Zulip Wasmtime GitHub notifications bot (Jan 19 2023 at 03:17):

Artentus commented on issue #5572:

Thanks!

However I assume this only works within the runtime correct?
Does that mean if I want to compile Rust using the rustc cranelift backend I have to supply these through the linker?

view this post on Zulip Wasmtime GitHub notifications bot (Jan 19 2023 at 05:02):

cfallin commented on issue #5572:

Yes, you would need to somehow provide implementations of the libcalls that the relocation resolution can find. cc @bjorn3 for the best way to do that...

view this post on Zulip Wasmtime GitHub notifications bot (Jan 19 2023 at 09:01):

afonso360 commented on issue #5572:

:wave: Hey, See #4460 for an example on how to define a new LibCall. Mostly once you provide the correct name in the module crate, the linker can usually figure it out automatically.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 19 2023 at 09:03):

afonso360 edited a comment on issue #5572:

:wave: Hey, See #4460 for an example on how to define a new LibCall. Mostly once you provide the correct name in the module crate, the linker can usually figure it out automatically.

That PR also does some setup groundwork to emit libcall's from isle that you shouldn't need to repeat.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 19 2023 at 17:34):

Artentus commented on issue #5572:

Yes I mostly copied the code from x64 emit_vm_call and it's correctly generating these calls now.
My main problem is how to get linkable object files that contain these functions, since there is no existing compilers for this target.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 19 2023 at 18:32):

bjorn3 commented on issue #5572:

You could add it to the compiler-builtins rust crate I think. This crate is included in everything rustc compiles.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 21 2023 at 06:31):

Artentus commented on issue #5572:

I am currently working on lowering for loads and stores, and there are these two ir instructions I'm unsure about: uload32 and sload32.
On a 32 bit architecture these are both functionally identical to just load.i32 but I'm wondering if the compiler will still try to emit these.

I tried experimenting and defined this test function:

function %uload32(i32) -> i32 {
block0(v0: i32):
  v1 = uload32.i32 v0+42
  return v1
}

But this yields the following error: inst1 (return v1): arg 0 (v1) has type i64, must match function signature of i32

So do I just completely ignore these?

view this post on Zulip Wasmtime GitHub notifications bot (Jan 21 2023 at 09:09):

bjorn3 commented on issue #5572:

uload and iload generally return a 64bit value by zero/sign extending a 32bit loaded value.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 21 2023 at 09:09):

bjorn3 edited a comment on issue #5572:

uload and sload generally return a 64bit value by zero/sign extending a 32bit loaded value.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 21 2023 at 09:18):

Artentus commented on issue #5572:

Well, uload8, sload8, uload16 and sload16 do work fine with 32 bit. This code compiles perfectly fine:

function %uload8(i32) -> i32 {
block0(v0: i32):
  v1 = uload8.i32 v0+42
  return v1
}

view this post on Zulip Wasmtime GitHub notifications bot (Jan 21 2023 at 15:04):

bjorn3 commented on issue #5572:

Looks like uload* and sload* have a constraint that their output size is bigger than what they load (up to 64bit). So for example uload32 will always return an i64 and uload16 an i32 or i64. https://github.com/bytecodealliance/wasmtime/blob/69cd0a6b1a78c9c793b549256d76c879b48cb3db/cranelift/codegen/meta/src/shared/instructions.rs#L814

view this post on Zulip Wasmtime GitHub notifications bot (Jan 21 2023 at 23:29):

Artentus commented on issue #5572:

Ah, I guess that makes sense. So I should probably implement all of them for both 32 bit _and_ 64 bit right?

view this post on Zulip Wasmtime GitHub notifications bot (Jan 22 2023 at 07:50):

bjorn3 commented on issue #5572:

uload8 and sload8 should he implemented for i16, i32 and i64 (i16 can be skipped for wasm I think, uload16 and sload16 should be implemented for i32 and i64. Uload32 and uload64 should be implemented for i64.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 22 2023 at 07:50):

bjorn3 edited a comment on issue #5572:

uload8 and sload8 should he implemented for i16, i32 and i64 (i16 can be skipped for wasm I think, uload16 and sload16 should be implemented for i32 and i64. Uload32 and sload32 should be implemented for i64.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 27 2023 at 17:11):

Artentus commented on issue #5572:

If you want the rust standard library to compile without patches you need both i128 and floats.

Unfortunately it seems like this is true even for the core library. I tried to compile it but am gettings loads of errors related to invalid SSA types (f32 and i128).

So how do I go about patching that stuff out?

view this post on Zulip Wasmtime GitHub notifications bot (Jan 27 2023 at 17:23):

bjorn3 commented on issue #5572:

I know there are usages in core::time and core::num. Try commenting them out and see what breaks. Then comment everything out that depends on those impls and repeat. You might take some inspiration from the patch file https://github.com/bjorn3/rustc_codegen_cranelift/commit/425b527dc31c4d88ed2099bcc91af96cebf8a463 introduced to remove i128 usage from libcore. It doesn't apply to the latest version anymore, but it may be useful as guide.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 28 2023 at 21:17):

Artentus commented on issue #5572:

Thank you very much, I actually managed to have it compile the (modified) core library without errors.

However I must say it is very awkward having to inject a custom core library.
I think I will actually do the work and add libcalls for all the float instructions. It's tedious but not difficult.

128 bit integers on the other hand do pose a problem, I'm not sure what to do about that.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 29 2023 at 16:29):

bjorn3 commented on issue #5572:

For rustc_codegen_cranelift the patches directory contains various kinds of patches. Those with -sysroot- are applied to the standard library. You can commit your changes, do git format-patch HEAD~..HEAD (or if you have multiple commits use the appropriate amount of ~) and then move the generated patch to the patches directory and rename it to include -sysroot-. After that running ./y.rs prepare again will apply the patch to the standard library sources that ./y.rs build uses. If you need to only build libcore and not the rest of the standard library, you can edit build_sysroot/Cargo.toml to remove the dependencies on the rest. (you will likely need to keep compiler_builtins) Note that you will have to do ./y.rs prepare again after editing it as it is copied elsewhere. (I plan to make this automatic, but haven't gotten to it yet.)

view this post on Zulip Wasmtime GitHub notifications bot (Jan 29 2023 at 17:00):

Artentus commented on issue #5572:

I think adding support for 128 bit integers in the backend is the better solution than patching the core library.
Modifying Cranelift to support 4 register values wasn't actually all that difficult.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 31 2023 at 20:20):

Artentus commented on issue #5572:

I'm successfully compiling code now, including the entire Rust core and alloc libraries.

However after inspecting the generated code I am seeing some very weird things being generated.
To be clear, I never expected Cranelift to generate code that is optimized to the same degree as LLVM, it clearly states so after all.
But I did not expect it to do such hilariously bad things.

I compiled this very basic hello world main function:

mod io {
    pub fn serial_write(s: &str) {
        // ...
    }
}

fn main() {
    io::serial_write("Hello World!");
}

And this is the generated assembly (manually commented):

;; allocate stack memory
sub sp, sp, 32

;; load address of "Hello World!" string (0 because pre-relocation)
ldui t7, 0
or a0, t7, 0

;; store address of "Hello World!" string at sp+0
add a1, sp, 0
st [a1], a0

;; store length of "Hello World!" string (12 bytes) at sp+4
add a1, sp, 4
ldi a2, 12
st [a1], a2

;; load address from sp+0
add a2, sp, 0
ld a2, [a2]

;; load length from sp+4
add a3, sp, 4
ld a3, [a3]

;; store address at sp+16
add a4, sp, 16
st [a4], a2

;; store length at sp+20
add a4, sp, 20
st [a4], a3

;; load address from sp+16
add a4, sp, 16
ld a0, [a4]

;; load length from sp+20
add a4, sp, 20
ld a1, [a4]

;; call io::serial_write (0 because pre-relocation)
ldui t7, 0
or a4, t7, 0
jl ra, a4

;; free stack memory
add sp, sp, 32

As you can see what should be loading two simple arguments into a0 and a1 registers is actually round tripping two times through memory for absolutely no reason.
Also it is not making use of offsetting memory instructions at all, instead emitting an add+load/store.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 31 2023 at 20:29):

cfallin commented on issue #5572:

@Artentus I wonder if it would make sense to move some of this discussion to a Zulip thread? We're interested in helping you support your ISA, but normally we like to use GitHub issues more for targeted single issues.

In any case, it's hard to say much without actually playing with your new backend. Are you enabling optimizations? How did you implement the ABI? What does the CLIF look like? The redundant ops could be coming from any one of several layers.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 31 2023 at 20:30):

bjorn3 commented on issue #5572:

I'm successfully compiling code now, including the entire Rust core and alloc libraries.

:tada:

However after inspecting the generated code I am seeing some very weird things being generated.

Yeah, rustc generates terrible MIR with a lot of copies in it. I try to remove some of the copies in cg_clif, but many are passed through to Cranelift. Cranelift doesn't have the optimization passes necessary to remove them. If you enable optimizations for cg_clif you this will enable some optimizations working on MIR which may tidy it up a bit.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 31 2023 at 20:50):

Artentus commented on issue #5572:

@cfallin got it. I have just never used Zulip in my life so I have to sign up first.

@bjorn3 I'm already compiling the compiler itself in release mode if that's what you mean. Test program was also compiled in release mode.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 01 2023 at 13:37):

bjorn3 commented on issue #5572:

I just checked it and this is indeed caused by the MIR containing copies:

fn main() -> () {
    let mut _0: ();                      // return place in scope 0 at /app/example.rs:7:15: 7:15
    let _1: ();                          // in scope 0 at /app/example.rs:8:5: 8:37
    let mut _2: &str;                    // in scope 0 at /app/example.rs:8:22: 8:36
    let _3: &str;                        // in scope 0 at /app/example.rs:8:22: 8:36

    bb0: {
        _3 = const "Hello World!";       // scope 0 at /app/example.rs:8:22: 8:36
        _2 = _3;                         // scope 0 at /app/example.rs:8:22: 8:36
        _1 = serial_write(move _2) -> bb1; // scope 0 at /app/example.rs:8:5: 8:37
    }

    bb1: {
        return;                          // scope 0 at /app/example.rs:9:2: 9:2
    }
}

The first store is _3 = const "Hello World!";, the first load and store is _2 = _3; and the last load is the move _2 argument. Compiling with -Zmir-opt-level=3 optimizes away the _2 = _3.

This seems to be a case where cg_clif itself should be able to avoid stack allocation, but for some reason it doesn't. I'm currently investigating.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 01 2023 at 13:57):

bjorn3 commented on issue #5572:

Found it. This was a regression from https://github.com/bjorn3/rustc_codegen_cranelift/commit/777d4732dc2389752ebaa4fc7256245c4873bc25. I've fixed it in https://github.com/bjorn3/rustc_codegen_cranelift/commit/df04fd6fba2e1b2b85a108e3ec7e56c731456ba9 for a 12% runtime perf win on a benchmark I use. Thanks for pointing this out!

view this post on Zulip Wasmtime GitHub notifications bot (Feb 02 2023 at 03:53):

Artentus commented on issue #5572:

That's great news! 12% from such a small change is huge.

This patch did get rid of one of the unnecessary copies.
I believe the second one is because of a transmute from &str to &[u8] which Cranelift can't optimize out. But that's fine.

I also figured out how to optimize those add+load/store sequences into a single instruction.
Just needed a rule to cover stack_addr followed by load or store IR instructions.

Since everything is working I believe I can now close this. Thanks to everyone who helped me with this, I got way more and way faster responses than I ever expected. The codebase was also very pleasant to work with, especially compared to the mess that is LLVM.

If you're interested I can PR the changes that are required to fully support 32 bit targets in cranelift_codegen.
It's very little code, essentially just extending ValueRegs to size of 4. Easily feature gated as well so if no 32 bit targets are enabled there won't be any impact.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 02 2023 at 03:53):

Artentus closed issue #5572:

Hello,

I am currently trying to write a Cranelift backend that targets a 32 bit architecture but am encountering some issues.
Since all the existing backends are 64 bit am I not sure whether this is due to missing support or my lack of understanding.

For example, codegen::machinst::valueregs::ValueRegs states in its description we cap the capacity of this at four (when any 32-bit target is enabled) or two (otherwise), however the capacity is hardcoded to 2.

Also, inside the codegen::machinst::isle::isle_lower_prelude_methods macro, the function temp_writable_reg assumes only a single register is returned. However, this function is called in IsleContext<'_, '_, MInst, XYZBackend>::imm, which takes 64 bit values, so on a 32 bit target two registers are required.

In general, the Context trait generated by ISLE requires 64 bit quantities in most places. How is one expected to implement these for a 32 bit target? Do you invent non-existing instructions that operate on 64 bit and then lower them into equivalent real instructions?

view this post on Zulip Wasmtime GitHub notifications bot (Feb 02 2023 at 16:18):

cfallin commented on issue #5572:

If you're interested I can PR the changes that are required to fully support 32 bit targets in cranelift_codegen.
It's very little code, essentially just extending ValueRegs to size of 4. Easily feature gated as well so if no 32 bit targets are enabled there won't be any impact.

We actually had exactly this code before, when we had a (partial) arm32 backend in-tree! I think it's probably best to wait until we add a new target to bring this back, as we are also considering other ways now of supporting wide values (namely: legalization/narrowing rules in the mid-end, as Cranelift used to do actually) that might be simpler in the end.

I'm happy you got this working though, it's a really neat demo of portability! And if you want to contribute any documentation regarding rough edges or confusing bits you hit, that might also be a good outcome from this :-)

view this post on Zulip Wasmtime GitHub notifications bot (Feb 02 2023 at 19:34):

Artentus commented on issue #5572:

A pice of documentation I would have loved to have is a listing of all CLIF instructions, just to know what needs to be implemented instead of trying to compile something and then reading error messages about missing lowering rules.
Some of the instructions also have quirks that weren't immediately obvious to me.

If you tell me where to put it I can start working on it.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 02 2023 at 19:40):

cfallin commented on issue #5572:

A pice of documentation I would have loved to have is a listing of all CLIF instructions, just to know what needs to be implemented instead of trying to compile something and then reading error messages about missing lowering rules. Some of the instructions also have quirks that weren't immediately obvious to me.

If you tell me where to put it I can start working on it.

The closest we have to that is the InstBuilder trait docs that are generated from the source (that ultimately come from the doc-strings in cranelift/codegen/meta/shared/src/instructions.rs). Any updates to those descriptions that would have been helpful are very welcome contributions!

We have a cranelift/docs/ directory as well that could use some attention in general; if you come up with any more general thoughts from your experience please do feel free to start a porting.md or similar! It's on our roadmap for this year to update docs and a porting-to-new-ISA guide is something we do want to build out, eventually.


Last updated: Dec 23 2024 at 13:07 UTC