Pulley registers · wasmtime · Zulip Chat Archive

Stream: wasmtime

Topic: Pulley registers

kmeakin (Aug 03 2024 at 22:19):

Why does the pulley VM only have 32 registers if each register is encoded by 1 byte? You could have 256 registers without making the encoding any longer

bjorn3 (Aug 04 2024 at 09:48):

That would make the MachineState much larger without significantly reducing the amount of instructions that need to be executed. And a larger MachineState would take up more space in the L1 cache, reducing the space for actually useful data.

bjorn3 (Aug 04 2024 at 09:49):

I don't know if the above is the actual reason why 32 registers was chosen though.

Alex Crichton (Aug 04 2024 at 21:58):

I believe 32 was chosen to match aarch64 and riscv64 for now, but AFAIK it hasn't been scientifically chosen. The encoding of opcoes is relatively inefficient right now and the hope is to encode 3 operands in 15 bits via a u16 in the future. Each register taking a single byte is mostly just for ease right now

kmeakin (Aug 04 2024 at 22:04):

That makes sense

kmeakin (Aug 04 2024 at 22:05):

Would add a few extra instructions to extract registers from instruction stream but I guess saving 1 byte per instruction makes up for it

kmeakin (Aug 04 2024 at 22:07):

I really like the higher order macro trick for declaring instructions btw. Never seen that before but I'll look for excuses to use it in future

Alex Crichton (Aug 04 2024 at 22:18):

heh not exactly the most readable but it is quite nice for keeping things in sync!

fitzgen (he/him) (Aug 06 2024 at 00:19):

Yeah, it is as Alex says. 32 seemed like "enough" and we can (eventually) shave a byte off of a = b op c-style instructions.

will be really nice to get the rest of pulley landed (cranelift backend and runtime integration) so that we can start tweaking things and determine which is more important: more registers or smaller instructions

fitzgen (he/him) (Aug 06 2024 at 00:19):

working on landing those other parts soon

kmeakin (Aug 07 2024 at 22:52):

fitzgen (he/him) said:

Yeah, it is as Alex says. 32 seemed like "enough" and we can (eventually) shave a byte off of a = b op c-style instructions.

will be really nice to get the rest of pulley landed (cranelift backend and runtime integration) so that we can start tweaking things and determine which is more important: more registers or smaller instructions

You could go even further and have 2-byte encodings for dst = op dst src2 for registers x0-x15 (1 byte for opcode, 4 bits for each register). IIRC RISC-V has something similar for their 2 byte compressed ISA

fitzgen (he/him) (Aug 07 2024 at 22:55):

indeed, I've also thought about that kind of thing as well haha

fitzgen (he/him) (Aug 07 2024 at 22:56):

fyi, I'm taking a look at the binary operands bitpacking PR now, but I think I'd prefer waiting to land it until after the cranelift backend lands, just so minimize churn/rebasing on that larger, fiddly amount of code

kmeakin (Aug 07 2024 at 22:56):

sure. no problem

fitzgen (he/him) (Aug 07 2024 at 22:56):

I'm just writing some filetests right now and then the backend should be ready to be made into a PR

fitzgen (he/him) (Aug 08 2024 at 00:32):

(and here is the PR introducing the pulley backend to cranelift: https://github.com/bytecodealliance/wasmtime/pull/9089)

Cranelift: Add a new backend for emitting Pulley bytecode by fitzgen · Pull Request #9089 · bytecodealliance/wasmtime

This commit adds two new backends for Cranelift that emits 32- and 64-bit Pulley bytecode. The backends are both actually the same, with a common implementation living in cranelift/codegen/src/isa/...

kmeakin (Aug 09 2024 at 17:31):

Hey @fitzgen (he/him) I'm still a bit confused about stack manipulation instructions.
I believe instructions to increment/decrement the SP directly are unecessary, because the increment/decrement can be done in the push/pop instruction

kmeakin (Aug 09 2024 at 17:33):

I'm looking at the tests from
https://github.com/bytecodealliance/wasmtime/blob/ee57c2b0994e58bdd7cbdaa30e72d1a85a800fee/cranelift/filetests/filetests/isa/pulley32/call.clif
and it seems to me like the adjustment to the SP is always word_size * number_of_regs

wasmtime/cranelift/filetests/filetests/isa/pulley32/call.clif at ee57c2b0994e58bdd7cbdaa30e72d1a85a800fee · bytecodealliance/wasmtime

A fast and secure runtime for WebAssembly. Contribute to bytecodealliance/wasmtime development by creating an account on GitHub.

kmeakin (Aug 09 2024 at 17:34):

eg:

;       11: 0e 23 d0                        xconst8 spilltmp0, -48
;       14: 12 20 20 23                     xadd32 sp, sp, spilltmp0
;       18: 0e 0f 00                        xconst8 x15, 0
;       1b: 2a 20 0f                        store64 sp, x15
;       1e: 2c 20 08 0f                     store64_offset8 sp, 8, x15
;       22: 2c 20 10 0f                     store64_offset8 sp, 16, x15
;       26: 2c 20 18 0f                     store64_offset8 sp, 24, x15
;       2a: 2c 20 20 0f                     store64_offset8 sp, 32, x15
;       2e: 2c 20 28 0f                     store64_offset8 sp, 40, x15

subtracts 48 from SP, then writes 6 registers to the stack

kmeakin (Aug 09 2024 at 17:34):

but you could just have a push instr that also updated the SP

kmeakin (Aug 09 2024 at 17:35):

so 6 push instrs would decrement the SP by 8 bytes each, and at the end the result is the SP is decremented by 48

fitzgen (he/him) (Aug 09 2024 at 17:40):

yeah I guess if you still have to do the moves of each register into the allocated stack space, then it is still N instructions. my b, I hadn't been thinking about each store.

we could I guess add a variable number of registers to be spilled into the allocated stack space, or have a few variations with a fixed numbers of registers to spill, but those are both starting to get pretty funky

so I think a push instruction could indeed make sense. that said, I think we still want to fold push lr; push fp; fp = sp into a single macro instruction

kmeakin (Aug 09 2024 at 17:41):

so I think a push instruction could indeed make sense. that said, I think we still want to fold push lr; push fp; fp = sp into a single macro instruction
Yes I agree a macro instruction to do the prologue/epilogue would be good but I dont think it would need a size argument

fitzgen (he/him) (Aug 09 2024 at 17:41):

yep

kmeakin (Aug 09 2024 at 17:42):

we could I guess add a variable number of registers to be spilled into the allocated stack space, or have a few variations with a fixed numbers of registers to spill, but those are both starting to get pretty funky
Like the old Arm32 instructions that could push/pop a whole list of regs in 1 instruction?

fitzgen (he/him) (Aug 09 2024 at 17:42):

I am not familiar with arm32, but that sounds right

kmeakin (Aug 09 2024 at 17:43):

https://developer.arm.com/documentation/dui0802/b/A32-and-T32-Instructions/PUSH-and-POP

fitzgen (he/him) (Aug 09 2024 at 17:44):

yeah exactly like that. encoding-wise we would do something like <opcode> <length> (<reg>)^length

fitzgen (he/him) (Aug 09 2024 at 17:45):

where (<reg>)^length is length repetitions of <reg>, in case that isn't clear

kmeakin (Aug 09 2024 at 17:45):

They abandoned it in the 32->64bit transition because it raised awkward questions like "what happens if an interrupt is recieved in the middle?" but we should have no such worries

fitzgen (he/him) (Aug 09 2024 at 17:46):

heh, nice

kmeakin (Aug 09 2024 at 17:46):

fitzgen (he/him) said:

yeah exactly like that. encoding-wise we would do something like <opcode> <length> (<reg>)^length

what about a u32 bitmask? Set the nth bit to 1 to push register n

fitzgen (he/him) (Aug 09 2024 at 17:46):

ooo I like that

fitzgen (he/him) (Aug 09 2024 at 17:46):

nice

fitzgen (he/him) (Aug 09 2024 at 17:50):

also, fyi: https://github.com/bytecodealliance/wasmtime/blob/main/cranelift/bitset/src/scalar.rs#L47

wasmtime/cranelift/bitset/src/scalar.rs at main · bytecodealliance/wasmtime

A fast and secure runtime for WebAssembly. Contribute to bytecodealliance/wasmtime development by creating an account on GitHub.

kmeakin (Aug 09 2024 at 18:19):

ah nice, i was already trying to figure out the bit twiddling myself

fitzgen (he/him) (Aug 09 2024 at 18:22):

I could foresee us eventually adding unchecked_* variants to that type as well, if the various assert!(..)s end up being too expensive during decoding or whatever

but we can cross that bridge when we get to it, ofc

fitzgen (he/him) (Aug 09 2024 at 18:23):

eg unchecked_insert that doesn't assert that the value inserted is in the range of the scalar backing storage

Last updated: Oct 23 2024 at 20:03 UTC