Stream: wasmtime

Topic: Pulley registers


view this post on Zulip kmeakin (Aug 03 2024 at 22:19):

Why does the pulley VM only have 32 registers if each register is encoded by 1 byte? You could have 256 registers without making the encoding any longer

view this post on Zulip bjorn3 (Aug 04 2024 at 09:48):

That would make the MachineState much larger without significantly reducing the amount of instructions that need to be executed. And a larger MachineState would take up more space in the L1 cache, reducing the space for actually useful data.

view this post on Zulip bjorn3 (Aug 04 2024 at 09:49):

I don't know if the above is the actual reason why 32 registers was chosen though.

view this post on Zulip Alex Crichton (Aug 04 2024 at 21:58):

I believe 32 was chosen to match aarch64 and riscv64 for now, but AFAIK it hasn't been scientifically chosen. The encoding of opcoes is relatively inefficient right now and the hope is to encode 3 operands in 15 bits via a u16 in the future. Each register taking a single byte is mostly just for ease right now

view this post on Zulip kmeakin (Aug 04 2024 at 22:04):

That makes sense

view this post on Zulip kmeakin (Aug 04 2024 at 22:05):

Would add a few extra instructions to extract registers from instruction stream but I guess saving 1 byte per instruction makes up for it

view this post on Zulip kmeakin (Aug 04 2024 at 22:07):

I really like the higher order macro trick for declaring instructions btw. Never seen that before but I'll look for excuses to use it in future

view this post on Zulip Alex Crichton (Aug 04 2024 at 22:18):

heh not exactly the most readable but it is quite nice for keeping things in sync!

view this post on Zulip fitzgen (he/him) (Aug 06 2024 at 00:19):

Yeah, it is as Alex says. 32 seemed like "enough" and we can (eventually) shave a byte off of a = b op c-style instructions.

will be really nice to get the rest of pulley landed (cranelift backend and runtime integration) so that we can start tweaking things and determine which is more important: more registers or smaller instructions

view this post on Zulip fitzgen (he/him) (Aug 06 2024 at 00:19):

working on landing those other parts soon

view this post on Zulip kmeakin (Aug 07 2024 at 22:52):

fitzgen (he/him) said:

Yeah, it is as Alex says. 32 seemed like "enough" and we can (eventually) shave a byte off of a = b op c-style instructions.

will be really nice to get the rest of pulley landed (cranelift backend and runtime integration) so that we can start tweaking things and determine which is more important: more registers or smaller instructions

You could go even further and have 2-byte encodings for dst = op dst src2 for registers x0-x15 (1 byte for opcode, 4 bits for each register). IIRC RISC-V has something similar for their 2 byte compressed ISA

view this post on Zulip fitzgen (he/him) (Aug 07 2024 at 22:55):

indeed, I've also thought about that kind of thing as well haha

view this post on Zulip fitzgen (he/him) (Aug 07 2024 at 22:56):

fyi, I'm taking a look at the binary operands bitpacking PR now, but I think I'd prefer waiting to land it until after the cranelift backend lands, just so minimize churn/rebasing on that larger, fiddly amount of code

view this post on Zulip kmeakin (Aug 07 2024 at 22:56):

sure. no problem

view this post on Zulip fitzgen (he/him) (Aug 07 2024 at 22:56):

I'm just writing some filetests right now and then the backend should be ready to be made into a PR

view this post on Zulip fitzgen (he/him) (Aug 08 2024 at 00:32):

(and here is the PR introducing the pulley backend to cranelift: https://github.com/bytecodealliance/wasmtime/pull/9089)

This commit adds two new backends for Cranelift that emits 32- and 64-bit Pulley bytecode. The backends are both actually the same, with a common implementation living in cranelift/codegen/src/isa/...

view this post on Zulip kmeakin (Aug 09 2024 at 17:31):

Hey @fitzgen (he/him) I'm still a bit confused about stack manipulation instructions.
I believe instructions to increment/decrement the SP directly are unecessary, because the increment/decrement can be done in the push/pop instruction

view this post on Zulip kmeakin (Aug 09 2024 at 17:33):

I'm looking at the tests from
https://github.com/bytecodealliance/wasmtime/blob/ee57c2b0994e58bdd7cbdaa30e72d1a85a800fee/cranelift/filetests/filetests/isa/pulley32/call.clif
and it seems to me like the adjustment to the SP is always word_size * number_of_regs

A fast and secure runtime for WebAssembly. Contribute to bytecodealliance/wasmtime development by creating an account on GitHub.

view this post on Zulip kmeakin (Aug 09 2024 at 17:34):

eg:

;       11: 0e 23 d0                        xconst8 spilltmp0, -48
;       14: 12 20 20 23                     xadd32 sp, sp, spilltmp0
;       18: 0e 0f 00                        xconst8 x15, 0
;       1b: 2a 20 0f                        store64 sp, x15
;       1e: 2c 20 08 0f                     store64_offset8 sp, 8, x15
;       22: 2c 20 10 0f                     store64_offset8 sp, 16, x15
;       26: 2c 20 18 0f                     store64_offset8 sp, 24, x15
;       2a: 2c 20 20 0f                     store64_offset8 sp, 32, x15
;       2e: 2c 20 28 0f                     store64_offset8 sp, 40, x15

subtracts 48 from SP, then writes 6 registers to the stack

view this post on Zulip kmeakin (Aug 09 2024 at 17:34):

but you could just have a push instr that also updated the SP

view this post on Zulip kmeakin (Aug 09 2024 at 17:35):

so 6 push instrs would decrement the SP by 8 bytes each, and at the end the result is the SP is decremented by 48

view this post on Zulip fitzgen (he/him) (Aug 09 2024 at 17:40):

yeah I guess if you still have to do the moves of each register into the allocated stack space, then it is still N instructions. my b, I hadn't been thinking about each store.

we could I guess add a variable number of registers to be spilled into the allocated stack space, or have a few variations with a fixed numbers of registers to spill, but those are both starting to get pretty funky

so I think a push instruction could indeed make sense. that said, I think we still want to fold push lr; push fp; fp = sp into a single macro instruction

view this post on Zulip kmeakin (Aug 09 2024 at 17:41):

so I think a push instruction could indeed make sense. that said, I think we still want to fold push lr; push fp; fp = sp into a single macro instruction
Yes I agree a macro instruction to do the prologue/epilogue would be good but I dont think it would need a size argument

view this post on Zulip fitzgen (he/him) (Aug 09 2024 at 17:41):

yep

view this post on Zulip kmeakin (Aug 09 2024 at 17:42):

we could I guess add a variable number of registers to be spilled into the allocated stack space, or have a few variations with a fixed numbers of registers to spill, but those are both starting to get pretty funky
Like the old Arm32 instructions that could push/pop a whole list of regs in 1 instruction?

view this post on Zulip fitzgen (he/him) (Aug 09 2024 at 17:42):

I am not familiar with arm32, but that sounds right

view this post on Zulip kmeakin (Aug 09 2024 at 17:43):

https://developer.arm.com/documentation/dui0802/b/A32-and-T32-Instructions/PUSH-and-POP

view this post on Zulip fitzgen (he/him) (Aug 09 2024 at 17:44):

yeah exactly like that. encoding-wise we would do something like <opcode> <length> (<reg>)^length

view this post on Zulip fitzgen (he/him) (Aug 09 2024 at 17:45):

where (<reg>)^length is length repetitions of <reg>, in case that isn't clear

view this post on Zulip kmeakin (Aug 09 2024 at 17:45):

They abandoned it in the 32->64bit transition because it raised awkward questions like "what happens if an interrupt is recieved in the middle?" but we should have no such worries

view this post on Zulip fitzgen (he/him) (Aug 09 2024 at 17:46):

heh, nice

view this post on Zulip kmeakin (Aug 09 2024 at 17:46):

fitzgen (he/him) said:

yeah exactly like that. encoding-wise we would do something like <opcode> <length> (<reg>)^length

what about a u32 bitmask? Set the nth bit to 1 to push register n

view this post on Zulip fitzgen (he/him) (Aug 09 2024 at 17:46):

ooo I like that

view this post on Zulip fitzgen (he/him) (Aug 09 2024 at 17:46):

nice

view this post on Zulip fitzgen (he/him) (Aug 09 2024 at 17:50):

also, fyi: https://github.com/bytecodealliance/wasmtime/blob/main/cranelift/bitset/src/scalar.rs#L47

A fast and secure runtime for WebAssembly. Contribute to bytecodealliance/wasmtime development by creating an account on GitHub.

view this post on Zulip kmeakin (Aug 09 2024 at 18:19):

ah nice, i was already trying to figure out the bit twiddling myself

view this post on Zulip fitzgen (he/him) (Aug 09 2024 at 18:22):

I could foresee us eventually adding unchecked_* variants to that type as well, if the various assert!(..)s end up being too expensive during decoding or whatever

but we can cross that bridge when we get to it, ofc

view this post on Zulip fitzgen (he/him) (Aug 09 2024 at 18:23):

eg unchecked_insert that doesn't assert that the value inserted is in the range of the scalar backing storage


Last updated: Oct 23 2024 at 20:03 UTC