Why does the pulley VM only have 32 registers if each register is encoded by 1 byte? You could have 256 registers without making the encoding any longer
That would make the MachineState
much larger without significantly reducing the amount of instructions that need to be executed. And a larger MachineState
would take up more space in the L1 cache, reducing the space for actually useful data.
I don't know if the above is the actual reason why 32 registers was chosen though.
I believe 32 was chosen to match aarch64 and riscv64 for now, but AFAIK it hasn't been scientifically chosen. The encoding of opcoes is relatively inefficient right now and the hope is to encode 3 operands in 15 bits via a u16 in the future. Each register taking a single byte is mostly just for ease right now
That makes sense
Would add a few extra instructions to extract registers from instruction stream but I guess saving 1 byte per instruction makes up for it
I really like the higher order macro trick for declaring instructions btw. Never seen that before but I'll look for excuses to use it in future
heh not exactly the most readable but it is quite nice for keeping things in sync!
Yeah, it is as Alex says. 32 seemed like "enough" and we can (eventually) shave a byte off of a = b op c
-style instructions.
will be really nice to get the rest of pulley landed (cranelift backend and runtime integration) so that we can start tweaking things and determine which is more important: more registers or smaller instructions
working on landing those other parts soon
fitzgen (he/him) said:
Yeah, it is as Alex says. 32 seemed like "enough" and we can (eventually) shave a byte off of
a = b op c
-style instructions.will be really nice to get the rest of pulley landed (cranelift backend and runtime integration) so that we can start tweaking things and determine which is more important: more registers or smaller instructions
You could go even further and have 2-byte encodings for dst = op dst src2
for registers x0-x15
(1 byte for opcode, 4 bits for each register). IIRC RISC-V has something similar for their 2 byte compressed ISA
indeed, I've also thought about that kind of thing as well haha
fyi, I'm taking a look at the binary operands bitpacking PR now, but I think I'd prefer waiting to land it until after the cranelift backend lands, just so minimize churn/rebasing on that larger, fiddly amount of code
sure. no problem
I'm just writing some filetests right now and then the backend should be ready to be made into a PR
(and here is the PR introducing the pulley backend to cranelift: https://github.com/bytecodealliance/wasmtime/pull/9089)
Hey @fitzgen (he/him) I'm still a bit confused about stack manipulation instructions.
I believe instructions to increment/decrement the SP directly are unecessary, because the increment/decrement can be done in the push/pop instruction
I'm looking at the tests from
https://github.com/bytecodealliance/wasmtime/blob/ee57c2b0994e58bdd7cbdaa30e72d1a85a800fee/cranelift/filetests/filetests/isa/pulley32/call.clif
and it seems to me like the adjustment to the SP is always word_size * number_of_regs
eg:
; 11: 0e 23 d0 xconst8 spilltmp0, -48
; 14: 12 20 20 23 xadd32 sp, sp, spilltmp0
; 18: 0e 0f 00 xconst8 x15, 0
; 1b: 2a 20 0f store64 sp, x15
; 1e: 2c 20 08 0f store64_offset8 sp, 8, x15
; 22: 2c 20 10 0f store64_offset8 sp, 16, x15
; 26: 2c 20 18 0f store64_offset8 sp, 24, x15
; 2a: 2c 20 20 0f store64_offset8 sp, 32, x15
; 2e: 2c 20 28 0f store64_offset8 sp, 40, x15
subtracts 48 from SP, then writes 6 registers to the stack
but you could just have a push
instr that also updated the SP
so 6 push instrs would decrement the SP by 8 bytes each, and at the end the result is the SP is decremented by 48
yeah I guess if you still have to do the moves of each register into the allocated stack space, then it is still N instructions. my b, I hadn't been thinking about each store.
we could I guess add a variable number of registers to be spilled into the allocated stack space, or have a few variations with a fixed numbers of registers to spill, but those are both starting to get pretty funky
so I think a push
instruction could indeed make sense. that said, I think we still want to fold push lr; push fp; fp = sp
into a single macro instruction
so I think a
push
instruction could indeed make sense. that said, I think we still want to foldpush lr; push fp; fp = sp
into a single macro instruction
Yes I agree a macro instruction to do the prologue/epilogue would be good but I dont think it would need a size argument
yep
we could I guess add a variable number of registers to be spilled into the allocated stack space, or have a few variations with a fixed numbers of registers to spill, but those are both starting to get pretty funky
Like the old Arm32 instructions that could push/pop a whole list of regs in 1 instruction?
I am not familiar with arm32, but that sounds right
https://developer.arm.com/documentation/dui0802/b/A32-and-T32-Instructions/PUSH-and-POP
yeah exactly like that. encoding-wise we would do something like <opcode> <length> (<reg>)^length
where (<reg>)^length
is length
repetitions of <reg>
, in case that isn't clear
They abandoned it in the 32->64bit transition because it raised awkward questions like "what happens if an interrupt is recieved in the middle?" but we should have no such worries
heh, nice
fitzgen (he/him) said:
yeah exactly like that. encoding-wise we would do something like
<opcode> <length> (<reg>)^length
what about a u32
bitmask? Set the nth bit to 1 to push register n
ooo I like that
nice
also, fyi: https://github.com/bytecodealliance/wasmtime/blob/main/cranelift/bitset/src/scalar.rs#L47
ah nice, i was already trying to figure out the bit twiddling myself
I could foresee us eventually adding unchecked_*
variants to that type as well, if the various assert!(..)
s end up being too expensive during decoding or whatever
but we can cross that bridge when we get to it, ofc
eg unchecked_insert
that doesn't assert that the value inserted is in the range of the scalar backing storage
Last updated: Oct 23 2024 at 20:03 UTC