alexcrichton opened PR #9775 from alexcrichton:pulley-loads-and-stores
to bytecodealliance:main
:
This commit gets the
address.wast
spec test working by filling out more load/store infrastructure in Pulley. In doing so I've done a refactoring of the existing load/store methods. Changes here are:
- All load/stores are
*_offset32
now instead of optionally having no offset, a 8-bit offset, or a 64-bit offset.- All x-register loads/stores are prefixed with
x
now.- All loads/stores have "le" for little-endian in their name.
- Loads/stores are refactored to have 8 and 16-bit variants.
- Sign-extending loads now either extend to 32 or 64 depending on the opcode.
- Float loads/stores are added.
- Big-endian is handled with explicit big-endian loads/stores instead
bswap
to handle this transparently in the backend (e.g. for stores not ISLE-generated) and to handle floats.<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
alexcrichton requested fitzgen for a review on PR #9775.
alexcrichton requested wasmtime-compiler-reviewers for a review on PR #9775.
alexcrichton requested wasmtime-default-reviewers for a review on PR #9775.
alexcrichton requested wasmtime-core-reviewers for a review on PR #9775.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
Curious on your thoughts on this in particular @fitzgen.
I'll note that we don't currently do this well, e.g.
xconst8
always writes the full 64-bits andbr_if
has no variant that only reads 32-bits. I think we'll want to change that over time though.
alexcrichton commented on PR #9775:
At a high level the
spec_testsuite/*.wast
still can't be executed, but that's the next PR after this to get that executing on CI (and a large number of tests already pass)
alexcrichton updated PR #9775.
alexcrichton requested wasmtime-fuzz-reviewers for a review on PR #9775.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
Also curious on your thoughts here @fitzgen of the deletion of this fuzzer. Personally I feel like it's a bit of a pain to maintain while not adding much more over the wasm-smith fuzzing we have already.
alexcrichton updated PR #9775.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
Also double-check me on this, is this right?
fitzgen submitted PR review.
fitzgen created PR review comment:
I think we could remove the parenthetical here. They must be fast to decode in software (perhaps worth adding that qualifier) at runtime and various bit-packing schemes will either be fast enough or not. I don't think it necessarily makes sense to say we won't do any bit-packing at all, since (a) we already do a little, and (b) we need to balance the decoding with instruction density and cache usage.
fitzgen created PR review comment:
We are (rightfully) burning a lot of opcode space for loads and stores, but I wonder if we can make an exception for the "only write the bottom bits" and always extend to the full register width? This would mean we wouldn't need both
xload8_s32_offset32
andxload8_s64_offset32
and would effectively have only the latter. This would roughly cut in half the number of load/store instructions we have but wouldn't constrain the cranelift backend at all. The cost would be a couple extra extends on 32-bit platforms, which I don't think is too bad a trade off in this particular case.And FWIW, I still think the default position should be to only write the low bits of registers when doing 32-bit adds/ands/xors/etc, we are just running into all the edge cases now :)
fitzgen created PR review comment:
For
xconst*
I think filling the whole register is what we want regardless. They are a little different from, say, bitwise arithmetic since they aren't really an operation themselves and are more about getting data in place to perform actual operations.Although IIRC we are still zero-extending rather than sign-extending in
const*
. The latter is going to save us code size, since otherwise all negative immediates will need to be loaded in registers withxconst64
s. But maybe we did fix this already at one point...
fitzgen created PR review comment:
This actually _increases_ the number of opcodes, but _decreases_ the number of instructions in most programs. That is, with macro ops we are doing _more work per instruction_. And the reason we want to do more work per instruction is because one of the biggest costs of interpretation compared to compilation is the interpreter loop's per-instruction overhead.
fitzgen created PR review comment:
Maybe also add a note about signed vs unsigned naming conventions
fitzgen created PR review comment:
nitpick: Use the new helper here and elsewhere
fitzgen submitted PR review.
fitzgen created PR review comment:
I don't think so, since (unlike addition) subtraction lets high bits (which should be invisible) move down to the low bits (and become visible), right?
alexcrichton submitted PR review.
alexcrichton created PR review comment:
Good point, I'll add notes about the balance. I was mostly thinking here of things like RISC-V instructions where the registers are spread throughout the instruction in various locations to fit around encodings of other instructions. That's the "super fancy" part but I also don't think we're at risk of getting there any time soon.
fitzgen submitted PR review.
fitzgen created PR review comment:
actually I think this works correctly
fitzgen submitted PR review.
fitzgen created PR review comment:
let's merge this and get further in spec tests, and eventually differential fuzzing, which will definitely catch any issues here
alexcrichton submitted PR review.
alexcrichton created PR review comment:
This I feel is a bit entangled with the question below of "do we want
xload8_s32_offset32
". I'm a bit worried aboutxconst8
always filling the entire register mostly from the perspective of 32-bit platforms. If most constants in a program are actually 32-bits then we're in theory spending a lot of instructions to fill the upper half of the register that's never read. Cache-wise that's probably not too problematic but it's possibly problematic from an icache perspective of the implementation ofxconst8
itself.I definitely agree though that for 64-bit platforms it doesn't matter at all.
For now I'm going to go ahead with this PR as-is (with
xload8_s32_offset32
for example) and these docs since we don't have a great answer one way or another and I think we're both on the fence. I'm going to add some notes to https://github.com/bytecodealliance/wasmtime/issues/9747 to come back to reevaluate though.
alexcrichton updated PR #9775.
alexcrichton has enabled auto merge for PR #9775.
alexcrichton updated PR #9775.
alexcrichton has enabled auto merge for PR #9775.
alexcrichton merged PR #9775.
Last updated: Jan 24 2025 at 00:11 UTC