cfallin opened Issue #1549:
Currently, to avoid issues with (i) the PC-relative addressing range of
LDR
, and (ii) the fact that the constant pool needs to be relocatable, the aarch64 backend emits constants inline with code and branches around them. This is obviously suboptimal for code-density and branchiness reasons.To handle the first issue, we should use the
ConstantPool
to collect constants, as the old backend does, and then modify our emission logic to defer constant emission to the end unless we're about to go out of range of pending constant references, in which case we emit a "constant island".To handle the second issue, we need to emit constant-pool relocations.
It's still unclear how we can handle the intersection of these two, i.e., when constant references become out-of-range because the client relocates the constant pool further away. One known use case of the relocatable constant pool is for SpiderMonkey to insert its own epilogue into Wasm functions; there, at least, we can bound how much further away the constant pool will be, so perhaps we can just have a slightly-conservative limit for when we must emit a constant island.
cfallin labeled Issue #1549:
Currently, to avoid issues with (i) the PC-relative addressing range of
LDR
, and (ii) the fact that the constant pool needs to be relocatable, the aarch64 backend emits constants inline with code and branches around them. This is obviously suboptimal for code-density and branchiness reasons.To handle the first issue, we should use the
ConstantPool
to collect constants, as the old backend does, and then modify our emission logic to defer constant emission to the end unless we're about to go out of range of pending constant references, in which case we emit a "constant island".To handle the second issue, we need to emit constant-pool relocations.
It's still unclear how we can handle the intersection of these two, i.e., when constant references become out-of-range because the client relocates the constant pool further away. One known use case of the relocatable constant pool is for SpiderMonkey to insert its own epilogue into Wasm functions; there, at least, we can bound how much further away the constant pool will be, so perhaps we can just have a slightly-conservative limit for when we must emit a constant island.
cfallin labeled Issue #1549:
Currently, to avoid issues with (i) the PC-relative addressing range of
LDR
, and (ii) the fact that the constant pool needs to be relocatable, the aarch64 backend emits constants inline with code and branches around them. This is obviously suboptimal for code-density and branchiness reasons.To handle the first issue, we should use the
ConstantPool
to collect constants, as the old backend does, and then modify our emission logic to defer constant emission to the end unless we're about to go out of range of pending constant references, in which case we emit a "constant island".To handle the second issue, we need to emit constant-pool relocations.
It's still unclear how we can handle the intersection of these two, i.e., when constant references become out-of-range because the client relocates the constant pool further away. One known use case of the relocatable constant pool is for SpiderMonkey to insert its own epilogue into Wasm functions; there, at least, we can bound how much further away the constant pool will be, so perhaps we can just have a slightly-conservative limit for when we must emit a constant island.
abrown commented on Issue #1549:
For reference, there are other things to think about with constant pools; see #1385.
cfallin labeled Issue #1549:
Currently, to avoid issues with (i) the PC-relative addressing range of
LDR
, and (ii) the fact that the constant pool needs to be relocatable, the aarch64 backend emits constants inline with code and branches around them. This is obviously suboptimal for code-density and branchiness reasons.To handle the first issue, we should use the
ConstantPool
to collect constants, as the old backend does, and then modify our emission logic to defer constant emission to the end unless we're about to go out of range of pending constant references, in which case we emit a "constant island".To handle the second issue, we need to emit constant-pool relocations.
It's still unclear how we can handle the intersection of these two, i.e., when constant references become out-of-range because the client relocates the constant pool further away. One known use case of the relocatable constant pool is for SpiderMonkey to insert its own epilogue into Wasm functions; there, at least, we can bound how much further away the constant pool will be, so perhaps we can just have a slightly-conservative limit for when we must emit a constant island.
akirilov-arm commented on Issue #1549:
There is a set of constants that might be particularly problematic - those generated for
Inst::LoadExtName
. The complication is that we need to emit relocations that modify the value in the constant pool itself. @cfallin, has a scheme similar to a Global Offset Table (GOT) been considered?There is also another problem that is quite similar - lack of deferred emission of instructions, in particular traps. Consider integer division; currently the AArch64 backend generates:
div rd, rn, rm cbnz rm, #8 udf cmn rm, #1 ccmp rn, #1, #nzcv, eq b.vc #8 udf
The natural expectation is that most of the time the
UDF
instructions are not going to be executed, so this sequence has the same deficiencies as inline constants, i.e. lower code density and more complcated control flow. Furthermore, it contains forward conditional branches, which may very well be predicted as not taken - exactly the opposite of the common case.I think that both issues (the one in the previous paragraph and constant pools) could be solved simultaneously.
BTW the limited range of literal loads could be extended by emitting a combination of
ADRP
+LDR
instead of just a literal load, which gives a range of approximately 8 GB (4 GB in either direction).
akirilov-arm edited a comment on Issue #1549:
There is a set of constants that might be particularly problematic - those generated for
Inst::LoadExtName
. The complication is that we need to emit relocations that modify the value in the constant pool itself. @cfallin, has a scheme similar to a Global Offset Table (GOT) been considered?There is also another problem that is quite similar - lack of deferred emission of instructions, in particular traps. Consider integer division; currently the AArch64 backend generates:
div rd, rn, rm cbnz rm, #8 udf cmn rm, #1 ccmp rn, #1, #nzcv, eq b.vc #8 udf
The natural expectation is that most of the time the
UDF
instructions are not executed, so this sequence has the same deficiencies as inline constants, i.e. lower code density and more complicated control flow. Furthermore, it contains forward conditional branches, which may very well be predicted as not taken - exactly the opposite of the common case.I think that both issues (the one in the previous paragraph and constant pools) could be solved simultaneously.
BTW the limited range of literal loads could be extended by emitting a combination of
ADRP
+LDR
instead of just a literal load, which gives a range of approximately 8 GB (4 GB in either direction).
cfallin commented on Issue #1549:
There is a set of constants that might be particularly problematic - those generated for Inst::LoadExtName. The complication is that we need to emit relocations that modify the value in the constant pool itself. @cfallin, has a scheme similar to a Global Offset Table (GOT) been considered?
@akirilov-arm no, we haven't considered anything like a GOT, though at some point we do need to fill out support for relocation types like this. Certainly the early/bringup work for the aarch64 backend was the JIT use-case, where GOTs are not really used (or at least, e.g. SpiderMonkey has a table of wasm functions and IIRC globals but it generates custom code for accesses); so we've gotten away with the relatively simple set of relocation types so far.
Longer-term, I think we ought to just emit the proper relocations for e.g. GOT references; we'll have to wire through the relocation-type support to the relevant backends, including
cranelift-object
(if not already there) and wasmtime's relocation handling; and we'll have to have a notion of "memory model" which is sort of a superset of our currentRelocDistance
, indicating when a symbol is external and the compilation mode calls for a GOT.deferred emission of instructions, in particular traps.
This is an interesting one: my first instinct is that it might be simpler to handle this case directly during lowering, by keeping a list of deferred trap-paths (combination of
MachLabel
andTrapCode
) and emitting and binding to labels a sequence of trap instructions at the tail of the function. There are some complications there for SpiderMonkey (it expects a fallthrough return and stitches its own epilogue on after Cranelift returns) but otherwise I think it would work. If we need a more general mechanism, we could defer emission via a mechanism on theMachBuffer
, but I'm hesitant to add that complexity unless we really do need it. Open to other ideas here, of course :-)
bjorn3 commented on Issue #1549:
This is done, right?
cfallin commented on Issue #1549:
We actually currently synthesize up to 64-bit values inline with
MOVZ
/MOVK
/MOVN
instructions, and use this for f32/f64 as well (integer constant then move to float reg). 128-bit constants are included inline and then branched around. @abrown added the constant-pool functionality but we haven't made use of this in the aarch64 backend yet. It is still worthwhile, I think, at least to evaluate; let's leave it open to track.
akirilov-arm commented on Issue #1549:
Some 64-bit floating-point constants are also loaded from an inline literal (these cases may be reduced further - refer to the comments). Note that based on my microbenchmark data in #2296, it definitely makes sense to consider using constant pools.
cfallin commented on Issue #1549:
@akirilov-arm: yes, sorry, that update somehow slipped off my radar -- thanks very much for the detailed evaluation! I agree that based on that update, we should probably at least replace the inline-constant cases with constant pool usages; the branch-around seems to have a significant impact.
akirilov-arm commented on Issue #1549:
@cfallin This is a side question, but do we have a mechanism to share a constant between several users instead of rematerializing it each time? This is particularly relevant to the bitmask extraction operations, in which case the constants are always the same. Of course, the increased register pressure is a trade-off to consider.
Also, some of the proposals to the Wasm SIMD specification (see WebAssembly/simd#395 for example) are made with such a capability in mind.
cfallin commented on Issue #1549:
@akirilov-arm we can share within a single function at least: @abrown's work in #2328 added deduplication of
VCodeConstant
s and the plumbing to use theMachBuffer
's constant-island support to emit them. (We also dedup statically, i.e. at the codegen crate level: the "well-known constant" option takes a&'static [u8]
to avoid carrying duplicates of constant data by value in the VCode.)We can't dedup across function bodies; that might be nice to have, but it would require a bit more plumbing due to our separate function compilation (which enables other things like parallelization). In principle we could have an "rodata object" that we aggregate from all
MachCompileResult
s after-the-fact, I suppose. Do you think we need that level of deduplication?
abrown commented on Issue #1549:
I think https://github.com/bytecodealliance/wasmtime/pull/2468 may be somewhat applicable to this thread; I wanted an ergonomic way to add constants during lowering (
LowerCtx::emit_constant
) and that PR is a few steps down that path. I abandoned it since it is not strictly necessary (more of a "nice to have") but I can finish it if we figure out some of this stuff out. E.g., it would be sort of nice to have a global rodata.
akirilov-arm commented on Issue #1549:
@cfallin
Do you think we need that level of deduplication?
No, I don't think so. As for the existing deduplication - if I understand correctly, it avoids keeping several copies of the same data inside the pool, but each use of the data still requires a literal load. I was thinking more in the direction of keeping the constant data live in a register and avoiding memory operations; this approach might be relevant even for more complicated integer constants, say those that need more than 2 instructions to materialize (arbitrary limit - I don't have hard data to justify it).
cfallin commented on Issue #1549:
Ah, OK. In theory, at least for constant duplication at the CLIF level, this may be covered by CSE. Constants that arise during lowering will not be CSE'd, so either post-lowering opts or constant rematerialization at the regalloc level could help. We might get to either or both of these someday. A constant pool with dedup is still better than literal loads from separate constant locations if we're going to do multiple literal loads (perhaps because opts are disabled or we haven't implemented remat yet); so these are complimentary I think.
Last updated: Nov 22 2024 at 16:03 UTC