MarinPostma opened PR #10076 from MarinPostma:addr-compute-refactor
to bytecodealliance:main
:
This PR factors out heap address computation from the
CodeGen
struct, so that it can be passed as a callback to masm. The reason for this refactor is that it avoid uncontidional spill on x64 for atomic operations that have specific requirement on some register.This happens because the address for those instruction are under the operands, forcing us to pop those operands to registers, computing the address, and pushing the operands back on the stack. This danse made is almost certain that the required registers would be already allocated when needed, thus forcing a spill. By factring out address computation, we can pass it as a callback to the macro assembler, that can pop arguements in their natural order, reserving specific registers as necessary.
runnig a few benchs with sightglass (not all the testsuite is supported yet), there's no performance regression in doing so:
compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm Δ = 19296541.11 ± 5094859.49 (confidence = 99%) refacto.so is 1.01x to 1.01x faster than main.so! [2361419777 2391240753.01 2441310853] main.so [2340893254 2371944211.90 2421142219] refacto.so instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm No difference in performance. [371434 422720.96 560161] main.so [341377 417196.75 494199] refacto.so execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm No difference in performance. [905704442 928015775.03 981552142] main.so [905318861 932491387.89 998574294] refacto.so ------------------------------------------------------------------------- instantiation :: cycles :: benchmarks/bz2/benchmark.wasm No difference in performance. [103974 128183.43 181890] main.so [105307 131811.77 177160] refacto.so compilation :: cycles :: benchmarks/bz2/benchmark.wasm No difference in performance. [21460612 25083573.56 51889605] main.so [21764794 25410240.69 53043037] refacto.so execution :: cycles :: benchmarks/bz2/benchmark.wasm No difference in performance. [108341768 110171123.02 119094004] main.so [107632268 109609358.98 114981054] refacto.so ------------------------------------------------------------------------- execution :: cycles :: benchmarks/regex/benchmark.wasm Δ = 2168975.90 ± 1429157.42 (confidence = 99%) refacto.so is 1.00x to 1.02x faster than main.so! [230279362 234821937.90 252109774] main.so [229225690 232652962.00 243118560] refacto.so instantiation :: cycles :: benchmarks/regex/benchmark.wasm No difference in performance. [310116 341486.65 583080] main.so [298291 337779.62 385882] refacto.so compilation :: cycles :: benchmarks/regex/benchmark.wasm No difference in performance. [227528308 234328619.97 266022424] main.so [227071476 232270717.37 262020199] refacto.so
Those benches do not indicate improvement in performance, but that's probably because they don't rely on atomic operations. Looking at the generated asm in the
disas
tests, we can see that a good amount of insructions are shaved off: they correspond to the spills.
github-actions[bot] commented on PR #10076:
Subscribe to Label Action
cc @saulecabrera
<details>
This issue or pull request has been labeled: "winch"Thus the following users have been cc'd because of the following labels:
- saulecabrera: winch
To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.
Learn more.
</details>
saulecabrera commented on PR #10076:
Thanks for this and thanks for providing benchmarks.
I'm personally not entirely convinced that the improvements in terms of code generation are worth the complexity of this change, for multiple reasons, but to name a few that come to mind:
The provided benchmarks are probably not representative enough to warrant some of the complexity in the
CodeGen
module. Even though in the code generation output we can see that some instructions are improved, and that there's no change in compilation time, I think it'd be beneficial to see a more real-world benchmark, one that actually uses the instructions for which this refactor is introduced. I'm aware that there are probably none currently, which makes me think that if it might be worth waiting until we have some real world benchmarks that we can use to decide to take an informed decision: it'd be beneficial to compare concrete % improvements for runtime performance and similar for compilation performance.I fear that instruction-focused optimizations, like in this case, might start compromising the compiler's simplicity design principle; as a rule of thumb, I think it's worth considering new architectural patterns that can be adopted throughout the compiler, but we should probably take a step back when considering changes that have the potential to improve certain instructions at the cost of higher complexity.
That said though, I really appreciate you taking the time to explore this; I decided to explore a slightly different alternative. My goal with this exploration was to achieve a similar result, but with potentially less changes required in the
CodeGen
module.The high level idea:
- Refactor
emit_compute_heap_addr
to take in anIndex
; this lifts the responsibility of having to pop the index, and gives more flexibility to the caller to decide when the heap address calculation should take place.- Introduce a
atomic_cas_clobbers
method at the MacroAssembler, which will inform the caller which registers are expected to be clobbered by a particular instruction.- When preparing the operations for
atomic_cas
, exclude any clobbers, viaCodeGenContext::without
Refer to the details section below for an (incomplete) diff of this idea:
<details>
diff --git a/winch/codegen/src/codegen/mod.rs b/winch/codegen/src/codegen/mod.rs index f2eb2f976..6648bc5c9 100644 --- a/winch/codegen/src/codegen/mod.rs +++ b/winch/codegen/src/codegen/mod.rs @@ -647,6 +647,7 @@ where &mut self, memarg: &MemArg, access_size: OperandSize, + index: Index, ) -> Result<Option<Reg>> { let ptr_size: OperandSize = self.env.ptr_type().try_into()?; let enable_spectre_mitigation = self.env.heap_access_spectre_mitigation(); @@ -656,7 +657,6 @@ where let memory_index = MemoryIndex::from_u32(memarg.memory); let heap = self.env.resolve_heap(memory_index); - let index = Index::from_typed_reg(self.context.pop_to_reg(self.masm, None)?); let offset = bounds::ensure_index_and_offset( self.masm, index, @@ -680,6 +680,7 @@ where self.emit_fuel_increment()?; self.masm.trap(TrapCode::HEAP_OUT_OF_BOUNDS)?; self.context.reachable = false; + self.context.free_reg(index.as_typed_reg().reg); None } else if !can_elide_bounds_check { // Account for the general case for bounds-checked memories. The @@ -840,38 +841,30 @@ where } /// Emit checks to ensure that the address at `memarg` is correctly aligned for `size`. - fn emit_check_align(&mut self, memarg: &MemArg, size: OperandSize) -> Result<()> { - if size.bytes() > 1 { - // Peek addr from top of the stack by popping and pushing. - let addr = *self - .context - .stack - .peek() - .ok_or_else(|| CodeGenError::missing_values_in_stack())?; - let tmp = self.context.any_gpr(self.masm)?; - self.context.move_val_to_reg(&addr, tmp, self.masm)?; + fn emit_check_align(&mut self, memarg: &MemArg, size: OperandSize, index: Index) -> Result<()> { + if size.bytes() < 2 { + return Ok(()); + } - if memarg.offset != 0 { - self.masm.add( - writable!(tmp), - tmp, - RegImm::Imm(Imm::I64(memarg.offset)), - size, - )?; - } + let tmp = self.context.any_gpr(self.masm)?; + let index_reg = index.as_typed_reg().reg; - self.masm.and( + self.masm + .mov(writable!(tmp), RegImm::reg(index_reg), size)?; + + if memarg.offset != 0 { + self.masm.add( writable!(tmp), tmp, - RegImm::Imm(Imm::I32(size.bytes() - 1)), + RegImm::Imm(Imm::I64(memarg.offset)), size, )?; - - self.masm.cmp(tmp, RegImm::Imm(Imm::i64(0)), size)?; - self.masm.trapif(IntCmpKind::Ne, TRAP_HEAP_MISALIGNED)?; - self.context.free_reg(tmp); } + self.masm.cmp(tmp, RegImm::Imm(Imm::i64(0)), size)?; + self.masm.trapif(IntCmpKind::Ne, TRAP_HEAP_MISALIGNED)?; + self.context.free_reg(tmp); + Ok(()) } @@ -879,9 +872,10 @@ where &mut self, memarg: &MemArg, access_size: OperandSize, + index: Index, ) -> Result<Option<Reg>> { - self.emit_check_align(memarg, access_size)?; - self.emit_compute_heap_address(memarg, access_size) + self.emit_check_align(memarg, access_size, index)?; + self.emit_compute_heap_address(memarg, access_size, index) } /// Emit a WebAssembly load. @@ -891,13 +885,16 @@ where target_type: WasmValType, kind: LoadKind, op_kind: MemOpKind, + index: Index, ) -> Result<()> { let maybe_addr = match op_kind { - MemOpKind::Atomic => { - self.emit_compute_heap_address_align_checked(&arg, kind.derive_operand_size())? - } + MemOpKind::Atomic => self.emit_compute_heap_address_align_checked( + &arg, + kind.derive_operand_size(), + index, + )?, MemOpKind::Normal => { - self.emit_compute_heap_address(&arg, kind.derive_operand_size())? + self.emit_compute_heap_address(&arg, kind.derive_operand_size(), index)? } }; @@ -926,12 +923,13 @@ where arg: &MemArg, size: OperandSize, op_kind: MemOpKind, + index: Index, ) -> Result<()> { let src = self.context.pop_to_reg(self.masm, None)?; let maybe_addr = match op_kind { - MemOpKind::Atomic => self.emit_compute_heap_address_align_checked(&arg, size)?, - MemOpKind::Normal => self.emit_compute_heap_address(&arg, size)?, + MemOpKind::Atomic => self.emit_compute_heap_address_align_checked(&arg, size, index)?, + MemOpKind::Normal => self.emit_compute_heap_address(&arg, size, index)?, }; if let Some(addr) = maybe_addr { @@ -1391,27 +1389,35 @@ where size: OperandSize, extend: Option<Extend<Zero>>, ) -> Result<()> { - // Emission for this instruction is a bit trickier. The address for the CAS is the 3rd from - // the top of the stack, and we must emit instruction to compute the actual address with - // `emit_compute_heap_address_align_checked`, while we still have access to self. However, - // some ISAs have requirements with regard to the registers used for some arguments, so we - // need to pass the context to the masm. To solve this issue, we pop the two first - // arguments from the stack, compute the address, push back the arguments, and hand over - // the control to masm. The implementer of `atomic_cas` can expect to find `expected` and - // `replacement` at the top the context's stack. + let clobbers = M::atomic_cas_clobbers(); - // pop the args - let replacement = self.context.pop_to_reg(self.masm, None)?; - let expected = self.context.pop_to_reg(self.masm, None)?; - - if let Some(addr) = self.emit_compute_heap_address_align_checked(arg, size)? { - // push back the args - self.context.stack.push(expected.into()); - self.context.stack.push(replacement.into()); + let (replacement, expected, index) = + self.context + .without::<Result<(TypedReg, TypedReg, TypedReg)>, _, _>( + &clobbers, + self.masm, + |context, masm| { + Ok(( + context.pop_to_reg(masm, None)?, + context.pop_to_reg(masm, None)?, + context.pop_to_reg(masm, None)?, + )) + }, + )??; + if let Some(addr) = + self.emit_compute_heap_address_align_checked(arg, size, Index::from_typed_reg(index))? + { let src = self.masm.address_at_reg(addr, 0)?; - self.masm - .atomic_cas(&mut self.context, src, size, UNTRUSTED_FLAGS, extend)?; + [message truncated]
MarinPostma closed without merge PR #10076.
MarinPostma commented on PR #10076:
You're implementation is indeed more elegant than what I came up with. But generally I agree with you. This would only be benificial for atomic operations, which by themselves are accepted to be somewhat costly. I don't think the cost of added complexity outweights the benefits. When I started this, I hoped that it would benefit more instructions, but that was a bad intuition.
I'll close this one, but thanks for your feedback! :)
MarinPostma edited a comment on PR #10076:
Your implementation is indeed more elegant than what I came up with. But generally I agree with you. This would only be benificial for atomic operations, which by themselves are accepted to be somewhat costly. I don't think the cost of added complexity outweights the benefits. When I started this, I hoped that it would benefit more instructions, but that was a bad intuition.
I'll close this one, but thanks for your feedback! :)
Last updated: Feb 28 2025 at 03:10 UTC