MarinPostma edited PR #9889.
saulecabrera commented on PR #9889:
I can help reviewing this one.
saulecabrera requested saulecabrera for a review on PR #9889.
saulecabrera submitted PR review:
Looking good. Left some comments inline.
As a general comment, the
emit_*
naming convention is not standard across the assembler implementations (as noted and fixed in https://github.com/bytecodealliance/wasmtime/pull/9918), so I´d suggest dropping theemit_
prefix in the assembler methods introduced in this PR.
saulecabrera created PR review comment:
Similar comment here, I think it should be possible to omit the
OperandSize
prefix in both match arms.
saulecabrera created PR review comment:
Nit: I believe that here and in the next match arm below you can omit the
OperandSize::
prefix given that you're already importing it above (e.g.,use OperandSize::*
)
saulecabrera created PR review comment:
/// Load the max value for an integer of size out_size, as a floating-point /// of size `in_size`, into register `rd`.
saulecabrera created PR review comment:
I´d suggest avoiding passing a placeholder register here. The temp register that is passed to
unsigned_truncate
is an allocatable register, therefore there's very little chance of it getting unintentionally clobbered. Instead having a single implementation for both cases, I´d suggest breaking down the implementation in the assembler into two methods, one for each case (signed/unsigned), then in each implementation you can decide to use the scratch register, immediately before emission, without the risk of unintentionally clobbering it. This is similar to how the x64 assembler implementation works.
saulecabrera created PR review comment:
If we're writing to
tmp_reg
, I think this should be defined in the signature, similar todst
:tmp_reg: Writable<Reg>
saulecabrera created PR review comment:
Here we don´t need to duplicate this method,
fcmp
already exists.
MarinPostma submitted PR review.
MarinPostma created PR review comment:
To make sure I understand, similarly to how x86 implementation, you would like to have two methods,
fpu_to_sint
andfpu_to_uint
in asm, and also, those methods should be passed a temporary register inmasm
, rather than using the provided one in the case ofunsigned_truncate
?I have a few questions/remarks:
- what is the purpose of the temp reg in the
unsigned
variant? Why must this registry be allocated rather than used as a temp register? Is the caller expecting to find something in there? It is not explicitly passed as a writable register, but the x64 implementation implicitly converts it to one.- While splitting the
convert_
methods for x64 makes sense to me since they correspond to a single instruction, we must emit a bunch of checks that are aware of the sign in the case of aarch64. Do you suggest duplicating all the checks in signed and unsigned variants? Or do we have two methods in ASM that call the current implementation with thesigned
argument set accordingly?- I'm sorry if that sounds very ignorant, but why do we have to worry about clobbering a tmp register anyway? Can't it, by definition, be used for any purpose, and one should not have too strong an assumption about what it may contain, especially when handing it over to a callee?
- I'm surely missing something, but it seems to me that splitting the
convert
method into a signed and unsigned variant, and the use of tmp regs are orthogonal things, so I want to make sure I understand what you want correctly :)
MarinPostma commented on PR #9889:
@saulecabrera I addressed most of the review comments but left a few questions
MarinPostma updated PR #9889.
MarinPostma updated PR #9889.
saulecabrera submitted PR review.
saulecabrera created PR review comment:
The purpose of the allocatable temporary register is to provide an extra
register to aid in the emission of the instruction sequence needed for the
unsigned case; in x64, 4 registers are required for the instruction sequence:
1 for the destination and 3 temporary registers. Since in the case of x64 the
instruction sequence is completely handled by Cranelift, there's no risk of
unintentionally clobbering any of the registers passed to it, the unintentional
clobbering risk is particularly important in the case of scratch registers in Winch: even
though they can be used for any purpose, one important detail about them is that
they are not tracked by Winch's regalloc therefore they must be used sparingly
and with extreme caution: a misuse could result in unintentional clobbering and
therefore subtle bugs. Ideally, the live range of the scratch registers should be
as short as possible to avoid potential bugs.In this particular case the reason why I am suggestion improving the usage of
the scratch register is to minimize the risk of unintentional clobbering of such
registers: once a scratch register is passed as a parameter, its live range is
extended and the risk of misuse is greater. A concrete example in this case of
a potential risk of unintentional clobbering is passing the scratch register to
fpu_to_int
and then using the float scratch register in any of the methods
needed for the instruction sequence needed to fulfill the truncate instructions
(this is not the case today, but by passing the float scratch register around as
if it were any other register, it becomes harder to reason and audit in the
future).There are some alternatives that I've thought about to improve the
handling/usage of the scratch registers to address some of the challenges
exposed in the previous paragraph. We could discuss those in a separate
issue/PR (e.g., relying on the type system to identify/audit scratch register usage and/or
providing exclusive access to the scratch registers, similar to what we have for
the allocatable registers). Care needs to be taken here to ensure that none of
the solutions explored affect compilation performance.Taking a deeper look at your changes for the unsigned case, only 3 registers are
needed here compared to the x64 implementation which demands 4. This means that
you can effectively ignore the temporary register defined in the Masm
unsigned_truncate
signature and only rely on the float scratch register.Taking this into account, my concrete suggestion to move forward here is to:
- Redefine the
unsigned_truncate
signature to be:fn unsigned_truncate( &mut self, context: &mut CodeGenContext<Emission>, src_size: OperandSize, dst_size: OperandSize, kind: TruncKind ) -> Result<()>;
This allows maximum flexibility to encode the ISA differences and more
importantly reduces register pressure in cases where only 3 registers are
needed, reducing unnecessary register pressure where possible is important to
minimize memory traffic.This is similar to how we handle the
div
instruction, given that the
requirements vary between ISAs.
- Redefine
fpu_to_int
to be:fn fpu_to_int( &mut self, dst: Writable<Reg>, src: Reg, src_size: OperandSize, dst_size: OperandSize, kind: TruncKind, signed: bool ) { ... };
Avoiding passing the scratch registers as params and shortening their live
range as much as possible by using them explicitly in each of the helper
functions that you have to fulfill the instruction sequence.
I believe that with this approach, my concerns will be addressed and there won't
be a need to split the signed/unsigned implementations (initially I was confused
with theTODO
comment in the code and had assumed that the implementation was
making use of a scratch register that wasn't actually needed, although that's
not the case).
saulecabrera edited PR review comment.
MarinPostma submitted PR review.
MarinPostma created PR review comment:
Thanks a lot for the detailed writeup! This makes things a lot clearer. I'll make the changes :+1:
MarinPostma updated PR #9889.
MarinPostma commented on PR #9889:
@saulecabrera I have made the requested changes
saulecabrera submitted PR review:
Looks great! Thanks for all your patience iterating on this. Left two minor comments and then we should be able to land this one.
saulecabrera created PR review comment:
Similar here, you can use the appropriate error constructor.
saulecabrera created PR review comment:
Here you can use
CodeGenError::unexpected_operand_size()
instead. https://github.com/bytecodealliance/wasmtime/blob/main/winch/codegen/src/codegen/error.rs#L139
MarinPostma updated PR #9889.
MarinPostma commented on PR #9889:
It should be good now @saulecabrera
saulecabrera submitted PR review:
Thanks!
saulecabrera merged PR #9889.
Last updated: Jan 24 2025 at 00:11 UTC