Hi all - I'm working on a university project to make a JIT compiler for OCaml bytecode in Rust. I'd like to use cranelift for the backend generation[1] but I need to make it work with the OCaml runtime's garbage collector. The precise stack maps and reference type support looks promising but as an R64 is just an opaque value there's not much I can do with it. My initial research has led me to consider forking cranelift for the project[2]. I'd appreciate if any of you could have a quick glance at my idea for the change I'd make and tell me if it might work or would be incredibly difficult:
It's my understanding from looking at the docs that an R32/R64 is just an opaque reference besides some WASM-specific primitives to make and check for null R64s (please correct me if I'm wrong!).
Would I possibly have any success adding instructions to cast from a R64 to an I64 and back again? The way I'd hope it'd work is any live R64 values are spilled if they're in registers and put on the stackmap at safepoints but mutating the I64 derived from the R64 doesn't cause the mutated I64 to be spilled accidentally. However in cases where it's just doing a pointer lookup with an offset using the I64 we got from the R64 they'd ideally use the same register.
In the input IR no I64 pointer derived from an R64 would ever be live[3] - but would I have to be worried about any optimisations reordering things to break this invariant?
Would this possibly work? Is there anything I'd have to be careful about/modify to do this beyond the minimum to add a new type of instruction?
[1] I've already got a JIT working using https://github.com/CensoredUsername/dynasm-rs but it keeps exactly to the same semantics as the existing bytecode interpreter (that is really inefficient stack machine). I'm now working on a slower compiler that I can replace closures with once I've noticed they're hot. I've written something which keeps track of the stack and basic blocks and converts it into SSA - I'd be looking to use cranelift to turn this SSA into x86_64 assembly but I'd need to integrate what I do with the GC.
Despite not being designed for the type of GC OCaml uses (precise generational mark-and-sweep rather than delayed reference counting) I think besides not being able to access and create raw reference values the existing support in cranelift is a perfect fit - when the GC is called I'd just have to walk up the stack and lookup stackmaps from a hashtable keyed by the return address.
[2] It's a write-once nobody-will-use contrived university project so taking the time to work on a more general solution to my problem that could be upstreamed is not worth it for me now - but I'd definitely be happy to share anything I found or consider working on something more general once I have more time this summer
[3] The OCaml runtime is single-threaded and garbage collection will only ever be triggered in specific places (like allocations once the "minor heap" is full and it's no longer just a pointer bump). The way I'm going to do codegen means it's pretty easy to keep this invariant - any use of the actual values of the R64 will be local between statepoints to do pointer lookups.
Would I possibly have any success adding instructions to cast from a R64 to an I64 and back again? The way I'd hope it'd work is any live R64 values are spilled if they're in registers and put on the stackmap at safepoints but mutating the I64 derived from the R64 doesn't cause the mutated I64 to be spilled accidentally. However in cases where it's just doing a pointer lookup with an offset using the I64 we got from the R64 they'd ideally use the same register.
If the r64
s aren't used again after casting to i64
, then they won't be in the stack maps because the regalloc won't know that they are still "live" via the derived i64
, meaning that the GC won't see these references, will mistakenly reclaim them as garbage, and you'll get use after frees. This is a pretty serious footgun with the approach.
I think it would be better to add support for loading/storing through r64
/r32
with the existing load/store instructions (I looked into this once / started doing it once, but I don't remember if I ever actually finished it...)
If the
r64
s aren't used again after casting toi64
, then they won't be in the stack maps because the regalloc won't know that they are still "live" via the derivedi64
, meaning that the GC won't see these references, will mistakenly reclaim them as garbage, and you'll get use after frees. This is a pretty serious footgun with the approach.
This is impossible the way I'd be using it - there wouldn't be any live i64
s derived from the r64
at places the GC could trigger (asuming this doesn't get reordered by an optimisation). It would be things sort of like this exmaple
r64
to i64
i64
i64
back to r64
where the value in 2 is only ever used in 3.
If the original r64
isn't ever used again and isn't referenced by any of the other roots then it is indeed garbage and would be collected - but that'd be fine because the i64
wouldn't be used either. There's never going to be an reference-derived i64 which is alive between multiple GC calls.
I think it would be better to add support for loading/storing through
r64
/r32
with the existing load/store instructions (I looked into this once / started doing it once, but I don't remember if I ever actually finished it...)
This is where OCaml makes things fun - it uses a uniform data representation[1] where everything is either a 63 bit integer with an LSB of 1 or it's a pointer (taking advantage of pointer alignment). So GC-managed values would still need to be cast to pointers in any case. For example, for sum types/Rust enums it uses an integer for the nullary constructors and effectively a tagged tuple for the constructors with data. Lots of functions I'd be compiling would branch based on seeing whether the bit 0 is 0 or 1.
So if I'm going to use cranelift I'm forced to somehow interpret GC-managed values as raw integers - unless there's some other approach I'm not considering?
I guess my question is, is there any subtle way this could break things (asuming the above invariant stops the problem of live i64
s after the r64
is dead)?
[1] It's basically the opposite approach to Rust/C++ templates - rather than generate lots of different versions of the code make everything look the same and generate one. The data representations you're forced to use are as slow as you'd expect - a 4 byte f32 takes up 16 bytes in total and requires boxing/unboxing every time you do everything with it.
Might using raw_bitcast
work? It looks like the x64 backend (the only one I care about) lowers it to a move:
Opcode::RawBitcast => {
// A raw_bitcast is just a mechanism for correcting the type of V128 values (see
// https://github.com/bytecodealliance/wasmtime/issues/1147). As such, this IR
// instruction should emit no machine code but a move is necessary to give the register
// allocator a definition for the output virtual register.
let src = put_input_in_reg(ctx, inputs[0]);
let dst = get_output_reg(ctx, outputs[0]).only_reg().unwrap();
let ty = ty.unwrap();
ctx.emit(Inst::gen_move(dst, src, ty));
}
// then
fn gen_move(dst_reg: Writable<Reg>, src_reg: Reg, ty: Type) -> Inst {
let rc_dst = dst_reg.to_reg().get_class();
let rc_src = src_reg.get_class();
// If this isn't true, we have gone way off the rails.
debug_assert!(rc_dst == rc_src);
match rc_dst {
RegClass::I64 => Inst::mov_r_r(OperandSize::Size64, src_reg, dst_reg),
// ...
@Will Robson the one potential issue I see in this scheme is: there is a window of unrooted vulnerability between the cast-to-i64 and deref if this is the last use of the ref (so the cast back to r64 to "keep it alive" is actually dead). The combination of instruction lowering and regalloc live analysis is smart enough to see through that pattern -- the move for the i64-to-r64 cast won't be lowered (its result is unused) and the regalloc will see the live range of the r64 ends at the r64-to-i64
This is fine if there are no safepoints between r64-to-i64 cast and load, of course
This is fine if there are no safepoints between r64-to-i64 cast and load, of course
Luckily I have this invariant with the format of the OCaml bytecode I'm translating from. It's what I was trying to get at when I said "there wouldn't be any live i64s derived from the r64 at places the GC could trigger". It won't necessarily be true at every safepoint (because one gets created every function call) but it will be at every safepoint the GC could trigger.
The way I'd be using the cast to i64 would be to inline the actual operations I'd need to do with GC values but these temporary i64s would always have a lifetime that fits in between any two GC operations (if that makes sense?). It's not so much there won't be any alive i64
s without a corresponding live r64
- but instead there will never be any alive r64
-derived i64
s at a meaningful safepoint.
It's not something true in general without being careful about keeping the invariant in the CLIF and would likely break with aggressive LLVM-style optimizations but for this I think it would work.
Thank you both for your help! I think I can make this work without having to fork cranelift.
I'm posting this here rather than the #cranelift-new-backend stream to keep the context
This approach worked for a while but for one of my functions caused an assertion to fail deep insideregalloc
. I've only spent a few hours trying to wrap my head around the regalloc but this is where I've traced it to. The assertion that fails is the second from this block at the start of alloc_spill_slots
:
let is_ref = vlr_env[vlrix].is_ref;
for cand_vlrix in vlrEquivClasses.equiv_class_elems_iter(vlrix) {
// "None of the VLRs in this equivalence class have an allocated spill slot."
// This should be true because we allocate spill slots for all of the members of an
// eclass at once.
assert!(vlr_slot_env[cand_vlrix].is_none());
// "All of the VLRs in this eclass have the same ref-ness as this VLR."
// Why this is true is a bit subtle. The equivalence classes are computed by
// `do_coalescing_analysis`, fundamentally by looking at all the move instructions
// and computing the transitive closure induced by them. The ref-ness annotations
// on each VLR are computed in `do_reftypes_analysis`, and they are also computed
// as a transitive closure on the same move instructions. Hence the results should
// be identical.
//
// With all that said, note that these equivalence classes are *not* guaranteed to
// be internally non-overlapping. This is explained in the big block comment at the
// top of bt_coalescing_analysis.rs.
assert!(vlr_env[cand_vlrix].is_ref == is_ref);
}
The vcode passed in to regalloc contains:
Inst 5: movl $1, %v56Jl
Inst 6: movq %v56J, %v7J
%v56J
isn't used at all after this but %v7J
is used later. The original IR had:
v7 = iconst.i64 1
v8 = raw_bitcast.r64 v7
So %v7J
is marked as a reftyped vreg but %v56J
(which for some reason v8 gets put in depsite the original) isn't. Their corresponding virtualranges also behave the same way.
analysis_reftypes
doesn't mark $v56J as containing a reftyped value as it's doing a DFS starting at all reftyped regs. But bt_coalescing_analysis
puts them in an equivalence class because it sees the move.
So for this particular case the reasoning behind the assertion is wrong. The assumption is a -mov-> b & b reffy ==> a reffy
but that's not true in this case. I'd imagine it works completely fine for wasm where there's no casting between reftype values an dintegers. I've only just ran into it on this one example after thousands of compiled functions don't hit this assertion despite doing moves from cranelift I64
s to R64
s all the time so hitting a code path where this assertion fails is somewhat rare.
Commenting out the assertion doesn't appear to break anything in the generated assembly for this case - but it might be an important invariant that just hasn't happened to trigger any problems?
Should the coalescing analysis consider reffy-ness
when deciding to put things in the same class? Does this invariant actually matter for the spillslot allocator?
Ah, so this is the check that a vreg is always used either as a "reffy" value or never
Are you reusing a vreg for a GC ref and then a non-GC value?
Err actually sorry I'm thinking in non-SSA
Hmm. A test-case would be useful to see; but also I should note I'm currently reworking the regalloc so if this is a real bug in regalloc.rs then it may or may not be worth looking into
This is the full logs including the IR that caused it & VCode: https://gist.github.com/wrbs/d4fde11deae93dcb1f3f89cebe7c1670
I haven't had any luck reproducing it with anything smaller so far
but I also haven't tried very hard to do that
OK, if you can get a minimized example, let us know (or at least point out which reg is both a GC ref and not, according to the regalloc)
Thanks for your help! I'll try to get something reproducible as an easily executable test case tomorrow and see see how much of it I can remove and still hit the issue.
(or at least point out which reg is both a GC ref and not, according to the regalloc)
I'm not sure if this is it? It's more that two ranges one of which is of ref type get put in an equivalence class because they contain the same value. What I don't know is why this assertion only fails for this particular case not any of the other times. This is a smaller case showing the same pattern (see Inst pairs 3,4/5,6/7,8):
VCode_ShowWithRRU {{
Entry block: 0
Block 0:
(original IR block: block0)
(successor: Block 1)
(instruction range: 0 .. 16)
Inst 0: movq %rdi, %v0J
Inst 1: movq %rsi, %v1J
Inst 2: load_ext_name u1:0+0, %v2J
Inst 3: movl $7, %v17Jl
Inst 4: movq %v17J, %v4J
Inst 5: movl $5, %v16Jl
Inst 6: movq %v16J, %v6J
Inst 7: movl $3, %v15Jl
Inst 8: movq %v15J, %v8J
Inst 9: movq 8(%v0J), %v9J
Inst 10: movq %v1J, %v11J
Inst 11: addq $-16, %v11J
Inst 12: movq %v8J, -16(%v1J)
Inst 13: movq %v6J, -8(%v1J)
Inst 14: movq %v4J, 0(%v1J)
Inst 15: jmp label1
Block 1:
(original IR block: block1)
(instruction range: 16 .. 23)
Inst 16: movq %v11J, 152(%v2J)
Inst 17: movq %v9J, %v12J
Inst 18: movl $3, %v14Jl
Inst 19: movq %v14J, %v13J
Inst 20: movq %v12J, %rax
Inst 21: movq %v13J, %rdx
Inst 22: ret
}}
but it compiles fine
Ah! I just realized that perhaps this is a consequence of move coalescing. "Reffyness" propagates across moves (that's the fixpoint loop where the assert fired) so the bitcasts you're using to turn R64s into I64s are going to cause this
Somehow we need to mark those as not moves but opaque operators
(in other words I can imagine the fix but it's slightly nontrivial)
Last updated: Jan 24 2025 at 00:11 UTC