Similar to #cranelift > Chaos mode bachelor thesis, I am currently thinking about writing a bachelor thesis utilizing Cranelift and its JIT capabilities. Now unlike the mentioned topic, I am not interested in extending or participating to Cranelift directly but rather determining its real life usefulness in a new context. Emulators after a certain generation of consoles utilize JITs heavily and I haven't yet seen anyone else use CLIR for this.
I think it could be a good fit because of the much smaller compilation times and the optimisations do not matter that much as we mostly translate machine instructions pretty directly. Now, I am still not quite sure what exactly would be in scope for a bachelor thesis. While writing a full emulator (GBA, NDS and later) is possible, the question remains with the academic value -- Interpreter vs JIT is already well established. Comparisons with LLVM IR have already been done and its known that LLVM compile times make it unfit for emulation so that topic does not really work either.
So if anyone has any ideas, I'd be grateful to hear them.
gotta run right now, but you might get a kick out of this: https://old.reddit.com/r/rust/comments/xe3bx1/minecraft_running_on_a_redstone_cpugpu/iog1o2t/?context=99999
It is definitely an interesting question in my mind (for an undergrad thesis) to explore compilation-time tradeoffs -- most of the JIT-based emulators I'm aware of (qemu, Valgrind) do fairly straightforward "baseline-level" compilation (not much optimization)
in order to get benefit from Cranelift's optimization you'd probably want a function-scope compiler, not a trace-based or basic-block-based JIT; that's sort of an interesting challenge when translating from machine code (look up "control flow recovery"; at least, for variable-length ISAs like x86)
the relevant questions include: what would you be trying to optimize (i.e. why do you think that applying a more strongly optimizing compiler would help) -- given that the input is machine code, is it that the original machine code is not optimized, or the translation introduces overheads, or ...
and then also the impedance mismatches that invariably arise -- different CPUs have slightly different status flags, ways of handling floating-point corner cases, etc -- and how to optimize them
Cranelift does give you a "mid-end" in which you can do rewrites, and it's not too hard to extend it with new opcodes/instructions, so it might be a good substrate for exploration of the semantic-mismatch questions in particular
Chris Fallin said:
It is definitely an interesting question in my mind (for an undergrad thesis) to explore compilation-time tradeoffs -- most of the JIT-based emulators I'm aware of (qemu, Valgrind) do fairly straightforward "baseline-level" compilation (not much optimization)
I am not really thinking of qemu or Valgrind necessarily, that's more virtualization. I mean more so bigger console emulators like yuzu, citra or similar that often utilize ARM to x86 Jitting. They are also relatively basic but often include their own IR and do minor optimisations on the block level. Usually basic blocks until they hit indirect branches.
Chris Fallin said:
the relevant questions include: what would you be trying to optimize (i.e. why do you think that applying a more strongly optimizing compiler would help) -- given that the input is machine code, is it that the original machine code is not optimized, or the translation introduces overheads, or ...
I would have to research a bit more but while the translation is relatively straightforward, some stuff is still possible. Though I can't remember rn, I will report back on that in a bit.
https://github.com/merryhime/dynarmic is the one I am thinking of that uses its own IR
Cool, looking forward to hearing more! FWIW, in case you haven't hit on it yet -- a lot of prior work in this vein goes under the name "dynamic binary translation"; one of the original systems was Dynamo at HP in the 90s, which lives on as DynamoRIO (I think they did Alpha-to-Alpha dynamic recompilation with opts); qemu and Valgrind are absolutely relevant, in that they both do trace-based JIT'ing as well and you can learn a lot from the way they handle machine flags, how they organize and look up JIT'd code fragments, etc.
I've heard that term before actually but nice to know the context and origin of it!
Kevin K. said:
Chris Fallin said:
the relevant questions include: what would you be trying to optimize (i.e. why do you think that applying a more strongly optimizing compiler would help) -- given that the input is machine code, is it that the original machine code is not optimized, or the translation introduces overheads, or ...
I would have to research a bit more but while the translation is relatively straightforward, some stuff is still possible. Though I can't remember rn, I will report back on that in a bit.
Oh and I think I misread this message the first time, those are potential question to be asked and answered during research and comparisons and your suggestions. But yea, that is kind of my point -- finding those questions that have enough academic value and scope. A more strongly optimizing compiler, for example, can be overkill for many reasons. As you said, the input is already machine code and translation with strong optimizers will take longer and introduce stuttering. Optimizations like DCE and constant propagation are quite common compared to whatever LLVM has deep in its repertoire.
In any case, I would like to stick with a basic-block-based JIT for now, since that is much less complex and I don't think I'd be able to fit the other approaches into the time scope of a thesis. Most emulators that I know rely mostly on that aswell.
I am still partial to comparing LLVM IR and CLIR with respect to https://github.com/bytecodealliance/wasmtime/blob/main/cranelift/docs/compare-llvm.md and seeing how well those differences translate in emulation
As someone who's work on binary translation for ~10 years, I find that compiler IRs like LLVM or Cranelift are not really suitable for implementing emulators. The main issue is that these IRs are designed for compiling things that look like C functions: you have a stack and the ability to call nested functions on that stack, function calls use standard calling conventions, etc.
The current compiler that I am working on for a binary translator has specific features directly as opcodes in the IR, which allows them to be optimized by passes. For example:
It's probably possible to modify Cranelift to support these, but I decide to write my compiler entirely from scratch. It was heavily inspired by the design of Cranelift though, and in fact also uses regalloc2 for register allocation.
Last updated: Dec 23 2024 at 13:07 UTC