JIT and emulation as a thesis topic · cranelift

Stream: cranelift

Topic: JIT and emulation as a thesis topic

Kevin K. (Sep 01 2023 at 22:39):

Similar to #cranelift > Chaos mode bachelor thesis, I am currently thinking about writing a bachelor thesis utilizing Cranelift and its JIT capabilities. Now unlike the mentioned topic, I am not interested in extending or participating to Cranelift directly but rather determining its real life usefulness in a new context. Emulators after a certain generation of consoles utilize JITs heavily and I haven't yet seen anyone else use CLIR for this.

I think it could be a good fit because of the much smaller compilation times and the optimisations do not matter that much as we mostly translate machine instructions pretty directly. Now, I am still not quite sure what exactly would be in scope for a bachelor thesis. While writing a full emulator (GBA, NDS and later) is possible, the question remains with the academic value -- Interpreter vs JIT is already well established. Comparisons with LLVM IR have already been done and its known that LLVM compile times make it unfit for emulation so that topic does not really work either.

So if anyone has any ideas, I'd be grateful to hear them.

fitzgen (he/him) (Sep 01 2023 at 22:46):

gotta run right now, but you might get a kick out of this: https://old.reddit.com/r/rust/comments/xe3bx1/minecraft_running_on_a_redstone_cpugpu/iog1o2t/?context=99999

Minecraft running on a redstone CPU/GPU implemented in Minecraft, running on a custom Minecraft server (written in Rust) capable of performing redstone calculations 10,000x faster than vanilla Minecraft

Posted in r/rust by u/kibwen • 1,356 points and 68 comments

Chris Fallin (Sep 01 2023 at 23:17):

It is definitely an interesting question in my mind (for an undergrad thesis) to explore compilation-time tradeoffs -- most of the JIT-based emulators I'm aware of (qemu, Valgrind) do fairly straightforward "baseline-level" compilation (not much optimization)

Chris Fallin (Sep 01 2023 at 23:18):

in order to get benefit from Cranelift's optimization you'd probably want a function-scope compiler, not a trace-based or basic-block-based JIT; that's sort of an interesting challenge when translating from machine code (look up "control flow recovery"; at least, for variable-length ISAs like x86)

Chris Fallin (Sep 01 2023 at 23:19):

the relevant questions include: what would you be trying to optimize (i.e. why do you think that applying a more strongly optimizing compiler would help) -- given that the input is machine code, is it that the original machine code is not optimized, or the translation introduces overheads, or ...

Chris Fallin (Sep 01 2023 at 23:20):

and then also the impedance mismatches that invariably arise -- different CPUs have slightly different status flags, ways of handling floating-point corner cases, etc -- and how to optimize them

Chris Fallin (Sep 01 2023 at 23:21):

Cranelift does give you a "mid-end" in which you can do rewrites, and it's not too hard to extend it with new opcodes/instructions, so it might be a good substrate for exploration of the semantic-mismatch questions in particular

Kevin K. (Sep 01 2023 at 23:23):

Chris Fallin said:

It is definitely an interesting question in my mind (for an undergrad thesis) to explore compilation-time tradeoffs -- most of the JIT-based emulators I'm aware of (qemu, Valgrind) do fairly straightforward "baseline-level" compilation (not much optimization)

I am not really thinking of qemu or Valgrind necessarily, that's more virtualization. I mean more so bigger console emulators like yuzu, citra or similar that often utilize ARM to x86 Jitting. They are also relatively basic but often include their own IR and do minor optimisations on the block level. Usually basic blocks until they hit indirect branches.

Kevin K. (Sep 01 2023 at 23:25):

Chris Fallin said:

the relevant questions include: what would you be trying to optimize (i.e. why do you think that applying a more strongly optimizing compiler would help) -- given that the input is machine code, is it that the original machine code is not optimized, or the translation introduces overheads, or ...

I would have to research a bit more but while the translation is relatively straightforward, some stuff is still possible. Though I can't remember rn, I will report back on that in a bit.

Kevin K. (Sep 01 2023 at 23:25):

https://github.com/merryhime/dynarmic is the one I am thinking of that uses its own IR

GitHub - merryhime/dynarmic: An ARM dynamic recompiler.

An ARM dynamic recompiler. Contribute to merryhime/dynarmic development by creating an account on GitHub.

Chris Fallin (Sep 01 2023 at 23:31):

Cool, looking forward to hearing more! FWIW, in case you haven't hit on it yet -- a lot of prior work in this vein goes under the name "dynamic binary translation"; one of the original systems was Dynamo at HP in the 90s, which lives on as DynamoRIO (I think they did Alpha-to-Alpha dynamic recompilation with opts); qemu and Valgrind are absolutely relevant, in that they both do trace-based JIT'ing as well and you can learn a lot from the way they handle machine flags, how they organize and look up JIT'd code fragments, etc.

Kevin K. (Sep 01 2023 at 23:47):

I've heard that term before actually but nice to know the context and origin of it!

Kevin K. (Sep 01 2023 at 23:51):

Kevin K. said:

Chris Fallin said:

the relevant questions include: what would you be trying to optimize (i.e. why do you think that applying a more strongly optimizing compiler would help) -- given that the input is machine code, is it that the original machine code is not optimized, or the translation introduces overheads, or ...

I would have to research a bit more but while the translation is relatively straightforward, some stuff is still possible. Though I can't remember rn, I will report back on that in a bit.

Oh and I think I misread this message the first time, those are potential question to be asked and answered during research and comparisons and your suggestions. But yea, that is kind of my point -- finding those questions that have enough academic value and scope. A more strongly optimizing compiler, for example, can be overkill for many reasons. As you said, the input is already machine code and translation with strong optimizers will take longer and introduce stuttering. Optimizations like DCE and constant propagation are quite common compared to whatever LLVM has deep in its repertoire.

Kevin K. (Sep 02 2023 at 00:02):

In any case, I would like to stick with a basic-block-based JIT for now, since that is much less complex and I don't think I'd be able to fit the other approaches into the time scope of a thesis. Most emulators that I know rely mostly on that aswell.

Kevin K. (Sep 04 2023 at 20:41):

I am still partial to comparing LLVM IR and CLIR with respect to https://github.com/bytecodealliance/wasmtime/blob/main/cranelift/docs/compare-llvm.md and seeing how well those differences translate in emulation

Amanieu (Sep 04 2023 at 23:10):

As someone who's work on binary translation for ~10 years, I find that compiler IRs like LLVM or Cranelift are not really suitable for implementing emulators. The main issue is that these IRs are designed for compiling things that look like C functions: you have a stack and the ability to call nested functions on that stack, function calls use standard calling conventions, etc.

The current compiler that I am working on for a binary translator has specific features directly as opcodes in the IR, which allows them to be optimized by passes. For example:

"Functions" don't exist at this level. Translated code blocks effectively always tail call to the next translated block.
Support for multiple entry points, which is needed if you are doing call/return optimizations: a return target in the original code is represented as a secondary entry point in the translated code block. You can't treat it like a normal call because you don't have a stack on which to keep data across calls.
A custom calling convention is used to improve performance. Almost all registers carry a significant value when transferring control between translated code blocks.
All loads and stores can fault, which requires generating metadata for use by the trap handler to recover the original register state. This register state is passed to the original trap handler which expects to see a native register state.
Loads from memory regions that are known to be read-only can be optimized to constants.
Condition flag emulation is extremely performance-sensitive, so you need optimizations that convert "evaluate some condition" to a native compare&branch operation.
Control transfers between blocks require doing a runtime lookup to find the translated code block for a given source address. This can be optimized if the source address is known to be a constant, which avoids the runtime hash table lookup.

It's probably possible to modify Cranelift to support these, but I decide to write my compiler entirely from scratch. It was heavily inspired by the design of Cranelift though, and in fact also uses regalloc2 for register allocation.

Last updated: Apr 08 2025 at 03:15 UTC