Hi there, I am doing some research into the possibility of lifting x86 semantics to cranelift IR. I see that there is a lack of intrinsics, however I do see that the IR has x86 specific instructions (x86_pmaddubsw, x86_pmulhrsw, etc..).
My question is, how difficult would it be to implement a new opcode/Ir operation such as "CPUID", "RDTSC", or reads/writes to control registers? Is there post already explaining the process of creating new IR operations? I assume ill need to make a new rule for lowering the IR to the native instruction...
Also i already decided against mcsema, qemu ir, llvm ir, and angr ir. Thanks for your time :)
@ZwSwapCert could you say a bit more about the overall goal -- do you want to lift x86 to CLIF, maybe do some transforms, then recompile to x86 only (dynamic binary instrumentation-style)? Or do you want to maybe recompile to other targets too (qemu-style)?
In any case, yes, one can add new opcodes for sure; if the goal is to take x86 semantics from x86 source to x86-only target that may be the lowest-friction approach for you, while if the goal is to work on any target then it may be better to work with existing opcodes. It might be interesting to look at how qemu's tcg or Valgrind's VEX model processor state: something like CPUID is lowered into loads from a "processor context" struct that your runtime provides. Anyway, happy to brainstorm more depending on use-case :-)
Chris Fallin said:
ZwSwapCert could you say a bit more about the overall goal -- do you want to lift x86 to CLIF, maybe do some transforms, then recompile to x86 only (dynamic binary instrumentation-style)? Or do you want to maybe recompile to other targets too (qemu-style)?
In any case, yes, one can add new opcodes for sure; if the goal is to take x86 semantics from x86 source to x86-only target that may be the lowest-friction approach for you, while if the goal is to work on any target then it may be better to work with existing opcodes. It might be interesting to look at how qemu's tcg or Valgrind's VEX model processor state: something like CPUID is lowered into loads from a "processor context" struct that your runtime provides. Anyway, happy to brainstorm more depending on use-case :-)
hi there! me and my teams goal is to just focus on x86 for now.
The reason for this project is: we are building a binary deobfuscation/obfuscation framework. We would like to have a semantic representation of x86 in an IR format which we can then run simplification (or obfuscation passes). We really really like cranelift IR and its egraph pass system. To us this is perfect because we are going to do mixed boolean arithmetic obfuscation and the pass system is really nice for this. Additionally in our recursive decent disassembler we could lift to cranelift IR then when we cant understand where control flow is going (Say a jmp reg) we can use the lifted IR to compute the possible destinations (useful for deobfuscating virtual machine based obfuscation like vmprotect and themida).
We have spent months researching existing projects and decided that nothing public fit our need. the reason why most public projects dont fit our need is they are attempting to do too much (retarget recompilation). We are just focused on x86 --> ir --> x86.
Also as a side note: we want to avoid LLVM at all costs because its IR is too abstract... for example its a very big mess in McSema to handle calling external code. (bridge between IR and native). There is a group called rev.ng that is using qemu tcg to llvm ir. They are attempting to make a retargetable recompiler. This is a huge task, as a team we decided we should just focus on x86.
Heres a list of some of the things are plan to do:
Now there are some dilemmas me and my team are thinking about.
intrinsics (CPUID, RDTSC, READMSR/WRITEMSR, reads/writes to cr3/cr4/cr8, xgetbv, etc). My team is willing to contribute by creating the lowering code required for each of these IR instructions.
function parameters... how to preserve these when lifting and lowering...? For example we could assume rcx, rdx, r8, and r9 are free to use after every function call, but how do we handle stack parameters? tough question.
cpu flags! other IR's get around this by creating a variable for rflags and after each instruction updating the individual bits in the flag register. This gets extremely messy though because each instruction creates a bunch of flags computation. Maybe my team can extend cranelift to have some sort of "rflags" IR representation. Not sure.
Cranelift used to have instructions which produced or consumed flags, but we removed them in https://github.com/bytecodealliance/wasmtime/pull/5406
@ZwSwapCert sounds really interesting!
We would like to have a semantic representation of x86 in an IR format which we can then run simplification (or obfuscation passes).
I think Cranelift could be made to work for this purpose, partially, but one high-level distinction I'd want to make is that the IR is not "total" in the sense that it can represent all possible code. You brought up ABI details and I think this is a good concrete example: Cranelift handles the ABI for you, and the semantics of the IR are that of functions with args and returns. There's no concept of "whatever was in rdi on entry" because CLIF doesn't have a representation of rdi, or the machine stack at the word level either, for that matter.
The way that I've seen ISA-to-IR-to-ISA work in the context of qemu and Valgrind is that a level of indirection is inserted: the translated code doesn't refer to first-class values that are lowered back to machine registers, but rather refers to a "CPU context" and does loads and stores of fields on that CPU context that represent registers. At that level, you can fully represent any possible x86 instruction semantics, because loads and stores to memory offsets are something you can do in CLIF. Likewise for flags -- this is "just another field".
So actually to be totally honest I think that trying to directly represent x86 in CLIF in a way that it can be roundtripped, for arbitrary code, is going to run into too many impedance mismatches to be a reasonable approach: Cranelift manages the ABI, does register allocation, has its own ideas about flags, etc., such that one can't really write CLIF to produce exactly the original x86 for any x86 code. For that one would really want an IR that is explicitly a list of x86 instructions...
Chris Fallin said:
ZwSwapCert sounds really interesting!
We would like to have a semantic representation of x86 in an IR format which we can then run simplification (or obfuscation passes).
I think Cranelift could be made to work for this purpose, partially, but one high-level distinction I'd want to make is that the IR is not "total" in the sense that it can represent all possible code. You brought up ABI details and I think this is a good concrete example: Cranelift handles the ABI for you, and the semantics of the IR are that of functions with args and returns. There's no concept of "whatever was in rdi on entry" because CLIF doesn't have a representation of rdi, or the machine stack at the word level either, for that matter.
The way that I've seen ISA-to-IR-to-ISA work in the context of qemu and Valgrind is that a level of indirection is inserted: the translated code doesn't refer to first-class values that are lowered back to machine registers, but rather refers to a "CPU context" and does loads and stores of fields on that CPU context that represent registers. At that level, you can fully represent any possible x86 instruction semantics, because loads and stores to memory offsets are something you can do in CLIF. Likewise for flags -- this is "just another field".
So actually to be totally honest I think that trying to directly represent x86 in CLIF in a way that it can be roundtripped, for arbitrary code, is going to run into too many impedance mismatches to be a reasonable approach: Cranelift manages the ABI, does register allocation, has its own ideas about flags, etc., such that one can't really write CLIF to produce exactly the original x86 for any x86 code. For that one would really want an IR that is explicitly a list of x86 instructions...
My only concern with using a cpu context is the amount of bloat/unreadable code that will be generated. Its something ive done before when devirtualizing vmprotect binaries. the result is a mess and nothing near the original x86. McSema uses this concept of a cpu context and they emulate x86 semantics with LLVM IR operations (including flags). However if you compile this code and look at it in a disassembler its a mess.
There exists private tooling which can do this lifting, optimizing, and recompiling and the output is near identical to the original code (deobfucation/removal of vmprotect for example). Im assuming these private tools do not use a 'cpu context' structure in IR. This private tooling is also being used for obfuscation. My team wants to create something like it if not better without having to create an entire compiler framework. The private tools ive seen do not use LLVM IR, i think they either made their own entire compiler, ir, and optimization pass system or are using cranelift already.
As for the ABI, i think we are not too concerned with register usage until it is lowered back to x86. I think i read either yours or someone elses blog post on how you use specific registers for instructions (like div, shr, etc). We could do this same thing for cpuid, rdtsc, etc. As for function parameters... Maybe during the lift (from x86 --> ir) we could do some analysis on the decoded x86 instructions to determine the number of parameters a function call uses? For example if we see rcx,rdx,r8, and r9 written too prior to a CALL instruction we can assume 4 registers are used. If there is a store to RSP+0x28, RSP+0x20 then we know there are 2 stack params. We only care to support windows ABI(s) for now.
So again, we want to avoid cpu context at all cost, we want to be able to lift and recompile x86 and have near original instructions. I know its possible and ive seen it with my own eyes.
bjorn3 said:
Cranelift used to have instructions which produced or consumed flags, but we removed them in https://github.com/bytecodealliance/wasmtime/pull/5406
Very interesting, ill take a look at this commit/branch to see what was once there. Maybe as a team we can add this back or keep a seperate branch for it.
i really like these cpu flags, maybe i will base my project off cranelift when it used to have these cpu flags! :) ^^
an issue we might run into however is semantics like
pushfq
and [rsp], set_some_bit
popfq
Lifting this to IR might be complex... We could do some analysis on the x86 instructions prior to see if this sets a specific flag and then translate that to an IR instruction which sets a flag.
So again, we want to avoid cpu context at all cost, we want to be able to lift and recompile x86 and have near original instructions. I know its possible and ive seen it with my own eyes.
That's fair, I guess I'm just noting that I strongly suspect you'll run into a bunch of impedance mismatches of the form like your last example. CLIF doesn't have access to notions like the processor stack; fundamentally there is just no encoding for a "push" unless you virtualize state into a CPU context and a separate stack. Both Cranelift and the user code you're translating can't both own the stack. (E.g., what do you do with code that stores the stack pointer and restores it? How would you translate an user-thread context-switch routine?) Register constraints will run into serious problems as well because you'll have, e.g., a known value in each of the 16 GPRs in the original code, but you can't constraint values into all 16 on a call instruction. Constraints also can't be put in CLIF -- again, it's a different abstraction level.
Imagine it like the following problem: "generate C code that, when compiled by gcc, produces exactly these x86 instructions". For an arbitrary x86 sequence, it can't be done. CLIF's abstraction level is closer to C (it has functions, it has arbitrary values that are regalloc'd, it manages the stack for you, it has two-way branches, it has no indirect jump only br_table) than the underlying machine.
I think you could push this to a point where you pattern-match some forms of "well-behaved" x86 back up to CLIF, just as you could write a decompiler from x86 to C; but in full generality, code with arbitrary jumps and register and stack manipulations simply can't be expressed in a way that will compile back to the original. I wish you all the best if you want to try though!
Thanks for your time :)
I think if we choose to use cranelift we will rework lots of it to help us solve these issues... Im really trying to avoid creating my own entire compiler framework because its a hell of a lot of work.
For what it's worth, it sounds like what you need is a 1-to-1 mapping from x86 to an IR -- that shouldn't be too complex, insofar at least as it doesn't require instruction selection / lowering or regalloc or ABI code, which are three of the gnarliest areas of Cranelift
And it may be possible to reuse parts of cranelift for this ir like cranelift_codegen::egraph (not currently exported from cranelift_codegen) for egraph optimizations and cranelift-isle for pattern matching in the uplifting, lowering and optimization passes.
we will continue to research into cranelift. I think we could rework some specific parts of it to achieve our goal. Maybe we just fully commit, lift all semantics and have it messy... then use rewrite passes to simplify known patterns (like flag usage) into the CLIR equal. Additionally use passes to uncover function parameters, remove lifted x86 prolog/epilogs. Essentially lift --> reduce..
bjorn3 said:
And it may be possible to reuse parts of cranelift for this ir like cranelift_codegen::egraph (not currently exported from cranelift_codegen) for egraph optimizations and cranelift-isle for pattern matching in the uplifting, lowering and optimization passes.
also quick question, is the ISLE used for lowering or also generating CLIR?
ISLE is used for lowering cranelift ir to the target specific ir on which regalloc runs. It is also used for the optimization rules of cranelift_codegen::egraph. ISLE isn't used for producing cranelift ir. For that you have to use the FuncBuilder
api.
bjorn3 said:
ISLE is used for lowering cranelift ir to the target specific ir on which regalloc runs. It is also used for the optimization rules of cranelift_codegen::egraph. ISLE isn't used for producing cranelift ir. For that you have to use the
FuncBuilder
api.
oh ok thanks for the clarification.
after further consulting my team, we have decided that we really like cranelift and we are going to use a cpu context structure to represent cpu registers and flags. We think that getting our hands dirty will also further educate us on the framework itself too.
Our focus now is using cranelift ir for its pass system to obfuscate code. Once we learn more about the framework we can attempt to change it.
our plan is this:
declare a function with 1 argument (the cpu context)
create call stubs which push all registers onto the stack then:
mov rcx, rsp
jmp generated_function
Last updated: Dec 23 2024 at 13:07 UTC