display disassembly · cranelift · Zulip Chat Archive

Is there a convenient way to see the disassembly of a function generated using cranelift-jit and craneleft-frontend? I can easily see the Cranelift IR using FunctionBuilder::display, but I would like to see what instructions come out after optimization and compilation.

Chris Fallin (Oct 15 2021 at 23:35):

@Veverak there's no API to do this programmatically, but (i) clif-util -D uses capstone to show disassemblies of compilations of either .clif or .wasm inputs, and (ii) if you set your log level to 'trace' (RUST_LOG=trace) with a binary that has log output set up (wasmtime does, for example), you'll see a bunch of info fly by, including final VCode for functions

Chris Fallin (Oct 15 2021 at 23:36):

also, if you're building your own JIT on top of Cranelift then it might be worthwhile to hook up capstone to show disassemblies, imho, so that you can get just that without all the other debug spew

Chris Fallin (Oct 15 2021 at 23:37):

(happy to take suggestions on ways we could improve the API here to provide something better!)

Veverak (Oct 15 2021 at 23:39):

Chris Fallin (Oct 15 2021 at 23:43):

@Veverak it's a utility binary that is built with Cranelift; cargo build --release -p cranelift-tools should give you target/release/clif-util

Chris Fallin (Oct 15 2021 at 23:44):

if you have a .clif file, you can run clif-util compile --target x86_64 -D file.clif (or s/x86_64/aarch64/ if desired)

Veverak (Oct 15 2021 at 23:48):

Using a powerful disassembler like Capstone seems a bit odd. I suppose Cranelift already has internal state that could be used to pretty-print the generated instructions with little effort without having to use reverse-engineering techniques.

Chris Fallin (Oct 15 2021 at 23:57):

Currently the pretty-printing for that isn't exposed in the external API but I'm happy to look at a PR that does so!

Chris Fallin (Oct 15 2021 at 23:59):

the reason I didn't mention it as the first option is that it is not exactly 1:1 with the final machine code; for example there is late editing that happens in MachBuffer to resolve control-flow and simplify branches. The only state that fully represents the final result is the machine code itself, so starting from that is truly the best option if you want the precise disassembly

Veverak (Oct 16 2021 at 00:03):

I guess it doesn't matter to me whether it exactly represents the final machine code. I just like to get an idea of what optimizations are applied. Is there a way I can set up log output in my own JIT?

Chris Fallin (Oct 16 2021 at 00:05):

Yep, if you want to log all the code that is generated via the VCode pretty-printer, this log::trace!() call is printing the output that you want; so, probably the best way to expose that is to either pass in a Cranelift option to emit at a higher log-level (so you don't have to see all the trace-level output, which is extremely verbose), or add an API to return a String given the final compiled function

wasmtime/compile.rs at 3ba9e5865a8171d1b4547bcabe525666d030c18b · bytecodealliance/wasmtime

Standalone JIT-style runtime for WebAssembly, using Cranelift - wasmtime/compile.rs at 3ba9e5865a8171d1b4547bcabe525666d030c18b · bytecodealliance/wasmtime

Chris Fallin (Oct 16 2021 at 00:07):

... and with that I am disappearing for now but I'm happy to review a PR to do one of the above if you decide to go that way!

Veverak (Oct 16 2021 at 01:18):

I guess the part about simplifying branches is important after all. The VCode contains a lot of blocks that only contain one instruction that jumps to the next block. It would be nice to have some API to get the final optimized code but as structured data rather than a blob, and with a source map, allowing to make a tool like godbolt.org. I'm still getting used to Cranelift, so don't expect a PR from me, yet.

Compiler Explorer

Compiler Explorer is an interactive online compiler which shows the assembly output of compiled C++, Rust, Go (and many more) code.

Chris Fallin (Oct 16 2021 at 05:48):

I agree that that would be a very nice facility to have! Unfortunately providing a "structured data" view of the final assembly is a bit beyond the design of the compiler backend: the final code emission intentionally does not build a data structure in memory that represents the (post-branch-simplification) code before emitting machine code, because it's not necessary and not doing so reduces the cost of emission.

This is why I recommended hooking up Capstone above if one wants the final disassembly -- to come back to what you said earlier:

it's exactly because we don't have this internal state that we're faster than otherwise at emitting code, but the tradeoff is that one has to pay more cost to build that internal state post-hoc if one wants it.

In theory one could use the debug location info we emit to make a Godbolt-like UI with correspondences to original source lines; that actually sounds like a really useful tool, if someone were to build it. Happy to help answer questions and/or work out ways to expose additional information as needed if you're hoping to do so!

Veverak (Oct 16 2021 at 07:46):

It's interesting to hear that Cranelift has been made faster by not maintaining enough detail about the generated code to allow pretty-printing it. I obviously don't know all the details, but I suspect that in any case, there may be a clever way around this. Maybe instead of making the main state more descriptive, a layer of annotations could be kept besides it, and this layer can be turned off when not needed. I guess this is already how debug location info works, and that the format of this could be adjusted to allow listing the instructions at each byte without having to disassemble them. What's the API to get the debug location info?

Chris Fallin (Oct 16 2021 at 18:09):

I'll note that such a project would need some careful design in the way that it interacts with MachBuffer (which is the code that mutates branches and removes redundant ones during binary emission), and would need some careful consideration in the way that it keeps in sync with the emitted binary code. I'd be happy to talk more design details with anyone who's interested in taking this up...

Some history might be useful context too. The original intent of MachInst was to be more-or-less exactly the assembler-input data structure that you're requesting: one MachInst for one machine instruction, with passes before final emission to get all the instructions into emittable shape. (Then the VCode pretty-printed output was exactly the final assembly.) In fact in the original aarch64 backend I was testing the VCode-emitted bytes by comparing to the gas assembly of the VCode pretty-printing.

So, both efficiency and correctness concerns led us to this design; in other words, there are good reasons why we don't build intermediate data structures that exactly represent the final instructions before we emit them. (I'll note that "emit the bytes directly while traversing something" is not an uncommon design choice in JITs in general, too, for speed reasons.)

The last thing I would say is regarding software-maintenance overhead: if we guarantee we can pretty-print in a way that exactly corresponds to machine code, that's another thing we have to test, and get right whenever we add instructions or instruction sequences. Not the end of the world, but a cost nonetheless. Sometimes saying "just use Capstone if you need that" is actually the right decision from an overall cost perspective -- it significantly simplifies the overall design (the compiler just outputs bytes; pretty-printing is a separate bytes-to-text pipeline). "Speed of pretty-printing" was not a top-level goal when we designed the current backend; that probably helps illuminate some of the tradeoffs we have made :-)

Anyway, that's more-or-less my complete braindump on the topic; anyone who wants to suggest an improved design is very welcome to do so and I'd be happy to point to the relevant bits of code that would be involved in implementation!

Veverak (Oct 19 2021 at 20:08):

It's an interesting potential improvement to keep in mind for the future, but for now, as you say, “just use Capstone if you need that”. When I try to do that however, I run into the same problem of not being able to access internal state. JITModule has internal state telling the size of each function, but the public methods only allow getting the pointer to each function, not the size. Maybe there needs to be another form of get_finalized_function which similarly to get_finalized_data includes the size of the data in the return value?

Stream: cranelift

Topic: display disassembly

Veverak (Oct 15 2021 at 23:28):

Chris Fallin (Oct 15 2021 at 23:35):

Chris Fallin (Oct 15 2021 at 23:36):

Chris Fallin (Oct 15 2021 at 23:37):

Veverak (Oct 15 2021 at 23:39):

Chris Fallin (Oct 15 2021 at 23:43):

Chris Fallin (Oct 15 2021 at 23:44):

Veverak (Oct 15 2021 at 23:48):

Chris Fallin (Oct 15 2021 at 23:57):

Chris Fallin (Oct 15 2021 at 23:57):

Chris Fallin (Oct 15 2021 at 23:59):

Veverak (Oct 16 2021 at 00:03):

Chris Fallin (Oct 16 2021 at 00:05):

Chris Fallin (Oct 16 2021 at 00:07):

Veverak (Oct 16 2021 at 01:18):

Chris Fallin (Oct 16 2021 at 05:48):

Veverak (Oct 16 2021 at 07:46):

Chris Fallin (Oct 16 2021 at 18:09):

Veverak (Oct 19 2021 at 20:08):

Chris Fallin (Oct 19 2021 at 21:40):