Hello!
I was wondering, would compilation to PTX ever be supported? PTX is basically assembly but for GPU kernels on nvidia GPUs. Its just like any other ISA. It would be very helpful for me and perhaps for many people if we could compile rust to PTX. targeting nvptx64-nvidia-cuda using LLVM currently doesnt really work because of LLVM dylib limitations, and it makes broken PTX a lot of the time. Not sure how much work it would be because i'm not quite sure how cranelift's codegen works. But i expect it may be as much work as a "smaller" ISA like ARM.
Greetings @Riccardo D'Ambrosio ! I'm not aware of anyone with immediate plans to add a PTX (or GPU in general) backend to Cranelift, but actually the Embark Studios folks had a similar desire a while ago and we talked about whether adding a GPU backend to Cranelift would make sense, which eventually led to the conclusion that it probably is more straightforward to develop a rustc backend directly. They've since built rust-gpu, which generates SPIR-V. @Benjamin Bouvier or @Johan Andersson would be the right folks to talk to if you want to know more!
Yeah rust-gpu is great, but CUDA usually yields considerably higher (30%+) performance for physics simulations and things like that, and has things that vulkan just doesnt have, like graphs, special memory handling, etc, so using rust-gpu over cuda is not really acceptable for me :(
for the small examples ive tried right now, rustc seems to make correct ptx, ill explore this further since last time i tried this, it made broken ptx
The issue is that i have to use --emit=asm
which is kind of a hack because windows has some issues linking with a custom ptx linker
Anyways, cranelift supporting PTX would still be very helpful since targeting the GPU for things like jit compilers (sounds crazy ikr) is very helpful
Gotcha. Well, we're always looking for more contributors, so if you have interest in developing a PTX backend, we'd be happy to talk further! We're spread pretty thin otherwise so such a backend is unlikely to be developed without someone specifically driving it forward. But I could point you (or anyone else) toward the right places to start.
yeah if i find that targeting ptx with rustc is completely broken i might look at it
im not sure how hard the PTX ISA is, i presume a lot of special handling would need to be done because it has special things like textures, surfaces, etc. im not sure how cranelift handles ISA-specific features
We more or less provide a standard load/store abstract machine at the IR level, similar to e.g. LLVM IR. There's precedent for adding intrinsics that are platform-specific but we try to minimize that
My initial gut reaction to that (with the caveat that I am not very familiar with the GPU world in general) is that we probably wouldn't turn into a compiler that understands, e.g., textures or surfaces. But insofar as general IR wants to be compiled down to GPGPU kernels, that seems reasonable to me
hmm i see, im not sure how llvm's ptx backend handles it
although, if the extent of the platform-specific quirks that one needs is just that there are special memories (scratchpad, texture memory?, etc) then we actually do have a notion of separate "heaps" that can be accessed (this comes from the multi-Wasm-heap world)
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#texture-sampler-and-surface-types
OK, interesting. So the first step would be working out how to represent these primitives in CLIF
yeah
there also seems to be some weirdness with "hiding the ABI", and there is the question of what to do about regalloc (iirc GPUs have "thousands-ish" of registers, but does ptx present a virtual-register machine?)
yes ptx has virtual registers
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parameterized-variable-names
ptx is actually pretty human-friendly compared to other ISAs id say
OK, gotcha, so we would need a way to bypass regalloc as well (that makes our life easier in some ways but also special-cases a bunch of stuff in the backend)
need to also write an assembler for it, im not sure how much work that is, and i actually need that right now for jit compiling some math lol
Yeah, each of our backends relies on an assembler library of sorts -- the MachInst
implementation and the emit code centered around the MachBuffer
so the steps to write a backend are (i) get that assembler infrastructure written, (ii) define the lowering rules (basically a big match
), (iii) define some other machine-specific traits like ABI implementations
yeah i looked at the x64 impl... gigantic
why is that much larger than say, arm?
hmm, I just linecounted; 20087 lines in isa/x64 vs. 21156 in isa/aarch64
possibly you're seeing in aarch64 we split the lowering into two files (lower.rs, lower_inst.rs)? anyway both are pretty comparable in complexity
i see
would estimate a few months of full-time work to bring up a new backend, maybe 6 months to a year to have a well-tuned one
so, "new backend for platform X?" is not a small ask, but as said above, happy to discuss further if you're interested in driving this!
Yeah for sure, its kind of a last resort for me, so kind of unlikely, but maybe in the future ill explore it, not sure
thanks anyways! :smile:
for sure!
quick q while im here, is it possible to make cranelift call an extern c function i have in my rust code when using it as a jit compiler?
It should be, yes; this is how "hostcalls" in e.g. wasmtime work
is there a specific function in the docs that does that?
I actually spend more time inside the compiler than at its boundary so I can't remember the API off the top of my head :-) but in the IR, it's an ExternalName
, and then we give that back to you in a relocation so you can patch the address in before executing
probably grep for ExternalName
in crates/*
in the wasmtime repo will get you started
haha, thanks
i suppose cranelift can't call standard rust functions because of its unspecified calling convention right? which isnt a big issue since i can just use a "trampoline" function
cc @bjorn3 on that (cranelift backend for rustc, so they are the expert) but yes, I think for a general hostcall sort of setup, using extern "C"
functions will be a lot simpler and more reliable
If you are using cranelift-jit, and declare a function with module.declare_function
, cranelift-jit will fallback to looking up the symbol in the host executable if it isn't defined using module.define_function
. You can also use .symbol()
on the JitBuilder
to add custom fallbacks that are looked up before the host executable symbols.
You will indeed have to use extern "C"
trampolines. You should try to avoid passing non-primitive types as part of the abi calculation for aggregates (structs, unions, ...) has to be done by the user of cranelift.
@Riccardo D'Ambrosio
I see, thanks! :smile:
is there a way to write a universal trampoline? i would suppose not because we have no variadic generics :/
most of the functions i want to call take vek
vectors or matrices :/
but theyre repr C so they should be dead simple
You should probably just pass pointers instead of passing them by value. That is much simpler to implement and avoids a move to the argument part of the stack.
the issue is im reusing this math code for my main physics engine library, so it needs to be idiomatic
Only the trampolines need to take pointers, the library code can take it by value.
I see
One possibility would of course be to compile the whole thing on a PTX target and then execute it on the GPU using cuModuleLoadData (CUDA Driver API).
However, I think it would be a good move to compile directly into the GPU assembly and thus avoid an intermediate instance.
Disadvantages of this are that these GPU ISAs are different from graphics card to graphics card generation (for example Nvidia Maxwell, Kepler, ...).
I think with this one could achieve a much higher performance than leaving this to a proprietary compiler.
Tuning this directly to the GPU would be a lot of reverse engineering work, since NVIDIA only publishes very limited documentation on their ISAs.
Last updated: Dec 23 2024 at 13:07 UTC