target ptx · general · Zulip Chat Archive

Stream: general

Topic: target ptx

Riccardo D'Ambrosio (Jun 07 2021 at 04:46):

Hello!

I was wondering, would compilation to PTX ever be supported? PTX is basically assembly but for GPU kernels on nvidia GPUs. Its just like any other ISA. It would be very helpful for me and perhaps for many people if we could compile rust to PTX. targeting nvptx64-nvidia-cuda using LLVM currently doesnt really work because of LLVM dylib limitations, and it makes broken PTX a lot of the time. Not sure how much work it would be because i'm not quite sure how cranelift's codegen works. But i expect it may be as much work as a "smaller" ISA like ARM.

Chris Fallin (Jun 07 2021 at 05:56):

Greetings @Riccardo D'Ambrosio ! I'm not aware of anyone with immediate plans to add a PTX (or GPU in general) backend to Cranelift, but actually the Embark Studios folks had a similar desire a while ago and we talked about whether adding a GPU backend to Cranelift would make sense, which eventually led to the conclusion that it probably is more straightforward to develop a rustc backend directly. They've since built rust-gpu, which generates SPIR-V. @Benjamin Bouvier or @Johan Andersson would be the right folks to talk to if you want to know more!

EmbarkStudios/rust-gpu

🐉 Making Rust a first-class language and ecosystem for GPU code 🚧 - EmbarkStudios/rust-gpu

Riccardo D'Ambrosio (Jun 07 2021 at 05:58):

Yeah rust-gpu is great, but CUDA usually yields considerably higher (30%+) performance for physics simulations and things like that, and has things that vulkan just doesnt have, like graphs, special memory handling, etc, so using rust-gpu over cuda is not really acceptable for me :(

Riccardo D'Ambrosio (Jun 07 2021 at 05:59):

for the small examples ive tried right now, rustc seems to make correct ptx, ill explore this further since last time i tried this, it made broken ptx

Riccardo D'Ambrosio (Jun 07 2021 at 05:59):

The issue is that i have to use --emit=asm which is kind of a hack because windows has some issues linking with a custom ptx linker

Riccardo D'Ambrosio (Jun 07 2021 at 05:59):

Anyways, cranelift supporting PTX would still be very helpful since targeting the GPU for things like jit compilers (sounds crazy ikr) is very helpful

Chris Fallin (Jun 07 2021 at 06:01):

Gotcha. Well, we're always looking for more contributors, so if you have interest in developing a PTX backend, we'd be happy to talk further! We're spread pretty thin otherwise so such a backend is unlikely to be developed without someone specifically driving it forward. But I could point you (or anyone else) toward the right places to start.

Riccardo D'Ambrosio (Jun 07 2021 at 06:01):

yeah if i find that targeting ptx with rustc is completely broken i might look at it

Riccardo D'Ambrosio (Jun 07 2021 at 06:03):

im not sure how hard the PTX ISA is, i presume a lot of special handling would need to be done because it has special things like textures, surfaces, etc. im not sure how cranelift handles ISA-specific features

Chris Fallin (Jun 07 2021 at 06:04):

We more or less provide a standard load/store abstract machine at the IR level, similar to e.g. LLVM IR. There's precedent for adding intrinsics that are platform-specific but we try to minimize that

Chris Fallin (Jun 07 2021 at 06:05):

My initial gut reaction to that (with the caveat that I am not very familiar with the GPU world in general) is that we probably wouldn't turn into a compiler that understands, e.g., textures or surfaces. But insofar as general IR wants to be compiled down to GPGPU kernels, that seems reasonable to me

Riccardo D'Ambrosio (Jun 07 2021 at 06:06):

hmm i see, im not sure how llvm's ptx backend handles it

Chris Fallin (Jun 07 2021 at 06:08):

although, if the extent of the platform-specific quirks that one needs is just that there are special memories (scratchpad, texture memory?, etc) then we actually do have a notion of separate "heaps" that can be accessed (this comes from the multi-Wasm-heap world)

Riccardo D'Ambrosio (Jun 07 2021 at 06:08):

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#texture-sampler-and-surface-types

Chris Fallin (Jun 07 2021 at 06:10):

OK, interesting. So the first step would be working out how to represent these primitives in CLIF

Riccardo D'Ambrosio (Jun 07 2021 at 06:10):

yeah

Chris Fallin (Jun 07 2021 at 06:11):

there also seems to be some weirdness with "hiding the ABI", and there is the question of what to do about regalloc (iirc GPUs have "thousands-ish" of registers, but does ptx present a virtual-register machine?)

Riccardo D'Ambrosio (Jun 07 2021 at 06:11):

yes ptx has virtual registers

Riccardo D'Ambrosio (Jun 07 2021 at 06:12):

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parameterized-variable-names

Riccardo D'Ambrosio (Jun 07 2021 at 06:12):

ptx is actually pretty human-friendly compared to other ISAs id say

Chris Fallin (Jun 07 2021 at 06:12):

OK, gotcha, so we would need a way to bypass regalloc as well (that makes our life easier in some ways but also special-cases a bunch of stuff in the backend)

Riccardo D'Ambrosio (Jun 07 2021 at 06:13):

need to also write an assembler for it, im not sure how much work that is, and i actually need that right now for jit compiling some math lol

Chris Fallin (Jun 07 2021 at 06:13):

Yeah, each of our backends relies on an assembler library of sorts -- the MachInst implementation and the emit code centered around the MachBuffer

Chris Fallin (Jun 07 2021 at 06:14):

so the steps to write a backend are (i) get that assembler infrastructure written, (ii) define the lowering rules (basically a big match), (iii) define some other machine-specific traits like ABI implementations

Riccardo D'Ambrosio (Jun 07 2021 at 06:14):

yeah i looked at the x64 impl... gigantic

Riccardo D'Ambrosio (Jun 07 2021 at 06:14):

why is that much larger than say, arm?

Chris Fallin (Jun 07 2021 at 06:15):

hmm, I just linecounted; 20087 lines in isa/x64 vs. 21156 in isa/aarch64

Chris Fallin (Jun 07 2021 at 06:16):

possibly you're seeing in aarch64 we split the lowering into two files (lower.rs, lower_inst.rs)? anyway both are pretty comparable in complexity

Riccardo D'Ambrosio (Jun 07 2021 at 06:16):

i see

Chris Fallin (Jun 07 2021 at 06:17):

would estimate a few months of full-time work to bring up a new backend, maybe 6 months to a year to have a well-tuned one

Chris Fallin (Jun 07 2021 at 06:17):

so, "new backend for platform X?" is not a small ask, but as said above, happy to discuss further if you're interested in driving this!

Riccardo D'Ambrosio (Jun 07 2021 at 06:17):

Yeah for sure, its kind of a last resort for me, so kind of unlikely, but maybe in the future ill explore it, not sure

Riccardo D'Ambrosio (Jun 07 2021 at 06:17):

thanks anyways! :smile:

Chris Fallin (Jun 07 2021 at 06:18):

for sure!

Riccardo D'Ambrosio (Jun 07 2021 at 06:18):

quick q while im here, is it possible to make cranelift call an extern c function i have in my rust code when using it as a jit compiler?

Chris Fallin (Jun 07 2021 at 06:18):

It should be, yes; this is how "hostcalls" in e.g. wasmtime work

Riccardo D'Ambrosio (Jun 07 2021 at 06:19):

is there a specific function in the docs that does that?

Chris Fallin (Jun 07 2021 at 06:20):

I actually spend more time inside the compiler than at its boundary so I can't remember the API off the top of my head :-) but in the IR, it's an ExternalName, and then we give that back to you in a relocation so you can patch the address in before executing

Chris Fallin (Jun 07 2021 at 06:20):

probably grep for ExternalName in crates/* in the wasmtime repo will get you started

Riccardo D'Ambrosio (Jun 07 2021 at 06:20):

haha, thanks

Riccardo D'Ambrosio (Jun 07 2021 at 06:21):

i suppose cranelift can't call standard rust functions because of its unspecified calling convention right? which isnt a big issue since i can just use a "trampoline" function

Chris Fallin (Jun 07 2021 at 06:22):

cc @bjorn3 on that (cranelift backend for rustc, so they are the expert) but yes, I think for a general hostcall sort of setup, using extern "C" functions will be a lot simpler and more reliable

bjorn3 (Jun 07 2021 at 07:51):

If you are using cranelift-jit, and declare a function with module.declare_function, cranelift-jit will fallback to looking up the symbol in the host executable if it isn't defined using module.define_function. You can also use .symbol() on the JitBuilder to add custom fallbacks that are looked up before the host executable symbols.

bjorn3 (Jun 07 2021 at 07:53):

You will indeed have to use extern "C" trampolines. You should try to avoid passing non-primitive types as part of the abi calculation for aggregates (structs, unions, ...) has to be done by the user of cranelift.

bjorn3 (Jun 07 2021 at 07:53):

@Riccardo D'Ambrosio

Riccardo D'Ambrosio (Jun 07 2021 at 07:58):

I see, thanks! :smile:

Riccardo D'Ambrosio (Jun 07 2021 at 07:58):

is there a way to write a universal trampoline? i would suppose not because we have no variadic generics :/

Riccardo D'Ambrosio (Jun 07 2021 at 07:59):

most of the functions i want to call take vek vectors or matrices :/

Riccardo D'Ambrosio (Jun 07 2021 at 07:59):

but theyre repr C so they should be dead simple

bjorn3 (Jun 07 2021 at 08:01):

You should probably just pass pointers instead of passing them by value. That is much simpler to implement and avoids a move to the argument part of the stack.

Riccardo D'Ambrosio (Jun 07 2021 at 08:01):

the issue is im reusing this math code for my main physics engine library, so it needs to be idiomatic

bjorn3 (Jun 07 2021 at 08:03):

Only the trampolines need to take pointers, the library code can take it by value.

Riccardo D'Ambrosio (Jun 07 2021 at 08:03):

I see

Robin Lindner (Jul 26 2022 at 17:05):

One possibility would of course be to compile the whole thing on a PTX target and then execute it on the GPU using cuModuleLoadData (CUDA Driver API).

However, I think it would be a good move to compile directly into the GPU assembly and thus avoid an intermediate instance.

Disadvantages of this are that these GPU ISAs are different from graphics card to graphics card generation (for example Nvidia Maxwell, Kepler, ...).
I think with this one could achieve a much higher performance than leaving this to a proprietary compiler.

Tuning this directly to the GPU would be a lot of reverse engineering work, since NVIDIA only publishes very limited documentation on their ISAs.

Last updated: Apr 07 2025 at 20:03 UTC