Stream: general

Topic: target ptx


view this post on Zulip Riccardo D'Ambrosio (Jun 07 2021 at 04:46):

Hello!

I was wondering, would compilation to PTX ever be supported? PTX is basically assembly but for GPU kernels on nvidia GPUs. Its just like any other ISA. It would be very helpful for me and perhaps for many people if we could compile rust to PTX. targeting nvptx64-nvidia-cuda using LLVM currently doesnt really work because of LLVM dylib limitations, and it makes broken PTX a lot of the time. Not sure how much work it would be because i'm not quite sure how cranelift's codegen works. But i expect it may be as much work as a "smaller" ISA like ARM.

view this post on Zulip Chris Fallin (Jun 07 2021 at 05:56):

Greetings @Riccardo D'Ambrosio ! I'm not aware of anyone with immediate plans to add a PTX (or GPU in general) backend to Cranelift, but actually the Embark Studios folks had a similar desire a while ago and we talked about whether adding a GPU backend to Cranelift would make sense, which eventually led to the conclusion that it probably is more straightforward to develop a rustc backend directly. They've since built rust-gpu, which generates SPIR-V. @Benjamin Bouvier or @Johan Andersson would be the right folks to talk to if you want to know more!

🐉 Making Rust a first-class language and ecosystem for GPU code 🚧 - EmbarkStudios/rust-gpu

view this post on Zulip Riccardo D'Ambrosio (Jun 07 2021 at 05:58):

Yeah rust-gpu is great, but CUDA usually yields considerably higher (30%+) performance for physics simulations and things like that, and has things that vulkan just doesnt have, like graphs, special memory handling, etc, so using rust-gpu over cuda is not really acceptable for me :(

view this post on Zulip Riccardo D'Ambrosio (Jun 07 2021 at 05:59):

for the small examples ive tried right now, rustc seems to make correct ptx, ill explore this further since last time i tried this, it made broken ptx

view this post on Zulip Riccardo D'Ambrosio (Jun 07 2021 at 05:59):

The issue is that i have to use --emit=asm which is kind of a hack because windows has some issues linking with a custom ptx linker

view this post on Zulip Riccardo D'Ambrosio (Jun 07 2021 at 05:59):

Anyways, cranelift supporting PTX would still be very helpful since targeting the GPU for things like jit compilers (sounds crazy ikr) is very helpful

view this post on Zulip Chris Fallin (Jun 07 2021 at 06:01):

Gotcha. Well, we're always looking for more contributors, so if you have interest in developing a PTX backend, we'd be happy to talk further! We're spread pretty thin otherwise so such a backend is unlikely to be developed without someone specifically driving it forward. But I could point you (or anyone else) toward the right places to start.

view this post on Zulip Riccardo D'Ambrosio (Jun 07 2021 at 06:01):

yeah if i find that targeting ptx with rustc is completely broken i might look at it

view this post on Zulip Riccardo D'Ambrosio (Jun 07 2021 at 06:03):

im not sure how hard the PTX ISA is, i presume a lot of special handling would need to be done because it has special things like textures, surfaces, etc. im not sure how cranelift handles ISA-specific features

view this post on Zulip Chris Fallin (Jun 07 2021 at 06:04):

We more or less provide a standard load/store abstract machine at the IR level, similar to e.g. LLVM IR. There's precedent for adding intrinsics that are platform-specific but we try to minimize that

view this post on Zulip Chris Fallin (Jun 07 2021 at 06:05):

My initial gut reaction to that (with the caveat that I am not very familiar with the GPU world in general) is that we probably wouldn't turn into a compiler that understands, e.g., textures or surfaces. But insofar as general IR wants to be compiled down to GPGPU kernels, that seems reasonable to me

view this post on Zulip Riccardo D'Ambrosio (Jun 07 2021 at 06:06):

hmm i see, im not sure how llvm's ptx backend handles it

view this post on Zulip Chris Fallin (Jun 07 2021 at 06:08):

although, if the extent of the platform-specific quirks that one needs is just that there are special memories (scratchpad, texture memory?, etc) then we actually do have a notion of separate "heaps" that can be accessed (this comes from the multi-Wasm-heap world)

view this post on Zulip Riccardo D'Ambrosio (Jun 07 2021 at 06:08):

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#texture-sampler-and-surface-types

view this post on Zulip Chris Fallin (Jun 07 2021 at 06:10):

OK, interesting. So the first step would be working out how to represent these primitives in CLIF

view this post on Zulip Riccardo D'Ambrosio (Jun 07 2021 at 06:10):

yeah

view this post on Zulip Chris Fallin (Jun 07 2021 at 06:11):

there also seems to be some weirdness with "hiding the ABI", and there is the question of what to do about regalloc (iirc GPUs have "thousands-ish" of registers, but does ptx present a virtual-register machine?)

view this post on Zulip Riccardo D'Ambrosio (Jun 07 2021 at 06:11):

yes ptx has virtual registers

view this post on Zulip Riccardo D'Ambrosio (Jun 07 2021 at 06:12):

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parameterized-variable-names

view this post on Zulip Riccardo D'Ambrosio (Jun 07 2021 at 06:12):

ptx is actually pretty human-friendly compared to other ISAs id say

view this post on Zulip Chris Fallin (Jun 07 2021 at 06:12):

OK, gotcha, so we would need a way to bypass regalloc as well (that makes our life easier in some ways but also special-cases a bunch of stuff in the backend)

view this post on Zulip Riccardo D'Ambrosio (Jun 07 2021 at 06:13):

need to also write an assembler for it, im not sure how much work that is, and i actually need that right now for jit compiling some math lol

view this post on Zulip Chris Fallin (Jun 07 2021 at 06:13):

Yeah, each of our backends relies on an assembler library of sorts -- the MachInst implementation and the emit code centered around the MachBuffer

view this post on Zulip Chris Fallin (Jun 07 2021 at 06:14):

so the steps to write a backend are (i) get that assembler infrastructure written, (ii) define the lowering rules (basically a big match), (iii) define some other machine-specific traits like ABI implementations

view this post on Zulip Riccardo D'Ambrosio (Jun 07 2021 at 06:14):

yeah i looked at the x64 impl... gigantic

view this post on Zulip Riccardo D'Ambrosio (Jun 07 2021 at 06:14):

why is that much larger than say, arm?

view this post on Zulip Chris Fallin (Jun 07 2021 at 06:15):

hmm, I just linecounted; 20087 lines in isa/x64 vs. 21156 in isa/aarch64

view this post on Zulip Chris Fallin (Jun 07 2021 at 06:16):

possibly you're seeing in aarch64 we split the lowering into two files (lower.rs, lower_inst.rs)? anyway both are pretty comparable in complexity

view this post on Zulip Riccardo D'Ambrosio (Jun 07 2021 at 06:16):

i see

view this post on Zulip Chris Fallin (Jun 07 2021 at 06:17):

would estimate a few months of full-time work to bring up a new backend, maybe 6 months to a year to have a well-tuned one

view this post on Zulip Chris Fallin (Jun 07 2021 at 06:17):

so, "new backend for platform X?" is not a small ask, but as said above, happy to discuss further if you're interested in driving this!

view this post on Zulip Riccardo D'Ambrosio (Jun 07 2021 at 06:17):

Yeah for sure, its kind of a last resort for me, so kind of unlikely, but maybe in the future ill explore it, not sure

view this post on Zulip Riccardo D'Ambrosio (Jun 07 2021 at 06:17):

thanks anyways! :smile:

view this post on Zulip Chris Fallin (Jun 07 2021 at 06:18):

for sure!

view this post on Zulip Riccardo D'Ambrosio (Jun 07 2021 at 06:18):

quick q while im here, is it possible to make cranelift call an extern c function i have in my rust code when using it as a jit compiler?

view this post on Zulip Chris Fallin (Jun 07 2021 at 06:18):

It should be, yes; this is how "hostcalls" in e.g. wasmtime work

view this post on Zulip Riccardo D'Ambrosio (Jun 07 2021 at 06:19):

is there a specific function in the docs that does that?

view this post on Zulip Chris Fallin (Jun 07 2021 at 06:20):

I actually spend more time inside the compiler than at its boundary so I can't remember the API off the top of my head :-) but in the IR, it's an ExternalName, and then we give that back to you in a relocation so you can patch the address in before executing

view this post on Zulip Chris Fallin (Jun 07 2021 at 06:20):

probably grep for ExternalName in crates/* in the wasmtime repo will get you started

view this post on Zulip Riccardo D'Ambrosio (Jun 07 2021 at 06:20):

haha, thanks

view this post on Zulip Riccardo D'Ambrosio (Jun 07 2021 at 06:21):

i suppose cranelift can't call standard rust functions because of its unspecified calling convention right? which isnt a big issue since i can just use a "trampoline" function

view this post on Zulip Chris Fallin (Jun 07 2021 at 06:22):

cc @bjorn3 on that (cranelift backend for rustc, so they are the expert) but yes, I think for a general hostcall sort of setup, using extern "C" functions will be a lot simpler and more reliable

view this post on Zulip bjorn3 (Jun 07 2021 at 07:51):

If you are using cranelift-jit, and declare a function with module.declare_function, cranelift-jit will fallback to looking up the symbol in the host executable if it isn't defined using module.define_function. You can also use .symbol() on the JitBuilder to add custom fallbacks that are looked up before the host executable symbols.

view this post on Zulip bjorn3 (Jun 07 2021 at 07:53):

You will indeed have to use extern "C" trampolines. You should try to avoid passing non-primitive types as part of the abi calculation for aggregates (structs, unions, ...) has to be done by the user of cranelift.

view this post on Zulip bjorn3 (Jun 07 2021 at 07:53):

@Riccardo D'Ambrosio

view this post on Zulip Riccardo D'Ambrosio (Jun 07 2021 at 07:58):

I see, thanks! :smile:

view this post on Zulip Riccardo D'Ambrosio (Jun 07 2021 at 07:58):

is there a way to write a universal trampoline? i would suppose not because we have no variadic generics :/

view this post on Zulip Riccardo D'Ambrosio (Jun 07 2021 at 07:59):

most of the functions i want to call take vek vectors or matrices :/

view this post on Zulip Riccardo D'Ambrosio (Jun 07 2021 at 07:59):

but theyre repr C so they should be dead simple

view this post on Zulip bjorn3 (Jun 07 2021 at 08:01):

You should probably just pass pointers instead of passing them by value. That is much simpler to implement and avoids a move to the argument part of the stack.

view this post on Zulip Riccardo D'Ambrosio (Jun 07 2021 at 08:01):

the issue is im reusing this math code for my main physics engine library, so it needs to be idiomatic

view this post on Zulip bjorn3 (Jun 07 2021 at 08:03):

Only the trampolines need to take pointers, the library code can take it by value.

view this post on Zulip Riccardo D'Ambrosio (Jun 07 2021 at 08:03):

I see

view this post on Zulip Robin Lindner (Jul 26 2022 at 17:05):

One possibility would of course be to compile the whole thing on a PTX target and then execute it on the GPU using cuModuleLoadData (CUDA Driver API).

However, I think it would be a good move to compile directly into the GPU assembly and thus avoid an intermediate instance.

Disadvantages of this are that these GPU ISAs are different from graphics card to graphics card generation (for example Nvidia Maxwell, Kepler, ...).
I think with this one could achieve a much higher performance than leaving this to a proprietary compiler.

Tuning this directly to the GPU would be a lot of reverse engineering work, since NVIDIA only publishes very limited documentation on their ISAs.


Last updated: Nov 22 2024 at 16:03 UTC