fitzgen opened PR #11825 from fitzgen:new-compile-time-builtins to bytecodealliance:main:
This commit adds the extremely unsafe
wasmtime::CodeBuilder::expose_unsafe_intrinsicsmethod. When enabled, the Wasm being compiled is given access to special imports that correspond to direct, unchecked and unsandboxed, native load and store operations. These intrinsics are intended to be used for implementing fast, inline-able versions of WASI interfaces that are special-cased to a particular host embedding, for example.Compile-time builtins, as originally described in the RFC, are basically made up of three parts:
- A function inliner
- Unsafe intrinsics
- Component composition to encapsulate the usage of unsafe intrinsics in a safe interface
Part (1) has been implemented in Wasmtime and Cranelift for a little while now (see
wasmtime::Config::compiler_inlining). This commit is part (2). After this commit lands, part (3) can be done withwacandwasm-compose, although follow up work is required to make the developer experience nicer and more integrated into Wasmtime so that the APIs can look like those proposed in the RFC.
I still have a little bit of doc comments and examples to fill out, but I thought it would be worth opening this PR up so that folks can start taking a look now, especially as I am taking Friday off and have a super-packed day tomorrow and probably won't have time to cross all the Ts and dot all the Is before next week.
One thing that no one brought up during the RFC but which started bugging me during this implementation is whether we can expose tools for compile-time builtin authors to do spectre mitigations. Basically expose an intrinsic that lowers to
spectre_select_guardor something? Seems possible but I haven't explored the design space too much yet. Also seems like it is _probably_ something we can do in an additive fashion, without needing to figure everything out before landing any intrinsics. Interested in folks' thoughts!<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
fitzgen requested cfallin for a review on PR #11825.
fitzgen requested wasmtime-compiler-reviewers for a review on PR #11825.
fitzgen requested wasmtime-core-reviewers for a review on PR #11825.
fitzgen requested alexcrichton for a review on PR #11825.
github-actions[bot] commented on PR #11825:
Subscribe to Label Action
cc @saulecabrera
<details>
This issue or pull request has been labeled: "wasmtime:api", "winch"Thus the following users have been cc'd because of the following labels:
- saulecabrera: winch
To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.
Learn more.
</details>
rvolosatovs created PR review comment:
What do you think about including
uN-native-{add,sub}(u64, uN) -> uN, which would return the previous value and wrap around on overflows? (potentially in a follow-up)
rvolosatovs submitted PR review.
fitzgen created PR review comment:
Can you clarify what the intended semantics are and why additional intrinsics are necessary and regular Wasm arithmetic is insufficient? Is this for doing the underlying architecture's pointer-sized arithmetic?
fitzgen submitted PR review.
rvolosatovs submitted PR review.
rvolosatovs created PR review comment:
I would imagine that e.g. a single
u64-addwould be more efficient, thanload, followed by anadd, followed by astore. Am I wrong assuming that's the case?
It would also be easier to use for embedders a little bit
rvolosatovs edited PR review comment.
fitzgen submitted PR review.
fitzgen created PR review comment:
Cranelift can optimize load+add+store to fuse the operations into a single instruction on architectures like x86-64 where such an instruction is available:
fitzgen updated PR #11825.
fitzgen updated PR #11825.
rvolosatovs submitted PR review.
rvolosatovs created PR review comment:
Oh, nice, then that solves that, yeah!
fitzgen updated PR #11825.
fitzgen updated PR #11825.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
One thing that I feel has worked out well elsewhere is using macros to define the signature. That guarantees everything stays in sync. Would it be possible to avoid manually creating the function type here through parameters and procedurally derive the type from a signle source of truth?
alexcrichton created PR review comment:
This is a duplicate of
TrampolineCompiler::abi_store_results, so I was wondering if it would be possible to make aTrampolineCompilerhere in this function and use that method? Similarly forabi_load_paramsfor replacinginitabove
alexcrichton created PR review comment:
I believe this'll need to truncate
pointeron 32-bit platforms
alexcrichton created PR review comment:
Similar to loads, this'll want to truncate the pointer for 32-bit targets
alexcrichton created PR review comment:
If it's expected that the list of intrinsics is going to grow over time, should this perhaps be
PrimaryMap<SomethingIndex, (ModuleInternedTypeIndex, UnsafeIntrinsic)>? Basically ignoring unused intrinsics.
alexcrichton created PR review comment:
Mind updating the comment at the top of this file too?
alexcrichton created PR review comment:
I believe this should be implementable by plumbing to the underlying Cranelift compiler?
alexcrichton created PR review comment:
Just flagging the various TODO here to get resolved before merging
alexcrichton created PR review comment:
This feels to me like it should be "iterate over what the component needs and compile those" rather than iterating over all intrinsics?
alexcrichton created PR review comment:
Technically I don't believe this is correct, and also technically the tests in this PR violate this by having a different host function return the u64 "pointer" which gets read/mutated. I think this'll want to be reworded, or perhaps even dropped entirely? Whether or not a modification/read of memory is safe is more-or-less up to Miri in a sense so we could somewhat defer to that.
alexcrichton created PR review comment:
Do you see a viable path to eventually omitting this? For example if we were to implement a DCE pass for functions if no intrinsics are actually imported anywhere and were inlined everywhere then all of these should get emptied out. "Just DCE" wouldn't be sufficient because of loops like this, however, and we'd have to, post function optimization, prune the list of intrinsics in theory.
alexcrichton created PR review comment:
From a Rust soundness perspective this is not sound. One reason is that, for example:
fn main() { let mut data = Box::new(32); let ptr = &mut *data as *mut i32; // e.g. through wasm... unsafe { *ptr += 10; } // e.g. through a host call... *data += 11; // e.g. through wasm again ... unsafe { *ptr += 12; } }Running this through Miri shows that the third modification here (adding 12) is unsound. The reason here is that the original pointer is "invalidated" once the original data is used through a different location.
This is also technically not sound because it's mutating through a
*const Tpointer which was originally derived from&Twhich does not allow mutation. Basically the*mut ()is going to need to be originally derived from*mut T.One fix to this is that
store.data()anddata_mutgo through this pointer rather thanself.inner.data. Another possible fix is we do some pre/post logic around wasm entry/exit (some permutation, I don't know exactly what) where we do some provenance juggling along the lines of this which compiles to a noop but to the compiler has meaning.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
Or, better yet, another possible fix is the "provenance juggling" approach modifying
dataanddata_mutmethods but in such a way that it compiles down to the same thing that happens today perhaps.
alexcrichton commented on PR #11825:
Oh, also, I'd recommend using
prtest:fullon this PR as this seems at high-risk of passing on x64 and failing elsewhere
alexcrichton submitted PR review.
alexcrichton created PR review comment:
For example, this is Miri-safe and additionally has the expected generated assembly
pub struct Foo { a: Box<i32>, raw_a: *mut i32, } #[unsafe(no_mangle)] pub extern "C" fn mutate_raw(foo: &mut Foo) { unsafe { *foo.raw_a += 1; } } #[unsafe(no_mangle)] pub extern "C" fn mutate_safe(foo: &mut Foo) { *get_a(foo) += 1; } #[unsafe(no_mangle)] pub extern "C" fn get_a(foo: &mut Foo) -> &mut i32 { unsafe { let addr: *mut i32 = &raw mut *foo.a; &mut *foo.raw_a.with_addr(addr.addr()) } } fn main() { let mut a = Foo { a: Box::new(200), raw_a: std::ptr::null_mut(), }; a.raw_a = &mut *a.a as *mut i32; println!("first: raw"); mutate_raw(&mut a); println!("second: safe"); mutate_safe(&mut a); println!("third: raw"); mutate_raw(&mut a); }
fitzgen submitted PR review.
fitzgen created PR review comment:
Creating a whole
TrampolineCompilerproved difficult due to it being fairly tied to component trampolines, but I did factor out theabi_{store_results,load_params}functions so that they are reusable from this code.
fitzgen created PR review comment:
There is no comment showing the pseudocode definition of
VMStoreContext, justVMContext
fitzgen submitted PR review.
fitzgen submitted PR review.
fitzgen created PR review comment:
Probably because we can define
VMStoreContextas a regular Rust struct, because it doesn't have dynamically-sized array fields.
fitzgen submitted PR review.
fitzgen created PR review comment:
My thinking has been that we can cross that bridge if we get to it.
Right now, there are very few intrinsics, and I'm not concerned about the size of little arrays like this. However, it definitely is intended that a component that doesn't use these intrinsics doesn't have them compiled into its text section and doesn't have any additional space reserved for their
VMFuncRefs and whatnot in itsvmctxlayout.
fitzgen submitted PR review.
fitzgen created PR review comment:
That would require a phase separation between regular function compilation and unsafe intrinsic compilation (so that we can determine which intrinsics are actually used via looking at CLIF external function imports after Wasm-to-CLIF translation but before inlining). This requires additional special-casing that we've been trying to remove from our compilation orchestration, so I'd rather not. I think, given how few intrinsics there currently are, that it is fine to get all of them if you expose them to a component.
In the future, I'd like to start doing gc-sections/DCE in our linking step, and rely on that to remove dead functions instead of introduce phases.
fitzgen submitted PR review.
fitzgen created PR review comment:
Do you see a viable path to eventually omitting this?
When you say "this" what exactly are you referring to?
Already it should be the case that when unsafe intrinsics are not exposed to a component, this loop performs zero iterations and there should[^0] be zero space reserved for intrinsics'
VMFuncRefs in the vmctx.[^0]: I think there may be a bug where the space is reserved unconditionally right now, looking at CI. But that is definitely unintentional.
For example if we were to implement a DCE pass for functions if no intrinsics are actually imported anywhere and were inlined everywhere then all of these should get emptied out. "Just DCE" wouldn't be sufficient because of loops like this, however, and we'd have to, post function optimization, prune the list of intrinsics in theory.
Yes, we would need to update the
env_component.unsafe_intrinsicsfield after doing gc-sections/DCE during linking, same as we would need to do the moral equivalent for theVMFuncRefs of defined Wasm functions that are imported/exported within a component but are ultimately never called and are dead code.
fitzgen updated PR #11825.
fitzgen updated PR #11825.
fitzgen updated PR #11825.
fitzgen submitted PR review.
fitzgen created PR review comment:
fitzgen edited PR review comment.
fitzgen updated PR #11825.
fitzgen updated PR #11825.
fitzgen requested alexcrichton for a review on PR #11825.
fitzgen commented on PR #11825:
@alexcrichton I think this should be ready for another review pass
alexcrichton created PR review comment:
For the CI failures I believe it's due to the fact that there's lingering access of the store data that doesn't go through these helpers. Could the
datafield be renamed to perhapsdata_without_provenanceor something like that with a comment to use these accessors?Also, these
unsafeblocks I think will definitely warrant a comment explaining what's going on as it's otherwise pretty nontrivial why they're setup the way they are
alexcrichton submitted PR review:
Thanks for slogging through all the CI bits and handling the miri bits, it's looking good!
To expand a bit on some of the unresolved comments from the previous review -- It feels a bit weird that there's different ways of managing the list of intrinsics for a component. The
VMComponentContexteither has 0 or all of them, compilation either compiles 0 or all of them,info::Componenttracks a full list of intrinsics but hasNonefor unneeded intrinsics,VMComponentContextinitialization "nulls out" all intrinsics, and instantiation only fills in used intrinsics. To me this feels like a random mish-mash of different strategies to manage everything. I get your point about crossing the bridge when we get there, but I also feel like this PR is moving us to a state where it's pretty inconsistent how the intrinsics are handled. Some contexts are "all or nothing" and some contexts are "only used intrinsics".Ideally I'd prefer a system where intrinsics were compacted/compiled on-demand as opposed to ever doing an "all or nothing" approach. My read of this is that this is basically a function of the initial analysis phase of a component and how clever it is. I would naively expect that fitting into the
GlobalInitializerinfrastructure would make the implementation "just fall out" by adding a newFooIndextype of some kind. Basically al lthe hash maps and helpers and such are all there, so I would naively expect the impementation to not be all that much work.I'm perpetually worried about the quantity of work we defer given that the rate of burning down this deferred work is often much smaller than the rate of deferring, but this is a topic reasonable folks can disagree on. In that sense I'll lay out my concerns here, but I'll also leave it up to you whether to merge or not before addressing. If this merges as-is, though, mind opening an issue about these future improvements?
alexcrichton created PR review comment:
For "this" I mean this entire loop in the context when unsafe intrinsics are used. This loop exists for the vanishingly rare case that intrinsics are used but also turned into
funcrefvalues one way or another, but in practice they'll basically never get used.In some sense I'm not saying much here, it's pretty clear that DCE won't make this loop go away, but an wasmtime-aware DCE pass which updated the
unsafe_intrinsicslist, would, however. So all I'm really saying here is that I think we should strive to make this loop go away in most situations when unsafe intrinsics are used, but that'll require fancy DCE.
alexcrichton created PR review comment:
To avoid the
Optionhere, could this useNonNull::dangling()as an initial constructor? That'll still segfault if erroneously accessed but otherwise avoids the peskyunwraps
fitzgen commented on PR #11825:
I'm perpetually worried about the quantity of work we defer given that the rate of burning down this deferred work is often much smaller than the rate of deferring, but this is a topic reasonable folks can disagree on.
I hear you. I think it is important that we balance incrementalism and doing things the Right Way. Personally, I think that the lever we use to strike that balance is not by accepting that things will be in a fairly suboptimal state "temporarily" (which, as you note, is often not temporary) in service of shipping something a little sooner, but to instead drop functionality and features that we don't have time to implement well. This way, we might not have everything we ideally want, but what we do have is rock solid. So when we look at some work being left for "follow up PRs", we should ask whether what is landing now is rock solid, and the follow ups are "just" optimizations/features/functionality.[^0]
[^0]: This is all assuming we are talking about the implementation of new features. Obviously improving existing suboptimal code, even if there are still more follow ups to be done before it is fully optimal, is worth landing right away.
But of course all these generalizations are very high-level and vague. We can still come to reasonable disagreements on what is considered "rock solid" or which bits of functionality even can be left out while preserving what remains.
To make things in this PR a little more concrete when viewed through the above lens, I have been accepting the compromise that we will compile, link, and instantiate
VMFuncRefs for all intrinsics when the flag to expose unsafe intrinsics is enabled. I think the Right Way to fix that, so that only the intrinsics that are actually used are compiled+linked+instantiated, is to do gc-sections/DCE during linking, but also that doing all that is something that can be delayed (perhaps for a very long time!) and things will be Fine in the meantime because what is implemented should be rock solid (albeit lacking the DCE optimizations we would have in an ideal world).Ignoring for a second the implementation details in this PR around how intrinsics are tracked through translation and the like, do you agree that the above compromise is an acceptable cut to make?
Because I do agree somewhat with the following:
It feels a bit weird that there's different ways of managing the list of intrinsics for a component. The
VMComponentContexteither has 0 or all of them, compilation either compiles 0 or all of them,info::Componenttracks a full list of intrinsics but hasNonefor unneeded intrinsics,VMComponentContextinitialization "nulls out" all intrinsics, and instantiation only fills in used intrinsics. To me this feels like a random mish-mash of different strategies to manage everything.How the intrinsics are tracked through compilation and in the
wasmtime-environmetadata structures is a little messy, and I'd like to improve it. But I don't want to expand the scope of those improvements to only compiling+linking+instantiating the intrinsics that are actually used. Doing that correctly (IMHO via DCE/gc-sections in linking) is too much for me to bite off immediately, and (unless I am missing something) the other ways of doing it require adding new special-cased code paths (which we have generally been trying to remove) and so would only bring us to local maxima, which doesn't seem worth spending the effort on.
And thanks as always for the thorough review! Will dig in some more momentarily.
fitzgen updated PR #11825.
alexcrichton closed without merge PR #11825.
alexcrichton commented on PR #11825:
High-level definitely agree with everything you say, and I also want to strive for a balance of perfection and pragmatism. I'm happy to defer things to a follow-up PR to make progress effectively whenever, and the moment I pause is when the conclusion is to open an issue. Filing an issue means that it'll likely be years, if ever, before something changes. So in that sense "let's file an issue" is equated to me as "let's just ignore this" which is partly where I come from. I basically feel that there's no balance to filing an issue as it effectively means that someone else, who's likely to be less suited to the task, will have to take care of it.
For this PR specifically I agree the DCE/VMFuncRef bits should not happen here. There's definitely no need to entangle all that and it's a huge project for not a whole lot of benefit right now. What I was mostly referring to was the compilation of the unsafe intrinsics where it's a mish-mash of everything vs the subset used. That to me feels like the perfect candidate for if an issue is filed it'll just be forgotten and never addressed. It feels frequent that small improvements like this are rarely justified to spend time on which means they just never get time spent on them.
I agree with your rock-solid quality though, and to that end I have no concerns about this PR. I see no bugs in this PR and it's just a bit sub-optimal in a few areas. Given that I'm fine to see this merged. My personal preference would be to have a follow-up adjusting the compilation side of things to change the list-of-all-intrinsics to list-of-only-the-used-intrinsics, but I'm ok settling for an issue too.
alexcrichton reopened PR #11825.
alexcrichton commented on PR #11825:
That was not the button I wanted...
fitzgen updated PR #11825.
fitzgen commented on PR #11825:
@alexcrichton the latest commit (9ca326b) cleans this up a little bit, and also happens to give us demand-based compilation of intrinsics, since it turns out we know which ones were
canon lowered by the time we are compiling functions. Mind taking another look? Does this help allieviate some of your concerns?
fitzgen updated PR #11825.
alexcrichton commented on PR #11825:
Agreed! I think everything at least now only works on O(intrinsics_used), and data-representation-wise I'd still prefer
PrimaryMap<SomethingIndex, ...>but that's fine to defer to a future refactoring if necessary.
fitzgen merged PR #11825.
Last updated: Dec 06 2025 at 06:05 UTC