I'm working on a sandbox environment for user defined functions in a data processing engine. It is a perfect use case for wasm, and even better it doesn't need interface types. I have a super simple need for the types on the wasm methods. All I need to do is be able to pass in a borrowed byte array into the method and it can return an owned byte array. This data processing engine expects to use a specific data serialization method inside WASM functions. So I'm starting out with a really simple wasm module that takes a byte buffer and returns a number. Once I have that working I'll expand out and actually parse the data with the serialization library. Unfortunately I'm hitting a few issues on the way.
So I've used two different signatures for this signature:
#[no_mangle]
pub unsafe extern "C" fn udf(input: *const u8) -> u32 {
1
}
and
#[wasm_bindgen]
pub fn udf(input_buffer: &[u8]) -> u32 {
0
}
that produces a method that either accepts 1 or 2 i32s respectively. The problem is I want to invoke this using an ExternRef and I'm ussure how to bridge the gap. If I just call it directly it will complain about "expected externref found i32" which makes sense. It is either a pointer or a pointer and length. I found that wasm-bindgen has a --reference-types flag. I'm not great at reading decompiled wasm yet so I don't know if it is giving me what I need. The way I'm invoking it is:
RUSTFLAGS="-C target-feature=+reference-types" cargo build --target wasm32-unknown-unknown; rm -rf bound; wasm-bindgen --no-typescript --reference-types --target no-modules target/wasm32-unknown-unknown/debug/adder.wasm --out-dir bound
The problem I'm having with the produced wasm file is that it tries to import __wbindgen_init_externref_table
which my runner can't find.
So my runner code looks like:
let mut cfg = Config::new();
cfg.wasm_reference_types(true);
let engine = Engine::new(&cfg).unwrap();
let store = Store::new(&engine);
let module = Module::from_file(store.engine(), "../path_to_wasm/adder.wasm").unwrap();
let instance = Instance::new(&store, &module, &[]).unwrap();
let add = instance.get_typed_func::<Option<ExternRef>, u32>("udf").unwrap();
let ext_ref = ExternRef::new([0,1]);
println!("Returned {}", add.call(Some(ext_ref)).unwrap());
Any help or ideas here would be much appreciated. Even if you just point me in the general direction of a unit test that would do me a world of good. Thank you so much!
wasm-bindgen
only targets the Web and other JS host environments, so unfortunately you can't use it in non-Web environments like Wasmtime.
if you really just need a single exported function, I'd suggest writing raw FFI:
fn udf(slice: &[u8]) -> u32 {
// ...
}
mod raw_ffi {
#[no_mangle]
pub fn udf(ptr: *const u8, len: usize) -> u32 {
let slice = std::slice::from_raw_parts(ptr, len);
super::udf(slice)
}
}
and then in the runner code, do
let udf = instance.get_typed_func::<u32, u32, u32>("udf")?;
Beautiful. That looks perfect. I'll try that tonight
One thing that stands out to me is how will I pass a pointer to code on the host into the sandboxed run time? I was under the impression that it needed to be an ExternRef to do that. For let udf = instance.get_typed_func::<u32, u32, u32>("udf")?;
it would be a pointer to an address that already exists in the wasm runtime, correct?
externref
is completely opaque to the guest, it can't be inspected or modified, so it's perfect for giving the guest a "handle" to something the host keeps track of. as such, it can't be used to pass data to the guest to use. you're right in that using a i32 pair you'll need to copy the bytes into the guest's memory to pass it a pointer to. i believe wasm-bindgen will export a "malloc"-like function that can be used to allocate memory in the guest in a way that's safe for hosts to use; that will give back such a pointer and then you'd use the Wasmtime Memory
object to write to that location.
One alternative that might be worth considering is to use a "well-known" file path with WASI and let the guest read the data from the file; WASI would handle the guest memory for you.
Yeah I got a suggestion elsewhere* to implement that malloc like behavior. Which makes sense and is doable. I was hoping externref would be a reference to memory outside the sandbox and wouldn't require a copy
Yeah, unfortunately externref
is more for the "i accept something from the host which I'll give back to the host and has no semantic meaning to me, the guest" use cases
But it sounds like I might be misunderstanding the intention and abilities of externref. I'll need to reread the proposal
a classic example would be WASI's use case: give the guest a "descriptor" that can't be forged so that we can guarantee if the guest passes the descriptor back it came from the host. right now WASI uses i32
and the guest can give the host "bad descriptors" in the WASI functions.
but eventually WASI will be defined in terms of externref
This makes a ton of sense and I think the parts that don't make sense will click in time
So should I think of externref like a file handler with a traditional OS? It is an identifier to some resource controlled by the host OS, and in the WASM case control by the runtime. It isn't so much a a reference to memory but some type of functionality?
exactly, but with the added benefit it can't be forged by the guest since the guest can't create an externref
Awesome. Thank you so much
So while I've got your ear I've got a question on best practices on the malloc/free helpers. This should be super quick and simple.
So the goal is to have users write their a wasm function that holds the user defined function that manipulates the data. Let's say the function looks like
pub fn udf(ptr: *const u8, len: usize) -> *const u8
It will accept a ptr and len and return a pointer (the pointer is an apache arrow encoded buffer so while include length)
So I will need to have this malloc/free functionality provided somewhere. I don't want the users implementing this UDF to have to add it every time. Is it best practice to just import the user provided UDF wasm module and also include a helper module that the system provides with the malloc/free functionality?
And once again a MASSIVE thank you for your help on this. You've already cleared this up a ton. I'm just new to system level wasm work and learning a ton
For this I'd probably implement a crate that exports the memory-management functions for the host which your users can depend upon in Cargo.toml (assuming you're not using wasm-bindgen)
once interface types are a reality and the tooling catches up, this should all get muuuch nicer :)
Yup! I'm so excited for interface types. I've got a project that's been on the backburner for a while now that's itching waiting for interface types
I got excited when I friend brought this one up because of the use of Arrow the types needed at the entry and exit points are dead simple. Its always an arrow encoded byte array
So the goal for this long term is to not be rust dependent. So I don't want to lean on any cargo features if possible
The goal is for users of ballista[1] to be able to define UDFs in their data pipeline in the language they are most comfortable with. It will then ship the WASM out to the executor nodes and allow you to use them as part of the query. So I could say something like SELECT levenshtein_distance(name1, name2) from foo
where levenshtein_distance is implemented with WASM. The dream is that user's could use Rust, C#, lua or whatever they are used to. I'll probably need to make a pretty minor SDK for each supported lanaguage but so far it looks like they will be minor lifts for each.
Last updated: Jan 24 2025 at 00:11 UTC