Stream: wasmtime

Topic: help with externref to a byte buffer


view this post on Zulip Stuart Small (Mar 21 2021 at 21:36):

I'm working on a sandbox environment for user defined functions in a data processing engine. It is a perfect use case for wasm, and even better it doesn't need interface types. I have a super simple need for the types on the wasm methods. All I need to do is be able to pass in a borrowed byte array into the method and it can return an owned byte array. This data processing engine expects to use a specific data serialization method inside WASM functions. So I'm starting out with a really simple wasm module that takes a byte buffer and returns a number. Once I have that working I'll expand out and actually parse the data with the serialization library. Unfortunately I'm hitting a few issues on the way.

So I've used two different signatures for this signature:

#[no_mangle]
pub unsafe extern "C" fn udf(input: *const u8) -> u32 {
     1
}

and

#[wasm_bindgen]
pub fn udf(input_buffer:  &[u8]) -> u32 {
    0
}

that produces a method that either accepts 1 or 2 i32s respectively. The problem is I want to invoke this using an ExternRef and I'm ussure how to bridge the gap. If I just call it directly it will complain about "expected externref found i32" which makes sense. It is either a pointer or a pointer and length. I found that wasm-bindgen has a --reference-types flag. I'm not great at reading decompiled wasm yet so I don't know if it is giving me what I need. The way I'm invoking it is:

RUSTFLAGS="-C target-feature=+reference-types" cargo build --target wasm32-unknown-unknown; rm -rf bound; wasm-bindgen --no-typescript --reference-types --target no-modules target/wasm32-unknown-unknown/debug/adder.wasm --out-dir bound

The problem I'm having with the produced wasm file is that it tries to import __wbindgen_init_externref_table which my runner can't find.

So my runner code looks like:

    let mut cfg = Config::new();
    cfg.wasm_reference_types(true);
    let engine = Engine::new(&cfg).unwrap();
    let store = Store::new(&engine);
    let module = Module::from_file(store.engine(), "../path_to_wasm/adder.wasm").unwrap();
    let instance = Instance::new(&store, &module, &[]).unwrap();
    let add = instance.get_typed_func::<Option<ExternRef>, u32>("udf").unwrap();
    let ext_ref = ExternRef::new([0,1]);
    println!("Returned {}", add.call(Some(ext_ref)).unwrap());

Any help or ideas here would be much appreciated. Even if you just point me in the general direction of a unit test that would do me a world of good. Thank you so much!

view this post on Zulip fitzgen (he/him) (Mar 22 2021 at 18:04):

wasm-bindgen only targets the Web and other JS host environments, so unfortunately you can't use it in non-Web environments like Wasmtime.

view this post on Zulip fitzgen (he/him) (Mar 22 2021 at 18:09):

if you really just need a single exported function, I'd suggest writing raw FFI:

fn udf(slice: &[u8]) -> u32 {
     // ...
}

mod raw_ffi {
    #[no_mangle]
    pub fn udf(ptr: *const u8, len: usize) -> u32 {
        let slice = std::slice::from_raw_parts(ptr, len);
        super::udf(slice)
    }
}

and then in the runner code, do

let udf = instance.get_typed_func::<u32, u32, u32>("udf")?;

view this post on Zulip Stuart Small (Mar 22 2021 at 18:31):

Beautiful. That looks perfect. I'll try that tonight

view this post on Zulip Stuart Small (Mar 22 2021 at 18:34):

One thing that stands out to me is how will I pass a pointer to code on the host into the sandboxed run time? I was under the impression that it needed to be an ExternRef to do that. For let udf = instance.get_typed_func::<u32, u32, u32>("udf")?; it would be a pointer to an address that already exists in the wasm runtime, correct?

view this post on Zulip Peter Huene (Mar 22 2021 at 18:58):

externref is completely opaque to the guest, it can't be inspected or modified, so it's perfect for giving the guest a "handle" to something the host keeps track of. as such, it can't be used to pass data to the guest to use. you're right in that using a i32 pair you'll need to copy the bytes into the guest's memory to pass it a pointer to. i believe wasm-bindgen will export a "malloc"-like function that can be used to allocate memory in the guest in a way that's safe for hosts to use; that will give back such a pointer and then you'd use the Wasmtime Memory object to write to that location.

view this post on Zulip Peter Huene (Mar 22 2021 at 19:02):

One alternative that might be worth considering is to use a "well-known" file path with WASI and let the guest read the data from the file; WASI would handle the guest memory for you.

view this post on Zulip Stuart Small (Mar 22 2021 at 19:03):

Yeah I got a suggestion elsewhere* to implement that malloc like behavior. Which makes sense and is doable. I was hoping externref would be a reference to memory outside the sandbox and wouldn't require a copy

I'm working on a sandbox environment for user defined functions in a data processing engine. It is a perfect use case for wasm, and even better it doesn't need interface types. I have a super simple need for the types on the wasm methods. All I need to do is be able to pass in a borrowed byte array into the method and it can return an owned byte array. This data processing engine expects to use a specific data serialization method inside WASM functions. I've been experimenting with building a si...

view this post on Zulip Peter Huene (Mar 22 2021 at 19:04):

Yeah, unfortunately externref is more for the "i accept something from the host which I'll give back to the host and has no semantic meaning to me, the guest" use cases

view this post on Zulip Stuart Small (Mar 22 2021 at 19:04):

But it sounds like I might be misunderstanding the intention and abilities of externref. I'll need to reread the proposal

view this post on Zulip Peter Huene (Mar 22 2021 at 19:05):

a classic example would be WASI's use case: give the guest a "descriptor" that can't be forged so that we can guarantee if the guest passes the descriptor back it came from the host. right now WASI uses i32 and the guest can give the host "bad descriptors" in the WASI functions.

view this post on Zulip Peter Huene (Mar 22 2021 at 19:06):

but eventually WASI will be defined in terms of externref

view this post on Zulip Stuart Small (Mar 22 2021 at 19:08):

This makes a ton of sense and I think the parts that don't make sense will click in time

view this post on Zulip Stuart Small (Mar 22 2021 at 19:09):

So should I think of externref like a file handler with a traditional OS? It is an identifier to some resource controlled by the host OS, and in the WASM case control by the runtime. It isn't so much a a reference to memory but some type of functionality?

view this post on Zulip Peter Huene (Mar 22 2021 at 19:10):

exactly, but with the added benefit it can't be forged by the guest since the guest can't create an externref

view this post on Zulip Stuart Small (Mar 22 2021 at 19:10):

Awesome. Thank you so much

view this post on Zulip Stuart Small (Mar 22 2021 at 19:15):

So while I've got your ear I've got a question on best practices on the malloc/free helpers. This should be super quick and simple.

So the goal is to have users write their a wasm function that holds the user defined function that manipulates the data. Let's say the function looks like

pub fn udf(ptr: *const u8, len: usize) ->  *const u8

It will accept a ptr and len and return a pointer (the pointer is an apache arrow encoded buffer so while include length)

So I will need to have this malloc/free functionality provided somewhere. I don't want the users implementing this UDF to have to add it every time. Is it best practice to just import the user provided UDF wasm module and also include a helper module that the system provides with the malloc/free functionality?

view this post on Zulip Stuart Small (Mar 22 2021 at 19:17):

And once again a MASSIVE thank you for your help on this. You've already cleared this up a ton. I'm just new to system level wasm work and learning a ton

view this post on Zulip Peter Huene (Mar 22 2021 at 19:22):

For this I'd probably implement a crate that exports the memory-management functions for the host which your users can depend upon in Cargo.toml (assuming you're not using wasm-bindgen)

view this post on Zulip Peter Huene (Mar 22 2021 at 19:22):

once interface types are a reality and the tooling catches up, this should all get muuuch nicer :)

view this post on Zulip Stuart Small (Mar 22 2021 at 19:23):

Yup! I'm so excited for interface types. I've got a project that's been on the backburner for a while now that's itching waiting for interface types

view this post on Zulip Stuart Small (Mar 22 2021 at 19:24):

I got excited when I friend brought this one up because of the use of Arrow the types needed at the entry and exit points are dead simple. Its always an arrow encoded byte array

view this post on Zulip Stuart Small (Mar 22 2021 at 19:24):

So the goal for this long term is to not be rust dependent. So I don't want to lean on any cargo features if possible

view this post on Zulip Stuart Small (Mar 22 2021 at 19:30):

The goal is for users of ballista[1] to be able to define UDFs in their data pipeline in the language they are most comfortable with. It will then ship the WASM out to the executor nodes and allow you to use them as part of the query. So I could say something like SELECT levenshtein_distance(name1, name2) from foo where levenshtein_distance is implemented with WASM. The dream is that user's could use Rust, C#, lua or whatever they are used to. I'll probably need to make a pretty minor SDK for each supported lanaguage but so far it looks like they will be minor lifts for each.

  1. https://github.com/ballista-compute/ballista
Distributed compute platform implemented in Rust, and powered by Apache Arrow. - ballista-compute/ballista

Last updated: Nov 22 2024 at 16:03 UTC