Stream: general

Topic: WebAssembly Strings


view this post on Zulip Deleted (Oct 10 2020 at 17:50):

Hey folks, I'd like to have your opinions on https://github.com/AssemblyScript/universal-strings and what your thoughts are about the foregoing discussion in https://github.com/WebAssembly/gc/issues/145. In particular I'd like to understand the potential strategic conflicts leading to tough discussions like these over something that might well be a generally superior solution benefiting the entire ecosystem of (managed) languages and developers? Feel free to PM me if there's non-public intel you'd like to share. Really want to understand the big picture better.

Document scoped to discussion of Universal Strings in WebAssembly - AssemblyScript/universal-strings
As of the MVP document, strings can be expressed as either an (array i8) or (array i16) per a language's string encoding, but with only one character at a time being accessible with array.get a...

view this post on Zulip bjorn3 (Oct 11 2020 at 08:09):

  1. It doesn't support UTF-8, which Rust mandates for strings. While the proposal suggests sanitizing the string at the boundary, this would require an unnecessary validation step when both sides are guaranteed to produce an UTF-8 string.

    Avoid alloc+copy->garbage at the boundary in between two Wasm GC-enabled languages and/or JavaScript

  2. While this is the case for GC-enabled languages, for non-GC-enabled languages it will require two copies instead of a single one.

view this post on Zulip Deleted (Oct 11 2020 at 16:23):

Thanks! There is indeed a validation step necessary in this case, with https://github.com/AssemblyScript/universal-strings/pull/2 proposing an idea to avoid redundant validation steps or validating implicitly. Also note that even with any alternative there will be a validation step somewhere iff the binding specified either UTF-8 or UTF-16, i.e. enforcing well-formedness, because one has to guarantee the invariant somehow. Also open to further ideas! Regarding 2. I don't quite see where there are two copies, and how it differs from let's say UTF-16 to UTF-16. Can you elaborate where you are seeing the issue, and how it is less efficient than interface types for example?

GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.

view this post on Zulip bjorn3 (Oct 11 2020 at 16:27):

There are two copies as you first have to copy the string from linear memory to a stringref and then back to linear memory in the other module.

view this post on Zulip Deleted (Oct 11 2020 at 16:44):

I see now, interesting. Very good point, agree that this should be avoided! Have ideas already :)

view this post on Zulip Deleted (Oct 11 2020 at 16:49):

One implicit mechanism I can imagine is if there is a string.new at one side of the boundary, and a string.lower immediately at the other, the engine can copy from the source to the target. Anything more formalized than that achieving the same will do as well.

view this post on Zulip Deleted (Oct 11 2020 at 17:10):

Interesting observation there is that a module does not even have to fully support GC for this to work, and is the common case in systems languages using linear memory exclusively. May just legalize string.new and string.lower at the boundary independently of GC.

view this post on Zulip Deleted (Oct 11 2020 at 18:06):

Here you go: https://github.com/AssemblyScript/universal-strings#integration-with-linear-memory-based-languages

Document scoped to discussion of Universal Strings in WebAssembly - AssemblyScript/universal-strings

view this post on Zulip Dan Gohman (Oct 11 2020 at 19:15):

@dcode Modules are typically compiled separately. When the compiler sees a string.new being passed to an import, it doesn't know whether the export will do a string.lower. It'd have to emit code to create a GC object to pass, because that might be what the export needs.

view this post on Zulip Deleted (Oct 11 2020 at 19:31):

Good point, yeah. Perhaps the engine may create both entry points upon compilation, and use the optimized one where it sees fit?

view this post on Zulip Dan Gohman (Oct 11 2020 at 20:05):

In order to let the linker pick the which version to use at link time, while avoiding duplicating the entire function, the compiler would presumably split the code which makes the call into a separate function.

view this post on Zulip Dan Gohman (Oct 11 2020 at 20:05):

Suppose we call these split-out funtions the "adapter functions"

view this post on Zulip Deleted (Oct 11 2020 at 20:14):

Heh, nice, fair point, just that these are taken care of by the engine, i.e. one does not have to author, ship, publish or install adapter functions (per environment), and there is zero size overhead in modules.

view this post on Zulip Dan Gohman (Oct 12 2020 at 16:01):

Except that tools auto-generate these so developers don't author them manaully, and there's no extra work to "ship, publish, or install", and they're not per-environment.

view this post on Zulip Dan Gohman (Oct 12 2020 at 16:04):

The code size part does get to an interesting design question -- should wasm define a fixed set of supported string formats, or should it let source languages define their own formats?

view this post on Zulip Deleted (Oct 12 2020 at 16:06):

Can you elaborate what the expected process of adding IT to and later using it with a module is? For instance, will it require creating multiple binaries depending on what other modules or hosts a module integrates with? Or just the adapters that wrap a module? How does it behave when for example a dependency is switched out with a compatible one written in another language?

view this post on Zulip Dan Gohman (Oct 12 2020 at 16:07):

This is the "fusion" part of the IT proposal. You produce a module with adapters that translate between your concrete types and the abstract IT types, and the host / other module has adapters that translate from the abstract IT types to its concrete types

view this post on Zulip Dan Gohman (Oct 12 2020 at 16:08):

These two halves are fused at link time to produce the complete adapter function. So you only ship the code for your half.

view this post on Zulip Deleted (Oct 12 2020 at 16:10):

I see, thanks! Yeah, only shipping the code for your half is crucial there.

view this post on Zulip Till Schneidereit (Oct 14 2020 at 11:07):

yeah, that part is one of the most important aspects of ITs: building on the whole-system view the runtime has, we can have a system where content modules don't have to agree on a serialization format as a least-common denominator, as RPC mechanisms usually have to


Last updated: Dec 23 2024 at 12:05 UTC