Just looking at: https://github.com/WebAssembly/component-model/blob/main/design/mvp/canonical-abi/definitions.py#L405-L406
case Own() : return lift_own(cx, load_int(cx.opts, ptr, 4), t)
case Borrow() : return lift_borrow(cx, load_int(cx.opts, ptr, 4), t)
Looks like it should be:
case Own() : return lift_own(cx, load_int(cx, ptr, 4), t)
case Borrow() : return lift_borrow(cx, load_int(cx, ptr, 4), t)
Indeed, that looks like a bug. Would you mind filing an issue in the component-model repo?
I will do, while I have your attention should the despecialize function convert a string
to a list<char>
?
(Based on the docs here: https://github.com/WebAssembly/component-model/blob/main/design/mvp/Explainer.md#specialized-value-types)
Specialized types are not (necessarily) despecialized in the canonical ABI. For strings in particular multiple unicode encodings are supported.
That despecialize
function implements the Canonical ABI's definition of despecialization; in the Canonicial ABI, list<char>
is represented like list<u32>
, while string
is represented like list<u8>
or list<u16>
where the u8
s or u16
s are Unicode code units. Or the latin1+utf16 representation.
There's a mention of this here, although we should document this subtlety more clearly.
Conceptually the cabi isn't really despecializing strings: that list<u8>
is still constrained to be valid utf-8
Speaking of Unicode - having the various conversions done inside the ABI didn't really sit well with me as folks tend to have different preferences as to what implementation to use (certainly in the c world) - I would have preferred if they were treated in a similar fashion to realloc function and left up to the consumer?
Unicode strings are ubiquitous. If the component model didn't have this functionality it would have been reinvented everywhere.
The ABI needs to be aware of encodings so that the host can automatically convert between them. For example, if you have a component that expects UTF-16 composed with another component that expects UTF-8, it's up to the host to convert them. Even if you were to leave it up to the consumer to do the conversion, the consumer would at least need to know what encoding they were converting from.
Also worth noting is that it doesn't need to do any "interesting" conversions, like normalization, case conversion, anything that needs to be aware of locales, non-Unicode encodings, or anything requiring codepoint tables. It's just translating Unicode scalar values from one encoding to another, which needs a lot less code than, say, realloc
.
@Dan Gohman Thats a fair point - I am trying to create a c++ ABI implementation and didn't want to have a dependency on ICU!
@Lann Martin I wasn't suggesting removing the Unicode support, just relocating the "encode" function to be a part of CallContext.opts
besides the "you'd have to be able to convert to any other representation" point Joel made, another thing that's different to realloc
is that unicode handling is part of the guarantees the component model gives: if you receive a string, you're guaranteed that it's well-formed. If the conversion happened in-content, it couldn't be combined with a validation pass, so it'd be strictly more expensive
Last updated: Nov 22 2024 at 16:03 UTC