Possible Canonical ABI issue · wit-bindgen

case Own()          : return lift_own(cx, load_int(cx.opts, ptr, 4), t)
case Borrow()       : return lift_borrow(cx, load_int(cx.opts, ptr, 4), t)

case Own()          : return lift_own(cx, load_int(cx, ptr, 4), t)
case Borrow()       : return lift_borrow(cx, load_int(cx, ptr, 4), t)

Dan Gohman (Mar 13 2024 at 20:55):

Indeed, that looks like a bug. Would you mind filing an issue in the component-model repo?

Gordon Smith (Mar 14 2024 at 10:00):

Lann Martin (Mar 14 2024 at 12:26):

Specialized types are not (necessarily) despecialized in the canonical ABI. For strings in particular multiple unicode encodings are supported.

Dan Gohman (Mar 14 2024 at 12:26):

That despecialize function implements the Canonical ABI's definition of despecialization; in the Canonicial ABI, list<char> is represented like list<u32>, while string is represented like list<u8> or list<u16> where the u8s or u16s are Unicode code units. Or the latin1+utf16 representation.

Dan Gohman (Mar 14 2024 at 12:26):

There's a mention of this here, although we should document this subtlety more clearly.

Lann Martin (Mar 14 2024 at 13:01):

Conceptually the cabi isn't really despecializing strings: that list<u8> is still constrained to be valid utf-8

Gordon Smith (Mar 14 2024 at 13:16):

Speaking of Unicode - having the various conversions done inside the ABI didn't really sit well with me as folks tend to have different preferences as to what implementation to use (certainly in the c world) - I would have preferred if they were treated in a similar fashion to realloc function and left up to the consumer?

Lann Martin (Mar 14 2024 at 13:37):

Unicode strings are ubiquitous. If the component model didn't have this functionality it would have been reinvented everywhere.

Joel Dice (Mar 14 2024 at 13:37):

The ABI needs to be aware of encodings so that the host can automatically convert between them. For example, if you have a component that expects UTF-16 composed with another component that expects UTF-8, it's up to the host to convert them. Even if you were to leave it up to the consumer to do the conversion, the consumer would at least need to know what encoding they were converting from.

Dan Gohman (Mar 14 2024 at 15:26):

Also worth noting is that it doesn't need to do any "interesting" conversions, like normalization, case conversion, anything that needs to be aware of locales, non-Unicode encodings, or anything requiring codepoint tables. It's just translating Unicode scalar values from one encoding to another, which needs a lot less code than, say, realloc.

Gordon Smith (Mar 14 2024 at 16:01):

@Dan Gohman Thats a fair point - I am trying to create a c++ ABI implementation and didn't want to have a dependency on ICU!
@Lann Martin I wasn't suggesting removing the Unicode support, just relocating the "encode" function to be a part of CallContext.opts

Till Schneidereit (Mar 14 2024 at 21:02):

besides the "you'd have to be able to convert to any other representation" point Joel made, another thing that's different to realloc is that unicode handling is part of the guarantees the component model gives: if you receive a string, you're guaranteed that it's well-formed. If the conversion happened in-content, it couldn't be combined with a validation pass, so it'd be strictly more expensive

Stream: wit-bindgen

Topic: Possible Canonical ABI issue

Gordon Smith (Mar 13 2024 at 18:42):

Dan Gohman (Mar 13 2024 at 20:55):

Gordon Smith (Mar 14 2024 at 10:00):

Lann Martin (Mar 14 2024 at 12:26):

Dan Gohman (Mar 14 2024 at 12:26):

Dan Gohman (Mar 14 2024 at 12:26):

Lann Martin (Mar 14 2024 at 13:01):

Gordon Smith (Mar 14 2024 at 13:16):

Lann Martin (Mar 14 2024 at 13:37):

Joel Dice (Mar 14 2024 at 13:37):

Dan Gohman (Mar 14 2024 at 15:26):

Gordon Smith (Mar 14 2024 at 16:01):

Till Schneidereit (Mar 14 2024 at 21:02):