Stream: git-wasmtime

Topic: wasmtime / issue #8956 The implementation of `fd_filestat...


view this post on Zulip Wasmtime GitHub notifications bot (Jul 14 2024 at 20:24):

whitequark opened issue #8956:

The implementation uses the value of metadataHash as st_ino:

https://github.com/bytecodealliance/wasmtime/blob/225d20efe387bc3f4052ec3115125380f9c59774/crates/wasi-preview1-component-adapter/src/lib.rs#L950-L951

This is unsound. To quote the documentation for metadataHash:

Return a hash of the metadata associated with a filesystem object referred to by a descriptor.

This returns a hash of the last-modification timestamp and file size, and may also include the inode number, device number, birth timestamp, and other metadata fields that may change when the file is modified or replaced. It may also include a secret value chosen by the implementation and not otherwise exposed.

Implementations are encourated to provide the following properties:

If the file is not modified or replaced, the computed hash value should usually not change.
If the object is modified or replaced, the computed hash value should usually change.
The inputs to the hash should not be easily computable from the computed hash.
However, none of these is required.

Applications will commonly use st1.st_ino == st2.st_ino in order to determine if two files are the same (for example, Clang does it). My implementation of metadataHash, which always returns 0 and is compliant with the definition above, caused the combination of Clang and the wasip1 component adapter, to always treat all read files as the same file (which in practice meant the first #include caused an infinite loop).

view this post on Zulip Wasmtime GitHub notifications bot (Jul 14 2024 at 20:24):

whitequark edited issue #8956:

The implementation uses the value of metadataHash as st_ino:

https://github.com/bytecodealliance/wasmtime/blob/225d20efe387bc3f4052ec3115125380f9c59774/crates/wasi-preview1-component-adapter/src/lib.rs#L950-L951

This is unsound. To quote the documentation for metadataHash:

Return a hash of the metadata associated with a filesystem object referred to by a descriptor.

This returns a hash of the last-modification timestamp and file size, and may also include the inode number, device number, birth timestamp, and other metadata fields that may change when the file is modified or replaced. It may also include a secret value chosen by the implementation and not otherwise exposed.

Implementations are encourated to provide the following properties:

However, none of these is required.

Applications will commonly use st1.st_ino == st2.st_ino in order to determine if two files are the same (for example, Clang does it). My implementation of metadataHash, which always returns 0 and is compliant with the definition above, caused the combination of Clang and the wasip1 component adapter, to always treat all read files as the same file (which in practice meant the first #include caused an infinite loop).

view this post on Zulip Wasmtime GitHub notifications bot (Jul 14 2024 at 20:28):

whitequark commented on issue #8956:

It's unclear to me how to provide the desired semantics (st1.st_ino == st2.st_ino iff isSameObject(f1, f2)), or, in fact, any reasonable semantics that doesn't make private assumptions about the exact value of metadataHash (i.e. any sound semantics). It's pretty late though so I may be missing something obvious.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 14 2024 at 20:39):

whitequark commented on issue #8956:

On reflection, I think the intended behavior here is to maintain a mapping from metadataHash to "potential files" (i.e. both files corresponding to file descriptors, and files that could be opened from a directory descriptor) that have previously been stat'd or opened, and to assign each "potential file" a fresh inode value.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 14 2024 at 20:46):

whitequark edited a comment on issue #8956:

On reflection, I think the intended behavior here is to maintain a mapping from metadataHash to "potential files" (i.e. both files corresponding to file descriptors, and files that could be opened from a directory descriptor) that have previously been stat'd or opened, and to assign each "potential file" a fresh inode value.

This would still be problematic, in a subtler way: nothing in the documentation for metadataHash implies that always returning 0 may lead to undesirable performance implications down the line. Normally, we treat quadratic behavior arising in hash tables as a security issue in cases like this! So clearly either this implementation also doesn't work, or the contract for metadataHash must be amended.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 14 2024 at 21:12):

whitequark commented on issue #8956:

On more reflection, since metadataHash changes if the file is modified, it can't be used to index into such a table, so we're back to fd_filestat_get can't be sound.

How was this supposed to be implemented?

view this post on Zulip Wasmtime GitHub notifications bot (Jul 15 2024 at 16:07):

alexcrichton commented on issue #8956:

I can link https://github.com/WebAssembly/wasi-filesystem/pull/81 as more possible background on this, but I don't have any extra info beyond that.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 15 2024 at 17:44):

pchickey commented on issue #8956:

Like Alex, the only understanding I have of this topic is from PR 81 on the spec. It is quite possible that both the spec and our implementation are questionable, and that you are the first person to encounter problems with it in the wild. I don't have the expertise or time to devote to really digging into what the best path forward here is right now. Dan is out on leave right now, so I'm standing in as a wasi-filesystem champion in his absence, and I will defer to your judgement for changes to the spec and/or implementation.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 15 2024 at 17:51):

whitequark commented on issue #8956:

Thank you. I've been feeling not the best lately, and this has directly impacted my ability to reason on really intricate topics like this one. I'll keep this issue on my radar and try to work out a way forward that works for everyone involved. I do not yet have any firm conclusions and I'll need a deeper understanding of the topic and more consultation with others before I'll have them.

Something that I noticed seems to have been an omission is the question of polyfilling the APIs. In one direction it's straightforward: wasip2 can be done on top of wasip1 easily. But it seems like nobody has drilled in detail into how to emulate wasip1 on top of wasip2, and given the state of the ecosystem (LLVM only directly knowing how to emit wasip1 binaries with wasip2 requiring a post-processing step) I think that's actually the more important direction?

@pchickey Do you think there is possibility or appetite for a sort of "sidecar" specification that builds on top of wasip2 in order to provide a sound polyfill for fd_filestat_get where if it's not available then st_dev/st_ino end up having zeroes in them? Or should I not consider that as a possible direction?

view this post on Zulip Wasmtime GitHub notifications bot (Jul 15 2024 at 20:40):

alexcrichton commented on issue #8956:

Oh that's actually the most important direction for us, too, implementing p1 on top of p2. The crate you linked in the OP is the current canonical source for implementing p1 on top of p2 within wasm. There's a host-side implementation at https://github.com/bytecodealliance/wasmtime/blob/main/crates/wasi/src/preview1.rs as well where the p1 APIs are all defined primarily in terms of their p2 counterparts.

Modifying the adapter (wasi_snapshot_preview1.wasm) is quite difficult, however, as it's written in "Rust" but that's in quotes because you can't do a lot you can do in normal Rust (e.g. use a hash map). Allocations are very carefully managed and such, so doing something fancy with std_{dev,ino} when metadata_hash is zero is probably going to be difficult.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 15 2024 at 20:46):

pchickey commented on issue #8956:

Agreed with Alex: implementing p1 on top of p2 is an important direction. There are some places we are choosing to do that with low fidelity (the whole rights system in p1 is a mess and we want to leave it behind as much as possible, but it ends up mattering for corner cases where wasi-libc ended up depending on it) but I think this case matters a whole lot more.

As for the incompleteness of the wasip2 support in wasi-libc, so that LLVM could emit a purely wasip2 binary, thats something thats "just" gated on someone who can work on wasi-libc having the time to do so, which is why it has stalled the last few months. We do consider that important to get done at some point, Alex has been filling in some related missing pieces recently, mostly focusing on the parts the Rust toolchain will need.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 15 2024 at 23:44):

whitequark commented on issue #8956:

Allocations are very carefully managed and such, so doing something fancy with std_{dev,ino} when metadata_hash is zero is probably going to be difficult.

In this case, given how p1 and p2 are defined, there must be a spec change to maintain soundness.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 15 2024 at 23:45):

whitequark edited a comment on issue #8956:

Allocations are very carefully managed and such, so doing something fancy with std_{dev,ino} when metadata_hash is zero is probably going to be difficult.

In this case, given how p1 and p2 are defined, there must be a spec change to maintain soundness. For example there could be an additional wit interface that returns an arbitrary 128-bit unique identifier for a descriptor or a descriptor based path that is then stuffed into st_dev/st_ino.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 16 2024 at 12:06):

whitequark commented on issue #8956:

I think a backstop solution that prevents the most egregious misbehavior (e.g. Clang considering every input file as the same) is to use a monotonically increasing global counter for filling in st_dev/st_ino. I can't think of any application where this would fail as badly, though of course this is still not a reasonable behavior for stat to exhibit. It's possible though that the reason I can't think of them is because my experience is limited still.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 16 2024 at 12:06):

whitequark edited a comment on issue #8956:

I think a backstop solution that prevents the most egregious consequences (e.g. Clang considering every input file as the same) is to use a monotonically increasing global counter for filling in st_dev/st_ino. I can't think of any application where this would fail as badly, though of course this is still not a reasonable behavior for stat to exhibit. It's possible though that the reason I can't think of them is because my experience is limited still.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 16 2024 at 14:39):

alexcrichton commented on issue #8956:

Could you perhaps hash the filename in the meantime? And pretend that each filename is its own unique dev/ino?

view this post on Zulip Wasmtime GitHub notifications bot (Jul 16 2024 at 14:44):

whitequark commented on issue #8956:

What about fstat?

view this post on Zulip Wasmtime GitHub notifications bot (Jul 16 2024 at 14:45):

whitequark edited a comment on issue #8956:

What about fstat?

I guess st_dev would be the fd and st_ino would be the hash of a filename, if any, opened off that fd.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 16 2024 at 14:47):

alexcrichton commented on issue #8956:

Ah sorry I'm assuming that the structure you're maintaining on the host corresponding to the file would contain the hashed filename or the full filename or something like that. If all you have is the fd and nothing else I don't know what would be done.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 16 2024 at 14:52):

whitequark commented on issue #8956:

Oh there is a misunderstanding. What I was considering is the ability to solve this issue in the preview1 component adapter for an arbitrary host. In my particular host I have basically added fake but consistent inodes to everything and my code works fine now; it's not a blocker for me personally, I'm raising this because it is an ecosystem health issue.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 16 2024 at 14:52):

whitequark edited a comment on issue #8956:

Oh there is a misunderstanding. What I was considering is the ability to solve this issue in the preview1 component adapter for an arbitrary host. In my particular host I have basically added fake but consistent inodes to everything, I return them as the metadata hash lower (in accordance with what the adapter currently does) and my code works fine now; it's not a blocker for me personally, I'm raising this because it is an ecosystem health issue.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 16 2024 at 14:57):

alexcrichton commented on issue #8956:

Ah makes sense. I would personally come to the same conclusion as you then that there's no way to actually "fix" things without changing the standard and/or docs there. Given the weak guarantees of metadata-hash and the strong guarantees of st_{dev,ino} I don't know how the adapter can work.

One possibility would be to add something which indicates whether metadata-hash is "strong", e.g. is suitable as-is for st_{dev,ino}. In such a situation the adapter could set dev/ino based on metadata-hash. If the "strong" bit is false then the hash of the filename plus whatever metadata-hash is could perhaps be used for dev/ino. That's just a guess though and is still sort of situation solving this. In any case this is probably best left for discussion on wasi-filesystem itself at this point.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 16 2024 at 15:00):

whitequark commented on issue #8956:

I agree, which one of us should kick off that discussion?

view this post on Zulip Wasmtime GitHub notifications bot (Jul 16 2024 at 17:40):

alexcrichton commented on issue #8956:

If you wouldn't mind I'd defer to you as I fear I wouldn't have enough context on this

view this post on Zulip Wasmtime GitHub notifications bot (Jul 16 2024 at 18:01):

cfallin commented on issue #8956:

Potentially bad idea, but I'm curious if it spawns other thinking: what if the inode were synthesized from an actual hash of file contents?

Basically, imagine the filesystem like a content-addressed store (e.g. git). The inode is then just the hash that addresses some content.

This certainly satisfies the "check the inode and timestamp first to see if it's the same file contents" use-case; it's no worse than what one would have to do manually to verify same file contents in the absence of an inode/timestamp, i.e., actually read the data.

It's also conceptually "minimal" in the sense that it doesn't require any properties or guarantees from the p2 API.

The main downside is that it's slow: it does not satisfy what the user expects to be an O(1)-ish metadata check. This is especially an issue since the inode number must be provided as the result of a stat operation (i.e., is not requested separately). Maybe that alone is enough to throw out this idea. But maybe not?

view this post on Zulip Wasmtime GitHub notifications bot (Jul 16 2024 at 18:03):

bjorn3 commented on issue #8956:

what if the inode were synthesized from an actual hash of file contents?

That could confuse tar into thinking two files are hardlinks to each other and thus turn them into actual hard links when unpacking.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 16 2024 at 20:16):

whitequark commented on issue #8956:

See https://github.com/WebAssembly/wasi-filesystem/issues/153.


Last updated: Nov 22 2024 at 17:03 UTC