whitequark opened issue #8956:
The implementation uses the value of
metadataHash
asst_ino
:This is unsound. To quote the documentation for
metadataHash
:Return a hash of the metadata associated with a filesystem object referred to by a descriptor.
This returns a hash of the last-modification timestamp and file size, and may also include the inode number, device number, birth timestamp, and other metadata fields that may change when the file is modified or replaced. It may also include a secret value chosen by the implementation and not otherwise exposed.
Implementations are encourated to provide the following properties:
If the file is not modified or replaced, the computed hash value should usually not change.
If the object is modified or replaced, the computed hash value should usually change.
The inputs to the hash should not be easily computable from the computed hash.
However, none of these is required.Applications will commonly use
st1.st_ino == st2.st_ino
in order to determine if two files are the same (for example, Clang does it). My implementation ofmetadataHash
, which always returns 0 and is compliant with the definition above, caused the combination of Clang and the wasip1 component adapter, to always treat all read files as the same file (which in practice meant the first#include
caused an infinite loop).
whitequark edited issue #8956:
The implementation uses the value of
metadataHash
asst_ino
:This is unsound. To quote the documentation for
metadataHash
:Return a hash of the metadata associated with a filesystem object referred to by a descriptor.
This returns a hash of the last-modification timestamp and file size, and may also include the inode number, device number, birth timestamp, and other metadata fields that may change when the file is modified or replaced. It may also include a secret value chosen by the implementation and not otherwise exposed.
Implementations are encourated to provide the following properties:
- If the file is not modified or replaced, the computed hash value should usually not change.
- If the object is modified or replaced, the computed hash value should usually change.
- The inputs to the hash should not be easily computable from the computed hash.
However, none of these is required.
Applications will commonly use
st1.st_ino == st2.st_ino
in order to determine if two files are the same (for example, Clang does it). My implementation ofmetadataHash
, which always returns 0 and is compliant with the definition above, caused the combination of Clang and the wasip1 component adapter, to always treat all read files as the same file (which in practice meant the first#include
caused an infinite loop).
whitequark commented on issue #8956:
It's unclear to me how to provide the desired semantics (
st1.st_ino == st2.st_ino
iffisSameObject(f1, f2)
), or, in fact, any reasonable semantics that doesn't make private assumptions about the exact value ofmetadataHash
(i.e. any sound semantics). It's pretty late though so I may be missing something obvious.
whitequark commented on issue #8956:
On reflection, I think the intended behavior here is to maintain a mapping from
metadataHash
to "potential files" (i.e. both files corresponding to file descriptors, and files that could be opened from a directory descriptor) that have previously been stat'd or opened, and to assign each "potential file" a fresh inode value.
whitequark edited a comment on issue #8956:
On reflection, I think the intended behavior here is to maintain a mapping from
metadataHash
to "potential files" (i.e. both files corresponding to file descriptors, and files that could be opened from a directory descriptor) that have previously been stat'd or opened, and to assign each "potential file" a fresh inode value.This would still be problematic, in a subtler way: nothing in the documentation for
metadataHash
implies that always returning 0 may lead to undesirable performance implications down the line. Normally, we treat quadratic behavior arising in hash tables as a security issue in cases like this! So clearly either this implementation also doesn't work, or the contract formetadataHash
must be amended.
whitequark commented on issue #8956:
On more reflection, since
metadataHash
changes if the file is modified, it can't be used to index into such a table, so we're back tofd_filestat_get
can't be sound.How was this supposed to be implemented?
alexcrichton commented on issue #8956:
I can link https://github.com/WebAssembly/wasi-filesystem/pull/81 as more possible background on this, but I don't have any extra info beyond that.
pchickey commented on issue #8956:
Like Alex, the only understanding I have of this topic is from PR 81 on the spec. It is quite possible that both the spec and our implementation are questionable, and that you are the first person to encounter problems with it in the wild. I don't have the expertise or time to devote to really digging into what the best path forward here is right now. Dan is out on leave right now, so I'm standing in as a wasi-filesystem champion in his absence, and I will defer to your judgement for changes to the spec and/or implementation.
whitequark commented on issue #8956:
Thank you. I've been feeling not the best lately, and this has directly impacted my ability to reason on really intricate topics like this one. I'll keep this issue on my radar and try to work out a way forward that works for everyone involved. I do not yet have any firm conclusions and I'll need a deeper understanding of the topic and more consultation with others before I'll have them.
Something that I noticed seems to have been an omission is the question of polyfilling the APIs. In one direction it's straightforward: wasip2 can be done on top of wasip1 easily. But it seems like nobody has drilled in detail into how to emulate wasip1 on top of wasip2, and given the state of the ecosystem (LLVM only directly knowing how to emit wasip1 binaries with wasip2 requiring a post-processing step) I think that's actually the more important direction?
@pchickey Do you think there is possibility or appetite for a sort of "sidecar" specification that builds on top of wasip2 in order to provide a sound polyfill for
fd_filestat_get
where if it's not available thenst_dev
/st_ino
end up having zeroes in them? Or should I not consider that as a possible direction?
alexcrichton commented on issue #8956:
Oh that's actually the most important direction for us, too, implementing p1 on top of p2. The crate you linked in the OP is the current canonical source for implementing p1 on top of p2 within wasm. There's a host-side implementation at https://github.com/bytecodealliance/wasmtime/blob/main/crates/wasi/src/preview1.rs as well where the p1 APIs are all defined primarily in terms of their p2 counterparts.
Modifying the adapter (
wasi_snapshot_preview1.wasm
) is quite difficult, however, as it's written in "Rust" but that's in quotes because you can't do a lot you can do in normal Rust (e.g. use a hash map). Allocations are very carefully managed and such, so doing something fancy withstd_{dev,ino}
whenmetadata_hash
is zero is probably going to be difficult.
pchickey commented on issue #8956:
Agreed with Alex: implementing p1 on top of p2 is an important direction. There are some places we are choosing to do that with low fidelity (the whole rights system in p1 is a mess and we want to leave it behind as much as possible, but it ends up mattering for corner cases where wasi-libc ended up depending on it) but I think this case matters a whole lot more.
As for the incompleteness of the wasip2 support in wasi-libc, so that LLVM could emit a purely wasip2 binary, thats something thats "just" gated on someone who can work on wasi-libc having the time to do so, which is why it has stalled the last few months. We do consider that important to get done at some point, Alex has been filling in some related missing pieces recently, mostly focusing on the parts the Rust toolchain will need.
whitequark commented on issue #8956:
Allocations are very carefully managed and such, so doing something fancy with
std_{dev,ino}
whenmetadata_hash
is zero is probably going to be difficult.In this case, given how p1 and p2 are defined, there must be a spec change to maintain soundness.
whitequark edited a comment on issue #8956:
Allocations are very carefully managed and such, so doing something fancy with
std_{dev,ino}
whenmetadata_hash
is zero is probably going to be difficult.In this case, given how p1 and p2 are defined, there must be a spec change to maintain soundness. For example there could be an additional wit interface that returns an arbitrary 128-bit unique identifier for a descriptor or a descriptor based path that is then stuffed into
st_dev
/st_ino
.
whitequark commented on issue #8956:
I think a backstop solution that prevents the most egregious misbehavior (e.g. Clang considering every input file as the same) is to use a monotonically increasing global counter for filling in
st_dev
/st_ino
. I can't think of any application where this would fail as badly, though of course this is still not a reasonable behavior forstat
to exhibit. It's possible though that the reason I can't think of them is because my experience is limited still.
whitequark edited a comment on issue #8956:
I think a backstop solution that prevents the most egregious consequences (e.g. Clang considering every input file as the same) is to use a monotonically increasing global counter for filling in
st_dev
/st_ino
. I can't think of any application where this would fail as badly, though of course this is still not a reasonable behavior forstat
to exhibit. It's possible though that the reason I can't think of them is because my experience is limited still.
alexcrichton commented on issue #8956:
Could you perhaps hash the filename in the meantime? And pretend that each filename is its own unique dev/ino?
whitequark commented on issue #8956:
What about
fstat
?
whitequark edited a comment on issue #8956:
What about
fstat
?I guess
st_dev
would be the fd andst_ino
would be the hash of a filename, if any, opened off that fd.
alexcrichton commented on issue #8956:
Ah sorry I'm assuming that the structure you're maintaining on the host corresponding to the file would contain the hashed filename or the full filename or something like that. If all you have is the fd and nothing else I don't know what would be done.
whitequark commented on issue #8956:
Oh there is a misunderstanding. What I was considering is the ability to solve this issue in the preview1 component adapter for an arbitrary host. In my particular host I have basically added fake but consistent inodes to everything and my code works fine now; it's not a blocker for me personally, I'm raising this because it is an ecosystem health issue.
whitequark edited a comment on issue #8956:
Oh there is a misunderstanding. What I was considering is the ability to solve this issue in the preview1 component adapter for an arbitrary host. In my particular host I have basically added fake but consistent inodes to everything, I return them as the metadata hash lower (in accordance with what the adapter currently does) and my code works fine now; it's not a blocker for me personally, I'm raising this because it is an ecosystem health issue.
alexcrichton commented on issue #8956:
Ah makes sense. I would personally come to the same conclusion as you then that there's no way to actually "fix" things without changing the standard and/or docs there. Given the weak guarantees of
metadata-hash
and the strong guarantees ofst_{dev,ino}
I don't know how the adapter can work.One possibility would be to add something which indicates whether
metadata-hash
is "strong", e.g. is suitable as-is forst_{dev,ino}
. In such a situation the adapter could set dev/ino based onmetadata-hash
. If the "strong" bit is false then the hash of the filename plus whatever metadata-hash is could perhaps be used for dev/ino. That's just a guess though and is still sort of situation solving this. In any case this is probably best left for discussion on wasi-filesystem itself at this point.
whitequark commented on issue #8956:
I agree, which one of us should kick off that discussion?
alexcrichton commented on issue #8956:
If you wouldn't mind I'd defer to you as I fear I wouldn't have enough context on this
cfallin commented on issue #8956:
Potentially bad idea, but I'm curious if it spawns other thinking: what if the inode were synthesized from an actual hash of file contents?
Basically, imagine the filesystem like a content-addressed store (e.g. git). The inode is then just the hash that addresses some content.
This certainly satisfies the "check the inode and timestamp first to see if it's the same file contents" use-case; it's no worse than what one would have to do manually to verify same file contents in the absence of an inode/timestamp, i.e., actually read the data.
It's also conceptually "minimal" in the sense that it doesn't require any properties or guarantees from the p2 API.
The main downside is that it's slow: it does not satisfy what the user expects to be an O(1)-ish metadata check. This is especially an issue since the inode number must be provided as the result of a stat operation (i.e., is not requested separately). Maybe that alone is enough to throw out this idea. But maybe not?
bjorn3 commented on issue #8956:
what if the inode were synthesized from an actual hash of file contents?
That could confuse tar into thinking two files are hardlinks to each other and thus turn them into actual hard links when unpacking.
whitequark commented on issue #8956:
See https://github.com/WebAssembly/wasi-filesystem/issues/153.
Last updated: Dec 23 2024 at 12:05 UTC