vigoo opened issue #11701:
I would like to report a very weird issue that I ran into while debugging this through Golem (which uses wasmtime under the hood). Through the investigation I realized that the issue can be reproduced purely with wasmtime, even with the latest published version.
However, the reproducer is a bit fragile:
- the original issue I was debugging is that for a particular directory structure created within a Rust guest, calling
std::fs::remove_dir_allon it ended up in an infinite loop where the Rust standard library seems to continuously creating a read dir iterator, and callingunlike_atandstat_aton the same files.- slightly modifying the code though (even when I just copy-pasted it into another example component and removed some unused functions!) it turns into failing with (the
remove_dir_allcall)Directory not empty (os error 55)which is also unexpected, but differentEven compiling to debug vs release seems to affect which of the above two outcome happens.
Test Case
I'm attaching a cargo-component crate that is reproducing me both of the above cases with rustc 1.89 and cargo-component 0.21.1.
Steps to Reproduce
Reproducing the "error 55" case with debug build:
- compile to debug:
cargo component build- create a temp directory on the host:
mkdir tmp- run with
wasmtime --invoke 'reproducer()' --dir 'tmp::/' target/wasm32-wasip1/debug/file_service.wasmOutput:
Trying to create directory /tmp/py/modules/0/mytest/__pycache__ Finished creating directory /tmp/py/modules/0/mytest/__pycache__ Ok(()) Creating files Ok(()) Ok(()) Ok(()) Ok(()) Removing all print_tree "/tmp/py/modules/0" ๐ mytest print_tree "/tmp/py/modules/0/mytest" ๐ __init__.py ๐ __pycache__ print_tree "/tmp/py/modules/0/mytest/__pycache__" ๐ mymodule.rustpython-01.pyc ๐ __init__.rustpython-01.pyc ๐ mymodule.py Err("Directory not empty (os error 55)") ()Reproducing the infinite loop with a release build:
- compile to debug:
cargo component build --release- create a temp directory on the host:
mkdir tmp- run with
wasmtime --invoke 'reproducer()' --dir 'tmp::/' target/wasm32-wasip1/release/file_service.wasmOutput:
Trying to create directory /tmp/py/modules/0/mytest/__pycache__ Finished creating directory /tmp/py/modules/0/mytest/__pycache__ Ok(()) Creating files Ok(()) Ok(()) Ok(()) Ok(()) Removing all print_tree "/tmp/py/modules/0" ๐ mytest print_tree "/tmp/py/modules/0/mytest" ๐ __init__.py ๐ __pycache__ print_tree "/tmp/py/modules/0/mytest/__pycache__" ๐ mymodule.rustpython-01.pyc ๐ __init__.rustpython-01.pyc ๐ mymodule.pyand hanging here.
Note that even removing things like prints from the code can make it rather fail than hang, so I'm not sure how stable this reproducer is on other machines. http://url
Also the attached code contains many other functions which are used in different tests originally - I left them because removing them made the "hanging case" irreproducible for me.I can attach the actual two WASMs if it helps.
Expected Results
The directory structure deleted and the guest returns without error.
Versions and Environment
Wasmtime version: tried with 33.0.0 (what we use internally) and the latest published (36.0.2)
Operating system: Darwin Kernel Version 24.6.0
Architecture: arm64
vigoo added the bug label to Issue #11701.
vigoo commented on issue #11701:
file-server-debug.wasm.zip
file-server-release.wasm.zipThe WASMs for the above two cases, for me reproducing the two different bad behaviors.
alexcrichton commented on issue #11701:
Ok this is kind of a wild bug. My understanding at this point is that the true bug lies here for the wasip1 target and here for the wasip2 target. I cannot yet explain the difference in
--debugand--release, nor can I explain why this appears to be platform-specific. Some various learnings otherwise:
- Using your source, or prebuilt modules, I cannot reproduce this on Linux. I can, however, reproduce both with the source and with the modules on macOS (debug errors, release loops).
- Using this program it passes on native but fails on the wasip1/wasip2 targets.
- This, however, I can explain. The lines linked above in Wasmtime are to "blame" of sorts for this. Effectively what's happening is that every time
fd_readdiris called it actually tries to read the whole directory, e.g. we don't keep the stream open or something like that. When this is coupled withd_nextor the "cookie" it means that what happens is (a) a directory is read and the last entry doesn't fit in the input buffer, (b) the guest deletes some files, (c) the guest requests more files from the directory with the previousd_next, and (d) wasmtime re-reads the directory, this time with fewer files, and skips everything because that's what the "cookie" says. Effectively the "cookie" is only valid for a single snapshot of directory in time and does not take filesystem modification into account.- Somehow the original components are affected by this
fd_readdirbehavior. I've patched the Rust standard library to insert acollect::<Vec<_>>()on this line and with that compiler when I compile from the sources shared here I can't reproduce the bug on macOS. (I could reproduce with a pre-patched libstd, however)- I feel like I can explain the "debug mode returns an error" in the original example now, it's because wasmtime re-reads the whole directory and skips an entry, meaning that the directory truly isn't empty when it's removed.
- For the infinite loop behavior my guess is that there's a bug in the WASIp1-to-WASIp2 adapter. Unfortunately that's notoriously hard to debug so I'm still staring at code.
In the meantime though, what to do about this? Unfortunately I think we're in a bit of a problematic situation. I'm not sure how to do some sort of host-side change to fix this with the WASIp1
fd_readdirAPI. That's unfortunately what's required to get fixed here since the Rust standard library is going through WASIp1 for reading directories, which is implemented through the WASIp1-to-WASIp2 adapter. This means we've got the two lines to fix in Wasmtime at the start of this (one in native on in content). The difficulty is thatfd_readdir, as specified, would effectively require buffering the entire directory's contents within the WASIp1-to-WASIp2 adapter which we basically can't do since dynamic allocations aren't possible. Without this buffering behavior I don't believe we can implement what's necessary forfd_readdirhere which is to read the directory at most once from a continuous stream of entries.Well, that's at least as far as I've gotten so far on this.
alexcrichton commented on issue #11701:
Ok well further staring found https://github.com/bytecodealliance/wasmtime/pull/11702 which is the cause of the release-mode-vs-debug-mode difference. With that I'm confident now that the only issue is the broken implementations of
fd_readdirin this repo.
alexcrichton commented on issue #11701:
cc @vados-cosmonic and @sunfishcode I'm curious as co-champions of wasi-filesystem to get your take on this. My question is about WASIp1, which if y'all would rather not care about feel free to ignore this. Specifically the
fd_readdirfunction -- how should an implementation deal with the fact that between invocations offd_readdirmutations to the directory might be made? I can personally think of two somewhat-viable paths forward:
- Declare the funtion broken in some "official" location and go update callers which might mutate to read the whole directory before mutating. For example motivate a change to the Rust standard library to read the entire directory before mutating it by deleting files in
remove_dir_all.- Implementations would always read at least two entries from the underlying host directory. The
d_nextof the previous one points to thed_inoof the next entry. That way if an entry is deleted we can in theory still at least resume at the last file read in the directory. Well ok now as I type this out this doesn't handle the case where both files are deleted.Hm ok different question: as co-champions of wasi-filesystem how do y'all feel about declaring this API as dead-and-broken? I realize WASIp1 is sort of already in that state but it would be useful to have this on-record somewhere if only in an issue or something like that.
alexcrichton added the wasi:impl label to Issue #11701.
alexcrichton commented on issue #11701:
@vigoo in the meantime if you're interested in getting this fixed in the near-term I think the quickest fix will be "don't use
std::fs::remove_dir_all" if that's possible. If that's in the bowels of some other crate you're using, however, the next-quickest fix would be to propose a change to Rust's libstd, but that's a pretty big step down in terms of "quickest"
bjorn3 commented on issue #11701:
fd_readdircan be implemented directly in terms ofgetdentswhen not using the wasip1 to wasip2 shim, right? Thed_offfield forgetdentsis directly equivalent tod_nextforfd_readdir. Maybe wasip2 could add acookiefield populated by thed_offfield ofgetdentstodirectory-entryand then the wasip1 to wasip2 shim can use thiscookiefield to seek to the right entry indirectory-entry-streamrather than using an integer index within the directory asd_next?
bjorn3 edited a comment on issue #11701:
fd_readdircan be implemented directly in terms ofgetdentswhen not using the wasip1 to wasip2 shim, right? Thed_offfield forgetdentsis directly equivalent tod_nextforfd_readdir. Maybe wasip2 could add acookiefield populated by thed_offfield ofgetdentstodirectory-entryand then the wasip1 to wasip2 shim can use thiscookiefield to seek to the right entry indirectory-entry-streamrather than using an integer index within the directory asd_next? Or maybe the entry name could be hashed and used as cookie as temporary workaround. Would probably cause issue for hash collisions though.
vigoo commented on issue #11701:
@vigoo in the meantime if you're interested in getting this fixed in the near-term I think the quickest fix will be "don't use
std::fs::remove_dir_all" if that's possible. If that's in the bowels of some other crate you're using, however, the next-quickest fix would be to propose a change to Rust's libstd, but that's a pretty big step down in terms of "quickest"Thanks for looking into it so quickly! I did that as a workaround already (not using
std::fs::remove_dir_allin that piece of code that triggered my investigation ) although of course I cannot guarantee our users will never use it.Most importantly I just wanted to let you know about the issue.
alexcrichton commented on issue #11701:
@bjorn3 it looks like
getdentsis Linux-specific which is already one major blocker, but another is that once you've readd_offthere's no guarantee the file there isn't deleted. If it's deleted, for example, then the iterator would be truncated without visiting anything else since the next seek wouldn't find anything.Effectively the WASIp1
fd_readdirdiffers fromgetdentsin that it's not stateful where you pass in acookieand at least my read of it is that you can arbitrarily seek around when reading a directory. With the lack of state, however, it means that we can't maintain a single object on the host that we're following a stream of. I don't know how we can take an arbitrarycookieand seek the actual stream in the face of modifications between calls tofd_readdir. WASIp{2,3} are much easier here since they return a stream/iterator and nothing else -- no seeking allowed.
vados-cosmonic commented on issue #11701:
A bit late but with regards to this note:
Hm ok different question: as co-champions of wasi-filesystem how do y'all feel about declaring this API as dead-and-broken? I realize WASIp1 is sort of already in that state but it would be useful to have this on-record somewhere if only in an issue or something like that.
This certainly seems like the right first step -- at the very least this is a big enough footgun that it should be documented somewhere.
Looking at the P1 interface I can't see a way to solve this that others haven't mentioned already here. One thing that I was thinking about, would it be possible to use some bits of the cookie to store some state?
Last updated: Dec 06 2025 at 07:03 UTC