wasmtime / issue #11701 Weird behavior of wasi::filesyste... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / issue #11701 Weird behavior of wasi::filesyste...

Wasmtime GitHub notifications bot (Sep 16 2025 at 12:49):

I would like to report a very weird issue that I ran into while debugging this through Golem (which uses wasmtime under the hood). Through the investigation I realized that the issue can be reproduced purely with wasmtime, even with the latest published version.

However, the reproducer is a bit fragile:

the original issue I was debugging is that for a particular directory structure created within a Rust guest, calling std::fs::remove_dir_all on it ended up in an infinite loop where the Rust standard library seems to continuously creating a read dir iterator, and calling unlike_at and stat_at on the same files.

slightly modifying the code though (even when I just copy-pasted it into another example component and removed some unused functions!) it turns into failing with (the remove_dir_all call) Directory not empty (os error 55) which is also unexpected, but different

Even compiling to debug vs release seems to affect which of the above two outcome happens.

Test Case

I'm attaching a cargo-component crate that is reproducing me both of the above cases with rustc 1.89 and cargo-component 0.21.1.

Steps to Reproduce

Reproducing the "error 55" case with debug build:

compile to debug: cargo component build

create a temp directory on the host: mkdir tmp

run with wasmtime --invoke 'reproducer()' --dir 'tmp::/' target/wasm32-wasip1/debug/file_service.wasm

Output:
Trying to create directory /tmp/py/modules/0/mytest/__pycache__
Finished creating directory /tmp/py/modules/0/mytest/__pycache__
Ok(())
Creating files
Ok(())
Ok(())
Ok(())
Ok(())
Removing all
print_tree "/tmp/py/modules/0"
📁 mytest
print_tree "/tmp/py/modules/0/mytest"
  📄 __init__.py
  📁 __pycache__
print_tree "/tmp/py/modules/0/mytest/__pycache__"
    📄 mymodule.rustpython-01.pyc
    📄 __init__.rustpython-01.pyc
  📄 mymodule.py
Err("Directory not empty (os error 55)")
()
Reproducing the infinite loop with a release build:

compile to debug: cargo component build --release

create a temp directory on the host: mkdir tmp

run with wasmtime --invoke 'reproducer()' --dir 'tmp::/' target/wasm32-wasip1/release/file_service.wasm

Output:
Trying to create directory /tmp/py/modules/0/mytest/__pycache__
Finished creating directory /tmp/py/modules/0/mytest/__pycache__
Ok(())
Creating files
Ok(())
Ok(())
Ok(())
Ok(())
Removing all
print_tree "/tmp/py/modules/0"
📁 mytest
print_tree "/tmp/py/modules/0/mytest"
  📄 __init__.py
  📁 __pycache__
print_tree "/tmp/py/modules/0/mytest/__pycache__"
    📄 mymodule.rustpython-01.pyc
    📄 __init__.rustpython-01.pyc
  📄 mymodule.py
and hanging here.

Note that even removing things like prints from the code can make it rather fail than hang, so I'm not sure how stable this reproducer is on other machines. http://url
Also the attached code contains many other functions which are used in different tests originally - I left them because removing them made the "hanging case" irreproducible for me.

I can attach the actual two WASMs if it helps.

Expected Results

The directory structure deleted and the guest returns without error.

Versions and Environment

Wasmtime version: tried with 33.0.0 (what we use internally) and the latest published (36.0.2)

Operating system: Darwin Kernel Version 24.6.0
Architecture: arm64

reproducer.zip

Wasmtime GitHub notifications bot (Sep 16 2025 at 12:49):

vigoo added the bug label to Issue #11701.

Wasmtime GitHub notifications bot (Sep 16 2025 at 12:55):

vigoo commented on issue #11701:

file-server-debug.wasm.zip
file-server-release.wasm.zip

The WASMs for the above two cases, for me reproducing the two different bad behaviors.

Wasmtime GitHub notifications bot (Sep 16 2025 at 18:52):

alexcrichton commented on issue #11701:

Ok this is kind of a wild bug. My understanding at this point is that the true bug lies here for the wasip1 target and here for the wasip2 target. I cannot yet explain the difference in --debug and --release, nor can I explain why this appears to be platform-specific. Some various learnings otherwise:

Using your source, or prebuilt modules, I cannot reproduce this on Linux. I can, however, reproduce both with the source and with the modules on macOS (debug errors, release loops).

Using this program it passes on native but fails on the wasip1/wasip2 targets.

This, however, I can explain. The lines linked above in Wasmtime are to "blame" of sorts for this. Effectively what's happening is that every time fd_readdir is called it actually tries to read the whole directory, e.g. we don't keep the stream open or something like that. When this is coupled with d_next or the "cookie" it means that what happens is (a) a directory is read and the last entry doesn't fit in the input buffer, (b) the guest deletes some files, (c) the guest requests more files from the directory with the previous d_next, and (d) wasmtime re-reads the directory, this time with fewer files, and skips everything because that's what the "cookie" says. Effectively the "cookie" is only valid for a single snapshot of directory in time and does not take filesystem modification into account.

Somehow the original components are affected by this fd_readdir behavior. I've patched the Rust standard library to insert a collect::<Vec<_>>() on this line and with that compiler when I compile from the sources shared here I can't reproduce the bug on macOS. (I could reproduce with a pre-patched libstd, however)

I feel like I can explain the "debug mode returns an error" in the original example now, it's because wasmtime re-reads the whole directory and skips an entry, meaning that the directory truly isn't empty when it's removed.

For the infinite loop behavior my guess is that there's a bug in the WASIp1-to-WASIp2 adapter. Unfortunately that's notoriously hard to debug so I'm still staring at code.

In the meantime though, what to do about this? Unfortunately I think we're in a bit of a problematic situation. I'm not sure how to do some sort of host-side change to fix this with the WASIp1 fd_readdir API. That's unfortunately what's required to get fixed here since the Rust standard library is going through WASIp1 for reading directories, which is implemented through the WASIp1-to-WASIp2 adapter. This means we've got the two lines to fix in Wasmtime at the start of this (one in native on in content). The difficulty is that fd_readdir, as specified, would effectively require buffering the entire directory's contents within the WASIp1-to-WASIp2 adapter which we basically can't do since dynamic allocations aren't possible. Without this buffering behavior I don't believe we can implement what's necessary for fd_readdir here which is to read the directory at most once from a continuous stream of entries.

Well, that's at least as far as I've gotten so far on this.

Wasmtime GitHub notifications bot (Sep 16 2025 at 19:04):

alexcrichton commented on issue #11701:

Ok well further staring found https://github.com/bytecodealliance/wasmtime/pull/11702 which is the cause of the release-mode-vs-debug-mode difference. With that I'm confident now that the only issue is the broken implementations of fd_readdir in this repo.

Wasmtime GitHub notifications bot (Sep 16 2025 at 19:34):

alexcrichton commented on issue #11701:

cc @vados-cosmonic and @sunfishcode I'm curious as co-champions of wasi-filesystem to get your take on this. My question is about WASIp1, which if y'all would rather not care about feel free to ignore this. Specifically the fd_readdir function -- how should an implementation deal with the fact that between invocations of fd_readdir mutations to the directory might be made? I can personally think of two somewhat-viable paths forward:

Declare the funtion broken in some "official" location and go update callers which might mutate to read the whole directory before mutating. For example motivate a change to the Rust standard library to read the entire directory before mutating it by deleting files in remove_dir_all.

Implementations would always read at least two entries from the underlying host directory. The d_next of the previous one points to the d_ino of the next entry. That way if an entry is deleted we can in theory still at least resume at the last file read in the directory. Well ok now as I type this out this doesn't handle the case where both files are deleted.

Hm ok different question: as co-champions of wasi-filesystem how do y'all feel about declaring this API as dead-and-broken? I realize WASIp1 is sort of already in that state but it would be useful to have this on-record somewhere if only in an issue or something like that.

Wasmtime GitHub notifications bot (Sep 16 2025 at 19:45):

alexcrichton added the wasi:impl label to Issue #11701.

Wasmtime GitHub notifications bot (Sep 16 2025 at 19:46):

alexcrichton commented on issue #11701:

@vigoo in the meantime if you're interested in getting this fixed in the near-term I think the quickest fix will be "don't use std::fs::remove_dir_all" if that's possible. If that's in the bowels of some other crate you're using, however, the next-quickest fix would be to propose a change to Rust's libstd, but that's a pretty big step down in terms of "quickest"

Wasmtime GitHub notifications bot (Sep 16 2025 at 19:52):

bjorn3 commented on issue #11701:

fd_readdir can be implemented directly in terms of getdents when not using the wasip1 to wasip2 shim, right? The d_off field for getdents is directly equivalent to d_next for fd_readdir. Maybe wasip2 could add a cookie field populated by the d_off field of getdents to directory-entry and then the wasip1 to wasip2 shim can use this cookie field to seek to the right entry in directory-entry-stream rather than using an integer index within the directory as d_next?

Wasmtime GitHub notifications bot (Sep 16 2025 at 19:53):

bjorn3 edited a comment on issue #11701:

fd_readdir can be implemented directly in terms of getdents when not using the wasip1 to wasip2 shim, right? The d_off field for getdents is directly equivalent to d_next for fd_readdir. Maybe wasip2 could add a cookie field populated by the d_off field of getdents to directory-entry and then the wasip1 to wasip2 shim can use this cookie field to seek to the right entry in directory-entry-stream rather than using an integer index within the directory as d_next? Or maybe the entry name could be hashed and used as cookie as temporary workaround. Would probably cause issue for hash collisions though.

Wasmtime GitHub notifications bot (Sep 16 2025 at 20:26):

vigoo commented on issue #11701:

@vigoo in the meantime if you're interested in getting this fixed in the near-term I think the quickest fix will be "don't use std::fs::remove_dir_all" if that's possible. If that's in the bowels of some other crate you're using, however, the next-quickest fix would be to propose a change to Rust's libstd, but that's a pretty big step down in terms of "quickest"

Thanks for looking into it so quickly! I did that as a workaround already (not using std::fs::remove_dir_all in that piece of code that triggered my investigation ) although of course I cannot guarantee our users will never use it.

Most importantly I just wanted to let you know about the issue.

Wasmtime GitHub notifications bot (Sep 17 2025 at 00:46):

alexcrichton commented on issue #11701:

@bjorn3 it looks like getdents is Linux-specific which is already one major blocker, but another is that once you've read d_off there's no guarantee the file there isn't deleted. If it's deleted, for example, then the iterator would be truncated without visiting anything else since the next seek wouldn't find anything.

Effectively the WASIp1 fd_readdir differs from getdents in that it's not stateful where you pass in a cookie and at least my read of it is that you can arbitrarily seek around when reading a directory. With the lack of state, however, it means that we can't maintain a single object on the host that we're following a stream of. I don't know how we can take an arbitrary cookie and seek the actual stream in the face of modifications between calls to fd_readdir. WASIp{2,3} are much easier here since they return a stream/iterator and nothing else -- no seeking allowed.

Wasmtime GitHub notifications bot (Sep 17 2025 at 16:44):

vados-cosmonic commented on issue #11701:

A bit late but with regards to this note:

Hm ok different question: as co-champions of wasi-filesystem how do y'all feel about declaring this API as dead-and-broken? I realize WASIp1 is sort of already in that state but it would be useful to have this on-record somewhere if only in an issue or something like that.

This certainly seems like the right first step -- at the very least this is a big enough footgun that it should be documented somewhere.

Looking at the P1 interface I can't see a way to solve this that others haven't mentioned already here. One thing that I was thinking about, would it be possible to use some bits of the cookie to store some state?

Last updated: Feb 24 2026 at 05:28 UTC