Determinism/Reproducibility · general

Stream: general

Topic: Determinism/Reproducibility

indolering (Oct 23 2020 at 22:53):

So Dan and I were discussing deterministic/reproducible behavior in WASI.

indolering (Oct 23 2020 at 22:54):

One thing that came up is sort order, which is a cause of non-determinism in a lot of languages.

Dan Gohman (Oct 23 2020 at 22:57):

I don't know of a any prior art in this space, so all I have right now are ideas off the top of my head

indolering (Oct 23 2020 at 22:57):

I had suggested earlier that some sort of way to control deterministic/reproducible behavior, such as defining the sorting of files based on name (as opposed to timestamps). This would be slow, especially without the help of the underlying filesystem/b-tree structure.

Dan Gohman (Oct 23 2020 at 22:57):

One option would be to have the implementation maintain its own index, that it'd update every time it creates or renames a file.

indolering (Oct 23 2020 at 22:58):

But have that be configurable, so you can turn that one but it will be slow and not all runtimes will support it.

Dan Gohman (Oct 23 2020 at 22:58):

Then, fd_readdir etc. would iterate over that index, rather than over thee host directory.

Dan Gohman (Oct 23 2020 at 22:58):

Right, slow, and it'd get out of date if other processes can access the dir.

Dan Gohman (Oct 23 2020 at 22:59):

So yeah, we'd want it to be optional

indolering (Oct 23 2020 at 23:02):

There are FUSE filesystems, Android used wrapfs to enable case-folded file lookup.

Dan Gohman (Oct 23 2020 at 23:02):

I wonder if the problem of other processes could be solved, or at least mitigated, by having the implementation compare the timestamp of the index to the timestamp of the directory. If the directory is more recently modified, re-scan the directory and regenerate the index.

indolering (Oct 23 2020 at 23:02):

And disorderfs, which can actually list directories in a specified sort order.

Dan Gohman (Oct 23 2020 at 23:03):

Ah, FUSE etc. is a good idea too. If we can make the host FS do what we need, that'd avoid the need for an external index

indolering (Oct 23 2020 at 23:03):

In my research on case-folded lookups on case-sensitive file systems, all of the solutions were racy.

Dan Gohman (Oct 23 2020 at 23:05):

Maybe that's the first question then. Are there any non-racy solutions?

indolering (Oct 23 2020 at 23:06):

I don't know.

Dan Gohman (Oct 23 2020 at 23:06):

Or perhaps we just need to say, don't enable this option if anyone can access the FS concurrently?

indolering (Oct 23 2020 at 23:07):

I mean, SQLITE can be as fast as filesystem access. So a virtual filesystem could work.

Dan Gohman (Oct 23 2020 at 23:09):

If that meets your needs, it sounds like a reasonable thing to do to me

indolering (Oct 23 2020 at 23:09):

BeFS allowed arbitrary metadata, but no one seems eager to emulate that FS.

indolering (Oct 23 2020 at 23:09):

OS X, Windows, etc all just maintain a file-based search index.

indolering (Oct 23 2020 at 23:10):

But again, I think it would be possible to define a well specified standard but allow non-deterministic access that relies on the underlying FS.

Dan Gohman (Oct 23 2020 at 23:12):

Could you describe your goal here in more detail?

indolering (Oct 23 2020 at 23:12):

If you can assume an exclusive lock on the directory, then you should be able to enable case-folded lookups on that directory in Linux. This is currently only working on EXT4 and F2FS but those efforts are very much intended to spread to the other Linux filesystems.

indolering (Oct 23 2020 at 23:13):

The filesystem is a global shared state, it helps if you can nail down its behavior as finely as possible.

indolering (Oct 23 2020 at 23:14):

My other big concern from earlier to preventing a standard which adopts case-sensitivity by default, as it's very hard to undo that choice later.

Dan Gohman (Oct 23 2020 at 23:14):

Does Linux have a way to lock directories?

indolering (Oct 23 2020 at 23:14):

As a usability engineer, we are often brought in waaaaay to late to make any changes.

indolering (Oct 23 2020 at 23:15):

I mean, through permissions. But ~/.wine essentially assumes exclusive access and enables case-insensitivity on that directory.

Dan Gohman (Oct 23 2020 at 23:17):

If you have an implementation in mind, could you write up some pseudo-code for how, eg. creating a file, and listing the contents of a directory, would work?

indolering (Oct 23 2020 at 23:19):

I don't, the simplest would be to create a sort order based on unicode codepoints. But that is a basic sort.

indolering (Oct 23 2020 at 23:20):

AFAICT, EXT4, BTRFS, and F2FS all maintain a hashed filename index. So it won't be sorted according to anything close to Unicode sort order, but still sorted.

Dan Gohman (Oct 23 2020 at 23:20):

So picking a specific ordering is one thing, yes, but I'm still trying to get a picture for what problem you're looking to solve, or which specific context you're looking to solve it in.

indolering (Oct 23 2020 at 23:22):

The specific problem I'm looking to solve is serving up files from the FS in an order based on the filename, as opposed to insertion order in the b-tree.

Dan Gohman (Oct 23 2020 at 23:23):

Cool, and are you hoping to enable this for general-purpose use, or only within virtual filesystems and perhaps private-directory filesystems where we have exclusive access?

indolering (Oct 23 2020 at 23:23):

Disorderfs is actually used by the reproducible build community to debug builds where the FS ordering of files creeps into build scripts.

indolering (Oct 23 2020 at 23:23):

The last two.

indolering (Oct 23 2020 at 23:24):

We could do general purpose, but I would just stub that code out and document how you would like it done.

indolering (Oct 23 2020 at 23:24):

Listing directories specific directories based on filename in a deterministic fashion is totally doable.

indolering (Oct 23 2020 at 23:25):

Emulating case-insensitivity on a case-sensitive filesystem can work well enough in 80

indolering (Oct 23 2020 at 23:25):

% of cases.

indolering (Oct 23 2020 at 23:27):

But it's always going to be slower, racy, and blow up in your face if you do something dumb like "/foo/BAR/readme.txt" and "/foo/bar/readme.txt"

Dan Gohman (Oct 23 2020 at 23:28):

I imagine for virtual-fs and private-directory cases we can case-fold the host files, so it'd work

indolering (Oct 23 2020 at 23:28):

Perhaps we should discuss filename, encoding, and case-sensitivity and to what degree the API should enforce its opinions on everyone?

Dan Gohman (Oct 23 2020 at 23:28):

Are you venturing into the general-purpose side of things now?

indolering (Oct 23 2020 at 23:28):

Yeah.

Dan Gohman (Oct 23 2020 at 23:30):

My working assumption is that case-insensitivity is just "whether and how it's done is nondeterministic"

Dan Gohman (Oct 23 2020 at 23:30):

in the general-purpose case

Dan Gohman (Oct 23 2020 at 23:30):

and we just let whatever the host FS does shine through

indolering (Oct 23 2020 at 23:31):

I don't think we can impose our invariants on the underlying filesystem without things breaking.

indolering (Oct 23 2020 at 23:31):

So agree there.

indolering (Oct 23 2020 at 23:31):

But I strongly disagree that we should fail if (as was suggested) the case differs from what was looked up.

Dan Gohman (Oct 23 2020 at 23:32):

Ah, interesting.

indolering (Oct 23 2020 at 23:32):

You are basically making the API opinionated in favor of case-sensitivity, which is not what end-users/consumers want.

indolering (Oct 23 2020 at 23:33):

And it's a nightmare to try and fix that choice later.

indolering (Oct 23 2020 at 23:33):

Have you seen sandboxfs?

indolering (Oct 23 2020 at 23:33):

Actually, let's shelf that for a minute.

indolering (Oct 23 2020 at 23:33):

shelve*

Dan Gohman (Oct 23 2020 at 23:36):

That "check to see if the case differs" is meant to avoid programs which accidentally depend on running on case-insensitive filesystems

Dan Gohman (Oct 23 2020 at 23:37):

Your point seems to be "case insensitive filesystems are The Path Forward", so we should embrace them and not risk locking ourselves out of an all-case-insensitive future

indolering (Oct 23 2020 at 23:38):

primarily-case-insensitive future.

Dan Gohman (Oct 23 2020 at 23:38):

I hadn't though of it like that

indolering (Oct 23 2020 at 23:38):

But yeah, mostly.

indolering (Oct 23 2020 at 23:38):

I mean, people have scripts that work on OS X but fail on Linux because case suddenly matters.

Dan Gohman (Oct 23 2020 at 23:40):

What we might do, is have that check, but don't make it part of the WASI spec. Just make it a debugging feature that engines can optionally provide.

indolering (Oct 23 2020 at 23:40):

What confuses me is that you are proposing normalizing all names to UTF-8, which is similar to case-folding in that the string used to lookup a file in a UCS-2 the language won't match the string returned as the filename.

Dan Gohman (Oct 23 2020 at 23:41):

assuming we do something like ARF strings or modified UTF8-C8, the differences is that the translation is reversible

Dan Gohman (Oct 23 2020 at 23:41):

case-folding is lossy

indolering (Oct 23 2020 at 23:41):

That's still pushing people to use case, which shouldn't matter.

indolering (Oct 23 2020 at 23:42):

For sure, but it will still require manual refactoring.

Dan Gohman (Oct 23 2020 at 23:42):

The theory with ARF strings is that ~~non-Unicode~~ill-formed filenames are so rare that many programs wouldn't need to bother

indolering (Oct 23 2020 at 23:43):

Also, if you are going to convert to UTF-8, are you also going to normalize to NFC?

indolering (Oct 23 2020 at 23:43):

Agreed.

Dan Gohman (Oct 23 2020 at 23:43):

No, I expect NFC/NFD is just "whatever the host does"

indolering (Oct 23 2020 at 23:45):

Hrm.

indolering (Oct 23 2020 at 23:46):

This is a tad above my paygrade: my understanding is that if you are going to mess with encoding, then you should probably do NFC too. <- might be wrong.

Dan Gohman (Oct 23 2020 at 23:46):

arbitrary codepoint sequence -> NFC or NFD is also lossy

indolering (Oct 23 2020 at 23:47):

Basically, you have to do NFC or NFD if you want deterministic sort order.

indolering (Oct 23 2020 at 23:48):

NFD is faster, but doesn't match what most keyboards and applications do, so it makes it hard to find the file in the filesystem. But since you don't care about re-encoding the file name....

Dan Gohman (Oct 23 2020 at 23:49):

in the general-purpose case, we can't prevent other processes from creating non-NFC or non-NFD names on hosts which don't enforce those

indolering (Oct 23 2020 at 23:49):

Linus hates this, it was seen as a mistake for HFS+ to do this, and that all filesystems should preserve encoding/case as the filename returned but store an NFD/NFC normalized filename in the metadata.

indolering (Oct 23 2020 at 23:50):

Okay, I actually agree we should go that way.

indolering (Oct 23 2020 at 23:51):

Alright, so I will post my concerns about blowing up when there is a case mismatch in that filename thread.

indolering (Oct 23 2020 at 23:51):

And post info about performance in the Gist I made.

indolering (Oct 23 2020 at 23:51):

Uhh, last thing I wanted to discuss: using FUSE/encryption to enforce access control and determinism.

indolering (Oct 23 2020 at 23:52):

Bazel created sandboxfs too speedup access-control permissions in builds.

indolering (Oct 23 2020 at 23:53):

I believe Apple uses encryption to enforce access control too, IIRC it gets as finely grained as append only writes.

indolering (Oct 23 2020 at 23:54):

That would also help with determinism.

indolering (Oct 23 2020 at 23:54):

Do you know of any literature that defines the limits of what encryption based ocap can do?

Dan Gohman (Oct 23 2020 at 23:56):

(to address your comment above, normalizing to NFC is worth considering, but in the general-purpose case we similarly run into O(2^N) situations where you have to check for all possible variations of non-normalized filenames created by other processes)

Dan Gohman (Oct 23 2020 at 23:56):

At the level we're operating at right now, all encryption is "the next level down"

Dan Gohman (Oct 23 2020 at 23:57):

Bits on disk are encrypted by the host's filesystem code, but that's all transparent outside of the kernel

Dan Gohman (Oct 23 2020 at 23:59):

userspace never sees the encrypted bits, and the only security function the encryption serves is to foil attackers that can access the underlying storage media directly, which userspace can't do under normal circumstances

Dan Gohman (Oct 24 2020 at 00:01):

capability-based security is about controlling what things you can ask for. The basic idea is that you're given handles, which are program values that you can pass around as arguments and return values and so on, that represent resources you have access to, instead of naming things with strings

Dan Gohman (Oct 24 2020 at 00:03):

open("foo", O_RDONLY) plucks the name foo from thin air and requests access to it. This implies a global namespace in which the request can be resolved, which might have ACLs to govern access, but ACLs are awkward to manage in a bunch of ways.

indolering (Oct 24 2020 at 00:05):

Yeah, I get that. But I believe the original E language had some crypto capabilities built into it.

Dan Gohman (Oct 24 2020 at 00:06):

There is some research into using handles across networks.

indolering (Oct 24 2020 at 00:06):

And there are some researchers who would like to reduce the security of a lot of systems down to cryptographic primitives.

indolering (Oct 24 2020 at 00:06):

But out of scope, clearly.

Dan Gohman (Oct 24 2020 at 00:09):

The key property of handles is that you can't "forge" them, or obtain them without being explicitly given them. Within a single host, that's straightforward, but if you want to pass a handle to another host on the network, and they might pass it on to someone else, how do you ensure that noone can forge such a handle?

Dan Gohman (Oct 24 2020 at 00:12):

I don't have links handy, but there has been research along those lines

indolering (Oct 24 2020 at 00:13):

Yeah.

indolering (Oct 24 2020 at 00:14):

The only comprehensive survey of capabilities I've seen are by the CHERI people, and they have formal models tying their capabilities to memory tagged hardware. I'll go ask them about alternative modes of enforcement.

indolering (Oct 24 2020 at 00:14):

Okay, thanks dan!

Dan Gohman (Oct 24 2020 at 00:15):

yw!

indolering (Oct 24 2020 at 00:30):

Oh, Dan, it's not O(N^filename) runtime.

indolering (Oct 24 2020 at 00:30):

It's O(log files)

indolering (Oct 24 2020 at 00:31):

As you just get a list of all files in a directory and convert their names to the canonical casefolded name.

indolering (Oct 24 2020 at 00:31):

And do a compare.

indolering (Oct 24 2020 at 00:34):

You would only get exponential behavior if someone created a lot of directories with lots of cases, such as /foo/bar/baz/..., /Foo/bar/baz/..., /FOo/bar/baz.

indolering (Oct 24 2020 at 00:36):

Even then, if your goal is to create a lint, you would could just error when there are no exact case-sensitive matches but multiple case insensitive matches.

Jubilee (Oct 28 2020 at 04:16):

I am mildly perplexed by "primarily case insensitive future" because Mac and Windows both introduced the ability to make their file systems case sensitive, and I don't think it's any less troublesome going from case insensitive to case sensitive here.

Dan Gohman (Oct 28 2020 at 21:54):

@indolering If I need to create N files (suppose I'm unpacking an archive, compiling lots of source files to object files, etc.), and each time I create a file I have to scan the directory to see if there's a file with a case-folding-equivalent name, it goes O(N^2) in the number of files I'm creating

Dan Gohman (Oct 28 2020 at 22:19):

@Jubilee My working ssumption is that WASI will end up saying that host fileystem directories can be either case-sensitive or insensitive, and we just expose that to applications as-is.

Dan Gohman (Oct 28 2020 at 22:19):

Applications will just need to avoid depending on either case sensitivity or case insensitivity if they want to be portable.

Dan Gohman (Oct 28 2020 at 22:22):

There may be some debugging facilities we can add to help applications catch mistakes, and I think the observation above is, we shouldn't make those features mandatory, because if filesystems do end up converging on case-insensitive, we don't want to be stuck with those debugging features forever.

Jubilee (Oct 30 2020 at 22:11):

Mmm, that's reasonable I guess?! I will concede I myself genuinely don't see a reason that file systems should be case insensitive from a data perspective, as it makes it possible to trust an address that is not bit-equal to another address is in fact not the same address (modulo some canonicalization), and user-space can support case-insensitive comparisons where it makes sense just fine. But here I am extending more of a "would-write-an-FS perspective" than a "would-like-to-solve-platform-compatibility-issues" perspective, and so venturing a bit further afield from The Point. Aside from "everyone should clearly adopt my perspective and then there would be no more compatibility issues" which I recognize is in actuality a non-starter. :^)

indolering (Nov 01 2020 at 22:21):

@Jubilee 99% of users on an OS with a filesystem that is case-insensitive by default.

indolering (Nov 01 2020 at 22:31):

@Dan Gohman I think there is a miscommunication WRT operating modes and error handling.

Jacob Lifshay (Nov 01 2020 at 22:32):

I heard somewhere that, among developers, it's much more evenly split between Linux, Windows, and macOS (about 1/3 for each). Nearly all Linux OSes are case-sensitive by default.

indolering (Nov 01 2020 at 22:36):

@Jacob Lifshay True, lots of server-side stuff is going to run on case-sensitive filesystems by default. That being said, at least on Stack Overflow, Linux devs are out-numbered 3:1.

indolering (Nov 01 2020 at 22:36):

https://insights.stackoverflow.com/survey/2020#technology-developers-primary-operating-systems

Stack Overflow Developer Survey 2020

Nearly 65,000 took this comprehensive, annual survey of people who code. Demographics. Most loved, dreaded and wanted technologies. Salary and careers.

indolering (Nov 01 2020 at 22:37):

I also think that for all of the wailing and gnashing of teeth, distros will eventually switch to case-insensitive by default. At least for home directories.

indolering (Nov 01 2020 at 22:38):

That being said, I believe that virtually every filesystem allows for setting this behavior on a per-directory basis.

indolering (Nov 01 2020 at 22:41):

My main concern was that the tickets I read were suggesting fail fast on case-insensitive filesystems. That's a mistake.

indolering (Nov 01 2020 at 22:45):

@Jubilee The web is case-insensitive: a domain name is basically a pointer to a server and paths are basically case-insensitive lookups on a filesystem.

indolering (Nov 01 2020 at 22:47):

From a security and reproducibility perspective, you want to minimize the impedance mismatch between the client and the host.

indolering (Nov 01 2020 at 22:51):

From a data-perspective, we definitely want valid UTF-8 right?

indolering (Nov 01 2020 at 22:52):

@Jubilee Sorry, I'll stop hammering you. I'm writing another ticket right now and my thought streams are crossing :P

indolering (Nov 01 2020 at 22:53):

image.png <img src="http://alinken.people.ua.edu/uploads/8/7/9/2/87929690/published/ghostbusters.jpg?1501582414" alt="Picture"/>

Jubilee (Nov 01 2020 at 23:01):

@indolering That is not true anymore, and has not been for a long time.
iOS and Android OS use primarily case sensitive FS while supporting operations on case-insensitive FS for back-compat and with many applications exposing case-insensitive functionality.
i.e. my preferred scheme.
As far as the web goes, not all servers implement those accesses as case insensitive, and of those which do, they commonly involve a redirect to a canonical version. It took... 3 tries? to find a case-sensitive access that fails without a redirect.

Jubilee (Nov 01 2020 at 23:02):

( the first hardly counts, since I was just doublechecking against my memory of domain names being case insensitive. )

Jacob Lifshay (Nov 01 2020 at 23:07):

indolering said:

The web is case-insensitive: a domain name is basically a pointer to a server and paths are basically case-insensitive lookups on a filesystem.

DNS is case insensitive (but only for ASCII I think -- IIRC punycode doesn't do any case folding before encoding). Most servers run Linux with case-sensitive filesystems, so I'd expect the other parts of a url to be usually case sensitive.

Jubilee (Nov 01 2020 at 23:16):

From a security perspective, the fact that I can write the link https://googIe.com and it does not go to the same place as https://google.com is a source of endless phishing tech, so no, you may consider me suitably skeptical that even exposing case insensitivity to a user serves their security that much.

And from a data perspective an address can be raw bytes for all I care (and still must, because not all OS enforce UTF-8 path validity!). File systems should cast their eye to living a life longer than an encoding scheme. It's, again, userspace's job to make it intelligible in my opinion. Sometimes an impedance mismatch is simply why we have software.

Dan Gohman (Nov 02 2020 at 00:36):

@Jubilee FWIW, WASI-filesystem is expected to use UTF-8 paths. The "filenames are just bytes" strategy was practical in its day, but with UTF-8 the practicalities line up very differently.

indolering (Nov 02 2020 at 00:36):

I stand corrected: path handling is case preserving and server/filesystem dependent, but Linux of course defaults to case sensitive matching. I guess I was in DNS land for too long!

DNS labels are case-insensitive, even i18n ones (punycode case-folds everything).
Android sdcard access has historically provided case-insensitive lookups, but case-insensitive FS semantics are now enforced using F2FS. The stock file manager on my Pixel defaults to case-insensitive behavior, even for internal storage.
Google Drive is case sensitive.
iOS is case sensitive, but both iCloud and APFS on OS X default to case insensitive semantics. I do not have an iOS device to test whether the filesystem manager attempts to enforce case-insensitive behavior (as I suspect my Files app is doing).

indolering (Nov 02 2020 at 00:48):

I mean, from the perspective of usability (carrying out end-user intent) and enforcing a single namespace in the filesystem, you want the lowest common denominator normalization (NFKD casefold) so that an attacker can't do something like store a ligature ℀ which some server or client could NFKD into a filename path.

indolering (Nov 02 2020 at 00:50):

And the only way to be sure you don't accidentally allow two unicode strings that normalize into some other unicode string is at the filesystem level.

indolering (Nov 02 2020 at 00:51):

But yeah, I also don't want some firewall allowlist filter to be bypassed because filename lookups are case/normalization insensitive.

Jubilee (Nov 02 2020 at 01:29):

Oh yes, I expect WASI to use UTF-8 because that is more sensible for WASI's purposes than letting someone decide the next UTF-16 is a good idea.

Jubilee (Nov 02 2020 at 01:46):

But that already involves hitting a translation layer between existing systems and WASI, from my perspective, and at that point where the translation layer exists is subject to some negotiation (since it is, as it were, already a negotiation). But it's slightly more audacious to reformat a user's hard drive than to replace the user's kernel, which is what drives my sentiments regarding where such canonicalization "belongs".

Dan Gohman (Nov 02 2020 at 02:19):

I like how you put it -- a "negotiation" describes it well.

indolering (Nov 02 2020 at 02:49):

Agreed! Determinism can only be enforced if WASI can assume exclusive control at the FS level (akin to WINE setting case-insensitive behavior on ~/.wine/).

indolering (Nov 02 2020 at 02:57):

And a runtime can chose a fast/sloppy implementation that relies on whatever the filesystem does, which should work with 95% of real-world filenames. If you want a truly deterministic filesystem, then you probably want to do something with FUSE or VFS (I suspect some Unicode compression/transliteration scheme could fit 99% of real-world filenames into the Posix portable filename subset, which is available on virtually every platform).

indolering (Nov 03 2020 at 22:00):

Just checked a iPod touch and the filesystem UI won't let me create two folders that differ in case. So for users iOS is case-insensitive. As iCloud enforces case-insensitivity, I would be surprised if Apple doesn't switch iOS to case-insensitive behavior as well.

indolering (Nov 03 2020 at 22:26):

But maybe the API should error out by default if the filename isn't an exact match, just as long as WASI can adapt to the underlying filestore in an ergonomic fashion.

indolering (Nov 03 2020 at 22:58):

If we have exclusive access to a directory (akin to Android, iOS, Flatpak, etc) then I would default to the lowest-common-denominator, so that everything "just works" regardless of the case or normalization conventions of the FS. But I need to do a review of all the issues with IDNA, Stringprep, and the i18n filesystem RFCs first.

indolering (Nov 04 2020 at 20:12):

Does anyone have experience with/thoughts on PRECIS? It's the IETF followup to the Stringprep algorithm.

Dan Gohman (Nov 05 2020 at 00:11):

Do you know if there are any filesystems which do case-sensitive lookups, but still prevent creating files that differ only in case?

indolering (Nov 05 2020 at 01:49):

Not off the top of my head. However, if a file has a non-normalized name, Linux falls back to using the bytestring as an opaque identifier.

indolering (Nov 05 2020 at 01:51):

FWIW, I am planning on documenting this behavior (with functional testing) in a git repo at some point.

indolering (Nov 07 2020 at 00:45):

I need to nail down the behavior of how filenames are handled across platforms and I'm at the point where I need to start testing. I don't know if this work will be upstreamed, but is there an infrastructure preference WRT functional testing and virtual machines?

Last updated: Apr 06 2025 at 23:03 UTC