Stream: wasmtime

Topic: deterministic stack usage


view this post on Zulip Sergei Pepyakin (Jun 07 2021 at 17:12):

I am working on a project that requires absolute determinism of the executed code. One of the parts that has non-deterministic behavior is the wasm stack. This non-determinism can be exhibited even by the same engine running under different conditions, e.g. x86-64 vs aarch64. In order to make it deterministic we want to run an instrumentation pass that would eradicate all the non-determinism.

The idea is to for each function we would compute its logical size. The logical size would assume the dumbest implementation ever. That is, each value costs some of this logical units. It doesn't matter whether the value comes from the value stack or a local/arg. Moveover, we assume that args are not shared between the frames but rather copied.

Once we have these costs for each function, we would go over all call sites and instrument each call site with a simple sequence:

Then we would configure the engine, i.e. wasmtime, to allocate the wasm native stack so big so that the logical stack limiter will be always hit first.

However, this scheme relies on the assumption that cranelift will generate a stack frame which is always smaller than the overestimated logical size.

I suspect that this assumption can be invalidated by:

The regalloc is totally unclear to me. I know that regalloc would need to allocate a spill area. What is not clear to me, if the worst case size for this area is capped by the number of locals?

@Chris Fallin , or anybody else familiar with the regalloc, is it actually possible for the regalloc to go rogue and potentially allocate the spill area bigger than the number of locals, say, multiplied by the biggest value wasm supports (16 bytes)?

view this post on Zulip Chris Fallin (Jun 07 2021 at 17:15):

@Sergei Pepyakin interesting ideas! Regarding regalloc spill-area size: in general it's hard to make tight guarantees about this, simply because of the complexity of the algorithms involved. (Even ensuring termination is subtle when e.g. splitting or backtracking.) The first aspect that's important is the number of live ranges, rather than locals: internally, the code is represented in SSA form, and locals no longer exist, only individual SSA values at new definition sites. There is an upper bound on the number of live values determined by the number of locals plus the maximum wasm operand stack size, but conservatively, the only bound we can confidently state is that there will be (at most) one spillslot per live range. Tighter bounds than that rely on reasoning about spillslot coalescing, which is complex

view this post on Zulip Chris Fallin (Jun 07 2021 at 17:17):

This actually feels like sort of an open-ended research question: what bounds can one prove about a realistic register allocator? Such proofs are notoriously hard when termination conditions are complex (i.e., when one has more than just structural recursion and simple iteration)

view this post on Zulip Chris Fallin (Jun 07 2021 at 17:17):

I might encourage approaching this problem from the opposite side: decide a stack-depth bound to deterministically enforce, then modify the engine so that it will dynamically (re-)allocate the stack up until this logical limit is exceeded

view this post on Zulip Chris Fallin (Jun 07 2021 at 17:18):

that trades off potential OOM at runtime for determinism of stack-overflow

view this post on Zulip Sergei Pepyakin (Jun 08 2021 at 12:47):

This is summary of my understanding:

  1. spillslot is an allocation of memory on the machine stack.
  2. conservatively we can state that there is at most one spillslot per live range
  3. we can conservatively assume that each variable lives from start to finish of the function.
  4. Therefore the number live ranges at all points is the number of variables, and thus we get the number of spillslots.
  5. if we assume a maximum size of a spillslot to be a constant C, we can get the maximum spill area size by multiplying C by the number of variables.

It seems that knowing this we will be able to compute a very conservative upper bound for the stack usage for a given logical stack size (described in the OP).

There is a catch though: the above uses the term variable referring to an SSA variable from Cranelift IR. They are created from stack operands and locals. As you mentioned it may be the maximum depth we are interested in. But besides that, translation of wasm into clif also produces temp variables. I assume we can find the absolute maximum of variables created by the translation or if it is unbounded impose application specific limits on wasm. Then knowing the number of instruction we can get the hard cap of the values produced by lowering wasm to clif.

As if it wasn't enough, there are some variables created by optimizations. If we had loop unrolling it may multiply all the variables found in the the loop.

Is my understanding roughly right?

Thanks for the idea of approaching this problem from the opposite side. Why we wanted to go with the instrumentation approach is because we would like to have some freedom in choosing an engine. That is a preference at this moment, not a strong requirement, so we may as well consider altering wasmtime for these purposes.

view this post on Zulip Chris Fallin (Jun 08 2021 at 19:57):

Hi @Sergei Pepyakin -- that's very close! The conservative upper bounds for live ranges are actually:

  1. We should assume that there is a new SSA value for every Wasm value definition ( == operand stack push), and for every value that is live across a control-flow-merge point (phi node), and for every Wasm local modification (local.set or local.tee opcode) or merge point thereof. A very conservative bound on that is (len(wasm opcodes) * (worst-case number of values pushed by wasm opcode) + (wasm local modifications)) * (number of basic blocks), though that is a very very loose bound because only a very sparse subset of values will be live at any given program point in a realistic program.

  2. Given that bound on the number of SSA values, the bound on the number of live ranges is, at most, one live range per SSA value per basic block (because a value can flow across discontiguous blocks and live-ranges are just contiguous ranges of instructions).

  3. Then, if the spillslot coalescer doesn't work, we have up to one spillslot (or N spillslots depending on types -- concretely, 2 spillslots for float/vector-typed values) per live range.

So that's O(len(program) ^ 3), effectively

view this post on Zulip Chris Fallin (Jun 08 2021 at 19:58):

Given that very loose bound, and the difficulty in making it tighter, I'd go as far as saying that the only practical way to do this would be to follow the logical limit exclusively and expand the physical stack to match, but possibly there are better ideas I've missed :-)

view this post on Zulip Sergei Pepyakin (Jun 09 2021 at 17:28):

This is immensely useful. Thank you for help!


Last updated: Dec 23 2024 at 13:07 UTC