sparker-arm opened PR #4200 from dynamic-types
to main
:
Introduce a new concept in the IR that allows a producer to create
dynamic vector types. An IR function can now contain global value(s)
that represent a dynamic scaling factor, for a given fixed-width
vector type. A dynamic type is then created by 'multiplying' the
corresponding global value with a fixed-width type. These new types
can be used just like the existing types and the type system has a
set of hard-coded dynamic types, such as I32X4XN, which the user
defined types map onto. The dynamic types are also used explicitly
to create dynamic stack slots, which have no set size like their
existing counterparts. New IR instructions are added to access these
new stack entities.Currently, during codegen, the dynamic scaling factor has to be
lowered to a constant so the dynamic slots do eventually have a
compile-time known size, as do spill slots.The current lowering for aarch64 just targets Neon, using a dynamic
scale of 1.<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
sparker-arm requested cfallin for a review on PR #4200.
sparker-arm requested abrown for a review on PR #4200.
cfallin submitted PR review.
cfallin submitted PR review.
cfallin created PR review comment:
Can you say more about the "minimum lane count" concept -- why it exists, and cases where one might provide different minimum lane counts? And why only {1/2, 1, 2} as factors?
In particular if this comes from a machine restriction, I'm somewhat concerned that we would be baking a specific set of details into the type-system framework forever. I'd rather have a somewhat higher-level notion of dynamic types in CLIF, e.g.
i32xN
; and then when we evaluate thedyn_scale_...
global value, we can either make whatever guarantees we need forN
if the value is at the discretion of the ISA backend, or for values loaded from the user, we can place additional restrictions as needed (e.g., a backend provides the embedder a "minimum multiple for dynamic vector size" and the embedder uses this).
cfallin created PR review comment:
This is slightly unclear to me: does this mean that it takes the size of the target's register width and divides by
vector_type
? i.e. that it has a concrete value determined by the compiler configuration / target microarchitecture? Or is it passed in / determined in some other way?Or I guess my confusion comes from: I would expect
DynScale
as a name to imply that this is some generic sort of scale parameter, while this seems to say that it is a specific scale value determined by some algorithm ("what will fit in ..."). So maybe a better name would help:DynScaleTargetReg
? And in the future we may have some otherDynScale...
options based on loads of parameters from vmctx, or a special vector-length register, or aDynScaleConstant
, or ...
sparker-arm submitted PR review.
sparker-arm created PR review comment:
does this mean that it takes the size of the target's register width and divides by vector_type?
Yes. So for a given dynamic type, the backend will scale the fixed part of the type by
DynScale
. The name is likely misleading.I feel we shouldn't need to prescribe different global values for constant scale or truly dynamic ones via a register, as it's not immediately obvious, to me, why this would be helpful at the IR level. I feel we should be able to encapsulate the semantics we want with a single construct and allow a backend to lower it how it wishes.... but...
It would almost certainly make it far easier to implement the machinst ABI and, as that is a particular point of pain, it would be good to make that as simple as possible :)
For this first implementation we could rename to
DynScaleConstant
and it would much better describe the current limitations.
sparker-arm created PR review comment:
The minimum lane count comes from the fact that we build a dynamic vector type from a fixed base vector type. The fixed base type provides the minimum lane count.
I added the halving/doubling to try to mimic the functionality of the existing SIMD types, and the ones commonly used, but I don't think it's needed and possibly doesn't make sense. But the main reason was for encoding space...
My problem here was that cranelift has a very large range of SIMD types, which made it hard, for me, to add another group of types with a reasonable range. What I wanted to do was just to use a single bit, along with the existing SIMD types, to express a dynamic vector type but I struggled and failed there.... So I'm very much happy to hear suggestions, I would really prefer to not use this 2-bit system.
sparker-arm submitted PR review.
cfallin submitted PR review.
cfallin created PR review comment:
What if we define dynamic vector types not on top of a base fixed-vector type, but on top of the lane type alone, and then make the dynamic factor (the symbolic
N
) provide a bit width, rather than a multiplier? This is sort of starting from I was getting at withi32xN
above (rather thani32x4xN
) but then shifting the meaning ofN
a bit so that we can have the sameN
(symbolically at least, in practice this is all resolved at compile time) for every lane type.Maybe we need a slightly different notation to make this more clear: perhaps (please suggestion other punctuation, anyone!)
i32/N
(i32
s overN
bits). Then we could haveN = dyn_width_constant 1024 dt0 = i32/N dt1 = i64/N v1 := load.dt0 ... v2 = bitcast.dt1 v1 store.dt2 v2, ...
I like this because it removes any notion of "minimum lane count" or some other factor (why
4
ini32x4xN
, other than the fact that historically we started with 128-bit vectors and we're expanding from there?). We just have the important bits: the lane type (i32
), and the machine width we're scaling to (N
). Thoughts?
sparker-arm submitted PR review.
sparker-arm created PR review comment:
I'm not sure I follow... for one, with
dyn_width_constant
taking an immediate value, shouldn't we really should just use fixed sizes at the IR level? This doesn't appear to enable anything dynamic.The other problem is that I think we should aim to enable writing clif in dynamic terms without losing too much the functionality of fixed widths - otherwise people have to make a hard choice of whether to use it or not. I think with your suggestion, we now only have types that use the full width of the register and so we wouldn't be able to support widen and narrow operations?
cfallin submitted PR review.
cfallin created PR review comment:
I'm not sure I follow... for one, with
dyn_width_constant
taking an immediate value, shouldn't we really should just use fixed sizes at the IR level? This doesn't appear to enable anything dynamic.Ah, so I didn't mean for the
dyn_width_constant
to be the interesting part of that example; perhaps it results from a pass that legalizes to a particular microarchitecture with a known width. The point was to have the global value that defines machine width, as we've had so far; maybedyn_width_vl
for a vector-length register or somesuch...The basic idea I was going for was to have a parameter for the machine width, and say "vector of lane type T up to width
N
". Whereas thei64x2xN
sort of syntax says "anN
-times multiple of a base 128-bit vector machine". Just holding a different set of two of the three variables in "machine width = lane width * lanes" constant, I guess.Said another way, the existing vector types
i8x16
,i16x8
,i32x4
,i64x2
can be written asi8/128
,i16/128
,i32/128
,i64/128
; and then this is making that width configurable, rather than the multiplier over 128 bits base width configurable.That said...
The other problem is that I think we should aim to enable writing clif in dynamic terms without losing too much the functionality of fixed widths - otherwise people have to make a hard choice of whether to use it or not.
I also see the advantages of the
i64x2xN
approach with respect to the existing SIMD types: we're basically "slapping a multiplier on it" and everything else remains the same. The downside is the encoding and from-first-principles comprehensibility of the types.All else being equal, I think either could work, and an easier migration path is important; and you've done the work to prototype the latter; so
i64x2xN
is totally workable. I won't push the above ideas too hard against it :-)I think with your suggestion, we now only have types that use the full width of the register and so we wouldn't be able to support widen and narrow operations?
Ah so here it actually gets interesting: what happens to the {1/2, 1, 2} multipliers if we do two narrow or widen ops in a row?
sparker-arm created PR review comment:
Ah so here it actually gets interesting: what happens to the {1/2, 1, 2} multipliers if we do two narrow or widen ops in a row?
Everything breaks :)
Although it would also break for fixed vectors after a certain amount of doubling too. We're never going to be able to support everything with a small, fixed, number of types, but I would really like dynamic types to have the same limitations as their fixed-width counterparts. I will look at the encoding again, but I think this is tough with how many vector types already exist. I take it that moving to a u16 is out of the question...?
I'll also prototype support for widen and narrow, because now I'm wondering how that will actually work... I assume we shouldn't support an implicit result type, unless that type has already been created for function?
sparker-arm submitted PR review.
sparker-arm created PR review comment:
So, I've currently got this encoding scheme:
// 0x00-0x0d: Lane types // 0x0e-0x0f: Reference types // 0x10-0x7f: Vector types // 0x80-0xef: Dynamic vector types (excluding refs, I128 and all bools except B1) // 0xf0-0xfe: Special types // 0xff: Void
And this is pretty ugly... it requires:
- Everything to be shuffled around (maybe unnecessarily...)
- Both fixed and dynamic vectors to be limited to 128 lanes, instead of 256.
- Dynamic vectors can only use 3 bits for the
lane_type
, which has meant I've re-encoded lane types so that the ones I want to support use the bottom 3 bits.So, this poses two question with our existing types:
- Do we have a strong reason to support so many bool types?
- Do we have a strong reason to support a max of 256 lanes? @bjorn3, you raised concerns about limiting lanes to 128 so I'd like to hear your opinion too.
sparker-arm submitted PR review.
cfallin submitted PR review.
cfallin created PR review comment:
I think it could be worthwhile to run some quick experiments with
Type
as au16
instead. It may pessimize a tiny bit where we have aVec<Type>
rather than aType
as a field (in the latter case we probably have some slack with padding) but if it's in the noise or say 0.1% or something, it seems like it would make this work substantially easier...I'll also prototype support for widen and narrow, because now I'm wondering how that will actually work... I assume we shouldn't support an implicit result type, unless that type has already been created for function?
Yes, actually, now that I think about it more, this seems right (and like a small difference from existing fixed SIMD types): one will need to explicitly declare the types on both sides. I'm curious to see how that will work out wrt our instruction constraint system; I don't have enough details paged in right now to say for sure if there will be any problems...
cfallin submitted PR review.
cfallin created PR review comment:
where we have a
Vec<Type>
As a followup after an astute question by @jameysharp , it turns out this is likely not an issue. I had for some reason thought that we kept types in a struct-of-arrays sort of way (
SecondaryMap<Value, Type>
for types of every value in theDataFlowGraph
) but we actually just stuff theType
into theValueData
here so we almost certainly have the padding to absorb au16
instead. (Thanks Jamey!)
sparker-arm submitted PR review.
sparker-arm created PR review comment:
Sounds good to me!
sparker-arm updated PR #4200 from dynamic-types
to main
.
bjorn3 created PR review comment:
Could you please keep the original name? This is an unnecessary breaking change that would break the nightly cranelift tests of cg_clif.
bjorn3 submitted PR review.
sparker-arm created PR review comment:
I have no strong opinion, but opposing feedback that I've previously received from @fitzgen (I can't remember in which area) was that we should make it clear it clear to differentiate between fixed and dynamic. If that was just around type naming, then it is a non-issue and I'm happy to revert. It's already been a problem for me while rebasing!
sparker-arm submitted PR review.
sparker-arm created PR review comment:
@cfallin What should I be running for looking at to get the metrics that you're interested in?
sparker-arm submitted PR review.
cfallin submitted PR review.
cfallin created PR review comment:
Any sort of profiling (even just a
perf stat ...
) on a compilation of a large module (benchmarks-next/spidermonkey/benchmark.wasm
from the Sightglass repo?) should be enough to see if makingType
au16
has any impact, I think. Though given the above I expect it shouldn't and would be surprised if we see anything more than a blip...
cfallin created PR review comment:
I'd prefer we have the explicit name as well; @bjorn3 each Cranelift upgrade is a semver break currently and our APIs are still considered unstable. We shouldn't needlessly break compatibility, but getting things right and finding clearer names and less error-prone APIs is a higher priority right now.
cfallin submitted PR review.
bjorn3 created PR review comment:
I guess so @cfallin.
bjorn3 submitted PR review.
sparker-arm created PR review comment:
Taking the example that you suggested, compilation is 0.7% slower when using u16. This is taking the mean over three runs. I'm using a laptop... so not the best benchmarking setup, but hopefully this sounds like a low enough impact?
sparker-arm submitted PR review.
cfallin created PR review comment:
Hmm, that's actually a bit higher than I was expecting. Wins from significant efforts can sometimes be 5-10% (and e.g. RA2 with ~6 months of work was 10-20% compile time reduction), so throwing away 0.7% here should be avoided if we can find a way around it at all.
I suspect it might be this
Vec<Type>
in theVCode
. Could you run another experiment where you keepType
atu8
but just inflate that tou16
(maybe make it aVec<(Type, u8)>
and the minimal hacks to stuff the extrau8
in)?If it turns out to be that
Vec
, we could keep it aVec<u8>
, reserve oneu8
value for>= 0xff
, and keep aFxHashMap<VReg, Type>
sparse map for the "out of bounds" types, if that makes sense.Otherwise, would you be willing to hunt down the perf delta a bit and see if it's coming from somewhere else?
cfallin submitted PR review.
sparker-arm submitted PR review.
sparker-arm created PR review comment:
Okay, I will have a look. FWIW, I've also just tested on a more suitable benchmarking machine, and I now see a 0.3% slow down over 5 runs.
sparker-arm created PR review comment:
Could you run another experiment where you keep Type at u8 but just inflate that to u16 (maybe make it a Vec<(Type, u8)>
This actually gave a small speedup instead!
we actually just stuff the Type into the ValueData here so we almost certainly have the padding to absorb a u16 instead
Unfortunately, this doesn't appear to be the case, at least this appears to be where the regression is coming from. perf showed
DataFlowGraph::value_def
taking a longer time, so I convertedty
into(Type, u8)
in theValueData
enum and saw a 1.4% slow down. Settingrepr(align(8))
fixed this regression, but not for the original case ofType(u16)
....However, using
num: usize
forValueData::Inst
andValueData::Param
reduces the regression to 0.19% when using u16.
sparker-arm submitted PR review.
cfallin created PR review comment:
Ah, I think I missed that the enum dscriminant in
ValueData
is an implicitu8
, so right now it fits inu64
but makingType
au16
inflates that (probably at least to 12 bytes if not 16?).I have some thoughts on bitpacking and giving
Type
14 bits instead; I'll do a PR and do some measurements, and get this in as a separate refactor if it works to pave the way for your work :-)
cfallin submitted PR review.
cfallin edited PR review comment.
abrown created PR review comment:
Maybe I missed this at some point, but why do we need new stack manipulation instructions? Wouldn't the old instructions have the attached dynamic type which would be enough to determine where the dynamic slots exist?
abrown submitted PR review.
cfallin submitted PR review.
cfallin created PR review comment:
The issue is the operand typing -- these take a
DynamicStackSlot
rather than aStackSlot
, and that distinction in turn is needed for the ABI code. We could have made eachStackSlot
be polymorphic over statically-sized or dynamically-sized type, though that adds a bit of complication. Given that we do the layout differently for each, and lower each slightly differently, I think it's reasonable to keep them separated, but i don't feel too strongly about that.
cfallin submitted PR review.
cfallin created PR review comment:
@sparker-arm I took a look at this just now in #4269. The impact on compiling SpiderMonkey.wasm was actually 4.5%, which is quite serious and so we definitely need to do something here -- the bitpacking in that PR gets it down to 1.0% but I'm hoping we can do better (ideally I wouldn't want to pay more than 0.1% for this, given that compile-time wins are generally hard to come by).
sparker-arm submitted PR review.
sparker-arm created PR review comment:
So, I re-ran my experiments and looking at reported time, instead of raw cycles, and I see u16 causing a 1.1% or 1.3% slowdown, depending on whether I use the mean or minimum values. (I don't understand why cycles and time are so different...) And I'm a bit concerned that your results are significantly different than mine. What are machine you running on? And are you running sightglass in a way to mimic a 'real-life' scenario, with the commands
--iterations-per-process 10 --processes 2
?To add to my confusion, when I benchmark your 14-bit patch I confirm your results, with a 0.78% slowdown...
cfallin submitted PR review.
cfallin created PR review comment:
Results for me were on a desktop Ryzen 3900X 12-core system. The iteration count and process count were a tradeoff to get some results and not wait forever, but all else equal, larger counts are better (it lets Sightglass get a tighter confidence interval). If run with just two engines rather than three it'll do a statistical-significance test too.
Compilation input can make a big difference in slowdowns observed (e.g. if functions get big enough for IR data structures to exceed L1), are you testing with SpiderMonkey.wasm or something else?
sparker-arm created PR review comment:
Yeah, I'm using spidermonkey.wasm and I'm on a big and shiny ampere altra. Maybe this has something to do with x64 vs aarch64, because I assume that the size of L1 caches varies very little between CPUs these days. Considering the benchmark takes < 10 seconds, is the tradeoff really necessary..?
sparker-arm submitted PR review.
sparker-arm updated PR #4200 from dynamic-types
to main
.
cfallin submitted PR review.
cfallin created PR review comment:
followup issue or assert? (I'm not sure I fully understand the TODO)
cfallin created PR review comment:
constant rather than magic
0x100
here?
cfallin created PR review comment:
The
value_reg
andput_in_reg
should be unnecessary now thanks to implicit conversions, if you want to remove them!
cfallin submitted PR review.
cfallin created PR review comment:
I'd prefer for these new lowerings to be in ISLE, if that's not too much trouble...
cfallin created PR review comment:
Can we file a followup issue to track what to do about dynamic vector types as args?
(Am I assuming right that usually they won't be args, but rather the args themselves will be pointers to the data we use with dynamically-sized vectors?)
cfallin created PR review comment:
Commented-out code removed or fixed?
cfallin created PR review comment:
commented-out code
cfallin created PR review comment:
I may be missing it but where does the global value (vector scale) actually come in here to determine the size of the dynamically-sized type?
cfallin created PR review comment:
Noting the FIXME here
cfallin created PR review comment:
remove?
cfallin created PR review comment:
Commented-out code here should be removed.
cfallin created PR review comment:
commented-out code
cfallin created PR review comment:
Yes, I think
DynScaleTargetWidth
or something might be best here. As to why different global values for the different cases -- if I'm understanding the question right, the answer is that the details are really ABI-visible (e.g., is there a hidden register that controls how much data we read at once that must be set up? is it part of a vmcontext in a runtime?) so they should be surfaced in the IR. And having it clearly here also lets us define CLIF semantics for it, which is useful for interpreters and for formal verification.Anyway with the above rename I think this is settled for now!
sparker-arm submitted PR review.
sparker-arm created PR review comment:
AFAIK, yes. It's either in a scalable register (data or predicate) or we pass a pointer.
sparker-arm created PR review comment:
Yes, this is a problem, as they are currently disjoint. I've currently renamed this to
dynamic_vector_bytes
to differentiate it from the scaling factor. The only way to tie them together, that I can think of, is to pass the value from TargetIsa when legalizing the GlobalValue, which I'll try out.
sparker-arm submitted PR review.
sparker-arm updated PR #4200 from dynamic-types
to main
.
sparker-arm updated PR #4200 from dynamic-types
to main
.
sparker-arm updated PR #4200 from dynamic-types
to main
.
cfallin submitted PR review.
cfallin created PR review comment:
OK, I think this is fine to start, but let's add a TODO and a tracking issue for truly dynamic stack frame layout based on some runtime value. At least, if I understand correctly, one option all along has been for runtime-determined size (from a vmcontext configuration field or some other input), which is what motivated us to take the GV-based approach; so we should build that eventually. I'm happy to see it done incrementally though in this case.
cfallin submitted PR review.
sparker-arm updated PR #4200 from dynamic-types
to main
.
cfallin merged PR #4200.
Last updated: Dec 23 2024 at 12:05 UTC