TimonPost commented on issue #6804:
I ran your branch 3 times on the large module I experienced issus on.
x86_64: 11.5228035s, 11.0294104s, 11.2740543s
aarch64: 12.8739ms, 17.9411ms, 14.3515msWhere on main it is was:
x86_64: 13.9465882s, 11.2770526s, 11.0808334s
aarch64: 157.5406149s, 144.1816011s, 144.1495896sSeems like a very big improvement!
TimonPost edited a comment on issue #6804:
I used your branch to compile a large module of ~115Mb, 3 times.
x86_64: 11.5228035s, 11.0294104s, 11.2740543s
aarch64: 12.8739ms, 17.9411ms, 14.3515msWhere on main it is was:
x86_64: 13.9465882s, 11.2770526s, 11.0808334s
aarch64: 157.5406149s, 144.1816011s, 144.1495896sSeems like a very big improvement!
alexcrichton commented on issue #6804:
@TimonPost
x86_64: 11.5228035s, 11.0294104s, 11.2740543s
aarch64: 12.8739ms, 17.9411ms, 14.3515msThanks for testing this! Would you be able to share a performance profile of the x86_64 compilation? Having a 100x slowdown for x86_64 is pretty surprising and may mean that there's lurking quadratic behavior there too worth looking into.
cfallin commented on issue #6804:
Given the other numbers, I suspect maybe the units were meant to be seconds rather than milliseconds on that line?
cfallin commented on issue #6804:
The quadratic-ness isn't necessarily inherent because https://github.com/bytecodealliance/wasmtime/issues/6798#issuecomment-1665539868 by representing fixup_records with a BinaryHeap it's possible to restore local reasoning about branches without quadratic behavior. That solution isn't viable, however, with the branch optimizations happening in the buffer.
Yeah, unfortunately the branch opts have to happen in the buffer -- that's important for compile-time performance and lifting them elsewhere would be a large complexity shift (on the order of months of work) in the remainder of the backend because of the lower-level branch forms. (As an example, @elliottt worked to remove the last vestiges of EBBs and make branches truly two-target in CLIF as well, and that was something like a few weeks of delicate full-time work with a few followup issues.) We've been there before, we learned!
That said, I don't know if I buy that the two issues are coupled to this degree. MachBuffer does both things -- branch folding and label fixups -- and the former has an invariant that any branches in the "most recent branches contiguous with end of buffer" list will have fixups -- but the invariant needs nothing (is vacuously true) if the recent-branches list is empty, and it is always empty after emitting an island. So I think one could do whatever one wants with a binary-heap or whatnot, leveraging that.
However, there's also a complexity tradeoff here: I worry very much about bugs in this code (see the series of issues three summers ago that resulted in me writing the correctness proof in the comments) and I'd rather keep it using uniformly simple data structures where we can. "Nonlocal reasoning" sounds undesirable but that's blurring a bit what's going on here: basically we now have islands that cause every forward-crossing edge to go through a veneer. There are plenty of other ways that changes in some code can push some other code toward a different decision (regalloc, block ordering, side-effects blocking hoisting, ...) and we don't really worry about those nonlocal effects, or at least we accept them as normal.
Anyway, I'll update on Monday based on your comments -- thanks very much for the review!
TimonPost commented on issue #6804:
Sorry, I made an error, my script errored for aarch64 with 'target not supported', used the wrong feature flag to compile wasmtime. I re-ran the script:
fn main() { const SANDBOX_PATH: &'static str = "src/sandbox.wasm"; let mut times: HashMap<String, std::time::Duration,RandomState> = HashMap::default(); for arch in &["x86_64", "aarch64"] { let start = Instant::now(); let mut command = Command::new("wasmtime"); command.arg("compile").arg(format!("--target={os}")).arg(format!("-o=output-{os}.wasm")).arg( SANDBOX_PATH, ); println!("Start compiling on {}", arch); let child = command.spawn().unwrap(); let output = child.wait_with_output().unwrap(); times.insert(format!("sandbox-{}", arch), start.elapsed()); println!( "[{:?}]: {os} ({})", start.elapsed(), String::from_utf8(output.stdout).unwrap() ); } }
Running three times:
x86_64: 11.6531752s, 11.5904301s, 11.5270225s aarch64: 12.006055s, 11.5779827s,11.9106966s
Seems more plausible :)
Tho indeed I wouldn't expect a compile time of milliseconds. This could probably be a bug?! I can test in 24 hours on a real macbook and see how the branch works.
TimonPost edited a comment on issue #6804:
Sorry, I made an error, my script errored for aarch64 with 'target not supported', used the wrong feature flag to compile wasmtime. I re-ran the script:
fn main() { const SANDBOX_PATH: &'static str = "src/sandbox.wasm"; let mut times: HashMap<String, std::time::Duration,RandomState> = HashMap::default(); for arch in &["x86_64", "aarch64"] { let start = Instant::now(); let mut command = Command::new("wasmtime"); command.arg("compile").arg(format!("--target={os}")).arg(format!("-o=output-{os}.wasm")).arg( SANDBOX_PATH, ); println!("Start compiling on {}", arch); let child = command.spawn().unwrap(); let output = child.wait_with_output().unwrap(); times.insert(format!("sandbox-{}", arch), start.elapsed()); println!( "[{:?}]: {os} ({})", start.elapsed(), String::from_utf8(output.stdout).unwrap() ); } }
Running three times:
x86_64: 11.6531752s, 11.5904301s, 11.5270225s aarch64: 12.006055s, 11.5779827s,11.9106966s
Seems more plausible :)
alexcrichton commented on issue #6804:
@cfallin I don't think it's worth belaboring the point too much, but I think it's important to have more rationale in the code other than "otherwise it's quadratic". The actual result of the generated code here I personally think is pretty surprising where the purpose of veneers/promotion is to incrementally get more distance and that's not what's happening here because everything is "flushed" all at once no matter whether or not we're about to go out of range of a branch. At least to me it's surprising that given the otherwise meticulous attention to detail throughout the file this seemingly-large detail is quickly glossed over.
And again to clarify I'm not advocating for here-and-now removing branch optimizations from the mach buffer. I realize it's a deep refactoring and would take quite some work and that additionally this is the result of historical work and balancing efforts. I still do believe though that this probably isn't the best place to do this because it's extremely late in the pipeline and we're already unable to perform optimization such as constant-branch-folding with knock-on effects from that. For example I'd imagine that if constant-branch-folding were implemented then many of the optimizations that the mach buffer does would probably fall out of that, where in such a situation it may not be necessary for the mach buffer to be as clever as it is today.
I mostly want to clarify that it sounds like you're advocating that this must stay as-is and there's no viable alternative to the mach buffer's optimizations today. I'm not personally convinced about that myself. In any case though this is all orthogonal to the compilation-time performance gains and is probably best left for a future issue/discussion rather than here.
@TimonPost thanks for double-checking, and those new numbers make sense! It's true that 12 seconds still feels a bit high (although 115MB is a big module too). If you'd like it might be worth digging into the performance profile there. Locally in my example program most of the time was spent in b-trees in register allocation, which I don't know how to improve. If that's the case for your test case too it's probably at the limit of speed it's going to reach for now.
TimonPost commented on issue #6804:
I might do some more profiling to see where most time is spent now. Also compiling the workspace with:
std::env::set_var("CARGO_INCREMENTAL", "1"); std::env::set_var("CARGO_PROFILE_RELEASE_LTO", "false"); std::env::set_var("CARGO_PROFILE_RELEASE_OPT_LEVEL", "z"); std::env::set_var("CARGO_PROFILE_RELEASE_DEBUG_ASSERTIONS", "true");
Helps us to optimize for binary size and get ~59MB instead of the 115MB.
TimonPost edited a comment on issue #6804:
I might do some more profiling to see where most time is spent now. Also compiling the workspace with release profile and:
std::env::set_var("CARGO_INCREMENTAL", "1"); std::env::set_var("CARGO_PROFILE_RELEASE_LTO", "false"); std::env::set_var("CARGO_PROFILE_RELEASE_OPT_LEVEL", "z"); std::env::set_var("CARGO_PROFILE_RELEASE_DEBUG_ASSERTIONS", "true");
Helps us to optimize for binary size and get ~59MB instead of the 115MB.
TimonPost edited a comment on issue #6804:
I might do some more profiling to see where most time is spent now. Also compiling the workspace with release profile, CARGO_PROFILE_RELEASE_OPT_LEVEL=z, CARGO_INCREMENTAL=1, helps us to optimize for binary size and get ~59MB instead of the 115MB.
TimonPost edited a comment on issue #6804:
I might do some more profiling to see where most time is spent now. Also compiling the workspace with release profile, CARGO_PROFILE_RELEASE_OPT_LEVEL=z, CARGO_INCREMENTAL=1, helps us to optimize for binary size and get ~59MB instead of the 115MB. So with this PR there are no big issues left.
TimonPost edited a comment on issue #6804:
I might do some more profiling to see where most time is spent now. Also compiling the workspace with release profile, CARGO_PROFILE_RELEASE_OPT_LEVEL=z, CARGO_INCREMENTAL=1, helps us to optimize for binary size and get ~59MB instead of the 115MB. So after this PR there are no big issues left.
cfallin commented on issue #6804:
Thanks @alexcrichton -- I filed #6818 to continue the larger discussion and added a more detailed "why" section to the top doc-comment here. I think actually doing deadlines properly might not be so bad, after more thought; will try to tackle this after some other higher-priority things.
alexcrichton commented on issue #6804:
Looks good to me, and thanks for opening an issue! I'll read over that later today :+1:
Last updated: Nov 22 2024 at 16:03 UTC