@Andrew Brown @Alex Crichton -- following up on the discussion of TLB shootdowns and a new Intel ISA feature to do this without IPIs -- for reference, here's the single instruction in AArch64 that does this (from the macOS kernel): https://github.com/apple/darwin-xnu/blob/2ff845c2e033bd0ff64b5b6aa6063a1f8f65aa32/osfmk/arm64/tlb.h#L207
AFAICT, this is standardized Arm, too, not an Apple extension
oh nice, I'm trying to profile locally if I can get good scaling but I'm seeing it level off around 4 cores (ish) again, sort of hard to test though b/c there's no taskset
on macos
I know though that on aarch64 linux I don't see great scaling
I seem to remember trying this once on aarch64 and finding good scaling (or at least, not seeing IPIs); I think aarch64 linux uses that instruction too; but maybe it's a more recent ISA level?
on the aarch64 ba server in a perf
profile I see __flush_tlb_range
at the top and a nop
instruction after tlbi vale1is, x1
has some samples
the hottest instruction in that function though is a dsb ish
ah, so that's probably a fence to give synchronous semantics (other threads observe new as soon as this syscall returns); still expensive on some cores, darn
yeah, it waits (data sync barrier) for every pending operation related to cache, tlb, branch predictors, and that kind of stuff. "ish" means it's shareable between cores within the same processor but not necessarily other devices in the same soc
Last updated: Nov 22 2024 at 16:03 UTC