RFR (M) CR 8050147: StoreLoad barrier interferes with stack usages
dave.dice at oracle.com
Tue Jul 22 22:08:29 UTC 2014
On 2014-7-22, at 5:38 PM, Aleksey Shipilev <aleksey.shipilev at oracle.com> wrote:
> (the patch itself is XS, but the explanation is M)
> This is a follow up for the issue discovered earlier. In a tight
> performance-sensitive loops StoreLoad barrier in the form of "lock addl
> (%esp+0), 0" interferes with stack users:
> I used the experimental patch:
> ...to juggle different StoreLoad barrier strategies:
> 1) mfence:
> we know it is slow on almost all platforms, keep this as control
Admittedly old data, but related : https://blogs.oracle.com/dave/entry/instruction_selection_for_volatile_fences
Over time, mfence has been freighted with additional semantics.
For 64-bit mode it’d be useful to try the “xchg rThread, rThread->Self” idiom, where the Self field points to the enclosing thread structure in a self-referential fashion. I’ve seen good results from that form.
And of course if we’re willing to kill a register — a register that might be on the verge of becoming dead anyway — then replacing [ST; fence-idiom] with XCHG might be interesting.
> 2) lock addl (%esp), 0:
> current default
> 3) lock addl (%esp-8), 0:
> unties the data dependency against (%esp) users
> 4) lock addl (%esp-CacheLine-8), 0:
> unties the data dependency, and also steps back a cache line
> to untie possible memory contention of (%esp) users and our
> locked instructions
> 5) lock addl (%esp-RedZone-8), 0:
> Not sure how it interacts with our calling convention, but
> System V ABI tells us there is a "red zone" 128 bytes down the
> stack pointer which is always preserved by interrupt handlers,
> etc. It seems we can use red zone for scratch data. Dave Dice
> suggests that idempotent operations in red zone are benign
> anyway. But in case we have a problem with the red zone, we can
> fallback to this mode.
> Targeted benchmarks triage that the issue only manifests in a tight
> loops where the users of (%esp) are very close to the StoreLoad barrier.
> By carefully backing off between the StoreLoad barrier and the users of
> (%esp), we were able to quantify where different StoreLoad strategies
> benefit. We use the same targeted benchmark (two prebuilt JARs are also
> Running this on different machine architectures yields more or less
> consistent result among the strategies. The links below point to charted
> data in PNG format, but they are also available in SVG, as well in raw
> data form, in the same folder. The graphs show how the throughput of
> volatile write + backoff changes with different backoffs. Lines are
> different StoreLoad strategies. Tiles correspond to the number of
> measurement threads doing the loop.
> * 1x1x2 Intel Atom Z530 seems completely oblivious of memory barrier
> issue. This seems to be due the generated code in 32-bit mode which
> completely pads away the StoreLoad barrier costs -- notice tens of
> nanoseconds per write:
> * 2x16 AMD Opteron (Abu Dhabi) benefits greatly with offsetting %esp.
> We can also quantify the area where the interference occurs. It spans
> the area in backoff [0..10], which is loosely [0..30] instructions
> between the stack user and StoreLoad barrier:
> * 2x16 AMD Opteron (Interlagos) tests paint the same picture (it is
> remarkable that mfence is consistently behind):
> * 1x4x1 Intel Core (Haswell-DT) benefits from offsetting %esp as well.
> There is an interesting hump on lower backoffs with addl (%esp-8), which
> seems to be a second-order microarchitectural effect. Unfortunately, we
> don't have large Haswells available at the moment to dive into this:
> * 1x2x2 Intel Core (Sandy Bridge), anecdotal evidence from my laptop
> also shows offsetting %esp helps:
> Our large reference workloads running on reference performance servers
> show no statistically significant improvements/regressions with either
> mode. I think this is because a) in large workloads the padding between
> stack users and StoreLoads is beefy; and b) there is not so many
> StoreLoads in performance benchmarks (survivorship bias).
> Selected ForkJoin-rich microbenchmarks show the good improvement for all
> (%esp-offset) modes.
> Having this data, I propose we switch either to "lock addl (%esp-8), 0",
> optimistically thinking there are no second-order effects with sharing
> the cache lines:
> ...or "lock addl (%esp-CL-8), 0), pessimistically padding away from
> stack users:
> Both patches pass full JPRT cycle.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the hotspot-compiler-dev