[concurrency-interest] RFR: 8065804: JEP 171:Clarifications/corrections for fence intrinsics
stephan.diestelhorst at gmail.com
Tue Nov 25 23:52:42 UTC 2014
David Holmes wrote:
> Stephan Diestelhorst writes:
> > Am Dienstag, 25. November 2014, 11:15:36 schrieb Hans Boehm:
> > > I'm no hardware architect, but fundamentally it seems to me that
> > >
> > > load x
> > > acquire_fence
> > >
> > > imposes a much more stringent constraint than
> > >
> > > load_acquire x
> > >
> > > Consider the case in which the load from x is an L1 hit, but a
> > > preceding load (from say y) is a long-latency miss. If we enforce
> > > ordering by just waiting for completion of prior operation, the
> > > former has to wait for the load from y to complete; while the
> > > latter doesn't. I find it hard to believe that this doesn't leave
> > > an appreciable amount of performance on the table, at least for
> > > some interesting microarchitectures.
> > I agree, Hans, that this is a reasonable assumption. Load_acquire x
> > does allow roach motel, whereas the acquire fence does not.
> > > In addition, for better or worse, fencing requirements on at least
> > > Power are actually driven as much by store atomicity issues, as by
> > > the ordering issues discussed in the cookbook. This was not
> > > understood in 2005, and unfortunately doesn't seem to be amenable to
> > > the kind of straightforward explanation as in Doug's cookbook.
> > Coming from a strongly ordered architecture to a weakly ordered one
> > myself, I also needed some mental adjustment about store (multi-copy)
> > atomicity. I can imagine others will be unaware of this difference,
> > too, even in 2014.
> Sorry I'm missing the connection between fences and multi-copy atomicity.
One example is the classic IRIW. With non-multi copy atomic stores, but
ordered (say through a dependency) loads in the following example:
Memory: foo = bar = 0
_T1_ _T2_ _T3_ _T4_
st (foo),1 st (bar),1 ld r1, (bar) ld r3,(foo)
<addr dep / local "fence" here> <addr dep>
ld r2, (foo) ld r4, (bar)
You may observe r1 = 1, r2 = 0, r3 = 1, r4 = 0 on non-multi-copy atomic
machines. On TSO boxes, this is not possible. That means that the
memory fence that will prevent such a behaviour (DMB on ARM) needs to
carry some additional oomph in ensuring multi-copy atomicity, or rather
prevent you from seeing it (which is the same thing).
More information about the core-libs-dev