Memory ordering properties of Atomic::r-m-w operations
david.holmes at oracle.com
Sat Nov 5 18:43:52 UTC 2016
Forking new discussion from:
RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
On 1/11/2016 7:44 PM, Andrew Haley wrote:
> On 31/10/16 21:30, David Holmes wrote:
>> On 31/10/2016 7:32 PM, Andrew Haley wrote:
>>> On 30/10/16 21:26, David Holmes wrote:
>>>> On 31/10/2016 4:36 AM, Andrew Haley wrote:
>>>>> And, while we're on the subject, is memory_order_conservative actually
>>>>> defined anywhere?
>>>> No. It was chosen to represent the current status quo that the Atomic::
>>>> ops should all be (by default) full bi-directional fences.
>>> Does that mean that a CAS is actually stronger than a load acquire
>>> followed by a store release? And that a CAS is a release fence even
>>> when it fails and no store happens?
>> Yes. Yes.
>> // All of the atomic operations that imply a read-modify-write
>> // action guarantee a two-way memory barrier across that
>> // operation. Historically these semantics reflect the strength
>> // of atomic operations that are provided on SPARC/X86. We assume
>> // that strength is necessary unless we can prove that a weaker
>> // form is sufficiently safe.
> Mmmm, but that doesn't say anything about a CAS that fails. But fair
> enough, I accept your interpretation.
Granted the above was not written with load-linked/store-conditional
style implementations in mind; and the historical behaviour on sparc and
x86 is not affected by failure of the cas, so it isn't called out. I
should fix that.
>> But there is some contention as to whether the actual implementations
>> obey this completely.
> Linux/AArch64 uses GCC's __sync_val_compare_and_swap, which is specified
> as a
> "full barrier". That is, no memory operand is moved across the
> operation, either forward or backward. Further, instructions are
> issued as necessary to prevent the processor from speculating loads
> across the operation and from queuing stores after the operation.
> ... which reads the same as the language you quoted above, but looking
> at the assembly code I'm sure that it's really no stronger than a seq
> cst load followed by a seq cst store.
Are you saying that a seq_cst load followed by a seq_cst store is weaker
than a full barrier?
> I guess maybe I could give up fighting this and implement all AArch64
> CAS sequences as
> CAS(seq_cst); full fence
> or, even more extremely,
> full fence; CAS(relaxed); full fence
> but it all seems unreasonably heavyweight.
Indeed. A couple of issues here. If you are thinking in terms of
orderAccess::fence() then it needs to guarantee visibility as well as
ordering - see this bug I just filed:
So would be heavier than a "full barrier" that simply combined all four
storeload membar variants. Though of course the actual implementation on
a given architecture may be just as heavyweight. And of course the
Atomic op must guarantee visibility of the successful store (else the
atomicity aspect would not be present).
That aside we do not need two "fences" surrounding the atomic op. For
platforms where the atomic op is a single instruction which combines
load and store then conceptually all we need is:
loadload|storeload; op; storeload|storestore
Note this is at odds with the commentary in atomic.hpp which says things
// <fence> add-value-to-dest <membar StoreLoad|StoreStore>
I need to check why we settled on the above formulation - I suspect it
was conservatism. And of course for the cmpxchg it fails to account for
the fact there may not be a store to order with.
For load-linked/store-conditional based operations that would expand to
(assume a retry loop for unrelated store failures):
temp = ld-linked &val
cmp temp, expected
st-cond &val, newVal
which is fine if we actually store, but if we find the wrong value there
is no store for those final barriers to sync with. That then raises the
question: can subsequent loads and stores move into the
ld-linked/st-cond region? The general context-free answer would be yes,
but the actual details may be architecture specific and also context
dependent - ie the subsequent loads/stores may be dependent on the CAS
succeeding (or on it failing). So without further knowledge you would
need to use a "full-barrier" after the st-cond.
>>> And that a conservative load is a *store* barrier?
>> Not sure what you mean. Atomic::load is not a r-m-w action so not
>> expected to be a two-way memory barrier.
More information about the hotspot-dev