RFR: 7143664: Clean up OrderAccess implementations and usage

Karen Kinnear karen.kinnear at oracle.com
Mon Feb 23 18:45:38 UTC 2015

On Feb 20, 2015, at 1:24 PM, Erik Österlund wrote:

And truly many thanks - in my mind you have made this much more conceptually clean and safer.
I am good with your checking in the parts I have reviewed.

> Hi Karen,
> On 20/02/15 17:04, Karen Kinnear wrote:
>> Erik,
>> 1. orderAccess.hpp
>> Were you going to add a comment about MSVC assumptions under C++ Volatile Semantics?
>> Is it worth documenting in source somewhere what minimal versions of compilers are assumed?
>> Or which versions of compilers were tested? I think that would help us in future please.
>> Maybe in the platform-specific files?
> I did not intend to add this comment to shared. I thought that it would 
> be enough to say that non-volatile to volatile constraints can not be 
> trusted in general. Instead there is a comment explaining this anomaly 
> in orderAccess_windows_x86.inline.hpp where it is exploited, since it is 
> the only documented exception I know of.
> If we start picking certain examples, it feels like we should rather 
> have either 1) a comprehensive table of what volatile actually means for 
> different compilers with references or question marks where we don't 
> know (and hence need to be conservative assuming the worst), or 2) 
> putting such comments in platform specific files (like MSVC currently).
> Do you agree? In that case, which one do you prefer?
> As for documenting compiler versions assumed, would it be enough to 
> stamp the code with the currently used compilers we used now, even 
> though it might actually have worked with earlier versions? The 
> reasoning is that it's unlikely we will go back to using earlier 
> compilers anyway, so finding out when exactly certain compilers started 
> doing what we want them to could possibly be unnecessary, while still 
> having enough information to allow checking if future compilers have 
> provided improved functionality from this version of OrderAccess?
It would make more sense to document for the individual platforms.
I'm trying to prevent future "what were they thinking" questions by recording
for each platform the specific compiler version tested. Not all versions that
should work, just the ones we tested it with.
>> 2. orderAccess_windows_x86.inline.hpp
>> Could you possibly add to the comment about MSVC, that this assumes all of the Scoped templates explicitly cast the addresses to volatiles?
>> (In case someone changes that some day?)
> Sure! But is this not clear given the comment in 
> orderAccess_windows_x86.inline.hpp where this is exploited next to the 
> scope template specialization:
> // Note that in MSVC, volatile memory accesses are explicitly
> // guaranteed to have acquire release semantics (w.r.t. compiler
> // reordering) and therefore does not even need a compiler barrier
> // for normal acquire release accesses.
> template<> inline void ScopedFence<X_ACQUIRE>::postfix()       { }
> template<> inline void ScopedFence<RELEASE_X>::prefix()        { }
> template<> inline void ScopedFence<RELEASE_X_FENCE>::prefix()  { }
> template<> inline void ScopedFence<RELEASE_X_FENCE>::postfix() { 
> OrderAccess::fence(); }
> If you don't think this is clear enough, I would happily improve the 
> description.
Perhaps I am being extra cautious since the very original set of calls all assumed the passed in arguments were
already volatile, I wanted to save us this problem in 10 years by stating that this assumes that the Scoped templates explicitly
cast the addresses to volatiles  (which is different than the caller passing in volatiles). So to me there were two ways to get there
and I thought it would be valuable to specify this is the assumption. Not a big deal.
>> 3. orderAccess_windows_x86.inline.hpp
>> So I looked up _ReadWriteBarrier and VS2012 and VS2013 say this is deprecated.
>> JDK8 builds on VS2010, so ok. JDK9 appears to be planning to move to VS2013, so we may need to change this.
>> It appears that the recommendation is volatile (with default on x86 of /volatile:ms). Can we use __asm__volatile("" : : : "memory";
>> compiler _barrier for VS2013 or is there an equivalent you know about?
> Unfortunately I currently have no access to windows platforms so I can't 
> experiment around. But I did some research when I looked into it and it 
> seems like _ReadWriteBarrier is all we got for now until we adopt C++11 
> std::atomic that Microsoft in the deprecation message on their website 
> suggests we should use instead.
> It seems to have been deprecated because programmers found it confusing 
> - the name does not suggest it's a compiler-only ordering constraint. 
> But that's fine for our purposes.
> I just assumed that for now we probably don't want to make such a 
> radical move to C++11, right? Since it was deprecation or C++11 I 
> intentionally used _ReadWriteBarrier anyway even though it is 
> deprecated. Is this a problem?
> When we need to switch, I believe we will be forced to use C++11 on 
> windows whether we like it or not. It has std::atomic_signal_fence which 
> should work as a drop-in replacement for _ReadWriteBarrier AFAIK. But 
> then again, we might as well just map all calls to std::atomic if we 
> gotta use C++11 anyway...
Thank you. Looks like an area we will need to follow-up on.
The C++ documentation says atomic_signal_fence just does atomic reordering. 
The Visual Studio 2012 documentation is not clear on whether it also adds hardware fences.

So - go ahead and leave this as is and we will need to switch later.
>> So - do you have plans to do additional changes to the OrderAccess logic? Beyond getting rid of the obsolete entries
>> when all platforms are ok with it?
> My biggest concern was correctness and I felt it really had to be done. 
> I use OrderAccess for my own GC code a lot, and needed to tame it a bit 
> to work, and thought I might as well contribute the fixes too. ;)

Thank you.
> Having looked into the code a bit I know there is still room 
> improvements in places if you would like me to look into it. Some examples:
> <possibleImprovements>
> 1) x86_64 windows - I suspect this can be improved a /lot/. It seems 
> like compilers were broken when the code was written at the very 
> transition point to 64 bit. I believe this can be optimized to use 
> compiler support I expect is in place today.
> 2) x86 on solaris: solaris studio was also crippled in the past with no 
> inline assembly. I believe these optimizations we see on other platforms 
> could now be ported to solaris as well.
Yes. yay!
> 3) _volatile variants of acquire/release/fence/membars: Now we are 
> conservative, taking volatile to non-volatile reordering into account 
> using compiler barriers. If the programmer knows this is not needed 
> since all data published is volatile, maybe it could be good to have 
> variants explicitly stating it won't be necessary.
> 4) Joining more store(); release(); pairs to store_release() since some 
> platforms like ARM can then use stlr instead of full dmb fence which 
> seems like a good thing in general anyway.
> All previous suggestions revolve around performance (and I don't know 
> how much we need it). But could also do other things.
> Testing:
> 5) Testing code: it is possible using some light template 
> metaprogramming to test if the expected specializations of OrderAccess 
> are indeed used, a bit like @Override in Java - asserting that the 
> generalized behaviour is indeed specialized as expected for all types we 
> expect them to be specialized for.
> Further refactoring:
> 6) Forwarding all atomic memory accesses to Atomic:: since it's really a 
> separate concern. Atomic:: should know how to make atomic accesses for 
> types used by OrderAccess, and OrderAccess should in shared code just 
> forward it. Currently Atomic supports all stores but only load of jlong, 
> making assumptions the rest can use volatile only - an assumption I'm 
> not willing to make and spread around in the JVM.
I share your concerns.
> </possibleImprovements>
> The main problem though is that OrderAcces changes are often inherently 
> platform specific, and I have very limited access as an outsider to the 
> platforms.
> This time, Jesper has been very kind to me, running JPRT for me which 
> made these changes possible, and I'm very happy about that. But maybe I 
> should do less platform specific things when I can't run JPRT. I don't 
> want to bother people with compiling code for me too much, especially 
> not if experimenting with what new compiler features have been added 
> over the years. :)
>> Looks like we have a couple of follow-up exercises:
>> 1) check our is_MP usage to ensure we get the compiler_barriers in both paths if we are not using volatile declarations for
>> all compiler ordering
>> 2) try applying these templates to any other platforms to see how well they work for us as maintainers
>> 3) figure out why we use StubRoutines::fence on amd64 Windows but on linux we just use lock; addl 0(sp), MSVC limitation?
> Yes I could volunteer to do 2 (only open variants of course) and 3 if 
> anyone would be kind to run JPRT for me and/or provide me with machines 
> I can try things out on (which I suspect is impossible?).
I wasn't asking you to do this work - I was suggesting work we should do. No worries.
>> I did not review ppc, zero, arm.
> I should mention Zero changes were reviewed by Severin Gehwolf in the 
> zero-dev list. He did not have any problem with my changes.
Glad to hear that. Thank you.
> Thanks for the review Karen! :)
> /Erik

>> thanks!
>> Karen
>> On Jan 22, 2015, at 8:19 PM, Erik Österlund wrote:
>>> Hi all,
>>> == Summary of Changes ==
>>> This changeset fixes issues in OrderAccess on multiple levels from the
>>> memory model semantics to compiler reorderings, to addressing
>>> maintainability/portability issues which (almost) had to be fixed in
>>> order to fix the correctness issues. It is the result of discussions
>>> found in the previous "OrderAccess Refactoring" thread:
>>> http://openjdk.5641.n7.nabble.com/OrderAccess-Refactoring-td212050.html
>>> Bug ID:
>>> https://bugs.openjdk.java.net/browse/JDK-7143664
>>> (updated to reflect these related changes)
>>> Webrev:
>>> http://cr.openjdk.java.net/~dholmes/7143664/webrev/
>>> Before I describe more I would like to give special thanks to David
>>> Holmes for long discussions leading up to the currently proposed
>>> changes. I would also like to thank Jesper Wilhelmsson for helping me
>>> run my changes through JPRT.
>>> == Motivation ==
>>> This change directly fixes a reported OrderAccess bug due to compiler
>>> reorderings which is still a vulnerability on almost all TSO platforms:
>>> https://bugs.openjdk.java.net/browse/JDK-806196
>>> And directly fixes confusions like release_store() != release() store()
>>> due to memory model issues previously described in this bug ID.
>>> At the same time it provides clearer design with separation of concerns
>>> and generalization/specialization, removing a whole bunch of platform
>>> specific code which could be generalized. The platform specific files
>>> now only have a few LoC requirements (~7) to conform to the memory model
>>> by specifying what the stand alone barriers do. Then optionally
>>> optimizations to the general approach are possible if platforms support
>>> it. This also makes it much easier to port to new platforms.
>>> == Memory Model ==
>>> The current definitions of acquire/release semantics are a bit fishy
>>> leading to problems previously described in the bug ID (release_store()
>>> != release() store()) and some other correctness issues. It has
>>> therefore been replaced with a new model. I would like to thank David
>>> Holmes for the long discussions leading up to the newly proposed model.
>>> The new model is formally defined like this:
>>> // T1: access_shared_data
>>> // T1: ]release
>>> // T1: (...)
>>> // T1: store(X)
>>> //
>>> // T2: load(X)
>>> // T2: (...)
>>> // T2: acquire[
>>> // T2: access_shared_data
>>> //
>>> // It is guaranteed that if T2: load(X) synchronizes with (observes the
>>> // value written by) T1: store(X), then the memory accesses before the
>>> // T1: ]release happen before the memory accesses after the T2: acquire[.
>>> The orderAccess.hpp file and bug ID also has a few additional
>>> explanations making it more intuitive to the user when to use
>>> acquire/release and the resemblance to TSO abstract machines. Big thanks
>>> goes to David Holmes for discussing the memory model with me, which
>>> helped a lot in deriving it.
>>> Now it holds that release() store() == release_store(), and the other
>>> correctness issues are fixed as well.
>>> The new model is also closer to C++11 definitions which could give us
>>> more relaxed compiler reordering constraints in the future when compiler
>>> support for C++11 is there and ready.
>>> == Reliance on C++ Volatile Semantics ==
>>> The C++ standard section 1.9 "Program Execution" is very vague about
>>> what the keyword volatile can actually do for us. It gives clear
>>> constraints in terms of volatile-volatile accesses but says little about
>>> nonvolatile-volatile accesses. Yet the current implementation heavily
>>> relies upon volatile to in terms of compiler reordering. But GCC
>>> explicitly declares that volatiles and non-volatiles may reorder freely
>>> ( https://gcc.gnu.org/onlinedocs/gcc/Volatiles.html ). The only compiler
>>> known to explicitly provide the wanted semantics with volatile is MSVC
>>> 2010 for windows (
>>> https://msdn.microsoft.com/en-us/library/12a04hfd(v=vs.100).aspx ).
>>> Compilers not giving explicit guarantees, must be considered unsafe and
>>> revert to conservative behaviour.
>>> This was brought to attention after causing bugs, but was only fixed for
>>> x86 linux. This is a fundamental issue inherent to all TSO platforms
>>> except windows, and has to be fixed on all of them.
>>> Several barriers are unsafe to use because they lack compiler reordering
>>> constraints (e.g. fence and acquire on linux_SPARC). For TSO platforms
>>> they are typically implemented using dummy loads and stores. This seems
>>> to be another old volatile reliance that I fixed. These barriers
>>> sometimes have omitted compiler barriers (which is what we really want).
>>> This seems to be another example on incorrect reliance on the volatile
>>> semantics to help us. Therefore dummy loads/stores have been replaced
>>> with compiler barriers on TSO platforms.
>>> It is also worth noting that compilers like sun studio did previously
>>> not support inline asm syntax. Therefore, barriers were implemented in
>>> .il-files. However, using them does not give explicit compiler
>>> constraints for reordering AFAIK. Therefore, they have been
>>> reimplemented using inline asm with explicit compiler reordering
>>> constraints, as even sun (solaris?) studio now supports this.
>>> The windows variants have added a windows-style _ReadWriteBarrier()
>>> compiler barrier similarly.
>>> == Strange Hardware Reorderings ==
>>> Fixed a weird inconsistency where acquire, loadstore and loadlaod would
>>> use isync instead of lwsync for PPC on linux_zero, but not in any other
>>> PPC platform in the repo. I assumed this is wrong and changed it to
>>> lwsync instead.
>>> == Code Redundancy and Refactoring ==
>>> The OrderAccess code looks like it has been developed over a long period
>>> of time, with small incremental changes. This seems to have led to a lot
>>> of code duplication over time. For example, store_fence variants are not
>>> referenced from anywhere, yet contribute to a lot of the code base and a
>>> lot of awkwardness (such as being the one only exception not using
>>> volatiles for some reason). Moreover, store_fence is not used anywhere
>>> in hotspot because it is not a good fit for either the acquire/release
>>> semantics or the Java volatile semantics, leaving a big question mark on
>>> when it should ever be used. I took the liberty of removing it.
>>> Another redundancy issue is that most of the semantics is exactly the
>>> same for all platforms, yet all that default boilerplate such as how to
>>> make atomic accesses, where acquire/release is supposed to be placed
>>> w.r.t. the memory access, what the different barriers should do etc. is
>>> copied in redundantly for each os_cpu and each type of memory access for
>>> each os_cpu. This makes it extremely painful 1) to understand what
>>> actually defines a certain platform compared to the defaults and 2) to
>>> correct bugs like those discovered here 3) prevent silly mistakes and
>>> bugs, by simply having a lot less code defining the behaviour of
>>> OrderAccess that could go wrong.
>>> A new architecture/design for OrderAccess is proposed, using a
>>> generalization/specialization approach.
>>> A general implementation in /share/ defines how things work and splits
>>> into different concerns: 1) how to make an atomic memory access, 2)
>>> where to but barriers w.r.t. the memory access for things like
>>> load_acquire, release_store and release_store_fence, 3) how these
>>> barriers are defined.
>>> This allows a clear split between what is required for following the
>>> specifications, and optimizations, which become much more readable and
>>> only optimizations need to be reviewed in depth as the defaults can
>>> always be trusted given correct standalone barriers.
>>> The only thing a platform is required to specify, is what an
>>> implementation of acquire(), release() and fence() should do. If they
>>> are implemented properly, everything in OrderAccess is guaranteed to
>>> work according to specification using the generalized code. This makes
>>> it very easy to support new ports. ALL the other code in the os_cpu
>>> files is used /only/ for optimization purposes offered for specific
>>> configurations.
>>> However, it is highly customizable so that specific platform can perform
>>> any desired optimizations. For instance this load_acquire on PPC is
>>> optimized:
>>> template<> inline jbyte  OrderAccess::specialized_load_acquire<jbyte>
>>> (volatile jbyte*  p) { register jbyte t = load(p);
>>> inlasm_acquire_reg(t); return t; }
>>> This overrides the whole load_acquire implementation to do something
>>> custom. Platforms like x86 extensively use this for joined fencing
>>> variants to optimize.
>>> The default implementation of load_acquire() will make an atomic load()
>>> followed by acquire() since the semantics is generalized. The
>>> generalized semantics are defined using inlined postfix/prefix calls
>>> after/before the atomic access, as defined here:
>>> template<> inline void ScopedFenceGeneral<X_ACQUIRE>::postfix()       {
>>> OrderAccess::acquire(); }
>>> template<> inline void ScopedFenceGeneral<RELEASE_X>::prefix()        {
>>> OrderAccess::release(); }
>>> template<> inline void ScopedFenceGeneral<RELEASE_X_FENCE>::prefix()  {
>>> OrderAccess::release(); }
>>> template<> inline void ScopedFenceGeneral<RELEASE_X_FENCE>::postfix() {
>>> OrderAccess::fence();   }
>>> For platforms that simply wish to override what e.g. acquire means for a
>>> joined ordered memory access in general, as different to calling stand
>>> alone acquire(), the semantics can be easily overridden for a platform
>>> such as windows like on windows:
>>> template<> inline void ScopedFence<X_ACQUIRE>::postfix()       { }
>>> template<> inline void ScopedFence<RELEASE_X>::prefix()        { }
>>> template<> inline void ScopedFence<RELEASE_X_FENCE>::prefix()  { }
>>> template<> inline void ScopedFence<RELEASE_X_FENCE>::postfix() {
>>> OrderAccess::fence(); }
>>> In this example, since Windows (now) has a compiler barrier for acquire,
>>> but does not need it for joined accesses since volatile has stronger
>>> guarantees on windows, this is enough to specialize that for joined
>>> memory accesses, no extra protection is needed.
>>> == Backward Compatibility and Transitioning ==
>>> Since the newly proposed code is structured differently to before, a
>>> #define was added for backward compatibility so that external
>>> repositories not adhering to this new architecture do not break.
>>> Furthermore, store_release was declared private and marked as
>>> deprecated. This allows for a smooth transition into the new style of
>>> OrderAccess. When the full transition is made in all known repos, the
>>> #define and store_fence may be safely removed, eventually.
>>> == Documentation ==
>>> The documentation seems old and outdated, describing how it works on
>>> SPARC RMO and IA64, which are nowhere to be found in the repository. It
>>> also describes things about C++ volatiles which cannot be relied upon.
>>> The documentation has been cleaned up to match the current state of the
>>> implementation better, with architectures actually found in the repository.
>>> == Testing ==
>>> JPRT. Big thanks to Jesper Wilhelmsson for helping me test these changes.
>>> Ran some DaCapo benchmarks (I know okay :p) for performance regression
>>> and there was no perceivable difference.
>>> Looking forward to feedback on this, and hope to get some reviews. :)
>>> Thanks,
>>> Erik

More information about the hotspot-dev mailing list