RFR(L) 8153224 Monitor deflation prolong safepoints (CR7/v2.07/10-for-jdk14)

Daniel D. Daugherty daniel.daugherty at oracle.com
Tue Oct 29 20:18:29 UTC 2019

Hi David,

Sorry for the delay in replying to this email. I spent most of last week
migrating my development environment from my dying MBP13 to a new MBP13.
I think I have everything recovered, but only time will tell...

More below...

On 10/23/19 12:00 AM, David Holmes wrote:
> Hi Dan,
> Trimming to the open issues.

I'll try to do the same (but was only able to trim one thing)...

> On 23/10/2019 6:46 am, Daniel D. Daugherty wrote:
>> Okay, time to return to my uses of release_store() and load_acquire().
>> Consider this bit of code:
>> src/hotspot/share/runtime/synchronizer.cpp:
>>      L351: // Take an ObjectMonitor from the start of the specified 
>> list. Also
>>      L352: // decrements the specified counter. Returns NULL if none 
>> are available.
>>      L353: static ObjectMonitor* 
>> take_from_start_of_common(ObjectMonitor* volatile * list_p,
>>      L354:                                                 int 
>> volatile * count_p) {
>>      L355:   ObjectMonitor* next = NULL;
>>      L356:   ObjectMonitor* take = NULL;
>>      L357:   // Mark the list head to guard against A-B-A race:
>>      L358:   if (!mark_list_head(list_p, &take, &next)) {
>>      L359:     return NULL;  // None are available.
>>      L360:   }
>>      L361:   // Switch marked list head to next (which unmarks the 
>> list head, but
>>      L362:   // leaves take marked):
>>      L363:   OrderAccess::release_store(list_p, next);
>>      L364:   Atomic::dec(count_p);
>>      L365:   // Unmark take, but leave the next value for any lagging 
>> list
>>      L366:   // walkers. It will get cleaned up when take is 
>> prepended to
>>      L367:   // the in-use list:
>>      L368:   set_next(take, next);
>>      L369:   return take;
>>      L370: }
>> On L358, we call mark_list_head() and if that call returns true, then
>> the calling thread is the "owner" of the list head. My idea is that
>> once my thread "owns" the list head, I can use release_store() as a
>> smaller hammer than cmpxchg() to sync out the new value ('next') as
>> the new list head.
>> Should I be doing something other than release_store() here?
>> My thinking is that release_store() pairs up with the load_acquire()
>> of any reader thread that's trying to simply walk the list. And that
>> release_store() also pairs up with any _other_ writer thread that's
>> trying to mark the list head using:
>>      mark_list_head() -> mark_next() -> cmpxchg()
>> Here's mark_list_head():
>>      L194: // Mark the next field in the list head ObjectMonitor. If 
>> marking was
>>      L195: // successful, then the mid and the unmarked next field 
>> are returned
>>      L196: // via parameter and true is returned. Otherwise false is 
>> returned.
>>      L197: static bool mark_list_head(ObjectMonitor* volatile * list_p,
>>      L198:                            ObjectMonitor** mid_p, 
>> ObjectMonitor** next_p) {
>>      L199:   while (true) {
>>      L200:     ObjectMonitor* mid = OrderAccess::load_acquire(list_p);
>>      L201:     if (mid == NULL) {
>>      L202:       return false;  // The list is empty so nothing to mark.
>>      L203:     }
>>      L204:     if (mark_next(mid, next_p)) {
>>      L205:       if (OrderAccess::load_acquire(list_p) != mid) {
>>      L206:         // The list head changed so we have to retry.
>>      L207:         set_next(mid, *next_p);  // unmark mid
>>      L208:         continue;
>>      L209:       }
>>      L210:       // We marked next field to guard against races.
>>      L211:       *mid_p = mid;
>>      L212:       return true;
>>      L213:     }
>>      L214:   }
>>      L215: }
>> which does a matching load_acquire() to get the current list head
>> and calls mark_next() to try and mark it. It then calls load_acquire()
>> again to verify that the list head hasn't changed while doing the
>> mark_next().
>> So here's mark_next():
>>   161 // Mark the next field in an ObjectMonitor. If marking was 
>> successful,
>>   162 // then the unmarked next field is returned via parameter and 
>> true is
>>   163 // returned. Otherwise false is returned.
>>   164 static bool mark_next(ObjectMonitor* om, ObjectMonitor** next_p) {
>>   165   // Get current next field without any marking value.
>>   166   ObjectMonitor* next = (ObjectMonitor*)
>>   167 ((intptr_t)OrderAccess::load_acquire(&om->_next_om) & ~0x1);
>>   168   if (Atomic::cmpxchg(mark_om_ptr(next), &om->_next_om, next) 
>> != next) {
>>   169     return false;  // Could not mark the next field or it was 
>> already marked.
>>   170   }
>>   171   *next_p = next;
>>   172   return true;
>>   173 }
>> We use load_acquire() on L166-7 to make sure that we have the latest
>> value in the next_om field. That load_acquire() matches up with the
>> release_store() done by another thread to unmark the next field (see
>> the set_next() function). Or it matches with the cmpxchg() done by
>> another to mark the next field. Yes, we could do a regular load when
>> we know that the field is only changed by cmpxchg(), but it could
>> have been changed by release_store() and my thinking is that we need
>> a matching load_acquire().
>> We have to use cmpxchg() here to try and set the mark on the next field.
>> If successful, then we return the value of the unmarked next field via
>> the 'next_p' param and we return true. However, just because we marked
>> the next field in the ObjectMonitor doesn't mean that we marked the list
>> head. That's what the second load_acquire() on L205 above verifies.
>> Okay, that was pretty gory on the details of my thinking! I think it's
>> pretty clear that I'm doing load_acquire()/release_store() pairs on
>> the list head pointers or the next_om fields to make sure that I match
>> loads and stores of _just_ the list head pointer or the next_om field
>> in question. I'm not trying affect other fields in the ObjectMonitor
>> or at least I don't think I have to do that.
>> Maybe I've misunderstood the load_acquire()/release_store() stuff 
>> (again)
>> and this code is doing way too much! I would be happy to have my 
>> thinking
>> about this load_acquire()/release_store() stuff corrected and the code
>> simplified. Maybe in this sequence:
>>      mark_list_head() -> mark_next() -> cmpxchg()
>> we don't need all the load_acquire() calls because the sequence only 
>> ends
>> successfully with a cmpxchg() and that makes memory happy everywhere 
>> with
>> just simple loads. Dunno. I definitely need more feedback here!
> Okay that was a massive deep dive :) but it is exactly what I was 
> referring to when I said:
> "My main comment on all the list management code pertains to the
> difficulty in seeing which release_store is paired with which
> load_acquire, and why."

Yup. And that comment is exactly why I wrote the "massive deep dive"
reply... so I _think_ I replied to the comment and thus proved the
point that this stuff is difficult all at the same time. :-)

> To understand the use of acquire/release/cmpxchg in detail you really 
> need to see all that code inlined/flattened so that you can see how 
> they actually arrange themselves.


> I'm immediately suspicious of a load_acquire that is just before a 
> cmpxchg on the same field because the cmpxchg will subsume any memory 
> affects of the load-acquire.

Okay, I'm trying to figure out if this comment is meant for L166, L167
and L168 of mark_next():

   165   // Get current next field without any marking value.
   166   ObjectMonitor* next = (ObjectMonitor*)
   167 ((intptr_t)OrderAccess::load_acquire(&om->_next_om) & ~0x1);
   168   if (Atomic::cmpxchg(mark_om_ptr(next), &om->_next_om, next) != 
next) {
   169     return false;  // Could not mark the next field or it was 
already marked.
   170   }
   171   *next_p = next;
   172   return true;

or if this is just a general comment or both... :-)

> But the devil is in the detail.

Agreed and with this project there are so many details. :-(

> If a field can be set by release_store, and there are previous stores 
> related to that which must be visible, then a read of that field can't 
> be via a plain load (but either load_acquire or as part of cmpxchg).

As the comment on L166 says, the goal of L167 and L168 is to get the
current next field without any marking value. Since the next field can be
changed by either cmpxchg() or release_store() and we have no idea which
one made the most recent update, we have to use a load_acquire() to get
the latest value of the next field.

As for the cmpxchg() on L168, the potential update operation has to be
done with cmpxchg() in order to safely and exclusively allow only one
thread to mark the ObjectMonitor's next field (or no threads if the next
field is already marked).

So... I think I have good reasons for a load_acquire() to be immediately
followed by a cmpxchg(), in this case anyway.

Of course, I'm assuming that this is the load_acquire(), cmpxchg() pair
that you are suspicious of. It could be some other pair or it could be
the pairs in general.

> But there are so many fields at play here that it is very difficult to 
> keep track of everything.
> You mention above:
> "My idea is that once my thread "owns" the list head, I can use 
> release_store() as a smaller hammer than cmpxchg() to sync out the new 
> value ('next') as the new list head."
> and I tend to agree that the second cmpxchg is certainly not needed as 
> there cannot be any contended update after you've marked the "next" field.

Not quite as strong of an agreement as I was hoping for, but I'll take 
it. :-)

Is there a different memory update/sync operation I should be using for my
"smaller hammer"?  I don't think so, but I could be missing something...

For example, I don't think I can switch from OrderAccess::release_store()
to Atomic::store() and from OrderAccess::load_acquire() to Atomic::load()
because I'll lose the {release, acquire} matching up component.

I also need to clarify one thing. In my reply to your previous set of
comments, I wrote:

> I think it's
> pretty clear that I'm doing load_acquire()/release_store() pairs on
> the list head pointers or the next_om fields to make sure that I match
> loads and stores of _just_ the list head pointer or the next_om field
> in question. I'm not trying affect other fields in the ObjectMonitor
> or at least I don't think I have to do that.

Now that I've had time to think about it, I have to retract that last
sentence. When doing these lock free list manipulations, we often will
change some other field in the ObjectMonitor as part of that manipulation.

For example, as part of deflation, we will clear the _object field when
we extract the ObjectMonitor from the containing in-use list and before
we prepend it to the global free list. A release_store() is used to update
the next field in the (cur_mid_in_use) ObjectMonitor that refers to the
deflated ObjectMonitor. That release_store() is important for sync'ing
out the clearing of the _object field because once the deflated
ObjectMonitor is unlinked from an in-use list, that field is no longer
updated by GC.

> In essence the marking is implementing a spin lock

I like the "spin lock" analogy.

> so once you have the "lock" you can use normal load/stores to make updates

Yup. I get it, e.g., we mark an ObjectMonitor so that we can deflate it
and part of deflation is making some updates...

> - but that assumes all updates occur under the "lock", that releasing 
> the "lock" also has the right barriers and that any reads sync 
> correctly with the writes. If any of that doesn't hold then 
> release_store and load_acquire will be needed rather than plain stores 
> and loads.

I _think_ that all updates that we make as part of list manipulation are
done after marking the ObjectMonitor's next field and before we unmark
that same next field. So I _think_ the "lock" is properly held...

Of course, it's always possible that I missed something, but that's what
we have code reviewers for right? :-)

> So again you'd need to look at all the flattened code paths to see how 
> this all hangs together.

For my load_acquire() calls I have matched them up with the corresponding
release_store() call. And in some cases the load_acquire() call site matches
up with a call site that does release_store() on one branch and cmpxchg()
on the other. Yes, complicated and mind numbing... Sorry about that...

I've also looked at the release_store() calls and matched them up with a
corresponding load_acquire().

However, I've looked at this code so many different times that I might be
blind to some aspects of it so another pair of eyes looking for missing
matches is always welcome.

>>> src/hotspot/share/runtime/serviceThread.cpp
>>> I have some concerns that the serviceThread must now wakeup 
>>> periodically instead of only on demand. It has a lot of tasks to 
>>> check now and it is not obvious how complex or trivial those checks 
>>> are. We don't currently gather any statistics in this area so we 
>>> have no idea how long the serviceThread typically waits between 
>>> requests. The fact the serviceThread is woken on demand means that 
>>> the desire to be "checked at the same interval" seems exceedingly 
>>> unlikely - the serviceThread would have to waiting for extended 
>>> periods of time for the timed-wait to have any real affect on the 
>>> frequency of the async monitor deflation.
>> Periodically waking up the ServiceThread is done to match the safepoint
>> cleanup period of GuaranteedSafepointInterval when doing safepoint based
>> deflation. Without waking up periodically, we could go a long time 
>> before
>> doing any deflation via the ServiceThread and that would definitely be
>> an observable change in behavior relative to safepoint based deflation.
>> In some of my test runs, I had seen us go for 8-27 seconds without doing
>> any async monitor deflation. Of course, that meant that there was 
>> more to
>> cleanup when we did do it.
> How does that occur? If the ServiceThread is woken when there are 
> monitors to deflate then that implies that if it was not woken then 
> there were none to deflate. Or turning it around, if after waiting for 
> GuaranteedSafepointInterval ms the Service Thread finds there are 
> monitors to deflate why wasn't it directly notified of that?

I think there's a bit of confusion here.

We don't wake up the ServiceThread when there are ObjectMonitors to
deflate. We wake up the ServiceThread to see if it can deflate any
ObjectMonitors. ObjectMonitor idleness or suitability to be deflated is
determined by the ServiceThread. ObjectMonitor idleness is not a property
determined by another part of the system and the ServiceThread is woken
to handle it.

>> I've been thinking about have my own AsyncDeflateIdleMonitorsThread 
>> and move
>> all the async deflation code from ServiceThread there. So many things 
>> are
>> being added to the ServiceThread that I do wonder whether it is 
>> overloaded,
>> but (as you said) there's no way to know that (yet).
> Some stuff just got moved out of the ServiceThread to a new 
> NotificationThread (which is a non-hidden Java thread), so the 
> ServiceThread has a little more capacity (potentially) to handle async 
> monitor deflation. But yes if the ServiceThread becomes a bottleneck 
> it will defeat the purpose of offloading work from the VMThread at 
> safepoints.

The other "benefit" of having a dedicated AsyncDeflateIdleMonitorsThread
is that it would be easier for us to determine how much actual work is
done by the whole ObjectMonitor deflation procedure. We can say XX% of
the VMs execution time went to the AsyncDeflateIdleMonitorsThread.

Also, the work moved out of the ServiceThread is not an "always on" thing
so the help for async deflation is limited at best.

>> BTW, I think GuaranteedSafepointInterval might be "broken" at this point
>> in that we don't guarantee a safepoint at that interval anymore. This 
>> has
>> kind of slipped through the cracks...
> We can look at that separately.

Agreed. Not directly related to the Async Monitor Deflation project
except as the explanation for how I went so long without seeing an
async monitor deflation... Without that cleanup safepoint happening
every GuaranteedSafepointInterval (and without the wait() timeout value),
we don't have ObjectSynchronizer::do_safepoint_work() periodically
waking up the ServiceThread to do an async monitor deflation sweep...

>>> ---
>>> src/hotspot/share/runtime/thread.cpp
>>> No changes to this file.
>>> ---
>>> test/jdk/tools/jlink/multireleasejar/JLinkMultiReleaseJarTest.java
>>> No changes to this file.
>> Correct. I'm unsure how to deal with this type of thing. This project
>> is managed as a series of patches. Earlier patches changed those files,
>> but a later patch undid those changes. So depending on where you are
>> in the patch queue, the logical 'files.list' contents are different.
>> I'm currently using a 'files.list' that reflects every file touched by
>> the current set of patches and that's why those two files are included
>> with no changes.
>> Suggestions?
> When you think you have the final version ready for review then you 
> should flatten out the patches (I don't know the right terminology) so 
> that the final webrev is accurate.

Good point. I've been so focused on different reviewers wanting to come
at this mass of code from different phases of the development process and
that led me to keep all the various version patches intact. I think we're
past the point where we are going to back up to, e.g., v2.05, and then
do a completely different lock free monitor list as a new v2.06.

Okay. When we roll to v2.08, v2.07 and predecessors will get flattened so
that we have an easier review process.

>> Thanks for the partial review! I'm looking forward to resolving the
>> queries about and your next set of review comments.
> I will try to look at the ObjectMonitor changes themselves this 
> afternoon. So many non-trivial reviews/discussions at the moment, plus 
> my own work :)

Thanks for spending so much time and energy on this code.


> Thanks,
> David
> -----
>> Dan
>>> ---
>>>> Dan
>>>> On 10/17/19 5:50 PM, Daniel D. Daugherty wrote:
>>>>> Greetings,
>>>>> The Async Monitor Deflation project is reaching the end game. I 
>>>>> have no
>>>>> changes planned for the project at this time so all that is left 
>>>>> is code
>>>>> review and any changes that results from those reviews.
>>>>> Carsten and Roman! Time for you guys to chime in again on the code 
>>>>> reviews.
>>>>> I have attached the list of fixes from CR6 to CR7 instead of 
>>>>> putting it
>>>>> in the main body of this email.
>>>>> Main bug URL:
>>>>>     JDK-8153224 Monitor deflation prolong safepoints
>>>>>     https://bugs.openjdk.java.net/browse/JDK-8153224
>>>>> The project is currently baselined on jdk-14+19.
>>>>> Here's the full webrev URL for those folks that want to see all of 
>>>>> the
>>>>> current Async Monitor Deflation code in one go (v2.07 full):
>>>>> http://cr.openjdk.java.net/~dcubed/8153224-webrev/10-for-jdk14.v2.07.full 
>>>>> Some folks might want to see just what has changed since the last 
>>>>> review
>>>>> cycle so here's a webrev for that (v2.07 inc):
>>>>> http://cr.openjdk.java.net/~dcubed/8153224-webrev/10-for-jdk14.v2.07.inc/ 
>>>>> The OpenJDK wiki has been updated to match the 
>>>>> CR7/v2.07/10-for-jdk14 changes:
>>>>> https://wiki.openjdk.java.net/display/HotSpot/Async+Monitor+Deflation
>>>>> The jdk-14+18 based v2.07 version of the patch has been thru Mach5 
>>>>> tier[1-8]
>>>>> testing on Oracle's usual set of platforms. It has also been 
>>>>> through my usual
>>>>> set of stress testing on Linux-X64, macOSX and Solaris-X64 with 
>>>>> the addition
>>>>> of Robbin's "MoCrazy 1024" test running in parallel with the other 
>>>>> tests in
>>>>> my lab.
>>>>> The jdk-14+19 based v2.07 version of the patch has been thru Mach5 
>>>>> tier[1-3]
>>>>> test on Oracle's usual set of platforms. Mach5 tier[4-8] are in 
>>>>> process.
>>>>> I did another round of SPECjbb2015 testing in Oracle's Aurora 
>>>>> Performance lab
>>>>> using using their tuned SPECjbb2015 Linux-X64 G1 configs:
>>>>>     - "base" is jdk-14+18
>>>>>     - "v2.07" is the latest version and includes C2 
>>>>> inc_om_ref_count() support
>>>>>       on LP64 X64 and the new HandshakeAfterDeflateIdleMonitors 
>>>>> option
>>>>>     - "off" is with -XX:-AsyncDeflateIdleMonitors specified
>>>>>     - "handshake" is with -XX:+HandshakeAfterDeflateIdleMonitors 
>>>>> specified
>>>>>          hbIR           hbIR
>>>>>     (max attempted)  (settled)  max-jOPS  critical-jOPS runtime
>>>>>     ---------------  ---------  --------  ------------- -------
>>>>>            34282.00   30635.90  28831.30       20969.20 3841.30 base
>>>>>            34282.00   30973.00  29345.80       21025.20 3964.10 v2.07
>>>>>            34282.00   31105.60  29174.30       21074.00 3931.30 
>>>>> v2.07_handshake
>>>>>            34282.00   30789.70  27151.60       19839.10 3850.20 
>>>>> v2.07_off
>>>>>     - The Aurora Perf comparison tool reports:
>>>>>         Comparison              max-jOPS critical-jOPS
>>>>>         ----------------------  -------------------- 
>>>>> --------------------
>>>>>         base vs 2.07            +1.78% (s, p=0.000) +0.27% (ns, 
>>>>> p=0.790)
>>>>>         base vs 2.07_handshake  +1.19% (s, p=0.007) +0.58% (ns, 
>>>>> p=0.536)
>>>>>         base vs 2.07_off        -5.83% (ns, p=0.394) -5.39% (ns, 
>>>>> p=0.347)
>>>>>         (s) - significant  (ns) - not-significant
>>>>>     - For historical comparison, the Aurora Perf comparision tool
>>>>>         reported for v2.06 with a baseline of jdk-13+31:
>>>>>         Comparison              max-jOPS critical-jOPS
>>>>>         ----------------------  -------------------- 
>>>>> --------------------
>>>>>         base vs 2.06            -0.32% (ns, p=0.345) +0.71% (ns, 
>>>>> p=0.646)
>>>>>         base vs 2.06_off        +0.49% (ns, p=0.292) -1.21% (ns, 
>>>>> p=0.481)
>>>>>         (s) - significant  (ns) - not-significant
>>>>> Thanks, in advance, for any questions, comments or suggestions.
>>>>> Dan
>> <trimmed older review invites>

More information about the hotspot-runtime-dev mailing list