RFR 8087198: G1 card refinement: batching, sorting, joining, prefetching

Erik Österlund erik.osterlund at lnu.se
Thu Jun 18 18:43:41 UTC 2015

Hi Thomas,

Den 18/06/15 08:58 skrev Thomas Schatzl <thomas.schatzl at oracle.com>:

>  some notes about the changes: the change does not compile as is, it
>does not compile without precompiled headers, or in (fast/slow)debug
>The change still contains a debug printf (using non-Hotspot printf) - I
>hope you did not do your benchmarking with the change you sent (That
>fails compilation btw with gcc 4.8.2, i.e. the release compiler version,
>on Linux).
>Further, in fastdebug mode it crashes with a fatal error when running
>specjbb2000 (but probably other applications too): the message is
>"should not call operator new[]" somewhere in the
>BufferedRefineCardTableEntryClosure constructor.
>The other patch about the dynamic barrier elision also does not compile
>without precompiled headers (run configure with
>We really appreciate your work, but it would be a lot less work for
>us/me if suggested changes compiled out of the box.
>Not sure why using Xcode (assuming you are using OS X) or gcc (if you
>happened to use linux) did not complain in any of these cases.

I completely understand, and I'm sorry if my patches did not compile for
you. I believe JPRT was designed for this very purpose so that these
things would not happen, and I do unfortunately not have access to it. If
I did, I could always make sure every patch compiled out of the box on all
of your platforms before submitting changes. As it is, I can only see that
it compiled on my platform (OS X + clang v6) and hope it works for others
too. I could compile my prefetching patch without precompiled headers and
have no idea why it won't compile on gcc 4.8.2.

With that being said, I removed the printf and changed that new[] operator
to use the NEW_C_HEAP_ARRAY macros instead to make assertions in fastdebug
happy. Sorry about those.

New webrev:


Again, this works on my platform without prefix headers in release and
fastdebug, and I'm sorry if it does not work on your platform - I cannot
test it.

As far as the barrier elision patch goes, it is not a proposed fix by any
means. It was more of a curiosity and an intentionally /experimental/
proof-of-concept prototype to demonstrate it's possible to do it so that
if somebody is interested, it might be worth sitting down and doing it
properly (without deoptimizing as wildly and breaking a few assertions in
fastdebug that deoptimization shouldn't happen during GC which it totally
now does, and do it in concurrent mark not only system gc). Sorry if I was
not clear enough about this patch being highly experimental.

Anyway, here is a patch to fix the issue of compiling without prefix
header so you should be able to compile and run it in release mode without
precompiled headers:

New webrev:


Again, sorry for the trouble. I'll be more careful next time. Maybe
submitting experimental ideas and patches is not a good idea...


>I will answer to the other questions later.
>  Thomas
>On Tue, 2015-06-16 at 18:19 +0000, Erik Österlund wrote:
>> Hi Thomas,
>> Den 16/06/15 02:08 skrev Thomas Schatzl <thomas.schatzl at oracle.com>:
>> >Hi Erik,
>> >
>> >On Mon, 2015-06-15 at 15:08 +0000, Erik Österlund wrote:
>> >
>> >> 
>> >> Den 15/06/15 04:11 skrev Thomas Schatzl <thomas.schatzl at oracle.com>:
>> >> 
>> >>
>> >> >> Basically we currently process cards one by one even though they
>> >>always
>> >> >> come in groups of at the very least G1UpdateBufferSize. The cards
>> >>may or
>> >> >> may not follow a particularly cache friendly order, depending on
>> >> >>lucky
>> >> >> we are, and need to do things like seek the start of the card
>> >> >>often,
>> >> >> as well as cutting arrays at the card intervals, even though many
>> >>times
>> >> >> the batches of cards could be linearly scanned and follow a
>> >>contiguous
>> >> >> order.
>> >> >
>> >> >The question is, is this true? I remember doing some experiments on
>> >>that
>> >> >years ago, with different applications, and it did not seem to be
>> >> >doing because the cards were not contiguous. Iirc there were almost
>> >>always
>> >> >just very little consecutive runs of cards (within a complete
>> >> >that did
>> >> >not amount to much.
>> >> >So sorting seemed more expensive than the gain due to batching them
>> >> >reference analysis.
>> >> 
>> >> I don¹t know if it is always true that intervals are contiguous. In
>> >> applications they are (yes I checked, for instance h2 of Dacapo has
>> >>almost
>> >> only contiguous cards from humongous arrays that seem to be sweeped
>> >>over)
>> >> which is sweet.
>> >> 
>> >> And when they are not and this is not true and intervals are broken
>>- I
>> >> use prefetching instead. So regardless of the case, caching
>> >> are improved.
>> >> 
>> >> I benchmarked the cost of the sorting and could not find a
>>difference at
>> >> all. I find it very cheap as all the data is local to the closure
>>and L1
>> >> cached. The cost however of missing cache lines when scanning cards
>> >> they could have been contiguous intervals
>> >
>> >I expect that hardware prefetch of the next line(s) will automatically
>> >prefetch them.
>> Are you saying I should, after sorting, skip the explicit prefetch of
>> contiguous intervals, since hardware should implicitly do it anyway?
>> I guess this might be possible. But I think it¹s still better to be
>> explicit about it, as we see other code do as well. The hardware
>> doesn¹t know how long the contiguous interval is to be prefetched, but
>> do.
>> >
>> >> >I remember more recent experiments on the refinement buffers
>> >>by
>> >> >the evacuation (in G1 remembered set updates during GC are not
>> >>immediately
>> >> >reflected in the remembered set, but G1 creates refinement entries
>> >> >instead),
>> >> >and there the situation seemed even worse.
>> >> 
>> >> My patch currently only applies to card refinement by the concurrent
>> >> refinement threads and mutators. I didn¹t touch the code for scanning
>> >>the
>> >> cards during evacuation pauses. But I expect similar stuff can be
>> >> there.
>> >> 
>> >> Also, maybe there could be a flag for reflecting the updates
>> >> so the work doesn¹t have to be done,
>> >
>> >That has been available before, but removed because it was buggy
>> >(JDK-8052172). It could have been fixed, but it was not worth it.
>> I can imagine. I didn¹t think it would be worth it either so I didn¹t
>> bother.
>> >
>> >> for users that care more about
>> >> optimizing performance than for having perfect latencies. Do you
>> >>to
>> >> know how slow the updates to the remembered sets are compared to the
>> >>cost
>> >> of scanning the cards roughly speaking? Curious where the bottlenecks
>> >>are
>> >> so I know where to focus optimizations.
>> >
>> >Remembered set updates are extremely slow in some cases, particularly
>> >the contended case. It's almost always better to let the locking happen
>> >in the concurrent phase :)
>> >
>> >Some details: By now you should know that the current design has three
>> >levels, for every source region we store remembered set entries either
>> >in the sparse, fine or coarse per-region remembered set. First we try
>> >fill up the sparse rset, then fine and finally use coarse.
>> >
>> >Most per-region remembered sets are sparse rsets. The problem is,
>> >to it requires the current implementation to take a lock, which
>> >performance is very bad if contended.
>> >
>> >It is doable to remove it, i.e. replace it with a lock-free
>> >implementation (like making a mechanism as in the proposed fix for
>> >JDK-8047328 lock-free), and you may want to have a try at it.
>> >
>> >However, we are also looking into remembered sets right now as the
>> >current implementation has a few other scalability problems that we
>> >would like to have addressed for the future.
>> Very interesting information, thanks. Replacing lock-based solutions
>> lock-free ones can help many times.
>> However, even with lock-free solutions, they do not completely remedy
>> contention symptoms, they just get rid of blocking (which can be
>> disastrous to be fair). But unless the lock-free data structure gets
>> enough space, which we probably fundamentally cannot afford for sparse
>> entries, all threads will contend for the same data, still impeding
>> scalability to some extent.
>> How about doing things like try_add_reference() instead which does
>> try_lock() instead of blocking locking? So if any lock is contended, we
>> remember the reference and deal with it later instead. The idea is that
>> there is contention, we are probably better off doing something more
>> useful elsewhere where there is no contention than insisting on either
>> blocking or having lock-free contention with other threads trying to
>> modify the same small data. In other words, instead of making contention
>> cheaper, try and avoid contention in the first place.
>> It¹s also possible to batch and distribute the problematic references
>> failed at try_add_reference() due to contention in a smarter way to
>> different worker threads in such a way that there will not be any
>> contention between those worker threads. It is possible to do this
>> having any write-write contention between the workers, but I¹ll spare
>> the details as it could possibly be a longish story.
>> >
>> >> >There is always the advantage of batching them to do less memory
>> >>barriers
>> >> >like CMS does.
>> >> 
>> >> Yeah, I also experimented with some ADS tricks that allowed me to
>> >> the G1 write barrier fences and joining 2 conditional jumps into 1,
>> >>which
>> >> led to further nice speed ups on my machine. But Dave Dice told me
>> >>fences
>> >> are becoming really cheap and ADS doesn¹t scale well on multiple
>> >> machines so I left that out of the patch, assuming my old macbook
>> >> representative of what customers will have in a few years time. :p
>> >
>> >My workstation has 40 threads :)
>> Yeah. YeeaaahhhhhhŠ :p
>> >> 
>> >> Oh, my totally scientific gut feeling tells me that lock bottlenecks
>> >> remembered sets might be a problem in that bug and that perhaps card
>> >> batching could help there too: batch the cards, locally record the
>> >>changes
>> >> that need to be made to each remembered set, then lock them one by
>> >>and
>> >> perform a bunch of updates for each critical section instead of each
>> >> reference one by one. I trust my gut feeling!
>> >
>> >That might be worth a try, but I do not expect a lot of improvements.
>> >you want to give it a spin, the microbenchmark for JDK-8069273
>> >(http://cr.openjdk.java.net/~redestad/8069273/G1HotCardBench.java)
>> >be useful. It should also give good numbers with this change.
>> >
>> >Other than that, see above.
>> Another idea I have that I am pretty sure would solve at least the
>> problems in that micro benchmark is to disable remembered sets
>> when they are not needed and lazily reconstruct them when they are
>> actually needed during concurrent mark cycles. Now every region assumes
>> that it could be selected for the CSet at the next pause and therefore
>> aggressively maintains all incoming edges. Sometimes this is simply
>> impossible and sometimes it is possible in theory yet very unlikely. I¹d
>> like to try removing RSets for those two cases, at the very least when
>> it¹s impossible.
>> First of all, a concurrent mark must happen first to estimate liveness
>> the old regions before they can become targets for the CSet. During this
>> concurrent marking, remembered sets can be reconstructed as needed. So
>> me it doesn¹t make sense to have the remembered set remember anything at
>> all until at least the first concurrent mark after the old region was
>> allocated that will activate it and reconstruct it. If we don¹t need to
>> collect old, there is no point in maintaining its remembered sets. So a
>> flag on each remembered set tells if it¹s active or not. In this
>> it would never be active since there is no need to actually collect old,
>> and hence nothing would be added to any remembered set.
>> And then, it would be useful to similarly track the velocity of
>> degradation of remembered sets and use this to find out if it¹s unlikely
>> that the region will be selected for the CSet. If they contain old data
>> that is permanently alive and never dies, it would be easy to see
>> two concurrent marks that the liveness is too high to be collected and
>> remains unchanged from previous concurrent mark. So it seems unlikely to
>> be of any interest to reclaim it any time soon, and its remembered set
>> be released. If this changes, then the RSet can be reactivated and
>> reconstructed for the next concurrent mark after it becomes eligible.
>> What do you think of this idea? I guess I need to prototype it and try
>> outŠ
>> Anyway, I¹ll see what I can do. I¹m at FCRC right now though so might
>>be a
>> bit unavailable for a while making it hard to prototype this right now!
>> >I will need to think a little about what benchmark to run and how to
>> >test for improvements.
>> Okay, thanks for looking into this and helping me out Thomas!
>> Thanks,
>> /Erik
>> >
>> >Thanks,
>> >  Thomas
>> >
>> >
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4281 bytes
Desc: not available
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-dev/attachments/20150618/a7565b99/smime.p7s>

More information about the hotspot-gc-dev mailing list