RFC: Throughput barriers for G1

Jungwoo Ha jwha at google.com
Thu Nov 10 20:04:17 UTC 2016

> On 11/09/2016 08:13 PM, Jungwoo Ha wrote:
>> Usually single store is faster than load & store on the same cache line.
>> I don't think card(o.x) is preloaded in cache to make if check cheap.
> There are also reasons for limiting the number of writes to the card
> table. For example, see https://blogs.oracle.com/dave/
> entry/false_sharing_induced_by_card.
store goes to the store buffer and marking the cache line with M state is
done at the background.
If the mutator doesn't read the card table at all, the cache line will stay
M until the GC reads the cards (changing it to S), thus saving cache
coherence traffic.
Load will trigger the cache-line to most likely S state, which is a high
latency load if the previous state is M, and the mutator is paying the long
latency loads.
I am not sure if there is any win with UseCondCardMark at least on x86.
Adding a branch adds a potential overhead on branch prediction as well.
You can probably use cmov instruction, but that's also not as cheap as
ordinary mov.

> You can start measuring without implementing the idea, the following
> experiment will show you the cost of the pre-write barrier for your
> workloads:
>    1. Find an application that doesn't need mixed GCs (for example by
>       increasing young gen size and max heap size). Alternatively you
>       can just run any of your applications until OOME.
>    2. Run the above application without generated pre-write barriers
>       (and  concurrent mark and refinement turned off). This run
>       becomes your baseline.
>    3. Run the above benchmark with the pre-write barriers, but in the
>       slow leaf call into the VM, discard the old buffer (but create a
>       new one). Turn off concurrent mark and concurrent refinement.
>       This run becomes your target.
>    If you now compare your baseline and your target, you should
>    essentially see the impact of just the pre-write barrier.
> Since the post-write barrier can be made almost identical to CMS (see my
> paragraph above), the overhead of the pre-write barrier would then be the
> barrier overhead for G1.
> Would you guys at Google be willing to help out with running these
> experiments? At the last CMS meeting Jeremy said that Google would be
> willing to help out with G1 improvements IIRC.

Sure, I can do the measurement with DaCapo benchmark suite. I don't think
we can run this experiment with the production workload.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.java.net/pipermail/hotspot-gc-dev/attachments/20161110/35a7e545/attachment.htm>

More information about the hotspot-gc-dev mailing list