[PATCH] Exploit Empty Regions in Young Gen to Enhance PS Full GC Performance
stefan.johansson at oracle.com
Tue Sep 17 13:52:53 UTC 2019
I will try to find time the coming weeks to do some evaluation and I'll
get back to you if I have any questions or comments.
On 2019-09-16 15:54, Haoyu Li wrote:
> Hi Stefan,
> Thanks for getting back to me! I have ported the optimization to JDK 14
> and the new patch is attached in this mail.
> As to the command lines in our evaluation, basically, we run the
> benchmarks with flags including *-Xmx<heap_size>
> -XX:ParallelGCThreads=32 -XX:+UseParallelGC -XX:-ScavengeBeforeFullGC
> -Xlog:gc.* We set the maximum heap size for each benchmark to 3X of
> their minimum heap size and the amount of GC threads to 32 because our
> machine has 32 physical cores. Full command lines for all benchmarks can
> be found in the attached file /evaluation.sh/.
> I am more than happy to have any feedback. Thanks for reviewing this patch!
> Best Regrads,
> Haoyu Li,
> Institute of Parallel and Distributed Systems(IPADS),
> School of Software,
> Shanghai Jiao Tong University
> Stefan Johansson <stefan.johansson at oracle.com
> <mailto:stefan.johansson at oracle.com>> 于2019年9月12日周四 上午5:34写道：
> Hi Haoyu,
> I recently came across your patch and I would like to pick up on
> some of the things Kim mentioned in his mails. I especially want
> evaluate and investigate if this is a technique we can use to
> improve the other GCs as well. To start that work I want to take the
> patch for a spin in our internal performance testing. The patch
> doesn’t apply clean to the latest JDK repository, so if you could
> provide an updated patch that would be very helpful.
> It would also be great if you could share some more information
> around the results presented in the paper. For example, it would be
> good to get the full command lines for the different benchmarks so
> we can run them locally and reproduce the results you’ve seen.
>> 12 mars 2019 kl. 03:21 skrev Haoyu Li <leihouyju at gmail.com
>> <mailto:leihouyju at gmail.com>>:
>> Hi Kim,
>> Thanks for reviewing and testing the patch. If there are any
>> failures or performance degradation relevant to the work, please
>> let me know and I'll be very happy to keep improving it. Also, any
>> suggestions about code improvements are well appreciated.
>> I'm not quite sure if both G1 and Shenandoah have the similar
>> region dependency issue, since I haven't studied their GC
>> behaviors before. If they have, I'm also willing to propose a more
>> general optimization.
>> As to the memory overhead, I believe it will be low because this
>> patch exploits empty regions in the young space rather than
>> off-heap memory to allocate shadow regions, and also reuses the
>> /_source_region/ field of each /RegionData /to record the
>> correspongding shadow region index. We only introduce a new
>> integer filed /_shadow /in the RegionData class to indicate the
>> status of a region, a global /GrowableArray _free_shadow/ to store
>> the indices of shadow regions, and a global /Monitor/ to protect
>> the array. These information might help if the memory overhead
>> need to be evaluated.
>> Looking forward to your insight.
>> Best Regrads,
>> Haoyu Li,
>> Institute of Parallel and Distributed Systems(IPADS),
>> School of Software,
>> Shanghai Jiao Tong University
>> Kim Barrett <kim.barrett at oracle.com
>> <mailto:kim.barrett at oracle.com>> 于2019年3月12日周二 上午6:11写道：
>> > On Mar 11, 2019, at 1:45 AM, Kim Barrett
>> <kim.barrett at oracle.com <mailto:kim.barrett at oracle.com>> wrote:
>> >> On Jan 24, 2019, at 3:58 AM, Haoyu Li <leihouyju at gmail.com
>> <mailto:leihouyju at gmail.com>> wrote:
>> >> Hi Kim,
>> >> I have ported my patch to OpenJDK 13 according to your
>> instructions in your last mail, and the patch is attached in
>> this mail. The patch does not change much since PSGC is indeed
>> pretty stable.
>> >> Also, I evaluate the correctness and performance of PS full
>> GC with benchmarks from DaCapo, SPECjvm2008, and JOlden suits
>> on a machine with dual Intel Xeon E5-2618L v3 CPUs(16 physical
>> cores), 64G DRAM and linux kernel 4.17. The evaluation result,
>> indicating 1.9X GC throughput improvement on average, is
>> attached, too.
>> >> However, I have no idea how to further test this patch for
>> both correctness and performance. Can I please get any
>> guidance from you or some sponsor?
>> > Sorry I missed that you had sent an updated version of the
>> > I’ve run the full regression suite across Oracle-supported
>> platforms. There are some
>> > failures, but there are almost always some failures in the
>> later tiers right now. I’ll start
>> > looking at them tomorrow to figure out whether any of them
>> are relevant.
>> > I’m also planning to run some of our performance benchmarks.
>> > I’ve lightly skimmed the proposed changes. There might be
>> some code improvements
>> > to be made.
>> > I’m also wondering if this technique applies to other
>> collectors. It seems like both G1 and
>> > Shenandoah full gc’s might have similar issues? If so, a
>> solution that is ParallelGC-specific
>> > is less interesting than one that has broader
>> applicability. Though maybe this optimization
>> > is less important for G1 and Shenandoah, since they actively
>> try to avoid full gc’s.
>> > I’m also not clear on how much additional memory might be
>> temporarily allocated by this
>> > mechanism.
>> I’ve created a CR for this:
More information about the hotspot-gc-dev