EpsilonGC and throughput.

Tue Dec 19 08:14:43 UTC 2017

On 12/18/2017 08:01 PM, Sergey Kuksenko wrote:> I agree that it makes sense to talk about latency,
but, please, don't expect that you will be able> to achieve high throughput with Epsilon GC. Having
zero barriers is not enough for this.> Just a simple example, I randomly took 9 standard throughput
measuring benchmarks and compared> Epsilon GC vs G1 and ParallelOld.
I assume you have ran SPECjvm2008.

Beware of what I call the Catch-22 of (GC) Performance Evaluation: "standard benchmarks" tend to be
developed/tuned with existing GCs in mind. For example, it would be hard to find the "standard
benchmark" that exhibits large LDS, or otherwise experiences large GC pauses, or experiences GC
problems in its steady state (ignoring transient hiccups in the warmups).

> - EpsilonGC vs ParallelOld:
>   -- only on 3 benchmarks overall throughput with Epsilon GC was higher than ParallelOld and speedup
> was : 0.2%-0.6%
>   -- on 6 benchmarks, ParallelOld (with barriers and pauses) was faster (faster means throughput!),
> within 1%-10%.
> 
> - EpsilonGC vs G1
>   -- EpsilonGC has shown higher throughput on 4 benchmarks, within 2%-3%
>   -  G1 was faster on 5 benchmarks, within 2%-10%.

Oh! The throughput figures are actually pretty good for non-compacting collector, and performance
improvements are in-line with that is called out in JEP as "Last-drop performance improvements" on
special workloads.

As noted above, it makes little sense to run Epsilon for throughput on "standard benchmarks" that do
not suffer from GC issues. It is instructive, however, to run workloads that *do* suffer from them.
For example, try this for a quick turn-around CLI workload that is supposed to do one thing very
quickly:

public class AL {
    static List<Object> l;
    public static void main(String... args) throws Throwable {
        l = new ArrayList<>();
        for (int c = 0; c < 100_000_000; c++) {
            l.add(new Object());
        }
        System.out.println(l.hashCode());
    }
}

$ time java -XX:+UseParallelGC AL
-1907572722

real	0m25.063s
user	1m5.700s
sys	0m1.084s

$ time java -XX:+UseG1GC AL
-1907572722

real	0m14.908s
user	0m33.264s
sys	0m0.788s

$ time java -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC AL
-1907572722

real	0m8.995s
user	0m8.784s
sys	0m0.260s

In workloads like these, having GC pauses does impact application throughput. When out-of-the-box GC
performance is concerned, the difference is not even in single-digit percents. Of course, you can
configure GC to avoid pauses in the timespan that is critical for you (e.g. setting -Xms8g -Xmx8g
-Xmn7g for the workload above), and hope you got it right, but one of the points for Epsilon is not
to guess about this, but actually have the guarantee GC never happens.

> Compacting GCs have significant advantage over non-GC in terms of throughput (e.g.
> https://shipilev.net/jvm-anatomy-park/11-moving-gc-locality/)
True, and it is called out in JEP:

"Locality considerations. Non-compacting GC implicitly means it maintains the object graph in its
allocation order. This has impact on spatial locality, and regular applications may experience the
throughput hit if allocations are random or generate lots of sparse garbage. While this may entail
some throughput overhead, this is outside of GC control, and would affect most non-moving GCs.
Locality-aware application coding would be required to mitigate this drawback, if locality proves to
be a problem."

Locality is something that users can control, especially when small contained applications are
concerned, and/or (hopefully) Valhalla and other language features that help to flatten the memory.

Thanks,
-Aleksey

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: OpenPGP digital signature
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20171219/3e851f3a/signature.asc>