Does allocation performance vary by collector?

Y. Srinivas Ramakrishna y.s.ramakrishna at oracle.com
Wed Apr 14 08:42:43 PDT 2010


Hi Matt -- if you are really trying to measure pure allocation throughput
you might want to completely eliminate GC overhead by making sure
yr instrumentation collects figures over an interval during which
no GC activity intervenes.

I would typically expect NUMA(Parallel) to be better than the rest,
but just as you stated in your example certain allocation+use patterns
could degrade performance. Other than that (except for G1 see below)
all other configurations (modulo GC overhead remarks above) should show
similar allocation performance.

G1 uses "heap regions" which might somewhat limit TLAB growth and might
cause slightly lower allocation throughput (but not necessarily; run the
non-G1 collector with +PrintTLABStatistics to get some data on TLAB sizes).
However, you can try to fix that by choosing a larger heap region
size in G1. (G1 cognoscenti on the list can provide more details.)

However, I see that you are interested not just in allocation performance
but latency of yr operations in general (which is why you were concerned
with GC pause times themselves). In that case, you are right that CMS or
G1 would probably be superior if you had a large heap footprint so as to
get whole heap GC's (or at least if enough objects got promoted to old
gen so as to require the occasional full gc). Between G1 and CMS,
G1 generally provides much more regular and predictable GC pauses,
but for a truly apples-to-apples comparison you cannot assume that the
optimal heap shape for CMS is the same as that for G1. CMS needs
hand tuning and G1 finds something close to optimal (but might occasionally
need some help) -- thus paradoxically, unless you have been careful
setting the heap shape for G1 might be suboptimal, especially if you
merely took the CMS optimal setting and used it with G1.

If you can arrange to have nothing promoted into the
old gen ever then, provided the lifetimes of objects are not too long
(so you would spend effort copying between survivor spaces), Parallel+NUMA
may be best, modulo the caveats about NUMA-allocator anti-patterns above.

One final remark: rather than looking at averages or other central measures,
i'd suggest looking at latency distribution metrics to quickly get a handle
on what is happening, and how to tune/configure which collector for your needs.

-- ramki

On 04/13/10 10:46, Matt Khan wrote:
> Hi
> 
> I have been revisiting our jvm configuration with the aim of reducing 
> pause times, it would be nice to be consistently down below 3ms all the 
> time. The allocation behaviour of the application in question involves a 
> small amount of static data on startup & then a steady stream of objects 
> that have a relatively short lifespan. There are 2 typical lifetimes of 
> these objects with about 75% while the remainder have a mean of maybe 70s 
> but there is a quite a long tail to this so the typical lifetime is more 
> like <10s. There won't be many such objects alive at once but there are 
> quite a few passing through. The app runs on a 16 core opteron box running 
> Solaris 10 with 6u18.
> 
> Therefore I've been benching different configurations with a massive eden 
> and relatively tiny tenured & trying different collectors to see how they 
> perform. These params were common to each run
> 
> -Xms3072m 
> -Xmx3072m 
> -Xmn2944m 
> -XX:+DisableExplicitGC 
> -XX:+PrintGCDetails 
> -XX:+PrintGCDateStamps 
> -XX:+PrintGCApplicationStoppedTime
> -XX:+PrintGCApplicationConcurrentTime
> -XX:MaxTenuringThreshold=1 
> -XX:SurvivorRatio=190 
> -XX:TargetSurvivorRatio=90
> 
> I then tried the following
> 
> # Parallel Scavenge 
> -XX:+UseParallelGC 
> -XX:+UseParallelOldGC 
> 
> # Parallel Scavenge with NUMA
> -XX:+UseParallelGC 
> -XX:+UseNUMA 
> -XX:+UseParallelOldGC 
> 
> # Incremental CMS/ParNew
> -XX:+UseConcMarkSweepGC 
> -XX:+CMSIncrementalMode 
> -XX:+CMSIncrementalPacing 
> -XX:+UseParNewGC 
> 
> # G1
> -XX:+UnlockExperimentalVMOptions 
> -XX:+UseG1GC 
> 
> The last two (CMS/G1) were repeated on 6u18 & 6u20b02 for completeness as 
> I see there were assorted fixes to G1 in 6u20b01.
> 
> I measure the time it takes to execute assorted points in my flow & see 
> fairly significant differences in latencies with each collector, for 
> example
> 
> 1) CMS == ~380-400micros 
> 2) Parallel + NUMA == ~400micros
> 3) Parallel == ~450micros
> 4) G1 == ~550micros
> 
> The times above are taken well after the jvm has warmed up (latencies have 
> stabilised, compilation activity is practically non-existent) & there is 
> no significant "other" activity on the server at the time. The differences 
> don't appear to be pause related as the shape of the distribution (around 
> those averages) is the same, it's as if it has settled into quite a 
> different steady state performance. This appears to be repeatable though, 
> given the time it takes to run this sort of benchmark, I admit to only 
> have seen it repeated a few times. I have run previous benchmarks where it 
> repeats it 20x times (keeping GC constant in this case, was testing 
> something else) without seeing variations that big across runs which makes 
> me suspect the collection algorithm as the culprit.
> 
> So the point of this relatively long setup is to ask whether there are 
> theoretical reasons why the choice of garbage collection algorithm should 
> vary measured latency like this? I had been working on the assumption that 
> eden allocation is a "bump the pointer as you take it from a TLAB" type of 
> event hence generally cheap & doesn't really vary by algorithm.
> 
> fwiw the ParNew/CMS config is still the best one for keeping down pause 
> times though the parallel one was close. The former peaks at intermittent 
> pauses of 20-30ms, the latter at about 40ms. The Parallel + NUMA one 
> curiously involved many fewer pauses such that much less time was spent 
> paused but peaked higher (~120ms) which are unacceptable really. I don't 
> really understand why that is but speculated that it's down to the fact 
> that one of our key domain objects is allocated in a different thread to 
> where it is primarily used. Is this right?
> 
> If there is some other data that I should post to back up some of the 
> above then pls tell me and I'll add the info if I have it (and repeat the 
> test if I don't) 
> 
> Cheers
> Matt
> 
> Matt Khan
> --------------------------------------------------
> GFFX Auto Trading
> Deutsche Bank, London
> 
> 
> ---
> 
> This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and delete this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden.
> 
> Please refer to http://www.db.com/en/content/eu_disclosures.htm for additional EU corporate and regulatory disclosures.




More information about the hotspot-gc-dev mailing list