Does allocation performance vary by collector?
matt.khan at db.com
Tue Apr 13 10:46:09 PDT 2010
I have been revisiting our jvm configuration with the aim of reducing
pause times, it would be nice to be consistently down below 3ms all the
time. The allocation behaviour of the application in question involves a
small amount of static data on startup & then a steady stream of objects
that have a relatively short lifespan. There are 2 typical lifetimes of
these objects with about 75% while the remainder have a mean of maybe 70s
but there is a quite a long tail to this so the typical lifetime is more
like <10s. There won't be many such objects alive at once but there are
quite a few passing through. The app runs on a 16 core opteron box running
Solaris 10 with 6u18.
Therefore I've been benching different configurations with a massive eden
and relatively tiny tenured & trying different collectors to see how they
perform. These params were common to each run
I then tried the following
# Parallel Scavenge
# Parallel Scavenge with NUMA
# Incremental CMS/ParNew
The last two (CMS/G1) were repeated on 6u18 & 6u20b02 for completeness as
I see there were assorted fixes to G1 in 6u20b01.
I measure the time it takes to execute assorted points in my flow & see
fairly significant differences in latencies with each collector, for
1) CMS == ~380-400micros
2) Parallel + NUMA == ~400micros
3) Parallel == ~450micros
4) G1 == ~550micros
The times above are taken well after the jvm has warmed up (latencies have
stabilised, compilation activity is practically non-existent) & there is
no significant "other" activity on the server at the time. The differences
don't appear to be pause related as the shape of the distribution (around
those averages) is the same, it's as if it has settled into quite a
different steady state performance. This appears to be repeatable though,
given the time it takes to run this sort of benchmark, I admit to only
have seen it repeated a few times. I have run previous benchmarks where it
repeats it 20x times (keeping GC constant in this case, was testing
something else) without seeing variations that big across runs which makes
me suspect the collection algorithm as the culprit.
So the point of this relatively long setup is to ask whether there are
theoretical reasons why the choice of garbage collection algorithm should
vary measured latency like this? I had been working on the assumption that
eden allocation is a "bump the pointer as you take it from a TLAB" type of
event hence generally cheap & doesn't really vary by algorithm.
fwiw the ParNew/CMS config is still the best one for keeping down pause
times though the parallel one was close. The former peaks at intermittent
pauses of 20-30ms, the latter at about 40ms. The Parallel + NUMA one
curiously involved many fewer pauses such that much less time was spent
paused but peaked higher (~120ms) which are unacceptable really. I don't
really understand why that is but speculated that it's down to the fact
that one of our key domain objects is allocated in a different thread to
where it is primarily used. Is this right?
If there is some other data that I should post to back up some of the
above then pls tell me and I'll add the info if I have it (and repeat the
test if I don't)
GFFX Auto Trading
Deutsche Bank, London
This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and delete this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden.
Please refer to http://www.db.com/en/content/eu_disclosures.htm for additional EU corporate and regulatory disclosures.
More information about the hotspot-gc-dev