I have been revisiting our jvm configuration with the aim of reducing 
pause times, it would be nice to be consistently down below 3ms all the 
time. The allocation behaviour of the application in question involves a 
small amount of static data on startup & then a steady stream of objects 
that have a relatively short lifespan. There are 2 typical lifetimes of 
these objects with about 75% while the remainder have a mean of maybe 70s 
but there is a quite a long tail to this so the typical lifetime is more 
like <10s. There won't be many such objects alive at once but there are 
quite a few passing through. The app runs on a 16 core opteron box running 
Solaris 10 with 6u18.

Therefore I've been benching different configurations with a massive eden 
and relatively tiny tenured & trying different collectors to see how they 
perform. These params were common to each run


I then tried the following

# Parallel Scavenge 

# Parallel Scavenge with NUMA

# Incremental CMS/ParNew

# G1

The last two (CMS/G1) were repeated on 6u18 & 6u20b02 for completeness as 
I see there were assorted fixes to G1 in 6u20b01.

I measure the time it takes to execute assorted points in my flow & see 
fairly significant differences in latencies with each collector, for 

1) CMS == ~380-400micros 
2) Parallel + NUMA == ~400micros
3) Parallel == ~450micros
4) G1 == ~550micros

The times above are taken well after the jvm has warmed up (latencies have 
stabilised, compilation activity is practically non-existent) & there is 
no significant "other" activity on the server at the time. The differences 
don't appear to be pause related as the shape of the distribution (around 
those averages) is the same, it's as if it has settled into quite a 
different steady state performance. This appears to be repeatable though, 
given the time it takes to run this sort of benchmark, I admit to only 
have seen it repeated a few times. I have run previous benchmarks where it 
repeats it 20x times (keeping GC constant in this case, was testing 
something else) without seeing variations that big across runs which makes 
me suspect the collection algorithm as the culprit.

So the point of this relatively long setup is to ask whether there are 
theoretical reasons why the choice of garbage collection algorithm should 
vary measured latency like this? I had been working on the assumption that 
eden allocation is a "bump the pointer as you take it from a TLAB" type of 
event hence generally cheap & doesn't really vary by algorithm.

fwiw the ParNew/CMS config is still the best one for keeping down pause 
times though the parallel one was close. The former peaks at intermittent 
pauses of 20-30ms, the latter at about 40ms. The Parallel + NUMA one 
curiously involved many fewer pauses such that much less time was spent 
paused but peaked higher (~120ms) which are unacceptable really. I don't 
really understand why that is but speculated that it's down to the fact 
that one of our key domain objects is allocated in a different thread to 
where it is primarily used. Is this right?

If there is some other data that I should post to back up some of the 
above then pls tell me and I'll add the info if I have it (and repeat the 
test if I don't) 


