CMS Promotion Failures
Y. S. Ramakrishna
y.s.ramakrishna at oracle.com
Wed Nov 17 09:21:02 PST 2010
On 11/17/10 06:17, Brian Williams wrote:
> On Nov 15, 2010, at 10:08 PM, Y. Srinivas Ramakrishna wrote:
>>> 1. If there is anything that could explain, beyond application usage, getting the promotion failures closer and closer together.
>> I have not seen that behaviour before. The only cases where i can think of that occurring is if
>> the heap occupancy is also montonically increasing so that the "free space" available keeps
>> getting smaller. But I am grasping at straws here.
> Output from jstat seems to indicate that's not the case here. Unfortunately, we're seeing this on a production server that doesn't have GC logging enabled. We're in the process of trying to get it enabled so we can try to understand this better.
>>> 2. And as a follow on question. If calling System.gc() leaves the heap in a better state than a promotion failure? (This will help us to answer whether we want to push for a server restart or a scheduled GC).
>> If you are not using +ExplicitGCInvokesConcurrent, then both would leave the heap
>> in an identical state, because they both cause a single-threaded (alas, still) compacting
>> collection of the entire heap. So, yes, scheduling explicit gc's to compact down
>> the heap at an opportune time would definitely be worthwhile, if possible.
>>> 3. Would using fewer parallel GC threads help reduce the fragmentation by having fewer PLABs?
>> Yes, this is usually the case. More PLAB's (in the form of cached free lists with the
>> individual GC worker threads) does translate to potentially more fragmentation, although
>> i have generally found that our autonomic per-block inventory control usually results
>> in keeping such fragmentation in check (unless the threads are "far too many" and the
>> free space "too little").
> We're running on a 32-way x4600 and aren't setting the ParallelGC threads explicitly, so we're probably ending up with 32. We will try to dial it down to see how that helps.
I think you get 5/8*n, so prbably closer to 20. With the amount of data that is copied
per scavenge and the size of yr old gen, 20 seems reasonable and probably does not need
dialing down (at least at first blush).
From looking at the snippets you sent, it almost seems like some kind of bug in
CMS allocation because there is plenty of free space (and comparatively not that
much promotion) when the promotion failure occurs (although full gc logs would
be needed before one could be confident of this pronouncement). So it would be
worthwhile to investigate this closely to see why this is happening. I somehow do
not think this is a tuning issue, but something else. Do you have Java support
and able to open a formal ticket with Oracle, so some formal/dedicated cycles can be
devoted looking at the issue?
What's the version of JDK you are running?
More information about the hotspot-gc-use