G1GC Full GCs

Todd Lipcon todd at cloudera.com
Sat Jan 22 00:51:42 UTC 2011

A bit more data. I did the following patch:

@@ -1560,6 +1575,19 @@

+    } else {
+      // no non-young gen collections - if our prediction is high enough,
we would
+      // never collect non-young again, so push it back towards zero so we
give it
+      // another try.
+      double predicted_other_time = predict_non_young_other_time_ms(1);
+      if (predicted_other_time > MaxGCPauseMillis/2.0) {
+        if (G1PolicyVerbose > 0) {
+          gclog_or_tty->print_cr(
+            "Predicted non-young other time %.1f is too large compared to
max pause time. Weighting down.",
+            predicted_other_time);
+        }
+        _non_young_other_cost_per_region_ms_seq->add(0.0);
+      }

and this mostly solved the problem described above. Now I get a full GC
every 45-50 minutes which is way improved from what it was before.

I still seem to be putting off GC of non-young regions too much though. I
did some analysis of the G1 log and made these graphs:


The top graph is a heat map of the number of young (pink color) and
non-young (blue) in each collection.
The middle graph is the post-collection heap usage over time in MB
The bottom graph is a heat map and smoothed line graph of the number of
millis spent per collection. The target in this case is 50ms.

 A few interesting things:
 - not sure what causes the sort of periodic striated pattern in the number
of young generation regions chosen
 - most of the time no old gen regions are selected for collection at all!
Here's a graph of just old regions:
 - When old regions are actually selected for collection the heap usage does
drop, though elapsed time does spike over the guarantee.

So seems like something about the heuristics aren't quite right. Thoughts?


On Fri, Jan 21, 2011 at 11:38 AM, Todd Lipcon <todd at cloudera.com> wrote:

> Hey folks,
> Took some time over the last day or two to follow up on this on the latest
> checkout of JDK7. I added some more instrumentation and my findings so far
> are:
> 1) CMS is definitely hitting a fragmentation problem. Our workload is
> pretty much guaranteed to fragment, and I don't think there's anything CMS
> can do about it - see the following graphs:
> http://people.apache.org/~todd/hbase-fragmentation/
> <http://people.apache.org/~todd/hbase-fragmentation/>2) G1GC is hitting
> full pauses because the "other" pause time predictions end up higher than
> the minimum pause length. I'm seeing the following sequence:
> - A single choose_cset operation for a non_young region takes a long time
> (unclear yet why this is happening, see below)
> - This inflates the predict_non_young_other_time_ms(1) result to a value
> greater than my pause goal
> - From then on, it doesn't collect any more non-young regions (ever!)
> because any region will be considered expensive regardless of the estimated
> rset or collection costs
> - The heap slowly fills up with non-young regions until we reach a full GC
> 3) So the question is why the choose_cset is taking a long time. I added
> getrusage() calls to wrap the choose_cset operation. Here's some output with
> extra logging:
> --> About to choose cset at   725.458
> Adding 1 young regions to the CSet
>   Added 0x0000000000000001 Young Regions to CS.
>     (3596288 KB left in heap.)
>     (picked region; 9.948053ms predicted; 21.164738ms remaining; 2448kb
> marked; 2448kb maxlive; 59-59% liveness)
>     (3593839 KB left in heap.)
> predict_region_elapsed_time_ms: 10.493828ms total, 5.486072ms rs scan
> (14528 cnum), 4.828102 copy time (2333800 bytes), 0.179654 other time
>     (picked region; 10.493828ms predicted; 11.216685ms remaining; 2279kb
> marked; 2279kb maxlive; 55-55% liveness)
> predict_region_elapsed_time_ms: 10.493828ms total, 5.486072ms rs scan
> (14528 cnum), 4.828102 copy time (2333800 bytes), 0.179654 other time
>     (3591560 KB left in heap.)
> predict_region_elapsed_time_ms: 10.346346ms total, 5.119780ms rs scan
> (13558 cnum), 5.046912 copy time (2439568 bytes), 0.179654 other time
> predict_region_elapsed_time_ms: 10.407672ms total, 5.333135ms rs scan
> (14123 cnum), 4.894882 copy time (2366080 bytes), 0.179654 other time
>     (no more marked regions; next region too expensive (adaptive; predicted
> 10.407672ms > remaining 0.722857ms))
> Resource usage of choose_cset:majflt: 0 nswap: 0  nvcsw: 6  nivcsw: 0
> --> About to prepare RS scan at   725.657
> The resource usage line with nvcsw=6 indicates there were 6 voluntary
> context switches while choosing cset. This choose_cset operation took
> 198.9ms all in choosing non-young.
> So, why are there voluntary context switches while choosing cset? This
> isn't swapping -- that should show under majflt, right? My only theories
> are:
> - are any locks acquired in choose_cset?
> - maybe the gc logging itself is blocking on IO to the log file? ie the
> instrumentation itself is interfering with the algorithm?
> Regardless, I think a single length choose_non_young_cset operation
> shouldn't be allowed to push the prediction above the time boundary and
> trigger this issue. Perhaps a simple workaround is that, whenever a
> collection chooses no non_young regions, it should contribute a value of 0
> to the average?
> I'll give this heuristic a try on my build and see if it solves the issue.
> -Todd
> On Tue, Jul 27, 2010 at 3:08 PM, Todd Lipcon <todd at cloudera.com> wrote:
>> Hi all,
>> Back from my vacation and took some time yesterday and today to build a
>> fresh JDK 7 with some additional debug printouts from Peter's patch.
>> What I found was a bit different - the rset scanning estimates are low,
>> but I consistently am seeing "Other time" estimates in the >40ms range.
>> Given my pause time goal of 20ms, these estimates are I think excluding most
>> of the regions from collectability. I haven't been able to dig around yet to
>> figure out where the long estimate for "other time" is coming from - in the
>> collections logged it sometimes shows fairly high "Other" but the "Choose
>> CSet" component is very short. I'll try to add some more debug info to the
>> verbose logging and rerun some tests over the next couple of days.
>> At the moment I'm giving the JRockit VM a try to see how its deterministic
>> GC stacks up against G1 and CMS.
>> Thanks
>> -Todd
>> On Tue, Jul 13, 2010 at 5:15 PM, Peter Schuller <
>> peter.schuller at infidyne.com> wrote:
>>> Ramki/Tony,
>>> > Any chance of setting +PrintHeapAtGC and -XX:+PrintHeapAtGCExtended and
>>> > sending us the log, or part of it (say between two Full GCs)? Be
>>> prepared:
>>> > this will generate piles of output. But it will give us per-region
>>> > information that might shed more light on the cause of the issue....
>>> thanks,
>>> So what I have in terms of data is (see footnotes for urls references in
>>> []):
>>> (a) A patch[1] that prints some additional information about estimated
>>> costs of region eviction, and disables the GC efficiency check that
>>> normally terminates selection of regions. (Note: This is a throw-away
>>> patch for debugging; it's not intended as a suggested change for
>>> inclusion.)
>>> (b) A log[2] showing the output of a test run I did just now, with
>>> both your flags above and my patch enabled (but without disabling the
>>> efficiency check). It shows fallback to full GC when the actual live
>>> set size is 252 MB, and the maximum heap size is 2 GB (in other words,
>>> ~ 12% liveness). An easy way to find the point of full gc is to search
>>> for the string 'full 1'.
>>> (c) A file[3] with the effective VM options during the test.
>>> (d) Instructions for how to run the test to reproduce it (I'll get to
>>> that at the end; it's simplified relative to previously).
>>> (e) Nature of the test.
>>> Discussion:
>>> WIth respect to region information: I originally tried it in response
>>> to your recommendation earlier, but I found I did not see the
>>> information I was after. Perhaps I was just misreading it, but I
>>> mostly just saw either 0% or 100% fullness, and never the actual
>>> liveness estimate as produced by the mark phase. In the log I am
>>> referring to in this E-Mail, you can see that the last printout of
>>> region information just before the live GC fits this pattern; I just
>>> don't see anything that looks like legitimate liveness information
>>> being printed. (I don't have time to dig back into it right now to
>>> double-check what it's printing.)
>>> If you scroll up from the point of the full gc until you find a bunch
>>> of output starting with "predict_region_elapsed_time_ms" you see some
>>> output resulting from the patch, with pretty extreme values such as:
>>> predict_region_elapsed_time_ms: 34.378642ms total, 34.021154ms rs scan
>>> (46542 cnum), 0.040069 copy time (20704 bytes), 0.317419 other time
>>> predict_region_elapsed_time_ms: 45.020866ms total, 44.653222ms rs scan
>>> (61087 cnum), 0.050225 copy time (25952 bytes), 0.317419 other time
>>> predict_region_elapsed_time_ms: 16.250033ms total, 15.887065ms rs scan
>>> (21734 cnum), 0.045550 copy time (23536 bytes), 0.317419 other time
>>> predict_region_elapsed_time_ms: 226.942877ms total, 226.559163ms rs
>>> scan (309940 cnum), 0.066296 copy time (34256 bytes), 0.317419 other
>>> time
>>> predict_region_elapsed_time_ms: 542.344828ms total, 541.954750ms rs
>>> scan (741411 cnum), 0.072659 copy time (37544 bytes), 0.317419 other
>>> time
>>> predict_region_elapsed_time_ms: 668.595597ms total, 668.220877ms rs
>>> scan (914147 cnum), 0.057301 copy time (29608 bytes), 0.317419 other
>>> time
>>> So in the most extreme case in the excerpt above, that's > half a
>>> second of estimate rset scanning time for a single region with 914147
>>> cards to be scanned. While not all are that extreme, lots and lots of
>>> regions are very expensive and almost only due to rset scanning costs.
>>> If you scroll down a bit to the first (and ONLY) partial that happened
>>> after the statistics accumulating from the marking phase, we see more
>>> output resulting form the patch. At the end, we see:
>>>    (picked region; 0.345890ms predicted; 1.317244ms remaining; 15kb
>>> marked; 15kb maxlive; 1-1% liveness)
>>>    (393380 KB left in heap.)
>>>    (picked region; 0.345963ms predicted; 0.971354ms remaining; 15kb
>>> marked; 15kb maxlive; 1-1% liveness)
>>>    (393365 KB left in heap.)
>>>    (picked region; 0.346036ms predicted; 0.625391ms remaining; 15kb
>>> marked; 15kb maxlive; 1-1% liveness)
>>>    (393349 KB left in heap.)
>>>    (no more marked regions; next region too expensive (adaptive;
>>> predicted 0.346036ms > remaining 0.279355ms))
>>> So in other words, it picked a bunch of regions in order of "lowest
>>> hanging fruit". The *least* low hanging fruit picked still had
>>> liveness at 1%; in other words, there's plenty of further regions that
>>> ideally should be collected because they contain almost no garbage
>>> (ignoring the cost of collecting them).
>>> In this case, it stopped picking regions because the next region to be
>>> picked, though cheap, was the straw that broke the camel's neck and we
>>> simply exceeded the alloted time for this particular GC.
>>> However, after this partial completes, it reverts back to doing just
>>> young gc:s. In other words, even though there's *plenty* of regions
>>> with very low liveness, further partials aren't happening.
>>> By applying this part of the patch:
>>>  -           (adaptive_young_list_length() &&
>>>  +           (adaptive_young_list_length() && false && // scodetodo
>>> I artificially force g1 to not fall back to doing young gc:s for
>>> efficiency reasons. When I run with that change, I don't experience
>>> the slow perpetual growth until fallback to full GC. If I remember
>>> correctly though, the rset scanning cost is in fact high, but I don't
>>> have details saved and I'm afraid I don't have time to re-run those
>>> tests right now and compare numbers.
>>> Reproducing it:
>>> I made some changes and the test case should now hopefully be easy to
>>> run assuming you have maven installed. The github project is at:
>>>   http://github.com/scode/httpgctest
>>> There is a README, but the shortest possible instructions to
>>> re-produce the test that I did:
>>>  git clone git://github.com/scode/httpgctest.git
>>>  cd httpgctest.git
>>>  git checkout 20100714_1       # grab from appropriate tag, in case I
>>> change master
>>>  mvn package
>>>  HTTPGCTEST_LOGGC=gc.log ./run.sh
>>> That should start the http server; then run concurrently:
>>>  while [ 1 ] ; do curl 'http://localhost:9191/dropdata?ratio=0.10' ;
>>> curl 'http://localhost:9191/gendata?amount=25000' ; sleep 0.1  ; done
>>> And then just wait and observe.
>>> Nature of the test:
>>> So the test if run as above will essentially reach a steady state of
>>> equilibrium with about 25000 pieces of data in a clojure immutable
>>> map. The result is that a significant amount of new data is being
>>> allocated, but very little writing to old regions is happening. The
>>> garbage generated is very well spread out over the entire heap because
>>> it goes through all objects and drops 10% (the ratio=0.10) for each
>>> iteration, after which it adds 25000 new items.
>>> In other words; not a lot of old gen writing, but lots of writes to
>>> the young gen referencing objects in the old gen.
>>> [1] http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
>>> [2]
>>> http://distfiles.scode.org/mlref/gctest/httpgctest-g1-fullgc-20100714/gc-fullfallback.log
>>> [3]
>>> http://distfiles.scode.org/mlref/gctest/httpgctest-g1-fullgc-20100714/vmoptions.txt
>>> --
>>> / Peter Schuller
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
> --
> Todd Lipcon
> Software Engineer, Cloudera

Todd Lipcon
Software Engineer, Cloudera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.java.net/pipermail/hotspot-gc-dev/attachments/20110121/60510b12/attachment.htm>
-------------- next part --------------
hotspot-gc-use mailing list
hotspot-gc-use at openjdk.java.net

More information about the hotspot-gc-dev mailing list