request for review (S): 7042740: CMS: assert(n> q) failed: Looping at: ... blockOffsetTable.cpp:557
Y. Srinivas Ramakrishna
y.s.ramakrishna at oracle.com
Fri May 20 07:43:08 UTC 2011
What follows is some background on this problem, for those who
may not be familiar with or may have forgotten the details:
The sweeper must occasionally "yield"
during its sweep -- these might be either to allow a foreground
scavenge to happen, a direct allocation of a large object or
because of a JNI critical section, or when sweeping perm gen, for
metadata allocation during class loading. Typically, the sweeper
can then continue exactly from the point at which it had yielded.
If these yields are at a block boundary, the sweeper can restart
there and be sure that it is still at a block boundary. This is
because blocks will not spontaneously coalesce together rendeing
block boundaries non-boundaries.
Furthermore, if the high water mark of the heap is recorded when a concurrent
cycle starts, then by current allocate-live policy, there can't be
unmarked objects above that recorded high water mark during the sweeping phase
of that cycle, because all such objects must have been allocated black.
Thus, the sweeper can terminate its sweep at this previously recorded
high water mark. This can be useful to limit unnecessary sweeping
work when the heap is rapidly expanding.
However, when the high water mark is recorded at the end of the heap,
and the heap ends in a free chunk, then a subsequent expansion
coalesces the previously coterminal chunk with the expansion delta,
so as to reduce fragmentation. Unfortunately, this renders the recorded
high water mark a non-block-boundary, so that it is dangerous for
the sweeper to assume otherwise and try to determine the length of
the "following" block. This is how the problem originally began.
We realized this problem in a previous CR (see webrev for the
Zeno-like trail of associated CRs), but that fix was incorrect
because although it avoided stepping past the recorded
high watermark and getting into trouble, it would return
only the prefix ending at the high water mark to the free
lists. Not only would this lose some space in the form of
a one-time leak, but this would result in the block offset table
for the suffix now potentially pointing a walker off into
never-never land because of landing at a non-block boundary on
a backward logarithmic jump landing at an arbitrary point in
the prefix. Once this happened, one could run into any manner
of assertion failures or crashes in the debug vm, and crashes
in the product vm. There could be other failure modes
depending on where the "lost space" lay on a card that
might have been dirtied by a later store, as object iteration
stepped into the "no man's land" created by the leaked
space for example between two bona-fide objects.
The problem has existed forever of course but became easier
to reproduce because of setting BlockOffsetArrayUnallocatedBlock
to false allowed the recorded high water mark (the sweep limit)
to move to the end of the heap during an inflationary phase
in the heap, exposing us to the problem, whereas previously
the setting of that boolean often prevented us from going
right up to the end because we would instead stop at the top
of the allocated part, making it much less likely -- but not
impossible -- to hit this bug.
One of the problems with fixing this bug was that the sweep
closure does not remember the last block boundary examined
when it steps to the end of that block, so it did not have
sufficient information to sweep up the full block that
straddled the recorded high water mark as we stepped over it.
The fix was for the closure to do a one-step lookahead when it
handled free or newly garbage blocks to see if it would reach
the high water mark at the next step, and if so determine
the size of the block correctly at that point, so it could
be returned in its entirety, potentially coalesced with a set
of preceding blocks.
While debugging this code, I added a few newer asserts that check
various invariants that should hold at various points of the sweep,
elaborated the error messages that issue for some other existing asserts,
as well as added some sweeping status messages under the non-product sweep-tracing
flag. Additionally, I const'd some variables in the associated/affected
Testing: gclocker001; jprt; (will test with OpenDS prior to pushing fix)
More information about the hotspot-gc-dev