RFR(s): 8055239: assert(_thread == Thread::current()->osthread()) failed: The PromotionFailedInfo should be thread local.
kim.barrett at oracle.com
Sat Nov 22 20:45:42 UTC 2014
On Nov 21, 2014, at 12:35 AM, Sangheon Kim <sangheon.kim at oracle.com> wrote:
> Please review this change to reset PromotionFailedInfo before use:
> This issue happens only with ParallelRefProcEnabled.
> And this assert happens as the current thread (when second promotion failure happened) is different from saved thread (when first promotion failure happened).
> From ParNewGeneration::collect(), there are two mt processing, for live object processing (by ParNewGenTask) and reference processing (by ParNewRefEnqueueTaskProxy). When first promotion failure happened from ParNewGenTask, thread pointer will be stored by PromotionFailedInfo::register_copy_failure(). And at second failure, assert happens as thread pointer is not guaranteed to be same.
> As the thread counts increase, its frequency become higher.
> Added ParScanThreadStateSet::trace_promotion_failed() to trace and reset promotion failure information instance.
> - jprt
This proposed change doesn't address what I think is the root cause of
the problem, which appears to me to be the semantically questionable
use of a single ParScanThreadStateSet for what are conceptually two
distinct sets of threads - the original generation scanning threads
and the reference processing threads. (The same underlying OS threads
are being used for both, but they are being run with completely
different "tasks", and the ThreadState info is really per-task)
Besides the unexpected reassociation of PromotionFailedInfo objects
(from owning ParScanThreadState objects) with new OS threads, it looks
like there are probably other things that are "messed up" by this
reuse, e.g. timing information and TASKQUEUE_STATS.
Creating a second state set for use by reference processing would
require changing the signature of handle_promotion_failed to deal with
two sets, but that function is private and only called in one place,
so that shouldn't be hard.
The real cost would be in the runtime construction of a second state
set. I'm not sure whether that would be important or not. It is also
only needed when the ParallelRefProcEnabled option is true, since
that's the case where the re-use occurs, which adds a bit of
One option might be to capture the information to be reported in a new
object and reinitialize the ThreadState objects, capture additional
information after the reuse (if needed), and pass both sets of
captured information to handle_promotion_failed.
An entirely different approach to addressing just the proximate
failure would be to eliminate the _thread member from
PromotionFailureInfo and deal with the consequences of that change. I
can't see any utility in reporting the OS thread associated with an
internal worker task when reporting promotion failure statistics. The
relevant context is long gone, after all. And really, the ThreadState
objects exist to avoid race conditions amongst the multiple
threads. [I think what *should* be reported is a combined summary
rather than per-thread promotion failure statistics.]
Removing the _thread member has the added benefit of being a code
removal and simplficiation change. The downside is that it affects a
messaging API; I'm not sure what would be involved in dealing with
that. The needed changes to the (open) jdk look straightforward:
Remove _thread member and associated code from PromotionFailedInfo.
Remove thread field from constructed message.
Remove thread field from PromotionFailed event description.
This is the API change that needs to be appropriately handled.
More information about the hotspot-gc-dev