<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Interesting discussion. :-)<div><br></div><div>Ramki's observation of high context switches to me suggests active locking as a possible culprit. &nbsp;Fwiw, based on your discussion it looks like you're headed down a path that makes sense.</div><div><br></div><div>charlie...</div><div><br><div><div>On Oct 19, 2012, at 3:40 AM, Srinivas Ramakrishna wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><br><br><div class="gmail_quote">On Thu, Oct 18, 2012 at 5:27 PM, Peter B. Kessler <span dir="ltr">&lt;<a href="mailto:Peter.B.Kessler@oracle.com" target="_blank">Peter.B.Kessler@oracle.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

When there's no room in the old generation and a worker has filled its PLAB to capacity, but it still has instances to try to promote, does it try to allocate a new PLAB, and fail? &nbsp;That would lead to each of the workers eventually failing to allocate a new PLAB for each promotion attempt. &nbsp;IIRC, PLAB allocation grabs a real lock (since it happens so rarely :-). &nbsp;In the promotion failure case, that lock could get incandescent. &nbsp;Maybe it's gone unnoticed because for modest young generations it doesn't stay hot enough for long enough for people to witness the supernova? &nbsp;Having a young generation the size you do would exacerbate the problem. &nbsp;If you have lots of workers, that would increase the amount of contention, too.<br>

</blockquote><div><br>Yes, that's exactly my thinking too. For the case of CMS, the PLAB's are "local free block lists" and the allocation from the shared global pool is<br>even worse and more heavyweight than an atomic pointer bump, with a lock protecting several layers of checks.<br>

&nbsp;<br></div><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

PLAB allocation might be a place where you could put a test for having failed promotion, so just return null and let the worker self-loop this instance. &nbsp;That would keep the test off the fast-path (when things are going well).<br>

</blockquote><div><br>Yes, that's a good idea and might well be sufficient, and was also my first thought. However, I also wonder about whether just moving the promotion<br>failure test a volatile read into the fast path of the copy routine, and immediately failing all subsequent copies after the first failure (and indeed via the<br>

global flag propagating that failure across all the workers immediately) won't just be quicker without having added that much in the fast path. It seems<br>that in that case we may be able to even avoid the self-looping and the subsequent single-threaded fixup. The first thread that fails sets the volatile<br>

global, so any subsequent thread artificially fails all subsequent copies of uncopied objects. Any object reference found pointing to an object in Eden<br>or From space that hasn't yet been copied will call the copy routine which will (artificially) fail and return the original address.<br>

<br>I'll do some experiments and there may lurk devils in the details, but it seems to me that this will work and be much more efficient in the<br>slow case, without making the fast path that much slower.<br>&nbsp;<br></div>

<blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

I'm still guessing.</blockquote><div><br>Your guesses are good, and very helpful, and I think we are on the right track with this one as regards the cause of the slowdown.<br><br>I'll update.<br><br>-- ramki<br>&nbsp;<br>

</div><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="im"><br>

<br>

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ... peter<br>

<br>

Srinivas Ramakrishna wrote:<br>

</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">

System data show high context switching in vicinity of event and points at the futile allocation bottleneck as a possible theory with some legs....<br>

<br>

more later.<br>

-- ramki<br>

<br></div><div><div class="h5">

On Thu, Oct 18, 2012 at 3:47 PM, Srinivas Ramakrishna &lt;<a href="mailto:ysr1729@gmail.com" target="_blank">ysr1729@gmail.com</a> &lt;mailto:<a href="mailto:ysr1729@gmail.com" target="_blank">ysr1729@gmail.com</a>&gt;&gt; wrote:<br>


<br>

&nbsp; &nbsp; Thanks Peter... the possibility of paging or related issue of VM<br>

&nbsp; &nbsp; system did occur to me, especially because system time shows up as<br>

&nbsp; &nbsp; somewhat high here. The problem is that this server runs without<br>

&nbsp; &nbsp; swap :-) so the time is going elsewhere.<br>

<br>

&nbsp; &nbsp; The cache miss theory is interesting (but would not show up as<br>

&nbsp; &nbsp; system time), and your back of the envelope calculation gives about<br>

&nbsp; &nbsp; 0.8 us for fetching a cache line, although i am pretty sure the<br>

&nbsp; &nbsp; cache miss predictor would probably figure out the misses and stream<br>

&nbsp; &nbsp; in the<br>

&nbsp; &nbsp; cache lines since as you say we are going in address order). I'd<br>

&nbsp; &nbsp; expect it to be no worse than when we do an "initial mark pause on a<br>

&nbsp; &nbsp; full Eden", give or<br>

&nbsp; &nbsp; take a little, and this is some 30 x worse.<br>

<br>

&nbsp; &nbsp; One possibility I am looking at is the part where we self-loop. I<br>

&nbsp; &nbsp; suspect the ParNew/CMS combination running with multiple worker threads<br>

&nbsp; &nbsp; is hit hard here, if the failure happens very early say -- from what<br>

&nbsp; &nbsp; i saw of that code recently, we don't consult the flag that says we<br>

&nbsp; &nbsp; failed<br>

&nbsp; &nbsp; so we should just return and self-loop. Rather we retry allocation<br>

&nbsp; &nbsp; for each subsequent object, fail that and then do the self-loop. The<br>

&nbsp; &nbsp; repeated<br>

&nbsp; &nbsp; failed attempts might be adding up, especially since the access<br>

&nbsp; &nbsp; involves looking at the shared pool. I'll look at how that is done,<br>

&nbsp; &nbsp; and see if we can<br>

&nbsp; &nbsp; do a fast fail after the first failure happens, rather than try and<br>

&nbsp; &nbsp; do the rest of the scavenge, since we'll need to do a fixup anyway.<br>

<br>

&nbsp; &nbsp; thanks for the discussion and i'll update as and when i do some more<br>

&nbsp; &nbsp; investigations. Keep those ideas coming, and I'll submit a bug<br>

&nbsp; &nbsp; report once<br>

&nbsp; &nbsp; i have spent a few more cycles looking at the available data and<br>

&nbsp; &nbsp; ruminating.<br>

<br>

&nbsp; &nbsp; - ramki<br>

<br>

<br>

&nbsp; &nbsp; On Thu, Oct 18, 2012 at 1:20 PM, Peter B. Kessler<br></div></div><div><div class="h5">

&nbsp; &nbsp; &lt;<a href="mailto:Peter.B.Kessler@oracle.com" target="_blank">Peter.B.Kessler@oracle.com</a> &lt;mailto:<a href="mailto:Peter.B.Kessler@oracle.com" target="_blank">Peter.B.Kessler@<u></u>oracle.com</a>&gt;&gt; wrote:<br>


<br>

&nbsp; &nbsp; &nbsp; &nbsp; IIRC, promotion failure still has to finish the evacuation<br>

&nbsp; &nbsp; &nbsp; &nbsp; attempt (and some objects may get promoted while the ones that<br>

&nbsp; &nbsp; &nbsp; &nbsp; fail get self-looped). &nbsp;That part is the usual multi-threaded<br>

&nbsp; &nbsp; &nbsp; &nbsp; object graph walk, with failed PLAB allocations thrown in to<br>

&nbsp; &nbsp; &nbsp; &nbsp; slow you down. &nbsp;Then you get to start the pass that deals with<br>

&nbsp; &nbsp; &nbsp; &nbsp; the self-loops, which you say is single-threaded. &nbsp;Undoing the<br>

&nbsp; &nbsp; &nbsp; &nbsp; self-loops is in address order, but it walks by the object<br>

&nbsp; &nbsp; &nbsp; &nbsp; sizes, so probably it mostly misses in the cache. &nbsp;40GB at the<br>

&nbsp; &nbsp; &nbsp; &nbsp; average object size (call them 40 bytes to make the math easy)<br>

&nbsp; &nbsp; &nbsp; &nbsp; is a lot of cache misses. &nbsp;How fast is your memory system?<br>

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Probably faster than (10minutes / (40GB / 40bytes)) per cache miss.<br>

<br>

&nbsp; &nbsp; &nbsp; &nbsp; Is it possible you are paging? &nbsp;Maybe not when things are<br>

&nbsp; &nbsp; &nbsp; &nbsp; running smoothly, but maybe a 10 minute stall on one service<br>

&nbsp; &nbsp; &nbsp; &nbsp; causes things to back up (and grow the heap of) other services<br>

&nbsp; &nbsp; &nbsp; &nbsp; on the same machine? &nbsp;I'm guessing.<br>

<br>

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ... peter<br>

<br>

&nbsp; &nbsp; &nbsp; &nbsp; Srinivas Ramakrishna wrote:<br>

<br>

<br>

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Has anyone come across extremely long (upwards of 10<br>

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; minutes) promotion failure unwinding scenarios when using<br>

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; any of the collectors, but especially with ParNew/CMS?<br>

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; I recently came across one such occurrence with ParNew/CMS<br>

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; that, with a 40 GB young gen took upwards of 10 minutes to<br>

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; "unwind". I looked through the code and I can see<br>

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; that the unwinding steps can be a source of slowdown as we<br>

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; iterate single-threaded (DefNew) through the large Eden to<br>

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; fix up self-forwarded objects, but that still wouldn't<br>

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; seem to explain such a large pause, even with a 40 GB young<br>

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; gen. I am looking through the promotion failure paths to see<br>

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; what might be the cause of such a large pause,<br>

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; but if anyone has experienced this kind of scenario before<br>

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; or has any conjectures or insights, I'd appreciate it.<br>

<br>

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; thanks!<br>

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; -- ramki<br>

<br>

<br></div></div>

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ------------------------------<u></u>__----------------------------<u></u>--__------------<br>

<br>

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ______________________________<u></u>___________________<br>

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; hotspot-gc-use mailing list<br>

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; hotspot-gc-use@openjdk.java.__<u></u>net<br>

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &lt;mailto:<a href="mailto:hotspot-gc-use@openjdk.java.net" target="_blank">hotspot-gc-use@<u></u>openjdk.java.net</a>&gt;<br>

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <a href="http://mail.openjdk.java.net/__mailman/listinfo/hotspot-gc-__use" target="_blank">http://mail.openjdk.java.net/_<u></u>_mailman/listinfo/hotspot-gc-_<u></u>_use</a><br>

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &lt;<a href="http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use" target="_blank">http://mail.openjdk.java.net/<u></u>mailman/listinfo/hotspot-gc-<u></u>use</a>&gt;<br>

<br>

<br>

<br>

</blockquote>

</blockquote></div><br>

_______________________________________________<br>hotspot-gc-use mailing list<br><a href="mailto:hotspot-gc-use@openjdk.java.net">hotspot-gc-use@openjdk.java.net</a><br>http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use<br></blockquote></div><br></div></body></html>