[Redirecting to net-dev, nio-dev]<br><br>Martin<br><br><div class="gmail_quote">On Tue, Jul 21, 2009 at 12:52, Ariel Weisberg <span dir="ltr"><<a href="mailto:ariel@weisberg.ws">ariel@weisberg.ws</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<div><div style="font-family: Arial; font-size: medium;" dir="ltr"><div>Hi all,</div>
<div> </div>
<div>It tooks a while for me to convince ourselves that this wasn't an application problem. I am attaching a test case that reliably reproduces the dead socket problem on some systems. The flow is essentially the same as the networking code in our messaging system.</div>
<div> </div>
<div>I had the best luck reproducing this on Dell Poweredge 2970s (two socket AMD) running CentOS 5.3. I dual booted two of them with Ubuntu server 9.04 and have not succeded in reproducing the problem with Ubuntu. I was not able to reproduce the problem on the Dell R610 (2 socket Nehalem) machines running CentOS 5.3 with the test application although the actual app (messaging system) does have this issue on the 610s.</div>
<div> </div>
<div>I am very interested in hearing about what happens when other people run it. I am also interested in confirming that this is a sane use of Selectors, SocketChannels, and SelectionKeys.</div>
<div> </div>
<div>Thanks,</div>
<div>Ariel Weisberg</div><div><div></div><div class="h5">
<div> </div>
<div>On Wed, 15 Jul 2009 14:24 -0700, "Martin Buchholz" <<a href="mailto:martinrb@google.com" target="_blank">martinrb@google.com</a>> wrote:</div>
<blockquote type="cite">In summary,<br>
there are two different bugs at work here,<br>
and neither of them is in LBD.<br>
The hotspot team is working on the LBD deadlock.<br>
(As always) It would be good to have a good test case for<br>
the dead socket problem.<br>
<br>
Martin<br>
<br>
<div class="gmail_quote">On Wed, Jul 15, 2009 at 12:24, Ariel Weisberg <span dir="ltr"><<a href="mailto:ariel@weisberg.ws" target="_blank">ariel@weisberg.ws</a>></span> wrote:<br>
<blockquote style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;" class="gmail_quote">
<div>
<div dir="ltr" style="font-family: Arial; font-size: medium;">
<div>Hi,</div>
<div> </div>
<div>I have found that there are two different failure modes without involving -XX:+UseMembar. There is the LBD deadlock and then there is the dead socket in between two nodes. Either failure can occur with the same code and settings. It appears that the dead socket problem is more common. The LBD failure is also not correlated with any specific LBD (originally saw it with only the LBD for an Initiator's mailbox).</div>
<div> </div>
<div>With -XX:+UseMembar the system is noticeably more reliable and tends to run much longer without failing (although it can still fail immediately). When it does fail it has been due to a dead connection. I have not reproduced a deadlock on an LBD with -XX:+UseMembar.</div>
<div> </div>
<div>I also found that the dead socket issue was reproducible twice on Dell Poweredge 2970s (two socket AMD). It takes an hour or so to reproduce the dead socket problem on the 2970. I have not recreated the LBD issue on them although given how difficult the socket issue is to reproduce it may be that I have not run them long enough. On the AMD machines I did not use -XX:+UseMembar.</div>
<div> </div>
<font color="#888888">
<div>Ariel</div>
</font>
<div>
<div> </div>
<div>
<div> </div>
<div>On Mon, 13 Jul 2009 18:59 -0400, "Ariel Weisberg" <<a href="mailto:ariel@weisberg.ws" target="_blank">ariel@weisberg.ws</a>> wrote:</div>
<blockquote type="cite">
<div style="font-family: Arial; font-size: medium;" dir="ltr">
<div>Hi all.</div>
<div> </div>
<div>Sorry Martin I missed reading your last email. I am not confident that I will get a small reproducible test case in a reasonable time frame. Reproducing it with the application is easy and I will see what I can do about getting the source available.</div>
<div> </div>
<div>One interesting thing I can tell you is that if I remove the LinkedBlockingDeque from the mailbox of the Initiator the system still deadlocks. The cluster has a TCP mesh topology so any node can deliver messages to any other node. One of the connections goes dead and neither side detects that there is a problem. I add some assertions to the network selection thread to check that all the connections in the cluster are still healthy and assert that they have the correct interests set.</div>
<div> </div>
<div>Here are the things it checks for to make sure each connection is working:</div>
<div>> for (ForeignHost.Port port : foreignHostPorts) {<br>
> assert(port.m_selectionKey.isValid());<br>
> assert(port.m_selectionKey.selector() == m_selector);<br>
> assert(port.m_channel.isOpen());<br>
> assert(((SocketChannel)port.m_channel).isConnected());<br>
> assert(((SocketChannel)port.m_channel).socket().isInputShutdown() == false);<br>
> assert(((SocketChannel)port.m_channel).socket().isOutputShutdown() == false);<br>
> assert(((SocketChannel)port.m_channel).isOpen());<br>
> assert(((SocketChannel)port.m_channel).isRegistered());<br>
> assert(((SocketChannel)port.m_channel).keyFor(m_selector) != null);<br>
> assert(((SocketChannel)port.m_channel).keyFor(m_selector) == port.m_selectionKey);<br>
> if (m_selector.selectedKeys().contains(port.m_selectionKey)) {<br>
> assert((port.m_selectionKey.interestOps() & SelectionKey.OP_READ) != 0);<br>
> assert((port.m_selectionKey.interestOps() & SelectionKey.OP_WRITE) != 0);<br>
> } else {<br>
> if (port.isRunning()) {<br>
> assert(port.m_selectionKey.interestOps() == 0);<br>
> } else {<br>
> port.m_selectionKey.interestOps(SelectionKey.OP_READ | SelectionKey.OP_WRITE);<br>
> assert((port.interestOps() & SelectionKey.OP_READ) != 0);<br>
> assert((port.interestOps() & SelectionKey.OP_WRITE) != 0);<br>
> }<br>
> }<br>
> assert(m_selector.isOpen());<br>
> assert(m_selector.keys().contains(port.m_selectionKey));<br>
OP_READ | OP_WRITE is set as the interest ops every time through, and there is no other code that changes the interest ops during execution. The application will run for a while and then one of the connections will stop being selected on both sides. If I step in with the debugger on either side everything looks correct. The keys have the correct interest ops and the selectors have the keys in their key set.</div>
<div> </div>
<div>What I suspect is happening is that a bug on one node stops the socket from being selected (for both read and write), and eventually the socket fills up and can't be written to by the other side.</div>
<div> </div>
<div>If I can get my VPN access together tomorrow I will run with -XX:+UseMembar and also try running on some 8-core AMD machines. Otherwise I will have to get to it Wednesday.</div>
<div> </div>
<div>Thanks,</div>
<div> </div>
<div>Ariel Weisberg</div>
<div> </div>
<div> </div>
<div>On Tue, 14 Jul 2009 05:00 +1000, "David Holmes" <<a href="mailto:davidcholmes@aapt.net.au" target="_blank">davidcholmes@aapt.net.au</a>> wrote:</div>
<blockquote type="cite">
<div><span><font size="2" color="#0000ff" face="Courier New">Martin,</font></span></div>
<div> </div>
<div><span><font size="2" color="#0000ff" face="Courier New">I don't think this is due to LBQ/D. This is looking similar to a couple of other ReentrantLock/AQS "lost wakeup" hangs that I've got on the radar. We have a reprodeucible test case for one issue but it only fails on one kind of system - x4450. I'm on vacation most of this week but will try and get back to this next week.</font></span></div>
<div> </div>
<div><span><font size="2" color="#0000ff" face="Courier New">Ariel: one thing to try please see if -XX:+UseMembar fixes the problem.</font></span></div>
<div> </div>
<div><span><font size="2" color="#0000ff" face="Courier New">Thanks,</font></span></div>
<div><span><font size="2" color="#0000ff" face="Courier New">David Holmes</font></span></div>
<blockquote style="border-left: 2px solid rgb(0, 0, 255); padding-left: 5px; margin-left: 5px;">
<div dir="ltr" align="left"><font size="2" face="Tahoma">-----Original Message-----<br>
<b>From:</b> Martin Buchholz [mailto:<a href="mailto:martinrb@google.com" target="_blank">martinrb@google.com</a>]<br>
<b>Sent:</b> Tuesday, 14 July 2009 8:38 AM<br>
<b>To:</b> Ariel Weisberg<br>
<b>Cc:</b> <a href="mailto:davidcholmes@aapt.net.au" target="_blank">davidcholmes@aapt.net.au</a>; core-libs-dev; <a href="mailto:concurrency-interest@cs.oswego.edu" target="_blank">concurrency-interest@cs.oswego.edu</a><br>
<b>Subject:</b> Re: [concurrency-interest] LinkedBlockingDeque deadlock?<br>
<br>
</font></div>
I did some stack trace eyeballing and did a mini-audit of the <br>
LinkedBlockingDeque code, with a view to finding possible bugs,<br>
and came up empty. Maybe it's a deep bug in hotspot?<br>
<br>
Ariel, it would be good if you could get a reproducible test case soonish,<br>
while someone on the planet has the motivation and familiarity to fix it.<br>
In another month I may disavow all knowledge of j.u.c.*Blocking*<br>
<br>
Martin<br>
<br>
<br>
<div class="gmail_quote">On Wed, Jul 8, 2009 at 15:57, Ariel Weisberg <span dir="ltr"><<a href="mailto:ariel@weisberg.ws" target="_blank">ariel@weisberg.ws</a>></span> wrote:<br>
<blockquote style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;" class="gmail_quote">Hi,<br>
<div><br>
> The poll()ing thread is blocked waiting for the internal lock, but<br>
> there's<br>
> no indication of any thread owning that lock. You're using an OpenJDK 6<br>
> build ... can you try JDK7 ?<br>
</div>
I got a chance to do that today. I downloaded JDK 7 from<br>
<a href="http://www.java.net/download/jdk7/binaries/jdk-7-ea-bin-b63-linux-x64-02_jul_2009.bin" target="_blank">http://www.java.net/download/jdk7/binaries/jdk-7-ea-bin-b63-linux-x64-02_jul_2009.bin</a><br>
and was able to reproduce the problem. I have attached the stack trace<br>
from running the 1.7 version. It is the same situation as before except<br>
there are 9 execution sites running on each host. There are no threads<br>
that are missing or that have been restarted. Foo Network thread<br>
(selector thread) and Network Thread - 0 are waiting on<br>
0x00002aaab43d3b28. I also ran with JDK 7 and 6 and LinkedBlockingQueue<br>
and was not able to recreate the problem using that structure.<br>
<div><br>
> I don't recall anything similar to this, but I don't know what version<br>
> that<br>
> OpenJDK6 build relates to.<br>
</div>
The cluster is running on CentOS 5.3.<br>
>[aweisberg@3f ~]$ rpm -qi java-1.6.0-openjdk-1.6.0.0-0.30.b09.el5<br>
>Name : java-1.6.0-openjdk Relocations: (not relocatable)<br>
>Version : 1.6.0.0 Vendor: CentOS<br>
>Release : 0.30.b09.el5 Build Date: Tue 07 Apr 2009 07:24:52 PM EDT<br>
>Install Date: Thu 11 Jun 2009 03:27:46 PM EDT Build Host: <a href="http://builder10.centos.org" target="_blank">builder10.centos.org</a><br>
>Group : Development/Languages Source RPM: java-1.6.0-openjdk-1.6.0.0-0.30.b09.el5.src.rpm<br>
>Size : 76336266 License: GPLv2 with exceptions<br>
>Signature : DSA/SHA1, Wed 08 Apr 2009 07:55:13 AM EDT, Key ID a8a447dce8562897<br>
>URL : <a href="http://icedtea.classpath.org/" target="_blank">http://icedtea.classpath.org/</a><br>
>Summary : OpenJDK Runtime Environment<br>
>Description :<br>
>The OpenJDK runtime environment.<br>
<div><br>
> Make sure you haven't missed any exceptions occurring in other threads.</div>
There are no threads missing in the application (terminated threads are<br>
not replaced) and there is a try catch pair (prints error and rethrows)<br>
around the run loop of each thread. It is possible that an exception may<br>
have been swallowed up somewhere.<br>
<div><br>
>A small reproducible test case from you would be useful.</div>
I am working on that. I wrote a test case that mimics the application's<br>
use of the LBD, but I have not succeeded in reproducing the problem in<br>
the test case. The app has a single thread (network selector) that polls<br>
the LBD and several threads (ExecutionSites, and network threads that<br>
return results from remote ExecutionSites) that offer results into the<br>
queue. About 120k items will go into/out of the deque each second. In<br>
the actual app the problem is reproducible but inconsistent. If I run on<br>
my dual core laptop I can't reproduce it, and it is less likely to occur<br>
with a small cluster, but with 6 nodes (~560k transactions/sec) the<br>
problem will usually appear. Sometimes the cluster will run for several<br>
minutes without issue and other times it will deadlock immediately.<br>
<br>
Thanks,<br>
<br>
Ariel<br>
<div><br>
On Wed, 08 Jul 2009 05:14 +1000, "Martin Buchholz"<br>
<<a href="mailto:martinrb@google.com" target="_blank">martinrb@google.com</a>> wrote:<br>
>[+core-libs-dev]<br>
><br>
>Doug Lea and I are (slowly) working on a new version of LinkedBlockingDeque.<br>
>I was not aware of a deadlock but can vaguely imagine how it might happen.<br>
>A small reproducible test case from you would be useful.<br>
><br>
>Unfinished work in progress can be found here:<br>
><a href="http://cr.openjdk.java.net/%7Emartin/webrevs/openjdk7/BlockingQueue/" target="_blank">http://cr.openjdk.java.net/~martin/webrevs/openjdk7/BlockingQueue/</a><br>
><br>
>Martin<br>
</div>
On Wed, 08 Jul 2009 05:14 +1000, "David Holmes"<br>
<div><<a href="mailto:davidcholmes@aapt.net.au" target="_blank">davidcholmes@aapt.net.au</a>> wrote:<br>
></div>
<div>
<div> </div>
<div>> Ariel,<br>
><br>
> The poll()ing thread is blocked waiting for the internal lock, but<br>
> there's<br>
> no indication of any thread owning that lock. You're using an OpenJDK 6<br>
> build ... can you try JDK7 ?<br>
><br>
> I don't recall anything similar to this, but I don't know what version<br>
> that<br>
> OpenJDK6 build relates to.<br>
><br>
> Make sure you haven't missed any exceptions occurring in other threads.<br>
><br>
> David Holmes<br>
><br>
> > -----Original Message-----<br>
> > From: <a href="mailto:concurrency-interest-bounces@cs.oswego.edu" target="_blank">concurrency-interest-bounces@cs.oswego.edu</a><br>
> > [mailto:<a href="mailto:concurrency-interest-bounces@cs.oswego.edu" target="_blank">concurrency-interest-bounces@cs.oswego.edu</a>]On Behalf Of Ariel<br>
> > Weisberg<br>
> > Sent: Wednesday, 8 July 2009 8:31 AM<br>
> > To: <a href="mailto:concurrency-interest@cs.oswego.edu" target="_blank">concurrency-interest@cs.oswego.edu</a><br>
> > Subject: [concurrency-interest] LinkedBlockingDeque deadlock?<br>
> ><br>
> ><br>
> > Hi all,<br>
> ><br>
> > I did a search on LinkedBlockingDeque and didn't find anything similar<br>
> > to what I am seeing. Attached is the stack trace from an application<br>
> > that is deadlocked with three threads waiting for 0x00002aaab3e91080<br>
> > (threads "ExecutionSite: 26", "ExecutionSite:27", and "Network<br>
> > Selector"). The execution sites are attempting to offer results to the<br>
> > deque and the network thread is trying to poll for them using the<br>
> > non-blocking version of poll. I am seeing the network thread never<br>
> > return from poll (straight poll()). Do my eyes deceive me?<br>
> ><br>
> > Thanks,<br>
> ><br>
> > Ariel Weisberg<br>
> ><br>
></div>
</div>
</blockquote></div>
<br>
</blockquote> </blockquote></div>
</blockquote></div>
</div>
</div>
</div>
<br>
_______________________________________________<br>
Concurrency-interest mailing list<br>
<a href="mailto:Concurrency-interest@cs.oswego.edu" target="_blank">Concurrency-interest@cs.oswego.edu</a><br>
<a href="http://cs.oswego.edu/mailman/listinfo/concurrency-interest" target="_blank">http://cs.oswego.edu/mailman/listinfo/concurrency-interest</a><br>
<br>
</blockquote></div>
<br>
</blockquote></div></div></div></div></blockquote></div><br>