[concurrency-interest] LinkedBlockingDeque deadlock?

David Holmes davidcholmes at aapt.net.au
Mon Jul 13 19:00:20 EDT 2009


Martin,

I don't think this is due to LBQ/D. This is looking similar to a couple of
other ReentrantLock/AQS "lost wakeup" hangs that I've got on the radar. We
have a reprodeucible test case for one issue but it only fails on one kind
of system - x4450. I'm on vacation most of this week but will try and get
back to this next week.

Ariel: one thing to try please see if -XX:+UseMembar fixes the problem.

Thanks,
David Holmes
  -----Original Message-----
  From: Martin Buchholz [mailto:martinrb at google.com]
  Sent: Tuesday, 14 July 2009 8:38 AM
  To: Ariel Weisberg
  Cc: davidcholmes at aapt.net.au; core-libs-dev;
concurrency-interest at cs.oswego.edu
  Subject: Re: [concurrency-interest] LinkedBlockingDeque deadlock?


  I did some stack trace eyeballing and did a mini-audit of the
  LinkedBlockingDeque code, with a view to finding possible bugs,
  and came up empty.  Maybe it's a deep bug in hotspot?

  Ariel, it would be good if you could get a reproducible test case soonish,
  while someone on the planet has the motivation and familiarity to fix it.
  In another month I may disavow all knowledge of j.u.c.*Blocking*

  Martin



  On Wed, Jul 8, 2009 at 15:57, Ariel Weisberg <ariel at weisberg.ws> wrote:

    Hi,


    > The poll()ing thread is blocked waiting for the internal lock, but
    > there's
    > no indication of any thread owning that lock. You're using an OpenJDK
6
    > build ... can you try JDK7 ?


    I got a chance to do that today. I downloaded JDK 7 from

http://www.java.net/download/jdk7/binaries/jdk-7-ea-bin-b63-linux-x64-02_jul
_2009.bin
    and was able to reproduce the problem. I have attached the stack trace
    from running the 1.7 version. It is the same situation as before except
    there are 9 execution sites running on each host. There are no threads
    that are missing or that have been restarted. Foo Network thread
    (selector thread) and Network Thread - 0 are waiting on
    0x00002aaab43d3b28. I also ran with JDK 7 and 6 and LinkedBlockingQueue
    and was not able to recreate the problem using that structure.


    > I don't recall anything similar to this, but I don't know what version
    > that
    > OpenJDK6 build relates to.


    The cluster is running on CentOS 5.3.
    >[aweisberg at 3f ~]$ rpm -qi java-1.6.0-openjdk-1.6.0.0-0.30.b09.el5
    >Name        : java-1.6.0-openjdk           Relocations: (not
relocatable)
    >Version     : 1.6.0.0                           Vendor: CentOS
    >Release     : 0.30.b09.el5                  Build Date: Tue 07 Apr 2009
07:24:52 PM EDT
    >Install Date: Thu 11 Jun 2009 03:27:46 PM EDT      Build Host:
builder10.centos.org
    >Group       : Development/Languages         Source RPM:
java-1.6.0-openjdk-1.6.0.0-0.30.b09.el5.src.rpm
    >Size        : 76336266                         License: GPLv2 with
exceptions
    >Signature   : DSA/SHA1, Wed 08 Apr 2009 07:55:13 AM EDT, Key ID
a8a447dce8562897
    >URL         : http://icedtea.classpath.org/
    >Summary     : OpenJDK Runtime Environment
    >Description :
    >The OpenJDK runtime environment.


    > Make sure you haven't missed any exceptions occurring in other
threads.

    There are no threads missing in the application (terminated threads are
    not replaced) and there is a try catch pair (prints error and rethrows)
    around the run loop of each thread. It is possible that an exception may
    have been swallowed up somewhere.


    >A small reproducible test case from you would be useful.

    I am working on that. I wrote a test case that mimics the application's
    use of the LBD, but I have not succeeded in reproducing the problem in
    the test case. The app has a single thread (network selector) that polls
    the LBD and several threads (ExecutionSites, and network threads that
    return results from remote ExecutionSites) that offer results into the
    queue. About 120k items will go into/out of the deque each second. In
    the actual app the problem is reproducible but inconsistent. If I run on
    my dual core laptop I can't reproduce it, and it is less likely to occur
    with a small cluster, but with 6 nodes (~560k transactions/sec) the
    problem will usually appear. Sometimes the cluster will run for several
    minutes without issue and other times it will deadlock immediately.

    Thanks,

    Ariel


    On Wed, 08 Jul 2009 05:14 +1000, "Martin Buchholz"
    <martinrb at google.com> wrote:
    >[+core-libs-dev]
    >
    >Doug Lea and I are (slowly) working on a new version of
LinkedBlockingDeque.
    >I was not aware of a deadlock but can vaguely imagine how it might
happen.
    >A small reproducible test case from you would be useful.
    >
    >Unfinished work in progress can be found here:
    >http://cr.openjdk.java.net/~martin/webrevs/openjdk7/BlockingQueue/
    >
    >Martin


    On Wed, 08 Jul 2009 05:14 +1000, "David Holmes"

    <davidcholmes at aapt.net.au> wrote:
    >

    > Ariel,
    >
    > The poll()ing thread is blocked waiting for the internal lock, but
    > there's
    > no indication of any thread owning that lock. You're using an OpenJDK
6
    > build ... can you try JDK7 ?
    >
    > I don't recall anything similar to this, but I don't know what version
    > that
    > OpenJDK6 build relates to.
    >
    > Make sure you haven't missed any exceptions occurring in other
threads.
    >
    > David Holmes
    >
    > > -----Original Message-----
    > > From: concurrency-interest-bounces at cs.oswego.edu
    > > [mailto:concurrency-interest-bounces at cs.oswego.edu]On Behalf Of
Ariel
    > > Weisberg
    > > Sent: Wednesday, 8 July 2009 8:31 AM
    > > To: concurrency-interest at cs.oswego.edu
    > > Subject: [concurrency-interest] LinkedBlockingDeque deadlock?
    > >
    > >
    > > Hi all,
    > >
    > > I did a search on LinkedBlockingDeque and didn't find anything
similar
    > > to what I am seeing. Attached is the stack trace from an application
    > > that is deadlocked with three threads waiting for 0x00002aaab3e91080
    > > (threads "ExecutionSite: 26", "ExecutionSite:27", and "Network
    > > Selector"). The execution sites are attempting to offer results to
the
    > > deque and the network thread is trying to poll for them using the
    > > non-blocking version of poll. I am seeing the network thread never
    > > return from poll (straight poll()). Do my eyes deceive me?
    > >
    > > Thanks,
    > >
    > > Ariel Weisberg
    > >
    >


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20090714/f2f74f69/attachment-0001.html>


More information about the Concurrency-interest mailing list