[concurrency-interest] strange performance issue with ArrayBlockingQueue on Linux

Alex Otenko oleksandr.otenko at gmail.com
Tue Jul 25 05:04:13 EDT 2017


Are you studying scheduler performance for complex dependency graphs?

For example, when I look at my Linux behaviour, I see that J* threads move only a few items at a time before blocking, whereas Mac is able to move an order of magnitude more (and completes in my case an order of magnitude faster - your mileage varies, but maybe that’s also linked to how much the threads move before blocking).

Solving this problem is not easy, because the optimal scheduler needs to know that they do not need to schedule J3 or J1 too soon - that it is more profitable to schedule J1 when more space is freed in the queue by J2 (and in the mean time use the CPU to run something else), and to schedule J3 when more space is used up in the queue by J2 - so that J3 moves many items to J4, and J4 moves many items to J5, etc, instead of spending one context switch per one move. So the cost of context switch is not so much an impact as scheduling the threads *late* enough.

Steering scheduling here would be an interesting problem.


Alex

> On 25 Jul 2017, at 01:03, Matthews, Graham <gmatthew at callutheran.edu> wrote:
> 
> I would like to dig down further into this issue as I am studying whether or not the traditional "threads and queues" approach is really slower than modern approaches (like SEDA), or whether the supposed slowness of the traditional approach is simply a function of the underlying threading implementation.
> 
> The results on OS X, where the traditional and modern approach have similar performance, is food for thought in this respect.
> 
> Any suggestions as to who to contact next to understand/improve the Linux performance would me most welcome.
> 
> graham
> 
> PS: I have confirmed that switching to JDK 9 doesn't change anything. If anything the code is even slower on JDK 9 than 8.
> 
> ------
> Graham MATTHEWS, Ph.D.
> Assistant Professor
> Department of Computer Science |  California Lutheran University
> 60 W. Olsen Rd.  |  Thousand Oaks, CA 91360
> gmatthew at callutheran.edu  | Office location: D19
> 
> ________________________________________
> From: Concurrency-interest <concurrency-interest-bounces at cs.oswego.edu> on behalf of Nathan and Ila Reynolds <nathanila at gmail.com>
> Sent: Thursday, July 20, 2017 7:45 AM
> To: concurrency-interest at cs.oswego.edu
> Subject: Re: [concurrency-interest] strange performance issue with ArrayBlockingQueue on Linux
> 
>> I've never understood why unpark always comes up as a performance
> bottleneck. What does it do that's so expensive?
> 
> A while ago, I talked with the engineer who has done a lot of
> optimization on LockSupport.unpark().  I can't remember if that implies
> Unsafe.unpark().
> 
> On Windows, LockSupport.unpark() goes into the kernel 1 time to set a
> Windows Event object.  On Linux, LockSupport.unpark() acquires a lock,
> sets a condition wait and releases a lock.  This *might* require 3 trips
> into the kernel.
> 
> Furthermore, the scheduler might let the unparked thread start running
> and unparking thread yield (if no logical cores are available).  Hence,
> unpark() will take even more time.
> 
> -Nathan
> 
> On 7/19/2017 10:50 PM, Brian S O'Neill wrote:
>> You also didn't mention CPU core count. I ran the test on two
>> machines, with Java 8 and with Java 9. With Java 9 I explicitly
>> selected the parallel GC, because the G1 collector in Java 9 is a CPU
>> pig that slows everything down.
>> 
>> Machine 1: Windows 10, Intel Xeon E31245 3.3GHz, 8 logical cores
>> Machine 2: Linux 4.10, AMD Ryzen 7 1700 3.0Ghz, 16 logical cores
>> 
>> Machine 1, Java 8: 9.2 seconds
>> Machine 1, Java 9: 9.1 seconds
>> Machine 2, Java 8: 16.9 seconds
>> Machine 2, Java 9: 12.1 seconds
>> 
>> At first blush it looks like the problem is Linux, but I don't have a
>> proper apples-to-apples comparison. I noticed that with Java 8, the
>> number of context switches per second was about 2.5 million, and with
>> Java 9 it was about 2.3 million. Either way, this is really high. On
>> Windows, the context switch rate was about 1.9 million per second.
>> 
>> I also see a lot of variation between test runs, and when I run the
>> code in a loop, the performance (on Windows) varies from 9 to 14
>> seconds per run. On Linux, I see 11 to 21 seconds when running in a
>> loop. I also added an explicit System.gc between each test run.
>> 
>> When I run with -Xprof, I see the threads spending most of their time
>> in Unsafe.unpark and Unsafe.park. I've never understood why unpark
>> always comes up as a performance bottleneck. What does it do that's so
>> expensive?
>> 
>> 
>> On 2017-07-19 09:11 PM, Matthews, Graham wrote:
>>> ​Ah yes, sorry I should have said we are working in Java 8.
>>> 
>>> graham
>>> 
>>> 
>>> ------
>>> Graham MATTHEWS, Ph.D.
>>> Assistant Professor
>>> Department of Computer Science |  California Lutheran University
>>> 60 W. Olsen Rd.  |  Thousand Oaks, CA 91360
>>> gmatthew at callutheran.edu  | Office location: D19
>>> ------------------------------------------------------------------------
>>> *From:* Martin Buchholz <martinrb at google.com>
>>> *Sent:* Wednesday, July 19, 2017 8:57 PM
>>> *To:* Matthews, Graham
>>> *Cc:* concurrency-interest at cs.oswego.edu
>>> *Subject:* Re: [concurrency-interest] strange performance issue with
>>> ArrayBlockingQueue on Linux
>>> One obvious variable is the version of ABQ you are using.  It was
>>> modified in jdk9 and you should be doing all performance work with
>>> latest jdk9 binaries e.g. from http://jdk.java.net/9/
>>> 
>>> On Wed, Jul 19, 2017 at 8:41 PM, Matthews, Graham
>>> <gmatthew at callutheran.edu <mailto:gmatthew at callutheran.edu>> wrote:
>>> 
>>>    ​[apologies in advance if this is the wrong place to discuss this
>>>    issue. also apologies for the long setup -- I promise this is about
>>>    ArrayBlockingQueue, Java, and a performance issue with them]
>>> 
>>> 
>>>    My student and I have been benchmarking various ways to run cross
>>>    data source joins. The basic setup is as follows:
>>> 
>>> 
>>>                J2
>>> 
>>>                /  \
>>> 
>>>               /    \
>>> 
>>>             R3  J1
>>> 
>>>                   /  \
>>> 
>>>                  /    \
>>> 
>>>                R2  R1
>>> 
>>>    R1, R2 and R3 are called receive operators as they receive data from
>>>    the databases. J1 and J2 are inner join operators joining values
>>>    from R2 with values from R1 and then the result with values from R3.
>>> 
>>>    One way to run these queries is the classic producers and consumers
>>>    approach -- so each operator is a Java thread, and there are
>>>    bounded queues (ArrayBlockingQueues) between the operators. So the
>>>    above example has 5 threads and 4 ArrayBlockingQueues. So here we
>>>    interleave I/O with computation.
>>> 
>>>    Another way to run this is to first load R3 into J2 in parallel with
>>>    loading R2 into J1 (the loads build hash maps in J1 and J2), and
>>>    then have a thread pool with N workers (where N is the number of
>>>    cores), where each worker takes a value from R1 and joins it "up the
>>>    spine" of the tree -- so matches against the hash map in J1 and then
>>>    J2. This approach (the phased approach) therefore has an I/O phase,
>>>    followed by a computation phase.
>>> 
>>> 
>>>    On OS X El Capitan (10.11.6), the queueing approach and the phased
>>>    approach perform roughly equally (more variation in the run times
>>>    for the queueing approach, but performance within 10% of each
>>>    other). On Linux (Ubuntu 16.04 LTS)the queueing approach is almost
>>>    three times slower than the phased approach!
>>> 
>>>    On top of that the queueing approach runs over twice as fast on OS X
>>>    as it is does on Linux, despite the fact that I am running OS X on
>>>    old hardware (mid 2012 Macbook Pro, 3rd Gen Intel Core i5 3210M) and
>>>    Linux on relatively new hardware (6th Gen Intel Core i7-5500U).
>>> 
>>>    We have boiled the problem down to code that doesn't involve
>>>    databases -- just threads and ArrayBlockingQueues. Attached are
>>>    5 Java files which "approximate" the queueing approach by having
>>>    each receive thread simply output numbers (rather than get values
>>>    from a database) to an ArrayBlockingQueue, and
>>>    each Join thread simply read its two input queues, adds the values
>>>    together, and then output the sum to the output queue.
>>> 
>>>    Onthe same hardware above this code runs in 1 minute 30 secs on OS
>>>    X, and 2 minutes 13 secs on Linux!
>>> 
>>>    The code is very simple -- just a bunch of threads and a bunch of
>>>    ArrayBlockingQueues. So I am at a loss to explain the size-able
>>>    performance differences noted above.
>>> 
>>>    Again apologies if this is not the right forum to discuss this
>>> issue.
>>> 
>>>    graham
>>> 
>>>>>>    ------
>>>    Graham MATTHEWS, Ph.D.
>>>    Assistant Professor
>>>    Department of Computer Science |  California Lutheran University
>>>    60 W. Olsen Rd.  |  Thousand Oaks, CA 91360
>>>    gmatthew at callutheran.edu <mailto:gmatthew at callutheran.edu>  | Office
>>>    location: D19
>>> 
>>>    _______________________________________________
>>>    Concurrency-interest mailing list
>>>    Concurrency-interest at cs.oswego.edu
>>>    <mailto:Concurrency-interest at cs.oswego.edu>
>>>    http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>> <http://cs.oswego.edu/mailman/listinfo/concurrency-interest>
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Concurrency-interest mailing list
>>> Concurrency-interest at cs.oswego.edu
>>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>> 
>> _______________________________________________
>> Concurrency-interest mailing list
>> Concurrency-interest at cs.oswego.edu
>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
> 
> --
> -Nathan
> 
> _______________________________________________
> Concurrency-interest mailing list
> Concurrency-interest at cs.oswego.edu
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
> _______________________________________________
> Concurrency-interest mailing list
> Concurrency-interest at cs.oswego.edu
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest



More information about the Concurrency-interest mailing list