[concurrency-interest] strange performance issue with ArrayBlockingQueue on Linux

Nathan and Ila Reynolds nathanila at gmail.com
Thu Jul 20 10:45:16 EDT 2017


 > I've never understood why unpark always comes up as a performance 
bottleneck. What does it do that's so expensive?

A while ago, I talked with the engineer who has done a lot of 
optimization on LockSupport.unpark().  I can't remember if that implies 
Unsafe.unpark().

On Windows, LockSupport.unpark() goes into the kernel 1 time to set a 
Windows Event object.  On Linux, LockSupport.unpark() acquires a lock, 
sets a condition wait and releases a lock.  This *might* require 3 trips 
into the kernel.

Furthermore, the scheduler might let the unparked thread start running 
and unparking thread yield (if no logical cores are available).  Hence, 
unpark() will take even more time.

-Nathan

On 7/19/2017 10:50 PM, Brian S O'Neill wrote:
> You also didn't mention CPU core count. I ran the test on two 
> machines, with Java 8 and with Java 9. With Java 9 I explicitly 
> selected the parallel GC, because the G1 collector in Java 9 is a CPU 
> pig that slows everything down.
>
> Machine 1: Windows 10, Intel Xeon E31245 3.3GHz, 8 logical cores
> Machine 2: Linux 4.10, AMD Ryzen 7 1700 3.0Ghz, 16 logical cores
>
> Machine 1, Java 8: 9.2 seconds
> Machine 1, Java 9: 9.1 seconds
> Machine 2, Java 8: 16.9 seconds
> Machine 2, Java 9: 12.1 seconds
>
> At first blush it looks like the problem is Linux, but I don't have a 
> proper apples-to-apples comparison. I noticed that with Java 8, the 
> number of context switches per second was about 2.5 million, and with 
> Java 9 it was about 2.3 million. Either way, this is really high. On 
> Windows, the context switch rate was about 1.9 million per second.
>
> I also see a lot of variation between test runs, and when I run the 
> code in a loop, the performance (on Windows) varies from 9 to 14 
> seconds per run. On Linux, I see 11 to 21 seconds when running in a 
> loop. I also added an explicit System.gc between each test run.
>
> When I run with -Xprof, I see the threads spending most of their time 
> in Unsafe.unpark and Unsafe.park. I've never understood why unpark 
> always comes up as a performance bottleneck. What does it do that's so 
> expensive?
>
>
> On 2017-07-19 09:11 PM, Matthews, Graham wrote:
>> ​Ah yes, sorry I should have said we are working in Java 8.
>>
>> graham
>>
>>
>> ------
>> Graham MATTHEWS, Ph.D.
>> Assistant Professor
>> Department of Computer Science |  California Lutheran University
>> 60 W. Olsen Rd.  |  Thousand Oaks, CA 91360
>> gmatthew at callutheran.edu  | Office location: D19
>> ------------------------------------------------------------------------
>> *From:* Martin Buchholz <martinrb at google.com>
>> *Sent:* Wednesday, July 19, 2017 8:57 PM
>> *To:* Matthews, Graham
>> *Cc:* concurrency-interest at cs.oswego.edu
>> *Subject:* Re: [concurrency-interest] strange performance issue with 
>> ArrayBlockingQueue on Linux
>> One obvious variable is the version of ABQ you are using.  It was 
>> modified in jdk9 and you should be doing all performance work with 
>> latest jdk9 binaries e.g. from http://jdk.java.net/9/
>>
>> On Wed, Jul 19, 2017 at 8:41 PM, Matthews, Graham 
>> <gmatthew at callutheran.edu <mailto:gmatthew at callutheran.edu>> wrote:
>>
>>     ​[apologies in advance if this is the wrong place to discuss this
>>     issue. also apologies for the long setup -- I promise this is about
>>     ArrayBlockingQueue, Java, and a performance issue with them]
>>
>>
>>     My student and I have been benchmarking various ways to run cross
>>     data source joins. The basic setup is as follows:
>>
>>
>>                 J2
>>
>>                 /  \
>>
>>                /    \
>>
>>              R3  J1
>>
>>                    /  \
>>
>>                   /    \
>>
>>                 R2  R1
>>
>>     R1, R2 and R3 are called receive operators as they receive data from
>>     the databases. J1 and J2 are inner join operators joining values
>>     from R2 with values from R1 and then the result with values from R3.
>>
>>     One way to run these queries is the classic producers and consumers
>>     approach -- so each operator is a Java thread, and there are
>>     bounded queues (ArrayBlockingQueues) between the operators. So the
>>     above example has 5 threads and 4 ArrayBlockingQueues. So here we
>>     interleave I/O with computation.
>>
>>     Another way to run this is to first load R3 into J2 in parallel with
>>     loading R2 into J1 (the loads build hash maps in J1 and J2), and
>>     then have a thread pool with N workers (where N is the number of
>>     cores), where each worker takes a value from R1 and joins it "up the
>>     spine" of the tree -- so matches against the hash map in J1 and then
>>     J2. This approach (the phased approach) therefore has an I/O phase,
>>     followed by a computation phase.
>>
>>
>>     On OS X El Capitan (10.11.6), the queueing approach and the phased
>>     approach perform roughly equally (more variation in the run times
>>     for the queueing approach, but performance within 10% of each
>>     other). On Linux (Ubuntu 16.04 LTS)the queueing approach is almost
>>     three times slower than the phased approach!
>>
>>     On top of that the queueing approach runs over twice as fast on OS X
>>     as it is does on Linux, despite the fact that I am running OS X on
>>     old hardware (mid 2012 Macbook Pro, 3rd Gen Intel Core i5 3210M) and
>>     Linux on relatively new hardware (6th Gen Intel Core i7-5500U).
>>
>>     We have boiled the problem down to code that doesn't involve
>>     databases -- just threads and ArrayBlockingQueues. Attached are
>>     5 Java files which "approximate" the queueing approach by having
>>     each receive thread simply output numbers (rather than get values
>>     from a database) to an ArrayBlockingQueue, and
>>     each Join thread simply read its two input queues, adds the values
>>     together, and then output the sum to the output queue.
>>
>>     Onthe same hardware above this code runs in 1 minute 30 secs on OS
>>     X, and 2 minutes 13 secs on Linux!
>>
>>     The code is very simple -- just a bunch of threads and a bunch of
>>     ArrayBlockingQueues. So I am at a loss to explain the size-able
>>     performance differences noted above.
>>
>>     Again apologies if this is not the right forum to discuss this 
>> issue.
>>
>>     graham
>>
>>     ​
>>     ------
>>     Graham MATTHEWS, Ph.D.
>>     Assistant Professor
>>     Department of Computer Science |  California Lutheran University
>>     60 W. Olsen Rd.  |  Thousand Oaks, CA 91360
>>     gmatthew at callutheran.edu <mailto:gmatthew at callutheran.edu>  | Office
>>     location: D19
>>
>>     _______________________________________________
>>     Concurrency-interest mailing list
>>     Concurrency-interest at cs.oswego.edu
>>     <mailto:Concurrency-interest at cs.oswego.edu>
>>     http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>> <http://cs.oswego.edu/mailman/listinfo/concurrency-interest>
>>
>>
>>
>>
>> _______________________________________________
>> Concurrency-interest mailing list
>> Concurrency-interest at cs.oswego.edu
>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>
> _______________________________________________
> Concurrency-interest mailing list
> Concurrency-interest at cs.oswego.edu
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest

-- 
-Nathan



More information about the Concurrency-interest mailing list