[concurrency-interest] strange performance issue with ArrayBlockingQueue on Linux

Matthews, Graham gmatthew at callutheran.edu
Fri Jul 21 14:29:27 EDT 2017


The core count on both machines is the same -- dual cores with hyper-threading, so 4 "cores".

I have repeated the test on the Mac on JDK 9, and get the same performance figures for the Mac (1 minute 30 secs).

I have also checked the context switching level, and OS X is running the code at about 50-100K context switches per second (so an order of magnitude fewer context switches per second than you report).

We have been unable to get JDK 9 running on our Linux box, but are working on it. I will post when I have the results for our Linux box.

We had previously run this code with profiling and like you found the unpark/park issue. We had also noticed that on OS X almost 1/2 as much time is spent in park/unpark as on Linux.

------
Graham MATTHEWS, Ph.D.
Assistant Professor
Department of Computer Science |  California Lutheran University
60 W. Olsen Rd.  |  Thousand Oaks, CA 91360
gmatthew at callutheran.edu  | Office location: D19

________________________________________
From: Concurrency-interest <concurrency-interest-bounces at cs.oswego.edu> on behalf of Brian S O'Neill <bronee at gmail.com>
Sent: Wednesday, July 19, 2017 9:50 PM
To: concurrency-interest at cs.oswego.edu
Subject: Re: [concurrency-interest] strange performance issue with ArrayBlockingQueue on Linux

You also didn't mention CPU core count. I ran the test on two machines,
with Java 8 and with Java 9. With Java 9 I explicitly selected the
parallel GC, because the G1 collector in Java 9 is a CPU pig that slows
everything down.

Machine 1: Windows 10, Intel Xeon E31245 3.3GHz, 8 logical cores
Machine 2: Linux 4.10, AMD Ryzen 7 1700 3.0Ghz, 16 logical cores

Machine 1, Java 8: 9.2 seconds
Machine 1, Java 9: 9.1 seconds
Machine 2, Java 8: 16.9 seconds
Machine 2, Java 9: 12.1 seconds

At first blush it looks like the problem is Linux, but I don't have a
proper apples-to-apples comparison. I noticed that with Java 8, the
number of context switches per second was about 2.5 million, and with
Java 9 it was about 2.3 million. Either way, this is really high. On
Windows, the context switch rate was about 1.9 million per second.

I also see a lot of variation between test runs, and when I run the code
in a loop, the performance (on Windows) varies from 9 to 14 seconds per
run. On Linux, I see 11 to 21 seconds when running in a loop. I also
added an explicit System.gc between each test run.

When I run with -Xprof, I see the threads spending most of their time in
Unsafe.unpark and Unsafe.park. I've never understood why unpark always
comes up as a performance bottleneck. What does it do that's so expensive?


On 2017-07-19 09:11 PM, Matthews, Graham wrote:
> ​Ah yes, sorry I should have said we are working in Java 8.
>
> graham
>
>
> ------
> Graham MATTHEWS, Ph.D.
> Assistant Professor
> Department of Computer Science |  California Lutheran University
> 60 W. Olsen Rd.  |  Thousand Oaks, CA 91360
> gmatthew at callutheran.edu  | Office location: D19
> ------------------------------------------------------------------------
> *From:* Martin Buchholz <martinrb at google.com>
> *Sent:* Wednesday, July 19, 2017 8:57 PM
> *To:* Matthews, Graham
> *Cc:* concurrency-interest at cs.oswego.edu
> *Subject:* Re: [concurrency-interest] strange performance issue with
> ArrayBlockingQueue on Linux
> One obvious variable is the version of ABQ you are using.  It was
> modified in jdk9 and you should be doing all performance work with
> latest jdk9 binaries e.g. from http://jdk.java.net/9/
>
> On Wed, Jul 19, 2017 at 8:41 PM, Matthews, Graham
> <gmatthew at callutheran.edu <mailto:gmatthew at callutheran.edu>> wrote:
>
>     ​[apologies in advance if this is the wrong place to discuss this
>     issue. also apologies for the long setup -- I promise this is about
>     ArrayBlockingQueue, Java, and a performance issue with them]
>
>
>     My student and I have been benchmarking various ways to run cross
>     data source joins. The basic setup is as follows:
>
>
>                 J2
>
>                 /  \
>
>                /    \
>
>              R3  J1
>
>                    /  \
>
>                   /    \
>
>                 R2  R1
>
>     R1, R2 and R3 are called receive operators as they receive data from
>     the databases. J1 and J2 are inner join operators joining values
>     from R2 with values from R1 and then the result with values from R3.
>
>     One way to run these queries is the classic producers and consumers
>     approach -- so each operator is a Java thread, and there are
>     bounded queues (ArrayBlockingQueues) between the operators. So the
>     above example has 5 threads and 4 ArrayBlockingQueues. So here we
>     interleave I/O with computation.
>
>     Another way to run this is to first load R3 into J2 in parallel with
>     loading R2 into J1 (the loads build hash maps in J1 and J2), and
>     then have a thread pool with N workers (where N is the number of
>     cores), where each worker takes a value from R1 and joins it "up the
>     spine" of the tree -- so matches against the hash map in J1 and then
>     J2. This approach (the phased approach) therefore has an I/O phase,
>     followed by a computation phase.
>
>
>     On OS X El Capitan (10.11.6), the queueing approach and the phased
>     approach perform roughly equally (more variation in the run times
>     for the queueing approach, but performance within 10% of each
>     other). On Linux (Ubuntu 16.04 LTS)the queueing approach is almost
>     three times slower than the phased approach!
>
>     On top of that the queueing approach runs over twice as fast on OS X
>     as it is does on Linux, despite the fact that I am running OS X on
>     old hardware (mid 2012 Macbook Pro, 3rd Gen Intel Core i5 3210M) and
>     Linux on relatively new hardware (6th Gen Intel Core i7-5500U).
>
>     We have boiled the problem down to code that doesn't involve
>     databases -- just threads and ArrayBlockingQueues. Attached are
>     5 Java files which "approximate" the queueing approach by having
>     each receive thread simply output numbers (rather than get values
>     from a database) to an ArrayBlockingQueue, and
>     each Join thread simply read its two input queues, adds the values
>     together, and then output the sum to the output queue.
>
>     Onthe same hardware above this code runs in 1 minute 30 secs on OS
>     X, and 2 minutes 13 secs on Linux!
>
>     The code is very simple -- just a bunch of threads and a bunch of
>     ArrayBlockingQueues. So I am at a loss to explain the size-able
>     performance differences noted above.
>
>     Again apologies if this is not the right forum to discuss this issue.
>
>     graham
>
>     ​
>     ------
>     Graham MATTHEWS, Ph.D.
>     Assistant Professor
>     Department of Computer Science |  California Lutheran University
>     60 W. Olsen Rd.  |  Thousand Oaks, CA 91360
>     gmatthew at callutheran.edu <mailto:gmatthew at callutheran.edu>  | Office
>     location: D19
>
>     _______________________________________________
>     Concurrency-interest mailing list
>     Concurrency-interest at cs.oswego.edu
>     <mailto:Concurrency-interest at cs.oswego.edu>
>     http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>     <http://cs.oswego.edu/mailman/listinfo/concurrency-interest>
>
>
>
>
> _______________________________________________
> Concurrency-interest mailing list
> Concurrency-interest at cs.oswego.edu
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>
_______________________________________________
Concurrency-interest mailing list
Concurrency-interest at cs.oswego.edu
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


More information about the Concurrency-interest mailing list