[concurrency-interest] strange performance issue with ArrayBlockingQueue on Linux
Brian S O'Neill
bronee at gmail.com
Thu Jul 20 00:50:42 EDT 2017
You also didn't mention CPU core count. I ran the test on two machines,
with Java 8 and with Java 9. With Java 9 I explicitly selected the
parallel GC, because the G1 collector in Java 9 is a CPU pig that slows
everything down.
Machine 1: Windows 10, Intel Xeon E31245 3.3GHz, 8 logical cores
Machine 2: Linux 4.10, AMD Ryzen 7 1700 3.0Ghz, 16 logical cores
Machine 1, Java 8: 9.2 seconds
Machine 1, Java 9: 9.1 seconds
Machine 2, Java 8: 16.9 seconds
Machine 2, Java 9: 12.1 seconds
At first blush it looks like the problem is Linux, but I don't have a
proper apples-to-apples comparison. I noticed that with Java 8, the
number of context switches per second was about 2.5 million, and with
Java 9 it was about 2.3 million. Either way, this is really high. On
Windows, the context switch rate was about 1.9 million per second.
I also see a lot of variation between test runs, and when I run the code
in a loop, the performance (on Windows) varies from 9 to 14 seconds per
run. On Linux, I see 11 to 21 seconds when running in a loop. I also
added an explicit System.gc between each test run.
When I run with -Xprof, I see the threads spending most of their time in
Unsafe.unpark and Unsafe.park. I've never understood why unpark always
comes up as a performance bottleneck. What does it do that's so expensive?
On 2017-07-19 09:11 PM, Matthews, Graham wrote:
> ​Ah yes, sorry I should have said we are working in Java 8.
>
> graham
>
>
> ------
> Graham MATTHEWS, Ph.D.
> Assistant Professor
> Department of Computer Science | California Lutheran University
> 60 W. Olsen Rd. | Thousand Oaks, CA 91360
> gmatthew at callutheran.edu | Office location: D19
> ------------------------------------------------------------------------
> *From:* Martin Buchholz <martinrb at google.com>
> *Sent:* Wednesday, July 19, 2017 8:57 PM
> *To:* Matthews, Graham
> *Cc:* concurrency-interest at cs.oswego.edu
> *Subject:* Re: [concurrency-interest] strange performance issue with
> ArrayBlockingQueue on Linux
> One obvious variable is the version of ABQ you are using. It was
> modified in jdk9 and you should be doing all performance work with
> latest jdk9 binaries e.g. from http://jdk.java.net/9/
>
> On Wed, Jul 19, 2017 at 8:41 PM, Matthews, Graham
> <gmatthew at callutheran.edu <mailto:gmatthew at callutheran.edu>> wrote:
>
> ​[apologies in advance if this is the wrong place to discuss this
> issue. also apologies for the long setup -- I promise this is about
> ArrayBlockingQueue, Java, and a performance issue with them]
>
>
> My student and I have been benchmarking various ways to run cross
> data source joins. The basic setup is as follows:
>
>
> J2
>
> / \
>
> / \
>
> R3 J1
>
> / \
>
> / \
>
> R2 R1
>
> R1, R2 and R3 are called receive operators as they receive data from
> the databases. J1 and J2 are inner join operators joining values
> from R2 with values from R1 and then the result with values from R3.
>
> One way to run these queries is the classic producers and consumers
> approach -- so each operator is a Java thread, and there are
> bounded queues (ArrayBlockingQueues) between the operators. So the
> above example has 5 threads and 4 ArrayBlockingQueues. So here we
> interleave I/O with computation.
>
> Another way to run this is to first load R3 into J2 in parallel with
> loading R2 into J1 (the loads build hash maps in J1 and J2), and
> then have a thread pool with N workers (where N is the number of
> cores), where each worker takes a value from R1 and joins it "up the
> spine" of the tree -- so matches against the hash map in J1 and then
> J2. This approach (the phased approach) therefore has an I/O phase,
> followed by a computation phase.
>
>
> On OS X El Capitan (10.11.6), the queueing approach and the phased
> approach perform roughly equally (more variation in the run times
> for the queueing approach, but performance within 10% of each
> other). On Linux (Ubuntu 16.04 LTS)the queueing approach is almost
> three times slower than the phased approach!
>
> On top of that the queueing approach runs over twice as fast on OS X
> as it is does on Linux, despite the fact that I am running OS X on
> old hardware (mid 2012 Macbook Pro, 3rd Gen Intel Core i5 3210M) and
> Linux on relatively new hardware (6th Gen Intel Core i7-5500U).
>
> We have boiled the problem down to code that doesn't involve
> databases -- just threads and ArrayBlockingQueues. Attached are
> 5 Java files which "approximate" the queueing approach by having
> each receive thread simply output numbers (rather than get values
> from a database) to an ArrayBlockingQueue, and
> each Join thread simply read its two input queues, adds the values
> together, and then output the sum to the output queue.
>
> Onthe same hardware above this code runs in 1 minute 30 secs on OS
> X, and 2 minutes 13 secs on Linux!
>
> The code is very simple -- just a bunch of threads and a bunch of
> ArrayBlockingQueues. So I am at a loss to explain the size-able
> performance differences noted above.
>
> Again apologies if this is not the right forum to discuss this issue.
>
> graham
>
> ​
> ------
> Graham MATTHEWS, Ph.D.
> Assistant Professor
> Department of Computer Science | California Lutheran University
> 60 W. Olsen Rd. | Thousand Oaks, CA 91360
> gmatthew at callutheran.edu <mailto:gmatthew at callutheran.edu> | Office
> location: D19
>
> _______________________________________________
> Concurrency-interest mailing list
> Concurrency-interest at cs.oswego.edu
> <mailto:Concurrency-interest at cs.oswego.edu>
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
> <http://cs.oswego.edu/mailman/listinfo/concurrency-interest>
>
>
>
>
> _______________________________________________
> Concurrency-interest mailing list
> Concurrency-interest at cs.oswego.edu
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>
More information about the Concurrency-interest
mailing list