[concurrency-interest] strange performance issue with ArrayBlockingQueue on Linux

Alex Otenko oleksandr.otenko at gmail.com
Fri Jul 21 03:22:08 EDT 2017


Or, to make the claim more obvious (reduced the number of items to move):

Ubuntu:
JOIN 1 : END - avg moved = 15.522080166396762 stops: 193273
JOIN 3 : END - avg moved = 11.284180712332459 stops: 265859
JOIN 4 : END - avg moved = 5.261479667895535 stops: 570182
JOIN 6 : END - avg moved = 5.253905401713407 stops: 571004
Num results = 1000000
JOIN 5 : END - avg moved = 9.410645946522266 stops: 318788
JOIN 2 : END - avg moved = 4.980204145797066 stops: 594771
JOIN 7 : END - avg moved = 13.271463266813882 stops: 226049

Mac:
JOIN 2 : END - avg moved = 191.25074912336663 stops: 15685
JOIN 5 : END - avg moved = 194.36389554850084 stops: 15433
JOIN 6 : END - avg moved = 270.7480144404339 stops: 11080
JOIN 7 : END - avg moved = 259.14675168653486 stops: 10821
JOIN 1 : END - avg moved = 231.73393077873897 stops: 12944
JOIN 3 : END - avg moved = 171.97678284797118 stops: 17444
JOIN 4 : END - avg moved = 216.7288490715988 stops: 13841


For Join.java counting how many times take or put would block:
    @Override
    public void run() {
        int moved = 0;
        double avg = 0;
        int stops = 0;
        try {
            while (true) {
                Long v = this.lhsInput.poll();
                if (v == null) {
                  avg = (avg * stops + moved) / (stops + 1);
                  stops++;
                  moved = 0;
                  v = this.lhsInput.take();
                }
                moved++;

                long lhsValue = v;

                v = this.rhsInput.poll();
                if (v == null) {
                  avg = (avg * stops + moved) / (stops + 1);
                  stops++;
                  moved = 0;
                  v = this.rhsInput.take();
                }
                moved++;

                long rhsValue = v;

                /* Check for end-of-stream markers. */
                if (lhsValue == Stream.EOS) {
                    this.eosCount++;
                }
                if (rhsValue == Stream.EOS) {
                    this.eosCount++;
                }

                /* Join values or send end-of-stream marker.*/
                if (this.eosCount == 2) {
                    break;
                }
                else {
                    if (!this.offer(lhsValue + rhsValue)) {
                      avg = (avg * stops + moved) / (stops + 1);
                      stops++;
                      moved = 0;
                      this.put(lhsValue + rhsValue);
                    }
                    moved++;
                }
            }
            this.put(Stream.EOS);
            System.out.println(this + " : END - avg moved = " + avg + " stops: " + stops);
        }
        catch (InterruptedException ex) {
            ex.printStackTrace(System.err);
        }
    }

(and a corresponding change to Stream.java to expose offer())

Alex

> On 20 Jul 2017, at 23:05, Alex Otenko <oleksandr.otenko at gmail.com> wrote:
> 
> This particular case is not about the cost of a context switch, and not about the number of context switches per second. This is about the amount of work done per switch.
> 
> 
> Consider J2. J2 moving data unblocks J1 and J3. If J1 gets on the CPU, the context switch allows J1 to move just one data item, then it gets blocked again. If J3 gets on the CPU, J2 moves as much data as possible - basically, emptying J1’s buffer, so when J2 is blocked, it created a lot of work for both J3 (kept feeding its input queue) and J1 (emptied its output queue), so when J1 does get on the core again, it can move a lot of items upstream to J2 (which in turn is better if J2 waits till there’s a lot of work available, instead of getting agitated the moment a single item is in its queue).
> 
> Example change in Join.java (synthetically biasing who gets blocked when) dramatically changed times for my Ubuntu (on Virtual Box):
> 
>     @Override
>     public void run() {
>         long [] arr = new long[1000];
>         int pos = 0;
>         try {
>             while (true) {
>                 long lhsValue = this.lhsInput.take();
>                 long rhsValue = this.rhsInput.take();
> 
>                 /* Check for end-of-stream markers. */
>                 if (lhsValue == Stream.EOS) {
>                     this.eosCount++;
>                 }
>                 if (rhsValue == Stream.EOS) {
>                     this.eosCount++;
>                 }
> 
>                 /* Join values or send end-of-stream marker.*/
>                 if (this.eosCount == 2) {
>                     for(int i = 0; i < pos; i++) {
>                       this.put(arr[i]);
>                     }
>                     break;
>                 }
>                 else {
>                     arr[pos++] = lhsValue + rhsValue;
>                     if (pos == arr.length) {
>                       for(int i = 0; i < pos; i++) {
>                         this.put(arr[i]);
>                       }
>                       pos = 0;
>                     }
>                 }
>             }
>             this.put(Stream.EOS);
>             System.out.println(this + " : END");
>         }
>         catch (InterruptedException ex) {
>             ex.printStackTrace(System.err);
>         }
>     }
> 
> 
> Alex
> 
> 
>> On 20 Jul 2017, at 15:45, Nathan and Ila Reynolds <nathanila at gmail.com <mailto:nathanila at gmail.com>> wrote:
>> 
>> > I've never understood why unpark always comes up as a performance bottleneck. What does it do that's so expensive?
>> 
>> A while ago, I talked with the engineer who has done a lot of optimization on LockSupport.unpark().  I can't remember if that implies Unsafe.unpark().
>> 
>> On Windows, LockSupport.unpark() goes into the kernel 1 time to set a Windows Event object.  On Linux, LockSupport.unpark() acquires a lock, sets a condition wait and releases a lock.  This *might* require 3 trips into the kernel.
>> 
>> Furthermore, the scheduler might let the unparked thread start running and unparking thread yield (if no logical cores are available).  Hence, unpark() will take even more time.
>> 
>> -Nathan
>> 
>> On 7/19/2017 10:50 PM, Brian S O'Neill wrote:
>>> You also didn't mention CPU core count. I ran the test on two machines, with Java 8 and with Java 9. With Java 9 I explicitly selected the parallel GC, because the G1 collector in Java 9 is a CPU pig that slows everything down.
>>> 
>>> Machine 1: Windows 10, Intel Xeon E31245 3.3GHz, 8 logical cores
>>> Machine 2: Linux 4.10, AMD Ryzen 7 1700 3.0Ghz, 16 logical cores
>>> 
>>> Machine 1, Java 8: 9.2 seconds
>>> Machine 1, Java 9: 9.1 seconds
>>> Machine 2, Java 8: 16.9 seconds
>>> Machine 2, Java 9: 12.1 seconds
>>> 
>>> At first blush it looks like the problem is Linux, but I don't have a proper apples-to-apples comparison. I noticed that with Java 8, the number of context switches per second was about 2.5 million, and with Java 9 it was about 2.3 million. Either way, this is really high. On Windows, the context switch rate was about 1.9 million per second.
>>> 
>>> I also see a lot of variation between test runs, and when I run the code in a loop, the performance (on Windows) varies from 9 to 14 seconds per run. On Linux, I see 11 to 21 seconds when running in a loop. I also added an explicit System.gc between each test run.
>>> 
>>> When I run with -Xprof, I see the threads spending most of their time in Unsafe.unpark and Unsafe.park. I've never understood why unpark always comes up as a performance bottleneck. What does it do that's so expensive?
>>> 
>>> 
>>> On 2017-07-19 09:11 PM, Matthews, Graham wrote:
>>>> ​Ah yes, sorry I should have said we are working in Java 8.
>>>> 
>>>> graham
>>>> 
>>>> 
>>>> ------
>>>> Graham MATTHEWS, Ph.D.
>>>> Assistant Professor
>>>> Department of Computer Science |  California Lutheran University
>>>> 60 W. Olsen Rd.  |  Thousand Oaks, CA 91360
>>>> gmatthew at callutheran.edu <mailto:gmatthew at callutheran.edu>  | Office location: D19
>>>> ------------------------------------------------------------------------
>>>> *From:* Martin Buchholz <martinrb at google.com <mailto:martinrb at google.com>>
>>>> *Sent:* Wednesday, July 19, 2017 8:57 PM
>>>> *To:* Matthews, Graham
>>>> *Cc:* concurrency-interest at cs.oswego.edu <mailto:concurrency-interest at cs.oswego.edu>
>>>> *Subject:* Re: [concurrency-interest] strange performance issue with ArrayBlockingQueue on Linux
>>>> One obvious variable is the version of ABQ you are using.  It was modified in jdk9 and you should be doing all performance work with latest jdk9 binaries e.g. from http://jdk.java.net/9/ <http://jdk.java.net/9/>
>>>> 
>>>> On Wed, Jul 19, 2017 at 8:41 PM, Matthews, Graham <gmatthew at callutheran.edu <mailto:gmatthew at callutheran.edu> <mailto:gmatthew at callutheran.edu <mailto:gmatthew at callutheran.edu>>> wrote:
>>>> 
>>>>    ​[apologies in advance if this is the wrong place to discuss this
>>>>    issue. also apologies for the long setup -- I promise this is about
>>>>    ArrayBlockingQueue, Java, and a performance issue with them]
>>>> 
>>>> 
>>>>    My student and I have been benchmarking various ways to run cross
>>>>    data source joins. The basic setup is as follows:
>>>> 
>>>> 
>>>>                J2
>>>> 
>>>>                /  \
>>>> 
>>>>               /    \
>>>> 
>>>>             R3  J1
>>>> 
>>>>                   /  \
>>>> 
>>>>                  /    \
>>>> 
>>>>                R2  R1
>>>> 
>>>>    R1, R2 and R3 are called receive operators as they receive data from
>>>>    the databases. J1 and J2 are inner join operators joining values
>>>>    from R2 with values from R1 and then the result with values from R3.
>>>> 
>>>>    One way to run these queries is the classic producers and consumers
>>>>    approach -- so each operator is a Java thread, and there are
>>>>    bounded queues (ArrayBlockingQueues) between the operators. So the
>>>>    above example has 5 threads and 4 ArrayBlockingQueues. So here we
>>>>    interleave I/O with computation.
>>>> 
>>>>    Another way to run this is to first load R3 into J2 in parallel with
>>>>    loading R2 into J1 (the loads build hash maps in J1 and J2), and
>>>>    then have a thread pool with N workers (where N is the number of
>>>>    cores), where each worker takes a value from R1 and joins it "up the
>>>>    spine" of the tree -- so matches against the hash map in J1 and then
>>>>    J2. This approach (the phased approach) therefore has an I/O phase,
>>>>    followed by a computation phase.
>>>> 
>>>> 
>>>>    On OS X El Capitan (10.11.6), the queueing approach and the phased
>>>>    approach perform roughly equally (more variation in the run times
>>>>    for the queueing approach, but performance within 10% of each
>>>>    other). On Linux (Ubuntu 16.04 LTS)the queueing approach is almost
>>>>    three times slower than the phased approach!
>>>> 
>>>>    On top of that the queueing approach runs over twice as fast on OS X
>>>>    as it is does on Linux, despite the fact that I am running OS X on
>>>>    old hardware (mid 2012 Macbook Pro, 3rd Gen Intel Core i5 3210M) and
>>>>    Linux on relatively new hardware (6th Gen Intel Core i7-5500U).
>>>> 
>>>>    We have boiled the problem down to code that doesn't involve
>>>>    databases -- just threads and ArrayBlockingQueues. Attached are
>>>>    5 Java files which "approximate" the queueing approach by having
>>>>    each receive thread simply output numbers (rather than get values
>>>>    from a database) to an ArrayBlockingQueue, and
>>>>    each Join thread simply read its two input queues, adds the values
>>>>    together, and then output the sum to the output queue.
>>>> 
>>>>    Onthe same hardware above this code runs in 1 minute 30 secs on OS
>>>>    X, and 2 minutes 13 secs on Linux!
>>>> 
>>>>    The code is very simple -- just a bunch of threads and a bunch of
>>>>    ArrayBlockingQueues. So I am at a loss to explain the size-able
>>>>    performance differences noted above.
>>>> 
>>>>    Again apologies if this is not the right forum to discuss this issue.
>>>> 
>>>>    graham
>>>> 
>>>>>>>>    ------
>>>>    Graham MATTHEWS, Ph.D.
>>>>    Assistant Professor
>>>>    Department of Computer Science |  California Lutheran University
>>>>    60 W. Olsen Rd.  |  Thousand Oaks, CA 91360
>>>>    gmatthew at callutheran.edu <mailto:gmatthew at callutheran.edu> <mailto:gmatthew at callutheran.edu <mailto:gmatthew at callutheran.edu>>  | Office
>>>>    location: D19
>>>> 
>>>>    _______________________________________________
>>>>    Concurrency-interest mailing list
>>>>    Concurrency-interest at cs.oswego.edu <mailto:Concurrency-interest at cs.oswego.edu>
>>>>    <mailto:Concurrency-interest at cs.oswego.edu <mailto:Concurrency-interest at cs.oswego.edu>>
>>>>    http://cs.oswego.edu/mailman/listinfo/concurrency-interest <http://cs.oswego.edu/mailman/listinfo/concurrency-interest>
>>>> <http://cs.oswego.edu/mailman/listinfo/concurrency-interest <http://cs.oswego.edu/mailman/listinfo/concurrency-interest>>
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Concurrency-interest mailing list
>>>> Concurrency-interest at cs.oswego.edu <mailto:Concurrency-interest at cs.oswego.edu>
>>>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>>> 
>>> _______________________________________________
>>> Concurrency-interest mailing list
>>> Concurrency-interest at cs.oswego.edu <mailto:Concurrency-interest at cs.oswego.edu>
>>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>> 
>> -- 
>> -Nathan
>> 
>> _______________________________________________
>> Concurrency-interest mailing list
>> Concurrency-interest at cs.oswego.edu <mailto:Concurrency-interest at cs.oswego.edu>
>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20170721/e9700a3d/attachment-0001.html>


More information about the Concurrency-interest mailing list