[concurrency-interest] Blocking vs. non-blocking

Stanimir Simeonoff stanimir at riflexo.com
Sat Jun 14 05:54:56 EDT 2014


>
> Also, apparently, in heavy I/O scenarios, you may have a much better
> system throughput waiting for things to happen in I/O (blocking I/O) vs
> being notified of I/O events (Selector-based I/O):
> http://www.mailinator.com/tymaPaulMultithreaded.pdf. Paper is 6 years old
> and kernel/Java realities might have changed, YMMV, but the difference
> is(was?) impressive. Also, Apache HTTP Client still swears by blocking I/O
> vs non-blocking one in terms of efficiency:
> http://wiki.apache.org/HttpComponents/HttpClient3vsHttpClient4vsHttpCore


Blocking IO is basically a single threaded poll() + copying the buffer via
malloc/free (or stack alloc for smaller arrays); selector based NIO is
epoll() without the copy when direct buffers are used. Windows is worse as
WaitForMultipleObjects  is limited to 64 handle, hence it requires tiered
threads to implement a selector.

Waking up a selector, esp. when done via naked wakeup(), suffers from
unneeded contention. I am not sure if that has been filed as a bug/feature
request in JDK, yet the impl is like that
    public Selector wakeup() {
        synchronized (interruptLock) {
            if (!interruptTriggered) {
                pollWrapper.interrupt();
                interruptTriggered = true;
            }
        }
        return this;
    }
 pollWrapper.interrupt() call is the real wakeup via Pipe or a pair of
sockets. The call completes relatively slow and causes stall of all
concurrent calls to Selector.wakeup - which is mostly used for writing.
That requires a check+CAS around Selector.wakeup to ensure only a single
thread carries the call.
More also Java lacks a good,out of the box MP/SC queue (preferably bounded)
to implement the event queue (writes/registration) for the selector loop.
Recently I had a look at netty.io and the implementation seems to pick
almost all tricks in the book.

There is no way a good implementation of NIO to lose to a blocking one.
Yet, the biggest downside is measuring just the throughput which is rarely
what matters -- blocking IO doesn't offer any reasonable means to control
latency, which thread(respectively socket) is scheduled depends entirely on
the OS scheduler. The latter is especially true when delivering real-time
information like market quotes.


Stanimir


On Sat, Jun 14, 2014 at 9:25 AM, Arcadiy Ivanov <arcadiy at ivanov.biz> wrote:

>  On 2014-06-14 00:31, Dennis Sosnoski wrote:
>
>
> I'm actually using direct wait()/notify() rather than a more sophisticated
> way of executing threads in turn, since I'm mostly interested in showing
> people why they should use callback-type event handling vs. blocking waits.
>
> Interestingly enough, it actually depends on what you're doing. ;)
>
> <imho>
> Firstly, while everything you say about thousands of threads being a waste
> of resources is true, there are a few points to consider:
>
>    1. Does your implementation satisfy user demand?
>     2. Would it be cheaper to just get a bigger box/more boxes and stay
>    with simple blocking code or would it be less expensive to (re-?)write the
>    code to be non-blocking and then maintain it?
>
> While I recognize my argument is somewhat tangential and narrower than the
> generic "wait/notify" vs "use callback" question, please consider this:
>
>    1. Generally, only active threads are relevant. If you have a 100
>    threads active at any given time it doesn't really matter
>    context-switching-wise if you have 50k threads (that and more can be easily
>    accomplished via trivial Linux kernel tuning) total. Yes you waste stack,
>    PIDs and FDs but 24 CPU/128GB box already cost only ~$30k a year ago and
>    pretty much any amount of development time is more expensive than adding
>    another 128GB to the machine.
>     2. If all you do is burn CPU, there is *no question* that the
>    wait/notify is grossly inefficient vs a callback - Aleksey can elaborate at
>    length what FJP optimizations were done to make sure that threads do not
>    suspend waiting for tasks.
>    If all you do is I/O and burn CPU based on that, the answer *could be*
>    dramatically different: I/O latencies dominate any context switching
>    overhead and on most OS'es when you perform most I/O there is an interrupt,
>    a security context switch in kernel and possibly even a thread suspension
>    and a thread context switch *anyway* in addition to that (you may get
>    suspended with I/O syscall interrupt being handled by kernel thread pool)!
>     3. Imagine you are processing a vast volume of SSH connections. At
>    certain data volumes your load will be dominated by time of AES
>    encryption/decryption of the SSH traffic, which will be a function of
>    plain/ciphertext volume, not the number of threads. You're going to max out
>    your compute at somewhere around 75MB/s/core of AES even with AES-NI, i.e.
>    the number of clients you can reasonably support is, maybe, in low tens of
>    thousands? If clients produce voluminous traffic then in low thousands.
>    Even at 100% efficiency you're limited to those numbers. Does it make sense
>    (time-/cost-/complexity-wise) to try to write a callback-based client that
>    could handle hundreds of thousands or millions of clients *if not* for all
>    that pesky encryption compute requirement you're going to be limited by
>    anyway?
>    Also, apparently, in heavy I/O scenarios, you may have a much better
>    system throughput waiting for things to happen in I/O (blocking I/O) vs
>    being notified of I/O events (Selector-based I/O):
>    http://www.mailinator.com/tymaPaulMultithreaded.pdf. Paper is 6 years
>    old and kernel/Java realities might have changed, YMMV, but the difference
>    is(was?) impressive. Also, Apache HTTP Client still swears by blocking I/O
>    vs non-blocking one in terms of efficiency:
>    http://wiki.apache.org/HttpComponents/HttpClient3vsHttpClient4vsHttpCore
>    4. Callbacks, potentially, have to maintain and threads executing them
>    have to switch application-defined contexts (e.g. current security
>    principal, current transaction etc). How expensive it is depends on the
>    application.
>     5. *Callback hell* is not an urban myth and neither is architectural
>    entropy. If you have a core of very competent developers that are going to
>    work together on the product in perpetuity, callbacks are a reasonable and
>    a very efficient solution. In enterprise environment with the number and
>    the quality of the people that work on the code and with architectural
>    preparation and control that time constraints allow and modularity demands,
>    your callback hierarchy may disintegrate rapidly causing races, deadlocks
>    etc forcing a complete rewrite in a just a few years or a complete project
>    failure even before release. Blocking code is orders of magnitude easier to
>    implement, validate and maintain, especially with people who cannot wrap
>    their heads around the meaning of volatile after writing Java for a decade.
>    Losing 20% (straw-man number) efficiency in thread context switching at
>    high tens of thousands of threads is a small price to pay for the code that
>    actually continues to work 10 years after it has been written. And you
>    virtually always can add yet another box to increase your total throughput.
>    6. Curiously, even a fully non-blocking algorithm that uses as many
>    software threads as there are hardware ones with all data being
>    thread-resident and no data sharing occurring can suffer severely from
>    cache residency imbalance and demonstrate poor efficiency:
>    https://blogs.oracle.com/dave/resource/spaa14-dice-UnfairnessResidency-CameraReady.pdf.
>    This is to illustrate that there are monsters in virtually every approach
>    and the end results may be quite surprising.
>
> Again, not saying anything you said is wrong, but there are a few
> considerations other than eliminating context switches and reducing OS
> resource constraints when answering the question "should I block?" There
> are many tools, there are many scenarios, different tools are good for
> different scenarios => blanket recommendations are dangerous. :)
> </imho>
>
> - Arcadiy
>
>    - Dennis
>
>
> On 2014-06-13 21:50, Dennis Sosnoski wrote:
>
> On 06/14/2014 01:31 PM, Vitaly Davidovich wrote:
>
> I'd think the 1M cycle delays to get a thread running again are probably
> due to OS scheduling it on a cpu that is in a deep c-state; there can be
> significant delays as the cpu powers back on.
>
>
> That makes sense, but I'd think it would only be an issue for systems
> under light load.
>
>   - Dennis
>
>  Sent from my phone
> On Jun 13, 2014 9:07 PM, "Dennis Sosnoski" <dms at sosnoski.com> wrote:
>
>> On 06/14/2014 11:57 AM, Doug Lea wrote:
>>
>>> On 06/13/2014 07:35 PM, Dennis Sosnoski wrote:
>>>
>>>> I'm writing an article where I'm discussing both blocking waits and
>>>> non-blocking
>>>> callbacks for handling events. As I see it, there are two main reasons
>>>> for
>>>> preferring non-blocking:
>>>>
>>>> 1. Threads are expensive resources (limited to on the order of 10000
>>>> per JVM),
>>>> and tying one up just waiting for an event completion is a waste of
>>>> this resource
>>>> 2. Thread switching adds substantial overhead to the application
>>>>
>>>> Are there any other good reasons I'm missing?
>>>>
>>>
>>> Also memory locality (core X cache effects).
>>>
>>
>> I thought about that, though couldn't come up with any easy way of
>> demonstrating the effect. I suppose something more memory-intensive would
>> do this - perhaps having a fairly sizable array of values for each thread,
>> and having the thread do some computation with those values each time it's
>> run.
>>
>>
>>>
>>>> ...
>>>> So a big drop in performance going from one thread to two, and again
>>>> from 2 to
>>>> 4, but after than just a slowly increasing trend. That's about 19
>>>> microseconds
>>>> per switch with 4096 threads, about half that time for just 2 threads.
>>>> Do these
>>>> results make sense to others?
>>>>
>>>
>>> Your best case of approximately 20 thousand clock cycles is not an
>>> unexpected result on a single-socket multicore with all cores turned
>>> on (i.e., no power management, fusing, or clock-step effects)
>>> and only a few bouncing cachelines.
>>>
>>> We've seen cases of over 1 million cycles to unblock a thread
>>> in some other cases. (Which can be challenging for us to deal
>>> with in JDK8 Stream.parallel(). I'll post something on this sometime.)
>>> Maybe Aleksey can someday arrange to collect believable
>>> systematic measurements across a few platforms.
>>>
>>
>> The reason for the long delay being cache effects, right? I'll try some
>> experiments with associated data per thread to see if I can demonstrate
>> this on a small scale.
>>
>> Thanks for the insights, Doug.
>>
>>   - Dennis
>>
>> _______________________________________________
>> Concurrency-interest mailing list
>> Concurrency-interest at cs.oswego.edu
>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>
>
>
>
> _______________________________________________
> Concurrency-interest mailing listConcurrency-interest at cs.oswego.eduhttp://cs.oswego.edu/mailman/listinfo/concurrency-interest
>
>
>
>
>
> _______________________________________________
> Concurrency-interest mailing list
> Concurrency-interest at cs.oswego.edu
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20140614/d229d57a/attachment-0001.html>


More information about the Concurrency-interest mailing list