[concurrency-interest] Blocking vs. non-blocking

Arcadiy Ivanov arcadiy at ivanov.biz
Sat Jun 14 02:25:39 EDT 2014


On 2014-06-14 00:31, Dennis Sosnoski wrote:
>
> I'm actually using direct wait()/notify() rather than a more 
> sophisticated way of executing threads in turn, since I'm mostly 
> interested in showing people why they should use callback-type event 
> handling vs. blocking waits.
Interestingly enough, it actually depends on what you're doing. ;)

<imho>
Firstly, while everything you say about thousands of threads being a 
waste of resources is true, there are a few points to consider:

 1. Does your implementation satisfy user demand?
 2. Would it be cheaper to just get a bigger box/more boxes and stay
    with simple blocking code or would it be less expensive to
    (re-?)write the code to be non-blocking and then maintain it?

While I recognize my argument is somewhat tangential and narrower than 
the generic "wait/notify" vs "use callback" question, please consider this:

 1. Generally, only active threads are relevant. If you have a 100
    threads active at any given time it doesn't really matter
    context-switching-wise if you have 50k threads (that and more can be
    easily accomplished via trivial Linux kernel tuning) total. Yes you
    waste stack, PIDs and FDs but 24 CPU/128GB box already cost only
    ~$30k a year ago and pretty much any amount of development time is
    more expensive than adding another 128GB to the machine.
 2. If all you do is burn CPU, there is *no question* that the
    wait/notify is grossly inefficient vs a callback - Aleksey can
    elaborate at length what FJP optimizations were done to make sure
    that threads do not suspend waiting for tasks.
    If all you do is I/O and burn CPU based on that, the answer *could
    be* dramatically different: I/O latencies dominate any context
    switching overhead and on most OS'es when you perform most I/O there
    is an interrupt, a security context switch in kernel and possibly
    even a thread suspension and a thread context switch *anyway* in
    addition to that (you may get suspended with I/O syscall interrupt
    being handled by kernel thread pool)!
 3. Imagine you are processing a vast volume of SSH connections. At
    certain data volumes your load will be dominated by time of AES
    encryption/decryption of the SSH traffic, which will be a function
    of plain/ciphertext volume, not the number of threads. You're going
    to max out your compute at somewhere around 75MB/s/core of AES even
    with AES-NI, i.e. the number of clients you can reasonably support
    is, maybe, in low tens of thousands? If clients produce voluminous
    traffic then in low thousands. Even at 100% efficiency you're
    limited to those numbers. Does it make sense
    (time-/cost-/complexity-wise) to try to write a callback-based
    client that could handle hundreds of thousands or millions of
    clients *if not* for all that pesky encryption compute requirement
    you're going to be limited by anyway?
    Also, apparently, in heavy I/O scenarios, you may have a much better
    system throughput waiting for things to happen in I/O (blocking I/O)
    vs being notified of I/O events (Selector-based I/O):
    http://www.mailinator.com/tymaPaulMultithreaded.pdf. Paper is 6
    years old and kernel/Java realities might have changed, YMMV, but
    the difference is(was?) impressive. Also, Apache HTTP Client still
    swears by blocking I/O vs non-blocking one in terms of efficiency:
    http://wiki.apache.org/HttpComponents/HttpClient3vsHttpClient4vsHttpCore
 4. Callbacks, potentially, have to maintain and threads executing them
    have to switch application-defined contexts (e.g. current security
    principal, current transaction etc). How expensive it is depends on
    the application.
 5. *Callback hell* is not an urban myth and neither is architectural
    entropy. If you have a core of very competent developers that are
    going to work together on the product in perpetuity, callbacks are a
    reasonable and a very efficient solution. In enterprise environment
    with the number and the quality of the people that work on the code
    and with architectural preparation and control that time constraints
    allow and modularity demands, your callback hierarchy may
    disintegrate rapidly causing races, deadlocks etc forcing a complete
    rewrite in a just a few years or a complete project failure even
    before release. Blocking code is orders of magnitude easier to
    implement, validate and maintain, especially with people who cannot
    wrap their heads around the meaning of volatile after writing Java
    for a decade. Losing 20% (straw-man number) efficiency in thread
    context switching at high tens of thousands of threads is a small
    price to pay for the code that actually continues to work 10 years
    after it has been written. And you virtually always can add yet
    another box to increase your total throughput.
 6. Curiously, even a fully non-blocking algorithm that uses as many
    software threads as there are hardware ones with all data being
    thread-resident and no data sharing occurring can suffer severely
    from cache residency imbalance and demonstrate poor efficiency:
    https://blogs.oracle.com/dave/resource/spaa14-dice-UnfairnessResidency-CameraReady.pdf.
    This is to illustrate that there are monsters in virtually every
    approach and the end results may be quite surprising.

Again, not saying anything you said is wrong, but there are a few 
considerations other than eliminating context switches and reducing OS 
resource constraints when answering the question "should I block?" There 
are many tools, there are many scenarios, different tools are good for 
different scenarios => blanket recommendations are dangerous. :)
</imho>

- Arcadiy
> - Dennis
>
>>
>> On 2014-06-13 21:50, Dennis Sosnoski wrote:
>>> On 06/14/2014 01:31 PM, Vitaly Davidovich wrote:
>>>>
>>>> I'd think the 1M cycle delays to get a thread running again are 
>>>> probably due to OS scheduling it on a cpu that is in a deep 
>>>> c-state; there can be significant delays as the cpu powers back on.
>>>>
>>>
>>> That makes sense, but I'd think it would only be an issue for 
>>> systems under light load.
>>>
>>>   - Dennis
>>>
>>>> Sent from my phone
>>>>
>>>> On Jun 13, 2014 9:07 PM, "Dennis Sosnoski" <dms at sosnoski.com 
>>>> <mailto:dms at sosnoski.com>> wrote:
>>>>
>>>>     On 06/14/2014 11:57 AM, Doug Lea wrote:
>>>>
>>>>         On 06/13/2014 07:35 PM, Dennis Sosnoski wrote:
>>>>
>>>>             I'm writing an article where I'm discussing both
>>>>             blocking waits and non-blocking
>>>>             callbacks for handling events. As I see it, there are
>>>>             two main reasons for
>>>>             preferring non-blocking:
>>>>
>>>>             1. Threads are expensive resources (limited to on the
>>>>             order of 10000 per JVM),
>>>>             and tying one up just waiting for an event completion
>>>>             is a waste of this resource
>>>>             2. Thread switching adds substantial overhead to the
>>>>             application
>>>>
>>>>             Are there any other good reasons I'm missing?
>>>>
>>>>
>>>>         Also memory locality (core X cache effects).
>>>>
>>>>
>>>>     I thought about that, though couldn't come up with any easy way
>>>>     of demonstrating the effect. I suppose something more
>>>>     memory-intensive would do this - perhaps having a fairly
>>>>     sizable array of values for each thread, and having the thread
>>>>     do some computation with those values each time it's run.
>>>>
>>>>
>>>>
>>>>             ...
>>>>             So a big drop in performance going from one thread to
>>>>             two, and again from 2 to
>>>>             4, but after than just a slowly increasing trend.
>>>>             That's about 19 microseconds
>>>>             per switch with 4096 threads, about half that time for
>>>>             just 2 threads. Do these
>>>>             results make sense to others?
>>>>
>>>>
>>>>         Your best case of approximately 20 thousand clock cycles is
>>>>         not an
>>>>         unexpected result on a single-socket multicore with all
>>>>         cores turned
>>>>         on (i.e., no power management, fusing, or clock-step effects)
>>>>         and only a few bouncing cachelines.
>>>>
>>>>         We've seen cases of over 1 million cycles to unblock a thread
>>>>         in some other cases. (Which can be challenging for us to deal
>>>>         with in JDK8 Stream.parallel(). I'll post something on this
>>>>         sometime.)
>>>>         Maybe Aleksey can someday arrange to collect believable
>>>>         systematic measurements across a few platforms.
>>>>
>>>>
>>>>     The reason for the long delay being cache effects, right? I'll
>>>>     try some experiments with associated data per thread to see if
>>>>     I can demonstrate this on a small scale.
>>>>
>>>>     Thanks for the insights, Doug.
>>>>
>>>>       - Dennis
>>>>
>>>>     _______________________________________________
>>>>     Concurrency-interest mailing list
>>>>     Concurrency-interest at cs.oswego.edu
>>>>     <mailto:Concurrency-interest at cs.oswego.edu>
>>>>     http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Concurrency-interest mailing list
>>> Concurrency-interest at cs.oswego.edu
>>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20140614/573adbd5/attachment-0001.html>


More information about the Concurrency-interest mailing list