[concurrency-interest] Blocking vs. non-blocking

Ron Pressler ron.pressler at gmail.com
Sun Jun 15 17:48:25 EDT 2014

Well, in the case of IO-heavy code (say, a web server), what often matters
most is the number of requests you can handle per second. This is
determined, according to Little's Law, by your latency and capacity (the
number of requests you can service concurrently). Now, in web servers you
often have little control over the latency -- it depends on your database
or on other microservices -- but the choice of blocking vs. nonblocking can
have a sever impact on capacity.

Recently we did a little experiment with common Java web servers, and found
<http://blog.paralleluniverse.co/2014/05/29/cascading-failures/> that
blocking code is severely susceptible to cascading failure as a result of a
temporary rise in latency (due to lack of headroom capacity). Switching to
nonblocking (well, actually the code stayed the same, but we told the
servers to use fibers instead of threads when serving HTTP requests), we
got an immediate 4x increase in server capacity.

On Sat, Jun 14, 2014 at 9:25 AM, Arcadiy Ivanov <arcadiy at ivanov.biz> wrote:

>  On 2014-06-14 00:31, Dennis Sosnoski wrote:
> I'm actually using direct wait()/notify() rather than a more sophisticated
> way of executing threads in turn, since I'm mostly interested in showing
> people why they should use callback-type event handling vs. blocking waits.
> Interestingly enough, it actually depends on what you're doing. ;)
> <imho>
> Firstly, while everything you say about thousands of threads being a waste
> of resources is true, there are a few points to consider:
>    1. Does your implementation satisfy user demand?
>     2. Would it be cheaper to just get a bigger box/more boxes and stay
>    with simple blocking code or would it be less expensive to (re-?)write the
>    code to be non-blocking and then maintain it?
> While I recognize my argument is somewhat tangential and narrower than the
> generic "wait/notify" vs "use callback" question, please consider this:
>    1. Generally, only active threads are relevant. If you have a 100
>    threads active at any given time it doesn't really matter
>    context-switching-wise if you have 50k threads (that and more can be easily
>    accomplished via trivial Linux kernel tuning) total. Yes you waste stack,
>    PIDs and FDs but 24 CPU/128GB box already cost only ~$30k a year ago and
>    pretty much any amount of development time is more expensive than adding
>    another 128GB to the machine.
>     2. If all you do is burn CPU, there is *no question* that the
>    wait/notify is grossly inefficient vs a callback - Aleksey can elaborate at
>    length what FJP optimizations were done to make sure that threads do not
>    suspend waiting for tasks.
>    If all you do is I/O and burn CPU based on that, the answer *could be*
>    dramatically different: I/O latencies dominate any context switching
>    overhead and on most OS'es when you perform most I/O there is an interrupt,
>    a security context switch in kernel and possibly even a thread suspension
>    and a thread context switch *anyway* in addition to that (you may get
>    suspended with I/O syscall interrupt being handled by kernel thread pool)!
>     3. Imagine you are processing a vast volume of SSH connections. At
>    certain data volumes your load will be dominated by time of AES
>    encryption/decryption of the SSH traffic, which will be a function of
>    plain/ciphertext volume, not the number of threads. You're going to max out
>    your compute at somewhere around 75MB/s/core of AES even with AES-NI, i.e.
>    the number of clients you can reasonably support is, maybe, in low tens of
>    thousands? If clients produce voluminous traffic then in low thousands.
>    Even at 100% efficiency you're limited to those numbers. Does it make sense
>    (time-/cost-/complexity-wise) to try to write a callback-based client that
>    could handle hundreds of thousands or millions of clients *if not* for all
>    that pesky encryption compute requirement you're going to be limited by
>    anyway?
>    Also, apparently, in heavy I/O scenarios, you may have a much better
>    system throughput waiting for things to happen in I/O (blocking I/O) vs
>    being notified of I/O events (Selector-based I/O):
>    http://www.mailinator.com/tymaPaulMultithreaded.pdf. Paper is 6 years
>    old and kernel/Java realities might have changed, YMMV, but the difference
>    is(was?) impressive. Also, Apache HTTP Client still swears by blocking I/O
>    vs non-blocking one in terms of efficiency:
>    http://wiki.apache.org/HttpComponents/HttpClient3vsHttpClient4vsHttpCore
>    4. Callbacks, potentially, have to maintain and threads executing them
>    have to switch application-defined contexts (e.g. current security
>    principal, current transaction etc). How expensive it is depends on the
>    application.
>     5. *Callback hell* is not an urban myth and neither is architectural
>    entropy. If you have a core of very competent developers that are going to
>    work together on the product in perpetuity, callbacks are a reasonable and
>    a very efficient solution. In enterprise environment with the number and
>    the quality of the people that work on the code and with architectural
>    preparation and control that time constraints allow and modularity demands,
>    your callback hierarchy may disintegrate rapidly causing races, deadlocks
>    etc forcing a complete rewrite in a just a few years or a complete project
>    failure even before release. Blocking code is orders of magnitude easier to
>    implement, validate and maintain, especially with people who cannot wrap
>    their heads around the meaning of volatile after writing Java for a decade.
>    Losing 20% (straw-man number) efficiency in thread context switching at
>    high tens of thousands of threads is a small price to pay for the code that
>    actually continues to work 10 years after it has been written. And you
>    virtually always can add yet another box to increase your total throughput.
>    6. Curiously, even a fully non-blocking algorithm that uses as many
>    software threads as there are hardware ones with all data being
>    thread-resident and no data sharing occurring can suffer severely from
>    cache residency imbalance and demonstrate poor efficiency:
>    https://blogs.oracle.com/dave/resource/spaa14-dice-UnfairnessResidency-CameraReady.pdf.
>    This is to illustrate that there are monsters in virtually every approach
>    and the end results may be quite surprising.
> Again, not saying anything you said is wrong, but there are a few
> considerations other than eliminating context switches and reducing OS
> resource constraints when answering the question "should I block?" There
> are many tools, there are many scenarios, different tools are good for
> different scenarios => blanket recommendations are dangerous. :)
> </imho>
> - Arcadiy
>    - Dennis
> On 2014-06-13 21:50, Dennis Sosnoski wrote:
> On 06/14/2014 01:31 PM, Vitaly Davidovich wrote:
> I'd think the 1M cycle delays to get a thread running again are probably
> due to OS scheduling it on a cpu that is in a deep c-state; there can be
> significant delays as the cpu powers back on.
> That makes sense, but I'd think it would only be an issue for systems
> under light load.
>   - Dennis
>  Sent from my phone
> On Jun 13, 2014 9:07 PM, "Dennis Sosnoski" <dms at sosnoski.com> wrote:
>> On 06/14/2014 11:57 AM, Doug Lea wrote:
>>> On 06/13/2014 07:35 PM, Dennis Sosnoski wrote:
>>>> I'm writing an article where I'm discussing both blocking waits and
>>>> non-blocking
>>>> callbacks for handling events. As I see it, there are two main reasons
>>>> for
>>>> preferring non-blocking:
>>>> 1. Threads are expensive resources (limited to on the order of 10000
>>>> per JVM),
>>>> and tying one up just waiting for an event completion is a waste of
>>>> this resource
>>>> 2. Thread switching adds substantial overhead to the application
>>>> Are there any other good reasons I'm missing?
>>> Also memory locality (core X cache effects).
>> I thought about that, though couldn't come up with any easy way of
>> demonstrating the effect. I suppose something more memory-intensive would
>> do this - perhaps having a fairly sizable array of values for each thread,
>> and having the thread do some computation with those values each time it's
>> run.
>>>> ...
>>>> So a big drop in performance going from one thread to two, and again
>>>> from 2 to
>>>> 4, but after than just a slowly increasing trend. That's about 19
>>>> microseconds
>>>> per switch with 4096 threads, about half that time for just 2 threads.
>>>> Do these
>>>> results make sense to others?
>>> Your best case of approximately 20 thousand clock cycles is not an
>>> unexpected result on a single-socket multicore with all cores turned
>>> on (i.e., no power management, fusing, or clock-step effects)
>>> and only a few bouncing cachelines.
>>> We've seen cases of over 1 million cycles to unblock a thread
>>> in some other cases. (Which can be challenging for us to deal
>>> with in JDK8 Stream.parallel(). I'll post something on this sometime.)
>>> Maybe Aleksey can someday arrange to collect believable
>>> systematic measurements across a few platforms.
>> The reason for the long delay being cache effects, right? I'll try some
>> experiments with associated data per thread to see if I can demonstrate
>> this on a small scale.
>> Thanks for the insights, Doug.
>>   - Dennis
>> _______________________________________________
>> Concurrency-interest mailing list
>> Concurrency-interest at cs.oswego.edu
>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
> _______________________________________________
> Concurrency-interest mailing listConcurrency-interest at cs.oswego.eduhttp://cs.oswego.edu/mailman/listinfo/concurrency-interest
> _______________________________________________
> Concurrency-interest mailing list
> Concurrency-interest at cs.oswego.edu
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20140616/21598e84/attachment.html>

More information about the Concurrency-interest mailing list