[concurrency-interest] Blocking vs. non-blocking

Dennis Sosnoski dms at sosnoski.com
Sat Jun 14 02:43:43 EDT 2014


Interesting points, Arcadiy, and I agree on at least most of what you 
said. There certainly are times, especially in a single user 
application, when blocking operations are fine. IMHO the biggest coding 
problem with blocking code is the tendency to get into deadlocks (and 
the difficulty of avoiding at least the possibility of deadlocks when 
you starting using blocking waits throughout your system), but as long 
as your usage is simple this isn't likely to become a problem. And I 
have also experienced callback hell and know what that feels like (try 
understanding what's going on when debugging code that's using nested 
callbacks 10 levels deep). My preferred solution is actually to use an 
actor-type approach, whether formally with Akka or the like or 
informally by just using message passing as an alternative to either 
blocking or callbacks.

Hmmm. Now that I think about it, I've been telling people that there are 
just two fundamental ways of handling the completions of asynchronous 
events, with blocking waits or with callbacks. I suppose message passing 
could be considered a third way, even though it's kind of a variation of 
callbacks. Are there other ways that different significantly from these 
two (or three)?

   - Dennis

On 06/14/2014 06:25 PM, Arcadiy Ivanov wrote:
> On 2014-06-14 00:31, Dennis Sosnoski wrote:
>>
>> I'm actually using direct wait()/notify() rather than a more 
>> sophisticated way of executing threads in turn, since I'm mostly 
>> interested in showing people why they should use callback-type event 
>> handling vs. blocking waits.
> Interestingly enough, it actually depends on what you're doing. ;)
>
> <imho>
> Firstly, while everything you say about thousands of threads being a 
> waste of resources is true, there are a few points to consider:
>
>  1. Does your implementation satisfy user demand?
>  2. Would it be cheaper to just get a bigger box/more boxes and stay
>     with simple blocking code or would it be less expensive to
>     (re-?)write the code to be non-blocking and then maintain it?
>
> While I recognize my argument is somewhat tangential and narrower than 
> the generic "wait/notify" vs "use callback" question, please consider 
> this:
>
>  1. Generally, only active threads are relevant. If you have a 100
>     threads active at any given time it doesn't really matter
>     context-switching-wise if you have 50k threads (that and more can
>     be easily accomplished via trivial Linux kernel tuning) total. Yes
>     you waste stack, PIDs and FDs but 24 CPU/128GB box already cost
>     only ~$30k a year ago and pretty much any amount of development
>     time is more expensive than adding another 128GB to the machine.
>  2. If all you do is burn CPU, there is *no question* that the
>     wait/notify is grossly inefficient vs a callback - Aleksey can
>     elaborate at length what FJP optimizations were done to make sure
>     that threads do not suspend waiting for tasks.
>     If all you do is I/O and burn CPU based on that, the answer *could
>     be* dramatically different: I/O latencies dominate any context
>     switching overhead and on most OS'es when you perform most I/O
>     there is an interrupt, a security context switch in kernel and
>     possibly even a thread suspension and a thread context switch
>     *anyway* in addition to that (you may get suspended with I/O
>     syscall interrupt being handled by kernel thread pool)!
>  3. Imagine you are processing a vast volume of SSH connections. At
>     certain data volumes your load will be dominated by time of AES
>     encryption/decryption of the SSH traffic, which will be a function
>     of plain/ciphertext volume, not the number of threads. You're
>     going to max out your compute at somewhere around 75MB/s/core of
>     AES even with AES-NI, i.e. the number of clients you can
>     reasonably support is, maybe, in low tens of thousands? If clients
>     produce voluminous traffic then in low thousands. Even at 100%
>     efficiency you're limited to those numbers. Does it make sense
>     (time-/cost-/complexity-wise) to try to write a callback-based
>     client that could handle hundreds of thousands or millions of
>     clients *if not* for all that pesky encryption compute requirement
>     you're going to be limited by anyway?
>     Also, apparently, in heavy I/O scenarios, you may have a much
>     better system throughput waiting for things to happen in I/O
>     (blocking I/O) vs being notified of I/O events (Selector-based
>     I/O): http://www.mailinator.com/tymaPaulMultithreaded.pdf. Paper
>     is 6 years old and kernel/Java realities might have changed, YMMV,
>     but the difference is(was?) impressive. Also, Apache HTTP Client
>     still swears by blocking I/O vs non-blocking one in terms of
>     efficiency:
>     http://wiki.apache.org/HttpComponents/HttpClient3vsHttpClient4vsHttpCore
>  4. Callbacks, potentially, have to maintain and threads executing
>     them have to switch application-defined contexts (e.g. current
>     security principal, current transaction etc). How expensive it is
>     depends on the application.
>  5. *Callback hell* is not an urban myth and neither is architectural
>     entropy. If you have a core of very competent developers that are
>     going to work together on the product in perpetuity, callbacks are
>     a reasonable and a very efficient solution. In enterprise
>     environment with the number and the quality of the people that
>     work on the code and with architectural preparation and control
>     that time constraints allow and modularity demands, your callback
>     hierarchy may disintegrate rapidly causing races, deadlocks etc
>     forcing a complete rewrite in a just a few years or a complete
>     project failure even before release. Blocking code is orders of
>     magnitude easier to implement, validate and maintain, especially
>     with people who cannot wrap their heads around the meaning of
>     volatile after writing Java for a decade. Losing 20% (straw-man
>     number) efficiency in thread context switching at high tens of
>     thousands of threads is a small price to pay for the code that
>     actually continues to work 10 years after it has been written. And
>     you virtually always can add yet another box to increase your
>     total throughput.
>  6. Curiously, even a fully non-blocking algorithm that uses as many
>     software threads as there are hardware ones with all data being
>     thread-resident and no data sharing occurring can suffer severely
>     from cache residency imbalance and demonstrate poor efficiency:
>     https://blogs.oracle.com/dave/resource/spaa14-dice-UnfairnessResidency-CameraReady.pdf.
>     This is to illustrate that there are monsters in virtually every
>     approach and the end results may be quite surprising.
>
> Again, not saying anything you said is wrong, but there are a few 
> considerations other than eliminating context switches and reducing OS 
> resource constraints when answering the question "should I block?" 
> There are many tools, there are many scenarios, different tools are 
> good for different scenarios => blanket recommendations are dangerous. :)
> </imho>
>
> - Arcadiy
>>   - Dennis
>>
>>>
>>> On 2014-06-13 21:50, Dennis Sosnoski wrote:
>>>> On 06/14/2014 01:31 PM, Vitaly Davidovich wrote:
>>>>>
>>>>> I'd think the 1M cycle delays to get a thread running again are 
>>>>> probably due to OS scheduling it on a cpu that is in a deep 
>>>>> c-state; there can be significant delays as the cpu powers back on.
>>>>>
>>>>
>>>> That makes sense, but I'd think it would only be an issue for 
>>>> systems under light load.
>>>>
>>>>   - Dennis
>>>>
>>>>> Sent from my phone
>>>>>
>>>>> On Jun 13, 2014 9:07 PM, "Dennis Sosnoski" <dms at sosnoski.com 
>>>>> <mailto:dms at sosnoski.com>> wrote:
>>>>>
>>>>>     On 06/14/2014 11:57 AM, Doug Lea wrote:
>>>>>
>>>>>         On 06/13/2014 07:35 PM, Dennis Sosnoski wrote:
>>>>>
>>>>>             I'm writing an article where I'm discussing both
>>>>>             blocking waits and non-blocking
>>>>>             callbacks for handling events. As I see it, there are
>>>>>             two main reasons for
>>>>>             preferring non-blocking:
>>>>>
>>>>>             1. Threads are expensive resources (limited to on the
>>>>>             order of 10000 per JVM),
>>>>>             and tying one up just waiting for an event completion
>>>>>             is a waste of this resource
>>>>>             2. Thread switching adds substantial overhead to the
>>>>>             application
>>>>>
>>>>>             Are there any other good reasons I'm missing?
>>>>>
>>>>>
>>>>>         Also memory locality (core X cache effects).
>>>>>
>>>>>
>>>>>     I thought about that, though couldn't come up with any easy
>>>>>     way of demonstrating the effect. I suppose something more
>>>>>     memory-intensive would do this - perhaps having a fairly
>>>>>     sizable array of values for each thread, and having the thread
>>>>>     do some computation with those values each time it's run.
>>>>>
>>>>>
>>>>>
>>>>>             ...
>>>>>             So a big drop in performance going from one thread to
>>>>>             two, and again from 2 to
>>>>>             4, but after than just a slowly increasing trend.
>>>>>             That's about 19 microseconds
>>>>>             per switch with 4096 threads, about half that time for
>>>>>             just 2 threads. Do these
>>>>>             results make sense to others?
>>>>>
>>>>>
>>>>>         Your best case of approximately 20 thousand clock cycles
>>>>>         is not an
>>>>>         unexpected result on a single-socket multicore with all
>>>>>         cores turned
>>>>>         on (i.e., no power management, fusing, or clock-step effects)
>>>>>         and only a few bouncing cachelines.
>>>>>
>>>>>         We've seen cases of over 1 million cycles to unblock a thread
>>>>>         in some other cases. (Which can be challenging for us to deal
>>>>>         with in JDK8 Stream.parallel(). I'll post something on
>>>>>         this sometime.)
>>>>>         Maybe Aleksey can someday arrange to collect believable
>>>>>         systematic measurements across a few platforms.
>>>>>
>>>>>
>>>>>     The reason for the long delay being cache effects, right? I'll
>>>>>     try some experiments with associated data per thread to see if
>>>>>     I can demonstrate this on a small scale.
>>>>>
>>>>>     Thanks for the insights, Doug.
>>>>>
>>>>>       - Dennis
>>>>>
>>>>>     _______________________________________________
>>>>>     Concurrency-interest mailing list
>>>>>     Concurrency-interest at cs.oswego.edu
>>>>>     <mailto:Concurrency-interest at cs.oswego.edu>
>>>>>     http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Concurrency-interest mailing list
>>>> Concurrency-interest at cs.oswego.edu
>>>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20140614/dd905b81/attachment.html>


More information about the Concurrency-interest mailing list