[concurrency-interest] Blocking vs. non-blocking

David Holmes davidcholmes at aapt.net.au
Sat Jun 14 03:02:52 EDT 2014


This is well trodden ground. See for example numerous papers, articles, etc
by Doug Schmidt on I/O frameworks and architectures.

http://www.dre.vanderbilt.edu/~schmidt/resume.html#books

In particular the POSA book(s).

David
  -----Original Message-----
  From: concurrency-interest-bounces at cs.oswego.edu
[mailto:concurrency-interest-bounces at cs.oswego.edu]On Behalf Of Dennis
Sosnoski
  Sent: Saturday, 14 June 2014 4:44 PM
  To: Arcadiy Ivanov; Vitaly Davidovich
  Cc: concurrency-interest at cs.oswego.edu
  Subject: Re: [concurrency-interest] Blocking vs. non-blocking


  Interesting points, Arcadiy, and I agree on at least most of what you
said. There certainly are times, especially in a single user application,
when blocking operations are fine. IMHO the biggest coding problem with
blocking code is the tendency to get into deadlocks (and the difficulty of
avoiding at least the possibility of deadlocks when you starting using
blocking waits throughout your system), but as long as your usage is simple
this isn't likely to become a problem. And I have also experienced callback
hell and know what that feels like (try understanding what's going on when
debugging code that's using nested callbacks 10 levels deep). My preferred
solution is actually to use an actor-type approach, whether formally with
Akka or the like or informally by just using message passing as an
alternative to either blocking or callbacks.

  Hmmm. Now that I think about it, I've been telling people that there are
just two fundamental ways of handling the completions of asynchronous
events, with blocking waits or with callbacks. I suppose message passing
could be considered a third way, even though it's kind of a variation of
callbacks. Are there other ways that different significantly from these two
(or three)?

    - Dennis

  On 06/14/2014 06:25 PM, Arcadiy Ivanov wrote:

    On 2014-06-14 00:31, Dennis Sosnoski wrote:


      I'm actually using direct wait()/notify() rather than a more
sophisticated way of executing threads in turn, since I'm mostly interested
in showing people why they should use callback-type event handling vs.
blocking waits.

    Interestingly enough, it actually depends on what you're doing. ;)

    <imho>
    Firstly, while everything you say about thousands of threads being a
waste of resources is true, there are a few points to consider:

      1.. Does your implementation satisfy user demand?

      2.. Would it be cheaper to just get a bigger box/more boxes and stay
with simple blocking code or would it be less expensive to (re-?)write the
code to be non-blocking and then maintain it?
    While I recognize my argument is somewhat tangential and narrower than
the generic "wait/notify" vs "use callback" question, please consider this:

      1.. Generally, only active threads are relevant. If you have a 100
threads active at any given time it doesn't really matter
context-switching-wise if you have 50k threads (that and more can be easily
accomplished via trivial Linux kernel tuning) total. Yes you waste stack,
PIDs and FDs but 24 CPU/128GB box already cost only ~$30k a year ago and
pretty much any amount of development time is more expensive than adding
another 128GB to the machine.

      2.. If all you do is burn CPU, there is *no question* that the
wait/notify is grossly inefficient vs a callback - Aleksey can elaborate at
length what FJP optimizations were done to make sure that threads do not
suspend waiting for tasks.
      If all you do is I/O and burn CPU based on that, the answer *could be*
dramatically different: I/O latencies dominate any context switching
overhead and on most OS'es when you perform most I/O there is an interrupt,
a security context switch in kernel and possibly even a thread suspension
and a thread context switch *anyway* in addition to that (you may get
suspended with I/O syscall interrupt being handled by kernel thread pool)!

      3.. Imagine you are processing a vast volume of SSH connections. At
certain data volumes your load will be dominated by time of AES
encryption/decryption of the SSH traffic, which will be a function of
plain/ciphertext volume, not the number of threads. You're going to max out
your compute at somewhere around 75MB/s/core of AES even with AES-NI, i.e.
the number of clients you can reasonably support is, maybe, in low tens of
thousands? If clients produce voluminous traffic then in low thousands. Even
at 100% efficiency you're limited to those numbers. Does it make sense
(time-/cost-/complexity-wise) to try to write a callback-based client that
could handle hundreds of thousands or millions of clients *if not* for all
that pesky encryption compute requirement you're going to be limited by
anyway?
      Also, apparently, in heavy I/O scenarios, you may have a much better
system throughput waiting for things to happen in I/O (blocking I/O) vs
being notified of I/O events (Selector-based I/O):
http://www.mailinator.com/tymaPaulMultithreaded.pdf. Paper is 6 years old
and kernel/Java realities might have changed, YMMV, but the difference
is(was?) impressive. Also, Apache HTTP Client still swears by blocking I/O
vs non-blocking one in terms of efficiency:
http://wiki.apache.org/HttpComponents/HttpClient3vsHttpClient4vsHttpCore
      4.. Callbacks, potentially, have to maintain and threads executing
them have to switch application-defined contexts (e.g. current security
principal, current transaction etc). How expensive it is depends on the
application.

      5.. *Callback hell* is not an urban myth and neither is architectural
entropy. If you have a core of very competent developers that are going to
work together on the product in perpetuity, callbacks are a reasonable and a
very efficient solution. In enterprise environment with the number and the
quality of the people that work on the code and with architectural
preparation and control that time constraints allow and modularity demands,
your callback hierarchy may disintegrate rapidly causing races, deadlocks
etc forcing a complete rewrite in a just a few years or a complete project
failure even before release. Blocking code is orders of magnitude easier to
implement, validate and maintain, especially with people who cannot wrap
their heads around the meaning of volatile after writing Java for a decade.
Losing 20% (straw-man number) efficiency in thread context switching at high
tens of thousands of threads is a small price to pay for the code that
actually continues to work 10 years after it has been written. And you
virtually always can add yet another box to increase your total throughput.
      6.. Curiously, even a fully non-blocking algorithm that uses as many
software threads as there are hardware ones with all data being
thread-resident and no data sharing occurring can suffer severely from cache
residency imbalance and demonstrate poor efficiency:
https://blogs.oracle.com/dave/resource/spaa14-dice-UnfairnessResidency-Camer
aReady.pdf. This is to illustrate that there are monsters in virtually every
approach and the end results may be quite surprising.

    Again, not saying anything you said is wrong, but there are a few
considerations other than eliminating context switches and reducing OS
resource constraints when answering the question "should I block?" There are
many tools, there are many scenarios, different tools are good for different
scenarios => blanket recommendations are dangerous. :)
    </imho>

    - Arcadiy

        - Dennis



        On 2014-06-13 21:50, Dennis Sosnoski wrote:

          On 06/14/2014 01:31 PM, Vitaly Davidovich wrote:

            I'd think the 1M cycle delays to get a thread running again are
probably due to OS scheduling it on a cpu that is in a deep c-state; there
can be significant delays as the cpu powers back on.


          That makes sense, but I'd think it would only be an issue for
systems under light load.

            - Dennis


            Sent from my phone

            On Jun 13, 2014 9:07 PM, "Dennis Sosnoski" <dms at sosnoski.com>
wrote:

              On 06/14/2014 11:57 AM, Doug Lea wrote:

                On 06/13/2014 07:35 PM, Dennis Sosnoski wrote:

                  I'm writing an article where I'm discussing both blocking
waits and non-blocking
                  callbacks for handling events. As I see it, there are two
main reasons for
                  preferring non-blocking:

                  1. Threads are expensive resources (limited to on the
order of 10000 per JVM),
                  and tying one up just waiting for an event completion is a
waste of this resource
                  2. Thread switching adds substantial overhead to the
application

                  Are there any other good reasons I'm missing?


                Also memory locality (core X cache effects).


              I thought about that, though couldn't come up with any easy
way of demonstrating the effect. I suppose something more memory-intensive
would do this - perhaps having a fairly sizable array of values for each
thread, and having the thread do some computation with those values each
time it's run.





                  ...
                  So a big drop in performance going from one thread to two,
and again from 2 to
                  4, but after than just a slowly increasing trend. That's
about 19 microseconds
                  per switch with 4096 threads, about half that time for
just 2 threads. Do these
                  results make sense to others?


                Your best case of approximately 20 thousand clock cycles is
not an
                unexpected result on a single-socket multicore with all
cores turned
                on (i.e., no power management, fusing, or clock-step
effects)
                and only a few bouncing cachelines.

                We've seen cases of over 1 million cycles to unblock a
thread
                in some other cases. (Which can be challenging for us to
deal
                with in JDK8 Stream.parallel(). I'll post something on this
sometime.)
                Maybe Aleksey can someday arrange to collect believable
                systematic measurements across a few platforms.


              The reason for the long delay being cache effects, right? I'll
try some experiments with associated data per thread to see if I can
demonstrate this on a small scale.

              Thanks for the insights, Doug.

                - Dennis

              _______________________________________________
              Concurrency-interest mailing list
              Concurrency-interest at cs.oswego.edu
              http://cs.oswego.edu/mailman/listinfo/concurrency-interest





_______________________________________________
Concurrency-interest mailing list
Concurrency-interest at cs.oswego.edu
http://cs.oswego.edu/mailman/listinfo/concurrency-interest







-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20140614/2808254f/attachment-0001.html>


More information about the Concurrency-interest mailing list