[concurrency-interest] Blocking vs. non-blocking

Arcadiy Ivanov arcadiy at ivanov.biz
Fri Jun 13 22:32:57 EDT 2014


If memory serves me right, Mr Shipilev mentioned in one of his 
presentations in Oracle Spb DC re FJP optimization challenges (in 
Russian, sorry, https://www.youtube.com/watch?v=t0dGLFtRR9c#t=3096) that 
thread scheduling overhead of "sane OSes" (aka Linux) is approx 50 us on 
average, while 'certain not-quite-sane OS named starting with "W"' is 
much more than that.
Loaded Linux kernel can produce latencies in *tens of seconds* 
(http://www.versalogic.com/downloads/whitepapers/real-time_linux_benchmark.pdf, 
page 13) without RT patches, and tens of us with RT ones. YMMV 
dramatically depending on kernel, kernel version, scheduler, 
architecture and load.

That said, uncontended AbstractQueuedSynchronizer and everything based 
on it (ReentrantLock, Semaphore, CountDownLatch etc) is a single 
succeeding CAS (in best case scenario it could even be a cached volatile 
read such as in 0-count CountDownLatch), i.e. *relatively* inexpensive.

When talking about blocking vs non-blocking I would also take a close 
look at Quasar (https://github.com/puniverse/quasar) when discussing a 
scenario where one thread suspends after submitting a single task to 
pool and awaiting result of that task executing in the pool on, 
supposedly, other thread. Quasar implements continuations of sorts and 
resolves a problem of thread park/unpark in that quite narrow case while 
maintaining code Thread semantics (i.e. Fiber vs Thread) by executing 
the scheduled task on the same thread and avoiding park+wait/unpark.

On 2014-06-13 21:50, Dennis Sosnoski wrote:
> On 06/14/2014 01:31 PM, Vitaly Davidovich wrote:
>>
>> I'd think the 1M cycle delays to get a thread running again are 
>> probably due to OS scheduling it on a cpu that is in a deep c-state; 
>> there can be significant delays as the cpu powers back on.
>>
>
> That makes sense, but I'd think it would only be an issue for systems 
> under light load.
>
>   - Dennis
>
>> Sent from my phone
>>
>> On Jun 13, 2014 9:07 PM, "Dennis Sosnoski" <dms at sosnoski.com 
>> <mailto:dms at sosnoski.com>> wrote:
>>
>>     On 06/14/2014 11:57 AM, Doug Lea wrote:
>>
>>         On 06/13/2014 07:35 PM, Dennis Sosnoski wrote:
>>
>>             I'm writing an article where I'm discussing both blocking
>>             waits and non-blocking
>>             callbacks for handling events. As I see it, there are two
>>             main reasons for
>>             preferring non-blocking:
>>
>>             1. Threads are expensive resources (limited to on the
>>             order of 10000 per JVM),
>>             and tying one up just waiting for an event completion is
>>             a waste of this resource
>>             2. Thread switching adds substantial overhead to the
>>             application
>>
>>             Are there any other good reasons I'm missing?
>>
>>
>>         Also memory locality (core X cache effects).
>>
>>
>>     I thought about that, though couldn't come up with any easy way
>>     of demonstrating the effect. I suppose something more
>>     memory-intensive would do this - perhaps having a fairly sizable
>>     array of values for each thread, and having the thread do some
>>     computation with those values each time it's run.
>>
>>
>>
>>             ...
>>             So a big drop in performance going from one thread to
>>             two, and again from 2 to
>>             4, but after than just a slowly increasing trend. That's
>>             about 19 microseconds
>>             per switch with 4096 threads, about half that time for
>>             just 2 threads. Do these
>>             results make sense to others?
>>
>>
>>         Your best case of approximately 20 thousand clock cycles is
>>         not an
>>         unexpected result on a single-socket multicore with all cores
>>         turned
>>         on (i.e., no power management, fusing, or clock-step effects)
>>         and only a few bouncing cachelines.
>>
>>         We've seen cases of over 1 million cycles to unblock a thread
>>         in some other cases. (Which can be challenging for us to deal
>>         with in JDK8 Stream.parallel(). I'll post something on this
>>         sometime.)
>>         Maybe Aleksey can someday arrange to collect believable
>>         systematic measurements across a few platforms.
>>
>>
>>     The reason for the long delay being cache effects, right? I'll
>>     try some experiments with associated data per thread to see if I
>>     can demonstrate this on a small scale.
>>
>>     Thanks for the insights, Doug.
>>
>>       - Dennis
>>
>>     _______________________________________________
>>     Concurrency-interest mailing list
>>     Concurrency-interest at cs.oswego.edu
>>     <mailto:Concurrency-interest at cs.oswego.edu>
>>     http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>
>
>
>
> _______________________________________________
> Concurrency-interest mailing list
> Concurrency-interest at cs.oswego.edu
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20140613/cbf02492/attachment.html>


More information about the Concurrency-interest mailing list