[concurrency-interest] Blocking vs. non-blocking
dms at sosnoski.com
Sat Jun 14 00:31:21 EDT 2014
On 06/14/2014 02:32 PM, Arcadiy Ivanov wrote:
> If memory serves me right, Mr Shipilev mentioned in one of his
> presentations in Oracle Spb DC re FJP optimization challenges (in
> Russian, sorry, https://www.youtube.com/watch?v=t0dGLFtRR9c#t=3096)
> that thread scheduling overhead of "sane OSes" (aka Linux) is approx
> 50 us on average, while 'certain not-quite-sane OS named starting with
> "W"' is much more than that.
> Loaded Linux kernel can produce latencies in *tens of seconds*
> page 13) without RT patches, and tens of us with RT ones. YMMV
> dramatically depending on kernel, kernel version, scheduler,
> architecture and load.
Sounds scary, glad my kernel seems reasonable even without RT (stock
OpenSUSE 12.3). I have a recentish W... OS installation on a laptop
drive (I pulled it and replaced it with an SSD, but keep the original
around for when I need to restore my keyboard backlight - you don't want
to know). Maybe I'll give that a try to see how it compares to the Linux
performance on the same system, just for fun.
> That said, uncontended AbstractQueuedSynchronizer and everything based
> on it (ReentrantLock, Semaphore, CountDownLatch etc) is a single
> succeeding CAS (in best case scenario it could even be a cached
> volatile read such as in 0-count CountDownLatch), i.e. *relatively*
I'm actually using direct wait()/notify() rather than a more
sophisticated way of executing threads in turn, since I'm mostly
interested in showing people why they should use callback-type event
handling vs. blocking waits.
> When talking about blocking vs non-blocking I would also take a close
> look at Quasar (https://github.com/puniverse/quasar) when discussing a
> scenario where one thread suspends after submitting a single task to
> pool and awaiting result of that task executing in the pool on,
> supposedly, other thread. Quasar implements continuations of sorts and
> resolves a problem of thread park/unpark in that quite narrow case
> while maintaining code Thread semantics (i.e. Fiber vs Thread) by
> executing the scheduled task on the same thread and avoiding
Yes, I'd noted Quasar from an earlier discussion on the list. It looks
like it would make a good topic for a future article in the series. :-)
> On 2014-06-13 21:50, Dennis Sosnoski wrote:
>> On 06/14/2014 01:31 PM, Vitaly Davidovich wrote:
>>> I'd think the 1M cycle delays to get a thread running again are
>>> probably due to OS scheduling it on a cpu that is in a deep c-state;
>>> there can be significant delays as the cpu powers back on.
>> That makes sense, but I'd think it would only be an issue for systems
>> under light load.
>> - Dennis
>>> Sent from my phone
>>> On Jun 13, 2014 9:07 PM, "Dennis Sosnoski" <dms at sosnoski.com
>>> <mailto:dms at sosnoski.com>> wrote:
>>> On 06/14/2014 11:57 AM, Doug Lea wrote:
>>> On 06/13/2014 07:35 PM, Dennis Sosnoski wrote:
>>> I'm writing an article where I'm discussing both
>>> blocking waits and non-blocking
>>> callbacks for handling events. As I see it, there are
>>> two main reasons for
>>> preferring non-blocking:
>>> 1. Threads are expensive resources (limited to on the
>>> order of 10000 per JVM),
>>> and tying one up just waiting for an event completion is
>>> a waste of this resource
>>> 2. Thread switching adds substantial overhead to the
>>> Are there any other good reasons I'm missing?
>>> Also memory locality (core X cache effects).
>>> I thought about that, though couldn't come up with any easy way
>>> of demonstrating the effect. I suppose something more
>>> memory-intensive would do this - perhaps having a fairly sizable
>>> array of values for each thread, and having the thread do some
>>> computation with those values each time it's run.
>>> So a big drop in performance going from one thread to
>>> two, and again from 2 to
>>> 4, but after than just a slowly increasing trend. That's
>>> about 19 microseconds
>>> per switch with 4096 threads, about half that time for
>>> just 2 threads. Do these
>>> results make sense to others?
>>> Your best case of approximately 20 thousand clock cycles is
>>> not an
>>> unexpected result on a single-socket multicore with all
>>> cores turned
>>> on (i.e., no power management, fusing, or clock-step effects)
>>> and only a few bouncing cachelines.
>>> We've seen cases of over 1 million cycles to unblock a thread
>>> in some other cases. (Which can be challenging for us to deal
>>> with in JDK8 Stream.parallel(). I'll post something on this
>>> Maybe Aleksey can someday arrange to collect believable
>>> systematic measurements across a few platforms.
>>> The reason for the long delay being cache effects, right? I'll
>>> try some experiments with associated data per thread to see if I
>>> can demonstrate this on a small scale.
>>> Thanks for the insights, Doug.
>>> - Dennis
>>> Concurrency-interest mailing list
>>> Concurrency-interest at cs.oswego.edu
>>> <mailto:Concurrency-interest at cs.oswego.edu>
>> Concurrency-interest mailing list
>> Concurrency-interest at cs.oswego.edu
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Concurrency-interest