[concurrency-interest] Coroutine-like control transfer?
nathan.reynolds at oracle.com
Thu Aug 29 12:09:02 EDT 2013
> If you do full spin, 50% CPU is wasted, the throughput would be half
of the true coroutine solution
The OS thread scheduler will let a thread run for its entire time slice
unless a higher priority thread needs to get on the logical core. Most
of the time though, threads will block before their time slice runs out.
However, when spin waiting, calling yield will ensure that threads can
run that want to do useful work. This is because when a thread calls
yield, it will be put at the end of the run queue. Other threads that
want processor time will be allowed to run for a full time slice. So,
throughput won't suffer.
But what about the cost of context switches?
The cost is nothing as long as the # of runnable threads is ? the # of
logical cores. This is because when the thread calls yield, it is put
at the end of an empty run queue and hence it doesn't context switch.
In fact, even if a context switch cost a lot of time, it wouldn't matter
because all of the runnable threads are running. One caveat is that a
spinning thread is consuming the physical core's compute resources and
so it will slow down the other thread that is running on the same
physical core. On x86, a pause instruction is a good idea since the
thread doesn't consume compute resources... unless you are running in a
If the # of runnable threads is > the # of logical cores, then threads
will context switch. However, the cost of switching the logical core's
state from one thread to another is too small to be of any concern. The
real cost of context switches comes from having to reload the cache. In
the case of spin-yield loops, very little of the cache has to be reload
since the spinning thread probably isn't accessing very many cache lines.
On 8/29/2013 7:39 AM, Zhong Yu wrote:
> On Wed, Aug 28, 2013 at 4:36 PM, Charles Oliver Nutter
> <headius at headius.com> wrote:
>> I've been struggling for years to find the right (i.e. fastest) way to
>> do a coroutine-like control transfer between two threads, and I
>> figured I'd try asking here again...
>> What I need to emulate is a true coroutine, where a thread transfers
>> execution (ip and stack) to a different piece of code, that code runs
>> for a while, and then execution transfers back. There's always exactly
>> one piece of code executing at a given time, since there's literally
>> only one thread bouncing between stacks.
>> On the JVM, where we don't have real coroutines, the best we can do is
>> to emulate this with threads (ignoring bytecode-weaving stack tricks
>> like Kilim for the moment). I do not believe any of the structures in
>> j.u.concurrent currently have this exact pattern in mind.
>> The patterns I've used to emulate this:
>> 1. Explicit park/unpark.
>> The parent thread starts up the coroutine, which immediately parks
>> itself. Parent thread wakes up coroutine by unparking it and giving it
>> an initial value, at which point parent parks itself. Child runs for a
>> bit, then unparks parent, gives it a value, and parks itself.
>> This is logically closest to what I want, but the park/unpark
>> operations are too expensive for fast transfers. This mechanism ended
>> up being the slowest way when measuring raw transfer rate (i.e. very
>> little work being done between transfers)
>> 2. SynchronousQueue
>> Instead of using explicit parking and unparking, the parent pushes a
>> value on child's sync queue, and then waits on its own sync queue.
>> Child signals parent by pushing a value on parent's sync queue and
>> then waits on its own.
>> This was about 3x faster than explicit park/unpark.
>> 3. Exchanger
>> Substitute Exchanger for SynchronousQueue, where the "take" operation
>> just exchanges null and the "put" operation ignores the result. This
>> was the fastest...around 15x faster than explicit park/unpark and 5x
>> faster than SynchronousQueue
>> 4. Just spin
>> Do nothing but spin on a null volatile field waiting for it to become
>> non-null. Another 3x faster than Exchanger.
>> It seems like I'm building a conceptually simple pattern on top of
>> structures not designed for it (or rather, designed for more complex
>> patterns). In SynchronousQueue's case, I have to do two put/take
>> operations for every transfer. With Exchanger, two exchanges. What I
>> really want is something like LockSupport.park/unpark that's more like
>> LockSupport.parkMeAndUnparkHimPassingValue(value), explicitly passing
>> control and a value to the other thread.
> It seems that the more spin there is, the faster it is; or, the sooner
> you do park(), the slower it is. If you do full spin, 50% CPU is
> wasted, the throughput would be half of the true coroutine solution -
> which is probably not too bad.
>> What am I missing? Do I need to hand-roll this?
>> - Charlie
>> Concurrency-interest mailing list
>> Concurrency-interest at cs.oswego.edu
> Concurrency-interest mailing list
> Concurrency-interest at cs.oswego.edu
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Concurrency-interest