[concurrency-interest] Low-latency pause in JDK

Nathan and Ila Reynolds nathanila at gmail.com
Sat Oct 26 09:52:46 EDT 2019

x86 MWAIT is not available but we could simulate this with XACQUIRE and 
PAUSE.  The thread uses XACQUIRE on the cache line and then executes 
PAUSE.  PAUSE takes 100s of cycles.  Hopefully, XACQUIRE will wake up 
the thread from PAUSE.

The downside of pausing the threads execution of instructions is that 
the thread cannot respond to stop the world events.  This will increase 
the time it takes to stop the world.  If we are talking about 1000s of 
cycles, this might not make much of a difference.  On the other hand, 
with GC pause times lower than 1 ms, 1000s of cycles might be a 
significant portion of time.

On x86, it takes about 3,000 cycles to enter and return from a 
System.yield() call on Windows (on a mid-range laptop processor from 8 
years ago).  Any low-latency pause loop has to take into account that if 
it waits 3,000 cycles, then it would have been better to enter the 
kernel in the first place.  Blocking in the kernel will reduce power 
consumption as well as allow other threads to do useful work.  Thus, 
each call site needs to keep statistics on how long the thread waits.  
If the call site is waiting too long too often, then the threads should 
immediately block in the kernel instead of spinning.  This is not easy 
to get right.

Perhaps, a better solution is to provide low-level mechanisms in the JDK 
and let people experiment with how long to spin or wait.


On 10/26/2019 3:21 AM, Andrew Haley via Concurrency-interest wrote:
> On 10/25/19 11:11 AM, Viktor Klang via Concurrency-interest wrote:
>> Is there any jdk-builtin Java8+ method which tries to be clever
>> about low-nanos/micros parking?
>> I'm currently considering LockSupport.parkNanos but want to avoid
>> having the Thread parked when parking + wake-up latency is more
>> likely to be much greater than the requested time.
>> I.e. some combination of onSpinWait + some non-cache-polluting
>> computation + yielding + actual parking. I'd like to avoid having to
>> custom-roll it, hence the question for prior art ;)
> As I understand it, the common wisdom is to wait for about half the
> round-trip time for a system call and then park. It doesn't sound
> terribly hard to write something to do that.
> Please forgive me for digressing, but:
> Arm has a mechanism to do this, WFE. When a core fails to obtain a
> lock it executes a WFE instruction which waits on the cache line
> containing the lock. When that cache line is written to by the core
> releasing the lock it awakens the waiting core.
> I'd like to find some way to expose this in a high-level language but
> it's not at all easy to do.
> I believe that Intel has MWAIT which is similar, but it's a privileged
> instruction so no use to us.

More information about the Concurrency-interest mailing list