[concurrency-interest] On A Formal Definition of 'Data-Race'

Nathan Reynolds nathan.reynolds at oracle.com
Tue Apr 16 19:01:02 EDT 2013

On x86, only loads can bypass stores. So, the program can make progress 
even though the store hasn't been made globally visible.

In an extreme case, the store is not made globally visible for a very 
long time.  The load/store buffer is eventually going to be filled with 
other stores (loads will be able to complete).  The core is going to 
stall waiting for the store at the front of the line to complete.

In order for a store to complete, it has to be pushed into the L1 cache 
for the core.  In order to do this, the cache line has to be fetched 
from another core's cache or from RAM.  Then the cache line has to be 
invalidated in all other cores.  Both of these operations can be done in 
a single message sent to all of the cores on the system.

Consider that an L3 cache miss takes 14-38 clocks or 6-66 ns 
(http://www.sisoftware.net/?d=qa&f=ben_mem_latency) on a Sandy Bridge E 
processor.  This means a store can take a long time relatively speaking.

Also, consider that the system could have 8 processor sockets. Some 
processor sockets are not directly connected and must communicate via a 
shared processor socket.  This increases the latency of the messaging 
even further.

Without a memory fence after a non-volatile write, the subsequent loads 
can bypass the store.  These loads could "read" the value being stored 
or "read" values previously stored.  This means there is no 
happens-before relationship between the stores and loads. In other 
words, the loads could happen before the store.

There is no way to know the timing of the visibility of stopped. The 
store could happen very quickly (i.e. 4 clocks) if the cache line is in 
the modified or exclusive state in the core's L1 cache or it could 
happen after the entire system has removed the cache line from all of 
the cores and has acknowledge the invalidation.

Nathan Reynolds 
<http://psr.us.oracle.com/wiki/index.php/User:Nathan_Reynolds> | 
Architect | 602.333.9091
Oracle PSR Engineering <http://psr.us.oracle.com/> | Server Technology
On 4/16/2013 3:31 PM, thurstonn wrote:
> Nathan Reynolds-2 wrote
>> All things being equal, reading a volatile and non-volatile field from
>> L1/2/3/4 cache/memory has no impact on performance.  The instructions
>> are exactly the same (on x86).
>> Writing a volatile and non-volatile field to cache/memory has an impact
>> on performance.  Writing to a volatile field requires a memory fence on
>> x86 and many other processors.  This fence is going to take cycles.
>> Nathan Reynolds
>> <http://psr.us.oracle.com/wiki/index.php/User:Nathan_Reynolds> |
>> Architect | 602.333.9091
>> Oracle PSR Engineering <http://psr.us.oracle.com/> | Server
>> Technology
> Sure, that's my understanding as well.  I wasn't asking about the 'cost' of
> reading #stopped when declared volatile, as you mentioned there isn't one.
> My question was about the 'timing' of the visibility of #stopped in the
> *non-volatile* case, given cache coherency
> --
> View this message in context: http://jsr166-concurrency.10961.n7.nabble.com/On-A-Formal-Definition-of-Data-Race-tp9408p9459.html
> Sent from the JSR166 Concurrency mailing list archive at Nabble.com.
> _______________________________________________
> Concurrency-interest mailing list
> Concurrency-interest at cs.oswego.edu
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20130416/9333dd9f/attachment.html>

More information about the Concurrency-interest mailing list