[concurrency-interest] x86 NOOP memory barriers

Dmitry Zaslavsky dmitry.zaslavsky at gmail.com
Tue Aug 6 14:03:56 EDT 2013


I  didn't have a chance to run this benchmark but one very significant point here is that the system is hyper threaded its only 6 'real' cores
My guess would be that volatile access generated lock instruction cause real stall for the other hyper thread.

Another factor is that test has read and write. The description says that atomiclongarray was used. 
.get method is volatile get and I guess it was used even for lazy version?



Sent from mobile device

On Aug 6, 2013, at 12:15 PM, "Boehm, Hans" <hans.boehm at hp.com> wrote:

> It would be nice to understand exactly what the difference in generated code is for the different versions whose performance you plotted in http://psy-lob-saw.blogspot.com/2013/05/using-jmh-to-benchmark-multi-threaded.html .  I’m surprised by the increasing differences in the unshared case at high processor counts.  That suggests you are generating different memory traffic for some reason, perhaps because of optimization artifacts for this particular implementation.  AFAICT, the unshared cases should be embarrassingly parallel; there should be no real thread interaction?  Are you spacing out the elements far enough to deal with prefetching artifacts?
>  
> I’m also surprised by the lazy vs. volatile differences in the shared case.   It seems to me the time should be completely dominated by coherence misses in either case.  There may be some unexpected odd optimization or lack thereof happening here.  In my limited experience, the impact of memory fences, etc. commonly decreases as scale increases, since those slowdowns are local to each core, and don’t affect the amount of memory traffic.  See for example the microbenchmark measurements in http://www.hpl.hp.com/techreports/2012/HPL-2012-218.html .
>  
> This benchmark is such that I have a hard time guessing what optimizations would be applied in each case, and I would expect that to vary a lot across JVMs.  You’re probably recalculating the addresses of the array indices more in some cases than others.  Can multiple increments even get combined in some cases?
>  
> Hans
>  
> From: concurrency-interest-bounces at cs.oswego.edu [mailto:concurrency-interest-bounces at cs.oswego.edu] On Behalf Of Nitsan Wakart
> Sent: Tuesday, August 06, 2013 1:38 AM
> To: concurrency-interest at cs.oswego.edu
> Subject: Re: [concurrency-interest] x86 NOOP memory barriers
>  
> Summarized in this blog post here:
> http://psy-lob-saw.blogspot.com/2013/08/memory-barriers-are-not-free.html
> Please point out any mistakes/omissions/oversight.
> Thanks for the help guys.
>  
> From: Michael Barker <mikeb01 at gmail.com>
> To: Nitsan Wakart <nitsanw at yahoo.com> 
> Cc: Vitaly Davidovich <vitalyd at gmail.com>; "concurrency-interest at cs.oswego.edu" <concurrency-interest at cs.oswego.edu> 
> Sent: Saturday, August 3, 2013 12:33 AM
> Subject: Re: [concurrency-interest] x86 NOOP memory barriers
> 
> > So because a putOrdered is a write to memory it cannot be reordered with
> > other writes, as per "8.2.3.2 Neither Loads Nor Stores Are Reordered with
> > Like Operations".
> 
> Yes in combination with the compiler reordering restrictions.  In
> Hotspot this is implemented within the
> LibraryCall::inline_unsafe_ordered_store (library_call.cpp) call.
> Look for:
> 
> insert_mem_bar(Op_MemBarRelease);
> insert_mem_bar(Op_MemBarCPUOrder);
> 
> Mike.
> 
> 
> _______________________________________________
> Concurrency-interest mailing list
> Concurrency-interest at cs.oswego.edu
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20130806/ce79bce1/attachment-0001.html>


More information about the Concurrency-interest mailing list