[concurrency-interest] x86 NOOP memory barriers

Boehm, Hans hans.boehm at hp.com
Tue Aug 6 12:15:44 EDT 2013

It would be nice to understand exactly what the difference in generated code is for the different versions whose performance you plotted in http://psy-lob-saw.blogspot.com/2013/05/using-jmh-to-benchmark-multi-threaded.html .  I'm surprised by the increasing differences in the unshared case at high processor counts.  That suggests you are generating different memory traffic for some reason, perhaps because of optimization artifacts for this particular implementation.  AFAICT, the unshared cases should be embarrassingly parallel; there should be no real thread interaction?  Are you spacing out the elements far enough to deal with prefetching artifacts?

I'm also surprised by the lazy vs. volatile differences in the shared case.   It seems to me the time should be completely dominated by coherence misses in either case.  There may be some unexpected odd optimization or lack thereof happening here.  In my limited experience, the impact of memory fences, etc. commonly decreases as scale increases, since those slowdowns are local to each core, and don't affect the amount of memory traffic.  See for example the microbenchmark measurements in http://www.hpl.hp.com/techreports/2012/HPL-2012-218.html .

This benchmark is such that I have a hard time guessing what optimizations would be applied in each case, and I would expect that to vary a lot across JVMs.  You're probably recalculating the addresses of the array indices more in some cases than others.  Can multiple increments even get combined in some cases?


From: concurrency-interest-bounces at cs.oswego.edu [mailto:concurrency-interest-bounces at cs.oswego.edu] On Behalf Of Nitsan Wakart
Sent: Tuesday, August 06, 2013 1:38 AM
To: concurrency-interest at cs.oswego.edu
Subject: Re: [concurrency-interest] x86 NOOP memory barriers

Summarized in this blog post here:
Please point out any mistakes/omissions/oversight.
Thanks for the help guys.

From: Michael Barker <mikeb01 at gmail.com<mailto:mikeb01 at gmail.com>>
To: Nitsan Wakart <nitsanw at yahoo.com<mailto:nitsanw at yahoo.com>>
Cc: Vitaly Davidovich <vitalyd at gmail.com<mailto:vitalyd at gmail.com>>; "concurrency-interest at cs.oswego.edu<mailto:concurrency-interest at cs.oswego.edu>" <concurrency-interest at cs.oswego.edu<mailto:concurrency-interest at cs.oswego.edu>>
Sent: Saturday, August 3, 2013 12:33 AM
Subject: Re: [concurrency-interest] x86 NOOP memory barriers

> So because a putOrdered is a write to memory it cannot be reordered with
> other writes, as per " Neither Loads Nor Stores Are Reordered with
> Like Operations".

Yes in combination with the compiler reordering restrictions.  In
Hotspot this is implemented within the
LibraryCall::inline_unsafe_ordered_store (library_call.cpp) call.
Look for:



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20130806/33be3936/attachment.html>

More information about the Concurrency-interest mailing list