[concurrency-interest] x86 NOOP memory barriers

Nitsan Wakart nitsanw at yahoo.com
Tue Aug 6 14:11:27 EDT 2013


> It would be nice to understand exactly what the difference in generated code is for the different versions whose performance you plotted in http://psy-lob-saw.blogspot.com/2013/05/using-jmh-to-benchmark-multi-threaded.html
I agree, at the time of writing I did printout the assembly, but considered the post long enough as it were (explaining the assembly would have added alot of work for me and reading for the audience). I don't recall the assembly to show anything suspect, but as all the code is included it is easy enough to generate. If I have the time I'll go through the exercise and add the printouts to the repository.

>  I’m surprised by the increasing differences in the unshared case at high processor counts.  That suggests you are generating different memory traffic for some reason, perhaps because of optimization artifacts for this particular implementation.  AFAICT, the unshared cases should be embarrassingly parallel; there should be no real thread interaction?  Are you spacing out the elements far enough to deal with prefetching artifacts?

If you are referring to the break in scalability from 8 threads onwards for the volatile and lazy case I completely agree with you. I spaced the elements out by one cache line, but if prefetching was the issue it would have affected the lower thread counts equally. The JMH version I was using at the time was quite early in the life of the tool, it may have had issues that caused poor scaling or interacted badly with this benchmark which is a nano-benchmark to use Shipilev's qualification. Alternatively, my code might be at fault, it's short enough and I had others review it before publication, but it can happen. 

>  I’m also surprised by the lazy vs. volatile differences in the shared case.  It seems to me the time should be completely dominated by coherence misses in either case.  There may be some unexpected odd optimization or lack thereof happening here.  In my limited experience, the impact of memory fences, etc. commonly decreases as scale increases, since those slowdowns are local to each core, and don’t affect the amount of memory traffic.  See for example the microbenchmark measurements in http://www.hpl.hp.com/techreports/2012/HPL-2012-218.html .

It is my understanding that lazy set tends to dampen false sharing effects as the value is not immediately 'flushed' and can be modified while in the write queue. The less you 'force' the write, the less you contend on the cache line.

> This benchmark is such that I have a hard time guessing what optimizations would be applied in each case, and I would expect that to vary a lot across JVMs.  

I only tested on one JVM, I expect you are right and the results will differ somewhat.

> You’re probably recalculating the addresses of the array indices more in some cases than others.  

The array offset is calculated once per thread context. So once in all cases.

> Can multiple increments even get combined in some cases?

I'll have to have another look at the assembly. I don't think it happens, I vaguely recall verifying it in the assembly, but it's been a few months.

Thanks for taking the time to read and make comments,

More information about the Concurrency-interest mailing list