[concurrency-interest] help show speed-up on a trivial but manual array-map operation?

Jeff Schultz jws at csse.unimelb.edu.au
Thu Mar 8 22:24:27 EST 2012


On Thu, Mar 08, 2012 at 06:40:58PM -0800, Dan Grossman wrote:
> The attached code (and pasted below for convenience) is a simple
> vector addition.  Kim (cc'ed) is getting slower results than the naive
> code on a 48-processor Linux box.  The attached results are from his
> machine.  I see similar behavior on a 4-processor Linux box with
> openJDK7.  We have more detailed machine specs or Java versions if
> that would help, but I imagine either:

I'd have expected that one int vector add loop would pretty much
saturate memory on most current processors, so more than one CPU on
the same chip won't do much better than a single CPU.

The 48 CPU machine is presumably a four or more socket arrangement
with memory attached directly to each socket.  Unless the OS can do
something very clever about moving parts of first and second in main
between the different memories, it's likely that they're completely
allocated on one socket's memory.  Even ignoring any effects of the
generally slower inter-socket interconnect, this still leaves the
problem of no more memory bandwidth than a single socket.

To show speedup, you need an operation that costs a lot more cycles
than integer add, while being cache-friendly.  (Naively) searching for
a pattern in an array might work.  Fill the input with mostly 1s and
the occasional 2 and look for patterns of N 1s followed by a 2.  As a
pedagogical bonus, you can show the effect of memory bandwidth
limitations by changing N.  Larger N means more reuse of each cache
line.


    Jeff Schultz


More information about the Concurrency-interest mailing list