[concurrency-interest] Array allocation and access on the JVM

Aleksandar Prokopec aleksandar.prokopec at gmail.com
Fri Jan 27 13:06:45 EST 2012


On one thread the particular benchmark i had went from 1200ms to 1400ms with the flag on.
-- 
Aleksandar Prokopec
LAMP, IC, EPFL

Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Vitaly Davidovich <vitalyd at gmail.com> wrote:

I thought UseCondMark just adds a branch to test whether the mark is already dirty instead of writing unconditionally - are you saying the branch would cause noticeable perf degradation?

Vitaly

Sent from my phone

On Jan 27, 2012 11:38 AM, "Nathan Reynolds" <nathan.reynolds at oracle.com> wrote:

Defaulting -XX:+UseCondCardMark would require a lot of testing to create a heuristic to guess when it should default on or off.  Not all workloads hit contention on the card table.  Making the card table concurrent for these workloads would slow them down.  If the heuristic sub-optimal, then we would be in a worse situation than we are today.

In order for the JVM to detect contention, it would need to profile each access or have a hardware counter to declare contention.  Timing each access and tracking statistics would be expensive.  Hardware counters exist but they are too expensive to access because of kernel round trips.  So, the current solutions are more expensive than the actual problem.

We have asked hardware vendors to supply user-mode access to hardware counters for true and false sharing.  The same hardware counters can be used for this problem.  If we get the user-mode hardware counters, then this might be automatize.

Another difficulty would be in making this flag dynamically changeable without incurring a performance penalty.

Nathan Reynolds | Consulting Member of Technical Staff | 602.333.9091
Oracle PSR Engineering | Server Technology


On 1/26/2012 10:36 PM, Hanson Char wrote: 

Is there plan to have -XX:+UseCondCardMark included as the default config in Java 7 ?  Or can a VM auto detect the need for such configuration and self config upon start up ? 


Thanks,

Hanson

On Wed, Jan 25, 2012 at 3:55 PM, Doug Lea <dl at cs.oswego.edu> wrote:

On 01/25/12 14:48, Aleksandar Prokopec wrote:

Hello,

I've recently posted the following question about memory contention on
StackOverflow:



I asked Aleksandar offlist to check the effects of -XX:+UseCondCardMark
which as always these days was the main problem. Really, everyone
should always use this switch on hotspot on multiprocessors.
(or -XX:+UseG1GC, although it is often a little worse on throughput
but better-behaved for large heaps).

As I mentioned in a post from a few weeks ago, see Dive Dice's blog
http://blogs.oracle.com/dave/entry/false_sharing_induced_by_card


-Doug






http://stackoverflow.com/questions/8942396/array-allocation-and-access-on-the-java-virtual-machine-and-memory-contention

Basically, I have several threads defined like this:

final class Worker extends Thread {
Foo[] array = new Foo[1024];
int sz;

public Worker(int _sz) {
sz = _sz;
}

public void run() {
//Foo[] arr = new Foo[1024];
Foo[] arr = array;
loop(arr);
}

public void loop(Foo[] arr) {
int i = 0;
int pos = 512;
Foo v = new Foo();
while (i < sz) {
if (i % 2 == 0) {
arr[pos] = v;
pos += 1;
} else {
pos -= 1;
v = arr[pos];
}
i++;
}
}
}

The program takes 2 parameters - `size` and `par`. `size` is the size of the
workload, and `par` is the number of threads to distribute the workload to.
What each thread does is repetitively write and read an array location 512, in
total `size`/ `par` times. Each thread reads its own array (locations being
written to should be at least 1024*4 bytes away from each other).

I've measured the speedup on an 2x 4-core Intel Xeon server and observed the
following speedups for 1, 2, 4 and 8 processors (7 repetitions each):

 >>> All running times: [2149, 2227, 1974, 1948, 1803, 2283, 1878]
 >>> All running times: [1140, 1124, 2022, 1141, 2028, 2004, 2136]
 >>> All running times: [867, 1022, 1457, 1342, 1436, 966, 1531]
 >>> All running times: [915, 864, 1245, 1243, 948, 790, 1007]

Apparently, this is due to memory contention.
What I've observed is that if I comment the line `Foo[] arr = array` and
uncomment the line above it to allocate the array within the thread, I get:

 >>> All running times: [2053, 1966, 2089, 1937, 2046, 1909, 2011]
 >>> All running times: [1048, 1178, 1100, 1194, 1367, 1271, 1207]
 >>> All running times: [578, 508, 589, 571, 617, 643, 645]
 >>> All running times: [330, 299, 300, 322, 331, 324, 575]

My guess is that in the second case the arrays get allocated in thread local
allocation buffers, so the array positions are much further from each other.
Still, even in the first case processors should be writing to separate cache
lines, so why does the first example scale so badly?

Can anyone explain these numbers? Why does this memory contention happen?

Thank you,
Aleksandar Prokopec



------------------------------------------------------------------------------

// The complete runnable program:


import java.util.ArrayList;

class MultiStackJavaExperiment {

final class Foo {
int x = 0;
}

final class Worker extends Thread {
Foo[] array = new Foo[1024];
int sz;

public Worker(int _sz) {
sz = _sz;
}

public void run() {
Foo[] arr = new Foo[1024];
//Foo[] arr = array;
loop(arr);
}

public void loop(Foo[] arr) {
int i = 0;
int pos = 512;
Foo v = new Foo();
while (i < sz) {
if (i % 2 == 0) {
arr[pos] = v;
pos += 1;
} else {
pos -= 1;
v = arr[pos];
}
i++;
}
}
}

public static void main(String[] args) {
(new MultiStackJavaExperiment()).mainMethod(args);
}

int size = Integer.parseInt(System.getProperty("size"));
int par = Integer.parseInt(System.getProperty("par"));

public void mainMethod(String[] args) {
int times = 0;
if (args.length == 0) times = 1;
else times = Integer.parseInt(args[0]);
ArrayList < Long > measurements = new ArrayList < Long > ();

for (int i = 0; i < times; i++) {
long start = System.currentTimeMillis();
run();
long end = System.currentTimeMillis();

long time = (end - start);
System.out.println(i + ") Running time: " + time + " ms");
measurements.add(time);
}

System.out.println(">>>");
System.out.println(">>> All running times: " + measurements);
System.out.println(">>>");
}

public void run() {
int sz = size / par;
ArrayList < Thread > threads = new ArrayList < Thread > ();

for (int i = 0; i < par; i++) {
threads.add(new Worker(sz));
threads.get(i).start();
}
for (int i = 0; i < par; i++) {
try {
threads.get(i).join();
} catch (Exception e) {}
}
}

}



_______________________________________________
Concurrency-interest mailing list
Concurrency-interest at cs.oswego.edu
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
Concurrency-interest at cs.oswego.edu
http://cs.oswego.edu/mailman/listinfo/concurrency-interest




_______________________________________________ Concurrency-interest mailing list Concurrency-interest at cs.oswego.edu http://cs.oswego.edu/mailman/listinfo/concurrency-interest 


_______________________________________________
Concurrency-interest mailing list
Concurrency-interest at cs.oswego.edu
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20120127/8d21db1e/attachment-0001.html>


More information about the Concurrency-interest mailing list