[concurrency-interest] Array allocation and access on the JVM

Vitaly Davidovich vitalyd at gmail.com
Fri Jan 27 13:24:30 EST 2012


On the same 2x4 Xeon server you mentioned earlier? Given that you said
there were no GCs I'd expect the cards to stay dirty since you're writing
to same location and CPU should predict the branch perfectly (cost of a
predicted branch is just a few cycles).  Interesting though, thanks for
sharing.

Sent from my phone
On Jan 27, 2012 1:07 PM, "Aleksandar Prokopec" <
aleksandar.prokopec at gmail.com> wrote:

> ** On one thread the particular benchmark i had went from 1200ms to
> 1400ms with the flag on.
> --
> Aleksandar Prokopec
> LAMP, IC, EPFL
>
> Sent from my Android phone with K-9 Mail. Please excuse my brevity.
>
> Vitaly Davidovich <vitalyd at gmail.com> wrote:
>>
>> I thought UseCondMark just adds a branch to test whether the mark is
>> already dirty instead of writing unconditionally - are you saying the
>> branch would cause noticeable perf degradation?
>>
>> Vitaly
>>
>> Sent from my phone
>> On Jan 27, 2012 11:38 AM, "Nathan Reynolds" <nathan.reynolds at oracle.com>
>> wrote:
>>
>>>  Defaulting -XX:+UseCondCardMark would require a lot of testing to
>>> create a heuristic to guess when it should default on or off.  Not all
>>> workloads hit contention on the card table.  Making the card table
>>> concurrent for these workloads would slow them down.  If the heuristic
>>> sub-optimal, then we would be in a worse situation than we are today.
>>>
>>> In order for the JVM to detect contention, it would need to profile each
>>> access or have a hardware counter to declare contention.  Timing each
>>> access and tracking statistics would be expensive.  Hardware counters exist
>>> but they are too expensive to access because of kernel round trips.  So,
>>> the current solutions are more expensive than the actual problem.
>>>
>>> We have asked hardware vendors to supply user-mode access to hardware
>>> counters for true and false sharing.  The same hardware counters can be
>>> used for this problem.  If we get the user-mode hardware counters, then
>>> this might be automatize.
>>>
>>> Another difficulty would be in making this flag dynamically changeable *
>>> without* incurring a performance penalty.
>>>
>>> Nathan Reynolds<http://psr.us.oracle.com/wiki/index.php/User:Nathan_Reynolds>| Consulting Member of Technical Staff |
>>> 602.333.9091
>>> Oracle PSR Engineering <http://psr.us.oracle.com/> | Server Technology
>>>
>>> On 1/26/2012 10:36 PM, Hanson Char wrote:
>>>
>>> Is there plan to have -XX:+UseCondCardMark included as the default
>>> config in Java 7 ?  Or can a VM auto detect the need for such configuration
>>> and self config upon start up ?
>>>
>>>  Thanks,
>>> Hanson
>>>
>>> On Wed, Jan 25, 2012 at 3:55 PM, Doug Lea <dl at cs.oswego.edu> wrote:
>>>
>>>> On 01/25/12 14:48, Aleksandar Prokopec wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I've recently posted the following question about memory contention on
>>>>> StackOverflow:
>>>>>
>>>>
>>>>
>>>>  I asked Aleksandar offlist to check the effects of -XX:+UseCondCardMark
>>>> which as always these days was the main problem. Really, everyone
>>>> should always use this switch on hotspot on multiprocessors.
>>>> (or -XX:+UseG1GC, although it is often a little worse on throughput
>>>> but better-behaved for large heaps).
>>>>
>>>> As I mentioned in a post from a few weeks ago, see Dive Dice's blog
>>>> http://blogs.oracle.com/dave/entry/false_sharing_induced_by_card
>>>>
>>>>
>>>> -Doug
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>> http://stackoverflow.com/questions/8942396/array-allocation-and-access-on-the-java-virtual-machine-and-memory-contention
>>>>>
>>>>> Basically, I have several threads defined like this:
>>>>>
>>>>> final class Worker extends Thread {
>>>>> Foo[] array = new Foo[1024];
>>>>> int sz;
>>>>>
>>>>> public Worker(int _sz) {
>>>>> sz = _sz;
>>>>> }
>>>>>
>>>>> public void run() {
>>>>> //Foo[] arr = new Foo[1024];
>>>>> Foo[] arr = array;
>>>>> loop(arr);
>>>>> }
>>>>>
>>>>> public void loop(Foo[] arr) {
>>>>> int i = 0;
>>>>> int pos = 512;
>>>>> Foo v = new Foo();
>>>>> while (i < sz) {
>>>>> if (i % 2 == 0) {
>>>>> arr[pos] = v;
>>>>> pos += 1;
>>>>> } else {
>>>>> pos -= 1;
>>>>> v = arr[pos];
>>>>> }
>>>>> i++;
>>>>> }
>>>>> }
>>>>> }
>>>>>
>>>>> The program takes 2 parameters - `size` and `par`. `size` is the size
>>>>> of the
>>>>> workload, and `par` is the number of threads to distribute the
>>>>> workload to.
>>>>> What each thread does is repetitively write and read an array location
>>>>> 512, in
>>>>> total `size`/ `par` times. Each thread reads its own array (locations
>>>>> being
>>>>> written to should be at least 1024*4 bytes away from each other).
>>>>>
>>>>> I've measured the speedup on an 2x 4-core Intel Xeon server and
>>>>> observed the
>>>>> following speedups for 1, 2, 4 and 8 processors (7 repetitions each):
>>>>>
>>>>>  >>> All running times: [2149, 2227, 1974, 1948, 1803, 2283, 1878]
>>>>>  >>> All running times: [1140, 1124, 2022, 1141, 2028, 2004, 2136]
>>>>>  >>> All running times: [867, 1022, 1457, 1342, 1436, 966, 1531]
>>>>>  >>> All running times: [915, 864, 1245, 1243, 948, 790, 1007]
>>>>>
>>>>> Apparently, this is due to memory contention.
>>>>> What I've observed is that if I comment the line `Foo[] arr = array`
>>>>> and
>>>>> uncomment the line above it to allocate the array within the thread, I
>>>>> get:
>>>>>
>>>>>  >>> All running times: [2053, 1966, 2089, 1937, 2046, 1909, 2011]
>>>>>  >>> All running times: [1048, 1178, 1100, 1194, 1367, 1271, 1207]
>>>>>  >>> All running times: [578, 508, 589, 571, 617, 643, 645]
>>>>>  >>> All running times: [330, 299, 300, 322, 331, 324, 575]
>>>>>
>>>>> My guess is that in the second case the arrays get allocated in thread
>>>>> local
>>>>> allocation buffers, so the array positions are much further from each
>>>>> other.
>>>>> Still, even in the first case processors should be writing to separate
>>>>> cache
>>>>> lines, so why does the first example scale so badly?
>>>>>
>>>>> Can anyone explain these numbers? Why does this memory contention
>>>>> happen?
>>>>>
>>>>> Thank you,
>>>>> Aleksandar Prokopec
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>>
>>>>> // The complete runnable program:
>>>>>
>>>>>
>>>>> import java.util.ArrayList;
>>>>>
>>>>> class MultiStackJavaExperiment {
>>>>>
>>>>> final class Foo {
>>>>> int x = 0;
>>>>> }
>>>>>
>>>>> final class Worker extends Thread {
>>>>> Foo[] array = new Foo[1024];
>>>>> int sz;
>>>>>
>>>>> public Worker(int _sz) {
>>>>> sz = _sz;
>>>>> }
>>>>>
>>>>> public void run() {
>>>>> Foo[] arr = new Foo[1024];
>>>>> //Foo[] arr = array;
>>>>> loop(arr);
>>>>> }
>>>>>
>>>>> public void loop(Foo[] arr) {
>>>>> int i = 0;
>>>>> int pos = 512;
>>>>> Foo v = new Foo();
>>>>> while (i < sz) {
>>>>> if (i % 2 == 0) {
>>>>> arr[pos] = v;
>>>>> pos += 1;
>>>>> } else {
>>>>> pos -= 1;
>>>>> v = arr[pos];
>>>>> }
>>>>> i++;
>>>>> }
>>>>> }
>>>>> }
>>>>>
>>>>> public static void main(String[] args) {
>>>>> (new MultiStackJavaExperiment()).mainMethod(args);
>>>>> }
>>>>>
>>>>> int size = Integer.parseInt(System.getProperty("size"));
>>>>> int par = Integer.parseInt(System.getProperty("par"));
>>>>>
>>>>> public void mainMethod(String[] args) {
>>>>> int times = 0;
>>>>> if (args.length == 0) times = 1;
>>>>> else times = Integer.parseInt(args[0]);
>>>>> ArrayList < Long > measurements = new ArrayList < Long > ();
>>>>>
>>>>> for (int i = 0; i < times; i++) {
>>>>> long start = System.currentTimeMillis();
>>>>> run();
>>>>> long end = System.currentTimeMillis();
>>>>>
>>>>> long time = (end - start);
>>>>> System.out.println(i + ") Running time: " + time + " ms");
>>>>> measurements.add(time);
>>>>> }
>>>>>
>>>>> System.out.println(">>>");
>>>>> System.out.println(">>> All running times: " + measurements);
>>>>> System.out.println(">>>");
>>>>> }
>>>>>
>>>>> public void run() {
>>>>> int sz = size / par;
>>>>> ArrayList < Thread > threads = new ArrayList < Thread > ();
>>>>>
>>>>> for (int i = 0; i < par; i++) {
>>>>> threads.add(new Worker(sz));
>>>>> threads.get(i).start();
>>>>> }
>>>>> for (int i = 0; i < par; i++) {
>>>>> try {
>>>>> threads.get(i).join();
>>>>> } catch (Exception e) {}
>>>>> }
>>>>> }
>>>>>
>>>>> }
>>>>>
>>>>>
>>>>>
>>>>>   _______________________________________________
>>>>> Concurrency-interest mailing list
>>>>> Concurrency-interest at cs.oswego.edu
>>>>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>>>>
>>>>
>>>> _______________________________________________
>>>> Concurrency-interest mailing list
>>>> Concurrency-interest at cs.oswego.edu
>>>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Concurrency-interest mailing listConcurrency-interest at cs.oswego.eduhttp://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>>
>>>
>>> _______________________________________________
>>> Concurrency-interest mailing list
>>> Concurrency-interest at cs.oswego.edu
>>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20120127/bc3ffde7/attachment.html>


More information about the Concurrency-interest mailing list