[concurrency-interest] Cost of Inserting bytecodes around monitorenter and monitorexit

Aaron Greenhouse aarong at cs.cmu.edu
Wed Aug 20 13:21:20 EDT 2008


(Sorry this is a long message.)

Hi everyone.  I have a question/problem that is a bit off topic, but I
need the advice of those who understand synchronization and the JVM
intimately.  I am manipulating Java classfiles (using ASM) as part of a
profiling project.  I was trying to understand the overhead created by
the inserted bytecodes, and I was surprised to see how much overhead I
have when I instrument around monitorenter and monitorexit operations.
Primarily my inserted bytecodes set up method calls to a data collection
class.  I insert a method call before and after monitorenter and after
moniterexit.  In this case, I have subsituted a data collection class
that has empty methods.

I am concerned because on my test case, the unintrumented program runs
in under 4 minutes.  If I instrument only monitorenter and monitorexit
bytecodes, the program takes over 20 minutes to run.  Before you tell me
that it is the inserted method calls that cause the problem, let me say
this: If I instead instrument only method calls which also causes 3
additional methods to be called for each original method call, the  
entire
program runs in around 5 minutes.  So I think there is something very
strange going on related to synchronization.

To be more specific, I have created a simpler test class. This class
doesn't do anything useful.  It is just designed to exercise a bunch of
synchronized blocks.  All the important stuff is in the Shared class.

public class Main {
   public static void main(final String[] args) {
     System.out.println("Main");

     final Shared s = new Shared();
     final Thread threadA = new Thread(new A(s));
     final Thread threadB = new Thread(new B(s));

     final long startTime = System.currentTimeMillis();
     threadA.start();
     threadB.start();

     while (threadA.isAlive() && threadB.isAlive()) {
       try {
         threadA.join();
         threadB.join();
       } catch (final InterruptedException e) {
         // won't happen
       }
     }

     final long endTime = System.currentTimeMillis();
     final long time = endTime - startTime;
     System.out.println("Done in " + time + "ms (" + ((time/1000.0)/ 
60.0) + " minutes)");
   }
}

class A implements Runnable {
   private static final int START = 2500000;
   private Shared s;

   public A(final Shared s) {
     this.s = s;
     s.forB(START);
   }

   public void run() {
     while (true) {
       try {
         final int v = s.fromB();
         if (v <= 0) {
           s.forB(v);
           break;
         }
         s.forB(v+1);
       } catch(final InterruptedException e) {
         // won't happen
       }
     }
   }
}

class B implements Runnable {
   private Shared s;

   public B(final Shared s) {
     this.s = s;
   }

   public void run() {
     while (true) {
       try {
         final int v = s.fromA();
         if (v <= 0) {
           s.forA(v);
           break;
         }
         s.forA(v-2);
       } catch (final InterruptedException e) {
         // won't happen
       }
     }
   }
}

class Shared {
   private boolean hasForB = false;
   private int fromAforB;

   private boolean hasForA = false;
   private int fromBforA;

   public void forB(final int v) {
     forB_impl(this, v);
   }

   private void forB_impl(final Object lock, final int v) {
     synchronized (lock) {
       fromAforB = v;
       hasForB = true;
       lock.notifyAll();
     }
   }

   public void forA(final int v) {
     forA_impl(this, v);
   }

   private void forA_impl(final Object lock, final int v) {
     synchronized (lock) {
       fromBforA = v;
       hasForA = true;
       lock.notifyAll();
     }
   }

   public int fromB() throws InterruptedException {
     return fromB_impl(this);
   }

   private int fromB_impl(final Object lock) throws  
InterruptedException {
     synchronized (lock) {
       while (!hasForA) {
         lock.wait();
       }
       hasForA = false;
       return fromBforA;
     }
   }

   public int fromA() throws InterruptedException {
     return fromA_impl(this);
   }

   private int fromA_impl(final Object lock) throws  
InterruptedException {
     synchronized (lock) {
       while (!hasForB) {
         lock.wait();
       }
       hasForB = false;
       return fromAforB;
     }
   }
}

The Shared class is using an extra layer of delegation because I was
trying to prevent the Java and JIT compilers from taking advantage of
the fact that the lock is known to be the receiver; this is important
below.

When I run this class normally on my Intel Core 2 Duo iMac, under Java
5, it takes about 33 seconds.  When I run with instrumentation added
around each monitorenter and monitorexit bytecode, it runs in about 39
seconds.  I have similar (but slower) results on an older Intel Dell
Laptop running Java 6.

Now, here is where it gets very strange.  I created a version of the
Shared class that approximates very closely the instrumentation that is
being added in the bytecode:

class Shared {
   private boolean hasForB = false;
   private int fromAforB;

   private boolean hasForA = false;
   private int fromBforA;

   public void forB(final int v) {
     forB_impl(this, v);
   }

   private void forB_impl(final Object lock, final int v) {
     try {
       Store.beforeStart(lock, lock == this, lock == Shared.class,  
Shared.class, 1);
       synchronized (lock) {
         Store.afterStart(lock, Shared.class, 1);
         fromAforB = v;
         hasForB = true;
         lock.notifyAll();
       }
     } finally {
       Store.afterEnd(this, Shared.class, 10);
     }
   }

   public void forA(final int v) {
     forA_impl(this, v);
   }

   private void forA_impl(final Object lock, final int v) {
     try {
       Store.beforeStart(lock, lock == this, lock == Shared.class,  
Shared.class, 1);
       synchronized (lock) {
         Store.afterStart(lock, Shared.class, 1);
         fromBforA = v;
         hasForA = true;
         lock.notifyAll();
       }
     } finally {
       Store.afterEnd(this, Shared.class, 10);
     }
   }

   public int fromB() throws InterruptedException {
     return fromB_impl(this);
   }

   private int fromB_impl(final Object lock) throws  
InterruptedException {
     try {
       Store.beforeStart(lock, lock == this, lock == Shared.class,  
Shared.class, 1);
       synchronized (lock) {
         Store.afterStart(lock, Shared.class, 1);
         while (!hasForA) {
           lock.wait();
         }
         hasForA = false;
         return fromBforA;
       }
     } finally {
       Store.afterEnd(this, Shared.class, 10);
     }
   }

   public int fromA() throws InterruptedException {
     return fromA_impl(this);
   }

   private int fromA_impl(final Object lock) throws  
InterruptedException {
     try {
       Store.beforeStart(lock, lock == this, lock == Shared.class,  
Shared.class, 1);
       synchronized (lock) {
         Store.afterStart(lock, Shared.class, 1);
         while (!hasForB) {
           lock.wait();
         }
         hasForB = false;
         return fromAforB;
       }
     } finally {
       Store.afterEnd(this, Shared.class, 10);
     }
   }
}

class Store {
   public static void beforeStart(
       final Object o, final boolean a, final boolean b, Class c, int  
l) {}
   public static void afterStart(final Object o, Class c, int l) {}
   public static void afterEnd(final Object o, Class c, int l) {}
}

This is where I want to avoid taking advantage of the fact that the lock
is the receiver.  Obviously, this isn't going to be exactly the same as
the instrumented bytecode, because the bytecode manipulator doens't know
what operations preceeded the monitorenter or monitorexit operations and
has to do some possibly redundant swap and dup stack manipulations.  But
it's pretty close.  When I run this version, it completes in about 33
seconds.  Actually, sometimes it seems to be slightly faster than the
original, running in 32 seconds.

This doesn't make any sense to me.  Other instrumentation that I insert
uses swap and dup, and is not heavily penalized.  I actually rewrote the
transformation to use local variables to avoid the use of swaps and
dups, but that turned out to be even slower!

To be even more specific again, here is the bytecode for the forB_impl()
method for the different versions.  (All the other methods are similar).

Original Method

   // Method descriptor #29 (Ljava/lang/Object;I)V
   // Stack: 2, Locals: 4
   private void forB_impl(java.lang.Object lock, int v);
      0  aload_1 [lock]
      1  dup
      2  astore_3
      3  monitorenter
      4  aload_0 [this]
      5  iload_2 [v]
      6  putfield test.Shared.fromAforB : int [31]
      9  aload_0 [this]
     10  iconst_1
     11  putfield test.Shared.hasForB : boolean [16]
     14  aload_1 [lock]
     15  invokevirtual java.lang.Object.notifyAll() : void [33]
     18  aload_3
     19  monitorexit
     20  goto 26
     23  aload_3
     24  monitorexit
     25  athrow
     26  return
       Exception Table:
         [pc: 4, pc: 20] -> 23 when : any
         [pc: 23, pc: 25] -> 23 when : any
       Line numbers:
         [pc: 0, line: 15]
         [pc: 4, line: 16]
         [pc: 9, line: 17]
         [pc: 14, line: 18]
         [pc: 18, line: 15]
         [pc: 26, line: 20]
       Local variable table:
         [pc: 0, pc: 27] local: this index: 0 type: test.Shared
         [pc: 0, pc: 27] local: lock index: 1 type: java.lang.Object
         [pc: 0, pc: 27] local: v index: 2 type: int

Pseudo-Instrumented Method

   // Method descriptor #29 (Ljava/lang/Object;I)V
   // Stack: 5, Locals: 5
   private void forB_impl(java.lang.Object lock, int v);
      0  aload_1 [lock]
      1  aload_1 [lock]
      2  aload_0 [this]
      3  if_acmpne 10
      6  iconst_1
      7  goto 11
     10  iconst_0
     11  aload_1 [lock]
     12  ldc <Class testInst.Shared> [1]
     14  if_acmpne 21
     17  iconst_1
     18  goto 22
     21  iconst_0
     22  ldc <Class testInst.Shared> [1]
     24  iconst_1
     25  invokestatic testInst.Store.beforeStart(java.lang.Object,  
boolean, boolean, java.lang.Class, int) : void [31]
     28  aload_1 [lock]
     29  dup
     30  astore_3
     31  monitorenter
     32  aload_1 [lock]
     33  ldc <Class testInst.Shared> [1]
     35  iconst_1
     36  invokestatic testInst.Store.afterStart(java.lang.Object,  
java.lang.Class, int) : void [37]
     39  aload_0 [this]
     40  iload_2 [v]
     41  putfield testInst.Shared.fromAforB : int [41]
     44  aload_0 [this]
     45  iconst_1
     46  putfield testInst.Shared.hasForB : boolean [16]
     49  aload_1 [lock]
     50  invokevirtual java.lang.Object.notifyAll() : void [43]
     53  aload_3
     54  monitorexit
     55  goto 74
     58  aload_3
     59  monitorexit
     60  athrow
     61  astore 4
     63  aload_0 [this]
     64  ldc <Class testInst.Shared> [1]
     66  bipush 10
     68  invokestatic testInst.Store.afterEnd(java.lang.Object,  
java.lang.Class, int) : void [46]
     71  aload 4
     73  athrow
     74  aload_0 [this]
     75  ldc <Class testInst.Shared> [1]
     77  bipush 10
     79  invokestatic testInst.Store.afterEnd(java.lang.Object,  
java.lang.Class, int) : void [46]
     82  return
       Exception Table:
         [pc: 32, pc: 55] -> 58 when : any
         [pc: 58, pc: 60] -> 58 when : any
         [pc: 0, pc: 61] -> 61 when : any
       Line numbers:
         [pc: 0, line: 16]
         [pc: 28, line: 17]
         [pc: 32, line: 18]
         [pc: 39, line: 19]
         [pc: 44, line: 20]
         [pc: 49, line: 21]
         [pc: 53, line: 17]
         [pc: 61, line: 23]
         [pc: 63, line: 24]
         [pc: 71, line: 25]
         [pc: 74, line: 24]
         [pc: 82, line: 26]
       Local variable table:
         [pc: 0, pc: 83] local: this index: 0 type: testInst.Shared
         [pc: 0, pc: 83] local: lock index: 1 type: java.lang.Object
         [pc: 0, pc: 83] local: v index: 2 type: int

Instrumented Method

   // Method descriptor #29 (Ljava/lang/Object;I)V
   // Stack: 9, Locals: 4
   private void forB_impl(java.lang.Object lock, int v);
      0  aload_1 [lock]
      1  dup
      2  astore_3
      3  dup
      4  dup
      5  ldc <Class test.Shared> [1]
      7  dup_x2
      8  swap
      9  dup_x1
     10  dup
     11  aload_0 [this]
     12  if_acmpne 19
     15  iconst_1
     16  goto 20
     19  iconst_0
     20  dup_x2
     21  pop
     22  swap
     23  dup_x1
     24  if_acmpne 31
     27  iconst_1
     28  goto 32
     31  iconst_0
     32  swap
     33  bipush 15
     35  invokestatic  
com.surelogic._flashlight.rewriter.test.EmptyStore.beforeIntrinsicLockAc 
quisition(java.lang.Object, boolean, boolean, java.lang.Class, int) :  
void [67]
     38  monitorenter
     39  bipush 15
     41  invokestatic  
com.surelogic._flashlight.rewriter.test.EmptyStore.afterIntrinsicLockAcq 
uisition(java.lang.Object, java.lang.Class, int) : void [71]
     44  aload_0 [this]
     45  iload_2 [v]
     46  putfield test.Shared.fromAforB : int [31]
     49  aload_0 [this]
     50  iconst_1
     51  putfield test.Shared.hasForB : boolean [16]
     54  aload_1 [lock]
     55  invokevirtual java.lang.Object.notifyAll() : void [33]
     58  aload_3
     59  dup
     60  monitorexit
     61  ldc <Class test.Shared> [1]
     63  bipush 15
     65  invokestatic  
com.surelogic._flashlight.rewriter.test.EmptyStore.afterIntrinsicLockRel 
ease(java.lang.Object, java.lang.Class, int) : void [74]
     68  goto 82
     71  aload_3
     72  dup
     73  monitorexit
     74  ldc <Class test.Shared> [1]
     76  bipush 15
     78  invokestatic  
com.surelogic._flashlight.rewriter.test.EmptyStore.afterIntrinsicLockRel 
ease(java.lang.Object, java.lang.Class, int) : void [74]
     81  athrow
     82  return
       Exception Table:
         [pc: 44, pc: 68] -> 71 when : any
         [pc: 71, pc: 81] -> 71 when : any
       Line numbers:
         [pc: 0, line: 15]
         [pc: 44, line: 16]
         [pc: 49, line: 17]
         [pc: 54, line: 18]
         [pc: 58, line: 15]
         [pc: 82, line: 20]
       Local variable table:
         [pc: 0, pc: 83] local: this index: 0 type: test.Shared
         [pc: 0, pc: 83] local: lock index: 1 type: java.lang.Object
         [pc: 0, pc: 83] local: v index: 2 type: int

In summary, I am completely stumped as to why there is a performance
difference between my bytecode-instrumented class (33 seconds) and my
Java-instrumented version (39 seconds).  Is the JIT doing something that
I'm not aware of?  Am I running into somekind of weird memory effect?
Can someone please shed some light on this situation?

--Aaron



More information about the Concurrency-interest mailing list