[concurrency-interest] JLS 17.7 Non-atomic treatment of double and long : Android

Vitaly Davidovich vitalyd at gmail.com
Tue Apr 30 14:17:00 EDT 2013


OK well I've been talking (and I thought others too) about non volatile
cases, where I'd find this hoop jumping to be costly.

Sent from my phone
On Apr 30, 2013 2:07 PM, "Stanimir Simeonoff" <stanimir at riflexo.com> wrote:

>
>
> On Tue, Apr 30, 2013 at 8:51 PM, Vitaly Davidovich <vitalyd at gmail.com>wrote:
>
>> By the way, if you make x non-volatile, what changes? Just lock addl
>> fencing instructions go away?
>>
> No, no. It uses standard x86:
>
>   0x00938b7e: add    0x8(%esi),%ecx
>   0x00938b81: adc    0xc(%esi),%ebx     ;*ladd
>                                         ; - t1.TearLong::test at 18 (line 8)
>   0x00938b84: mov    %ecx,0x8(%esi)
>   0x00938b87: mov    %ebx,0xc(%esi)     ;*putfield x
>                                         ; - t1.TearLong::test at 22 (line 9)
>
> Like I've told I know that by experience.
>
>
>> Sent from my phone
>> On Apr 30, 2013 1:31 PM, "Stanimir Simeonoff" <stanimir at riflexo.com>
>> wrote:
>>
>>> Here is some proof.
>>>
>>> Stanimir
>>> -----------
>>> package t1;
>>>
>>> public class TearLong {
>>>     private volatile long x;
>>>     long test(){
>>>         for (int i=0;i<20000;i++){
>>>             long n=x;
>>>             n+=System.currentTimeMillis()&0xff;
>>>             x=n;
>>>         }
>>>         return x;
>>>     }
>>>     public static void main(String[] args) {
>>>         System.out.println(new TearLong().test());
>>>         System.out.println(new TearLong().test());
>>>     }
>>> }
>>>
>>> Decoding compiled method 0x00938a08:
>>> Code:
>>> [Disassembling for mach='i386']
>>> [Entry Point]
>>> [Verified Entry Point]
>>> [Constants]
>>>   # {method} 'test' '()J' in 't1/TearLong'
>>>   0x00938b00: int3
>>>   0x00938b01: xchg   %ax,%ax
>>>   0x00938b04: mov    %eax,0xffffd000(%esp)
>>>   0x00938b0b: push   %ebp
>>>   0x00938b0c: sub    $0x18,%esp
>>>   0x00938b12: mov    0x8(%ecx),%ebx
>>>   0x00938b15: mov    0xc(%ecx),%esi
>>>   0x00938b18: mov    %ecx,(%esp)
>>>   0x00938b1b: call   0x6dbeed90         ;   {runtime_call}
>>>   0x00938b20: mov    0x4(%esi),%ebp     ; implicit exception: dispatches
>>> to 0x00938bdd
>>>   0x00938b23: cmp    $0x3b6bd38,%ebp    ;   {oop('t1/TearLong')}
>>>   0x00938b29: jne    0x00938bcb         ;*aload_0
>>>                                         ; - t1.TearLong::test at 5 (line 7)
>>>   0x00938b2f: inc    %ebx               ;*iinc
>>>                                         ; - t1.TearLong::test at 25 (line
>>> 6)
>>>   0x00938b30: movsd  0x8(%esi),%xmm0
>>>   0x00938b35: movd   %xmm0,%ebp
>>>   0x00938b39: psrlq  $0x20,%xmm0
>>>   0x00938b3e: movd   %xmm0,%edi         ;*getfield x
>>>                                         ; - t1.TearLong::test at 6 (line 7)
>>>   0x00938b42: call   0x6dce22f0         ;   {runtime_call}
>>>   0x00938b47: and    $0xff,%eax
>>>   0x00938b4d: and    $0x0,%edx
>>>   0x00938b50: add    %ebp,%eax
>>>   0x00938b52: adc    %edi,%edx
>>>   0x00938b54: cmp    0x8(%esi),%eax
>>>   0x00938b57: movd   %eax,%xmm1
>>>   0x00938b5b: movd   %edx,%xmm0
>>>   0x00938b5f: punpckldq %xmm0,%xmm1
>>>   0x00938b63: movsd  %xmm1,0x8(%esi)
>>>   0x00938b68: lock addl $0x0,(%esp)     ;*putfield x
>>>                                         ; - t1.TearLong::test at 22 (line
>>> 9)
>>>   0x00938b6d: jmp    0x00938b9c
>>>   0x00938b6f: nop                       ;*getfield x
>>>                                         ; - t1.TearLong::test at 6 (line 7)
>>>   0x00938b70: call   0x6dce22f0         ;*putfield x
>>>                                         ; - t1.TearLong::test at 22 (line
>>> 9)
>>>                                         ;   {runtime_call}
>>>   0x00938b75: inc    %ebx               ;*iinc
>>>                                         ; - t1.TearLong::test at 25 (line
>>> 6)
>>>   0x00938b76: and    $0xff,%eax
>>>   0x00938b7c: and    $0x0,%edx
>>>   0x00938b7f: add    %ebp,%eax
>>>   0x00938b81: adc    %edi,%edx
>>>   0x00938b83: cmp    0x8(%esi),%eax
>>>   0x00938b86: movd   %eax,%xmm1
>>>   0x00938b8a: movd   %edx,%xmm0
>>>   0x00938b8e: punpckldq %xmm0,%xmm1
>>>   0x00938b92: movsd  %xmm1,0x8(%esi)
>>>   0x00938b97: lock addl $0x0,(%esp)     ; OopMap{esi=Oop off=156}
>>>                                         ;*if_icmplt
>>>                                         ; - t1.TearLong::test at 32 (line
>>> 6)
>>>   0x00938b9c: test   %edi,0x8c0000      ;*if_icmplt
>>>                                         ; - t1.TearLong::test at 32 (line
>>> 6)
>>>                                         ;   {poll}
>>>   0x00938ba2: movsd  0x8(%esi),%xmm0
>>>   0x00938ba7: movd   %xmm0,%ebp
>>>   0x00938bab: psrlq  $0x20,%xmm0
>>>   0x00938bb0: movd   %xmm0,%edi
>>>   0x00938bb4: cmp    $0x4e20,%ebx
>>>   0x00938bba: jl     0x00938b70         ;*getfield x
>>>                                         ; - t1.TearLong::test at 36 (line
>>> 11)
>>>   0x00938bbc: mov    %ebp,%eax
>>>   0x00938bbe: mov    %edi,%edx
>>>   0x00938bc0: add    $0x18,%esp
>>>   0x00938bc3: pop    %ebp
>>>   0x00938bc4: test   %eax,0x8c0000      ;   {poll_return}
>>>   0x00938bca: ret
>>>   0x00938bcb: mov    $0xffffffad,%ecx
>>>   0x00938bd0: mov    %esi,%ebp
>>>   0x00938bd2: mov    %ebx,0x4(%esp)
>>>   0x00938bd6: nop
>>>   0x00938bd7: call   0x0091c700         ; OopMap{ebp=Oop off=220}
>>>                                         ;*aload_0
>>>                                         ; - t1.TearLong::test at 5 (line 7)
>>>                                         ;   {runtime_call}
>>>   0x00938bdc: int3                      ;*getfield x
>>>                                         ; - t1.TearLong::test at 6 (line 7)
>>>   0x00938bdd: mov    $0xfffffff6,%ecx
>>>   0x00938be2: nop
>>>   0x00938be3: call   0x0091c700         ; OopMap{off=232}
>>>                                         ;*getfield x
>>>                                         ; - t1.TearLong::test at 6 (line 7)
>>>                                         ;   {runtime_call}
>>>   0x00938be8: int3                      ;*getfield x
>>>                                         ; - t1.TearLong::test at 6 (line 7)
>>> ....
>>>
>>>
>>>
>>> On Tue, Apr 30, 2013 at 8:15 PM, Stanimir Simeonoff <
>>> stanimir at riflexo.com> wrote:
>>>
>>>> .
>>>>
>>>>> As for SSE, yeah it's possible, but is that true? JIT skips integer
>>>>> registers for scalar long operations? I find that hard to believe as it
>>>>> would miss out on large register file/renaming opportunities.
>>>>>
>>>>> I know that by looking at the assembly. I can still check w/ the
>>>> current version.
>>>>
>>>> Stanimir
>>>>
>>>>
>>>>
>>>>> Sent from my phone
>>>>> On Apr 30, 2013 12:58 PM, "Nathan Reynolds" <
>>>>> nathan.reynolds at oracle.com> wrote:
>>>>>
>>>>>>  The processor can do whatever it wants in registers without other
>>>>>> threads being able to see intermediate values.  Registers are private to
>>>>>> the hardware thread.  So, we can use multiple instructions to load the
>>>>>> ecx:ebx registers and then execute the cmpxchg8b to do a single write to
>>>>>> globally visible cache.
>>>>>>
>>>>>> Nathan Reynolds<http://psr.us.oracle.com/wiki/index.php/User:Nathan_Reynolds>| Architect |
>>>>>> 602.333.9091
>>>>>> Oracle PSR Engineering <http://psr.us.oracle.com/> | Server
>>>>>> Technology
>>>>>>  On 4/30/2013 9:53 AM, Vitaly Davidovich wrote:
>>>>>>
>>>>>> But this requires the src value to be in ecx:ebx so how would you
>>>>>> load it there without two loads (and possibly observe tearing) in the first
>>>>>> place?
>>>>>>
>>>>>> Sent from my phone
>>>>>> On Apr 30, 2013 12:45 PM, "Nathan Reynolds" <
>>>>>> nathan.reynolds at oracle.com> wrote:
>>>>>>
>>>>>>>  On 32-bit x86, the cmpxchg8b can be used to write a long in 1
>>>>>>> instruction.  This instruction has been "present on most post-80486
>>>>>>> processors" (Wikipedia).  There might be cheaper ways to write a long but
>>>>>>> there is at least 1 way.
>>>>>>>
>>>>>>> Nathan Reynolds<http://psr.us.oracle.com/wiki/index.php/User:Nathan_Reynolds>| Architect |
>>>>>>> 602.333.9091
>>>>>>> Oracle PSR Engineering <http://psr.us.oracle.com/> | Server
>>>>>>> Technology
>>>>>>>  On 4/30/2013 9:37 AM, Vitaly Davidovich wrote:
>>>>>>>
>>>>>>> Curious how x86 would move a long in 1 instruction? There's no
>>>>>>> memory to memory mov so has to go through register, and thus needs 2
>>>>>>> registers (and hence split).  Am I missing something?
>>>>>>>
>>>>>>> Sent from my phone
>>>>>>> On Apr 30, 2013 12:23 PM, "Nathan Reynolds" <
>>>>>>> nathan.reynolds at oracle.com> wrote:
>>>>>>>
>>>>>>>>  You might want to print the assembly using HotSpot (and
>>>>>>>> OpenJDK?).  If the assembly, uses 1 instruction to do the write, then no
>>>>>>>> splitting can ever happen (because alignment takes care of cache line
>>>>>>>> splits).  If the assembly, uses 2 instructions to do the write, then it is
>>>>>>>> only a matter of timing.
>>>>>>>>
>>>>>>>> With a single processor system, you are waiting for the thread's
>>>>>>>> quantum to end right after the first instruction but before the second
>>>>>>>> instruction.  This will allow the other thread to see the split write.
>>>>>>>>
>>>>>>>> With a dual processor system, the reader thread simply has to get a
>>>>>>>> copy of the cache line after the first write and before the second write.
>>>>>>>> This is much easier to do.
>>>>>>>>
>>>>>>>> HotSpot will do a lot of optimizations on single processor
>>>>>>>> systems.  For example, it gets rid of the "lock" prefix in front of atomic
>>>>>>>> instructions since the instruction's execution can't be split.  It also
>>>>>>>> doesn't output memory fences.  Both of these give good performance boosts.
>>>>>>>> I wonder if with one processor, OpenJDK is using 2 instructions to do the
>>>>>>>> write whereas with multiple processors it plays it safe and uses 1
>>>>>>>> instruction.
>>>>>>>>
>>>>>>>> Note: If you disable all of the processors but 1 and then start
>>>>>>>> HotSpot, HotSpot will start in single processor mode.  If you then enable
>>>>>>>> those processors while HotSpot is running, a lot of things break and the
>>>>>>>> JVM will crash.  Because single processor systems are rare, the default
>>>>>>>> might be changed to assume multiple processors unless the command line
>>>>>>>> specifies 1 processor.
>>>>>>>>
>>>>>>>> Nathan Reynolds<http://psr.us.oracle.com/wiki/index.php/User:Nathan_Reynolds>| Architect |
>>>>>>>> 602.333.9091
>>>>>>>> Oracle PSR Engineering <http://psr.us.oracle.com/> | Server
>>>>>>>> Technology
>>>>>>>>  On 4/30/2013 8:48 AM, Tim Halloran wrote:
>>>>>>>>
>>>>>>>>  Aleksey, correct -- more trials show what you predicted. Thanks
>>>>>>>> for the nudge.
>>>>>>>>
>>>>>>>>  Mark,
>>>>>>>>
>>>>>>>>  Very helpful, in fact, we are seeing quick failures except for
>>>>>>>> the dual-processor case -- on a dual processor hardware or VM (Virtual Box)
>>>>>>>> we have yet to get a failure.  The two programs attached are what I'm
>>>>>>>> running.  I stripped out my benchmark framework (so they are easy to run on
>>>>>>>> OpenJDK but not on Android).  The difference is that one uses two threads
>>>>>>>> (one writer one reader) the other three (two writers one reader) -- both
>>>>>>>> seem to produce similar results.
>>>>>>>>
>>>>>>>>  With one processor, OpenJDK 1.6.0_27 I see the split write almost
>>>>>>>> immediatly. Dual we can't get a failure, yet, we get more failures as the
>>>>>>>> processor count goes up -- but after a few failures, we don't get any more
>>>>>>>> (they program tries to get 10 to happen)...we can't get to 10.
>>>>>>>>
>>>>>>>>  It seems that while this can happen on OpenJDK it is rarer than
>>>>>>>> on Android where ten failures takes less than a second to happen.
>>>>>>>>
>>>>>>>>  Best, Tim
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Apr 30, 2013 at 11:26 AM, Mark Thornton <
>>>>>>>> mthornton at optrak.com> wrote:
>>>>>>>>
>>>>>>>>>   On 30/04/13 15:36, Tim Halloran wrote:
>>>>>>>>>
>>>>>>>>> On Mon, Apr 29, 2013 at 4:59 PM, Aleksey Shipilev <
>>>>>>>>> aleksey.shipilev at oracle.com> wrote:
>>>>>>>>>
>>>>>>>>>> Yes, that's exactly what I had in mind:
>>>>>>>>>>  a. Declare "long a"
>>>>>>>>>>  b. Ramp up two threads.
>>>>>>>>>>  c. Make thread 1 write 0L and -1L over and over to field $a
>>>>>>>>>>  d. Make thread 2 observe the field a, and count the observed
>>>>>>>>>> values
>>>>>>>>>>  e. ...
>>>>>>>>>>  f. PROFIT!
>>>>>>>>>>
>>>>>>>>>> P.S. It is important to do some action on value read in thread 2,
>>>>>>>>>> so
>>>>>>>>>> that it does not hoisted from the loop, since $a is not supposed
>>>>>>>>>> to be
>>>>>>>>>> volatile.
>>>>>>>>>>
>>>>>>>>>> -Aleksey.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>  This discussion is getting a bit far afield, I guess, but to get
>>>>>>>>> back onto the topic. I followed Aleksey's advice. And wrote an
>>>>>>>>> implementation that tests this.  I used two separate threads to write 0L
>>>>>>>>> and -1L into the long field "a" but that is the only real change I made. (I
>>>>>>>>> already had some scaffolding code to run things on Android or desktop Java).
>>>>>>>>>
>>>>>>>>>  *Android: splits writes to longs into two parts.*
>>>>>>>>>
>>>>>>>>>  On a Samsung Galaxy II with Android 4.0.4  a Nexus 4 phone with
>>>>>>>>> Android 4.2.2 I saw non-atomic treatment of long. The value -4294967296
>>>>>>>>> (xFFFFFFFF00000000) showed up as well as 4294967295 (x00000000FFFFFFFF).
>>>>>>>>>
>>>>>>>>>  So looks like Android does not follow the (albeit optional)
>>>>>>>>> advice in the Java language specification about this.
>>>>>>>>>
>>>>>>>>>  *JDK: DOES NOT split writes to longs into two parts (even 32-bit
>>>>>>>>> implementations)*
>>>>>>>>>
>>>>>>>>>  Of course we couldn't get this to happen on any 64-bit JVM, but
>>>>>>>>> we tried it out under Linux on 32-bit OpenJDK 1.7.0_21 it does NOT happen.
>>>>>>>>> The 32-bit JVM implementations follow the recommendation of the Java
>>>>>>>>> language specification.
>>>>>>>>>
>>>>>>>>>  An interesting curio. I wonder how many crashes in "working"
>>>>>>>>> Java code moved from desktop Java onto Android programmers are going to
>>>>>>>>> lose sleep tracking down this one.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  Last time I tried this sort of test, a split write would be
>>>>>>>>> observed in under a second on a true dual processor. However, with only one
>>>>>>>>> processor available, it would typically take around 20 minutes. So you
>>>>>>>>> might have to run a very long test to have any real confidence in the lack
>>>>>>>>> of splitting.
>>>>>>>>>
>>>>>>>>> Mark Thornton
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Concurrency-interest mailing list
>>>>>>>>> Concurrency-interest at cs.oswego.edu
>>>>>>>>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Concurrency-interest mailing listConcurrency-interest at cs.oswego.eduhttp://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Concurrency-interest mailing list
>>>>>>>> Concurrency-interest at cs.oswego.edu
>>>>>>>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> Concurrency-interest mailing list
>>>>> Concurrency-interest at cs.oswego.edu
>>>>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>>>>
>>>>>
>>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20130430/eca11b33/attachment-0001.html>


More information about the Concurrency-interest mailing list