[concurrency-interest] JLS 17.7 Non-atomic treatment of double and long : Android

Nathan Reynolds nathan.reynolds at oracle.com
Tue Apr 30 12:58:16 EDT 2013


The processor can do whatever it wants in registers without other 
threads being able to see intermediate values.  Registers are private to 
the hardware thread.  So, we can use multiple instructions to load the 
ecx:ebx registers and then execute the cmpxchg8b to do a single write to 
globally visible cache.

Nathan Reynolds 
<http://psr.us.oracle.com/wiki/index.php/User:Nathan_Reynolds> | 
Architect | 602.333.9091
Oracle PSR Engineering <http://psr.us.oracle.com/> | Server Technology
On 4/30/2013 9:53 AM, Vitaly Davidovich wrote:
>
> But this requires the src value to be in ecx:ebx so how would you load 
> it there without two loads (and possibly observe tearing) in the first 
> place?
>
> Sent from my phone
>
> On Apr 30, 2013 12:45 PM, "Nathan Reynolds" 
> <nathan.reynolds at oracle.com <mailto:nathan.reynolds at oracle.com>> wrote:
>
>     On 32-bit x86, the cmpxchg8b can be used to write a long in 1
>     instruction. This instruction has been "present on most post-80486
>     processors" (Wikipedia).  There might be cheaper ways to write a
>     long but there is at least 1 way.
>
>     Nathan Reynolds
>     <http://psr.us.oracle.com/wiki/index.php/User:Nathan_Reynolds> |
>     Architect | 602.333.9091 <tel:602.333.9091>
>     Oracle PSR Engineering <http://psr.us.oracle.com/> | Server Technology
>     On 4/30/2013 9:37 AM, Vitaly Davidovich wrote:
>>
>>     Curious how x86 would move a long in 1 instruction? There's no
>>     memory to memory mov so has to go through register, and thus
>>     needs 2 registers (and hence split).  Am I missing something?
>>
>>     Sent from my phone
>>
>>     On Apr 30, 2013 12:23 PM, "Nathan Reynolds"
>>     <nathan.reynolds at oracle.com <mailto:nathan.reynolds at oracle.com>>
>>     wrote:
>>
>>         You might want to print the assembly using HotSpot (and
>>         OpenJDK?).  If the assembly, uses 1 instruction to do the
>>         write, then no splitting can ever happen (because alignment
>>         takes care of cache line splits).  If the assembly, uses 2
>>         instructions to do the write, then it is only a matter of timing.
>>
>>         With a single processor system, you are waiting for the
>>         thread's quantum to end right after the first instruction but
>>         before the second instruction.  This will allow the other
>>         thread to see the split write.
>>
>>         With a dual processor system, the reader thread simply has to
>>         get a copy of the cache line after the first write and before
>>         the second write.  This is much easier to do.
>>
>>         HotSpot will do a lot of optimizations on single processor
>>         systems.  For example, it gets rid of the "lock" prefix in
>>         front of atomic instructions since the instruction's
>>         execution can't be split. It also doesn't output memory
>>         fences.  Both of these give good performance boosts.  I
>>         wonder if with one processor, OpenJDK is using 2 instructions
>>         to do the write whereas with multiple processors it plays it
>>         safe and uses 1 instruction.
>>
>>         Note: If you disable all of the processors but 1 and then
>>         start HotSpot, HotSpot will start in single processor mode. 
>>         If you then enable those processors while HotSpot is running,
>>         a lot of things break and the JVM will crash.  Because single
>>         processor systems are rare, the default might be changed to
>>         assume multiple processors unless the command line specifies
>>         1 processor.
>>
>>         Nathan Reynolds
>>         <http://psr.us.oracle.com/wiki/index.php/User:Nathan_Reynolds> |
>>         Architect | 602.333.9091 <tel:602.333.9091>
>>         Oracle PSR Engineering <http://psr.us.oracle.com/> | Server
>>         Technology
>>         On 4/30/2013 8:48 AM, Tim Halloran wrote:
>>>         Aleksey, correct -- more trials show what you predicted.
>>>         Thanks for the nudge.
>>>
>>>         Mark,
>>>
>>>         Very helpful, in fact, we are seeing quick failures except
>>>         for the dual-processor case -- on a dual processor hardware
>>>         or VM (Virtual Box) we have yet to get a failure.  The two
>>>         programs attached are what I'm running.  I stripped out my
>>>         benchmark framework (so they are easy to run on OpenJDK but
>>>         not on Android).  The difference is that one uses two
>>>         threads (one writer one reader) the other three (two writers
>>>         one reader) -- both seem to produce similar results.
>>>
>>>         With one processor, OpenJDK 1.6.0_27 I see the split write
>>>         almost immediatly. Dual we can't get a failure, yet, we get
>>>         more failures as the processor count goes up -- but after a
>>>         few failures, we don't get any more (they program tries to
>>>         get 10 to happen)...we can't get to 10.
>>>
>>>         It seems that while this can happen on OpenJDK it is rarer
>>>         than on Android where ten failures takes less than a second
>>>         to happen.
>>>
>>>         Best, Tim
>>>
>>>
>>>
>>>         On Tue, Apr 30, 2013 at 11:26 AM, Mark Thornton
>>>         <mthornton at optrak.com <mailto:mthornton at optrak.com>> wrote:
>>>
>>>             On 30/04/13 15:36, Tim Halloran wrote:
>>>>             On Mon, Apr 29, 2013 at 4:59 PM, Aleksey Shipilev
>>>>             <aleksey.shipilev at oracle.com
>>>>             <mailto:aleksey.shipilev at oracle.com>> wrote:
>>>>
>>>>                 Yes, that's exactly what I had in mind:
>>>>                  a. Declare "long a"
>>>>                  b. Ramp up two threads.
>>>>                  c. Make thread 1 write 0L and -1L over and over to
>>>>                 field $a
>>>>                  d. Make thread 2 observe the field a, and count
>>>>                 the observed values
>>>>                  e. ...
>>>>                  f. PROFIT!
>>>>
>>>>                 P.S. It is important to do some action on value
>>>>                 read in thread 2, so
>>>>                 that it does not hoisted from the loop, since $a is
>>>>                 not supposed to be
>>>>                 volatile.
>>>>
>>>>                 -Aleksey.
>>>>
>>>>
>>>>             This discussion is getting a bit far afield, I guess,
>>>>             but to get back onto the topic. I followed Aleksey's
>>>>             advice. And wrote an implementation that tests this.  I
>>>>             used two separate threads to write 0L and -1L into the
>>>>             long field "a" but that is the only real change I made.
>>>>             (I already had some scaffolding code to run things on
>>>>             Android or desktop Java).
>>>>
>>>>             *Android: splits writes to longs into two parts.*
>>>>
>>>>             On a Samsung Galaxy II with Android 4.0.4  a Nexus 4
>>>>             phone with Android 4.2.2 I saw non-atomic treatment of
>>>>             long. The value -4294967296 (xFFFFFFFF00000000) showed
>>>>             up as well as 4294967295 (x00000000FFFFFFFF).
>>>>
>>>>             So looks like Android does not follow the (albeit
>>>>             optional) advice in the Java language specification
>>>>             about this.
>>>>
>>>>             *JDK: DOES NOT split writes to longs into two parts
>>>>             (even 32-bit implementations)*
>>>>
>>>>             Of course we couldn't get this to happen on any 64-bit
>>>>             JVM, but we tried it out under Linux on 32-bit OpenJDK
>>>>             1.7.0_21 it does NOT happen. The 32-bit JVM
>>>>             implementations follow the recommendation of the Java
>>>>             language specification.
>>>>
>>>>             An interesting curio. I wonder how many crashes in
>>>>             "working" Java code moved from desktop Java onto
>>>>             Android programmers are going to lose sleep tracking
>>>>             down this one.
>>>>
>>>>
>>>
>>>             Last time I tried this sort of test, a split write would
>>>             be observed in under a second on a true dual processor.
>>>             However, with only one processor available, it would
>>>             typically take around 20 minutes. So you might have to
>>>             run a very long test to have any real confidence in the
>>>             lack of splitting.
>>>
>>>             Mark Thornton
>>>
>>>
>>>             _______________________________________________
>>>             Concurrency-interest mailing list
>>>             Concurrency-interest at cs.oswego.edu
>>>             <mailto:Concurrency-interest at cs.oswego.edu>
>>>             http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>>
>>>
>>>
>>>
>>>         _______________________________________________
>>>         Concurrency-interest mailing list
>>>         Concurrency-interest at cs.oswego.edu  <mailto:Concurrency-interest at cs.oswego.edu>
>>>         http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>
>>
>>         _______________________________________________
>>         Concurrency-interest mailing list
>>         Concurrency-interest at cs.oswego.edu
>>         <mailto:Concurrency-interest at cs.oswego.edu>
>>         http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20130430/986d1ca6/attachment-0001.html>


More information about the Concurrency-interest mailing list