[concurrency-interest] JLS 17.7 Non-atomic treatment of double and long : Android

Nathan Reynolds nathan.reynolds at oracle.com
Tue Apr 30 14:52:14 EDT 2013


You could use cmpxchg8b for reading as well since it returns the 
original value in the memory location.  Simply set the expected and 
update values to be the same so that the memory location is not changed.

Nathan Reynolds 
<http://psr.us.oracle.com/wiki/index.php/User:Nathan_Reynolds> | 
Architect | 602.333.9091
Oracle PSR Engineering <http://psr.us.oracle.com/> | Server Technology
On 4/30/2013 10:05 AM, Vitaly Davidovich wrote:
>
> Right, writes would be atomic but not reads; read one half, another 
> core updates the value, read 2nd half from different value now.
>
> As for SSE, yeah it's possible, but is that true? JIT skips integer 
> registers for scalar long operations? I find that hard to believe as 
> it would miss out on large register file/renaming opportunities.
>
> Sent from my phone
>
> On Apr 30, 2013 12:58 PM, "Nathan Reynolds" 
> <nathan.reynolds at oracle.com <mailto:nathan.reynolds at oracle.com>> wrote:
>
>     The processor can do whatever it wants in registers without other
>     threads being able to see intermediate values.  Registers are
>     private to the hardware thread.  So, we can use multiple
>     instructions to load the ecx:ebx registers and then execute the
>     cmpxchg8b to do a single write to globally visible cache.
>
>     Nathan Reynolds
>     <http://psr.us.oracle.com/wiki/index.php/User:Nathan_Reynolds> |
>     Architect | 602.333.9091 <tel:602.333.9091>
>     Oracle PSR Engineering <http://psr.us.oracle.com/> | Server Technology
>     On 4/30/2013 9:53 AM, Vitaly Davidovich wrote:
>>
>>     But this requires the src value to be in ecx:ebx so how would you
>>     load it there without two loads (and possibly observe tearing) in
>>     the first place?
>>
>>     Sent from my phone
>>
>>     On Apr 30, 2013 12:45 PM, "Nathan Reynolds"
>>     <nathan.reynolds at oracle.com <mailto:nathan.reynolds at oracle.com>>
>>     wrote:
>>
>>         On 32-bit x86, the cmpxchg8b can be used to write a long in 1
>>         instruction.  This instruction has been "present on most
>>         post-80486 processors" (Wikipedia).  There might be cheaper
>>         ways to write a long but there is at least 1 way.
>>
>>         Nathan Reynolds
>>         <http://psr.us.oracle.com/wiki/index.php/User:Nathan_Reynolds> |
>>         Architect | 602.333.9091 <tel:602.333.9091>
>>         Oracle PSR Engineering <http://psr.us.oracle.com/> | Server
>>         Technology
>>         On 4/30/2013 9:37 AM, Vitaly Davidovich wrote:
>>>
>>>         Curious how x86 would move a long in 1 instruction? There's
>>>         no memory to memory mov so has to go through register, and
>>>         thus needs 2 registers (and hence split).  Am I missing
>>>         something?
>>>
>>>         Sent from my phone
>>>
>>>         On Apr 30, 2013 12:23 PM, "Nathan Reynolds"
>>>         <nathan.reynolds at oracle.com
>>>         <mailto:nathan.reynolds at oracle.com>> wrote:
>>>
>>>             You might want to print the assembly using HotSpot (and
>>>             OpenJDK?).  If the assembly, uses 1 instruction to do
>>>             the write, then no splitting can ever happen (because
>>>             alignment takes care of cache line splits).  If the
>>>             assembly, uses 2 instructions to do the write, then it
>>>             is only a matter of timing.
>>>
>>>             With a single processor system, you are waiting for the
>>>             thread's quantum to end right after the first
>>>             instruction but before the second instruction.  This
>>>             will allow the other thread to see the split write.
>>>
>>>             With a dual processor system, the reader thread simply
>>>             has to get a copy of the cache line after the first
>>>             write and before the second write.  This is much easier
>>>             to do.
>>>
>>>             HotSpot will do a lot of optimizations on single
>>>             processor systems.  For example, it gets rid of the
>>>             "lock" prefix in front of atomic instructions since the
>>>             instruction's execution can't be split. It also doesn't
>>>             output memory fences. Both of these give good
>>>             performance boosts.  I wonder if with one processor,
>>>             OpenJDK is using 2 instructions to do the write whereas
>>>             with multiple processors it plays it safe and uses 1
>>>             instruction.
>>>
>>>             Note: If you disable all of the processors but 1 and
>>>             then start HotSpot, HotSpot will start in single
>>>             processor mode.  If you then enable those processors
>>>             while HotSpot is running, a lot of things break and the
>>>             JVM will crash.  Because single processor systems are
>>>             rare, the default might be changed to assume multiple
>>>             processors unless the command line specifies 1 processor.
>>>
>>>             Nathan Reynolds
>>>             <http://psr.us.oracle.com/wiki/index.php/User:Nathan_Reynolds>
>>>             | Architect | 602.333.9091 <tel:602.333.9091>
>>>             Oracle PSR Engineering <http://psr.us.oracle.com/> |
>>>             Server Technology
>>>             On 4/30/2013 8:48 AM, Tim Halloran wrote:
>>>>             Aleksey, correct -- more trials show what you
>>>>             predicted. Thanks for the nudge.
>>>>
>>>>             Mark,
>>>>
>>>>             Very helpful, in fact, we are seeing quick failures
>>>>             except for the dual-processor case -- on a dual
>>>>             processor hardware or VM (Virtual Box) we have yet to
>>>>             get a failure.  The two programs attached are what I'm
>>>>             running.  I stripped out my benchmark framework (so
>>>>             they are easy to run on OpenJDK but not on Android).
>>>>              The difference is that one uses two threads (one
>>>>             writer one reader) the other three (two writers one
>>>>             reader) -- both seem to produce similar results.
>>>>
>>>>             With one processor, OpenJDK 1.6.0_27 I see the split
>>>>             write almost immediatly. Dual we can't get a failure,
>>>>             yet, we get more failures as the processor count goes
>>>>             up -- but after a few failures, we don't get any more
>>>>             (they program tries to get 10 to happen)...we can't get
>>>>             to 10.
>>>>
>>>>             It seems that while this can happen on OpenJDK it is
>>>>             rarer than on Android where ten failures takes less
>>>>             than a second to happen.
>>>>
>>>>             Best, Tim
>>>>
>>>>
>>>>
>>>>             On Tue, Apr 30, 2013 at 11:26 AM, Mark Thornton
>>>>             <mthornton at optrak.com <mailto:mthornton at optrak.com>> wrote:
>>>>
>>>>                 On 30/04/13 15:36, Tim Halloran wrote:
>>>>>                 On Mon, Apr 29, 2013 at 4:59 PM, Aleksey Shipilev
>>>>>                 <aleksey.shipilev at oracle.com
>>>>>                 <mailto:aleksey.shipilev at oracle.com>> wrote:
>>>>>
>>>>>                     Yes, that's exactly what I had in mind:
>>>>>                      a. Declare "long a"
>>>>>                      b. Ramp up two threads.
>>>>>                      c. Make thread 1 write 0L and -1L over and
>>>>>                     over to field $a
>>>>>                      d. Make thread 2 observe the field a, and
>>>>>                     count the observed values
>>>>>                      e. ...
>>>>>                      f. PROFIT!
>>>>>
>>>>>                     P.S. It is important to do some action on
>>>>>                     value read in thread 2, so
>>>>>                     that it does not hoisted from the loop, since
>>>>>                     $a is not supposed to be
>>>>>                     volatile.
>>>>>
>>>>>                     -Aleksey.
>>>>>
>>>>>
>>>>>                 This discussion is getting a bit far afield, I
>>>>>                 guess, but to get back onto the topic. I followed
>>>>>                 Aleksey's advice. And wrote an implementation that
>>>>>                 tests this.  I used two separate threads to write
>>>>>                 0L and -1L into the long field "a" but that is the
>>>>>                 only real change I made. (I already had some
>>>>>                 scaffolding code to run things on Android or
>>>>>                 desktop Java).
>>>>>
>>>>>                 *Android: splits writes to longs into two parts.*
>>>>>
>>>>>                 On a Samsung Galaxy II with Android 4.0.4  a Nexus
>>>>>                 4 phone with Android 4.2.2 I saw non-atomic
>>>>>                 treatment of long. The value -4294967296
>>>>>                 (xFFFFFFFF00000000) showed up as well as
>>>>>                 4294967295 (x00000000FFFFFFFF).
>>>>>
>>>>>                 So looks like Android does not follow the (albeit
>>>>>                 optional) advice in the Java language
>>>>>                 specification about this.
>>>>>
>>>>>                 *JDK: DOES NOT split writes to longs into two
>>>>>                 parts (even 32-bit implementations)*
>>>>>
>>>>>                 Of course we couldn't get this to happen on any
>>>>>                 64-bit JVM, but we tried it out under Linux on
>>>>>                 32-bit OpenJDK 1.7.0_21 it does NOT happen. The
>>>>>                 32-bit JVM implementations follow the
>>>>>                 recommendation of the Java language specification.
>>>>>
>>>>>                 An interesting curio. I wonder how many crashes in
>>>>>                 "working" Java code moved from desktop Java onto
>>>>>                 Android programmers are going to lose sleep
>>>>>                 tracking down this one.
>>>>>
>>>>>
>>>>
>>>>                 Last time I tried this sort of test, a split write
>>>>                 would be observed in under a second on a true dual
>>>>                 processor. However, with only one processor
>>>>                 available, it would typically take around 20
>>>>                 minutes. So you might have to run a very long test
>>>>                 to have any real confidence in the lack of splitting.
>>>>
>>>>                 Mark Thornton
>>>>
>>>>
>>>>                 _______________________________________________
>>>>                 Concurrency-interest mailing list
>>>>                 Concurrency-interest at cs.oswego.edu
>>>>                 <mailto:Concurrency-interest at cs.oswego.edu>
>>>>                 http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>>>
>>>>
>>>>
>>>>
>>>>             _______________________________________________
>>>>             Concurrency-interest mailing list
>>>>             Concurrency-interest at cs.oswego.edu  <mailto:Concurrency-interest at cs.oswego.edu>
>>>>             http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>>
>>>
>>>             _______________________________________________
>>>             Concurrency-interest mailing list
>>>             Concurrency-interest at cs.oswego.edu
>>>             <mailto:Concurrency-interest at cs.oswego.edu>
>>>             http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20130430/3f2d7994/attachment-0001.html>


More information about the Concurrency-interest mailing list