[concurrency-interest] Double Checked Locking in OpenJDK

Ulrich Grepel uli at grepel.de
Sat Aug 18 08:35:18 EDT 2012

Am 17.08.2012 20:08, schrieb Ruslan Cheremin:
> Yes, Ulrich, I have grid-like systems in my mind then talking about
> perspective of weakening hardware coherence.
> But, by any way, one does not need to look so far. As I've already
> write, we already have some kind of
> weak-consistent-not-automatically-coherent memory in todays Indel CPUs
> -- in form of registers and store buffers. This is small layer atop of
> coherent memory, but this layer is, as far as I know, critical for
> overall performance, since it is important in hiding (well, sometimes
> hiding) still noticeable memory latency. Not only main memory (or,
> say, L3/2 cache latency), but also a QPI latency, if accessed memory
> location is owned by another core, and need to be re-owned, for
> example.
There's a big difference between registers and L1 cache - the registers 
are under full control of the compiler. Well, not really, looking at 
register renaming and all that stuff. But the compiler DOES know that 
when a register is stored into memory, that it CAN be seen by all other 
threads. Immediately. It doesn't actually have to be immediately stored 
into memory, not even L1 cache, it might reside in some write buffer for 
a while, but all other cores will, due to the cache synchronisation 
mechanisms, know about it IF they care about it.

L1 cache though is more or less completely transparent to the compiler. 
Besides some configuration things (like "do not use L1 at all for these 
memory locations, because they are I/O areas" or similar) that usually 
are under OS control, there's not much a compiler can do to force writes 
from L1 to main memory. So either we will have weak synchronisation, or 
we will have cache coherency, or we will at least have some new assembly 
instructions for flushing a cache line. Since many many CPUs from 8088 
(or at least 80386) to Core i7, plus AMD plus others, share their 
assembly instruction set, and since all of these have very different L1 
cache designs (from "none" to "write through" or "copy-back", n-way 
associative with varying "n") and different cache levels (why stop at 
L1?) it is unlikely that any such instructions will surface. The best 
thing we might get would be a huge register file ("huge" as in 1024 or 
more registers), all of them available to the application. But besides 
having a completely new architecture that causes other problems, such as 
long running context switches and, on top, a tradeoff between transistor 
count and performance. Transistors might be used for something else to 
gain performance, and larger register files will be slower.

> I see no reason why evolution of QPI will be somehow different from
> evolution of memory itself. Leaving away chance for some kind of
> hardware revolution (breakthrough, which would give us cheap and
> ultimate fast memory/QPI), it seems for me like we'll have same
> QPI-wall, as we've already have memory wall.
That wall is already there if you look at AMD's HT instead (see below). 
Also don't forget that main memory today has access times that will not 
allow the RAM chips to be significantly farther away than right next to 
the CPU on the same board. Physics, speed of light and all that stuff. 
So coordinating caches across distributed (as in "several racks in the 
data center") systems will never be as fast as coordinating a couple of 
local caches. Remember that for 1 ns access time you will have to travel 
from the CPU to RAM and back, which in vacuum would allow about 15cm 
distance minus reaction time of the RAM chips themselves. And vacuum is 
not available, we're talking about copper which reduces this to let's 
say 10cm which is what we find on current mainboards.

So for fast, big massively parallel systems we will never have one 
memory model with coherent caches and thus with the ability to live in 
one single process model synchronizing threads with something like locks.

> I see no chance for QPI
> being fast, wide, cheap, and scale to hundreds of CPUs same time. So
> we'll still need some kind of weak consistent layer with explicit
> flushing control to hide weakness of memory (and QPI, as part of
> memory engine).

QPI is similar to AMDs HT which has been around for 8 or 9 years now. 
Originally this was meant to scale well to many CPU sockets, but there's 
not that many systems using 8 Opteron CPUs around. Diminishing returns. 
HT is also used by Cray with their Seastar architecture which is HT on 
steroids (up to 32K nodes), but only as a very fast network 
interconnection, not for synchronizing RAM directly. For example:


> What I trying to say here: seems like we always will have strict
> consistent but rather slow memory (with QPI), and quick but weak
> consistent memory. Border between them could move -- now days servers
> and desktops have tiny weak-consistent layer, while grids and clusters
> have all its memory "weak consistent" (only explicitly synchronized).
As I said, pure physics are in the way when trying to get a speedy 
grid/cluster system
with coherent memory.

> And if my assumption are not too far from reality, it seems promising
> (or at least interesting) to trying to investigate algorithms which
> can exploit inconsistency, instead of trying to fight with it with
> fences.
Exactly. Unfortunately, easy parallel programming has been on the wish 
list for ages, without any real advantage as in "works on its own and 
can be done by your average coder".

There's always some areas that lend themselves to parallelism, and there 
IS some progress in easying parallel programming, but as this whole 
discussion shows, it is still not easy going.

> Do you know about any works in this direction? For now I see only one
> promising example for exploiting eventually consistent approach -- the
> sync-less cache for atomically published entities, like primitives
> (except long/double sure) or immutable objects.
No, I don't, sorry.


More information about the Concurrency-interest mailing list