[concurrency-interest] Double Checked Locking in OpenJDK
uli at grepel.de
Sat Aug 18 08:35:18 EDT 2012
Am 17.08.2012 20:08, schrieb Ruslan Cheremin:
> Yes, Ulrich, I have grid-like systems in my mind then talking about
> perspective of weakening hardware coherence.
> But, by any way, one does not need to look so far. As I've already
> write, we already have some kind of
> weak-consistent-not-automatically-coherent memory in todays Indel CPUs
> -- in form of registers and store buffers. This is small layer atop of
> coherent memory, but this layer is, as far as I know, critical for
> overall performance, since it is important in hiding (well, sometimes
> hiding) still noticeable memory latency. Not only main memory (or,
> say, L3/2 cache latency), but also a QPI latency, if accessed memory
> location is owned by another core, and need to be re-owned, for
There's a big difference between registers and L1 cache - the registers
are under full control of the compiler. Well, not really, looking at
register renaming and all that stuff. But the compiler DOES know that
when a register is stored into memory, that it CAN be seen by all other
threads. Immediately. It doesn't actually have to be immediately stored
into memory, not even L1 cache, it might reside in some write buffer for
a while, but all other cores will, due to the cache synchronisation
mechanisms, know about it IF they care about it.
L1 cache though is more or less completely transparent to the compiler.
Besides some configuration things (like "do not use L1 at all for these
memory locations, because they are I/O areas" or similar) that usually
are under OS control, there's not much a compiler can do to force writes
from L1 to main memory. So either we will have weak synchronisation, or
we will have cache coherency, or we will at least have some new assembly
instructions for flushing a cache line. Since many many CPUs from 8088
(or at least 80386) to Core i7, plus AMD plus others, share their
assembly instruction set, and since all of these have very different L1
cache designs (from "none" to "write through" or "copy-back", n-way
associative with varying "n") and different cache levels (why stop at
L1?) it is unlikely that any such instructions will surface. The best
thing we might get would be a huge register file ("huge" as in 1024 or
more registers), all of them available to the application. But besides
having a completely new architecture that causes other problems, such as
long running context switches and, on top, a tradeoff between transistor
count and performance. Transistors might be used for something else to
gain performance, and larger register files will be slower.
> I see no reason why evolution of QPI will be somehow different from
> evolution of memory itself. Leaving away chance for some kind of
> hardware revolution (breakthrough, which would give us cheap and
> ultimate fast memory/QPI), it seems for me like we'll have same
> QPI-wall, as we've already have memory wall.
That wall is already there if you look at AMD's HT instead (see below).
Also don't forget that main memory today has access times that will not
allow the RAM chips to be significantly farther away than right next to
the CPU on the same board. Physics, speed of light and all that stuff.
So coordinating caches across distributed (as in "several racks in the
data center") systems will never be as fast as coordinating a couple of
local caches. Remember that for 1 ns access time you will have to travel
from the CPU to RAM and back, which in vacuum would allow about 15cm
distance minus reaction time of the RAM chips themselves. And vacuum is
not available, we're talking about copper which reduces this to let's
say 10cm which is what we find on current mainboards.
So for fast, big massively parallel systems we will never have one
memory model with coherent caches and thus with the ability to live in
one single process model synchronizing threads with something like locks.
> I see no chance for QPI
> being fast, wide, cheap, and scale to hundreds of CPUs same time. So
> we'll still need some kind of weak consistent layer with explicit
> flushing control to hide weakness of memory (and QPI, as part of
> memory engine).
QPI is similar to AMDs HT which has been around for 8 or 9 years now.
Originally this was meant to scale well to many CPU sockets, but there's
not that many systems using 8 Opteron CPUs around. Diminishing returns.
HT is also used by Cray with their Seastar architecture which is HT on
steroids (up to 32K nodes), but only as a very fast network
interconnection, not for synchronizing RAM directly. For example:
> What I trying to say here: seems like we always will have strict
> consistent but rather slow memory (with QPI), and quick but weak
> consistent memory. Border between them could move -- now days servers
> and desktops have tiny weak-consistent layer, while grids and clusters
> have all its memory "weak consistent" (only explicitly synchronized).
As I said, pure physics are in the way when trying to get a speedy
with coherent memory.
> And if my assumption are not too far from reality, it seems promising
> (or at least interesting) to trying to investigate algorithms which
> can exploit inconsistency, instead of trying to fight with it with
Exactly. Unfortunately, easy parallel programming has been on the wish
list for ages, without any real advantage as in "works on its own and
can be done by your average coder".
There's always some areas that lend themselves to parallelism, and there
IS some progress in easying parallel programming, but as this whole
discussion shows, it is still not easy going.
> Do you know about any works in this direction? For now I see only one
> promising example for exploiting eventually consistent approach -- the
> sync-less cache for atomically published entities, like primitives
> (except long/double sure) or immutable objects.
No, I don't, sorry.
More information about the Concurrency-interest