I think this community is the right place to start a conversation about
NUMA (aren't NUMA nodes to memory what multiprocessors are to processing?
;). I apologize if this is considered off-topic.

We are developing a Java in-memory analytical database (it's called
"ActivePivot") that our customers deploy on ever larger datasets. Some
ActivePivot instances are deployed on java heaps close to 1TB, on NUMA
servers (typically 4 Xeon processors and 4 NUMA nodes). This is becoming a
trend, and we are researching solutions to improve our performance on NUMA

We understand that in the current state of things (and including JDK8) the
support for NUMA in hotspot is the following:
* The young generation heap layout can be NUMA-Aware (partitioned per NUMA
node, objects allocated in the same node than the running thread)
* The old generation heap layout is not optimized for NUMA (at best the old
generation is interleaved among nodes which at least makes memory accesses
somewhat uniform)
* The parallel garbage collector is NUMA optimized, the GC threads focusing
on objects in their node.

Yet activating -XX:+UseNUMA option has almost no impact on the performance
of our in-memory database. It is not surprising, the pattern for a database
is to load the data in the memory and then make queries on it. The data
goes and stays in the old generation, and it is read from there by queries.
Most memory accesses are in the old gen and most of those are not local.

I guess there is a reason hotspot does not yet optimize the old generation
for NUMA. It must be very difficult to do it in the general case, when you
have no idea what thread from what node will read data and interleaving is.
But for an in-memory database this is frustrating because we know very well
which threads will access which piece of data. At least in ActivePivot data
structures are partitioned, partitions are each assigned a thread pool so
the threads that allocated the data in a partition are also the threads
that perform sub-queries on that partition. We are a few lines of code away
from binding thread pools to NUMA nodes, and if the garbage collector would
leave objects promoted to the old generation on their original NUMA node
memory accesses would be close to optimal.

We have not been able to do that. But that being said I read an inspiring
2005 article from Mustafa M. Tikir and Jeffrey K. Hollingsworth that did
experiment on NUMA layouts for the old generation. ("NUMA-aware Java heaps
for server applications"
That motivated me to ask the following questions:

* Are there hidden or experimental hotspot options that allow NUMA-Aware
partitioning of the old generation?
* Do you know why there isn't much (visible, generally available) research
on NUMA optimizations for the old gen? Is the Java in-memory database use
case considered a rare one?
* Maybe we should experiment and even contribute new heap layouts to the
open-jdk project. Can some of you guys comment on the difficulty of that?

Thanks for reading,

