[concurrency-interest] NUMA-Aware Java Heaps for in-memory databases

Stanimir Simeonoff stanimir at riflexo.com
Sat Feb 16 16:47:53 EST 2013


On Fri, Feb 15, 2013 at 5:48 PM, Antoine Chambille <ach at quartetfs.com>wrote:

> You are right, for primitive arrays I may be overestimating the overhead.
> Nevertheless I have no idea where (regarding NUMA layout) *
> UNSAFE.allocateMemory()* allocates the memory for direct byte buffers...
> I should find some time to write a prototype.
>
> Unsafe.allocareMemory calls os:malloc that's a thin wrapper around just
malloc. In explicit numa_ allocation the OS should(?) allocate the pages on
the 1st write touch, i.e. page fault. At least that should be the default
behavior.
Hence if you bind a few the threads to numa nodes and use them to allocate
and pre-touch the DirectBuffers you are good to go - the raw data should
stay unmoved..
Actually I thought in-mem DB and DirectBuffers should be a perfect match.



> But the serialization price is definitely to steep for objects. Of course
> the results of queries to the database must be serialized before sent to
> clients, but the size of a result cell set is incommensurate with the size
> of the data scanned to produce it. A typical analytical query on
> ActivePivot scans and aggregates one billion lines to produce 100 cells.
>
> If you use bound threads and split the dataset amongst the NUMA nodes via
the DirectBuffers and the query is split accordingly it should be possible
to have few cross-node loads.

Stanimir



> -Antoine
>
>
> On 15 February 2013 16:01, Stanimir Simeonoff <stanimir at riflexo.com>wrote:
>
>>
>>
>> On Fri, Feb 15, 2013 at 1:18 PM, Antoine Chambille <ach at quartetfs.com>wrote:
>>
>>> Data is stored in columns, to maximize the performance of analytical
>>> queries that commonly scan billions of rows but only for a subset of the
>>> columns. We support a mix of primitive data and object oriented data ( some
>>> columns look like double[], some other look like Object[] ).
>>>
>>> The point is to replace all primitive arrays (+String) w/ ByteBuffer (or
>> Double/LongBuffer), instead double[]  you  can use DoubleBuffer in native
>> byteorder( the latter is quite important). Also you use
>> ByteBuffer.asDoubleBuffer() and have individual access to each byte.
>>
>>
>> Using direct buffers would open a door to NUMA-Aware memory placement
>>> (provided that the direct allocation itself can be made on the right node).
>>> That's probably more a Pandora box than a door though ;) Anyway it implies
>>> serializing data into byte arrays, and deserializing at each query. That's
>>> a serious performance penalty for primitive data, and that's absolutely
>>> prohibitive when you do that with plain objects, even Externizable ones.
>>>
>> Unless the database runs in the process, there would be always need to
>> serialize java objects into network format. I assumed the database is
>> standalone and not embedded.
>>
>> Stanimir
>>
>>
>>
>>>
>>> On 15 February 2013 11:35, Stanimir Simeonoff <stanimir at riflexo.com>wrote:
>>>
>>>> Just out of curiosity: would not DirectBuffers and managing the data
>>>> yourself would be both easier and more efficient?
>>>> Technically you can ship the data w/o even copying it straight to the
>>>> sockets (or disks).
>>>> I don't know how you store the data itself but I can think only of
>>>> tuples i.e. Object[].
>>>>
>>>> Stanimir
>>>>
>>>> On Fri, Feb 15, 2013 at 11:48 AM, Antoine Chambille <ach at quartetfs.com>wrote:
>>>>
>>>>> I think this community is the right place to start a conversation
>>>>> about NUMA (aren't NUMA nodes to memory what multiprocessors are to
>>>>> processing? ;). I apologize if this is considered off-topic.
>>>>>
>>>>>
>>>>> We are developing a Java in-memory analytical database (it's called
>>>>> "ActivePivot") that our customers deploy on ever larger datasets. Some
>>>>> ActivePivot instances are deployed on java heaps close to 1TB, on NUMA
>>>>> servers (typically 4 Xeon processors and 4 NUMA nodes). This is becoming a
>>>>> trend, and we are researching solutions to improve our performance on NUMA
>>>>> configurations.
>>>>>
>>>>>
>>>>> We understand that in the current state of things (and including JDK8)
>>>>> the support for NUMA in hotspot is the following:
>>>>> * The young generation heap layout can be NUMA-Aware (partitioned per
>>>>> NUMA node, objects allocated in the same node than the running thread)
>>>>> * The old generation heap layout is not optimized for NUMA (at best
>>>>> the old generation is interleaved among nodes which at least makes memory
>>>>> accesses somewhat uniform)
>>>>> * The parallel garbage collector is NUMA optimized, the GC threads
>>>>> focusing on objects in their node.
>>>>>
>>>>>
>>>>> Yet activating -XX:+UseNUMA option has almost no impact on the
>>>>> performance of our in-memory database. It is not surprising, the pattern
>>>>> for a database is to load the data in the memory and then make queries on
>>>>> it. The data goes and stays in the old generation, and it is read from
>>>>> there by queries. Most memory accesses are in the old gen and most of those
>>>>> are not local.
>>>>>
>>>>> I guess there is a reason hotspot does not yet optimize the old
>>>>> generation for NUMA. It must be very difficult to do it in the general
>>>>> case, when you have no idea what thread from what node will read data and
>>>>> interleaving is. But for an in-memory database this is frustrating because
>>>>> we know very well which threads will access which piece of data. At least
>>>>> in ActivePivot data structures are partitioned, partitions are each
>>>>> assigned a thread pool so the threads that allocated the data in a
>>>>> partition are also the threads that perform sub-queries on that partition.
>>>>> We are a few lines of code away from binding thread pools to NUMA nodes,
>>>>> and if the garbage collector would leave objects promoted to the old
>>>>> generation on their original NUMA node memory accesses would be close to
>>>>> optimal.
>>>>>
>>>>> We have not been able to do that. But that being said I read an
>>>>> inspiring 2005 article from Mustafa M. Tikir and Jeffrey K. Hollingsworth
>>>>> that did experiment on NUMA layouts for the old generation. ("NUMA-aware
>>>>> Java heaps for server applications"
>>>>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.92.6587&rep=rep1&type=pdf). That motivated me to ask the following questions:
>>>>>
>>>>>
>>>>> * Are there hidden or experimental hotspot options that allow
>>>>> NUMA-Aware partitioning of the old generation?
>>>>> * Do you know why there isn't much (visible, generally available)
>>>>> research on NUMA optimizations for the old gen? Is the Java in-memory
>>>>> database use case considered a rare one?
>>>>> * Maybe we should experiment and even contribute new heap layouts to
>>>>> the open-jdk project. Can some of you guys comment on the difficulty of
>>>>> that?
>>>>>
>>>>>
>>>>> Thanks for reading,
>>>>>
>>>>> --
>>>>> Antoine CHAMBILLE
>>>>> Director Research & Development
>>>>> Quartet FS
>>>>>
>>>>> _______________________________________________
>>>>> Concurrency-interest mailing list
>>>>> Concurrency-interest at cs.oswego.edu
>>>>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Antoine CHAMBILLE
>>> Director Research & Development
>>> Quartet FS
>>>
>>
>>
>
>
> --
> Antoine CHAMBILLE
> Director Research & Development
> Quartet FS
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20130216/a636e979/attachment-0001.html>


More information about the Concurrency-interest mailing list