[concurrency-interest] Concurrent axis steps in a tree-structure

Johannes.Lichtenberger Johannes.Lichtenberger at uni-konstanz.de
Tue Dec 20 14:33:04 EST 2011


Already posted on stackoverflow and the Akka (akka.io) mailinglist.

We are researching a secure tree-based cloud storage system. We have
for instance several Axis-implementations (all XPath-Axis and a
LevelOrderAxis, PostOrderAxis..., NestedAxis, FilterAxis...).
I want to rewrite our axis-implementations sometime in the first
quarter next year to use multithreading. I'm currently not sure if I
would really gain any benefit, but as it most likely involves I/O if
nothing is in the cache (which might be almost everytime during an axis
step) even though the nodes can be get from pages through a simple
offset it will involve I/ O for each node similar to a directory
traversal in the Filesystem. Furthermore I've added visitor support for
our DescendantAxis which is traversing a tree in preorder.
I've copy/pasted the following code from hasNext() (Iterator
implementation):

http://www.heypasteit.com/clip/05IJ

I think it should be pretty self-explanatory, the transaction is more
or less a cursor traversing the tree-structure. Now I'm not sure how I
should divide the work and if it really makes sense. For example on
level 1 split the task for all second nodes. But it might be that the
tree is deep, even though regarding XML it seems most trees are rather
"wide" instead of deep. The idea is if we can speed up the traversal
such that we don't have to implement Path-/Element-/ and Text-Indexes
(as well as incremental updates thereof) as our main goal is to provide
an open source secure framework for tree-structured data which evolves
over time (through snapshots) for the Cloud (research project).
Furthermore besides considering if it really makes sense for very
large trees (1GB on disk and much more) I'm considering using either
Java7 fork/join or Akka actors. What do you think? How should I split
the task, as we currently don't save the descendant-count of each
node (which might also be added, as we provide and update hashes for
the integrity and could as well add/change the descendant count for
each modification to the tree).

It's just a simple idea that one must not implement indexes, but I
don't know (indexes are for sure faster if they can be used, but I
suppose also much more work considering that we optimized everything
for updates/modifications of the tree -- and it would need incremental
Path-/Element-/Value-indexes plus we don't intend to win any
performance benchmarks ;-)). Furthermore it would support our Saxon
integration (to support XPath 2.0, XQuery 1.0 and XSLT 2.0) as Saxon
uses it's own indexes (if at all) and just makes use of our axis and
other method-implementations.

I think it also pretty much now depends on the visitor, which can be
plugged in optionally, if the axis is fast or not. As such I just wanted
to learn the new Fork-/Join-Framework or maybe even Akka, as I'm
interested in learning newer concurrency aspects as for instance the
Fork-/Join-Framework and STM/Actors on the JVM. That said I'm not sure
if the axis are the right thing to use concurrency. Maybe other aspects
of a persistent tree-storage system would benefit much more. I'm
interested in any use cases ;-)

The main benefit I suppose would be on very large trees, where we have
to find specific nodes, that is we really traverse the whole tree.

best regards,
Johannes


More information about the Concurrency-interest mailing list