[Lambda-lib] Iterators and Spliterators

Brian Goetz brian.goetz at oracle.com
Wed Sep 14 23:35:59 EDT 2011


Here's an API strawman for aggregate operations on collections.  Mike just checked in something to the lambda repo on openjdk implementing much of this.  


Aggregate operations for Collections
====================================

Serial vs parallel operations
-----------------------------

We've decided against any sort of "transparent parallelism".  Instead, we will have separate sets of methods for serial and parallel operations.  Ideally (this may or may not be possible), these operations could have a common interface ancestor, though specialized operations for primitives may well get in the way.  Parallel operations on a collection will be accessible by obtaining a "parallel view" of the collection.


Serial operations
-----------------

Many serial operations on collections can be expressed directly as extension methods on Iterable.  We proposed to add the following methods to Iterable<T>:

   interface Iterable<T> {
       ...
       Iterable<T> filter(Predicate<T> filter);
       <U> Iterable<U> map(Mapper<T,U> mapper);
       <U> U reduce(U base, Reducer<T,U> reducer);
       void forEach(Block<T> block);
       <U extends Collection<? super T>> U intoCollection(U collection);
   }

The methods that return an Iterable are lazy; the methods that force to a value (reduce) or application (forEach, intoCollection) are eager.  This fits well with the natural use cases, exemplified by the canonical

   int maxSize = collection
                 .filter(pred)
                 .map(#Foo.getSize)
                 .reduce(0, #Integer.max);

In this case, filter() and map() product Iterables, which are "lazy views" of the collection, but then immediately the reduce() method starts pulling elements from these Iterables, which ultimately pull members from the collection.

Each of these has obvious defaults based on iterator().


Primitive specialization
------------------------

In the example above, the Mapper boxes the size values to Integer, and the reducer unboxes them to do comparisons.  This is undesirable.  For a limited subset of primitives, we should consider fused versions of the above operations, probably for int, long, and double:

   int mapReduce(IntMapper<T> mapper, int base, IntIntReducer reducer);

Then the above example becomes:

   int maxSize = collection
                 .filter(pred)
                 .mapReduce(#Foo.getSize, 0, #Integer.max);


Parallel operations
-------------------

Parallel operations will be implemented by fork-join.  We introduce an abstraction for data structures that can be decomposed with fork-join, called Spliterable / Spliterator (which are the parallel analogue of Iterable / Iterator.)  This way, any data structure can express how it can be divided without knowledge of what operations will be executed in parallel, and parallel operations can be specified only in terms of the abstract splitting operation.

Spliterator is the equivalent of Iterator, and represents a one-time opportunity to traverse a collection either in serial or parallel mode.

   interface Spliterator<T> {
       boolean canSplit();
       Spliterator<T> left();
       Spliterator<T> right();
       Iterator<T> iterator();
   }

When you have a Spliterator, you have a choice: if the data structure indicates that splitting is possible, you could either call left() and right() once each, or you can call iterator() and get a sequential view of the contents.  (If canSplit() returns false, you can only call iterator()).  In this way, a Spliterator represents something that can always be iterated sequentially, and sometimes can be split.

The canSplit() method indicates whether it would be natural to split further.  An array can split down to one element; a closed-chain hash table might split down to a single bucket, and go sequential from there.  However, parallel algorithms might not want to split as far as the data structure will let them. 

Spliterable is the analogue of Iterable, and offers similar extension methods with similar lazy/eager characteristics to Iterable:

   interface Spliterable<T> {
       Spliterator<T> spliterator();
       Spliterable<T> filter(Predicate<T> filter);
       <U> Spliterable<U> map(Mapper<T,U> mapper);
       <U> U reduce(U base, Reducer<T,U> reducer);
       void forEach(Block<T> block);
       <U extends Collection<? super T>> U intoCollection(U collection);
   }

Now we just need a way to get a spliterator out of a collection.  We introduce a new method into Collection:

   interface Collection<T> {
       ....
       Splitterable<T> parallel();
   }

Our canonical example in parallel bears much in common to the serial version:

   int maxSize = collection
                 .parallel()
                 .filter(pred)
                 .map(#Foo.getSize)
                 .reduce(0, #Integer.max);

(we can make these even more similar by pulling up the aggregate operations into an AggregateOps interface, implemented by both Iterable and Spliterable, provided we don't go too crazy with primitive specializations.)

How we implement the default in Collection is an open question; one possibility is to always produce a Spliteratble whose spliterators say "no splitting", effectively forcing serial operation.  Another is to copy to an array.  In any case, the existing implementations would be retrofitted with a better implementation of parallel().





More information about the Lambda-lib mailing list