[concurrency-interest] A concurrent, linked HashMap impl

Doug Lea dl at cs.oswego.edu
Fri May 29 07:08:15 EDT 2009


Jim Andreou wrote:
> Yes, it's a link chain of the elements hashed _to the same bucket_.
> 
> The default load factor is 0.75. That's the average number of elements
> that each bucket contains.
> 
> It would take a really, really bad hashCode() impl to get 915 elements
> in a single bucket chain.

Right; and the worst case is even worse --
if all hashCodes are  the same, then all elements go into
the same bucket of the same segment.

Aside: It might be handy to have special versions
of JDK hash tables that support a "development mode" where
the table aborts if you have too many elements with same
hashCode. Many people seem to define hashCode() producing
too many duplicates by mistake and never find out.
(We internally apply bit-spreading functions that protect
against nonuniform bit distributions of hashCodes, but cannot
do anything about multiple identical ones.)
Also, when using identity hash codes (including use
of keys that do not override Object.hashCode), the
number of bits of hashCode can be less than you think
(22 in hotspot 32bit mode, even fewer in some other JVMs).
So you may start getting a lot of collisions in large
identity-based tables (for example more than a million).

In general though, segmenting does not impact entry distribution
across buckets. A segmented hash table has the same statistical
properties as an unsegmented one. About which see for example
   http://en.wikipedia.org/wiki/Hash_table
   http://en.wikipedia.org/wiki/Hash_(computer_science)

Large segmented resizable tables (as in CHM) do have a space
advantage over unsegmented ones in that the total
space will be closer to N/loadFactor buckets because
segments independently double in size without doubling whole
table. On the other hand smaller segmented tables occupy
more space than unsegmented because of initial fixed
segment-object overhead.
A paper coming up in ECOOP09 ("Making Sense of Large Heaps"
by Nick Mitchell, Edith Schonberg, and Gary Sevitsky) found
a case where an application was making many thousands of
nearly empty CHMs, which of course leads to massive bloat.
(Don't do that!)
It is possible to reduce this fixed initial overhead
by postponing some initialization. This is being done for
CustomConcurrentHashMap, and should be applied to plain
CHM as well. However in both cases, it requires the upcoming
Fences API to work correctly under the memory model, or
Unsafe cheats until then.


> 
> Btw, I see this comment in the implementation:
>> (The average length [of the bucket chain] is less than two for the default load factor threshold.)
> 
> I'm not sure why it says "less than two" and not "less than one".

Thanks for pointing out that this ought to be clarified.
The context for the comment is for
moves, where you know you have at least one node. The
conditional probability of having another is nearly the
same as plain probability of having one. So total less than
two.

-Doug




More information about the Concurrency-interest mailing list