Guile Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: hash table subtleties
> Tcl hash tables resize when the number of items = 3 x number of
> buckets. At this point it increases the number of buckets by a factor
> of 4. This prevents resizing too often, and maybe even is more
> reasonable than resizing based on max bucket size because if
> everything's getting tossed into one bucket, why keep doubling the
> number of buckets? This'll just suck up tons of memory for little
> effect. I don't know if 3 & 4 are the best numbers, but at least this
> method won't blow up in the worst case - i.e. - yield a hash table of
> size 2^(number of elements). Also, it's less bookkeeping.
A hash table with n buckets is storing k keys.
The probability of a given key falling into a given bucket
with a perfectly random hashing function is 1/n.
Using the binomial theorem, the probability of a given
bucket containing exactly m keys is:
m m k-m
p(m) = ( ) (1/n) (1 - 1/n)
I define utilisation of the hash table as the number of stored keys
divided by the number of buckets (i.e. the average keys per bucket)
so n * u = k.
Using the Stirling approximation of a factorial in order to calculate
the number of combinations I was able to plot cumulative probabilities
for a constant utilisation of 3 and with n=10 buckets up to n=55 buckets
(after that the factorials get too big). Over this region the number of
buckets has NO EFFECT on cumulative probability curve, given that the
utilisation is fixed. I would guess that the shape of the cumulative
probability curve is only dependent on the utilisation.
I also tried fixing the buckets at 55 and changing the utilisation:
# utilisation, keys in bucket, cumulative probability
1 0 0.36451
1 1 0.73578
1 2 0.929255
1 3 0.991696
1 4 1.00663
1 5 1.00943
2 0 0.132867
2 1 0.403524
2 2 0.688219
2 3 0.875442
2 4 0.96755
2 5 1.00356
3 0 0.0484314
3 1 0.196416
3 2 0.43062
3 3 0.663072
3 4 0.836216
3 5 0.939033
Which can be read as saying,
``for utilisation of 3, 94% of the buckets will contain 5 keys or less.''
> In fact, the more I think about it, the less it seems like growing
> based on max bucket size is a good idea. If we start with a table
> size of 4 & double the table size when the max bucket size passes 3,
> and we have the bad luck that everything hashes to the same value,
> then after n insertions we end up with an array of size 2^(n-1)
Given the low utilisation that such a table would have, the probability
of this happening with a hash function that approximates a random
variable is vanishingly small. If it happened, I would be looking for
bugs elsewhere. Much the same as when I hear a loud explosion I don't
think, ``what a surprise, by some amazing fluke a huge number of randomly
moving air molecules all struck my eardrum at exactly the same moment''.
I think that growing based on the maximum bucket size should be OK
but attempting to keep constant utilisation is easier to implement and
easier to understand the behaviour.
> You especially don't want to do this if the user is allowed to supply
> hash functions.
True, the user may offer a decidedly non-random hash function.
It is difficult to provide any guarantee of good performance when such
a thing happens.
> I'd like to also repeat what I mentioned elsewhere - namely if you
> shrink the tables on deletions you can easily get the bad behavior of
> resizing every few operations (on a mix of insertions & deletions)
> causing the hash tables to give O(n) behavior instead of O(1).
> They'll be slower than an alist, let alone a balanced tree.
Put some hysteresis into the grow and shrink. For example, when
utilisation is greater than 3, grow, when it is less than 1, shrink.
This means that resize cannot possibly occur every few operations.
This same problem occurs in automatic gearboxes when you get onto a
hill of just the right slope.
> As for Gawk, it also grows the vector when number of elements >
> AVG_CHAIN_MAX * vector size (expressed as n_elements/array_size >
> AVG_CHAIN_MAX). For gawk, AVG_CHAIN_MAX = 10 & it grows by about a
> factor of 10 each time.
I'd guess that they were scared of resizes but since a hash table with
resize is always O(1), they are not really improving the situation.
The trade is for a larger time factor spent list scanning against a
smaller factor for resize. Unfortunately most of the table access will
be ``find'' operations which suffer the penalty without the benefit
(since a ``find'' will never cause a resize).
> * I tried a zillion different hash functions and asked many other
> * people for advice. Many people had their own favorite functions,
> * all different, but no-one had much idea why they were good ones.
> * I chose the one below (multiply by 9 and add new character)
> * because of the following reasons:
> * 1. Multiplying by 10 is perfect for keys that are decimal strings,
> * and multiplying by 9 is just about as good.
> * 2. Times-9 is (shift-left-3) plus (old). This means that each
> * character's bits hang around in the low-order bits of the
> * hash value for ever, plus they spread fairly rapidly up to
> * the high-order bits to fill out the hash value. This seems
> * works well both for decimal and non-decimal strings.
I hold that you can't try to know the typical distribution of your
keys. True, the example of decimal strings is a good one for perl and Tcl
but it really can't be applied to guile where strings are used mainly
to represent English words.
This leads to the conclusion that what a hash function is really trying
to do is map the key space onto a set of random numbers. The best hash function
is one that gives a uniformly distributed random output for the majority
of input keys. To achieve this, we borrow the ``butterfly effect'' from
chaos theory which says that a small change in one iteration, spreads
more and more with each iteration, leading to a large effect on the output.
The multiply-and-add method that is common, depends on the repeated
multiplication combined with an implicit modulo caused by finite word length
(much the same as a linear congruence random generator). The multiplication
turns a small change into a large change and the modulo folds it back
onto itself so that the change is not lost. Most folks know the trick of
using shifts instead of multiply to save time (in effect CRC generators
are also using linear congruence by multiplying polynomials and they also
use shifts to save time).
I figure that it is faster and more logical to just keep a mapping
table of random numbers on hand. Then you immediately know that any
small change to the input represents a large change to the output.
Most likely it should feed part of the looked-up value back into the
index for future lookups too so that the effect of the change continues
to grow (I hadn't thought of that earlier). This would also mean
that where I used a rotate before, I could get away with just a shift... hmmm.
ULONG StringHash( const char *s )
ULONG a = 0, c;
while(( c = *s++ ))
a >>= 1;
a += c;
a ^= hashing_bit_matrix[ 0xff & a ];
return( a );
Loop unrolling can and should be handled by the compiler.
> 3. mappers & rebalance speedup idea.
> It might be worth keeping the key/item pairs themselves in a doubly
> linked list. This'd make hash-table->list fast, & would make mappers
> a little faster. Rebalancing would also be a little faster. It's
> also helpful for the worst case behavior - it keeps scanning all
> elements at O(n) instead of O(n + number_of_buckets). Of course, this
> gives a higher insertion cost, so it won't always be a win - it
> depends on the operations the user will be doing.
Well I assume that the saving on mappers which you mention would be
caused by not having to waste time checking empty buckets. Assuming that
utilisation is forced to somewhere between 1 and 3 by the resize,
we could take 2 as a typical utilisation and say that 13% of the buckets
are expected to be empty. This does represent a time saving I guess
but the doubly linked list means that it's difficult to use standard
cons cells (which are singly linked) and that must be more pain than
it is worth.
Guile Home |
Main Index |