This has been bugging me for some time now...

The first thing you do when inserting/retreiving from hash table is to calculate the hashCode for the given key and then find the correct bucket by trimming the hashCode to the size of the hashTable by doing hashCode % table_length. Here are 2 statements that you most probably have read somewhere

"Wangchun Li" pointed outed out in the comments to "CLRS theorem 31.20". I approached it with a Ramanujan style proof, will look at the Hardy style proof in a bit.

The first thing you do when inserting/retreiving from hash table is to calculate the hashCode for the given key and then find the correct bucket by trimming the hashCode to the size of the hashTable by doing hashCode % table_length. Here are 2 statements that you most probably have read somewhere

- If you use a power of 2 for table_length, finding (hashCode(key) % 2^n ) is as simple and quick as (hashCode(key) & (2^n -1)). But if your function to calculate hashCode for a given key isn't good, you will suffer from clustering of many keys in a few hash buckets.

- If you use prime numbers for table_length, hashCodes calculated could map into the different hash buckets even if you have a slightly stupid hashCode function.

*If suppose your hashCode function results in the following hashCodes among others {x , 2x, 3x, 4x, 5x, 6x...}, then all these are going to be clustered in just m number of buckets, where m = table_length/GreatestCommonFactor(table_length, x). (It is trivial to verify/derive this). Now you can do one of the following to avoid clustering**Make sure that you don't generate too many hashCodes that are multiples of another hashCode like in {x, 2x, 3x, 4x, 5x, 6x...}.But this may be kind of difficult because you have to come up with a perfect hashCode function.*

*Or simply make m equal to the table_length by making GreatestCommonFactor(table_length, x) equal to 1, i.e by making table_length coprime with x. And if x can be just about any number then make sure that table_length is a prime number.*

"Wangchun Li" pointed outed out in the comments to "CLRS theorem 31.20". I approached it with a Ramanujan style proof, will look at the Hardy style proof in a bit.

## 7 comments:

Can you give the complete proof.

"If suppose your hashCode function results in the following hashCodes among others {x , 2x, 3x, 4x, 5x, 6x...}, then all these are going to be clustered in just m number of buckets".

------------------

can you send the complete proof to my e-mail:qingying1226@yahoo.com.cn

Thanks for the clear explanation. Your post was really useful in clearing a few grey areas I had while reading on hash functions.

Cheers pal, that's been bothering me for years

mark and see it later

"If suppose your hashCode function results in the following hashCodes among others {x , 2x, 3x, 4x, 5x, 6x...}, then all these are going to be clustered in just m number of buckets".

------

proof is in CLRS chapter 31,therom 31.20

And what if your hash table consists of multiples of the prime number used as length? Or the probability of that is low so this case is discarded?

Post a Comment