CSC 378: Data Structures and Algorithm Analysis

Universal Hashing

Motivation

We're constructing a hash table. Suppose the probability distribution of searches,

index1.gif

is fixed but unknown. We might inadvertently choose a bad hash function that gives index2.gif performance, since index3.gif is unknown.

For example, U has 1,000,000 keys, 1000 of which are high-probability. All other keys have zero probability. If the hash function maps all 1000 to the same slot, we'll get terrible performance.

Idea: Make a new, random hash function, h, each time the program runs. (We'll have to rehash our data with the new hash function each time the program runs.)

We can choose h from a special class, H, such that the expected number of collisions is low, no matter the distribution, D.

Unfortunately, we will still occasionally pick a bad hash function. But it'll only last for one run of the program, and it'll occur only infrequently.

Let index4.gif be a set of hash functions.

H is universal if, for each distinct pair, index5.gif , the number of functions, index6.gif , for which index7.gif is exactly index8.gif .

Suppose index9.gif :

index10.gif

So if we pick a random function, h, from H, there is a index11.gif chance that index12.gif . That means that, for such a randomly chosen h,

index13.gif

This is true for any pair, index14.gif : index15.gif .

But this is the same chance of collision as if index16.gif and index17.gif are chosen randomly from index18.gif . So this is the best hash function that we could choose.

Constructing a universal set of hash functions

Choose m to be prime.

For a key, index19.gif , break x into index20.gif pieces, index21.gif .

We'll choose the division into bit strings such that each index22.gif has the same number of bits, and each index23.gif .

For example,

To pick a random hash function from a universal set, randomly choose

index24.gif

where each index25.gif is randomly chosen from index26.gif .

Define a hash function based on this choice of a:

index27.gif

Define index28.gif . The union is taken over all possible as (there are index29.gif of them).

Then index30.gif is a universal set of hash functions, and any index31.gif randomly chosen from index32.gif has the desirable property that index33.gif .

Proof that H is universal

We will prove that H is universal by showing that, index34.gif , if index35.gif is randomly chosen from index36.gif then index37.gif .

From number theory, the equation

index38.gif

(for known integers a, b, and m (with index39.gif ) and unknown x) has a unique solution, x, if m is prime.

Let index40.gif with index41.gif .

Write them as

index42.gif

index43.gif

Since index44.gif , they differ in at least one position. Without loss of generality, assume that index45.gif .

If index46.gif , then

index47.gif

In other words, if index48.gif , then

index49.gif

But for how many value of index50.gif is this true if index51.gif are fixed?

Using the number theory equation above, substitute

index52.gif

Then the equation has a unique solution for index53.gif !

This means that, if we choose index54.gif randomly, there's only one choice of the remaining index55.gif that results in a collision.

Since we're choosing index56.gif randomly from index57.gif , there's a index58.gif chance of choosing the index59.gif that is a solution for the equation.

In other words, index60.gif .

Looking at this another way: For a given index61.gif , there are index62.gif ways to choose the index63.gif such that index64.gif . But since there are index65.gif total ways to choose index66.gif , the probability that we choose one for which index67.gif is exactly

index68.gif

Thus H is a universal set of hash functions.

Summary

Picking a hash function randomly from a universal set gives us good expected performance.

One way to pick such a function is to generate random index69.gif as described above. Then index70.gif is the hash function.

Each time the program is run, we randomly pick a new hash function. This protects us from being stuck with a consistently bad hash function. But it means that we have to rehash all of our data at the start of each execution.

This is a randomized algorithm for hashing.



Back to the lecture index

Copyright 1998 James Stewart