Hash Table

Hash – definition

to “hash” something is to summarize it
a hash function calculates a key (which indicates where the value is)
it alwasy calculates the same key for the same value
O(1) average-case performance

Rules of hashing

Hashing something should be pretty fast
The “hash” does not need to be reversible
The “hash” should be repeatable/reliable
The “hash” should avoid repeated keys (collisions)

Types of hashes

hasmaps
- numbers should look like they were assigned randomly
cryptography
- should be really hard to use backwards
- should be really hard to make collisions

The hash function

The hash function is our “organizer” (it tells us where things go)

Simple example:

Write a function to compute a hash code for lowercase alphabetic English caracters (a-z). The function returns integers from 0 to 25, with each char getting a different int. The function returns a wild card for not lowercase alphabetic English chars.

Implementation

public int getHash(char c) {
  c = Character.toLowerCase(c);
  int hash = (int) c - 97;
  if (hash < 0 || hash > 25) return -1;
  return hash;
}

What do you get when you run this?

Character a = 'a';
System.out.print(a.hashCode());

Minimal Perfect Hash function

hash function maps a specific set of \(n\) keys to exactly \(n\) consecutive integer values
integer values are typically 0 to n-1
no collisions (perfect hash functions are collision free)
no unused slots (no gaps, key can be used directly as index in array)

Minimal Perfect Hash function

hash – if \(x == y\) then \(h(x) == h(y)\)
perfect – if \(h(x) == h(y)\) then \(x == y\)
minimal – only uses numbers \(0\) to \((n−1)\)

Minimal Perfect Hash function

Write a function to compute a hash code for lowercase alphabetic English caracters (a-z). The function returns integers from 0 to 25, with each char getting a different int. The function returns a wild card for not lowercase alphabetic English chars.

General algorithm

Minimal Perfect Hash function \(h(key)\)

Data: store an array (of appropriate length) named values
put(key, value) – values[\(h(key)\)] = value
contains(key) – return values[\(h(key)\)] != null
get(key) – return values[\(h(key)\)]
remove(key) – values[\(h(key)\)] = null

Implementation

public class CharHash<T> {
    private T[] table = (T[]) new Object[26];

    private int getHash(char c) {
        c = Character.toLowerCase(c);
        int hash = (int) c - 97;
        if (hash < 0 || hash > 25) return -1;
        return hash;
    }

    public void put(char key, T value) {
       int index = getHash(key);
       if (index >= 0) table[index] = value;

    }

    public boolean contains(char key) {
        int index = getHash(key);
        if (index >= 0) return table[index] != null;
        return false;
    }

    public T get(char key) {
        int index = getHash(key);
        if (index >= 0) return table[index];
        return null;
    }

    public void remove(char key) {
        int index = getHash(key);
        if (index >= 0) table[index] = null;
    }

}

Minimal Perfect Hash function

Advantages:

simple code
fast

Disadvantages:

hard to find a set of values that works
inflexible (item set is fixed)
large item sets may be wasteful
- Example: As of Unicode version 18.0, there are 394,194 assigned characters

Student Rating of Teaching

Go to canvas, on the menu on the left you will see a “Student Rating of Teaching” option.

Please complete the survey: it’s anonymous, I don’t get the results until winter break, results are used for improving my teaching (and providing feedback to my boss)

Quiz 10

You have 10 minutes to complete the quiz

Hash functions

Constraints:

hash function needs to return same key for the same value
- if X.equals(y) then X.hashCode() == Y.hashCode()
- object doesn’t change - hashCode doesn’t change
to be fast, hash collisions need to be avoided (hard to mathematically formalize this in general case)
(As much as possible) if !X.equals(y) then X.hashCode() != Y.hashCode()

Imperfect hash functions

Two main issues:

not minimal – hash values could be very-large or negative
not perfect – collisions are possible

Simple example

Here are the values that are to be stored in the hash table:

37, 7, 73, 77, 11, 68, 50, 66, 19, 4, 1, 18, 76, 71, 31, 45, 12, 25, 49, 53,
69, 87, 41, 2, 43, 23, 62, 38, 47, 51, 55, 65, 58, 91, 17, 63, 27, 88, 48

What should the size of the table be?
What would be a good hash function to store these numbers?

Simple example

Here’s a hash function in java:

public class Hash {
    int tableSize;

    public Hash(int size) {
        tableSize = size;
    }

    public int getHash(int value) {
        return Math.abs(value % tableSize);
    }
}

How many collisions would we get?

Table size: 10

0 - 1   1 - 7   2 - 3   3 - 5   4 - 1 
5 - 4   6 - 2   7 - 7   8 - 6   9 - 3

Simple example – testing

int[] values = new int[]{37, 7, 73, 77, 11, 68, 50, 66, 19, 4, 1, 18, 76, 71,
  31, 45, 12, 25, 49, 53, 69, 87, 41, 2, 43, 23, 62, 38, 47, 51, 55, 65, 58, 
  91, 17, 63, 27, 88, 48};

int size = 20;
Hash hash = new Hash(size);

int[] indices = new int[size];
for (int v : values) {
  indices[hash.getHash(v)] += 1;
}

for (int i=0; i<size; i++) {
  System.out.println(i + " - " + indices[i] );
}

What does Claude say?

Simple and fast solution:

h(k) = k mod m
m = 53 (a prime number)
- Prime numbers reduce clustering and collisions
- 53 is larger than 39 (your data size) but not excessively so
- Gives a good load factor of about 0.74 (39/53)

What does Claude say?

Works well for any table size:

h(k) = ⌊m × (k × A mod 1)⌋
Using A ≈ 0.6180339887 (golden ratio constant) with m = 64

Exercise

Consider an initially empty hash table \(H\) of size 15, designed to store integer keys. The hash function is defined as:

hash(k) = ⌊k mod length(\(H\))⌋

Insert the following sequence of keys in the given order: -524, 16, 322, -15, 78, -212

Exercise

hash(k) = ⌊k mod length(\(H\))⌋

Insert the following sequence of keys in the given order: -524, 16, 322, -15, 78, -212

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14

Exercise

hash(k) = ⌊k mod length(\(H\))⌋

Insert the following sequence of keys in the given order: -524, 16, 322, -15, 78, -212

-524 % 15 = -14 (Java doesn’t use the mathematical definition)

⌊-14⌋ = 14

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14

														-524

Exercise

hash(k) = ⌊k mod length(\(H\))⌋

Insert the following sequence of keys in the given order: 16, 322, -15, 78, -212

16 % 15 = 1

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14

	16													-524

Exercise

hash(k) = ⌊k mod length(\(H\))⌋

Insert the following sequence of keys in the given order: 322, -15, 78, -212

322 % 15 = 7

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14

	16						322							-524

Exercise

hash(k) = ⌊k mod length(\(H\))⌋

Insert the following sequence of keys in the given order: -15, 78, -212

-15 % 15 = 0

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14

-15	16						322							-524

Exercise

hash(k) = ⌊k mod length(\(H\))⌋

Insert the following sequence of keys in the given order: 78, -212

78 % 15 = 3

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14

-15	16		78				322							-524

Exercise

hash(k) = ⌊k mod length(\(H\))⌋

Insert the following sequence of keys in the given order: -212

⌊-212 % 15⌋ = 2

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14

-15	16	212	78				322							-524

Exercise

hash(k) = ⌊k mod length(\(H\))⌋

Insert the following sequence of keys in the given order: -212

⌊-212 % 15⌋ = 2

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14

-15	16	212	78				322							-524

What if the next value to be stored is 120?

Imperfect hash functions

Two main issues:

not minimal – hash values could be very-large or negative
not perfect – collisions are possible

Two problems to solve: hash code range, collisions

Hash code range problem

We compute index from hashcode as follows:

int index = abs(hashCode % tableSize)

modulo will re-map the code to 0 to tableSize-1 range
if hashCode is negative - modulo is negative - use absolute value to fix

Hash code range problem

We compute index from hashcode as follows:

int index = abs(hashCode % tableSize)

Consequences:

Index is based on table size and data:
- Any time we change table size, everything gets a new index
This may cause collisions in an otherwise perfect hash code

The collision problem

This one is hard. Two general approaches, both compromise efficiency:

Open Addressing: If a slot is taken, probe for the next empty one
Chaining: Each array slot holds a linked list of all items that hashed there. If there’s a collision, just add to the list.

Open Addressing

If space is full put it somewhere else

linear probing - just go to next free space
quadratic probing - use a more complicated strategy to pick next space
double hashing - have “which try is this” be part of hash code computation
easier to store to a disk (everything is in an array)

Open Addressing exercise

Consider an initially empty hash table \(H\) of size 5, designed to store integer keys with linear probing. The hash function is defined as:

hash(k) = k mod length(\(H\))

Insert the following sequence of keys in the given order: 5, 7, 10

Chaining

Key compromise: hash table doesn’t actually “finish” the job - just sorts into buckets.

put → find “bucket”, search bucket, add or set
get → find “bucket”, search bucket
contains → find “bucket” search bucket
remove → find “bucket” search bucket

Chaining

Finding bucket → O(1)
Buckets are often O(bucketN)
Keep buckets small and all is good.

Bucket Choices

parallel array map - most buckets have 0 to 1 element - waste of space
association list (linked chain) - great memory savings on empty buckets
binary search tree - faster search, more key requirements

Writing a good hash function

The same value must always produce the same hash code
Hash codes should be spread evenly across the output space
The hash function should be fast to compute
A good hash function makes collisions rare
(cryptographic) Given a hash, it should be computationally infeasible to find any input that produces it

Load Factor

λ – the “load factor”, n current size of key set, k current size of table

λ = n / k

λ < 1 array is sparse
λ == 1 array probably still sparse
λ > 1 more items than array – buckets are getting full

Load Factor – exercise

For one of our previous exercises, we had 39 values to store (n). What would be the load factor λ for different table sizes?

Load Factor – exercise

For one of our previous exercises, we had 39 values to store (n). What would be the load factor λ for different table sizes?

λ = 39/10 = 3.9
λ = 39/20 = 1.95
λ = 39/50 = 0.78

Rehashing – growing the table size

Why?

Keep load factor small
Keep performance fast

Rehashing – growing the table size

When?

track load factor
When too big – make array bigger
(advanced feature) when too small make smaller

Rehashing – growing the table size

How?

Make new bigger array
- DO NOT: copy position to position
- re-compute index for each element:
  - index = hash % tableSize
  - hash doesn’t change
  - tablesize does

Practice

Design a Simple Spell Checker.

You’re building a spell checker that needs to quickly determine if a word exists in a dictionary of 100,000 English words

Implement a hash table to store the dictionary. Write a simple hash function for strings that takes a word as input and returns an integer index for an array.

Analyze Your Hash Function

Test your hash function with a sample of words. Does it:

Distribute words evenly across buckets?
Handle similar words (like “cat” and “act”) differently?
Produce many collisions?

Solution

public class WordHash {
    private String[] table;
    private int tableSize = 70000;
    private int count;

    public WordHash() {
        table = new String[tableSize];
        count = 0;
    }

    /**
     * Hash function
     * @param word is a string, a word in English
     * @return an index, which is an int between 0 and tableSize-1
     */
    private int getIndex(String word) {
        int total = 0;
        for (char c : word.toCharArray()) {
            total += (int) c;
        }
        return (total * 997) % tableSize;
    }

    public void put(String word) {
        int index = getIndex(word);
        while (table[index] != null) {
            index++;
            if (index == tableSize -1) index = 0;
        }
        table[index] = word;
        count++;
        if (load() > 0.7) expandArray();
    }

    public boolean contains(String word) {
        int index = getIndex(word);
        while (table[index]!= null && !table[index].equals(word)) {
            index++;
            if (index == tableSize-1) index = 0;
        }
        if (table[index] == null) return false;
        return true;
    }

    private int whereToPut(String word, String[] newTable) {
        int index = getIndex(word);
        while (newTable[index] != null) {
            index++;
            if (index == tableSize-1) index = 0;
        }
        return index;
    }

    private void expandArray() {
        tableSize *= 2;
        String[] newTable = new String[tableSize];
        for (String word : table) {
          newTable[whereToPut(word, newTable)] = word;
        }
        table = newTable;
    }

    public double load() {
        return (double) count / (double) tableSize;
    }

    @Override
    public String toString() {
        String retVal = "";
        for (int i=0; i < tableSize; i++) {
            if (table[i] != null) retVal += i + " - " + table[i] + "\n";
        }
        return retVal.trim();
    }
}