Hash Table

CSCI 1913 – Introduction to Algorithms, Data Structures, and Program Development
Adriana Picoral

Hash – definition

  • to “hash” something is to summarize it
  • a hash function calculates a key (which indicates where the value is)
  • it alwasy calculates the same key for the same value
  • O(1) average-case performance

Rules of hashing

  • Hashing something should be pretty fast
  • The “hash” does not need to be reversible
  • The “hash” should be repeatable/reliable
  • The “hash” should avoid repeated keys (collisions)

Types of hashes

  • hasmaps
    • numbers should look like they were assigned randomly
  • cryptography
    • should be really hard to use backwards
    • should be really hard to make collisions

The hash function

  • The hash function is our “organizer” (it tells us where things go)

Simple example:

Write a function to compute a hash code for lowercase alphabetic English caracters (a-z). The function returns integers from 0 to 25, with each char getting a different int. The function returns a wild card for not lowercase alphabetic English chars.

Implementation

public int getHash(char c) {
  c = Character.toLowerCase(c);
  int hash = (int) c - 97;
  if (hash < 0 || hash > 25) return -1;
  return hash;
}

What do you get when you run this?

Character a = 'a';
System.out.print(a.hashCode());

Minimal Perfect Hash function

  • hash function maps a specific set of \(n\) keys to exactly \(n\) consecutive integer values
  • integer values are typically 0 to n-1
  • no collisions (perfect hash functions are collision free)
  • no unused slots (no gaps, key can be used directly as index in array)

Minimal Perfect Hash function

  • hash – if \(x == y\) then \(h(x) == h(y)\)
  • perfect – if \(h(x) == h(y)\) then \(x == y\)
  • minimal – only uses numbers \(0\) to \((n−1)\)

Minimal Perfect Hash function

Write a function to compute a hash code for lowercase alphabetic English caracters (a-z). The function returns integers from 0 to 25, with each char getting a different int. The function returns a wild card for not lowercase alphabetic English chars.

General algorithm

Minimal Perfect Hash function \(h(key)\)

  • Data: store an array (of appropriate length) named values
  • put(key, value) – values[\(h(key)\)] = value
  • contains(key) – return values[\(h(key)\)] != null
  • get(key) – return values[\(h(key)\)]
  • remove(key) – values[\(h(key)\)] = null

Implementation

public class CharHash<T> {
    private T[] table = (T[]) new Object[26];

    private int getHash(char c) {
        c = Character.toLowerCase(c);
        int hash = (int) c - 97;
        if (hash < 0 || hash > 25) return -1;
        return hash;
    }

    public void put(char key, T value) {
       int index = getHash(key);
       if (index >= 0) table[index] = value;

    }

    public boolean contains(char key) {
        int index = getHash(key);
        if (index >= 0) return table[index] != null;
        return false;
    }

    public T get(char key) {
        int index = getHash(key);
        if (index >= 0) return table[index];
        return null;
    }

    public void remove(char key) {
        int index = getHash(key);
        if (index >= 0) table[index] = null;
    }

}

Minimal Perfect Hash function

Advantages:

  • simple code
  • fast

Disadvantages:

  • hard to find a set of values that works
  • inflexible (item set is fixed)
  • large item sets may be wasteful
    • Example: As of Unicode version 18.0, there are 394,194 assigned characters

Student Rating of Teaching

Go to canvas, on the menu on the left you will see a “Student Rating of Teaching” option.

Please complete the survey: it’s anonymous, I don’t get the results until winter break, results are used for improving my teaching (and providing feedback to my boss)

Quiz 10

You have 10 minutes to complete the quiz

Hash functions

Constraints:

  • hash function needs to return same key for the same value
    • if X.equals(y) then X.hashCode() == Y.hashCode()
    • object doesn’t change - hashCode doesn’t change
  • to be fast, hash collisions need to be avoided (hard to mathematically formalize this in general case)
  • (As much as possible) if !X.equals(y) then X.hashCode() != Y.hashCode()

Imperfect hash functions

Two main issues:

  • not minimal – hash values could be very-large or negative
  • not perfect – collisions are possible

Simple example

Here are the values that are to be stored in the hash table:

37, 7, 73, 77, 11, 68, 50, 66, 19, 4, 1, 18, 76, 71, 31, 45, 12, 25, 49, 53,
69, 87, 41, 2, 43, 23, 62, 38, 47, 51, 55, 65, 58, 91, 17, 63, 27, 88, 48
  • What should the size of the table be?
  • What would be a good hash function to store these numbers?

Simple example

Here’s a hash function in java:

public class Hash {
    int tableSize;

    public Hash(int size) {
        tableSize = size;
    }

    public int getHash(int value) {
        return Math.abs(value % tableSize);
    }
}
  • How many collisions would we get?

Table size: 10

0 - 1   1 - 7   2 - 3   3 - 5   4 - 1 
5 - 4   6 - 2   7 - 7   8 - 6   9 - 3

Simple example – testing

int[] values = new int[]{37, 7, 73, 77, 11, 68, 50, 66, 19, 4, 1, 18, 76, 71,
  31, 45, 12, 25, 49, 53, 69, 87, 41, 2, 43, 23, 62, 38, 47, 51, 55, 65, 58, 
  91, 17, 63, 27, 88, 48};

int size = 20;
Hash hash = new Hash(size);

int[] indices = new int[size];
for (int v : values) {
  indices[hash.getHash(v)] += 1;
}

for (int i=0; i<size; i++) {
  System.out.println(i + " - " + indices[i] );
}

What does Claude say?

Simple and fast solution:

  • h(k) = k mod m
  • m = 53 (a prime number)
    • Prime numbers reduce clustering and collisions
    • 53 is larger than 39 (your data size) but not excessively so
    • Gives a good load factor of about 0.74 (39/53)

What does Claude say?

Works well for any table size:

  • h(k) = ⌊m × (k × A mod 1)⌋
  • Using A ≈ 0.6180339887 (golden ratio constant) with m = 64

Exercise

Consider an initially empty hash table \(H\) of size 15, designed to store integer keys. The hash function is defined as:

hash(k) = ⌊k mod length(\(H\))⌋

Insert the following sequence of keys in the given order: -524, 16, 322, -15, 78, -212

Exercise

hash(k) = ⌊k mod length(\(H\))⌋

Insert the following sequence of keys in the given order: -524, 16, 322, -15, 78, -212



 0   1   2   3   4   5   6   7   8   9   10 11  12 13 14
                                                                          

Exercise

hash(k) = ⌊k mod length(\(H\))⌋

Insert the following sequence of keys in the given order: -524, 16, 322, -15, 78, -212


-524 % 15 = -14 (Java doesn’t use the mathematical definition)

⌊-14⌋ = 14

 0   1   2   3   4   5   6   7   8   9   10 11  12  13  14
                                                                      -524

Exercise

hash(k) = ⌊k mod length(\(H\))⌋

Insert the following sequence of keys in the given order: 16, 322, -15, 78, -212


16 % 15 = 1

 0   1   2   3   4   5   6   7   8   9   10 11  12  13  14
     16                                                             -524

Exercise

hash(k) = ⌊k mod length(\(H\))⌋

Insert the following sequence of keys in the given order: 322, -15, 78, -212


322 % 15 = 7

 0   1   2   3   4   5   6    7   8   9   10  11  12  13  14
     16                          322                               -524

Exercise

hash(k) = ⌊k mod length(\(H\))⌋

Insert the following sequence of keys in the given order: -15, 78, -212


-15 % 15 = 0

 0   1   2   3   4   5   6     7   8   9   10  11  12  13  14
-15 16                          322                               -524

Exercise

hash(k) = ⌊k mod length(\(H\))⌋

Insert the following sequence of keys in the given order: 78, -212


78 % 15 = 3

 0   1   2   3   4   5   6     7   8   9   10  11  12  13  14
-15 16      78                322                               -524

Exercise

hash(k) = ⌊k mod length(\(H\))⌋

Insert the following sequence of keys in the given order: -212


⌊-212 % 15⌋ = 2

 0   1     2   3   4   5   6     7   8   9   10  11  12  13  14
-15 16 212 78                322                               -524

Exercise

hash(k) = ⌊k mod length(\(H\))⌋

Insert the following sequence of keys in the given order: -212


⌊-212 % 15⌋ = 2

 0   1     2   3   4   5   6     7   8   9   10  11  12  13  14
-15 16 212 78                322                               -524

What if the next value to be stored is 120?

Imperfect hash functions

Two main issues:

  • not minimal – hash values could be very-large or negative
  • not perfect – collisions are possible

Two problems to solve: hash code range, collisions

Hash code range problem

We compute index from hashcode as follows:

int index = abs(hashCode % tableSize)

  • modulo will re-map the code to 0 to tableSize-1 range
  • if hashCode is negative - modulo is negative - use absolute value to fix

Hash code range problem

We compute index from hashcode as follows:

int index = abs(hashCode % tableSize)

Consequences:

  • Index is based on table size and data:
    • Any time we change table size, everything gets a new index
  • This may cause collisions in an otherwise perfect hash code

The collision problem

This one is hard. Two general approaches, both compromise efficiency:

  • Open Addressing: If a slot is taken, probe for the next empty one

  • Chaining: Each array slot holds a linked list of all items that hashed there. If there’s a collision, just add to the list.

Open Addressing

If space is full put it somewhere else

  • linear probing - just go to next free space
  • quadratic probing - use a more complicated strategy to pick next space
  • double hashing - have “which try is this” be part of hash code computation
  • easier to store to a disk (everything is in an array)

Open Addressing exercise

Consider an initially empty hash table \(H\) of size 5, designed to store integer keys with linear probing. The hash function is defined as:

hash(k) = k mod length(\(H\))

Insert the following sequence of keys in the given order: 5, 7, 10

Chaining

Key compromise: hash table doesn’t actually “finish” the job - just sorts into buckets.

  • put → find “bucket”, search bucket, add or set
  • get → find “bucket”, search bucket
  • contains → find “bucket” search bucket
  • remove → find “bucket” search bucket

Chaining

  • Finding bucket → O(1)
  • Buckets are often O(bucketN)
  • Keep buckets small and all is good.

Bucket Choices

  • parallel array map - most buckets have 0 to 1 element - waste of space
  • association list (linked chain) - great memory savings on empty buckets
  • binary search tree - faster search, more key requirements

Writing a good hash function

  • The same value must always produce the same hash code
  • Hash codes should be spread evenly across the output space
  • The hash function should be fast to compute
  • A good hash function makes collisions rare
  • (cryptographic) Given a hash, it should be computationally infeasible to find any input that produces it

Load Factor

λ – the “load factor”, n current size of key set, k current size of table

λ = n / k

  • λ < 1 array is sparse
  • λ == 1 array probably still sparse
  • λ > 1 more items than array – buckets are getting full

Load Factor – exercise

For one of our previous exercises, we had 39 values to store (n). What would be the load factor λ for different table sizes?

Load Factor – exercise

For one of our previous exercises, we had 39 values to store (n). What would be the load factor λ for different table sizes?

  • λ = 39/10 = 3.9
  • λ = 39/20 = 1.95
  • λ = 39/50 = 0.78

Rehashing – growing the table size

Why?

  • Keep load factor small
  • Keep performance fast

Rehashing – growing the table size

When?

  • track load factor
  • When too big – make array bigger
  • (advanced feature) when too small make smaller

Rehashing – growing the table size

How?

  • Make new bigger array
    • DO NOT: copy position to position
    • re-compute index for each element:
      • index = hash % tableSize
      • hash doesn’t change
      • tablesize does

Practice

Design a Simple Spell Checker.

You’re building a spell checker that needs to quickly determine if a word exists in a dictionary of 100,000 English words

Implement a hash table to store the dictionary. Write a simple hash function for strings that takes a word as input and returns an integer index for an array.

Analyze Your Hash Function

Test your hash function with a sample of words. Does it:

  • Distribute words evenly across buckets?
  • Handle similar words (like “cat” and “act”) differently?
  • Produce many collisions?

Solution

public class WordHash {
    private String[] table;
    private int tableSize = 70000;
    private int count;

    public WordHash() {
        table = new String[tableSize];
        count = 0;
    }

    /**
     * Hash function
     * @param word is a string, a word in English
     * @return an index, which is an int between 0 and tableSize-1
     */
    private int getIndex(String word) {
        int total = 0;
        for (char c : word.toCharArray()) {
            total += (int) c;
        }
        return (total * 997) % tableSize;
    }

    public void put(String word) {
        int index = getIndex(word);
        while (table[index] != null) {
            index++;
            if (index == tableSize -1) index = 0;
        }
        table[index] = word;
        count++;
        if (load() > 0.7) expandArray();
    }

    public boolean contains(String word) {
        int index = getIndex(word);
        while (table[index]!= null && !table[index].equals(word)) {
            index++;
            if (index == tableSize-1) index = 0;
        }
        if (table[index] == null) return false;
        return true;
    }

    private int whereToPut(String word, String[] newTable) {
        int index = getIndex(word);
        while (newTable[index] != null) {
            index++;
            if (index == tableSize-1) index = 0;
        }
        return index;
    }

    private void expandArray() {
        tableSize *= 2;
        String[] newTable = new String[tableSize];
        for (String word : table) {
          newTable[whereToPut(word, newTable)] = word;
        }
        table = newTable;
    }

    public double load() {
        return (double) count / (double) tableSize;
    }

    @Override
    public String toString() {
        String retVal = "";
        for (int i=0; i < tableSize; i++) {
            if (table[i] != null) retVal += i + " - " + table[i] + "\n";
        }
        return retVal.trim();
    }
}