Hash Tables

Definition

What are Hash Tables?

  • Hash Tables are key-value pairs
    • key must be immutable

Hash Function

  • Hash functions convert the value to key
  • One big challenge is to calculate the key more evenly to avoid collision
  • One common way is to calculate the mod of a prime number (the number theory is beyond the scope here)
  • but look at two examples below

Example 1:

10 % 4 -> 2
20 % 4 -> 0
30 % 4 -> 2
40 % 4 -> 0
50 % 4 -> 2

In example 1, the results from hash function don’t distribute evenly, they are either 2 or 0

Example 2:

10 % 7 -> 3
20 % 7 -> 6
30 % 7 -> 2
40 % 7 -> 4
50 % 7 -> 1

In example 2, the results of moding a prime number distribute much more evenly so the chance for collision is small

Below is a good reference to select the prime number to mod based on the data size.

Collision

In example 2, if we keep calculating mod

60 % 7 -> 4 This is a hash collision as 40 % 7 -> 4 too

Chaining

  • One solution to collision is chaining,
    • which basically creates a linked list at the index where collision occurs,
    • e.g. 4: 40%7 -> 60%7
    • if the hash value is not distributed evenly, it might become a linked list, losing the benefits of hash tables
  • however it’s not how Python handles the collision.

Open Addressing

  • Python uses open addressing solution to collisions
    • When an index is taken, Python will find the next available index
  • There are several types of addressing
    • h is the hash function, i is the key, k is the value
    • linear probing: h(k,i) = (h(k)+i) % m, i = 0,1,2…,m-1
    • quadratic probing: h(k,i) = (h(k) + c1 + c2*i**2) % m, i = 0,1,2,…,m-1
    • rehashing: h(k,i) = (h1(k) + i*h2(k))%m
  • In CPython, quadratic probing is used

    • Here’s the details in CPython
    • “Open addressing is preferred over chaining since the link overhead for chaining would be substantial (100% with typical malloc overhead).”
  • Open Addressing Code

    def h(key, M=13):
        return key % M
    
    nums = [765, 431, 96, 142, 579, 226, 903, 388]
    
    def handle_collision(nums, M=13, inserted_set=set()):
        for num in nums:
            index = h(num)
            first_index = index
            i = 1
            while index in inserted_set:
                print("\th({num}) = {num} % M = {index} collision".format(num=num, index=index))
                index = (first_index + i**2) % M
                i += 1
            else:
                print("h({num}) = {num} % M = {index}".format(num=num, index=index))
                inserted_set.add(index)
    
    handle_collision(nums)
    
    h(765) = 765 % M = 11
    h(431) = 431 % M = 2
    h(96) = 96 % M = 5
    h(142) = 142 % M = 12
    h(579) = 579 % M = 7
        h(226) = 226 % M = 5 collision
    h(226) = 226 % M = 6
        h(903) = 903 % M = 6 collision
        h(903) = 903 % M = 7 collision
    h(903) = 903 % M = 10
        h(388) = 388 % M = 11 collision
        h(388) = 388 % M = 12 collision
        h(388) = 388 % M = 2 collision
        h(388) = 388 % M = 7 collision
    h(388) = 388 % M = 1
    

Load Factor

  • As we keep adding elements into hash tables, the spaces might be taken up soon
  • load factor is the ratio of used indices vs. total indices,
  • in the open addressing example,
    • the size of the hash table is 13,
    • there are already 8 indices used,
    • so the load factor is 8 / 13 ~= 0.6
  • Usually when the load factor is above 0.8, we need to open new spaces for rehashing

Rehashing

  • As discussed above, once the load factor is greater than 0.8, we need to rehash
  • How large is the new space depends, in CPython 3.7, it’s currently set to 3, so new space will be used space * 3
  • Once the new space is created, all the elements in the old hash will be copied into the new one, which involves a time complexity of O(n)

Complexity

Time Complexity

  • Hash function: O(1)
  • Memory index access: O(1)
  • Search complexity O(1)

Python Implementation

Hashcode in Python

In python, we can use hash function to calculate the hash value for any object. Hash values are integers. They are used to quickly compare dictionary keys during a dictionary lookup. Numeric values that compare equal have the same hash value (even if they are of different types, as is the case for 1 and 1.0).

# for an int number
a = 20
print(hash(a))

# for the same number but in float
print(hash(20.0))

# a negative int
b = -20
print(hash(b))

# a string
print(hash("c"))

# a float
print(hash(3.1415926))

# a custom object
class ListNode:
    def __init__(self, val):
        self.val = val
        self.next = None

ln = ListNode(20)

print(hash(ListNode))
print(hash(ln))
20
20
-20
-5423205724351203701
326490306866391043
-9223363252756124725
-9223372036577884363

HashTable Abstract Data Type

  • Here I will “build” a simplified version of Hash Table

  • It contains a “Slot” to store indices, it has three status

    • used
    • removed
    • used
  • Three different methods

    • add(key, value)
    • get(key)
    • remove(key)
class Slot:
    def __init__(self, key, val):
        self.key = key
        self.val = val

class HashTable:
    UNUSED = None # slot has not been used
    REMOVED = Slot(None, None) # slot was used before but got removed

    def __init__(self):
        self._table = [HashTable.UNUSED] * 8
        self.length = 0

    # load factor
    @property
    def _load_factor(self):
        return self.length / float(len(self._table))

    def __len__(self):
        return self.length

    def _hash(self, key):
        # use a simple mod function
        return hash(key) % len(self._table)

    def _find_key(self, key):
        index = self._hash(key)
        _len = len(self._table)

        while self._table[index] is not HashTable.UNUSED:
            # only find index when it's actually used
            if self._table[index] is HashTable.REMOVED:
                # one way for cpython to handle hash collision
                index = (index * 5 + 1) % _len
                # if the key is removed, then keep looping
                continue
            elif self._table[index].key == key:
                return index
            else:
                index = (index * 5 + 1) % _len

        return None

    def _slot_can_insert(self, index):
        # check if the slot is able to be inserted new element
        if self._table[index] in (HashTable.UNUSED, HashTable.REMOVED):
            return True
        else:
            return False

    def _find_slot_to_insert(self, key):
        # Find a slot to insert value
        index = self._hash(key)
        _len = len(self._table)

        while not self._slot_can_insert(index):
            # while we cannot insert
            # find the next availabe slot to insert
            index = (index*5 + 1) % _len

        return index

    def __contains__(self, key):
        # in operation
        index = self._find_key(key)
        return index is not None

    def _rehash(self):
        # create new spaces whenever spaces are about to run out
        old_table = self._table
        self._table = [HashTable.UNUSED] * (len(self._table) * 3)
        self.length = 0

        for ele in old_table:
            # only transfer the valid elements
            if ele is not HashTable.UNUSED and ele is not HashTable.REMOVED:
                index = self._find_slot_to_insert(ele.key)
                self._table[index] = ele
                self.length += 1

    def add(self, key, val):
        # add new element to HT
        if key in self:
            # use the __contains__
            # if found update
            index = self._find_key(key)
            self._table[index] = val
            return False # only updated not added
        else:
            # find the availabe slot to insert
            index = self._find_slot_to_insert(key)
            # add the element
            self._table[index] = Slot(key, val)
            # update length
            self.length += 1

            # if load factor is greater than 0.8, rehashing
            if self._load_factor >= 0.8:
                self._rehash()

            return True

    def get(self, key):
        index = self._find_key(key)
        if index is None:
            return False
        else:
            return self._table[index].val

    def delete(self, key):
        index = self._find_key(key)
        if index is None:
            raise Exception("Key Error")
        value = self._table[index].val
        self.length -= 1
        self._table[index] = HashTable.UNUSED
        return value

    def __iter__(self):
        # Iterate thru the ht via key
        for ele in self._table:
            if ele not in (HashTable.UNUSED, HashTable.REMOVED):
                yield ele.key

def test():
    ht = HashTable()

    # test add
    ht.add("a", 0)
    ht.add("b", 42)
    ht.add("c", 74)

    # test iterate
    for ele in ht:
        print(ele)

    # test delete
    print(ht.delete("a"))

    print(ht.get("b"))

test()
b
c
a
0
42