Hash Tables

Definition

What are Hash Tables?

• Hash Tables are key-value pairs
• key must be immutable

Hash Function

• Hash functions convert the value to key
• One big challenge is to calculate the key more evenly to avoid collision
• One common way is to calculate the mod of a prime number (the number theory is beyond the scope here)
• but look at two examples below

Example 1:

10 % 4 -> 2
20 % 4 -> 0
30 % 4 -> 2
40 % 4 -> 0
50 % 4 -> 2

In example 1, the results from hash function don’t distribute evenly, they are either 2 or 0

Example 2:

10 % 7 -> 3
20 % 7 -> 6
30 % 7 -> 2
40 % 7 -> 4
50 % 7 -> 1

In example 2, the results of moding a prime number distribute much more evenly so the chance for collision is small

Below is a good reference to select the prime number to mod based on the data size.

Collision

In example 2, if we keep calculating mod

60 % 7 -> 4 This is a hash collision as 40 % 7 -> 4 too

Chaining

• One solution to collision is chaining,
• which basically creates a linked list at the index where collision occurs,
• e.g. 4: 40%7 -> 60%7
• if the hash value is not distributed evenly, it might become a linked list, losing the benefits of hash tables
• however it’s not how Python handles the collision.

• Python uses open addressing solution to collisions
• When an index is taken, Python will find the next available index
• There are several types of addressing
• h is the hash function, i is the key, k is the value
• linear probing: h(k,i) = (h(k)+i) % m, i = 0,1,2…,m-1
• quadratic probing: h(k,i) = (h(k) + c1 + c2*i**2) % m, i = 0,1,2,…,m-1
• rehashing: h(k,i) = (h1(k) + i*h2(k))%m
• In CPython, quadratic probing is used

• Here’s the details in CPython
• “Open addressing is preferred over chaining since the link overhead for chaining would be substantial (100% with typical malloc overhead).”
• Open Addressing Code

def h(key, M=13):
return key % M

nums = [765, 431, 96, 142, 579, 226, 903, 388]

def handle_collision(nums, M=13, inserted_set=set()):
for num in nums:
index = h(num)
first_index = index
i = 1
while index in inserted_set:
print("\th({num}) = {num} % M = {index} collision".format(num=num, index=index))
index = (first_index + i**2) % M
i += 1
else:
print("h({num}) = {num} % M = {index}".format(num=num, index=index))

handle_collision(nums)

h(765) = 765 % M = 11
h(431) = 431 % M = 2
h(96) = 96 % M = 5
h(142) = 142 % M = 12
h(579) = 579 % M = 7
h(226) = 226 % M = 5 collision
h(226) = 226 % M = 6
h(903) = 903 % M = 6 collision
h(903) = 903 % M = 7 collision
h(903) = 903 % M = 10
h(388) = 388 % M = 11 collision
h(388) = 388 % M = 12 collision
h(388) = 388 % M = 2 collision
h(388) = 388 % M = 7 collision
h(388) = 388 % M = 1

• As we keep adding elements into hash tables, the spaces might be taken up soon
• load factor is the ratio of used indices vs. total indices,
• in the open addressing example,
• the size of the hash table is 13,
• there are already 8 indices used,
• so the load factor is 8 / 13 ~= 0.6
• Usually when the load factor is above 0.8, we need to open new spaces for rehashing

Rehashing

• As discussed above, once the load factor is greater than 0.8, we need to rehash
• How large is the new space depends, in CPython 3.7, it’s currently set to 3, so new space will be used space * 3
• Once the new space is created, all the elements in the old hash will be copied into the new one, which involves a time complexity of O(n)

Complexity

Time Complexity

• Hash function: O(1)
• Memory index access: O(1)
• Search complexity O(1)

Python Implementation

Hashcode in Python

In python, we can use hash function to calculate the hash value for any object. Hash values are integers. They are used to quickly compare dictionary keys during a dictionary lookup. Numeric values that compare equal have the same hash value (even if they are of different types, as is the case for 1 and 1.0).

# for an int number
a = 20
print(hash(a))

# for the same number but in float
print(hash(20.0))

# a negative int
b = -20
print(hash(b))

# a string
print(hash("c"))

# a float
print(hash(3.1415926))

# a custom object
class ListNode:
def __init__(self, val):
self.val = val
self.next = None

ln = ListNode(20)

print(hash(ListNode))
print(hash(ln))
20
20
-20
-5423205724351203701
326490306866391043
-9223363252756124725
-9223372036577884363

HashTable Abstract Data Type

• Here I will “build” a simplified version of Hash Table

• It contains a “Slot” to store indices, it has three status

• used
• removed
• used
• Three different methods

• get(key)
• remove(key)
class Slot:
def __init__(self, key, val):
self.key = key
self.val = val

class HashTable:
UNUSED = None # slot has not been used
REMOVED = Slot(None, None) # slot was used before but got removed

def __init__(self):
self._table = [HashTable.UNUSED] * 8
self.length = 0

@property
return self.length / float(len(self._table))

def __len__(self):
return self.length

def _hash(self, key):
# use a simple mod function
return hash(key) % len(self._table)

def _find_key(self, key):
index = self._hash(key)
_len = len(self._table)

while self._table[index] is not HashTable.UNUSED:
# only find index when it's actually used
if self._table[index] is HashTable.REMOVED:
# one way for cpython to handle hash collision
index = (index * 5 + 1) % _len
# if the key is removed, then keep looping
continue
elif self._table[index].key == key:
return index
else:
index = (index * 5 + 1) % _len

return None

def _slot_can_insert(self, index):
# check if the slot is able to be inserted new element
if self._table[index] in (HashTable.UNUSED, HashTable.REMOVED):
return True
else:
return False

def _find_slot_to_insert(self, key):
# Find a slot to insert value
index = self._hash(key)
_len = len(self._table)

while not self._slot_can_insert(index):
# while we cannot insert
# find the next availabe slot to insert
index = (index*5 + 1) % _len

return index

def __contains__(self, key):
# in operation
index = self._find_key(key)
return index is not None

def _rehash(self):
# create new spaces whenever spaces are about to run out
old_table = self._table
self._table = [HashTable.UNUSED] * (len(self._table) * 3)
self.length = 0

for ele in old_table:
# only transfer the valid elements
if ele is not HashTable.UNUSED and ele is not HashTable.REMOVED:
index = self._find_slot_to_insert(ele.key)
self._table[index] = ele
self.length += 1

def add(self, key, val):
# add new element to HT
if key in self:
# use the __contains__
# if found update
index = self._find_key(key)
self._table[index] = val
return False # only updated not added
else:
# find the availabe slot to insert
index = self._find_slot_to_insert(key)
# add the element
self._table[index] = Slot(key, val)
# update length
self.length += 1

# if load factor is greater than 0.8, rehashing
if self._load_factor >= 0.8:
self._rehash()

return True

def get(self, key):
index = self._find_key(key)
if index is None:
return False
else:
return self._table[index].val

def delete(self, key):
index = self._find_key(key)
if index is None:
raise Exception("Key Error")
value = self._table[index].val
self.length -= 1
self._table[index] = HashTable.UNUSED
return value

def __iter__(self):
# Iterate thru the ht via key
for ele in self._table:
if ele not in (HashTable.UNUSED, HashTable.REMOVED):
yield ele.key

def test():
ht = HashTable()

# test iterate
for ele in ht:
print(ele)

# test delete
print(ht.delete("a"))

print(ht.get("b"))

test()
b
c
a
0
42