More data structures#

Education objectives

  • hash table

  • set

  • notion of complexity

  • dict

standard type set#

Unordered collections of unique elements. Sets are mutable. Not all elements can be part of a set, they must be hashable, a term we will define later.

s0 = set()
{1, 1, 1, 3}
{1, 3}
set([1, 1, 1, 3])
{1, 3}
s1 = {1, 2}
s2 = {2, 3}
print(s1.intersection(s2))
print(s1.union(s2))
{2}
{1, 2, 3}

set: lookup#

Lookup of an element in the set is algorithmically efficient (complexity O(1)), i.e. faster than a look up in a list or a tuple (complexity O(size iterable)). Consider the following example where we look for the last element of a list and a set:

size = 10000
large_list = list(range(size))
large_set = set(large_list)

%timeit size-1 in large_list
%timeit size-1 in large_set
148 μs ± 18.6 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
90 ns ± 16.2 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

Back to the “find the removed element” problem#

from random import shuffle, randint

size = 20
number = randint(0, size - 1)
print("integer remove from the list:", number)
numbers = list(range(size))
numbers.remove(number)
shuffle(numbers)
print("shuffled list: ", numbers)
integer remove from the list: 1
shuffled list:  [18, 11, 10, 9, 17, 3, 4, 0, 5, 6, 14, 19, 16, 15, 12, 7, 13, 8, 2]
  • Could the problem be solved using set ?

  • What is the complexity of this solution ?

A possible solution :

full_set = set(range(size))
changed_set = set(numbers)
ns = full_set - changed_set
ns.pop()
1

Complexity#

  • line 1: n insertions –> O(n)

  • line 2 : n insertions –> O(n)

  • line 3: one traversal O(n), with one lookup at each time (O(1) -> O(n)

=> Complexity of the whole algorithm : O(n)

Complexity of the “sum” solution : one traversal for the computation of the sum O(n) with sum at each step O(1) => O(n).

What is a hashtable?#

To be in a set, an object must be hashable, which means a function (called the hash function) is able to assign an index (i.e. a number) to this object. In the lookup example, we look for the presence of an element in a set. To do that, the hash function assign an index to this object and look if the index is present in the set. A set is simply a hash table. https://en.wikipedia.org/wiki/Hash_table

Some elements are not hashable, like list because they are mutable. Imagine you want to make a set of unique lists:

wrong_set = set(([1, 2], [2, 3], [1, 2]))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[8], line 1
----> 1 wrong_set = set(([1, 2], [2, 3], [1, 2]))

TypeError: unhashable type: 'list'

Iteration on set is non-reproducible#

A set is an unordered sequence. The unordered nature can lead to unexpected and non-reproducible result. Let us create a set containing the first letters of the alphabet.

s_letters = set("abcde")
s_letters
{'a', 'b', 'c', 'd', 'e'}

Now, let’s see what happens when we inter on the set. We can just do that by creating a list from it.

list(s_letters)
['d', 'a', 'b', 'c', 'e']

Is the list ordered? If it is not, it’s surprising but normal: the iteration order is non-deterministic

Worse, exactly the same code can lead to different outputs!

!python3 -c "print(list(set('abcde')))"
['c', 'b', 'e', 'd', 'a']
!python3 -c "print(list(set('abcde')))"
['a', 'e', 'b', 'd', 'c']
!python3 -c "print(list(set('abcde')))"
['d', 'c', 'a', 'e', 'b']

When a program iterates over a set, the iteration order is non-deterministic, making the program’s behavior non-reproducible. Since reproducibility is often a desirable characteristic, one solution is to use the built-in sorted() function to convert the set to a sorted list with consistent ordering. However, this comes at the cost of sorting performance.

!python3 -c "print(sorted(set('abcde')))"
['a', 'b', 'c', 'd', 'e']
!python3 -c "print(sorted(set('abcde')))"
['a', 'b', 'c', 'd', 'e']
!python3 -c "print(sorted(set('abcde')))"
['a', 'b', 'c', 'd', 'e']

dict: set of “key: value” pairs#

The dictionary (dict) is a very important data structure in Python. All namespaces are (nearly) dictionaries and “Namespaces are one honking great idea – let’s do more of those!” (The zen of Python).

A dict is a hashtable (a set) + associated values.

my_dict = {}
my_dict["b"] = 2
my_dict["a"] = 1
print(my_dict)
{'b': 2, 'a': 1}
my_dict = {"a": 1, "b": 2, 0: False, 1: True}
print(my_dict)
{'a': 1, 'b': 2, 0: False, 1: True}

Tip: parallel between dict and list#

You can first think about dict as a super list which can be indexed with other objects than integers (and in particular with str).

my_list = ["value0", "value1"]
my_list.append("value2")
print(my_list)
['value0', 'value1', 'value2']
my_list[1]
'value1'
my_dict = {"key0": "value0", "key1": "value1"}
my_dict["key2"] = "value2"
print(my_dict)
{'key0': 'value0', 'key1': 'value1', 'key2': 'value2'}
my_dict["key1"]
'value1'

dict: public methods#

# dict have 11 public methods (with a list comprehension)
print([name for name in dir(my_dict) if not name.startswith("__")])
['clear', 'copy', 'fromkeys', 'get', 'items', 'keys', 'pop', 'popitem', 'setdefault', 'update', 'values']

dict: different ways to loops over a dictionary#

# loop with items
for key, value in my_dict.items():
    print(key, value)
key0 value0
key1 value1
key2 value2
# loop with values
for value in my_dict.values():
    print(value)
value0
value1
value2
# loop with keys
for key in my_dict.keys():
    print(key)
key0
key1
key2

Note

Note that (since Python 3.6) the iteration order for a dict is deterministic (in contrast as for set). dict conserves the insertion order.

for key, value in {2: 2, 1: 1, 3: 3}.items():
    print(key, value)
2 2
1 1
3 3

Dict comprehension#

grades = {"Albert": 19, "Martin": 10, "Sue": 17}
# Add 1 to everyone
new_grades = {name: grade + 1 for name, grade in grades.items()}
print(new_grades)
{'Albert': 20, 'Martin': 11, 'Sue': 18}

Exercise 15

Write a function that returns a dictionary containing the number of occurrences of letters in a text.

text = "abbbcc"