by Martin Erwig
If the one-by-one search reminds you of the search for an element in a list, you are correct. This is exactly what’s happening here. The difference is that the list search is applied repeatedly for each row, that is, for each character of the clue word. And therein lies the power of this method. Consider what happens to the search space, that is, the set of all 262,144 possible paths across the tile grid. Each tile in the first row identifies a different starting point that can be continued in 8 × 8 × 8 × 8 × 8 = 32,768 different ways by selecting the different combination of letters from the five remaining rows. When Indiana Jones looks at the first tile, which is labeled, say, K, he will of course not step on it, since it doesn’t match the required I. This single decision has removed in one fell swoop a total of 32,768 paths from the search space, namely, all those paths that can be formed by starting with the K tile. And the same reduction happens with each further rejected tile in the first row.
Once Indiana Jones reaches the correct tile, the reduction of the search space is even more dramatic, because as soon as he steps on the tile, the decision process for the first row is completed, and the search space is immediately reduced to the 32,768 paths that can be formed with the remaining five rows. All in all, with at most seven decisions (which are required when the I tile comes last) Indiana Jones has reduced the search space by a factor of eight. And so he continues with the second row of tiles and the letter e. Again, the search space is reduced by a factor of eight to 4,096 with at most seven decisions. Once Indiana Jones has reached the last row, only eight possible paths are left, and again with no more than seven decisions he can complete the path. In the worst case this search requires 6 × 7 = 42 “steps” (only six literal steps)—a remarkably efficient way of finding one path among 262,144.
It might not be obvious, but the challenge mastered by Indiana Jones bears some similarity to Hansel and Gretel’s. In both cases the protagonists have to find a path to safety, and in both cases the path consists of a sequence of locations that are marked—in the case of Hansel and Gretel by pebbles, in the case of Indiana Jones by letters. The search for the next location in the paths is different, though: Hansel and Gretel just have to find any pebble (although they have to make sure not to revisit pebbles), whereas Indiana Jones has to find a specific letter. Both examples again emphasize the role of representation in computation. Indiana Jones’s challenge specifically illustrates that the sequence of signifiers (letters) for individual tiles is itself another signifier (the word Iehova) for a path. Moreover, the fact that the word Iehova stands for a path illustrates how a computation that simply searches for a word becomes meaningful and important in the world.
The way in which Indiana Jones finds the clue word on the grid is exactly how one can search efficiently for a word in a dictionary: first narrow down the group of pages that contain words that start with the same letter as the clue word, then narrow down the group of pages further to those that match the first and second letters, and so on, until the word is found.
To better understand the data structure that lurks behind the tile grid and that makes Indiana Jones’s search so efficient, we look at another way of representing words and dictionaries that uses trees to organize the search space.
Counting with a Dictionary
Chapter 4 described the tree data structure. I used a family tree to demonstrate how to compute a list of heirs in order of their inheritance priority. The computation traversed the whole tree and had to visit every node in the tree. Trees are also an excellent data structure to support the search within a collection. In this case the search uses nodes in the tree to direct the search down one path to find the desired element.
Let’s reconsider Indiana Jones’s tile floor challenge. In the movie, his first step turns into a dramatic scene where he almost gets killed by stepping onto an unsafe tile. He uses the spelling Jehova and steps onto a J tile, which crumbles under his feet. This shows that the challenge is actually more tricky than it seemed at first because Indiana Jones can’t really be so sure about the correct (spelling of the) clue word. In addition to the spellings with I and J there is another one that starts with Y. Moreover, other possible names that could in principle work are, for example, Yahweh and God. Let us assume that these are all the possibilities and that Indiana Jones and his father are certain that one of these words indeed indicates a safe path across the tile floor.
Now what would be a good strategy to decide which tiles to step on? If all the words are equally likely, he could improve his odds by picking a letter that occurs in more than one name. To illustrate, assume he picks J (as he actually does). Since the J occurs only in one of the five names, there is only a one in five (that is, 20%) chance that the tile will hold. On the other hand, stepping onto a v tile is 60% secure, since the v occurs in three of the names. If any of these three words is correct, the v tile will be safe, and since all words are considered to be equally likely correct, the chance of survival increases to three out of five for the v tile.4
Therefore, a good strategy is to first compute the frequency of letters among the five words and then try to step on tiles with the highest occurrence rate. Such a mapping of letters to frequencies is called a histogram. To compute a letter histogram, Indiana Jones needs to maintain a frequency counter for each letter. By scanning through all the words he can increment the counter for each encountered letter. Maintaining counters for different letters is a task that is solved by the dictionary data type (see chapter 4). For this application the keys are individual letters, and the information stored with each key is the letter frequency.
We could use an array with letters as indices as a data structure to implement this dictionary, but this would waste over 50% of the array’s space, since we have only a total of eleven different letters to count. Alternatively, we could also use a list. But that wouldn’t be very efficient. To see this, let’s suppose we scan the words in alphabetical order. Thus we start with the word God and add each of its letters into the list, together with the initial count of 1. We obtain the following list:
G:1 → o:1 → d:1
Note that the insertion of G took one step, the insertion of o took two because we had to add it after the G, and the insertion of d took three steps because we had to add it after both G and o. But couldn’t we instead insert the new letters at the front of the list? Unfortunately, that wouldn’t work because we have to make sure a letter is not in the list before we add it, and so we have to look at all existing letters in the list before we can add a new one. If a letter is already contained in the list, we instead increment its counter. Thus we have spent already 1 + 2 + 3 = 6 steps on the first word.
For the next word Iehova we need 4, 5, and 6 steps to insert I, e, and h, respectively, which brings us to a total of 21 steps. The next letter is an o, which already exists. We only need 2 steps to find it and update its count to 2. Letters v and a are new and need an additional 7 + 8 = 15 steps to be inserted at the end of the list. At this point our list has taken 38 steps to build and looks as follows:
G:1 → o:2 → d:1 → I:1 → e:1 → h:1 → v:1 → a:1
Updating the dictionary with the counts from the word Jehova then takes 9 steps for adding the J and 5, 6, 2, 7, and 8 steps for incrementing the counts for the remaining letters that are already present in the list, which brings the total number of steps to 75. After processing the last two words, Yahweh and Yehova, using 40 and 38 steps, respectively, we finally end up with the following list, which took a total of 153 steps to construct:
G:1 → o:4 → d:1 → I:1 → e:4 → h:4 → v:3 → a:4 → J:1 → Y:2 → w:1
Note that we must not add the second h in Yahweh, since counting it twice for one word would increase, incorrectly, the odds of the letter (in this case from 80% to 100%).5 A final scan of the list shows that the letters o, e, h, and a are the safest to begin with, all having a probability of 80% of being contained in the correct word.
Lean Is Not Always Better
The main drawb
ack of the list data structure implementation of a dictionary is the high cost for repeatedly accessing items that are located toward the end of the list.
The binary search tree data structure tries to avoid this problem by partitioning the search space more evenly to support faster search. A binary tree is a tree in which each node has at most two children. As mentioned, nodes without any children are called leaves. The nodes with children are called internal nodes. Also, the node that has no parent is called the root of the tree.
Let’s look at some binary tree examples shown in figure 5.1. The left one is a tree of a single node—the simplest tree one can imagine. The root, which is also a leaf, consists of the letter G. This tree doesn’t really look much like a tree, similar to a list of one element that doesn’t look like a list. The middle tree has one more node, a child o, added to the root. In the right tree a child d itself has children a and e. This example illustrates that if we cut the link to its parent, a node becomes the root of a separate tree, which is called a subtree of its parent. In this case the tree with root d and two children a and e is a subtree of the node G, as is the single-node tree with root o. This suggests that a tree is inherently a recursive data structure and can be defined in the following way. A tree is either a single node, or it is a node that has one or two subtrees (whose roots are the children of the node). This recursive structure of trees was indicated chapter 4, which presented a recursive algorithm to traverse a family tree.
The idea of a binary search tree is to use internal nodes as boundaries that separate all its descendant nodes into two classes: those nodes with a smaller value than the boundary node are contained in the left subtree, and those with a larger value are contained in the right subtree.
This arrangement can be exploited for searching as follows. If we are looking for a particular value in a tree, we compare it with the root of the tree. If they match, we have found the value and the search is over. Otherwise, in case the value is smaller than the root, we can limit the further search exclusively to the left subtree. We do not have to look at any of the nodes in the right subtrees because all the nodes are known to be larger than the root and thus larger than the sought value. In each case, the internal node defines a boundary that separates the “inside” (given by the left subtree) from the “outside” given by the right subtree just as the return address of the Grail diary defines the “inside” of the next step of Indiana Jones’s search for his father.
Figure 5.1 Three example binary trees. Left: A tree of only one node. Middle: A tree whose root has a right subtree of a single node. Right: A tree whose root has two subtrees. All three trees have the binary search property, which says that nodes in left subtrees are smaller than the root, and nodes in right subtrees are larger than the root.
The three trees shown in figure 5.1 are all binary search trees. They store letters as values that can be compared according to their position in the alphabet. To find a value in such a tree works by repeatedly comparing the sought value to the values in nodes in the tree and correspondingly descending into the right or left subtrees.
Suppose we want to find out whether e is contained in the right tree. We start at the root of the tree and compare e with G. Since e precedes G in the alphabet and is thus “smaller” than G, we continue the search in the left subtree, which contains all values smaller than G that are stored in this tree. Comparing e with the root of the left subtree, d, we find that e is larger and therefore continue the search in the right subtree, which is a tree of only one node. Comparing e with that node finishes the search successfully. Had we searched for, say, f instead, the search would have led to the same node e, but since e does not have a right subtree, the search would have ended there with the conclusion that f is not contained in the tree.
Any search proceeds along a path in the tree, and so the time it takes to find an element or to conclude that it does not exist in the tree never takes longer than the longest path from the root to a leaf in the tree. We can see that the right tree in figure 5.1, which stores five elements, contains paths to leaves that have a length of 2 or 3 only. This means we can find any element in at most three steps. Compare this with a list of five elements, where a search for the last element of the list always takes five steps.
We are now ready to follow the computation of the letter histogram using a binary search tree to represent a dictionary. Again, we start with the word God and add each of its letters into the tree, together with the initial count of 1. We obtain the following tree, which reflects the order of the letters where they appear in the tree:
In this case the insertion of G took one step, as it did for the list, but the insertion of o and d both took only two steps, since they are both children of G. Thus we have saved one step for the first word (1 + 2 + 2 = 5 steps for the tree versus 1 + 2 + 3 = 6 for the list).
Inserting the next word, Iehova, requires three steps for each of the letters, except for o and h, which require two and four steps, respectively. As with lists, processing the o does not create a new element but increments the count of the existing element by 1. Processing this word therefore takes 18 steps and brings the total to 24 steps, compared to 38 steps that were needed for the list data structure. The word Jehova changes the tree structure only slightly by adding a node for J, which takes four steps. With 3 + 4 + 2 + 3 + 3 = 15 more steps the counts for the existing letters are updated, leading to a total of 39 steps, compared to 75 for the list:
Finally, adding Yahwe(h) and Yehova, each requiring 19 steps, we obtain the following tree, which took a total of 76 steps to construct. This is half of the 153 steps that were needed for the list data structure.
This binary search tree represents the same dictionary as the list, but in a form that supports searching and updating faster—in general, at least. The spatial shapes of lists and trees explain some of the differences in their efficiency. The long and narrow structure of the list often forces the search to go long distances and to look at elements that are not relevant. In contrast, the wide and shallow form of the tree directs the search effectively and limits the elements to be considered and the distance traversed. However, a fair comparison between lists and binary search trees requires us to consider a few additional aspects.
Efficiency Hangs in the Balance
In computer science we usually do not compare different data structures by counting the exact number of steps for specific lists or trees, since this can give a misleading impression, specifically for small data structures. Moreover, even in this simple analysis we compared operations that are not of exactly the same complexity and take different amounts of time to execute. For example, to perform a comparison in a list we only have to test whether two elements are equal, whereas in a binary search tree we also have to determine which element is larger in order to direct the search into the corresponding subtree. This means that many of the 153 list operations are simpler and faster to perform than some of the 76 tree operations and that we shouldn’t read too much into the direct comparison of the two numbers.
Instead we consider the increase in runtime of operations as the data structures get bigger and more complicated. Chapter 2 showed examples. For lists we know that the runtime for inserting new and searching for existing elements is linear, that is, the time for inserting or finding is proportional to the length of the list in the worst case. This isn’t too bad, but if such an operation is executed repeatedly, the accumulated runtime becomes quadratic, which can be prohibitive. Recall the strategy that had Hansel go all the way back home for fetching each new pebble.
What, in comparison, is the runtime for inserting an element into a tree and searching for an element in a tree? The final search tree contains 11 elements, and searching and inserting takes between three and five steps. In fact, the time to find or insert an element is bounded by the height of the tree, which is often much smaller than the number of its elements. In a balanced tree, that is, a tree in which all paths from the root to a leaf have the same length (
±1), the height of a tree is logarithmic in size, which means the height grows by 1 only whenever the number of nodes in the tree doubles. For example, a balanced tree with 15 nodes has a height of 4, a balanced tree with 1,000 nodes has height 10, and 1,000,000 nodes fit comfortably into a balanced tree of height 20. Logarithmic runtime is much better than linear runtime, and as the size of a dictionary gets really large, the tree data structure gets better and better compared to the list data structure.
This analysis is based on balanced binary search trees. Can we actually guarantee that constructed trees are balanced, and if not, what is the runtime in the case of unbalanced search trees? The final tree in this example is not balanced, since the path to leaf e is of length 3 while the path to leaf w is of length 5. The order in which the letters are inserted into the tree matters and can lead to different trees. For example, if we insert the letters in the order given by the word doG, we obtain the following binary search tree, which is not balanced at all:
This tree is extremely unbalanced—it is not really different from a list. And this is not some quirky exception. Of the six possible sequences of the three letters, two lead to a balanced tree (God and Gdo) and the other four lead to lists. Fortunately, there are techniques to balance search trees that get out of shape after insertions, but these add to the cost of insertion operations, although the cost can still be kept logarithmic.