1 00:00:00,000 --> 00:00:02,000 [Binary Search] 2 00:00:02,000 --> 00:00:04,000 [Patrick Schmid - Harvard University] 3 00:00:04,000 --> 00:00:07,000 [This is CS50. - CS50.TV] 4 00:00:07,000 --> 00:00:09,000 >> If I gave you a list of Disney character names in alphabetical order 5 00:00:09,000 --> 00:00:11,000 and asked you to find Mickey Mouse, 6 00:00:11,000 --> 00:00:13,000 how would you go about doing this? 7 00:00:13,000 --> 00:00:15,000 One obvious way would be to scan the list from the beginning 8 00:00:15,000 --> 00:00:18,000 and check every name to see if it's Mickey. 9 00:00:18,000 --> 00:00:22,000 But as you read Aladdin, Alice, Ariel, and so forth, 10 00:00:22,000 --> 00:00:25,000 you'll quickly realize that starting at the front of the list wasn't a good idea. 11 00:00:25,000 --> 00:00:29,000 Okay, maybe you should start working backwards from the end of the list. 12 00:00:29,000 --> 00:00:33,000 Now you read Tarzan, Stitch, Snow White, and so forth. 13 00:00:33,000 --> 00:00:36,000 Still, this doesn't seem like the best way to go about it. 14 00:00:36,000 --> 00:00:38,000 >> Well, another way that you could go about doing this is to try to narrow down 15 00:00:38,000 --> 00:00:41,000 the list of names that you have to look at. 16 00:00:41,000 --> 00:00:43,000 Since you know that they are in alphabetical order, 17 00:00:43,000 --> 00:00:45,000 you could just look at the names in the middle of the list 18 00:00:45,000 --> 00:00:49,000 and check if Mickey Mouse is before or after this name. 19 00:00:49,000 --> 00:00:51,000 Looking at the last name in the second column 20 00:00:51,000 --> 00:00:54,000 you'd realize that M for Mickey comes after J for Jasmine, 21 00:00:54,000 --> 00:00:57,000 so you'd simply ignore the first half of the list. 22 00:00:57,000 --> 00:00:59,000 Then you'd probably look at the top of the last column 23 00:00:59,000 --> 00:01:02,000 and see that it begins with Rapunzel. 24 00:01:02,000 --> 00:01:06,000 Mickey comes before Rapunzel; looks like we can ignore the last column as well. 25 00:01:06,000 --> 00:01:08,000 Continuing the search strategy, you'll quickly see that Mickey 26 00:01:08,000 --> 00:01:11,000 is in the first half of the remaining list of names 27 00:01:11,000 --> 00:01:14,000 and finally find Mickey hiding between Merlin and Minnie. 28 00:01:14,000 --> 00:01:17,000 >> What you just did was basically binary search. 29 00:01:17,000 --> 00:01:22,000 As this name implies, it performs its searching strategy in a binary matter. 30 00:01:22,000 --> 00:01:24,000 What does this mean? 31 00:01:24,000 --> 00:01:27,000 Well, given a list of sorted items, the binary search algorithm makes a binary decision-- 32 00:01:27,000 --> 00:01:33,000 left or right, greater than or less than, alphabetically before or after--at each point. 33 00:01:33,000 --> 00:01:35,000 Now that we have a name that goes along with this search algorithm, 34 00:01:35,000 --> 00:01:38,000 let's look at another example, but this time with a list of sorted numbers. 35 00:01:38,000 --> 00:01:43,000 Say we are looking for the number 144 in this list of sorted numbers. 36 00:01:43,000 --> 00:01:46,000 Just like before, we find the number that's in the middle-- 37 00:01:46,000 --> 00:01:50,000 which in this case is 13--and see if 144 is greater than or less than 13. 38 00:01:50,000 --> 00:01:54,000 Since it's clearly greater than 13, we can ignore everything that is 13 or less 39 00:01:54,000 --> 00:01:57,000 and just concentrate on the remaining half. 40 00:01:57,000 --> 00:01:59,000 Since we now have an even number of items left, 41 00:01:59,000 --> 00:02:01,000 we simply choose a number that is close to the middle. 42 00:02:01,000 --> 00:02:03,000 In this case we choose 55. 43 00:02:03,000 --> 00:02:06,000 We could have just as easily chosen 89. 44 00:02:06,000 --> 00:02:11,000 Okay. Again, 144 is greater than 55, so we go to the right. 45 00:02:11,000 --> 00:02:14,000 Fortunately for us, the next middle number is 144, 46 00:02:14,000 --> 00:02:16,000 the one we're looking for. 47 00:02:16,000 --> 00:02:21,000 So in order to find 144 using a binary search, we're able to find it in only 3 steps. 48 00:02:21,000 --> 00:02:24,000 If we would have used linear search here, it would have taken us 12 steps. 49 00:02:24,000 --> 00:02:28,000 As a matter of fact, since this search method halves the number of items 50 00:02:28,000 --> 00:02:31,000 it has to look at at each step, it will find the item it is searching for 51 00:02:31,000 --> 00:02:35,000 in about the log of the number of items in the list. 52 00:02:35,000 --> 00:02:37,000 Now that we've seen 2 examples, let's take a look at 53 00:02:37,000 --> 00:02:41,000 some pseudocode for a recursive function that implements binary search. 54 00:02:41,000 --> 00:02:44,000 Starting at the top, we see that we have to find a function that takes 4 arguments: 55 00:02:44,000 --> 00:02:47,000 key, array, min, and max. 56 00:02:47,000 --> 00:02:51,000 The key is the number that we are looking for, so 144 in the previous example. 57 00:02:51,000 --> 00:02:54,000 Array is the list of numbers that we are searching. 58 00:02:54,000 --> 00:02:57,000 Min and max are the indices of the minimum and maximum positions 59 00:02:57,000 --> 00:02:59,000 that we are currently looking at. 60 00:02:59,000 --> 00:03:03,000 So when we start, min will be zero and max will be the maximum index of the array. 61 00:03:03,000 --> 00:03:07,000 As we narrow the search down, we will update min and max 62 00:03:07,000 --> 00:03:10,000 to be just the range that we are still looking in. 63 00:03:10,000 --> 00:03:12,000 >> Let's skip to the interesting part first. 64 00:03:12,000 --> 00:03:14,000 The first thing we do is find the midpoint, 65 00:03:14,000 --> 00:03:19,000 the index that is halfway between the min and max of the array that we are still considering. 66 00:03:19,000 --> 00:03:22,000 Then we look at the value of the array at that midpoint location 67 00:03:22,000 --> 00:03:25,000 and see if the number that we are looking for is less than that key. 68 00:03:25,000 --> 00:03:27,000 If the number at that position is less, 69 00:03:27,000 --> 00:03:31,000 then it means the key is larger than all numbers to the left of that position. 70 00:03:31,000 --> 00:03:33,000 So we can call binary search function again, 71 00:03:33,000 --> 00:03:36,000 but this time updating the min and max parameters to read just the half 72 00:03:36,000 --> 00:03:40,000 that is greater than or equal to the value we just looked at. 73 00:03:40,000 --> 00:03:44,000 On the other hand, if the key is less than the number at the current midpoint of the array, 74 00:03:44,000 --> 00:03:47,000 we want to go to the left and ignore all numbers that are greater. 75 00:03:47,000 --> 00:03:52,000 Again, we call binary search but this time with the range of min and max updated 76 00:03:52,000 --> 00:03:54,000 to include just the lower half. 77 00:03:54,000 --> 00:03:57,000 If the value at the current midpoint in the array is neither 78 00:03:57,000 --> 00:04:01,000 larger than nor smaller than the key, then it must be equal to the key. 79 00:04:01,000 --> 00:04:05,000 Thus, we can simply return the current midpoint index, and we're done. 80 00:04:05,000 --> 00:04:09,000 Finally, this check here is for the case that the number 81 00:04:09,000 --> 00:04:11,000 isn't actually in the array of numbers we are searching. 82 00:04:11,000 --> 00:04:14,000 If the maximum index of the range that we are looking for 83 00:04:14,000 --> 00:04:17,000 is ever less than the minimum, that means that we've gone too far. 84 00:04:17,000 --> 00:04:20,000 Since the number wasn't in the input array, we return -1 85 00:04:20,000 --> 00:04:24,000 to indicate that nothing was found. 86 00:04:24,000 --> 00:04:26,000 >> You may have noticed that for this algorithm to work, 87 00:04:26,000 --> 00:04:28,000 the list of numbers has to be sorted. 88 00:04:28,000 --> 00:04:31,000 In other words, we can only find 144 using binary search 89 00:04:31,000 --> 00:04:34,000 if all the numbers are ordered from lowest to highest. 90 00:04:34,000 --> 00:04:38,000 If this weren't the case, we wouldn't be able to exclude half of the numbers at each step. 91 00:04:38,000 --> 00:04:40,000 So we have 2 options. 92 00:04:40,000 --> 00:04:43,000 Either we can take an unsorted list and sort it before using binary search, 93 00:04:43,000 --> 00:04:48,000 or we can make sure that the list of numbers is sorted as we add numbers to it. 94 00:04:48,000 --> 00:04:50,000 Thus, instead of sorting just when we have to search, 95 00:04:50,000 --> 00:04:53,000 why not keep the list sorted at all times? 96 00:04:53,000 --> 00:04:57,000 >> One way to keep a list of numbers sorted while simultaneously allowing 97 00:04:57,000 --> 00:04:59,000 one to add or move numbers from this list 98 00:04:59,000 --> 00:05:02,000 is by using something called a binary search tree. 99 00:05:02,000 --> 00:05:05,000 A binary search tree is a data structure that has 3 properties. 100 00:05:05,000 --> 00:05:09,000 First, the left subtree of any node contains only values that are less than 101 00:05:09,000 --> 00:05:11,000 or equal to the node's value. 102 00:05:11,000 --> 00:05:15,000 Second, the right subtree of a node only contains values that are greater than 103 00:05:15,000 --> 00:05:17,000 or equal to the node's value. 104 00:05:17,000 --> 00:05:20,000 And, finally, both the left and right subtrees of all nodes 105 00:05:20,000 --> 00:05:23,000 are also binary search trees. 106 00:05:23,000 --> 00:05:26,000 Let's look at an example with the same numbers we used earlier. 107 00:05:26,000 --> 00:05:30,000 For those of you who have never seen a computer science tree before, 108 00:05:30,000 --> 00:05:34,000 let me begin by telling you that a computer science tree grows downwards. 109 00:05:34,000 --> 00:05:36,000 Yes, unlike trees you are accustomed to, 110 00:05:36,000 --> 00:05:38,000 the root of a computer science tree is at the top, 111 00:05:38,000 --> 00:05:41,000 and the leaves are at the bottom. 112 00:05:41,000 --> 00:05:45,000 Each little box is called a node, and the nodes are connected to each other by edges. 113 00:05:45,000 --> 00:05:48,000 So the root of this tree is a node value with the value 13, 114 00:05:48,000 --> 00:05:52,000 which has 2 children nodes with the values 5 and 34. 115 00:05:52,000 --> 00:05:57,000 A subtree is the tree that is formed just by looking at a subsection of the entire tree. 116 00:05:57,000 --> 00:06:03,000 >> For example, the left subtree of the node 3 is the tree created by the nodes 0, 1, and 2. 117 00:06:03,000 --> 00:06:06,000 So, if we go back to the properties of a binary search tree, 118 00:06:06,000 --> 00:06:09,000 we see that each node in the tree conforms to all 3 properties, namely, 119 00:06:09,000 --> 00:06:15,000 the left subtree only contains values that are less than or equal to the node's value; 120 00:06:15,000 --> 00:06:16,000 the right subtree of all nodes 121 00:06:16,000 --> 00:06:19,000 only contains values that are greater than or equal to the node's value; and 122 00:06:19,000 --> 00:06:25,000 both left and right subtrees of all nodes also are binary search trees. 123 00:06:25,000 --> 00:06:28,000 Although this tree looks different, this is a valid binary search tree 124 00:06:28,000 --> 00:06:30,000 for the same set of numbers. 125 00:06:30,000 --> 00:06:32,000 As a matter of fact, there are many possible ways that you can create 126 00:06:32,000 --> 00:06:35,000 a valid binary search tree from these numbers. 127 00:06:35,000 --> 00:06:38,000 Well, let's go back to the first one we created. 128 00:06:38,000 --> 00:06:40,000 So what can we do with these trees? 129 00:06:40,000 --> 00:06:44,000 Well, we can very simply find the minimum and maximum values. 130 00:06:44,000 --> 00:06:46,000 The minimum values can be found by always going to the left 131 00:06:46,000 --> 00:06:48,000 until there are no more nodes to visit. 132 00:06:48,000 --> 00:06:53,000 Conversely, to find the maximum one simply just goes down to the right at each time. 133 00:06:54,000 --> 00:06:56,000 >> Finding any other number that isn't the minimum or the maximum 134 00:06:56,000 --> 00:06:58,000 is just as easy. 135 00:06:58,000 --> 00:07:00,000 Say we're looking for the number 89. 136 00:07:00,000 --> 00:07:03,000 We simply check the value of each node and go to the left or right, 137 00:07:03,000 --> 00:07:06,000 depending on whether the node's value is less than or greater than 138 00:07:06,000 --> 00:07:08,000 the one we're looking for. 139 00:07:08,000 --> 00:07:11,000 So, starting at the root of 13, we see that 89 is greater, 140 00:07:11,000 --> 00:07:13,000 and so we go to the right. 141 00:07:13,000 --> 00:07:16,000 Then we see the node for 34, and again we go to the right. 142 00:07:16,000 --> 00:07:20,000 89 is still greater than 55, so we continue going to the right. 143 00:07:20,000 --> 00:07:24,000 We then come up with a node with the value of 144 and go to the left. 144 00:07:24,000 --> 00:07:26,000 Lo and behold, 89 is right there. 145 00:07:26,000 --> 00:07:31,000 >> Another thing that we can do is print out all numbers by performing an inorder traversal. 146 00:07:31,000 --> 00:07:35,000 An inorder traversal means to print everything out in the left subtree 147 00:07:35,000 --> 00:07:37,000 followed by printing the node itself 148 00:07:37,000 --> 00:07:40,000 and then followed by printing everything out in the right subtree. 149 00:07:40,000 --> 00:07:43,000 For example, let's take our favorite binary search tree 150 00:07:43,000 --> 00:07:46,000 and print out the numbers in sorted order. 151 00:07:46,000 --> 00:07:49,000 We start at the root of 13, but before printing 13 we have to print out 152 00:07:49,000 --> 00:07:51,000 everything in the left subtree. 153 00:07:51,000 --> 00:07:55,000 So we go to 5. We still have to go deeper down in the tree until we find the left-most node, 154 00:07:55,000 --> 00:07:57,000 which is zero. 155 00:07:57,000 --> 00:08:00,000 After printing zero, we go back up to the 1 and print that out. 156 00:08:00,000 --> 00:08:03,000 Then we go to the right subtree, which is 2, and print that out. 157 00:08:03,000 --> 00:08:05,000 Now that we're done with that subtree, 158 00:08:05,000 --> 00:08:07,000 we can go back up to the 3 and print it out. 159 00:08:07,000 --> 00:08:11,000 Continuing back up, we print the 5 and then the 8. 160 00:08:11,000 --> 00:08:13,000 Now that we have completed the entire left subtree, 161 00:08:13,000 --> 00:08:17,000 we can print out the 13 and start working on the right subtree. 162 00:08:17,000 --> 00:08:21,000 We hop down to 34, but before printing 34 we have to print out its left subtree. 163 00:08:21,000 --> 00:08:27,000 So we print out 21; then we get to print out 34 and visit its right subtree. 164 00:08:27,000 --> 00:08:31,000 Since 55 has no left subtree, we print it out and continue on to its right subtree. 165 00:08:31,000 --> 00:08:36,000 144 has a left subtree, and so we print out the 89, followed by the 144, 166 00:08:36,000 --> 00:08:39,000 and finally the right-most node of 233. 167 00:08:39,000 --> 00:08:44,000 There you have it; all the numbers are printed out in order from lowest to highest. 168 00:08:44,000 --> 00:08:47,000 >> Adding something to the tree is relatively painless as well. 169 00:08:47,000 --> 00:08:51,000 All we have to do is make sure that we follow 3 binary search tree properties 170 00:08:51,000 --> 00:08:53,000 and then insert the value where there is space. 171 00:08:53,000 --> 00:08:55,000 Let's say we want to insert the value of 7. 172 00:08:55,000 --> 00:08:58,000 Since 7 is less than 13, we go to the left. 173 00:08:58,000 --> 00:09:01,000 But it's greater than 5, so we traverse to the right. 174 00:09:01,000 --> 00:09:04,000 Since it's less than 8 and 8 is a leaf node, we add 7 to the left child of 8. 175 00:09:04,000 --> 00:09:09,000 Voila! We've added a number to our binary search tree. 176 00:09:09,000 --> 00:09:12,000 >> If we can add things, we better be able to delete things as well. 177 00:09:12,000 --> 00:09:14,000 Unfortunately for us, deleting is a little bit more complicated-- 178 00:09:14,000 --> 00:09:16,000 not much, but just a little bit. 179 00:09:16,000 --> 00:09:18,000 There are 3 different scenarios that we have to consider 180 00:09:18,000 --> 00:09:21,000 when deleting elements from binary search trees. 181 00:09:21,000 --> 00:09:24,000 First, the easiest case is that the element is a leaf node. 182 00:09:24,000 --> 00:09:27,000 In this case, we simply delete it and go on with our business. 183 00:09:27,000 --> 00:09:30,000 Say we want to delete the 7 that we just added. 184 00:09:30,000 --> 00:09:34,000 Well, we simply find it, remove it, and that's it. 185 00:09:35,000 --> 00:09:37,000 The next case is if the node has only 1 child. 186 00:09:37,000 --> 00:09:40,000 Here we can delete the node, but we first have to make sure 187 00:09:40,000 --> 00:09:42,000 to connect the subtree that is now left parentless 188 00:09:42,000 --> 00:09:44,000 to the parent of the node we just deleted. 189 00:09:44,000 --> 00:09:47,000 Say we want to delete 3 from our tree. 190 00:09:47,000 --> 00:09:50,000 We take the child element of that node and attach it to the parent of the node. 191 00:09:50,000 --> 00:09:54,000 In this case, we're now attaching the 1 to the 5. 192 00:09:54,000 --> 00:09:57,000 This works without a problem because we know, according to the binary search tree property, 193 00:09:57,000 --> 00:10:01,000 that everything in the left subtree of 3 was less than 5. 194 00:10:01,000 --> 00:10:05,000 Now that 3's subtree is taken care of, we can delete it. 195 00:10:05,000 --> 00:10:08,000 The third and final case is the most complex. 196 00:10:08,000 --> 00:10:11,000 This is the case when the node we want to delete has 2 children. 197 00:10:11,000 --> 00:10:15,000 In order to do this, we first have to find the node that has the next largest value, 198 00:10:15,000 --> 00:10:18,000 swap the two, and then delete the node in question. 199 00:10:18,000 --> 00:10:22,000 Notice the node that has the next largest value can't have 2 children itself 200 00:10:22,000 --> 00:10:26,000 since its left child would be a better candidate for the next largest. 201 00:10:26,000 --> 00:10:30,000 Therefore, deleting a node with 2 children amounts to swapping of 2 nodes, 202 00:10:30,000 --> 00:10:33,000 and then deleting is handled by 1 of the 2 aforementioned rules. 203 00:10:33,000 --> 00:10:37,000 For example, let's say we want to delete the root node, 13. 204 00:10:37,000 --> 00:10:39,000 The first thing we do is we find the next largest value in the tree 205 00:10:39,000 --> 00:10:41,000 which, in this case, is 21. 206 00:10:41,000 --> 00:10:46,000 We then swap the 2 nodes, making 13 a leaf and 21 the central group node. 207 00:10:46,000 --> 00:10:49,000 Now we can simply delete 13. 208 00:10:50,000 --> 00:10:53,000 >> As alluded to earlier, there are many possible ways to make a valid binary search tree. 209 00:10:53,000 --> 00:10:56,000 Unfortunately for us, some are worse than others. 210 00:10:56,000 --> 00:10:59,000 For example, what happens when we build a binary search tree 211 00:10:59,000 --> 00:11:01,000 from a sorted list of numbers? 212 00:11:01,000 --> 00:11:04,000 All numbers are just added to the right at each step. 213 00:11:04,000 --> 00:11:06,000 When we want to search for a number, 214 00:11:06,000 --> 00:11:08,000 we have no choice but only to look at the right at each step. 215 00:11:08,000 --> 00:11:11,000 This isn't better than linear search at all. 216 00:11:11,000 --> 00:11:13,000 Although we won't cover them here, there are other, more complex, 217 00:11:13,000 --> 00:11:16,000 data structures that make sure that this doesn't happen. 218 00:11:16,000 --> 00:11:18,000 However, one simple thing that can be done to avoid this is 219 00:11:18,000 --> 00:11:21,000 to just randomly shuffle the input values. 220 00:11:21,000 --> 00:11:26,000 It's highly unlikely that by random chance a shuffled list of numbers is sorted. 221 00:11:26,000 --> 00:11:29,000 If this were the case, casinos wouldn't stay in business for long. 222 00:11:29,000 --> 00:11:31,000 >> There you have it. 223 00:11:31,000 --> 00:11:34,000 You now know about binary searching and binary search trees. 224 00:11:34,000 --> 00:11:36,000 I'm Patrick Schmid, and this is CS50. 225 00:11:36,000 --> 00:11:38,000 [CS50.TV] 226 00:11:38,000 --> 00:11:43,000 One obvious way would be to scan the list from...[beep] 227 00:11:43,000 --> 00:11:46,000 ...the number of items...yep 228 00:11:46,000 --> 00:11:50,000 [laughs] 229 00:11:50,000 --> 00:11:55,000 ...post node of 234...augh. 230 00:11:55,000 --> 00:11:58,000 >>Yay! That was--