1 00:00:00,000 --> 00:00:00,500 2 00:00:00,500 --> 00:00:02,880 SPEAKER 1: Let's take a look at distances now. 3 00:00:02,880 --> 00:00:05,830 So we'll define distance in terms of edit distance. 4 00:00:05,830 --> 00:00:07,680 But what is edit distance? 5 00:00:07,680 --> 00:00:10,260 Most simply put, we can define edit distance 6 00:00:10,260 --> 00:00:14,550 as the lowest cost for turning string A into string B. 7 00:00:14,550 --> 00:00:16,842 But what does cost actually mean? 8 00:00:16,842 --> 00:00:19,800 We can measure cost in terms of the number of operations it would take, 9 00:00:19,800 --> 00:00:24,330 or the number of steps it would take, to convert one string into the other. 10 00:00:24,330 --> 00:00:25,830 What counts as an operation? 11 00:00:25,830 --> 00:00:28,980 Well, there are three different types of operations we can consider. 12 00:00:28,980 --> 00:00:32,235 An insertion just means we add a character into a string. 13 00:00:32,235 --> 00:00:35,190 A deletion means we remove a character from a string. 14 00:00:35,190 --> 00:00:39,399 And a substitution means we replace a character with another character. 15 00:00:39,399 --> 00:00:41,190 And these are the possible valid operations 16 00:00:41,190 --> 00:00:45,190 that we can use when converting one string into another. 17 00:00:45,190 --> 00:00:47,070 So here's what you'll have to do. 18 00:00:47,070 --> 00:00:50,880 Distances is going to return some 2D list of costs, 19 00:00:50,880 --> 00:00:56,070 where the dimensions of that list are the length of A, the string a plus 1, 20 00:00:56,070 --> 00:00:58,470 by the length of B plus 1. 21 00:00:58,470 --> 00:01:02,040 So that you have a big matrix that will represent all of the possible ways 22 00:01:02,040 --> 00:01:05,370 to convert the first I characters of string A, 23 00:01:05,370 --> 00:01:12,270 into the first J characters of string B. So matrix I j, the ITH row and the JTH 24 00:01:12,270 --> 00:01:15,090 column of your matrix, will be a tupple, consisting 25 00:01:15,090 --> 00:01:19,165 first of the edit distance between the first I characters of string A, 26 00:01:19,165 --> 00:01:22,350 and the first J characters of string B. And then 27 00:01:22,350 --> 00:01:25,860 also, the last operation that you had to take in the optimal sequence of 28 00:01:25,860 --> 00:01:29,190 steps to get from one string to the other. 29 00:01:29,190 --> 00:01:32,730 So ultimately, when we're trying to compare the edit distance between two 30 00:01:32,730 --> 00:01:37,710 strings overall, we can just say matrix length of A length of B, 31 00:01:37,710 --> 00:01:39,690 in order to get the number of steps we would 32 00:01:39,690 --> 00:01:43,740 need to take to go from the whole string A into the whole string B. 33 00:01:43,740 --> 00:01:47,400 But in the process, we'll build up all of these intermediate variables which 34 00:01:47,400 --> 00:01:49,540 will help us in our calculations. 35 00:01:49,540 --> 00:01:51,700 So let's take a look at an example. 36 00:01:51,700 --> 00:01:55,800 Let's say we wanted to take the word cat and turn it into the word ate. 37 00:01:55,800 --> 00:01:56,745 How might we do that? 38 00:01:56,745 --> 00:01:59,370 Well, you might think that, well, each word is three characters 39 00:01:59,370 --> 00:02:00,536 so we could just substitute. 40 00:02:00,536 --> 00:02:05,040 We could turn the C into an A, we could then turn the A into a T, 41 00:02:05,040 --> 00:02:08,490 we could then turn the T into an E. And that's three substitutions, 42 00:02:08,490 --> 00:02:12,240 and we were able to turn the string cat into the string ate. 43 00:02:12,240 --> 00:02:15,480 But we might also be able to do this more efficiently, and with fewer 44 00:02:15,480 --> 00:02:16,530 operations. 45 00:02:16,530 --> 00:02:18,870 We could instead, start with the word cat, 46 00:02:18,870 --> 00:02:23,010 and first delete the C using a deletion operation. 47 00:02:23,010 --> 00:02:25,960 And then, inserting an E at the end of the string. 48 00:02:25,960 --> 00:02:29,190 And now we've turned the word cat into the word ate 49 00:02:29,190 --> 00:02:31,690 using only two steps instead of three. 50 00:02:31,690 --> 00:02:33,510 But how do we know that two is the best? 51 00:02:33,510 --> 00:02:35,520 How do we know that there isn't some better way? 52 00:02:35,520 --> 00:02:38,250 Your goal in implementing this edit distance algorithm, 53 00:02:38,250 --> 00:02:42,460 is to find the best way to convert any string into any other string. 54 00:02:42,460 --> 00:02:45,930 So let's take a look at how we can actually do this computation. 55 00:02:45,930 --> 00:02:48,900 Right here, what I have represented is a matrix. 56 00:02:48,900 --> 00:02:52,680 Where along the left hand side, we see one row 57 00:02:52,680 --> 00:02:56,340 for each of the possible characters in the word cat. 58 00:02:56,340 --> 00:03:00,000 Where 0 is going to represent the first zero characters of cat, 59 00:03:00,000 --> 00:03:04,740 1 represents the first character of cat, 2 represents the first two 60 00:03:04,740 --> 00:03:08,790 characters of cat, and 3 represents the first three characters of cat. 61 00:03:08,790 --> 00:03:11,130 Then along the top, in each of the columns, 62 00:03:11,130 --> 00:03:15,390 we have one column for each of the characters inside of our second word, 63 00:03:15,390 --> 00:03:16,230 ate. 64 00:03:16,230 --> 00:03:19,830 And what each of the cells in this matrix is going to represent, 65 00:03:19,830 --> 00:03:22,650 is the number of steps or operations it would take in order 66 00:03:22,650 --> 00:03:28,800 to convert the first I characters of cat into the first J characters of ate. 67 00:03:28,800 --> 00:03:30,760 So let's start with something simple. 68 00:03:30,760 --> 00:03:33,960 Let's fill in cell 0, 0 of this matrix. 69 00:03:33,960 --> 00:03:36,630 This means converting the first 0 characters of cat 70 00:03:36,630 --> 00:03:39,900 into the first 0 characters of ate, which is just converting 71 00:03:39,900 --> 00:03:42,390 an empty string into an empty string. 72 00:03:42,390 --> 00:03:44,790 Well to do that, we don't actually need to do any work, 73 00:03:44,790 --> 00:03:47,790 because the empty string is already the empty string. 74 00:03:47,790 --> 00:03:50,040 So we can fill in 0 into this matrix to mean 75 00:03:50,040 --> 00:03:54,330 that there is no cost for turning the empty string into the empty string. 76 00:03:54,330 --> 00:03:56,940 But what about the cell immediately below it. 77 00:03:56,940 --> 00:04:00,360 Taking the first character of cat and turning it 78 00:04:00,360 --> 00:04:02,760 into the first zero characters of ate. 79 00:04:02,760 --> 00:04:05,700 In other words, turning C into the empty string. 80 00:04:05,700 --> 00:04:09,850 Well we can do that just by deleting the C, which is one operation. 81 00:04:09,850 --> 00:04:14,340 So inside of the cell for 1, 0, we'll store 1 82 00:04:14,340 --> 00:04:18,480 because it will take 1 deletion in order to convert C into empty string, 83 00:04:18,480 --> 00:04:21,810 and also D for deletion, to mean we just deleted 84 00:04:21,810 --> 00:04:25,230 a character as the last step in our process of transitioning 85 00:04:25,230 --> 00:04:27,650 from one string to the other. 86 00:04:27,650 --> 00:04:29,760 What about the next cell immediately below that? 87 00:04:29,760 --> 00:04:32,400 Converting CA into the empty string? 88 00:04:32,400 --> 00:04:34,590 Well we can do that via two deletions. 89 00:04:34,590 --> 00:04:38,160 Deleting the C, and then deleting the A. So inside of that cell, 90 00:04:38,160 --> 00:04:42,690 we can store 2,D, because it will take two steps to convert CA into the empty 91 00:04:42,690 --> 00:04:43,590 string. 92 00:04:43,590 --> 00:04:46,560 And likewise, if we were converting cat all three characters 93 00:04:46,560 --> 00:04:50,160 into the empty string, that would be three deletions. 94 00:04:50,160 --> 00:04:52,044 Now let's try and fill in the first row. 95 00:04:52,044 --> 00:04:54,210 What if we wanted it to go the other way and convert 96 00:04:54,210 --> 00:04:57,270 the empty string into the character A? 97 00:04:57,270 --> 00:04:59,340 Well that just takes one insertion. 98 00:04:59,340 --> 00:05:03,360 We have to insert the letter A. So inside of cell 0, 1, 99 00:05:03,360 --> 00:05:07,470 we can store one the cost, and then I for insertion, 100 00:05:07,470 --> 00:05:10,470 because our last operation was an insertion. 101 00:05:10,470 --> 00:05:14,400 Likewise, if we wanted to convert the empty string into AT, or two 102 00:05:14,400 --> 00:05:17,640 characters, we'd have to insert the A and insert the T, 103 00:05:17,640 --> 00:05:22,150 and so that's a cost of two, and also an insertion as our last operation. 104 00:05:22,150 --> 00:05:25,110 And finally converting into the word ate, all three characters, 105 00:05:25,110 --> 00:05:28,230 would be a cost of three, also using insertions. 106 00:05:28,230 --> 00:05:31,020 So that's the first row in the first column. 107 00:05:31,020 --> 00:05:33,210 But now how do we fill in the rest of the matrix? 108 00:05:33,210 --> 00:05:36,540 In particular, let's try and fill in cell 1,1. 109 00:05:36,540 --> 00:05:41,740 Converting the letter C into the letter A. How might we do that? 110 00:05:41,740 --> 00:05:45,270 Well, given that we already have existing values in the matrix, 111 00:05:45,270 --> 00:05:48,510 from here on out we can define every cell's value 112 00:05:48,510 --> 00:05:52,530 based on the values of other cells that we've already filled in to this matrix. 113 00:05:52,530 --> 00:05:53,850 Consider it this way. 114 00:05:53,850 --> 00:05:59,900 We already know how to convert the empty string to the letter A in one step. 115 00:05:59,900 --> 00:06:04,200 And so, if we can convert the empty string into A in one step, 116 00:06:04,200 --> 00:06:07,200 then to convert C into A, all we would have to do 117 00:06:07,200 --> 00:06:11,910 is convert the empty string into A in one step, leaving us with AC, 118 00:06:11,910 --> 00:06:16,030 and then we'd just have to delete the C, doing one additional step. 119 00:06:16,030 --> 00:06:20,390 Which means that we could get to from C to A using two steps, where 120 00:06:20,390 --> 00:06:22,590 our last step is a deletion. 121 00:06:22,590 --> 00:06:25,200 Alternatively, we can look one cell to the left, 122 00:06:25,200 --> 00:06:28,770 and consider we already know how to take the string C 123 00:06:28,770 --> 00:06:32,220 and convert it into the empty string just by deleting the character 124 00:06:32,220 --> 00:06:33,980 C in one step. 125 00:06:33,980 --> 00:06:38,100 And if we can do that, then we can convert C into A by first deleting 126 00:06:38,100 --> 00:06:43,500 the C, and then inserting the Which is one more step than this cell which 127 00:06:43,500 --> 00:06:46,470 is highlighted, which means that to convert C into A, 128 00:06:46,470 --> 00:06:50,850 we could do that in two steps, where the last step is an insertion. 129 00:06:50,850 --> 00:06:53,100 But of course the most efficient way to do this 130 00:06:53,100 --> 00:06:56,820 is to consider that we can use a substitution. 131 00:06:56,820 --> 00:06:58,950 In particular, we didn't have to do any work 132 00:06:58,950 --> 00:07:01,650 to convert the empty string into the empty string, 133 00:07:01,650 --> 00:07:05,610 and so now when we just consider these last characters the C and the A, 134 00:07:05,610 --> 00:07:10,110 all we have to do is substitute the C for the A in one step, 135 00:07:10,110 --> 00:07:16,710 and using a substitution we can convert C into A using just one operation. 136 00:07:16,710 --> 00:07:20,490 Now let's consider how to convert CA into A. 137 00:07:20,490 --> 00:07:24,690 Well likewise, if we went the route of converting C into A first, 138 00:07:24,690 --> 00:07:30,270 and then deleting the C, that would be one cost in order to convert C into A, 139 00:07:30,270 --> 00:07:33,660 and then one more to delete the C for a total of two. 140 00:07:33,660 --> 00:07:39,120 Or likewise, if we started with CA, we could delete the CA in two steps, 141 00:07:39,120 --> 00:07:43,230 and then add an A back in a third step, so that would be three steps. 142 00:07:43,230 --> 00:07:46,110 Or we can say, well, if we already know how 143 00:07:46,110 --> 00:07:50,100 to turn C into the empty string, after that we could just do 144 00:07:50,100 --> 00:07:52,320 a substitution of the final character. 145 00:07:52,320 --> 00:07:55,020 But the final character of both of these strings on the left 146 00:07:55,020 --> 00:07:59,190 are both A, so there's actually no substitution that needs to happen. 147 00:07:59,190 --> 00:08:04,860 After we convert the C into the empty string, the A is already there for us. 148 00:08:04,860 --> 00:08:08,820 And so there's actually no additional cost of making that substitution, 149 00:08:08,820 --> 00:08:10,620 because there was nothing to substitute. 150 00:08:10,620 --> 00:08:13,710 Both of the characters in question, the second character of cat, 151 00:08:13,710 --> 00:08:17,490 and the first character of ate, are both the letter A. 152 00:08:17,490 --> 00:08:21,130 Now let's try and formalize this in terms of an actual algorithm. 153 00:08:21,130 --> 00:08:23,520 Here's how we would calculate the edit distance. 154 00:08:23,520 --> 00:08:27,300 If we're trying to find the cost of row I and column J, 155 00:08:27,300 --> 00:08:30,950 in particular, the cost of trying to take the first I characters of string 156 00:08:30,950 --> 00:08:34,650 A, and convert it into the first J characters of string B, 157 00:08:34,650 --> 00:08:36,419 we've got three options. 158 00:08:36,419 --> 00:08:39,570 Our last step might be a deletion. 159 00:08:39,570 --> 00:08:42,450 In which case, we can look at the cost of turning 160 00:08:42,450 --> 00:08:45,600 the first I minus one characters of string A, 161 00:08:45,600 --> 00:08:52,230 into the J characters a string B, and then deleting that last character. 162 00:08:52,230 --> 00:08:55,350 Or we can consider the last step to be an insertion. 163 00:08:55,350 --> 00:08:58,620 In which case if we can turn the first I characters of string A, 164 00:08:58,620 --> 00:09:02,010 into the first J minus one characters of string B, 165 00:09:02,010 --> 00:09:05,280 then we can do that conversion, and then just add on the last character 166 00:09:05,280 --> 00:09:07,950 that we need in one more step. 167 00:09:07,950 --> 00:09:10,710 Or alternatively, we can do a substitution. 168 00:09:10,710 --> 00:09:15,900 In which case, if we can convert the first I minus 1 characters of string A, 169 00:09:15,900 --> 00:09:20,940 into the first J minus 1 characters of string B, in a certain amount of steps, 170 00:09:20,940 --> 00:09:25,290 and the last character or in particular the ITH character of string A, 171 00:09:25,290 --> 00:09:27,946 is the same as the JTH character of string B, 172 00:09:27,946 --> 00:09:30,570 then there's actually no additional work that needs to be done. 173 00:09:30,570 --> 00:09:32,490 Just like we saw in that last example. 174 00:09:32,490 --> 00:09:34,650 After we convert the I minus 1 characters 175 00:09:34,650 --> 00:09:36,990 into the J minus 1 characters, we're done. 176 00:09:36,990 --> 00:09:39,240 Because the last character is the same. 177 00:09:39,240 --> 00:09:43,590 But, if those characters are different, if the ITH character of string A 178 00:09:43,590 --> 00:09:46,230 is different from the JTH character of string B, 179 00:09:46,230 --> 00:09:50,340 then we'll need to add one additional step, substituting the ITH character 180 00:09:50,340 --> 00:09:53,610 string A for the JTH character of string B. Which means 181 00:09:53,610 --> 00:09:56,470 we need to add 1 to our total cost. 182 00:09:56,470 --> 00:09:58,170 So we can think of cost like this. 183 00:09:58,170 --> 00:10:02,370 Figure out how much it would cost to a deletion, versus an insertion, 184 00:10:02,370 --> 00:10:05,460 versus a substitution, and then take whichever one of those 185 00:10:05,460 --> 00:10:06,810 is the minimum cost. 186 00:10:06,810 --> 00:10:09,300 Whichever one would be the most efficient way 187 00:10:09,300 --> 00:10:12,796 to translate string A into string B. 188 00:10:12,796 --> 00:10:14,670 If we were to fill in the rest of that table, 189 00:10:14,670 --> 00:10:16,270 it would look something like this. 190 00:10:16,270 --> 00:10:19,140 And then we could look at the lower right hand cell of this table, 191 00:10:19,140 --> 00:10:23,350 to say, in order to convert all of the letters in cat 192 00:10:23,350 --> 00:10:27,730 into all of the letters in ate, it would take just two steps. 193 00:10:27,730 --> 00:10:33,480 Just like we saw before, we would remove the C, and then insert the E. So what 194 00:10:33,480 --> 00:10:35,280 does that mean in terms of your cost. 195 00:10:35,280 --> 00:10:40,830 Inside of matrix IJ, you'll want to store a tupple of cost and operation. 196 00:10:40,830 --> 00:10:43,710 Where cost is going to be an INT, just representing 197 00:10:43,710 --> 00:10:45,540 the minimum number of steps it would take 198 00:10:45,540 --> 00:10:47,520 to convert the first I characters of string A, 199 00:10:47,520 --> 00:10:51,240 into the first J characters of string B. And operation 200 00:10:51,240 --> 00:10:53,130 is going to be that last operation. 201 00:10:53,130 --> 00:10:56,880 It might be operation.deleted, if you deleted a character last, 202 00:10:56,880 --> 00:11:00,030 operation.inserted, If you inserted the character last, 203 00:11:00,030 --> 00:11:04,260 or operation.substituted if you did a substitution. 204 00:11:04,260 --> 00:11:07,890 For matrix 0 0, converting an empty string into an empty string, 205 00:11:07,890 --> 00:11:10,560 there was no operation that actually needed to be done there. 206 00:11:10,560 --> 00:11:15,040 So just store none as your operation in that case. 207 00:11:15,040 --> 00:11:18,300 So now when it comes to actually calculating edit distances, 208 00:11:18,300 --> 00:11:20,940 you'll first want to set up your 2D list. 209 00:11:20,940 --> 00:11:23,612 Then you'll want to add values for your base cases. 210 00:11:23,612 --> 00:11:25,320 Just like we did in that example, filling 211 00:11:25,320 --> 00:11:28,080 in the first row and the first column, in order 212 00:11:28,080 --> 00:11:31,380 to get ready to do all the rest of the calculations. 213 00:11:31,380 --> 00:11:35,247 And then you'll want to fill in all of the other entries into the table. 214 00:11:35,247 --> 00:11:37,080 And as you do this, you may want to consider 215 00:11:37,080 --> 00:11:39,150 using a recursive helper function. 216 00:11:39,150 --> 00:11:43,050 Which may be useful to you as you think about how to compute the costs for all 217 00:11:43,050 --> 00:11:45,096 of the cells inside of this matrix. 218 00:11:45,096 --> 00:11:47,220 Once you've done that, you'll be able to figure out 219 00:11:47,220 --> 00:11:49,770 the optimal edit distance for the fastest way 220 00:11:49,770 --> 00:11:52,850 to turn one string into another. 221 00:11:52,850 --> 00:11:53,780