00:00:00,500 --> 00:00:02,880 SPEAKER 1: Let's take a look at distances now. So we'll define distance in terms of edit distance. But what is edit distance? Most simply put, we can define edit distance as the lowest cost for turning string A into string B. But what does cost actually mean? We can measure cost in terms of the number of operations it would take, or the number of steps it would take, to convert one string into the other. What counts as an operation? Well, there are three different types of operations we can consider. An insertion just means we add a character into a string. A deletion means we remove a character from a string. And a substitution means we replace a character with another character. And these are the possible valid operations that we can use when converting one string into another. So here's what you'll have to do. Distances is going to return some 2D list of costs, where the dimensions of that list are the length of A, the string a plus 1, by the length of B plus 1. So that you have a big matrix that will represent all of the possible ways to convert the first I characters of string A, into the first J characters of string B. So matrix I j, the ITH row and the JTH column of your matrix, will be a tupple, consisting first of the edit distance between the first I characters of string A, and the first J characters of string B. And then also, the last operation that you had to take in the optimal sequence of steps to get from one string to the other. So ultimately, when we're trying to compare the edit distance between two strings overall, we can just say matrix length of A length of B, in order to get the number of steps we would need to take to go from the whole string A into the whole string B. But in the process, we'll build up all of these intermediate variables which will help us in our calculations. So let's take a look at an example. Let's say we wanted to take the word cat and turn it into the word ate. How might we do that? Well, you might think that, well, each word is three characters so we could just substitute. We could turn the C into an A, we could then turn the A into a T, we could then turn the T into an E. And that's three substitutions, and we were able to turn the string cat into the string ate. But we might also be able to do this more efficiently, and with fewer operations. We could instead, start with the word cat, and first delete the C using a deletion operation. And then, inserting an E at the end of the string. And now we've turned the word cat into the word ate using only two steps instead of three. But how do we know that two is the best? How do we know that there isn't some better way? Your goal in implementing this edit distance algorithm, is to find the best way to convert any string into any other string. So let's take a look at how we can actually do this computation. Right here, what I have represented is a matrix. Where along the left hand side, we see one row for each of the possible characters in the word cat. Where 0 is going to represent the first zero characters of cat, 1 represents the first character of cat, 2 represents the first two characters of cat, and 3 represents the first three characters of cat. Then along the top, in each of the columns, we have one column for each of the characters inside of our second word, ate. And what each of the cells in this matrix is going to represent, is the number of steps or operations it would take in order to convert the first I characters of cat into the first J characters of ate. So let's start with something simple. Let's fill in cell 0, 0 of this matrix. This means converting the first 0 characters of cat into the first 0 characters of ate, which is just converting an empty string into an empty string. Well to do that, we don't actually need to do any work, because the empty string is already the empty string. So we can fill in 0 into this matrix to mean that there is no cost for turning the empty string into the empty string. But what about the cell immediately below it. Taking the first character of cat and turning it into the first zero characters of ate. In other words, turning C into the empty string. Well we can do that just by deleting the C, which is one operation. So inside of the cell for 1, 0, we'll store 1 because it will take 1 deletion in order to convert C into empty string, and also D for deletion, to mean we just deleted a character as the last step in our process of transitioning from one string to the other. What about the next cell immediately below that? Converting CA into the empty string? Well we can do that via two deletions. Deleting the C, and then deleting the A. So inside of that cell, we can store 2,D, because it will take two steps to convert CA into the empty string. And likewise, if we were converting cat all three characters into the empty string, that would be three deletions. Now let's try and fill in the first row. What if we wanted it to go the other way and convert the empty string into the character A? Well that just takes one insertion. We have to insert the letter A. So inside of cell 0, 1, we can store one the cost, and then I for insertion, because our last operation was an insertion. Likewise, if we wanted to convert the empty string into AT, or two characters, we'd have to insert the A and insert the T, and so that's a cost of two, and also an insertion as our last operation. And finally converting into the word ate, all three characters, would be a cost of three, also using insertions. So that's the first row in the first column. But now how do we fill in the rest of the matrix? In particular, let's try and fill in cell 1,1. Converting the letter C into the letter A. How might we do that? Well, given that we already have existing values in the matrix, from here on out we can define every cell's value based on the values of other cells that we've already filled in to this matrix. Consider it this way. We already know how to convert the empty string to the letter A in one step. And so, if we can convert the empty string into A in one step, then to convert C into A, all we would have to do is convert the empty string into A in one step, leaving us with AC, and then we'd just have to delete the C, doing one additional step. Which means that we could get to from C to A using two steps, where our last step is a deletion. Alternatively, we can look one cell to the left, and consider we already know how to take the string C and convert it into the empty string just by deleting the character C in one step. And if we can do that, then we can convert C into A by first deleting the C, and then inserting the Which is one more step than this cell which is highlighted, which means that to convert C into A, we could do that in two steps, where the last step is an insertion. But of course the most efficient way to do this is to consider that we can use a substitution. In particular, we didn't have to do any work to convert the empty string into the empty string, and so now when we just consider these last characters the C and the A, all we have to do is substitute the C for the A in one step, and using a substitution we can convert C into A using just one operation. Now let's consider how to convert CA into A. Well likewise, if we went the route of converting C into A first, and then deleting the C, that would be one cost in order to convert C into A, and then one more to delete the C for a total of two. Or likewise, if we started with CA, we could delete the CA in two steps, and then add an A back in a third step, so that would be three steps. Or we can say, well, if we already know how to turn C into the empty string, after that we could just do a substitution of the final character. But the final character of both of these strings on the left are both A, so there's actually no substitution that needs to happen. After we convert the C into the empty string, the A is already there for us. And so there's actually no additional cost of making that substitution, because there was nothing to substitute. Both of the characters in question, the second character of cat, and the first character of ate, are both the letter A. Now let's try and formalize this in terms of an actual algorithm. Here's how we would calculate the edit distance. If we're trying to find the cost of row I and column J, in particular, the cost of trying to take the first I characters of string A, and convert it into the first J characters of string B, we've got three options. Our last step might be a deletion. In which case, we can look at the cost of turning the first I minus one characters of string A, into the J characters a string B, and then deleting that last character. Or we can consider the last step to be an insertion. In which case if we can turn the first I characters of string A, into the first J minus one characters of string B, then we can do that conversion, and then just add on the last character that we need in one more step. Or alternatively, we can do a substitution. In which case, if we can convert the first I minus 1 characters of string A, into the first J minus 1 characters of string B, in a certain amount of steps, and the last character or in particular the ITH character of string A, is the same as the JTH character of string B, then there's actually no additional work that needs to be done. Just like we saw in that last example. After we convert the I minus 1 characters into the J minus 1 characters, we're done. Because the last character is the same. But, if those characters are different, if the ITH character of string A is different from the JTH character of string B, then we'll need to add one additional step, substituting the ITH character string A for the JTH character of string B. Which means we need to add 1 to our total cost. So we can think of cost like this. Figure out how much it would cost to a deletion, versus an insertion, versus a substitution, and then take whichever one of those is the minimum cost. Whichever one would be the most efficient way to translate string A into string B. If we were to fill in the rest of that table, it would look something like this. And then we could look at the lower right hand cell of this table, to say, in order to convert all of the letters in cat into all of the letters in ate, it would take just two steps. Just like we saw before, we would remove the C, and then insert the E. So what does that mean in terms of your cost. Inside of matrix IJ, you'll want to store a tupple of cost and operation. Where cost is going to be an INT, just representing the minimum number of steps it would take to convert the first I characters of string A, into the first J characters of string B. And operation is going to be that last operation. It might be operation.deleted, if you deleted a character last, operation.inserted, If you inserted the character last, or operation.substituted if you did a substitution. For matrix 0 0, converting an empty string into an empty string, there was no operation that actually needed to be done there. So just store none as your operation in that case. So now when it comes to actually calculating edit distances, you'll first want to set up your 2D list. Then you'll want to add values for your base cases. Just like we did in that example, filling in the first row and the first column, in order to get ready to do all the rest of the calculations. And then you'll want to fill in all of the other entries into the table. And as you do this, you may want to consider using a recursive helper function. Which may be useful to you as you think about how to compute the costs for all of the cells inside of this matrix. Once you've done that, you'll be able to figure out the optimal edit distance for the fastest way to turn one string into another.