BRIAN YU: Let's analyze some DNA. In this problem, your task is going to be to take a sequence of DNA and determine which person it most likely belongs to. Recall that a sequence of DNA is really just a sequence of nucleotide bases where there are four different nucleotide bases that are abbreviated T, C, A, and G. If you put enough of these in sequence, you end up with a human's DNA sequence. And interestingly, if you look at a human's DNA sequence, you'll often find places in the DNA where certain parts of the DNA repeat. Take a look at the sequence AGAT for example. You'll notice that in this DNA sequence, it appears here. And then it repeats right after it here and here and here and here. And it repeats five times consecutively in a row. 

This isn't a coincidence. It's something called a short tandem repeat-- a sequence of DNA that repeats consecutively in particular locations in human DNA. The short tandem repeats, also called STRs, vary a lot in the population. In particular, different people have each STR repeated different numbers of times. So what are the key observations here? Well, for any STR, people in a population vary in how many times that particular STR repeats consecutively. You might have a particular STR repeated 25 times. I might have it repeated 28 times. Someone else might have it repeated some different number of times. And matching STR counts can be used to, therefore, identify who a sample of DNA belongs to. 

If we take a sample of DNA and count up for each of the STRs-- how many times do they repeat consecutively and it matches some existing piece of DNA in a DNA database, for example, then it's pretty likely that those two pieces of DNA came from the same person. And in fact, the very DNA sequence that we were looking at earlier, AGAT, is actually one of the STRs that the FBI uses in its own DNA database. Your task in this problem is going to be to do something similar. 

We're going to give you a DNA database of your very own, formatted as a CSV file where each row is going to correspond to a person. And each column is going to correspond to a particular STR, a particular sequence of DNA that repeats some number of times. We can take this CSV file, or comma separated value file, and format it as a table in order to read it a little more easily. And the way to interpret this is that Alice has the DNA sequence AGAT repeated 28 times consecutively somewhere in her DNA. She also has the sequence AATG repeated 42 times consecutively somewhere in her DNA. And she has the TATC sequence repeated 14 times elsewhere in her DNA. Bob has the same thing with different numbers. And Charlie has the same STRs with different numbers as well. 

What we're going to give you in this problem is we're going to give you a CSV file representing all of this data, and we're also going to give you a DNA sequence formatted as a text file called sequence.txt, for example. And your task is going to be to take this DNA sequence and figure out whether or not it matches all of the STR counts for any person in the DNA database that we give you. What does this actually mean? 

Well, it means that your program will first need to open the CSV file and the DNA sequence and read the contents into memory in your Python program. Next up, for each of the STRs, you're going to compute the longest run of consecutive repeats of that STR in the DNA sequence you've been given. Once you've calculated the number of times that each STR repeats consecutively, you're going to compare those counts against every row in the CSV file to look for a match. If you find a match, you'll print out their name. And if there's no match, you'll just print out no match to indicate that there was no match for anyone in your DNA database. 

So let's go through these steps one at a time, starting with opening the files. When you open the CSV file, one thing you'll notice is that the very first row of the CSV file has name as the first column and every subsequent column is going to be an actual DNA sequence representing the STRs that you should be counting. Every other row of the CSV file is going to correspond to a person, where the first column is going to be that person's name. And all of the remaining columns are going to represent the number of times that the STR repeats consecutively in that person's DNA. 

So a couple of hints for doing this-- one thing to note is that Python has a CSV module, a module designed with functions that will help you to read CSV files. In particular, you might want to take a look at reader and DictReader to read a CSV file as a sequence of lists or a sequence of dictionaries, respectively. You'll also note that Python has a sys module-- short for system-- which will give you access to sys.argv for the command line arguments of the program where the first command line argument, other than the program's name, is going to be the DNA database formatted as a CSV file. And the second argument is going to be the DNA sequence so you're actually going to read and do some calculations on. 

Another hint-- once you've opened up a file by calling open and then passing in the file name in parentheses, you can read its contents using f.read if f is the name of the file, for example. Once you've opened up all of the files, the next step is going to be to compute the individual STR accounts. In other words, how many times do they repeat consecutively in a person's DNA? Well, let's take a look at this DNA sequence for example and look at the STR AGAT. 

You'll notice that it appears once here, AGAT. It also appears once here, AGAT. And here it repeats twice in sequence-- AGAT and then AGAT after that. And you could imagine doing this for every position within the DNA sequence, computing at that particular place in the sequence how many times does AGAT repeat starting from that position. Once you've computed all of those values, you could imagine taking whichever of those values is the biggest and that's going to be the longest run of consecutive STRs that you're going to find for this particular DNA sequence. 

So what does this actually mean algorithmically? Well, it means for each position in the DNA sequence, you'll want to compute how many times the STR repeats starting at that position. How do you do that? Well, for every position, you can keep checking successive substrings-- first, checking to see whether the STR matches once, then to see if it matches twice, then three times, so on and so forth until it doesn't repeat anymore. And at the end, you'll just want to keep track of what is the maximal number of times that this particular STR repeats consecutively. 

In order to do this, you might find the len function in Python helpful, which will give you the length of a string s or any other sequence for that matter. And you can also take a string and get a substring of it by using s[i:j] which takes the string s and will return the substring with all the characters from character i up to but not including character j. And that might be helpful if you've got a long sequence of DNA and you want to find out whether a particular subsequence of it matches some expected result. 

Once you've computed all of these STR counts, the last step is to compare it against the data in the CSV file that we've given you. How are you going to do that? Well, you'll want to save the STR counts in some sort of data structure, maybe a list, maybe a dictionary, maybe something else, whatever makes most sense to you. And then for each row in the data, you'll want to check if each of the STR accounts matches. And if each of the STR accounts matches for a particular person, then the DNA is a match and you should print out that person's name. Otherwise, you should just print out no match. 

In order to do this, you might want to remember that int is a function that you can actually call in Python that takes a string and turns it into the integer version of that string. So if you took the string "2" for example in quotation marks and passed it into the int function, what you'd get back is the integer 2. And remember that to confirm a match for a particular person, you'll need to check every column of the CSV file other than the first column, because the first column just stores the person's name. 

So once you've opened up the CSV file and the DNA sequence into memory, calculated all the STR counts, and then compared those counts against each row in your CSV file, you should then have the ability to take a sequence of DNA and figure out most likely who that DNA belongs to. My name is Brian and this was DNA.