1
00:00:00,000 --> 00:00:00,500


2
00:00:00,500 --> 00:00:02,009
BRIAN YU: Let's analyze some DNA.

3
00:00:02,009 --> 00:00:05,610
In this problem, your task is going
to be to take a sequence of DNA

4
00:00:05,610 --> 00:00:09,360
and determine which person
it most likely belongs to.

5
00:00:09,360 --> 00:00:13,800
Recall that a sequence of DNA is really
just a sequence of nucleotide bases

6
00:00:13,800 --> 00:00:16,410
where there are four different
nucleotide bases that

7
00:00:16,410 --> 00:00:21,960
are abbreviated T, C, A, and G. If
you put enough of these in sequence,

8
00:00:21,960 --> 00:00:24,630
you end up with a human's DNA sequence.

9
00:00:24,630 --> 00:00:27,690
And interestingly, if you look
at a human's DNA sequence,

10
00:00:27,690 --> 00:00:32,820
you'll often find places in the DNA
where certain parts of the DNA repeat.

11
00:00:32,820 --> 00:00:36,280
Take a look at the
sequence AGAT for example.

12
00:00:36,280 --> 00:00:39,360
You'll notice that in this
DNA sequence, it appears here.

13
00:00:39,360 --> 00:00:43,650
And then it repeats right after it
here and here and here and here.

14
00:00:43,650 --> 00:00:47,220
And it repeats five times
consecutively in a row.

15
00:00:47,220 --> 00:00:48,780
This isn't a coincidence.

16
00:00:48,780 --> 00:00:51,240
It's something called
a short tandem repeat--

17
00:00:51,240 --> 00:00:54,050
a sequence of DNA that
repeats consecutively

18
00:00:54,050 --> 00:00:56,460
in particular locations in human DNA.

19
00:00:56,460 --> 00:01:01,500
The short tandem repeats, also called
STRs, vary a lot in the population.

20
00:01:01,500 --> 00:01:04,349
In particular, different
people have each STR

21
00:01:04,349 --> 00:01:07,470
repeated different numbers of times.

22
00:01:07,470 --> 00:01:09,810
So what are the key observations here?

23
00:01:09,810 --> 00:01:14,280
Well, for any STR, people in a
population vary in how many times

24
00:01:14,280 --> 00:01:17,370
that particular STR
repeats consecutively.

25
00:01:17,370 --> 00:01:20,100
You might have a particular
STR repeated 25 times.

26
00:01:20,100 --> 00:01:21,960
I might have it repeated 28 times.

27
00:01:21,960 --> 00:01:25,080
Someone else might have it repeated
some different number of times.

28
00:01:25,080 --> 00:01:30,510
And matching STR counts can be used to,
therefore, identify who a sample of DNA

29
00:01:30,510 --> 00:01:31,710
belongs to.

30
00:01:31,710 --> 00:01:35,730
If we take a sample of DNA and
count up for each of the STRs--

31
00:01:35,730 --> 00:01:38,280
how many times do they
repeat consecutively

32
00:01:38,280 --> 00:01:42,150
and it matches some existing
piece of DNA in a DNA database,

33
00:01:42,150 --> 00:01:46,110
for example, then it's pretty
likely that those two pieces of DNA

34
00:01:46,110 --> 00:01:47,670
came from the same person.

35
00:01:47,670 --> 00:01:51,840
And in fact, the very DNA sequence
that we were looking at earlier, AGAT,

36
00:01:51,840 --> 00:01:56,910
is actually one of the STRs that the
FBI uses in its own DNA database.

37
00:01:56,910 --> 00:02:00,220
Your task in this problem is going
to be to do something similar.

38
00:02:00,220 --> 00:02:03,570
We're going to give you a DNA
database of your very own, formatted

39
00:02:03,570 --> 00:02:07,980
as a CSV file where each row is
going to correspond to a person.

40
00:02:07,980 --> 00:02:11,700
And each column is going to
correspond to a particular STR,

41
00:02:11,700 --> 00:02:15,840
a particular sequence of DNA that
repeats some number of times.

42
00:02:15,840 --> 00:02:19,410
We can take this CSV file, or
comma separated value file,

43
00:02:19,410 --> 00:02:22,980
and format it as a table in order
to read it a little more easily.

44
00:02:22,980 --> 00:02:25,650
And the way to interpret
this is that Alice

45
00:02:25,650 --> 00:02:30,660
has the DNA sequence AGAT repeated
28 times consecutively somewhere

46
00:02:30,660 --> 00:02:31,980
in her DNA.

47
00:02:31,980 --> 00:02:37,410
She also has the sequence AATG repeated
42 times consecutively somewhere

48
00:02:37,410 --> 00:02:38,430
in her DNA.

49
00:02:38,430 --> 00:02:43,713
And she has the TATC sequence repeated
14 times elsewhere in her DNA.

50
00:02:43,713 --> 00:02:45,630
Bob has the same thing
with different numbers.

51
00:02:45,630 --> 00:02:49,107
And Charlie has the same STRs
with different numbers as well.

52
00:02:49,107 --> 00:02:50,940
What we're going to
give you in this problem

53
00:02:50,940 --> 00:02:54,660
is we're going to give you a CSV
file representing all of this data,

54
00:02:54,660 --> 00:02:58,110
and we're also going to give you
a DNA sequence formatted as a text

55
00:02:58,110 --> 00:03:00,990
file called sequence.txt, for example.

56
00:03:00,990 --> 00:03:03,990
And your task is going to
be to take this DNA sequence

57
00:03:03,990 --> 00:03:07,050
and figure out whether or
not it matches all of the STR

58
00:03:07,050 --> 00:03:11,460
counts for any person in the
DNA database that we give you.

59
00:03:11,460 --> 00:03:13,150
What does this actually mean?

60
00:03:13,150 --> 00:03:15,420
Well, it means that
your program will first

61
00:03:15,420 --> 00:03:18,390
need to open the CSV
file and the DNA sequence

62
00:03:18,390 --> 00:03:22,380
and read the contents into
memory in your Python program.

63
00:03:22,380 --> 00:03:25,320
Next up, for each of
the STRs, you're going

64
00:03:25,320 --> 00:03:29,790
to compute the longest run of
consecutive repeats of that STR

65
00:03:29,790 --> 00:03:32,150
in the DNA sequence you've been given.

66
00:03:32,150 --> 00:03:33,900
Once you've calculated
the number of times

67
00:03:33,900 --> 00:03:36,780
that each STR repeats
consecutively, you're

68
00:03:36,780 --> 00:03:40,530
going to compare those counts
against every row in the CSV file

69
00:03:40,530 --> 00:03:41,788
to look for a match.

70
00:03:41,788 --> 00:03:43,830
If you find a match, you'll
print out their name.

71
00:03:43,830 --> 00:03:45,747
And if there's no match,
you'll just print out

72
00:03:45,747 --> 00:03:50,410
no match to indicate that there was no
match for anyone in your DNA database.

73
00:03:50,410 --> 00:03:55,050
So let's go through these steps one at
a time, starting with opening the files.

74
00:03:55,050 --> 00:03:57,570
When you open the CSV file,
one thing you'll notice

75
00:03:57,570 --> 00:04:02,400
is that the very first row of the
CSV file has name as the first column

76
00:04:02,400 --> 00:04:07,020
and every subsequent column is going to
be an actual DNA sequence representing

77
00:04:07,020 --> 00:04:09,690
the STRs that you should be counting.

78
00:04:09,690 --> 00:04:12,390
Every other row of the CSV
file is going to correspond

79
00:04:12,390 --> 00:04:16,079
to a person, where the first column
is going to be that person's name.

80
00:04:16,079 --> 00:04:19,890
And all of the remaining columns are
going to represent the number of times

81
00:04:19,890 --> 00:04:23,640
that the STR repeats consecutively
in that person's DNA.

82
00:04:23,640 --> 00:04:25,860
So a couple of hints for doing this--

83
00:04:25,860 --> 00:04:28,310
one thing to note is
that Python has a CSV

84
00:04:28,310 --> 00:04:32,850
module, a module designed with functions
that will help you to read CSV files.

85
00:04:32,850 --> 00:04:36,360
In particular, you might want to
take a look at reader and DictReader

86
00:04:36,360 --> 00:04:40,560
to read a CSV file as a sequence of
lists or a sequence of dictionaries,

87
00:04:40,560 --> 00:04:41,730
respectively.

88
00:04:41,730 --> 00:04:45,270
You'll also note that Python has
a sys module-- short for system--

89
00:04:45,270 --> 00:04:48,270
which will give you access
to sys.argv for the command

90
00:04:48,270 --> 00:04:51,780
line arguments of the program where
the first command line argument, other

91
00:04:51,780 --> 00:04:55,910
than the program's name, is going to
be the DNA database formatted as a CSV

92
00:04:55,910 --> 00:04:56,640
file.

93
00:04:56,640 --> 00:04:58,680
And the second argument
is going to be the DNA

94
00:04:58,680 --> 00:05:02,530
sequence so you're actually going
to read and do some calculations on.

95
00:05:02,530 --> 00:05:06,280
Another hint-- once you've opened
up a file by calling open and then

96
00:05:06,280 --> 00:05:11,110
passing in the file name in parentheses,
you can read its contents using f.read

97
00:05:11,110 --> 00:05:13,990
if f is the name of
the file, for example.

98
00:05:13,990 --> 00:05:16,660
Once you've opened up all
of the files, the next step

99
00:05:16,660 --> 00:05:20,200
is going to be to compute
the individual STR accounts.

100
00:05:20,200 --> 00:05:25,420
In other words, how many times do they
repeat consecutively in a person's DNA?

101
00:05:25,420 --> 00:05:28,330
Well, let's take a look at
this DNA sequence for example

102
00:05:28,330 --> 00:05:31,360
and look at the STR AGAT.

103
00:05:31,360 --> 00:05:34,900
You'll notice that it
appears once here, AGAT.

104
00:05:34,900 --> 00:05:37,570
It also appears once here, AGAT.

105
00:05:37,570 --> 00:05:39,550
And here it repeats twice in sequence--

106
00:05:39,550 --> 00:05:42,430
AGAT and then AGAT after that.

107
00:05:42,430 --> 00:05:46,270
And you could imagine doing this for
every position within the DNA sequence,

108
00:05:46,270 --> 00:05:49,060
computing at that particular
place in the sequence

109
00:05:49,060 --> 00:05:53,620
how many times does AGAT repeat
starting from that position.

110
00:05:53,620 --> 00:05:55,910
Once you've computed
all of those values,

111
00:05:55,910 --> 00:05:58,990
you could imagine taking whichever
of those values is the biggest

112
00:05:58,990 --> 00:06:03,010
and that's going to be the
longest run of consecutive STRs

113
00:06:03,010 --> 00:06:06,310
that you're going to find for
this particular DNA sequence.

114
00:06:06,310 --> 00:06:09,280
So what does this actually
mean algorithmically?

115
00:06:09,280 --> 00:06:12,280
Well, it means for each
position in the DNA sequence,

116
00:06:12,280 --> 00:06:15,460
you'll want to compute how
many times the STR repeats

117
00:06:15,460 --> 00:06:17,470
starting at that position.

118
00:06:17,470 --> 00:06:18,710
How do you do that?

119
00:06:18,710 --> 00:06:22,990
Well, for every position, you can keep
checking successive substrings-- first,

120
00:06:22,990 --> 00:06:26,380
checking to see whether the STR matches
once, then to see if it matches twice,

121
00:06:26,380 --> 00:06:30,493
then three times, so on and so forth
until it doesn't repeat anymore.

122
00:06:30,493 --> 00:06:32,410
And at the end, you'll
just want to keep track

123
00:06:32,410 --> 00:06:36,700
of what is the maximal number of
times that this particular STR

124
00:06:36,700 --> 00:06:38,560
repeats consecutively.

125
00:06:38,560 --> 00:06:41,050
In order to do this, you
might find the len function

126
00:06:41,050 --> 00:06:44,470
in Python helpful, which will
give you the length of a string s

127
00:06:44,470 --> 00:06:46,720
or any other sequence for that matter.

128
00:06:46,720 --> 00:06:53,860
And you can also take a string and get
a substring of it by using s[i:j] which

129
00:06:53,860 --> 00:06:58,480
takes the string s and will return the
substring with all the characters from

130
00:06:58,480 --> 00:07:02,590
character i up to but not
including character j.

131
00:07:02,590 --> 00:07:05,530
And that might be helpful if
you've got a long sequence of DNA

132
00:07:05,530 --> 00:07:08,620
and you want to find out whether
a particular subsequence of it

133
00:07:08,620 --> 00:07:11,650
matches some expected result.

134
00:07:11,650 --> 00:07:14,230
Once you've computed
all of these STR counts,

135
00:07:14,230 --> 00:07:17,950
the last step is to compare it
against the data in the CSV file

136
00:07:17,950 --> 00:07:19,390
that we've given you.

137
00:07:19,390 --> 00:07:20,930
How are you going to do that?

138
00:07:20,930 --> 00:07:24,730
Well, you'll want to save the STR counts
in some sort of data structure, maybe

139
00:07:24,730 --> 00:07:26,920
a list, maybe a dictionary,
maybe something else,

140
00:07:26,920 --> 00:07:28,780
whatever makes most sense to you.

141
00:07:28,780 --> 00:07:30,880
And then for each row
in the data, you'll

142
00:07:30,880 --> 00:07:34,610
want to check if each of
the STR accounts matches.

143
00:07:34,610 --> 00:07:37,450
And if each of the STR accounts
matches for a particular person,

144
00:07:37,450 --> 00:07:41,200
then the DNA is a match and you
should print out that person's name.

145
00:07:41,200 --> 00:07:43,892
Otherwise, you should
just print out no match.

146
00:07:43,892 --> 00:07:45,850
In order to do this, you
might want to remember

147
00:07:45,850 --> 00:07:47,890
that int is a function
that you can actually

148
00:07:47,890 --> 00:07:52,000
call in Python that takes a string
and turns it into the integer

149
00:07:52,000 --> 00:07:53,420
version of that string.

150
00:07:53,420 --> 00:07:56,260
So if you took the string "2"
for example in quotation marks

151
00:07:56,260 --> 00:08:01,140
and passed it into the int function,
what you'd get back is the integer 2.

152
00:08:01,140 --> 00:08:04,150
And remember that to confirm a
match for a particular person,

153
00:08:04,150 --> 00:08:08,380
you'll need to check every column of the
CSV file other than the first column,

154
00:08:08,380 --> 00:08:12,050
because the first column just
stores the person's name.

155
00:08:12,050 --> 00:08:16,350
So once you've opened up the CSV file
and the DNA sequence into memory,

156
00:08:16,350 --> 00:08:19,870
calculated all the STR counts,
and then compared those counts

157
00:08:19,870 --> 00:08:23,020
against each row in your
CSV file, you should then

158
00:08:23,020 --> 00:08:25,720
have the ability to
take a sequence of DNA

159
00:08:25,720 --> 00:08:29,650
and figure out most likely
who that DNA belongs to.

160
00:08:29,650 --> 00:08:33,240
My name is Brian and this was DNA.

161
00:08:33,240 --> 00:08:34,366