1
00:00:00,000 --> 00:00:02,994
[MUSIC PLAYING]

2
00:00:02,994 --> 00:01:22,792


3
00:01:22,792 --> 00:01:23,750
DAVID MALAN: All right.

4
00:01:23,750 --> 00:01:28,230
So this is CS50, and this day
we continue our look at Python,

5
00:01:28,230 --> 00:01:30,330
but also introduce another language.

6
00:01:30,330 --> 00:01:32,700
This one is called SQL, or
Structured Query Language,

7
00:01:32,700 --> 00:01:35,492
and it's going to allow us to solve
some different problems as well

8
00:01:35,492 --> 00:01:37,260
as some old problems better.

9
00:01:37,260 --> 00:01:40,330
But to do this, let's try to motivate
some of the problems initially.

10
00:01:40,330 --> 00:01:42,390
So this, of course, is just
an example of a spreadsheet.

11
00:01:42,390 --> 00:01:43,920
Odds are you use these
pretty frequently,

12
00:01:43,920 --> 00:01:46,253
whether it's in the form of
Google Spreadsheets or Excel

13
00:01:46,253 --> 00:01:47,700
or Apple Numbers and the like.

14
00:01:47,700 --> 00:01:50,700
And this is actually the simplest
form of what we might call a database,

15
00:01:50,700 --> 00:01:54,000
just some kind of application
that stores lots of data.

16
00:01:54,000 --> 00:01:56,910
And odds are if you've ever taken
registrations for a student group

17
00:01:56,910 --> 00:01:59,580
or had people sign up or
fill out a form or a survey,

18
00:01:59,580 --> 00:02:03,270
odds are you've used Google Forms to
get your data into a Google Spreadsheet.

19
00:02:03,270 --> 00:02:05,370
And so we thought we
would begin there today,

20
00:02:05,370 --> 00:02:08,370
a representative problem whereby you
might want to collect a lot of data

21
00:02:08,370 --> 00:02:10,660
from users and then
do something with it,

22
00:02:10,660 --> 00:02:14,670
but ideally more powerfully than
you could do as a human on your own.

23
00:02:14,670 --> 00:02:18,600
So we've put into place a
Google Form that quite simply

24
00:02:18,600 --> 00:02:20,040
asks a couple of questions today.

25
00:02:20,040 --> 00:02:21,423
What are your favorite TV shows?

26
00:02:21,423 --> 00:02:24,090
And we ask you for the title of
your favorite show and the genre

27
00:02:24,090 --> 00:02:26,340
or genres into which that show falls.

28
00:02:26,340 --> 00:02:29,490
So if you wouldn't mind, go
ahead and open up this URL here

29
00:02:29,490 --> 00:02:31,192
on your laptop or phone.

30
00:02:31,192 --> 00:02:32,650
If you don't have one, that's fine.

31
00:02:32,650 --> 00:02:35,790
You can use the person's
next to you, if I may.

32
00:02:35,790 --> 00:02:41,970
And take just a moment to go to that URL
there and answer those two questions.

33
00:02:41,970 --> 00:02:46,890
What is the title of your favorite TV
show, and into what genre or genres

34
00:02:46,890 --> 00:02:48,060
does it fall?

35
00:02:48,060 --> 00:02:51,310


36
00:02:51,310 --> 00:02:52,720
Favorite TV shows.

37
00:02:52,720 --> 00:02:59,260
And when you go to that URL, you'll
find yourself at, again, this form here,

38
00:02:59,260 --> 00:03:03,230
asking for your title and
for your favorite show.

39
00:03:03,230 --> 00:03:06,040
So by the look of my
own screen here we've

40
00:03:06,040 --> 00:03:09,640
got some 62 responses, 74
responses coming in already,

41
00:03:09,640 --> 00:03:10,780
so we can track this live.

42
00:03:10,780 --> 00:03:12,710
So 111 responses so far.

43
00:03:12,710 --> 00:03:14,830
That's a lot of favorite TV shows.

44
00:03:14,830 --> 00:03:16,090
This number's-- keep going up.

45
00:03:16,090 --> 00:03:20,890
169, up to, next, 191.

46
00:03:20,890 --> 00:03:22,480
Keep them coming.

47
00:03:22,480 --> 00:03:23,520
One per person is fine.

48
00:03:23,520 --> 00:03:25,103
It will give us more than enough data.

49
00:03:25,103 --> 00:03:26,860
246.

50
00:03:26,860 --> 00:03:29,300
We'll give it another few
seconds, but suffice it to say,

51
00:03:29,300 --> 00:03:31,190
this is kind of a lot of data.

52
00:03:31,190 --> 00:03:32,920
And if the next step
in this process were

53
00:03:32,920 --> 00:03:35,830
for me to analyze the data
because I'm curious to know

54
00:03:35,830 --> 00:03:38,050
what's the most popular show
on campus or if I wanted

55
00:03:38,050 --> 00:03:40,330
to do some other
analytical work on this,

56
00:03:40,330 --> 00:03:42,400
it'll be useful to
have all of this data,

57
00:03:42,400 --> 00:03:44,740
not in Google Form's
proprietary interface,

58
00:03:44,740 --> 00:03:48,520
but in Google Spreadsheet form with rows
and columns, a format you're probably

59
00:03:48,520 --> 00:03:49,420
pretty familiar with.

60
00:03:49,420 --> 00:03:51,670
So we'll let this counter
keep going up and up and up,

61
00:03:51,670 --> 00:03:53,840
and I'm going to go ahead
and click on responses.

62
00:03:53,840 --> 00:03:55,990
And if we really wanted
to, we could look through--

63
00:03:55,990 --> 00:03:59,950
The Office is up there, followed
by Games of Thrones, Friends, OK.

64
00:03:59,950 --> 00:04:01,500
Perhaps pretty predictable here.

65
00:04:01,500 --> 00:04:03,250
Let me go ahead and
click this icon, which

66
00:04:03,250 --> 00:04:06,100
is going to open the Google
Spreadsheet version of this.

67
00:04:06,100 --> 00:04:08,590
And you'll see that there's
three columns by default.

68
00:04:08,590 --> 00:04:12,230
[LAUGHTER]

69
00:04:12,230 --> 00:04:13,440
Oh, that's funny.

70
00:04:13,440 --> 00:04:16,386


71
00:04:16,386 --> 00:04:18,341
[APPLAUSE]

72
00:04:18,341 --> 00:04:18,841
Thank you.

73
00:04:18,841 --> 00:04:21,787
Thank you.

74
00:04:21,787 --> 00:04:23,616
OK.

75
00:04:23,616 --> 00:04:24,530
OK.

76
00:04:24,530 --> 00:04:25,310
Well-played.

77
00:04:25,310 --> 00:04:28,790
Now-- so you'll see that by
default we got three columns.

78
00:04:28,790 --> 00:04:31,370
Timestamp, Google Forms
just does automatically

79
00:04:31,370 --> 00:04:33,230
to timestamp the submission of the form.

80
00:04:33,230 --> 00:04:34,910
Title comes from the first question.

81
00:04:34,910 --> 00:04:36,617
Genres comes from the second question.

82
00:04:36,617 --> 00:04:38,450
And if we kept scrolling
and scrolling, odds

83
00:04:38,450 --> 00:04:40,010
are we're seeing more and more rows.

84
00:04:40,010 --> 00:04:43,040
Let's propose that we have
enough of these rows already,

85
00:04:43,040 --> 00:04:44,960
and how might we go
about analyzing the data?

86
00:04:44,960 --> 00:04:48,080
Well, if you're pretty expert
with Google Spreadsheets or Excel

87
00:04:48,080 --> 00:04:50,570
or Numbers, maybe you could
use some functions that

88
00:04:50,570 --> 00:04:54,197
are built into these programs or you
could do some Command-F or Control-F

89
00:04:54,197 --> 00:04:56,030
and search for the
favorite show that you're

90
00:04:56,030 --> 00:04:57,890
looking for to try to run some numbers.

91
00:04:57,890 --> 00:04:59,940
But we have programming
abilities nowadays.

92
00:04:59,940 --> 00:05:02,480
And if you haven't used
this feature already,

93
00:05:02,480 --> 00:05:04,710
if you go to Google
Spreadsheet's file menu.

94
00:05:04,710 --> 00:05:06,960
You can go to download, and
you can download your data

95
00:05:06,960 --> 00:05:07,970
into a bunch of formats.

96
00:05:07,970 --> 00:05:09,762
And the most useful
one I'd propose for now

97
00:05:09,762 --> 00:05:12,770
is going to be this one,
Comma-Separated Values or CSV.

98
00:05:12,770 --> 00:05:14,100
And we've seen this before.

99
00:05:14,100 --> 00:05:15,800
In fact, you've generated this before.

100
00:05:15,800 --> 00:05:18,920
So if I go ahead and do that, it
pretty quickly downloaded a CSV file

101
00:05:18,920 --> 00:05:22,700
that I really wanted to I could open
in Excel or Apple Numbers or the like.

102
00:05:22,700 --> 00:05:26,460
But let's do something instead with
that, and go ahead and put it into,

103
00:05:26,460 --> 00:05:30,870
for instance, my IDE so that I
can actually do something with it.

104
00:05:30,870 --> 00:05:33,480
I'm going to grab my
downloaded file here.

105
00:05:33,480 --> 00:05:37,130
And I'm going to go ahead
and drag that into the IDE.

106
00:05:37,130 --> 00:05:39,680
And so now you'll see a
whole bunch of files here.

107
00:05:39,680 --> 00:05:40,850
Let me put-- whoops.

108
00:05:40,850 --> 00:05:43,102
Let me put this right in
the root of my folder.

109
00:05:43,102 --> 00:05:44,310
And we'll see this file here.

110
00:05:44,310 --> 00:05:46,910
And if I double-click on it,
you'll see exactly the same data,

111
00:05:46,910 --> 00:05:48,185
albeit in CSV format.

112
00:05:48,185 --> 00:05:50,810
So it looks the same, but we have
commas separating the values.

113
00:05:50,810 --> 00:05:53,580
And Google Spreadsheets did
that for us automatically.

114
00:05:53,580 --> 00:05:55,910
So let's now write a program
that analyzes this data

115
00:05:55,910 --> 00:05:59,360
and see if we can't make sense of
what some of the most popular values

116
00:05:59,360 --> 00:06:00,570
are in there.

117
00:06:00,570 --> 00:06:03,510
Let me go ahead and close
this, open a new file,

118
00:06:03,510 --> 00:06:07,150
and I'm going to go ahead and call
it, for instance, favorites.py,

119
00:06:07,150 --> 00:06:09,770
since the goal is to figure
out everyone's favorites here.

120
00:06:09,770 --> 00:06:12,920
And let me go ahead and use
some syntax from last week

121
00:06:12,920 --> 00:06:15,230
that you might be familiar
with now from P set 6.

122
00:06:15,230 --> 00:06:17,390
Let me go ahead and
import CSV, the module,

123
00:06:17,390 --> 00:06:20,600
so that I have access to that
library of functionality.

124
00:06:20,600 --> 00:06:22,238
Then let me do this with Open.

125
00:06:22,238 --> 00:06:25,280
And then I'm going to go ahead and
open this file, and I could rename it,

126
00:06:25,280 --> 00:06:28,220
but you know, I can just grab the
whole file name this way, too.

127
00:06:28,220 --> 00:06:32,360
It's a little excessive, but I'll go
ahead and paste that in in read mode.

128
00:06:32,360 --> 00:06:34,530
And that gives me a file.

129
00:06:34,530 --> 00:06:36,990
Let me go ahead and zoom
out just a little bit there.

130
00:06:36,990 --> 00:06:40,430
And once I have this file, I can
throw it in to a dict reader,

131
00:06:40,430 --> 00:06:42,620
for instance, to make it
easier to read the CSV.

132
00:06:42,620 --> 00:06:45,080
So reader gets CSV.DictReader.

133
00:06:45,080 --> 00:06:47,000
Pass in that file.

134
00:06:47,000 --> 00:06:48,890
And then after this, I
have the ability, now,

135
00:06:48,890 --> 00:06:54,170
to iterate over each row like this,
much like you probably did for DNA

136
00:06:54,170 --> 00:06:56,400
or one or more other problems, as well.

137
00:06:56,400 --> 00:06:59,390
Now, just as a sanity check, let
me just go ahead and print out

138
00:06:59,390 --> 00:07:00,870
each of the row's titles.

139
00:07:00,870 --> 00:07:02,630
So I'm not going to
see anything new here,

140
00:07:02,630 --> 00:07:05,720
but this will be a little sanity
check that I'm writing correct code.

141
00:07:05,720 --> 00:07:09,930
So run Python of favorites.py, and
voila, fwoom, there's all the data.

142
00:07:09,930 --> 00:07:13,270
So The Office is both at the beginning
and at the end here, it seems.

143
00:07:13,270 --> 00:07:16,520
All right, so that's all fine and good,
a nice stepping stone, but let's start

144
00:07:16,520 --> 00:07:19,580
to analyze this data,
this time using some code.

145
00:07:19,580 --> 00:07:22,850
So let me go ahead and say this.

146
00:07:22,850 --> 00:07:25,580
Suppose I wanted to
keep track and determine

147
00:07:25,580 --> 00:07:27,530
what the most popular show is.

148
00:07:27,530 --> 00:07:30,480
So I've got this big CSV file
with title, title, title, title.

149
00:07:30,480 --> 00:07:32,480
Invariably, there's going
to be some duplicates.

150
00:07:32,480 --> 00:07:35,690
A bunch of you probably like
The Office or some other show,

151
00:07:35,690 --> 00:07:38,180
and so it'd be nice to kind
of aggregate that data.

152
00:07:38,180 --> 00:07:43,080
What might be a useful data structure
in Python whereby for each row

153
00:07:43,080 --> 00:07:45,740
we can use this data
structure to just kind of do

154
00:07:45,740 --> 00:07:49,490
the equivalent of like
1, 2, 3, and count each

155
00:07:49,490 --> 00:07:51,720
of the occurrences of those titles?

156
00:07:51,720 --> 00:07:56,030
Any-- OK, I heard a hash table,
but more properly in Python?

157
00:07:56,030 --> 00:07:56,950
A dictionary.

158
00:07:56,950 --> 00:07:59,410
So a dictionary is
kind of the Swiss army

159
00:07:59,410 --> 00:08:02,140
knife of data structures for
a lot of people in the sense

160
00:08:02,140 --> 00:08:05,590
that insofar as it allows
you to store keys and values,

161
00:08:05,590 --> 00:08:08,900
and those keys can be strings, and
those values can be anything you want,

162
00:08:08,900 --> 00:08:11,763
including, for instance, in this
case, numbers, like a counter,

163
00:08:11,763 --> 00:08:13,930
it's a really versatile
data structure when you just

164
00:08:13,930 --> 00:08:17,000
want to throw some data into a data
structure and do something with it.

165
00:08:17,000 --> 00:08:20,110
So let me go ahead and give myself
a dictionary at the top of the file.

166
00:08:20,110 --> 00:08:23,110
I can do this in a couple of ways,
and if my goal is to store the counts

167
00:08:23,110 --> 00:08:25,420
of each of these tiles,
I could call it counts,

168
00:08:25,420 --> 00:08:30,130
and I could set it equal to dict(),
which is just a function that hands me

169
00:08:30,130 --> 00:08:31,450
an empty dictionary.

170
00:08:31,450 --> 00:08:33,730
Or I recall I can use
this syntax, which is

171
00:08:33,730 --> 00:08:35,919
equivalent to just shorthand notation.

172
00:08:35,919 --> 00:08:40,210
But what I want to do now
inside of my loop, every time I

173
00:08:40,210 --> 00:08:44,210
iterate through this file,
I want to check if my--

174
00:08:44,210 --> 00:08:47,630
I've seen this title before,
I want to add 1 to my counter.

175
00:08:47,630 --> 00:08:52,808
But if I've never seen this title
before, what do I want to do instead?

176
00:08:52,808 --> 00:08:53,308
Yeah?

177
00:08:53,308 --> 00:08:55,117
AUDIENCE: [INAUDIBLE]

178
00:08:55,117 --> 00:08:56,950
DAVID MALAN: Add the
title to the dictionary

179
00:08:56,950 --> 00:09:00,110
as a key with the value of
what, the first time I see it?

180
00:09:00,110 --> 00:09:01,970
Well, the first time I see it.

181
00:09:01,970 --> 00:09:02,470
AUDIENCE: 1.

182
00:09:02,470 --> 00:09:03,940
DAVID MALAN: So we can
just initialize it to 1.

183
00:09:03,940 --> 00:09:05,290
So there's two scenarios here.

184
00:09:05,290 --> 00:09:08,500
Either we've seen this title
before and we want to do plus 1,

185
00:09:08,500 --> 00:09:11,830
or we've not seen the title before
and we just want to set it equal to 1.

186
00:09:11,830 --> 00:09:13,630
So how might we express that logic?

187
00:09:13,630 --> 00:09:16,190
Well, let me go ahead
and say, first of all,

188
00:09:16,190 --> 00:09:18,280
let's get the title into
a variable of its own,

189
00:09:18,280 --> 00:09:20,740
just to shorten our syntax a little bit.

190
00:09:20,740 --> 00:09:25,040
And then I can say something
like this, if title in counts--

191
00:09:25,040 --> 00:09:26,300
well, what does that mean?

192
00:09:26,300 --> 00:09:29,140
That means that inside
of my counts dictionary,

193
00:09:29,140 --> 00:09:34,300
I can index into it at that location,
and I can increment whatever is there

194
00:09:34,300 --> 00:09:38,830
by 1, assuming on some
previous iteration,

195
00:09:38,830 --> 00:09:44,738
else I went ahead and did
counts of title equals 1.

196
00:09:44,738 --> 00:09:46,030
So these are the two scenarios.

197
00:09:46,030 --> 00:09:48,940
If I've seen this title
before, go ahead and add 1

198
00:09:48,940 --> 00:09:50,560
to its value in the dictionary.

199
00:09:50,560 --> 00:09:53,960
Otherwise, just default to
I have seen it one time.

200
00:09:53,960 --> 00:09:56,710
So some of the shows are going to
stay stuck at one if they're not

201
00:09:56,710 --> 00:09:58,502
very popular among the
audience, or they're

202
00:09:58,502 --> 00:10:00,703
going to keep getting
plus 1, plus 1, plus 1.

203
00:10:00,703 --> 00:10:02,620
So what I often do when
writing code like this

204
00:10:02,620 --> 00:10:05,450
is, as I'm sort of finding my
way, after I'm done with this,

205
00:10:05,450 --> 00:10:07,075
I'm not going to write the whole thing.

206
00:10:07,075 --> 00:10:10,070
Let me go ahead and just, at the
end of this, print counts, right?

207
00:10:10,070 --> 00:10:11,570
I'm just curious to see what it is.

208
00:10:11,570 --> 00:10:13,570
I'm just going to go ahead
and add print counts.

209
00:10:13,570 --> 00:10:16,060
Let me go ahead and run
now Python of favorites.

210
00:10:16,060 --> 00:10:19,690
And voila, you see a huge,
messy data structure.

211
00:10:19,690 --> 00:10:21,520
But let's go to the very top of it.

212
00:10:21,520 --> 00:10:25,210
Notice that it starts with curly braces,
indicating it's, indeed, a dictionary.

213
00:10:25,210 --> 00:10:28,480
It starts with a quoted value,
which is one of the titles in it,

214
00:10:28,480 --> 00:10:31,450
and it looks like two of you said
Dynasty is your favorite show.

215
00:10:31,450 --> 00:10:33,790
25 of you said The Office
is your favorite show.

216
00:10:33,790 --> 00:10:36,340
One of you said Blindspot
is your favorite show.

217
00:10:36,340 --> 00:10:37,210
One of you, 24.

218
00:10:37,210 --> 00:10:39,060
Friends, 19, and so forth.

219
00:10:39,060 --> 00:10:40,810
So the data is kind
of all over the place,

220
00:10:40,810 --> 00:10:43,570
because it's based on the order
in which you all input it.

221
00:10:43,570 --> 00:10:45,487
So let's just clean
this up a little bit.

222
00:10:45,487 --> 00:10:47,570
And let me go ahead and
print this out as follows.

223
00:10:47,570 --> 00:10:49,660
Instead of just printing
a big, messy dictionary,

224
00:10:49,660 --> 00:10:52,990
why don't I go ahead and
iterate over the dictionary?

225
00:10:52,990 --> 00:10:55,220
And we didn't see this
necessarily last week,

226
00:10:55,220 --> 00:10:57,053
but you can actually
do something like this.

227
00:10:57,053 --> 00:11:01,480
If your dictionary's storing titles
as keys and counts as values,

228
00:11:01,480 --> 00:11:06,640
we can do something fancy like
for title comma count in counts,

229
00:11:06,640 --> 00:11:10,090
but if you want to iterate over not just
the keys but all of the items therein,

230
00:11:10,090 --> 00:11:12,250
there's this other function
in Python associated

231
00:11:12,250 --> 00:11:15,370
with a dictionary that will hand
you back two things at once,

232
00:11:15,370 --> 00:11:16,763
key and value together.

233
00:11:16,763 --> 00:11:18,430
And you can call them anything you want.

234
00:11:18,430 --> 00:11:21,470
I'm calling them title
comma count, respectively.

235
00:11:21,470 --> 00:11:25,300
Now, let me go ahead and print out,
for instance, something like the title,

236
00:11:25,300 --> 00:11:28,340
and then let me go ahead and just
print out the count right next to it.

237
00:11:28,340 --> 00:11:32,540
So now, if I save this and run
favorites.py, OK, it's prettier,

238
00:11:32,540 --> 00:11:35,530
but it's still a little
messy in that we have--

239
00:11:35,530 --> 00:11:38,300
the numbers are just right
there after the name.

240
00:11:38,300 --> 00:11:39,890
So let's go ahead and tidy this up.

241
00:11:39,890 --> 00:11:41,890
Recall that when you print
two things with print

242
00:11:41,890 --> 00:11:44,170
separating them with a
comma, what character

243
00:11:44,170 --> 00:11:47,380
did you get automatically
in between them?

244
00:11:47,380 --> 00:11:48,560
Yeah, just a single space.

245
00:11:48,560 --> 00:11:50,630
We saw that at the very
beginning of last week.

246
00:11:50,630 --> 00:11:53,740
But we can override that
separator, otherwise known as sep,

247
00:11:53,740 --> 00:11:56,950
which has a default value of
quote-unquote with a space.

248
00:11:56,950 --> 00:12:00,293
But suppose I just want to give
some visual delineation here,

249
00:12:00,293 --> 00:12:03,460
maybe like a vertical bar to kind of
sort of make it look like a spreadsheet

250
00:12:03,460 --> 00:12:04,000
again.

251
00:12:04,000 --> 00:12:05,590
Let me go ahead and run this.

252
00:12:05,590 --> 00:12:07,600
And now it's not super,
super pretty, but you

253
00:12:07,600 --> 00:12:12,700
can see title followed by a vertical
bar followed by the number 1.

254
00:12:12,700 --> 00:12:16,243
So this is all fine and good,
but it'd be nice, for instance,

255
00:12:16,243 --> 00:12:18,160
if we could sort these,
because I don't really

256
00:12:18,160 --> 00:12:22,000
want to sift through this visually,
looking for all of the possible values.

257
00:12:22,000 --> 00:12:25,660
So you know, the easiest way to sort
these things might be like this.

258
00:12:25,660 --> 00:12:28,960
Instead of passing in
counts of items, there's

259
00:12:28,960 --> 00:12:31,930
another function in Python where you
can literally just sort something

260
00:12:31,930 --> 00:12:32,590
with sorted.

261
00:12:32,590 --> 00:12:34,850
You don't have to worry about bubble
sort, selection sort, insertion sort,

262
00:12:34,850 --> 00:12:36,160
merge sort, anything.

263
00:12:36,160 --> 00:12:37,867
Just, you sort it and Python.

264
00:12:37,867 --> 00:12:39,700
We'll use one of the
fastest implementations

265
00:12:39,700 --> 00:12:41,720
of sorting available to it.

266
00:12:41,720 --> 00:12:44,020
Now, if I rerun this, we'll see that--

267
00:12:44,020 --> 00:12:45,220
[CHUCKLES] OK.

268
00:12:45,220 --> 00:12:47,350
We'll see that we have
all of these values

269
00:12:47,350 --> 00:12:51,170
here starting with 24,
9009, Adventure Time.

270
00:12:51,170 --> 00:12:53,420
But that's not what I meant.

271
00:12:53,420 --> 00:12:57,160
I wanted to sort not
by title, but by what?

272
00:12:57,160 --> 00:12:58,250
Count.

273
00:12:58,250 --> 00:13:00,550
So this gets a little less obvious.

274
00:13:00,550 --> 00:13:04,390
And it turns out Python does this
in a couple of possible ways.

275
00:13:04,390 --> 00:13:06,640
But let me go ahead and
introduce this one first.

276
00:13:06,640 --> 00:13:11,170
It turns out that this sorted function
takes an optional argument that

277
00:13:11,170 --> 00:13:12,910
happens to be called key.

278
00:13:12,910 --> 00:13:18,280
And key simply specifies,
how do you want to sort

279
00:13:18,280 --> 00:13:20,620
the dictionaries that you're being--

280
00:13:20,620 --> 00:13:23,680
or how do you want to sort the key
value pairs that are being passed?

281
00:13:23,680 --> 00:13:26,540
By default, it literally
uses the key, so title.

282
00:13:26,540 --> 00:13:29,960
But if you want to use the value,
you know what you can do, is this.

283
00:13:29,960 --> 00:13:33,420
Let me go ahead, and up here, even
though this is a little weird,

284
00:13:33,420 --> 00:13:35,920
and let me go ahead
and say the following.

285
00:13:35,920 --> 00:13:37,960
Def function item.

286
00:13:37,960 --> 00:13:38,617
So def f.

287
00:13:38,617 --> 00:13:40,450
I'm just giving myself
a pretty generic name

288
00:13:40,450 --> 00:13:43,075
for a function, because I'm going
to get rid of it in a moment.

289
00:13:43,075 --> 00:13:46,780
I'm going to go ahead and
return item bracket 1.

290
00:13:46,780 --> 00:13:48,080
Now, why is this?

291
00:13:48,080 --> 00:13:49,780
Take a look at this line here.

292
00:13:49,780 --> 00:13:53,500
And even though sorted is new
today, as this counts.items(),

293
00:13:53,500 --> 00:13:58,490
all items does is return key comma
value, and sorted sorts then by key,

294
00:13:58,490 --> 00:14:02,330
by default, handing them back
as title comma count, or again,

295
00:14:02,330 --> 00:14:03,930
I could call these anything I want.

296
00:14:03,930 --> 00:14:07,700
But I'm calling them, more semantically
usefully, title comma counts.

297
00:14:07,700 --> 00:14:09,890
So that's giving me back to things.

298
00:14:09,890 --> 00:14:13,370
Well, just like an array
in C or a list in Python,

299
00:14:13,370 --> 00:14:16,850
if you're getting two things at
once, the first of those things,

300
00:14:16,850 --> 00:14:19,790
conventionally, is going to
be accessible in bracket zero.

301
00:14:19,790 --> 00:14:22,190
The second thing is going to
be accessible in bracket 1.

302
00:14:22,190 --> 00:14:27,350
So this super, super simple function
called f takes, as input, an item.

303
00:14:27,350 --> 00:14:30,050
And if that item has two
values, a key and a value,

304
00:14:30,050 --> 00:14:32,270
return item one is just
literally going to return

305
00:14:32,270 --> 00:14:33,710
the second thing, whatever it is.

306
00:14:33,710 --> 00:14:36,530
In this case, it's going to
return, of course, the value.

307
00:14:36,530 --> 00:14:38,160
Now, why is this useful?

308
00:14:38,160 --> 00:14:43,260
Well, because the sorted function takes
an additional argument called key,

309
00:14:43,260 --> 00:14:46,040
you can actually pass it
the name of a function

310
00:14:46,040 --> 00:14:51,620
to use in order to determine
how to sort the previous items.

311
00:14:51,620 --> 00:14:53,750
So this is an example of
a really powerful feature

312
00:14:53,750 --> 00:14:57,860
of Python and other higher-level
languages, in which case

313
00:14:57,860 --> 00:15:01,670
you can pass in, as arguments,
the names of functions.

314
00:15:01,670 --> 00:15:02,720
I'm not calling f.

315
00:15:02,720 --> 00:15:04,340
I'm just passing in its name.

316
00:15:04,340 --> 00:15:07,520
If I were calling it, I'd have
open paren and a closed paren,

317
00:15:07,520 --> 00:15:09,620
but I'm just passing it in by name.

318
00:15:09,620 --> 00:15:13,050
And now, this is going to, by default,
sort things from small to large.

319
00:15:13,050 --> 00:15:15,050
Turns out if you read the
documentation, there's

320
00:15:15,050 --> 00:15:19,430
another argument to sorted called
reverse that will invert it so that now

321
00:15:19,430 --> 00:15:21,780
you'll get largest to smallest.

322
00:15:21,780 --> 00:15:23,160
So now let's try this.

323
00:15:23,160 --> 00:15:25,460
Let me go ahead and rerun
Python of favorites.py

324
00:15:25,460 --> 00:15:29,030
enter, and now, if I
scroll up to the top,

325
00:15:29,030 --> 00:15:33,680
we should see The Office is indeed
the most popular show, apparently

326
00:15:33,680 --> 00:15:35,660
consistent with Netflix's
data these days, too,

327
00:15:35,660 --> 00:15:38,702
followed by Friends, Game of Thrones,
Breaking Bad, Rick and Morty, Black

328
00:15:38,702 --> 00:15:39,932
Mirror, and so forth.

329
00:15:39,932 --> 00:15:42,890
And the least popular show among you,
although there looks like there's

330
00:15:42,890 --> 00:15:44,735
a lot of ties, Demon Slayer.

331
00:15:44,735 --> 00:15:47,890
[CHUCKLES] Or Game of Thrones, but GOT.

332
00:15:47,890 --> 00:15:49,880
So interestingly, there are some--

333
00:15:49,880 --> 00:15:51,420
[LAUGHTER]

334
00:15:51,420 --> 00:15:52,320
There's-- OK.

335
00:15:52,320 --> 00:15:54,930


336
00:15:54,930 --> 00:15:55,520
OK.

337
00:15:55,520 --> 00:15:57,950
So here, we've gone and
sorted these values.

338
00:15:57,950 --> 00:16:00,680
Now, let me go ahead and
clean this up a little bit.

339
00:16:00,680 --> 00:16:05,510
Much like in C where it's kind
of silly to define a variable

340
00:16:05,510 --> 00:16:08,270
and then use it only in one
place, similarly in Python,

341
00:16:08,270 --> 00:16:11,660
if you're going to define a function,
but then only use it in one place,

342
00:16:11,660 --> 00:16:12,920
you don't actually need to.

343
00:16:12,920 --> 00:16:17,248
There's a fancy one-line equivalent
wear you can actually say this.

344
00:16:17,248 --> 00:16:19,790
And you won't see this too often,
but it's a handy thing just

345
00:16:19,790 --> 00:16:21,170
to have seen at least once.

346
00:16:21,170 --> 00:16:25,850
You can define a function called a
lambda function that takes in as input

347
00:16:25,850 --> 00:16:30,240
any argument like this, and then you
can literally just return that value.

348
00:16:30,240 --> 00:16:34,175
So if I go ahead and get rid of this
function f and replace it with this,

349
00:16:34,175 --> 00:16:37,730
the syntax is super
weird-looking at first glance,

350
00:16:37,730 --> 00:16:40,520
but lambda is the term
of art in computer

351
00:16:40,520 --> 00:16:42,037
science for an anonymous function.

352
00:16:42,037 --> 00:16:43,370
I don't care that it's called f.

353
00:16:43,370 --> 00:16:44,720
I'm never going to use it again.

354
00:16:44,720 --> 00:16:47,360
So lambda just means give me a function.

355
00:16:47,360 --> 00:16:51,080
Item means that is the input to this
function, the argument, and the return

356
00:16:51,080 --> 00:16:54,080
value, or output, is going
to be item bracket 1.

357
00:16:54,080 --> 00:16:55,580
So it's a little arcane.

358
00:16:55,580 --> 00:16:58,935
Odds are you'll have to double check
the documentation for using this oneself

359
00:16:58,935 --> 00:17:00,560
in the future, but it's the same thing.

360
00:17:00,560 --> 00:17:02,720
And indeed, if I rerun
the program, I'm going

361
00:17:02,720 --> 00:17:08,030
to get back the exact same results,
this time with the items still sorted

362
00:17:08,030 --> 00:17:09,589
like that.

363
00:17:09,589 --> 00:17:14,569
All right, any questions on the code
here, the features of the syntax?

364
00:17:14,569 --> 00:17:16,160
Anything at all?

365
00:17:16,160 --> 00:17:17,261
Yeah.

366
00:17:17,261 --> 00:17:20,303
AUDIENCE: [INAUDIBLE]

367
00:17:20,303 --> 00:17:22,970
DAVID MALAN: Yes, if you're writing
a lambda function like this,

368
00:17:22,970 --> 00:17:24,770
you are indeed limited to one line.

369
00:17:24,770 --> 00:17:27,187
That was OK, because f, even
though it was two lines, only

370
00:17:27,187 --> 00:17:28,550
had one line of actual content.

371
00:17:28,550 --> 00:17:31,900
But yes, it has to be one line, indeed.

372
00:17:31,900 --> 00:17:32,400
All right.

373
00:17:32,400 --> 00:17:36,360
So the GOT thing opens up some
interesting messiness, right?

374
00:17:36,360 --> 00:17:39,990
Because GOT presumably denoted Game
of Thrones, and yet most of you

375
00:17:39,990 --> 00:17:42,235
wrote Game of Thrones,
capitalized properly,

376
00:17:42,235 --> 00:17:44,610
and yet, it'd be nice if we
could have some equivalences.

377
00:17:44,610 --> 00:17:48,300
And honestly, I'm guessing there's a
bunch of capitalization differences,

378
00:17:48,300 --> 00:17:48,960
too.

379
00:17:48,960 --> 00:17:50,970
So you know what I
might be inclined to do?

380
00:17:50,970 --> 00:17:54,390
What might be the easiest
way for us to standardize

381
00:17:54,390 --> 00:17:58,710
what we're calling these shows'
titles that would be a little more

382
00:17:58,710 --> 00:18:01,270
resilient against some of
you typing with capitals,

383
00:18:01,270 --> 00:18:03,300
some of you typing with lower case?

384
00:18:03,300 --> 00:18:06,130
Let's ignore the GOT problem, though.

385
00:18:06,130 --> 00:18:09,022
Yeah, so make them all lowercase,
or heck, make them all upper case.

386
00:18:09,022 --> 00:18:10,980
It doesn't matter so long
as you're consistent.

387
00:18:10,980 --> 00:18:11,688
So you know what?

388
00:18:11,688 --> 00:18:13,740
Before I get title from
the row, let me go ahead

389
00:18:13,740 --> 00:18:18,450
and force it all to lower case,
and then let me go ahead and--

390
00:18:18,450 --> 00:18:24,160
in my-- go ahead and force
it to lower case there.

391
00:18:24,160 --> 00:18:26,070
So no matter what the
input is, I'm going

392
00:18:26,070 --> 00:18:29,790
to go ahead and force it
on entry to lower case.

393
00:18:29,790 --> 00:18:33,490
Now, if I do this and scroll back up,
looks like our counts went up a little,

394
00:18:33,490 --> 00:18:33,990
right?

395
00:18:33,990 --> 00:18:35,580
The Office is now at 27.

396
00:18:35,580 --> 00:18:37,080
Game of Thrones is now at 21.

397
00:18:37,080 --> 00:18:38,290
Friends is at 20.

398
00:18:38,290 --> 00:18:41,220
So it looks like some of you,
kind of, reasonably if sloppily,

399
00:18:41,220 --> 00:18:44,272
didn't capitalize it exactly as
the show officially is called.

400
00:18:44,272 --> 00:18:45,480
But that should be OK, right?

401
00:18:45,480 --> 00:18:47,700
This is all about UX or user experience.

402
00:18:47,700 --> 00:18:51,930
This should be one stepping stone toward
making your program much, much more

403
00:18:51,930 --> 00:18:54,780
resilient against user input.

404
00:18:54,780 --> 00:18:55,530
But you know what?

405
00:18:55,530 --> 00:18:57,240
What if there were a better
way to do this, right?

406
00:18:57,240 --> 00:18:59,490
Instead of writing code,
this feels pretty tedious.

407
00:18:59,490 --> 00:19:01,948
Could we not just use a different
tool, different language,

408
00:19:01,948 --> 00:19:03,160
and achieve these same goals?

409
00:19:03,160 --> 00:19:05,395
Today's answer is indeed
going to be yes, with SQL.

410
00:19:05,395 --> 00:19:06,270
But first a question.

411
00:19:06,270 --> 00:19:06,810
AUDIENCE: [INAUDIBLE]

412
00:19:06,810 --> 00:19:07,160
DAVID MALAN: No?

413
00:19:07,160 --> 00:19:07,920
No question.

414
00:19:07,920 --> 00:19:11,520
All right, so then the
segue is exactly this.

415
00:19:11,520 --> 00:19:14,940
I'm going to go ahead now
and close favorites.py.

416
00:19:14,940 --> 00:19:18,010
And I'm just going to maximize the
size of my terminal window here.

417
00:19:18,010 --> 00:19:21,840
And I'm going to go ahead and
run a program called SQLite3.

418
00:19:21,840 --> 00:19:25,590
SQLite3 is a command line program that
you can install on your Mac or PC--

419
00:19:25,590 --> 00:19:27,240
it's already on CS50 IDE--

420
00:19:27,240 --> 00:19:31,470
that allows you to use another
language called SQL interactively.

421
00:19:31,470 --> 00:19:34,170
And it's got this really
cool feature whereby

422
00:19:34,170 --> 00:19:40,230
if I have this file called, very
verbosely, CS520 2019 Lecture 7,

423
00:19:40,230 --> 00:19:43,020
and so forth, I'm going to copy
that after highlighting it.

424
00:19:43,020 --> 00:19:47,730
I'm going to run SQLite3, and I'm
going to put SQLite3 into .mode CSV.

425
00:19:47,730 --> 00:19:51,180
It's a weird syntax, .mode CSV, but you
don't have to use these commands very

426
00:19:51,180 --> 00:19:51,840
often.

427
00:19:51,840 --> 00:19:54,840
Then I'm going to use one other
command here, which is called import.

428
00:19:54,840 --> 00:19:58,450
And I'm going to go ahead and
import this into a database.

429
00:19:58,450 --> 00:20:00,175
But what database, exactly?

430
00:20:00,175 --> 00:20:03,050
Well, let me go ahead and more
specifically-- let me exit out of this

431
00:20:03,050 --> 00:20:03,550
here.

432
00:20:03,550 --> 00:20:04,170
Whoops.

433
00:20:04,170 --> 00:20:08,850
Let me exit out of this here and run
SQLite3 in the name of a database.

434
00:20:08,850 --> 00:20:10,830
I'm dealing with favorite
shows, so maybe I'll

435
00:20:10,830 --> 00:20:14,040
call this favorites.db,
DB denoting database.

436
00:20:14,040 --> 00:20:17,730
I'm going to now go
ahead and do .mode CSV.

437
00:20:17,730 --> 00:20:20,550
I'm going to now do .import,
quoting the name of the file,

438
00:20:20,550 --> 00:20:23,550
and quotes are important, because
it's got some ugly spaces in it,

439
00:20:23,550 --> 00:20:26,940
and I'm going to import this into a
table, kind of like a spreadsheet,

440
00:20:26,940 --> 00:20:29,700
that I'm just going to call
in all lowercase favorites.

441
00:20:29,700 --> 00:20:30,750
And hit Enter.

442
00:20:30,750 --> 00:20:32,410
Nothing seems to happen now.

443
00:20:32,410 --> 00:20:35,650
But if I go back to my
terminal window and type ls,

444
00:20:35,650 --> 00:20:39,780
notice that in addition to my CSV
file and favorites.py and src7,

445
00:20:39,780 --> 00:20:41,970
which I downloaded in
advance with today's code,

446
00:20:41,970 --> 00:20:44,820
I now have this file
called favorites.db.

447
00:20:44,820 --> 00:20:50,140
And if I rerun SQLite, it turns out when
I pass in the name of that database,

448
00:20:50,140 --> 00:20:52,920
now I can ask the same kinds
of questions as before,

449
00:20:52,920 --> 00:20:55,590
but I can do it with a different syntax.

450
00:20:55,590 --> 00:20:59,670
I can go ahead and select all
the titles from my favorites

451
00:20:59,670 --> 00:21:04,350
by typing this command, SELECT title
FROM favorites; Enter, and voila,

452
00:21:04,350 --> 00:21:06,010
we get all of the titles there.

453
00:21:06,010 --> 00:21:08,250
Suppose I want to sort
those titles as before,

454
00:21:08,250 --> 00:21:14,230
or I can SELECT title FROM favorites,
ORDER BY title; ending with--

455
00:21:14,230 --> 00:21:14,730
oh, sorry.

456
00:21:14,730 --> 00:21:15,870
Semicolons are back.

457
00:21:15,870 --> 00:21:18,300
Then I'm going to go ahead and hit here.

458
00:21:18,300 --> 00:21:20,730
And you'll see that
now it's alphabetized,

459
00:21:20,730 --> 00:21:23,318
and there's all those
Offices by first letter.

460
00:21:23,318 --> 00:21:24,360
Now, there are some bugs.

461
00:21:24,360 --> 00:21:26,500
If you scroll down, you'll
see lowercase letters as well,

462
00:21:26,500 --> 00:21:28,620
so we'd have to deal
with that situation, too.

463
00:21:28,620 --> 00:21:32,820
But this seems to be a user-friendly
way of just selecting data

464
00:21:32,820 --> 00:21:33,570
that I care about.

465
00:21:33,570 --> 00:21:34,240
And watch this.

466
00:21:34,240 --> 00:21:36,010
Let me try to group things together.

467
00:21:36,010 --> 00:21:38,430
I'm going to go ahead
and select title, and I'm

468
00:21:38,430 --> 00:21:41,880
going to go ahead and say
count the number of titles

469
00:21:41,880 --> 00:21:45,000
from this table called favorites.

470
00:21:45,000 --> 00:21:48,660
But I want to group
those things by title.

471
00:21:48,660 --> 00:21:51,820
Now, we'll see in a moment what this
means, but the effect is what's cool.

472
00:21:51,820 --> 00:21:56,370
Voila, I now have output
much like my Python program.

473
00:21:56,370 --> 00:22:00,028
It's not really as orderly
as I would like, but notice

474
00:22:00,028 --> 00:22:01,320
what it's done at the very top.

475
00:22:01,320 --> 00:22:06,330
The very first line of output,
if I keep scrolling is this here.

476
00:22:06,330 --> 00:22:08,670
On the left is my title,
then a vertical bar.

477
00:22:08,670 --> 00:22:13,120
On the right is literally count of
title, how many titles match that.

478
00:22:13,120 --> 00:22:14,800
So let me go ahead and do this.

479
00:22:14,800 --> 00:22:18,240
Let me go ahead and first of all
limit this to the top 10 titles,

480
00:22:18,240 --> 00:22:20,910
so I can just deal with this
and see all this at once.

481
00:22:20,910 --> 00:22:23,460
So here we have some
alphabetized list of titles.

482
00:22:23,460 --> 00:22:27,490
But now, suppose I want to go
ahead and rename this thing.

483
00:22:27,490 --> 00:22:28,930
Let me go ahead and say that.

484
00:22:28,930 --> 00:22:30,510
Let me go ahead and say, as--

485
00:22:30,510 --> 00:22:33,450
COUNT AS n.

486
00:22:33,450 --> 00:22:37,290
Now notice that the title of
this output is title, and then n.

487
00:22:37,290 --> 00:22:41,650
But the reason I did that
was so that I can do this.

488
00:22:41,650 --> 00:22:47,010
I'm going to now do ORDER
BY n, descending order,

489
00:22:47,010 --> 00:22:49,370
and then limit to the top 10.

490
00:22:49,370 --> 00:22:51,120
Enter, and voila.

491
00:22:51,120 --> 00:22:53,460
Now, using a completely
different language,

492
00:22:53,460 --> 00:22:56,460
I think I've gotten the same results,
except for the lower casing issue,

493
00:22:56,460 --> 00:22:58,085
which we can come back to another time.

494
00:22:58,085 --> 00:23:00,210
But The Office is here
as the number one hit.

495
00:23:00,210 --> 00:23:01,350
Friends then, 19.

496
00:23:01,350 --> 00:23:03,820
Game of Thrones, 18, and so forth.

497
00:23:03,820 --> 00:23:06,840
So what's then the motivation in part
for this new language called SQL?

498
00:23:06,840 --> 00:23:08,550
And you've seen just
snippets of it here.

499
00:23:08,550 --> 00:23:10,850
It allows us to solve
the same kind of problems

500
00:23:10,850 --> 00:23:13,350
much more easily, because I
don't have to sit down and write

501
00:23:13,350 --> 00:23:16,590
a whole darn Python program,
faster though it is then

502
00:23:16,590 --> 00:23:20,933
C. I don't have to write custom code
just to answer questions about my data.

503
00:23:20,933 --> 00:23:23,100
And in fact, if you pursue
the web track in a couple

504
00:23:23,100 --> 00:23:25,710
of weeks' time or the
mobile track, both of those

505
00:23:25,710 --> 00:23:29,130
are going to offer support for SQL
so that if you want to store data

506
00:23:29,130 --> 00:23:32,880
on users or interactive information
on an app or a website application,

507
00:23:32,880 --> 00:23:37,050
SQL is going to be ultimately where you
can put that data and access it later,

508
00:23:37,050 --> 00:23:40,507
thereby writing even
less code than today.

509
00:23:40,507 --> 00:23:43,590
So let's go ahead and take a look at
some of the features of these things.

510
00:23:43,590 --> 00:23:45,507
What we've really just
done is introduce this,

511
00:23:45,507 --> 00:23:47,580
something called a relational database.

512
00:23:47,580 --> 00:23:50,940
The goal at hand is not to
rely on CSV files anymore.

513
00:23:50,940 --> 00:23:54,420
CSVs are like the weakest
form of databases you can use.

514
00:23:54,420 --> 00:23:54,960
Why?

515
00:23:54,960 --> 00:23:59,162
Well, every time you want to look in a
CSV file for data, you have to open it.

516
00:23:59,162 --> 00:24:00,120
You have to read it in.

517
00:24:00,120 --> 00:24:02,590
You have to iterate over
it, line by line by line,

518
00:24:02,590 --> 00:24:05,160
looking for the data you care
about, and then you're done.

519
00:24:05,160 --> 00:24:08,942
That's going to be super slow when
you have a lot of data in a CSV file.

520
00:24:08,942 --> 00:24:10,650
And in fact, in a
little bit today, we'll

521
00:24:10,650 --> 00:24:14,190
download a really big
database in text form

522
00:24:14,190 --> 00:24:19,230
that we're going to then see 500-plus
megabytes later is much better handled

523
00:24:19,230 --> 00:24:21,420
by what are called
relational databases, pieces

524
00:24:21,420 --> 00:24:24,990
of software you can run on your
Mac or PC or servers in a company.

525
00:24:24,990 --> 00:24:28,950
All they do is store data and
provide you with faster access to it.

526
00:24:28,950 --> 00:24:32,010
So SQLite3 is just a
command line program

527
00:24:32,010 --> 00:24:34,012
via which we can
interact with that data.

528
00:24:34,012 --> 00:24:35,220
It's not terribly compelling.

529
00:24:35,220 --> 00:24:37,200
It's, again, a black and
white blinking prompt,

530
00:24:37,200 --> 00:24:40,050
but we'll see a graphical user
interface, or GUI, in a bit,

531
00:24:40,050 --> 00:24:43,200
too, that'll allow us to
navigate that same data.

532
00:24:43,200 --> 00:24:45,570
Here's how we went ahead
and imported the data.

533
00:24:45,570 --> 00:24:47,550
And .schema is the last of the commands.

534
00:24:47,550 --> 00:24:49,870
In fact, let me go ahead
and type this at the prompt.

535
00:24:49,870 --> 00:24:52,440
Not .mode or .import but .schema.

536
00:24:52,440 --> 00:24:55,860
And you'll see what automatically
happened for me is something

537
00:24:55,860 --> 00:24:58,590
like this, CREATE TABLE favorites.

538
00:24:58,590 --> 00:25:02,010
This all happened automatically when
I imported my data, but in a bit,

539
00:25:02,010 --> 00:25:04,740
we'll be doing this
more manually ourselves.

540
00:25:04,740 --> 00:25:08,580
So where does this then bring
us in terms of features?

541
00:25:08,580 --> 00:25:11,280
Well, in the world of
storing data, whether it's

542
00:25:11,280 --> 00:25:13,650
storing users who've
registered for your websites

543
00:25:13,650 --> 00:25:16,350
or the number of likes a
post on Instagram has gotten

544
00:25:16,350 --> 00:25:19,950
or any form of data that you
might want to read or write,

545
00:25:19,950 --> 00:25:22,410
there are four fundamental
operations in the world.

546
00:25:22,410 --> 00:25:25,180
CRUD is the sort of crude
way of remembering this.

547
00:25:25,180 --> 00:25:28,260
These stand for Create,
Read, Update, Delete.

548
00:25:28,260 --> 00:25:30,870
And I dare say that with
these four operations

549
00:25:30,870 --> 00:25:34,030
you can do anything you want
when it comes with data.

550
00:25:34,030 --> 00:25:38,010
Creating that data, reading that
data, updating it, or deleting it.

551
00:25:38,010 --> 00:25:42,450
Now, in the world of SQL, or
Structured Query Language, or S-Q-L,

552
00:25:42,450 --> 00:25:44,490
this is just another
programming language.

553
00:25:44,490 --> 00:25:47,460
It tends to be used only in
the context of databases,

554
00:25:47,460 --> 00:25:51,810
but you can use it to solve problems
not unlike Python, but in the context

555
00:25:51,810 --> 00:25:54,090
of the data you care about.

556
00:25:54,090 --> 00:25:56,370
So with SQL, these four
CRUD operations actually

557
00:25:56,370 --> 00:25:58,530
have different keywords mapped to them.

558
00:25:58,530 --> 00:26:01,650
They happen to be these,
INSERT, SELECT, UPDATE, DELETE.

559
00:26:01,650 --> 00:26:06,360
That is to say, unlike Python, which has
hundreds of functions available to you,

560
00:26:06,360 --> 00:26:11,940
today is kind of neat in so far AS SQL
has maybe a dozen or so total functions

561
00:26:11,940 --> 00:26:14,700
that you can use, a few dozen
functions total that you can use.

562
00:26:14,700 --> 00:26:18,790
Much more manageable, and these are the
four that we'll use most frequently.

563
00:26:18,790 --> 00:26:21,330
So beyond this, how do you
go about creating a table?

564
00:26:21,330 --> 00:26:22,420
Well, what does that mean?

565
00:26:22,420 --> 00:26:24,253
Well, relational database
is literally that.

566
00:26:24,253 --> 00:26:27,660
It's like a fancier version of Google
Spreadsheets or Microsoft Excel

567
00:26:27,660 --> 00:26:30,120
or Apple Numbers that allows
you to store all of your data

568
00:26:30,120 --> 00:26:31,500
in rows and columns.

569
00:26:31,500 --> 00:26:35,790
But it gives you these
four functions and more

570
00:26:35,790 --> 00:26:40,140
via which to select data you care about,
or delete or update or insert data

571
00:26:40,140 --> 00:26:41,820
that you care about as well.

572
00:26:41,820 --> 00:26:45,390
So we'll see in a bit that this is going
to be the syntax VIA which in SQL you

573
00:26:45,390 --> 00:26:46,820
can create a new table.

574
00:26:46,820 --> 00:26:49,500
In Google Spreadsheets you would
literally go to the plus icon

575
00:26:49,500 --> 00:26:50,640
and say New Sheet.

576
00:26:50,640 --> 00:26:54,000
In SQL, you would express this more
programmatically with a line of code

577
00:26:54,000 --> 00:26:54,795
like this.

578
00:26:54,795 --> 00:26:57,420
But you're going to have to make
a few decisions along the way,

579
00:26:57,420 --> 00:27:00,280
and that's because SQL
has its own data types.

580
00:27:00,280 --> 00:27:03,300
Now, fortunately, they're pretty
reminiscent of C and Python.

581
00:27:03,300 --> 00:27:06,302
But there's five main ones here.

582
00:27:06,302 --> 00:27:08,260
Let's just walk through
a few examples thereof.

583
00:27:08,260 --> 00:27:09,390
So first, INTEGER.

584
00:27:09,390 --> 00:27:12,930
If you know in advance that you want
to store a whole bunch of information

585
00:27:12,930 --> 00:27:19,440
in a database like someone's age
or the number of likes a post has

586
00:27:19,440 --> 00:27:21,660
or anything that lends
itself to an integer,

587
00:27:21,660 --> 00:27:24,540
SQL supports an integer data type.

588
00:27:24,540 --> 00:27:28,120
But in some versions of SQL, you
actually have a decision to make.

589
00:27:28,120 --> 00:27:31,470
Do you want an integer,
a smallint, or a bigint?

590
00:27:31,470 --> 00:27:34,710
And these speak to, as in
C, the size, the number

591
00:27:34,710 --> 00:27:36,180
of bits that are used to store it.

592
00:27:36,180 --> 00:27:37,530
An integer is 32 bits.

593
00:27:37,530 --> 00:27:39,090
A bigint is 64 bits.

594
00:27:39,090 --> 00:27:41,690
A smallint is fewer than both those.

595
00:27:41,690 --> 00:27:44,720
Now, we're using, for class,
something called SQLite.

596
00:27:44,720 --> 00:27:47,363
It's a free and open source, very
user friendly version of SQL.

597
00:27:47,363 --> 00:27:49,280
But there's others you
might have heard about.

598
00:27:49,280 --> 00:27:51,230
So if you've ever heard
of the company Oracle,

599
00:27:51,230 --> 00:27:53,540
their bread and butter
is a database program

600
00:27:53,540 --> 00:27:56,210
called Oracle that is
a SQL database that

601
00:27:56,210 --> 00:27:58,730
allows companies to store
huge amounts of data

602
00:27:58,730 --> 00:28:01,490
and select and create and
update and delete data

603
00:28:01,490 --> 00:28:03,180
using software they've written.

604
00:28:03,180 --> 00:28:07,940
MySQL, Postgres, MariaDB,
Microsoft Access Server,

605
00:28:07,940 --> 00:28:12,050
SQL Server, and bunches of others,
are all examples of SQL servers.

606
00:28:12,050 --> 00:28:15,162
And they'll each support data
types, typically, like this.

607
00:28:15,162 --> 00:28:17,120
We happen to be showing
you ones from Postgres,

608
00:28:17,120 --> 00:28:20,390
which is a super popular open source
one that you can use on the internet,

609
00:28:20,390 --> 00:28:22,010
for instance for final projects.

610
00:28:22,010 --> 00:28:25,657
Real numbers, just like in Python or
C, are similar in spirit to floats.

611
00:28:25,657 --> 00:28:27,240
However, you have a couple of options.

612
00:28:27,240 --> 00:28:29,330
One is a real, which uses 32 bits.

613
00:28:29,330 --> 00:28:31,040
Another is called double precision.

614
00:28:31,040 --> 00:28:34,400
It's actually two words, but
it gives you 64 bits as well,

615
00:28:34,400 --> 00:28:36,770
to give you more precision numerically.

616
00:28:36,770 --> 00:28:39,950
Then you've got this catch-all
known as numeric in SQL,

617
00:28:39,950 --> 00:28:43,490
whereby if you have a number
that's number-like, essentially,

618
00:28:43,490 --> 00:28:46,460
but isn't just an integer, categories--

619
00:28:46,460 --> 00:28:50,270
this category encompasses things
like Boolean values, dates,

620
00:28:50,270 --> 00:28:53,240
which have a predefined
format like YYYY,

621
00:28:53,240 --> 00:28:56,390
four digits typically implements a year.

622
00:28:56,390 --> 00:28:59,660
That would be considered numeric here,
as would be a date and a date time,

623
00:28:59,660 --> 00:29:04,950
which has something like 00:00:00
for hours, minutes, seconds.

624
00:29:04,950 --> 00:29:06,780
And then you have other values as well.

625
00:29:06,780 --> 00:29:09,590
What's nice about this
numeric type specifically

626
00:29:09,590 --> 00:29:14,570
is you can actually solve the
floating point problem in both Python

627
00:29:14,570 --> 00:29:18,410
and C. Recall that floats
or even doubles in C

628
00:29:18,410 --> 00:29:22,370
ultimately have finite precision,
which is bad in terms of values

629
00:29:22,370 --> 00:29:24,140
not quite adding up as you intend.

630
00:29:24,140 --> 00:29:26,300
But with numeric data
types and databases,

631
00:29:26,300 --> 00:29:30,380
you can specify exactly how many digits
you want before the decimal point

632
00:29:30,380 --> 00:29:31,830
and after, maximally.

633
00:29:31,830 --> 00:29:35,060
So if you're doing with financial
information or scientific information,

634
00:29:35,060 --> 00:29:39,020
you can be super, super precise thanks
to this data type in the database.

635
00:29:39,020 --> 00:29:43,160
So finally, a problem that we've
solved, versus those other languages.

636
00:29:43,160 --> 00:29:43,910
Text.

637
00:29:43,910 --> 00:29:46,050
When it comes to storing
data in a database,

638
00:29:46,050 --> 00:29:48,470
you can specify that
your columns can either

639
00:29:48,470 --> 00:29:52,220
be characters, which isn't an
individual characters or chars.

640
00:29:52,220 --> 00:29:55,730
You literally specify n,
the number of characters

641
00:29:55,730 --> 00:29:58,370
that every cell in that column will be.

642
00:29:58,370 --> 00:29:59,823
You never do this in Excel.

643
00:29:59,823 --> 00:30:02,240
You never do this in Google
Spreadsheets or Apple Numbers.

644
00:30:02,240 --> 00:30:06,040
In a database, though, you can tell
the database ever more precisely,

645
00:30:06,040 --> 00:30:10,790
I want to store two characters
in every cell in some column.

646
00:30:10,790 --> 00:30:11,660
Why?

647
00:30:11,660 --> 00:30:13,310
Well, maybe like US state codes.

648
00:30:13,310 --> 00:30:16,640
If you have MA for Massachusetts
or CA for California,

649
00:30:16,640 --> 00:30:20,210
it might be nice to tell the database
that you can store, minimally

650
00:30:20,210 --> 00:30:23,420
and maximally, two characters for
every value I'm going to give you.

651
00:30:23,420 --> 00:30:24,723
Varchar is a little different.

652
00:30:24,723 --> 00:30:26,390
This is a variable number of characters.

653
00:30:26,390 --> 00:30:30,290
If your column might have a few
characters or a lot of characters based

654
00:30:30,290 --> 00:30:33,470
on what some human types in,
you can give an upper bound

655
00:30:33,470 --> 00:30:36,710
n on how many characters
will be stored in every cell.

656
00:30:36,710 --> 00:30:40,490
So if you don't know in advance how long
a user's email address is going to be

657
00:30:40,490 --> 00:30:43,100
or how long my name or your
name is going to be in advance,

658
00:30:43,100 --> 00:30:47,370
you can specify varchar and then pick
some upper bound, some value that, ugh,

659
00:30:47,370 --> 00:30:50,600
no one's going to have more than
20 characters in their name or 200

660
00:30:50,600 --> 00:30:55,160
characters in their name, whatever
line you decide to draw in the sand.

661
00:30:55,160 --> 00:30:57,710
As an aside, does anyone
want to conjecture

662
00:30:57,710 --> 00:31:00,940
what a good upper bound is
for the number of characters

663
00:31:00,940 --> 00:31:04,085
in a human name that might register
for a website you're making?

664
00:31:04,085 --> 00:31:05,480
AUDIENCE: [INAUDIBLE]

665
00:31:05,480 --> 00:31:06,390
DAVID MALAN: 25?

666
00:31:06,390 --> 00:31:06,890
Pretty good.

667
00:31:06,890 --> 00:31:07,990
AUDIENCE: [INAUDIBLE]

668
00:31:07,990 --> 00:31:09,032
DAVID MALAN: 30, I heard.

669
00:31:09,032 --> 00:31:09,980
AUDIENCE: [INAUDIBLE]

670
00:31:09,980 --> 00:31:11,615
DAVID MALAN: 32, OK.

671
00:31:11,615 --> 00:31:12,490
AUDIENCE: [INAUDIBLE]

672
00:31:12,490 --> 00:31:14,330
DAVID MALAN: 45?

673
00:31:14,330 --> 00:31:15,740
Higher.

674
00:31:15,740 --> 00:31:16,720
60?

675
00:31:16,720 --> 00:31:17,420
AUDIENCE: 64.

676
00:31:17,420 --> 00:31:18,170
DAVID MALAN: 64?

677
00:31:18,170 --> 00:31:20,790
This is the kind of thing that
Google, I dare say, is good at.

678
00:31:20,790 --> 00:31:23,570
So let me go ahead and
open up Google real fast.

679
00:31:23,570 --> 00:31:26,570
Suppose the goal at hand
is the store, in a database

680
00:31:26,570 --> 00:31:32,060
table, that is the equivalent of
a spreadsheets column, the longest

681
00:31:32,060 --> 00:31:34,811
name in world.

682
00:31:34,811 --> 00:31:35,950
All right.

683
00:31:35,950 --> 00:31:43,180
Looks like we had 988 characters
in a fellow's name, Hubert.

684
00:31:43,180 --> 00:31:46,430
And I won't even try to pronounce
his last name, but there it is.

685
00:31:46,430 --> 00:31:48,470
So if you want to fill--

686
00:31:48,470 --> 00:31:50,795
if you to be able to
fit everyone's name--

687
00:31:50,795 --> 00:31:52,900
oh, and note, senior.

688
00:31:52,900 --> 00:31:54,760
[LAUGHTER]

689
00:31:54,760 --> 00:31:56,870
You might need even more
characters than that.

690
00:31:56,870 --> 00:31:59,950
So this is a rabbit hole we
won't really go down today,

691
00:31:59,950 --> 00:32:02,530
but suffice it to say that
the answers to these questions

692
00:32:02,530 --> 00:32:03,550
are often not obvious.

693
00:32:03,550 --> 00:32:06,340
And if you think about websites
you've visited in the real world,

694
00:32:06,340 --> 00:32:10,930
whether on campus or off, odds are,
you have tried typing in some input

695
00:32:10,930 --> 00:32:14,660
to some web-based form or
even some mobile application,

696
00:32:14,660 --> 00:32:17,200
where all of a sudden your
keyboard stops working, right?

697
00:32:17,200 --> 00:32:19,000
They won't tolerate any more characters.

698
00:32:19,000 --> 00:32:21,490
Maybe it's the Common Application
and the length of your essay.

699
00:32:21,490 --> 00:32:23,680
Maybe it's the length that
your resume can be in a field.

700
00:32:23,680 --> 00:32:26,440
Maybe it's the length of your name
when registering for a website.

701
00:32:26,440 --> 00:32:27,670
Well, why is that?

702
00:32:27,670 --> 00:32:32,860
Those applications are probably
using a database, probably using SQL,

703
00:32:32,860 --> 00:32:37,210
and they had to decide, in advance,
what is the maximum length of an input

704
00:32:37,210 --> 00:32:39,190
that we're going to
tolerate from a human?

705
00:32:39,190 --> 00:32:42,220
And then the computer
ultimately enforces that.

706
00:32:42,220 --> 00:32:42,820
Why?

707
00:32:42,820 --> 00:32:46,900
Well that would be a very clever
attack, for some bad actor out there

708
00:32:46,900 --> 00:32:48,310
to exploit.

709
00:32:48,310 --> 00:32:50,998
If you had no bounds on
how big the input could

710
00:32:50,998 --> 00:32:53,290
be that a human could type
in, they could-- much like I

711
00:32:53,290 --> 00:32:55,123
need to try the other
day with Emma's name--

712
00:32:55,123 --> 00:32:57,760
just keep pasting pasting,
pasting, pasting a massive input,

713
00:32:57,760 --> 00:33:01,750
hoping to overwhelm your computer's
memory, maybe make your server crash,

714
00:33:01,750 --> 00:33:04,370
and therefore deny
service to other people.

715
00:33:04,370 --> 00:33:06,670
So there are reasons for
these kinds of defenses.

716
00:33:06,670 --> 00:33:08,140
Lastly is BLOB.

717
00:33:08,140 --> 00:33:10,153
At the very top, Binary Large Object.

718
00:33:10,153 --> 00:33:12,070
Don't need it typically
that often, but if you

719
00:33:12,070 --> 00:33:16,270
want to store binary data, raw zeros and
ones that represent files or the like,

720
00:33:16,270 --> 00:33:18,400
you can store it in a
BLOB format as well.

721
00:33:18,400 --> 00:33:20,650
Now, just to give you a
whirlwind tour of other syntax

722
00:33:20,650 --> 00:33:23,020
before we start using
it, you can insert data

723
00:33:23,020 --> 00:33:25,570
into a database using
syntax quite like this.

724
00:33:25,570 --> 00:33:29,590
You can select data from a database
using syntax like I already did.

725
00:33:29,590 --> 00:33:32,180
You can use functions as well.

726
00:33:32,180 --> 00:33:35,230
In fact, in just a bit, when we
load more data into the database,

727
00:33:35,230 --> 00:33:38,800
I could actually select all of the
distinct names from a database--

728
00:33:38,800 --> 00:33:40,870
and in fact, I could have
done this a moment ago.

729
00:33:40,870 --> 00:33:45,430
Rather than SELECT title
FROM favorites, as I

730
00:33:45,430 --> 00:33:49,360
did earlier-- that's going to give me
all of the titles in that database.

731
00:33:49,360 --> 00:33:54,310
I could instead do something like
SELECT DISTINCT title FROM favorites,

732
00:33:54,310 --> 00:33:56,680
and that's going to filter
out all of the duplicates

733
00:33:56,680 --> 00:34:00,268
without me having to write a
single line of code to do that.

734
00:34:00,268 --> 00:34:01,810
You can also count, as we did before.

735
00:34:01,810 --> 00:34:03,790
You can compute averages,
mins, or max, which

736
00:34:03,790 --> 00:34:06,220
is really useful for analytical
work that you might want

737
00:34:06,220 --> 00:34:07,909
to do as a data scientist or the like.

738
00:34:07,909 --> 00:34:10,659
And there's a whole bunch of other
functions available to as well.

739
00:34:10,659 --> 00:34:12,010
But there's some more powerful features.

740
00:34:12,010 --> 00:34:14,230
And we'll begin to use
these in just a little bit.

741
00:34:14,230 --> 00:34:15,880
We used WHERE before.

742
00:34:15,880 --> 00:34:17,199
I used LIMIT before.

743
00:34:17,199 --> 00:34:18,460
I used GROUP BY before.

744
00:34:18,460 --> 00:34:21,280
And we'll see a bunch of examples
that reinforce all of these.

745
00:34:21,280 --> 00:34:23,920
But there's keywords
like LIKE, whereby if you

746
00:34:23,920 --> 00:34:28,540
want to search for everyone's input
who mentions The Office, you know what?

747
00:34:28,540 --> 00:34:31,989
I bet we could do something
like this instead.

748
00:34:31,989 --> 00:34:36,280
I know that I can SELECT
star for-- oh, let's go ahead

749
00:34:36,280 --> 00:34:41,920
and SELECT title FROM favorites WHERE,
and this is kind of nonsensical,

750
00:34:41,920 --> 00:34:45,820
but title equals "The
Office," quote unquote.

751
00:34:45,820 --> 00:34:48,969
And I get back everyone who
typed in literally The Office.

752
00:34:48,969 --> 00:34:55,090
However, if I instead did something
like this, "the office" in lower case,

753
00:34:55,090 --> 00:34:58,780
I'm going to get one of you who
typed in "the office" in lowercase.

754
00:34:58,780 --> 00:35:01,240
And the thing here is just
the title of this output.

755
00:35:01,240 --> 00:35:03,940
But what if I want to do
anything like The Office?

756
00:35:03,940 --> 00:35:10,780
I could say something like LIKE
"%office%", and it turns out in SQL,

757
00:35:10,780 --> 00:35:12,880
this new language,
percent is a placeholder.

758
00:35:12,880 --> 00:35:16,120
It's a wild card that just says,
eh, anything can go before,

759
00:35:16,120 --> 00:35:18,190
and, eh, anything can go after.

760
00:35:18,190 --> 00:35:21,340
So now I'm searching for any titles
that have the word The Office.

761
00:35:21,340 --> 00:35:22,810
We picked up another entry here.

762
00:35:22,810 --> 00:35:26,153
So one of you said Office, without the
word "the," also pretty reasonable.

763
00:35:26,153 --> 00:35:28,570
Some of the uppercase and
lowercase is all over the place.

764
00:35:28,570 --> 00:35:30,010
Now we've caught those.

765
00:35:30,010 --> 00:35:33,100
And if you want to count the
total number of offices now,

766
00:35:33,100 --> 00:35:34,810
now I can do something like this.

767
00:35:34,810 --> 00:35:38,020
COUNT title-- so I can
combine all of these building

768
00:35:38,020 --> 00:35:40,720
blocks, not unlike Scratch,
and get back the total number

769
00:35:40,720 --> 00:35:45,610
of offices, which looks now to
be 33 when we tolerate a lot more

770
00:35:45,610 --> 00:35:48,880
variability in our users' input.

771
00:35:48,880 --> 00:35:50,680
Most powerfully of all,
we'll see something

772
00:35:50,680 --> 00:35:53,020
like JOIN in just a
little bit, and that'll

773
00:35:53,020 --> 00:35:56,600
be all toward the end of using
databases better with better design.

774
00:35:56,600 --> 00:36:00,010
But with SELECTs, we, of course, don't
just want to select all of the titles.

775
00:36:00,010 --> 00:36:02,360
We might select where
some condition is true.

776
00:36:02,360 --> 00:36:04,430
So conditions, just
like in Python and C,

777
00:36:04,430 --> 00:36:08,323
exist in SQL, but at the end
of our queries, as we've seen.

778
00:36:08,323 --> 00:36:10,240
If you want to update
something, for instance,

779
00:36:10,240 --> 00:36:12,550
you can update data in
a database by saying

780
00:36:12,550 --> 00:36:16,300
UPDATE table name SET the
column name equal to some value,

781
00:36:16,300 --> 00:36:18,200
WHERE condition is true.

782
00:36:18,200 --> 00:36:19,820
So how might I use this?

783
00:36:19,820 --> 00:36:23,640
Well, let's again look at the
data where it's like The Office.

784
00:36:23,640 --> 00:36:24,390
And you know what?

785
00:36:24,390 --> 00:36:25,807
I'm going to go ahead and do this.

786
00:36:25,807 --> 00:36:28,307
If I want to clean all of this
data up-- so data cleaning is

787
00:36:28,307 --> 00:36:30,140
a very common approach,
whether you're doing

788
00:36:30,140 --> 00:36:32,890
research or analytical work,
often when you have messy data,

789
00:36:32,890 --> 00:36:34,770
the first pass is honestly grunt work.

790
00:36:34,770 --> 00:36:37,350
Like, writing some code, or
god forbid, manually copying

791
00:36:37,350 --> 00:36:40,330
and pasting in an Excel file,
just to clean up your data.

792
00:36:40,330 --> 00:36:42,630
You can do this a lot
more powerfully in SQL.

793
00:36:42,630 --> 00:36:48,450
So for instance, let me go ahead
and say UPDATE my favorites SET

794
00:36:48,450 --> 00:36:52,500
title equal to "The Office,"
the canonical form, capital

795
00:36:52,500 --> 00:36:56,400
T, capital O, WHERE title LIKE--

796
00:36:56,400 --> 00:36:59,850
and now let me do this
wildcard search, knowing

797
00:36:59,850 --> 00:37:02,940
that that's going to slurp
in all 30-plus of those rows.

798
00:37:02,940 --> 00:37:06,330
But now, when I hit Enter,
nothing seems to happen.

799
00:37:06,330 --> 00:37:10,050
But if I then SELECT title
again WHERE LIKE "%office%",

800
00:37:10,050 --> 00:37:13,590
now they all look the same,
because I've updated my data.

801
00:37:13,590 --> 00:37:17,010
And so in fact, now I could
go back to The Office,

802
00:37:17,010 --> 00:37:21,520
because I've canonicalized my data,
and everything now looks the same.

803
00:37:21,520 --> 00:37:23,760
So with SQL, can you do that as well?

804
00:37:23,760 --> 00:37:28,620
Previously, let's go ahead and
SELECT a title FROM favorites.

805
00:37:28,620 --> 00:37:32,880
And let's just limit this to 10 of them
this time, so they don't all fly by.

806
00:37:32,880 --> 00:37:34,980
Suppose-- VEEP is a good show.

807
00:37:34,980 --> 00:37:37,380
Maybe I'm not a fan of, say--

808
00:37:37,380 --> 00:37:40,770
let's give ourselves 20
here, see what comes up next.

809
00:37:40,770 --> 00:37:44,370
OK, I actually don't like Friends, even
though everyone in the world seems to.

810
00:37:44,370 --> 00:37:45,420
No problem.

811
00:37:45,420 --> 00:37:46,665
DELETE FROM favorites--

812
00:37:46,665 --> 00:37:47,700
[LAUGHTER]

813
00:37:47,700 --> 00:37:50,790
--WHERE title = "Friends";.

814
00:37:50,790 --> 00:37:53,070
Now, no more Friends.

815
00:37:53,070 --> 00:37:54,885
And so you can delete your data as well.

816
00:37:54,885 --> 00:37:56,010
[LAUGHTER]

817
00:37:56,010 --> 00:38:00,390
You can delete your data as well
as by using a query of this form.

818
00:38:00,390 --> 00:38:04,020
This one is perhaps the
most dangerous of all.

819
00:38:04,020 --> 00:38:07,860
Consider what might happen if you're
a little sloppy, a little hasty,

820
00:38:07,860 --> 00:38:09,720
and omit the WHERE condition.

821
00:38:09,720 --> 00:38:14,070
What happens if you go ahead
and DELETE FROM favorites;?

822
00:38:14,070 --> 00:38:15,143
Everything goes away.

823
00:38:15,143 --> 00:38:18,060
And slightly worse than that, though
we won't have occasion to use it,

824
00:38:18,060 --> 00:38:20,352
there's another function
called DROP, where if you just

825
00:38:20,352 --> 00:38:23,970
want to delete the table altogether,
delete all of your rows and columns,

826
00:38:23,970 --> 00:38:28,630
you can say DROP TABLE favorites;,
and that will delete all of the data

827
00:38:28,630 --> 00:38:29,130
as well.

828
00:38:29,130 --> 00:38:32,700
And we'll come back to that, because
the mere fact that you can scarily

829
00:38:32,700 --> 00:38:35,370
delete lots of data at
once makes you super

830
00:38:35,370 --> 00:38:38,580
vulnerable to bad actors on the
internet or in the office who

831
00:38:38,580 --> 00:38:41,770
want to potentially
wreak havoc on your data.

832
00:38:41,770 --> 00:38:46,210
So we'll come back today, talking
about matters of security, as well.

833
00:38:46,210 --> 00:38:49,920
So any questions before, now,
we take things up a notch

834
00:38:49,920 --> 00:38:55,270
and look at a much more massive
than our hundreds of favorites here?

835
00:38:55,270 --> 00:38:56,430
Is that a question there?

836
00:38:56,430 --> 00:38:58,586
Yeah.

837
00:38:58,586 --> 00:39:01,965
AUDIENCE: [INAUDIBLE]

838
00:39:01,965 --> 00:39:04,090
DAVID MALAN: Did not look
like Friends got removed.

839
00:39:04,090 --> 00:39:05,050
Oh, interesting.

840
00:39:05,050 --> 00:39:06,730
You are right.

841
00:39:06,730 --> 00:39:09,700
Let's try a little harder.

842
00:39:09,700 --> 00:39:12,010
LIKE "%friends%".

843
00:39:12,010 --> 00:39:15,200


844
00:39:15,200 --> 00:39:16,370
Better?

845
00:39:16,370 --> 00:39:17,190
Now they're gone.

846
00:39:17,190 --> 00:39:19,607
So I'm guessing-- I would have
to look closer at the data,

847
00:39:19,607 --> 00:39:21,442
but I'm guessing one
or more people typed it

848
00:39:21,442 --> 00:39:23,150
with a slightly
different capitalization,

849
00:39:23,150 --> 00:39:25,760
or maybe they hit a spacebar
character or something, where

850
00:39:25,760 --> 00:39:29,040
I wasn't doing a precise
match, thereby making

851
00:39:29,040 --> 00:39:32,730
the LIKE operator a more robust fix.

852
00:39:32,730 --> 00:39:35,280
However, let's consider
for just a moment.

853
00:39:35,280 --> 00:39:36,860
What might the gotcha here be?

854
00:39:36,860 --> 00:39:39,170
Like, this feels like a
nice quick and dirty fix,

855
00:39:39,170 --> 00:39:42,118
but it really is quick
and dirty in what sense?

856
00:39:42,118 --> 00:39:43,910
AUDIENCE: Like, if you
have another TV show

857
00:39:43,910 --> 00:39:47,543
that has the word friend in it,
but not Friends, [INAUDIBLE]..

858
00:39:47,543 --> 00:39:49,460
DAVID MALAN: Yeah, there
might be another show

859
00:39:49,460 --> 00:39:50,480
with the word "friends" in it.

860
00:39:50,480 --> 00:39:51,320
Quite reasonable.

861
00:39:51,320 --> 00:39:54,380
And I just blew those away
as well from my database.

862
00:39:54,380 --> 00:39:56,480
And in fact, we can
probably see this for real.

863
00:39:56,480 --> 00:39:59,540
If I go ahead and look
at, for instance--

864
00:39:59,540 --> 00:40:05,360
let me go into a browser here, IMDB.com,
super popular website for just getting

865
00:40:05,360 --> 00:40:07,550
movie reviews or
information about movies.

866
00:40:07,550 --> 00:40:09,330
Let me go ahead and search for friends.

867
00:40:09,330 --> 00:40:12,830
And yes, the most popular hit seems
to be the 1994 version of Friends,

868
00:40:12,830 --> 00:40:14,740
but there's Friends with Benefits.

869
00:40:14,740 --> 00:40:17,573
There's Just Friends, We Are Your
Friends, My Best Friend's Wedding,

870
00:40:17,573 --> 00:40:18,990
Fisherman's Friends, and so forth.

871
00:40:18,990 --> 00:40:21,020
Some of those movies,
some of those TV shows.

872
00:40:21,020 --> 00:40:24,950
And had we poked a little further among
our own data of hundreds of rows, maybe

873
00:40:24,950 --> 00:40:27,980
some of you did have some of those
shows as your favorites as well.

874
00:40:27,980 --> 00:40:30,290
So beware that, too.

875
00:40:30,290 --> 00:40:32,990
So in preparing for
today, we wanted to play

876
00:40:32,990 --> 00:40:36,350
with an even juicier data set then
one we could generate for ourselves.

877
00:40:36,350 --> 00:40:39,920
And wonderfully, IMDB makes
a lot of their database

878
00:40:39,920 --> 00:40:44,900
available for download as TSV
files, Tab-Separated Values.

879
00:40:44,900 --> 00:40:48,210
It's really no fundamentally
different from Comma Separated Values.

880
00:40:48,210 --> 00:40:51,200
One human at IMDB just decided
to use tabs to separate

881
00:40:51,200 --> 00:40:52,763
their values instead of commas.

882
00:40:52,763 --> 00:40:54,680
And if you read through
this page here, you'll

883
00:40:54,680 --> 00:40:58,520
see that they let you download a
bunch of different files, one of which

884
00:40:58,520 --> 00:41:02,360
will focus on, initially,
specifically on title.basics.tsv.gz.

885
00:41:02,360 --> 00:41:05,105


886
00:41:05,105 --> 00:41:06,980
So that's a mouthful,
but this is just saying

887
00:41:06,980 --> 00:41:09,920
this is basic information
about titles in their database.

888
00:41:09,920 --> 00:41:12,785
Gz means it's G-zipped,
which is like Zip.

889
00:41:12,785 --> 00:41:14,660
If you're familiar, on
a Mac or PC, it's just

890
00:41:14,660 --> 00:41:19,850
another way of compressing information,
but you can unzip that on a computer

891
00:41:19,850 --> 00:41:20,390
as well.

892
00:41:20,390 --> 00:41:22,790
Dot TSV means Tab-Separated Values.

893
00:41:22,790 --> 00:41:24,860
Now, we read in advance
the information here,

894
00:41:24,860 --> 00:41:26,490
but let me summarize it as follows.

895
00:41:26,490 --> 00:41:28,972
There's some juicy
information in that file.

896
00:41:28,972 --> 00:41:31,430
And I'm going to go ahead and
download this file over here.

897
00:41:31,430 --> 00:41:33,710
If I click a link on
IMDB's website, I'll

898
00:41:33,710 --> 00:41:37,607
see a file called title.basics.tsv.gz.

899
00:41:37,607 --> 00:41:40,190
I'm going to go ahead and let
that download in the background.

900
00:41:40,190 --> 00:41:44,600
It looks like it is 100-plus megabytes.

901
00:41:44,600 --> 00:41:47,570
If I click on it on my Mac,
it's going to decompress it.

902
00:41:47,570 --> 00:41:53,210
And you'll see here, this apparently is
532 megabytes once it's decompressed.

903
00:41:53,210 --> 00:41:56,840
So that's way bigger than the
CSV file I downloaded from Google

904
00:41:56,840 --> 00:42:01,110
a few minutes ago, which is
24 kilobytes, 24,000 bytes.

905
00:42:01,110 --> 00:42:03,560
This is 532 million bytes.

906
00:42:03,560 --> 00:42:07,040
So it's an order of magnitude bigger,
which means our design decisions today

907
00:42:07,040 --> 00:42:08,210
are really going to matter.

908
00:42:08,210 --> 00:42:10,500
Now, if we take a look
inside of this file,

909
00:42:10,500 --> 00:42:12,260
we'll see a bunch of interesting fields.

910
00:42:12,260 --> 00:42:15,920
And I've essentially summarized this
from their own online documentation.

911
00:42:15,920 --> 00:42:18,980
There's these five fields that
are of interest right now,

912
00:42:18,980 --> 00:42:20,630
and we'll tease these apart now.

913
00:42:20,630 --> 00:42:22,550
So tconst, I didn't recognize that.

914
00:42:22,550 --> 00:42:26,510
It's an IMDB-specific term, it seems,
but it means some kind of title

915
00:42:26,510 --> 00:42:27,200
constant.

916
00:42:27,200 --> 00:42:30,480
It's a unique identifier for
every movie in the database.

917
00:42:30,480 --> 00:42:33,360
So for instance, one of the
movies in that TSV file,

918
00:42:33,360 --> 00:42:39,040
which I'll open up in a moment,
has the unique value tt4786824.

919
00:42:39,040 --> 00:42:39,830
Why?

920
00:42:39,830 --> 00:42:40,970
Just because.

921
00:42:40,970 --> 00:42:44,750
But it turns out we can
actually see what this is.

922
00:42:44,750 --> 00:42:57,160
Let me actually grab that value
and go to IMDB.com/title/tt4786824.

923
00:42:57,160 --> 00:42:58,160
Enter.

924
00:42:58,160 --> 00:42:59,870
These actually do have significance.

925
00:42:59,870 --> 00:43:01,760
That is the unique
identifier for The Crown,

926
00:43:01,760 --> 00:43:04,400
popular hit TV show that you
can now watch on Netflix.

927
00:43:04,400 --> 00:43:08,660
So if you go back to the actual field--

928
00:43:08,660 --> 00:43:09,960
that's how I spent the summer.

929
00:43:09,960 --> 00:43:12,543
So if you go back to the fields
here, we have not only tconst,

930
00:43:12,543 --> 00:43:15,440
which uniquely identifies the
film, we also have title type.

931
00:43:15,440 --> 00:43:17,990
So it turns out in
this 500 megabyte file,

932
00:43:17,990 --> 00:43:22,010
there's huge amounts of information
on movies, TV series, documentaries,

933
00:43:22,010 --> 00:43:25,280
shorts, way more data than we could
possibly play with in one day.

934
00:43:25,280 --> 00:43:29,270
So we'll focus just on TV series,
like this, the show The Crown.

935
00:43:29,270 --> 00:43:31,910
Primary title is the
name in the file that is

936
00:43:31,910 --> 00:43:33,660
given to the primary title of the show.

937
00:43:33,660 --> 00:43:35,660
Sometimes shows apparently
have different titles

938
00:43:35,660 --> 00:43:38,320
in different geographies or
communities, so the primary title

939
00:43:38,320 --> 00:43:40,070
is the one that most people know it by.

940
00:43:40,070 --> 00:43:43,670
Start year is the year for a TV show
that-- in which that TV show began.

941
00:43:43,670 --> 00:43:46,250
The Crown came out first in 2016.

942
00:43:46,250 --> 00:43:51,050
And genres, a little weirdly,
is a comma-separated list

943
00:43:51,050 --> 00:43:53,690
of genres into which that show falls.

944
00:43:53,690 --> 00:43:57,320
And I say it's a little weird,
because we do have a TSV file,

945
00:43:57,320 --> 00:43:59,580
but they're using commas inside.

946
00:43:59,580 --> 00:44:03,590
So this is probably among the
motivations for having used tabs in--

947
00:44:03,590 --> 00:44:05,660
to separate all of the
columns, because they

948
00:44:05,660 --> 00:44:08,060
want to use commas for
some other purpose here.

949
00:44:08,060 --> 00:44:09,570
But it's a little messy.

950
00:44:09,570 --> 00:44:13,160
This is actually not necessarily
the best design decision here.

951
00:44:13,160 --> 00:44:17,090
So what can we go ahead and
do with all of this data here?

952
00:44:17,090 --> 00:44:20,140
So let's go ahead and start
to grab some of this data

953
00:44:20,140 --> 00:44:21,640
and write some scripts involving it.

954
00:44:21,640 --> 00:44:23,260
But let me issue a disclaimer.

955
00:44:23,260 --> 00:44:26,000
It turns out when you have
a lot of data in a database,

956
00:44:26,000 --> 00:44:30,247
it can take quite a long time
to analyze it or parse it.

957
00:44:30,247 --> 00:44:32,830
And so if you ever watch certain
baking shows like Julia Child

958
00:44:32,830 --> 00:44:35,620
from down the road here in
Cambridge, very often there

959
00:44:35,620 --> 00:44:39,580
would be some TV magic whereby
someone mixes the cake mix

960
00:44:39,580 --> 00:44:42,723
and puts it into the oven, and then
two minutes later, voila, it comes out.

961
00:44:42,723 --> 00:44:44,890
That's because some of the
scripts we'll write today

962
00:44:44,890 --> 00:44:47,410
might actually take many
minutes to run, but we'll

963
00:44:47,410 --> 00:44:49,118
go ahead and run them
in the backgrounds,

964
00:44:49,118 --> 00:44:51,760
and I'll use the outputs that
I've created here in advance.

965
00:44:51,760 --> 00:44:55,600
So how do we go about beginning
to navigate this data set?

966
00:44:55,600 --> 00:44:59,830
Well, first of all, let me go
ahead and just open this file.

967
00:44:59,830 --> 00:45:02,230
And I'm going to use my
own terminal window here.

968
00:45:02,230 --> 00:45:06,790
This file was called titles.basics.tsv.

969
00:45:06,790 --> 00:45:09,580
Notice that even in my program
here, it took a moment to open,

970
00:45:09,580 --> 00:45:11,060
but this is what it looks like.

971
00:45:11,060 --> 00:45:14,380
It's a little cryptic at first
glance, but notice the first row

972
00:45:14,380 --> 00:45:16,030
is the headers of the file.

973
00:45:16,030 --> 00:45:19,900
So we've got tconst and title type
and primary title and so forth.

974
00:45:19,900 --> 00:45:24,160
Then every row thereafter seems to start
with one of those unique identifiers,

975
00:45:24,160 --> 00:45:27,250
and they seem to be monotonically
increasing, 1, 2, 3, 4,

976
00:45:27,250 --> 00:45:29,830
but they start with tt
then some number of digits.

977
00:45:29,830 --> 00:45:32,972
Then you see that these are a bunch
of shorts, so short films or the like.

978
00:45:32,972 --> 00:45:35,180
And if we kept scrolling,
we'd see some other values.

979
00:45:35,180 --> 00:45:36,825
Here's the first movie in the database.

980
00:45:36,825 --> 00:45:39,700
And if we keep scrolling and scrolling
and scrolling, we'll see more.

981
00:45:39,700 --> 00:45:43,150
So if actually I use my text editor
here and search for literally the word

982
00:45:43,150 --> 00:45:45,040
The Crown, it's going
to take a little while

983
00:45:45,040 --> 00:45:48,100
to get to, because there's a
bunch of shows apparently called

984
00:45:48,100 --> 00:45:50,470
The Crown or Crowning or something.

985
00:45:50,470 --> 00:45:52,060
But you know what'll be faster?

986
00:45:52,060 --> 00:45:53,860
Let me go ahead and search for--

987
00:45:53,860 --> 00:45:55,090
what was it?

988
00:45:55,090 --> 00:46:03,020
If I go back to tt4786824,
Enter, it takes a moment

989
00:46:03,020 --> 00:46:05,770
to find this because it's searching
through hundreds of megabytes,

990
00:46:05,770 --> 00:46:10,493
but voila, down here is
this tconst, TV series type.

991
00:46:10,493 --> 00:46:11,410
It's called The Crown.

992
00:46:11,410 --> 00:46:15,920
In another geography it's called
also The Crown, 2016, and so forth.

993
00:46:15,920 --> 00:46:17,920
So that's a huge amount of data.

994
00:46:17,920 --> 00:46:19,180
In fact, how much data?

995
00:46:19,180 --> 00:46:22,310
Well, if I actually analyze this
with a command on my computer,

996
00:46:22,310 --> 00:46:29,320
there are 6,267,469 shows
in the world, be it TV

997
00:46:29,320 --> 00:46:31,330
or movies or documentaries or the like.

998
00:46:31,330 --> 00:46:32,830
That's a huge data set.

999
00:46:32,830 --> 00:46:36,460
So suppose I want to focus really
just on a subset of that data.

1000
00:46:36,460 --> 00:46:39,970
What can I go about using instead?

1001
00:46:39,970 --> 00:46:45,040
Well, let me go ahead and open
up, for instance, the same file,

1002
00:46:45,040 --> 00:46:47,685
but let me try putting it in CS50 IDE.

1003
00:46:47,685 --> 00:46:50,560
I'm going to go ahead and download
this, as you might have downloaded

1004
00:46:50,560 --> 00:46:52,660
past problem sets, with wget.

1005
00:46:52,660 --> 00:46:54,867
That's going to give
me a pretty big file.

1006
00:46:54,867 --> 00:46:56,950
There's a command now that
you've not had occasion

1007
00:46:56,950 --> 00:47:02,640
to use in the IDE called gunzip, which
will unzip a file that starts with--

1008
00:47:02,640 --> 00:47:04,480
that ends with .gz.

1009
00:47:04,480 --> 00:47:09,060
That's going to take a moment
to unzip the whole file.

1010
00:47:09,060 --> 00:47:11,310
A moment, a moment, a moment, OK.

1011
00:47:11,310 --> 00:47:12,970
[CHUCKLES] And no space left on disk.

1012
00:47:12,970 --> 00:47:14,880
But that's OK, because it
turns out what I'm also

1013
00:47:14,880 --> 00:47:17,260
going to start doing today is
using my own Mac a bit more.

1014
00:47:17,260 --> 00:47:19,260
And you'll recall last week,
I did that a little bit,

1015
00:47:19,260 --> 00:47:22,218
because when I wanted to play with
speech recognition in my microphone,

1016
00:47:22,218 --> 00:47:25,890
it's a lot easier to have Python
running on my Mac or your PC

1017
00:47:25,890 --> 00:47:28,200
if I want to use hardware
built into my computer.

1018
00:47:28,200 --> 00:47:31,680
Plus my MacBook, is a lot faster,
for instance, than a single cloud

1019
00:47:31,680 --> 00:47:33,270
account on a shared server.

1020
00:47:33,270 --> 00:47:37,770
So I'll go ahead and write some of this
code here on my own machine instead.

1021
00:47:37,770 --> 00:47:40,930
So let me go ahead and open
a file called import .py.

1022
00:47:40,930 --> 00:47:44,700
And I'm going to go ahead and
import the CSV module, as always,

1023
00:47:44,700 --> 00:47:49,490
and I'm going to go ahead and open this
file, which is called title.basics.tsv

1024
00:47:49,490 --> 00:47:53,425
in read-only mode, and I'm going
to call this variable titles.

1025
00:47:53,425 --> 00:47:54,550
What am I next going to do?

1026
00:47:54,550 --> 00:47:58,230
Let me myself a reader
using csv.DictReader,

1027
00:47:58,230 --> 00:48:01,290
as before, reading in those titles.

1028
00:48:01,290 --> 00:48:04,800
But csv.DictReader feels
like the wrong reader.

1029
00:48:04,800 --> 00:48:07,040
Why?

1030
00:48:07,040 --> 00:48:09,360
Just to be clear.

1031
00:48:09,360 --> 00:48:11,445
Yeah, I mean, maybe I want TSV.

1032
00:48:11,445 --> 00:48:14,290
But it turns out TSV doesn't exist.

1033
00:48:14,290 --> 00:48:17,310
So even though I might be inclined,
for instance, to change this

1034
00:48:17,310 --> 00:48:21,570
to a T and this to a T, the
CSV module does enough for me,

1035
00:48:21,570 --> 00:48:24,540
but I need to tell it that I want
to use a different delimiter.

1036
00:48:24,540 --> 00:48:26,723
Instead of the default,
which looks like this,

1037
00:48:26,723 --> 00:48:28,890
I can actually override
that and say, you know what?

1038
00:48:28,890 --> 00:48:29,700
Use a tab.

1039
00:48:29,700 --> 00:48:33,690
And just like backslash n in C
and in Python, it's a new line.

1040
00:48:33,690 --> 00:48:37,830
Backslash t in both
languages is a tab character.

1041
00:48:37,830 --> 00:48:41,100
All right, so once I've done this,
let me go ahead and open up, now--

1042
00:48:41,100 --> 00:48:43,920
let me go ahead and open up shows0.csv.

1043
00:48:43,920 --> 00:48:47,760
My goal in life now is to make this
file a lot more manageable for myself.

1044
00:48:47,760 --> 00:48:53,070
I want to take a 500megabyte file and
extract only the TV shows therein.

1045
00:48:53,070 --> 00:48:56,190
Moreover, how about only the
TV shows from 1970 onward?

1046
00:48:56,190 --> 00:48:59,080
We won't go even further back than that.

1047
00:48:59,080 --> 00:49:04,320
So let me go ahead and open up, in
write mode, a file called show0.csv.

1048
00:49:04,320 --> 00:49:06,278
And I'm just going to
call that variable shows.

1049
00:49:06,278 --> 00:49:08,445
Then I'm going to go ahead
and give myself a writer.

1050
00:49:08,445 --> 00:49:10,290
And you might recall
using this in the past.

1051
00:49:10,290 --> 00:49:13,082
This is simply going to give me a
variable called writer, via which

1052
00:49:13,082 --> 00:49:14,400
I can write to a new file.

1053
00:49:14,400 --> 00:49:18,750
Because again, the goal is to read
this file and write to this file

1054
00:49:18,750 --> 00:49:21,090
a subset of the data therein.

1055
00:49:21,090 --> 00:49:23,160
So let me go ahead and
write one row first.

1056
00:49:23,160 --> 00:49:28,180
Write a row, passing in a list
of values, specifically tconst,

1057
00:49:28,180 --> 00:49:29,320
which is the title--

1058
00:49:29,320 --> 00:49:31,860
which is the ID field;
primaryTitle, which

1059
00:49:31,860 --> 00:49:35,910
is the title field; startYear,
which is the year field; and genres,

1060
00:49:35,910 --> 00:49:38,620
was one more fields that
I mentioned earlier.

1061
00:49:38,620 --> 00:49:43,248
So my goal is to export only those
four columns that I care about for now.

1062
00:49:43,248 --> 00:49:44,790
So I'm going to go ahead and do this.

1063
00:49:44,790 --> 00:49:49,590
For each row in my reader, I'm
going to go ahead and say if that

1064
00:49:49,590 --> 00:49:56,340
row's titleType ==
tvSeries, which, recall,

1065
00:49:56,340 --> 00:50:00,720
The Crown was an example of, then
I'm going to go ahead and write

1066
00:50:00,720 --> 00:50:01,710
to the writer--

1067
00:50:01,710 --> 00:50:05,940
whoops-- writer.writeRow,
a list containing what?

1068
00:50:05,940 --> 00:50:13,230
The row's tconst value, the row's
primary title value, the row--

1069
00:50:13,230 --> 00:50:21,214
whoops-- the row's start
year, and lastly, the row's--

1070
00:50:21,214 --> 00:50:24,310
[CHUCKLES] the row's genres.

1071
00:50:24,310 --> 00:50:26,560
So what am I doing
here, just to be clear?

1072
00:50:26,560 --> 00:50:30,760
So what I have done is I've
first written out-- once and only

1073
00:50:30,760 --> 00:50:33,070
once-- literally these
values, because I want

1074
00:50:33,070 --> 00:50:35,200
headers in the first line of my file.

1075
00:50:35,200 --> 00:50:37,750
After that, for each row in
the reader, which is currently

1076
00:50:37,750 --> 00:50:40,900
iterating over this
file in read-only mode,

1077
00:50:40,900 --> 00:50:44,200
I want to print out the current row's
tconst, current row's primary title,

1078
00:50:44,200 --> 00:50:47,170
the current row's start year,
and the current row genres.

1079
00:50:47,170 --> 00:50:52,450
But notice, I'm ignoring movies and
shorts and documentaries and bunches

1080
00:50:52,450 --> 00:50:54,080
of other values as well.

1081
00:50:54,080 --> 00:50:54,830
And you know what?

1082
00:50:54,830 --> 00:50:57,310
Just for good measure, let's
shrink this a little bit.

1083
00:50:57,310 --> 00:51:02,320
And row, how about, is adult
== "0", for today's purposes?

1084
00:51:02,320 --> 00:51:03,970
So that'll filter the list further.

1085
00:51:03,970 --> 00:51:06,160
If you'd like to flip
that later, that's fine.

1086
00:51:06,160 --> 00:51:11,920
So let me go ahead and do Python 3,
for version 3 on my Mac-- actually,

1087
00:51:11,920 --> 00:51:17,632
no, let's do Python of
import.py, all right?

1088
00:51:17,632 --> 00:51:19,090
I'm going to cross my fingers here.

1089
00:51:19,090 --> 00:51:22,300
Hopefully the file is
working, working, working.

1090
00:51:22,300 --> 00:51:24,010
But it's taking a decent amount of time.

1091
00:51:24,010 --> 00:51:29,290
Like, this is how much time it
takes, apparently, to process over

1092
00:51:29,290 --> 00:51:31,840
millions of rows of data.

1093
00:51:31,840 --> 00:51:33,700
Still running.

1094
00:51:33,700 --> 00:51:34,960
Still running.

1095
00:51:34,960 --> 00:51:37,465
But the goal, again,
is to shrink the amount

1096
00:51:37,465 --> 00:51:40,090
of data I have to ultimately care
about so that we can actually

1097
00:51:40,090 --> 00:51:41,980
search it much more effectively.

1098
00:51:41,980 --> 00:51:43,510
So, OK, it actually finished.

1099
00:51:43,510 --> 00:51:46,430
So let me go ahead
and open up show0.csv.

1100
00:51:46,430 --> 00:51:49,840
Notice now, in my text editor,
I've got a lot less data.

1101
00:51:49,840 --> 00:51:51,820
I've thrown away everything
I don't care about,

1102
00:51:51,820 --> 00:51:54,970
but I've been left with tconst,
primary title, start year, and genres,

1103
00:51:54,970 --> 00:51:59,052
and everything herein is now
consistent with that filtration.

1104
00:51:59,052 --> 00:52:00,760
But I haven't filtered
everything I said.

1105
00:52:00,760 --> 00:52:03,380
What did I say I wanted
to get rid of earlier?

1106
00:52:03,380 --> 00:52:06,590


1107
00:52:06,590 --> 00:52:08,090
Yeah, the shows before 1970.

1108
00:52:08,090 --> 00:52:11,550
And clearly some of these are
coming from 1940s and so forth.

1109
00:52:11,550 --> 00:52:14,480
So let's go ahead and get
rid of those, but see how.

1110
00:52:14,480 --> 00:52:18,597
This is the CSV file that we just
opened, but in Google Spreadsheet form.

1111
00:52:18,597 --> 00:52:20,930
So I literally just imported
it into Google Spreadsheets

1112
00:52:20,930 --> 00:52:21,950
that so we could see it.

1113
00:52:21,950 --> 00:52:25,640
Literally the same data as before,
and there are those 1940s movies.

1114
00:52:25,640 --> 00:52:28,490
But there's something curious
that I wanted to be mindful of.

1115
00:52:28,490 --> 00:52:34,290
If I scroll down in the start years
and keep going and going and going,

1116
00:52:34,290 --> 00:52:35,400
huh, those seem OK.

1117
00:52:35,400 --> 00:52:36,650
Those are increasing in order.

1118
00:52:36,650 --> 00:52:37,560
But let me try this.

1119
00:52:37,560 --> 00:52:40,880
Let me just poke around my
data, sorting in reverse order.

1120
00:52:40,880 --> 00:52:44,480
It's going to take a while, because even
this is a decent number of TV shows.

1121
00:52:44,480 --> 00:52:47,000
Notice this weirdness.

1122
00:52:47,000 --> 00:52:50,630
At the top of start year now,
once I've reversed sorted them,

1123
00:52:50,630 --> 00:52:53,450
there's a whole bunch of
backslash capital N's.

1124
00:52:53,450 --> 00:52:57,050
Now, this has nothing to do with
C, and nothing to do with Python.

1125
00:52:57,050 --> 00:52:59,900
It has everything to do
with the documentation.

1126
00:52:59,900 --> 00:53:05,930
If you read IMDB's data, as I only did
carefully eventually, you'll see this.

1127
00:53:05,930 --> 00:53:10,370
A backslash n and capital N is used
to denote that a particular field is

1128
00:53:10,370 --> 00:53:12,592
missing or null for that title name.

1129
00:53:12,592 --> 00:53:14,300
Now, this is important,
because if I want

1130
00:53:14,300 --> 00:53:17,090
to filter out movies
that are after 1970,

1131
00:53:17,090 --> 00:53:18,670
I need to be resilient against that.

1132
00:53:18,670 --> 00:53:20,210
So let me go ahead and do this.

1133
00:53:20,210 --> 00:53:27,590
So if the current row's startYear
does not equal backslash n,

1134
00:53:27,590 --> 00:53:31,850
then I'm going to go ahead and
check that it's a TV series

1135
00:53:31,850 --> 00:53:34,160
and that it's not an adult show.

1136
00:53:34,160 --> 00:53:35,150
So that would help.

1137
00:53:35,150 --> 00:53:38,110
But furthermore, how
can I check this here?

1138
00:53:38,110 --> 00:53:39,530
Let me do year.

1139
00:53:39,530 --> 00:53:43,500
And how can I convert row
startYear to an integer?

1140
00:53:43,500 --> 00:53:45,500
Well, everything in a
spreadsheet, by definition

1141
00:53:45,500 --> 00:53:50,510
of it having been in a spreadsheet,
or a TSV file or a CSV file, is text.

1142
00:53:50,510 --> 00:53:52,940
But start year looks like
years, so what Python

1143
00:53:52,940 --> 00:53:55,430
function can I use to
actually convert text that

1144
00:53:55,430 --> 00:53:58,840
resembles a number to an actual number?

1145
00:53:58,840 --> 00:53:59,340
Yeah.

1146
00:53:59,340 --> 00:54:01,240
So we can do something like this.

1147
00:54:01,240 --> 00:54:03,840
So I can convert year
to an int, and now I

1148
00:54:03,840 --> 00:54:08,530
can say, if year greater
than or equal to 1970,

1149
00:54:08,530 --> 00:54:12,348
now I'm going to go ahead
and do those lines instead.

1150
00:54:12,348 --> 00:54:14,640
Now, there's an opportunity,
surely, for better design,

1151
00:54:14,640 --> 00:54:16,530
because once your code
starts doing this,

1152
00:54:16,530 --> 00:54:18,360
you've done something
suboptimally, right?

1153
00:54:18,360 --> 00:54:21,112
This is not going to end well if
all of my code starts wrapping.

1154
00:54:21,112 --> 00:54:23,820
So I could clean up the logic in
a little bit, but let's go ahead

1155
00:54:23,820 --> 00:54:25,750
and run this just one more time.

1156
00:54:25,750 --> 00:54:30,450
This time, changing this to shows1.CSV
so we can see slightly different

1157
00:54:30,450 --> 00:54:30,960
outputs.

1158
00:54:30,960 --> 00:54:33,546
Let me go ahead and
run Python import.py.

1159
00:54:33,546 --> 00:54:37,740
Huh, syntax error,
Unicode error, codec--

1160
00:54:37,740 --> 00:54:39,000
that's a weird one.

1161
00:54:39,000 --> 00:54:43,810
But this is because backslashes, recall,
in C and Python, have special meaning.

1162
00:54:43,810 --> 00:54:47,070
So when you do something like
this, backslash capital n,

1163
00:54:47,070 --> 00:54:49,860
even though it's not a lower
case n, backslash n, recall,

1164
00:54:49,860 --> 00:54:52,110
is the scape character in C and Python.

1165
00:54:52,110 --> 00:54:55,230
So this is like telling Python,
this is a special character.

1166
00:54:55,230 --> 00:54:56,092
But it's not.

1167
00:54:56,092 --> 00:54:58,050
And we've never really
had occasion to do this,

1168
00:54:58,050 --> 00:55:00,240
but how would do you
think we could output

1169
00:55:00,240 --> 00:55:05,620
a literal backslash before a capital N?

1170
00:55:05,620 --> 00:55:06,120
Yeah.

1171
00:55:06,120 --> 00:55:08,610
So it turns out the solution
to this problem, usually,

1172
00:55:08,610 --> 00:55:11,730
no matter the language, is that if
you want a literal character, not

1173
00:55:11,730 --> 00:55:15,070
an escape character, you literally
put another one of it before.

1174
00:55:15,070 --> 00:55:17,260
So even though this
looks a little funky now,

1175
00:55:17,260 --> 00:55:22,420
this backslash backslash capital
N literally will mean backslash N.

1176
00:55:22,420 --> 00:55:22,920
All right.

1177
00:55:22,920 --> 00:55:26,360
So now, let me go ahead
and run this on import.py.

1178
00:55:26,360 --> 00:55:29,790
This time, I'm hopefully
going to actually generate

1179
00:55:29,790 --> 00:55:36,120
a new file called shows1.csv
that has even less data that

1180
00:55:36,120 --> 00:55:42,400
actually is going to contain my
shows, but only a subset of them.

1181
00:55:42,400 --> 00:55:45,000
And let's go ahead and pull the
cake out of the oven this way.

1182
00:55:45,000 --> 00:55:47,050
This is what I get now this time.

1183
00:55:47,050 --> 00:55:50,058
So if I actually load the CSV,
shows1.csv, into Google Spreadsheet,

1184
00:55:50,058 --> 00:55:53,100
just because it's pretty easy to look
at than the black and white window,

1185
00:55:53,100 --> 00:55:57,420
now you can see that I apparently am
only getting shows 1970 and onward.

1186
00:55:57,420 --> 00:55:59,940
And indeed, if I sorted them,
I would see no backslash N's.

1187
00:55:59,940 --> 00:56:03,960
I would have thrown away everything
that doesn't meet that criteria.

1188
00:56:03,960 --> 00:56:07,110
Well, let me go ahead and
do one last thing here.

1189
00:56:07,110 --> 00:56:10,020
I'm going to go ahead
and make one more change.

1190
00:56:10,020 --> 00:56:12,330
And first, let's
improve the design here.

1191
00:56:12,330 --> 00:56:16,710
This indentation is the result of
my asking questions again and again

1192
00:56:16,710 --> 00:56:20,410
and again and indenting if and
only if those things are true.

1193
00:56:20,410 --> 00:56:22,950
But notice, you can start to
flip your logic here, right?

1194
00:56:22,950 --> 00:56:26,790
Instead of saying if the start
year does not equal backslash N,

1195
00:56:26,790 --> 00:56:30,210
what if I just do this and say continue?

1196
00:56:30,210 --> 00:56:32,010
I can then unindent this--

1197
00:56:32,010 --> 00:56:35,280
because if you've not used
it before in Python and in C,

1198
00:56:35,280 --> 00:56:39,880
if you say continue inside of a loop,
it's not going to continue down there,

1199
00:56:39,880 --> 00:56:42,690
it's going to [WHOOSH] continue
to the start of the loop again.

1200
00:56:42,690 --> 00:56:45,840
So via this logic, we can actually
keep wrapping around again and again.

1201
00:56:45,840 --> 00:56:50,460
And here, too, we could say, if
year less than 1970, I can go ahead

1202
00:56:50,460 --> 00:56:54,340
and say continue, which would then
allow me to unindent this as well.

1203
00:56:54,340 --> 00:56:57,510
So there are solutions,
design-wise, to actually avoiding

1204
00:56:57,510 --> 00:56:59,230
that infinite indentation.

1205
00:56:59,230 --> 00:56:59,730
All right.

1206
00:56:59,730 --> 00:57:02,130
Let's go ahead and do one last version.

1207
00:57:02,130 --> 00:57:04,710
Then I'm going to go ahead and
pull out of the oven in a--

1208
00:57:04,710 --> 00:57:05,640
premade.

1209
00:57:05,640 --> 00:57:08,680
So the last thing I didn't
load before was this.

1210
00:57:08,680 --> 00:57:12,780
Suppose that I want to load into
the CSV file, all of the genres

1211
00:57:12,780 --> 00:57:14,820
associated with the show.

1212
00:57:14,820 --> 00:57:17,940
It looks like all of these shows have
one or more genres, just like you

1213
00:57:17,940 --> 00:57:20,370
were asked for your favorite shows.

1214
00:57:20,370 --> 00:57:24,780
And so now, we have a CSV file with
tconst, primary titles, start year,

1215
00:57:24,780 --> 00:57:29,280
and genres, where genre is
itself is a comma-separated list.

1216
00:57:29,280 --> 00:57:31,290
But there's a fundamental problem here.

1217
00:57:31,290 --> 00:57:35,040
Even though I have all of this
data here, the best I can do

1218
00:57:35,040 --> 00:57:36,460
is a program like this.

1219
00:57:36,460 --> 00:57:42,190
Let me go ahead and search for,
for instance, the following.

1220
00:57:42,190 --> 00:57:44,765
Let me go ahead and
grab a file real fast.

1221
00:57:44,765 --> 00:57:48,410


1222
00:57:48,410 --> 00:57:53,300
Let me go ahead and grab
a copy of shows2.csv

1223
00:57:53,300 --> 00:57:55,970
and write one final program here.

1224
00:57:55,970 --> 00:57:59,712
If I want to go ahead now and search
this very large, still, data set,

1225
00:57:59,712 --> 00:58:02,780
in shows2.csv, well, let
me go ahead and do this.

1226
00:58:02,780 --> 00:58:06,528
Import CSV.

1227
00:58:06,528 --> 00:58:08,570
Let me go ahead now and
ask the user for a title.

1228
00:58:08,570 --> 00:58:11,120
I could use CS50's get string, but
there's really no need for that

1229
00:58:11,120 --> 00:58:14,162
anymore now that we have the title
function-- the input function, recall.

1230
00:58:14,162 --> 00:58:15,310
So I'll just use that.

1231
00:58:15,310 --> 00:58:20,900
And then I'm going to go ahead and
open up shows2.csv in read-only mode.

1232
00:58:20,900 --> 00:58:23,270
And I'm going to call that my file.

1233
00:58:23,270 --> 00:58:27,470
Then I'm going to go ahead and give
myself a reader from csv.DictReader,

1234
00:58:27,470 --> 00:58:29,330
passing in that file.

1235
00:58:29,330 --> 00:58:33,422
And now I'm going to go ahead and,
for row in reader, do the following.

1236
00:58:33,422 --> 00:58:35,630
The goal now is to write a
Python program that allows

1237
00:58:35,630 --> 00:58:37,620
me to search only those TV shows.

1238
00:58:37,620 --> 00:58:44,420
So I could say something like this,
if title == row "primaryTitle",

1239
00:58:44,420 --> 00:58:50,060
then I can go ahead, for instance,
and print out row "primaryTitle,"

1240
00:58:50,060 --> 00:58:53,040
started in row "startYear."

1241
00:58:53,040 --> 00:58:55,130
So what is the goal of this program?

1242
00:58:55,130 --> 00:58:56,887
It's going to ask the user for input.

1243
00:58:56,887 --> 00:58:58,970
It's going to open this
big CSV that I've created.

1244
00:58:58,970 --> 00:59:01,520
But that's still smaller
than the 50-megabyte version.

1245
00:59:01,520 --> 00:59:04,670
It's going to iterate over every
row in that file via DictReader,

1246
00:59:04,670 --> 00:59:09,800
checking if the title the human typed in
equals the current row's primary title.

1247
00:59:09,800 --> 00:59:13,190
And if so, it's going to print
the title and year of that show.

1248
00:59:13,190 --> 00:59:17,750
So if I go ahead and
run Python of search.py,

1249
00:59:17,750 --> 00:59:22,760
typing in something like The Crown,
Enter, voila, I get that answer.

1250
00:59:22,760 --> 00:59:26,710
If I go ahead and do The Office,
Enter, there's a bunch of Offices.

1251
00:59:26,710 --> 00:59:29,210
And in fact, if you haven't
seen more than the American one,

1252
00:59:29,210 --> 00:59:31,620
there's the UK one, and
apparently several others.

1253
00:59:31,620 --> 00:59:33,410
And we can actually
corroborate this now.

1254
00:59:33,410 --> 00:59:36,590
If you go to IMDB, where all of
this data originally came from,

1255
00:59:36,590 --> 00:59:40,730
and type The Office, there are all
of The Offices that actually line up

1256
00:59:40,730 --> 00:59:42,470
with our own very data set.

1257
00:59:42,470 --> 00:59:44,690
It's going to be so
damn tedious every time

1258
00:59:44,690 --> 00:59:46,610
you want to search for
data or update data

1259
00:59:46,610 --> 00:59:49,880
or insert new data to write
a Python program to do it,

1260
00:59:49,880 --> 00:59:51,822
so we need a few more features of SQL.

1261
00:59:51,822 --> 00:59:53,780
But I think first, we
need some Halloween candy

1262
00:59:53,780 --> 00:59:55,790
for our five-minute break outback.

1263
00:59:55,790 --> 00:59:57,560
All right, we are back.

1264
00:59:57,560 --> 01:00:00,680
So we have a whole bunch more
data now, because we've downloaded

1265
01:00:00,680 --> 01:00:03,470
that really big TSV file from IMDB.

1266
01:00:03,470 --> 01:00:07,700
I've simplified it into some CSV files,
but that really gets me half of the way

1267
01:00:07,700 --> 01:00:10,910
there, because now if I want to
search the data, as with search.py,

1268
01:00:10,910 --> 01:00:14,540
I still have to look over
thousands of rows of TV shows,

1269
01:00:14,540 --> 01:00:17,990
and it's only going to be linear search
by nature of just how files are read,

1270
01:00:17,990 --> 01:00:19,700
top to bottom, left to right.

1271
01:00:19,700 --> 01:00:25,130
But it turns out in Python that you
can actually write SQL code itself.

1272
01:00:25,130 --> 01:00:28,640
And CS50 has a Python library that
has not only get string and get int

1273
01:00:28,640 --> 01:00:33,200
and so forth, but it also has a SQL
function built in that allows you

1274
01:00:33,200 --> 01:00:38,000
to connect to, so to speak, a file
that ends in something like .db,

1275
01:00:38,000 --> 01:00:41,960
which is to say, in a moment, we'll
start to write some Python code now

1276
01:00:41,960 --> 01:00:47,750
toward an end of loading a really large
data set like IMDB's into a proper SQL

1277
01:00:47,750 --> 01:00:51,500
database, thereby allowing us
hereafter to use all of the power

1278
01:00:51,500 --> 01:00:54,650
and expressiveness of SQL, and more
examples of that in just a bit like

1279
01:00:54,650 --> 01:00:58,160
SELECT, UPDATE, DELETE, and INSERT,
without having to write a whole bunch

1280
01:00:58,160 --> 01:00:59,120
of Python code.

1281
01:00:59,120 --> 01:01:03,860
So to be clear, instead of using Python
to search and manipulate our data,

1282
01:01:03,860 --> 01:01:08,120
we're going to write a script, a program
in Python, whose sole purpose in life

1283
01:01:08,120 --> 01:01:10,283
is to get data from one
format into another.

1284
01:01:10,283 --> 01:01:12,950
And you can imagine there's being
generally useful, whether it's

1285
01:01:12,950 --> 01:01:15,260
a Google Spreadsheet you've
downloaded or a large data

1286
01:01:15,260 --> 01:01:17,750
set you found on the internet that
you want to use for a final project

1287
01:01:17,750 --> 01:01:18,830
or for some other class.

1288
01:01:18,830 --> 01:01:22,640
Python can be a really powerful way
of taking one data source as input

1289
01:01:22,640 --> 01:01:26,330
and producing its output now,
as of today, SQL instead.

1290
01:01:26,330 --> 01:01:31,670
So let's go ahead and iterate one
final time of our title.basics.tsv.

1291
01:01:31,670 --> 01:01:34,220
But this time, not just
save it into a CSV file.

1292
01:01:34,220 --> 01:01:39,740
Let's put it into a proper SQL
database on my own Mac or your PC.

1293
01:01:39,740 --> 01:01:41,510
So let me go ahead and do this.

1294
01:01:41,510 --> 01:01:47,150
First, let me go ahead and say, just
like on the slide, db = cs50.sql,

1295
01:01:47,150 --> 01:01:53,567
and then quote unquote, "sqlite:///--"
so the third slash is not a typo.

1296
01:01:53,567 --> 01:01:54,650
It should indeed be there.

1297
01:01:54,650 --> 01:01:57,260
And I'm going to say
shows3.db, just because this

1298
01:01:57,260 --> 01:01:59,690
is version 3 now of my import script.

1299
01:01:59,690 --> 01:02:03,680
I'm going to go ahead, just as
last week, and now import CSV--

1300
01:02:03,680 --> 01:02:06,240
CS50's library as well.

1301
01:02:06,240 --> 01:02:11,330
But for this to work, the file
shows3.db needs to exist first.

1302
01:02:11,330 --> 01:02:15,260
And there's a couple of ways on a Mac
or a PC or a Linux computer, typically,

1303
01:02:15,260 --> 01:02:18,440
to create an empty file
that's ready to receive data.

1304
01:02:18,440 --> 01:02:20,990
You can literally use
the command touch, which

1305
01:02:20,990 --> 01:02:25,160
just will create an empty file by
whatever name you type at the prompt.

1306
01:02:25,160 --> 01:02:26,900
Or we can do this programmatically.

1307
01:02:26,900 --> 01:02:28,310
And I'm going to do it
programmatically, because I

1308
01:02:28,310 --> 01:02:30,352
bet I'm going to screw up
one or more times here,

1309
01:02:30,352 --> 01:02:34,220
and it's going to be useful to let
my Python program create and recreate

1310
01:02:34,220 --> 01:02:37,250
the database again and again
and again until I get it right.

1311
01:02:37,250 --> 01:02:42,920
So let me go ahead and open a file
called shows3.db in write mode.

1312
01:02:42,920 --> 01:02:45,860
And recall from Python
and C, using fopen,

1313
01:02:45,860 --> 01:02:50,180
anytime you open a file in write mode,
it will overwrite any file that exists

1314
01:02:50,180 --> 01:02:52,340
or create any file that doesn't.

1315
01:02:52,340 --> 01:02:53,780
That's all I needed to do.

1316
01:02:53,780 --> 01:02:57,830
So in Python 2, recall that we
were able to use this dot notation.

1317
01:02:57,830 --> 01:02:59,780
And it turns out here,
when you open a file,

1318
01:02:59,780 --> 01:03:02,900
if you want to immediately close it,
because your only goal was to create

1319
01:03:02,900 --> 01:03:06,930
it, you can just do .close on
the very thing you just opened.

1320
01:03:06,930 --> 01:03:09,570
That is equivalent, just
to be clear, to doing

1321
01:03:09,570 --> 01:03:14,790
something a little more pedantic like
this, file = open, and then file.close.

1322
01:03:14,790 --> 01:03:18,070
But we can collapse this into
one slick one-liner, so to speak,

1323
01:03:18,070 --> 01:03:21,840
by just doing instead
what I did a moment ago.

1324
01:03:21,840 --> 01:03:27,330
All that does is create
empty shows.3 file.

1325
01:03:27,330 --> 01:03:32,385
Now, open that file for SQLite.

1326
01:03:32,385 --> 01:03:34,260
And again, SQLite is
the light version of SQL

1327
01:03:34,260 --> 01:03:36,060
that anyone can use on
their own Mac or PC.

1328
01:03:36,060 --> 01:03:39,220
You don't need a special server
to get up and running with it.

1329
01:03:39,220 --> 01:03:42,990
So now let me go ahead and
open up title.basics.tsv.

1330
01:03:42,990 --> 01:03:48,060
And then here, let me go ahead
and create myself a DictReader so

1331
01:03:48,060 --> 01:03:50,910
that I can iterate over
the lines in that TSV file.

1332
01:03:50,910 --> 01:03:55,180
And now, let me go ahead, and for
row in reader, do the following.

1333
01:03:55,180 --> 01:03:57,390
I first want to filter
out stuff just as before.

1334
01:03:57,390 --> 01:04:04,200
So I'm going to say if row
bracket "titleType" == "tvSeries"

1335
01:04:04,200 --> 01:04:10,830
and row "isAdult" == quote unquote "0",
then I'm going to go ahead and check

1336
01:04:10,830 --> 01:04:11,720
one other thing.

1337
01:04:11,720 --> 01:04:15,465
I'm going to go ahead and give myself a
start year variable, similar to before,

1338
01:04:15,465 --> 01:04:17,130
although I called it year earlier.

1339
01:04:17,130 --> 01:04:19,980
Then let me go ahead
and do row "startYear",

1340
01:04:19,980 --> 01:04:22,050
just so I can cast that to an int.

1341
01:04:22,050 --> 01:04:27,240
But I only want to do that
if row "startYear" does not

1342
01:04:27,240 --> 01:04:31,620
equal that special backslash N
that IMDB told me to watch out for.

1343
01:04:31,620 --> 01:04:34,600
So I only want to do
that if it's not that.

1344
01:04:34,600 --> 01:04:38,670
And then if startYear is
greater than or equal to 1970,

1345
01:04:38,670 --> 01:04:41,080
let's go ahead and do the following.

1346
01:04:41,080 --> 01:04:46,680
Let's go ahead and do
genres, gets row "genres".

1347
01:04:46,680 --> 01:04:50,073
Let's go ahead and get
tconst, gets row "tconst,"

1348
01:04:50,073 --> 01:04:52,740
just so I can put these in some
slightly shorter variable names,

1349
01:04:52,740 --> 01:04:54,360
just to keep myself sane.

1350
01:04:54,360 --> 01:04:58,770
primaryTitle is going to
be from row "primaryTitle."

1351
01:04:58,770 --> 01:05:00,720
and then let me go
ahead and give myself--

1352
01:05:00,720 --> 01:05:04,150
we already have startYear, so those
are the only other three fields I need.

1353
01:05:04,150 --> 01:05:08,460
So now, I want to go ahead and
insert this row from my TSV

1354
01:05:08,460 --> 01:05:10,002
into a SQLite database.

1355
01:05:10,002 --> 01:05:12,960
And the operative word that we saw
earlier that we haven't used it yet,

1356
01:05:12,960 --> 01:05:13,710
is INSERT.

1357
01:05:13,710 --> 01:05:15,390
We did use SELECT.

1358
01:05:15,390 --> 01:05:16,500
We did use UPDATE.

1359
01:05:16,500 --> 01:05:17,460
We did use DELETE.

1360
01:05:17,460 --> 01:05:18,850
We haven't used INSERT yet.

1361
01:05:18,850 --> 01:05:20,350
So I'm going to do that in a moment.

1362
01:05:20,350 --> 01:05:23,200
But first, I need my
database to actually exist,

1363
01:05:23,200 --> 01:05:25,330
so I need to create an actual table.

1364
01:05:25,330 --> 01:05:28,860
So I'm going to go up here first
and do this, db, is a reference now,

1365
01:05:28,860 --> 01:05:30,995
a variable representing
my database, and I'm

1366
01:05:30,995 --> 01:05:33,120
going to call the only
function inside of it that's

1367
01:05:33,120 --> 01:05:36,000
useful for our purposes, called execute.

1368
01:05:36,000 --> 01:05:39,240
What I can now do is
execute any SQL I want.

1369
01:05:39,240 --> 01:05:41,280
So what do I want to
load into this database?

1370
01:05:41,280 --> 01:05:45,930
I think I want to load in the tconst,
the primaryTitle, the startYear,

1371
01:05:45,930 --> 01:05:51,690
and the genres, just like we had
earlier from title.basics.tsv.

1372
01:05:51,690 --> 01:05:55,140
I want to load rows that represent
this kind of data, all right?

1373
01:05:55,140 --> 01:05:56,620
So how am I going to do this?

1374
01:05:56,620 --> 01:05:58,590
Well, let me go ahead
and create a table.

1375
01:05:58,590 --> 01:06:02,880
I'm going to call it shows, because that
seems nice and conceptually consistent.

1376
01:06:02,880 --> 01:06:06,510
I'm going to go ahead and
create a list of columns now.

1377
01:06:06,510 --> 01:06:08,400
tconst is going to be one column.

1378
01:06:08,400 --> 01:06:10,290
primaryTitle is going to be another.

1379
01:06:10,290 --> 01:06:12,000
startYear is going to be another.

1380
01:06:12,000 --> 01:06:14,610
And genres is going to be the last.

1381
01:06:14,610 --> 01:06:20,310
I can literally, that is to say,
write SQL inside of a string

1382
01:06:20,310 --> 01:06:23,700
that I pass to a Python
function called db.execute.

1383
01:06:23,700 --> 01:06:27,300
And because db.execute--
or rather, because db

1384
01:06:27,300 --> 01:06:34,380
was configured with shows3.db, when
I execute this string in Python,

1385
01:06:34,380 --> 01:06:38,100
it's going to get executed on
that database file, shows3.db.

1386
01:06:38,100 --> 01:06:40,573
So it's a nice way of
bridging these two worlds.

1387
01:06:40,573 --> 01:06:42,990
So I'm going to have to be a
little more specific, though.

1388
01:06:42,990 --> 01:06:45,707
Recall that SQL has a bunch of types.

1389
01:06:45,707 --> 01:06:47,040
And I'm going to keep it simple.

1390
01:06:47,040 --> 01:06:50,910
I'm going to go ahead and say that
the type of that tconst value is text.

1391
01:06:50,910 --> 01:06:53,220
The type of the primaryTitle is text.

1392
01:06:53,220 --> 01:06:56,400
The type of startYear is going
to be numeric, kind of a catch

1393
01:06:56,400 --> 01:06:58,020
all for dates and date times.

1394
01:06:58,020 --> 01:07:00,390
And then genres is going
to be text as well.

1395
01:07:00,390 --> 01:07:01,770
So the syntax is a little funky.

1396
01:07:01,770 --> 01:07:04,860
You actually specify the name
of the column and then the type,

1397
01:07:04,860 --> 01:07:08,408
as opposed to the opposite, which we
did in C. But that's the way SQL is.

1398
01:07:08,408 --> 01:07:10,200
So I'm going to go
ahead and save that now.

1399
01:07:10,200 --> 01:07:14,180
And just to comment this,
this is going to create

1400
01:07:14,180 --> 01:07:21,590
a table called shows in
database file called shows3.db,

1401
01:07:21,590 --> 01:07:23,100
just to be super explicit.

1402
01:07:23,100 --> 01:07:24,780
So what am I going to do down here?

1403
01:07:24,780 --> 01:07:27,450
It looks like I have the
ability with CS50's library

1404
01:07:27,450 --> 01:07:29,320
to execute any SQL I want.

1405
01:07:29,320 --> 01:07:34,230
So let me go ahead and insert
into shows the following values,

1406
01:07:34,230 --> 01:07:39,570
a tconst, a primaryTitle,
a startYear, and a genre--

1407
01:07:39,570 --> 01:07:40,920
and genres.

1408
01:07:40,920 --> 01:07:41,940
What values?

1409
01:07:41,940 --> 01:07:43,443
I want to insert these values.

1410
01:07:43,443 --> 01:07:45,360
Now, I don't know in
advance, so I'm literally

1411
01:07:45,360 --> 01:07:47,380
going to put some question marks here.

1412
01:07:47,380 --> 01:07:51,630
And it turns out in SQL, this is valid
syntax for the library we're using.

1413
01:07:51,630 --> 01:07:54,000
This is this INSERT query.

1414
01:07:54,000 --> 01:07:57,750
INSERT INTO the table
name a parenthesized list

1415
01:07:57,750 --> 01:08:00,270
of the columns you want
to insert data into.

1416
01:08:00,270 --> 01:08:03,078
Then, a set of values
in separate parentheses.

1417
01:08:03,078 --> 01:08:05,370
And for now, I'm using question
marks for placeholders,

1418
01:08:05,370 --> 01:08:07,248
for reasons we'll come back to.

1419
01:08:07,248 --> 01:08:09,540
But I'm going to go ahead
and plug the following values

1420
01:08:09,540 --> 01:08:18,479
into those placeholders, tconst,
primaryTitle, startYear, and genres.

1421
01:08:18,479 --> 01:08:20,819
And what the db.execute
function is going

1422
01:08:20,819 --> 01:08:24,399
to do for me automatically is it's
going to look at this SQL query.

1423
01:08:24,399 --> 01:08:27,630
Notice that, oh, it's got four
question marks in it, or placeholders.

1424
01:08:27,630 --> 01:08:32,850
Those, in SQL, are like the %s was
in C or are like the curly braces are

1425
01:08:32,850 --> 01:08:34,319
and Python f strings.

1426
01:08:34,319 --> 01:08:38,130
So this says, give me 1, 2, 3, 4
placeholders and plug in, ultimately,

1427
01:08:38,130 --> 01:08:41,220
the following four values,
tconst, which is just a variable;

1428
01:08:41,220 --> 01:08:44,040
primaryTitle, which is the same;
startYear, which is the same;

1429
01:08:44,040 --> 01:08:45,968
and genres, which is the same.

1430
01:08:45,968 --> 01:08:47,010
So what am I going to do?

1431
01:08:47,010 --> 01:08:52,290
For every TV series in this
file that's not an adult series

1432
01:08:52,290 --> 01:08:57,270
and that started after 1970,
insert it into my database.

1433
01:08:57,270 --> 01:09:00,689
If I've made no typos, I'm going to go
ahead and run this, cross my fingers,

1434
01:09:00,689 --> 01:09:02,160
and enter.

1435
01:09:02,160 --> 01:09:05,279
This one is going to take more
time, because it turns out writing

1436
01:09:05,279 --> 01:09:07,109
to a CSV file is actually pretty quick.

1437
01:09:07,109 --> 01:09:10,080
You can just write row, write
row, write row, write row,

1438
01:09:10,080 --> 01:09:13,979
but inserting into a SQLite
database is going to take more time.

1439
01:09:13,979 --> 01:09:19,000
More time upfront, but it's going to
be a lot faster to search thereafter.

1440
01:09:19,000 --> 01:09:23,520
So let me go ahead and do
the cake in the oven thing

1441
01:09:23,520 --> 01:09:25,680
and go ahead and now
open up a file I made

1442
01:09:25,680 --> 01:09:29,250
an advance in today's src3
directory called shows3.db

1443
01:09:29,250 --> 01:09:33,000
using SQLite3, that command
line program we used earlier.

1444
01:09:33,000 --> 01:09:38,310
Recall that I can say .schema to see
the types of data in the database.

1445
01:09:38,310 --> 01:09:40,470
And indeed, look what
I've done in advance.

1446
01:09:40,470 --> 01:09:47,250
I created a table called shows
with exactly those columns in a--

1447
01:09:47,250 --> 01:09:51,840
with exactly these four columns, tconst,
primaryTitle, startYear, and genres.

1448
01:09:51,840 --> 01:09:55,950
But I did this in advance to save
us time so that I can now do SELECT

1449
01:09:55,950 --> 01:09:58,268
* FROM shows.

1450
01:09:58,268 --> 01:09:59,560
And let me not get all of them.

1451
01:09:59,560 --> 01:10:02,580
Let me do the first 10, semicolon.

1452
01:10:02,580 --> 01:10:03,120
All right.

1453
01:10:03,120 --> 01:10:06,610
So we see the first 10 shows
from IMDB in whatever order

1454
01:10:06,610 --> 01:10:08,010
IMDB distributes them.

1455
01:10:08,010 --> 01:10:10,590
You can see their ID
numbers are incrementing.

1456
01:10:10,590 --> 01:10:11,460
And All My Children.

1457
01:10:11,460 --> 01:10:13,418
I remember growing up
with that show years ago.

1458
01:10:13,418 --> 01:10:15,410
And it seems that that's a drama--

1459
01:10:15,410 --> 01:10:18,540
it's not really mystery, but so
be it, and a romance show there.

1460
01:10:18,540 --> 01:10:23,370
But it's indeed 1970, as are every
show thereafter in 1970 or onward.

1461
01:10:23,370 --> 01:10:26,350
I can go ahead and search
for more like this.

1462
01:10:26,350 --> 01:10:28,590
Let me give myself the first 100 shows.

1463
01:10:28,590 --> 01:10:30,540
All right, so this is a
pretty large data set.

1464
01:10:30,540 --> 01:10:32,370
And let me go ahead and count them all.

1465
01:10:32,370 --> 01:10:34,745
Recall that you can use a
COUNT function so that we don't

1466
01:10:34,745 --> 01:10:36,330
have to print them all on my screen.

1467
01:10:36,330 --> 01:10:42,090
It looks like there are
153,331 TV series in IMDB.

1468
01:10:42,090 --> 01:10:45,720
Crazier than that, you want to know
how many of them came out this year?

1469
01:10:45,720 --> 01:10:50,430
WHERE startYear = 2019;.

1470
01:10:50,430 --> 01:10:57,090
There were 6,099 new TV shows in the
world this year according to IMDB.

1471
01:10:57,090 --> 01:11:00,900
Just one of those is The
Office that won our--

1472
01:11:00,900 --> 01:11:01,950
won our vote earlier.

1473
01:11:01,950 --> 01:11:06,280
So SELECT *, star denoting wild
card, in this case, everything.

1474
01:11:06,280 --> 01:11:08,290
And unfortunately, SQL
has two wild cards.

1475
01:11:08,290 --> 01:11:12,600
Star means select all of
the columns in the table.

1476
01:11:12,600 --> 01:11:17,220
Percent means let any characters
come before or after a quoted

1477
01:11:17,220 --> 01:11:18,870
string in a WHERE clause.

1478
01:11:18,870 --> 01:11:24,660
So let me go ahead and SELECT *
FROM shows WHERE title = The Office,

1479
01:11:24,660 --> 01:11:26,700
and we'll actually see-- whoops.

1480
01:11:26,700 --> 01:11:28,800
primaryTitle, sorry.

1481
01:11:28,800 --> 01:11:31,530
Let me fix that.
primaryTitle = The Office.

1482
01:11:31,530 --> 01:11:33,150
There are all of those Offices.

1483
01:11:33,150 --> 01:11:37,230
And indeed, 2005 is probably
the one we know and love.

1484
01:11:37,230 --> 01:11:44,370
If I go to a browser and go to
IMDB slash title slash that ID,

1485
01:11:44,370 --> 01:11:46,620
indeed, that's probably the
one we're all thinking of,

1486
01:11:46,620 --> 01:11:48,790
unless you voted for
the UK version instead.

1487
01:11:48,790 --> 01:11:52,290
So again, this is actually real, live
data that we're now playing with.

1488
01:11:52,290 --> 01:11:53,738
Well, what more can we do?

1489
01:11:53,738 --> 01:11:55,530
Well, there's one thing
that I don't really

1490
01:11:55,530 --> 01:12:00,540
like about this, which is that
when we select all of the columns--

1491
01:12:00,540 --> 01:12:02,130
and let's go ahead and do this.

1492
01:12:02,130 --> 01:12:04,440
Let's select another
hundred of them before.

1493
01:12:04,440 --> 01:12:10,470
This feels a little messy that we have
all of these nice, clean columns except

1494
01:12:10,470 --> 01:12:12,270
for when we get to genres.

1495
01:12:12,270 --> 01:12:14,902
Then we just have this
arbitrary comma-separated list.

1496
01:12:14,902 --> 01:12:16,860
Suppose, for instance,
I want to search for all

1497
01:12:16,860 --> 01:12:19,820
of the comedies that came out in 2019.

1498
01:12:19,820 --> 01:12:27,870
I could say SELECT * FROM shows
where genres = "Comedy" AND--

1499
01:12:27,870 --> 01:12:29,880
turns out you can use
conjunctions like this--

1500
01:12:29,880 --> 01:12:32,970
startYear = 2019.

1501
01:12:32,970 --> 01:12:34,290
So that gives me a whole bunch.

1502
01:12:34,290 --> 01:12:35,490
Let's count them.

1503
01:12:35,490 --> 01:12:38,490
So COUNT this here, Enter.

1504
01:12:38,490 --> 01:12:42,390
OK, a thousand of those
6,000 shows are comedies.

1505
01:12:42,390 --> 01:12:44,850
But I think that's an underestimate.

1506
01:12:44,850 --> 01:12:48,810
Why is this query buggy at the moment?

1507
01:12:48,810 --> 01:12:49,458
Yeah?

1508
01:12:49,458 --> 01:12:50,587
AUDIENCE: [INAUDIBLE]

1509
01:12:50,587 --> 01:12:52,920
DAVID MALAN: Yeah, some of
them had more than one genre,

1510
01:12:52,920 --> 01:12:56,070
so comedy is somewhere in
that comma-separated list.

1511
01:12:56,070 --> 01:13:00,450
And so what I should probably do
instead is not say genre = "Comedy,"

1512
01:13:00,450 --> 01:13:05,490
but maybe genres LIKE "Comedy," and
allow something maybe to appear before,

1513
01:13:05,490 --> 01:13:07,140
something maybe to appear after.

1514
01:13:07,140 --> 01:13:11,040
And that's going to give
me 1,593 comedies that

1515
01:13:11,040 --> 01:13:13,290
came out this year in 2019.

1516
01:13:13,290 --> 01:13:14,820
So that seems a little better.

1517
01:13:14,820 --> 01:13:17,220
But this is not very robust, right?

1518
01:13:17,220 --> 01:13:19,890
Once you start resorting
to techniques like this,

1519
01:13:19,890 --> 01:13:22,920
it should start, as a programmer,
to rub you the wrong way.

1520
01:13:22,920 --> 01:13:24,168
It's kind of a hack, right?

1521
01:13:24,168 --> 01:13:27,210
Like, you're searching for comedy,
but there could be something before it

1522
01:13:27,210 --> 01:13:29,640
or something after it,
and odds are, there

1523
01:13:29,640 --> 01:13:32,400
is no other word I can think
of in the world of genres

1524
01:13:32,400 --> 01:13:36,142
that starts with or ends with
comedy, so we're probably OK.

1525
01:13:36,142 --> 01:13:38,850
But this is kind of hack-ish, that
you're just kind of searching.

1526
01:13:38,850 --> 01:13:44,790
It would be nice if we could just search
for a specific column called genre.

1527
01:13:44,790 --> 01:13:47,170
So how can we go about doing that?

1528
01:13:47,170 --> 01:13:50,670
Well, let me go ahead and
do the following instead.

1529
01:13:50,670 --> 01:13:54,570
Let me go ahead and open up a
final version of my import script,

1530
01:13:54,570 --> 01:13:58,020
this one that does two things up top.

1531
01:13:58,020 --> 01:14:02,780
At the top, I'm going to create two
tables, one called shows, which has--

1532
01:14:02,780 --> 01:14:04,950
I'm to clean up the column names, too.

1533
01:14:04,950 --> 01:14:06,930
IMDB is a little nonconventional.

1534
01:14:06,930 --> 01:14:10,290
What most people would do when
describing a unique identifier,

1535
01:14:10,290 --> 01:14:12,210
they're going to call it ID, not tconst.

1536
01:14:12,210 --> 01:14:13,895
So we're going to rename it to ID.

1537
01:14:13,895 --> 01:14:16,020
They're not going to call
their title primaryTitle.

1538
01:14:16,020 --> 01:14:18,660
They're going to call it title,
so we're going to rename it title.

1539
01:14:18,660 --> 01:14:19,770
They're not going to name it startYear.

1540
01:14:19,770 --> 01:14:20,937
We're going to call it year.

1541
01:14:20,937 --> 01:14:22,060
And then that's it.

1542
01:14:22,060 --> 01:14:24,270
We'll come back to primary
key in just a moment.

1543
01:14:24,270 --> 01:14:25,530
But notice this.

1544
01:14:25,530 --> 01:14:28,320
In my new and final version
of this script, I'm creating,

1545
01:14:28,320 --> 01:14:33,060
I propose, a second table called
genres whose purpose in life

1546
01:14:33,060 --> 01:14:38,250
is to contain a value called show_id
and another one called genre.

1547
01:14:38,250 --> 01:14:39,750
So what's going on?

1548
01:14:39,750 --> 01:14:42,810
Well, let me go ahead and show this.

1549
01:14:42,810 --> 01:14:48,930
If I load the resulting database from
this one, shows4.db, and I do .schema,

1550
01:14:48,930 --> 01:14:50,850
you'll see that I
indeed have two tables.

1551
01:14:50,850 --> 01:14:55,150
Let me go ahead and SELECT * FROM
shows WHERE title, this time,

1552
01:14:55,150 --> 01:14:59,410
because I've renamed it from
primaryTitle, = The Office.

1553
01:14:59,410 --> 01:14:59,910
OK.

1554
01:14:59,910 --> 01:15:01,050
That's a lot of Offices.

1555
01:15:01,050 --> 01:15:06,700
But let's go, AND year = 2005, which
is the one we're all thinking about.

1556
01:15:06,700 --> 01:15:07,860
And it's that one.

1557
01:15:07,860 --> 01:15:09,810
And now, notice this.

1558
01:15:09,810 --> 01:15:12,360
Notice that I'm getting back what?

1559
01:15:12,360 --> 01:15:19,590
An ID, I'm getting back a title, and
I'm getting back a year, but no genres.

1560
01:15:19,590 --> 01:15:24,053
That's because there's another table
now called genres that's separate.

1561
01:15:24,053 --> 01:15:25,470
And you know, I'm kind of curious.

1562
01:15:25,470 --> 01:15:27,763
I see that a genre table has show_id.

1563
01:15:27,763 --> 01:15:28,680
Let me go and do this.

1564
01:15:28,680 --> 01:15:34,200
SELECT * FROM genres
WHERE show_id =, and let

1565
01:15:34,200 --> 01:15:37,830
me do a little copy paste
here, = this show_id.

1566
01:15:37,830 --> 01:15:39,780
And what might I see?

1567
01:15:39,780 --> 01:15:41,100
Comedy.

1568
01:15:41,100 --> 01:15:42,840
So what have we done now?

1569
01:15:42,840 --> 01:15:46,170
For any TV show that was
in IMDB's database that

1570
01:15:46,170 --> 01:15:50,010
was a comma-separated list of genres,
I've exploded it, so to speak.

1571
01:15:50,010 --> 01:15:52,170
I've split that value on the commas.

1572
01:15:52,170 --> 01:15:56,070
And if the show is a comedy, I've
added a row in these genres table,

1573
01:15:56,070 --> 01:15:59,310
but then I've jotted down the
show's ID next to that genre

1574
01:15:59,310 --> 01:16:02,670
so I remember that that
show was of that genre.

1575
01:16:02,670 --> 01:16:05,770
But if another show has
multiple fields-- for instance,

1576
01:16:05,770 --> 01:16:10,980
let's go ahead and search for not
The Office but, say, The Crown.

1577
01:16:10,980 --> 01:16:12,540
And there's only one of those.

1578
01:16:12,540 --> 01:16:20,190
And now I do SELECT * FROM genres
WHERE show_id = this number,

1579
01:16:20,190 --> 01:16:25,360
we'll see that, oh, The Crown now
has drama and history as a genre.

1580
01:16:25,360 --> 01:16:27,570
And so therefore, in
the genres table, notice

1581
01:16:27,570 --> 01:16:30,690
that there's two IDs and two genres.

1582
01:16:30,690 --> 01:16:32,220
But now we can use a quality.

1583
01:16:32,220 --> 01:16:39,300
I can now search for all of the comedies
from 2019 in kind of a powerful way,

1584
01:16:39,300 --> 01:16:47,010
SELECT * FROM shows WHERE id IN--

1585
01:16:47,010 --> 01:16:48,330
and here's the cool part--

1586
01:16:48,330 --> 01:17:01,020
SELECT show_id FROM genres WHERE
genre = "Comedy" AND year = 2019.

1587
01:17:01,020 --> 01:17:03,960
So this is admittedly a mouthful,
but let's consider what's happening.

1588
01:17:03,960 --> 01:17:05,910
First, I'm selecting star from shows.

1589
01:17:05,910 --> 01:17:08,760
That means give me all the
shows in all of the columns.

1590
01:17:08,760 --> 01:17:14,640
But filter as follows, only show me
those shows where the ID of the show

1591
01:17:14,640 --> 01:17:17,170
is in the following list of IDs.

1592
01:17:17,170 --> 01:17:20,610
Now, you can look at a nested
query inside the parentheses here.

1593
01:17:20,610 --> 01:17:25,140
This list here selects all of the
show IDs from the genres table

1594
01:17:25,140 --> 01:17:27,380
where genre = "Comedy".

1595
01:17:27,380 --> 01:17:31,770
So the highlighted parenthetical
right now returns essentially a list

1596
01:17:31,770 --> 01:17:35,130
of all of the ID numbers of shows
that are associated with comedy,

1597
01:17:35,130 --> 01:17:37,900
even if they're associated
with other things, too.

1598
01:17:37,900 --> 01:17:41,400
And we're making sure
that the year equals 2019.

1599
01:17:41,400 --> 01:17:46,290
So if I now hit Enter, we'll see a whole
bunch of results, but we should see,

1600
01:17:46,290 --> 01:17:49,110
if I count these by
using my usual syntax,

1601
01:17:49,110 --> 01:17:54,120
there were 1,593 shows
that are comedies in 2019.

1602
01:17:54,120 --> 01:17:58,200
That does happen to equal the same
count we did earlier by using like,

1603
01:17:58,200 --> 01:18:01,560
but this is better designed in the
sense that there's no ambiguity.

1604
01:18:01,560 --> 01:18:03,730
You're not just hackishly
looking for a substring,

1605
01:18:03,730 --> 01:18:06,060
so to speak, in a comma-separated list.

1606
01:18:06,060 --> 01:18:08,580
You can actually now
search more robustly

1607
01:18:08,580 --> 01:18:10,710
by having redesigned your data.

1608
01:18:10,710 --> 01:18:13,830
And what we've done really
is something like this.

1609
01:18:13,830 --> 01:18:18,150
Instead of storing our data in
just one table called shows,

1610
01:18:18,150 --> 01:18:22,230
and every show has an ID, a
title, a year, and genres,

1611
01:18:22,230 --> 01:18:26,880
we've instead exploded those genres
into two separate tables, such

1612
01:18:26,880 --> 01:18:29,650
that now our shows
table looks like this.

1613
01:18:29,650 --> 01:18:32,200
We have an ID, a title, and a year.

1614
01:18:32,200 --> 01:18:37,350
And notice that the ID of a
show can also appear over here

1615
01:18:37,350 --> 01:18:40,950
in another table called genres, and
just by convention, to keep us sane,

1616
01:18:40,950 --> 01:18:43,380
instead of calling it
ID here, we've called

1617
01:18:43,380 --> 01:18:48,060
it show_id to connote that it came
from a table called shows, plural,

1618
01:18:48,060 --> 01:18:50,010
but it's a single ID from a show.

1619
01:18:50,010 --> 01:18:55,050
So by convention, humans often call them
table name, minus the S, underscore ID.

1620
01:18:55,050 --> 01:18:59,280
And then the keyword here or genre
is comedy or drama or documentary

1621
01:18:59,280 --> 01:19:01,443
or some other genre as well.

1622
01:19:01,443 --> 01:19:03,360
And now, this is a little
subtle, but the fact

1623
01:19:03,360 --> 01:19:05,940
that this little symbol here,
drawn from the database world,

1624
01:19:05,940 --> 01:19:08,370
flails out into three
separate places, this

1625
01:19:08,370 --> 01:19:11,370
is a one-to-many
relationship, so to speak.

1626
01:19:11,370 --> 01:19:15,630
You can have one show over here
mapping to many genres over here,

1627
01:19:15,630 --> 01:19:21,090
or maybe zero, but it's zero
or more possible genres.

1628
01:19:21,090 --> 01:19:21,990
All right.

1629
01:19:21,990 --> 01:19:26,720
Any questions just yet?

1630
01:19:26,720 --> 01:19:27,220
All right.

1631
01:19:27,220 --> 01:19:30,430
So the real power then, to
be clear, is coming, now,

1632
01:19:30,430 --> 01:19:31,870
from this kind of expressiveness.

1633
01:19:31,870 --> 01:19:34,495
So now, let's play around with
some other queries and features.

1634
01:19:34,495 --> 01:19:37,330
But first, let's give
ourselves a bit more data.

1635
01:19:37,330 --> 01:19:40,420
It turns out besides the
file called title.basics.tsv,

1636
01:19:40,420 --> 01:19:44,140
IMDB.com makes a bunch of
others available to us as well.

1637
01:19:44,140 --> 01:19:46,900
There is one called name.basics.tsv.

1638
01:19:46,900 --> 01:19:49,420
And this is one that
has information on all

1639
01:19:49,420 --> 01:19:53,690
of the names of actors and
actresses and directors and writers

1640
01:19:53,690 --> 01:19:54,940
and other people in the world.

1641
01:19:54,940 --> 01:19:58,600
So for instance, there is an nconst,
which is like a name constant

1642
01:19:58,600 --> 01:20:03,190
or an ID, nm2946516,
which happens to belong

1643
01:20:03,190 --> 01:20:06,550
to the actor whose primary name is
Claire Foy, the star of The Crown.

1644
01:20:06,550 --> 01:20:10,990
She was born in 1984, and there are
some other fields in that file as well.

1645
01:20:10,990 --> 01:20:14,560
But also juicy is this
file, title.principals.tsv,

1646
01:20:14,560 --> 01:20:16,690
and this is where it
gets interesting, too.

1647
01:20:16,690 --> 01:20:20,450
In this file, notice there
are no actual titles.

1648
01:20:20,450 --> 01:20:21,740
There's no primary titles.

1649
01:20:21,740 --> 01:20:23,800
There's no actual human names.

1650
01:20:23,800 --> 01:20:28,810
Instead, there's just two unique
identifiers, a tconst and an nconst,

1651
01:20:28,810 --> 01:20:34,240
which IMDB speak for a title
identifier and a name identifier.

1652
01:20:34,240 --> 01:20:37,450
So for instance, in one of the
rows in this TSV file called

1653
01:20:37,450 --> 01:20:42,790
title.principals.tsv, there is a
row that starts with tt4786824;

1654
01:20:42,790 --> 01:20:48,220
also has nm2946516, and
has the word actress,

1655
01:20:48,220 --> 01:20:52,090
thereby implying that
if you look up the nm--

1656
01:20:52,090 --> 01:20:57,280
the nconst in the names file and you
look up the tconst in the titles file,

1657
01:20:57,280 --> 01:21:03,600
you will be able to, by transitivity,
infer that Claire Foy is in The Crown.

1658
01:21:03,600 --> 01:21:06,670
This allows us to have a
many-to-many relationship.

1659
01:21:06,670 --> 01:21:10,360
A one movie or show
can have many actors,

1660
01:21:10,360 --> 01:21:16,030
and one actor can be in many shows, so
we're using a sort of join file here,

1661
01:21:16,030 --> 01:21:18,070
a join table, that's
going to somehow allow

1662
01:21:18,070 --> 01:21:21,590
us to link two different data sets
together, and more on that in a moment.

1663
01:21:21,590 --> 01:21:24,700
But what's really fun in IMDB is
that it also has a bunch of ratings

1664
01:21:24,700 --> 01:21:28,630
that humans have typed in saying, I
get this 10 out of 10, a 0 out of 10,

1665
01:21:28,630 --> 01:21:29,200
and so forth.

1666
01:21:29,200 --> 01:21:31,950
And they keep track of the number
of votes that shows have gotten.

1667
01:21:31,950 --> 01:21:36,370
And so in title.ratings.tsv, yet
another file you can download from IMDB,

1668
01:21:36,370 --> 01:21:40,120
you can look up a given tconst,
the unique identifier for a title,

1669
01:21:40,120 --> 01:21:42,940
what its average rating is
and the number of votes.

1670
01:21:42,940 --> 01:21:46,660
And in fact, if I pull this up, for
instance, on The Crown, if I go back

1671
01:21:46,660 --> 01:21:48,880
to IMDB itself--

1672
01:21:48,880 --> 01:21:53,930
IMDB, search for The
Crown, the 2016 version,

1673
01:21:53,930 --> 01:21:59,410
you'll see that indeed, it is an
8.7, which lines up with exactly what

1674
01:21:59,410 --> 01:22:00,210
we have here.

1675
01:22:00,210 --> 01:22:03,580
But over time, that number is going to
go up or down, because IMDB is updating

1676
01:22:03,580 --> 01:22:06,280
their data set every day as well.

1677
01:22:06,280 --> 01:22:09,730
So besides this data, we
also then have the ability

1678
01:22:09,730 --> 01:22:12,580
to consider what this all
looks like collectively.

1679
01:22:12,580 --> 01:22:14,920
So in this case here,
here is another diagram.

1680
01:22:14,920 --> 01:22:18,610
It's more complicated now, but
it just captures the intuition

1681
01:22:18,610 --> 01:22:22,450
that you would derive by just reading
through IMDB's documentation, which

1682
01:22:22,450 --> 01:22:23,800
defines the following.

1683
01:22:23,800 --> 01:22:26,420
It turns out if you read
closely among those files,

1684
01:22:26,420 --> 01:22:29,710
you'll see that, oh, we can glean a
whole bunch of shows that are going

1685
01:22:29,710 --> 01:22:32,050
to have IDs, title, year, and episodes.

1686
01:22:32,050 --> 01:22:37,840
I want to go ahead and associate those
shows with a whole bunch of stars.

1687
01:22:37,840 --> 01:22:42,170
But people are the entities in the
world that have IDs, names, and birth.

1688
01:22:42,170 --> 01:22:43,910
So now things get a little weird.

1689
01:22:43,910 --> 01:22:46,300
Let's focus only on
these two tables here.

1690
01:22:46,300 --> 01:22:48,340
This is a diagram of the
goal at hand, and this

1691
01:22:48,340 --> 01:22:49,632
is a script I wrote in advance.

1692
01:22:49,632 --> 01:22:53,150
And in a moment, we'll open up a
SQL database that represents this.

1693
01:22:53,150 --> 01:22:56,290
There's going to be a table
called shows, every row of which

1694
01:22:56,290 --> 01:22:59,350
has an ID, title, year, and
some number of episodes,

1695
01:22:59,350 --> 01:23:00,962
so you can see how long-running it is.

1696
01:23:00,962 --> 01:23:02,920
There's also going to be
a table called people.

1697
01:23:02,920 --> 01:23:05,080
Claire Foy is going to be among them.

1698
01:23:05,080 --> 01:23:09,610
She and the other humans will have an
ID, name, and birth year associated

1699
01:23:09,610 --> 01:23:10,960
with them as well.

1700
01:23:10,960 --> 01:23:14,230
But there's going to be this
other table here called stars.

1701
01:23:14,230 --> 01:23:19,390
And you'll notice there's a line
that links shows to stars to people,

1702
01:23:19,390 --> 01:23:21,190
again by transitivity.

1703
01:23:21,190 --> 01:23:27,580
If there is, in the stars table, a show
ID, a.k.a. tconst, and a person ID,

1704
01:23:27,580 --> 01:23:28,260
a.k.a.

1705
01:23:28,260 --> 01:23:32,260
nconst, that links, for instance,
Claire Foy to The Crown.

1706
01:23:32,260 --> 01:23:34,480
It's going to link Steve
Carell to The Office.

1707
01:23:34,480 --> 01:23:38,450
It's going to link every other
actor to their show as well.

1708
01:23:38,450 --> 01:23:41,770
Similarly for writers, we won't play
too much with the writers today,

1709
01:23:41,770 --> 01:23:45,760
but writers are people, too,
just as stars are people.

1710
01:23:45,760 --> 01:23:48,970
And so here's another feature
or design goal of SQL.

1711
01:23:48,970 --> 01:23:52,150
You want to ideally factor
out the commonalities, so

1712
01:23:52,150 --> 01:23:55,450
that Claire Foy appears
only in one place by name,

1713
01:23:55,450 --> 01:23:58,965
but her unique identifier might
appear in bunches of places.

1714
01:23:58,965 --> 01:24:01,840
There's a lot of actors in the world
who are also writers themselves.

1715
01:24:01,840 --> 01:24:03,970
They are, at the end of
the day, people, but they

1716
01:24:03,970 --> 01:24:08,200
might appear both in the stars
table and in the writers table

1717
01:24:08,200 --> 01:24:10,300
by way of their person ID.

1718
01:24:10,300 --> 01:24:14,050
So the goal of SQL is to not copy and
paste Claire Foy, Claire Foy, Claire

1719
01:24:14,050 --> 01:24:17,750
Foy, or Steve Carell, Steve Carell,
Steve Carell, all over the place.

1720
01:24:17,750 --> 01:24:20,170
You want to have one
authoritative place for all

1721
01:24:20,170 --> 01:24:23,320
of your people, one authoritative
place for all of your shows,

1722
01:24:23,320 --> 01:24:25,180
and then you have these
other tables called

1723
01:24:25,180 --> 01:24:28,940
join tables, which are similar
in spirit to the TSV files

1724
01:24:28,940 --> 01:24:32,790
you can download that somehow
link these identifiers together.

1725
01:24:32,790 --> 01:24:38,820
So if I, on my computer here, open a
file that I had-- came with in advance,

1726
01:24:38,820 --> 01:24:43,370
that you can also play with online
as well, called, finally, shows.db.

1727
01:24:43,370 --> 01:24:45,080
And I do .schema on this one.

1728
01:24:45,080 --> 01:24:48,560
This is the largest of the databases,
and this is the one that we, the staff,

1729
01:24:48,560 --> 01:24:51,345
actually generated from all
of the TSV files online,

1730
01:24:51,345 --> 01:24:52,970
but we threw away a lot of information.

1731
01:24:52,970 --> 01:24:55,850
We threw away anything before
1970, and we filtered out

1732
01:24:55,850 --> 01:24:59,600
everything except TV series,
so that we can actually

1733
01:24:59,600 --> 01:25:01,820
play with the data in this data set.

1734
01:25:01,820 --> 01:25:04,580
But honestly, using a black and
white prompt and a terminal window

1735
01:25:04,580 --> 01:25:07,455
tends not to get fun, especially
when the data flies over the screen.

1736
01:25:07,455 --> 01:25:10,413
So there are also things called
GUIs, Graphical User Interfaces,

1737
01:25:10,413 --> 01:25:13,080
and indeed, there is a program
that's freely available for Macs,

1738
01:25:13,080 --> 01:25:16,490
PCs, and other types of operating
systems, called DB Browser.

1739
01:25:16,490 --> 01:25:18,480
And indeed, will point
you at this online.

1740
01:25:18,480 --> 01:25:23,030
This is just a program that allows you
to explore SQL files on your own Mac

1741
01:25:23,030 --> 01:25:24,890
or PC much more visibly--

1742
01:25:24,890 --> 01:25:28,070
or much more visually and much
more pleasantly than maybe just

1743
01:25:28,070 --> 01:25:29,870
a command line interface allows.

1744
01:25:29,870 --> 01:25:33,080
So let me go ahead and open
up, for instance, shows.db.

1745
01:25:33,080 --> 01:25:34,770
And we'll see a whole bunch of things.

1746
01:25:34,770 --> 01:25:36,890
First of all, the
graphical user interface

1747
01:25:36,890 --> 01:25:40,280
here shows me the same information,
just in slightly prettier format.

1748
01:25:40,280 --> 01:25:44,450
Shows.db, per my diagram a moment
ago, has six tables, people,

1749
01:25:44,450 --> 01:25:46,670
like I mentioned;
shows, like I mentioned;

1750
01:25:46,670 --> 01:25:50,398
also, stars and writers, ratings, and
then that separate table called genres.

1751
01:25:50,398 --> 01:25:52,190
And if you look over
here on the right, you

1752
01:25:52,190 --> 01:25:56,360
can actually see the SQL code
we wrote to create those tables.

1753
01:25:56,360 --> 01:26:00,350
But cooler than that is that
notice these tabs up top here?

1754
01:26:00,350 --> 01:26:03,170
I am currently on Database Structure,
and if you're using Windows,

1755
01:26:03,170 --> 01:26:05,180
your interface will
look a little different,

1756
01:26:05,180 --> 01:26:06,740
but the options are still there.

1757
01:26:06,740 --> 01:26:08,960
I'm going to go ahead
and click on Browse Data.

1758
01:26:08,960 --> 01:26:12,000
And now you'll see a little
dropdown of all of the tables.

1759
01:26:12,000 --> 01:26:14,510
So if I want to go ahead and
look at a whole bunch of shows,

1760
01:26:14,510 --> 01:26:18,680
I can actually see all of my data
here, not unlike Google Spreadsheets.

1761
01:26:18,680 --> 01:26:23,450
But notice, this is 153,331 movies--

1762
01:26:23,450 --> 01:26:27,380
or shows, rather, that I
can see altogether here.

1763
01:26:27,380 --> 01:26:31,110
And lastly, what's cool is that
if I go over to the SQL tab,

1764
01:26:31,110 --> 01:26:33,058
I can now execute some sample queries.

1765
01:26:33,058 --> 01:26:34,350
So let me go ahead and do this.

1766
01:26:34,350 --> 01:26:37,340
SELECT * FROM shows;.

1767
01:26:37,340 --> 01:26:39,530
This is going to give
me a whole lot of shows.

1768
01:26:39,530 --> 01:26:40,130
Enter.

1769
01:26:40,130 --> 01:26:43,140
So I hit the little Play button
that just executes that query,

1770
01:26:43,140 --> 01:26:46,980
and you see the resulting
rows that have come back.

1771
01:26:46,980 --> 01:26:48,800
So again, how did we get to this point?

1772
01:26:48,800 --> 01:26:51,920
We, the staff, downloaded all
of those TSV files in advance.

1773
01:26:51,920 --> 01:26:56,150
We wrote a Python script that imported
all of the data from those files

1774
01:26:56,150 --> 01:26:58,880
into memory, threw away the
stuff we didn't care about,

1775
01:26:58,880 --> 01:27:03,362
and then inserted it into tables
like this table called shows.

1776
01:27:03,362 --> 01:27:06,320
And what's nice about this tool is,
because if you are an aspiring data

1777
01:27:06,320 --> 01:27:09,020
scientist or you're just trying
to get some analytical work done

1778
01:27:09,020 --> 01:27:11,690
for this class or any
other, or any other project,

1779
01:27:11,690 --> 01:27:14,370
a graphical tool lends itself
to just kind of poking around.

1780
01:27:14,370 --> 01:27:19,010
So for instance, you'll see that the
shows table no longer has any genres.

1781
01:27:19,010 --> 01:27:20,010
But that's OK.

1782
01:27:20,010 --> 01:27:21,950
We can reconstitute that data.

1783
01:27:21,950 --> 01:27:25,670
I can go ahead and SELECT * FROM shows--

1784
01:27:25,670 --> 01:27:27,390
but you know what I can do?

1785
01:27:27,390 --> 01:27:30,680
I can actually join in the genres table.

1786
01:27:30,680 --> 01:27:34,250
So I can take the shows table
here, the genres table here,

1787
01:27:34,250 --> 01:27:37,640
and essentially link them
together by way of the ID.

1788
01:27:37,640 --> 01:27:39,000
How is that possible?

1789
01:27:39,000 --> 01:27:43,010
Well, if you look at shows, a show
has an ID title, year, and episodes.

1790
01:27:43,010 --> 01:27:45,020
Genres has a show ID.

1791
01:27:45,020 --> 01:27:48,560
So if you think of ID on my left
hand as representing my fingers here.

1792
01:27:48,560 --> 01:27:51,830
Show ID in genres is
representing my right hand here.

1793
01:27:51,830 --> 01:27:54,830
What we want to do is
lineup ID with show ID

1794
01:27:54,830 --> 01:28:00,570
to make one larger, wider table that
constitutes all of that data together.

1795
01:28:00,570 --> 01:28:01,890
So how do I do this?

1796
01:28:01,890 --> 01:28:04,130
Well, in SQL, you can join two tables.

1797
01:28:04,130 --> 01:28:07,430
You say what table you want to
join with what other table and

1798
01:28:07,430 --> 01:28:08,570
how you want to do it.

1799
01:28:08,570 --> 01:28:16,740
Well, I want to go ahead and join
it on shows.id = genres.show_id;.

1800
01:28:16,740 --> 01:28:19,700
And now, when I hit
execute, it took a moment--

1801
01:28:19,700 --> 01:28:21,682
indeed, it took 408 milliseconds.

1802
01:28:21,682 --> 01:28:24,140
But my god, that's a lot faster
than writing a whole Python

1803
01:28:24,140 --> 01:28:25,160
script to do this.

1804
01:28:25,160 --> 01:28:29,220
Now I have a table with
all of the shows as before.

1805
01:28:29,220 --> 01:28:30,740
But notice the table got wider.

1806
01:28:30,740 --> 01:28:33,590
This is a temporary table
that SQL has returned to me.

1807
01:28:33,590 --> 01:28:36,077
This one now has genre and show ID.

1808
01:28:36,077 --> 01:28:37,910
So in fact, let me go
ahead and filter this.

1809
01:28:37,910 --> 01:28:52,290
WHERE title = "The Office" AND year
= 2005, Play, we'll just get this.

1810
01:28:52,290 --> 01:28:55,370
So notice I have a wider table
containing all of the columns

1811
01:28:55,370 --> 01:28:57,350
from both of those joined tables.

1812
01:28:57,350 --> 01:29:03,170
But if I change this now to The Crown,
and that was 2016, and hit play,

1813
01:29:03,170 --> 01:29:05,875
notice I get back seemingly
redundant information.

1814
01:29:05,875 --> 01:29:08,000
But this might be useful
nonetheless, because now I

1815
01:29:08,000 --> 01:29:13,040
can iterate over all of the rows knowing
that every row has not only a title

1816
01:29:13,040 --> 01:29:16,220
and a year, but also a genre as well.

1817
01:29:16,220 --> 01:29:19,070
So I can reconstitute
the table like that.

1818
01:29:19,070 --> 01:29:21,530
Well, how can I ask
other questions as well,

1819
01:29:21,530 --> 01:29:26,030
like what are actors that
are in other shows as well?

1820
01:29:26,030 --> 01:29:30,000
Well, let me go ahead and
do something like this.

1821
01:29:30,000 --> 01:29:33,230
For instance, let's select
all of Steve Carell's movies.

1822
01:29:33,230 --> 01:29:35,280
Or-- let's select Steve
Carell himself first.

1823
01:29:35,280 --> 01:29:38,280
So SELECT * FROM--

1824
01:29:38,280 --> 01:29:40,140
let's see-- no.

1825
01:29:40,140 --> 01:29:40,830
Let's do this.

1826
01:29:40,830 --> 01:29:41,910
How about Ellen?

1827
01:29:41,910 --> 01:29:49,405
So SELECT * from people
where name = Ellen DeGeneres.

1828
01:29:49,405 --> 01:29:50,280
I spelled that right.

1829
01:29:50,280 --> 01:29:52,150
Semicolon, Play.

1830
01:29:52,150 --> 01:29:52,650
All right.

1831
01:29:52,650 --> 01:29:56,040
So this is handy, because I now know
that Ellen's birth year is 1958,

1832
01:29:56,040 --> 01:29:58,320
but her ID is 1122.

1833
01:29:58,320 --> 01:30:01,530
That's the same thing as that
nconst, but we threw away the nm

1834
01:30:01,530 --> 01:30:05,520
and we got rid of all the zeros and made
it into a proper number for efficiency,

1835
01:30:05,520 --> 01:30:08,340
better to uniquely identify
humans, typically-- or anything--

1836
01:30:08,340 --> 01:30:13,110
by numbers which fit in 32 or 64
bits rather than longer strings.

1837
01:30:13,110 --> 01:30:16,740
So now I know Ellen's ID
is 1122, what can I do?

1838
01:30:16,740 --> 01:30:25,470
Well, let me go ahead and SELECT *
FROM stars WHERE person_id = this.

1839
01:30:25,470 --> 01:30:26,970
This will tell me what?

1840
01:30:26,970 --> 01:30:31,920
This will tell me all of the information
about shows that Ellen starred in,

1841
01:30:31,920 --> 01:30:34,150
including, presumably, her own.

1842
01:30:34,150 --> 01:30:37,230
OK, so I now see person_id is
the same, the same, the same,

1843
01:30:37,230 --> 01:30:39,608
but Ellen is apparently
in all of those shows.

1844
01:30:39,608 --> 01:30:41,400
But that's not that
helpful, and also, this

1845
01:30:41,400 --> 01:30:44,070
is kind of lame that I've
just hardcoded Ellen's ID.

1846
01:30:44,070 --> 01:30:45,550
But I don't have to do that.

1847
01:30:45,550 --> 01:30:55,020
I can do a subquery, SELECT id FROM
people WHERE name = "Ellen DeGeneres",

1848
01:30:55,020 --> 01:30:57,330
closed parenthesis, Play.

1849
01:30:57,330 --> 01:31:00,480
Now it's dynamic, so now
I've not hardcoded anything.

1850
01:31:00,480 --> 01:31:01,930
But this isn't that useful.

1851
01:31:01,930 --> 01:31:04,770
Let me go ahead and just
select show_id here.

1852
01:31:04,770 --> 01:31:08,910
So I now have SELECT show_id
FROM stars WHERE the person_id ID

1853
01:31:08,910 --> 01:31:12,108
= whatever Ellen's ID is here.

1854
01:31:12,108 --> 01:31:13,650
How can I take this one step further?

1855
01:31:13,650 --> 01:31:19,980
Well, what if I do SELECT * FROM
shows WHERE the ID of the show

1856
01:31:19,980 --> 01:31:23,400
is in the following list of values?

1857
01:31:23,400 --> 01:31:25,590
So not only does SQL
support equals, when

1858
01:31:25,590 --> 01:31:28,530
you want to compare one value against
the next, much like in Python,

1859
01:31:28,530 --> 01:31:33,360
you have the keyword IN, where you can
say select everything from shows where

1860
01:31:33,360 --> 01:31:36,930
the ID of the show is in
the following list of shows,

1861
01:31:36,930 --> 01:31:42,260
which happens to represent the
list of show IDs that Ellen is in.

1862
01:31:42,260 --> 01:31:43,140
Phew.

1863
01:31:43,140 --> 01:31:45,150
Lets hit Play on this.

1864
01:31:45,150 --> 01:31:48,660
It took a moment, but it looks
like these are all of the shows

1865
01:31:48,660 --> 01:31:51,120
that Ellen has been
in, according to IMDB.

1866
01:31:51,120 --> 01:31:54,600
And it looks like The Ellen
DeGeneres Show is one.

1867
01:31:54,600 --> 01:31:58,380
She's been on the air
for 2,865 episodes.

1868
01:31:58,380 --> 01:32:01,480
There is Ellen, her original TV show,
which was on for a few seasons back

1869
01:32:01,480 --> 01:32:01,980
then.

1870
01:32:01,980 --> 01:32:06,430
Looks like she's doing some game shows
these days as of 2017, and so forth.

1871
01:32:06,430 --> 01:32:09,810
So using these step-by-step
thought processes

1872
01:32:09,810 --> 01:32:12,300
can we actually build up
more interesting queries

1873
01:32:12,300 --> 01:32:15,410
to get back information like that.

1874
01:32:15,410 --> 01:32:21,192
All right, any questions before
we try a few others out as well?

1875
01:32:21,192 --> 01:32:21,860
No?

1876
01:32:21,860 --> 01:32:22,360
All right.

1877
01:32:22,360 --> 01:32:25,270
Well, let me show one other
approach to this same problem.

1878
01:32:25,270 --> 01:32:27,720
It turns out this is what
are known as nested queries.

1879
01:32:27,720 --> 01:32:30,012
You keep using the
parenthesization, much like in math,

1880
01:32:30,012 --> 01:32:32,490
where you can nest arithmetic
expressions in parentheses.

1881
01:32:32,490 --> 01:32:33,900
You can do the same in SQL.

1882
01:32:33,900 --> 01:32:37,590
But you can also join information
in slightly different ways as well.

1883
01:32:37,590 --> 01:32:40,480
I can actually do something like this.

1884
01:32:40,480 --> 01:32:47,670
Let me go ahead and SELECT
Title FROM the following tables,

1885
01:32:47,670 --> 01:32:56,980
people JOIN stars ON
people.id = stars.person_id--

1886
01:32:56,980 --> 01:32:59,230
and we'll walk through
this in just a second--

1887
01:32:59,230 --> 01:33:03,210
JOIN-- sorry--

1888
01:33:03,210 --> 01:33:14,640
JOIN shows ON stars.show_id = shows.id
where name equals "Ellen DeGeneres."

1889
01:33:14,640 --> 01:33:18,780
This, while more of a mouthful, is
equivalent to what I've just done.

1890
01:33:18,780 --> 01:33:21,450
Notice that I've select
the title of the show

1891
01:33:21,450 --> 01:33:24,630
from the following, the result
of joining people on stars.

1892
01:33:24,630 --> 01:33:27,330
How do you join the people's
table and the stars table?

1893
01:33:27,330 --> 01:33:28,740
Well, people have IDs.

1894
01:33:28,740 --> 01:33:31,465
Stars have person IDs,
according to the diagram.

1895
01:33:31,465 --> 01:33:33,090
What else do you want to join together?

1896
01:33:33,090 --> 01:33:35,310
Let's join all of that with shows.

1897
01:33:35,310 --> 01:33:36,000
How?

1898
01:33:36,000 --> 01:33:40,740
Well, let's go ahead and say
stars.show_id = shows.id.

1899
01:33:40,740 --> 01:33:43,770
So it's a way of linking,
transitively, multiple tables together,

1900
01:33:43,770 --> 01:33:47,910
and then filter all of that with
the name of Ellen DeGeneres.

1901
01:33:47,910 --> 01:33:50,580
This is what we would
call an explicit join.

1902
01:33:50,580 --> 01:33:54,210
And if I click Play there,
notice it's taking a moment.

1903
01:33:54,210 --> 01:33:58,530
Took, in fact, 1,990 milliseconds,
almost two human seconds.

1904
01:33:58,530 --> 01:34:01,800
So it got slower, but it
does give me another syntax

1905
01:34:01,800 --> 01:34:02,910
via which to select data.

1906
01:34:02,910 --> 01:34:07,380
And honestly, what's cool about this
is I can select other fields as well.

1907
01:34:07,380 --> 01:34:12,240
They don't have to come from the
original query in my nested selects.

1908
01:34:12,240 --> 01:34:15,060
But let me go ahead and do
this, just as a teaser, and then

1909
01:34:15,060 --> 01:34:17,850
we'll consider some
problems before wrapping up.

1910
01:34:17,850 --> 01:34:20,130
So it turns out 2 millisecond--

1911
01:34:20,130 --> 01:34:25,590
2 seconds is kind of slow, and if I do
this again, it took 2,029 milliseconds.

1912
01:34:25,590 --> 01:34:29,345
If I do it again, it
took 1,963 milliseconds,

1913
01:34:29,345 --> 01:34:30,720
and I'm looking at the time here.

1914
01:34:30,720 --> 01:34:31,530
That's slow, right?

1915
01:34:31,530 --> 01:34:34,030
That means if you have a finite
number of servers in the world

1916
01:34:34,030 --> 01:34:36,720
and each of them can only handle
some number of users at a time,

1917
01:34:36,720 --> 01:34:41,700
it's kind of a waste of hardware and of
money, arguably, to spend more seconds

1918
01:34:41,700 --> 01:34:46,050
or milliseconds servicing one query from
a given user on your app or your web

1919
01:34:46,050 --> 01:34:47,100
application.

1920
01:34:47,100 --> 01:34:49,950
But it turns out we can do
something kind of smart here.

1921
01:34:49,950 --> 01:34:54,340
Notice that in our database structure,
we've done a few things here.

1922
01:34:54,340 --> 01:34:57,570
Let me open up, for instance, people.

1923
01:34:57,570 --> 01:35:02,990
Rather, let me go over to people,
and you'll see a few keywords

1924
01:35:02,990 --> 01:35:04,460
that I ignored earlier.

1925
01:35:04,460 --> 01:35:07,640
When you're defining a table, you
can specify that one of your columns

1926
01:35:07,640 --> 01:35:09,253
is what's called a primary key.

1927
01:35:09,253 --> 01:35:12,420
That is the column by which you're going
to uniquely identify all your data,

1928
01:35:12,420 --> 01:35:15,620
so it's like those numeric IDs
we've given every person and show.

1929
01:35:15,620 --> 01:35:18,020
A foreign key is what
we call the same number

1930
01:35:18,020 --> 01:35:19,590
when it appears in some other table.

1931
01:35:19,590 --> 01:35:24,440
So when we saw a person_id or show_id,
those are what are called foreign keys,

1932
01:35:24,440 --> 01:35:27,140
because it's the same
numbers, but in another table

1933
01:35:27,140 --> 01:35:29,780
that we're using to join
things together, eventually.

1934
01:35:29,780 --> 01:35:31,718
And you can also define
columns to be unique,

1935
01:35:31,718 --> 01:35:33,260
or you can index them for efficiency.

1936
01:35:33,260 --> 01:35:35,750
And that's, in fact, where
we're going right now.

1937
01:35:35,750 --> 01:35:38,690
If you look at the same
query, you'll notice that I'm

1938
01:35:38,690 --> 01:35:40,640
searching on a bunch of columns.

1939
01:35:40,640 --> 01:35:43,380
I'm clearly searching on the name field.

1940
01:35:43,380 --> 01:35:48,050
I'm also searching on-- that
is, joining on-- show_id,

1941
01:35:48,050 --> 01:35:52,740
and I'm searching on person_id, not
to mention the individual ID fields.

1942
01:35:52,740 --> 01:35:55,040
But the cool thing about
a primary key, which

1943
01:35:55,040 --> 01:35:59,270
we have in advance to find all of
our ID columns, all of the columns

1944
01:35:59,270 --> 01:36:03,500
called ID, to be primary keys, you
get a feature for free with SQL.

1945
01:36:03,500 --> 01:36:06,530
SQL builds up what's called
an index, a very fancy data

1946
01:36:06,530 --> 01:36:09,950
structure, a tree-like structure,
that actually allows you to search

1947
01:36:09,950 --> 01:36:12,240
for information quite efficiently.

1948
01:36:12,240 --> 01:36:14,240
So when you define a
column to be a primary key,

1949
01:36:14,240 --> 01:36:15,770
you get what's called an index.

1950
01:36:15,770 --> 01:36:19,430
Specifically, if we go back a few
weeks, to tree, our discussion of trees,

1951
01:36:19,430 --> 01:36:20,248
it looks like this.

1952
01:36:20,248 --> 01:36:22,790
This is what's called the B-tree,
which is not a binary tree.

1953
01:36:22,790 --> 01:36:25,460
It's a B-tree in the sense
that it's got a lot of nodes

1954
01:36:25,460 --> 01:36:28,130
that might be one or two
or more children each,

1955
01:36:28,130 --> 01:36:32,270
but it's very short with a lot
of nodes, very-- a lot of width.

1956
01:36:32,270 --> 01:36:35,570
So that means when you store your
data in a tree-like structure,

1957
01:36:35,570 --> 01:36:38,870
long story short, it just tends
to be very efficiently searchable.

1958
01:36:38,870 --> 01:36:40,970
So when you define a
column to be a primary key,

1959
01:36:40,970 --> 01:36:43,460
you get that speed for free.

1960
01:36:43,460 --> 01:36:46,520
But we can also tell our
SQL database, you know what?

1961
01:36:46,520 --> 01:36:51,330
I plan on searching on person
IDs and show IDs and names, also,

1962
01:36:51,330 --> 01:36:53,990
so let's go ahead and
create myself some indexes.

1963
01:36:53,990 --> 01:36:57,530
Let me go ahead and execute the
following queries just once.

1964
01:36:57,530 --> 01:37:04,490
Let me create something called an index
called person_index on the stars table

1965
01:37:04,490 --> 01:37:07,040
using its person_id column.

1966
01:37:07,040 --> 01:37:08,810
So pretty quick syntax.

1967
01:37:08,810 --> 01:37:12,230
This means create a B-tree,
create a fancy tree structure,

1968
01:37:12,230 --> 01:37:16,730
called person index, on
the stars table by person--

1969
01:37:16,730 --> 01:37:18,020
on the person_id column.

1970
01:37:18,020 --> 01:37:21,810
That is, make a table like this in
memory to store all of the person IDs.

1971
01:37:21,810 --> 01:37:22,310
Why?

1972
01:37:22,310 --> 01:37:26,450
So I can find them faster than
linear search in the column itself.

1973
01:37:26,450 --> 01:37:27,955
Let me go ahead and execute this.

1974
01:37:27,955 --> 01:37:29,330
And you'll see it takes a moment.

1975
01:37:29,330 --> 01:37:29,930
It's thinking.

1976
01:37:29,930 --> 01:37:31,510
It took me 1.6 seconds.

1977
01:37:31,510 --> 01:37:33,772
So a little slow, but I
only have to do this once.

1978
01:37:33,772 --> 01:37:34,730
Let me create two more.

1979
01:37:34,730 --> 01:37:39,860
CREATE INDEX called show_index
ON stars ON the show_id column.

1980
01:37:39,860 --> 01:37:42,080
So almost the same, but
this just means give me

1981
01:37:42,080 --> 01:37:44,780
a tree that looks like
this in memory, so

1982
01:37:44,780 --> 01:37:48,360
that when I query for
data like Ellen DeGeneres,

1983
01:37:48,360 --> 01:37:52,640
it searches a tree-like structure
instead of a list-like structure

1984
01:37:52,640 --> 01:37:53,960
in a column alone.

1985
01:37:53,960 --> 01:37:54,890
Let me execute that.

1986
01:37:54,890 --> 01:37:58,440
And I screwed up ON stars show_ID.

1987
01:37:58,440 --> 01:37:59,100
There we go.

1988
01:37:59,100 --> 01:38:00,210
That took 1 second.

1989
01:38:00,210 --> 01:38:07,320
And lastly, let's create an index called
name index on people on the name column

1990
01:38:07,320 --> 01:38:10,620
so that I can search for people
by name more efficiently as well.

1991
01:38:10,620 --> 01:38:13,920
Otherwise, without these
indexes on a column like name,

1992
01:38:13,920 --> 01:38:17,310
it is going to check every
damn cell in the column looking

1993
01:38:17,310 --> 01:38:21,180
for Ellen DeGeneres, Ellen DeGeneres,
Ellen DeGeneres, using big O of N,

1994
01:38:21,180 --> 01:38:24,070
running time, or linear search.

1995
01:38:24,070 --> 01:38:26,520
So now, let me go back to my query here.

1996
01:38:26,520 --> 01:38:28,440
I've not made any changes to the query.

1997
01:38:28,440 --> 01:38:31,500
The last time I ran this,
it took almost 2 seconds.

1998
01:38:31,500 --> 01:38:34,860
Now, after creating these indexes
and telling my SQL database

1999
01:38:34,860 --> 01:38:37,470
I plan to search on those
columns, watch this,

2000
01:38:37,470 --> 01:38:40,110
and watch the blue highlighted number.

2001
01:38:40,110 --> 01:38:41,520
8 milliseconds.

2002
01:38:41,520 --> 01:38:42,190
8 millisecond.

2003
01:38:42,190 --> 01:38:43,110
7 milliseconds.

2004
01:38:43,110 --> 01:38:44,490
7, 7, 7.

2005
01:38:44,490 --> 01:38:46,245
Looks like 4 milliseconds that time.

2006
01:38:46,245 --> 01:38:49,120
So this is what the Googles of the
world, the Facebooks of the world,

2007
01:38:49,120 --> 01:38:51,180
the Microsofts, who
have very large data,

2008
01:38:51,180 --> 01:38:53,310
they not only store
their data in databases

2009
01:38:53,310 --> 01:38:57,150
like we are here, but they also
index their tables intelligently,

2010
01:38:57,150 --> 01:39:01,200
drawing in ideas from weeks ago, so that
the database, for them, and for free,

2011
01:39:01,200 --> 01:39:04,110
and sort of magically, creates
these kinds of structures in memory,

2012
01:39:04,110 --> 01:39:07,380
but does it so that you can
search and insert and update

2013
01:39:07,380 --> 01:39:10,620
your data all the more efficiently.

2014
01:39:10,620 --> 01:39:13,560
So there's got to be some
problems or some prices paid.

2015
01:39:13,560 --> 01:39:15,810
So let's consider just
a couple of those.

2016
01:39:15,810 --> 01:39:19,637
So what problems actually arise
when using a SQL database like this?

2017
01:39:19,637 --> 01:39:21,720
Well, the first is what's
called a race condition.

2018
01:39:21,720 --> 01:39:24,780
And perhaps the best way to
explain this is by way of a--

2019
01:39:24,780 --> 01:39:27,780
sort of a story that I was told
when I took a class like this years

2020
01:39:27,780 --> 01:39:31,690
ago in a class called
CS161, Operating Systems.

2021
01:39:31,690 --> 01:39:33,540
So contrived scenario,
but consider this.

2022
01:39:33,540 --> 01:39:35,832
You and your roommates have
a fridge in your dorm room,

2023
01:39:35,832 --> 01:39:37,890
and you and your roommate
really like milk.

2024
01:39:37,890 --> 01:39:41,910
And one of you comes home one day,
opens the fridge, and sees, oh, dammit,

2025
01:39:41,910 --> 01:39:42,770
we're out of milk.

2026
01:39:42,770 --> 01:39:45,270
And so you close the fridge and
you walk into Harvard Square

2027
01:39:45,270 --> 01:39:47,400
and you head to CVS or
some other such store.

2028
01:39:47,400 --> 01:39:50,670
Meanwhile, your other roommate
comes home, opens the fridge,

2029
01:39:50,670 --> 01:39:53,280
really wants some milk, but,
oh, darn it, we're out of milk.

2030
01:39:53,280 --> 01:39:56,425
So they close the fridge,
walk outside, and head

2031
01:39:56,425 --> 01:39:59,550
to some other store that sells milk,
and for reasons that we're contriving,

2032
01:39:59,550 --> 01:40:01,500
don't bump into you at CVS.

2033
01:40:01,500 --> 01:40:04,728
A few minutes later, you both get home,
of course, having bought some milk.

2034
01:40:04,728 --> 01:40:06,520
But of course, milk
doesn't last that long,

2035
01:40:06,520 --> 01:40:08,460
so now you've got twice as much
milk, and surely some of it's

2036
01:40:08,460 --> 01:40:09,127
going to go bad.

2037
01:40:09,127 --> 01:40:11,640
Like horrible, horrible problem.

2038
01:40:11,640 --> 01:40:12,690
Why, right?

2039
01:40:12,690 --> 01:40:16,230
It's contrived, but this-- fond
memories of when it was taught to me.

2040
01:40:16,230 --> 01:40:19,320
So what is the fundamental problem
here, if you put your finger on it?

2041
01:40:19,320 --> 01:40:22,380
Both you and your roommate
were allowed to inspect

2042
01:40:22,380 --> 01:40:26,400
the state of that refrigerator
without knowing that the other was

2043
01:40:26,400 --> 01:40:27,720
about to do the same.

2044
01:40:27,720 --> 01:40:31,405
Better would have been for one of you to
leave a note on the fridge saying, gone

2045
01:40:31,405 --> 01:40:33,780
for milk, so that the other
person doesn't do it as well.

2046
01:40:33,780 --> 01:40:37,230
Or more stringently, just
lock the refrigerator

2047
01:40:37,230 --> 01:40:41,430
so that no one else besides you
can see inside of it until you have

2048
01:40:41,430 --> 01:40:44,010
updated the state of that refrigerator.

2049
01:40:44,010 --> 01:40:46,320
So this comes into play
all too often these days

2050
01:40:46,320 --> 01:40:48,130
when dealing with
really large data sets.

2051
01:40:48,130 --> 01:40:50,880
The Twitters of the world, the
Instagrams of the world, all of you

2052
01:40:50,880 --> 01:40:53,130
are probably in the habit
of liking or uploading posts

2053
01:40:53,130 --> 01:40:54,640
on those sites or some other.

2054
01:40:54,640 --> 01:40:58,120
But if a lot of other people are,
too, especially when things go viral,

2055
01:40:58,120 --> 01:41:01,080
there might be code like
this underneath the hood that

2056
01:41:01,080 --> 01:41:02,400
essentially does the following.

2057
01:41:02,400 --> 01:41:05,700
Here's three lines of representative
Python code that use SQL.

2058
01:41:05,700 --> 01:41:08,850
The first line here selects the
number of likes from a table

2059
01:41:08,850 --> 01:41:11,880
called posts where the ID
of the post is something,

2060
01:41:11,880 --> 01:41:14,370
a variable called ID, 1, 2, 3, 4, 5.

2061
01:41:14,370 --> 01:41:17,670
That just tells me, what is the
total number of likes this post has?

2062
01:41:17,670 --> 01:41:20,648
Then, I store that answer
in a variable using likes.

2063
01:41:20,648 --> 01:41:22,440
And you'll see this
syntax in problem set 7

2064
01:41:22,440 --> 01:41:25,560
as you begin to play with SQL
yourself in the CS50 library.

2065
01:41:25,560 --> 01:41:29,130
Suppose that your third line of
code is to then update posts,

2066
01:41:29,130 --> 01:41:32,670
setting likes equal to some
place holder, where that place

2067
01:41:32,670 --> 01:41:34,630
holder is going to be likes plus 1.

2068
01:41:34,630 --> 01:41:38,490
This is the same problem
as the milk example,

2069
01:41:38,490 --> 01:41:41,430
because if you check
how many likes there

2070
01:41:41,430 --> 01:41:43,490
are on this post, or your
roommate checks how--

2071
01:41:43,490 --> 01:41:46,050
or if you check how much milk
is available left in the fridge

2072
01:41:46,050 --> 01:41:49,440
and then go off and begin to
make a decision like buying milk,

2073
01:41:49,440 --> 01:41:52,260
your roommate might, on
a server that's running

2074
01:41:52,260 --> 01:41:55,410
the same code to handle thousands
and thousands of people,

2075
01:41:55,410 --> 01:41:56,680
might ask that same question.

2076
01:41:56,680 --> 01:41:58,472
How many likes does
this current post have?

2077
01:41:58,472 --> 01:42:00,120
Suppose the post has a million likes.

2078
01:42:00,120 --> 01:42:02,400
Both of you execute
that first line of code,

2079
01:42:02,400 --> 01:42:04,020
see that, oh, I have a million likes.

2080
01:42:04,020 --> 01:42:06,550
I've just clicked plus 1 on the site.

2081
01:42:06,550 --> 01:42:10,830
And so you try changing the number of
likes with this second update query

2082
01:42:10,830 --> 01:42:12,720
to 1 million plus 1.

2083
01:42:12,720 --> 01:42:15,750
But if both of you have
been making a decision that

2084
01:42:15,750 --> 01:42:18,660
gets interleaved with the other
person, as will absolutely

2085
01:42:18,660 --> 01:42:20,910
happen with Instagram and
Twitter, who have thousands

2086
01:42:20,910 --> 01:42:24,870
of servers all operating in parallel,
instead of changing the number of likes

2087
01:42:24,870 --> 01:42:30,750
from 1 million to 1,000,002, you might
change it both of you to 1,000,001,

2088
01:42:30,750 --> 01:42:33,870
thereby wasting one of the
counts, because it gets lost,

2089
01:42:33,870 --> 01:42:36,810
because you inspected the state
of the fridge or the likes.

2090
01:42:36,810 --> 01:42:37,983
You made a decision.

2091
01:42:37,983 --> 01:42:39,900
And then you updated it
based on that decision

2092
01:42:39,900 --> 01:42:42,813
without realizing that
life changed in between.

2093
01:42:42,813 --> 01:42:44,730
Your roommate got home
and checked the fridge,

2094
01:42:44,730 --> 01:42:47,190
too, or someone else clicked the upvote.

2095
01:42:47,190 --> 01:42:49,890
So this is a bad situation,
but there are solutions in SQL

2096
01:42:49,890 --> 01:42:54,270
that we won't look at this week,
but what are known as transactions.

2097
01:42:54,270 --> 01:42:57,960
That is a solvable problem, and more on
that in a higher level database class.

2098
01:42:57,960 --> 01:43:00,690
And one final example of
this, because this one you

2099
01:43:00,690 --> 01:43:04,380
can avoid throughout the coming
weeks in CS50 and beyond.

2100
01:43:04,380 --> 01:43:08,520
A SQL injection attack is
when you write bad code that

2101
01:43:08,520 --> 01:43:12,150
somehow allows some bad actor on
the internet or an application

2102
01:43:12,150 --> 01:43:16,530
to trick your code into running
SQL code that you did not intend.

2103
01:43:16,530 --> 01:43:19,680
For instance, this is how Yale
students log on in New Haven.

2104
01:43:19,680 --> 01:43:22,470
This, of course, is how Harvard
students log on here in Cambridge.

2105
01:43:22,470 --> 01:43:24,887
You're asked for a username
and password, or a Harvard key

2106
01:43:24,887 --> 01:43:25,680
and a password.

2107
01:43:25,680 --> 01:43:28,290
Suppose, though, you as
a user don't cooperate,

2108
01:43:28,290 --> 01:43:30,840
and instead of typing in your
email address and password,

2109
01:43:30,840 --> 01:43:34,080
suppose you type something like
this, like your email address,

2110
01:43:34,080 --> 01:43:38,520
Malan@Harvard.edu, then maybe a
single quote, and then a dash dash.

2111
01:43:38,520 --> 01:43:40,950
Well, it turns out in
SQL, dash dash is how

2112
01:43:40,950 --> 01:43:46,320
you start a comment, similar to hash
in Python or in slash slash in C.

2113
01:43:46,320 --> 01:43:50,180
But suppose that Harvard had
implemented its website such

2114
01:43:50,180 --> 01:43:52,680
that there's a query like this
going on underneath the hood,

2115
01:43:52,680 --> 01:43:56,370
if the Harvard key is implemented in
Python, that says this, SELECT * FROM

2116
01:43:56,370 --> 01:44:00,630
users WHERE username = placeholder
AND password = placeholder,

2117
01:44:00,630 --> 01:44:02,430
passing in username and password.

2118
01:44:02,430 --> 01:44:03,550
That is good.

2119
01:44:03,550 --> 01:44:07,050
That is correct code, because
those place holders are important.

2120
01:44:07,050 --> 01:44:10,890
What the CS50 library does, and other
libraries in the world like it, is

2121
01:44:10,890 --> 01:44:12,810
it escapes user input.

2122
01:44:12,810 --> 01:44:17,480
If a user tries to trick your code into
inputting comments or single quotes,

2123
01:44:17,480 --> 01:44:21,990
db execute and other libraries
will sanitize the user's input

2124
01:44:21,990 --> 01:44:24,570
and prevent the
following from happening.

2125
01:44:24,570 --> 01:44:27,630
If, however, you do
this-- not good, but bad.

2126
01:44:27,630 --> 01:44:32,490
Suppose that you really practice what
we preach last week using f strings,

2127
01:44:32,490 --> 01:44:34,920
this is now bad in the context of SQL.

2128
01:44:34,920 --> 01:44:37,830
If you create a format string
like this with a little f

2129
01:44:37,830 --> 01:44:41,250
that is literally SELECT *
from users WHERE username =,

2130
01:44:41,250 --> 01:44:44,310
and then use the fancy Python
notation for curly braces,

2131
01:44:44,310 --> 01:44:48,660
and password = curly braces, this
is a correct SQL construction

2132
01:44:48,660 --> 01:44:53,280
so long as the human cooperates and puts
their username here and password here.

2133
01:44:53,280 --> 01:44:55,680
But what if the user
is malicious, like me,

2134
01:44:55,680 --> 01:45:01,110
and actually includes a single quote
in their input, and then dash dash?

2135
01:45:01,110 --> 01:45:04,650
So Malan@Harvard.edu
quote dash dash would

2136
01:45:04,650 --> 01:45:08,520
have the effect of injecting
that into the user's input.

2137
01:45:08,520 --> 01:45:11,040
But dash dash I claim
as a comment, so it's

2138
01:45:11,040 --> 01:45:15,270
a way of tricking the computer into
executing not SELECT * WHERE user--

2139
01:45:15,270 --> 01:45:19,350
SELECT * FROM users where username
= this AND password = that,

2140
01:45:19,350 --> 01:45:22,230
it has the effect of just
saying SELECT * FROM users where

2141
01:45:22,230 --> 01:45:25,290
username equals Malan@Harvard.edu.

2142
01:45:25,290 --> 01:45:27,720
And if-- and you'll
see this in p set 7--

2143
01:45:27,720 --> 01:45:31,020
db.execute returns one or
more rows from the table--

2144
01:45:31,020 --> 01:45:33,750
that's how you can actually
select data using Python code--

2145
01:45:33,750 --> 01:45:35,670
and the number of rows
equals 1, because it's

2146
01:45:35,670 --> 01:45:39,210
selected Malan@Harvard.edu
you could log in

2147
01:45:39,210 --> 01:45:45,660
as me because you very simply tricked
the computer into executing code

2148
01:45:45,660 --> 01:45:47,400
that you injected.

2149
01:45:47,400 --> 01:45:50,900
And the biggest takeaway here is
this is 100% solvable problem.

2150
01:45:50,900 --> 01:45:53,670
You just need to know
that it is solved and not

2151
01:45:53,670 --> 01:45:56,760
do code like this using f
strings in Python, but instead

2152
01:45:56,760 --> 01:46:02,370
to use library code like db.execute,
or after this course, anything like it.

2153
01:46:02,370 --> 01:46:05,290
With that said, we'll end with a joke.

2154
01:46:05,290 --> 01:46:17,475


2155
01:46:17,475 --> 01:46:18,130
OK.

2156
01:46:18,130 --> 01:46:19,300
That's it for CS50.

2157
01:46:19,300 --> 01:46:21,480
We'll see you next time.

2158
01:46:21,480 --> 01:46:22,000