1
00:00:00,000 --> 00:00:03,493
[MUSIC PLAYING]

2
00:00:03,493 --> 00:00:49,357


3
00:00:49,357 --> 00:00:50,440
DAVID J. MALAN: All right.

4
00:00:50,440 --> 00:00:53,260
This is CS50, and this is week 7.

5
00:00:53,260 --> 00:00:56,207
And today's focus is going
to be entirely on data--

6
00:00:56,207 --> 00:00:58,540
the process of collecting it,
the process of storing it,

7
00:00:58,540 --> 00:01:00,610
the process of searching
it, and so much more.

8
00:01:00,610 --> 00:01:03,280
You'll recall that last week we
started off by playing around

9
00:01:03,280 --> 00:01:04,750
with the relatively small data set.

10
00:01:04,750 --> 00:01:08,630
We asked everyone for what their
preferred house at Hogwarts might be.

11
00:01:08,630 --> 00:01:12,400
And then we proceeded to analyze that
data a little bit using some Python

12
00:01:12,400 --> 00:01:15,730
and counting up how many people wanted
Gryffindor or Slytherin or the others,

13
00:01:15,730 --> 00:01:16,423
as well.

14
00:01:16,423 --> 00:01:19,090
And we ultimately did that by
using a Google form to collect it.

15
00:01:19,090 --> 00:01:21,923
And we stored all of the data in a
Google spreadsheet, which we then

16
00:01:21,923 --> 00:01:24,350
exported, of course, as a CSV file.

17
00:01:24,350 --> 00:01:26,740
So this week, we thought we'd
collect a little more data

18
00:01:26,740 --> 00:01:28,780
and see what kinds of
problems arise when

19
00:01:28,780 --> 00:01:32,350
we start using only a spreadsheet
or, in turn, a CSV file

20
00:01:32,350 --> 00:01:34,220
to store the data that we care about.

21
00:01:34,220 --> 00:01:37,930
So in fact, if you could go ahead
and go to this URL here that you see,

22
00:01:37,930 --> 00:01:41,230
you should see another Google
form, this one asking you

23
00:01:41,230 --> 00:01:42,730
some different questions.

24
00:01:42,730 --> 00:01:46,540
All of us probably have some preferred
TV shows, now more than ever, perhaps.

25
00:01:46,540 --> 00:01:49,150
And what we'd like to do
is ask everyone to input

26
00:01:49,150 --> 00:01:53,590
into that form their favorite
TV show followed by the genre

27
00:01:53,590 --> 00:01:58,190
or genres into which that
particular TV show falls.

28
00:01:58,190 --> 00:02:00,680
So go ahead and take
a moment to do that.

29
00:02:00,680 --> 00:02:03,820
And if you're unable to follow along
at home, what folks are looking at

30
00:02:03,820 --> 00:02:07,360
is a form quite like this one here,
whereby we're just asking them

31
00:02:07,360 --> 00:02:11,350
for the title of their
preferred TV show and the genre

32
00:02:11,350 --> 00:02:15,770
or genres of that specific TV show.

33
00:02:15,770 --> 00:02:16,270
All right.

34
00:02:16,270 --> 00:02:19,270
So let's go ahead and start to look
at some of this data that's come in.

35
00:02:19,270 --> 00:02:23,043
Here is the resulting Google spreadsheet
that Google Forms has created for us.

36
00:02:23,043 --> 00:02:25,960
And you'll notice that by default,
Google Forms, this particular tool,

37
00:02:25,960 --> 00:02:28,180
has three different columns,
at least for this form.

38
00:02:28,180 --> 00:02:30,070
One is a timestamp, and
Google automatically

39
00:02:30,070 --> 00:02:33,340
gives us that based on what day
and time everyone was buzzing in

40
00:02:33,340 --> 00:02:34,390
with the responses.

41
00:02:34,390 --> 00:02:38,770
Then they have a header row
beyond that for title and genres.

42
00:02:38,770 --> 00:02:42,100
I've manually boldfaced it in
advance just to make it stand out.

43
00:02:42,100 --> 00:02:45,580
But you'll notice that the
headings here, Title and Genres,

44
00:02:45,580 --> 00:02:48,790
perfectly matches the question
that we asked in the Google form.

45
00:02:48,790 --> 00:02:53,020
That allows us to therefore line up
your responses with our questions.

46
00:02:53,020 --> 00:02:56,410
And you can see here Punisher
was the first favorite TV

47
00:02:56,410 --> 00:03:00,040
show to be inputted followed by The
Office, Breaking Bad, New Girl, Archer,

48
00:03:00,040 --> 00:03:02,270
another Office, and so forth.

49
00:03:02,270 --> 00:03:04,660
And in the third column,
under Genres, you'll

50
00:03:04,660 --> 00:03:06,520
see that there's something curious here.

51
00:03:06,520 --> 00:03:08,230
While some of the cells--

52
00:03:08,230 --> 00:03:10,330
that is, the little boxes of text--

53
00:03:10,330 --> 00:03:12,970
have just single words
like "comedy" or "drama,"

54
00:03:12,970 --> 00:03:15,550
you'll notice that some of them
have a comma-separated list.

55
00:03:15,550 --> 00:03:19,150
And that comma-separated list is because
some of you checked, as you could,

56
00:03:19,150 --> 00:03:24,730
multiple check boxes to indicate
that Breaking Bad is a crime genre

57
00:03:24,730 --> 00:03:26,830
drama and also thriller.

58
00:03:26,830 --> 00:03:31,240
And so the way Google Forms handles
this is a bit sleazily in the sense

59
00:03:31,240 --> 00:03:35,350
that they just drop all of those
values as a comma-separated list

60
00:03:35,350 --> 00:03:37,853
inside of the spreadsheet itself.

61
00:03:37,853 --> 00:03:40,270
And that's potentially a problem
if we ultimately download

62
00:03:40,270 --> 00:03:43,570
this as a CSV file,
comma-separated values,

63
00:03:43,570 --> 00:03:47,500
because now you have commas
in between the commas.

64
00:03:47,500 --> 00:03:50,480
Fortunately, there's a solution
to that that we'll ultimately see.

65
00:03:50,480 --> 00:03:52,160
So we've got a good amount of data here.

66
00:03:52,160 --> 00:03:55,300
In fact, if I keep scrolling down,
we'll see a few hundred responses now.

67
00:03:55,300 --> 00:03:58,120
And it would be nice to
analyze this data in some way

68
00:03:58,120 --> 00:04:02,410
and figure out what the most popular
TV show is, maybe search for new shows

69
00:04:02,410 --> 00:04:04,143
I might like via their genre.

70
00:04:04,143 --> 00:04:06,310
So you can imagine some
number of queries that could

71
00:04:06,310 --> 00:04:08,620
be answered by way of this data set.

72
00:04:08,620 --> 00:04:12,340
But let's first consider the
limitations of leaving this data

73
00:04:12,340 --> 00:04:14,980
in just a spreadsheet like this.

74
00:04:14,980 --> 00:04:17,440
All of us are probably in the
habit of using occasionally

75
00:04:17,440 --> 00:04:22,490
Google Spreadsheets, Apple Numbers,
Microsoft Excel, or some other tool.

76
00:04:22,490 --> 00:04:27,220
So let's consider what spreadsheets
are good at and what they are bad at.

77
00:04:27,220 --> 00:04:30,370
Would anyone like to volunteer
an answer to the first of those?

78
00:04:30,370 --> 00:04:34,030
What is a spreadsheet
good at or good for?

79
00:04:34,030 --> 00:04:35,080
Yeah, Andrew?

80
00:04:35,080 --> 00:04:36,760
What's your thinking on spreadsheets?

81
00:04:36,760 --> 00:04:39,797
AUDIENCE: [INAUDIBLE]

82
00:04:39,797 --> 00:04:41,880
DAVID J. MALAN: OK, very
good for quickly sorting.

83
00:04:41,880 --> 00:04:42,300
I like that.

84
00:04:42,300 --> 00:04:44,758
I could click on the top of
the Title column, for instance,

85
00:04:44,758 --> 00:04:48,450
and immediately sort all of
those titles by alphabetically.

86
00:04:48,450 --> 00:04:49,140
I like that.

87
00:04:49,140 --> 00:04:53,370
Other reasons to use a spreadsheet--
what problems do they solve?

88
00:04:53,370 --> 00:04:55,050
What are they good at?

89
00:04:55,050 --> 00:04:56,670
Other thoughts on spreadsheets.

90
00:04:56,670 --> 00:04:58,530
Yeah, how about Peter?

91
00:04:58,530 --> 00:05:01,947
AUDIENCE: Storing large amounts of
data that you can later analyze.

92
00:05:01,947 --> 00:05:03,780
DAVID J. MALAN: OK, so
storing large amounts

93
00:05:03,780 --> 00:05:05,700
of data that you can later analyze.

94
00:05:05,700 --> 00:05:09,210
It's kind of a nice model for storing
lots of rows of data, so to speak.

95
00:05:09,210 --> 00:05:11,310
I will say that there
actually is a limit.

96
00:05:11,310 --> 00:05:13,890
And in fact, back in the day,
I learned what this limit is.

97
00:05:13,890 --> 00:05:16,515
Long story short, in graduate
school, I was using a spreadsheet

98
00:05:16,515 --> 00:05:17,940
to analyze some research data.

99
00:05:17,940 --> 00:05:23,370
And at one point, I had more data
than Excel supported rows for.

100
00:05:23,370 --> 00:05:28,110
Specifically, I had
some 65,536 rows, which

101
00:05:28,110 --> 00:05:30,360
was too many at that point
for Excel at the time,

102
00:05:30,360 --> 00:05:33,870
because, long story short, if you
recall from a spreadsheet program

103
00:05:33,870 --> 00:05:37,328
like Google Spreadsheets, every
row is numbered from 1 on up.

104
00:05:37,328 --> 00:05:39,120
Well, unfortunately,
at the time, Microsoft

105
00:05:39,120 --> 00:05:43,170
had used a 16-bit integer,
16 bits or 2 bytes,

106
00:05:43,170 --> 00:05:45,150
to represent each of those numbers.

107
00:05:45,150 --> 00:05:49,320
And it turns out the 2 to the
16th power is roughly 65,000.

108
00:05:49,320 --> 00:05:52,000
So at that point, I maxed
out the total number of rows.

109
00:05:52,000 --> 00:05:54,828
Now, to Peter's point, they've
increased that in recent years.

110
00:05:54,828 --> 00:05:56,620
And you can actually
store a lot more data.

111
00:05:56,620 --> 00:05:58,870
So spreadsheets are indeed good at that.

112
00:05:58,870 --> 00:06:02,580
But they're not necessarily good at
everything, because at some point,

113
00:06:02,580 --> 00:06:05,310
you're going to have more data
potentially in a spreadsheet

114
00:06:05,310 --> 00:06:07,860
than your Mac or PC can handle.

115
00:06:07,860 --> 00:06:10,860
In fact, if you're actually trying
to build an application, whether it's

116
00:06:10,860 --> 00:06:14,490
Twitter, Instagram, or Facebook
or anything of that scale,

117
00:06:14,490 --> 00:06:17,490
those companies are certainly not
storing their data, suffice it to say,

118
00:06:17,490 --> 00:06:20,700
in a spreadsheet, because there would
just be way too much data to use.

119
00:06:20,700 --> 00:06:23,050
And no one could literally
open it on their computer.

120
00:06:23,050 --> 00:06:25,830
So we'll need a solution
to that problem of scale.

121
00:06:25,830 --> 00:06:29,950
But I don't think we need to throw out
what works well about spreadsheets.

122
00:06:29,950 --> 00:06:33,510
So you can store indeed a
lot of data in row form.

123
00:06:33,510 --> 00:06:36,930
But it would seem that you can also
store a lot of data in column form.

124
00:06:36,930 --> 00:06:39,567
And even though I'm only
showing columns A, B, and C,

125
00:06:39,567 --> 00:06:41,400
of course, you've
probably used spreadsheets

126
00:06:41,400 --> 00:06:42,570
where you add more columns--

127
00:06:42,570 --> 00:06:44,680
D, E, F, and so forth.

128
00:06:44,680 --> 00:06:48,540
So what's the right mental model
for how to think about rows

129
00:06:48,540 --> 00:06:51,540
versus columns in a spreadsheet?

130
00:06:51,540 --> 00:06:57,840
I feel like we probably use them in a
somewhat different way conceptually.

131
00:06:57,840 --> 00:07:00,550
We might think about them
a little differently.

132
00:07:00,550 --> 00:07:04,440
What's the difference between
rows and columns in a spreadsheet?

133
00:07:04,440 --> 00:07:06,570
Sofia.

134
00:07:06,570 --> 00:07:07,890
AUDIENCE: Adding more entries.

135
00:07:07,890 --> 00:07:09,420
Adding more data is--

136
00:07:09,420 --> 00:07:12,720
those are within the rows, but then the
actual attributes or characteristics

137
00:07:12,720 --> 00:07:14,240
of the data should be in columns.

138
00:07:14,240 --> 00:07:15,240
DAVID J. MALAN: Exactly.

139
00:07:15,240 --> 00:07:17,220
When you add more data
to the spreadsheet,

140
00:07:17,220 --> 00:07:19,620
you should really be
adding to the bottom of it,

141
00:07:19,620 --> 00:07:21,310
adding more and more rows.

142
00:07:21,310 --> 00:07:24,300
So these things sort of grow
vertically, even though of course that's

143
00:07:24,300 --> 00:07:25,920
just a human's perception of it.

144
00:07:25,920 --> 00:07:28,740
They grow from top to bottom
by adding more and more rows.

145
00:07:28,740 --> 00:07:31,560
But to Sofia's point,
your columns represent

146
00:07:31,560 --> 00:07:37,920
what we might call attributes or fields
or any other such characteristic that

147
00:07:37,920 --> 00:07:40,030
is a type of data that you're storing.

148
00:07:40,030 --> 00:07:42,930
So in this case of our form,
Timestamp is the first column.

149
00:07:42,930 --> 00:07:44,460
Title is the second column.

150
00:07:44,460 --> 00:07:45,930
Genres is the third column.

151
00:07:45,930 --> 00:07:49,980
And those columns can indeed be thought
of as fields or attributes, properties

152
00:07:49,980 --> 00:07:50,697
of your data.

153
00:07:50,697 --> 00:07:54,030
And those are properties that you should
really decide on in advance when you're

154
00:07:54,030 --> 00:07:56,970
first creating the form, in our case,
or when you're manually creating

155
00:07:56,970 --> 00:07:59,430
the spreadsheet in another case.

156
00:07:59,430 --> 00:08:01,320
You should not really
be in the habit, when

157
00:08:01,320 --> 00:08:05,430
using spreadsheets, of
adding data from left

158
00:08:05,430 --> 00:08:08,370
to right, adding more and
more columns, unless you

159
00:08:08,370 --> 00:08:11,740
decide to collect more types of data.

160
00:08:11,740 --> 00:08:15,873
So just because someone adds a new
favorite TV show to your data set,

161
00:08:15,873 --> 00:08:18,540
you shouldn't be adding that from
left to right in a new column.

162
00:08:18,540 --> 00:08:21,040
You should indeed be adding
it from top to bottom.

163
00:08:21,040 --> 00:08:24,780
But suppose that we actually decided to
collect more information from everyone.

164
00:08:24,780 --> 00:08:28,650
Maybe that form had instead asked you
for your name or your email address

165
00:08:28,650 --> 00:08:30,120
or any other questions.

166
00:08:30,120 --> 00:08:34,480
Those properties or attributes or
fields would belong as new columns.

167
00:08:34,480 --> 00:08:38,309
So this is to say we generally
decide on the layout of our data,

168
00:08:38,309 --> 00:08:41,130
the schema of our data, in advance.

169
00:08:41,130 --> 00:08:45,420
And then from there on out, we proceed
to add, add, add more rows, not

170
00:08:45,420 --> 00:08:47,670
columns, unless we change
our mind and need to change

171
00:08:47,670 --> 00:08:50,230
the schema of our particular data.

172
00:08:50,230 --> 00:08:53,850
So it turns out that spreadsheets
are indeed wonderfully useful,

173
00:08:53,850 --> 00:08:56,700
to Peter's point, for
large or reasonably large

174
00:08:56,700 --> 00:08:59,610
data sets that we might collect.

175
00:08:59,610 --> 00:09:04,500
And we can, of course, per last week,
export those data sets as CSV files.

176
00:09:04,500 --> 00:09:07,290
And so we can go from a
spreadsheet to a simple text

177
00:09:07,290 --> 00:09:11,370
file stored in ASCII or Unicode, more
generally, on your own hard drive

178
00:09:11,370 --> 00:09:12,660
or somewhere in the cloud.

179
00:09:12,660 --> 00:09:16,350
And you can actually think
of that file, that .CSV file,

180
00:09:16,350 --> 00:09:19,540
as what we might call
a flat-file database.

181
00:09:19,540 --> 00:09:22,980
A database is, generally
speaking, a file that stores data.

182
00:09:22,980 --> 00:09:25,997
Or it's a program that
stores data for you.

183
00:09:25,997 --> 00:09:29,080
And all of us have probably thought
about or used databases in some sense.

184
00:09:29,080 --> 00:09:31,560
You're probably familiar
with the fact that all

185
00:09:31,560 --> 00:09:35,310
of those same big websites, Google and
Twitter and Facebook and others, use

186
00:09:35,310 --> 00:09:37,017
databases to store our data.

187
00:09:37,017 --> 00:09:38,850
Well, those databases
are either just really

188
00:09:38,850 --> 00:09:42,120
big files containing lots
of data or special programs

189
00:09:42,120 --> 00:09:44,130
that are storing our data for us.

190
00:09:44,130 --> 00:09:46,350
And a flat file is just
referring to the fact

191
00:09:46,350 --> 00:09:48,580
that it really is a very simple design.

192
00:09:48,580 --> 00:09:51,510
In fact, years ago,
decades ago, humans decided

193
00:09:51,510 --> 00:09:54,780
when storing data in simple
text files that if you

194
00:09:54,780 --> 00:09:57,540
want to store different types
of data, like, to Sofia's point,

195
00:09:57,540 --> 00:10:00,340
different properties or attributes,
well, let's keep it simple.

196
00:10:00,340 --> 00:10:03,780
Let's just separate
those columns with commas

197
00:10:03,780 --> 00:10:06,450
in our flat-file database, a.k.a.

198
00:10:06,450 --> 00:10:07,118
a CSV.

199
00:10:07,118 --> 00:10:08,160
You can use other things.

200
00:10:08,160 --> 00:10:09,430
You can use tabs.

201
00:10:09,430 --> 00:10:12,570
There's things called TSVs,
for Tab-Separated Values.

202
00:10:12,570 --> 00:10:14,760
And frankly, you can
use anything you want.

203
00:10:14,760 --> 00:10:16,050
But there is a corner case.

204
00:10:16,050 --> 00:10:17,980
And we've already seen a preview of it.

205
00:10:17,980 --> 00:10:21,190
What if your actual
data has a comma in it?

206
00:10:21,190 --> 00:10:23,820
What if the title of your
favorite TV show has a comma?

207
00:10:23,820 --> 00:10:27,660
What if Google is presuming to store
genres as a comma-separated list?

208
00:10:27,660 --> 00:10:32,340
Bad things can happen if using a
CSV as your flat-file database.

209
00:10:32,340 --> 00:10:33,760
But there are solutions to that.

210
00:10:33,760 --> 00:10:35,580
And in fact, what the
world typically does

211
00:10:35,580 --> 00:10:39,640
is whenever you have commas
inside of your CSV file,

212
00:10:39,640 --> 00:10:42,300
you just make sure that
the whole string is double

213
00:10:42,300 --> 00:10:44,460
quoted on the far left and far right.

214
00:10:44,460 --> 00:10:46,860
And anything inside of
double quotes is not

215
00:10:46,860 --> 00:10:50,790
mistaken thereafter as
delineating a column

216
00:10:50,790 --> 00:10:53,220
as the other commas in the file might.

217
00:10:53,220 --> 00:10:55,590
So that's all that's meant
by a flat-file database.

218
00:10:55,590 --> 00:10:58,860
And CSV is perhaps one of the most
common, the most common, formats

219
00:10:58,860 --> 00:11:01,240
thereof, if only because
all of these programs,

220
00:11:01,240 --> 00:11:03,420
like Google Spreadsheets
and Excel and Numbers,

221
00:11:03,420 --> 00:11:07,137
allow you to save your files as CSVs.

222
00:11:07,137 --> 00:11:08,970
Now, long story short,
those of you who have

223
00:11:08,970 --> 00:11:12,570
used fancier features of spreadsheets
like built-in functions and formulas

224
00:11:12,570 --> 00:11:14,850
and those kinds of
things, those are built in

225
00:11:14,850 --> 00:11:19,120
and proprietary to Google
Spreadsheets and Excel and Numbers.

226
00:11:19,120 --> 00:11:24,900
You cannot use formulas in a CSV file or
a TSV file or in a flat-file database,

227
00:11:24,900 --> 00:11:25,870
more generally.

228
00:11:25,870 --> 00:11:27,990
You can only store static--

229
00:11:27,990 --> 00:11:30,090
that is, unchanging-- values.

230
00:11:30,090 --> 00:11:33,490
So when you export the data,
what you see is what you get.

231
00:11:33,490 --> 00:11:35,242
And that's why people
use fancier programs

232
00:11:35,242 --> 00:11:37,200
like Excel and Numbers
and Google Spreadsheets,

233
00:11:37,200 --> 00:11:38,658
because you get more functionality.

234
00:11:38,658 --> 00:11:41,100
But if you want to export
the data, you can only

235
00:11:41,100 --> 00:11:44,190
get indeed the raw
textual data out of it.

236
00:11:44,190 --> 00:11:45,690
But I daresay that's going to be OK.

237
00:11:45,690 --> 00:11:47,398
In fact, Brian, do
you mind if I go ahead

238
00:11:47,398 --> 00:11:50,160
and download this spreadsheet
as a CSV file now?

239
00:11:50,160 --> 00:11:51,510
BRIAN YU: Yep, go ahead.

240
00:11:51,510 --> 00:11:51,810
DAVID J. MALAN: All right.

241
00:11:51,810 --> 00:11:54,890
I'm going to go ahead in Google
Spreadsheets and go to File, Download.

242
00:11:54,890 --> 00:11:56,640
And you can see a whole
bunch of options--

243
00:11:56,640 --> 00:12:01,850
PDF, Web Page, Comma-Separated
Values, which is the one I want.

244
00:12:01,850 --> 00:12:04,320
So I'm going to indeed
go ahead and choose CSV

245
00:12:04,320 --> 00:12:06,510
from this dropdown in spreadsheets.

246
00:12:06,510 --> 00:12:08,410
That, of course, downloaded
that file for me.

247
00:12:08,410 --> 00:12:11,077
And now I'm going to go ahead and
go into our familiar CS50 IDE.

248
00:12:11,077 --> 00:12:14,600
You'll recall that last week I was
able to upload a file into the IDE.

249
00:12:14,600 --> 00:12:17,350
And I'm going to go ahead and do
the same here this week, as well.

250
00:12:17,350 --> 00:12:20,730
I'm going to go ahead and grab my
file, which ended up in my Downloads

251
00:12:20,730 --> 00:12:22,830
folder on my particular computer here.

252
00:12:22,830 --> 00:12:27,840
And I'm going to go ahead and
drag and drop this into the IDE

253
00:12:27,840 --> 00:12:31,790
such that it ends up in my
home directory, so to speak.

254
00:12:31,790 --> 00:12:34,410
So now I have this file,
Favorite TV Shows Forms.

255
00:12:34,410 --> 00:12:36,750
And in fact, if I double
click this within the IDE,

256
00:12:36,750 --> 00:12:38,880
you'll see familiar data now.

257
00:12:38,880 --> 00:12:42,950
Timestamp comma title comma
genres is our header row

258
00:12:42,950 --> 00:12:46,830
that contains the names of the
properties or attributes in this file.

259
00:12:46,830 --> 00:12:51,390
Then we've got our timestamps
comma favorite title comma and then

260
00:12:51,390 --> 00:12:53,310
a comma-separated list of genres.

261
00:12:53,310 --> 00:12:56,100
And here indeed, notice
that Google took care

262
00:12:56,100 --> 00:13:00,030
to use double quotes around any
values that themselves had commas.

263
00:13:00,030 --> 00:13:02,130
So it's a relatively simple file format.

264
00:13:02,130 --> 00:13:04,560
And I could certainly just
kind of skim through this,

265
00:13:04,560 --> 00:13:07,920
figuring out who likes The Office, who
likes Breaking Bad, or other shows.

266
00:13:07,920 --> 00:13:11,040
But per last week, we now have a
pretty useful programming language

267
00:13:11,040 --> 00:13:14,220
at our disposal, Python, that could
allow us to start manipulating

268
00:13:14,220 --> 00:13:16,860
and analyzing this data more readily.

269
00:13:16,860 --> 00:13:20,100
And here to my point last week about
using the right tool for the job,

270
00:13:20,100 --> 00:13:24,860
you could absolutely do everything we're
about to do in all weeks prior of CS50.

271
00:13:24,860 --> 00:13:27,720
We could have used C for
what we're about to do.

272
00:13:27,720 --> 00:13:31,350
But as you can probably glean, C tends
to be painful for certain things,

273
00:13:31,350 --> 00:13:34,290
like anything involving
string manipulation,

274
00:13:34,290 --> 00:13:36,660
changing strings, analyzing strings.

275
00:13:36,660 --> 00:13:38,290
It's just a real pain, right?

276
00:13:38,290 --> 00:13:42,330
God forbid you had to take this CSV
file and load it all into memory, not

277
00:13:42,330 --> 00:13:43,470
unlike your spell checker.

278
00:13:43,470 --> 00:13:46,950
You would have to be using malloc all
over the place or realloc or the like.

279
00:13:46,950 --> 00:13:50,640
There's just a lot of heavy lifting
involved in just analyzing a text file.

280
00:13:50,640 --> 00:13:53,760
So Python does all of that
for us by just giving us

281
00:13:53,760 --> 00:13:56,130
more functions at our
disposal with which

282
00:13:56,130 --> 00:13:59,470
to start analyzing and opening data.

283
00:13:59,470 --> 00:14:01,570
So let me go ahead and close this file.

284
00:14:01,570 --> 00:14:05,082
And let me go ahead and create
a new one called favorites.py,

285
00:14:05,082 --> 00:14:07,290
wherein I'm going to start
playing with this data set

286
00:14:07,290 --> 00:14:09,900
and see if we can't start
answering some questions about it.

287
00:14:09,900 --> 00:14:12,570
And frankly, to this day,
20-plus years after learning how

288
00:14:12,570 --> 00:14:14,670
to program for the
first time, I myself am

289
00:14:14,670 --> 00:14:18,000
very much in the habit when writing
a new program of just starting simple

290
00:14:18,000 --> 00:14:22,320
and not solving the problem I ultimately
want to but something simpler just

291
00:14:22,320 --> 00:14:24,270
as a sort of proof of
concept to make sure

292
00:14:24,270 --> 00:14:26,510
I have the right plumbing in place.

293
00:14:26,510 --> 00:14:27,510
So by that, I mean this.

294
00:14:27,510 --> 00:14:32,550
Let's go ahead and write a quick program
that simply opens up this file, the CSV

295
00:14:32,550 --> 00:14:37,120
file, iterates over it top to bottom,
and just prints out each of the titles,

296
00:14:37,120 --> 00:14:39,430
just as a quick sanity check
that I know what I'm doing

297
00:14:39,430 --> 00:14:41,460
and I have access to the data therein.

298
00:14:41,460 --> 00:14:43,740
So let me go ahead and import CSV.

299
00:14:43,740 --> 00:14:45,840
And then I can do this
in a few different ways.

300
00:14:45,840 --> 00:14:48,030
But by now, you've
probably seen or remembered

301
00:14:48,030 --> 00:14:50,490
my using something like
the open command and the

302
00:14:50,490 --> 00:14:55,260
with keyword to open and eventually
automatically close this file for me.

303
00:14:55,260 --> 00:14:59,710
This file is called Favorite TV
Shows - Form Responses 1.csv.

304
00:14:59,710 --> 00:15:02,400


305
00:15:02,400 --> 00:15:04,560
And I'm going to open
this up in read mode.

306
00:15:04,560 --> 00:15:07,000
Strictly speaking,
the r is not required.

307
00:15:07,000 --> 00:15:09,330
You might see examples
online not including it.

308
00:15:09,330 --> 00:15:13,140
That's because read is the default.
But for parity with C and fopen,

309
00:15:13,140 --> 00:15:15,900
I'm going to be explicit
and actually do "r."

310
00:15:15,900 --> 00:15:18,670
And I'm going to go ahead and
give this a variable name of file.

311
00:15:18,670 --> 00:15:23,820
So this line 3 here has the effect of
opening that CSV file in read-only mode

312
00:15:23,820 --> 00:15:27,532
and creating a variable called
file via which I can reference it.

313
00:15:27,532 --> 00:15:30,240
Now I'm going to go ahead and use
some of that CSV functionality.

314
00:15:30,240 --> 00:15:32,790
I'm going to give myself what
we keep calling a reader, which

315
00:15:32,790 --> 00:15:34,650
I could call it xyz, anything else.

316
00:15:34,650 --> 00:15:37,740
But "reader" kind of describes
what this variable is going to do.

317
00:15:37,740 --> 00:15:42,930
And it's going to be the return value
of calling csv.reader on that file.

318
00:15:42,930 --> 00:15:46,740
And so essentially, the
CSV library, per last week,

319
00:15:46,740 --> 00:15:48,360
has a lot of fancy features built in.

320
00:15:48,360 --> 00:15:52,470
And all it needs as input is
an already opened text file.

321
00:15:52,470 --> 00:15:55,120
And then it will then wrap
that file, so to speak,

322
00:15:55,120 --> 00:15:57,270
with a whole bunch of
more useful functionality,

323
00:15:57,270 --> 00:16:01,750
like the ability to read it
column and row at a time.

324
00:16:01,750 --> 00:16:02,250
All right.

325
00:16:02,250 --> 00:16:05,170
Now I'm going to go ahead and,
you know what, just for now,

326
00:16:05,170 --> 00:16:08,190
I'm going to skip the first row.

327
00:16:08,190 --> 00:16:11,310
I'm going to skip the first row,
because the first row has my headings--

328
00:16:11,310 --> 00:16:13,530
Timestamp, Title, and Genres.

329
00:16:13,530 --> 00:16:17,692
And I know what my columns are, so I'm
just going to ignore that line for now.

330
00:16:17,692 --> 00:16:18,900
And now I'm going to do this.

331
00:16:18,900 --> 00:16:24,570
For row in reader, let me go ahead
and print out, quite simply, row.

332
00:16:24,570 --> 00:16:28,890
And I only want title, so I think if
it's three columns from left to right,

333
00:16:28,890 --> 00:16:30,330
it's 0, 1, 2.

334
00:16:30,330 --> 00:16:33,480
So I want to print out
column bracket 1, which

335
00:16:33,480 --> 00:16:35,680
is going to be the second
column zero indexed.

336
00:16:35,680 --> 00:16:36,180
All right.

337
00:16:36,180 --> 00:16:39,240
Let me go ahead and save that,
go down to my terminal window,

338
00:16:39,240 --> 00:16:42,850
and run python of favorites.py
and cross my fingers.

339
00:16:42,850 --> 00:16:43,500
OK.

340
00:16:43,500 --> 00:16:44,920
Voila.

341
00:16:44,920 --> 00:16:46,740
It flew by super fast.

342
00:16:46,740 --> 00:16:49,530
But it looks like, indeed,
these are all of the TV

343
00:16:49,530 --> 00:16:51,150
shows that folks have inputted.

344
00:16:51,150 --> 00:16:53,370
Indeed, there's a few hundred
if I keep scrolling up.

345
00:16:53,370 --> 00:16:55,740
So it looks like my program is working.

346
00:16:55,740 --> 00:16:57,850
But let's improve it just a little bit.

347
00:16:57,850 --> 00:17:02,490
It turns out that using the
csv.reader isn't necessarily

348
00:17:02,490 --> 00:17:04,050
the best approach in Python.

349
00:17:04,050 --> 00:17:07,589
Many of you have already discovered
a DictReader, a dictionary reader,

350
00:17:07,589 --> 00:17:10,740
which is nice, because then you don't
have to know or keep double checking

351
00:17:10,740 --> 00:17:13,230
what number column your data is in.

352
00:17:13,230 --> 00:17:17,520
You can instead refer it to by
the header itself, so by "title"

353
00:17:17,520 --> 00:17:18,660
or by "genres."

354
00:17:18,660 --> 00:17:21,052
This is also good, because
if you or maybe a colleague

355
00:17:21,052 --> 00:17:23,010
are sort of messing around
with the spreadsheet

356
00:17:23,010 --> 00:17:26,339
and they rearrange the columns
by dragging them left or right,

357
00:17:26,339 --> 00:17:30,120
any numbers you have used
in your code, 0, 1, 2 on up,

358
00:17:30,120 --> 00:17:34,390
could suddenly be incorrect if your
colleague has reordered those columns.

359
00:17:34,390 --> 00:17:37,590
So using a dictionary reader tends to
be a little more robust, because it

360
00:17:37,590 --> 00:17:40,480
uses the titles, not the mere numbers.

361
00:17:40,480 --> 00:17:43,230
It's still fallible if someone,
yourself or someone else,

362
00:17:43,230 --> 00:17:47,978
changes the values in that very first
row and renames titles or genres.

363
00:17:47,978 --> 00:17:49,270
Then things are going to break.

364
00:17:49,270 --> 00:17:51,270
But at that point, we
kind of have to blame you

365
00:17:51,270 --> 00:17:53,730
for not having kept track of
your code versus your data.

366
00:17:53,730 --> 00:17:55,020
But still a risk.

367
00:17:55,020 --> 00:17:58,445
So I'm going to change this to
dictionary reader or DictReader here.

368
00:17:58,445 --> 00:18:00,570
And pretty much the rest
of my code can be the same

369
00:18:00,570 --> 00:18:02,970
except I don't need this
hack here on line 5.

370
00:18:02,970 --> 00:18:06,750
I don't need to just skip over
to the next row from the get-go,

371
00:18:06,750 --> 00:18:10,890
because I now want the dictionary
reader to handle the process of reading

372
00:18:10,890 --> 00:18:11,985
that first row for me.

373
00:18:11,985 --> 00:18:13,860
But otherwise, everything
else stays the same

374
00:18:13,860 --> 00:18:15,693
except for this last
line, where now I think

375
00:18:15,693 --> 00:18:21,300
I can now use row as a
dictionary, not as a list per se,

376
00:18:21,300 --> 00:18:24,880
and print out specifically
the title from each given row.

377
00:18:24,880 --> 00:18:27,690
So let me go ahead and run
python of favorites.py again.

378
00:18:27,690 --> 00:18:31,480
And voila, it looks like I got the
same result, several hundred of them.

379
00:18:31,480 --> 00:18:34,260
But let me stipulate that it's
doing the same thing if we actually

380
00:18:34,260 --> 00:18:36,490
compared both of those side-by-side.

381
00:18:36,490 --> 00:18:36,990
All right.

382
00:18:36,990 --> 00:18:39,180
Before I forge ahead now
to actually augment this

383
00:18:39,180 --> 00:18:44,610
with new functionality, any questions
or confusion on this Python script

384
00:18:44,610 --> 00:18:49,530
we just wrote to open a file, wrap
it with a reader or DictReader,

385
00:18:49,530 --> 00:18:54,510
and then iterate over the rows one
at a time, printing the titles?

386
00:18:54,510 --> 00:18:56,510
Any questions, confusion
on syntax at all?

387
00:18:56,510 --> 00:18:57,010
It's OK.

388
00:18:57,010 --> 00:18:59,370
We've only known or
seen Python for a week.

389
00:18:59,370 --> 00:19:01,380
It's fine if it's still quite new.

390
00:19:01,380 --> 00:19:04,115
Anything, Brian, we should address?

391
00:19:04,115 --> 00:19:04,740
BRIAN YU: Yeah.

392
00:19:04,740 --> 00:19:08,800
So why is it that you don't need
to close the file using the syntax

393
00:19:08,800 --> 00:19:10,190
that you're using right here?

394
00:19:10,190 --> 00:19:11,732
DAVID J. MALAN: Really good question.

395
00:19:11,732 --> 00:19:15,250
Last week, I more pedantically
used open on its own.

396
00:19:15,250 --> 00:19:19,210
And then I later used a close function
that was associated with the file

397
00:19:19,210 --> 00:19:20,470
that I had just opened.

398
00:19:20,470 --> 00:19:23,800
Now, the more Pythonic way
to do things, if you will,

399
00:19:23,800 --> 00:19:27,370
is actually to use this with
keyword, which didn't exist in C.

400
00:19:27,370 --> 00:19:29,830
And it just tends to be a
useful feature in Python

401
00:19:29,830 --> 00:19:35,470
whereby if you say with open, dot dot
dot, it will open the file for you.

402
00:19:35,470 --> 00:19:39,280
Then it will remain open so long
as your code is indented inside

403
00:19:39,280 --> 00:19:41,410
of that with keywords block.

404
00:19:41,410 --> 00:19:43,780
And as soon as you get to
the end of your program,

405
00:19:43,780 --> 00:19:45,732
it will automatically be closed for you.

406
00:19:45,732 --> 00:19:48,190
So this is one of these features
where Python in some sense

407
00:19:48,190 --> 00:19:50,770
is trying to protect us from ourselves.

408
00:19:50,770 --> 00:19:52,900
It's probably pretty
common for humans, myself

409
00:19:52,900 --> 00:19:55,000
included, to forget to close your file.

410
00:19:55,000 --> 00:19:57,580
That can create problems with
saving things permanently.

411
00:19:57,580 --> 00:19:59,990
It can create memory
leaks, as we know from C.

412
00:19:59,990 --> 00:20:02,740
So the with keyword just assumes
that I'm not going to be an idiot

413
00:20:02,740 --> 00:20:04,150
and forget to close the file.

414
00:20:04,150 --> 00:20:08,050
Python is going to do
it for me automatically.

415
00:20:08,050 --> 00:20:10,870
Other questions or confusions, Brian?

416
00:20:10,870 --> 00:20:13,690
BRIAN YU: How does
DictReader know that Title

417
00:20:13,690 --> 00:20:16,270
is the name of the key
inside of the dictionary?

418
00:20:16,270 --> 00:20:18,020
DAVID J. MALAN: Really
good question, too.

419
00:20:18,020 --> 00:20:22,090
So it is designed by the
authors of the Python language

420
00:20:22,090 --> 00:20:25,450
to look at the very
first row in the file,

421
00:20:25,450 --> 00:20:29,380
split it on the commas
in that very first row,

422
00:20:29,380 --> 00:20:34,090
and just assume that the first word
or phrase before the first comma

423
00:20:34,090 --> 00:20:37,270
is the name of the first
column, that the second word

424
00:20:37,270 --> 00:20:42,470
or phrase after the first comma
is the name of the second column,

425
00:20:42,470 --> 00:20:43,310
and so forth.

426
00:20:43,310 --> 00:20:47,500
So a DictReader just presumes,
as is the convention with CSVs,

427
00:20:47,500 --> 00:20:51,280
that your first row is going to
contain the headings that you

428
00:20:51,280 --> 00:20:53,290
want to use to refer to those columns.

429
00:20:53,290 --> 00:20:56,860
If your CSV happens not to have
such a heading whereby it just

430
00:20:56,860 --> 00:20:59,050
jumps right in on the
first row to real data,

431
00:20:59,050 --> 00:21:02,140
then you're not going to be able to
use a DictReader correctly, at least

432
00:21:02,140 --> 00:21:04,670
not without some manual configuration.

433
00:21:04,670 --> 00:21:05,170
All right.

434
00:21:05,170 --> 00:21:06,840
So let's go ahead and--

435
00:21:06,840 --> 00:21:08,590
now I feel like there's
a whole mess here.

436
00:21:08,590 --> 00:21:10,562
And some of these shows
are pretty popular.

437
00:21:10,562 --> 00:21:13,270
And as I'm glancing over this, I
definitely see some duplication.

438
00:21:13,270 --> 00:21:15,010
A whole bunch of you like The Office.

439
00:21:15,010 --> 00:21:17,530
A whole bunch of you like
Breaking Bad, Game of Thrones,

440
00:21:17,530 --> 00:21:19,280
and a whole bunch of
other shows, as well.

441
00:21:19,280 --> 00:21:21,250
So it would be nicer,
I think, if we kind of

442
00:21:21,250 --> 00:21:25,480
narrow the scope of our look at this
data by just looking at unique values.

443
00:21:25,480 --> 00:21:26,807
You're looking at unique value.

444
00:21:26,807 --> 00:21:29,140
So rather than just iterate
over the file top to bottom,

445
00:21:29,140 --> 00:21:31,690
printing out one title
after another, why

446
00:21:31,690 --> 00:21:34,330
don't we go ahead and sort of
accumulate all of this data

447
00:21:34,330 --> 00:21:38,800
in some kind of data structure so that
we can throw away duplicate values

448
00:21:38,800 --> 00:21:42,910
and then only print out the unique
titles that we've accumulated?

449
00:21:42,910 --> 00:21:44,630
So I bet we can do this in a few ways.

450
00:21:44,630 --> 00:21:47,650
But if we think back to last week's
demonstration of our dictionary,

451
00:21:47,650 --> 00:21:50,388
you'll recall that I used
what was called a set.

452
00:21:50,388 --> 00:21:52,930
And I'm going to go ahead and
create a variable called titles

453
00:21:52,930 --> 00:21:55,180
and set it equal to
something called set.

454
00:21:55,180 --> 00:21:57,310
And a set is just a
collection of values.

455
00:21:57,310 --> 00:21:58,540
It's kind of like a list.

456
00:21:58,540 --> 00:22:00,580
But it eliminates duplicates for me.

457
00:22:00,580 --> 00:22:02,860
And that would seem to be
exactly the characteristic

458
00:22:02,860 --> 00:22:04,870
that I want for this program.

459
00:22:04,870 --> 00:22:08,560
Now, instead of printing each
title, which is now premature

460
00:22:08,560 --> 00:22:10,480
if I want to first
filter out duplicates,

461
00:22:10,480 --> 00:22:11,990
I'm going to go ahead and do this.

462
00:22:11,990 --> 00:22:17,290
I'm going to go ahead and add to the
titles set using the add function

463
00:22:17,290 --> 00:22:19,570
the current row's title.

464
00:22:19,570 --> 00:22:21,250
So again, I'm not printing it now.

465
00:22:21,250 --> 00:22:25,510
I'm instead adding to the title
set that particular title.

466
00:22:25,510 --> 00:22:27,190
And if it's there already, no big deal.

467
00:22:27,190 --> 00:22:29,260
The set data structure
in Python is going

468
00:22:29,260 --> 00:22:31,000
to throw away the duplicates for me.

469
00:22:31,000 --> 00:22:33,310
And it's only going to go
ahead and keep the uniques.

470
00:22:33,310 --> 00:22:37,330
Now, at the bottom of my file, I need
to do a little more work, admittedly.

471
00:22:37,330 --> 00:22:40,990
Now I have to iterate over the set to
print out only those unique titles.

472
00:22:40,990 --> 00:22:41,740
So let me do this.

473
00:22:41,740 --> 00:22:46,555
For title in titles, go
ahead and print out title.

474
00:22:46,555 --> 00:22:49,180
And this is where Python just
gets really user-friendly, right?

475
00:22:49,180 --> 00:22:53,050
You don't have to do int i get
0, i less than n, or whatever.

476
00:22:53,050 --> 00:22:55,540
You can just say for title in titles.

477
00:22:55,540 --> 00:22:59,200
And if the title's variable
is the type of data structure

478
00:22:59,200 --> 00:23:04,690
that you can iterate over, which it
will be if it's a list or if it's a set

479
00:23:04,690 --> 00:23:06,940
or even if it's a dictionary,
another data structure

480
00:23:06,940 --> 00:23:11,180
we saw last week in Python, the for loop
in Python will just know what to do.

481
00:23:11,180 --> 00:23:15,880
This will loop over all of
the titles in the titles set.

482
00:23:15,880 --> 00:23:18,700
So let me go ahead and save
this file and go ahead now

483
00:23:18,700 --> 00:23:20,920
and run python of favorites.py.

484
00:23:20,920 --> 00:23:25,240
And it looks like, yeah, the
list is different in some way.

485
00:23:25,240 --> 00:23:29,463
But I'm seeing fewer results as I
scroll up, definitely fewer than before,

486
00:23:29,463 --> 00:23:31,630
because my scrollbar didn't
jump nearly as far down.

487
00:23:31,630 --> 00:23:33,260
But honestly, this is kind of a mess.

488
00:23:33,260 --> 00:23:34,660
Let's go ahead and sort this.

489
00:23:34,660 --> 00:23:37,502
Now, in C, it would have been
kind of a pain to sort things.

490
00:23:37,502 --> 00:23:39,460
We'd have to whip out
the pseudocode, probably,

491
00:23:39,460 --> 00:23:41,460
for bubble sort, selection
sort, or, god forbid,

492
00:23:41,460 --> 00:23:43,270
merge sort and then
implement it ourselves.

493
00:23:43,270 --> 00:23:47,210
But no, with Python comes, really, the
proverbial kitchen sink of functions.

494
00:23:47,210 --> 00:23:49,510
So if you want to sort
this set, you know what?

495
00:23:49,510 --> 00:23:50,950
Just say you want it sorted.

496
00:23:50,950 --> 00:23:53,890
There is a function in
Python called sorted

497
00:23:53,890 --> 00:23:57,257
that will use one of those better
algorithms-- maybe it's merge sort.

498
00:23:57,257 --> 00:23:58,840
Maybe it's something called quicksort.

499
00:23:58,840 --> 00:24:00,350
Maybe it's something else altogether.

500
00:24:00,350 --> 00:24:02,380
It's not going to use a
big O of n squared sort.

501
00:24:02,380 --> 00:24:06,940
Someone at Python probably has spent the
time implementing a better sort for us.

502
00:24:06,940 --> 00:24:08,817
But it will go ahead
and sort the set for me.

503
00:24:08,817 --> 00:24:10,400
Now let me go ahead and do this again.

504
00:24:10,400 --> 00:24:13,360
Let me increase the size of my
terminal window and rerun python

505
00:24:13,360 --> 00:24:15,070
of favorites.py.

506
00:24:15,070 --> 00:24:15,640
OK.

507
00:24:15,640 --> 00:24:19,330
And now we have an
interesting assortment

508
00:24:19,330 --> 00:24:22,570
of shows that's easier for
me to wrap my mind around,

509
00:24:22,570 --> 00:24:25,743
because I have it now sorted here.

510
00:24:25,743 --> 00:24:28,660
And indeed, if I scroll all the way
up, we should see all of the shows

511
00:24:28,660 --> 00:24:32,257
beginning with numbers
or a period, which

512
00:24:32,257 --> 00:24:34,090
might have just been
someone playing around,

513
00:24:34,090 --> 00:24:36,290
followed by the A words,
the B words, and so forth.

514
00:24:36,290 --> 00:24:38,707
So now it's a little easier
to wrap our minds around this.

515
00:24:38,707 --> 00:24:39,760
But something's up.

516
00:24:39,760 --> 00:24:44,110
I feel like a lot of you like
Avatar: The Last Airbender.

517
00:24:44,110 --> 00:24:47,980
And yet I'm seeing it,
indeed, four different times.

518
00:24:47,980 --> 00:24:49,720
But I thought we were
filtering this down

519
00:24:49,720 --> 00:24:53,350
to uniques by using that set structure.

520
00:24:53,350 --> 00:24:54,340
So what's going on?

521
00:24:54,340 --> 00:24:56,200
And in fact, if I keep
scrolling, I'm pretty

522
00:24:56,200 --> 00:24:59,080
sure I saw more duplicates in here.

523
00:24:59,080 --> 00:25:01,840
BoJack Horseman, Breaking
Bad, Breaking Bad,

524
00:25:01,840 --> 00:25:07,810
Brooklyn Nine-Nine, Brooklyn Nine-Nine,
CS50 in several different flavors.

525
00:25:07,810 --> 00:25:10,210
And yes, it keeps going.

526
00:25:10,210 --> 00:25:11,110
Friends.

527
00:25:11,110 --> 00:25:13,050
So I see a lot of duplicate values.

528
00:25:13,050 --> 00:25:14,980
So what's going on?

529
00:25:14,980 --> 00:25:17,960
Yeah, [? Gadana? ?]

530
00:25:17,960 --> 00:25:22,480
AUDIENCE: Yeah, so your current
sort is case insensitive-- sorry,

531
00:25:22,480 --> 00:25:26,680
is case sensitive, meaning that if
someone spells avatar with capital

532
00:25:26,680 --> 00:25:30,800
A's in some places, then it's going
to be a different result each time.

533
00:25:30,800 --> 00:25:32,050
DAVID J. MALAN: Yeah, exactly.

534
00:25:32,050 --> 00:25:35,530
Some of you weren't quite diligent
when it came to capitalization.

535
00:25:35,530 --> 00:25:38,048
And so in fact, the reality
is, as [? Gadana ?] notes,

536
00:25:38,048 --> 00:25:39,840
that there's differences
in capitalization.

537
00:25:39,840 --> 00:25:41,090
Now, we've addressed this before.

538
00:25:41,090 --> 00:25:43,230
In fact, when you implemented
your spell checker,

539
00:25:43,230 --> 00:25:44,980
you had to deal with
this already when you

540
00:25:44,980 --> 00:25:46,780
were spell checking an arbitrary text.

541
00:25:46,780 --> 00:25:48,160
Some words might be capitalized.

542
00:25:48,160 --> 00:25:50,300
Some might be all
lowercase, all uppercase.

543
00:25:50,300 --> 00:25:52,810
And you wanted to tolerate
different casings.

544
00:25:52,810 --> 00:25:55,840
And so we probably solved this
by just forcing everything

545
00:25:55,840 --> 00:25:58,540
to uppercase or everything to
lowercase and doing things,

546
00:25:58,540 --> 00:26:00,500
therefore, case insensitively.

547
00:26:00,500 --> 00:26:01,750
So give me just a moment here.

548
00:26:01,750 --> 00:26:06,007
And I'm going to go ahead and make
a quick change to my form here.

549
00:26:06,007 --> 00:26:07,840
Let's go ahead and
change this in such a way

550
00:26:07,840 --> 00:26:11,020
that we actually force everything
to uppercase or lowercase.

551
00:26:11,020 --> 00:26:13,760
Doesn't really matter which, but
we need to canonicalize things,

552
00:26:13,760 --> 00:26:14,860
so to speak, in some way.

553
00:26:14,860 --> 00:26:18,790
And to canonicalize things just
means to format all of your data

554
00:26:18,790 --> 00:26:20,020
in some standard way.

555
00:26:20,020 --> 00:26:22,390
So to [? Gadana's ?] point,
let's just standardize

556
00:26:22,390 --> 00:26:23,920
the capitalization of things.

557
00:26:23,920 --> 00:26:25,600
Maybe all uppercase, all lowercase.

558
00:26:25,600 --> 00:26:27,260
We just need to make a judgment call.

559
00:26:27,260 --> 00:26:29,427
So I'm going to go ahead
and make a few tweaks here.

560
00:26:29,427 --> 00:26:30,670
I'm still going to use a set.

561
00:26:30,670 --> 00:26:33,190
I'm still going to
read the CSV as before.

562
00:26:33,190 --> 00:26:37,270
But instead of just adding the
title with row bracket title,

563
00:26:37,270 --> 00:26:40,180
I'm going to go ahead and
force it to uppercase, just

564
00:26:40,180 --> 00:26:42,850
arbitrarily, just for
the sake of uniformity.

565
00:26:42,850 --> 00:26:45,610
And then let's go ahead and check
what exactly has happened here.

566
00:26:45,610 --> 00:26:47,050
I'm not going to change anything else.

567
00:26:47,050 --> 00:26:49,383
But let me go ahead and
increase the size of my terminal

568
00:26:49,383 --> 00:26:52,600
window, rerun python of favorites.py.

569
00:26:52,600 --> 00:26:53,517
And voila.

570
00:26:53,517 --> 00:26:55,600
It's a little harder to
read, just because I'm not

571
00:26:55,600 --> 00:26:56,770
used to reading all caps.

572
00:26:56,770 --> 00:26:58,687
Kind of looks like we're
yelling at ourselves.

573
00:26:58,687 --> 00:27:01,600
But I don't see-- wait a minute.

574
00:27:01,600 --> 00:27:05,560
I still see The Office over here twice.

575
00:27:05,560 --> 00:27:11,920
If I keep scrolling here, so far, I see
Stranger Things and Strainger Things.

576
00:27:11,920 --> 00:27:13,570
That just looks like a typo.

577
00:27:13,570 --> 00:27:15,680
I see two Sherlocks, though.

578
00:27:15,680 --> 00:27:17,380
This is a little suspicious.

579
00:27:17,380 --> 00:27:21,730
So [? Gadana, ?] you and I don't
seem to have solved things fully.

580
00:27:21,730 --> 00:27:24,220
And this one's a little more subtle.

581
00:27:24,220 --> 00:27:30,970
What more should I perhaps do to my data
to ensure we get duplicates removed?

582
00:27:30,970 --> 00:27:32,200
Olivia?

583
00:27:32,200 --> 00:27:34,407
AUDIENCE: Maybe trim around the edges.

584
00:27:34,407 --> 00:27:35,990
DAVID J. MALAN: Trim around the edges.

585
00:27:35,990 --> 00:27:37,480
I like the sound of that,
but what do you mean?

586
00:27:37,480 --> 00:27:38,320
What does that do?

587
00:27:38,320 --> 00:27:40,862
AUDIENCE: Oh, like, trim off
the extra spaces in case someone

588
00:27:40,862 --> 00:27:42,940
put a space before or after the words.

589
00:27:42,940 --> 00:27:44,230
DAVID J. MALAN: Yeah, exactly.

590
00:27:44,230 --> 00:27:47,140
It's pretty common for humans,
intentionally or accidentally,

591
00:27:47,140 --> 00:27:48,920
to hit the Space bar
where they shouldn't.

592
00:27:48,920 --> 00:27:51,790
And in fact, I'm kind of
inferring that I bet one or more

593
00:27:51,790 --> 00:27:55,143
of you accidentally typed
Sherlock, space, and then decided,

594
00:27:55,143 --> 00:27:55,810
nope, that's it.

595
00:27:55,810 --> 00:27:57,040
I'm not typing anything else.

596
00:27:57,040 --> 00:28:00,740
But that space, even though we can't
quite see it obviously, is there.

597
00:28:00,740 --> 00:28:04,060
And when we do a string comparison or
when the set data structure does that,

598
00:28:04,060 --> 00:28:08,110
it's actually going to be noticed
when doing those comparisons.

599
00:28:08,110 --> 00:28:10,092
And therefore they're
not going to be the same.

600
00:28:10,092 --> 00:28:11,800
So I can do this in
a few different ways.

601
00:28:11,800 --> 00:28:15,080
But it turns out, in Python, you
can chain functions together,

602
00:28:15,080 --> 00:28:17,390
which is also, too,
kind of a fancy feature.

603
00:28:17,390 --> 00:28:18,650
Notice what I'm doing here.

604
00:28:18,650 --> 00:28:21,070
I'm still accessing the titles set.

605
00:28:21,070 --> 00:28:23,710
I'm adding the following value to it.

606
00:28:23,710 --> 00:28:27,610
I'm adding the value row
bracket title, but not quite.

607
00:28:27,610 --> 00:28:30,880
That is a string or an
str, in Python speak.

608
00:28:30,880 --> 00:28:33,130
I'm going to go ahead
and strip it, which

609
00:28:33,130 --> 00:28:36,340
means if we look up the documentation
for this function, to Olivia's point,

610
00:28:36,340 --> 00:28:39,130
it's going to strip off or
trim all of the white space

611
00:28:39,130 --> 00:28:41,200
to the left, all of the
white space to the right,

612
00:28:41,200 --> 00:28:43,840
whether that's the Space
bar or the Enter key

613
00:28:43,840 --> 00:28:46,870
or the Tab character or a
few other things, as well.

614
00:28:46,870 --> 00:28:50,200
It's just going to get rid of
leading and trailing white space.

615
00:28:50,200 --> 00:28:53,650
And then whatever's left over, I'm
going to go ahead and force everything

616
00:28:53,650 --> 00:28:56,810
to uppercase in the spirit of
[? Gadana's ?] suggestion, too.

617
00:28:56,810 --> 00:29:00,470
So we're sort of combining two good
ideas now to really massage the data,

618
00:29:00,470 --> 00:29:02,470
if you will, into a cleaner format.

619
00:29:02,470 --> 00:29:04,780
And this is such a real-world reality.

620
00:29:04,780 --> 00:29:09,588
Humans, you and I, cannot be trusted to
input data the way we are supposed to.

621
00:29:09,588 --> 00:29:11,380
Sometimes it's all
lowercase, because we're

622
00:29:11,380 --> 00:29:13,463
being a little lazy or a
little social media-like,

623
00:29:13,463 --> 00:29:16,120
even if we're checking
out from Amazon and trying

624
00:29:16,120 --> 00:29:18,310
to input a valid postal address.

625
00:29:18,310 --> 00:29:22,127
Sometimes it's all capitals, because
I can think of a few people in my life

626
00:29:22,127 --> 00:29:24,460
who don't quite understand
the Caps Lock thing just yet.

627
00:29:24,460 --> 00:29:26,710
And so things might be
all capitalized instead.

628
00:29:26,710 --> 00:29:30,580
This is not good for computer
systems that require precision,

629
00:29:30,580 --> 00:29:32,440
to our emphasis in week 0.

630
00:29:32,440 --> 00:29:35,140
And so massaging data
means cleaning it up,

631
00:29:35,140 --> 00:29:38,500
doing some mutations that don't
really change the meaning of the data

632
00:29:38,500 --> 00:29:41,740
but canonicalize it,
standardize it, so that you're

633
00:29:41,740 --> 00:29:44,950
comparing apples and apples, so
to speak, not apples and oranges.

634
00:29:44,950 --> 00:29:47,950
Well, let me go ahead and run
this again in my bigger terminal

635
00:29:47,950 --> 00:29:50,140
window, python of favorites.py.

636
00:29:50,140 --> 00:29:50,710
Voila.

637
00:29:50,710 --> 00:29:55,220
In scrolling up, up, up, I
think we're in a better place.

638
00:29:55,220 --> 00:29:57,520
I only see one Office now.

639
00:29:57,520 --> 00:30:01,510
And if I keep scrolling up and up
and up, I'm seeing typos still,

640
00:30:01,510 --> 00:30:03,910
but nothing related to white space.

641
00:30:03,910 --> 00:30:08,340
And I think we have a much cleaner
unique list of titles at this point.

642
00:30:08,340 --> 00:30:10,800
Of course, if we scroll
up, I would have to be

643
00:30:10,800 --> 00:30:14,970
a lot more clever if I want to detect
things like typographical errors.

644
00:30:14,970 --> 00:30:19,870
It looks like one of you was very
diligent about putting F.R.I.

645
00:30:19,870 --> 00:30:22,970
and so forth but then got bored at
the end and left off the last period.

646
00:30:22,970 --> 00:30:25,470
But that's going to happen when
you're taking in user input.

647
00:30:25,470 --> 00:30:28,140
We've, of course, got all
these variants of CS50.

648
00:30:28,140 --> 00:30:30,570
That's going to be a mess
to clean up, because now you

649
00:30:30,570 --> 00:30:35,130
can imagine having to add a whole bunch
of if conditions and elses and else ifs

650
00:30:35,130 --> 00:30:38,160
to clean all of that up if
we do want to canonicalize

651
00:30:38,160 --> 00:30:41,920
all different flavors of CS50
as, quote unquote, "CS50."

652
00:30:41,920 --> 00:30:43,890
So this is a very slippery slope.

653
00:30:43,890 --> 00:30:47,010
You and I could start writing a huge
amount of data just to clean this up.

654
00:30:47,010 --> 00:30:50,940
But that's the reality when
dealing with real-world data.

655
00:30:50,940 --> 00:30:55,140
Well, let's go ahead now and
improve this program further,

656
00:30:55,140 --> 00:30:57,810
do something a little
fancier, because I now

657
00:30:57,810 --> 00:31:00,090
can trust that my data
has been canonicalized

658
00:31:00,090 --> 00:31:03,900
except for the actual typos or the
weird variants of CS50 and the like.

659
00:31:03,900 --> 00:31:07,470
Let's go ahead and figure out
what's the most popular favorite TV

660
00:31:07,470 --> 00:31:10,510
show among the audience here.

661
00:31:10,510 --> 00:31:12,300
So I'm going to start
where I have before,

662
00:31:12,300 --> 00:31:14,133
with my current code,
because I think I have

663
00:31:14,133 --> 00:31:16,143
most of the building blocks in place.

664
00:31:16,143 --> 00:31:18,810
I'm going to go ahead and clean
up my code a little bit in here.

665
00:31:18,810 --> 00:31:22,050
I'm going to go ahead and give myself
a separate variable now called title

666
00:31:22,050 --> 00:31:26,040
just so that I can think about things
in a little more orderly fashion.

667
00:31:26,040 --> 00:31:29,200
But I'm not going to start adding
things to this set anymore.

668
00:31:29,200 --> 00:31:32,220
In fact, a set, I don't
think, is really going

669
00:31:32,220 --> 00:31:35,880
to be sufficient to keep track
of the popularity of TV shows,

670
00:31:35,880 --> 00:31:38,820
because by definition, the set
is throwing away duplicates.

671
00:31:38,820 --> 00:31:40,680
But the goal now is
kind of the opposite.

672
00:31:40,680 --> 00:31:45,240
I want to know which are the
duplicates so that I can tell you

673
00:31:45,240 --> 00:31:46,860
that this many people like The Office.

674
00:31:46,860 --> 00:31:50,530
This many people like
Breaking Bad and the like.

675
00:31:50,530 --> 00:31:56,010
So what tools do we have in Python's
toolkit via which we could accumulate

676
00:31:56,010 --> 00:31:59,320
or figure out that information?

677
00:31:59,320 --> 00:32:02,740
Any thoughts on what data
structure might help us here

678
00:32:02,740 --> 00:32:07,870
if we want to figure out show,
popularity, show, popularity?

679
00:32:07,870 --> 00:32:11,950
And by popularity, I just mean the
frequency of it in the CSV file.

680
00:32:11,950 --> 00:32:13,720
Santiago?

681
00:32:13,720 --> 00:32:17,110
AUDIENCE: I guess one option
could be to use dictionaries

682
00:32:17,110 --> 00:32:20,410
so that you can have The
Office, I don't know,

683
00:32:20,410 --> 00:32:23,110
20 votes, and then Game
of Thrones, another one,

684
00:32:23,110 --> 00:32:27,023
so that a dictionary could
really help you visualize that.

685
00:32:27,023 --> 00:32:28,690
DAVID J. MALAN: Yeah, perfect instincts.

686
00:32:28,690 --> 00:32:31,660
Recall that a dictionary, at the
end of the day, no matter how

687
00:32:31,660 --> 00:32:34,450
sophisticated it's implemented
underneath the hood,

688
00:32:34,450 --> 00:32:35,680
like your spell checker--

689
00:32:35,680 --> 00:32:38,240
It's just a collection
of key value pairs.

690
00:32:38,240 --> 00:32:42,790
And indeed, it's maybe one of the most
useful data structures in any language,

691
00:32:42,790 --> 00:32:45,820
because this ability to associate
one piece of data with another

692
00:32:45,820 --> 00:32:49,150
is just a very general
purpose solution to problems.

693
00:32:49,150 --> 00:32:51,730
And indeed, to Santiago's
point, if the problem at hand

694
00:32:51,730 --> 00:32:53,650
is to figure out the
popularity of shows,

695
00:32:53,650 --> 00:32:58,510
well, let's make the keys the titles of
our shows and the frequencies thereof--

696
00:32:58,510 --> 00:32:59,830
the votes, so to speak--

697
00:32:59,830 --> 00:33:01,810
the values of those keys.

698
00:33:01,810 --> 00:33:06,450
We're going to map title to votes, title
to vote, title to vote, and so forth.

699
00:33:06,450 --> 00:33:08,145
So a dictionary is exactly that.

700
00:33:08,145 --> 00:33:09,520
So let me go ahead and scroll up.

701
00:33:09,520 --> 00:33:10,978
And I can make a little tweak here.

702
00:33:10,978 --> 00:33:14,260
Instead of a set, I can instead
say dict and give myself

703
00:33:14,260 --> 00:33:15,598
just an empty dictionary.

704
00:33:15,598 --> 00:33:18,640
There's actually shorthand notation
for that that's a little more common.

705
00:33:18,640 --> 00:33:20,830
So you use two empty curly braces.

706
00:33:20,830 --> 00:33:22,810
That just means the exact same thing.

707
00:33:22,810 --> 00:33:25,270
Give me a dictionary
that's initially empty.

708
00:33:25,270 --> 00:33:27,400
There's no fancy shortcut for a set.

709
00:33:27,400 --> 00:33:30,370
You have to literally type out
S-E-T, open paren and closed paren.

710
00:33:30,370 --> 00:33:34,220
But dictionaries are so common,
so popular, so powerful,

711
00:33:34,220 --> 00:33:38,350
they have this little syntactic
shortcut of just two curly braces,

712
00:33:38,350 --> 00:33:39,560
open and closed.

713
00:33:39,560 --> 00:33:42,700
So now that I have that,
let me go ahead and do this.

714
00:33:42,700 --> 00:33:45,580
Inside of my for loop,
instead of printing

715
00:33:45,580 --> 00:33:48,880
the title, which I don't want to do,
and instead of adding it to the set,

716
00:33:48,880 --> 00:33:50,770
I now want to add it to the dictionary.

717
00:33:50,770 --> 00:33:51,860
So how do I do that?

718
00:33:51,860 --> 00:33:55,480
Well, if my dictionary is called titles,
I think I can essentially do something

719
00:33:55,480 --> 00:34:02,710
like this, titles bracket
title = or maybe += 1.

720
00:34:02,710 --> 00:34:07,120
Maybe I can kind of use the dictionary
as just a little cheat sheet

721
00:34:07,120 --> 00:34:12,050
of counts, numbers, that start at
0 and then just add 1, at 2, add 3.

722
00:34:12,050 --> 00:34:17,860
So every time I see The Office, The
Office, The Office, do += 1, += 1.

723
00:34:17,860 --> 00:34:20,199
We can't do ++, because
that's not a thing in Python.

724
00:34:20,199 --> 00:34:24,580
It only exists in C. But this would
seem to go into the dictionary called

725
00:34:24,580 --> 00:34:29,260
titles, look up the key that
matches this specific title,

726
00:34:29,260 --> 00:34:34,340
and then increment whatever
value is there by 1.

727
00:34:34,340 --> 00:34:37,380
But I'm going to go ahead and
run this a little naively here.

728
00:34:37,380 --> 00:34:40,280
Let me go ahead and run
python of favorites.py.

729
00:34:40,280 --> 00:34:43,400
And wow, it broke already on line 9.

730
00:34:43,400 --> 00:34:47,389
So sort of an apt choice
of show to begin with,

731
00:34:47,389 --> 00:34:49,530
we have a key error with Punisher.

732
00:34:49,530 --> 00:34:50,917
So Punisher is bad.

733
00:34:50,917 --> 00:34:52,250
Something bad has just happened.

734
00:34:52,250 --> 00:34:53,250
But what does that mean?

735
00:34:53,250 --> 00:34:55,429
A key error is referring
to the fact that I

736
00:34:55,429 --> 00:34:59,407
tried to access an invalid
key in a dictionary.

737
00:34:59,407 --> 00:35:01,490
This is saying that literally
in this line of code

738
00:35:01,490 --> 00:35:04,610
here, even though titles
is a dictionary and even

739
00:35:04,610 --> 00:35:07,130
though the value of
title, singular, is, quote

740
00:35:07,130 --> 00:35:09,560
unquote, "PUNISHER,"
I'm getting a key error,

741
00:35:09,560 --> 00:35:13,230
because that title does not yet exist.

742
00:35:13,230 --> 00:35:17,060
So even if you're not sure of the
Python syntax for fixing this problem,

743
00:35:17,060 --> 00:35:21,530
what's the intuitive solution here?

744
00:35:21,530 --> 00:35:25,610
I cannot increment the
frequency of the Punisher,

745
00:35:25,610 --> 00:35:28,130
because Punisher is
not in the dictionary.

746
00:35:28,130 --> 00:35:29,986
It almost feels like a catch-22.

747
00:35:29,986 --> 00:35:31,890
[? Greg? ?]

748
00:35:31,890 --> 00:35:35,900
AUDIENCE: I think that you need,
first of all, to create a for loop

749
00:35:35,900 --> 00:35:40,520
and maybe assign a value to
everything in the dictionary.

750
00:35:40,520 --> 00:35:43,683
For example, a value 0, and then add 1.

751
00:35:43,683 --> 00:35:45,350
DAVID J. MALAN: Yeah, so good instincts.

752
00:35:45,350 --> 00:35:46,730
And here, I can use another metaphor.

753
00:35:46,730 --> 00:35:49,147
I worry we might have a chicken
and the egg problem there,

754
00:35:49,147 --> 00:35:51,470
because I don't think I can
go to the top of my code,

755
00:35:51,470 --> 00:35:56,420
add a loop that initializes all of
the values in the dictionary to 0,

756
00:35:56,420 --> 00:36:01,130
because I would need to know all of
the names of the shows at that point.

757
00:36:01,130 --> 00:36:02,180
Now, that's fine.

758
00:36:02,180 --> 00:36:05,330
I think I could take you maybe
more literally, [? Greg, ?]

759
00:36:05,330 --> 00:36:09,630
and open up the CSV file,
iterate over it top to bottom,

760
00:36:09,630 --> 00:36:12,920
and, any time I see a
title, just initialize it

761
00:36:12,920 --> 00:36:16,220
in the dictionary as
having a value of 0, 0, 0.

762
00:36:16,220 --> 00:36:20,280
Then have another for loop, maybe
reopen the file, and do the same.

763
00:36:20,280 --> 00:36:21,380
And that would work.

764
00:36:21,380 --> 00:36:23,540
But it's arguably not very efficient.

765
00:36:23,540 --> 00:36:26,330
It is asymptotically, in
terms of big O. But that would

766
00:36:26,330 --> 00:36:28,220
seem to be doing twice as much work.

767
00:36:28,220 --> 00:36:31,820
Iterate over the file once just
to initialize everything to 0.

768
00:36:31,820 --> 00:36:35,330
Then iterate over the file a second
time just to increment the counts.

769
00:36:35,330 --> 00:36:38,360
I think we can do things
a little more efficiently.

770
00:36:38,360 --> 00:36:41,090
I think we can achieve not only
correctness but better design.

771
00:36:41,090 --> 00:36:45,560
Any thoughts on how we can still
solve this problem without having

772
00:36:45,560 --> 00:36:48,290
to iterate over the whole thing twice?

773
00:36:48,290 --> 00:36:50,360
Yeah, [? Semowit? ?]

774
00:36:50,360 --> 00:36:53,450
AUDIENCE: I think we can
add in an if statement

775
00:36:53,450 --> 00:36:55,970
to check if that key
is in the dictionary.

776
00:36:55,970 --> 00:36:59,865
And if it's not, then add it and then
go ahead and increment the value after.

777
00:36:59,865 --> 00:37:00,740
DAVID J. MALAN: Nice.

778
00:37:00,740 --> 00:37:02,460
And we can do exactly that.

779
00:37:02,460 --> 00:37:04,310
So let's just apply that intuition.

780
00:37:04,310 --> 00:37:08,583
If the problem is that I'm trying to
access a key that does not yet exist,

781
00:37:08,583 --> 00:37:10,500
well, let's just be a
little smarter about it.

782
00:37:10,500 --> 00:37:14,090
And to [? Semowit's ?] point,
let's check whether the key exists.

783
00:37:14,090 --> 00:37:16,020
And if it does, then increment it.

784
00:37:16,020 --> 00:37:19,340
But if it does not, then and only
then, [? to Greg's ?] advice,

785
00:37:19,340 --> 00:37:20,730
initialize it to 0.

786
00:37:20,730 --> 00:37:21,570
So let me do that.

787
00:37:21,570 --> 00:37:24,980
Let me go ahead and
say if title in titles,

788
00:37:24,980 --> 00:37:28,520
which is the very Pythonic,
beautiful way of asking a question

789
00:37:28,520 --> 00:37:30,500
like that, way cleaner than in C--

790
00:37:30,500 --> 00:37:35,390
let me go ahead, then, and say
exactly the line from before.

791
00:37:35,390 --> 00:37:40,280
Else, though, if that title is not
yet in the dictionary called titles,

792
00:37:40,280 --> 00:37:41,720
well, that's OK, too.

793
00:37:41,720 --> 00:37:47,150
I can go ahead and say
titles bracket title = 0.

794
00:37:47,150 --> 00:37:51,950
So the difference here is
that I can certainly index

795
00:37:51,950 --> 00:37:57,740
into a dictionary using a key that
doesn't exist if I plan at that moment

796
00:37:57,740 --> 00:37:58,730
to give it a value.

797
00:37:58,730 --> 00:38:02,030
That's OK, and that has always
been OK since last week.

798
00:38:02,030 --> 00:38:07,490
But, however, if I want to go ahead
and increment the value that's there,

799
00:38:07,490 --> 00:38:11,630
I'm going to go ahead and do
that in this separate line.

800
00:38:11,630 --> 00:38:13,850
But I did introduce a bug.

801
00:38:13,850 --> 00:38:15,770
I did introduce a bug here.

802
00:38:15,770 --> 00:38:19,220
I think I need to go one
step further logically.

803
00:38:19,220 --> 00:38:24,480
I don't think I want to
initialize this to 0 per se.

804
00:38:24,480 --> 00:38:29,090
Does anyone see a subtle
bug in my logic here?

805
00:38:29,090 --> 00:38:32,570
If the title is already in the
dictionary, I'm incrementing it by 1.

806
00:38:32,570 --> 00:38:37,000
Otherwise, I'm initializing it to 0.

807
00:38:37,000 --> 00:38:38,500
Any subtle catches here?

808
00:38:38,500 --> 00:38:40,690
Yeah, Olivia, what do you see?

809
00:38:40,690 --> 00:38:44,700
AUDIENCE: I think you should initialize
it to 1, since it's the first instance.

810
00:38:44,700 --> 00:38:45,700
DAVID J. MALAN: Exactly.

811
00:38:45,700 --> 00:38:46,870
I should initialize it to 1.

812
00:38:46,870 --> 00:38:50,137
Otherwise, I'm accidentally
overlooking this particular title,

813
00:38:50,137 --> 00:38:51,970
and I'm going to go
ahead and undercount it.

814
00:38:51,970 --> 00:38:54,100
So I can fix this either by doing this.

815
00:38:54,100 --> 00:38:57,820
Or frankly, if you prefer, I don't
technically need to use an if else.

816
00:38:57,820 --> 00:39:00,730
I can use just an if by doing
something like this instead.

817
00:39:00,730 --> 00:39:04,900
I could say if title not in
titles, then I could go ahead

818
00:39:04,900 --> 00:39:07,420
and say titles bracket title gets 0.

819
00:39:07,420 --> 00:39:12,270
And then after that, I can
blindly, so to speak, just do this.

820
00:39:12,270 --> 00:39:13,480
So which one is better?

821
00:39:13,480 --> 00:39:15,460
I think the second one
is maybe a little better

822
00:39:15,460 --> 00:39:17,420
in that I'm saving one line of code.

823
00:39:17,420 --> 00:39:19,270
But it's ensuring with
that if condition,

824
00:39:19,270 --> 00:39:24,190
to [? Semowit's ?] advice, that I'm
not indexing into the titles dictionary

825
00:39:24,190 --> 00:39:27,050
until I'm sure that
the title is in there.

826
00:39:27,050 --> 00:39:31,570
So let me go ahead and run this
now, python of favorites.py, Enter.

827
00:39:31,570 --> 00:39:34,090
And OK, it didn't crash, so that's good.

828
00:39:34,090 --> 00:39:36,400
But I'm not yet seeing
any useful information.

829
00:39:36,400 --> 00:39:38,740
But I now have access to a bit more.

830
00:39:38,740 --> 00:39:41,680
Let me scroll down now to the
bottom of this program, where

831
00:39:41,680 --> 00:39:43,360
I have now this loop.

832
00:39:43,360 --> 00:39:45,460
Let me go ahead and print
out not just the title

833
00:39:45,460 --> 00:39:49,870
but the value of that key in
the dictionary by just indexing

834
00:39:49,870 --> 00:39:50,500
into it here.

835
00:39:50,500 --> 00:39:52,030
And you might not have
seen this syntax before.

836
00:39:52,030 --> 00:39:54,758
But with print, you can actually
pass in multiple arguments.

837
00:39:54,758 --> 00:39:57,550
And by default, print will just
separate them with a space for you.

838
00:39:57,550 --> 00:39:59,860
You can override that behavior
and separate them with anything.

839
00:39:59,860 --> 00:40:02,777
But this is just meant to be a quick
and dirty program that prints out

840
00:40:02,777 --> 00:40:04,820
titles and now the popularity thereof.

841
00:40:04,820 --> 00:40:07,210
So let me run this again,
python of favorites.py.

842
00:40:07,210 --> 00:40:08,470
And voila.

843
00:40:08,470 --> 00:40:12,040
It's kind of all over the place.

844
00:40:12,040 --> 00:40:14,800
Office, super popular
with 26 votes there.

845
00:40:14,800 --> 00:40:18,220
A lot of single votes here.

846
00:40:18,220 --> 00:40:19,810
Big Bang Theory has nine.

847
00:40:19,810 --> 00:40:21,340
You know, this is all nice and good.

848
00:40:21,340 --> 00:40:24,548
But I feel like this is going to take
me forever to wrap my mind around which

849
00:40:24,548 --> 00:40:25,990
are the most popular shows.

850
00:40:25,990 --> 00:40:27,560
So of course, how would we do this?

851
00:40:27,560 --> 00:40:30,250
Well, to the point made earlier,
with spreadsheets, my god,

852
00:40:30,250 --> 00:40:33,100
in Microsoft Excel or Google
Spreadsheets or Apple Numbers,

853
00:40:33,100 --> 00:40:35,590
you just click the column
heading and boom, sorted.

854
00:40:35,590 --> 00:40:38,450
We seem to have lost that capability
unless we now do it in code.

855
00:40:38,450 --> 00:40:40,450
So let me do that for us.

856
00:40:40,450 --> 00:40:42,550
Let me go ahead and go back to my code.

857
00:40:42,550 --> 00:40:48,520
And it looks like sorted, even
though it does work on dictionaries,

858
00:40:48,520 --> 00:40:52,340
is actually sorting
by key, not by value.

859
00:40:52,340 --> 00:40:55,030
And here's where our Python
programming techniques need

860
00:40:55,030 --> 00:40:56,530
to get a little more sophisticated.

861
00:40:56,530 --> 00:40:58,572
And we want to introduce
another feature here now

862
00:40:58,572 --> 00:41:01,660
of Python which is going to
solve this problem specifically

863
00:41:01,660 --> 00:41:03,680
but in a pretty general way.

864
00:41:03,680 --> 00:41:06,310
So if we read the
documentation for sorted,

865
00:41:06,310 --> 00:41:11,320
the sorted function indeed sorts
sets by the values therein.

866
00:41:11,320 --> 00:41:13,840
It sorts lists by the values therein.

867
00:41:13,840 --> 00:41:17,140
It sorts dictionaries
by the keys therein,

868
00:41:17,140 --> 00:41:20,960
because dictionaries have two pieces
of information for every element.

869
00:41:20,960 --> 00:41:23,450
It has a key and a
value, not just a value.

870
00:41:23,450 --> 00:41:25,390
So by default, sorted sorts by key.

871
00:41:25,390 --> 00:41:28,150
So we somehow have to
override that behavior.

872
00:41:28,150 --> 00:41:29,390
So how can we do this?

873
00:41:29,390 --> 00:41:31,840
Well, it turns out that
the sorted function

874
00:41:31,840 --> 00:41:35,890
takes another optional
argument literally called key.

875
00:41:35,890 --> 00:41:41,570
And the key argument takes as
its value the name of a function.

876
00:41:41,570 --> 00:41:43,690
And this is where things
get really interesting,

877
00:41:43,690 --> 00:41:45,370
if not confusing, really quickly.

878
00:41:45,370 --> 00:41:50,620
It turns out, in Python, you can
pass around functions as arguments

879
00:41:50,620 --> 00:41:51,790
by way of their name.

880
00:41:51,790 --> 00:41:56,080
And technically, you can do this in C.
It's a lot more syntactically involved.

881
00:41:56,080 --> 00:41:57,777
But in Python, it's very common.

882
00:41:57,777 --> 00:41:59,110
In JavaScript, it's very common.

883
00:41:59,110 --> 00:42:01,930
In a lot of languages, it's very
common to think of functions

884
00:42:01,930 --> 00:42:06,040
as first-class objects, which is a fancy
way of saying you can pass them around

885
00:42:06,040 --> 00:42:08,440
just like they are variables themselves.

886
00:42:08,440 --> 00:42:09,730
We're not calling them yet.

887
00:42:09,730 --> 00:42:11,720
But you can pass them
around by their name.

888
00:42:11,720 --> 00:42:13,310
So what do I mean by this?

889
00:42:13,310 --> 00:42:18,940
Well, I need a function now to
sort my dictionary by its value.

890
00:42:18,940 --> 00:42:22,900
And only I know how to do this, perhaps,
so let me go ahead and give myself

891
00:42:22,900 --> 00:42:25,990
a generic function name just for the
moment called f-- f for function,

892
00:42:25,990 --> 00:42:26,907
kind of like in math--

893
00:42:26,907 --> 00:42:28,907
because we're going to
get rid of it eventually.

894
00:42:28,907 --> 00:42:31,120
But let me go ahead and
temporarily define a function

895
00:42:31,120 --> 00:42:34,150
called f that takes as input a title.

896
00:42:34,150 --> 00:42:39,140
And then it returns for me the
value corresponding to that key.

897
00:42:39,140 --> 00:42:43,060
So I'm going to go ahead and
return titles bracket title.

898
00:42:43,060 --> 00:42:47,480
So here, we have a function whose
purpose in life is super simple.

899
00:42:47,480 --> 00:42:48,700
You give it a title.

900
00:42:48,700 --> 00:42:52,990
It gives you the count thereof, the
frequency, the popularity thereof,

901
00:42:52,990 --> 00:42:56,080
by just looking it up in
that global dictionary.

902
00:42:56,080 --> 00:42:59,830
So it's super simple, but
that's its only purpose in life.

903
00:42:59,830 --> 00:43:03,400
But now, according to the
documentation for sorted,

904
00:43:03,400 --> 00:43:06,730
what it's now going to do, because I'm
passing in a second argument called

905
00:43:06,730 --> 00:43:12,250
key, the sorted function, rather than
just presume you want everything sorted

906
00:43:12,250 --> 00:43:15,280
alphabetically by key,
it's instead going

907
00:43:15,280 --> 00:43:22,420
to call that function f on every one
of the elements in your dictionary.

908
00:43:22,420 --> 00:43:25,960
And depending on your
answer, the return value

909
00:43:25,960 --> 00:43:30,760
you give with that f function,
that will be used instead

910
00:43:30,760 --> 00:43:34,060
to determine the actual ordering.

911
00:43:34,060 --> 00:43:36,900
So by default, sorted just looks at key.

912
00:43:36,900 --> 00:43:39,750
What I'm effectively
doing with this f function

913
00:43:39,750 --> 00:43:44,160
is instead returning the value
corresponding to every key.

914
00:43:44,160 --> 00:43:48,360
And so the logical implication of this,
even though the syntax is a little new,

915
00:43:48,360 --> 00:43:51,690
is that this dictionary
of titles will now

916
00:43:51,690 --> 00:43:55,140
be sorted by value instead of by key.

917
00:43:55,140 --> 00:43:57,460
Because again, by
default, it sorts by key.

918
00:43:57,460 --> 00:44:01,740
But if I define my own key
function and override that behavior

919
00:44:01,740 --> 00:44:05,370
to return the corresponding value,
it's the values, the numbers,

920
00:44:05,370 --> 00:44:08,750
the counts that will actually
be used to this thing.

921
00:44:08,750 --> 00:44:09,250
All right.

922
00:44:09,250 --> 00:44:11,333
Let's go ahead and see if
that's true in practice.

923
00:44:11,333 --> 00:44:13,440
Let me go ahead and rerun
python of favorites.py.

924
00:44:13,440 --> 00:44:14,790
I should see all the titles.

925
00:44:14,790 --> 00:44:17,520
And voila, conveniently,
the most popular show

926
00:44:17,520 --> 00:44:22,170
seems to be Game of Thrones with 33
votes, followed by Friends with 27,

927
00:44:22,170 --> 00:44:25,000
followed by The Office
with 26, and so forth.

928
00:44:25,000 --> 00:44:27,060
But of course, the list
is kind of backwards.

929
00:44:27,060 --> 00:44:29,680
I mean, it's convenient that I can
see it at the bottom of my screen.

930
00:44:29,680 --> 00:44:32,530
But really, if we're making a list,
it should really be at the top.

931
00:44:32,530 --> 00:44:34,170
So how can we override that behavior?

932
00:44:34,170 --> 00:44:36,840
Turns out the sorted function,
if you read its documentation,

933
00:44:36,840 --> 00:44:41,020
also takes another optional
parameter called reverse.

934
00:44:41,020 --> 00:44:43,590
And if you set reverse
equal to True, capital

935
00:44:43,590 --> 00:44:48,120
T in Python, that's going
to go ahead and give us now

936
00:44:48,120 --> 00:44:50,190
the reverse order of that same sort.

937
00:44:50,190 --> 00:44:53,790
So let me go ahead and maximize my
terminal window, rerun it again.

938
00:44:53,790 --> 00:44:57,480
And voila, if I scroll back up to the
top, it's not alphabetically sorted.

939
00:44:57,480 --> 00:44:59,970
But if I keep going, keep
going, keep going, keep going,

940
00:44:59,970 --> 00:45:01,262
the numbers are getting bigger.

941
00:45:01,262 --> 00:45:06,770
And voila, now Game of Thrones
with 33 is all the way at the top.

942
00:45:06,770 --> 00:45:08,490
All right, so pretty cool.

943
00:45:08,490 --> 00:45:11,360
And again, the new functionality
here in Python, at least,

944
00:45:11,360 --> 00:45:15,110
is that we can actually pass
in functions to functions

945
00:45:15,110 --> 00:45:19,380
and leave it to the
latter to call the former.

946
00:45:19,380 --> 00:45:21,180
So that's complicated just to say.

947
00:45:21,180 --> 00:45:26,400
But any questions or confusion now
on how we are using dictionaries

948
00:45:26,400 --> 00:45:34,290
and how we are sorting things in
this reverse, value-based way?

949
00:45:34,290 --> 00:45:35,450
Any questions or confusion?

950
00:45:35,450 --> 00:45:39,000
Anything in the chat or verbally, Brian?

951
00:45:39,000 --> 00:45:41,470
BRIAN YU: Looks like all
questions are answered here.

952
00:45:41,470 --> 00:45:42,320
DAVID J. MALAN: OK.

953
00:45:42,320 --> 00:45:44,780
Then in that case, let me
point out a common mistake.

954
00:45:44,780 --> 00:45:50,000
Notice that even though f is a function,
notice that I did not call it there.

955
00:45:50,000 --> 00:45:53,630
That would be incorrect, the
reason being we deliberately

956
00:45:53,630 --> 00:45:58,410
want to pass the function
f into the sorted function

957
00:45:58,410 --> 00:46:03,810
so that the sorted function can take it
upon itself to call f again and again

958
00:46:03,810 --> 00:46:04,310
and again.

959
00:46:04,310 --> 00:46:07,227
We don't want to just call it once
by using the parentheses ourselves.

960
00:46:07,227 --> 00:46:11,030
We want to just pass it in by name so
that the sorted function, which comes

961
00:46:11,030 --> 00:46:14,630
with Python, can instead do it for us.

962
00:46:14,630 --> 00:46:17,060
Santiago, did you have a question?

963
00:46:17,060 --> 00:46:18,710
AUDIENCE: Yes, I was going to ask.

964
00:46:18,710 --> 00:46:21,425
Why didn't we put f of title?

965
00:46:21,425 --> 00:46:24,350


966
00:46:24,350 --> 00:46:26,960
I was going to ask that
question specifically.

967
00:46:26,960 --> 00:46:29,005
DAVID J. MALAN: Oh,
with the parentheses?

968
00:46:29,005 --> 00:46:29,630
AUDIENCE: Yeah.

969
00:46:29,630 --> 00:46:30,963
DAVID J. MALAN: Oh, OK, perfect.

970
00:46:30,963 --> 00:46:34,010
So because that would call the
function once and only once.

971
00:46:34,010 --> 00:46:37,013
We want sorted to be able
to call it again and again.

972
00:46:37,013 --> 00:46:38,930
Now, here's actually an
example, as we've seen

973
00:46:38,930 --> 00:46:40,760
in the past, of a correct solution.

974
00:46:40,760 --> 00:46:45,170
This is behaving as I intend, a list
of sorted titles from top to bottom

975
00:46:45,170 --> 00:46:47,450
in order of popularity.

976
00:46:47,450 --> 00:46:49,820
But it's a little poorly
designed, because I'm

977
00:46:49,820 --> 00:46:53,390
defining this function f, whose name
in the first place is kind of lame.

978
00:46:53,390 --> 00:46:56,660
But I'm defining a function
only to use it in one place.

979
00:46:56,660 --> 00:47:00,860
And my god, the function is so tiny, it
just feels like a waste of keystrokes

980
00:47:00,860 --> 00:47:03,740
to have defined a new function
just to then pass it in.

981
00:47:03,740 --> 00:47:08,150
So it turns out, in Python, if you
have a very short function whose

982
00:47:08,150 --> 00:47:13,130
purpose in life is meant to be to solve
a local problem just once and that's it

983
00:47:13,130 --> 00:47:16,880
and it's short enough that you're pretty
sure you can fit it on one line of code

984
00:47:16,880 --> 00:47:21,080
without things wrapping and starting
to get ugly stylistically, it turns out

985
00:47:21,080 --> 00:47:23,270
you can actually do this instead.

986
00:47:23,270 --> 00:47:26,820
You can copy the code that
you had in mind like this.

987
00:47:26,820 --> 00:47:30,680
And instead of actually
defining f as a function name,

988
00:47:30,680 --> 00:47:34,070
you can actually use a special
keyword in Python called lambda.

989
00:47:34,070 --> 00:47:37,760
You can specify the name of an
argument for your function as before.

990
00:47:37,760 --> 00:47:41,690
And then you can simply specify
the return value, thereafter

991
00:47:41,690 --> 00:47:44,940
deleting the function itself.

992
00:47:44,940 --> 00:47:49,640
So to be clear, key is still an
argument to the sorted function.

993
00:47:49,640 --> 00:47:54,120
It expects as its value
typically the name of a function.

994
00:47:54,120 --> 00:47:57,590
But if you've decided that, eh,
this seems like a waste of effort

995
00:47:57,590 --> 00:47:59,870
to define a function,
then pass the function in,

996
00:47:59,870 --> 00:48:02,840
especially when it's so short,
you can do it in a one liner.

997
00:48:02,840 --> 00:48:05,990
A lambda function is
an anonymous function.

998
00:48:05,990 --> 00:48:09,230
Lambda literally says,
Python, give me a function.

999
00:48:09,230 --> 00:48:11,147
I don't care about its name.

1000
00:48:11,147 --> 00:48:13,230
Therefore, you don't have
to choose a name for it.

1001
00:48:13,230 --> 00:48:17,940
But it does care still about its
arguments and its return value.

1002
00:48:17,940 --> 00:48:23,147
So it's still up to you to provide zero
or more arguments and a return value.

1003
00:48:23,147 --> 00:48:24,230
And notice I've done that.

1004
00:48:24,230 --> 00:48:28,010
I've specified the keyword lambda
followed by the name of the argument

1005
00:48:28,010 --> 00:48:31,490
I want this anonymous,
nameless function to accept.

1006
00:48:31,490 --> 00:48:33,890
And then I'm specifying
the return value.

1007
00:48:33,890 --> 00:48:37,940
And with lambda functions, you
do not need to specify return.

1008
00:48:37,940 --> 00:48:41,000
Whatever you write after
the colon is literally

1009
00:48:41,000 --> 00:48:43,050
what will be returned automatically.

1010
00:48:43,050 --> 00:48:45,320
So again, this is a very
Pythonic thing to do.

1011
00:48:45,320 --> 00:48:49,250
It's kind of a very clever one liner,
even though it's a little cryptic

1012
00:48:49,250 --> 00:48:50,757
to see for the very first time.

1013
00:48:50,757 --> 00:48:53,840
But it allows you to condense your
thoughts into a succinct statement that

1014
00:48:53,840 --> 00:48:57,470
gets the job done so you don't have to
start defining more and more functions

1015
00:48:57,470 --> 00:49:02,490
that you or someone else
then need to keep track of.

1016
00:49:02,490 --> 00:49:02,990
All right.

1017
00:49:02,990 --> 00:49:05,300
Any questions, then, on this?

1018
00:49:05,300 --> 00:49:10,580
And I am pretty sure this is as complex
or sophisticated as our Python code

1019
00:49:10,580 --> 00:49:13,290
today will get.

1020
00:49:13,290 --> 00:49:16,020
Yeah, over to Sophia.

1021
00:49:16,020 --> 00:49:19,380
AUDIENCE: I was wondering why
"lambda" is used specifically

1022
00:49:19,380 --> 00:49:21,217
rather than some other keyword.

1023
00:49:21,217 --> 00:49:23,550
DAVID J. MALAN: Yeah, so
there's a long history in this.

1024
00:49:23,550 --> 00:49:27,060
And if, in fact, you take a course on
functional programming-- at Harvard,

1025
00:49:27,060 --> 00:49:28,830
it's called CS51--

1026
00:49:28,830 --> 00:49:32,280
there's a whole etymology
behind keywords like this.

1027
00:49:32,280 --> 00:49:34,360
Let me defer that one for another time.

1028
00:49:34,360 --> 00:49:37,440
But indeed, not only in
Python but in other languages,

1029
00:49:37,440 --> 00:49:41,290
as well, these things have come
to exist called lambda functions.

1030
00:49:41,290 --> 00:49:44,230
So they're actually quite commonplace
in other languages, as well.

1031
00:49:44,230 --> 00:49:48,580
And so Python just
adopted the term of art.

1032
00:49:48,580 --> 00:49:52,060
Mathematically, lambda is often
used as a symbol for functions.

1033
00:49:52,060 --> 00:49:55,980
And so they borrowed that same
idea in the world of programming.

1034
00:49:55,980 --> 00:49:56,550
All right.

1035
00:49:56,550 --> 00:50:00,840
So seeing no other questions, let's go
ahead and solve a related problem still

1036
00:50:00,840 --> 00:50:03,510
with some Python but
that's going to push up

1037
00:50:03,510 --> 00:50:09,150
against the limits of efficiency when it
comes to storing our data in CSV files.

1038
00:50:09,150 --> 00:50:13,113
Let me go ahead and start fresh
in this file, Favorites.py.

1039
00:50:13,113 --> 00:50:15,030
All of the code I've
written thus far, though,

1040
00:50:15,030 --> 00:50:16,905
is on the course's
website in advance, so you

1041
00:50:16,905 --> 00:50:18,570
can see the incremental improvement.

1042
00:50:18,570 --> 00:50:21,280
I'm going to go ahead and,
again, import csv at the top.

1043
00:50:21,280 --> 00:50:24,690
And now let's write a program
this time that doesn't just

1044
00:50:24,690 --> 00:50:27,990
automatically open up the
CSV and analyze it looking

1045
00:50:27,990 --> 00:50:31,020
for the total popularity of shows.

1046
00:50:31,020 --> 00:50:35,430
Let's search for a specific
show in the CSV and then

1047
00:50:35,430 --> 00:50:39,082
go ahead and output
the popularity thereof.

1048
00:50:39,082 --> 00:50:41,040
And I can do this in a
bunch of different ways.

1049
00:50:41,040 --> 00:50:43,415
But I'm going to try to make
this as concise as possible.

1050
00:50:43,415 --> 00:50:46,800
I'm first going to ask
the user to input a title.

1051
00:50:46,800 --> 00:50:49,170
I could use CS50's get_string function.

1052
00:50:49,170 --> 00:50:52,330
But recall that it's pretty much
the same as Python's input function,

1053
00:50:52,330 --> 00:50:55,740
so I'm going to use Python's
input function today.

1054
00:50:55,740 --> 00:50:57,780
And then I'm going to
go ahead and, as before,

1055
00:50:57,780 --> 00:51:01,350
open up that same CSV
called Favorite TV Shows -

1056
00:51:01,350 --> 00:51:08,010
Form Responses 1.csv in read-only
mode as a variable called file.

1057
00:51:08,010 --> 00:51:11,160
I'm then going to give myself a
reader, and I'll use a DictReader again

1058
00:51:11,160 --> 00:51:14,520
so I don't have to worry about
knowing which columns things are in,

1059
00:51:14,520 --> 00:51:16,080
passing in file.

1060
00:51:16,080 --> 00:51:17,340
And then let's see.

1061
00:51:17,340 --> 00:51:20,310
If I only care about one title,
I can keep this program simpler.

1062
00:51:20,310 --> 00:51:23,340
I don't need to figure out
the popularity of every show.

1063
00:51:23,340 --> 00:51:26,880
I just need to figure out the
popularity of one show, the title

1064
00:51:26,880 --> 00:51:28,510
that the human has typed in.

1065
00:51:28,510 --> 00:51:32,160
So I'm going to go ahead and give
myself a very simple int called counter

1066
00:51:32,160 --> 00:51:33,480
and set it equal to 0.

1067
00:51:33,480 --> 00:51:34,950
I don't need a whole dictionary.

1068
00:51:34,950 --> 00:51:36,930
Just one variable suffices now.

1069
00:51:36,930 --> 00:51:42,300
And I'm going to go ahead and iterate
over the rows in the reader, as before.

1070
00:51:42,300 --> 00:51:48,600
And then I'm going to say if the current
row's title == the title the human

1071
00:51:48,600 --> 00:51:51,930
typed in, let's go ahead
and increment counter by 1.

1072
00:51:51,930 --> 00:51:54,510
And it's already initialized,
because I did that on line 7.

1073
00:51:54,510 --> 00:51:55,500
So I think I'm good.

1074
00:51:55,500 --> 00:51:57,450
And then at the end
of this program, let's

1075
00:51:57,450 --> 00:51:59,940
very simply print out
the value of counter.

1076
00:51:59,940 --> 00:52:04,920
So the purpose of this program is to
prompt the user for a title of a show

1077
00:52:04,920 --> 00:52:08,220
and then just report
the popularity thereof

1078
00:52:08,220 --> 00:52:11,040
by counting the number of
instances of it in the file.

1079
00:52:11,040 --> 00:52:14,520
So let me go ahead and run this
with python of favorites.py.

1080
00:52:14,520 --> 00:52:15,450
Enter.

1081
00:52:15,450 --> 00:52:21,440
Let me go ahead and type in
"The Office," Enter, and 19.

1082
00:52:21,440 --> 00:52:23,710
Now, I don't remember
exactly what the number was.

1083
00:52:23,710 --> 00:52:26,620
But I remember The Office
was more popular than that.

1084
00:52:26,620 --> 00:52:29,840
I'm pretty sure it was not 19.

1085
00:52:29,840 --> 00:52:35,645
Any intuition as to why this program
is buggy or so it would seem?

1086
00:52:35,645 --> 00:52:37,520
BRIAN YU: A few people
in the chat are saying

1087
00:52:37,520 --> 00:52:40,635
you need to remember to deal with
capitalization and white space again.

1088
00:52:40,635 --> 00:52:41,510
DAVID J. MALAN: Yeah.

1089
00:52:41,510 --> 00:52:44,790
So we need to practice those
same lessons learned from before.

1090
00:52:44,790 --> 00:52:48,830
So I should really canonicalize the
input that the human, I, just typed in

1091
00:52:48,830 --> 00:52:51,980
and also the input that's
coming from the CSV file.

1092
00:52:51,980 --> 00:52:53,990
Perhaps the simplest way
to do this is, up here,

1093
00:52:53,990 --> 00:52:57,200
to first strip off leading and trailing
white space in case I get a little

1094
00:52:57,200 --> 00:52:59,670
sloppy and hit the Space
bar where I shouldn't.

1095
00:52:59,670 --> 00:53:02,612
And then let's go ahead and force
it to uppercase just because.

1096
00:53:02,612 --> 00:53:04,320
It doesn't matter if
it's upper or lower,

1097
00:53:04,320 --> 00:53:06,420
but at least we'll
standardize things that way.

1098
00:53:06,420 --> 00:53:10,130
And then when I do this, look
at the current rows title.

1099
00:53:10,130 --> 00:53:12,170
I think I really need
to do the same thing.

1100
00:53:12,170 --> 00:53:15,140
If I'm going to canonicalize one,
I need to canonical the other.

1101
00:53:15,140 --> 00:53:19,790
And now compare the all-caps,
white-space-stripped versions

1102
00:53:19,790 --> 00:53:20,730
of both strings.

1103
00:53:20,730 --> 00:53:21,920
So now let me rerun it.

1104
00:53:21,920 --> 00:53:24,020
Now I'm going to type
in "The Office," Enter.

1105
00:53:24,020 --> 00:53:24,770
And voila.

1106
00:53:24,770 --> 00:53:28,130
Now I'm at 26, which I think
is where we were at before.

1107
00:53:28,130 --> 00:53:30,890
And in fact, now I, the
user, can be a little sloppy.

1108
00:53:30,890 --> 00:53:32,450
I can say "the office."

1109
00:53:32,450 --> 00:53:35,780
I can run it again and say "the
office" and then, for whatever reason,

1110
00:53:35,780 --> 00:53:37,460
hit the Space bar a lot, Enter.

1111
00:53:37,460 --> 00:53:38,570
It's still going to work.

1112
00:53:38,570 --> 00:53:41,810
And indeed, though we seem to
be belaboring the pedantic here

1113
00:53:41,810 --> 00:53:44,683
with trimming off white space
and so forth, just think.

1114
00:53:44,683 --> 00:53:46,850
In a relatively small
audience here, how many of you

1115
00:53:46,850 --> 00:53:50,150
accidentally hit the Space bar or
capitalized things differently?

1116
00:53:50,150 --> 00:53:52,340
This happens massively on scale.

1117
00:53:52,340 --> 00:53:55,040
And you can imagine this being
important when you're tagging

1118
00:53:55,040 --> 00:53:56,780
friends in some social media account.

1119
00:53:56,780 --> 00:53:58,940
You're doing @Brian or the like.

1120
00:53:58,940 --> 00:54:02,510
You don't want to have to require
the user to type @, capital B,

1121
00:54:02,510 --> 00:54:05,100
lowercase r-i-a-n, and so forth.

1122
00:54:05,100 --> 00:54:07,880
So tolerating disparate,
messy user input

1123
00:54:07,880 --> 00:54:11,990
is such a common problem
to solve, including

1124
00:54:11,990 --> 00:54:14,740
in today's apps that we all use.

1125
00:54:14,740 --> 00:54:15,280
All right.

1126
00:54:15,280 --> 00:54:21,000
Any questions, then, on this
program, which I think is correct?

1127
00:54:21,000 --> 00:54:22,890
Then let me ask a question of you.

1128
00:54:22,890 --> 00:54:27,080
In what sense is this
program poorly designed?

1129
00:54:27,080 --> 00:54:30,860
In what sense is this
program poorly designed?

1130
00:54:30,860 --> 00:54:33,480
This is more subtle.

1131
00:54:33,480 --> 00:54:38,040
But think about the running time
of this program in terms of big O.

1132
00:54:38,040 --> 00:54:45,030
What is the running time of this program
if the CSV file has n different shows

1133
00:54:45,030 --> 00:54:47,610
in it or n different submissions?

1134
00:54:47,610 --> 00:54:50,310
So n is the variable in question.

1135
00:54:50,310 --> 00:54:53,122
Yeah, what's the running time, Andrew?

1136
00:54:53,122 --> 00:54:55,510
AUDIENCE: [INAUDIBLE]

1137
00:54:55,510 --> 00:54:58,260
DAVID J. MALAN: Yeah, it's big O
of n, because I'm literally using

1138
00:54:58,260 --> 00:55:00,400
linear search by way of the for loop.

1139
00:55:00,400 --> 00:55:04,000
That's how a for loop works in Python,
just like in C. Starts at the beginning

1140
00:55:04,000 --> 00:55:06,080
and potentially goes all
the way till the end.

1141
00:55:06,080 --> 00:55:08,730
And so I'm using
implicitly linear search,

1142
00:55:08,730 --> 00:55:11,910
because I'm not using any fancy data
structures, no sets, no dictionaries.

1143
00:55:11,910 --> 00:55:14,140
I'm just looping from top to bottom.

1144
00:55:14,140 --> 00:55:18,390
So you can imagine that if we surveyed
not just all of the students here

1145
00:55:18,390 --> 00:55:21,390
in class but maybe everyone on
campus or everyone in the world--

1146
00:55:21,390 --> 00:55:24,240
maybe we're Internet
Movie Database, IMDb.

1147
00:55:24,240 --> 00:55:28,710
There could be a huge number of
votes and a huge number of shows.

1148
00:55:28,710 --> 00:55:32,460
And so writing a program, whether
it's in a terminal window like mine

1149
00:55:32,460 --> 00:55:36,480
or maybe on a mobile device or maybe on
a webpage for your laptop or desktop,

1150
00:55:36,480 --> 00:55:40,800
it's probably not the best
design to constantly loop

1151
00:55:40,800 --> 00:55:44,190
over all of the shows in
your database from top

1152
00:55:44,190 --> 00:55:47,250
to bottom just to answer
a single question.

1153
00:55:47,250 --> 00:55:51,502
It would be much nicer to do things
in log of n time or in constant time.

1154
00:55:51,502 --> 00:55:54,210
And thankfully, over the past few
weeks, both in C and in Python,

1155
00:55:54,210 --> 00:55:57,390
we have seen smarter ways to do this.

1156
00:55:57,390 --> 00:56:00,420
But I'm not practicing
what I've preached here.

1157
00:56:00,420 --> 00:56:05,520
And in fact, at some point, this
notion of a flat-file database

1158
00:56:05,520 --> 00:56:07,440
starts to get too primitive for us.

1159
00:56:07,440 --> 00:56:11,670
Flat-file databases, like CSV
files, are wonderfully useful

1160
00:56:11,670 --> 00:56:13,590
when you just want to
do something quickly

1161
00:56:13,590 --> 00:56:16,410
or when you want to download
data from some third party,

1162
00:56:16,410 --> 00:56:18,518
like Google, in a
standard, portable way.

1163
00:56:18,518 --> 00:56:21,810
"Portable" means that it can be used by
different people and different systems.

1164
00:56:21,810 --> 00:56:23,760
CSV is about as simple
as it gets, because you

1165
00:56:23,760 --> 00:56:26,250
don't need to own
Microsoft Word or Apple

1166
00:56:26,250 --> 00:56:28,150
Numbers or any particular product.

1167
00:56:28,150 --> 00:56:30,750
It's just a text file, so
you can use any text editing

1168
00:56:30,750 --> 00:56:34,140
program or any programming
language to access it.

1169
00:56:34,140 --> 00:56:38,730
But flat-file databases aren't
necessarily the best structure

1170
00:56:38,730 --> 00:56:42,750
to use ultimately for larger data
sets, because they don't really

1171
00:56:42,750 --> 00:56:44,790
lend themselves to
more efficient queries.

1172
00:56:44,790 --> 00:56:48,280
So CSV files, pretty much at best,
you have to search top to bottom,

1173
00:56:48,280 --> 00:56:49,140
left to right.

1174
00:56:49,140 --> 00:56:53,070
But it turns out that there are better
databases out there generally known

1175
00:56:53,070 --> 00:56:57,960
as relational databases that, instead
of being files in which you store data,

1176
00:56:57,960 --> 00:57:01,230
they are instead programs
in which you store data.

1177
00:57:01,230 --> 00:57:04,830
Now, to be fair, those programs
use a lot of RAM, memory,

1178
00:57:04,830 --> 00:57:06,390
where they actually store your data.

1179
00:57:06,390 --> 00:57:08,670
And they do certainly persist your data.

1180
00:57:08,670 --> 00:57:12,600
They keep it long term by
storing your data also in files.

1181
00:57:12,600 --> 00:57:16,110
But between you and your data,
there is this running program.

1182
00:57:16,110 --> 00:57:20,070
And if you've ever heard of Oracle
or MySQL or PostgreSQL or SQL Server

1183
00:57:20,070 --> 00:57:23,880
or Microsoft Access or bunches
of other popular products,

1184
00:57:23,880 --> 00:57:26,520
both commercial and free
and open source alike,

1185
00:57:26,520 --> 00:57:30,720
relational databases are so
similar in spirit to spreadsheets.

1186
00:57:30,720 --> 00:57:33,690
But they are implemented in software.

1187
00:57:33,690 --> 00:57:35,490
And they give us more and more features.

1188
00:57:35,490 --> 00:57:37,360
And they use more and
more data structures

1189
00:57:37,360 --> 00:57:42,550
so that we can search for data, insert
data, delete data, update data much,

1190
00:57:42,550 --> 00:57:47,350
much more efficiently than we could if
just using something like a CSV file.

1191
00:57:47,350 --> 00:57:49,600
So let's go ahead and take
our five-minute break here.

1192
00:57:49,600 --> 00:57:52,558
And when we come back, we'll look at
relational databases and, in turn,

1193
00:57:52,558 --> 00:57:54,400
a language called SQL.

1194
00:57:54,400 --> 00:57:55,170
All right.

1195
00:57:55,170 --> 00:57:56,190
So we are back.

1196
00:57:56,190 --> 00:57:58,890
And the goal at hand
now is to transition

1197
00:57:58,890 --> 00:58:02,340
from these fairly simplistic
flat-file databases

1198
00:58:02,340 --> 00:58:04,260
to a more proper relational database.

1199
00:58:04,260 --> 00:58:07,500
And relational databases are
indeed what power so many

1200
00:58:07,500 --> 00:58:10,470
of today's mobile applications,
web applications, and the like.

1201
00:58:10,470 --> 00:58:13,170
Now we're beginning to
transition to real-world software

1202
00:58:13,170 --> 00:58:16,210
with real-world languages, at that.

1203
00:58:16,210 --> 00:58:20,610
And so now, let me introduce
what we're going to call SQLite.

1204
00:58:20,610 --> 00:58:23,040
So it turns out that
a relational database

1205
00:58:23,040 --> 00:58:27,660
is a database that stores all of
the data still in rows and columns.

1206
00:58:27,660 --> 00:58:30,960
But it doesn't do so using
spreadsheets or sheets.

1207
00:58:30,960 --> 00:58:33,930
It instead does so using what
we're going to call tables.

1208
00:58:33,930 --> 00:58:36,120
So it's pretty much the same idea.

1209
00:58:36,120 --> 00:58:39,120
But with tables, we get some
additional functionality.

1210
00:58:39,120 --> 00:58:41,130
With those tables,
we'll have the ability

1211
00:58:41,130 --> 00:58:46,480
to search for data, update data, delete
data, insert new data, and the like.

1212
00:58:46,480 --> 00:58:49,188
And these are things that we
absolutely can do with spreadsheets.

1213
00:58:49,188 --> 00:58:52,105
But in the world of spreadsheets,
if you want to search for something,

1214
00:58:52,105 --> 00:58:54,840
it's you, the human, doing it by
manually clicking and scrolling,

1215
00:58:54,840 --> 00:58:55,340
typically.

1216
00:58:55,340 --> 00:58:57,390
If you want to insert
data, it's you, the human,

1217
00:58:57,390 --> 00:58:59,362
typing it in manually
after adding a new row.

1218
00:58:59,362 --> 00:59:01,320
If you want to delete
something, it's you right

1219
00:59:01,320 --> 00:59:04,350
clicking or Control-clicking
and deleting a whole row

1220
00:59:04,350 --> 00:59:06,750
or updating the individual
cells they're in.

1221
00:59:06,750 --> 00:59:11,910
With SQL, Structured Query Language,
we have a new programming language

1222
00:59:11,910 --> 00:59:15,360
that is very often used in conjunction
with other programming languages.

1223
00:59:15,360 --> 00:59:18,930
And so today, we'll see SQL
used on its own initially.

1224
00:59:18,930 --> 00:59:21,990
But we'll also see it in the
context of a Python program.

1225
00:59:21,990 --> 00:59:28,170
So a language like Python can itself
use SQL to do more powerful things

1226
00:59:28,170 --> 00:59:30,660
than Python alone could do.

1227
00:59:30,660 --> 00:59:34,440
So with that said, SQLite is
like a light version of SQL.

1228
00:59:34,440 --> 00:59:35,940
It's a more user-friendly version.

1229
00:59:35,940 --> 00:59:36,780
It's more portable.

1230
00:59:36,780 --> 00:59:40,260
It can be used on Macs and PCS and
phones and laptops and desktops

1231
00:59:40,260 --> 00:59:40,950
and servers.

1232
00:59:40,950 --> 00:59:42,120
But it's incredibly common.

1233
00:59:42,120 --> 00:59:45,810
In fact, in your iPhone and your
Android phone, many of the applications

1234
00:59:45,810 --> 00:59:50,250
you are running today on your own device
are using SQLite underneath the hood.

1235
00:59:50,250 --> 00:59:52,290
So it isn't a toy language per se.

1236
00:59:52,290 --> 00:59:55,150
It's instead a relatively
simple implementation

1237
00:59:55,150 --> 00:59:56,920
of a language generally known as SQL.

1238
00:59:56,920 --> 01:00:00,680
But long story short, there's other
implementations of relational databases

1239
01:00:00,680 --> 01:00:01,180
out there.

1240
01:00:01,180 --> 01:00:02,972
And I rattled off
several of them already--

1241
01:00:02,972 --> 01:00:05,620
Oracle and MySQL and
PostgreSQL and the like.

1242
01:00:05,620 --> 01:00:09,910
Those all have slightly different
flavors or dialects of SQL.

1243
01:00:09,910 --> 01:00:14,860
So SQL is a fairly standard language
for interacting with databases.

1244
01:00:14,860 --> 01:00:16,960
But different companies,
different communities

1245
01:00:16,960 --> 01:00:20,200
have kind of added or subtracted
their own preferred features.

1246
01:00:20,200 --> 01:00:24,340
And so the syntax you use is generally
constant across all platforms.

1247
01:00:24,340 --> 01:00:27,458
But we will standardize
for our purposes on SQLite.

1248
01:00:27,458 --> 01:00:29,500
And indeed, this is what
you would use these days

1249
01:00:29,500 --> 01:00:31,820
in the world of mobile applications.

1250
01:00:31,820 --> 01:00:33,710
So it's very much germane there.

1251
01:00:33,710 --> 01:00:39,760
So with SQLite, we're going to have
ultimately the ability to query data

1252
01:00:39,760 --> 01:00:41,690
and update data, delete
data, and the like.

1253
01:00:41,690 --> 01:00:44,080
But to do so, we actually
need a program with which

1254
01:00:44,080 --> 01:00:46,220
to interact with our database.

1255
01:00:46,220 --> 01:00:50,320
So the way SQLite works is
that it stores all of your data

1256
01:00:50,320 --> 01:00:51,920
still in a file.

1257
01:00:51,920 --> 01:00:53,500
But it's a binary file now.

1258
01:00:53,500 --> 01:00:55,690
That is, it's a file
containing 0's and 1's.

1259
01:00:55,690 --> 01:00:57,550
And those 0's and 1's
might represent text.

1260
01:00:57,550 --> 01:00:58,810
They might represent numbers.

1261
01:00:58,810 --> 01:01:01,570
But it's a more compact,
efficient representation

1262
01:01:01,570 --> 01:01:05,290
than a mere CSV file would
be using ASCII or Unicode.

1263
01:01:05,290 --> 01:01:06,670
So that's the first difference.

1264
01:01:06,670 --> 01:01:10,690
SQLite uses a single
file, a binary file,

1265
01:01:10,690 --> 01:01:14,470
to store all of your data and represent
it inside of that file by way of all

1266
01:01:14,470 --> 01:01:18,110
of those 0's and 1's or the
tables to which I alluded before,

1267
01:01:18,110 --> 01:01:22,180
which are the analogue in the database
world of sheets or spreadsheets

1268
01:01:22,180 --> 01:01:23,690
in the spreadsheet world.

1269
01:01:23,690 --> 01:01:28,940
So to interact with that binary file
wherein all of your data is stored,

1270
01:01:28,940 --> 01:01:31,283
we need some kind of
user-facing program.

1271
01:01:31,283 --> 01:01:32,950
And there's many different tools to use.

1272
01:01:32,950 --> 01:01:36,970
But the standard one
that comes with SQLite

1273
01:01:36,970 --> 01:01:40,750
is called sqlite3, essentially
version 3 of the tool.

1274
01:01:40,750 --> 01:01:44,050
This is a command line tool similar
in spirit to any of the commands

1275
01:01:44,050 --> 01:01:46,000
you've run in a terminal
window thus far that

1276
01:01:46,000 --> 01:01:50,395
allows you to open up that binary file
and interact with all of your tables.

1277
01:01:50,395 --> 01:01:53,020
Now, here again, we kind of have
a chicken and the egg problem.

1278
01:01:53,020 --> 01:01:56,290
If I want to use a database
but I don't yet have a database

1279
01:01:56,290 --> 01:01:58,570
and yet I want to select
data from my database,

1280
01:01:58,570 --> 01:02:00,040
how do I actually load things in?

1281
01:02:00,040 --> 01:02:04,130
Well, you can load data into a
SQLite database in at least two ways.

1282
01:02:04,130 --> 01:02:06,490
One, which I'll do in
a moment, you can just

1283
01:02:06,490 --> 01:02:10,480
import an existing flat-file
database, like a CSV.

1284
01:02:10,480 --> 01:02:15,640
And what you do is you save the CSV
on your Mac or PC on your CS50 IDE.

1285
01:02:15,640 --> 01:02:18,100
You run a special command with sqlite3.

1286
01:02:18,100 --> 01:02:21,430
And it will just load
the CSV into memory.

1287
01:02:21,430 --> 01:02:23,620
It will figure out where
all of the commas are.

1288
01:02:23,620 --> 01:02:28,510
And it will construct inside of that
binary file the corresponding rows

1289
01:02:28,510 --> 01:02:31,360
and columns using the
appropriate 0's and 1's

1290
01:02:31,360 --> 01:02:32,810
to store all of that information.

1291
01:02:32,810 --> 01:02:35,410
So it just imports it
for you automatically.

1292
01:02:35,410 --> 01:02:39,310
Approach 2 would be to actually
write code in a language like Python

1293
01:02:39,310 --> 01:02:44,290
or any other that actually
manually inserts all of the data

1294
01:02:44,290 --> 01:02:45,155
into your database.

1295
01:02:45,155 --> 01:02:46,280
And we'll do that, as well.

1296
01:02:46,280 --> 01:02:47,290
But let's start simple.

1297
01:02:47,290 --> 01:02:51,070
Let me go ahead and run,
for instance, sqlite3.

1298
01:02:51,070 --> 01:02:54,550
And this is preinstalled on CS50 IDE,
and it's not that hard to get it up

1299
01:02:54,550 --> 01:02:56,570
and running on a Mac and PC, as well.

1300
01:02:56,570 --> 01:02:59,860
I'm going to go ahead and run
sqlite3 in my terminal window here.

1301
01:02:59,860 --> 01:03:00,610
And voila.

1302
01:03:00,610 --> 01:03:03,430
You just see some very simple output.

1303
01:03:03,430 --> 01:03:07,023
It's telling me to type .help if
I want to see some usage hints.

1304
01:03:07,023 --> 01:03:09,190
But I know most of the
commands, and we'll generally

1305
01:03:09,190 --> 01:03:11,232
give you all of the commands
that you might need.

1306
01:03:11,232 --> 01:03:15,760
In fact, one of the commands that we can
use is .mode, and another is .import.

1307
01:03:15,760 --> 01:03:18,100
So generally, you won't
use these that frequently.

1308
01:03:18,100 --> 01:03:21,670
You'll only use them when creating
a database for the first time when

1309
01:03:21,670 --> 01:03:25,002
you are creating that database
from an existing CSV file.

1310
01:03:25,002 --> 01:03:26,710
And indeed, that's my
goal at the moment.

1311
01:03:26,710 --> 01:03:30,610
Let me take our CSV file containing
all of your favorite TV shows

1312
01:03:30,610 --> 01:03:35,650
and load it into SQLite in
a proper relational database

1313
01:03:35,650 --> 01:03:39,460
so that we can do better
than, for instance, big O of n

1314
01:03:39,460 --> 01:03:42,730
when it comes to searching that
data and doing anything else on it.

1315
01:03:42,730 --> 01:03:44,960
So to do this, I have
to execute two commands.

1316
01:03:44,960 --> 01:03:48,280
One, I need to put SQLite into CSV mode.

1317
01:03:48,280 --> 01:03:51,010
And that's just to distinguish
it from other flat-file formats,

1318
01:03:51,010 --> 01:03:53,890
like TSV for tabs or some other format.

1319
01:03:53,890 --> 01:03:56,230
And now I'm going to go
ahead and run .import.

1320
01:03:56,230 --> 01:03:59,920
Then I have to specify the name of
the file to import, which is the CSV.

1321
01:03:59,920 --> 01:04:03,490
And I'm going to go ahead
and call my table shows.

1322
01:04:03,490 --> 01:04:08,500
So .import takes two arguments, the
name of the file that you want to import

1323
01:04:08,500 --> 01:04:12,430
and the name of the table that you
want to create out of that file.

1324
01:04:12,430 --> 01:04:14,680
And again, tables have rows and columns.

1325
01:04:14,680 --> 01:04:18,280
And the commas in the file
are going to delineate

1326
01:04:18,280 --> 01:04:20,290
where those columns begin and end.

1327
01:04:20,290 --> 01:04:21,790
I'm going to go ahead and hit Enter.

1328
01:04:21,790 --> 01:04:24,670
It looks like it flew by pretty fast.

1329
01:04:24,670 --> 01:04:26,560
Nothing seems to have happened.

1330
01:04:26,560 --> 01:04:30,940
But I think that's OK, because now we're
going to go ahead and have the ability

1331
01:04:30,940 --> 01:04:32,710
to actually manipulate that data.

1332
01:04:32,710 --> 01:04:34,630
But how do we manipulate the data?

1333
01:04:34,630 --> 01:04:36,070
We need a new language.

1334
01:04:36,070 --> 01:04:42,280
SQL, Structured Query Language, is the
language used by SQLites and Oracle

1335
01:04:42,280 --> 01:04:45,220
and MySQL and PostgreSQL and
bunches of other products

1336
01:04:45,220 --> 01:04:48,040
whose names you don't need to
know or remember any time soon.

1337
01:04:48,040 --> 01:04:53,260
But SQL is the language we'll use to
query the database for information

1338
01:04:53,260 --> 01:04:54,620
and do something with it.

1339
01:04:54,620 --> 01:04:57,920
Generally speaking, a relational
database and, in turn,

1340
01:04:57,920 --> 01:05:02,480
SQL, which is a language via which you
can interact with relational databases,

1341
01:05:02,480 --> 01:05:04,910
support four fundamental operations.

1342
01:05:04,910 --> 01:05:08,090
And they're sort of a crude
acronym, pun intended,

1343
01:05:08,090 --> 01:05:11,960
that is just helpful for remembering
what those fundamental operations are

1344
01:05:11,960 --> 01:05:13,190
with relational databases.

1345
01:05:13,190 --> 01:05:19,220
CRUD stands for Create,
Read, Update, and Delete.

1346
01:05:19,220 --> 01:05:21,800
And indeed, the acronym
is CRUD, C-R-U-D.

1347
01:05:21,800 --> 01:05:25,040
So it helps you remember that the
four basic operations supported by any

1348
01:05:25,040 --> 01:05:28,590
relational database are
create, read, update, delete.

1349
01:05:28,590 --> 01:05:30,710
"Create" means to
create or add new data.

1350
01:05:30,710 --> 01:05:34,550
"Read" means to access and
load into memory new data.

1351
01:05:34,550 --> 01:05:36,710
We've seen read before
with opening files.

1352
01:05:36,710 --> 01:05:39,140
"Update" and "delete" mean
exactly that, as well,

1353
01:05:39,140 --> 01:05:41,450
if you want to manipulate
the data in your data set.

1354
01:05:41,450 --> 01:05:44,530
Now, those are generic terms
for any relational database.

1355
01:05:44,530 --> 01:05:48,200
Those are the four properties typically
supported by any relational database.

1356
01:05:48,200 --> 01:05:53,490
In the world of SQL, there are some
very specific commands or functions,

1357
01:05:53,490 --> 01:05:58,550
if you will, that implement
those four functionalities.

1358
01:05:58,550 --> 01:06:00,980
They are create and insert--

1359
01:06:00,980 --> 01:06:03,620
achieve the same thing
as create more generally.

1360
01:06:03,620 --> 01:06:07,850
The keyword "select" is what's
used to read data from a database.

1361
01:06:07,850 --> 01:06:09,460
Update and delete are the same.

1362
01:06:09,460 --> 01:06:11,210
So it's kind of an
annoying inconsistency.

1363
01:06:11,210 --> 01:06:14,803
The acronym or the term of art is
CRUD, Create, Read, Update, Delete.

1364
01:06:14,803 --> 01:06:16,970
But in the world of SQL,
the authors of the language

1365
01:06:16,970 --> 01:06:20,030
decided to implement
those four ideas by way

1366
01:06:20,030 --> 01:06:24,760
of these five keywords or functions or
commands, if you will, in the language

1367
01:06:24,760 --> 01:06:25,260
SQL.

1368
01:06:25,260 --> 01:06:28,800
So what you are looking at
are five of the keywords

1369
01:06:28,800 --> 01:06:32,990
that you can use in this new language
called SQL to actually do something

1370
01:06:32,990 --> 01:06:33,980
with your database.

1371
01:06:33,980 --> 01:06:35,070
Now, what does that mean?

1372
01:06:35,070 --> 01:06:37,190
Well, suppose that you
wanted to manually create

1373
01:06:37,190 --> 01:06:38,960
a database for the very first time.

1374
01:06:38,960 --> 01:06:39,585
What do you do?

1375
01:06:39,585 --> 01:06:42,752
Well, back in the world of spreadsheets,
it's pretty straightforward, right?

1376
01:06:42,752 --> 01:06:44,300
You'd open up Google Spreadsheets.

1377
01:06:44,300 --> 01:06:46,370
You go to File, New or whatever.

1378
01:06:46,370 --> 01:06:48,350
And then you just, voila,
get a new spreadsheet

1379
01:06:48,350 --> 01:06:51,170
into which you can start creating
rows and columns and the like.

1380
01:06:51,170 --> 01:06:53,840
In Microsoft Excel, Apple
Numbers, same thing--

1381
01:06:53,840 --> 01:06:57,840
File menu, New Spreadsheet or whatever,
and boom, you have a new spreadsheet.

1382
01:06:57,840 --> 01:07:00,860
Now, in the world of SQL,
SQL databases are generally

1383
01:07:00,860 --> 01:07:02,840
meant to be interacted with code.

1384
01:07:02,840 --> 01:07:05,930
However, there are Graphical
User Interfaces, GUIs, by which

1385
01:07:05,930 --> 01:07:07,430
you can interact with them, as well.

1386
01:07:07,430 --> 01:07:11,600
But we're going to use code today to
do so and programs at a command line.

1387
01:07:11,600 --> 01:07:17,120
It turns out that you can
create tables programmatically

1388
01:07:17,120 --> 01:07:19,530
by running a command like this.

1389
01:07:19,530 --> 01:07:24,320
So if you literally type out syntax
along the lines of CREATE TABLE, then

1390
01:07:24,320 --> 01:07:27,230
the name of your table,
indicated here in lowercase,

1391
01:07:27,230 --> 01:07:31,490
then a parenthesis, then the name of
your column that you want to create

1392
01:07:31,490 --> 01:07:36,190
and the type of that column,
a la C, and then comma, dot,

1393
01:07:36,190 --> 01:07:39,050
dot, dot, some more
columns, this is generally

1394
01:07:39,050 --> 01:07:43,350
speaking the syntax you'll use
to create in this language called

1395
01:07:43,350 --> 01:07:44,802
SQL a new table.

1396
01:07:44,802 --> 01:07:46,010
Now, this is in the abstract.

1397
01:07:46,010 --> 01:07:49,077
Again, table in lowercase is
meant to represent the name

1398
01:07:49,077 --> 01:07:50,660
you want to give to your actual table.

1399
01:07:50,660 --> 01:07:52,580
column in lowercase is
meant to be the name

1400
01:07:52,580 --> 01:07:54,080
you want to give to your own column.

1401
01:07:54,080 --> 01:07:54,788
Maybe it's Title.

1402
01:07:54,788 --> 01:07:55,567
Maybe it's Genres.

1403
01:07:55,567 --> 01:07:57,400
And dot, dot, dot just
means, of course, you

1404
01:07:57,400 --> 01:07:59,100
can have even more columns than that.

1405
01:07:59,100 --> 01:08:02,990
But literally in a moment, if I
were to type in this kind of command

1406
01:08:02,990 --> 01:08:06,500
into the terminal window after
running the sqlite3 program,

1407
01:08:06,500 --> 01:08:09,980
I could start creating one
or more tables for myself.

1408
01:08:09,980 --> 01:08:12,920
And in fact, that's what
already happened for me.

1409
01:08:12,920 --> 01:08:15,560
This .import command,
which is not part of SQL--

1410
01:08:15,560 --> 01:08:19,579
this is the equivalent of a Menu
option in Excel or Google Spreadsheets.

1411
01:08:19,579 --> 01:08:22,729
.import just automates a
certain process for me.

1412
01:08:22,729 --> 01:08:24,319
And what it did for me is this.

1413
01:08:24,319 --> 01:08:28,609
If I type now .schema, which is
another SQLite-specific command--

1414
01:08:28,609 --> 01:08:32,479
anything that starts with a .
is specific only to sqlite3,

1415
01:08:32,479 --> 01:08:34,250
this terminal window program.

1416
01:08:34,250 --> 01:08:36,830
Notice what's outputted is this.

1417
01:08:36,830 --> 01:08:44,420
By running .import that automatically
for me created a table in my database

1418
01:08:44,420 --> 01:08:46,010
called shows.

1419
01:08:46,010 --> 01:08:47,540
And it gave it three columns--

1420
01:08:47,540 --> 01:08:50,060
Timestamp, title, and genres.

1421
01:08:50,060 --> 01:08:52,529
Where did those column names come from?

1422
01:08:52,529 --> 01:08:55,220
Well, they came from the
very first line in the CSV.

1423
01:08:55,220 --> 01:08:58,850
And they all looked like text,
so the type of those values

1424
01:08:58,850 --> 01:09:02,490
was just assumed to be text, text, text.

1425
01:09:02,490 --> 01:09:05,090
Now, to be clear, I could
have manually type this out,

1426
01:09:05,090 --> 01:09:08,359
created these three columns in
a new table called shows for me.

1427
01:09:08,359 --> 01:09:11,870
But again, the .import command
just automated that from a CSV.

1428
01:09:11,870 --> 01:09:17,370
But the SQL is what we see here,
CREATE TABLE shows and so forth.

1429
01:09:17,370 --> 01:09:22,609
So that is to say now, in this
database, there is a file--

1430
01:09:22,609 --> 01:09:27,500
or rather, there is a
table called shows inside

1431
01:09:27,500 --> 01:09:29,630
of which is all of the
data from that CSV.

1432
01:09:29,630 --> 01:09:31,580
How do I actually get at that data?

1433
01:09:31,580 --> 01:09:33,830
Well, it turns out there's
other commands were called.

1434
01:09:33,830 --> 01:09:37,430
Not just CREATE, but also
SELECT, it turns out.

1435
01:09:37,430 --> 01:09:40,850
SELECT is the equivalent of read,
getting data from the database.

1436
01:09:40,850 --> 01:09:42,590
And this one is pretty powerful.

1437
01:09:42,590 --> 01:09:45,950
And the reason that so many data
scientists and statisticians

1438
01:09:45,950 --> 01:09:48,290
use and like using languages like SQL--

1439
01:09:48,290 --> 01:09:51,620
they make it relatively easy to
just get data and filter that data

1440
01:09:51,620 --> 01:09:55,380
and analyze that data using
new syntax for us today,

1441
01:09:55,380 --> 01:09:58,940
but relatively simple syntax
relative to other things we've seen.

1442
01:09:58,940 --> 01:10:03,590
The SELECT command in SQL lets
you select one or more columns

1443
01:10:03,590 --> 01:10:06,710
from your table by the given name.

1444
01:10:06,710 --> 01:10:10,040
So we'll see this now
in just a moment here.

1445
01:10:10,040 --> 01:10:11,460
How might I go about doing this?

1446
01:10:11,460 --> 01:10:15,170
Well, let me go ahead and now, at my
prompt after just clearing the window

1447
01:10:15,170 --> 01:10:17,340
to keep things neat,
let me try this out.

1448
01:10:17,340 --> 01:10:26,090
Let me go ahead and SELECT,
let's say, title FROM shows;.

1449
01:10:26,090 --> 01:10:27,290
So why am I doing this?

1450
01:10:27,290 --> 01:10:29,800
Well, again, the conventional
format for the SELECT command

1451
01:10:29,800 --> 01:10:33,400
is to say SELECT, then the name
of one or more columns, then

1452
01:10:33,400 --> 01:10:37,240
literally the preposition FROM, and then
the name of the table from which you

1453
01:10:37,240 --> 01:10:38,840
want to select that data.

1454
01:10:38,840 --> 01:10:43,390
So if my table is called shows
and the column is called title,

1455
01:10:43,390 --> 01:10:46,930
it stands to reason that SELECT
title FROM shows should give me

1456
01:10:46,930 --> 01:10:48,100
back the data I want.

1457
01:10:48,100 --> 01:10:50,080
Now, notice a couple
of stylistic choices

1458
01:10:50,080 --> 01:10:52,630
that aren't strictly
required but are good style.

1459
01:10:52,630 --> 01:10:56,470
Conventionally, I would
capitalize any SQL keywords,

1460
01:10:56,470 --> 01:10:59,470
including SELECT and FROM
in this case, and then

1461
01:10:59,470 --> 01:11:03,610
lowercase anything that's a
column name or a table name,

1462
01:11:03,610 --> 01:11:07,023
assuming you created those columns
and tables in, in fact, lowercase.

1463
01:11:07,023 --> 01:11:08,690
There's different conventions out there.

1464
01:11:08,690 --> 01:11:09,815
Some people will uppercase.

1465
01:11:09,815 --> 01:11:12,950
Some people will use something called
camel case or snake case or the like.

1466
01:11:12,950 --> 01:11:15,220
But generally speaking, I
would encourage all caps

1467
01:11:15,220 --> 01:11:19,180
for SQL syntax and lowercase
for the column and table names.

1468
01:11:19,180 --> 01:11:21,190
I'm going to go ahead now and hit Enter.

1469
01:11:21,190 --> 01:11:22,060
And voila.

1470
01:11:22,060 --> 01:11:26,950
We see rapidly a whole list of
values outputted from the database.

1471
01:11:26,950 --> 01:11:30,790
And if you think way back, you
might recognize that this actually

1472
01:11:30,790 --> 01:11:35,200
happens to be the same order
as before, because the CSV

1473
01:11:35,200 --> 01:11:39,010
file was loaded top to bottom
into this same database table.

1474
01:11:39,010 --> 01:11:42,370
And so what we're seeing, in fact,
is all of that same data, duplicates

1475
01:11:42,370 --> 01:11:46,030
and miscapitalizations
and weird spacing and all.

1476
01:11:46,030 --> 01:11:48,790
But suppose I want to see
all of the data from the CSV.

1477
01:11:48,790 --> 01:11:51,160
Well, it turns out you can
select multiple columns.

1478
01:11:51,160 --> 01:11:54,478
You can select not only title, but
maybe timestamp was of interest.

1479
01:11:54,478 --> 01:11:56,770
And this one admittedly was
capitalized, because that's

1480
01:11:56,770 --> 01:11:58,360
what it was in the spreadsheet.

1481
01:11:58,360 --> 01:12:00,290
That was not something I chose manually.

1482
01:12:00,290 --> 01:12:02,800
So if I just use a comma-separated
list of column names,

1483
01:12:02,800 --> 01:12:04,090
notice what I can do now.

1484
01:12:04,090 --> 01:12:07,790
It's a little hard to see for us humans,
because there's a lot going on now.

1485
01:12:07,790 --> 01:12:10,120
But notice that in double
quotes on the left,

1486
01:12:10,120 --> 01:12:14,170
there are all of the timestamps, which
represent the time at which you all

1487
01:12:14,170 --> 01:12:15,490
submitted your favorite shows.

1488
01:12:15,490 --> 01:12:19,390
And on the right of the comma,
there's another quoted string

1489
01:12:19,390 --> 01:12:22,210
that is the title of the show
that you liked, although SQLite

1490
01:12:22,210 --> 01:12:27,070
omits the quotes if it's just a single
word, like Friends, just by convention.

1491
01:12:27,070 --> 01:12:29,290
In fact, if I want to
get all of the columns,

1492
01:12:29,290 --> 01:12:31,510
turns out there's some
shorthand syntax for that.

1493
01:12:31,510 --> 01:12:34,270
* is the so-called wild card operator.

1494
01:12:34,270 --> 01:12:37,780
And it will get me all of the columns
from left to right in my table.

1495
01:12:37,780 --> 01:12:38,500
And voila.

1496
01:12:38,500 --> 01:12:44,180
Now I see all of the data, including
all of the genres, as well.

1497
01:12:44,180 --> 01:12:49,090
So now I effectively have three columns
being outputted all at once here.

1498
01:12:49,090 --> 01:12:51,520
Well, this is not that useful thus far.

1499
01:12:51,520 --> 01:12:53,770
In fact, all I've been doing
is really just outputting

1500
01:12:53,770 --> 01:12:55,060
the contents of the CSV.

1501
01:12:55,060 --> 01:12:58,990
But SQL's powerful because it comes with
other features right out of the box,

1502
01:12:58,990 --> 01:13:02,830
somewhat similar in spirit to functions
that are built into Google Spreadsheets

1503
01:13:02,830 --> 01:13:03,550
and Excel.

1504
01:13:03,550 --> 01:13:06,110
But now we can use them
ultimately in our own code.

1505
01:13:06,110 --> 01:13:09,460
So functions like AVG, COUNT,
DISTINCT, LOWER, MAX, MIN,

1506
01:13:09,460 --> 01:13:13,540
and UPPER and bunches more, these
are all functions built into SQL

1507
01:13:13,540 --> 01:13:19,370
that you can use as part of your query
to alter the data as it's coming back

1508
01:13:19,370 --> 01:13:21,370
from the database-- not
permanently, but as it's

1509
01:13:21,370 --> 01:13:25,040
coming back to you-- so that it's
in a format you actually care about.

1510
01:13:25,040 --> 01:13:26,870
So for instance, one
of my goals earlier,

1511
01:13:26,870 --> 01:13:29,680
was to get back just the
distinct, the unique titles.

1512
01:13:29,680 --> 01:13:32,620
And we had to write all that
annoying code using a set

1513
01:13:32,620 --> 01:13:35,560
and then add things to the set and
then loop over it again, right?

1514
01:13:35,560 --> 01:13:37,180
That was not a huge amount of code.

1515
01:13:37,180 --> 01:13:40,840
But it definitely took us, what, 5, 10
minutes to get the job done at least.

1516
01:13:40,840 --> 01:13:43,780
In SQL, you can do all
of that in one breath.

1517
01:13:43,780 --> 01:13:45,650
I'm going to go ahead now and do this.

1518
01:13:45,650 --> 01:13:49,690
SELECT not just title FROM shows.

1519
01:13:49,690 --> 01:13:54,370
Let me go ahead and SELECT
DISTINCT title FROM shows.

1520
01:13:54,370 --> 01:13:57,640
So DISTINCT, again, is an
available function in SQL

1521
01:13:57,640 --> 01:13:58,900
that does what the name says.

1522
01:13:58,900 --> 01:14:00,650
It's going to filter
out all of the titles

1523
01:14:00,650 --> 01:14:02,450
to just give me the distinct ones back.

1524
01:14:02,450 --> 01:14:08,740
So if I hit Enter now, you'll see a
similarly messy list but including--

1525
01:14:08,740 --> 01:14:10,810
"no idea," someone
that doesn't watch TV--

1526
01:14:10,810 --> 01:14:14,230
including an unsorted
list of those titles.

1527
01:14:14,230 --> 01:14:18,130
So I think we can probably start to
clean this thing up as we did before.

1528
01:14:18,130 --> 01:14:20,950
Let me go ahead and now
SELECT not just DISTINCT,

1529
01:14:20,950 --> 01:14:23,660
but let me go ahead and
uppercase everything as well.

1530
01:14:23,660 --> 01:14:25,970
And I can use UPPER as another function.

1531
01:14:25,970 --> 01:14:27,580
And notice I'm just nesting things.

1532
01:14:27,580 --> 01:14:30,247
The output of one function, as
we've seen in many languages now,

1533
01:14:30,247 --> 01:14:31,450
can be the input to another.

1534
01:14:31,450 --> 01:14:32,830
Let me hit Enter now.

1535
01:14:32,830 --> 01:14:36,610
And now it's getting a little
more canonicalized, so to speak,

1536
01:14:36,610 --> 01:14:39,190
because I'm using
capitalization for everything.

1537
01:14:39,190 --> 01:14:43,690
But it would seem that things
still aren't really sorted.

1538
01:14:43,690 --> 01:14:46,070
It's just the same order
in which you inputted them

1539
01:14:46,070 --> 01:14:48,370
but without duplicates this time.

1540
01:14:48,370 --> 01:14:51,700
So it turns out that
SQL has other syntax

1541
01:14:51,700 --> 01:14:55,580
that we can use to make our queries
more precise and more powerful.

1542
01:14:55,580 --> 01:14:57,640
So in addition to these
kinds of functions

1543
01:14:57,640 --> 01:15:00,340
that you can use to alter the
data that's being shown to you

1544
01:15:00,340 --> 01:15:04,570
and coming back, you can also use
these kinds of clauses or syntax

1545
01:15:04,570 --> 01:15:05,800
in SQL queries.

1546
01:15:05,800 --> 01:15:09,130
You can say WHERE, which is
the equivalent of a condition.

1547
01:15:09,130 --> 01:15:13,600
You can say select all of this data
where something is true or false.

1548
01:15:13,600 --> 01:15:17,440
You can say LIKE, where you can say
give me data that isn't exactly this

1549
01:15:17,440 --> 01:15:18,520
but is like this.

1550
01:15:18,520 --> 01:15:20,660
You can order the data by some column.

1551
01:15:20,660 --> 01:15:23,210
You can limit the number
of rows that come back.

1552
01:15:23,210 --> 01:15:26,850
And you can group identical
values together in some way.

1553
01:15:26,850 --> 01:15:28,640
So let's see a few examples of this.

1554
01:15:28,640 --> 01:15:32,055
Let me go back here and
play around now with--

1555
01:15:32,055 --> 01:15:32,930
how about The Office?

1556
01:15:32,930 --> 01:15:34,513
That was the one we looked at earlier.

1557
01:15:34,513 --> 01:15:42,260
So let me go ahead and SELECT title
FROM shows WHERE title = "The Office";.

1558
01:15:42,260 --> 01:15:48,200
So I've added this WHERE predicate, so
to speak, WHERE title = "The Office."

1559
01:15:48,200 --> 01:15:49,190
So SQL's nice.

1560
01:15:49,190 --> 01:15:52,670
Similar in spirit to Python,
it's more user friendly, perhaps,

1561
01:15:52,670 --> 01:15:55,910
than C where everything kind of sort
of reads like an English sentence,

1562
01:15:55,910 --> 01:15:58,230
even though it's a little more precise.

1563
01:15:58,230 --> 01:15:59,880
And it's a little more succinct.

1564
01:15:59,880 --> 01:16:01,130
Let me go ahead and hit Enter.

1565
01:16:01,130 --> 01:16:02,120
And voila.

1566
01:16:02,120 --> 01:16:05,850
That's how many of you
inputted The Office.

1567
01:16:05,850 --> 01:16:08,520
But notice it's not everyone, is it?

1568
01:16:08,520 --> 01:16:10,050
We're missing some still.

1569
01:16:10,050 --> 01:16:14,070
It seems that I got back only
those of you who typed in literally

1570
01:16:14,070 --> 01:16:16,710
"The Office," capital T, capital O.

1571
01:16:16,710 --> 01:16:19,200
So what if I want to be a
little more resilient than that?

1572
01:16:19,200 --> 01:16:23,280
Well, let me get back any rows
where you all typed in "office."

1573
01:16:23,280 --> 01:16:26,820
Maybe you omitted the article "the."

1574
01:16:26,820 --> 01:16:30,390
So let me go ahead and
say not title = "Office."

1575
01:16:30,390 --> 01:16:33,780
but let me go ahead and say
where the title is like "Office."

1576
01:16:33,780 --> 01:16:35,490
But I don't want it to just be "office."

1577
01:16:35,490 --> 01:16:39,120
I want to allow for maybe some stuff
at the beginning, maybe some stuff

1578
01:16:39,120 --> 01:16:39,673
at the end.

1579
01:16:39,673 --> 01:16:42,090
And even though that seems
like a bit of an inconsistency,

1580
01:16:42,090 --> 01:16:46,950
in the context of using LIKE,
there's another wild card character.

1581
01:16:46,950 --> 01:16:51,390
The percent sign represents zero
or more characters to the left.

1582
01:16:51,390 --> 01:16:55,410
And this percent sign represents
zero or more characters to the right.

1583
01:16:55,410 --> 01:16:59,940
So it's kind of this catchall that will
now find me all titles that somewhere

1584
01:16:59,940 --> 01:17:02,980
have O-F-F-I-C-E inside of them.

1585
01:17:02,980 --> 01:17:04,778
And it turns out LIKE
is case insensitive,

1586
01:17:04,778 --> 01:17:07,320
so I don't even need to worry
about capitalization with LIKE.

1587
01:17:07,320 --> 01:17:08,610
Now let me hit Enter.

1588
01:17:08,610 --> 01:17:09,450
And voila.

1589
01:17:09,450 --> 01:17:10,890
Now I get back more answers.

1590
01:17:10,890 --> 01:17:12,780
And you can really
see the messiness now.

1591
01:17:12,780 --> 01:17:15,900
Notice up here one of
you used lowercase.

1592
01:17:15,900 --> 01:17:18,450
That tends to be common when
typing things in quickly.

1593
01:17:18,450 --> 01:17:21,270
One of you did it lowercase
here and then also gave

1594
01:17:21,270 --> 01:17:23,160
us an extra white space at the end.

1595
01:17:23,160 --> 01:17:24,900
One of you just typed in "office."

1596
01:17:24,900 --> 01:17:27,540
One of you typed in "the office"
again with a space at the end.

1597
01:17:27,540 --> 01:17:29,200
And so there's a lot of variation here.

1598
01:17:29,200 --> 01:17:31,560
And that's why, when we
forced everything to uppercase

1599
01:17:31,560 --> 01:17:34,650
and we started trimming
things, we were able to get rid

1600
01:17:34,650 --> 01:17:37,440
of a lot of those redundancies.

1601
01:17:37,440 --> 01:17:40,290
Well, in fact, let's go
ahead and order this now.

1602
01:17:40,290 --> 01:17:44,040
So let me go back to selecting
the distinct uppercase title,

1603
01:17:44,040 --> 01:17:51,060
so SELECT DISTINCT UPPER
of title FROM shows.

1604
01:17:51,060 --> 01:17:56,220
And let me now ORDER BY, which is a
new clause, the uppercased version

1605
01:17:56,220 --> 01:17:57,868
of title.

1606
01:17:57,868 --> 01:17:59,910
So now notice there's a
few things going on here.

1607
01:17:59,910 --> 01:18:01,710
But I'm just building up
more complicated queries

1608
01:18:01,710 --> 01:18:04,260
similar to scratch, where we just
started throwing more and more puzzle

1609
01:18:04,260 --> 01:18:05,340
pieces at a problem.

1610
01:18:05,340 --> 01:18:10,530
I'm selecting all of the distinct
uppercase titles from the shows table.

1611
01:18:10,530 --> 01:18:13,050
But I'm going to order
the results this time

1612
01:18:13,050 --> 01:18:15,780
by the uppercased version of title.

1613
01:18:15,780 --> 01:18:17,550
So everything is going to be uppercased.

1614
01:18:17,550 --> 01:18:20,460
And then it's going to be sorted
A through Z. Hit Enter now,

1615
01:18:20,460 --> 01:18:23,160
and now things are a little
easier to make sense of.

1616
01:18:23,160 --> 01:18:26,970
Notice the quotes are there only when
there are multiple words in a title.

1617
01:18:26,970 --> 01:18:29,400
Otherwise, sqlite3
doesn't bother showing us.

1618
01:18:29,400 --> 01:18:32,190
But notice here's all the "the" shows.

1619
01:18:32,190 --> 01:18:36,270
And if we keep scrolling up, the P's,
the N's, the M's, the L's, and so

1620
01:18:36,270 --> 01:18:41,190
forth-- it's indeed alphabetized
thanks to using ORDER BY.

1621
01:18:41,190 --> 01:18:41,950
All right.

1622
01:18:41,950 --> 01:18:45,540
Well, let's start to solve more
similar problems now in SQL

1623
01:18:45,540 --> 01:18:49,830
by writing way less code than
we did a bit ago in Python.

1624
01:18:49,830 --> 01:18:54,780
Suppose I want to actually figure out
the counts of these most popular shows.

1625
01:18:54,780 --> 01:18:58,050
So I want to combine all
of the identical shows

1626
01:18:58,050 --> 01:19:00,510
and figure out all of
the corresponding counts.

1627
01:19:00,510 --> 01:19:02,330
Well, let me go ahead and try this.

1628
01:19:02,330 --> 01:19:07,932
Let me go ahead and SELECT again
the uppercased version of title.

1629
01:19:07,932 --> 01:19:10,140
But I'm not going to do
DISTINCT this time, because I

1630
01:19:10,140 --> 01:19:11,830
want to do that a little differently.

1631
01:19:11,830 --> 01:19:13,650
I'm going to SELECT
the uppercased version

1632
01:19:13,650 --> 01:19:16,510
of title, the COUNT of those titles--

1633
01:19:16,510 --> 01:19:19,320
so the number of times a
given title appears, so COUNT

1634
01:19:19,320 --> 01:19:20,610
is a new keyword now--

1635
01:19:20,610 --> 01:19:22,080
FROM shows.

1636
01:19:22,080 --> 01:19:25,860
But now how do I figure
out what the count is?

1637
01:19:25,860 --> 01:19:29,700
Well, if you think about this
table as having a lot of titles--

1638
01:19:29,700 --> 01:19:31,930
title, title, title, title, title--

1639
01:19:31,930 --> 01:19:35,970
it would be nice to kind of group
the identical titles together

1640
01:19:35,970 --> 01:19:42,460
and then actually count how many
such titles we grouped together.

1641
01:19:42,460 --> 01:19:47,710
And the syntax for that is literally
to say GROUP BY UPPER(title);.

1642
01:19:47,710 --> 01:19:51,130
This tells SQL to group all
of the uppercased titles

1643
01:19:51,130 --> 01:19:53,860
together, kind of collapse
multiple rows into one,

1644
01:19:53,860 --> 01:19:58,990
but keep track of the count
of titles after that collapse.

1645
01:19:58,990 --> 01:20:01,810
Let me go ahead now and hit Enter.

1646
01:20:01,810 --> 01:20:05,980
And you'll see, very similar to one of
the earlier Python programs we wrote,

1647
01:20:05,980 --> 01:20:10,040
all of the titles on the left followed
by a comma, followed by the count.

1648
01:20:10,040 --> 01:20:11,920
So one of you really
likes Tom and Jerry.

1649
01:20:11,920 --> 01:20:14,470
One of you really likes Top Gear.

1650
01:20:14,470 --> 01:20:17,140
If I scroll up, though, two
of you really liked The Wire.

1651
01:20:17,140 --> 01:20:19,930
23 of you here like The
Office, although we still

1652
01:20:19,930 --> 01:20:22,010
haven't trimmed the issue here.

1653
01:20:22,010 --> 01:20:25,180
So we could still combine that further
by trimming whitespace if we want.

1654
01:20:25,180 --> 01:20:27,040
But now we're getting
these kinds of counts.

1655
01:20:27,040 --> 01:20:32,510
Well, how can I go ahead and
order this, as we did before?

1656
01:20:32,510 --> 01:20:39,820
Let me go ahead here and
add ORDER BY COUNT of title

1657
01:20:39,820 --> 01:20:42,010
and then hit semicolon now.

1658
01:20:42,010 --> 01:20:45,310
And now notice, just as
in Python, everything

1659
01:20:45,310 --> 01:20:47,800
is from smallest to largest
initially, with Game of Thrones

1660
01:20:47,800 --> 01:20:49,180
here down on the bottom.

1661
01:20:49,180 --> 01:20:50,360
How can I fix this?

1662
01:20:50,360 --> 01:20:53,890
Well, it turns out if you can
order things in descending order,

1663
01:20:53,890 --> 01:20:58,510
D-E-S-C for short instead of A-S-C,
which is the default for ascending--

1664
01:20:58,510 --> 01:21:02,110
so if I do it in descending order, now
I'd have to scroll all the way back up

1665
01:21:02,110 --> 01:21:07,480
to the A's, the very top, to
see where the lines begin.

1666
01:21:07,480 --> 01:21:09,420
Whoops.

1667
01:21:09,420 --> 01:21:13,020
If I scroll all the way back up to the
top, we'll see where all of the A words

1668
01:21:13,020 --> 01:21:14,610
begin up here.

1669
01:21:14,610 --> 01:21:17,252
And now if I want to--

1670
01:21:17,252 --> 01:21:18,210
whoops, whoops, whoops.

1671
01:21:18,210 --> 01:21:20,190
Did I do that right?

1672
01:21:20,190 --> 01:21:20,690
Sorry.

1673
01:21:20,690 --> 01:21:21,950
I don't want to--

1674
01:21:21,950 --> 01:21:23,827
there we go, ORDER BY COUNT descending.

1675
01:21:23,827 --> 01:21:26,660
Now let me go ahead and-- this is
just a little too unwieldy to see.

1676
01:21:26,660 --> 01:21:29,035
Let me just limit myself to
the top 10 and keep it simple

1677
01:21:29,035 --> 01:21:30,920
and only look at the top 10 values here.

1678
01:21:30,920 --> 01:21:31,730
Voila.

1679
01:21:31,730 --> 01:21:36,585
Now I have Game of Thrones at 33,
Friends at 26, The Office at 23--

1680
01:21:36,585 --> 01:21:38,210
though I think I'm still missing a few.

1681
01:21:38,210 --> 01:21:41,660
Brian, do you recall the SQL function
for trimming leading and trailing

1682
01:21:41,660 --> 01:21:43,410
white space?

1683
01:21:43,410 --> 01:21:44,785
BRIAN YU: I think it's just TRIM.

1684
01:21:44,785 --> 01:21:45,660
DAVID J. MALAN: TRIM?

1685
01:21:45,660 --> 01:21:46,340
OK.

1686
01:21:46,340 --> 01:21:47,577
I myself did not remember.

1687
01:21:47,577 --> 01:21:49,160
So when in doubt, google or ask Brian.

1688
01:21:49,160 --> 01:21:51,000
So let me go ahead and fix this.

1689
01:21:51,000 --> 01:21:55,670
Let me go ahead and SELECT uppercase
of trimming the title first.

1690
01:21:55,670 --> 01:22:00,840
And then I'm going to GROUP BY
trimming and then uppercasing it there.

1691
01:22:00,840 --> 01:22:02,372
And now Enter, and voila.

1692
01:22:02,372 --> 01:22:03,080
Thank you, Brian.

1693
01:22:03,080 --> 01:22:07,020
So now we're up to our 26 Offices here.

1694
01:22:07,020 --> 01:22:09,110
So in short, it took us
a little while to get

1695
01:22:09,110 --> 01:22:10,880
to this point in the story in SQL.

1696
01:22:10,880 --> 01:22:12,020
But notice what we've done.

1697
01:22:12,020 --> 01:22:14,210
We've taken a program
that took us a few minutes

1698
01:22:14,210 --> 01:22:16,790
and certainly a dozen
or more lines of code.

1699
01:22:16,790 --> 01:22:20,300
And we've distilled it into something
that, yes, is a new language

1700
01:22:20,300 --> 01:22:22,310
but is just kind of a one liner.

1701
01:22:22,310 --> 01:22:24,888
And once you get comfortable
with a language like SQL,

1702
01:22:24,888 --> 01:22:27,680
especially if you're not even a
computer scientist but maybe a data

1703
01:22:27,680 --> 01:22:31,130
scientist or an analyst of some sort
who spends a lot of their day looking

1704
01:22:31,130 --> 01:22:33,470
at financial information
or medical information

1705
01:22:33,470 --> 01:22:37,070
or really any data set that can
be loaded into rows and columns,

1706
01:22:37,070 --> 01:22:41,150
once you start to speak
and read SQL as a human can

1707
01:22:41,150 --> 01:22:44,990
you start to express some pretty
powerful queries relatively succinctly

1708
01:22:44,990 --> 01:22:47,390
and, boom, get back your answer.

1709
01:22:47,390 --> 01:22:50,000
And by using a command
line program, like sqlite3,

1710
01:22:50,000 --> 01:22:53,540
you can immediately see the results
there, albeit as very simplistic text.

1711
01:22:53,540 --> 01:22:56,690
But as mentioned, too, there's
also some graphical programs

1712
01:22:56,690 --> 01:23:00,117
out there, free and commercial, that
also support SQL, where you can still

1713
01:23:00,117 --> 01:23:00,950
type these commands.

1714
01:23:00,950 --> 01:23:03,770
And then it will show it to you
in a more user friendly way, much

1715
01:23:03,770 --> 01:23:07,790
like in Windows or
macOS would by default.

1716
01:23:07,790 --> 01:23:16,058
So any questions now on the syntax
or capabilities of SELECT statements?

1717
01:23:16,058 --> 01:23:17,350
BRIAN YU: One question came in.

1718
01:23:17,350 --> 01:23:20,450
Where is the file with this
data actually being stored?

1719
01:23:20,450 --> 01:23:21,700
DAVID J. MALAN: Good question.

1720
01:23:21,700 --> 01:23:24,030
Where is the file actually being stored?

1721
01:23:24,030 --> 01:23:27,460
So before quitting, I can actually
save this file as anything.

1722
01:23:27,460 --> 01:23:30,043
I want the file extension
would typically be .db.

1723
01:23:30,043 --> 01:23:31,960
And in fact, Brian, do
you mind just checking?

1724
01:23:31,960 --> 01:23:34,930
What's the syntax for writing the
file manually with dot something?

1725
01:23:34,930 --> 01:23:36,910
It would be under .help, I think.

1726
01:23:36,910 --> 01:23:39,550
BRIAN YU: I think it's .save
followed by the name of the file.

1727
01:23:39,550 --> 01:23:43,240
DAVID J. MALAN: .save, so I'll
call this shows.db, Enter.

1728
01:23:43,240 --> 01:23:46,600
If I now go ahead and open up
another terminal window and type

1729
01:23:46,600 --> 01:23:49,990
our old friend ls, you'll see
that now I have a CSV file.

1730
01:23:49,990 --> 01:23:51,760
I have my Python file from before.

1731
01:23:51,760 --> 01:23:54,790
And I have a new file called
shows.db, which I've created.

1732
01:23:54,790 --> 01:24:00,910
That is the binary file that contains
the table that I've loaded dynamically

1733
01:24:00,910 --> 01:24:04,700
in from that CSV file.

1734
01:24:04,700 --> 01:24:08,810
Any other questions on SELECT
queries or what we can do with them?

1735
01:24:08,810 --> 01:24:12,620
BRIAN YU: Yeah, a few people are asking
about what the runtime of this is.

1736
01:24:12,620 --> 01:24:14,430
DAVID J. MALAN: Yeah,
really good question.

1737
01:24:14,430 --> 01:24:15,170
What is the runtime?

1738
01:24:15,170 --> 01:24:18,253
I'm going to come back to that question
in just a little bit if that's OK.

1739
01:24:18,253 --> 01:24:20,960
Right now, it's admittedly big O of n.

1740
01:24:20,960 --> 01:24:23,390
I've not actually done
anything better than we did

1741
01:24:23,390 --> 01:24:26,090
with our CSV file or our Python code.

1742
01:24:26,090 --> 01:24:28,040
Right now, it's still
big O of n by default.

1743
01:24:28,040 --> 01:24:30,230
But there's going to be
a better answer to that

1744
01:24:30,230 --> 01:24:33,030
that's going to make it
something much more logarithmic.

1745
01:24:33,030 --> 01:24:36,687
So let me come back to that feature
when it's time to enable it.

1746
01:24:36,687 --> 01:24:39,020
But in fact, let's start to
take some steps toward that.

1747
01:24:39,020 --> 01:24:40,812
Because it turns out,
when loading in data,

1748
01:24:40,812 --> 01:24:42,853
we're not always going to
have the luxury of just

1749
01:24:42,853 --> 01:24:44,900
having one big file in
CSV format that we import,

1750
01:24:44,900 --> 01:24:46,070
and we go about our business.

1751
01:24:46,070 --> 01:24:47,780
We're going to have to
decide in advance how

1752
01:24:47,780 --> 01:24:50,210
we want to store the data and
what data we want to store

1753
01:24:50,210 --> 01:24:53,120
and what the relationships
might be across not one

1754
01:24:53,120 --> 01:24:55,278
single table, but multiple tables.

1755
01:24:55,278 --> 01:24:57,320
So let me go ahead and
run one other command here

1756
01:24:57,320 --> 01:25:00,170
that actually introduces
the first of a problem.

1757
01:25:00,170 --> 01:25:03,830
Let me go ahead and
SELECT title FROM shows

1758
01:25:03,830 --> 01:25:07,160
WHERE genres equals,
for instance, "Comedy."

1759
01:25:07,160 --> 01:25:08,570
That was one of the genres.

1760
01:25:08,570 --> 01:25:11,690
And notice that we get back
a whole bunch of results.

1761
01:25:11,690 --> 01:25:14,300
But I bet I'm missing some.

1762
01:25:14,300 --> 01:25:16,470
I'm skimming through
this pretty quickly.

1763
01:25:16,470 --> 01:25:19,880
But I bet I'm missing some,
because if I check if genres

1764
01:25:19,880 --> 01:25:21,872
= "Comedy," what am I omitting?

1765
01:25:21,872 --> 01:25:24,830
Well, those of you who checked multiple
boxes might have said something

1766
01:25:24,830 --> 01:25:28,310
is a comedy and a drama
or comedy and romance

1767
01:25:28,310 --> 01:25:30,800
or maybe a couple of other
permutations of genres.

1768
01:25:30,800 --> 01:25:34,070
If I'm searching for
equality here, = "Comedy,"

1769
01:25:34,070 --> 01:25:37,880
I'm only going to get those favorites
from you where you only said,

1770
01:25:37,880 --> 01:25:40,250
my favorite TV show is a comedy.

1771
01:25:40,250 --> 01:25:48,113
But what if we want to do
something like LIKE comedy instead?

1772
01:25:48,113 --> 01:25:50,030
And we could say something
like, well, so long

1773
01:25:50,030 --> 01:25:54,290
as the word "comedy" is in there, then
we should get back even more results.

1774
01:25:54,290 --> 01:25:57,480
And let me stipulate that, indeed,
I now have a longer list of results.

1775
01:25:57,480 --> 01:26:01,010
Now we have all shows where you
checked at least the Comedy box.

1776
01:26:01,010 --> 01:26:03,770
But unfortunately, this
starts to get a little sloppy,

1777
01:26:03,770 --> 01:26:06,410
because recall what the
Genres column looks like.

1778
01:26:06,410 --> 01:26:07,730
SELECT.

1779
01:26:07,730 --> 01:26:11,150
Let me SELECT genres FROM shows;.

1780
01:26:11,150 --> 01:26:16,010
Notice that all of the genres that we
loaded into this table from the CSV

1781
01:26:16,010 --> 01:26:20,030
file are a comma-separated
list of genres.

1782
01:26:20,030 --> 01:26:22,070
That's just the way Google Forms did it.

1783
01:26:22,070 --> 01:26:24,320
And that's fine for CSV purposes.

1784
01:26:24,320 --> 01:26:28,310
That's kind of fine for SQL
purposes, but this is kind of messy.

1785
01:26:28,310 --> 01:26:31,700
Generally speaking, storing
comma-separated lists

1786
01:26:31,700 --> 01:26:35,840
of values in a SQL database is
not what you should be doing.

1787
01:26:35,840 --> 01:26:41,030
The whole point of using a SQL database
is to move away from commas and CSVs

1788
01:26:41,030 --> 01:26:42,860
and to actually store
things more cleanly.

1789
01:26:42,860 --> 01:26:45,920
Because in fact, let
me propose a problem.

1790
01:26:45,920 --> 01:26:50,540
Suppose I want to search not
for comedy but maybe also

1791
01:26:50,540 --> 01:26:55,520
music, like this, thereby allowing
me to find any shows where

1792
01:26:55,520 --> 01:26:59,990
the word "music" is somewhere
in the comma-separated list.

1793
01:26:59,990 --> 01:27:01,940
There's a subtle bug here.

1794
01:27:01,940 --> 01:27:05,690
And you might have to think
back to where we began, the form

1795
01:27:05,690 --> 01:27:07,910
that you pulled up.

1796
01:27:07,910 --> 01:27:09,860
I can't show the whole
thing here, but we

1797
01:27:09,860 --> 01:27:14,060
started with action, adventure,
animation, biography, dot, dot, dot,

1798
01:27:14,060 --> 01:27:15,620
music.

1799
01:27:15,620 --> 01:27:18,500
Musical was also there, so distinct.

1800
01:27:18,500 --> 01:27:22,700
A music video versus a musical
are two different types of genres.

1801
01:27:22,700 --> 01:27:25,250
But notice my query at the moment.

1802
01:27:25,250 --> 01:27:26,930
What's problematic with this?

1803
01:27:26,930 --> 01:27:31,070
At the moment, we would seem to have
a bug whereby this query will select

1804
01:27:31,070 --> 01:27:34,370
not only "music," but also "musical."

1805
01:27:34,370 --> 01:27:36,620
And so this is just where
things are getting messy.

1806
01:27:36,620 --> 01:27:37,400
Now, yeah, you know what?

1807
01:27:37,400 --> 01:27:38,810
We could kind of clean this up.

1808
01:27:38,810 --> 01:27:43,790
Maybe we could put a comma here so
that it can't just be music something.

1809
01:27:43,790 --> 01:27:45,410
It has to be music comma.

1810
01:27:45,410 --> 01:27:47,840
But what if music is the
last box that you checked?

1811
01:27:47,840 --> 01:27:49,310
Well, then it's music nothing.

1812
01:27:49,310 --> 01:27:50,210
There is no comma.

1813
01:27:50,210 --> 01:27:52,262
So now I need to OR things together.

1814
01:27:52,262 --> 01:27:54,470
So maybe I have to do
something like WHERE "%Music,%"

1815
01:27:54,470 --> 01:28:00,800
like this or OR genres
LIKE "%Music" like this.

1816
01:28:00,800 --> 01:28:02,750
But honestly, this is
just getting messy.

1817
01:28:02,750 --> 01:28:04,040
This is poorly designed.

1818
01:28:04,040 --> 01:28:07,220
If you're just storing your data as a
comma-separated list of values inside

1819
01:28:07,220 --> 01:28:11,010
of a column and you have to resort
to this kind of hack to figure out,

1820
01:28:11,010 --> 01:28:13,130
well, maybe it's over
here or here or here,

1821
01:28:13,130 --> 01:28:16,640
and thinking about all the permutations
of syntax, you're doing it wrong.

1822
01:28:16,640 --> 01:28:20,130
You're not using a SQL database
to its fullest potential.

1823
01:28:20,130 --> 01:28:22,490
So how do we go about
designing this thing better

1824
01:28:22,490 --> 01:28:26,690
and actually load this CSV into
a database a little more cleanly?

1825
01:28:26,690 --> 01:28:31,820
In short, how do we get rid of the
stupid commas in the Genres column

1826
01:28:31,820 --> 01:28:36,740
and instead put one word,
"comedy" or "music" or "musical,"

1827
01:28:36,740 --> 01:28:38,930
in each of those cells, so to speak?

1828
01:28:38,930 --> 01:28:40,250
Not two, not three--

1829
01:28:40,250 --> 01:28:43,820
one only without throwing
away some of those genres.

1830
01:28:43,820 --> 01:28:46,730
Well, let me introduce a few building
blocks that will get us there.

1831
01:28:46,730 --> 01:28:48,680
It turns out, when
creating your own tables

1832
01:28:48,680 --> 01:28:51,260
and loading data into
a database on your own,

1833
01:28:51,260 --> 01:28:53,375
we're going to need
more than just SELECT.

1834
01:28:53,375 --> 01:28:55,220
SELECT, of course, is just for reading.

1835
01:28:55,220 --> 01:28:59,330
But if we're going to do this better
and not just use sqlite3 as a built-in

1836
01:28:59,330 --> 01:29:04,880
.import command, but instead we're
going to write some code to load all

1837
01:29:04,880 --> 01:29:07,580
of our data into maybe two tables--

1838
01:29:07,580 --> 01:29:10,100
one for the titles, one for the genres--

1839
01:29:10,100 --> 01:29:15,680
we're going to need a little more
expressiveness when it comes to SQL.

1840
01:29:15,680 --> 01:29:17,990
And so for that, we're going
to need, one, the ability

1841
01:29:17,990 --> 01:29:19,113
to create our own tables.

1842
01:29:19,113 --> 01:29:20,780
And we've seen a glimpse of this before.

1843
01:29:20,780 --> 01:29:23,280
But we're also going to need
to see another piece of syntax,

1844
01:29:23,280 --> 01:29:24,500
as well, so inserting.

1845
01:29:24,500 --> 01:29:29,060
Inserting is another command that
you can execute on a SQL database

1846
01:29:29,060 --> 01:29:32,720
in order to actually add data
to a database, which is great.

1847
01:29:32,720 --> 01:29:38,630
Because if I want to ultimately iterate
over that same CSV but, this time,

1848
01:29:38,630 --> 01:29:43,075
manually add all of the
rows to the database myself,

1849
01:29:43,075 --> 01:29:45,200
well, then I'm going to
need some way of inserting.

1850
01:29:45,200 --> 01:29:46,850
And the syntax for that is as follows.

1851
01:29:46,850 --> 01:29:50,720
INSERT INTO the name of the
table, the column or columns

1852
01:29:50,720 --> 01:29:54,890
that you want to insert values into,
then literally the word VALUES,

1853
01:29:54,890 --> 01:29:58,787
and then literally in parentheses
again, the actual list of values.

1854
01:29:58,787 --> 01:30:01,370
So it's a little abstract when
we see it in this generic form.

1855
01:30:01,370 --> 01:30:06,480
But we'll see this more explicitly
in just a moment here, as well.

1856
01:30:06,480 --> 01:30:09,483
So when it comes to inserting
something into a database,

1857
01:30:09,483 --> 01:30:10,650
let's go ahead and try this.

1858
01:30:10,650 --> 01:30:13,100
So suppose that-- let's see.

1859
01:30:13,100 --> 01:30:15,080
What's a show that--

1860
01:30:15,080 --> 01:30:15,985
The Muppet Show.

1861
01:30:15,985 --> 01:30:17,360
I grew up loving The Muppet Show.

1862
01:30:17,360 --> 01:30:18,650
It was out in, like, the '70s.

1863
01:30:18,650 --> 01:30:21,680
And I don't think it was on the
list, but I can check this for sure.

1864
01:30:21,680 --> 01:30:28,100
So SELECT * FROM shows
WHERE title LIKE--

1865
01:30:28,100 --> 01:30:30,950
let's just search for
"muppets" with a wild card.

1866
01:30:30,950 --> 01:30:32,500
And I'm guessing no one put it there.

1867
01:30:32,500 --> 01:30:33,000
Good.

1868
01:30:33,000 --> 01:30:34,320
So it's a missed opportunity.

1869
01:30:34,320 --> 01:30:35,570
I forgot to fill out the form.

1870
01:30:35,570 --> 01:30:37,820
I could go back and fill out
the form and re-import the CSV,

1871
01:30:37,820 --> 01:30:39,487
but let's go ahead and do this manually.

1872
01:30:39,487 --> 01:30:44,420
So let me go ahead and INSERT
INTO shows what columns?

1873
01:30:44,420 --> 01:30:50,360
title and genres, and I guess I
could do a Timestamp just for kicks.

1874
01:30:50,360 --> 01:30:52,220
And then I'm going to
insert what values?

1875
01:30:52,220 --> 01:30:55,430
The values will be, well, I don't
know, whatever time it is now.

1876
01:30:55,430 --> 01:30:58,460
So I'm going to cheat there rather
than look up the date and the time.

1877
01:30:58,460 --> 01:31:01,430
The title will be "The Muppet Show."

1878
01:31:01,430 --> 01:31:05,100
And the genres will be--
it was kind of a comedy.

1879
01:31:05,100 --> 01:31:06,290
It was kind of a musical.

1880
01:31:06,290 --> 01:31:08,360
So we'll kind of leave it at that.

1881
01:31:08,360 --> 01:31:09,350
Semicolon.

1882
01:31:09,350 --> 01:31:11,870
So again, this follows
the standard syntax here

1883
01:31:11,870 --> 01:31:14,030
of specifying the table
you want to insert into,

1884
01:31:14,030 --> 01:31:16,910
the columns you want to
insert into, and the values

1885
01:31:16,910 --> 01:31:18,467
you want to put into those columns.

1886
01:31:18,467 --> 01:31:20,300
And I'm going to go
ahead and hit Enter now.

1887
01:31:20,300 --> 01:31:22,250
Nothing seems to have happened.

1888
01:31:22,250 --> 01:31:28,070
But if I now select that same query--

1889
01:31:28,070 --> 01:31:32,630
oh, OK, it's still nothing,
because I made a subtle mistake.

1890
01:31:32,630 --> 01:31:34,700
I'm not searching for "Muppets," plural.

1891
01:31:34,700 --> 01:31:37,250
I'm searching for "Muppet,"
singular, The Muppet Show.

1892
01:31:37,250 --> 01:31:38,000
Voila.

1893
01:31:38,000 --> 01:31:40,790
Now you see my row in this database.

1894
01:31:40,790 --> 01:31:42,680
And so INSERT would
give us the ability now

1895
01:31:42,680 --> 01:31:44,570
to insert new rows into the database.

1896
01:31:44,570 --> 01:31:48,410
Suppose you want to update something.

1897
01:31:48,410 --> 01:31:51,540
You know, some of the Muppet Shows
were actually pretty dramatic.

1898
01:31:51,540 --> 01:31:52,710
So how might we do that?

1899
01:31:52,710 --> 01:31:56,960
Well, I can say UPDATE shows SET--

1900
01:31:56,960 --> 01:32:04,250
let's see-- genres = "Comedy,
Drama, Musical" WHERE

1901
01:32:04,250 --> 01:32:07,910
title = "The Muppet Show."

1902
01:32:07,910 --> 01:32:10,890
So again, I'll pull up the
canonical syntax for this in a bit.

1903
01:32:10,890 --> 01:32:14,120
But for now, just a little teaser,
you can update things pretty simply.

1904
01:32:14,120 --> 01:32:16,662
And even though it takes a little
getting used to the syntax,

1905
01:32:16,662 --> 01:32:17,960
it kind of does what it says.

1906
01:32:17,960 --> 01:32:23,250
UPDATE shows SET genres =
this WHERE title = that.

1907
01:32:23,250 --> 01:32:24,650
And now I can go ahead and Enter.

1908
01:32:24,650 --> 01:32:27,290
If I go ahead and select the same
thing, just like in a terminal window,

1909
01:32:27,290 --> 01:32:28,250
you can go up and down.

1910
01:32:28,250 --> 01:32:29,600
That's how I'm typing so quickly.

1911
01:32:29,600 --> 01:32:31,600
I'm just going up and
down to previous commands.

1912
01:32:31,600 --> 01:32:32,120
Voila.

1913
01:32:32,120 --> 01:32:36,830
Now I see that the Muppet Show is
a comedy, a drama, and a musical.

1914
01:32:36,830 --> 01:32:40,070
Well, I take issue, though, with
one of the more popular shows that

1915
01:32:40,070 --> 01:32:40,970
was in the list.

1916
01:32:40,970 --> 01:32:44,637
A whole bunch of you
liked, let's say, Friends,

1917
01:32:44,637 --> 01:32:46,220
which I've never really been a fan of.

1918
01:32:46,220 --> 01:32:53,828
And let me go ahead and SELECT title
FROM shows WHERE title = "Friends."

1919
01:32:53,828 --> 01:32:56,120
And maybe I should be a little
more rigorous than that.

1920
01:32:56,120 --> 01:32:59,150
I should say title LIKE
"Friends" just in case

1921
01:32:59,150 --> 01:33:00,650
there was different capitalizations.

1922
01:33:00,650 --> 01:33:01,460
Enter.

1923
01:33:01,460 --> 01:33:03,148
A lot of you really liked Friends.

1924
01:33:03,148 --> 01:33:04,190
In fact, how many of you?

1925
01:33:04,190 --> 01:33:05,610
Well, recall that I can do this.

1926
01:33:05,610 --> 01:33:08,900
I can say COUNT, and I can
let SQL do the count for me.

1927
01:33:08,900 --> 01:33:10,575
26 of you, I disagree with strongly.

1928
01:33:10,575 --> 01:33:12,950
And there's a couple of you
that even added all the dots,

1929
01:33:12,950 --> 01:33:14,240
but we'll deal with you later.

1930
01:33:14,240 --> 01:33:16,100
So suppose I do take issue with this.

1931
01:33:16,100 --> 01:33:22,970
Well, DELETE FROM shows
WHERE title = "Friends"--

1932
01:33:22,970 --> 01:33:24,390
actually, title LIKE "Friends."

1933
01:33:24,390 --> 01:33:25,220
Let's get them all.

1934
01:33:25,220 --> 01:33:26,090
Enter.

1935
01:33:26,090 --> 01:33:29,450
And now if we SELECT
this again, I'm sorry.

1936
01:33:29,450 --> 01:33:30,870
Friends has been canceled.

1937
01:33:30,870 --> 01:33:34,910
So you can again execute these
fundamental commands of CRUD,

1938
01:33:34,910 --> 01:33:38,630
Create Read, Update, and Delete,
by using CREATE or INSERT,

1939
01:33:38,630 --> 01:33:41,540
by using SELECT, by
using UPDATE literally

1940
01:33:41,540 --> 01:33:43,380
and DELETE literally, as well.

1941
01:33:43,380 --> 01:33:44,580
And that's about it.

1942
01:33:44,580 --> 01:33:46,580
Even though this was a
lot quickly, there really

1943
01:33:46,580 --> 01:33:49,040
are just those four
fundamental operations in SQL

1944
01:33:49,040 --> 01:33:53,090
plus some of these add-on features, like
these additional functions like COUNT

1945
01:33:53,090 --> 01:33:57,420
that you can use and also some of
these keywords like WHERE and the like.

1946
01:33:57,420 --> 01:33:59,810
Well, let me propose
that we now do better.

1947
01:33:59,810 --> 01:34:04,580
If we have the ability to select data
and create tables and insert data,

1948
01:34:04,580 --> 01:34:11,270
let's go ahead and write our own Python
script that uses SQL, as in a loop,

1949
01:34:11,270 --> 01:34:16,130
to read over my CSV file and to insert,
insert, insert, insert each of the rows

1950
01:34:16,130 --> 01:34:16,700
manually.

1951
01:34:16,700 --> 01:34:18,408
Because honestly, it
will take me forever

1952
01:34:18,408 --> 01:34:22,220
to manually type out hundreds of SQL
queries to import all of your rows

1953
01:34:22,220 --> 01:34:23,390
into a new database.

1954
01:34:23,390 --> 01:34:25,520
I want to write a program
that does this instead.

1955
01:34:25,520 --> 01:34:29,430
And I'm going to propose that we
design it in the following way.

1956
01:34:29,430 --> 01:34:32,720
I'm going to have two tables
this time, represented here

1957
01:34:32,720 --> 01:34:34,190
with this artist's rendition.

1958
01:34:34,190 --> 01:34:36,020
One is going to be called shows.

1959
01:34:36,020 --> 01:34:38,060
One is going to be called genres.

1960
01:34:38,060 --> 01:34:44,270
And this is a fundamental principle
of designing relational databases,

1961
01:34:44,270 --> 01:34:49,700
to figure out the relationships among
data and to normalize your data.

1962
01:34:49,700 --> 01:34:53,480
To normalize your data means
to eliminate redundancies.

1963
01:34:53,480 --> 01:34:58,520
To normalize your data means to
eliminate mentions of the same words

1964
01:34:58,520 --> 01:35:02,320
again and again and have just single
sources of truth for your data,

1965
01:35:02,320 --> 01:35:02,820
so to speak.

1966
01:35:02,820 --> 01:35:04,140
So what do I mean by that?

1967
01:35:04,140 --> 01:35:07,520
I'm going to propose that we instead
create a simpler table called

1968
01:35:07,520 --> 01:35:10,320
shows that has just two columns.

1969
01:35:10,320 --> 01:35:13,098
One is going to be
called id, which is new.

1970
01:35:13,098 --> 01:35:15,140
The other is going to be
called title, as before.

1971
01:35:15,140 --> 01:35:16,940
Honestly, I don't care about
timestamps, so we're just

1972
01:35:16,940 --> 01:35:19,730
going to throw that value away,
which is another upside of writing

1973
01:35:19,730 --> 01:35:20,420
our own program.

1974
01:35:20,420 --> 01:35:23,030
We can add or remove any data we want.

1975
01:35:23,030 --> 01:35:25,850
For id, I'm introducing
this, which is going

1976
01:35:25,850 --> 01:35:28,490
to be a unique identifier,
literally a simple integer--

1977
01:35:28,490 --> 01:35:31,190
1, 2, 3, all the way up
to a billion or 2 billion,

1978
01:35:31,190 --> 01:35:33,080
however many favorites we have.

1979
01:35:33,080 --> 01:35:35,690
I'm just going to let this
auto increment as we go.

1980
01:35:35,690 --> 01:35:36,710
Why?

1981
01:35:36,710 --> 01:35:42,530
I propose that we move to another
table all of the genres and that,

1982
01:35:42,530 --> 01:35:48,350
instead of having one or two or
three or five genres in one column

1983
01:35:48,350 --> 01:35:51,860
as a stupid comma-separated list--
which is stupid only in the sense

1984
01:35:51,860 --> 01:35:53,180
that it's just messy, right?

1985
01:35:53,180 --> 01:35:55,040
It means that I have
to run stupid commands

1986
01:35:55,040 --> 01:35:57,332
where I'm checking for the
comma here, the comma there.

1987
01:35:57,332 --> 01:35:58,850
It's very hackish, so to speak.

1988
01:35:58,850 --> 01:36:00,080
Bad design.

1989
01:36:00,080 --> 01:36:03,770
Instead of doing that, I'm going
to create another table that

1990
01:36:03,770 --> 01:36:05,580
also has two columns.

1991
01:36:05,580 --> 01:36:09,320
One is going to be called show_id, and
the other is going to be called genre.

1992
01:36:09,320 --> 01:36:12,830
And genre here is just going
to be a single word now.

1993
01:36:12,830 --> 01:36:16,340
That column will contain
single words for genres,

1994
01:36:16,340 --> 01:36:19,400
like "comedy" or "music" or "musical."

1995
01:36:19,400 --> 01:36:23,570
But we're going to associate
all of those genres

1996
01:36:23,570 --> 01:36:27,470
with the original show to which
they belong, per your Google form

1997
01:36:27,470 --> 01:36:31,500
submissions, by using this show_id here.

1998
01:36:31,500 --> 01:36:33,290
So what does this mean in particular?

1999
01:36:33,290 --> 01:36:37,370
By adding to our first table,
shows, this unique identifier--

2000
01:36:37,370 --> 01:36:39,080
1, 2, 3, 4, 5, 6--

2001
01:36:39,080 --> 01:36:44,630
I can now refer to that same show
in a very efficient way using

2002
01:36:44,630 --> 01:36:46,940
a very simple number
instead of redundantly

2003
01:36:46,940 --> 01:36:49,730
having The Office, The Office,
The Office again and again.

2004
01:36:49,730 --> 01:36:52,280
I can refer to it by just
one canonical number, which

2005
01:36:52,280 --> 01:36:54,980
is only going to be 4 bytes or 32 bits.

2006
01:36:54,980 --> 01:36:56,330
Pretty efficient.

2007
01:36:56,330 --> 01:37:00,920
But I can still associate that
show with one genre or two or three

2008
01:37:00,920 --> 01:37:03,210
or more or even none.

2009
01:37:03,210 --> 01:37:07,610
So in this way, every
row in our current table

2010
01:37:07,610 --> 01:37:12,860
is going to become one or more
rows in our new pair of tables.

2011
01:37:12,860 --> 01:37:15,560
We're factoring out
the genres so that we

2012
01:37:15,560 --> 01:37:20,270
can add multiple rows for every
show, potentially, but still

2013
01:37:20,270 --> 01:37:25,050
remap those genres back to
the original show itself.

2014
01:37:25,050 --> 01:37:27,890
So what is some of the buzzwords here?

2015
01:37:27,890 --> 01:37:31,070
What's some of the language
to be familiar with?

2016
01:37:31,070 --> 01:37:35,090
Well, we need to know what kinds
of types are at our disposal here.

2017
01:37:35,090 --> 01:37:37,250
So for that, let me propose this.

2018
01:37:37,250 --> 01:37:41,300
Let me propose that we
have this list here.

2019
01:37:41,300 --> 01:37:44,590
It turns out, in SQLite, there
are five main data types.

2020
01:37:44,590 --> 01:37:46,340
And that's a bit of
an oversimplification,

2021
01:37:46,340 --> 01:37:49,430
but there's five main data types,
some of which look familiar,

2022
01:37:49,430 --> 01:37:51,410
a couple of which are a little weird.

2023
01:37:51,410 --> 01:37:53,810
INTEGER is a thing.

2024
01:37:53,810 --> 01:37:55,910
REAL is the same thing as float.

2025
01:37:55,910 --> 01:38:00,080
So an integer might be a 32-bit or
4-byte value, like 1, 2, 3, or 4,

2026
01:38:00,080 --> 01:38:01,130
positive or negative.

2027
01:38:01,130 --> 01:38:03,213
Real number's going to
have a decimal point in it,

2028
01:38:03,213 --> 01:38:05,570
a floating point value,
probably 32 bits by default.

2029
01:38:05,570 --> 01:38:08,240
But those kinds of things,
the sizes of these types,

2030
01:38:08,240 --> 01:38:10,430
vary by system, just
like they technically

2031
01:38:10,430 --> 01:38:13,760
did in C. So do they vary by
system in the world of SQL.

2032
01:38:13,760 --> 01:38:16,010
But generally speaking, these
are good rules of thumb.

2033
01:38:16,010 --> 01:38:16,970
TEXT is just that.

2034
01:38:16,970 --> 01:38:19,820
It's sort of the equivalent
of a string of some length.

2035
01:38:19,820 --> 01:38:22,362
But then in SQLite, it turns
out there's two other data

2036
01:38:22,362 --> 01:38:23,570
types we've not seen before--

2037
01:38:23,570 --> 01:38:25,010
NUMERIC and BLOB.

2038
01:38:25,010 --> 01:38:26,750
But more on those in just a little bit.

2039
01:38:26,750 --> 01:38:28,370
BLOB is Binary Large Object.

2040
01:38:28,370 --> 01:38:30,860
It means you can store 0's
and 1's in your database.

2041
01:38:30,860 --> 01:38:34,670
NUMERIC is going to be something that's
number-like but isn't a number per se.

2042
01:38:34,670 --> 01:38:38,360
It's like a year or a time,
something that has numbers, but isn't

2043
01:38:38,360 --> 01:38:40,730
just a simple integer at that.

2044
01:38:40,730 --> 01:38:44,210
And let me propose, too, that SQLite
is going to allow us to specify, too,

2045
01:38:44,210 --> 01:38:49,520
when we create our own columns manually
by executing the SQL code ourselves,

2046
01:38:49,520 --> 01:38:52,430
we can specify that a
column cannot be null.

2047
01:38:52,430 --> 01:38:53,840
Thus far, we've ignored this.

2048
01:38:53,840 --> 01:38:56,090
But some of you might
have taken the fifth

2049
01:38:56,090 --> 01:38:58,850
and just not given us the
title of a show or a genre.

2050
01:38:58,850 --> 01:39:01,020
Your answers might be blank.

2051
01:39:01,020 --> 01:39:03,020
Some of you, maybe in
registering for a website,

2052
01:39:03,020 --> 01:39:06,170
don't want to provide information like
where you live or your phone number.

2053
01:39:06,170 --> 01:39:10,190
So a database in general sometimes
does want to support null values.

2054
01:39:10,190 --> 01:39:12,290
But you might want to say
that it can't be null.

2055
01:39:12,290 --> 01:39:14,390
A website probably needs
your email address,

2056
01:39:14,390 --> 01:39:18,570
needs your password and a few
other fields, but not everything.

2057
01:39:18,570 --> 01:39:22,250
And there's another keyword in SQL, just
so you've seen it, called UNIQUE, where

2058
01:39:22,250 --> 01:39:25,460
you can additionally say that
whatever values are in this column

2059
01:39:25,460 --> 01:39:26,520
must be unique.

2060
01:39:26,520 --> 01:39:28,670
So a website might also use that.

2061
01:39:28,670 --> 01:39:31,910
If you want to make sure that the
same email address can't register

2062
01:39:31,910 --> 01:39:33,830
for your website
multiple times, you just

2063
01:39:33,830 --> 01:39:36,020
specify that the email column is unique.

2064
01:39:36,020 --> 01:39:40,370
That way, you can't put multiple people
in with identical email addresses.

2065
01:39:40,370 --> 01:39:44,060
So long story short, this is just
more of the tools in our SQL toolkit,

2066
01:39:44,060 --> 01:39:46,280
because we'll see some
of these now indirectly.

2067
01:39:46,280 --> 01:39:49,670
And the last piece of jargon we
need before designing our own tables

2068
01:39:49,670 --> 01:39:51,150
is going to be this.

2069
01:39:51,150 --> 01:39:54,110
It turns out that, in
SQL, there's this notion

2070
01:39:54,110 --> 01:39:56,270
of primary keys and foreign keys.

2071
01:39:56,270 --> 01:39:59,390
And we've not seen this in spreadsheets.

2072
01:39:59,390 --> 01:40:02,150
Unless you've been working in
the real world for some years

2073
01:40:02,150 --> 01:40:04,400
and you have fairly fancy
spreadsheets in front of you

2074
01:40:04,400 --> 01:40:06,380
as an analyst or financial
person or the like,

2075
01:40:06,380 --> 01:40:11,750
odds are you've not seen keys or unique
identifiers in quite the same way.

2076
01:40:11,750 --> 01:40:13,170
But they're relatively simple.

2077
01:40:13,170 --> 01:40:17,390
In fact, let me go back to
our picture before and propose

2078
01:40:17,390 --> 01:40:21,230
that when you have two
tables like this and you

2079
01:40:21,230 --> 01:40:25,790
want to use a simple integer to
uniquely identify all of the rows in one

2080
01:40:25,790 --> 01:40:28,395
of the tables, that's
called technically an ID.

2081
01:40:28,395 --> 01:40:30,020
That's what I'll call it by convention.

2082
01:40:30,020 --> 01:40:33,770
You could call it anything you want, but
ID just means it's a unique identifier.

2083
01:40:33,770 --> 01:40:37,470
But semantically, this ID is
what's called a primary key.

2084
01:40:37,470 --> 01:40:43,940
A primary key is the column in a table
that uniquely identifies every row.

2085
01:40:43,940 --> 01:40:46,820
This means you can have
multiple versions of The Office

2086
01:40:46,820 --> 01:40:48,860
in that title field.

2087
01:40:48,860 --> 01:40:52,490
But each of those rows is going to have
its own number uniquely, potentially.

2088
01:40:52,490 --> 01:40:56,630
So primary key uniquely
identifies each row.

2089
01:40:56,630 --> 01:41:01,550
In another table, like genres, which I'm
proposing we create in just a moment,

2090
01:41:01,550 --> 01:41:06,770
it turns out that you're welcome
to refer back to another table

2091
01:41:06,770 --> 01:41:09,260
by way of that unique identifier.

2092
01:41:09,260 --> 01:41:13,710
But when it's in this context,
that ID is called a foreign key.

2093
01:41:13,710 --> 01:41:16,130
So even though I've
called it show_id here,

2094
01:41:16,130 --> 01:41:18,470
that's just a convention
in a lot of SQL databases

2095
01:41:18,470 --> 01:41:23,030
to imply that this is technically
a column called ID in a table

2096
01:41:23,030 --> 01:41:26,760
called show or shows,
plural in this case.

2097
01:41:26,760 --> 01:41:29,900
So if there's a number
1 here, and suppose

2098
01:41:29,900 --> 01:41:34,190
that The Office has a
unique ID of 1, we would

2099
01:41:34,190 --> 01:41:38,420
have a row in this table called
id is 1, title is The Office.

2100
01:41:38,420 --> 01:41:43,730
The Office might be in the comedy
category, the drama category,

2101
01:41:43,730 --> 01:41:46,400
the romance category, so multiple ones.

2102
01:41:46,400 --> 01:41:51,050
Therefore, in the genres table,
we want to output three rows,

2103
01:41:51,050 --> 01:41:56,150
the number 1, 1, 1 in each of
those rows but the words "comedy,"

2104
01:41:56,150 --> 01:42:00,450
"drama," "romance" in each
of those rows respectively.

2105
01:42:00,450 --> 01:42:03,620
So again, the goal here is to just
design our database better, not

2106
01:42:03,620 --> 01:42:08,120
have these stupid comma-separated lists
of values inside of a single column.

2107
01:42:08,120 --> 01:42:12,980
We want to kind of blow that up,
explode it, into individual rows.

2108
01:42:12,980 --> 01:42:15,710
You might think, well, why don't
we just use multiple columns?

2109
01:42:15,710 --> 01:42:18,560
But again, per our
principle from spreadsheets,

2110
01:42:18,560 --> 01:42:21,650
you should not be in the habit of
adding more and more columns when

2111
01:42:21,650 --> 01:42:25,190
the data is all the same, like
genre, genre, genre, right?

2112
01:42:25,190 --> 01:42:27,410
The stupid way to do this
in the spreadsheet world

2113
01:42:27,410 --> 01:42:29,660
would be to have one
column called Genre 1,

2114
01:42:29,660 --> 01:42:34,100
another column called Genre 2, another
column called Genre 3, Genre 4.

2115
01:42:34,100 --> 01:42:37,340
And you can imagine just how
stupid and inefficient this is.

2116
01:42:37,340 --> 01:42:41,510
A lot of those columns are going to be
empty for shows with very few genres.

2117
01:42:41,510 --> 01:42:43,770
And it's just kind of
messy at that point.

2118
01:42:43,770 --> 01:42:47,030
So better, in the world
of relational databases,

2119
01:42:47,030 --> 01:42:51,350
to have something like a second table,
where you have multiple rows that

2120
01:42:51,350 --> 01:42:55,700
somehow link back to that primary
key by way of what we're calling,

2121
01:42:55,700 --> 01:42:58,440
conceptually, a foreign key.

2122
01:42:58,440 --> 01:42:59,090
All right.

2123
01:42:59,090 --> 01:43:01,640
So let's go ahead now and
try to write this code.

2124
01:43:01,640 --> 01:43:03,710
Let me go back to my IDE.

2125
01:43:03,710 --> 01:43:07,850
Let me quit out of SQLite now.

2126
01:43:07,850 --> 01:43:10,640
And let me just move away.

2127
01:43:10,640 --> 01:43:15,402
I'm going to move this away,
my file, for just a moment

2128
01:43:15,402 --> 01:43:17,360
so that we're only left
with our original data.

2129
01:43:17,360 --> 01:43:21,680
Let's go about implementing a final
version of my Python file that does

2130
01:43:21,680 --> 01:43:23,540
this-- creates two tables--

2131
01:43:23,540 --> 01:43:26,270
one called shows, one called genres--

2132
01:43:26,270 --> 01:43:30,200
and then, two, in a for
loop, iterates over that CSV

2133
01:43:30,200 --> 01:43:34,490
and inserts some data into the shows
and other data into the genres.

2134
01:43:34,490 --> 01:43:36,350
How can we do this programmatically?

2135
01:43:36,350 --> 01:43:38,720
Well, there's a final piece
of the puzzle that we need.

2136
01:43:38,720 --> 01:43:41,912
We need some way of bridging
the world of Python and SQL.

2137
01:43:41,912 --> 01:43:44,120
And here, we do need a
library, because it would just

2138
01:43:44,120 --> 01:43:46,700
be way too painful to
do without a library.

2139
01:43:46,700 --> 01:43:47,540
It can be CS50.

2140
01:43:47,540 --> 01:43:50,730
CS50, as we'll see,
makes this very simple.

2141
01:43:50,730 --> 01:43:53,480
There are other third-party
commercial and open-source libraries

2142
01:43:53,480 --> 01:43:56,522
that you can also use in the real
world, as well, that do the same thing.

2143
01:43:56,522 --> 01:43:58,670
But the syntax is a
little less friendly,

2144
01:43:58,670 --> 01:44:01,880
so we'll start by using the CS50
library, which in Python, recall,

2145
01:44:01,880 --> 01:44:04,330
has functions like get_string
and get_int and get_float.

2146
01:44:04,330 --> 01:44:10,430
But today, it also has support, it turns
out, for SQL capabilities, as well.

2147
01:44:10,430 --> 01:44:12,760
So I'm going to go back
to my Favorites file.

2148
01:44:12,760 --> 01:44:15,970
And I'm going to import
not only CSV, but I'm also

2149
01:44:15,970 --> 01:44:21,310
going to import from the CS50
library a feature called SQL.

2150
01:44:21,310 --> 01:44:25,930
So we have a variable, if you
will, inside of the CS50 library

2151
01:44:25,930 --> 01:44:28,600
or, rather, a function
inside of the CS50 library

2152
01:44:28,600 --> 01:44:31,870
called SQL that, if I
call it, will allow me

2153
01:44:31,870 --> 01:44:35,270
to load a SQLite database into memory.

2154
01:44:35,270 --> 01:44:36,290
So how do I do this?

2155
01:44:36,290 --> 01:44:38,790
Well, let me go ahead and add
a couple of new lines of code.

2156
01:44:38,790 --> 01:44:45,340
Let me go ahead and open
up a file called shows.db,

2157
01:44:45,340 --> 01:44:47,055
but this time in write mode.

2158
01:44:47,055 --> 01:44:49,180
And then just for kicks--
just for now, rather, I'm

2159
01:44:49,180 --> 01:44:50,930
going to go ahead and
close it right away.

2160
01:44:50,930 --> 01:44:54,260
This is a Pythonic way of
creating an empty file.

2161
01:44:54,260 --> 01:44:58,210
It's kind of stupid looking, but
by opening a file called shows.db

2162
01:44:58,210 --> 01:45:00,790
in write mode and then
immediately closing it,

2163
01:45:00,790 --> 01:45:03,670
it has the effect of creating
the file, closing the file.

2164
01:45:03,670 --> 01:45:06,310
So I now have an empty file
with which to interact.

2165
01:45:06,310 --> 01:45:09,100
I could also do this, as
an aside, by doing this--

2166
01:45:09,100 --> 01:45:10,810
touch shows.db.

2167
01:45:10,810 --> 01:45:14,470
touch kind of a strange command,
but in a terminal window,

2168
01:45:14,470 --> 01:45:17,870
it means to create a
file if it doesn't exist.

2169
01:45:17,870 --> 01:45:19,450
So we could also do that instead.

2170
01:45:19,450 --> 01:45:22,420
But that would be independent of Python.

2171
01:45:22,420 --> 01:45:24,790
So once I've created this
file, let me go ahead

2172
01:45:24,790 --> 01:45:28,720
and open the file now
as a SQLite database.

2173
01:45:28,720 --> 01:45:31,600
I'm going to declare a variable
called db for database.

2174
01:45:31,600 --> 01:45:34,930
I'm going to use the SQL
function from CS50's library.

2175
01:45:34,930 --> 01:45:38,170
And I'm going to open via
somewhat cryptic string this--

2176
01:45:38,170 --> 01:45:43,600
sqlite:///shows.db.

2177
01:45:43,600 --> 01:45:48,740
Now, it looks like a URL,
http://, but it's SQLite instead.

2178
01:45:48,740 --> 01:45:52,300
And there's three slashes
instead of the usual two.

2179
01:45:52,300 --> 01:45:54,430
But this line of code,
line 6, has the result

2180
01:45:54,430 --> 01:45:57,820
of opening now that otherwise
empty file with nothing

2181
01:45:57,820 --> 01:46:04,040
in it yet as being a SQLite
database using CS50's library.

2182
01:46:04,040 --> 01:46:05,330
Why did I do that?

2183
01:46:05,330 --> 01:46:09,020
Well, I did that because I now
want to create my first table.

2184
01:46:09,020 --> 01:46:12,140
Let me go ahead and execute, db.execute.

2185
01:46:12,140 --> 01:46:16,330
So there's a function called execute
inside of the CS50 SQL library.

2186
01:46:16,330 --> 01:46:17,980
And I'm going to go ahead and run this.

2187
01:46:17,980 --> 01:46:23,770
CREATE TABLE called shows,
the columns of which

2188
01:46:23,770 --> 01:46:27,430
are an id, which is going to
be an integer, a title, which

2189
01:46:27,430 --> 01:46:33,380
is going to be text, the primary key
in which is going to be the id column.

2190
01:46:33,380 --> 01:46:34,870
So this is a bit cryptic.

2191
01:46:34,870 --> 01:46:36,520
But let's see what's happening.

2192
01:46:36,520 --> 01:46:41,950
I seem to now, in line 8, be
combining Python with SQL.

2193
01:46:41,950 --> 01:46:46,000
And this is where now programming
gets really powerful, fancy, cool,

2194
01:46:46,000 --> 01:46:48,250
difficult, however you
want to perceive it.

2195
01:46:48,250 --> 01:46:50,680
I can actually use one
language inside of another.

2196
01:46:50,680 --> 01:46:51,250
How?

2197
01:46:51,250 --> 01:46:53,420
Well, SQL is just a bunch
of textural commands.

2198
01:46:53,420 --> 01:46:55,420
Up until now, I've been
typing them out manually

2199
01:46:55,420 --> 01:46:57,430
in this program called SQLite3.

2200
01:46:57,430 --> 01:47:00,010
There's nothing stopping
me, though, from storing

2201
01:47:00,010 --> 01:47:02,830
those same commands in
Python strings and then

2202
01:47:02,830 --> 01:47:05,890
passing them to a database using code.

2203
01:47:05,890 --> 01:47:08,230
The code I'm using is a
function called execute.

2204
01:47:08,230 --> 01:47:10,990
And its purpose in life,
and CS50 staff wrote this,

2205
01:47:10,990 --> 01:47:18,950
is to pass the argument from your Python
code into the database for execution.

2206
01:47:18,950 --> 01:47:22,510
So it's like the programmatic way
of just typing things manually

2207
01:47:22,510 --> 01:47:25,160
at the SQLite prompt a few minutes ago.

2208
01:47:25,160 --> 01:47:27,880
So that's going to go ahead
and create my table called

2209
01:47:27,880 --> 01:47:30,610
shows, in which I'm going to
store all of those unique IDs

2210
01:47:30,610 --> 01:47:32,290
and also the titles.

2211
01:47:32,290 --> 01:47:33,670
And then let me do this again.

2212
01:47:33,670 --> 01:47:39,040
db.execute CREATE TABLE
genres, and that's

2213
01:47:39,040 --> 01:47:43,670
going to have a column called show_id,
which is an integer also, genre,

2214
01:47:43,670 --> 01:47:45,340
which is text.

2215
01:47:45,340 --> 01:47:48,130
And lastly, it's going
to have a foreign key--

2216
01:47:48,130 --> 01:47:51,190
it's going to wrap a little long here--

2217
01:47:51,190 --> 01:47:56,563
on show_id, which references
the shows table id.

2218
01:47:56,563 --> 01:47:57,730
All right, so this is a lot.

2219
01:47:57,730 --> 01:47:59,860
So let's just recap left to right.

2220
01:47:59,860 --> 01:48:03,730
db.execute is my Python function
that executes any SQL I want.

2221
01:48:03,730 --> 01:48:06,460
CREATE TABLE genres creates
a table called genres.

2222
01:48:06,460 --> 01:48:10,060
The columns in that table will
be something called show_id,

2223
01:48:10,060 --> 01:48:13,630
which is an integer, and
genre, which is a text field.

2224
01:48:13,630 --> 01:48:17,050
But it's going to be one
genre at a time, not multiple.

2225
01:48:17,050 --> 01:48:20,170
And then here, I'm
specifying a foreign key

2226
01:48:20,170 --> 01:48:24,280
will be the show_id column,
which happens to refer back

2227
01:48:24,280 --> 01:48:28,180
to the shows table's IDs column.

2228
01:48:28,180 --> 01:48:31,480
It's a little cryptic, but all this
is doing is implementing for us

2229
01:48:31,480 --> 01:48:33,470
the equivalent of this picture here.

2230
01:48:33,470 --> 01:48:35,770
I could have manually
typed both of these SQL

2231
01:48:35,770 --> 01:48:37,690
commands at that blinking prompt.

2232
01:48:37,690 --> 01:48:39,850
But again, no, I want
to write a program now

2233
01:48:39,850 --> 01:48:43,720
in Python that creates the tables
for me and now, more interestingly,

2234
01:48:43,720 --> 01:48:47,583
loads the data into that database.

2235
01:48:47,583 --> 01:48:49,000
So let's go ahead and do this now.

2236
01:48:49,000 --> 01:48:51,100
I'm not going to select
a title from the user,

2237
01:48:51,100 --> 01:48:52,660
because I want to import everything.

2238
01:48:52,660 --> 01:48:54,993
I'm not going to use any
counting or anything like that.

2239
01:48:54,993 --> 01:48:57,700
So let's go ahead and just go
inside of my loop as before.

2240
01:48:57,700 --> 01:49:02,240
And this time, let's go
ahead and, for row in reader,

2241
01:49:02,240 --> 01:49:05,110
let's go ahead and get the current
title, as we've always done.

2242
01:49:05,110 --> 01:49:08,640
But let's also, as always, go
ahead and strip it of white space

2243
01:49:08,640 --> 01:49:11,700
and capitalize it, just
to canonicalize it.

2244
01:49:11,700 --> 01:49:15,960
And now I'm going to go ahead and
execute db.execute, quote unquote,

2245
01:49:15,960 --> 01:49:24,707
INSERT INTO shows the title
column, the value of "title."

2246
01:49:24,707 --> 01:49:26,040
So I want to put the title here.

2247
01:49:26,040 --> 01:49:31,690
It turns out that SQL libraries like
ours support one final piece of syntax,

2248
01:49:31,690 --> 01:49:32,850
which is a placeholder.

2249
01:49:32,850 --> 01:49:34,800
In C, we use %s.

2250
01:49:34,800 --> 01:49:37,950
In Python, we just use curly braces
and put the word right there.

2251
01:49:37,950 --> 01:49:41,520
In SQL, we have a third approach to
the same problem-- just syntactically

2252
01:49:41,520 --> 01:49:43,590
different, but conceptually the same.

2253
01:49:43,590 --> 01:49:46,560
You put a question mark where
you want to put a placeholder.

2254
01:49:46,560 --> 01:49:50,670
And then outside of this string, I'm
going to actually type in the value

2255
01:49:50,670 --> 01:49:53,070
that I want to plug
into that question mark.

2256
01:49:53,070 --> 01:49:55,590
So this is so similar
to printf in week 1.

2257
01:49:55,590 --> 01:50:00,180
But instead of %s, it's a question mark
now and then a comma-separated list

2258
01:50:00,180 --> 01:50:03,120
of the arguments you want to
plug in for those placeholders.

2259
01:50:03,120 --> 01:50:08,820
So now this line of code 16 has
just inserted all of those values

2260
01:50:08,820 --> 01:50:09,670
into my database.

2261
01:50:09,670 --> 01:50:10,440
And let's go ahead and run this.

2262
01:50:10,440 --> 01:50:12,970
Before I go any further,
let me go ahead and do this.

2263
01:50:12,970 --> 01:50:15,960
I'm going to go ahead now and
run python of favorites.py

2264
01:50:15,960 --> 01:50:18,030
and cross my fingers, as always.

2265
01:50:18,030 --> 01:50:20,010
It's taking a moment, taking a moment.

2266
01:50:20,010 --> 01:50:23,340
That's because there's a
decent-sized file there.

2267
01:50:23,340 --> 01:50:25,650
Or I screwed up.

2268
01:50:25,650 --> 01:50:27,930
This is taking too long.

2269
01:50:27,930 --> 01:50:28,950
Oh, OK.

2270
01:50:28,950 --> 01:50:30,960
I should have just been more patient.

2271
01:50:30,960 --> 01:50:31,560
All right.

2272
01:50:31,560 --> 01:50:33,970
So it just seems my
connection's a little slow.

2273
01:50:33,970 --> 01:50:38,717
So as I expected, everything is
100% correct, and it's working fine.

2274
01:50:38,717 --> 01:50:40,800
So now let's go ahead and
see what I actually did.

2275
01:50:40,800 --> 01:50:44,970
If I type ls, notice that I
have a file called shows.db.

2276
01:50:44,970 --> 01:50:48,180
This is brand new, because my
Python program created it this time.

2277
01:50:48,180 --> 01:50:51,060
Let's go ahead and run
sqlite3 of shows.db

2278
01:50:51,060 --> 01:50:53,080
just so I can now see
what's inside of it.

2279
01:50:53,080 --> 01:50:57,090
Notice that I can do .schema
just to see what tables exist.

2280
01:50:57,090 --> 01:51:00,660
And indeed, the two tables that
I created in my Python code

2281
01:51:00,660 --> 01:51:01,920
seem to exist.

2282
01:51:01,920 --> 01:51:04,020
But notice that there's--

2283
01:51:04,020 --> 01:51:08,730
if I do SELECT * FROM shows,
let's see all the data.

2284
01:51:08,730 --> 01:51:09,750
Voila.

2285
01:51:09,750 --> 01:51:13,170
There is a table that's been
programmatically created.

2286
01:51:13,170 --> 01:51:16,350
And it has, notice this time,
no timestamps, no genres.

2287
01:51:16,350 --> 01:51:20,730
But it has an ID on the left
and the title on the right.

2288
01:51:20,730 --> 01:51:25,350
And amazingly, all of the IDs are
monotonically increasing from 1

2289
01:51:25,350 --> 01:51:27,390
on up to 513, in this case.

2290
01:51:27,390 --> 01:51:28,300
Why is that?

2291
01:51:28,300 --> 01:51:30,600
Well, one of the features
you get in a SQL database

2292
01:51:30,600 --> 01:51:34,410
is if you define a column as
being a primary key in SQLite,

2293
01:51:34,410 --> 01:51:36,480
it's going to be auto
incremented for you.

2294
01:51:36,480 --> 01:51:41,970
Recall that nowhere in my code did
I even have a line, an integer,

2295
01:51:41,970 --> 01:51:43,830
inputting 1, then 2, then 3.

2296
01:51:43,830 --> 01:51:45,310
I could absolutely do that.

2297
01:51:45,310 --> 01:51:47,730
I could have done something
like this-- counter--

2298
01:51:47,730 --> 01:51:51,660
rather, I could have done
something like this-- counter = 1.

2299
01:51:51,660 --> 01:51:56,280
And then down here, I could
have said id, title, give myself

2300
01:51:56,280 --> 01:51:59,122
two placeholders, and then
pass in the counter each time.

2301
01:51:59,122 --> 01:52:01,830
I could have implemented this
myself and then, on each iteration,

2302
01:52:01,830 --> 01:52:03,960
done counter += 1.

2303
01:52:03,960 --> 01:52:06,330
But with SQL databases,
as we've seen, you

2304
01:52:06,330 --> 01:52:08,310
get a lot more functionality built in.

2305
01:52:08,310 --> 01:52:11,130
I don't have to do any
of that, because if I've

2306
01:52:11,130 --> 01:52:16,710
declared that ID as being a primary
key, SQLite is going to insert it for me

2307
01:52:16,710 --> 01:52:19,870
and increment it also for me, as well.

2308
01:52:19,870 --> 01:52:20,400
All right.

2309
01:52:20,400 --> 01:52:24,510
So if I go back to SQLite, though,
notice that I do have IDs and titles.

2310
01:52:24,510 --> 01:52:28,860
But if I SELECT * FROM genres,
there's of course nothing there yet.

2311
01:52:28,860 --> 01:52:32,250
So how now do I get all of the
genres for each of these shows in?

2312
01:52:32,250 --> 01:52:33,910
I need to finish my script.

2313
01:52:33,910 --> 01:52:38,970
So inside of this same loop, I have
not only the title in my current row,

2314
01:52:38,970 --> 01:52:42,570
but I also have genres
in the current row.

2315
01:52:42,570 --> 01:52:45,570
But the genres are separated by commas.

2316
01:52:45,570 --> 01:52:47,880
Recall that in the CSV,
next to every title,

2317
01:52:47,880 --> 01:52:51,450
there's a comma-separated
list of genres.

2318
01:52:51,450 --> 01:52:53,460
How do I get at each genre individually?

2319
01:52:53,460 --> 01:52:59,190
Well, I'd like to be able to say
for genre in row bracket genres.

2320
01:52:59,190 --> 01:53:02,520
But this is not going to
work, because that's not going

2321
01:53:02,520 --> 01:53:05,310
to be split up based on those commas.

2322
01:53:05,310 --> 01:53:07,190
That's literally just
going to iterate over,

2323
01:53:07,190 --> 01:53:10,860
in fact, all of the characters in
that string, as we saw last week.

2324
01:53:10,860 --> 01:53:13,950
But it turns out that strings
in Python have a fancy split

2325
01:53:13,950 --> 01:53:19,300
function, whereby I can split
on a comma followed by a space.

2326
01:53:19,300 --> 01:53:21,930
And what this function
will do for me in Python is

2327
01:53:21,930 --> 01:53:26,130
take a comma separated list of
genres and explode it, so to speak,

2328
01:53:26,130 --> 01:53:31,800
split it on every comma,
space into a Python list

2329
01:53:31,800 --> 01:53:36,570
containing genre after genre
in an actual Python list

2330
01:53:36,570 --> 01:53:37,990
a la square brackets.

2331
01:53:37,990 --> 01:53:42,360
So now I can iterate over that
list of individual genres.

2332
01:53:42,360 --> 01:53:49,470
And inside of here, I can do db.execute
INSERT INTO genres show_id, genre,

2333
01:53:49,470 --> 01:53:53,130
the values, question
mark, question mark.

2334
01:53:53,130 --> 01:53:56,100
But huh, there's a problem.

2335
01:53:56,100 --> 01:53:59,970
I can definitely plug in the
current genre, which is this.

2336
01:53:59,970 --> 01:54:02,970
But I need to put something here still.

2337
01:54:02,970 --> 01:54:07,560
For that first question mark,
I need a value for the show_id.

2338
01:54:07,560 --> 01:54:11,130
How do I know what the ID
is of the current TV show?

2339
01:54:11,130 --> 01:54:13,650
Well, it turns out the library
can help you with this.

2340
01:54:13,650 --> 01:54:18,970
When you insert new rows into
a table that has a primary key,

2341
01:54:18,970 --> 01:54:23,400
it turns out that most libraries will
return you that value in some way.

2342
01:54:23,400 --> 01:54:26,520
And if I go back to
line 15 and I actually

2343
01:54:26,520 --> 01:54:31,470
store the return value of
db.execute after using INSERT,

2344
01:54:31,470 --> 01:54:34,500
the library will tell me
what was the integer that

2345
01:54:34,500 --> 01:54:36,390
was just used for this given show.

2346
01:54:36,390 --> 01:54:37,650
Maybe it's 1, 2, 3.

2347
01:54:37,650 --> 01:54:39,940
I don't have to know or
care as the programmer.

2348
01:54:39,940 --> 01:54:42,570
But the return value, I
can store in a variable.

2349
01:54:42,570 --> 01:54:47,520
And then down here, I can literally
put that same ID so that now,

2350
01:54:47,520 --> 01:54:51,600
if I am inputting The Office,
whose ID is 1, into the shows table

2351
01:54:51,600 --> 01:54:54,720
and its genres are
comedy, drama, romance,

2352
01:54:54,720 --> 01:54:57,990
I can now inside of this for
loop, this nested for loop,

2353
01:54:57,990 --> 01:55:03,240
insert 1 followed by "comedy," 1
followed by "drama," 1 followed

2354
01:55:03,240 --> 01:55:07,330
by "romance," three rows all at once.

2355
01:55:07,330 --> 01:55:11,980
And so now let's go back down
here into my terminal window.

2356
01:55:11,980 --> 01:55:15,660
Let me remove the old shows.db
with rm, just to start fresh.

2357
01:55:15,660 --> 01:55:19,920
Let me go ahead and rerun
python of favorites.py.

2358
01:55:19,920 --> 01:55:23,733
I'll be more patient this time,
because cloud's being a little slow.

2359
01:55:23,733 --> 01:55:24,900
So it's doing some thinking.

2360
01:55:24,900 --> 01:55:27,030
And in fact, there's
more work being done now.

2361
01:55:27,030 --> 01:55:29,340
At this point in the story,
my program is presumably

2362
01:55:29,340 --> 01:55:33,060
iterating over all of
the rows in the CSV.

2363
01:55:33,060 --> 01:55:37,170
And it's inserting into the
shows table one at a time,

2364
01:55:37,170 --> 01:55:43,380
and then it's inserting one or
more genres into the genres table.

2365
01:55:43,380 --> 01:55:44,250
It's a little slow.

2366
01:55:44,250 --> 01:55:47,370
If we were on a faster system or if
I were doing it on my own Mac or PC,

2367
01:55:47,370 --> 01:55:49,480
it would probably go down more quickly.

2368
01:55:49,480 --> 01:55:52,740
But you can see here an example of why
I use the .import command in the first

2369
01:55:52,740 --> 01:55:53,130
place.

2370
01:55:53,130 --> 01:55:54,670
That automated some of this process.

2371
01:55:54,670 --> 01:55:58,440
But unfortunately, it didn't allow
me to change the format of my data.

2372
01:55:58,440 --> 01:56:01,530
But the key point to make
here is that even though this

2373
01:56:01,530 --> 01:56:05,490
is taking a little bit of time to insert
these hundreds of rows all at once,

2374
01:56:05,490 --> 01:56:07,260
I'm only going to have to do this once.

2375
01:56:07,260 --> 01:56:10,840
And what was asked a bit ago
was the performance of this.

2376
01:56:10,840 --> 01:56:15,390
It turns out that now that we have
full control over the SQL database,

2377
01:56:15,390 --> 01:56:20,640
it turns out we're going to have
the ability to actually improve

2378
01:56:20,640 --> 01:56:22,230
the performance thereof.

2379
01:56:22,230 --> 01:56:24,000
Oh, OK.

2380
01:56:24,000 --> 01:56:25,830
As expected, it finished right on time.

2381
01:56:25,830 --> 01:56:29,970
And let me go ahead now and
run sqlite3 on shows.db.

2382
01:56:29,970 --> 01:56:32,670
All right, so now I'm back
in my raw SQL environment.

2383
01:56:32,670 --> 01:56:36,180
If I do SELECT * FROM
shows, which I did before,

2384
01:56:36,180 --> 01:56:37,650
we'll see all of this as before.

2385
01:56:37,650 --> 01:56:42,090
If I SELECT * FROM shows
WHERE title = "THE OFFICE,"

2386
01:56:42,090 --> 01:56:45,103
I'll see the actual unique
IDs of all of those.

2387
01:56:45,103 --> 01:56:46,770
We didn't bother eliminating duplicates.

2388
01:56:46,770 --> 01:56:50,610
We just kept everything as is, but
we gave everything a unique ID.

2389
01:56:50,610 --> 01:56:57,520
But if I now do SELECT * FROM genres,
we'll see all of the values there.

2390
01:56:57,520 --> 01:56:59,070
And notice the key detail.

2391
01:56:59,070 --> 01:57:03,360
There is only one genre per row here.

2392
01:57:03,360 --> 01:57:06,480
And so we can ultimately line
those up with our titles.

2393
01:57:06,480 --> 01:57:10,250
And our titles here, we
had all of these here.

2394
01:57:10,250 --> 01:57:12,538
Something's wrong.

2395
01:57:12,538 --> 01:57:13,580
I want to get this right.

2396
01:57:13,580 --> 01:57:15,940
Let's go ahead and take our second
and final five-minute break here.

2397
01:57:15,940 --> 01:57:18,280
And we'll come back, and I
will explain what's going on.

2398
01:57:18,280 --> 01:57:20,170
All right, we are back.

2399
01:57:20,170 --> 01:57:23,710
And just before we broke up, my own
self-doubt was starting to creep in.

2400
01:57:23,710 --> 01:57:26,830
But I'm happy to say, with no
fancy magic behind the scenes,

2401
01:57:26,830 --> 01:57:28,430
everything was actually working fine.

2402
01:57:28,430 --> 01:57:30,263
I was just doubting the
correctness of this.

2403
01:57:30,263 --> 01:57:33,460
If I do SELECT * FROM
shows, I indeed get back

2404
01:57:33,460 --> 01:57:37,540
two columns, one with the unique ID,
the so-called primary key, followed

2405
01:57:37,540 --> 01:57:40,280
by the title of each of those shows.

2406
01:57:40,280 --> 01:57:46,120
And if I similarly search for * FROM
genres, I get single genres at a time.

2407
01:57:46,120 --> 01:57:49,600
But on the left-hand side
are not primary keys per se

2408
01:57:49,600 --> 01:57:52,450
but now those same numbers
here in this context called

2409
01:57:52,450 --> 01:57:55,160
foreign keys that map one to the other.

2410
01:57:55,160 --> 01:58:01,260
So for instance, whatever show 512 is
had five different genres associated

2411
01:58:01,260 --> 01:58:01,760
with it.

2412
01:58:01,760 --> 01:58:05,320
And in fact, if I go back a moment to
shows, it looks like Game of Thrones

2413
01:58:05,320 --> 01:58:10,420
was decided by one of you as belonging
in thriller, history, adventure,

2414
01:58:10,420 --> 01:58:14,660
action, and war, as well, those five.

2415
01:58:14,660 --> 01:58:17,320
So now this is what's meant
by relational database.

2416
01:58:17,320 --> 01:58:21,430
You have this relation or
relationship across multiple tables

2417
01:58:21,430 --> 01:58:25,050
that link some data in one to
some other data in the like.

2418
01:58:25,050 --> 01:58:27,550
The catch, though, is that it
would seem a little harder now

2419
01:58:27,550 --> 01:58:30,910
to answer questions, because now
I have to kind of query two tables

2420
01:58:30,910 --> 01:58:34,450
or execute two separate queries
and then combine the data.

2421
01:58:34,450 --> 01:58:36,130
But that's not actually the case.

2422
01:58:36,130 --> 01:58:39,100
Suppose that I want to
answer the question of,

2423
01:58:39,100 --> 01:58:42,760
what are all of the musicals
among your favorite TV shows?

2424
01:58:42,760 --> 01:58:46,490
I can't select just the shows, because
there's no genres in there anymore.

2425
01:58:46,490 --> 01:58:48,730
But I also can't select
just the genres table,

2426
01:58:48,730 --> 01:58:50,900
because there's no titles in there.

2427
01:58:50,900 --> 01:58:55,060
But there is a value that's bridging
one and the other, that foreign key

2428
01:58:55,060 --> 01:58:56,980
to primary key relationship.

2429
01:58:56,980 --> 01:58:59,170
So you know what I can do
off the top of my head?

2430
01:58:59,170 --> 01:59:03,790
I'm pretty sure I can select all of
the show_ids from the genres table

2431
01:59:03,790 --> 01:59:07,072
where a specific genre = "Musical."

2432
01:59:07,072 --> 01:59:09,280
And I don't have to worry
about commas or spaces now,

2433
01:59:09,280 --> 01:59:13,210
because again, in this new version
that I have designed programmatically

2434
01:59:13,210 --> 01:59:16,990
with code, musical and every
other genre is just a single word.

2435
01:59:16,990 --> 01:59:21,220
If I hit Enter, all of
these show_ids were decided

2436
01:59:21,220 --> 01:59:23,930
by you all as belonging to musicals.

2437
01:59:23,930 --> 01:59:25,930
But now this is not
interesting, and I certainly

2438
01:59:25,930 --> 01:59:28,360
don't want to execute 10
or so queries manually

2439
01:59:28,360 --> 01:59:30,400
to look up every one of those IDs.

2440
01:59:30,400 --> 01:59:32,680
But notice what we can
do in SQL, as well.

2441
01:59:32,680 --> 01:59:33,880
I can nest queries.

2442
01:59:33,880 --> 01:59:36,940
Let me put this whole query in
parentheses for just a moment

2443
01:59:36,940 --> 01:59:39,070
and then prepend to it the following.

2444
01:59:39,070 --> 01:59:46,930
SELECT title FROM shows WHERE the
primary key, id, is in this subquery.

2445
01:59:46,930 --> 01:59:50,650
So you can have nested queries similar
in spirit a bit like in Python and C

2446
01:59:50,650 --> 01:59:52,510
when you have nested for loops.

2447
01:59:52,510 --> 01:59:55,690
In this case, just like in grade school
math, whatever is in the parentheses

2448
01:59:55,690 --> 01:59:57,160
will be executed first.

2449
01:59:57,160 --> 02:00:02,140
Then the outer query will be executed
using the results of that inner query.

2450
02:00:02,140 --> 02:00:07,000
So if I select the title from shows
where the ID is in that list of IDs,

2451
02:00:07,000 --> 02:00:07,660
voila.

2452
02:00:07,660 --> 02:00:11,560
It seems that, somewhat
amusingly, several of you

2453
02:00:11,560 --> 02:00:15,280
think that Breaking Bad, Supernatural,
Glee, Sherlock, How I Met Your Mother,

2454
02:00:15,280 --> 02:00:18,190
Hawaii Five-0, Twin Peaks, The
Lawyer, and My Brother, My Brother

2455
02:00:18,190 --> 02:00:19,900
and Me are all musicals.

2456
02:00:19,900 --> 02:00:22,630
I take exception to a few
of those, but so be it.

2457
02:00:22,630 --> 02:00:24,850
You checked the box for
musical for those shows.

2458
02:00:24,850 --> 02:00:29,260
So even though we've designed
things better in the sense

2459
02:00:29,260 --> 02:00:33,010
that we've normalized our database
by factoring out commonalities

2460
02:00:33,010 --> 02:00:35,050
or, rather, we've cleaned
up the data, there's

2461
02:00:35,050 --> 02:00:37,150
still admittedly some redundancy.

2462
02:00:37,150 --> 02:00:39,370
There's still admittedly
some redundancy.

2463
02:00:39,370 --> 02:00:44,410
But I at least now have
the data in clean fashion

2464
02:00:44,410 --> 02:00:47,800
so that every column has just
a single value in it and not

2465
02:00:47,800 --> 02:00:49,870
some contrived comma-separated list.

2466
02:00:49,870 --> 02:00:51,725
Suppose I want to find
out all of the genres

2467
02:00:51,725 --> 02:00:53,350
that you all thought The Office was in.

2468
02:00:53,350 --> 02:00:55,480
So let's ask kind of
the opposite question.

2469
02:00:55,480 --> 02:00:56,840
Well, how might I do that?

2470
02:00:56,840 --> 02:01:00,430
Well, to figure out The Office, I'm
going to first need to SELECT the id

2471
02:01:00,430 --> 02:01:06,400
FROM shows WHERE title = "THE
OFFICE," because a whole bunch of you

2472
02:01:06,400 --> 02:01:07,330
typed in The Office.

2473
02:01:07,330 --> 02:01:09,902
And we gave each of your
answers a unique identifier

2474
02:01:09,902 --> 02:01:11,110
so we could keep track of it.

2475
02:01:11,110 --> 02:01:12,500
And there's all of those numbers.

2476
02:01:12,500 --> 02:01:14,260
Now, this is, like, dozens of responses.

2477
02:01:14,260 --> 02:01:16,540
I certainly don't want to
execute that many queries.

2478
02:01:16,540 --> 02:01:18,850
But I think a subquery
will help us out again.

2479
02:01:18,850 --> 02:01:21,470
Let me put parentheses
around this whole thing.

2480
02:01:21,470 --> 02:01:27,910
And now let me say SELECT
DISTINCT genre FROM genres WHERE

2481
02:01:27,910 --> 02:01:32,380
the show_id in the genres
table is in that query.

2482
02:01:32,380 --> 02:01:37,400
And just for kicks, let me
go ahead and ORDER BY genre.

2483
02:01:37,400 --> 02:01:38,930
So let me go ahead and execute this.

2484
02:01:38,930 --> 02:01:42,490
And, OK, somewhat amusingly, those
of you who inputted The Office

2485
02:01:42,490 --> 02:01:46,960
checked boxes for animation, comedy,
documentary, drama, family, horror,

2486
02:01:46,960 --> 02:01:49,000
reality-TV, romance, and sci-fi.

2487
02:01:49,000 --> 02:01:51,020
I take exception to a few of those, too.

2488
02:01:51,020 --> 02:01:53,720
But this is what happens
when you accept user input.

2489
02:01:53,720 --> 02:01:57,293
So here again, we have
with this SQL language

2490
02:01:57,293 --> 02:01:59,710
the ability to express fairly
succinctly, even though it's

2491
02:01:59,710 --> 02:02:03,670
a lot of new features today all at
once, what would otherwise take me

2492
02:02:03,670 --> 02:02:06,593
a dozen or two lines in
Python code to implement

2493
02:02:06,593 --> 02:02:09,010
and god knows how many lines
of code and how many hours it

2494
02:02:09,010 --> 02:02:13,060
would take me to implement something
like this in C. Now, admittedly,

2495
02:02:13,060 --> 02:02:15,220
we could do better than this design.

2496
02:02:15,220 --> 02:02:18,550
This table or this picture
represents what we have now.

2497
02:02:18,550 --> 02:02:22,360
But you'll notice a lot of redundancy
implicit in the genres table.

2498
02:02:22,360 --> 02:02:25,510
Any time you check the
comedy box, I have a row now

2499
02:02:25,510 --> 02:02:27,940
that says comedy,
comedy, comedy, comedy.

2500
02:02:27,940 --> 02:02:31,930
And the show_id differs, but I have
the word "comedy" again and again.

2501
02:02:31,930 --> 02:02:35,740
And now, that tends to be frowned upon
in the world of relational databases,

2502
02:02:35,740 --> 02:02:39,370
because if you have a
genre called comedy or one

2503
02:02:39,370 --> 02:02:42,430
called musical or anything
else, you should ideally just

2504
02:02:42,430 --> 02:02:43,970
have that living in one place.

2505
02:02:43,970 --> 02:02:47,980
And so if we really wanted to
be particular and really, truly

2506
02:02:47,980 --> 02:02:51,100
normalize this database, which
is an academic term referring

2507
02:02:51,100 --> 02:02:55,530
to removing all such redundancies,
we could actually do it like this.

2508
02:02:55,530 --> 02:02:59,480
We could have a shows table still with
an id and title, no difference there.

2509
02:02:59,480 --> 02:03:03,890
But we could have a genres table
with two columns, id and name.

2510
02:03:03,890 --> 02:03:05,020
Now, this is its own id.

2511
02:03:05,020 --> 02:03:06,910
It has no connection with the show_id.

2512
02:03:06,910 --> 02:03:10,660
It's just its own unique
identifier, a primary key here now,

2513
02:03:10,660 --> 02:03:12,320
and the name of that genre.

2514
02:03:12,320 --> 02:03:14,350
So you would have one
row in the genres table

2515
02:03:14,350 --> 02:03:17,690
for comedy, for drama, music,
musical, and everything else.

2516
02:03:17,690 --> 02:03:19,870
And then you would use
a third table, which

2517
02:03:19,870 --> 02:03:23,920
is colloquially called a join table,
which I'll draw here in the middle.

2518
02:03:23,920 --> 02:03:25,960
And you can call it
anything you want, but we've

2519
02:03:25,960 --> 02:03:29,920
called it shows_genres to make
clear that this table implements

2520
02:03:29,920 --> 02:03:33,400
a relationship between those two tables.

2521
02:03:33,400 --> 02:03:36,910
And notice that in this table
is really no juicy data.

2522
02:03:36,910 --> 02:03:38,800
It's just foreign keys--

2523
02:03:38,800 --> 02:03:41,380
show_id, genre_id.

2524
02:03:41,380 --> 02:03:43,930
And by having this
third table, we can now

2525
02:03:43,930 --> 02:03:47,890
make sure that the word "comedy"
only appears in one row anywhere.

2526
02:03:47,890 --> 02:03:50,860
The word "musical" only
appears in one row anywhere.

2527
02:03:50,860 --> 02:03:55,450
But we use these more efficient
integers called show_id and genre_id,

2528
02:03:55,450 --> 02:04:00,850
which respectively point to those
primary keys and their primary tables

2529
02:04:00,850 --> 02:04:02,072
to link those two together.

2530
02:04:02,072 --> 02:04:04,780
And this is an example of what's
called in the world of databases

2531
02:04:04,780 --> 02:04:06,790
a many-to-many relationship.

2532
02:04:06,790 --> 02:04:09,610
One show can have many genres.

2533
02:04:09,610 --> 02:04:12,730
One genre can belong to many shows.

2534
02:04:12,730 --> 02:04:14,530
And so by having this
third table, you can

2535
02:04:14,530 --> 02:04:16,730
have that many-to-many relationship.

2536
02:04:16,730 --> 02:04:19,570
And again, the third table now
allows us to truly normalize

2537
02:04:19,570 --> 02:04:23,920
our data set by getting rid of all of
the duplicate comedy, comedy, comedy.

2538
02:04:23,920 --> 02:04:25,420
Why is this important?

2539
02:04:25,420 --> 02:04:27,310
Probably not a huge deal for genres.

2540
02:04:27,310 --> 02:04:30,910
But imagine with my current design
if I made a spelling mistake,

2541
02:04:30,910 --> 02:04:32,440
and I misnamed comedy.

2542
02:04:32,440 --> 02:04:36,190
I would now have to change every row
with the word comedy again and again.

2543
02:04:36,190 --> 02:04:39,580
Or if maybe you change
the genres of the shows,

2544
02:04:39,580 --> 02:04:41,840
you would have to change
it in multiple places.

2545
02:04:41,840 --> 02:04:44,260
But with this other
approach with three tables,

2546
02:04:44,260 --> 02:04:46,450
you can argue that now
you only have to change

2547
02:04:46,450 --> 02:04:49,750
the name of a genre in one
place, not all over the place.

2548
02:04:49,750 --> 02:04:52,420
And that, in general, in C
and now in Python and now

2549
02:04:52,420 --> 02:04:57,040
SQL has generally been a good thing
not to copy paste identical values

2550
02:04:57,040 --> 02:05:00,060
all over the place.

2551
02:05:00,060 --> 02:05:00,720
All right.

2552
02:05:00,720 --> 02:05:04,410
So with that said, what other
tools do we have at our disposal?

2553
02:05:04,410 --> 02:05:09,180
Well, it turns out that there are other
data types out there in the real world

2554
02:05:09,180 --> 02:05:11,280
using SQL besides just these five--

2555
02:05:11,280 --> 02:05:13,620
BLOB, INTEGER, NUMERIC, REAL, and TEXT.

2556
02:05:13,620 --> 02:05:15,870
BLOB, again, is for binary
stuff, generally not

2557
02:05:15,870 --> 02:05:18,840
used except for more specialized
applications, let's say.

2558
02:05:18,840 --> 02:05:21,270
INTEGER, which is an
int, typically 32 bits;

2559
02:05:21,270 --> 02:05:23,700
NUMERIC, which is something
like a date or a year

2560
02:05:23,700 --> 02:05:26,220
or time or something like
that; REAL numbers, which

2561
02:05:26,220 --> 02:05:30,180
are floating point values; and
TEXT, which are things like strings.

2562
02:05:30,180 --> 02:05:34,110
But if you graduate ultimately
from SQLite on phones

2563
02:05:34,110 --> 02:05:38,522
and on Macs and PCs to actual
servers that run Oracle, MySQL,

2564
02:05:38,522 --> 02:05:40,230
and PostgreSQL if
you're actually running

2565
02:05:40,230 --> 02:05:42,540
your own internet-style
business, well, it

2566
02:05:42,540 --> 02:05:47,310
turns out that more
sophisticated, even more powerful

2567
02:05:47,310 --> 02:05:50,620
databases come with other
subtypes, if you will.

2568
02:05:50,620 --> 02:05:54,270
So besides INTEGER, you can
specify smallint for small numbers,

2569
02:05:54,270 --> 02:05:57,690
maybe using just a few
bits instead of 32--

2570
02:05:57,690 --> 02:06:01,800
INTEGER or bigint, which
uses 64 bits instead of 32.

2571
02:06:01,800 --> 02:06:05,130
The Facebooks, the Twitters of the
world need to use bigint a lot,

2572
02:06:05,130 --> 02:06:06,720
because they have so much data.

2573
02:06:06,720 --> 02:06:09,330
You and I can get away with
simple integers, because we're not

2574
02:06:09,330 --> 02:06:12,450
going to have more than 4 billion
favorite TV shows in a class,

2575
02:06:12,450 --> 02:06:13,290
certainly.

2576
02:06:13,290 --> 02:06:17,040
Something like REAL, you can have
32-bit real numbers or, a little weirdly

2577
02:06:17,040 --> 02:06:22,470
named, double precision, which is like
a double was in C, using 64 bits instead

2578
02:06:22,470 --> 02:06:23,640
for more precision.

2579
02:06:23,640 --> 02:06:25,230
NUMERIC is kind of this catchall.

2580
02:06:25,230 --> 02:06:29,160
You can have not only dates and date
times but things like Boolean values.

2581
02:06:29,160 --> 02:06:31,200
You can specify the
total number of digits

2582
02:06:31,200 --> 02:06:34,180
to store using this numeric
scale and precision.

2583
02:06:34,180 --> 02:06:37,440
So it relates to numbers that
aren't just quite integers.

2584
02:06:37,440 --> 02:06:39,720
And then you also have
categories of TEXT--

2585
02:06:39,720 --> 02:06:42,570
char followed by a
number, which specifies

2586
02:06:42,570 --> 02:06:47,010
that every value in the column will
have the same number of characters,

2587
02:06:47,010 --> 02:06:50,190
that's helpful for things where the
length in advance, like in the US.

2588
02:06:50,190 --> 02:06:54,030
All states, all 50 states,
have two-character codes,

2589
02:06:54,030 --> 02:06:57,450
like MA for Massachusetts,
CA for California.

2590
02:06:57,450 --> 02:07:00,150
char(2) would be appropriate
there, because you

2591
02:07:00,150 --> 02:07:03,000
know every value in the column
is going to have two characters.

2592
02:07:03,000 --> 02:07:05,250
When you don't know,
though, you can use varchar.

2593
02:07:05,250 --> 02:07:08,400
And varchar specifies a
maximum number of characters.

2594
02:07:08,400 --> 02:07:12,060
And so you might specify
varchar of, like, 32.

2595
02:07:12,060 --> 02:07:15,600
No one might be able to type in a
name that's longer than 32 characters,

2596
02:07:15,600 --> 02:07:18,900
or varchar(200) if you want to
allow for something even bigger.

2597
02:07:18,900 --> 02:07:21,690
But this is germane to our
real-world experience with the web.

2598
02:07:21,690 --> 02:07:24,540
If you've ever gone to a website,
start filling out a form,

2599
02:07:24,540 --> 02:07:26,850
and all of a sudden you can't
type any more characters,

2600
02:07:26,850 --> 02:07:28,440
your response is too long--

2601
02:07:28,440 --> 02:07:29,462
why is that?

2602
02:07:29,462 --> 02:07:31,170
Well, one, the
programmers just might not

2603
02:07:31,170 --> 02:07:33,810
want you to keep expressing
yourself in more detail, especially

2604
02:07:33,810 --> 02:07:36,330
if it's a complaint form
on a customer service site.

2605
02:07:36,330 --> 02:07:40,620
But pragmatically, it's probably
because their database was designed

2606
02:07:40,620 --> 02:07:42,570
to store a finite number of characters.

2607
02:07:42,570 --> 02:07:44,025
And you have hit that threshold.

2608
02:07:44,025 --> 02:07:45,900
And you certainly don't
want to have a buffer

2609
02:07:45,900 --> 02:07:50,190
overflow, like in C. So the database
will enforce a maximum value n.

2610
02:07:50,190 --> 02:07:52,830
And then text is for even
bigger chunks of text.

2611
02:07:52,830 --> 02:07:54,930
If you're letting people
copy paste their resumes

2612
02:07:54,930 --> 02:07:59,680
or hold documents or even larger sets
of text, you might use text instead.

2613
02:07:59,680 --> 02:08:03,510
So let's then consider
a real-world data set.

2614
02:08:03,510 --> 02:08:07,320
Things get really interesting, and
all of these very academic ideas

2615
02:08:07,320 --> 02:08:09,600
and recommendations
really come into play

2616
02:08:09,600 --> 02:08:14,830
when we don't had hundreds of favorites
but when we have thousands instead.

2617
02:08:14,830 --> 02:08:19,180
And so what I'm going to go ahead
and do here is download a file here,

2618
02:08:19,180 --> 02:08:25,120
which is a SQLite version of the
IMDb, Internet Movie Database,

2619
02:08:25,120 --> 02:08:27,120
that some of you might
have used in website form

2620
02:08:27,120 --> 02:08:30,330
in order to look up movies and
ratings thereof and the like.

2621
02:08:30,330 --> 02:08:32,100
And what we've done
in advance is we wrote

2622
02:08:32,100 --> 02:08:38,490
a script that downloaded all of that
information in advance as TSV files.

2623
02:08:38,490 --> 02:08:42,600
It turns out that they, Internet
Movie Database, make all of their data

2624
02:08:42,600 --> 02:08:46,650
available as TSV files,
Tab-Separated Values.

2625
02:08:46,650 --> 02:08:54,010
And we went ahead and imported it with
a script called shows.db as follows.

2626
02:08:54,010 --> 02:08:55,800
So I'm going to go
ahead in just a moment

2627
02:08:55,800 --> 02:08:59,520
and open up shows.db, which is
not the version I created earlier

2628
02:08:59,520 --> 02:09:00,990
based on your favorites.

2629
02:09:00,990 --> 02:09:02,820
This is now the version
that we, the staff,

2630
02:09:02,820 --> 02:09:06,630
created in advance by
downloading hundreds of thousands

2631
02:09:06,630 --> 02:09:10,950
of movies and TV shows and actors
and directors from IMDb.com

2632
02:09:10,950 --> 02:09:15,660
under their license and then
imported into a SQLite database.

2633
02:09:15,660 --> 02:09:17,130
So how can I see what's in here?

2634
02:09:17,130 --> 02:09:19,530
Well, let me go ahead
and type .schema, recall.

2635
02:09:19,530 --> 02:09:22,545
And you'll see a whole
bunch of data therein.

2636
02:09:22,545 --> 02:09:25,690
And in fact, in pictorial form,
it actually looks like this.

2637
02:09:25,690 --> 02:09:28,140
Here is a picture that just
gives you the lay of the land.

2638
02:09:28,140 --> 02:09:30,330
There's going to be a
people table that has

2639
02:09:30,330 --> 02:09:33,903
an ID for every person, a
name, and their birth year.

2640
02:09:33,903 --> 02:09:36,570
There's going to be a shows table,
just like we've been talking,

2641
02:09:36,570 --> 02:09:41,100
which is IDs, titles of shows-- also,
though, the year that the show debuted

2642
02:09:41,100 --> 02:09:43,380
and the number of episodes
that the show had.

2643
02:09:43,380 --> 02:09:46,410
Then there's going to be genres,
similar in design to before.

2644
02:09:46,410 --> 02:09:49,620
So we didn't go all out and
factor it out into a third table.

2645
02:09:49,620 --> 02:09:52,830
We just have some duplication
here, admittedly, in genres.

2646
02:09:52,830 --> 02:09:54,240
But then there's a ratings table.

2647
02:09:54,240 --> 02:09:57,240
And here's where you can see where
relational databases get interesting.

2648
02:09:57,240 --> 02:10:01,000
You can have a ratings table
storing ratings, like 1 to 5,

2649
02:10:01,000 --> 02:10:05,080
but also associate those ratings
with a show by way of its show_id.

2650
02:10:05,080 --> 02:10:08,440
And then you can keep track of the
number of votes that that show got.

2651
02:10:08,440 --> 02:10:10,910
Writers, notice, is a separate table.

2652
02:10:10,910 --> 02:10:12,560
And notice this is kind of cool.

2653
02:10:12,560 --> 02:10:19,060
This table, per the arrows, relates to
the shows table and the people table,

2654
02:10:19,060 --> 02:10:20,770
because this is a joined table.

2655
02:10:20,770 --> 02:10:24,040
A foreign key of show_id and
a foreign key of person_id

2656
02:10:24,040 --> 02:10:28,250
refer to the shows table and
the people table respectively

2657
02:10:28,250 --> 02:10:32,710
so that a human person can be
a writer for multiple shows

2658
02:10:32,710 --> 02:10:36,560
and one show can have multiple writers,
another many-to-many relationship.

2659
02:10:36,560 --> 02:10:39,310
And then lastly, stars,
the actors in a show.

2660
02:10:39,310 --> 02:10:41,050
Notice that this, too, is a join table.

2661
02:10:41,050 --> 02:10:43,540
It's only got two
foreign keys, a show_id

2662
02:10:43,540 --> 02:10:47,447
and a person_id that are referring
back to those tables respectively.

2663
02:10:47,447 --> 02:10:50,030
And here's where it really makes
sense of relational database.

2664
02:10:50,030 --> 02:10:52,930
It would be pretty stupid
and bad design if you

2665
02:10:52,930 --> 02:10:57,520
had names of all of the directors
and names of all of the writers

2666
02:10:57,520 --> 02:11:01,840
and names of all of the stars of these
shows in separate tables in duplicate,

2667
02:11:01,840 --> 02:11:04,330
like Steve Carell, Steve
Carell, Steve Carell.

2668
02:11:04,330 --> 02:11:06,670
All of those actors and
directors and writers

2669
02:11:06,670 --> 02:11:11,450
and every other role in the business
are just people at the end of the day.

2670
02:11:11,450 --> 02:11:13,630
So in a relational
database, the advice would

2671
02:11:13,630 --> 02:11:16,570
be to put all of those
people in a people table

2672
02:11:16,570 --> 02:11:21,040
and then use primary and foreign
keys to refer to, to relate them to,

2673
02:11:21,040 --> 02:11:24,010
these other types of tables.

2674
02:11:24,010 --> 02:11:26,720
The catch is, though,
that when we do this,

2675
02:11:26,720 --> 02:11:31,280
it turns out that things can be
slow when we have lots of data.

2676
02:11:31,280 --> 02:11:33,250
So for instance, let me go into this.

2677
02:11:33,250 --> 02:11:37,210
Let me go ahead and
SELECT * FROM shows;.

2678
02:11:37,210 --> 02:11:38,343
That's a lot of data.

2679
02:11:38,343 --> 02:11:41,260
It's pretty fast on my Mac, and I
switched from the IDE to my Mac just

2680
02:11:41,260 --> 02:11:43,270
to save time, because it's
a little faster doing things

2681
02:11:43,270 --> 02:11:44,800
locally instead of in the cloud.

2682
02:11:44,800 --> 02:11:48,460
Let me go ahead and count the number
of shows in this IMDb database

2683
02:11:48,460 --> 02:11:49,720
by using COUNT.

2684
02:11:49,720 --> 02:11:53,390
153,331 TV shows.

2685
02:11:53,390 --> 02:11:54,250
So that's a lot.

2686
02:11:54,250 --> 02:11:59,110
How about the count of
people from the people table?

2687
02:11:59,110 --> 02:12:06,290
457,886 people who might be stars or
writers or some other role, as well.

2688
02:12:06,290 --> 02:12:07,765
So this is a sizable data set.

2689
02:12:07,765 --> 02:12:09,890
So let me go ahead and do
something simple, though.

2690
02:12:09,890 --> 02:12:14,560
Let me go ahead and SELECT * FROM
shows WHERE title = "The Office."

2691
02:12:14,560 --> 02:12:17,900
And this time, I don't have to worry
about weird capitalization or spacing.

2692
02:12:17,900 --> 02:12:18,850
This is IMDb.

2693
02:12:18,850 --> 02:12:21,727
This is clean data from
an authoritative source.

2694
02:12:21,727 --> 02:12:24,310
Notice that there's actually
different versions of The Office.

2695
02:12:24,310 --> 02:12:26,590
You probably know the
UK one and the US one.

2696
02:12:26,590 --> 02:12:30,520
There's other shows that are unrelated
to that particular type of show.

2697
02:12:30,520 --> 02:12:34,540
But each of them is distinguished,
notice, by the year here.

2698
02:12:34,540 --> 02:12:37,280
All right, so that's kind of a lot.

2699
02:12:37,280 --> 02:12:38,680
And let's do this again.

2700
02:12:38,680 --> 02:12:40,930
Let me go ahead and turn on
a feature temporarily just

2701
02:12:40,930 --> 02:12:44,000
to time this query by turning
on a timer in this program.

2702
02:12:44,000 --> 02:12:45,370
And let me run it again.

2703
02:12:45,370 --> 02:12:51,970
It looks like it took 0.012 seconds
of real time to do that search.

2704
02:12:51,970 --> 02:12:52,780
That's pretty fast.

2705
02:12:52,780 --> 02:12:55,180
I barely noticed, certainly
because it's so fast.

2706
02:12:55,180 --> 02:12:56,710
But let me go ahead and do this.

2707
02:12:56,710 --> 02:13:01,510
Let me go ahead and create an index
called title_index on the table

2708
02:13:01,510 --> 02:13:04,360
called shows on its title column.

2709
02:13:04,360 --> 02:13:05,470
Well, what am I doing?

2710
02:13:05,470 --> 02:13:08,680
Well, to answer the question finally
from before about performance,

2711
02:13:08,680 --> 02:13:11,340
by default, everything we've
been doing is indeed big O of n.

2712
02:13:11,340 --> 02:13:13,090
It's just being linearly
searched from top

2713
02:13:13,090 --> 02:13:16,630
to bottom, which seems to call into
question the whole purpose of SQL if we

2714
02:13:16,630 --> 02:13:18,850
were doing no better than with CSVs.

2715
02:13:18,850 --> 02:13:22,480
But an index is a clue
to the database to load

2716
02:13:22,480 --> 02:13:25,960
the data more efficiently in such a
way that you get logarithmic time.

2717
02:13:25,960 --> 02:13:30,520
An index is a fancy data structure
that the SQLite database or the Oracle

2718
02:13:30,520 --> 02:13:33,520
database or the MySQL database,
whatever product you're using,

2719
02:13:33,520 --> 02:13:35,680
builds up for you in memory.

2720
02:13:35,680 --> 02:13:38,560
And then it does something
using syntax like this

2721
02:13:38,560 --> 02:13:42,340
that builds in memory generally
something known as a B-tree.

2722
02:13:42,340 --> 02:13:44,178
We've talked a bit about
trees in the class.

2723
02:13:44,178 --> 02:13:46,720
We talked about binary search
trees, things that kind of look

2724
02:13:46,720 --> 02:13:47,920
like family trees.

2725
02:13:47,920 --> 02:13:50,620
A B-tree is essentially
a family tree that's

2726
02:13:50,620 --> 02:13:53,020
just very wide and not that tall.

2727
02:13:53,020 --> 02:13:56,500
It's a data structure similar in
spirit to what we looked at in C.

2728
02:13:56,500 --> 02:13:59,830
But it tries to keep all of the
leaf nodes, all of the children

2729
02:13:59,830 --> 02:14:02,230
or grandchildren or
great-grandchildren, so to speak,

2730
02:14:02,230 --> 02:14:04,390
as close to the root as possible.

2731
02:14:04,390 --> 02:14:08,320
And the algorithm it uses for that
tends to be proprietary or documented

2732
02:14:08,320 --> 02:14:09,970
based on the system you're using.

2733
02:14:09,970 --> 02:14:12,100
But it doesn't store things in a list.

2734
02:14:12,100 --> 02:14:17,620
It does not store things top to bottom,
like the tables we view them as.

2735
02:14:17,620 --> 02:14:21,640
Underneath the hood, those tables
that look like very tall structures

2736
02:14:21,640 --> 02:14:23,770
are actually, underneath
the hood, implemented

2737
02:14:23,770 --> 02:14:25,820
with fancier things called trees.

2738
02:14:25,820 --> 02:14:29,710
And if we create those trees by creating
what are properly called indexes

2739
02:14:29,710 --> 02:14:34,660
like this, it might take us a moment,
like 0.098 seconds, to create an index.

2740
02:14:34,660 --> 02:14:36,220
But now notice what happens.

2741
02:14:36,220 --> 02:14:40,210
Previously, when I searched the titles
for The Office, using linear search,

2742
02:14:40,210 --> 02:14:43,180
it took 0.012 seconds.

2743
02:14:43,180 --> 02:14:46,750
If I do the same query again
after having created the index

2744
02:14:46,750 --> 02:14:50,920
and having told SQLite, build me
this fancy tree in memory, voila.

2745
02:14:50,920 --> 02:14:55,450
0.001 seconds, so orders
of magnitude faster.

2746
02:14:55,450 --> 02:14:57,550
Now, both are fast to
us humans, certainly.

2747
02:14:57,550 --> 02:15:01,040
But imagine the data set being even
bigger, the query being even bigger.

2748
02:15:01,040 --> 02:15:05,900
These indexes can get
even larger than that.

2749
02:15:05,900 --> 02:15:07,970
Rather, the queries can
take longer than that

2750
02:15:07,970 --> 02:15:11,130
and therefore take even
more time than that.

2751
02:15:11,130 --> 02:15:13,940
But unfortunately, if I've got
all of my data all over the place,

2752
02:15:13,940 --> 02:15:16,970
as in a diagram like this, my god.

2753
02:15:16,970 --> 02:15:18,770
How do I actually get useful work done?

2754
02:15:18,770 --> 02:15:21,590
How do I get back the people
in a movie and the writers

2755
02:15:21,590 --> 02:15:24,260
and the stars and the ratings
if it's all over the place?

2756
02:15:24,260 --> 02:15:26,840
I would seem to have created
such a mess and that I now

2757
02:15:26,840 --> 02:15:28,910
need to execute all of these queries.

2758
02:15:28,910 --> 02:15:32,000
But notice it doesn't have
to be that complicated.

2759
02:15:32,000 --> 02:15:35,660
It turns out that there's another
keyword in SQL, really the last

2760
02:15:35,660 --> 02:15:38,150
that we'll look at here, called JOIN.

2761
02:15:38,150 --> 02:15:41,480
The JOIN keyword, which you can
use implicitly or explicitly,

2762
02:15:41,480 --> 02:15:45,470
allows you to just join tables
together and sort of reconstitute

2763
02:15:45,470 --> 02:15:47,760
a bigger, more user friendly table.

2764
02:15:47,760 --> 02:15:51,020
So for instance, suppose I want to
get all of Steve Carell's TV shows,

2765
02:15:51,020 --> 02:15:52,250
not just The Office.

2766
02:15:52,250 --> 02:15:55,880
Well, recall that I can select
Steve's ID from the people

2767
02:15:55,880 --> 02:15:59,390
table WHERE name = "Steve Carell."

2768
02:15:59,390 --> 02:16:02,780
So again, he has a different ID in
this table, because this is from IMDb.

2769
02:16:02,780 --> 02:16:04,400
But there's his ID.

2770
02:16:04,400 --> 02:16:07,260
And let me go ahead and
turn the timer off for now.

2771
02:16:07,260 --> 02:16:07,760
All right.

2772
02:16:07,760 --> 02:16:11,510
So there is his ID, 126797.

2773
02:16:11,510 --> 02:16:14,780
I could copy paste that into my
code, but that's not necessary

2774
02:16:14,780 --> 02:16:16,490
thanks to these nested queries.

2775
02:16:16,490 --> 02:16:18,660
I can do something like this.

2776
02:16:18,660 --> 02:16:23,720
Let me go ahead and now select all
of the show_ids from the stars table

2777
02:16:23,720 --> 02:16:29,790
where person_id from that
table is equal to this result.

2778
02:16:29,790 --> 02:16:33,240
So there's that join table, stars,
that links people and shows.

2779
02:16:33,240 --> 02:16:35,370
So let me go ahead and execute that.

2780
02:16:35,370 --> 02:16:35,870
All right.

2781
02:16:35,870 --> 02:16:39,559
So there's all of the show_ids
of Steve Carell's TV shows.

2782
02:16:39,559 --> 02:16:40,379
That's a lot.

2783
02:16:40,379 --> 02:16:42,139
And it's very nonobvious what they are.

2784
02:16:42,139 --> 02:16:45,680
So let me do another nested query by
putting all of that in parentheses

2785
02:16:45,680 --> 02:16:51,530
and now SELECT title FROM
shows WHERE the ID of the show

2786
02:16:51,530 --> 02:16:55,820
is in this big, long list of show_ids.

2787
02:16:55,820 --> 02:17:00,260
And there are all of the shows that
he's in, including The Dana Carvey Show

2788
02:17:00,260 --> 02:17:04,430
back when, The Office up at the
top, and then, most recently,

2789
02:17:04,430 --> 02:17:07,142
shows like The Morning Show on Apple TV.

2790
02:17:07,142 --> 02:17:09,350
All right, so that's pretty
cool that we can actually

2791
02:17:09,350 --> 02:17:11,129
reconstitute the data like that.

2792
02:17:11,129 --> 02:17:13,889
But it turns out there's different
ways of doing that, as well.

2793
02:17:13,889 --> 02:17:15,950
And you'll see more of
this in the coming weeks

2794
02:17:15,950 --> 02:17:18,150
and in the problem sets
and labs and the like.

2795
02:17:18,150 --> 02:17:19,879
But it turns out we can
do other things, as well.

2796
02:17:19,879 --> 02:17:21,962
And let me just show this
syntax even though it'll

2797
02:17:21,962 --> 02:17:23,670
look a little cryptic at first glance.

2798
02:17:23,670 --> 02:17:26,299
You can also use that
JOIN keyword as follows.

2799
02:17:26,299 --> 02:17:33,350
I can select the title from the people
table joined with the stars table

2800
02:17:33,350 --> 02:17:39,959
on the people.id column equaling
the stars.person_id column.

2801
02:17:39,959 --> 02:17:42,799
So in other words, I can
select a title from the result

2802
02:17:42,799 --> 02:17:46,940
of joining people and stars, like
this, on the id column in one

2803
02:17:46,940 --> 02:17:49,129
and the person_id column in the other.

2804
02:17:49,129 --> 02:17:58,879
And I can join in the shows table on
the stars.show_id equaling the shows.id.

2805
02:17:58,879 --> 02:18:03,799
So again, now I'm joining the primary
and foreign keys on these two tables

2806
02:18:03,799 --> 02:18:07,700
where the name equals "Steve Carell."

2807
02:18:07,700 --> 02:18:10,070
So this is the most cryptic
thing we've seen yet.

2808
02:18:10,070 --> 02:18:12,530
But it just means take this
table and join it with this one

2809
02:18:12,530 --> 02:18:16,580
and then join it with this one and
filter all of the resulting joined rows

2810
02:18:16,580 --> 02:18:18,530
by a name of Steve Carell.

2811
02:18:18,530 --> 02:18:19,520
And voila.

2812
02:18:19,520 --> 02:18:22,469
There we have all of
those answers, as well.

2813
02:18:22,469 --> 02:18:25,129
And there's other ways
of doing this, too.

2814
02:18:25,129 --> 02:18:27,809
I'll leave unsaid now some
of the syntax for that.

2815
02:18:27,809 --> 02:18:29,480
But that felt a little slow.

2816
02:18:29,480 --> 02:18:32,090
And in fact, let me go ahead
and turn my timer back on.

2817
02:18:32,090 --> 02:18:34,610
Let me re-execute this last query.

2818
02:18:34,610 --> 02:18:40,879
SELECT title FROM people joining
on stars, joining on shows

2819
02:18:40,879 --> 02:18:42,650
WHERE name = "Steve Carell."

2820
02:18:42,650 --> 02:18:44,700
That took over half a second.

2821
02:18:44,700 --> 02:18:47,480
So that was actually
admittedly kind of slow.

2822
02:18:47,480 --> 02:18:50,209
But again, indexes come to
the rescue and if, again, we

2823
02:18:50,209 --> 02:18:52,610
don't allow linear search to dominate.

2824
02:18:52,610 --> 02:18:54,889
But let me go ahead and
create a few indexes.

2825
02:18:54,889 --> 02:19:01,940
Create an index called person_index on
the stars table, the person_id column.

2826
02:19:01,940 --> 02:19:02,570
Why?

2827
02:19:02,570 --> 02:19:05,600
Well, my query a moment ago
used the person_id column.

2828
02:19:05,600 --> 02:19:06,510
It filtered on it.

2829
02:19:06,510 --> 02:19:08,000
So that might be a bottleneck.

2830
02:19:08,000 --> 02:19:12,290
I'm going to go ahead and create
another index called show_index

2831
02:19:12,290 --> 02:19:14,870
on the stars table on show_id.

2832
02:19:14,870 --> 02:19:18,290
Similarly, a moment ago, my
query used the show_id column.

2833
02:19:18,290 --> 02:19:21,743
And so that, too, might have been a
bottleneck linearly, top to bottom.

2834
02:19:21,743 --> 02:19:22,910
So let me create that index.

2835
02:19:22,910 --> 02:19:25,368
And then lastly, let me create
an index called name_index--

2836
02:19:25,368 --> 02:19:28,459
and this is perhaps the most obvious,
similar to the show titles before--

2837
02:19:28,459 --> 02:19:31,549
on the people table on the name column.

2838
02:19:31,549 --> 02:19:32,930
And that, too, took a moment.

2839
02:19:32,930 --> 02:19:35,330
Now, in total, this took
almost a full second.

2840
02:19:35,330 --> 02:19:37,850
But these indexes only get created once.

2841
02:19:37,850 --> 02:19:40,070
They get maintained
automatically over time.

2842
02:19:40,070 --> 02:19:42,080
But you don't incur
this with every query.

2843
02:19:42,080 --> 02:19:44,389
Now let me do my SELECT again.

2844
02:19:44,389 --> 02:19:48,800
Let me SELECT title FROM
people joining the stars table,

2845
02:19:48,800 --> 02:19:52,730
joining the shows table
WHERE name = "Steve Carell."

2846
02:19:52,730 --> 02:19:53,690
Boom.

2847
02:19:53,690 --> 02:19:56,630
0.001 seconds.

2848
02:19:56,630 --> 02:20:00,930
That was an order of magnitude faster
than the more than half a second

2849
02:20:00,930 --> 02:20:02,620
it took us a little bit ago.

2850
02:20:02,620 --> 02:20:05,860
So here, too, you see the
power of a relational database.

2851
02:20:05,860 --> 02:20:08,912
So even though we've created some
problems for ourselves over time,

2852
02:20:08,912 --> 02:20:12,120
we've solved them ultimately-- granted,
with some more sophisticated features

2853
02:20:12,120 --> 02:20:13,320
and additional syntax.

2854
02:20:13,320 --> 02:20:15,990
But a relational database
is indeed why you use them

2855
02:20:15,990 --> 02:20:19,470
in the real world for the Twitters, the
Instagrams, the Facebooks, the Googles,

2856
02:20:19,470 --> 02:20:22,590
because they can store
data so efficiently

2857
02:20:22,590 --> 02:20:25,960
without redundancy, because you can
normalize them and factor everything

2858
02:20:25,960 --> 02:20:26,460
out.

2859
02:20:26,460 --> 02:20:28,740
But they can still
maintain the relations

2860
02:20:28,740 --> 02:20:30,570
that you might have
seen in a spreadsheet

2861
02:20:30,570 --> 02:20:32,940
but using something closer
to logarithmic thanks

2862
02:20:32,940 --> 02:20:34,770
to those tree structures.

2863
02:20:34,770 --> 02:20:35,910
But there are problems.

2864
02:20:35,910 --> 02:20:38,880
And what we wanted to do is end
on today two primary problems

2865
02:20:38,880 --> 02:20:42,570
that are introduced with SQL,
because they are just unfortunately

2866
02:20:42,570 --> 02:20:43,920
so commonly done.

2867
02:20:43,920 --> 02:20:45,462
Notice this year.

2868
02:20:45,462 --> 02:20:47,670
There is something generally
known as a SQL injection

2869
02:20:47,670 --> 02:20:51,330
attack, which you are
vulnerable to in any application

2870
02:20:51,330 --> 02:20:52,830
where you're taking user input.

2871
02:20:52,830 --> 02:20:55,800
That hasn't been an issue
for my favorites.py file,

2872
02:20:55,800 --> 02:20:58,260
where I only took input from a CSV.

2873
02:20:58,260 --> 02:21:00,510
But if one of you were
malicious, what if one of you

2874
02:21:00,510 --> 02:21:03,750
had maliciously typed in the
word "delete" or "update"

2875
02:21:03,750 --> 02:21:06,180
or something else as
the title of your show

2876
02:21:06,180 --> 02:21:11,040
and I accidentally plugged it into my
own Python code when executing a query?

2877
02:21:11,040 --> 02:21:14,940
You could potentially
inject SQL into my own code.

2878
02:21:14,940 --> 02:21:15,750
How might that be?

2879
02:21:15,750 --> 02:21:18,960
Well, if logging in via Yale, you'll
typically see a form like this.

2880
02:21:18,960 --> 02:21:21,850
Or logging in via Harvard to
something, you'll see a form like this.

2881
02:21:21,850 --> 02:21:23,767
Here's an example that
I'm pretty sure neither

2882
02:21:23,767 --> 02:21:25,710
Harvard nor Yale are vulnerable to.

2883
02:21:25,710 --> 02:21:28,590
Suppose I type in my email
address to this login form

2884
02:21:28,590 --> 02:21:32,350
as malan@harvard.edu'--.

2885
02:21:32,350 --> 02:21:34,890
It turns out, in SQL, --

2886
02:21:34,890 --> 02:21:38,250
is the symbol for commenting if
you want to comment something out.

2887
02:21:38,250 --> 02:21:40,428
It turns out that the
single quote is used

2888
02:21:40,428 --> 02:21:43,470
when you want to search for something
like Steve Carell or, in this case,

2889
02:21:43,470 --> 02:21:44,930
malan@harvard.edu.

2890
02:21:44,930 --> 02:21:45,930
It can be double quotes.

2891
02:21:45,930 --> 02:21:47,040
It can be single quotes.

2892
02:21:47,040 --> 02:21:50,040
In this case, I'm using
single quotes here.

2893
02:21:50,040 --> 02:21:53,400
But let's consider some sample
code, if you will, in Python.

2894
02:21:53,400 --> 02:21:56,910
Here's a line of code that I
propose might exist in the backend

2895
02:21:56,910 --> 02:22:00,180
for Harvard's authentication
or Yale's or anyone else's.

2896
02:22:00,180 --> 02:22:04,890
Maybe someone wrote some Python code
like this using SELECT * FROM users

2897
02:22:04,890 --> 02:22:06,870
WHERE username = question?

2898
02:22:06,870 --> 02:22:10,770
AND password = question?, and they
plugged in username and password.

2899
02:22:10,770 --> 02:22:13,770
Whatever the user typed into
that web form a moment ago gets

2900
02:22:13,770 --> 02:22:16,270
plugged in here to these question marks.

2901
02:22:16,270 --> 02:22:17,290
This is good.

2902
02:22:17,290 --> 02:22:20,980
This is good code, because you're
using the SQL question marks.

2903
02:22:20,980 --> 02:22:24,315
So if you literally just do what we
preach today and use these question

2904
02:22:24,315 --> 02:22:27,870
mark placeholders, you are safe
from SQL injection attacks.

2905
02:22:27,870 --> 02:22:29,760
Unfortunately, there
are too many developers

2906
02:22:29,760 --> 02:22:34,950
in the world that don't practice this
or don't realize this or do forget this.

2907
02:22:34,950 --> 02:22:38,850
If you instead resort to
Python approaches like this,

2908
02:22:38,850 --> 02:22:42,910
where you use an f-string instead,
which might be your instincts after last

2909
02:22:42,910 --> 02:22:45,660
week, because they're wonderfully
convenient with the curly braces

2910
02:22:45,660 --> 02:22:46,290
and all--

2911
02:22:46,290 --> 02:22:50,370
suppose that you literally
plug in username and password

2912
02:22:50,370 --> 02:22:53,430
not with the question mark
placeholders but just literally

2913
02:22:53,430 --> 02:22:55,260
in between those curly braces.

2914
02:22:55,260 --> 02:22:58,210
Watch what happens if my
username, malan@harvard.edu,

2915
02:22:58,210 --> 02:23:03,120
was actually typed in by me
maliciously as malan@harvard.edu'--.

2916
02:23:03,120 --> 02:23:05,691


2917
02:23:05,691 --> 02:23:09,030
That would have the effect
of tricking this Python

2918
02:23:09,030 --> 02:23:11,610
code into doing essentially this.

2919
02:23:11,610 --> 02:23:13,590
Let me do a find and replace.

2920
02:23:13,590 --> 02:23:22,881
It would trick Python into executing
username = "malan@harvard.edu"--"

2921
02:23:22,881 --> 02:23:24,660
and then other stuff.

2922
02:23:24,660 --> 02:23:27,480
Unfortunately, the --
again means comment,

2923
02:23:27,480 --> 02:23:33,390
which means you could maybe trick a
server into ignoring the whole password

2924
02:23:33,390 --> 02:23:35,190
part of this SQL query.

2925
02:23:35,190 --> 02:23:37,530
And if the SQL query's
purpose in life is to check,

2926
02:23:37,530 --> 02:23:42,210
is this username and password valid, so
that you can decide to log the user in

2927
02:23:42,210 --> 02:23:44,880
or to say, no, you're
not authorized, well,

2928
02:23:44,880 --> 02:23:48,390
by essentially commenting out
everything related to password,

2929
02:23:48,390 --> 02:23:49,710
notice what I've done.

2930
02:23:49,710 --> 02:23:55,620
I've just now theoretically logged
myself in as malan@harvard.edu without

2931
02:23:55,620 --> 02:24:00,030
even knowing or inputting a password,
because I injected SQL syntax,

2932
02:24:00,030 --> 02:24:04,620
the quote and the --, into my query,
tricking the server into just ignoring

2933
02:24:04,620 --> 02:24:06,870
the password equality check.

2934
02:24:06,870 --> 02:24:11,250
And so it turns out that db.execute,
when you execute an INSERT,

2935
02:24:11,250 --> 02:24:15,240
it returns to you as said the
ID of the newly inserted row.

2936
02:24:15,240 --> 02:24:20,370
When you use db.execute to select
rows from a database table,

2937
02:24:20,370 --> 02:24:25,360
it returns to you a list of rows,
each of which is a dictionary.

2938
02:24:25,360 --> 02:24:28,110
So this is now pseudocode
down here with my comment.

2939
02:24:28,110 --> 02:24:31,140
But if you get back one
row, that would seem

2940
02:24:31,140 --> 02:24:34,470
to imply that there is a
user named malan@harvard.edu.

2941
02:24:34,470 --> 02:24:37,830
Don't know what his password is, because
whoever this person is maliciously

2942
02:24:37,830 --> 02:24:40,860
tricked the server into
ignoring that syntax.

2943
02:24:40,860 --> 02:24:43,890
So SQL injection attacks
are unfortunately

2944
02:24:43,890 --> 02:24:46,570
one of the most common
attacks against SQL databases.

2945
02:24:46,570 --> 02:24:51,090
They are completely preventable if
you simply use placeholders and use

2946
02:24:51,090 --> 02:24:53,940
libraries, whether it's CS50's
or other third-party libraries

2947
02:24:53,940 --> 02:24:55,440
that you may use down the road.

2948
02:24:55,440 --> 02:24:58,530
A common meme on the internet
is this picture here.

2949
02:24:58,530 --> 02:25:00,810
If we Zoom in on this
person's license plate

2950
02:25:00,810 --> 02:25:02,970
or where the license
plate should be, this

2951
02:25:02,970 --> 02:25:05,940
is an example of someone
theoretically trying

2952
02:25:05,940 --> 02:25:10,350
to trick some camera on the highway
into dropping the whole database.

2953
02:25:10,350 --> 02:25:13,710
DROP is another keyword in SQL
that deletes a database table.

2954
02:25:13,710 --> 02:25:15,810
And this person was either
intentionally or just

2955
02:25:15,810 --> 02:25:19,980
a humorously trying to
trick it into executing SQL

2956
02:25:19,980 --> 02:25:21,760
by using syntax like this.

2957
02:25:21,760 --> 02:25:26,070
So characters like single quotes,
--, semicolons are all potentially

2958
02:25:26,070 --> 02:25:29,190
dangerous characters in SQL if
they're passed through unchanged

2959
02:25:29,190 --> 02:25:30,120
to the database.

2960
02:25:30,120 --> 02:25:34,140
A very popular xkcd comic-- let me
give you a moment to just read this--

2961
02:25:34,140 --> 02:25:40,080
is another well-known meme of
sorts now in computer science.

2962
02:25:40,080 --> 02:25:43,880
If you'd like to, read
this one on your own.

2963
02:25:43,880 --> 02:25:51,170
But henceforth, you are now in the
family of educated learners who

2964
02:25:51,170 --> 02:25:54,410
know who Little Bobby Tables is.

2965
02:25:54,410 --> 02:25:56,240
Unfortunately, it's
dead silence in here,

2966
02:25:56,240 --> 02:25:58,340
so I can't tell if anyone is
actually laughing at this joke.

2967
02:25:58,340 --> 02:26:00,110
But anyhow, this is a
very well-known meme.

2968
02:26:00,110 --> 02:26:02,690
So if you're a computer scientist
who knows SQL, you know this one.

2969
02:26:02,690 --> 02:26:05,565
And there's one last problem we'd
like to introduce if you don't mind

2970
02:26:05,565 --> 02:26:07,250
just a couple of final moments here.

2971
02:26:07,250 --> 02:26:09,530
And that is a fundamental
problem in computing

2972
02:26:09,530 --> 02:26:11,690
called race conditions,
which for the first time

2973
02:26:11,690 --> 02:26:14,300
is now manifest in
our discussion of SQL.

2974
02:26:14,300 --> 02:26:18,230
It turns out that SQL and SQL
databases are very often used, again,

2975
02:26:18,230 --> 02:26:21,380
in the real world for very
high-performing applications.

2976
02:26:21,380 --> 02:26:24,320
And by that, I mean, again, the
Googles, the Facebooks, the Twitters

2977
02:26:24,320 --> 02:26:28,490
of the world where lots and lots of
data is coming into servers all at once.

2978
02:26:28,490 --> 02:26:30,200
And case in point,
some of you might have

2979
02:26:30,200 --> 02:26:33,320
clicked Like on this egg some time ago.

2980
02:26:33,320 --> 02:26:35,690
This is the most-liked
Instagram post ever.

2981
02:26:35,690 --> 02:26:39,710
As of last night, it was up
to 50-plus million likes.

2982
02:26:39,710 --> 02:26:42,620
Well eclipsed Kim
Kardashian's previous post,

2983
02:26:42,620 --> 02:26:44,690
which is still at 18 million or so.

2984
02:26:44,690 --> 02:26:47,780
This is to say this is
a hard problem to solve,

2985
02:26:47,780 --> 02:26:51,800
this notion of likes coming
in at such an incredible rate.

2986
02:26:51,800 --> 02:26:55,310
Because suppose that, long
story short, Instagram actually

2987
02:26:55,310 --> 02:26:57,290
has a server with a SQL database.

2988
02:26:57,290 --> 02:27:01,490
And they have code in Python or C++
or whatever language that's talking

2989
02:27:01,490 --> 02:27:02,660
to that database.

2990
02:27:02,660 --> 02:27:04,910
And suppose that they
have code that's trying

2991
02:27:04,910 --> 02:27:06,680
to increment the total number of likes.

2992
02:27:06,680 --> 02:27:08,240
Well, how might this work logically?

2993
02:27:08,240 --> 02:27:11,660
Well, in order to increment the number
of likes that a picture like this egg

2994
02:27:11,660 --> 02:27:14,060
has, you might first
select from the database

2995
02:27:14,060 --> 02:27:18,260
the current number of likes for
the ID of that egg photograph.

2996
02:27:18,260 --> 02:27:19,790
Then you might add 1 to it.

2997
02:27:19,790 --> 02:27:21,797
Then you might update the database.

2998
02:27:21,797 --> 02:27:24,630
And I didn't use it before, but
just like there's INSERT and DELETE,

2999
02:27:24,630 --> 02:27:26,010
there's UPDATE, as well.

3000
02:27:26,010 --> 02:27:29,600
So you might update the database
with the new count plus 1.

3001
02:27:29,600 --> 02:27:31,970
So the code for that might
look a little something

3002
02:27:31,970 --> 02:27:35,600
like this, three lines of code
using CS50's library here,

3003
02:27:35,600 --> 02:27:40,010
where you execute SELECT likes
FROM posts WHERE id = question?,

3004
02:27:40,010 --> 02:27:42,890
where id is the unique
identifier for that egg.

3005
02:27:42,890 --> 02:27:45,740
And then I'm storing the
result in a rows variable,

3006
02:27:45,740 --> 02:27:48,950
which, again, I claim is a list of rows.

3007
02:27:48,950 --> 02:27:52,130
I'm going to go into the first
row, so that's rows bracket 0.

3008
02:27:52,130 --> 02:27:55,070
And I'm going to go into the likes
column to get the actual number.

3009
02:27:55,070 --> 02:27:57,140
And that number, I'm going to
store in a variable called likes.

3010
02:27:57,140 --> 02:27:58,880
So this is going to
be, like, 50,000,000,

3011
02:27:58,880 --> 02:28:01,100
and I want it to go to 50,000,001.

3012
02:28:01,100 --> 02:28:02,370
So how do I do that?

3013
02:28:02,370 --> 02:28:08,780
Well, I execute on the database
UPDATE posts SET likes = ?.

3014
02:28:08,780 --> 02:28:10,980
And then I just plug in likes + 1.

3015
02:28:10,980 --> 02:28:15,020
The problem, though, with the Instagrams
and Googles and Twitters of the world

3016
02:28:15,020 --> 02:28:16,790
is that they don't just have one server.

3017
02:28:16,790 --> 02:28:18,710
They have many thousands of servers.

3018
02:28:18,710 --> 02:28:22,580
And all of those servers might in
parallel be receiving clicks from you

3019
02:28:22,580 --> 02:28:23,960
and I on the internet.

3020
02:28:23,960 --> 02:28:28,310
And those clicks translate into this
code getting executed, executed,

3021
02:28:28,310 --> 02:28:28,970
executed.

3022
02:28:28,970 --> 02:28:32,930
And the problem is that when you have
three lines of code and suppose Brian

3023
02:28:32,930 --> 02:28:35,420
and I click on that egg
at roughly the same time,

3024
02:28:35,420 --> 02:28:40,010
my three lines might not get executed
before his three lines or vice versa.

3025
02:28:40,010 --> 02:28:42,650
They might get commingled
chronologically.

3026
02:28:42,650 --> 02:28:46,130
My first line might get executed, then
Brian's first line might get executed.

3027
02:28:46,130 --> 02:28:48,750
My second line might get
executed, Brian's second line.

3028
02:28:48,750 --> 02:28:50,960
So they might get interspersed
on different servers

3029
02:28:50,960 --> 02:28:53,900
or just temporally in
time, chronologically.

3030
02:28:53,900 --> 02:28:56,690
That's problematic, because
suppose Brian and I click

3031
02:28:56,690 --> 02:28:58,580
on that egg roughly at the same time.

3032
02:28:58,580 --> 02:29:01,010
And we get back the same
answer to the SELECT query.

3033
02:29:01,010 --> 02:29:03,290
50 million is the current count.

3034
02:29:03,290 --> 02:29:06,620
Then our next lines of code execute
on the servers we happen to be on,

3035
02:29:06,620 --> 02:29:09,260
which adds 1 to the likes.

3036
02:29:09,260 --> 02:29:14,780
The server might accidentally end
up updating the row for the egg

3037
02:29:14,780 --> 02:29:20,960
with 50,000,001 both times,
because the fundamental problem is

3038
02:29:20,960 --> 02:29:24,890
if my code executes while
Brian's code executes,

3039
02:29:24,890 --> 02:29:29,480
we are both checking the value of a
variable at essentially the same time.

3040
02:29:29,480 --> 02:29:32,090
And we are both then
making a conclusion--

3041
02:29:32,090 --> 02:29:35,190
oh, the current likes are 50 million.

3042
02:29:35,190 --> 02:29:36,470
We are then making a decision.

3043
02:29:36,470 --> 02:29:38,310
Let's add 1 to 50 million.

3044
02:29:38,310 --> 02:29:41,600
We are then updating the
value with 50,000,001.

3045
02:29:41,600 --> 02:29:46,640
The problem is, though, that, really,
if Brian's code or the server he happens

3046
02:29:46,640 --> 02:29:50,780
to be connected to on Instagram happens
to have selected the number of likes

3047
02:29:50,780 --> 02:29:53,900
first, he should be allowed
to finish the code that's

3048
02:29:53,900 --> 02:29:57,950
being executed so that when I
select it, I see 50,000,001,

3049
02:29:57,950 --> 02:30:02,270
and I add 1 to that so the
new count is 50,000,002.

3050
02:30:02,270 --> 02:30:04,070
This is what's known
as a race condition.

3051
02:30:04,070 --> 02:30:06,980
When you write code in a multiserver--

3052
02:30:06,980 --> 02:30:11,120
more fancily known as a multithreaded
environment-- lines of code

3053
02:30:11,120 --> 02:30:16,160
chronologically can get commingled on
different servers at any given time.

3054
02:30:16,160 --> 02:30:18,200
The problem fundamentally
derives from the fact

3055
02:30:18,200 --> 02:30:22,430
that if Brian's server is in the middle
of checking the state of a variable,

3056
02:30:22,430 --> 02:30:23,840
I should be locked out.

3057
02:30:23,840 --> 02:30:26,870
I should not be allowed to click
on that button at the same time,

3058
02:30:26,870 --> 02:30:30,590
or my code should not be
allowed to execute logically.

3059
02:30:30,590 --> 02:30:33,050
So there is a solution
when you have to write code

3060
02:30:33,050 --> 02:30:36,500
like this, as is common for Twitter and
Instagram and Facebook and the like,

3061
02:30:36,500 --> 02:30:38,420
to use what are called transactions.

3062
02:30:38,420 --> 02:30:41,815
Transactions add some few new pieces
of syntax that we won't dwell on today

3063
02:30:41,815 --> 02:30:43,690
and you don't need to
use in the coming days.

3064
02:30:43,690 --> 02:30:46,180
But they do solve a
fundamentally hard problem.

3065
02:30:46,180 --> 02:30:50,500
Transactions essentially allow
you to lock a table or, really,

3066
02:30:50,500 --> 02:30:54,885
a row in the table so that
if Brian's click on that egg

3067
02:30:54,885 --> 02:30:57,760
results in some code executing that's
in the process of checking what

3068
02:30:57,760 --> 02:31:02,770
is the total like count, my click on the
egg will not get handled by the server

3069
02:31:02,770 --> 02:31:05,630
until his code is done executing.

3070
02:31:05,630 --> 02:31:08,470
So in green here, I've proposed
the way you should do this.

3071
02:31:08,470 --> 02:31:12,968
You shouldn't just execute the middle
three lines, "you" being Facebook,

3072
02:31:12,968 --> 02:31:13,510
in this case.

3073
02:31:13,510 --> 02:31:17,200
Instagram should execute
BEGIN TRANSACTION first, then

3074
02:31:17,200 --> 02:31:19,300
COMMIT the transaction at the end.

3075
02:31:19,300 --> 02:31:22,780
And the design of transactions is
that all of the lines in between

3076
02:31:22,780 --> 02:31:26,320
will either succeed
altogether or fail altogether.

3077
02:31:26,320 --> 02:31:28,180
The database won't get
into this funky state

3078
02:31:28,180 --> 02:31:32,320
where we start losing
track of likes on eggs.

3079
02:31:32,320 --> 02:31:34,660
And though this has not been
an issue in recent years,

3080
02:31:34,660 --> 02:31:36,952
back in the day when Twitter
was first getting started,

3081
02:31:36,952 --> 02:31:40,232
Twitter was super popular and
super offline a lot of the time.

3082
02:31:40,232 --> 02:31:42,190
There was this thing
called a Fail Whale, which

3083
02:31:42,190 --> 02:31:44,037
is the picture they
showed on their website

3084
02:31:44,037 --> 02:31:46,120
when they were getting too
much traffic to handle.

3085
02:31:46,120 --> 02:31:49,540
That was because when people are liking
and tweeting and retweeting things,

3086
02:31:49,540 --> 02:31:51,520
it's a huge amount of data coming in.

3087
02:31:51,520 --> 02:31:54,500
And it turns out it's very
hard to solve these problems.

3088
02:31:54,500 --> 02:31:58,450
But locking the database table or
the rows with these transactions

3089
02:31:58,450 --> 02:32:00,490
is one way fundamentally to solve this.

3090
02:32:00,490 --> 02:32:03,160
And in our final extra
time today, we thought

3091
02:32:03,160 --> 02:32:05,080
we would play this out
in the same example

3092
02:32:05,080 --> 02:32:07,510
that I was taught transactions
in some years ago.

3093
02:32:07,510 --> 02:32:10,750
Suppose that the scenario at hand
is that you and your roommates

3094
02:32:10,750 --> 02:32:12,370
have a nice dorm fridge.

3095
02:32:12,370 --> 02:32:15,100
And you're all in the habit
of drinking lots of milk,

3096
02:32:15,100 --> 02:32:17,050
and you want to be able
to drink some milk.

3097
02:32:17,050 --> 02:32:19,420
But you go to the fridge,
like I'm about to here.

3098
02:32:19,420 --> 02:32:22,210
And you realize, uh-oh,
we're out of milk.

3099
02:32:22,210 --> 02:32:25,570
And so now I am inspecting the
state of this refrigerator, which

3100
02:32:25,570 --> 02:32:27,970
is quite old but also quite empty.

3101
02:32:27,970 --> 02:32:30,160
And the state of this
variable, being empty,

3102
02:32:30,160 --> 02:32:33,620
tells me that I should go to
CVS and buy some more milk.

3103
02:32:33,620 --> 02:32:35,080
So what do I then do?

3104
02:32:35,080 --> 02:32:37,150
I'm presumably going
to close the fridge,

3105
02:32:37,150 --> 02:32:40,600
and I'm going to go and
leave and go head to CVS.

3106
02:32:40,600 --> 02:32:43,510
Unfortunately, the same problem
arises that we'll act out here

3107
02:32:43,510 --> 02:32:46,150
in our final 60 or so
seconds together, whereby

3108
02:32:46,150 --> 02:32:49,660
if Brian now, my roommate in
this story, also wants some milk,

3109
02:32:49,660 --> 02:32:52,060
he comes by when I'm
already headed to the store,

3110
02:32:52,060 --> 02:32:55,310
inspects the state of the fridge,
and realizes, oh, we're out of milk.

3111
02:32:55,310 --> 02:32:57,650
So he nicely will go restock, as well.

3112
02:32:57,650 --> 02:32:59,620
So let's see how this
plays out, and we'll

3113
02:32:59,620 --> 02:33:03,590
see if there isn't a
similar, analogous solution.

3114
02:33:03,590 --> 02:33:05,620
So I've checked the
state of the variable.

3115
02:33:05,620 --> 02:33:06,920
We're indeed out of milk.

3116
02:33:06,920 --> 02:33:08,030
I'll be right back.

3117
02:33:08,030 --> 02:33:09,085
Just going to go to CVS.

3118
02:33:09,085 --> 02:33:26,336


3119
02:33:26,336 --> 02:33:29,829
[MUSIC PLAYING]

3120
02:33:29,829 --> 02:34:44,240


3121
02:34:44,240 --> 02:34:45,020
All right.

3122
02:34:45,020 --> 02:34:46,550
I am now back from the store.

3123
02:34:46,550 --> 02:34:47,870
I've picked up some milk.

3124
02:34:47,870 --> 02:34:50,090
Going to go ahead and put
it into the fridge and--

3125
02:34:50,090 --> 02:34:51,710
oh, how did this happen?

3126
02:34:51,710 --> 02:34:53,570
Now there's multiple jugs of milk.

3127
02:34:53,570 --> 02:34:55,490
And of course, milk
does not last that long.

3128
02:34:55,490 --> 02:34:57,282
And Brian and I don't
drink that much milk.

3129
02:34:57,282 --> 02:34:58,970
So this is a really serious problem.

3130
02:34:58,970 --> 02:35:03,150
We've sort of tried to update the value
of this variable at the same time.

3131
02:35:03,150 --> 02:35:05,030
So how do we go about fixing this?

3132
02:35:05,030 --> 02:35:07,320
What's the actual solution here?

3133
02:35:07,320 --> 02:35:09,920
Well, I dare say that we
can draw some inspiration

3134
02:35:09,920 --> 02:35:13,787
from the world of transactions
and the world of databases.

3135
02:35:13,787 --> 02:35:15,620
And perhaps create a
visual for here that we

3136
02:35:15,620 --> 02:35:18,270
hope you never forget if you
take nothing away from today.

3137
02:35:18,270 --> 02:35:21,327
Let's go ahead and act this out
one last time where, this time,

3138
02:35:21,327 --> 02:35:22,910
I'm going to be a little more extreme.

3139
02:35:22,910 --> 02:35:24,290
I go ahead and open the fridge.

3140
02:35:24,290 --> 02:35:25,940
I realize, oh, we're out of milk.

3141
02:35:25,940 --> 02:35:27,380
I'm going to go to the store.

3142
02:35:27,380 --> 02:35:29,450
I do not want to allow
for this situation

3143
02:35:29,450 --> 02:35:32,490
where Brian accidentally
checks the fridge, as well.

3144
02:35:32,490 --> 02:35:37,670
So I am going to lock
the refrigerator instead.

3145
02:35:37,670 --> 02:35:41,390
Let me go ahead and
drape this through here.

3146
02:35:41,390 --> 02:35:43,940


3147
02:35:43,940 --> 02:35:49,050
A little extreme, but I think so
long as he can't get into the fridge,

3148
02:35:49,050 --> 02:35:52,790
this shouldn't be a problem.

3149
02:35:52,790 --> 02:35:56,060
Let me go ahead now and
just attach the lock here.

3150
02:35:56,060 --> 02:35:57,170
Almost got it.

3151
02:35:57,170 --> 02:35:58,340
Come on.

3152
02:35:58,340 --> 02:35:59,570
All right.

3153
02:35:59,570 --> 02:36:01,993
Now the fridge is locked.

3154
02:36:01,993 --> 02:36:03,410
Now I'm going to go get some milk.

3155
02:36:03,410 --> 02:36:17,210


3156
02:36:17,210 --> 02:36:18,210
BRIAN YU: [SIGHS]

3157
02:36:18,210 --> 02:36:18,710


3158
02:36:18,710 --> 02:36:22,060
[MUSIC PLAYING]

3159
02:36:22,060 --> 02:37:19,000