1
00:00:00,000 --> 00:00:02,988
[MUSIC PLAYING]

2
00:00:02,988 --> 00:01:01,790


3
00:01:01,790 --> 00:01:03,500
SPEAKER 1: All right.

4
00:01:03,500 --> 00:01:09,590
This is CS50, and this is week 7
on the day before All Hallows' Eve.

5
00:01:09,590 --> 00:01:13,760
Today, we will introduce yet
another language, the goal of which

6
00:01:13,760 --> 00:01:16,430
is not to introduce another
language for language's sake,

7
00:01:16,430 --> 00:01:20,510
but to really begin to emphasize that
when it comes to the world of software

8
00:01:20,510 --> 00:01:23,390
programming, engineering
more generally, there's often

9
00:01:23,390 --> 00:01:25,170
different tools for different jobs.

10
00:01:25,170 --> 00:01:28,632
And if you were to try to use C to
solve all of our future problems,

11
00:01:28,632 --> 00:01:30,590
it would actually be
painful, as you've already

12
00:01:30,590 --> 00:01:35,360
seen how much more pleasant it is to
solve certain problems with Python.

13
00:01:35,360 --> 00:01:37,460
But today, quite quickly,
we'll realize that even

14
00:01:37,460 --> 00:01:40,440
Python's actually not the most pleasant
way to solve a lot of problems,

15
00:01:40,440 --> 00:01:43,100
especially when it comes
to data, data at scale.

16
00:01:43,100 --> 00:01:45,830
And, in fact, as we transition
in the final weeks of CS50

17
00:01:45,830 --> 00:01:49,130
to the world of web programming, and if
you so choose with your final project

18
00:01:49,130 --> 00:01:51,500
mobile programming, you'll
actually need more tools

19
00:01:51,500 --> 00:01:54,150
in your tool kit than
C and Python alone.

20
00:01:54,150 --> 00:01:57,620
In fact, today we'll introduce a
database-centric language called S-Q-L,

21
00:01:57,620 --> 00:01:58,590
or SQL.

22
00:01:58,590 --> 00:02:02,400
Next week, we'll explore markup
languages, like HTML and CSS.

23
00:02:02,400 --> 00:02:03,970
A bit of JavaScript, as well.

24
00:02:03,970 --> 00:02:07,290
And then we'll synthesize all of this
together at the end of the class,

25
00:02:07,290 --> 00:02:10,360
as some of you might, for
your final projects, as well.

26
00:02:10,360 --> 00:02:13,120
But, before we do that,
and talk about data,

27
00:02:13,120 --> 00:02:15,310
let's actually start
to gather some data.

28
00:02:15,310 --> 00:02:19,770
So if you could visit this URL
here on your phone, or a laptop.

29
00:02:19,770 --> 00:02:23,460
Or, if easier, here's a
bar code version thereof.

30
00:02:23,460 --> 00:02:25,980
You can point your camera
at this bar code here,

31
00:02:25,980 --> 00:02:29,220
and it's going to pull up a
relatively short Google form that's

32
00:02:29,220 --> 00:02:32,040
just going to ask you
a couple of questions,

33
00:02:32,040 --> 00:02:35,070
so that we can collect
some actual live data

34
00:02:35,070 --> 00:02:37,800
and actually play around with
it, initially, in Python.

35
00:02:37,800 --> 00:02:41,020
So if you go to that URL there,
you'll see a Google form.

36
00:02:41,020 --> 00:02:44,577
And if you haven't been able to
scan it quite yet, that's fine.

37
00:02:44,577 --> 00:02:47,160
Just kind of look over the
shoulder of the person next to you.

38
00:02:47,160 --> 00:02:49,960
But you, or they, will see a
little something like this.

39
00:02:49,960 --> 00:02:53,130
So among the questions will be,
which is your favorite language?

40
00:02:53,130 --> 00:02:56,160
On the list, only thus far,
is Scratch, C, and Python,

41
00:02:56,160 --> 00:02:58,350
and below that you'll see
another question asking

42
00:02:58,350 --> 00:03:02,960
about your favorite problem
across the problem sets thus far.

43
00:03:02,960 --> 00:03:04,710
Each of them is radio
buttons, which means

44
00:03:04,710 --> 00:03:08,800
you'll be able to select one
button for each of those questions.

45
00:03:08,800 --> 00:03:11,310
And, ultimately, what's
going to be nice is

46
00:03:11,310 --> 00:03:14,132
that if you've never used Google
Forms before as an administrator,

47
00:03:14,132 --> 00:03:17,340
all of that data is actually going to
end up being, not only in Google Forms,

48
00:03:17,340 --> 00:03:19,525
but also, if you so choose,
in Google Spreadsheets.

49
00:03:19,525 --> 00:03:22,650
Which is an example, of course, of a
spreadsheet software similar in spirit

50
00:03:22,650 --> 00:03:26,940
to Apple Numbers on Macs or Microsoft
Excel on different platforms.

51
00:03:26,940 --> 00:03:29,220
And Google Sheets is
going to allow us to store

52
00:03:29,220 --> 00:03:30,960
all of that data in rows and columns.

53
00:03:30,960 --> 00:03:33,090
And so, since Google made
both of these products,

54
00:03:33,090 --> 00:03:34,690
they integrated one with the other.

55
00:03:34,690 --> 00:03:39,930
So, in fact, if I, on my laptop
here in another window, open

56
00:03:39,930 --> 00:03:41,280
this up-- let me flip over.

57
00:03:41,280 --> 00:03:42,660
Here's the live spreadsheet.

58
00:03:42,660 --> 00:03:45,510
And we'll see that the very
first person who buzzed in really

59
00:03:45,510 --> 00:03:48,090
liked Python, as did a lot
of other people thereafter.

60
00:03:48,090 --> 00:03:52,390
But, Hello, World was your
favorite in Python, which is great.

61
00:03:52,390 --> 00:03:54,330
There's a couple of
votes for Scratch here.

62
00:03:54,330 --> 00:03:58,110
If we scroll down, there's one hold
out for C, who really liked Credit,

63
00:03:58,110 --> 00:03:59,320
in this case here.

64
00:03:59,320 --> 00:04:03,250
And if we scroll down further, it looks
like Python and Scratch are in there.

65
00:04:03,250 --> 00:04:05,727
A few more C's, and so on and so forth.

66
00:04:05,727 --> 00:04:08,310
So suppose that we wanted to,
actually, now analyze this data.

67
00:04:08,310 --> 00:04:11,587
Now, any of you who have used Excel,
or Numbers, or Google Spreadsheets

68
00:04:11,587 --> 00:04:13,920
know that it comes with built
in functions and formulas,

69
00:04:13,920 --> 00:04:17,040
and we can do all of that, but suppose
there was a huge amount of data.

70
00:04:17,040 --> 00:04:19,920
Or, suppose that this data was
not coming in via Google Sheets,

71
00:04:19,920 --> 00:04:23,130
but via your own web application,
or your own mobile application,

72
00:04:23,130 --> 00:04:25,530
and it's just ending up in
some kind of spreadsheet.

73
00:04:25,530 --> 00:04:28,530
Well, wouldn't it be nice if we could
actually analyze that kind of data

74
00:04:28,530 --> 00:04:29,250
with code.

75
00:04:29,250 --> 00:04:30,570
And, in fact, you can.

76
00:04:30,570 --> 00:04:35,220
And the simplest way to store a bunch
of data isn't with anything fancy,

77
00:04:35,220 --> 00:04:39,450
in fact, but just a literal text
file, something ending in .txt,

78
00:04:39,450 --> 00:04:41,520
or, maybe more commonly, .csv.

79
00:04:41,520 --> 00:04:44,400
In fact, what we'd call
a flat file database

80
00:04:44,400 --> 00:04:48,820
is literally just code for a text
file containing all of your data.

81
00:04:48,820 --> 00:04:53,040
But your data typically has delimiters
that separate some values from others.

82
00:04:53,040 --> 00:04:55,050
And, in fact, the most
common approach, daresay,

83
00:04:55,050 --> 00:04:59,430
is to use, indeed, comma-separated
values, or, CSV files.

84
00:04:59,430 --> 00:05:04,080
And what that means is that in simple
text alone, just asking, maybe Unicode,

85
00:05:04,080 --> 00:05:06,960
you sort of mimic the
idea of rows and columns

86
00:05:06,960 --> 00:05:09,570
by using newline characters,
like backslash n,

87
00:05:09,570 --> 00:05:11,260
to represent row after row after row.

88
00:05:11,260 --> 00:05:12,510
That's pretty straightforward.

89
00:05:12,510 --> 00:05:14,280
To just move the cursor
to the next line.

90
00:05:14,280 --> 00:05:18,930
And because text files aren't graphical,
so there's no notion of vertical bars

91
00:05:18,930 --> 00:05:21,510
that you can put between what
you and I think of as columns,

92
00:05:21,510 --> 00:05:24,630
you literally just use a comma,
or some other such symbol

93
00:05:24,630 --> 00:05:27,212
to separate one value from another.

94
00:05:27,212 --> 00:05:28,920
So, in fact, let me
go ahead and do this.

95
00:05:28,920 --> 00:05:31,837
Even if you've never done this before,
it turns out, in Google Sheets,

96
00:05:31,837 --> 00:05:35,130
and also Excel and Numbers,
you can export your data,

97
00:05:35,130 --> 00:05:38,880
not in some proprietary Apple
or Microsoft or Google format,

98
00:05:38,880 --> 00:05:43,380
but in a globally portable
format known as .csv.

99
00:05:43,380 --> 00:05:44,970
So let me go up to file.

100
00:05:44,970 --> 00:05:47,037
I will go to download.

101
00:05:47,037 --> 00:05:49,620
And notice, I can export this
in a bunch of different formats,

102
00:05:49,620 --> 00:05:52,722
but the one I care about for
today is going to be .csv.

103
00:05:52,722 --> 00:05:55,180
On my Mac, that's going to put
it into my downloads folder.

104
00:05:55,180 --> 00:05:57,722
And what I'm going to go ahead
here and do, in just a moment,

105
00:05:57,722 --> 00:06:00,690
is let me go ahead it
and open up VS Code.

106
00:06:00,690 --> 00:06:03,300
So, in short, I've downloaded this file.

107
00:06:03,300 --> 00:06:04,980
Google gave it a long default file name.

108
00:06:04,980 --> 00:06:06,900
I've renamed it to favorites.csv.

109
00:06:06,900 --> 00:06:10,200
Let me go ahead and actually
open it now in VS Code here.

110
00:06:10,200 --> 00:06:12,480
And this now is the exact same data.

111
00:06:12,480 --> 00:06:14,800
If you're still submitting the form,
I'm afraid you didn't make the cut off.

112
00:06:14,800 --> 00:06:17,220
So we're not going to see more
data ending up in this file

113
00:06:17,220 --> 00:06:19,260
because I've literally
downloaded a copy of it.

114
00:06:19,260 --> 00:06:21,240
But, indeed, if I scroll
through this, we'll

115
00:06:21,240 --> 00:06:25,050
see that it's got some
399 lines of data.

116
00:06:25,050 --> 00:06:28,080
Or, technically, 398
because the very first one

117
00:06:28,080 --> 00:06:30,990
is what we'd call a header
row, which just describes

118
00:06:30,990 --> 00:06:32,820
what each of these columns means.

119
00:06:32,820 --> 00:06:35,670
Now notice that even though the
Google Sheets interface actually

120
00:06:35,670 --> 00:06:39,780
put all of this data in proper
graphical rows and columns,

121
00:06:39,780 --> 00:06:42,510
you can still see the rows because
they're just separate lines,

122
00:06:42,510 --> 00:06:45,900
and you can kind of see the
columns by focusing on the comma

123
00:06:45,900 --> 00:06:47,970
here, the comma here, the comma here.

124
00:06:47,970 --> 00:06:50,820
And then, also, the comma
here, the comma here,

125
00:06:50,820 --> 00:06:52,540
the comma here, and so forth.

126
00:06:52,540 --> 00:06:53,910
Now, a quick subtlety.

127
00:06:53,910 --> 00:06:59,490
Hello, World is a two-word problem
name, and it itself has a comma in it.

128
00:06:59,490 --> 00:07:06,090
What's to stop me from confusing the
comma in Hello, World from the commas

129
00:07:06,090 --> 00:07:09,410
that Google, apparently,
inserted into this file?

130
00:07:09,410 --> 00:07:09,910
Yeah.

131
00:07:09,910 --> 00:07:11,925
So, automatically, you
all did not type this.

132
00:07:11,925 --> 00:07:13,300
We did not type it into the form.

133
00:07:13,300 --> 00:07:16,750
But Google is smart enough, as
is Apple and Microsoft, when

134
00:07:16,750 --> 00:07:20,958
they export CSVs to somehow escape
seemingly dangerous characters,

135
00:07:20,958 --> 00:07:22,750
or characters that
could just break things.

136
00:07:22,750 --> 00:07:25,180
And, in this case, a
convention in the CSV world

137
00:07:25,180 --> 00:07:28,340
is just to quote any strings
that themselves have commas

138
00:07:28,340 --> 00:07:32,150
so that any program you're writing that
reads this file doesn't get confused.

139
00:07:32,150 --> 00:07:35,260
So the only thing that came
in automatically from Google

140
00:07:35,260 --> 00:07:37,690
is just this timestamp
here, based on the time

141
00:07:37,690 --> 00:07:39,370
of day in our own local timezone.

142
00:07:39,370 --> 00:07:40,970
That was added automatically.

143
00:07:40,970 --> 00:07:43,100
So we have three, and
not just two columns.

144
00:07:43,100 --> 00:07:45,460
So with that said, we
have three columns:

145
00:07:45,460 --> 00:07:47,980
timestamp, language, and problem.

146
00:07:47,980 --> 00:07:50,200
The latter two of which
came from all of you.

147
00:07:50,200 --> 00:07:52,840
Let's actually write some
code that analyzes this data.

148
00:07:52,840 --> 00:07:55,720
Let's figure out what the most
popular something or other is here.

149
00:07:55,720 --> 00:07:57,970
So I'll close the CSV file.

150
00:07:57,970 --> 00:08:02,140
Let me go ahead and use what seems to
be folks favorite language, thus far,

151
00:08:02,140 --> 00:08:05,320
and write a file called
favorites.py, for instance.

152
00:08:05,320 --> 00:08:08,440
And I'm going to use Python
to open that CSV file,

153
00:08:08,440 --> 00:08:12,180
analyze it, crunch some numbers, and
output some statistics, if you will.

154
00:08:12,180 --> 00:08:15,920
So the easiest way to manipulate CSV
files, as you might have gleaned,

155
00:08:15,920 --> 00:08:19,940
is not to just open the file yourself,
look for commas, allocate memory,

156
00:08:19,940 --> 00:08:22,220
or anything like that
in C. In Python, you

157
00:08:22,220 --> 00:08:27,230
can literally just import CSV,
which is a Python module that

158
00:08:27,230 --> 00:08:29,570
gives you CSV related functionality.

159
00:08:29,570 --> 00:08:32,390
In Python, there's a bunch
of ways to open files.

160
00:08:32,390 --> 00:08:35,960
One way to open a file
is like this in Python.

161
00:08:35,960 --> 00:08:40,039
You can say file equals
open, similar to fopen in C.

162
00:08:40,039 --> 00:08:44,600
You can specify the name of the file
you want to open, like favorites.csv,

163
00:08:44,600 --> 00:08:48,330
and you can explicitly say you
want to open the file for reading,

164
00:08:48,330 --> 00:08:50,240
quote unquote, "r," just like fopen.

165
00:08:50,240 --> 00:08:53,280
Strictly speaking, in
Python read is implied.

166
00:08:53,280 --> 00:08:55,580
So if you omit the second
argument, it will still work.

167
00:08:55,580 --> 00:08:58,490
But for parity with fopen,
I'll do the same here.

168
00:08:58,490 --> 00:09:03,020
Then you can, maybe, do something
with file, and then, at the end,

169
00:09:03,020 --> 00:09:05,070
you can close the file like this.

170
00:09:05,070 --> 00:09:08,840
So in Python, there really is a
mapping between fopen and fclose.

171
00:09:08,840 --> 00:09:12,560
But in Python the functions are
called open and close for short.

172
00:09:12,560 --> 00:09:16,130
But a more common way, a more
Pythonic way, so to speak, i.e.

173
00:09:16,130 --> 00:09:19,040
the way people tend to do
it in Python, is actually

174
00:09:19,040 --> 00:09:22,610
to use a keyword that didn't exist
in C, where you instead say with.

175
00:09:22,610 --> 00:09:26,990
And you say, with open this file
name as a specific file name,

176
00:09:26,990 --> 00:09:28,970
and then indent it inside of that.

177
00:09:28,970 --> 00:09:30,950
Now you can do whatever
you want with the file.

178
00:09:30,950 --> 00:09:33,350
And the implication of using
with, even though it's not

179
00:09:33,350 --> 00:09:36,740
obvious from the keyword itself, is that
the file will be automatically closed

180
00:09:36,740 --> 00:09:37,770
for you later.

181
00:09:37,770 --> 00:09:39,650
So this is just a minor
Python convenience

182
00:09:39,650 --> 00:09:42,020
so that you don't have to
remember to close the file.

183
00:09:42,020 --> 00:09:44,030
Unless something goes
wrong, it will just

184
00:09:44,030 --> 00:09:47,690
close automatically as soon as
you're outside of this with block.

185
00:09:47,690 --> 00:09:49,460
So how do I go about reading a CSV?

186
00:09:49,460 --> 00:09:52,940
The simplest way is to actually
give yourself a variable called,

187
00:09:52,940 --> 00:09:53,900
maybe, reader.

188
00:09:53,900 --> 00:09:56,780
Like if I want to read the CSV,
I'll call my variable reader.

189
00:09:56,780 --> 00:10:02,270
Set that equal to the return value of
a function that comes with Python CSV

190
00:10:02,270 --> 00:10:07,080
module called reader, in lowercase,
and you just pass in the file name.

191
00:10:07,080 --> 00:10:09,560
So the first line there on
line three opens the file

192
00:10:09,560 --> 00:10:11,570
and gives me access
to the bytes there in.

193
00:10:11,570 --> 00:10:14,450
Line four now actually
uses this library that

194
00:10:14,450 --> 00:10:18,260
comes with Python to just go read it
for me, figure out where the commas are,

195
00:10:18,260 --> 00:10:21,690
so that it can hand me, line
after line, the data in the file.

196
00:10:21,690 --> 00:10:24,320
Now the first piece of data
in the file, though, was what?

197
00:10:24,320 --> 00:10:27,080
What's the very first row?

198
00:10:27,080 --> 00:10:30,692
So it was that header row with
timestamp, language, problem.

199
00:10:30,692 --> 00:10:32,900
I actually want to skip that
because that's not data.

200
00:10:32,900 --> 00:10:36,870
That's what we might call metadata
that's describing my actual data.

201
00:10:36,870 --> 00:10:39,670
So one way to deal with this, I
can literally just say next reader.

202
00:10:39,670 --> 00:10:42,170
And next, even though the
semantics here are a little weird,

203
00:10:42,170 --> 00:10:46,100
it just means, no, no, give me the
next line from that reader instead

204
00:10:46,100 --> 00:10:47,550
of the first by default.

205
00:10:47,550 --> 00:10:51,650
And now inside of this with block,
I can do something like this.

206
00:10:51,650 --> 00:10:56,810
For each row in that reader, let's
do something super simple initially.

207
00:10:56,810 --> 00:11:00,380
Let's just print out row bracket one.

208
00:11:00,380 --> 00:11:01,910
So row bracket one.

209
00:11:01,910 --> 00:11:03,330
So what's going on here?

210
00:11:03,330 --> 00:11:07,040
Well, the CSV reader in Python
is going to return to you, inside

211
00:11:07,040 --> 00:11:09,990
of this loop, one row after another.

212
00:11:09,990 --> 00:11:12,080
Each of those rows, though,
has how many columns?

213
00:11:12,080 --> 00:11:12,920
Three.

214
00:11:12,920 --> 00:11:15,470
The timestamp, the
language, and the problem.

215
00:11:15,470 --> 00:11:19,430
And just like in C, our lists
in Python are zero index.

216
00:11:19,430 --> 00:11:20,720
So zero, one, two.

217
00:11:20,720 --> 00:11:25,350
So if I'm printing out row bracket one,
that's the second, or middle, column.

218
00:11:25,350 --> 00:11:29,220
So what's this going
to print row after row?

219
00:11:29,220 --> 00:11:32,340
Each of the languages
that you all replied with.

220
00:11:32,340 --> 00:11:32,940
So that's all.

221
00:11:32,940 --> 00:11:34,440
This isn't doing any
kind of analytics yet.

222
00:11:34,440 --> 00:11:36,440
I'm just going through
the motions to, at least,

223
00:11:36,440 --> 00:11:37,900
print out some data of interest.

224
00:11:37,900 --> 00:11:38,692
So let me run this.

225
00:11:38,692 --> 00:11:41,550
Python of favorites.py, enter.

226
00:11:41,550 --> 00:11:43,270
And it happens super fast.

227
00:11:43,270 --> 00:11:47,170
But if I scroll back in my terminal
window, there is all of that raw data.

228
00:11:47,170 --> 00:11:49,472
So this is to say, once you
know the function names,

229
00:11:49,472 --> 00:11:51,180
once you know the
keywords, it's actually

230
00:11:51,180 --> 00:11:54,090
pretty simple in Python to just
get up and running with a file,

231
00:11:54,090 --> 00:11:56,070
and start looking at the data therein.

232
00:11:56,070 --> 00:12:02,910
But it turns out that it's a little
sub-optimal to use the reader alone.

233
00:12:02,910 --> 00:12:05,400
It turns out there's better
ways to do this, and let

234
00:12:05,400 --> 00:12:06,780
me make this clear as follows.

235
00:12:06,780 --> 00:12:09,660
I don't strictly need a variable,
but let me actually declare

236
00:12:09,660 --> 00:12:13,560
a variable called favorite, set
it equal to row bracket one,

237
00:12:13,560 --> 00:12:15,630
and then print out that favorite value.

238
00:12:15,630 --> 00:12:17,438
This is not doing anything new.

239
00:12:17,438 --> 00:12:19,230
It's just declaring an
additional variable,

240
00:12:19,230 --> 00:12:22,380
but I wanted to highlight the fact
that I'm just kind of trusting that row

241
00:12:22,380 --> 00:12:24,630
bracket one is the problem--

242
00:12:24,630 --> 00:12:26,430
or, is the language I care about.

243
00:12:26,430 --> 00:12:27,660
Is language, is language.

244
00:12:27,660 --> 00:12:30,790
But suppose one of you, or I,
go into the Google spreadsheet,

245
00:12:30,790 --> 00:12:32,668
and like anyone might
with a spreadsheet,

246
00:12:32,668 --> 00:12:34,210
you might start moving things around.

247
00:12:34,210 --> 00:12:36,947
And you might swap some of
the columns left and right.

248
00:12:36,947 --> 00:12:39,280
You might delete one of the
columns, add something else.

249
00:12:39,280 --> 00:12:41,830
In short, spreadsheets
are arguably fragile

250
00:12:41,830 --> 00:12:44,590
in that it's pretty easy in the
GUI, the graphical user interface,

251
00:12:44,590 --> 00:12:46,040
to change them around.

252
00:12:46,040 --> 00:12:50,080
And so my code in Python,
accordingly, is, arguably, fragile

253
00:12:50,080 --> 00:12:53,770
because I'm just hoping that row
bracket one is always the data

254
00:12:53,770 --> 00:12:54,890
that I care about.

255
00:12:54,890 --> 00:12:56,450
So what would be marginally better?

256
00:12:56,450 --> 00:12:59,110
Well, let's actually use
that header row instead.

257
00:12:59,110 --> 00:13:02,920
And more common, arguably, in Python,
is not to use a simple reader,

258
00:13:02,920 --> 00:13:05,650
but, instead, what we would
call a Dictionary Reader.

259
00:13:05,650 --> 00:13:09,190
I'm going to change this to
DictReader, capital D, capital R,

260
00:13:09,190 --> 00:13:12,010
and that's it for the
change to line four.

261
00:13:12,010 --> 00:13:14,197
On line five, I'm going to
get rid of the next line

262
00:13:14,197 --> 00:13:16,030
because I don't want
to skip the header now.

263
00:13:16,030 --> 00:13:19,210
What DictReader does,
which reader does not,

264
00:13:19,210 --> 00:13:22,690
is it automatically analyzes
that first line in the file,

265
00:13:22,690 --> 00:13:25,510
figures out what are all
of your columns called,

266
00:13:25,510 --> 00:13:30,220
and, thereafter, when you iterate over
this reader what each of your rows

267
00:13:30,220 --> 00:13:35,650
now is-- it's no longer a list of
size three, bracket zero, bracket

268
00:13:35,650 --> 00:13:36,490
one, bracket two.

269
00:13:36,490 --> 00:13:39,670
Each row that you get back
in this loop on line five

270
00:13:39,670 --> 00:13:44,890
is now, wonderfully, a dictionary
instead, the keys of which

271
00:13:44,890 --> 00:13:46,090
are from the header field.

272
00:13:46,090 --> 00:13:47,650
Timestamp, language, problem.

273
00:13:47,650 --> 00:13:52,460
The values of which are whatever
each of you typed in again and again.

274
00:13:52,460 --> 00:13:53,890
So how do I change the code?

275
00:13:53,890 --> 00:13:57,800
I no longer have these numeric indices
because row is no longer a list,

276
00:13:57,800 --> 00:13:58,850
it's a dictionary.

277
00:13:58,850 --> 00:14:02,350
So if I literally want the
language that you typed in,

278
00:14:02,350 --> 00:14:06,520
I can use row bracket, quote unquote,
"language," treating row as a Dict

279
00:14:06,520 --> 00:14:09,010
not as a list anymore.

280
00:14:09,010 --> 00:14:12,700
Now this is, again, more robust because
if you move the columns around on me,

281
00:14:12,700 --> 00:14:15,340
code is still going to work,
at least so long as you don't

282
00:14:15,340 --> 00:14:17,570
rename the columns at the very top.

283
00:14:17,570 --> 00:14:20,180
So that's still one
assumption I'm making.

284
00:14:20,180 --> 00:14:20,680
All right.

285
00:14:20,680 --> 00:14:23,770
Well, beyond that, what
could I actually do here?

286
00:14:23,770 --> 00:14:27,422
Well, just to be clear, I don't,
strictly speaking, need this variable.

287
00:14:27,422 --> 00:14:28,880
So no need to highlight this again.

288
00:14:28,880 --> 00:14:30,755
So let me just simplify
the code a little bit

289
00:14:30,755 --> 00:14:33,220
and get rid of this variable,
and instead just print out

290
00:14:33,220 --> 00:14:34,960
the language in that row.

291
00:14:34,960 --> 00:14:40,360
As a quick check, let me
rerun Python of favorites.py

292
00:14:40,360 --> 00:14:43,430
and it seems to still work, even
though there's a lot of output.

293
00:14:43,430 --> 00:14:45,610
We're not going to check
all 399 lines, but it

294
00:14:45,610 --> 00:14:48,160
looks like it printed out all
of those popular languages.

295
00:14:48,160 --> 00:14:50,120
Well, what more can we actually do?

296
00:14:50,120 --> 00:14:52,420
Well, let's actually now
start to crunch some numbers

297
00:14:52,420 --> 00:14:55,450
and figure out how many people
like Scratch, how many people like

298
00:14:55,450 --> 00:14:56,890
C, how many people like Python.

299
00:14:56,890 --> 00:14:58,580
Let's start to analyze this.

300
00:14:58,580 --> 00:15:02,290
So maybe the most pedantic
way to do this in Python

301
00:15:02,290 --> 00:15:07,113
would be to just create some variables
and do all of the counting myself.

302
00:15:07,113 --> 00:15:08,780
So let me actually go ahead and do this.

303
00:15:08,780 --> 00:15:10,113
Let me delete this code for now.

304
00:15:10,113 --> 00:15:12,370
And after opening the
reader, let me create

305
00:15:12,370 --> 00:15:14,980
a variable called Scratch
and set it equal to zero,

306
00:15:14,980 --> 00:15:19,060
a variable called C, set it equal
to zero, a variable called Python,

307
00:15:19,060 --> 00:15:23,290
set it equal to zero, just so that I
have three counters, similar to what

308
00:15:23,290 --> 00:15:26,620
we did in week zero, week one,
anytime we counted anything Honestly,

309
00:15:26,620 --> 00:15:29,600
this looks a little
stupid, and it's not wrong.

310
00:15:29,600 --> 00:15:31,600
In fact, this is how you
would do it in Python,

311
00:15:31,600 --> 00:15:33,910
but Python also has some clever syntax.

312
00:15:33,910 --> 00:15:38,800
If you want to be really cool,
you can do Scratch comma, c comma,

313
00:15:38,800 --> 00:15:43,932
Python equals zero, zero, zero, and
do all three at once if you like that.

314
00:15:43,932 --> 00:15:47,140
So it tightens up the code a little bit,
even though the effect is ultimately

315
00:15:47,140 --> 00:15:47,710
the same.

316
00:15:47,710 --> 00:15:52,010
Now let's go ahead and iterate
over this file row by row by row.

317
00:15:52,010 --> 00:15:54,730
And if we see Scratch,
increment the Scratch counter.

318
00:15:54,730 --> 00:15:56,320
If we see C, increment the C counter.

319
00:15:56,320 --> 00:15:59,000
If we see Python, increment
the Python counter, instead.

320
00:15:59,000 --> 00:16:00,080
So how do I do this?

321
00:16:00,080 --> 00:16:03,220
Well, I could do something like,
for each row in the reader,

322
00:16:03,220 --> 00:16:07,180
just like before, let me go ahead
and get that favorite variable

323
00:16:07,180 --> 00:16:11,230
and set it equal to the language in
that dictionary that just came back

324
00:16:11,230 --> 00:16:12,740
as part of this iteration.

325
00:16:12,740 --> 00:16:19,060
And now I can do something like this, if
favorite equals, equals "Scratch" then,

326
00:16:19,060 --> 00:16:26,710
with my indentation, I can do Scratch
plus equals one, elif favorite equals,

327
00:16:26,710 --> 00:16:27,670
equals "C."

328
00:16:27,670 --> 00:16:31,290
I can go ahead and
increment C plus equals one.

329
00:16:31,290 --> 00:16:34,440
Else-- and I don't think I
want else, just to be safe.

330
00:16:34,440 --> 00:16:37,410
Even though we only saw three options,
I think just to be super safe,

331
00:16:37,410 --> 00:16:41,040
elif favorite equals, equals
"Python," then let's go ahead

332
00:16:41,040 --> 00:16:42,210
and increment Python.

333
00:16:42,210 --> 00:16:45,360
The one thing we can't do that we
could do in C is the plus, plus trick.

334
00:16:45,360 --> 00:16:47,890
So plus equals is as
close as we can get.

335
00:16:47,890 --> 00:16:48,900
So what have I done?

336
00:16:48,900 --> 00:16:52,110
Inside of this loop, I've just
incremented each of those counters

337
00:16:52,110 --> 00:16:56,230
by one if I see Scratch, or
C, or Python, again and again.

338
00:16:56,230 --> 00:16:56,730
All right.

339
00:16:56,730 --> 00:17:00,060
Outside of the loop and
outside of the with block,

340
00:17:00,060 --> 00:17:02,370
because once I'm done
reading all of the rows,

341
00:17:02,370 --> 00:17:06,493
I might as well let the with clause
close the file automatically,

342
00:17:06,493 --> 00:17:08,410
let's just go ahead and
print out some values.

343
00:17:08,410 --> 00:17:13,560
So I'm going to go ahead and print out
Scratch colon, and then inside of there

344
00:17:13,560 --> 00:17:16,290
let's print out whatever the
value of that variable is.

345
00:17:16,290 --> 00:17:19,770
Let's then go ahead
and print out C colon,

346
00:17:19,770 --> 00:17:22,030
and then whatever the
value of the C variable is.

347
00:17:22,030 --> 00:17:26,160
And then, lastly, let's print out
Python colon and whatever the value is.

348
00:17:26,160 --> 00:17:29,117
And now I made three typos here.

349
00:17:29,117 --> 00:17:30,700
This is not going to print the values.

350
00:17:30,700 --> 00:17:32,320
What do I need to do?

351
00:17:32,320 --> 00:17:34,030
Sorry?

352
00:17:34,030 --> 00:17:36,560
Yeah, I'm missing the f in
front of each of these strings

353
00:17:36,560 --> 00:17:38,470
so that I actually get formatted.

354
00:17:38,470 --> 00:17:41,800
And that is to say, the variables get
interpolated inside of the curly braces

355
00:17:41,800 --> 00:17:43,220
like we saw last week.

356
00:17:43,220 --> 00:17:43,720
All right.

357
00:17:43,720 --> 00:17:45,928
So, honestly, that's kind
of a decent amount of code,

358
00:17:45,928 --> 00:17:49,630
like 18 or so lines of code, just
to count the number of responses.

359
00:17:49,630 --> 00:17:51,070
But let's see if I got it right.

360
00:17:51,070 --> 00:17:55,330
Let me open my terminal and
run Python of favorites.py,

361
00:17:55,330 --> 00:17:58,420
and now I see, by an
overwhelming amount,

362
00:17:58,420 --> 00:18:04,660
folks like Python, followed by C,
followed by Scratch, in that order.

363
00:18:04,660 --> 00:18:06,910
But that was a decent amount
of code to have to write,

364
00:18:06,910 --> 00:18:10,540
and it turns out there's actually better
ways of doing this, more Pythonic ways,

365
00:18:10,540 --> 00:18:12,530
more programmatic ways of doing this.

366
00:18:12,530 --> 00:18:15,760
And if we think back to one of
our universal data structures.

367
00:18:15,760 --> 00:18:18,910
Think back to how we preached
last week, and the week before,

368
00:18:18,910 --> 00:18:21,280
the value of these
dictionaries, more generally.

369
00:18:21,280 --> 00:18:23,317
Like the CSV module,
clearly a fan of them,

370
00:18:23,317 --> 00:18:25,150
because that's what
DictReader is giving us,

371
00:18:25,150 --> 00:18:27,220
dictionary, after
dictionary, after dictionary.

372
00:18:27,220 --> 00:18:29,290
And this was the general
idea of a dictionary.

373
00:18:29,290 --> 00:18:31,840
It associates keys with
values, much like you

374
00:18:31,840 --> 00:18:34,940
might in a two-column table
on a chalkboard, or the like.

375
00:18:34,940 --> 00:18:39,340
Well, this is what I need if I want
to keep track of how many people said

376
00:18:39,340 --> 00:18:42,250
Scratch, and C, and Python.

377
00:18:42,250 --> 00:18:46,900
If I had a piece of chalk, I
could just write Scratch, and C,

378
00:18:46,900 --> 00:18:48,670
and Python as three keys.

379
00:18:48,670 --> 00:18:51,910
And then with my chalk and, maybe,
an eraser, keep track of the values.

380
00:18:51,910 --> 00:18:52,900
They all start at zero.

381
00:18:52,900 --> 00:18:54,550
Then, I add one, add two, add three.

382
00:18:54,550 --> 00:18:56,800
Or, maybe, on a chalkboard,
I actually use hash marks.

383
00:18:56,800 --> 00:19:00,910
But a dictionary is kind of the perfect
data structure for just associating

384
00:19:00,910 --> 00:19:05,770
something like Scratch, C, Python,
with something else-- keys with values,

385
00:19:05,770 --> 00:19:06,890
respectively.

386
00:19:06,890 --> 00:19:09,250
So this is going to
look a little weirder,

387
00:19:09,250 --> 00:19:13,040
but it's going to be pretty conventional
to do something like this instead.

388
00:19:13,040 --> 00:19:15,550
Let me go back into VS Code.

389
00:19:15,550 --> 00:19:17,360
I'll close my terminal window.

390
00:19:17,360 --> 00:19:20,680
And let me go ahead and
actually delete a lot of this

391
00:19:20,680 --> 00:19:22,630
because I can simplify this further.

392
00:19:22,630 --> 00:19:26,500
Let me go ahead and now give
myself, maybe, a variable just,

393
00:19:26,500 --> 00:19:30,698
generically, called counts, and set
that equal to an empty dictionary.

394
00:19:30,698 --> 00:19:32,740
And you can actually do
this in a couple of ways.

395
00:19:32,740 --> 00:19:34,727
You can literally write
out dict with nothing

396
00:19:34,727 --> 00:19:37,060
in parentheses, which will
give you an empty dictionary,

397
00:19:37,060 --> 00:19:39,602
like the picture on the board,
but a little more conventional

398
00:19:39,602 --> 00:19:43,450
is to just use two keystrokes and use
two curly braces with nothing inside.

399
00:19:43,450 --> 00:19:47,390
That gives me an empty dictionary
like this picture here.

400
00:19:47,390 --> 00:19:49,160
Now my loop is going to be the same.

401
00:19:49,160 --> 00:19:52,000
I'm going to do, for
each row in the reader,

402
00:19:52,000 --> 00:19:55,550
I'm going to go ahead and grab the
favorite language for this person.

403
00:19:55,550 --> 00:19:58,652
So favorite equals row,
quote unquote, "language."

404
00:19:58,652 --> 00:20:00,860
But now I'm going to do
something a little different.

405
00:20:00,860 --> 00:20:02,680
There's two scenarios here.

406
00:20:02,680 --> 00:20:07,690
Either, I have seen this language before
and I want to increment it by one.

407
00:20:07,690 --> 00:20:11,950
Or, if I've never seen this language
before because the loop just started,

408
00:20:11,950 --> 00:20:16,950
what should I initialize the
count for this language to?

409
00:20:16,950 --> 00:20:18,840
So, one because I've only seen it once.

410
00:20:18,840 --> 00:20:19,710
Exactly.

411
00:20:19,710 --> 00:20:22,470
So now let me go ahead
and do exactly that.

412
00:20:22,470 --> 00:20:26,370
If this current favorite,
that I am seeing in the row,

413
00:20:26,370 --> 00:20:29,997
is already in the counts
dictionary-- and in Python,

414
00:20:29,997 --> 00:20:31,830
this is literally how
you ask that question.

415
00:20:31,830 --> 00:20:33,840
If favorite in counts.

416
00:20:33,840 --> 00:20:38,250
That will check, is there a key with
this name, Scratch, C, or Python?

417
00:20:38,250 --> 00:20:42,390
If so, go into that location
in the counts dictionary,

418
00:20:42,390 --> 00:20:46,500
index into it at the favorite
location, because favorite is a string.

419
00:20:46,500 --> 00:20:49,140
It's either, quote unquote,
"Scratch," "C," or "Python,"

420
00:20:49,140 --> 00:20:53,490
and just increment it by one like this.

421
00:20:53,490 --> 00:20:54,780
Else, as you noted.

422
00:20:54,780 --> 00:21:00,360
If it's not there implicitly, then
counts bracket favorite should probably

423
00:21:00,360 --> 00:21:03,210
be set equal-- not to zero
because we're literally

424
00:21:03,210 --> 00:21:04,920
are seeing it in the current row.

425
00:21:04,920 --> 00:21:06,660
Let's initialize it to one.

426
00:21:06,660 --> 00:21:08,460
And, thereafter, if
we see it again, it's

427
00:21:08,460 --> 00:21:11,610
going to be plus equals one,
plus equals one, plus equals one.

428
00:21:11,610 --> 00:21:15,330
So now outside of that loop, outside
of the with block, let me do this.

429
00:21:15,330 --> 00:21:18,200
For each favorite in those counts--

430
00:21:18,200 --> 00:21:20,920
And this, too, in Python is a
trick if you want to iterate over

431
00:21:20,920 --> 00:21:22,870
all of the keys in a dictionary.

432
00:21:22,870 --> 00:21:26,560
That is, if you want to iterate over the
left-hand column of all of these keys,

433
00:21:26,560 --> 00:21:30,220
you literally can say, for
something in that dictionary.

434
00:21:30,220 --> 00:21:33,730
So for favorite in counts, this is
giving me a variable called favorite

435
00:21:33,730 --> 00:21:37,420
and updating it automatically
top to bottom in that dictionary.

436
00:21:37,420 --> 00:21:40,600
Let's go ahead and print
out an f string that's

437
00:21:40,600 --> 00:21:44,230
going to say whatever the name
of that language is colon,

438
00:21:44,230 --> 00:21:50,950
and whatever the value of that
language is in that there dictionary.

439
00:21:50,950 --> 00:21:54,040
So, again, logically the only
thing that's new is this.

440
00:21:54,040 --> 00:21:57,940
I'm now using one dictionary
instead of three variables

441
00:21:57,940 --> 00:22:02,170
to keep track of three things, like
updating this chalkboard with three

442
00:22:02,170 --> 00:22:04,043
different things, Scratch, C, Python.

443
00:22:04,043 --> 00:22:06,460
And the last thing I'm doing,
which is a little different,

444
00:22:06,460 --> 00:22:09,430
is once I have that dictionary,
whether there's three languages,

445
00:22:09,430 --> 00:22:11,597
or, maybe, tomorrow there'll
be fourth because we're

446
00:22:11,597 --> 00:22:12,970
going to introduce SQL today.

447
00:22:12,970 --> 00:22:16,807
Well, this will iterate over all of
those keys and print out the values.

448
00:22:16,807 --> 00:22:18,640
All right, so if I
didn't do anything wrong,

449
00:22:18,640 --> 00:22:21,940
if I do Python of
favorites.py and hit enter.

450
00:22:21,940 --> 00:22:23,043
There we have it.

451
00:22:23,043 --> 00:22:25,210
And it happens to be in a
different order this time.

452
00:22:25,210 --> 00:22:28,090
That's because we saw Python
first, we then saw Scratch,

453
00:22:28,090 --> 00:22:31,870
and, eventually, we saw C. But if
we wanted to sort these differently,

454
00:22:31,870 --> 00:22:34,780
we actually could with
some different code.

455
00:22:34,780 --> 00:22:36,170
But, in short, what have we done?

456
00:22:36,170 --> 00:22:39,340
We've created this kind of
structure in memory with three keys,

457
00:22:39,340 --> 00:22:42,100
Python, C, and Scratch,
because each time

458
00:22:42,100 --> 00:22:47,140
we encounter such a language from you
all, we either set our counter to one

459
00:22:47,140 --> 00:22:50,110
or increment it by one instead.

460
00:22:50,110 --> 00:22:56,320
Any questions on this code, or this
general idea of using dictionaries as,

461
00:22:56,320 --> 00:23:00,490
like, a little cheat sheet for
doing some math in this way?

462
00:23:00,490 --> 00:23:04,200
Super common paradigm.

463
00:23:04,200 --> 00:23:04,700
All right.

464
00:23:04,700 --> 00:23:06,890
Well, let me tweak this a little bit.

465
00:23:06,890 --> 00:23:10,622
Right now, in my output, we're
seeing Python, Scratch and C.

466
00:23:10,622 --> 00:23:13,580
Maybe, for the sake of discussion,
suppose we want to sort this by key.

467
00:23:13,580 --> 00:23:14,580
We can actually do that.

468
00:23:14,580 --> 00:23:16,370
Let me close my terminal temporarily.

469
00:23:16,370 --> 00:23:19,037
And it turns out, in Python,
there's a bunch of ways to do this,

470
00:23:19,037 --> 00:23:21,890
but the simplest way to
sort a dictionary by key

471
00:23:21,890 --> 00:23:24,500
is literally to use a
function called sorted

472
00:23:24,500 --> 00:23:26,970
that comes with Python,
that just does it for you.

473
00:23:26,970 --> 00:23:29,720
And even if you pass it a dictionary,
it will sort that dictionary

474
00:23:29,720 --> 00:23:34,170
by the left-hand column so you can
iterate it over alphabetically instead.

475
00:23:34,170 --> 00:23:36,110
So if I go back now to VS Code.

476
00:23:36,110 --> 00:23:39,710
If I open my terminal window and
I rerun Python on favorites.py,

477
00:23:39,710 --> 00:23:41,840
now that I've added the
sorted call, we should now

478
00:23:41,840 --> 00:23:45,470
see just because it's sorted
alphabetically instead.

479
00:23:45,470 --> 00:23:49,195
Now that's not that useful, especially
if we had lots of languages.

480
00:23:49,195 --> 00:23:51,320
You probably don't care
about it being alphabetized

481
00:23:51,320 --> 00:23:54,570
as much as you care about it being
ranked by which is the most popular,

482
00:23:54,570 --> 00:23:55,700
which is the least popular.

483
00:23:55,700 --> 00:23:58,220
And, for that, there's a bunch
of ways to do this in Python.

484
00:23:58,220 --> 00:24:03,410
And, I think, the simplest way to sort
by value the right-hand column instead

485
00:24:03,410 --> 00:24:06,900
of the left-hand column is probably
to make this change instead.

486
00:24:06,900 --> 00:24:09,240
Let me close my terminal temporarily.

487
00:24:09,240 --> 00:24:14,110
Let me still use the sorted function,
which by default sorts by key,

488
00:24:14,110 --> 00:24:16,750
but let's change it to be as follows.

489
00:24:16,750 --> 00:24:21,960
Let's change it to sort by a
function called counts.get,

490
00:24:21,960 --> 00:24:24,360
which is a little weird,
but this comes back

491
00:24:24,360 --> 00:24:28,290
to last week's brief discussion of
object-oriented programming, or oop.

492
00:24:28,290 --> 00:24:30,150
Remember, in Python,
that almost everything

493
00:24:30,150 --> 00:24:31,860
is like an object of some sort.

494
00:24:31,860 --> 00:24:36,930
An int is an object, a dictionary
is an object, a string is an object.

495
00:24:36,930 --> 00:24:40,590
Which is to say that, not only do these
things have values like, quote unquote,

496
00:24:40,590 --> 00:24:44,340
"Hello, World," or 50, these
variables, these objects,

497
00:24:44,340 --> 00:24:46,980
can also have functions
built into them, a.k.a.

498
00:24:46,980 --> 00:24:47,670
methods.

499
00:24:47,670 --> 00:24:52,440
So it turns out that because counts
is a dictionary, because I made it so,

500
00:24:52,440 --> 00:24:55,350
that counts dictionary, like
any dictionary in Python,

501
00:24:55,350 --> 00:24:57,150
comes with a function called get.

502
00:24:57,150 --> 00:25:01,290
And if you just tell the sorted
function to use that built-in method,

503
00:25:01,290 --> 00:25:05,400
it will actually, for every key, get
its value, get its value, get its value,

504
00:25:05,400 --> 00:25:09,720
and sort effectively by the
right-hand column instead of the left.

505
00:25:09,720 --> 00:25:13,140
Now we'll see down the line, perhaps,
more sophisticated ways of using this,

506
00:25:13,140 --> 00:25:15,990
but, for now, this just
overrides the default behavior

507
00:25:15,990 --> 00:25:19,510
and sorts the dictionary, not
by key, but by value instead.

508
00:25:19,510 --> 00:25:20,010
All right.

509
00:25:20,010 --> 00:25:24,120
So now watch this if I run
Python of favorites.py once more.

510
00:25:24,120 --> 00:25:28,140
Previously, it was in the order in
which the languages appeared first

511
00:25:28,140 --> 00:25:31,930
in the CSV file, then it
was sorted alphabetically.

512
00:25:31,930 --> 00:25:34,960
Now it should be sorted by value.

513
00:25:34,960 --> 00:25:35,850
And, indeed.

514
00:25:35,850 --> 00:25:39,180
Scratch is the least with
40, C is the next with 78,

515
00:25:39,180 --> 00:25:41,458
Python is the biggest with 280.

516
00:25:41,458 --> 00:25:43,500
That's not much of a top
10, or a top three list.

517
00:25:43,500 --> 00:25:44,820
Let's actually reverse it.

518
00:25:44,820 --> 00:25:46,800
And the easiest way in
Python to do that is

519
00:25:46,800 --> 00:25:50,130
to pass a third argument into
sorted, and you would know this

520
00:25:50,130 --> 00:25:51,750
by just reading the documentation.

521
00:25:51,750 --> 00:25:56,370
You can literally say, reverse
equals True, capital T,

522
00:25:56,370 --> 00:26:00,480
and now if I rerun this one last
time, Python of favorites.py,

523
00:26:00,480 --> 00:26:04,110
I'll see the same values but with
the whole thing reversed in order.

524
00:26:04,110 --> 00:26:07,230
Long story short, even though
this might feel like a slog,

525
00:26:07,230 --> 00:26:11,070
like adding this and looking up
this, so much easier than in C

526
00:26:11,070 --> 00:26:13,980
where you would have had to figure
out, how does bubble sort work?

527
00:26:13,980 --> 00:26:17,520
Let me implement bubble sort, selection
sort, any of those sorting algorithms,

528
00:26:17,520 --> 00:26:18,660
or use some other library.

529
00:26:18,660 --> 00:26:21,430
In Python, you just get a lot
more for free, so to speak.

530
00:26:21,430 --> 00:26:24,390
It's just built in once you get
comfy with the documentation.

531
00:26:24,390 --> 00:26:29,370
And, to be clear, this is an
argument, as is this, as is this.

532
00:26:29,370 --> 00:26:33,540
But in Python, we have not only
positional arguments, which

533
00:26:33,540 --> 00:26:36,720
are based on what position they
are in, left to right, just like C,

534
00:26:36,720 --> 00:26:39,900
you also have these
named parameters whereby

535
00:26:39,900 --> 00:26:42,720
they have explicit names
that you can use yourself,

536
00:26:42,720 --> 00:26:45,930
to make clear that you're using
this one but not this other one.

537
00:26:45,930 --> 00:26:50,980
More parameters in Python can
be optional than in C. Phew.

538
00:26:50,980 --> 00:26:51,480
All right.

539
00:26:51,480 --> 00:26:54,727
Any questions about that technique yet?

540
00:26:54,727 --> 00:26:57,810
And if you're feeling like this is
starting to take the fun out of Python,

541
00:26:57,810 --> 00:27:01,930
that's actually kind of the
point of doing this the hard way.

542
00:27:01,930 --> 00:27:02,430
All right.

543
00:27:02,430 --> 00:27:05,110
Well, let's do it one other
way that's marginally better.

544
00:27:05,110 --> 00:27:09,833
It turns out, in Python there really
is this rich ecosystem of libraries,

545
00:27:09,833 --> 00:27:12,750
the code that comes with the language
itself, or, even, third parties.

546
00:27:12,750 --> 00:27:17,040
And coming with the
language is another module

547
00:27:17,040 --> 00:27:19,920
called the collections
module, or package here,

548
00:27:19,920 --> 00:27:25,350
whereby if I use from collections, I
can import something called Counter,

549
00:27:25,350 --> 00:27:28,140
capital C. And it
turns out, if this felt

550
00:27:28,140 --> 00:27:30,450
a little bit painful to
create a dictionary yourself,

551
00:27:30,450 --> 00:27:33,470
initialize it, maybe, to
zero or one, like this.

552
00:27:33,470 --> 00:27:36,547
Turns out, you have the same problem
that people before you have had,

553
00:27:36,547 --> 00:27:38,130
and so there's another way to do this.

554
00:27:38,130 --> 00:27:42,150
You can create a variable called
counts, set it equal to Counter, capital

555
00:27:42,150 --> 00:27:43,770
C, open paren, close paren.

556
00:27:43,770 --> 00:27:46,140
And this is a different type of object.

557
00:27:46,140 --> 00:27:48,300
It's a different type of
object in Python, that

558
00:27:48,300 --> 00:27:50,615
has counting capabilities built in.

559
00:27:50,615 --> 00:27:53,490
And so if I actually want to use
this counter instead, I can do this.

560
00:27:53,490 --> 00:27:57,930
For each row in the reader, let's go
ahead and grab the favorite language

561
00:27:57,930 --> 00:28:00,540
from that row, just like before.

562
00:28:00,540 --> 00:28:04,530
And without doing any of that headache
of like, if, elif, or any of this,

563
00:28:04,530 --> 00:28:09,060
you can literally just index
into that counter using favorite,

564
00:28:09,060 --> 00:28:14,340
quote unquote, "Scratch," or "C," or
"Python," and increment it by one.

565
00:28:14,340 --> 00:28:17,680
What the Counter class is going
to do for you, so to speak--

566
00:28:17,680 --> 00:28:20,190
Another example of object-oriented
programming, and counts

567
00:28:20,190 --> 00:28:21,435
is now an object thereof.

568
00:28:21,435 --> 00:28:25,107
What this whole feature of Counter
is going to do for you is it's

569
00:28:25,107 --> 00:28:26,940
going to automatically
initialize everything

570
00:28:26,940 --> 00:28:28,710
to zero, even if you've
never seen it before,

571
00:28:28,710 --> 00:28:30,940
and then you can just blindly
start incrementing it.

572
00:28:30,940 --> 00:28:33,570
So, in short, there's just
more pleasant ways, sometimes,

573
00:28:33,570 --> 00:28:36,280
to do something in Python as well.

574
00:28:36,280 --> 00:28:36,780
All right.

575
00:28:36,780 --> 00:28:42,750
How about, lastly, let's
make things, maybe--

576
00:28:42,750 --> 00:28:44,340
Oh, actually, let's do this.

577
00:28:44,340 --> 00:28:47,918
We can even simplify the sorting here.

578
00:28:47,918 --> 00:28:49,710
Let me actually take
this one step further.

579
00:28:49,710 --> 00:28:52,950
Instead of manually figuring out how
to sort this, I'm going to do this.

580
00:28:52,950 --> 00:29:02,650
For each favorite, and the count
thereof in the counts variable's most

581
00:29:02,650 --> 00:29:07,820
common function's return value, go
ahead and print out this as well.

582
00:29:07,820 --> 00:29:09,790
So in short, again, a bit new syntax.

583
00:29:09,790 --> 00:29:11,210
But what's going on here?

584
00:29:11,210 --> 00:29:14,410
Well, it turns out that
this counts class and--

585
00:29:14,410 --> 00:29:17,478
sorry, this Counter class, and,
in turn, this counts variable,

586
00:29:17,478 --> 00:29:20,770
comes with a function built in that you
would only know from the documentation.

587
00:29:20,770 --> 00:29:23,260
It's literally called
most underscore common,

588
00:29:23,260 --> 00:29:31,700
and what it returns to you when you call
it is a pair of key value, key value.

589
00:29:31,700 --> 00:29:34,480
And so this, too, is a trick in
Python that we did not have in C.

590
00:29:34,480 --> 00:29:37,180
If you want to iterate
over something, but grab

591
00:29:37,180 --> 00:29:40,120
two variables at a time on
each iteration like this,

592
00:29:40,120 --> 00:29:42,880
you separate them by commas
and can get favorite count,

593
00:29:42,880 --> 00:29:44,840
favorite count, favorite count.

594
00:29:44,840 --> 00:29:50,260
So if I run this now, Python of
favorites.py, this, too, just works.

595
00:29:50,260 --> 00:29:53,440
And it's getting a little simpler,
a little tighter than before

596
00:29:53,440 --> 00:29:57,740
than if we had actually
done it all manually.

597
00:29:57,740 --> 00:30:01,310
Lastly, here is a code that's
the shortest version thereof.

598
00:30:01,310 --> 00:30:03,650
We're down to like 14 or 15 lines.

599
00:30:03,650 --> 00:30:08,480
If I wanted to change this to
analyze the most popular problem thus

600
00:30:08,480 --> 00:30:12,620
far in the class, how do I
go about changing the code

601
00:30:12,620 --> 00:30:18,260
to print out, top to bottom, the
most popular problem or problems?

602
00:30:18,260 --> 00:30:19,280
What line should change?

603
00:30:19,280 --> 00:30:20,570
Yeah.

604
00:30:20,570 --> 00:30:22,070
So, yeah.

605
00:30:22,070 --> 00:30:22,610
Line 10.

606
00:30:22,610 --> 00:30:25,310
Because I've written this in
kind of a general purpose way

607
00:30:25,310 --> 00:30:27,282
and using dictionaries
with keys, it suffices

608
00:30:27,282 --> 00:30:29,240
to change language to,
quote unquote, "problem"

609
00:30:29,240 --> 00:30:31,325
because that was the
third column from the CSV.

610
00:30:31,325 --> 00:30:33,200
And so now, if you're
curious, let's actually

611
00:30:33,200 --> 00:30:35,240
make my terminal window a bit bigger.

612
00:30:35,240 --> 00:30:37,820
Python of favorites.py, enter.

613
00:30:37,820 --> 00:30:38,750
And, OK.

614
00:30:38,750 --> 00:30:40,910
Tragically, we peaked
early with Hello, World--

615
00:30:40,910 --> 00:30:45,060
is the most popular problem thus far,
followed by Filter, then Scratch.

616
00:30:45,060 --> 00:30:45,560
OK.

617
00:30:45,560 --> 00:30:46,700
Peaked even earlier.

618
00:30:46,700 --> 00:30:51,660
Mario, DNA, and so forth, and
a bunch of others thereafter.

619
00:30:51,660 --> 00:30:55,477
So based on this sample size, here's
the ranking of the problems thus far.

620
00:30:55,477 --> 00:30:56,060
So, we got it.

621
00:30:56,060 --> 00:30:59,570
More Hello, World problems
in the weeks to come.

622
00:30:59,570 --> 00:31:01,160
All right.

623
00:31:01,160 --> 00:31:04,520
Now that we've done
that in that way, let's

624
00:31:04,520 --> 00:31:07,070
just make this program
slightly interactive

625
00:31:07,070 --> 00:31:11,070
and see how we can really take a
fundamentally different approach.

626
00:31:11,070 --> 00:31:12,620
I'm going to go into VS Code.

627
00:31:12,620 --> 00:31:15,798
I'm going to keep everything the
same, except that, at the bottom,

628
00:31:15,798 --> 00:31:17,840
I'm going to get rid of
this loop because I don't

629
00:31:17,840 --> 00:31:20,120
want any more print out everything.

630
00:31:20,120 --> 00:31:22,040
I want to look up specific counts.

631
00:31:22,040 --> 00:31:25,060
Like, how popular was this problem,
how popular was this other problem?

632
00:31:25,060 --> 00:31:28,310
And what I'm going to go ahead and do
is to create a variable called favorite,

633
00:31:28,310 --> 00:31:30,180
set it equal to--

634
00:31:30,180 --> 00:31:32,060
I could use get string
in the CS50 library,

635
00:31:32,060 --> 00:31:34,790
but we saw last week there's no
need to for strings, certainly.

636
00:31:34,790 --> 00:31:37,340
Let me just use the input
function that comes with Python,

637
00:31:37,340 --> 00:31:39,890
and prompt the human for
their favorite problem.

638
00:31:39,890 --> 00:31:45,020
And then let me go ahead and print out,
for instance, an f string containing

639
00:31:45,020 --> 00:31:51,920
whatever their favorite is, colon,
and whatever the count is thereof

640
00:31:51,920 --> 00:31:54,290
of that favorite, close quote.

641
00:31:54,290 --> 00:31:56,400
So let me open my terminal window.

642
00:31:56,400 --> 00:31:59,300
Let me run Python of
favorites.py, enter.

643
00:31:59,300 --> 00:32:04,640
And if I type in Hello, World, looks
like 65 people, indeed, like that one.

644
00:32:04,640 --> 00:32:06,860
If I run Python of favorites.py again.

645
00:32:06,860 --> 00:32:09,350
I type in Scratch, now we see that one.

646
00:32:09,350 --> 00:32:12,570
If I type in anything else, I'm
going to get its specific value.

647
00:32:12,570 --> 00:32:18,050
So this is to say, not only can we write
Python code to analyze some data pretty

648
00:32:18,050 --> 00:32:20,720
tightly versus the manual
code we wrote out earlier,

649
00:32:20,720 --> 00:32:23,238
you can also make these
programs interactive as well.

650
00:32:23,238 --> 00:32:25,530
And this is going to be a
super common paradigm, right,

651
00:32:25,530 --> 00:32:28,880
if you go into the world of consulting,
analytics, data science, more

652
00:32:28,880 --> 00:32:29,540
generally.

653
00:32:29,540 --> 00:32:34,310
Among your roles is going to be to
analyze data, to ask questions of data,

654
00:32:34,310 --> 00:32:37,610
get back the answer. be ask questions
of data, get back the answer.

655
00:32:37,610 --> 00:32:40,190
Honestly, life gets pretty
tedious, even though you've only

656
00:32:40,190 --> 00:32:43,670
been programming in Python, perhaps, for
like one week, a week and a half now.

657
00:32:43,670 --> 00:32:46,880
When you have to write code to
solve all of the world's problems--

658
00:32:46,880 --> 00:32:49,250
and there's this sort of
tenant in programming,

659
00:32:49,250 --> 00:32:53,270
that programmers tend to avoid
writing code as much as they can

660
00:32:53,270 --> 00:32:55,190
because, ideally, you
would solve problems

661
00:32:55,190 --> 00:32:58,640
with the right tool for the job,
minimizing the number of lines of code

662
00:32:58,640 --> 00:32:59,880
you actually write.

663
00:32:59,880 --> 00:33:01,920
So how do we actually get to that point?

664
00:33:01,920 --> 00:33:06,740
Well, instead of just dealing with
CSV files, pure text, it turns out

665
00:33:06,740 --> 00:33:09,200
there's an entire world
of proper databases.

666
00:33:09,200 --> 00:33:12,080
Not flat file databases, where you
store everything in text files,

667
00:33:12,080 --> 00:33:16,370
but a database program, a piece
of software running on a computer,

668
00:33:16,370 --> 00:33:18,922
running on a server, that's
always listening for you.

669
00:33:18,922 --> 00:33:22,130
It's got a lot of memory, it's got a
lot of space, and in turn a lot of data,

670
00:33:22,130 --> 00:33:25,460
and it supports a
database specific language

671
00:33:25,460 --> 00:33:31,020
that makes it much easier, much faster
to ask questions of the very same data.

672
00:33:31,020 --> 00:33:33,290
It's a relational database
in the sense, too,

673
00:33:33,290 --> 00:33:35,480
that it's not even
necessarily one spreadsheet,

674
00:33:35,480 --> 00:33:37,430
one set of rows and columns.

675
00:33:37,430 --> 00:33:41,960
You can have two sheets, three sheets,
30 sheets across which there might very

676
00:33:41,960 --> 00:33:44,840
well be relationships, or relations.

677
00:33:44,840 --> 00:33:48,830
So S-Q-L, or SQL, is a
database specific language,

678
00:33:48,830 --> 00:33:52,625
stands for Structured Query Language,
that's a declarative language whereby

679
00:33:52,625 --> 00:33:54,500
you're not going to be
in the habit with SQL,

680
00:33:54,500 --> 00:33:58,870
typically, of writing loops and
conditionals, and this kind of thing.

681
00:33:58,870 --> 00:34:02,260
You're instead going to describe
the data that you want to get back,

682
00:34:02,260 --> 00:34:05,680
you're going to describe the
question that you want the answer to,

683
00:34:05,680 --> 00:34:08,650
and we'll do this using a
relatively small grammar.

684
00:34:08,650 --> 00:34:11,620
That is to say, there's not
that many keywords in SQL.

685
00:34:11,620 --> 00:34:12,850
It's a pretty small language.

686
00:34:12,850 --> 00:34:16,120
But it's going to allow us to eliminate
dozens of lines of Python code,

687
00:34:16,120 --> 00:34:16,900
perhaps.

688
00:34:16,900 --> 00:34:19,429
SQL follows this CRUD paradigm.

689
00:34:19,429 --> 00:34:23,230
So C-R-U-D, which simply means
that in a relational database,

690
00:34:23,230 --> 00:34:25,030
you can really only do four things.

691
00:34:25,030 --> 00:34:30,100
You can create data, read data-- that
is, look at it or analyze it somehow.

692
00:34:30,100 --> 00:34:32,540
Update the data, or delete the data.

693
00:34:32,540 --> 00:34:33,770
So, CRUD, for short.

694
00:34:33,770 --> 00:34:36,673
And that really speaks to
just how relatively simple

695
00:34:36,673 --> 00:34:39,340
the world is, even though we'll
just scratch the surface of some

696
00:34:39,340 --> 00:34:40,423
of its capabilities today.

697
00:34:40,423 --> 00:34:42,560
And you'll explore more over time.

698
00:34:42,560 --> 00:34:45,340
Specifically, in SQL, there's
going to be other keywords that

699
00:34:45,340 --> 00:34:46,719
map to those four ideas.

700
00:34:46,719 --> 00:34:49,480
Technically, you don't just
create data in the world of SQL,

701
00:34:49,480 --> 00:34:53,270
you can also insert data, like
inserting more rows into a sheet.

702
00:34:53,270 --> 00:34:55,330
And it's not the word
"read" that people use.

703
00:34:55,330 --> 00:34:56,739
People say to "select" data.

704
00:34:56,739 --> 00:35:00,170
But they mean to read data, which
is sort of the opposite of writing

705
00:35:00,170 --> 00:35:01,220
or creating data.

706
00:35:01,220 --> 00:35:04,070
But the U and the D are the
same, except that there's also

707
00:35:04,070 --> 00:35:07,580
a keyword in SQL known as DROP,
which lets you very destructively,

708
00:35:07,580 --> 00:35:11,610
very dangerously delete entire
database tables, as well.

709
00:35:11,610 --> 00:35:15,500
So how do we do this, and what's
the connection to our favorites data

710
00:35:15,500 --> 00:35:16,190
thus far?

711
00:35:16,190 --> 00:35:18,920
Well, here is the syntax
in this language called

712
00:35:18,920 --> 00:35:22,880
SQL via which you can create a table.

713
00:35:22,880 --> 00:35:26,510
So the jargon is a little different,
but the ideas are exactly the same

714
00:35:26,510 --> 00:35:27,840
from the world of spreadsheets.

715
00:35:27,840 --> 00:35:33,720
What you call a sheet in a spreadsheet,
the database world calls a table.

716
00:35:33,720 --> 00:35:37,320
It's a table of rows and columns,
but it's the exact same idea.

717
00:35:37,320 --> 00:35:40,320
You're going to have discretion
over what to call the table,

718
00:35:40,320 --> 00:35:42,710
just like you can call a
spreadsheet something, or else,

719
00:35:42,710 --> 00:35:44,790
and you can also specify
the types of data

720
00:35:44,790 --> 00:35:46,790
that you want to store
in your rows and columns.

721
00:35:46,790 --> 00:35:49,520
And it's going to go a little more
deeply than just formatting it,

722
00:35:49,520 --> 00:35:51,687
like in Excel, and Numbers,
and Google Spreadsheets,

723
00:35:51,687 --> 00:35:55,250
you can actually control, maybe,
how big the data could be depending

724
00:35:55,250 --> 00:35:57,140
on the database you're actually using.

725
00:35:57,140 --> 00:35:59,990
In CS50, we're going to
use a light version of SQL.

726
00:35:59,990 --> 00:36:01,820
Literally, a language called--

727
00:36:01,820 --> 00:36:04,970
an implementation of SQL
called SQLite, which has really

728
00:36:04,970 --> 00:36:08,360
all of the core functionality that
you would see in the real world,

729
00:36:08,360 --> 00:36:10,702
and with larger, more
scalable systems, but it's

730
00:36:10,702 --> 00:36:13,160
going to allow us to focus on
a lot of the building blocks.

731
00:36:13,160 --> 00:36:17,010
And SQLite's actually really popular
on Macs, PCs, and phones, nowadays.

732
00:36:17,010 --> 00:36:20,450
A lot of the data that games and other
applications on your phone might store,

733
00:36:20,450 --> 00:36:24,470
actually have a file, a binary
file with zeros and ones,

734
00:36:24,470 --> 00:36:26,393
that's in the SQLite format.

735
00:36:26,393 --> 00:36:29,060
So if you do a mobile app, for
instance, for your final project,

736
00:36:29,060 --> 00:36:32,220
you'll have an opportunity to
play with something like this.

737
00:36:32,220 --> 00:36:34,760
Well, how do you actually run SQLite3?

738
00:36:34,760 --> 00:36:37,500
It's just a command built
into your code space.

739
00:36:37,500 --> 00:36:40,400
So this is a program you could
install on your own Mac, your own PC,

740
00:36:40,400 --> 00:36:40,910
or the like.

741
00:36:40,910 --> 00:36:43,280
We'll do everything as we've
done before, in the cloud,

742
00:36:43,280 --> 00:36:44,780
and actually use your code space.

743
00:36:44,780 --> 00:36:48,170
And by that I mean, we can just
start to play with this data now

744
00:36:48,170 --> 00:36:51,890
using SQL instead of Python.

745
00:36:51,890 --> 00:36:52,710
So let me do this.

746
00:36:52,710 --> 00:36:55,190
Let me open up my terminal
window here, and let

747
00:36:55,190 --> 00:36:58,010
me go ahead and maximize my
terminal window just because we'll

748
00:36:58,010 --> 00:37:00,140
focus now on the files here.

749
00:37:00,140 --> 00:37:03,530
Recall that I have a file
called favorites.csv,

750
00:37:03,530 --> 00:37:05,510
and that CSV file is just text.

751
00:37:05,510 --> 00:37:08,300
But let me load it
into a proper database

752
00:37:08,300 --> 00:37:11,510
so I can actually use this
other language called SQL on it.

753
00:37:11,510 --> 00:37:14,210
To do this, I'm going
to run SQLite3, which

754
00:37:14,210 --> 00:37:16,070
just means the third
version of it, and I'm

755
00:37:16,070 --> 00:37:18,950
going to create a new
database called favorites.db.

756
00:37:18,950 --> 00:37:21,380
That's just a convention,
but it means here comes

757
00:37:21,380 --> 00:37:23,030
a database that I'm going to create.

758
00:37:23,030 --> 00:37:25,370
Notice, I'm not using
the tabbed code editor.

759
00:37:25,370 --> 00:37:28,460
I'm not using the code command
because the code command is generally

760
00:37:28,460 --> 00:37:29,600
for text files.

761
00:37:29,600 --> 00:37:33,440
SQLite3 is going to create a binary
file, zeros and ones, ultimately.

762
00:37:33,440 --> 00:37:35,750
When I run that, it's going
to ask me to verify yes.

763
00:37:35,750 --> 00:37:37,400
I'm going to hit y and then enter.

764
00:37:37,400 --> 00:37:40,940
And now I'm at the SQLite prompt,
which is not the dollar sign.

765
00:37:40,940 --> 00:37:43,730
It literally says SQLite
with an angled bracket.

766
00:37:43,730 --> 00:37:49,220
Now, one time only, I want to
go ahead and load favorites.csv

767
00:37:49,220 --> 00:37:53,150
into this database so I can actually
play around with it using not Python,

768
00:37:53,150 --> 00:37:54,982
but this new language called SQL.

769
00:37:54,982 --> 00:37:56,940
And the way I'm going to
do this is as follows.

770
00:37:56,940 --> 00:37:57,773
I'm going to do dot.

771
00:37:57,773 --> 00:38:00,130
Mode csv, enter.

772
00:38:00,130 --> 00:38:03,430
And that just puts SQLite into CSV mode.

773
00:38:03,430 --> 00:38:05,650
It has different modes for
different file formats.

774
00:38:05,650 --> 00:38:09,700
I'm going to .import and then I'm
going to specify the file that I want

775
00:38:09,700 --> 00:38:11,800
to import, which is favorites.csv.

776
00:38:11,800 --> 00:38:13,030
And then this one's up to me.

777
00:38:13,030 --> 00:38:15,580
What is the name of the
table I want to create?

778
00:38:15,580 --> 00:38:18,012
And table, again, is essentially
synonymous with sheets.

779
00:38:18,012 --> 00:38:19,720
So I'm going to call
everything the same.

780
00:38:19,720 --> 00:38:22,700
I'm going to call my
table favorites as well.

781
00:38:22,700 --> 00:38:25,420
So what this command is
essentially going to do,

782
00:38:25,420 --> 00:38:29,890
is all of those lines of Python code
that open the file, read it row by row,

783
00:38:29,890 --> 00:38:31,028
and do something with it--

784
00:38:31,028 --> 00:38:32,320
This is just built into SQLite.

785
00:38:32,320 --> 00:38:37,720
It's going to load the whole darn
CSV into this new favorites.db file,

786
00:38:37,720 --> 00:38:38,980
and then that's it for now.

787
00:38:38,980 --> 00:38:42,220
I'm going to go ahead and literally
type .quit to get out of SQLite.

788
00:38:42,220 --> 00:38:43,810
I'm back at my dollar sign prompt.

789
00:38:43,810 --> 00:38:47,500
If I type ls, I have
not only favorites.csv,

790
00:38:47,500 --> 00:38:54,790
I also have favorites.db now as well, a
brand new file, and in that file now is

791
00:38:54,790 --> 00:38:57,560
an optimized version of the CSV file.

792
00:38:57,560 --> 00:39:00,860
In that DB file now is a
version of the data that's

793
00:39:00,860 --> 00:39:04,430
going to lend itself to CRUD
operations, creating, reading, update,

794
00:39:04,430 --> 00:39:08,810
and deleting, using this
new language called SQL.

795
00:39:08,810 --> 00:39:10,478
All right, so how do I get into this?

796
00:39:10,478 --> 00:39:13,520
Well, let me clear my terminal window
and pretend that I'm doing this now

797
00:39:13,520 --> 00:39:14,062
the next day.

798
00:39:14,062 --> 00:39:15,478
I've already created the database.

799
00:39:15,478 --> 00:39:16,670
That's a one time operation.

800
00:39:16,670 --> 00:39:18,545
Once you've got the
data, now I'm going to go

801
00:39:18,545 --> 00:39:24,080
ahead and again run SQLite3 favorites.db
just to open the file again.

802
00:39:24,080 --> 00:39:27,260
But it's already now-- all
of the data is in there.

803
00:39:27,260 --> 00:39:32,420
Just as a teaser, let
me go ahead and do this.

804
00:39:32,420 --> 00:39:38,580
.schema is a SQLite command that just
shows me the schema of this database

805
00:39:38,580 --> 00:39:39,080
table.

806
00:39:39,080 --> 00:39:41,130
And we'll see more about
this in a little bit,

807
00:39:41,130 --> 00:39:44,630
but for now this is
showing me, essentially,

808
00:39:44,630 --> 00:39:48,170
the SQL command that
was automatically run

809
00:39:48,170 --> 00:39:51,650
when I imported this database
the first time around.

810
00:39:51,650 --> 00:39:54,680
And, for now, just notice
that it mentions timestamp,

811
00:39:54,680 --> 00:39:57,270
it mentions language,
it mentions problem.

812
00:39:57,270 --> 00:39:59,790
Very loosely, it calls
each of those texts.

813
00:39:59,790 --> 00:40:02,370
So we're not trying very hard
to distinguish one type of data

814
00:40:02,370 --> 00:40:02,912
from another.

815
00:40:02,912 --> 00:40:03,690
It's all text.

816
00:40:03,690 --> 00:40:06,790
But notice, create table
If not exists favorites.

817
00:40:06,790 --> 00:40:09,420
This is essentially
the create table syntax

818
00:40:09,420 --> 00:40:11,490
that I alluded to
earlier via which you can

819
00:40:11,490 --> 00:40:15,510
create a table in a SQLite database.

820
00:40:15,510 --> 00:40:17,280
But more on that in just a bit.

821
00:40:17,280 --> 00:40:22,500
Here now is how we can actually
get at the data in that database.

822
00:40:22,500 --> 00:40:28,380
It turns out that we can select
one or more columns from a database

823
00:40:28,380 --> 00:40:30,180
table using syntax like this.

824
00:40:30,180 --> 00:40:33,870
Literally, the keyword select, then
the name of one or more columns

825
00:40:33,870 --> 00:40:38,550
that are in that database, and then from
the specific table that you care about.

826
00:40:38,550 --> 00:40:41,760
And notice that in capital
letters here are all of the SQL

827
00:40:41,760 --> 00:40:44,760
specific keywords, select,
and from, in particular.

828
00:40:44,760 --> 00:40:47,940
And in lowercase, by convention,
here are the placeholders

829
00:40:47,940 --> 00:40:51,270
for the columns that you, or I, have
created, and the tables that you, or I,

830
00:40:51,270 --> 00:40:52,210
have created.

831
00:40:52,210 --> 00:40:53,820
So if I go back to SQLite here.

832
00:40:53,820 --> 00:40:57,090
Let me just clear with Control L, which
will just freshen up the screen here

833
00:40:57,090 --> 00:40:58,770
so we can focus on what's new.

834
00:40:58,770 --> 00:41:03,600
If I want to select everything
from the table called favorites,

835
00:41:03,600 --> 00:41:04,890
here's what I can do.

836
00:41:04,890 --> 00:41:09,570
Select star from favorites semicolon.

837
00:41:09,570 --> 00:41:12,660
And, do forgive me, semicolons
are back for SQL, in this case.

838
00:41:12,660 --> 00:41:16,680
But select star from favorites uses a
syntax you might not be familiar with.

839
00:41:16,680 --> 00:41:18,660
Star here has nothing
to do with pointers.

840
00:41:18,660 --> 00:41:20,100
Star is a wild card.

841
00:41:20,100 --> 00:41:22,290
It means give me everything,
no matter what it's

842
00:41:22,290 --> 00:41:24,480
called, from this particular table.

843
00:41:24,480 --> 00:41:27,030
When I hit enter, what
we're going to see

844
00:41:27,030 --> 00:41:30,810
is the entire contents
of the favorite table

845
00:41:30,810 --> 00:41:34,480
that's the result of having imported
that CSV into this database.

846
00:41:34,480 --> 00:41:37,380
So when I hit enter,
there is all of that data.

847
00:41:37,380 --> 00:41:40,110
And SQLite, just to be
friendly, it's using

848
00:41:40,110 --> 00:41:44,220
what we might call ASCII art, just
very simple text, like hyphens,

849
00:41:44,220 --> 00:41:47,310
and vertical bars, and pluses on
the corner, to make it look pretty

850
00:41:47,310 --> 00:41:49,690
and make it look like
it is a proper table.

851
00:41:49,690 --> 00:41:52,620
But what you're really seeing
is the contents of favorites.db,

852
00:41:52,620 --> 00:41:54,750
specifically in that table.

853
00:41:54,750 --> 00:41:57,130
Specifically, if I only
care about languages,

854
00:41:57,130 --> 00:41:59,130
let me try something
more specific than star.

855
00:41:59,130 --> 00:42:03,270
Select language from
favorites semicolon.

856
00:42:03,270 --> 00:42:05,640
This is going to give me
just a single column now,

857
00:42:05,640 --> 00:42:09,150
of all of the favorites that you
selected for language specifically.

858
00:42:09,150 --> 00:42:12,730
This is a little overwhelming to
see all 399 or so pieces of data,

859
00:42:12,730 --> 00:42:14,680
so let me actually
truncate it a little bit.

860
00:42:14,680 --> 00:42:20,333
Let me do select language
from favorites limit 10.

861
00:42:20,333 --> 00:42:22,500
So we're about to see that
there's little tricks you

862
00:42:22,500 --> 00:42:26,430
can use to tweak the
behavior of the language

863
00:42:26,430 --> 00:42:28,500
in order to get back more or less data.

864
00:42:28,500 --> 00:42:31,110
In fact, it turns out
there's a bunch of keywords

865
00:42:31,110 --> 00:42:35,880
like these built into SQL, much like
Google Spreadsheets, Apple Numbers,

866
00:42:35,880 --> 00:42:38,293
Microsoft Excel, and certainly Python.

867
00:42:38,293 --> 00:42:41,460
There's a lot of functionality that you
just get for free with the language.

868
00:42:41,460 --> 00:42:44,610
If you want to calculate an average,
count the number of things in a file,

869
00:42:44,610 --> 00:42:48,090
get the unique or distinct values,
force everything to lowercase,

870
00:42:48,090 --> 00:42:50,380
force everything to uppercase,
get the maximum value,

871
00:42:50,380 --> 00:42:53,505
minimum value-- much like spreadsheets,
if you're familiar with that world,

872
00:42:53,505 --> 00:42:56,880
you get all of that functionality
in SQL but also more.

873
00:42:56,880 --> 00:42:59,680
So, for instance, if I go back
to my terminal window here.

874
00:42:59,680 --> 00:43:02,490
Let me go ahead and select
the total number of favorites

875
00:43:02,490 --> 00:43:06,150
in this table, the total number
of rows that you all inputted.

876
00:43:06,150 --> 00:43:10,320
So I could do select star
from favorites semicolon,

877
00:43:10,320 --> 00:43:12,450
and then I could literally
start counting these.

878
00:43:12,450 --> 00:43:15,330
Like 1, 2, 3, 4-- there's
clearly a better way.

879
00:43:15,330 --> 00:43:19,480
And, indeed, on our list of
functions was a count function.

880
00:43:19,480 --> 00:43:22,200
And so the way I can use
that in SQL is like this.

881
00:43:22,200 --> 00:43:25,230
Select count of star--

882
00:43:25,230 --> 00:43:28,113
so pass star in as an argument
to the count function.

883
00:43:28,113 --> 00:43:30,780
You don't care what columns you're
counting just count them all.

884
00:43:30,780 --> 00:43:33,440
From favorites semicolon.

885
00:43:33,440 --> 00:43:35,190
And now, you're actually
going to get back

886
00:43:35,190 --> 00:43:40,050
like a little baby table that has just
one row, one column inside of which--

887
00:43:40,050 --> 00:43:43,410
one cell of which has
the total actual count.

888
00:43:43,410 --> 00:43:47,610
And it's 398 because 399, recall,
included the actual header

889
00:43:47,610 --> 00:43:49,480
row from the file.

890
00:43:49,480 --> 00:43:49,980
All right.

891
00:43:49,980 --> 00:43:52,260
So suppose you want to--

892
00:43:52,260 --> 00:43:54,480
actually, note that this
is the exact same thing

893
00:43:54,480 --> 00:43:58,840
as counting a specific
column because every row

894
00:43:58,840 --> 00:44:00,670
has the same number of columns, three.

895
00:44:00,670 --> 00:44:03,790
We could just say select
the count of languages,

896
00:44:03,790 --> 00:44:07,720
or select the count of problems.

897
00:44:07,720 --> 00:44:10,023
All of those are going to
give me back the same answer.

898
00:44:10,023 --> 00:44:12,190
It is, therefore, conventional
in SQL if you're just

899
00:44:12,190 --> 00:44:14,080
trying to count the
number of rows, don't even

900
00:44:14,080 --> 00:44:15,413
worry about what they're called.

901
00:44:15,413 --> 00:44:18,722
Just do count star to get
back everything more simply.

902
00:44:18,722 --> 00:44:21,430
All right, but what if we want to
get back the distinct languages

903
00:44:21,430 --> 00:44:23,972
and we didn't know a priori that
this came from a Google form

904
00:44:23,972 --> 00:44:25,107
with three radio buttons?

905
00:44:25,107 --> 00:44:26,690
Well, we could do something like this.

906
00:44:26,690 --> 00:44:33,130
We could select the distinct languages
from the favorites table, enter.

907
00:44:33,130 --> 00:44:35,950
And that gives me Python,
Scratch, C because distinct

908
00:44:35,950 --> 00:44:38,200
is one of the other functions
that comes with SQL.

909
00:44:38,200 --> 00:44:40,990
This is, obviously, very easily
countable with my human eyes,

910
00:44:40,990 --> 00:44:43,210
but if I wanted to do
this more dynamically,

911
00:44:43,210 --> 00:44:47,530
I could change this to be
count the distinct languages.

912
00:44:47,530 --> 00:44:50,200
And just like in C, just like
in Python, just like in Scratch,

913
00:44:50,200 --> 00:44:52,987
I can nest these functions
and pass the output of one

914
00:44:52,987 --> 00:44:54,070
into the input of another.

915
00:44:54,070 --> 00:44:59,550
If I hit enter now, I now
get three in this case here.

916
00:44:59,550 --> 00:45:00,050
OK.

917
00:45:00,050 --> 00:45:04,580
Let me pause to see if there's any
questions or confusion just yet.

918
00:45:04,580 --> 00:45:05,120
Yeah.

919
00:45:05,120 --> 00:45:07,000
[INDISTINCT SPEECH]

920
00:45:07,000 --> 00:45:07,770
Does SQLite--

921
00:45:07,770 --> 00:45:09,910
[INDISTINCT SPEECH]

922
00:45:09,910 --> 00:45:11,980
SQLite3 is a program.

923
00:45:11,980 --> 00:45:16,210
And it's an implementation
of the SQLite language, which

924
00:45:16,210 --> 00:45:18,820
itself is a lightweight
version of what the world known

925
00:45:18,820 --> 00:45:21,890
as SQL, which is a very
convoluted way of saying

926
00:45:21,890 --> 00:45:23,390
there's lots of humans in the world.

927
00:45:23,390 --> 00:45:25,420
Not everyone agrees what SQL should be.

928
00:45:25,420 --> 00:45:28,720
Microsoft might disagree with Oracle,
might disagree with other companies,

929
00:45:28,720 --> 00:45:29,390
as well.

930
00:45:29,390 --> 00:45:32,500
So there's a common
subset of SQL in the world

931
00:45:32,500 --> 00:45:35,200
that almost everyone knows,
and learns, and uses,

932
00:45:35,200 --> 00:45:37,900
but there are also some
vendor specific features.

933
00:45:37,900 --> 00:45:40,790
SQLite tries to distill things
really into the essence,

934
00:45:40,790 --> 00:45:44,430
and so that's what you increasingly see
on Android, on iOS, on Macs, and PCs,

935
00:45:44,430 --> 00:45:44,930
as well.

936
00:45:44,930 --> 00:45:47,650
So we use it because it's
relatively canonical.

937
00:45:47,650 --> 00:45:49,330
Good question.

938
00:45:49,330 --> 00:45:49,830
All right.

939
00:45:49,830 --> 00:45:53,550
So let's do a few other things by
introducing a few other keywords

940
00:45:53,550 --> 00:45:55,440
without trying all of these right now.

941
00:45:55,440 --> 00:45:59,730
Here in this list is a
bunch of new keywords

942
00:45:59,730 --> 00:46:01,740
that are going to give
us even finer control.

943
00:46:01,740 --> 00:46:04,920
And we saw limit already, and
that just limits the output.

944
00:46:04,920 --> 00:46:08,010
But you can also have what
are called predicates.

945
00:46:08,010 --> 00:46:11,280
You can literally use the keyword
where to start filtering the data,

946
00:46:11,280 --> 00:46:14,640
without using an if, and an elif, and
an elif, and an elif, and so forth.

947
00:46:14,640 --> 00:46:17,640
You can just in one line
express something conditionally,

948
00:46:17,640 --> 00:46:21,067
you can order the data, and you can
even group similar data together.

949
00:46:21,067 --> 00:46:22,150
So what do I mean by this?

950
00:46:22,150 --> 00:46:24,622
Let me go back to VS Code
here, and let me play around

951
00:46:24,622 --> 00:46:25,830
with a few different queries.

952
00:46:25,830 --> 00:46:32,490
Let me select, maybe, the count of
rows from favorites, which previously

953
00:46:32,490 --> 00:46:38,250
was going to be 398 if I just get back
all of the rows, but suppose I only

954
00:46:38,250 --> 00:46:41,190
want to know how many of you
liked C. I can then say something

955
00:46:41,190 --> 00:46:46,500
like where the language in each
row equals, quote unquote, "C,"

956
00:46:46,500 --> 00:46:49,750
and the convention here is to
use single quotes, though SQLite

957
00:46:49,750 --> 00:46:51,790
is tolerant of other formats as well.

958
00:46:51,790 --> 00:46:56,257
If I hit enter here, I'll see, indeed,
as we saw with Python, the 78 number.

959
00:46:56,257 --> 00:46:57,340
That, honestly, took what?

960
00:46:57,340 --> 00:46:59,690
13, 14, 15 lines of code?

961
00:46:59,690 --> 00:47:04,990
Now I've distilled that kind of
query into a single line of SQL code

962
00:47:04,990 --> 00:47:07,180
instead, by using this
built in functionality.

963
00:47:07,180 --> 00:47:11,110
Suppose I really want to get specific,
and how many of you really liked

964
00:47:11,110 --> 00:47:14,500
Hello, World in C as your favorite?

965
00:47:14,500 --> 00:47:16,510
Well, I could change this query.

966
00:47:16,510 --> 00:47:19,330
And just like your dollar
sign prompt, your shell,

967
00:47:19,330 --> 00:47:22,660
you can go up and down in your
history in SQLite to save keystrokes.

968
00:47:22,660 --> 00:47:24,520
You can use Boolean logic.

969
00:47:24,520 --> 00:47:27,790
And I can say language
equals C AND, maybe,

970
00:47:27,790 --> 00:47:31,490
problem equals, quote
unquote, "Hello, World,"

971
00:47:31,490 --> 00:47:34,730
and the number of you that
liked that problem was seven.

972
00:47:34,730 --> 00:47:37,900
So really, really early on
likes Hello, World in C.

973
00:47:37,900 --> 00:47:39,910
Now notice a couple of key differences.

974
00:47:39,910 --> 00:47:43,900
One, I'm using AND, and not
ampersand, ampersand like in C. I'm

975
00:47:43,900 --> 00:47:46,100
using single equal signs.

976
00:47:46,100 --> 00:47:50,030
So SQL behaves like Scratch does,
which is not like Python or C. Why?

977
00:47:50,030 --> 00:47:52,780
Different people have implemented
different languages differently.

978
00:47:52,780 --> 00:47:57,310
Equals, equals, equality
in the world of SQL

979
00:47:57,310 --> 00:47:59,120
for comparing things left and right.

980
00:47:59,120 --> 00:47:59,620
All right.

981
00:47:59,620 --> 00:48:01,870
Things are now going to get
a little more interesting,

982
00:48:01,870 --> 00:48:05,140
but the whole goal of
all of that Python code

983
00:48:05,140 --> 00:48:09,650
was to analyze the ranking of
languages and popularity thereof.

984
00:48:09,650 --> 00:48:12,400
Turns out in SQL, once
you have the vocabulary,

985
00:48:12,400 --> 00:48:14,590
it's pretty easy to do
something like that.

986
00:48:14,590 --> 00:48:15,560
I'm going to do this.

987
00:48:15,560 --> 00:48:19,120
I'm going to select all of
the languages in the table,

988
00:48:19,120 --> 00:48:23,020
but I'm also going to
select the count thereof.

989
00:48:23,020 --> 00:48:26,020
And then I'm going to do that
from the favorites table,

990
00:48:26,020 --> 00:48:29,560
but I'm going to group
by language because I

991
00:48:29,560 --> 00:48:33,370
claimed a moment ago that group by is
another one of our key phrases in SQL

992
00:48:33,370 --> 00:48:35,120
that's going to let us group data.

993
00:48:35,120 --> 00:48:37,600
And what this effectively
means is that if you've

994
00:48:37,600 --> 00:48:41,530
got this table with a lot of
duplicate languages, again, and again,

995
00:48:41,530 --> 00:48:45,130
and again, you can group by
that column, and, essentially,

996
00:48:45,130 --> 00:48:48,850
smush all of the Python rows together,
all of the Scratch rows together,

997
00:48:48,850 --> 00:48:52,930
all of the C rows together, but
figure out how many of those

998
00:48:52,930 --> 00:48:55,240
rows just got smushed together.

999
00:48:55,240 --> 00:48:57,910
Effectively, doing all of
that dictionary legwork,

1000
00:48:57,910 --> 00:49:01,090
or the counter legwork,
that I did in 13--

1001
00:49:01,090 --> 00:49:02,990
15 lines of Python code.

1002
00:49:02,990 --> 00:49:06,700
So if I hit enter here,
this now is the motivation

1003
00:49:06,700 --> 00:49:08,410
for what we're now starting to do.

1004
00:49:08,410 --> 00:49:11,860
I have distilled into
a single line of code

1005
00:49:11,860 --> 00:49:14,140
in a language called SQL
what, indeed, took me

1006
00:49:14,140 --> 00:49:18,280
more than a dozen lines of Python
code just to get back an answer.

1007
00:49:18,280 --> 00:49:20,110
And I can do the same
thing with problem.

1008
00:49:20,110 --> 00:49:23,230
I can just change language here,
for instance, to problem instead.

1009
00:49:23,230 --> 00:49:26,300
But, per this list, I can not only
group things, I can order them.

1010
00:49:26,300 --> 00:49:29,440
So if you actually want to get a top
10, or a top three list, well let's

1011
00:49:29,440 --> 00:49:30,880
just change this query slightly.

1012
00:49:30,880 --> 00:49:37,970
Before the semicolon, let me order
by the count of those rows semicolon.

1013
00:49:37,970 --> 00:49:40,570
And now what I get is
from smallest to largest.

1014
00:49:40,570 --> 00:49:42,850
40, 78, 280.

1015
00:49:42,850 --> 00:49:44,590
If you want to flip that, that's fine.

1016
00:49:44,590 --> 00:49:49,630
By default, order by uses
ascending order, abbreviated A-S-C.

1017
00:49:49,630 --> 00:49:53,810
If you want to do descending order,
D-E-S-C, you can do that as well.

1018
00:49:53,810 --> 00:49:57,680
And now we have a top three
list, from largest to smallest.

1019
00:49:57,680 --> 00:50:01,360
Now, honestly, this is a bit of a
mouthful to use count star over here,

1020
00:50:01,360 --> 00:50:02,920
count star over here.

1021
00:50:02,920 --> 00:50:06,580
There's a nicety in SQL, too, where
you can create little aliases of sorts.

1022
00:50:06,580 --> 00:50:08,890
So if I use the same query again--

1023
00:50:08,890 --> 00:50:10,850
let me scroll over to the left.

1024
00:50:10,850 --> 00:50:13,990
I can actually use the
keyword as here, and I

1025
00:50:13,990 --> 00:50:18,370
can rename this weird looking column,
count star, to anything I want.

1026
00:50:18,370 --> 00:50:19,930
I can rename it to n.

1027
00:50:19,930 --> 00:50:24,280
And then at the end of this query,
I can order by n, essentially,

1028
00:50:24,280 --> 00:50:26,990
creating a synonym, if you
will, for one versus the other.

1029
00:50:26,990 --> 00:50:30,310
So if I hit enter now, same exact
thing, but my little baby table

1030
00:50:30,310 --> 00:50:33,310
that came back-- not a technical
term-- has two columns, one of which

1031
00:50:33,310 --> 00:50:36,580
is more simply called n
now instead of count star.

1032
00:50:36,580 --> 00:50:40,330
It just makes it minorly more
convenient in your actual SQL code

1033
00:50:40,330 --> 00:50:44,110
to reference things that might
actually be a little annoying to type.

1034
00:50:44,110 --> 00:50:47,200
Lastly, suppose we want
to get a top one list,

1035
00:50:47,200 --> 00:50:49,000
and we just want the
most popular language.

1036
00:50:49,000 --> 00:50:51,790
Honestly, I can just
do limit one, enter.

1037
00:50:51,790 --> 00:50:55,720
That gives me just this tiny little
table, a temporary table, really,

1038
00:50:55,720 --> 00:50:56,860
with one row.

1039
00:50:56,860 --> 00:51:00,460
And, honestly, if I don't even
care about what the language is,

1040
00:51:00,460 --> 00:51:02,230
I can omit that entirely.

1041
00:51:02,230 --> 00:51:06,340
Just see how many people really
like the most popular language.

1042
00:51:06,340 --> 00:51:07,660
280, in this case.

1043
00:51:07,660 --> 00:51:10,910
But, of course, it's more interesting
to see what it actually is.

1044
00:51:10,910 --> 00:51:13,420
So, in short, just by turning
these knobs syntactically,

1045
00:51:13,420 --> 00:51:16,900
it's relatively easy to start
getting at more and more data.

1046
00:51:16,900 --> 00:51:19,270
And more answers there, too.

1047
00:51:19,270 --> 00:51:19,770
Phew.

1048
00:51:19,770 --> 00:51:24,270
Questions on this thus far?

1049
00:51:24,270 --> 00:51:25,200
Any questions?

1050
00:51:25,200 --> 00:51:26,040
No?

1051
00:51:26,040 --> 00:51:26,670
OK.

1052
00:51:26,670 --> 00:51:29,305
Well suppose that this
week, for instance.

1053
00:51:29,305 --> 00:51:31,680
One of our new problems is
going to be called Fiftyville,

1054
00:51:31,680 --> 00:51:33,960
and it's going to allow you
to explore the world of SQL

1055
00:51:33,960 --> 00:51:35,793
in the context of a
place called Fiftyville.

1056
00:51:35,793 --> 00:51:38,350
Suppose that suddenly becomes
your favorite problem.

1057
00:51:38,350 --> 00:51:40,920
Well, how can we go about
adding more data to a database?

1058
00:51:40,920 --> 00:51:43,720
Well, we've seen create
table for creating the table,

1059
00:51:43,720 --> 00:51:47,230
we've seen select for
selecting data there from.

1060
00:51:47,230 --> 00:51:50,340
Turns out there's also
an insert into command

1061
00:51:50,340 --> 00:51:54,210
that you can use to insert
new data into a table.

1062
00:51:54,210 --> 00:51:57,450
Now, I did this in bulk by just
importing that whole CSV file,

1063
00:51:57,450 --> 00:52:00,370
and SQLite3 did it all
for me automatically.

1064
00:52:00,370 --> 00:52:04,320
But in the real world, if you don't have
a captive audience, every one of whom

1065
00:52:04,320 --> 00:52:06,540
is submitting the form
at the same time--

1066
00:52:06,540 --> 00:52:09,300
but maybe it's an application
that's running 24/7,

1067
00:52:09,300 --> 00:52:13,210
you're going to get more and more data
over time, just like Google itself.

1068
00:52:13,210 --> 00:52:16,290
So if you write code like
this, you can insert one row

1069
00:52:16,290 --> 00:52:20,530
at a time, one row at a time, and
actually change the data in your table.

1070
00:52:20,530 --> 00:52:24,820
So just as a check, let me do
select star from favorites, enter.

1071
00:52:24,820 --> 00:52:26,500
Just to see all of the data.

1072
00:52:26,500 --> 00:52:31,030
And the last data we got was
at 1:41 PM, and 21 seconds.

1073
00:52:31,030 --> 00:52:34,390
Suppose now I've decided I
want to insert one new row.

1074
00:52:34,390 --> 00:52:35,560
I can do this.

1075
00:52:35,560 --> 00:52:39,550
Insert into favorites--
and then I have to specify

1076
00:52:39,550 --> 00:52:41,710
what columns do I want to insert into.

1077
00:52:41,710 --> 00:52:47,230
I'm going to insert a new language
column, and a new problem column.

1078
00:52:47,230 --> 00:52:47,840
Timestamp?

1079
00:52:47,840 --> 00:52:48,340
I could.

1080
00:52:48,340 --> 00:52:51,590
I don't really want to look up the time,
so I'm going to leave that one blank.

1081
00:52:51,590 --> 00:52:54,520
And I'm going to put in
values as follows for this.

1082
00:52:54,520 --> 00:52:57,850
SQL for the language, which wasn't
even an option on the form earlier,

1083
00:52:57,850 --> 00:53:02,830
and Fiftyville for the name
of the problem, semicolon.

1084
00:53:02,830 --> 00:53:04,540
So there's a bit of dichotomy here.

1085
00:53:04,540 --> 00:53:07,690
In the first set of parentheses,
you specify a comma separated list

1086
00:53:07,690 --> 00:53:09,910
of the columns that you
want to put data into.

1087
00:53:09,910 --> 00:53:11,890
In the second set of
parentheses, you actually

1088
00:53:11,890 --> 00:53:15,680
specify the values that you
want to put into those columns.

1089
00:53:15,680 --> 00:53:18,350
So when I hit enter
nothing seems to happen,

1090
00:53:18,350 --> 00:53:20,510
which, in general, is a
good thing at my terminal.

1091
00:53:20,510 --> 00:53:25,030
But if I now rerun select star
from favorites, we will see--

1092
00:53:25,030 --> 00:53:27,070
voila, a brand new row.

1093
00:53:27,070 --> 00:53:29,320
We don't know what time or
date it was inputted at.

1094
00:53:29,320 --> 00:53:33,130
In fact, we see an old friend, null,
which indicates the absence of a value,

1095
00:53:33,130 --> 00:53:37,610
but we do indeed see that SQL, in
Fiftyville, is actually now in there.

1096
00:53:37,610 --> 00:53:42,340
So in the world of SQL, null has nothing
to do with pointers or addresses.

1097
00:53:42,340 --> 00:53:46,120
The world of SQL, it's just using the
same word to represent the same idea,

1098
00:53:46,120 --> 00:53:48,700
that there's no data
here, but it has nothing

1099
00:53:48,700 --> 00:53:51,080
to do with actual memory
addresses in this case.

1100
00:53:51,080 --> 00:53:53,830
But suppose that you don't want
to do that, and, like, no, no, no.

1101
00:53:53,830 --> 00:53:54,640
Let's just delete that.

1102
00:53:54,640 --> 00:53:56,348
Fiftyville hasn't even
been released yet,

1103
00:53:56,348 --> 00:53:58,720
nor have we even finished
talking about SQL.

1104
00:53:58,720 --> 00:54:02,110
How do we delete data
from a database table?

1105
00:54:02,110 --> 00:54:03,670
Well, there's a delete from command.

1106
00:54:03,670 --> 00:54:05,360
Let me go back to VS Code here.

1107
00:54:05,360 --> 00:54:08,080
Let me go ahead and clear my
terminal just to keep things clean.

1108
00:54:08,080 --> 00:54:10,570
Let me go ahead and
delete from favorites,

1109
00:54:10,570 --> 00:54:14,290
and let me not hit enter
here after a semicolon.

1110
00:54:14,290 --> 00:54:16,330
This is one of the
most destructive things

1111
00:54:16,330 --> 00:54:19,570
you can do as a database administrator.

1112
00:54:19,570 --> 00:54:21,550
If you Google around,
there are horror stories

1113
00:54:21,550 --> 00:54:25,750
of interns in the real world executing
commands like this at their companies.

1114
00:54:25,750 --> 00:54:28,510
This will delete
everything from favorites.

1115
00:54:28,510 --> 00:54:31,660
So if you ever do this, remember,
we told you not to today.

1116
00:54:31,660 --> 00:54:38,530
But if we add a where clause here, only
delete rows where the timestamp column

1117
00:54:38,530 --> 00:54:41,433
is null, this is more reasonable.

1118
00:54:41,433 --> 00:54:43,600
And, frankly, any companies
you work for should also

1119
00:54:43,600 --> 00:54:45,767
have backups of their
database, so we shouldn't even

1120
00:54:45,767 --> 00:54:48,680
be reading about these horror
stories, but such is the real world.

1121
00:54:48,680 --> 00:54:52,960
So this is going to delete any
row from the favorites table

1122
00:54:52,960 --> 00:54:57,260
where Timestamp, capital T, because
that's how Google did it, is null.

1123
00:54:57,260 --> 00:54:58,570
I go ahead and hit enter.

1124
00:54:58,570 --> 00:55:03,880
Nothing seems to happen, but if I do
select star from favorites semicolon

1125
00:55:03,880 --> 00:55:06,550
that now row is, again, gone.

1126
00:55:06,550 --> 00:55:11,410
So you can use these predicates, these
where conditions, coupled with select,

1127
00:55:11,410 --> 00:55:14,740
coupled with delete, and
other operations as well.

1128
00:55:14,740 --> 00:55:16,520
What if I actually
want to make a change?

1129
00:55:16,520 --> 00:55:18,910
So if you want to update
existing data like this.

1130
00:55:18,910 --> 00:55:20,540
Well, we could do this.

1131
00:55:20,540 --> 00:55:25,300
I could update this table I could
set one column equal to this value

1132
00:55:25,300 --> 00:55:27,230
where some condition is true.

1133
00:55:27,230 --> 00:55:28,638
So how might this work?

1134
00:55:28,638 --> 00:55:30,430
Well, let me boldly
claim that a lot of you

1135
00:55:30,430 --> 00:55:33,070
are really going to like
Fiftyville in the world of SQL,

1136
00:55:33,070 --> 00:55:35,860
so all of these favorites
are sort of passe now.

1137
00:55:35,860 --> 00:55:36,890
So let's do this.

1138
00:55:36,890 --> 00:55:39,950
Let me go ahead and update
the favorites table,

1139
00:55:39,950 --> 00:55:43,150
setting the language column
equal to "SQL," quote unquote.

1140
00:55:43,150 --> 00:55:47,650
And with a comma let me go ahead and
also update the problem column to be

1141
00:55:47,650 --> 00:55:49,810
equal to, quote unquote, "Fiftyville."

1142
00:55:49,810 --> 00:55:52,930
I'm not going to have any kind
of where here, which means this

1143
00:55:52,930 --> 00:55:56,030
is just going to do its
thing on all of the rows.

1144
00:55:56,030 --> 00:55:58,270
So if I hit enter nothing
seems to have happened,

1145
00:55:58,270 --> 00:56:00,850
but if I now do select
star from favorites,

1146
00:56:00,850 --> 00:56:04,160
everyone's favorite is
literally that problem.

1147
00:56:04,160 --> 00:56:07,690
So this too is destructive,
unlike the real digital world,

1148
00:56:07,690 --> 00:56:10,120
there's no Control Z, or undo that.

1149
00:56:10,120 --> 00:56:13,480
You better have made a backup
of your database, otherwise

1150
00:56:13,480 --> 00:56:15,460
that's not a good thing.

1151
00:56:15,460 --> 00:56:17,830
In this case, I do have this
CSV file, so I could just

1152
00:56:17,830 --> 00:56:19,703
delete my favorites.db file.

1153
00:56:19,703 --> 00:56:22,870
I could re-import the data, so I haven't
really lost anything of importance,

1154
00:56:22,870 --> 00:56:26,140
but you could in the case of
the real world and any data

1155
00:56:26,140 --> 00:56:27,470
you're actually working on.

1156
00:56:27,470 --> 00:56:29,410
So just to make the
point, let me go ahead

1157
00:56:29,410 --> 00:56:33,250
and delete from favorites
semicolon, enter.

1158
00:56:33,250 --> 00:56:34,660
Let me reselect.

1159
00:56:34,660 --> 00:56:36,710
There's no data there anymore.

1160
00:56:36,710 --> 00:56:42,890
And in fact, if I do select
count star from favorites,

1161
00:56:42,890 --> 00:56:45,220
we'll see as much that the
answer is, in fact, zero

1162
00:56:45,220 --> 00:56:48,040
because everything has now been deleted.

1163
00:56:48,040 --> 00:56:51,340
Any questions, then, on that code there?

1164
00:56:51,340 --> 00:56:51,950
No?

1165
00:56:51,950 --> 00:56:52,450
All right.

1166
00:56:52,450 --> 00:56:55,533
So if not too scared yet, let's go
ahead and take our 10-minute break now.

1167
00:56:55,533 --> 00:56:59,110
Halloween candy is served,
and we'll be back in 10.

1168
00:56:59,110 --> 00:57:01,060
All right.

1169
00:57:01,060 --> 00:57:05,750
So we are back, and before we dive
back into SQL and some real world data,

1170
00:57:05,750 --> 00:57:10,300
it turns out, unbeknownst to me,
we've had a Halloween costume contest.

1171
00:57:10,300 --> 00:57:14,560
So it's now time to announce the winners
of this year's CS50 costume contest.

1172
00:57:14,560 --> 00:57:17,530
If our two winners would
like to come on up who,

1173
00:57:17,530 --> 00:57:19,495
I'm told during break, dressed up as me.

1174
00:57:19,495 --> 00:57:22,265
[APPLAUSE]

1175
00:57:22,265 --> 00:57:25,620


1176
00:57:25,620 --> 00:57:26,460
Come on over.

1177
00:57:26,460 --> 00:57:28,085
Would you like to introduce yourselves?

1178
00:57:28,085 --> 00:57:31,240
SPEAKER 2: Yes Hi, everyone.

1179
00:57:31,240 --> 00:57:32,100
I'm David.

1180
00:57:32,100 --> 00:57:34,380
I am living in Matthews as a first year.

1181
00:57:34,380 --> 00:57:38,700
I'm planning on studying
gov and computer science.

1182
00:57:38,700 --> 00:57:39,820
SPEAKER 3: Hi, everyone.

1183
00:57:39,820 --> 00:57:40,740
I'm David.

1184
00:57:40,740 --> 00:57:43,020
I'm a second year in
Mather, and I'm planning

1185
00:57:43,020 --> 00:57:44,745
on just studying computer science.

1186
00:57:44,745 --> 00:57:45,480
[LAUGHING]

1187
00:57:45,480 --> 00:57:46,605
SPEAKER 1: Well, thank you.

1188
00:57:46,605 --> 00:57:50,190
We have some Oreos for you.

1189
00:57:50,190 --> 00:57:51,497
Thank you, both, so much.

1190
00:57:51,497 --> 00:57:52,830
Did anyone else dress like this?

1191
00:57:52,830 --> 00:57:55,437
We have two more Oreos, if you'd like?

1192
00:57:55,437 --> 00:57:56,020
Intentionally?

1193
00:57:56,020 --> 00:57:58,082
[LAUGHING]

1194
00:57:58,082 --> 00:57:59,040
Oh, that's pretty good.

1195
00:57:59,040 --> 00:57:59,540
OK.

1196
00:57:59,540 --> 00:58:00,750
Yes, we have one more winner.

1197
00:58:00,750 --> 00:58:01,515
Come on down.

1198
00:58:01,515 --> 00:58:03,810
[LAUGHS]

1199
00:58:03,810 --> 00:58:09,580


1200
00:58:09,580 --> 00:58:10,317
Thank you.

1201
00:58:10,317 --> 00:58:10,900
SPEAKER 4: Hi.

1202
00:58:10,900 --> 00:58:12,010
SPEAKER 1: This is intentional?

1203
00:58:12,010 --> 00:58:12,590
SPEAKER 4: No.

1204
00:58:12,590 --> 00:58:13,340
SPEAKER 1: Oh, OK.

1205
00:58:13,340 --> 00:58:14,410
[LAUGHING]

1206
00:58:14,410 --> 00:58:16,150
SPEAKER 4: Hello, my name is David.

1207
00:58:16,150 --> 00:58:18,490
I'm from Canada, and I'm a first year.

1208
00:58:18,490 --> 00:58:20,200
I'm not sure what I'm going to study.

1209
00:58:20,200 --> 00:58:20,783
SPEAKER 1: OK.

1210
00:58:20,783 --> 00:58:21,800
Welcome, as well.

1211
00:58:21,800 --> 00:58:22,675
SPEAKER 4: Thank you.

1212
00:58:22,675 --> 00:58:23,560
SPEAKER 1: All right.

1213
00:58:23,560 --> 00:58:26,270
[APPLAUSE]

1214
00:58:26,270 --> 00:58:27,732


1215
00:58:27,732 --> 00:58:29,690
Up until now, we've played
around with the data

1216
00:58:29,690 --> 00:58:33,695
that you all gave us, which was based,
very simply, on your favorite language

1217
00:58:33,695 --> 00:58:34,820
and your favorite problems.

1218
00:58:34,820 --> 00:58:38,030
But it turns out there's a lot
of real world data in, indeed,

1219
00:58:38,030 --> 00:58:40,880
the real world, some of
which is quite voluminous.

1220
00:58:40,880 --> 00:58:44,720
And, indeed, there can be not just
dozens, or hundreds, but thousands,

1221
00:58:44,720 --> 00:58:47,962
hundreds of thousands, even millions
of rows in the biggest of databases.

1222
00:58:47,962 --> 00:58:50,420
And so what we thought we'd do
in the latter part of today,

1223
00:58:50,420 --> 00:58:53,360
is really, actually, get our
hands dirty with a real world

1224
00:58:53,360 --> 00:58:57,020
data set from the Internet Movie
Database, otherwise known as IMDb.

1225
00:58:57,020 --> 00:59:01,010
And, in fact, if you go to
imdb.com, you'll be able to answer,

1226
00:59:01,010 --> 00:59:03,650
via their web interface, some
of the very questions we'll

1227
00:59:03,650 --> 00:59:05,720
do today using SQL alone.

1228
00:59:05,720 --> 00:59:10,520
But what you'll find, ultimately,
is that what websites like imdb.com,

1229
00:59:10,520 --> 00:59:13,880
or their mobile app versions
thereof, are probably doing

1230
00:59:13,880 --> 00:59:17,420
is-- yes, giving you a nice pretty
graphical interface to type queries,

1231
00:59:17,420 --> 00:59:22,760
but underneath the hood they are
passing your input into SQL queries,

1232
00:59:22,760 --> 00:59:25,800
or similar queries, that
they formed most of.

1233
00:59:25,800 --> 00:59:28,320
They're just waiting for
placeholders, like the keywords,

1234
00:59:28,320 --> 00:59:30,160
that you're actually searching for.

1235
00:59:30,160 --> 00:59:33,420
So let's go ahead and experiment,
maybe, with just some real-world data,

1236
00:59:33,420 --> 00:59:36,725
initially, before we consider how
to actually store it at scale.

1237
00:59:36,725 --> 00:59:38,850
So let me open up, just
for the sake of discussion,

1238
00:59:38,850 --> 00:59:41,190
an actual, empty
spreadsheet, just so I have

1239
00:59:41,190 --> 00:59:43,650
some rows and columns to play with.

1240
00:59:43,650 --> 00:59:47,790
And let me propose that we want to
model TV shows from the real world.

1241
00:59:47,790 --> 00:59:49,420
How can we go about doing this?

1242
00:59:49,420 --> 00:59:53,020
Well, maybe I could start in
this first column A, so to speak,

1243
00:59:53,020 --> 00:59:54,600
and I could create a title column.

1244
00:59:54,600 --> 00:59:57,660
And then, maybe, a column
for the star of that show.

1245
00:59:57,660 --> 01:00:00,060
And a very popular show,
of course, is The Office.

1246
01:00:00,060 --> 01:00:03,900
So I might put this into the
second cell in that first column.

1247
01:00:03,900 --> 01:00:06,870
And under star, I could put
someone like Steve Carell.

1248
01:00:06,870 --> 01:00:10,200
But, of course, he wasn't
the only star of the show.

1249
01:00:10,200 --> 01:00:11,740
There are others as well.

1250
01:00:11,740 --> 01:00:14,370
And so if I want to put in
someone like Rainn Wilson,

1251
01:00:14,370 --> 01:00:16,210
well, maybe, I need
a second star column.

1252
01:00:16,210 --> 01:00:17,640
So Rainn Wilson.

1253
01:00:17,640 --> 01:00:20,010
But even as early as
the first season, there

1254
01:00:20,010 --> 01:00:24,420
was also another star in
the credits, John Krasinski.

1255
01:00:24,420 --> 01:00:25,440
So he was a star.

1256
01:00:25,440 --> 01:00:29,280
Jenna Fischer was top
credited in the first season.

1257
01:00:29,280 --> 01:00:30,990
So Jenna Fischer.

1258
01:00:30,990 --> 01:00:37,020
And then BJ Novak, Harvard alum, was
also in the first season's opening

1259
01:00:37,020 --> 01:00:37,870
credits, as well.

1260
01:00:37,870 --> 01:00:39,360
So we've got all--

1261
01:00:39,360 --> 01:00:41,170
one, two, three, four,
five of these folks.

1262
01:00:41,170 --> 01:00:42,630
Hopefully, I didn't
misspell anyone's name,

1263
01:00:42,630 --> 01:00:45,030
but here's the beginnings
of a real-world data set.

1264
01:00:45,030 --> 01:00:47,970
And we could imagine doing this
for everyone's favorite shows,

1265
01:00:47,970 --> 01:00:49,630
adding more, and more rows.

1266
01:00:49,630 --> 01:00:52,650
But let's consider, as we often
do, not just the correctness

1267
01:00:52,650 --> 01:00:56,220
of this implementation, but the design.

1268
01:00:56,220 --> 01:00:57,990
It's pretty straightforward.

1269
01:00:57,990 --> 01:00:58,972
It's very readable.

1270
01:00:58,972 --> 01:01:00,430
So I think it's good in that sense.

1271
01:01:00,430 --> 01:01:03,510
But if you start to nitpick
what's poorly designed,

1272
01:01:03,510 --> 01:01:06,240
even in the world of spreadsheets,
about what I've done here--

1273
01:01:06,240 --> 01:01:08,610
assuming that the next
row is another show,

1274
01:01:08,610 --> 01:01:12,110
the next row is another
show, and so forth.

1275
01:01:12,110 --> 01:01:13,280
What's bad about this?

1276
01:01:13,280 --> 01:01:13,780
Yeah?

1277
01:01:13,780 --> 01:01:15,272
[INDISTINCT SPEECH]

1278
01:01:15,272 --> 01:01:17,980
Yeah, so each row is going to have
a different number of columns,

1279
01:01:17,980 --> 01:01:20,440
and even I, kind of, couldn't
make up my mind from the get go.

1280
01:01:20,440 --> 01:01:22,000
Like, do I have just one star column?

1281
01:01:22,000 --> 01:01:22,500
Or two?

1282
01:01:22,500 --> 01:01:23,792
Or maybe now I'm up to 5?

1283
01:01:23,792 --> 01:01:25,750
For even bigger shows,
and later in The Office,

1284
01:01:25,750 --> 01:01:27,625
when more people got
top billing, we're going

1285
01:01:27,625 --> 01:01:29,770
to need more than five
columns for stars.

1286
01:01:29,770 --> 01:01:31,060
So that's fine.

1287
01:01:31,060 --> 01:01:34,682
We can clearly scroll to the right,
and just keep adding more columns,

1288
01:01:34,682 --> 01:01:36,640
but there should be
something about that design

1289
01:01:36,640 --> 01:01:38,890
that like rubs you the
wrong way, like something

1290
01:01:38,890 --> 01:01:41,620
feels a little off if some
rows have this many columns,

1291
01:01:41,620 --> 01:01:42,700
others have this many.

1292
01:01:42,700 --> 01:01:45,520
The data would be very jagged
along the right hand side.

1293
01:01:45,520 --> 01:01:48,320
It would be very sparse, which
would be another way to describe it.

1294
01:01:48,320 --> 01:01:50,000
There's probably a better way.

1295
01:01:50,000 --> 01:01:53,440
So maybe I should flip
the data around, and maybe

1296
01:01:53,440 --> 01:01:57,520
a better approach here would be
to just have one column for stars.

1297
01:01:57,520 --> 01:01:59,870
So let me do this.

1298
01:01:59,870 --> 01:02:04,180
Let me just move Rainn Wilson over
here, and John Krasinski over here,

1299
01:02:04,180 --> 01:02:09,980
and Jenna Fischer over here,
and BJ Novak over here, as well.

1300
01:02:09,980 --> 01:02:13,850
I'll get rid of all of these
superfluous, identically named columns.

1301
01:02:13,850 --> 01:02:17,840
And now this is sort of better because
now I can have any number of stars

1302
01:02:17,840 --> 01:02:21,330
in the vertical, although it's a
little weird to leave this blank.

1303
01:02:21,330 --> 01:02:23,850
So maybe I should of copy paste here.

1304
01:02:23,850 --> 01:02:27,920
So in some sense this is
better, in that now I only

1305
01:02:27,920 --> 01:02:30,860
have one title column, one
star column, and I can just

1306
01:02:30,860 --> 01:02:34,790
keep adding row, row, row,
for each show and its stars.

1307
01:02:34,790 --> 01:02:37,840
But what's now poorly
designed about this?

1308
01:02:37,840 --> 01:02:38,340
Yeah?

1309
01:02:38,340 --> 01:02:39,970
[INDISTINCT SPEECH]

1310
01:02:39,970 --> 01:02:40,470
Yeah.

1311
01:02:40,470 --> 01:02:41,580
I'm repeating the title.

1312
01:02:41,580 --> 01:02:44,220
And, in general, copy, paste,
repeating yourself in code

1313
01:02:44,220 --> 01:02:46,090
has generally been a bad thing.

1314
01:02:46,090 --> 01:02:49,110
It's generally gotten us in trouble
if I make a change, or maybe

1315
01:02:49,110 --> 01:02:50,550
a typographical error somewhere.

1316
01:02:50,550 --> 01:02:52,050
Maybe it propagates elsewhere.

1317
01:02:52,050 --> 01:02:54,450
And if nothing else, it's
just a lot of wasted space.

1318
01:02:54,450 --> 01:02:57,300
If this is actually going to be
stored in a database, in a CSV file,

1319
01:02:57,300 --> 01:03:00,300
why are you duplicating the same
string again, and again, and again,

1320
01:03:00,300 --> 01:03:01,440
for large TV shows?

1321
01:03:01,440 --> 01:03:02,370
That's just wasteful.

1322
01:03:02,370 --> 01:03:04,780
It just doesn't seem wise.

1323
01:03:04,780 --> 01:03:06,970
So how can we eliminate that redundancy?

1324
01:03:06,970 --> 01:03:10,260
Well, unfortunately, in the world
of spreadsheets, things kind of now

1325
01:03:10,260 --> 01:03:13,080
escalate quickly to be kind of annoying.

1326
01:03:13,080 --> 01:03:16,150
But let me do it, nonetheless,
with just a small bit of data.

1327
01:03:16,150 --> 01:03:18,250
Let me propose that we do this instead.

1328
01:03:18,250 --> 01:03:21,977
Let me create not one sheet
but maybe multiple sheets,

1329
01:03:21,977 --> 01:03:24,060
and assume that there's
some kind of relationship,

1330
01:03:24,060 --> 01:03:25,680
or relation, across these sheets.

1331
01:03:25,680 --> 01:03:29,100
So, just to be pedantic, let me call
this sheet, not the default sheet one,

1332
01:03:29,100 --> 01:03:30,360
but let's call this shows.

1333
01:03:30,360 --> 01:03:34,750
And in this sheet, I'm going to
have a title column for every show,

1334
01:03:34,750 --> 01:03:36,990
and I think I'm going
to be proactive here.

1335
01:03:36,990 --> 01:03:41,200
I'm going to start giving every show
a unique ID number, much like Harvard

1336
01:03:41,200 --> 01:03:43,150
affiliates have Harvard
IDs, Yale affiliates

1337
01:03:43,150 --> 01:03:44,900
have Yale ID numbers, and so forth.

1338
01:03:44,900 --> 01:03:48,280
Let's go ahead and give each
show its own unique identifier,

1339
01:03:48,280 --> 01:03:49,940
for reasons we'll soon see.

1340
01:03:49,940 --> 01:03:53,110
So for The Office let
me, just for consistency

1341
01:03:53,110 --> 01:03:55,030
with the actual Internet
Movie Database, I'm

1342
01:03:55,030 --> 01:03:58,488
going to give it a
unique number of 386676.

1343
01:03:58,488 --> 01:04:00,280
The specifics don't
really matter, but that

1344
01:04:00,280 --> 01:04:03,010
happens to be what people in
the real world actually do.

1345
01:04:03,010 --> 01:04:05,950
But that's it for TV shows,
even though I could imagine

1346
01:04:05,950 --> 01:04:07,490
there being many more in this sheet.

1347
01:04:07,490 --> 01:04:10,840
Let me create another sheet now
here, and I'll call it people.

1348
01:04:10,840 --> 01:04:15,490
And in the people sheet, let me
keep track of all of those TV stars.

1349
01:04:15,490 --> 01:04:20,500
So one column will be name, another
will be also called ID here,

1350
01:04:20,500 --> 01:04:25,070
but it's going to be a
person ID, not a show ID.

1351
01:04:25,070 --> 01:04:27,610
And here we have, for
instance, Steve Carell.

1352
01:04:27,610 --> 01:04:29,570
We have Rainn Wilson.

1353
01:04:29,570 --> 01:04:31,870
We have John Krasinski.

1354
01:04:31,870 --> 01:04:34,300
We have Jenna Fischer.

1355
01:04:34,300 --> 01:04:36,670
And we have BJ Novak.

1356
01:04:36,670 --> 01:04:38,600
And, this is going to
be a little tedious,

1357
01:04:38,600 --> 01:04:41,070
but just to be consistent
with reality, turns out

1358
01:04:41,070 --> 01:04:46,230
that according to IMDb Steve Carell's
unique number in the world is 136797,

1359
01:04:46,230 --> 01:04:56,640
Rainn's is 933988, John's is
1024677, Jenna's is 278979,

1360
01:04:56,640 --> 01:05:01,814
and, lastly, BJ Novak's is 1145983.

1361
01:05:01,814 --> 01:05:02,650
Phew.

1362
01:05:02,650 --> 01:05:03,150
OK.

1363
01:05:03,150 --> 01:05:06,360
So now we have the same people, but
they each have a unique ID number.

1364
01:05:06,360 --> 01:05:10,590
Lastly, let's associate those shows
with those people in a way that

1365
01:05:10,590 --> 01:05:13,320
avoids the two problems we
identified earlier, which

1366
01:05:13,320 --> 01:05:16,650
was having a variable number
of columns in one case

1367
01:05:16,650 --> 01:05:18,960
versus redundancy in the second case.

1368
01:05:18,960 --> 01:05:22,710
Let's really tighten things
up so nothing is in duplicate

1369
01:05:22,710 --> 01:05:24,880
that doesn't actually need to be.

1370
01:05:24,880 --> 01:05:28,890
So I'm going to create a third
sheet here, and I'll call it stars,

1371
01:05:28,890 --> 01:05:31,770
like the TV stars for these
shows, and what I'm going to do

1372
01:05:31,770 --> 01:05:36,292
is have only two columns,
a show ID, and a person ID.

1373
01:05:36,292 --> 01:05:38,250
I could write these in
different ways, but it's

1374
01:05:38,250 --> 01:05:41,740
conventional in the database world
to use snake case, so to speak,

1375
01:05:41,740 --> 01:05:46,150
where everything is lowercase,
with underscores instead of spaces.

1376
01:05:46,150 --> 01:05:49,390
And for show ID-- well, we're
only mocking up one show for now,

1377
01:05:49,390 --> 01:05:54,100
but I'm going to go ahead and say
386676, which is The Office, I claimed.

1378
01:05:54,100 --> 01:05:57,160
And now I'm going to go ahead
and have all of those person IDs,

1379
01:05:57,160 --> 01:06:00,230
and this time it's OK to copy
paste, if only to save time.

1380
01:06:00,230 --> 01:06:04,540
So I'm going to grab all of these
five stars IDs, paste them there,

1381
01:06:04,540 --> 01:06:08,840
and I am going to indulge
by duplicating the show ID,

1382
01:06:08,840 --> 01:06:10,580
even though we didn't like that earlier.

1383
01:06:10,580 --> 01:06:14,380
Now this indeed has escalated
quickly because none of these sheets

1384
01:06:14,380 --> 01:06:17,590
are very useful to look at to the
human eye because none of them

1385
01:06:17,590 --> 01:06:18,910
has the complete picture.

1386
01:06:18,910 --> 01:06:22,600
We have shows in one, people in another,
and then like this cryptic mapping

1387
01:06:22,600 --> 01:06:24,250
of numbers in the third.

1388
01:06:24,250 --> 01:06:29,200
But I propose that this is the
right way to implement data

1389
01:06:29,200 --> 01:06:32,530
if your goal is to have a
canonical source of truth

1390
01:06:32,530 --> 01:06:34,060
for every show and every person.

1391
01:06:34,060 --> 01:06:37,990
That is to say, you only say the
show's name once in one place.

1392
01:06:37,990 --> 01:06:43,030
You only write the TV star's
names once in one place.

1393
01:06:43,030 --> 01:06:46,330
And you associate them, you
relate one with the other,

1394
01:06:46,330 --> 01:06:48,710
by way of this third sheet here.

1395
01:06:48,710 --> 01:06:52,270
So if you've not seen it already,
notice that if The Office has

1396
01:06:52,270 --> 01:06:56,800
this unique ID, 386676,
notice in the stars table,

1397
01:06:56,800 --> 01:06:59,620
that same value appears multiple times.

1398
01:06:59,620 --> 01:07:04,810
But what this third sheet is doing
is associating that same show ID

1399
01:07:04,810 --> 01:07:08,510
with one, two, three, four,
five different people.

1400
01:07:08,510 --> 01:07:11,860
Now I can see that it's similar
in spirit to what we already

1401
01:07:11,860 --> 01:07:13,780
indicted as bad design a moment ago.

1402
01:07:13,780 --> 01:07:15,460
The Office, The Office, The Office.

1403
01:07:15,460 --> 01:07:18,970
But think about our world of C. In
the world of C, and really computers

1404
01:07:18,970 --> 01:07:22,180
in general, data takes up finite
amount of space, typically.

1405
01:07:22,180 --> 01:07:24,470
Like an integer is four bytes, 32 bits.

1406
01:07:24,470 --> 01:07:27,070
So even though, yes, I'm
duplicating this value,

1407
01:07:27,070 --> 01:07:30,460
it's just the same four
bytes, four bytes, four bytes.

1408
01:07:30,460 --> 01:07:36,340
It's not t-h-e space
o-f-f-i-c-e, null character.

1409
01:07:36,340 --> 01:07:38,755
It's not the same 11
bytes again, and again.

1410
01:07:38,755 --> 01:07:41,380
It's just a number, and numbers
tend to be much more efficient.

1411
01:07:41,380 --> 01:07:43,660
That computers can crunch
numbers much more quickly.

1412
01:07:43,660 --> 01:07:47,920
Duplicating numbers is in
general allowed, or smiled upon.

1413
01:07:47,920 --> 01:07:50,660
Duplicating strings will
get you into trouble.

1414
01:07:50,660 --> 01:07:54,212
So with that said, is this
a useful spreadsheet now?

1415
01:07:54,212 --> 01:07:56,920
Would you want to be handed this
in your job and asked questions?

1416
01:07:56,920 --> 01:07:59,020
Like, hey, who stars in The Office?

1417
01:07:59,020 --> 01:08:01,390
You can answer it, but you
have to look up one sheet,

1418
01:08:01,390 --> 01:08:02,950
then another, then a third.

1419
01:08:02,950 --> 01:08:06,490
Or you need to use VLOOKUP, or special
functions in Excel, or Google Sheets.

1420
01:08:06,490 --> 01:08:08,823
I mean, you're just creating
a lot of work for yourself.

1421
01:08:08,823 --> 01:08:12,340
But, academically, if you
will, systematically, this

1422
01:08:12,340 --> 01:08:15,640
has a lot of merit because
we've avoided all duplication.

1423
01:08:15,640 --> 01:08:20,800
We've normalized the data, so to speak,
by factoring out any duplication.

1424
01:08:20,800 --> 01:08:22,700
So where are we going with this?

1425
01:08:22,700 --> 01:08:25,720
Well, it turns out that we'll
play now with some actual data

1426
01:08:25,720 --> 01:08:28,481
from the real world from the
actual Internet Movie database.

1427
01:08:28,481 --> 01:08:31,189
And in a moment, it's going to
look a little something like this.

1428
01:08:31,189 --> 01:08:34,689
This is an artist's rendition
of five different-- sorry,

1429
01:08:34,689 --> 01:08:36,800
six different tables.

1430
01:08:36,800 --> 01:08:39,430
So not one, but six
different tables that we

1431
01:08:39,430 --> 01:08:43,479
created using freely available
Internet Movie Database data.

1432
01:08:43,479 --> 01:08:47,904
They kindly provide, not CSVs,
but TSVs, tab separated values,

1433
01:08:47,904 --> 01:08:49,779
which are essentially
the same thing, but you

1434
01:08:49,779 --> 01:08:51,729
look for tab characters
instead of commas

1435
01:08:51,729 --> 01:08:55,120
in order to convert them, as
we did, into our own format,

1436
01:08:55,120 --> 01:08:56,800
SQLite in this case.

1437
01:08:56,800 --> 01:08:59,404
But we'll see that there
is a file that we've

1438
01:08:59,404 --> 01:09:02,529
made available from today, which you
can download off the course's website,

1439
01:09:02,529 --> 01:09:07,149
called shows.db, and it contains
all of this same information.

1440
01:09:07,149 --> 01:09:10,660
And in that shows.db file there
are indeed these six tables,

1441
01:09:10,660 --> 01:09:13,180
but let's focus on just
two of them initially.

1442
01:09:13,180 --> 01:09:16,038
IMDb, the Internet Movie Database,
is all about rating TV shows,

1443
01:09:16,038 --> 01:09:18,580
and tracking that kind of
information, so let's actually take

1444
01:09:18,580 --> 01:09:20,497
a look at some of these
ratings and figure out

1445
01:09:20,497 --> 01:09:22,970
how we can actually
answer actual questions.

1446
01:09:22,970 --> 01:09:24,760
So let me go over to VS Code.

1447
01:09:24,760 --> 01:09:29,750
And let me run SQLite of shows.db,
which is a file that already exists.

1448
01:09:29,750 --> 01:09:33,010
There's no CSVs, no TSVs, we
did all of this for you already.

1449
01:09:33,010 --> 01:09:34,932
When I hit enter, I
get my SQLite prompt.

1450
01:09:34,932 --> 01:09:36,640
And the first thing
I like to do whenever

1451
01:09:36,640 --> 01:09:39,189
I'm playing with a SQL database
for the first time-- maybe

1452
01:09:39,189 --> 01:09:41,410
I got it from a class,
or my boss, or the like--

1453
01:09:41,410 --> 01:09:44,979
is just to wrap my mind around
what's inside of the database

1454
01:09:44,979 --> 01:09:48,069
because you're not typically going to
be provided with pictures like this.

1455
01:09:48,069 --> 01:09:49,729
You're just going to be given a file.

1456
01:09:49,729 --> 01:09:51,220
So let me just select some data.

1457
01:09:51,220 --> 01:09:54,468
Let me select star from the shows table.

1458
01:09:54,468 --> 01:09:57,010
I don't really want to see all
of it, so let me just limit it

1459
01:09:57,010 --> 01:09:59,390
to the first 10 shows in the table.

1460
01:09:59,390 --> 01:10:03,310
And here, we can infer what
the shows table looks like.

1461
01:10:03,310 --> 01:10:07,270
Every show has an ID, a title, the
year in which it debuted, apparently,

1462
01:10:07,270 --> 01:10:11,320
and the number of episodes as of
last night when we exported the data.

1463
01:10:11,320 --> 01:10:15,010
So that seems to reflect this
picture, and this is technically

1464
01:10:15,010 --> 01:10:19,000
an entity relationship diagram, a
standard way of depicting things.

1465
01:10:19,000 --> 01:10:21,670
And you'll see that in
our picture shows, indeed,

1466
01:10:21,670 --> 01:10:25,480
have an ID column, title column,
year column, and episodes column.

1467
01:10:25,480 --> 01:10:26,840
Well what about these ratings?

1468
01:10:26,840 --> 01:10:29,650
Well, according to the picture,
that has a show ID, a rating,

1469
01:10:29,650 --> 01:10:30,830
and a votes column.

1470
01:10:30,830 --> 01:10:34,690
So let's go back to VS Code
here, and let's do select star

1471
01:10:34,690 --> 01:10:39,670
from ratings, limit 10, just to wrap
our mind around some of the data.

1472
01:10:39,670 --> 01:10:42,100
And there we have a
show ID in the left, we

1473
01:10:42,100 --> 01:10:45,170
have rating in the middle, which seems
to be like a floating point value,

1474
01:10:45,170 --> 01:10:46,940
and then votes, which
seems to be an integer.

1475
01:10:46,940 --> 01:10:48,815
So we have some different
types of data here.

1476
01:10:48,815 --> 01:10:50,270
But there's a lot of data.

1477
01:10:50,270 --> 01:10:51,610
In fact, if I do this now.

1478
01:10:51,610 --> 01:10:55,180
Select star from shows--

1479
01:10:55,180 --> 01:10:56,630
let's not select all of the data.

1480
01:10:56,630 --> 01:11:00,670
Let's do select count stars
from shows, and in this database

1481
01:11:00,670 --> 01:11:04,640
there are 214,000
shows in this database.

1482
01:11:04,640 --> 01:11:08,780
So we're well past the 398 rows that
we've been talking about thus far.

1483
01:11:08,780 --> 01:11:12,010
So it turns out, per this
diagram, there's actually

1484
01:11:12,010 --> 01:11:16,360
a standard relationship between
these two tables, shows and ratings

1485
01:11:16,360 --> 01:11:18,880
respectively, and it's what
we would call a one-to-one one

1486
01:11:18,880 --> 01:11:23,430
relationship whereby every show
in this design has one rating.

1487
01:11:23,430 --> 01:11:25,180
And this is indicated,
technically, if you

1488
01:11:25,180 --> 01:11:27,755
look at what the arrowheads
look like on these diagrams.

1489
01:11:27,755 --> 01:11:30,130
This indicates that it's a
one-to-one relationship, which

1490
01:11:30,130 --> 01:11:34,450
means every show has one rating, which
means every row in the shows table

1491
01:11:34,450 --> 01:11:36,970
has a corresponding row
in the ratings table.

1492
01:11:36,970 --> 01:11:39,640
Strictly speaking, they
could be in the same table.

1493
01:11:39,640 --> 01:11:42,130
You could just join them together
and make one wider table,

1494
01:11:42,130 --> 01:11:44,800
but IMDb keeps the
data separate so we too

1495
01:11:44,800 --> 01:11:47,210
kept them separate in
two separate tables here.

1496
01:11:47,210 --> 01:11:50,120
So what does this
actually mean in practice?

1497
01:11:50,120 --> 01:11:54,590
Well, let's actually take a look in VS
Code at the schema for these tables.

1498
01:11:54,590 --> 01:11:57,160
Let me clear my screen,
and let me .schema,

1499
01:11:57,160 --> 01:12:01,570
but specifically look at the schema,
or the design of, the shows table.

1500
01:12:01,570 --> 01:12:06,640
So you can do .schema shows, and any
command in SQLite with a dot is SQLite

1501
01:12:06,640 --> 01:12:08,050
specific.

1502
01:12:08,050 --> 01:12:10,900
In the real world, if you're
using other products like Oracle,

1503
01:12:10,900 --> 01:12:14,350
or Postgre, or MySQL, or others,
they have different commands,

1504
01:12:14,350 --> 01:12:16,300
but anything else that
we've been typing,

1505
01:12:16,300 --> 01:12:20,260
especially the capitalized
keywords, is indeed standard SQL.

1506
01:12:20,260 --> 01:12:25,310
If I hit enter here, here is what the
shows table apparently looks like.

1507
01:12:25,310 --> 01:12:28,480
In other words, here is the create
table command that we, the staff,

1508
01:12:28,480 --> 01:12:30,550
ran in order to create
this table for you,

1509
01:12:30,550 --> 01:12:32,560
and then we imported a
bunch of data into it.

1510
01:12:32,560 --> 01:12:38,200
Every show has an ID, has a title,
has a year, has a number of episodes.

1511
01:12:38,200 --> 01:12:39,610
But there's more detail here.

1512
01:12:39,610 --> 01:12:43,600
Apparently, the ID is an integer, the
title is text, and it is not null,

1513
01:12:43,600 --> 01:12:47,170
cannot be null, the year is
numeric, whatever that means,

1514
01:12:47,170 --> 01:12:49,030
and the episodes is an integer.

1515
01:12:49,030 --> 01:12:50,740
Well, that is now familiar.

1516
01:12:50,740 --> 01:12:51,830
And then primary key.

1517
01:12:51,830 --> 01:12:53,622
So there's some other
stuff going on there,

1518
01:12:53,622 --> 01:12:57,760
which we'll come back to, but let
me also do .schema ratings, enter,

1519
01:12:57,760 --> 01:13:00,280
and we'll see a couple
of other data types here.

1520
01:13:00,280 --> 01:13:03,070
There's show ID, which is
still an integer, but not null.

1521
01:13:03,070 --> 01:13:05,560
Rating, which is a real number, a.k.a.

1522
01:13:05,560 --> 01:13:07,840
float, but it, too, cannot be null.

1523
01:13:07,840 --> 01:13:10,750
And then some number of votes,
which is an integer, cannot be null.

1524
01:13:10,750 --> 01:13:12,560
And then there's mention of foreign key.

1525
01:13:12,560 --> 01:13:16,300
So unlike our data set for the
favorites a bit ago, which we just

1526
01:13:16,300 --> 01:13:20,530
did automatically, and imported,
this database by us, and by IMDb,

1527
01:13:20,530 --> 01:13:23,680
has been more thoughtfully
designed, where there's actually

1528
01:13:23,680 --> 01:13:27,640
some relationships across multiple
tables, rather than previously

1529
01:13:27,640 --> 01:13:28,420
just one.

1530
01:13:28,420 --> 01:13:32,300
Now in the world of SQL, we have
indeed different data types.

1531
01:13:32,300 --> 01:13:35,140
For instance, we have
these five primarily.

1532
01:13:35,140 --> 01:13:37,750
One, cutely named
BLOB, which is actually

1533
01:13:37,750 --> 01:13:41,990
Binary Large Object, which generally
means like a file or some piece of data

1534
01:13:41,990 --> 01:13:43,190
that's zeros and ones.

1535
01:13:43,190 --> 01:13:46,160
Though, generally, it's best
to store files on file systems,

1536
01:13:46,160 --> 01:13:50,090
like in folders, on disks, so
to speak, not in your database.

1537
01:13:50,090 --> 01:13:51,440
There's integers we've seen.

1538
01:13:51,440 --> 01:13:52,410
There's numeric.

1539
01:13:52,410 --> 01:13:54,740
Which numeric is more
like dates and times,

1540
01:13:54,740 --> 01:13:58,340
things that are numbers, but not
necessarily integers or floating point

1541
01:13:58,340 --> 01:13:58,970
values.

1542
01:13:58,970 --> 01:14:01,460
There's reals, which do
have decimal points in them.

1543
01:14:01,460 --> 01:14:02,780
And then there's just text.

1544
01:14:02,780 --> 01:14:06,290
In other SQL databases in the
real world, in your future jobs,

1545
01:14:06,290 --> 01:14:09,740
or your future classes that
you might use SQL again in,

1546
01:14:09,740 --> 01:14:14,960
there are even more data types in other
databases: Oracle, MySQL, Postgre,

1547
01:14:14,960 --> 01:14:15,605
and so forth.

1548
01:14:15,605 --> 01:14:17,480
But these are sort of
representative of them.

1549
01:14:17,480 --> 01:14:19,767
They just get more
precise in other systems.

1550
01:14:19,767 --> 01:14:22,100
But there's also some keywords,
that we've seen already,

1551
01:14:22,100 --> 01:14:24,230
that you can specify when
designing a database.

1552
01:14:24,230 --> 01:14:26,090
--that this column cannot be null.

1553
01:14:26,090 --> 01:14:29,720
If you want to make sure that
no one can insert or update

1554
01:14:29,720 --> 01:14:32,030
data unbeknownst to
you that is null, you

1555
01:14:32,030 --> 01:14:34,340
can impose that when creating the table.

1556
01:14:34,340 --> 01:14:36,740
And unlike Excel, and
Google Spreadsheets,

1557
01:14:36,740 --> 01:14:40,010
and Apple Numbers, which will generally
let the human type in or not type

1558
01:14:40,010 --> 01:14:43,520
in anything they want, with a
database you have more protections

1559
01:14:43,520 --> 01:14:45,365
over the integrity of your data.

1560
01:14:45,365 --> 01:14:48,900
Moreover, you can specify that a
column's values must be unique.

1561
01:14:48,900 --> 01:14:50,750
If you want to avoid
duplicates, like you

1562
01:14:50,750 --> 01:14:54,740
don't want the same person to be able to
register twice for your website thereby

1563
01:14:54,740 --> 01:14:57,500
making sure they have
one unique email address,

1564
01:14:57,500 --> 01:14:59,840
your database can help with that too.

1565
01:14:59,840 --> 01:15:03,470
You don't have to rely on Python
to check if it already exists.

1566
01:15:03,470 --> 01:15:07,520
But there's this other feature
of relational databases, that

1567
01:15:07,520 --> 01:15:11,480
is databases that have multiple tables
across which there are relationships,

1568
01:15:11,480 --> 01:15:13,880
and that's these keywords
we saw briefly a moment ago.

1569
01:15:13,880 --> 01:15:16,620
Primary key, and foreign key.

1570
01:15:16,620 --> 01:15:20,180
And we started to scratch
this surface here.

1571
01:15:20,180 --> 01:15:22,490
It turns out what I
was doing was actually

1572
01:15:22,490 --> 01:15:25,340
best practice in the world
of relational databases

1573
01:15:25,340 --> 01:15:30,680
I gave or, really, IMDb gave every
show in the world a unique ID.

1574
01:15:30,680 --> 01:15:36,470
And that unique ID, in this case
386676, is a numeric value, an integer,

1575
01:15:36,470 --> 01:15:38,930
that uniquely identifies that TV show.

1576
01:15:38,930 --> 01:15:43,370
In other words, this is the
primary key for this table.

1577
01:15:43,370 --> 01:15:46,370
Technically a sheet, but I'm using
that just because it's easier to type

1578
01:15:46,370 --> 01:15:48,020
in than my black and white window.

1579
01:15:48,020 --> 01:15:53,960
This ID column is the primary key for
shows, as I was mocking up earlier.

1580
01:15:53,960 --> 01:15:57,290
What is the primary key
in the people sheet here?

1581
01:15:57,290 --> 01:15:58,580
It is also ID.

1582
01:15:58,580 --> 01:16:02,450
It's a different ID, but it's by
convention often called the same thing.

1583
01:16:02,450 --> 01:16:07,037
But this people column,
called ID, is its primary key.

1584
01:16:07,037 --> 01:16:08,870
And you can perhaps see
where this is going.

1585
01:16:08,870 --> 01:16:13,910
Those same numbers also happen
to appear in this third table,

1586
01:16:13,910 --> 01:16:17,180
but in that context, they're
sort of foreign keys.

1587
01:16:17,180 --> 01:16:21,420
They didn't come from this star sheet,
but they are in this star sheet,

1588
01:16:21,420 --> 01:16:23,690
so they're sort of
relatively foreign to it.

1589
01:16:23,690 --> 01:16:29,030
So foreign keys is simply the presence
of primary keys in some other table,

1590
01:16:29,030 --> 01:16:30,630
in some other data set.

1591
01:16:30,630 --> 01:16:33,050
And so it's just a
description of relativity,

1592
01:16:33,050 --> 01:16:36,860
but the foreign key is the column
that uniquely identifies your data.

1593
01:16:36,860 --> 01:16:41,190
Foreign keys is just the appearance
of those same numbers elsewhere.

1594
01:16:41,190 --> 01:16:42,570
So what does this mean?

1595
01:16:42,570 --> 01:16:45,020
If we go back to VS
Code here, you'll see

1596
01:16:45,020 --> 01:16:48,860
that when we created this
table using IMDb's real data,

1597
01:16:48,860 --> 01:16:52,640
we specified that in our shows
table, which is bigger than the one

1598
01:16:52,640 --> 01:16:56,180
I mocked up with Google Sheets there,
has not only an ID and a title it,

1599
01:16:56,180 --> 01:16:58,615
again, also has year in
which the show debuted,

1600
01:16:58,615 --> 01:16:59,990
and the total number of episodes.

1601
01:16:59,990 --> 01:17:02,810
Because that's juicy data
that comes from IMDb,

1602
01:17:02,810 --> 01:17:04,670
beyond what I mocked up a moment ago.

1603
01:17:04,670 --> 01:17:08,150
In the ratings table,
meanwhile, there's a show ID,

1604
01:17:08,150 --> 01:17:12,360
which is also an integer, just like
this ID, but as the name implies,

1605
01:17:12,360 --> 01:17:17,810
this show ID column is actually going
to be a foreign key that references

1606
01:17:17,810 --> 01:17:20,640
the shows tables ID column.

1607
01:17:20,640 --> 01:17:23,420
So this is the relational
in relational databases.

1608
01:17:23,420 --> 01:17:25,530
These are two tables
that have a relationship,

1609
01:17:25,530 --> 01:17:32,520
and that relationship is that show ID
is referring to this actual ID here.

1610
01:17:32,520 --> 01:17:34,520
And it allows us
essentially, conceptually,

1611
01:17:34,520 --> 01:17:37,350
to link these two tables together.

1612
01:17:37,350 --> 01:17:39,353
So what does this actually mean?

1613
01:17:39,353 --> 01:17:40,770
Well, let me go ahead and do this.

1614
01:17:40,770 --> 01:17:42,140
Let me go back to VS Code here.

1615
01:17:42,140 --> 01:17:45,120
I'll clear my terminal, and let's
play around with some of this data.

1616
01:17:45,120 --> 01:17:48,080
So let's go ahead and do
this just to experiment.

1617
01:17:48,080 --> 01:17:51,320
Select star from ratings where--

1618
01:17:51,320 --> 01:17:54,860
let's get all of the good shows
where, just like Rotten Tomatoes,

1619
01:17:54,860 --> 01:17:57,410
we'll do the cut off at 6.0 out of 10.

1620
01:17:57,410 --> 01:18:01,063
So where rating is greater
than or equal to 6.0.

1621
01:18:01,063 --> 01:18:03,230
And just so I don't overwhelm
my screen, let me just

1622
01:18:03,230 --> 01:18:05,370
limit this to the first 10 results.

1623
01:18:05,370 --> 01:18:08,510
In other words, this is
SQL syntax for selecting

1624
01:18:08,510 --> 01:18:12,920
all of the ratings that are at
least 6.0 or higher from that table.

1625
01:18:12,920 --> 01:18:13,640
Enter.

1626
01:18:13,640 --> 01:18:15,350
And we see just the first 10 of them.

1627
01:18:15,350 --> 01:18:18,540
Not the top 10 because we've not sorted
or grouped or anything like that,

1628
01:18:18,540 --> 01:18:20,490
but the first 10 in the table.

1629
01:18:20,490 --> 01:18:23,780
So what is interesting
here is that we've

1630
01:18:23,780 --> 01:18:26,390
seen just some sampling
of the data, if you will.

1631
01:18:26,390 --> 01:18:28,790
But this isn't all
that interesting here.

1632
01:18:28,790 --> 01:18:32,220
Let me actually distill this just to
the show ID because, in other words,

1633
01:18:32,220 --> 01:18:34,190
I want to know 10 good shows to watch.

1634
01:18:34,190 --> 01:18:36,440
So let me just select show ID.

1635
01:18:36,440 --> 01:18:39,480
So same result, but less data.

1636
01:18:39,480 --> 01:18:41,700
It's just that first column thereof.

1637
01:18:41,700 --> 01:18:43,540
Now this is going to
be a little annoying,

1638
01:18:43,540 --> 01:18:47,460
but if I want to find out
the names of these shows--

1639
01:18:47,460 --> 01:18:50,400
think about the picture
from whence we came.

1640
01:18:50,400 --> 01:18:55,320
All of the show's names are in the shows
table, but all of the show's ratings

1641
01:18:55,320 --> 01:18:57,520
are in the ratings table.

1642
01:18:57,520 --> 01:19:00,058
So even if I do select
star from ratings,

1643
01:19:00,058 --> 01:19:02,100
I'm never going to know
what show I'm looking at.

1644
01:19:02,100 --> 01:19:04,500
Like, what the heck is show ID 62614?

1645
01:19:04,500 --> 01:19:05,680
Well, I could do this.

1646
01:19:05,680 --> 01:19:14,370
I could select star from shows where the
ID of the show equals 62614 semicolon.

1647
01:19:14,370 --> 01:19:15,240
Enter.

1648
01:19:15,240 --> 01:19:18,270
OK, so I could watch
this show from 1981.

1649
01:19:18,270 --> 01:19:19,290
Let me do another one.

1650
01:19:19,290 --> 01:19:24,030
Select star from shows
where ID equals 63881.

1651
01:19:24,030 --> 01:19:26,710
So I'm just grabbing
the second ID from here.

1652
01:19:26,710 --> 01:19:27,210
OK.

1653
01:19:27,210 --> 01:19:29,530
So Catweazle, a kid's show from 1970.

1654
01:19:29,530 --> 01:19:30,030
All right.

1655
01:19:30,030 --> 01:19:31,120
So I'll watch that.

1656
01:19:31,120 --> 01:19:32,620
So now let's do another one.

1657
01:19:32,620 --> 01:19:33,810
I'll just copy this.

1658
01:19:33,810 --> 01:19:37,530
Like suffice it to say, this is
not the best way to look up data,

1659
01:19:37,530 --> 01:19:41,610
where I'm literally copying and pasting
values from one query into the next.

1660
01:19:41,610 --> 01:19:43,950
But this is where SQL
gets a bit powerful.

1661
01:19:43,950 --> 01:19:46,170
I can have nested queries.

1662
01:19:46,170 --> 01:19:47,710
I can put one inside of the other.

1663
01:19:47,710 --> 01:19:48,793
So let me instead do this.

1664
01:19:48,793 --> 01:19:51,130
Let me clear the screen,
and let me instead do this.

1665
01:19:51,130 --> 01:19:57,620
Select star from shows
where the ID of the show

1666
01:19:57,620 --> 01:20:01,160
is in the following list of IDs.

1667
01:20:01,160 --> 01:20:02,630
Select.

1668
01:20:02,630 --> 01:20:05,480
And, actually, I'll do
this on a separate line.

1669
01:20:05,480 --> 01:20:12,020
Select show ID from ratings
where the rating value is

1670
01:20:12,020 --> 01:20:15,800
greater than or equal to 6.0 semicolon.

1671
01:20:15,800 --> 01:20:17,570
So I've separated this onto two lines.

1672
01:20:17,570 --> 01:20:19,710
The dot, dot, dot is just
a continuation character,

1673
01:20:19,710 --> 01:20:23,480
which means same query, multiple lines,
but the parentheses are deliberate.

1674
01:20:23,480 --> 01:20:27,500
Just like grade school math, I want
what's in parentheses to happen first.

1675
01:20:27,500 --> 01:20:29,930
And so what the database
will do is we'll

1676
01:20:29,930 --> 01:20:33,740
select, as before, all of the
show IDs from the ratings table

1677
01:20:33,740 --> 01:20:36,713
where the rating value is
at least 6.0 out of 10.

1678
01:20:36,713 --> 01:20:38,630
And that's going to
return to me, effectively,

1679
01:20:38,630 --> 01:20:41,750
a list, some kind of
collection of show IDs,

1680
01:20:41,750 --> 01:20:43,760
which previously I was
copying and pasting,

1681
01:20:43,760 --> 01:20:47,390
now the database will
do the legwork for me.

1682
01:20:47,390 --> 01:20:51,410
It will now select everything from the
shows table, where the ID of the show

1683
01:20:51,410 --> 01:20:53,655
is in that list of values.

1684
01:20:53,655 --> 01:20:56,280
And it's actually going to be
more than 10 unless I go in there

1685
01:20:56,280 --> 01:20:58,690
and say limit 10, which I can do.

1686
01:20:58,690 --> 01:21:00,510
So let me go ahead and hit enter now.

1687
01:21:00,510 --> 01:21:03,630
And now I see more useful
information not just

1688
01:21:03,630 --> 01:21:05,640
the ratings information,
which in a vacuum

1689
01:21:05,640 --> 01:21:07,320
tells me nothing about what to watch.

1690
01:21:07,320 --> 01:21:12,360
Now I see the show ID, the title,
the year, and the episodes.

1691
01:21:12,360 --> 01:21:17,200
But notably, what is-- and if I want
to distill this into just the title,

1692
01:21:17,200 --> 01:21:20,010
let me actually go back
here and instead do

1693
01:21:20,010 --> 01:21:25,140
select just the title from shows
where the ID is in this whole list.

1694
01:21:25,140 --> 01:21:28,210
And I'll re-execute it by just
copying and pasting the same.

1695
01:21:28,210 --> 01:21:31,240
The only difference now is instead
of star, I'm selecting title.

1696
01:21:31,240 --> 01:21:34,080
Here's now how the data
analyst at IMDb might

1697
01:21:34,080 --> 01:21:38,430
be selecting 10 shows that are really
good to watch, according to ratings.

1698
01:21:38,430 --> 01:21:41,490
But what, of course, is
missing from the output,

1699
01:21:41,490 --> 01:21:45,065
whether I do star or just title?

1700
01:21:45,065 --> 01:21:45,690
What's missing?

1701
01:21:45,690 --> 01:21:45,960
Yeah?

1702
01:21:45,960 --> 01:21:46,860
AUDIENCE: Rating.

1703
01:21:46,860 --> 01:21:47,940
- The actual rating.

1704
01:21:47,940 --> 01:21:51,000
I know these are at least
6.0, but which is 7.0?

1705
01:21:51,000 --> 01:21:52,050
Which is 10.0?

1706
01:21:52,050 --> 01:21:55,270
It'd be nice to actually
combine the data in some way.

1707
01:21:55,270 --> 01:21:57,840
So we can actually do that,
too, because it turns out

1708
01:21:57,840 --> 01:22:00,960
that when you have two tables in
the world of SQL, or even more,

1709
01:22:00,960 --> 01:22:03,720
you can actually join them together.

1710
01:22:03,720 --> 01:22:08,700
You can join them together, literally
using a keyword called join.

1711
01:22:08,700 --> 01:22:10,540
And you can do this as follows.

1712
01:22:10,540 --> 01:22:13,253
Let me propose, with a sample
data set, these two tables.

1713
01:22:13,253 --> 01:22:16,170
And dot, dot, dot just means we don't
really care about the specifics.

1714
01:22:16,170 --> 01:22:17,800
We just care about the structure.

1715
01:22:17,800 --> 01:22:21,000
So on the left here is a
simplified version of my shows

1716
01:22:21,000 --> 01:22:24,270
table that has a show
ID and a show title,

1717
01:22:24,270 --> 01:22:27,990
but I've omitted year and
episodes as just intellectually

1718
01:22:27,990 --> 01:22:30,180
distracting from the
basic structure here.

1719
01:22:30,180 --> 01:22:31,740
But they're there in the real table.

1720
01:22:31,740 --> 01:22:34,890
On the right hand side here,
we have the ratings table

1721
01:22:34,890 --> 01:22:37,290
with just two of its columns,
the show ID and the rating,

1722
01:22:37,290 --> 01:22:39,457
but I've omitted the votes
because it doesn't really

1723
01:22:39,457 --> 01:22:40,750
add anything to the discussion.

1724
01:22:40,750 --> 01:22:42,600
But let me propose this.

1725
01:22:42,600 --> 01:22:45,750
Notice that these two
tables and these two rows

1726
01:22:45,750 --> 01:22:47,790
therein definitely have commonalities.

1727
01:22:47,790 --> 01:22:50,370
Like, they both have the same ID.

1728
01:22:50,370 --> 01:22:54,150
In the left table it's 386676, a.k.a.

1729
01:22:54,150 --> 01:22:55,860
The primary key of that row.

1730
01:22:55,860 --> 01:23:01,200
But it's also appearing in duplicate
in the right hand table, 386676,

1731
01:23:01,200 --> 01:23:02,977
and in that context it's a foreign key.

1732
01:23:02,977 --> 01:23:04,810
The point, though, is
that they're the same.

1733
01:23:04,810 --> 01:23:08,320
So wouldn't it be nice if I could
treat one table here, one table here,

1734
01:23:08,320 --> 01:23:12,150
and if my fingertips represent these
identical values, kind of like glue

1735
01:23:12,150 --> 01:23:15,715
them together so I get one wider
table with all of the information

1736
01:23:15,715 --> 01:23:18,090
together to satisfy your
concern, that we don't even know

1737
01:23:18,090 --> 01:23:20,015
what the ratings are of those shows.

1738
01:23:20,015 --> 01:23:21,390
Well let me go ahead and do this.

1739
01:23:21,390 --> 01:23:24,488
Just for artist's rendition,
let me flip title and ID,

1740
01:23:24,488 --> 01:23:25,780
which has no functional effect.

1741
01:23:25,780 --> 01:23:28,488
It's just going to put the numbers
closer together on the screen.

1742
01:23:28,488 --> 01:23:32,610
Let me then literally highlight the fact
that these two numbers are identical,

1743
01:23:32,610 --> 01:23:35,190
and let me propose that we
do the equivalent of this,

1744
01:23:35,190 --> 01:23:38,910
we somehow join these two
tables on that common value.

1745
01:23:38,910 --> 01:23:40,995
Strictly speaking, I
don't need both values

1746
01:23:40,995 --> 01:23:43,870
because they're duplicates, so I
don't care if one of them goes away.

1747
01:23:43,870 --> 01:23:46,470
But what I'd really like
to do is select, indeed,

1748
01:23:46,470 --> 01:23:52,230
a temporary table that is the joined
version of that original data.

1749
01:23:52,230 --> 01:23:54,840
And frankly, I don't really
care as the user what the ID is.

1750
01:23:54,840 --> 01:23:57,690
Heck, all I care about is what show
to watch and what its rating is.

1751
01:23:57,690 --> 01:23:59,370
Give me the title and the rating.

1752
01:23:59,370 --> 01:24:01,590
All of these numbers are,
again, metadata, things

1753
01:24:01,590 --> 01:24:05,490
that the computer cares about,
but we humans probably do not.

1754
01:24:05,490 --> 01:24:07,890
So how can we implement that idea?

1755
01:24:07,890 --> 01:24:09,840
Of taking one data set
that has a relationship

1756
01:24:09,840 --> 01:24:12,510
with this data set, and
somehow combine it together?

1757
01:24:12,510 --> 01:24:14,170
Well let me go back to VS Code here.

1758
01:24:14,170 --> 01:24:15,330
Let me clear my screen.

1759
01:24:15,330 --> 01:24:20,220
And this is going to be a bit cryptic
at first, but it's very step by step.

1760
01:24:20,220 --> 01:24:21,010
Let me do this.

1761
01:24:21,010 --> 01:24:25,890
Select star from shows,
but not from shows alone.

1762
01:24:25,890 --> 01:24:28,650
Let me join it with the ratings table.

1763
01:24:28,650 --> 01:24:32,400
So let me select everything
from shows joined with ratings,

1764
01:24:32,400 --> 01:24:35,910
but I need to tell the database, well,
what do I want to join things on?

1765
01:24:35,910 --> 01:24:37,200
What are my fingertips?

1766
01:24:37,200 --> 01:24:40,030
Specifically, I want them to
join on those common integers.

1767
01:24:40,030 --> 01:24:44,940
So I can literally say, on, and then
I can specify one table on the left.

1768
01:24:44,940 --> 01:24:51,030
Shows.id on the left should equal
the ratings table's shows.id column

1769
01:24:51,030 --> 01:24:52,470
on the right.

1770
01:24:52,470 --> 01:24:57,120
Again, if I'm joining shows with
ratings it's called ID in one,

1771
01:24:57,120 --> 01:25:00,640
it's called show ID in the other,
but it's the exact same thing.

1772
01:25:00,640 --> 01:25:03,210
In fact, if I rewind, this
is where we came from.

1773
01:25:03,210 --> 01:25:05,290
Two tables with the same value.

1774
01:25:05,290 --> 01:25:10,170
So with this query here, if I go ahead
and now specify not just that but let

1775
01:25:10,170 --> 01:25:14,760
me further say, where rating is
greater than or equal to 6.0.

1776
01:25:14,760 --> 01:25:16,500
And, heck, let's limit it to 10.

1777
01:25:16,500 --> 01:25:21,420
Just fits on the screen, so it's more
of a mouthful, but when I hit enter now

1778
01:25:21,420 --> 01:25:26,220
we have a wider table that indeed
contains everything, star, from having

1779
01:25:26,220 --> 01:25:29,340
joined these two tables left and right.

1780
01:25:29,340 --> 01:25:31,710
Now, again, I don't really
care about much of this data,

1781
01:25:31,710 --> 01:25:34,410
like year and episodes and
definitely not the IDs.

1782
01:25:34,410 --> 01:25:36,070
So let me actually hit up.

1783
01:25:36,070 --> 01:25:38,070
Let me go to the beginning
of the query, and let

1784
01:25:38,070 --> 01:25:41,640
me just select the title of the
show and the rating of the show.

1785
01:25:41,640 --> 01:25:45,780
The query is getting a little long and
it's wrapping, but it's the same query.

1786
01:25:45,780 --> 01:25:49,500
Except, instead of star,
I've done title comma rating.

1787
01:25:49,500 --> 01:25:50,910
Now when I hit enter--

1788
01:25:50,910 --> 01:25:54,190
like, this is the list that would have
been nice to see the first time around.

1789
01:25:54,190 --> 01:25:57,280
Show me 10 shows with a
rating of 6.0 or higher,

1790
01:25:57,280 --> 01:25:59,620
but remind me what the
rating actually is so,

1791
01:25:59,620 --> 01:26:03,580
maybe, I can prioritize the sevens, the
eights, the nines, and even the tens,

1792
01:26:03,580 --> 01:26:05,236
if any.

1793
01:26:05,236 --> 01:26:11,760
Any questions about this
technique of joining two tables?

1794
01:26:11,760 --> 01:26:15,210
This sort of solves the problem that
we created in the world of this sheet,

1795
01:26:15,210 --> 01:26:18,030
where I was just kind of playing
around, where I sort of moved

1796
01:26:18,030 --> 01:26:20,280
all the data into its
separate locations, which

1797
01:26:20,280 --> 01:26:21,750
is not at all pleasant to use.

1798
01:26:21,750 --> 01:26:25,590
But with SQL, with join, you can
still get any of the data you want.

1799
01:26:25,590 --> 01:26:27,546
Yeah.

1800
01:26:27,546 --> 01:26:30,498
[INDISTINCT SPEECH]

1801
01:26:30,498 --> 01:26:34,450


1802
01:26:34,450 --> 01:26:35,308
Correct.

1803
01:26:35,308 --> 01:26:37,600
So, yes, I should have called
that out more explicitly.

1804
01:26:37,600 --> 01:26:41,472
In my query here, I was using dot
notation, which we've seen in Python,

1805
01:26:41,472 --> 01:26:43,930
we've seen in C. It means
something similar in spirit here,

1806
01:26:43,930 --> 01:26:45,972
but it has nothing to do
with structured objects.

1807
01:26:45,972 --> 01:26:48,650
In this case, it has to do
with tables and columns.

1808
01:26:48,650 --> 01:26:53,320
So shows.id just makes clear that I
want the ID column from the shows table

1809
01:26:53,320 --> 01:26:57,490
to line up with the show ID
column from the ratings table.

1810
01:26:57,490 --> 01:27:01,630
Strictly speaking, I don't need
to do that because in this case

1811
01:27:01,630 --> 01:27:03,460
there is no ambiguity.

1812
01:27:03,460 --> 01:27:06,340
One table has a column
called ID, the other table

1813
01:27:06,340 --> 01:27:07,810
has a column called show ID.

1814
01:27:07,810 --> 01:27:10,840
So certainly the database can
just figure this out for me.

1815
01:27:10,840 --> 01:27:15,730
But, for best practice, and for the
sake of being explicit, using the dot

1816
01:27:15,730 --> 01:27:17,950
notation and table names
can help, especially

1817
01:27:17,950 --> 01:27:20,270
if there's some common
language across them.

1818
01:27:20,270 --> 01:27:20,770
All right.

1819
01:27:20,770 --> 01:27:23,030
Well let's go back to
the bigger data set here.

1820
01:27:23,030 --> 01:27:25,060
These are all six tables in IMDb.

1821
01:27:25,060 --> 01:27:28,450
We focused for just a moment there
on like shows and ratings alone,

1822
01:27:28,450 --> 01:27:29,620
but what about genres.

1823
01:27:29,620 --> 01:27:33,250
So, genres, like comedy, and
documentary, and drama, and so forth.

1824
01:27:33,250 --> 01:27:37,190
Turns out that this actually implements
a different type of relationship.

1825
01:27:37,190 --> 01:27:39,530
Previously we saw a
one-to-one relationship,

1826
01:27:39,530 --> 01:27:42,080
but it turns out that
IMDb supports what's

1827
01:27:42,080 --> 01:27:44,820
called a one-to-many relationship
when it comes to genres.

1828
01:27:44,820 --> 01:27:45,320
Why?

1829
01:27:45,320 --> 01:27:48,650
Well, shows like The Office, I do
think are generally considered comedy

1830
01:27:48,650 --> 01:27:51,650
and that's it, but there
are certainly other TV shows

1831
01:27:51,650 --> 01:27:55,140
that might have multiple
genres associated with them.

1832
01:27:55,140 --> 01:27:59,160
Maybe it's comedy and a bit of romance
thrown in, like rom-coms, and so forth,

1833
01:27:59,160 --> 01:28:03,980
so you could imagine some shows
having two or three or more genres,

1834
01:28:03,980 --> 01:28:07,520
and so one-to-many means that
one show can have many genres.

1835
01:28:07,520 --> 01:28:11,450
One-to-one would mean one show can
have one rating, as we've seen.

1836
01:28:11,450 --> 01:28:17,330
So why don't we go ahead and focus,
maybe, on how about a query like this.

1837
01:28:17,330 --> 01:28:19,577
Let me go back to VS Code
here, clear my screen,

1838
01:28:19,577 --> 01:28:21,410
and let's just look at
some of those genres.

1839
01:28:21,410 --> 01:28:25,700
Select star from genres and
then I'll limit it to 10.

1840
01:28:25,700 --> 01:28:28,590
And, again, I do this just to wrap
my mind around a new data set.

1841
01:28:28,590 --> 01:28:30,740
I could look at the schema, but
that tends to be more cryptic.

1842
01:28:30,740 --> 01:28:32,310
I just want to look at the raw data.

1843
01:28:32,310 --> 01:28:32,810
OK.

1844
01:28:32,810 --> 01:28:37,227
It looks like here there are a bunch
of genres: comedy, adventure, comedy.

1845
01:28:37,227 --> 01:28:38,810
So two comedies, which is interesting.

1846
01:28:38,810 --> 01:28:40,100
Oh, interesting.

1847
01:28:40,100 --> 01:28:42,410
Family, action, sci-fi, family.

1848
01:28:42,410 --> 01:28:45,260
So the values here are
duplicated, which it turns out

1849
01:28:45,260 --> 01:28:47,000
is not the best design of IMDb.

1850
01:28:47,000 --> 01:28:49,710
We literally just imported
the data as they implement it.

1851
01:28:49,710 --> 01:28:55,190
But notice that show ID
62614 is a comedy, but so is

1852
01:28:55,190 --> 01:29:00,950
show 63881, and so is show 65270.

1853
01:29:00,950 --> 01:29:04,320
So it turns out that in the real world
sometimes data is somewhat messy.

1854
01:29:04,320 --> 01:29:06,830
There's duplication of
comedy, comedy, comedy,

1855
01:29:06,830 --> 01:29:09,380
but such is the way IMDb's data is.

1856
01:29:09,380 --> 01:29:12,694
But what's more interesting to
me, for now, is notice this.

1857
01:29:12,694 --> 01:29:16,530
This show ID in three rows is the same.

1858
01:29:16,530 --> 01:29:19,700
So there's some show out there
that's considered by the world

1859
01:29:19,700 --> 01:29:22,257
to be an adventure, a
comedy, and a family show.

1860
01:29:22,257 --> 01:29:23,340
So let's see what that is.

1861
01:29:23,340 --> 01:29:26,880
Let me just highlight and copy
that value, 63881, and do this.

1862
01:29:26,880 --> 01:29:31,378
Select star from shows where the
ID of the show equals that value.

1863
01:29:31,378 --> 01:29:33,170
And it turns out, we
saw it briefly before,

1864
01:29:33,170 --> 01:29:37,400
it's a show from the 1970s called
Catweazle, which falls into all three

1865
01:29:37,400 --> 01:29:38,880
of those categories.

1866
01:29:38,880 --> 01:29:43,880
So by using a one-to-many relationship,
sort of depicted by this picture here,

1867
01:29:43,880 --> 01:29:47,588
you can implement that same idea
without having that jagged edge.

1868
01:29:47,588 --> 01:29:50,630
When we looked at the spreadsheet
earlier, in an earlier version of this,

1869
01:29:50,630 --> 01:29:55,490
we had star, star, star, which we
could do again, genre, genre, genre,

1870
01:29:55,490 --> 01:29:58,460
but now we instead have
two separate tables

1871
01:29:58,460 --> 01:30:02,247
where this many-to-many
relationship is implemented across.

1872
01:30:02,247 --> 01:30:03,830
So let's actually play around with it.

1873
01:30:03,830 --> 01:30:07,640
Let me go back to VS Code
here, and let's actually take

1874
01:30:07,640 --> 01:30:10,520
a look at the schema for genres.

1875
01:30:10,520 --> 01:30:14,030
And we'll see that it's pretty
small, as the picture suggests.

1876
01:30:14,030 --> 01:30:14,990
It's called genres.

1877
01:30:14,990 --> 01:30:18,530
Every row has a show ID, which
is an integer, cannot be null.

1878
01:30:18,530 --> 01:30:21,200
It has a genre, which
is text, cannot be null.

1879
01:30:21,200 --> 01:30:25,190
And that show ID is a foreign key
in this table that references an ID

1880
01:30:25,190 --> 01:30:26,690
column in the shows table.

1881
01:30:26,690 --> 01:30:29,240
So very similar in
spirit, and so it really

1882
01:30:29,240 --> 01:30:32,090
is just kind of on the
honor system that we're only

1883
01:30:32,090 --> 01:30:34,670
putting one row for
each show in ratings,

1884
01:30:34,670 --> 01:30:39,150
but 0 or more shows in the
genres table for shows, as well.

1885
01:30:39,150 --> 01:30:41,360
So what can we do once we
want to tinker with genres?

1886
01:30:41,360 --> 01:30:42,420
Well, let me do this.

1887
01:30:42,420 --> 01:30:49,587
How about we select the show ID from the
genres table where the genre is comedy.

1888
01:30:49,587 --> 01:30:51,170
Like, I'm in the mood for some comedy.

1889
01:30:51,170 --> 01:30:54,950
Let's see all of the available comedies,
except let's just limit it to 10.

1890
01:30:54,950 --> 01:30:59,150
Here are the show IDs for 10 comedies
according to the Internet Movie

1891
01:30:59,150 --> 01:30:59,803
Database.

1892
01:30:59,803 --> 01:31:01,220
Well, that's not very interesting.

1893
01:31:01,220 --> 01:31:03,350
I care about the title,
so we can do that.

1894
01:31:03,350 --> 01:31:08,990
Select title from shows where the
ID of the show is not equal to,

1895
01:31:08,990 --> 01:31:12,710
but rather, in the following
subquery, if you will.

1896
01:31:12,710 --> 01:31:18,170
Select show ID from genres where
genre equals, quote unquote,

1897
01:31:18,170 --> 01:31:21,900
"Comedy," limit 10, just
to keep things simple.

1898
01:31:21,900 --> 01:31:25,460
So same query as before, but now
I'm using it as a nested query

1899
01:31:25,460 --> 01:31:28,730
to select the actual titles
whose IDs match those their.

1900
01:31:28,730 --> 01:31:29,540
Enter.

1901
01:31:29,540 --> 01:31:32,540
And there are those
titles for 10 comedies,

1902
01:31:32,540 --> 01:31:34,240
maybe a couple of which we've seen.

1903
01:31:34,240 --> 01:31:35,480
Catweazel we've seen before.

1904
01:31:35,480 --> 01:31:36,970
But what if we want to now--

1905
01:31:36,970 --> 01:31:38,380
let's see.

1906
01:31:38,380 --> 01:31:42,085
Maybe we want to get all of
the-- let's flip it around.

1907
01:31:42,085 --> 01:31:43,630
Catweazle keeps coming up.

1908
01:31:43,630 --> 01:31:47,170
Why don't we figure out, with a
query, what all of its genres are?

1909
01:31:47,170 --> 01:31:48,110
So Catweazle.

1910
01:31:48,110 --> 01:31:48,610
Let's see.

1911
01:31:48,610 --> 01:31:55,450
So let's do select star
from genres limit 10,

1912
01:31:55,450 --> 01:31:57,430
because I know it's in the the first 10.

1913
01:31:57,430 --> 01:31:59,950
This was, 63881 was Catweazle's ID.

1914
01:31:59,950 --> 01:32:00,880
So let's do this.

1915
01:32:00,880 --> 01:32:06,860
So select genre from genres where
the show ID equals that value.

1916
01:32:06,860 --> 01:32:07,360
OK.

1917
01:32:07,360 --> 01:32:09,152
So there's the same
query as we did before.

1918
01:32:09,152 --> 01:32:10,480
Can we make this dynamic?

1919
01:32:10,480 --> 01:32:11,650
Well, we can, too.

1920
01:32:11,650 --> 01:32:13,870
What if I instead, more
dynamically do, select

1921
01:32:13,870 --> 01:32:18,250
genre from genres where the show ID--

1922
01:32:18,250 --> 01:32:19,180
and not in.

1923
01:32:19,180 --> 01:32:23,020
If I'm looking for a specific
show, now I can actually do equals.

1924
01:32:23,020 --> 01:32:25,480
And in my subquery, I
could do this, select ID

1925
01:32:25,480 --> 01:32:28,660
from shows where the
title of the show equals,

1926
01:32:28,660 --> 01:32:33,160
quote unquote, "Catweazel,"
semicolon, enter.

1927
01:32:33,160 --> 01:32:35,720
So, again, even though I'm
typing these very quickly,

1928
01:32:35,720 --> 01:32:38,630
I'm really just composing
similar, smaller ideas

1929
01:32:38,630 --> 01:32:41,060
that we've seen before into
larger and larger queries

1930
01:32:41,060 --> 01:32:42,573
to just get at more of this data.

1931
01:32:42,573 --> 01:32:44,490
So what's really going
on underneath the hood?

1932
01:32:44,490 --> 01:32:46,160
Well, you can think of it like this.

1933
01:32:46,160 --> 01:32:50,060
If we've got this relationship
between shows and genres,

1934
01:32:50,060 --> 01:32:52,580
here's an excerpt from
shows, and I didn't bother

1935
01:32:52,580 --> 01:32:54,170
showing the thousands of other shows.

1936
01:32:54,170 --> 01:32:56,900
Here's an excerpt from
genres on the right.

1937
01:32:56,900 --> 01:33:00,530
What is that query-- what are
we essentially trying to do?

1938
01:33:00,530 --> 01:33:02,570
Well, let me flip this around here.

1939
01:33:02,570 --> 01:33:06,290
Let me highlight the fact that
this is the same, this is the same,

1940
01:33:06,290 --> 01:33:07,260
this is the same.

1941
01:33:07,260 --> 01:33:10,040
So wouldn't it be nice if I could
kind of get these all together?

1942
01:33:10,040 --> 01:33:11,660
Well, if I join these
tables, we're actually

1943
01:33:11,660 --> 01:33:13,370
going to notice an interesting artifact.

1944
01:33:13,370 --> 01:33:16,490
If I join them together, as
we did before with ratings,

1945
01:33:16,490 --> 01:33:20,990
I'm going to need to fill in the gap
there because this is not a table.

1946
01:33:20,990 --> 01:33:24,860
Tables, by definition, always have
the same number of rows and columns.

1947
01:33:24,860 --> 01:33:26,730
You can't have gaps in them like this.

1948
01:33:26,730 --> 01:33:29,820
So the simplest thing to do is
just to fill that in this way.

1949
01:33:29,820 --> 01:33:33,170
But if I were to try to
combine two tables that

1950
01:33:33,170 --> 01:33:36,320
have this one-to-many
relationship, you're

1951
01:33:36,320 --> 01:33:38,090
actually going to get duplication.

1952
01:33:38,090 --> 01:33:40,190
It's not duplication
in the original tables,

1953
01:33:40,190 --> 01:33:43,850
but in the temporary tables,
otherwise known as a result set,

1954
01:33:43,850 --> 01:33:45,750
that's coming back to us.

1955
01:33:45,750 --> 01:33:47,070
So what do I mean by this?

1956
01:33:47,070 --> 01:33:50,060
Well, if we actually implement
this same idea as before,

1957
01:33:50,060 --> 01:33:52,970
where we try to join
these two tables, let

1958
01:33:52,970 --> 01:33:55,640
me propose that we do
it with this syntax.

1959
01:33:55,640 --> 01:33:59,540
Let me do select star
from shows join genres,

1960
01:33:59,540 --> 01:34:04,850
which is just like we did with ratings
but now let's join it on shows.id

1961
01:34:04,850 --> 01:34:07,760
equals genres.show_ID.

1962
01:34:07,760 --> 01:34:14,120
But let's just do this for Catweazel,
where ID equals 63881, semicolon.

1963
01:34:14,120 --> 01:34:16,250
With the ratings, it
worked perfectly because it

1964
01:34:16,250 --> 01:34:19,920
was a one-to-one relationship, so
the rows just got wider, if you will.

1965
01:34:19,920 --> 01:34:23,660
But now, because it's a
one-to-many relationship, when

1966
01:34:23,660 --> 01:34:26,960
you execute these queries, you are
going to get back duplicate data

1967
01:34:26,960 --> 01:34:28,400
but this is OK.

1968
01:34:28,400 --> 01:34:30,740
It's considered OK because
this is sort of ephemeral.

1969
01:34:30,740 --> 01:34:32,540
These result sets,
these temporary tables,

1970
01:34:32,540 --> 01:34:34,700
exist just for us to
look at the data, just

1971
01:34:34,700 --> 01:34:36,440
for us to crunch the numbers somehow.

1972
01:34:36,440 --> 01:34:42,167
It's not actually stored in
duplicate in the database itself.

1973
01:34:42,167 --> 01:34:44,000
If I wanted to tighten
this further, though,

1974
01:34:44,000 --> 01:34:47,570
let me actually get rid of the star
and let me just do title genre.

1975
01:34:47,570 --> 01:34:49,040
And, indeed, we can now see.

1976
01:34:49,040 --> 01:34:52,130
OK, Catweazel three times has
three different categories.

1977
01:34:52,130 --> 01:34:54,440
But, generally, we don't
even care about that so I

1978
01:34:54,440 --> 01:34:57,830
can even whittle this query
down to just selecting genre,

1979
01:34:57,830 --> 01:35:00,500
and that, too, will just give me
the result, effectively, hiding

1980
01:35:00,500 --> 01:35:01,550
the duplication.

1981
01:35:01,550 --> 01:35:04,957
But when you join data with
a one-to-many relationship

1982
01:35:04,957 --> 01:35:07,790
you're temporarily going to get
duplicates, which is actually useful

1983
01:35:07,790 --> 01:35:10,580
because it's very easy then
to get at the show's title

1984
01:35:10,580 --> 01:35:14,080
no matter where you are in some loop.

1985
01:35:14,080 --> 01:35:14,620
All right.

1986
01:35:14,620 --> 01:35:15,912
Well, what more can we do here?

1987
01:35:15,912 --> 01:35:19,470
Well let me propose that we revisit the
main database, here with six tables,

1988
01:35:19,470 --> 01:35:21,970
and let's look at, perhaps, the
juiciest, and the one that's

1989
01:35:21,970 --> 01:35:26,680
really what most people use IMDb for
is to look up shows and people therein.

1990
01:35:26,680 --> 01:35:28,300
Let's focus on these three tables.

1991
01:35:28,300 --> 01:35:32,140
And we can infer, from this diagram,
that there's now, for the first time,

1992
01:35:32,140 --> 01:35:34,270
three tables involved in a relationship.

1993
01:35:34,270 --> 01:35:38,710
There's people, there's shows, but
I've proposed this intermediary stars

1994
01:35:38,710 --> 01:35:40,930
table, much like I
temporarily, in Google Sheets,

1995
01:35:40,930 --> 01:35:44,120
gave us a third sheet to
link the two together.

1996
01:35:44,120 --> 01:35:46,810
This stars table we're about
to see, it's purpose in life

1997
01:35:46,810 --> 01:35:49,990
is to join two other tables together.

1998
01:35:49,990 --> 01:35:54,670
And, in fact, it's only going to have
two columns, show ID and person ID.

1999
01:35:54,670 --> 01:35:57,790
So what this is going to do
for us is implement this idea,

2000
01:35:57,790 --> 01:35:59,360
many-to-many relationship.

2001
01:35:59,360 --> 01:35:59,860
Why?

2002
01:35:59,860 --> 01:36:02,810
Because any TV show can
obviously have many people in it,

2003
01:36:02,810 --> 01:36:06,670
but one person can presumably
star in many different shows.

2004
01:36:06,670 --> 01:36:09,680
Like, Steve Carell has been in
multiple shows, not just The Office.

2005
01:36:09,680 --> 01:36:11,870
So when you have a
many-to-many relationship,

2006
01:36:11,870 --> 01:36:16,655
you actually do need this third table
to bridge the two any number of times.

2007
01:36:16,655 --> 01:36:19,280
But it's going to make our life
a little more unpleasant to get

2008
01:36:19,280 --> 01:36:22,800
the data we want because it's going to
add some additional steps, if you will.

2009
01:36:22,800 --> 01:36:24,300
So let me do this.

2010
01:36:24,300 --> 01:36:28,340
Suppose that I want to get
everything I know about The Office.

2011
01:36:28,340 --> 01:36:30,800
Well, let's start with a
single query here in VS Code.

2012
01:36:30,800 --> 01:36:35,780
Select star from shows where title
equals, quote unquote, "The Office."

2013
01:36:35,780 --> 01:36:37,130
And I should see--

2014
01:36:37,130 --> 01:36:38,060
Oh, interesting.

2015
01:36:38,060 --> 01:36:41,810
Several attempts at creating a TV
show called The Office over the years.

2016
01:36:41,810 --> 01:36:45,740
You can, perhaps, infer the year in
which the most popular of them began.

2017
01:36:45,740 --> 01:36:46,910
2005?

2018
01:36:46,910 --> 01:36:49,580
So I presume this is the one
we all know and have watched,

2019
01:36:49,580 --> 01:36:54,200
at least in the US, which is this one,
386676, which matches the ID that I

2020
01:36:54,200 --> 01:36:55,770
very carefully used earlier.

2021
01:36:55,770 --> 01:36:58,520
So let me actually be a
little more deliberate,

2022
01:36:58,520 --> 01:37:02,690
where the title equals The
Office and the year equals 2005.

2023
01:37:02,690 --> 01:37:07,400
That query now gets us The Office that
we all in the US and, perhaps, love.

2024
01:37:07,400 --> 01:37:09,830
But now let's actually
do something like, get

2025
01:37:09,830 --> 01:37:11,960
all of the people who
starred in it, at least

2026
01:37:11,960 --> 01:37:14,650
according to IMDb,
whoever had top billing.

2027
01:37:14,650 --> 01:37:16,370
So how can I do this?

2028
01:37:16,370 --> 01:37:20,350
Well, unfortunately, in the
shows table there are no people,

2029
01:37:20,350 --> 01:37:24,310
and there's no stars even, but I
could do a nested query like this.

2030
01:37:24,310 --> 01:37:32,350
Why don't I select the person
ID from the stars table where--

2031
01:37:32,350 --> 01:37:45,990
whoops, where the-- sorry, where show
ID equals, and then in parentheses

2032
01:37:45,990 --> 01:37:48,430
let me do that same query as
before and for time's sake,

2033
01:37:48,430 --> 01:37:52,260
I'll just copy paste so that
we get back the one and only

2034
01:37:52,260 --> 01:37:53,640
Office in that subquery.

2035
01:37:53,640 --> 01:37:56,710
So what I'm going to do is take
an intermediate step, a baby step,

2036
01:37:56,710 --> 01:37:57,240
if you will.

2037
01:37:57,240 --> 01:38:00,992
Right now, I have found in the
shows table The Office I care about.

2038
01:38:00,992 --> 01:38:02,700
But if I want to get
to the people table,

2039
01:38:02,700 --> 01:38:06,720
I have to take a step through the stars
table, this intermediate table, just

2040
01:38:06,720 --> 01:38:09,130
to get anywhere close to
the names of those people.

2041
01:38:09,130 --> 01:38:11,070
So what can I get from the stars table?

2042
01:38:11,070 --> 01:38:14,040
Well, why don't I at least
select all of the person

2043
01:38:14,040 --> 01:38:18,340
IDs in that table that are somehow
associated with the same show ID?

2044
01:38:18,340 --> 01:38:20,340
So in VS Code, what I'm doing is this.

2045
01:38:20,340 --> 01:38:23,340
Select person ID from
that intermediate stars

2046
01:38:23,340 --> 01:38:27,690
table, where the show ID in question is
whatever the show ID is for The Office.

2047
01:38:27,690 --> 01:38:30,510
I could literally type
386676, but I'm trying

2048
01:38:30,510 --> 01:38:33,630
to do this more dynamically so
I've used a nested query instead.

2049
01:38:33,630 --> 01:38:34,890
All right, this is correct.

2050
01:38:34,890 --> 01:38:36,760
Whoops, ironically.

2051
01:38:36,760 --> 01:38:38,085
Where show ID--

2052
01:38:38,085 --> 01:38:41,540


2053
01:38:41,540 --> 01:38:42,440
Oh, thank you.

2054
01:38:42,440 --> 01:38:43,610
Oh, who said that?

2055
01:38:43,610 --> 01:38:44,570
OK.

2056
01:38:44,570 --> 01:38:45,110
Batman.

2057
01:38:45,110 --> 01:38:47,370
I owe Batman Oreos after class.

2058
01:38:47,370 --> 01:38:48,470
Thank you, Batman.

2059
01:38:48,470 --> 01:38:50,330
That's de facto.

2060
01:38:50,330 --> 01:38:50,930
Yes.

2061
01:38:50,930 --> 01:38:52,355
Other good prize for that costume.

2062
01:38:52,355 --> 01:38:53,330
[LAUGHS]

2063
01:38:53,330 --> 01:38:55,160
OK, so let me fix this.

2064
01:38:55,160 --> 01:38:55,910
My apologies.

2065
01:38:55,910 --> 01:38:58,790
So let's go ahead and
select person ID from stars

2066
01:38:58,790 --> 01:39:02,390
where show ID equals, and this
is where I messed up before.

2067
01:39:02,390 --> 01:39:05,510
I did select star from
shows, but I can't

2068
01:39:05,510 --> 01:39:09,410
look for a show ID equaling
an entire row of information.

2069
01:39:09,410 --> 01:39:16,160
I instead need to do select ID from
shows where the shows title equals,

2070
01:39:16,160 --> 01:39:20,510
quote unquote, "The Office" and
the year of that show is 2005.

2071
01:39:20,510 --> 01:39:25,340
And just to call this out, much
like in C, I'm quoting strings,

2072
01:39:25,340 --> 01:39:26,780
but I'm not quoting numbers.

2073
01:39:26,780 --> 01:39:28,910
It's not necessary for
something like 2005.

2074
01:39:28,910 --> 01:39:33,410
And just to be super clear, too, I have
generally adopted a style already today

2075
01:39:33,410 --> 01:39:37,850
of capitalizing any SQL keywords,
like select, like from, like where,

2076
01:39:37,850 --> 01:39:40,470
and so forth, and then using
lowercase for everything else.

2077
01:39:40,470 --> 01:39:42,180
Strictly speaking, that's not necessary.

2078
01:39:42,180 --> 01:39:44,130
Stylistically, we would
encourage you to be

2079
01:39:44,130 --> 01:39:46,592
in the habit of using
uppercase for your SQL keywords

2080
01:39:46,592 --> 01:39:48,300
because they just pop
more on the screen.

2081
01:39:48,300 --> 01:39:49,883
It sort of makes things more readable.

2082
01:39:49,883 --> 01:39:52,200
But strictly speaking,
SQL itself does not care

2083
01:39:52,200 --> 01:39:54,690
about that capitalization of keywords.

2084
01:39:54,690 --> 01:39:56,490
All right, now let me cross my fingers.

2085
01:39:56,490 --> 01:39:59,100
And now I get back this
list of person IDs.

2086
01:39:59,100 --> 01:40:00,840
And, again, my goal
was to figure out who

2087
01:40:00,840 --> 01:40:03,018
is in The Office that debuted in 2005.

2088
01:40:03,018 --> 01:40:06,060
This is not that interesting because
I don't know who any of these people

2089
01:40:06,060 --> 01:40:08,820
are, but here's where we
can do one additional step

2090
01:40:08,820 --> 01:40:10,650
and nest a nested query.

2091
01:40:10,650 --> 01:40:15,000
So let me actually select the
names from the people table

2092
01:40:15,000 --> 01:40:18,720
where the ID of those people is in.

2093
01:40:18,720 --> 01:40:20,820
And then on a new line,
just to make it pretty,

2094
01:40:20,820 --> 01:40:26,770
where the ID equals this
query, which I'll paste here,

2095
01:40:26,770 --> 01:40:31,260
and then that equals, and then
another line and indent further,

2096
01:40:31,260 --> 01:40:32,602
this query here.

2097
01:40:32,602 --> 01:40:35,310
So just to save time, I'm copying
and pasting the previous query,

2098
01:40:35,310 --> 01:40:38,970
but I'm wrapping it with one
outermost query, now, that's

2099
01:40:38,970 --> 01:40:42,690
saying select names from people
where the ID of those people

2100
01:40:42,690 --> 01:40:47,940
is in this result set, where the ID
of those shows is in this result set.

2101
01:40:47,940 --> 01:40:52,170
So the parentheses make clear,
ideally, what's happening

2102
01:40:52,170 --> 01:40:53,790
and what order from inside out.

2103
01:40:53,790 --> 01:40:54,540
Enter.

2104
01:40:54,540 --> 01:40:58,860
And there we have it, at least according
to IMDb, for the latest season,

2105
01:40:58,860 --> 01:41:04,350
like this is the top-billed stars
that are in this here database.

2106
01:41:04,350 --> 01:41:04,860
All right.

2107
01:41:04,860 --> 01:41:06,302
So how can we do something else?

2108
01:41:06,302 --> 01:41:08,260
Well, let me just do it
in the other direction.

2109
01:41:08,260 --> 01:41:11,177
Suppose we want to see all of Steve
Carell shows, not just The Office.

2110
01:41:11,177 --> 01:41:16,650
Select title from shows where
the ID of the show is in.

2111
01:41:16,650 --> 01:41:18,990
And then here, I'm
going to do select show

2112
01:41:18,990 --> 01:41:24,150
ID from stars where person ID equals.

2113
01:41:24,150 --> 01:41:26,490
And then here, indenting
for clarity, I'm

2114
01:41:26,490 --> 01:41:28,740
going to select Steve
Carell's ID by saying,

2115
01:41:28,740 --> 01:41:32,850
select ID from people where
the name of that person is,

2116
01:41:32,850 --> 01:41:35,040
quote unquote, "Steve Carell."

2117
01:41:35,040 --> 01:41:36,780
And so in this way,

2118
01:41:36,780 --> 01:41:42,250
I'm writing the SQL
query sort of in reverse.

2119
01:41:42,250 --> 01:41:44,102
I'm asking first what I care about.

2120
01:41:44,102 --> 01:41:45,810
But before I can even
answer that, I have

2121
01:41:45,810 --> 01:41:48,510
to answer this nested query,
what is Steve Carell's ID?

2122
01:41:48,510 --> 01:41:52,320
Once I have that, what are all the
show IDs that person ID has been in?

2123
01:41:52,320 --> 01:41:55,740
And then, please tell me what the
title of all of those shows is.

2124
01:41:55,740 --> 01:41:57,510
Let me go ahead and cross my fingers.

2125
01:41:57,510 --> 01:41:58,200
And, voila.

2126
01:41:58,200 --> 01:42:00,950
Some of these you might have heard
of, some of you might not have,

2127
01:42:00,950 --> 01:42:03,750
but if you were to go on imdb.com
and search for Steve Carell,

2128
01:42:03,750 --> 01:42:08,445
you would presumably see this here
list of shows that he's been in,

2129
01:42:08,445 --> 01:42:11,410
in some particular order.

2130
01:42:11,410 --> 01:42:15,330
Just to show you two other syntaxes,
but let me not emphasize this

2131
01:42:15,330 --> 01:42:17,200
because it will look complicated.

2132
01:42:17,200 --> 01:42:19,500
There are other ways to
solve this same problem.

2133
01:42:19,500 --> 01:42:21,660
If you prefer the
approach of joining, we

2134
01:42:21,660 --> 01:42:25,720
can actually join not just
two, but three tables together.

2135
01:42:25,720 --> 01:42:26,535
But question first.

2136
01:42:26,535 --> 01:42:28,310
[INDISTINCT SPEECH]

2137
01:42:28,310 --> 01:42:29,810
No, just stretching first.

2138
01:42:29,810 --> 01:42:35,120
So two final ways to execute the same
idea, but the first of them that I just

2139
01:42:35,120 --> 01:42:37,850
did is arguably, relatively, simpler.

2140
01:42:37,850 --> 01:42:38,910
You could do this.

2141
01:42:38,910 --> 01:42:42,800
I could select the title
from the shows table

2142
01:42:42,800 --> 01:42:49,220
by joining it on the stars
table on the shows ID column,

2143
01:42:49,220 --> 01:42:52,820
equaling the stars
tables show ID column.

2144
01:42:52,820 --> 01:42:58,460
And then I can further join it on the
people table, on stars dot person ID

2145
01:42:58,460 --> 01:43:00,453
equaling people.id.

2146
01:43:00,453 --> 01:43:03,620
This is a mouthful, and even I am kind
of crossing my fingers that it didn't

2147
01:43:03,620 --> 01:43:06,510
screw up when transcribing
it from my printout here,

2148
01:43:06,510 --> 01:43:11,180
but what I'm effectively doing is
joining one, two, three tables all

2149
01:43:11,180 --> 01:43:17,570
together by telling the database how
to join the shows table, with the stars

2150
01:43:17,570 --> 01:43:19,070
table, and the people table.

2151
01:43:19,070 --> 01:43:22,940
Specifically, the way to bridge
that picture, per the diagram,

2152
01:43:22,940 --> 01:43:28,890
is to specify that shows.id should
be lined up with stars.show ID.

2153
01:43:28,890 --> 01:43:34,380
And stars.person ID should
be lined up with people ID.

2154
01:43:34,380 --> 01:43:35,008
And that's it.

2155
01:43:35,008 --> 01:43:37,050
That essentially allows
us to connect these three

2156
01:43:37,050 --> 01:43:39,000
tables with their common fields.

2157
01:43:39,000 --> 01:43:42,390
If I hit enter now, I'm going
to get back, somewhat slowly

2158
01:43:42,390 --> 01:43:47,160
actually, a really long
list, with some duplication,

2159
01:43:47,160 --> 01:43:49,830
of all of those particular shows.

2160
01:43:49,830 --> 01:43:52,182
Actually, all shows in
the database because I

2161
01:43:52,182 --> 01:43:53,640
didn't practice what I'm preaching.

2162
01:43:53,640 --> 01:43:56,140
I wanted to search for just
Steve Carell, what you're seeing

2163
01:43:56,140 --> 01:43:58,950
is the entirety of the tens
of thousands of TV shows.

2164
01:43:58,950 --> 01:44:00,360
Control C is your friend.

2165
01:44:00,360 --> 01:44:05,070
Let me go ahead and reload SQLite,
and let me type that again.

2166
01:44:05,070 --> 01:44:06,690
Let me type that once more, sorry.

2167
01:44:06,690 --> 01:44:13,320
Select title from shows, join stars
on shows.id equals stars.show ID,

2168
01:44:13,320 --> 01:44:20,872
join people on stars.person
ID equals people ID where--

2169
01:44:20,872 --> 01:44:22,080
this was the part I left out.

2170
01:44:22,080 --> 01:44:24,990
Name equals Steve Carell.

2171
01:44:24,990 --> 01:44:28,710
And if I didn't screw up
by typing so fast-- enter.

2172
01:44:28,710 --> 01:44:31,500
Now we get, a little
more slowly than before,

2173
01:44:31,500 --> 01:44:34,435
those same shows that
Steve Carell starred in.

2174
01:44:34,435 --> 01:44:36,810
So this is just to say there's
another way of doing this.

2175
01:44:36,810 --> 01:44:39,420
But maybe a third way, which
is a little simpler than that,

2176
01:44:39,420 --> 01:44:41,250
explicitly joining them in that way.

2177
01:44:41,250 --> 01:44:47,490
You can, alternatively, still do
this, select title from shows, stars,

2178
01:44:47,490 --> 01:44:48,270
and people.

2179
01:44:48,270 --> 01:44:51,480
You just literally enumerate,
with commas, what three tables

2180
01:44:51,480 --> 01:44:52,920
you want to join somehow.

2181
01:44:52,920 --> 01:44:55,020
And then you can instead
of using join, you

2182
01:44:55,020 --> 01:44:58,560
can just use where clauses to
make sure they line up properly.

2183
01:44:58,560 --> 01:45:03,630
You can say, where shows.id
equals stars.show ID,

2184
01:45:03,630 --> 01:45:09,180
and people ID equals
stars.person ID, and name

2185
01:45:09,180 --> 01:45:11,760
equals, quote unquote, "Steve Carell."

2186
01:45:11,760 --> 01:45:14,430
And I realize this is hard to
keep track of everything now,

2187
01:45:14,430 --> 01:45:16,353
all these darn different
ways to do this,

2188
01:45:16,353 --> 01:45:19,020
this is just to say that there's
different approaches to solving

2189
01:45:19,020 --> 01:45:20,728
the same problem and
for different people

2190
01:45:20,728 --> 01:45:23,970
you might think about things a little
more differently than someone else.

2191
01:45:23,970 --> 01:45:27,630
If I hit enter here, this too it's a
little slower than the nested selects

2192
01:45:27,630 --> 01:45:31,590
it seems, but it does, in
fact, give us that same answer.

2193
01:45:31,590 --> 01:45:34,440
And just for thoroughness,
if I go back to our diagram.

2194
01:45:34,440 --> 01:45:38,700
Besides the tables we've seen,
there's actually another writers table

2195
01:45:38,700 --> 01:45:39,480
in there, as well.

2196
01:45:39,480 --> 01:45:42,480
If you're curious to see what writers
is, let's just glance at that real

2197
01:45:42,480 --> 01:45:42,980
fast.

2198
01:45:42,980 --> 01:45:47,010
In VS Code, let me do .schema writers,
and it's actually almost the same

2199
01:45:47,010 --> 01:45:48,330
as stars.

2200
01:45:48,330 --> 01:45:52,440
Except this case, in writers,
we are associating a show

2201
01:45:52,440 --> 01:45:54,750
with a person ID, both
of which in this context

2202
01:45:54,750 --> 01:45:58,770
are foreign keys that indeed
reference back shows and people ID.

2203
01:45:58,770 --> 01:46:02,520
Which, again, if I do this schema
stars, which we didn't see before,

2204
01:46:02,520 --> 01:46:04,120
is structurally the same.

2205
01:46:04,120 --> 01:46:06,540
So the relationship is essentially
embodied, in this case,

2206
01:46:06,540 --> 01:46:12,690
by IMDb, and in turn by us, by way of
the tables name, TV stars or writers

2207
01:46:12,690 --> 01:46:14,410
thereof.

2208
01:46:14,410 --> 01:46:14,940
All right.

2209
01:46:14,940 --> 01:46:16,000
I know that's a lot.

2210
01:46:16,000 --> 01:46:19,720
Any questions before we take
a higher level step back?

2211
01:46:19,720 --> 01:46:20,220
Yeah.

2212
01:46:20,220 --> 01:46:22,992
[INDISTINCT SPEECH]

2213
01:46:22,992 --> 01:46:25,183


2214
01:46:25,183 --> 01:46:25,850
A good question.

2215
01:46:25,850 --> 01:46:28,400
Does SQL provide any way to figure
out the mapping between tables

2216
01:46:28,400 --> 01:46:29,540
without looking at the database?

2217
01:46:29,540 --> 01:46:30,380
Short answer, no.

2218
01:46:30,380 --> 01:46:32,668
Like, this is the dia--
well, that's not quite fair.

2219
01:46:32,668 --> 01:46:34,460
Depending on the database
you're using, you

2220
01:46:34,460 --> 01:46:36,080
might be able to click
a button, for instance,

2221
01:46:36,080 --> 01:46:38,997
and get a nice pretty picture like
this, that shows the relationships.

2222
01:46:38,997 --> 01:46:40,730
Indeed, we use software
to generate this.

2223
01:46:40,730 --> 01:46:42,980
We didn't do this diagram,
for instance, by hand.

2224
01:46:42,980 --> 01:46:45,530
SQLite itself does not
provide you with that.

2225
01:46:45,530 --> 01:46:48,425
In SQLite the best you
can do is run .schema.

2226
01:46:48,425 --> 01:46:50,300
And if you don't specify
a table name, you'll

2227
01:46:50,300 --> 01:46:52,120
get everything from the table described.

2228
01:46:52,120 --> 01:46:53,870
Once you get comfortable
with SQL, though,

2229
01:46:53,870 --> 01:46:56,240
the idea is that you can read
the text and sort of infer what

2230
01:46:56,240 --> 01:46:57,080
the structure is.

2231
01:46:57,080 --> 01:46:58,850
But, yes, there are
graphical programs can

2232
01:46:58,850 --> 01:47:00,710
generate prettier
pictures like this, but it

2233
01:47:00,710 --> 01:47:02,900
depends on the software you're using.

2234
01:47:02,900 --> 01:47:03,470
Yeah.

2235
01:47:03,470 --> 01:47:06,600
[INDISTINCT SPEECH]

2236
01:47:06,600 --> 01:47:10,590
SQL is not case sensitive
with respect to its keywords.

2237
01:47:10,590 --> 01:47:13,920
But table names, and other
things that you chose,

2238
01:47:13,920 --> 01:47:16,452
you've got to be consistent
with how you capitalize them.

2239
01:47:16,452 --> 01:47:19,410
I've done everything in lowercase,
but that tends to be one convention.

2240
01:47:19,410 --> 01:47:23,970
Other people might use camel case,
where you alternate caps appropriately.

2241
01:47:23,970 --> 01:47:26,737
All right, so let's take a
higher level look at this,

2242
01:47:26,737 --> 01:47:29,820
and also consider some of the actual
real world problems that, tragically,

2243
01:47:29,820 --> 01:47:32,200
are still with us in some form today.

2244
01:47:32,200 --> 01:47:34,980
Notice that some of the
queries we executed a bit ago

2245
01:47:34,980 --> 01:47:38,440
were actually relatively slow whereas
I hit enter and got a lot of my results

2246
01:47:38,440 --> 01:47:38,940
like that.

2247
01:47:38,940 --> 01:47:41,130
Those last two queries,
where I was joining

2248
01:47:41,130 --> 01:47:43,500
all of those tables looking
for Steve Carell's shows

2249
01:47:43,500 --> 01:47:45,138
were actually relatively slow.

2250
01:47:45,138 --> 01:47:46,680
And let's try to take a simpler case.

2251
01:47:46,680 --> 01:47:47,310
Let me do this.

2252
01:47:47,310 --> 01:47:51,480
In SQLite you can actually time your
queries by running .timer and then

2253
01:47:51,480 --> 01:47:52,330
turning it on.

2254
01:47:52,330 --> 01:47:55,620
This is just going to keep track now
of how many seconds or milliseconds any

2255
01:47:55,620 --> 01:47:57,787
of your queries take if
you're curious to figure out

2256
01:47:57,787 --> 01:47:58,930
what's faster, what's slow.

2257
01:47:58,930 --> 01:48:00,930
Let me do something
relatively simple like this,

2258
01:48:00,930 --> 01:48:04,860
select star from shows where
the title of the show equals,

2259
01:48:04,860 --> 01:48:07,198
quote unquote, "The Office," semicolon.

2260
01:48:07,198 --> 01:48:08,740
All right, that was pretty darn fast.

2261
01:48:08,740 --> 01:48:13,240
And it took 0.044 seconds in reality.

2262
01:48:13,240 --> 01:48:16,210
If you care further, you can break
this time down into user time,

2263
01:48:16,210 --> 01:48:18,708
like spent in my account,
versus system time, which means

2264
01:48:18,708 --> 01:48:20,500
spent in the operating
system, essentially.

2265
01:48:20,500 --> 01:48:28,120
But we'll focus on the real
wall-clock time of 0.044 seconds.

2266
01:48:28,120 --> 01:48:29,290
Pretty darn fast.

2267
01:48:29,290 --> 01:48:32,770
But we can actually do better because
it turns out in the world of SQL,

2268
01:48:32,770 --> 01:48:37,690
you can create what are called indexes,
which is a fancy way of saying a data

2269
01:48:37,690 --> 01:48:41,170
structure that makes it
faster to perform queries,

2270
01:48:41,170 --> 01:48:43,660
like selects, and even
other queries as well.

2271
01:48:43,660 --> 01:48:46,180
In an index, you can
use syntax like this,

2272
01:48:46,180 --> 01:48:49,420
create index, the name of the
index, on a specific table

2273
01:48:49,420 --> 01:48:50,710
on the specific columns.

2274
01:48:50,710 --> 01:48:54,340
And what I mean by this is if you know
that your application, like imdb.com

2275
01:48:54,340 --> 01:48:58,480
or their mobile app, is going to
search on certain columns frequently,

2276
01:48:58,480 --> 01:49:01,240
you can prepare the database
in advance to build up

2277
01:49:01,240 --> 01:49:04,450
some fancy data structures in
memory so it can get back answers

2278
01:49:04,450 --> 01:49:06,080
even faster than that.

2279
01:49:06,080 --> 01:49:08,200
So case in point, let
me go back to VS Code

2280
01:49:08,200 --> 01:49:12,700
here, and let me create an index
called, whatever, title index,

2281
01:49:12,700 --> 01:49:17,180
for instance on the shows table,
specifically on the title column.

2282
01:49:17,180 --> 01:49:20,830
So that's simply the syntax for
telling the database in advance,

2283
01:49:20,830 --> 01:49:23,350
index this column because I'm
going to do a lot of searching

2284
01:49:23,350 --> 01:49:25,210
on it so I want the queries to be fast.

2285
01:49:25,210 --> 01:49:26,020
Enter.

2286
01:49:26,020 --> 01:49:26,920
It took a moment.

2287
01:49:26,920 --> 01:49:30,010
It took almost half a second, but
the index I only have to build once.

2288
01:49:30,010 --> 01:49:32,320
Even though, technically, you have
to maintain it over time if you're

2289
01:49:32,320 --> 01:49:34,210
doing updates, deletes, and inserts.

2290
01:49:34,210 --> 01:49:36,610
But now let me do the
same query as before,

2291
01:49:36,610 --> 01:49:41,530
select star from shows where title
equals, quote unquote, "The Office."

2292
01:49:41,530 --> 01:49:45,580
Previously, that query
took 0.044 seconds.

2293
01:49:45,580 --> 01:49:47,080
When I hit enter now?

2294
01:49:47,080 --> 01:49:47,890
Boom.

2295
01:49:47,890 --> 01:49:50,770
I mean, it takes no time
at all, or less time

2296
01:49:50,770 --> 01:49:53,998
than it's even keeping track of
in terms of significant digits.

2297
01:49:53,998 --> 01:49:57,040
Now, that might not seem like a big
deal to us humans and our human eyes,

2298
01:49:57,040 --> 01:49:59,620
but if you've got hundreds,
thousands, millions of users,

2299
01:49:59,620 --> 01:50:03,580
as maybe the real imdb.com
has, you just saved yourself

2300
01:50:03,580 --> 01:50:05,900
a fortune in servers and complexity.

2301
01:50:05,900 --> 01:50:06,400
Why?

2302
01:50:06,400 --> 01:50:08,320
Because the same server
can, clearly, now

2303
01:50:08,320 --> 01:50:12,400
handle way more people per unit of
time, per second, because each query

2304
01:50:12,400 --> 01:50:14,027
takes less and less time.

2305
01:50:14,027 --> 01:50:16,360
I mean, we're all too familiar
here and at Yale, surely,

2306
01:50:16,360 --> 01:50:19,087
with certain University applications
that are just so darn slow.

2307
01:50:19,087 --> 01:50:20,920
When you click a button
and the stupid thing

2308
01:50:20,920 --> 01:50:23,740
spins and makes you wait and
wait, a lot of the time that

2309
01:50:23,740 --> 01:50:27,820
can be explained by poor database
design, or databases that might not

2310
01:50:27,820 --> 01:50:29,200
have been indexed properly.

2311
01:50:29,200 --> 01:50:31,492
So when you're searching for
some course, for instance,

2312
01:50:31,492 --> 01:50:33,310
in the course catalog,
it's taking forever

2313
01:50:33,310 --> 01:50:35,170
because underneath the
hood it's essentially

2314
01:50:35,170 --> 01:50:37,730
doing linear search over everything.

2315
01:50:37,730 --> 01:50:40,030
But, by contrast, in
a relational database,

2316
01:50:40,030 --> 01:50:42,670
when you create an index
in advance because you

2317
01:50:42,670 --> 01:50:45,640
have a hunch that maybe users are
going to search on that column,

2318
01:50:45,640 --> 01:50:47,560
like show titles.

2319
01:50:47,560 --> 01:50:49,270
Essentially, you're
building up in memory

2320
01:50:49,270 --> 01:50:51,700
what's called a b-tree,
which is not a binary tree.

2321
01:50:51,700 --> 01:50:54,370
It's still a tree, though, if
you think back to week five.

2322
01:50:54,370 --> 01:50:57,820
But it's a very short,
fat tree instead, where

2323
01:50:57,820 --> 01:51:02,650
every node might have two, or three,
or 30 children, which essentially

2324
01:51:02,650 --> 01:51:04,820
pulls the height of the tree way up.

2325
01:51:04,820 --> 01:51:07,990
Which is to say that when you
search for some value in a b-tree,

2326
01:51:07,990 --> 01:51:10,040
it's invariably going
to be in the leaves.

2327
01:51:10,040 --> 01:51:13,870
So the shorter the tree is, the fewer
steps it takes to find the value

2328
01:51:13,870 --> 01:51:14,750
you care about.

2329
01:51:14,750 --> 01:51:17,860
So when you run create table,
that kind of data structure

2330
01:51:17,860 --> 01:51:21,680
is being magically created
for you by the database

2331
01:51:21,680 --> 01:51:25,540
so it's not a simple linear
search through the entire column,

2332
01:51:25,540 --> 01:51:27,350
top to bottom.

2333
01:51:27,350 --> 01:51:31,330
So with that said, we can see this
really with more complicated queries.

2334
01:51:31,330 --> 01:51:33,260
And let me go back to VS Code here.

2335
01:51:33,260 --> 01:51:36,853
Let me propose to run that
same, slow query before,

2336
01:51:36,853 --> 01:51:39,520
even though it's fine if you're
not comfortable with the syntax.

2337
01:51:39,520 --> 01:51:41,710
It was relatively
slow, though, to watch.

2338
01:51:41,710 --> 01:51:46,360
So let's do select title
from shows stars and people

2339
01:51:46,360 --> 01:51:56,500
where shows.id equals stars.show ID, and
people.id equals stars.dot person ID,

2340
01:51:56,500 --> 01:51:59,050
and name equals Steve Carell.

2341
01:51:59,050 --> 01:52:00,820
So this was the last of those queries.

2342
01:52:00,820 --> 01:52:03,280
That just searches for all
of Steve Carell's TV shows

2343
01:52:03,280 --> 01:52:06,430
without using joins explicitly
but by just enumerating

2344
01:52:06,430 --> 01:52:10,210
all three tables in question, and
then using where to cleverly connect

2345
01:52:10,210 --> 01:52:11,390
the dots, so to speak.

2346
01:52:11,390 --> 01:52:12,770
But my timer is still on.

2347
01:52:12,770 --> 01:52:16,390
So notice now, when I hit enter,
it doesn't just feel slow.

2348
01:52:16,390 --> 01:52:19,390
It actually took 2.763 seconds.

2349
01:52:19,390 --> 01:52:20,410
Like, that's slow.

2350
01:52:20,410 --> 01:52:21,250
That's expensive.

2351
01:52:21,250 --> 01:52:22,650
That's going to annoy your users.

2352
01:52:22,650 --> 01:52:25,900
That's going to annoy your students if
the database is thinking, and thinking,

2353
01:52:25,900 --> 01:52:27,730
and thinking, and taking that much time.

2354
01:52:27,730 --> 01:52:29,770
But let's note this.

2355
01:52:29,770 --> 01:52:34,080
That same query I just executed
touched a bunch of columns,

2356
01:52:34,080 --> 01:52:35,830
and it turns out that
whenever you declare

2357
01:52:35,830 --> 01:52:40,330
a primary key in a database, at least
in SQLite, you get an index for free.

2358
01:52:40,330 --> 01:52:43,060
Primary keys are automatically
indexed, so searching

2359
01:52:43,060 --> 01:52:44,320
for a number in that column?

2360
01:52:44,320 --> 01:52:45,040
Super fast.

2361
01:52:45,040 --> 01:52:48,580
Not linear search, it's something
logarithmic, most likely, or, ideally,

2362
01:52:48,580 --> 01:52:51,100
closer to something
like constant time even.

2363
01:52:51,100 --> 01:52:54,940
Here, though, I'm touching
not just shows.id,

2364
01:52:54,940 --> 01:52:58,840
but I'm also filtering on
stars.show ID, so a foreign key.

2365
01:52:58,840 --> 01:53:01,000
Foreign keys are not indexed by default.

2366
01:53:01,000 --> 01:53:02,260
I'm looking at people ID.

2367
01:53:02,260 --> 01:53:03,190
That's a primary key.

2368
01:53:03,190 --> 01:53:03,880
That's indexed.

2369
01:53:03,880 --> 01:53:08,200
But stars.person ID, not indexed
by default, is a foreign key.

2370
01:53:08,200 --> 01:53:11,890
Lastly, I'm filtering by
name in the people table.

2371
01:53:11,890 --> 01:53:17,140
Names are not indexed by default. So
I'm touching three separate columns,

2372
01:53:17,140 --> 01:53:20,980
two foreign keys, one name
field, that have no fancy tree

2373
01:53:20,980 --> 01:53:22,570
structure built for them.

2374
01:53:22,570 --> 01:53:23,900
But I can do that.

2375
01:53:23,900 --> 01:53:27,190
Let me go down to my terminal here.

2376
01:53:27,190 --> 01:53:30,550
Let me create one index called,
say, person index, though, I

2377
01:53:30,550 --> 01:53:35,750
could call it anything I want, on the
stars table on the person ID column.

2378
01:53:35,750 --> 01:53:38,140
So that indexes that foreign key.

2379
01:53:38,140 --> 01:53:41,050
Took 1.7 seconds, but I
only have to do it once.

2380
01:53:41,050 --> 01:53:46,720
Create index, show index
on the stars table show ID.

2381
01:53:46,720 --> 01:53:49,210
So another foreign key
is getting its own index.

2382
01:53:49,210 --> 01:53:51,490
Took 1.4 seconds, but
it's a one time thing.

2383
01:53:51,490 --> 01:53:54,070
And, lastly, let's index
all of those actors names.

2384
01:53:54,070 --> 01:53:59,770
Create index called name index on
the people table on the name column.

2385
01:53:59,770 --> 01:54:00,460
Enter.

2386
01:54:00,460 --> 01:54:04,520
That took 1.0 seconds,
but a one time operation.

2387
01:54:04,520 --> 01:54:07,850
So, essentially, I've built up
like three of these trees in memory

2388
01:54:07,850 --> 01:54:09,920
now, specifically for these columns.

2389
01:54:09,920 --> 01:54:15,160
So now recall, previously,
that slow query.

2390
01:54:15,160 --> 01:54:18,610
If I scroll back up,
that took, what was it?

2391
01:54:18,610 --> 01:54:20,570
2.7 seconds?

2392
01:54:20,570 --> 01:54:21,070
I think.

2393
01:54:21,070 --> 01:54:23,330
2.7 seconds, previously.

2394
01:54:23,330 --> 01:54:29,080
But if I now run the same
thing, select title from show

2395
01:54:29,080 --> 01:54:33,850
stars people where shows ID
equals stars.show show ID,

2396
01:54:33,850 --> 01:54:37,240
and people ID equals
stars.person person ID,

2397
01:54:37,240 --> 01:54:42,730
and name equals Steve, so close, Carell.

2398
01:54:42,730 --> 01:54:47,320
Same query as before previously took
2.7 seconds, which was the most annoying

2399
01:54:47,320 --> 01:54:47,890
of them yet.

2400
01:54:47,890 --> 01:54:49,420
Now, when I hit enter?

2401
01:54:49,420 --> 01:54:50,230
Boom.

2402
01:54:50,230 --> 01:54:55,240
0.001 seconds, which is the
difference, again, between bunches

2403
01:54:55,240 --> 01:55:00,970
of linear searches and, in this case,
searching a fancier, week-five style,

2404
01:55:00,970 --> 01:55:02,650
b-tree, in this case.

2405
01:55:02,650 --> 01:55:04,010
So indexes matter.

2406
01:55:04,010 --> 01:55:07,120
So what's then, maybe,
the trade off here?

2407
01:55:07,120 --> 01:55:09,680
Like, why not index every
column in every table?

2408
01:55:09,680 --> 01:55:10,930
Because this is feeling great.

2409
01:55:10,930 --> 01:55:14,110
Like, we're speeding things up
by factors of 1,000, practically.

2410
01:55:14,110 --> 01:55:15,190
What's the trade off?

2411
01:55:15,190 --> 01:55:16,240
[INDISTINCT SPEECH]

2412
01:55:16,240 --> 01:55:18,010
Lots and lots of memory, or space.

2413
01:55:18,010 --> 01:55:20,200
Yeah, so you're just
trading off space for time,

2414
01:55:20,200 --> 01:55:23,200
which we said a couple of weeks ago
is an acceptable trade off depending

2415
01:55:23,200 --> 01:55:24,610
on what resources you have.

2416
01:55:24,610 --> 01:55:28,160
But it's probably an
overcorrection to index everything,

2417
01:55:28,160 --> 01:55:31,445
especially since it will slightly slow
down inserts, updates, and deletes

2418
01:55:31,445 --> 01:55:34,570
because you have to maintain this tree
structure so it doesn't devolve back

2419
01:55:34,570 --> 01:55:37,600
into a linked list or something linear.

2420
01:55:37,600 --> 01:55:43,900
But, in fact, being selective about
it is perhaps the best strategy.

2421
01:55:43,900 --> 01:55:44,440
All right.

2422
01:55:44,440 --> 01:55:48,250
So that we can now solve some
other problems more generally,

2423
01:55:48,250 --> 01:55:49,780
let me just connect two dots.

2424
01:55:49,780 --> 01:55:53,740
Even though we focused today on
SQL, specifically for databases,

2425
01:55:53,740 --> 01:55:56,590
you can actually combine
one language with another

2426
01:55:56,590 --> 01:55:58,010
and solve different problems.

2427
01:55:58,010 --> 01:55:59,420
And so, in fact, let me do this.

2428
01:55:59,420 --> 01:56:02,530
Let me revisit our
favorites.py from earlier,

2429
01:56:02,530 --> 01:56:07,610
but let me actually now use the
favorites database as follows.

2430
01:56:07,610 --> 01:56:09,980
Let me go into VS Code here.

2431
01:56:09,980 --> 01:56:16,810
Let me remove favorites.db because if
you recall it made everyone's count

2432
01:56:16,810 --> 01:56:18,760
previously--

2433
01:56:18,760 --> 01:56:19,510
became Fiftyville.

2434
01:56:19,510 --> 01:56:21,640
So let me remove that file.

2435
01:56:21,640 --> 01:56:26,140
Let me run SQLite3 on
favorites.db again.

2436
01:56:26,140 --> 01:56:27,940
Let me create the file anew.

2437
01:56:27,940 --> 01:56:30,490
Let me set the mode to CSV again.

2438
01:56:30,490 --> 01:56:34,630
Let me import that file called
favorites.csv into an identical table

2439
01:56:34,630 --> 01:56:36,160
as before called favorites.

2440
01:56:36,160 --> 01:56:37,150
And then quit.

2441
01:56:37,150 --> 01:56:40,970
So I've just reset things to my
backup, if you will, from the CSV file.

2442
01:56:40,970 --> 01:56:43,300
So, I again have a favorites.db.

2443
01:56:43,300 --> 01:56:48,033
Let me now minimize my terminal
window here and reopen favorites.py,

2444
01:56:48,033 --> 01:56:50,200
and let me just go ahead
and get rid of that version

2445
01:56:50,200 --> 01:56:53,830
entirely and focus this time
on not talking to a CSV file,

2446
01:56:53,830 --> 01:56:55,780
opening it, iterating over the rows.

2447
01:56:55,780 --> 01:56:58,930
We can actually use Python
to execute SQL queries

2448
01:56:58,930 --> 01:57:00,950
and kind of get the best of both worlds.

2449
01:57:00,950 --> 01:57:03,160
So let me do this here.

2450
01:57:03,160 --> 01:57:09,580
Let me, from the CS50 Python library,
import our own SQL functionality.

2451
01:57:09,580 --> 01:57:12,052
And this is a training wheel
we still provide for SQL

2452
01:57:12,052 --> 01:57:14,260
because it's just much easier
than using the industry

2453
01:57:14,260 --> 01:57:15,760
standard libraries for SQL.

2454
01:57:15,760 --> 01:57:18,540
It just is painful for simple tasks.

2455
01:57:18,540 --> 01:57:21,380
So now, let me create a variable
called db, for database.

2456
01:57:21,380 --> 01:57:24,410
I'm going to set it equal to this
SQL function that CS50 wrote.

2457
01:57:24,410 --> 01:57:28,730
And this is going to look weird, but
the way you open a DB file in Python,

2458
01:57:28,730 --> 01:57:31,580
whether it's with CS50
library or someone else's, you

2459
01:57:31,580 --> 01:57:36,560
say SQLite colon slash,
slash, slash favorites.db.

2460
01:57:36,560 --> 01:57:38,780
So weird syntax, but it's commonplace.

2461
01:57:38,780 --> 01:57:41,900
Three slashes, not two
like a URL is usually.

2462
01:57:41,900 --> 01:57:44,300
Now let's use a variable
called favorite and set

2463
01:57:44,300 --> 01:57:48,020
it equal to the return value of input by
asking the human for their favorite TV

2464
01:57:48,020 --> 01:57:48,650
show.

2465
01:57:48,650 --> 01:57:51,860
And now, previously we
opened up the CSV file,

2466
01:57:51,860 --> 01:57:56,270
iterated over it looking for the show
they typed in, show they typed in.

2467
01:57:56,270 --> 01:57:59,060
Or, rather-- Oh, sorry, not the show.

2468
01:57:59,060 --> 01:58:01,610
The problem, or the
problem that we typed in.

2469
01:58:01,610 --> 01:58:03,150
Let me instead do this.

2470
01:58:03,150 --> 01:58:07,290
Let me set, use a db
variables, execute function,

2471
01:58:07,290 --> 01:58:11,360
which comes with the CS50 library,
and let me execute this SQL query.

2472
01:58:11,360 --> 01:58:20,250
Select count star as n from favorites
where problem equals question mark.

2473
01:58:20,250 --> 01:58:23,000
And the question mark is a little
weird, but think of it, for now,

2474
01:58:23,000 --> 01:58:24,902
like C's percent s.

2475
01:58:24,902 --> 01:58:26,750
Comma favorite.

2476
01:58:26,750 --> 01:58:29,840
I'm going to plug in whatever the
human typed in into that query where

2477
01:58:29,840 --> 01:58:30,920
the question mark is.

2478
01:58:30,920 --> 01:58:34,340
So no percent s, I'm using a
question mark in this world of SQL.

2479
01:58:34,340 --> 01:58:36,883
This is going to give me
back a temporary table,

2480
01:58:36,883 --> 01:58:39,800
and I'm going to store that temporary
table in a variable called rows.

2481
01:58:39,800 --> 01:58:42,840
Because the temporary table is
essentially zero or more rows,

2482
01:58:42,840 --> 01:58:44,660
so I'm going to name my variable rows.

2483
01:58:44,660 --> 01:58:48,170
And then if I want to get
back the first and only row,

2484
01:58:48,170 --> 01:58:51,860
I can literally do row equals rows
bracket zero just to be pedantic,

2485
01:58:51,860 --> 01:58:55,970
just to get, literally, the first row
in that temporary table, or result set.

2486
01:58:55,970 --> 01:58:59,810
And now if I want to print
back the n column therein,

2487
01:58:59,810 --> 01:59:03,590
I can do print row, quote unquote, "n."

2488
01:59:03,590 --> 01:59:06,420
So let me take a step back and do this.

2489
01:59:06,420 --> 01:59:12,710
Let me go into SQLite3 of favorites.db
and let me literally type something

2490
01:59:12,710 --> 01:59:13,580
like this.

2491
01:59:13,580 --> 01:59:16,640
Here I'm in SQLite3 at the bottom
and the problem I'm searching,

2492
01:59:16,640 --> 01:59:19,520
for instance, is Scratch semicolon.

2493
01:59:19,520 --> 01:59:23,480
Notice that is how in SQL, in
the command line of SQLite3,

2494
01:59:23,480 --> 01:59:24,960
I can get back the answer I want.

2495
01:59:24,960 --> 01:59:26,960
But what if I want to
make a Python program that

2496
01:59:26,960 --> 01:59:29,480
queries the database for that value
and then prints something out?

2497
01:59:29,480 --> 01:59:31,790
In two weeks time, what if
I want to do that to make

2498
01:59:31,790 --> 01:59:35,300
a web application, like
imdb.com or a mobile app,

2499
01:59:35,300 --> 01:59:37,490
that writes code to
generate a user interface

2500
01:59:37,490 --> 01:59:39,320
but that pulls the data from a database?

2501
01:59:39,320 --> 01:59:42,170
You're not going to have your
human users using SQLite3,

2502
01:59:42,170 --> 01:59:44,280
you're going to generate
the output for them.

2503
01:59:44,280 --> 01:59:46,790
So let me close my terminal window here.

2504
01:59:46,790 --> 01:59:49,190
Rather, let me close out of SQLite 3.

2505
01:59:49,190 --> 01:59:53,840
Let me now run Python
of favorites.py, enter.

2506
01:59:53,840 --> 01:59:56,450
I'm prompted for my favorite
using the input function.

2507
01:59:56,450 --> 02:00:00,290
I type in Scratch and hit
enter, and there's my 34.

2508
02:00:00,290 --> 02:00:02,650
So this is an incredibly
common practice,

2509
02:00:02,650 --> 02:00:04,400
to use one language
for what it's best at.

2510
02:00:04,400 --> 02:00:07,430
Like SQL is best at reading
data from databases,

2511
02:00:07,430 --> 02:00:11,060
Python is, maybe, best in this
case for creating a user interface

2512
02:00:11,060 --> 02:00:13,040
or, eventually, making
a web application.

2513
02:00:13,040 --> 02:00:16,560
But it's certainly fine to
use one inside of the other.

2514
02:00:16,560 --> 02:00:19,740
The documentation for this library,
if and when you get curious,

2515
02:00:19,740 --> 02:00:23,520
is it this URL here, along with a
lot of CS50's own documentation.

2516
02:00:23,520 --> 02:00:26,340
But there are some
problems, nonetheless, we

2517
02:00:26,340 --> 02:00:29,760
might encounter in this world, and we
thought we'd end on these challenges.

2518
02:00:29,760 --> 02:00:33,420
It turns out that in the world of
SQL, even though we haven't touched

2519
02:00:33,420 --> 02:00:36,638
upon this yet, you're generally
working not with like hundreds

2520
02:00:36,638 --> 02:00:39,180
of people and their favorite
languages and problems, not even

2521
02:00:39,180 --> 02:00:42,600
thousands of movies, but like
millions of things in the database,

2522
02:00:42,600 --> 02:00:45,180
like Instagram posts or
TikTok videos or the like.

2523
02:00:45,180 --> 02:00:48,510
Those are huge databases
with millions of rows.

2524
02:00:48,510 --> 02:00:51,510
The problem with SQL and,
really, databases in general,

2525
02:00:51,510 --> 02:00:54,030
is if you have thousands,
millions of things

2526
02:00:54,030 --> 02:00:57,765
happening at once, things can get out
of order and your math can be wrong.

2527
02:00:57,765 --> 02:00:59,890
You can lose track of how
many likes something has.

2528
02:00:59,890 --> 02:01:03,870
And so, for instance, as of last night
this remains the most popular Instagram

2529
02:01:03,870 --> 02:01:06,000
post, to date, in the world.

2530
02:01:06,000 --> 02:01:08,368
It was clicked on by so many people.

2531
02:01:08,368 --> 02:01:10,410
And you might think that's
pretty straightforward

2532
02:01:10,410 --> 02:01:13,230
to keep track click,
click, click, but not

2533
02:01:13,230 --> 02:01:16,870
when there's millions of devices in
the world and thousands of servers,

2534
02:01:16,870 --> 02:01:20,000
probably, at Meta running
Instagram's back end.

2535
02:01:20,000 --> 02:01:23,290
So how do you actually keep
track of all of these likes?

2536
02:01:23,290 --> 02:01:26,980
Well, maybe, Meta is
using code like this

2537
02:01:26,980 --> 02:01:30,370
to implement the counter
for likes on Instagram.

2538
02:01:30,370 --> 02:01:33,160
Maybe they are using lines of
code similar to what we just

2539
02:01:33,160 --> 02:01:37,510
wrote db.execute select
the current number of likes

2540
02:01:37,510 --> 02:01:40,750
from the posts table
where the ID of the post

2541
02:01:40,750 --> 02:01:43,760
equals whatever the one is
that the user clicked on.

2542
02:01:43,760 --> 02:01:48,160
So the post a moment ago, presumably,
has a unique ID, a primary key,

2543
02:01:48,160 --> 02:01:51,160
and that number just gets plugged in
here when Meta wants to figure out,

2544
02:01:51,160 --> 02:01:53,140
all right, someone just
clicked on this post.

2545
02:01:53,140 --> 02:01:56,320
Let's figure out what the current
number of likes so we can add 1 to it

2546
02:01:56,320 --> 02:01:57,890
and update the database.

2547
02:01:57,890 --> 02:02:00,910
So this query here gives
us a temporary table

2548
02:02:00,910 --> 02:02:04,360
containing the current number of likes
before you or someone else clicked.

2549
02:02:04,360 --> 02:02:06,820
Maybe then we just
declare a variable called

2550
02:02:06,820 --> 02:02:09,772
likes to get at the
first row's likes column.

2551
02:02:09,772 --> 02:02:11,980
So this is just Python
syntax, similar to what I did,

2552
02:02:11,980 --> 02:02:15,640
just to get the actual number you care
about, like 34 or whatever million

2553
02:02:15,640 --> 02:02:16,480
it is here.

2554
02:02:16,480 --> 02:02:20,650
But then suppose there's a second
database query and a third line of code

2555
02:02:20,650 --> 02:02:26,200
that updates the posts table setting
the likes equal to this value

2556
02:02:26,200 --> 02:02:29,030
where the ID of the post is this value.

2557
02:02:29,030 --> 02:02:32,870
So these question marks are similar,
again, in spirit to printf's percent s.

2558
02:02:32,870 --> 02:02:34,810
They're placeholders for
things that are going

2559
02:02:34,810 --> 02:02:37,070
to be plugged in after the commas.

2560
02:02:37,070 --> 02:02:38,860
So if I want to update
the number of likes

2561
02:02:38,860 --> 02:02:41,080
to be whatever the current
number is plus one,

2562
02:02:41,080 --> 02:02:45,100
I put it there as a second argument
and then I plug-in the ID of that post.

2563
02:02:45,100 --> 02:02:49,420
The problem, though, with large
systems like the Metas, the Googles,

2564
02:02:49,420 --> 02:02:51,640
the Microsofts, and
others of the world, is

2565
02:02:51,640 --> 02:02:54,880
that they are executing code
like this on multiple servers,

2566
02:02:54,880 --> 02:02:58,930
thousands of servers, that might
be executing slightly out of order.

2567
02:02:58,930 --> 02:03:01,070
One might be faster,
one might be slower.

2568
02:03:01,070 --> 02:03:04,990
Which is to say, even though
these three lines of code

2569
02:03:04,990 --> 02:03:07,240
represent what should happen
when I click on that post

2570
02:03:07,240 --> 02:03:09,670
and you click on that post
and you click on that post,

2571
02:03:09,670 --> 02:03:13,600
the lines of code chronologically
might get shuffled a little bit.

2572
02:03:13,600 --> 02:03:16,090
Like, maybe, this line of
code gets executed for me

2573
02:03:16,090 --> 02:03:18,100
and then it gets executed
for you and then you,

2574
02:03:18,100 --> 02:03:20,900
and then the server moves
on to the next line of code.

2575
02:03:20,900 --> 02:03:23,890
So it's sort of multitasking,
handling lots of users at once.

2576
02:03:23,890 --> 02:03:27,940
The problem with this is that you
run into a race condition of sorts,

2577
02:03:27,940 --> 02:03:30,820
where the servers are sort
of racing to handle one user

2578
02:03:30,820 --> 02:03:34,580
but other users requests are
happening at the same time.

2579
02:03:34,580 --> 02:03:39,220
So the analogy that I was taught
years ago in an operating systems

2580
02:03:39,220 --> 02:03:41,860
class actually pertains to
something like a refrigerator here.

2581
02:03:41,860 --> 02:03:43,360
So we have a mini refrigerator here.

2582
02:03:43,360 --> 02:03:45,850
Suppose you've got one in
your dorm or your house room,

2583
02:03:45,850 --> 02:03:48,340
and you come home one day.

2584
02:03:48,340 --> 02:03:51,860
And you really like milk, so you
open the fridge and you look inside,

2585
02:03:51,860 --> 02:03:52,480
and there's--

2586
02:03:52,480 --> 02:03:53,590
Oh, we're out of milk.

2587
02:03:53,590 --> 02:03:57,160
So you close the fridge, you walk
out to CVS, or somewhere else,

2588
02:03:57,160 --> 02:03:58,600
and go to get more milk.

2589
02:03:58,600 --> 02:04:01,240
Meanwhile, though, your roommate
comes home, who in this story

2590
02:04:01,240 --> 02:04:03,850
also likes milk, and
so they decide that--

2591
02:04:03,850 --> 02:04:05,620
Oh, I'm out of milk in the fridge.

2592
02:04:05,620 --> 02:04:07,120
So they, maybe, head out.

2593
02:04:07,120 --> 02:04:10,660
Maybe they follow a different
path to go get more milk as well.

2594
02:04:10,660 --> 02:04:13,330
Some number of minutes pass,
you both come home later on.

2595
02:04:13,330 --> 02:04:14,080
Like, Oh, darn it.

2596
02:04:14,080 --> 02:04:15,830
Now we have twice as
much milk as we need.

2597
02:04:15,830 --> 02:04:17,247
We don't really like it that much.

2598
02:04:17,247 --> 02:04:19,720
And some of it's going to go
sour now, so it's wasted like.

2599
02:04:19,720 --> 02:04:20,720
We've made a mistake.

2600
02:04:20,720 --> 02:04:22,780
We should not have bought
twice as much milk.

2601
02:04:22,780 --> 02:04:27,220
Now, stupid story, but the
point is that both of you

2602
02:04:27,220 --> 02:04:30,250
made decisions based on
the state of a variable.

2603
02:04:30,250 --> 02:04:33,790
But the problem was that variable
was in the process of being

2604
02:04:33,790 --> 02:04:36,430
updated when someone else looked at it.

2605
02:04:36,430 --> 02:04:39,470
The first person in the story
was on their way to the store

2606
02:04:39,470 --> 02:04:41,560
so the variable was
about to be incremented

2607
02:04:41,560 --> 02:04:43,840
in terms of quantity of
milk, but the other person

2608
02:04:43,840 --> 02:04:46,430
didn't know that yet so they
too tried to increment it.

2609
02:04:46,430 --> 02:04:48,760
And, in that case, we ended
up with too much milk.

2610
02:04:48,760 --> 02:04:52,450
But suppose what might happen
here is similar in spirit.

2611
02:04:52,450 --> 02:04:56,560
Suppose that post, at some point
in time, had just 1,000,000 likes.

2612
02:04:56,560 --> 02:05:00,310
And suppose this line of code got
executed for me, for you, and for you

2613
02:05:00,310 --> 02:05:02,050
after all three of us clicked on it.

2614
02:05:02,050 --> 02:05:05,530
Well, the value of our likes
variable in Meta servers

2615
02:05:05,530 --> 02:05:08,020
might be 1,000,000,
1,000,000, and 1,000,000.

2616
02:05:08,020 --> 02:05:11,470
They therefore update 1,000,000
to be 1,000,000 plus one.

2617
02:05:11,470 --> 02:05:14,890
And so what they update the
database to be is 1,000,001,

2618
02:05:14,890 --> 02:05:16,080
but they do it three times.

2619
02:05:16,080 --> 02:05:18,690
1,000,001, 1,000,001, 1,000,001.

2620
02:05:18,690 --> 02:05:21,900
But they've lost two of
those likes because they

2621
02:05:21,900 --> 02:05:26,040
might have inspected the variable while
some other server, some other user's

2622
02:05:26,040 --> 02:05:28,510
like, was being processed.

2623
02:05:28,510 --> 02:05:31,350
So long story short, when
you have lots of data

2624
02:05:31,350 --> 02:05:34,365
on lots of servers, all of
which is happening very quickly,

2625
02:05:34,365 --> 02:05:36,240
you run into these so
called race conditions.

2626
02:05:36,240 --> 02:05:39,750
And code like this can be
dangerous, even though it might not

2627
02:05:39,750 --> 02:05:41,460
look incorrect at a glance.

2628
02:05:41,460 --> 02:05:44,280
Thankfully, in the world of
SQL, though you won't generally

2629
02:05:44,280 --> 02:05:47,700
have to do this certainly for problem
sets sake, there are solutions to this.

2630
02:05:47,700 --> 02:05:51,210
But too many engineers in the world
don't know this, don't remember this,

2631
02:05:51,210 --> 02:05:53,580
or don't appreciate this reality.

2632
02:05:53,580 --> 02:05:55,920
There are keywords in
certain databases that

2633
02:05:55,920 --> 02:05:59,610
let you instead begin a transaction,
which means, essentially,

2634
02:05:59,610 --> 02:06:02,970
that these three lines
of code should either all

2635
02:06:02,970 --> 02:06:05,370
happen together or not at all.

2636
02:06:05,370 --> 02:06:06,660
They should be atomic.

2637
02:06:06,660 --> 02:06:09,390
That is to say, they should
all happen without interruption

2638
02:06:09,390 --> 02:06:11,340
or they just shouldn't happen at all.

2639
02:06:11,340 --> 02:06:16,000
And that ensures that the
math does not go wrong

2640
02:06:16,000 --> 02:06:18,820
because my like will get
counted, and then you're like,

2641
02:06:18,820 --> 02:06:22,390
and then you're like, as opposed
to them being intermingled and lost

2642
02:06:22,390 --> 02:06:23,630
track of accordingly.

2643
02:06:23,630 --> 02:06:26,590
So in Python, using
the CS50 library, you

2644
02:06:26,590 --> 02:06:30,280
could wrap these three lines of
code by saying begin transaction

2645
02:06:30,280 --> 02:06:33,580
to the database, commit the
transaction to the database,

2646
02:06:33,580 --> 02:06:37,867
and that relatively simple solution
avoids this problem of race conditions.

2647
02:06:37,867 --> 02:06:38,950
But this, too, is a topic.

2648
02:06:38,950 --> 02:06:41,020
If you Google, invariably
you'll see that this

2649
02:06:41,020 --> 02:06:46,000
is a problem that has hit various
servers and apps over time.

2650
02:06:46,000 --> 02:06:49,660
But there's one other problem that
the world is still not very good at,

2651
02:06:49,660 --> 02:06:52,060
and that's known as a
SQL injection attack.

2652
02:06:52,060 --> 02:06:55,690
And it turns out that even with what
we've been doing, even code like this

2653
02:06:55,690 --> 02:07:00,160
here, is all too easily vulnerable
to being hacked, if you will.

2654
02:07:00,160 --> 02:07:04,670
That is misused in some way, unless
you practice what I'm preaching,

2655
02:07:04,670 --> 02:07:07,070
which is using placeholders like this.

2656
02:07:07,070 --> 02:07:10,570
It turns out that it's very dangerous
to take user input in, generally.

2657
02:07:10,570 --> 02:07:13,150
Like, most of your users
might be nice, good people,

2658
02:07:13,150 --> 02:07:16,330
but there's always going to be
someone who's malicious or curious

2659
02:07:16,330 --> 02:07:19,670
or just execute something you don't
expect, and things can go wrong.

2660
02:07:19,670 --> 02:07:21,940
So in the world of SQL,
here's what can happen.

2661
02:07:21,940 --> 02:07:24,890
For instance, here is the
login screen for Yale accounts.

2662
02:07:24,890 --> 02:07:27,220
Here's the comparable
screen for Harvard accounts.

2663
02:07:27,220 --> 02:07:29,803
And all of us are in their habit
of using one of these screens

2664
02:07:29,803 --> 02:07:31,150
or something similar or another.

2665
02:07:31,150 --> 02:07:34,450
You're often asked for a username
or email address and a password.

2666
02:07:34,450 --> 02:07:38,950
But suppose that Harvard or Yale
or Google or Microsoft or wherever,

2667
02:07:38,950 --> 02:07:42,280
are taking user input from forms
like this, be it on the web

2668
02:07:42,280 --> 02:07:46,540
or on a mobile app, and they're just
plugging your input into a SQL query

2669
02:07:46,540 --> 02:07:50,320
that they wrote in advance that just is
waiting for your username or password

2670
02:07:50,320 --> 02:07:53,980
to be plugged in so they can ask
a complete query of the database.

2671
02:07:53,980 --> 02:07:57,850
It turns out that if I'm a
bit malicious, or curious,

2672
02:07:57,850 --> 02:08:01,810
I could, maybe, type in some
funky syntax to fields like this

2673
02:08:01,810 --> 02:08:04,570
that I know have special
meaning to certain databases.

2674
02:08:04,570 --> 02:08:09,490
And it turns out in the world of SQL,
SQLite in particular, single quotes

2675
02:08:09,490 --> 02:08:10,570
are clearly important.

2676
02:08:10,570 --> 02:08:12,850
Because I've been using them
all day long for strings,

2677
02:08:12,850 --> 02:08:16,745
and I didn't mention this yet, but the
comment character in SQLite is dash,

2678
02:08:16,745 --> 02:08:17,245
dash.

2679
02:08:17,245 --> 02:08:20,203
If you want the rest of a line to be
ignored, you just say, dash, dash.

2680
02:08:20,203 --> 02:08:23,605
So it's not hash, it's not slash, slash,
like in Python and C, respectively.

2681
02:08:23,605 --> 02:08:26,210
It's dash, dash, or two hyphens.

2682
02:08:26,210 --> 02:08:28,887
So suppose that I'm a
hacker or a curious student,

2683
02:08:28,887 --> 02:08:30,970
and I want to see if Harvard
knows what it's doing

2684
02:08:30,970 --> 02:08:32,680
when it comes to SQL injection attacks.

2685
02:08:32,680 --> 02:08:36,347
I could literally type in, maybe
my username with a single quote

2686
02:08:36,347 --> 02:08:37,180
and then dash, dash.

2687
02:08:37,180 --> 02:08:38,870
Well, why would I do this?

2688
02:08:38,870 --> 02:08:43,240
Well, suppose for the sake of discussion
that some developer at Harvard,

2689
02:08:43,240 --> 02:08:46,030
or any website, really,
has written a line of code

2690
02:08:46,030 --> 02:08:49,870
like this to check if the
username and password just

2691
02:08:49,870 --> 02:08:52,760
typed in match what's in the database.

2692
02:08:52,760 --> 02:08:53,890
So how might I do this?

2693
02:08:53,890 --> 02:08:57,410
And if so, let the user log in, show
them their account, or whatever.

2694
02:08:57,410 --> 02:08:59,800
So here's the line of code in question.

2695
02:08:59,800 --> 02:09:05,050
Select star from users where username
equals question mark and password

2696
02:09:05,050 --> 02:09:05,980
equals question mark.

2697
02:09:05,980 --> 02:09:07,040
This is correct.

2698
02:09:07,040 --> 02:09:08,650
This is green because it's good.

2699
02:09:08,650 --> 02:09:10,960
It is not vulnerable to attack because

2700
02:09:10,960 --> 02:09:13,090
I'm using these
placeholders here, which,

2701
02:09:13,090 --> 02:09:15,370
even though we've implemented
in the CS50 library,

2702
02:09:15,370 --> 02:09:18,040
most SQL libraries
support the same syntax.

2703
02:09:18,040 --> 02:09:19,600
But it can vary by system.

2704
02:09:19,600 --> 02:09:20,890
This is safe.

2705
02:09:20,890 --> 02:09:23,230
What is not safe is to
use some of the stuff

2706
02:09:23,230 --> 02:09:25,750
you learned last week where
you can just use f strings,

2707
02:09:25,750 --> 02:09:29,680
or format your own strings and
interpolate values with curly braces.

2708
02:09:29,680 --> 02:09:31,930
For instance, suppose
you took me at my word

2709
02:09:31,930 --> 02:09:33,790
last week that you
can do this in Python.

2710
02:09:33,790 --> 02:09:37,430
This code here, unfortunately,
in yellow, is not safe.

2711
02:09:37,430 --> 02:09:39,190
Here's a format string in Python.

2712
02:09:39,190 --> 02:09:41,140
Here's the beginning of a SQL query.

2713
02:09:41,140 --> 02:09:43,810
Here's a placeholder
for the user username.

2714
02:09:43,810 --> 02:09:45,730
Here's a placeholder
for the user's password.

2715
02:09:45,730 --> 02:09:49,400
And I've proactively, correctly,
put single quotes around them.

2716
02:09:49,400 --> 02:09:53,410
The problem is if you just
blindly plug users input

2717
02:09:53,410 --> 02:09:58,390
into pre-made queries like this,
they can finish your thought for you,

2718
02:09:58,390 --> 02:10:01,280
in ways you don't expect,
if you trust the users.

2719
02:10:01,280 --> 02:10:03,910
For instance, if I plug
in for my username,

2720
02:10:03,910 --> 02:10:07,600
mail in at harvard.edu,
single quote, dash, dash.

2721
02:10:07,600 --> 02:10:09,200
Notice what happens.

2722
02:10:09,200 --> 02:10:11,470
Here's the single quote from the query.

2723
02:10:11,470 --> 02:10:13,100
Here's what I typed in.

2724
02:10:13,100 --> 02:10:14,150
But, wait a minute.

2725
02:10:14,150 --> 02:10:17,560
It looks like this single
quote that I typed in finishes

2726
02:10:17,560 --> 02:10:19,810
the thought that the developer started.

2727
02:10:19,810 --> 02:10:22,527
The dash, dash means, heck,
ignore the rest of that.

2728
02:10:22,527 --> 02:10:25,360
And, indeed, I've lifted grayed out
because what effectively happens

2729
02:10:25,360 --> 02:10:29,510
is that only executed is what's
in yellow here at the moment.

2730
02:10:29,510 --> 02:10:30,010
Why?

2731
02:10:30,010 --> 02:10:32,440
Because everything
after that close quote,

2732
02:10:32,440 --> 02:10:35,255
which finishes the developers
thought, is just ignored.

2733
02:10:35,255 --> 02:10:38,380
So I mean, this is literally an example
of how you can hack into a database

2734
02:10:38,380 --> 02:10:39,910
by injecting SQL.

2735
02:10:39,910 --> 02:10:42,220
Like dash, dash is an example of SQL.

2736
02:10:42,220 --> 02:10:45,280
Silly as it is, it's a comment
that tells the database

2737
02:10:45,280 --> 02:10:46,460
to ignore the password.

2738
02:10:46,460 --> 02:10:47,450
So what does this mean?

2739
02:10:47,450 --> 02:10:50,770
Well, of course you're going to get
rows back now because if you're only

2740
02:10:50,770 --> 02:10:53,950
searching for someone by username, it
doesn't matter what their password is,

2741
02:10:53,950 --> 02:10:57,460
what they typed in, you've
essentially blacked out that part

2742
02:10:57,460 --> 02:10:59,530
and you're not even
checking the password.

2743
02:10:59,530 --> 02:11:03,640
The effect then would be that you could
log in as me or Carter or anyone else,

2744
02:11:03,640 --> 02:11:08,410
just by knowing their usernames if
the Harvard developer wrote code

2745
02:11:08,410 --> 02:11:09,183
in this way.

2746
02:11:09,183 --> 02:11:12,100
And even though I'm pretty sure
Harvard key does not suffer from this,

2747
02:11:12,100 --> 02:11:14,590
so many darn websites have in the past.

2748
02:11:14,590 --> 02:11:17,733
And if you Google, SQL
injection attacks-- search by

2749
02:11:17,733 --> 02:11:20,150
and Google, for instance, the
past month or the past year,

2750
02:11:20,150 --> 02:11:22,940
you will tragically likely
see results because humans

2751
02:11:22,940 --> 02:11:24,650
continue to make this mistake.

2752
02:11:24,650 --> 02:11:27,830
The solution though,
ultimately, is actually just

2753
02:11:27,830 --> 02:11:30,710
use placeholders, just
use the library that

2754
02:11:30,710 --> 02:11:32,990
escapes potentially dangerous input.

2755
02:11:32,990 --> 02:11:35,600
And this looks a little
weird, but in C we

2756
02:11:35,600 --> 02:11:38,000
saw that the escape
character was a backslash?

2757
02:11:38,000 --> 02:11:41,690
And that made backslash n, or backslash
something else, be treated specially.

2758
02:11:41,690 --> 02:11:45,750
Weirdly, in SQL it tends to
be just another single quote.

2759
02:11:45,750 --> 02:11:48,450
So if you do quote, quote
that actually means--

2760
02:11:48,450 --> 02:11:49,580
I mean, a literal quote.

2761
02:11:49,580 --> 02:11:52,010
It's not like the empty string,
or nothing in between it.

2762
02:11:52,010 --> 02:11:53,060
So it looks weird.

2763
02:11:53,060 --> 02:11:56,630
But, long story short, if
you use a library like CS50,

2764
02:11:56,630 --> 02:11:58,460
or anything in the
real world that handles

2765
02:11:58,460 --> 02:12:01,130
escaping of user input,
the whole query you

2766
02:12:01,130 --> 02:12:04,910
can think of as being now good and green
again because it doesn't matter what

2767
02:12:04,910 --> 02:12:05,780
the human types in.

2768
02:12:05,780 --> 02:12:10,680
Any scary characters will be
properly escaped by the database.

2769
02:12:10,680 --> 02:12:12,740
So it'll depend on the
library you're using,

2770
02:12:12,740 --> 02:12:15,200
but it almost always is
the syntax using question

2771
02:12:15,200 --> 02:12:18,530
marks or some similar placeholder.

2772
02:12:18,530 --> 02:12:19,130
All right.

2773
02:12:19,130 --> 02:12:23,270
So with that said, you are
now inducted into the Hall

2774
02:12:23,270 --> 02:12:26,020
of People who know now a little
something about databases.

2775
02:12:26,020 --> 02:12:28,520
We've only just scratched the
surface of using the language,

2776
02:12:28,520 --> 02:12:31,340
but it's now something we'll use to
build up more and more interesting

2777
02:12:31,340 --> 02:12:33,673
applications, especially for
final projects when we soon

2778
02:12:33,673 --> 02:12:37,160
transition to web programming or mobile
app development, if you go that route.

2779
02:12:37,160 --> 02:12:41,540
But you'll soon be able to speak a
language, literally and figuratively,

2780
02:12:41,540 --> 02:12:44,310
that those before you
have acquired as well.

2781
02:12:44,310 --> 02:12:48,170
So you are now qualified to
understand this sort of internet joke.

2782
02:12:48,170 --> 02:12:51,020
This is someone who, if
I Zoom in, was trying

2783
02:12:51,020 --> 02:12:55,820
to get out of paying some camera based
tolls by tricking the state or the city

2784
02:12:55,820 --> 02:12:58,130
into deleting or dropping
their whole database.

2785
02:12:58,130 --> 02:13:01,100
Drop means delete the whole
thing, not just the rows therein.

2786
02:13:01,100 --> 02:13:02,270
OK, so maybe not that funny.

2787
02:13:02,270 --> 02:13:06,910
But this is now the note will end
on, similar to other xkcd comics

2788
02:13:06,910 --> 02:13:07,910
we've introduced you to.

2789
02:13:07,910 --> 02:13:11,150
Every CS person out there
has seen, knows this comic.

2790
02:13:11,150 --> 02:13:14,150
So if you ever refer to, with a wink--

2791
02:13:14,150 --> 02:13:17,090
if you ever refer to little
Bobby tables with a wink,

2792
02:13:17,090 --> 02:13:20,795
if it's a computer scientist on the
other end, they'll know whom you mean.

2793
02:13:20,795 --> 02:13:23,918


2794
02:13:23,918 --> 02:13:25,330
[LAUGHING]

2795
02:13:25,330 --> 02:13:26,260
OK, there we go.

2796
02:13:26,260 --> 02:13:29,690


2797
02:13:29,690 --> 02:13:30,230
All right.

2798
02:13:30,230 --> 02:13:30,740
Tough crowd.

2799
02:13:30,740 --> 02:13:33,990
All right, Batman, come on down for your
cookies, and we'll see you next time.

2800
02:13:33,990 --> 02:13:34,520
[APPLAUSE]

2801
02:13:34,520 --> 02:13:37,570
[MUSIC PLAYING]

2802
02:13:37,570 --> 02:14:04,000