1
00:00:00,000 --> 00:00:02,445
[MUSIC PLAYING]

2
00:00:02,445 --> 00:00:18,560


3
00:00:18,560 --> 00:00:21,560
BRIAN YU: Welcome back, everybody,
to our final class in an Introduction

4
00:00:21,560 --> 00:00:23,570
to Artificial Intelligence with Python.

5
00:00:23,570 --> 00:00:26,150
Now, so far in this class,
we've been taking problems

6
00:00:26,150 --> 00:00:29,457
that we want to solve intelligently
and framing them in ways that computers

7
00:00:29,457 --> 00:00:31,040
are going to be able to make sense of.

8
00:00:31,040 --> 00:00:34,850
We've been taking problems and framing
them as search problems or constraint

9
00:00:34,850 --> 00:00:38,210
satisfaction problems or
optimization problems, for example.

10
00:00:38,210 --> 00:00:41,780
In essence, we have been trying to
communicate about problems in ways

11
00:00:41,780 --> 00:00:44,360
that our computer is going
to be able to understand.

12
00:00:44,360 --> 00:00:46,910
Today, the goal is going
to be to get computers

13
00:00:46,910 --> 00:00:49,610
to understand the way you
and I communicate naturally,

14
00:00:49,610 --> 00:00:51,710
via our own natural languages.

15
00:00:51,710 --> 00:00:53,060
Languages like English.

16
00:00:53,060 --> 00:00:56,720
But natural language contains
a lot of nuance and complexity

17
00:00:56,720 --> 00:00:59,930
that's going to make it challenging
for computers to be able to understand.

18
00:00:59,930 --> 00:01:03,440
So we'll need to explore some
new tools and some new techniques

19
00:01:03,440 --> 00:01:06,920
to allow computers to make
sense of natural language.

20
00:01:06,920 --> 00:01:09,920
So what is it exactly that we're
trying to get computers to do?

21
00:01:09,920 --> 00:01:13,850
Well, they all fall under this general
heading of natural language processing,

22
00:01:13,850 --> 00:01:16,610
getting computers to work
with natural language.

23
00:01:16,610 --> 00:01:20,280
And these tasks include tasks
like automatic summarization.

24
00:01:20,280 --> 00:01:23,040
Given a long text, can
we train the computer

25
00:01:23,040 --> 00:01:25,590
to be able to come up with a
shorter representation of it?

26
00:01:25,590 --> 00:01:27,030
Information extraction.

27
00:01:27,030 --> 00:01:29,730
Getting the computer to pull out
relevant facts or details out

28
00:01:29,730 --> 00:01:30,540
of some text.

29
00:01:30,540 --> 00:01:32,790
Machine translation,
like Google translate,

30
00:01:32,790 --> 00:01:36,120
translating some text from one
language into another language.

31
00:01:36,120 --> 00:01:37,140
Question answering.

32
00:01:37,140 --> 00:01:39,330
If you've ever asked a
question to your phone

33
00:01:39,330 --> 00:01:41,880
or had a conversation
with an AI chatbot where

34
00:01:41,880 --> 00:01:45,150
you provide some text to
the computer, the computer

35
00:01:45,150 --> 00:01:49,740
is able to understand that text and
then generate some text in response.

36
00:01:49,740 --> 00:01:52,350
Text classification,
where we provide some text

37
00:01:52,350 --> 00:01:56,280
to the computer and the computer assigns
it a label, positive or negative,

38
00:01:56,280 --> 00:01:57,990
inbox or spam, for example.

39
00:01:57,990 --> 00:01:59,880
And there are several
other kinds of tasks

40
00:01:59,880 --> 00:02:03,090
that all fall under this heading
of natural language processing.

41
00:02:03,090 --> 00:02:05,790
But before we take a look
at how the computer might

42
00:02:05,790 --> 00:02:08,460
try to solve these kinds of
tasks, it might be useful

43
00:02:08,460 --> 00:02:10,949
for us to think about
language in general.

44
00:02:10,949 --> 00:02:12,960
What are the kinds of
challenges that we might

45
00:02:12,960 --> 00:02:15,540
need to deal with as we
start to think about language

46
00:02:15,540 --> 00:02:18,120
and getting a computer to
be able to understand it?

47
00:02:18,120 --> 00:02:22,230
So one part of language that we'll need
to consider is the syntax of language.

48
00:02:22,230 --> 00:02:24,600
Syntax is all about the
structure of language.

49
00:02:24,600 --> 00:02:26,760
Language is composed
of individual words,

50
00:02:26,760 --> 00:02:30,690
and those words are composed together
in some kind of structured whole.

51
00:02:30,690 --> 00:02:33,360
And if our computer is going to
be able to understand language,

52
00:02:33,360 --> 00:02:36,720
it's going to need to understand
something about that structure.

53
00:02:36,720 --> 00:02:38,550
So let's take a couple of examples.

54
00:02:38,550 --> 00:02:40,320
Here, for instance, is a sentence.

55
00:02:40,320 --> 00:02:44,100
"Just before nine o'clock Sherlock
Holmes stepped briskly into the room."

56
00:02:44,100 --> 00:02:47,550
That sentence is made up of
words, and those words together

57
00:02:47,550 --> 00:02:49,110
form a structured whole.

58
00:02:49,110 --> 00:02:51,810
This is syntactically
valid as a sentence.

59
00:02:51,810 --> 00:02:54,540
But we could take some
of those same words,

60
00:02:54,540 --> 00:02:56,760
rearrange them, and come
up with a sentence that

61
00:02:56,760 --> 00:02:59,010
is not syntactically valid.

62
00:02:59,010 --> 00:03:01,800
Here, for example, "Just
before Sherlock Holmes

63
00:03:01,800 --> 00:03:05,970
nine o'clock stepped briskly the room"
is still composed of valid words,

64
00:03:05,970 --> 00:03:08,490
but they're not in any
kind of logical whole.

65
00:03:08,490 --> 00:03:12,180
This is not a syntactically
well-formed sentence.

66
00:03:12,180 --> 00:03:15,630
Another interesting challenge,
is that some sentences will have

67
00:03:15,630 --> 00:03:18,240
multiple possible valid structures.

68
00:03:18,240 --> 00:03:19,890
Here's a sentence, for example.

69
00:03:19,890 --> 00:03:22,830
"I saw the man on the
mountain with a telescope."

70
00:03:22,830 --> 00:03:25,350
And here, this is a valid
sentence, but it actually

71
00:03:25,350 --> 00:03:28,380
has two different
possible structures that

72
00:03:28,380 --> 00:03:31,110
lend themselves to two different
interpretations and two

73
00:03:31,110 --> 00:03:31,950
different meanings.

74
00:03:31,950 --> 00:03:34,560
Maybe I, the one doing
the seeing and the one

75
00:03:34,560 --> 00:03:36,960
with the telescope, or maybe
the man on the mountain

76
00:03:36,960 --> 00:03:38,580
is the one with the telescope.

77
00:03:38,580 --> 00:03:40,890
And so natural language is ambiguous.

78
00:03:40,890 --> 00:03:44,190
Sometimes the same sentence can
be interpreted in multiple ways.

79
00:03:44,190 --> 00:03:46,800
And that's something that we'll
need to think about, as well.

80
00:03:46,800 --> 00:03:49,530
And this lends itself to
another problem within language

81
00:03:49,530 --> 00:03:51,930
that we'll need to think
about, which is semantics.

82
00:03:51,930 --> 00:03:54,540
While syntax is all about
the structure of language,

83
00:03:54,540 --> 00:03:56,820
semantics is about the
meaning of language.

84
00:03:56,820 --> 00:03:59,370
It's not enough for a
computer just to know

85
00:03:59,370 --> 00:04:03,600
that a sentence is well-structured if it
doesn't know what that sentence means.

86
00:04:03,600 --> 00:04:07,080
And so semantics is going to concern
itself with the meaning of words

87
00:04:07,080 --> 00:04:08,700
and the meaning of sentences.

88
00:04:08,700 --> 00:04:12,390
So if we go back to that same sentence
as before, "Just before nine o'clock

89
00:04:12,390 --> 00:04:15,420
Sherlock Holmes stepped
briskly into the room."

90
00:04:15,420 --> 00:04:17,730
I could come up with another sentence.

91
00:04:17,730 --> 00:04:20,490
Say the sentence, "A
few minutes before nine,

92
00:04:20,490 --> 00:04:22,980
Sherlock Holmes walked
quickly into the room."

93
00:04:22,980 --> 00:04:26,040
And those are two different sentences,
with some of the words the same

94
00:04:26,040 --> 00:04:28,650
and some of the words
different, but the two sentences

95
00:04:28,650 --> 00:04:30,720
have essentially the same meaning.

96
00:04:30,720 --> 00:04:33,030
And so ideally, whatever
model we build, we'll

97
00:04:33,030 --> 00:04:35,160
be able to understand
that these two sentences

98
00:04:35,160 --> 00:04:38,130
while different, mean
something very similar.

99
00:04:38,130 --> 00:04:41,940
Some syntactically well-formed
sentences don't mean anything at all.

100
00:04:41,940 --> 00:04:45,210
A famous example from linguist,
Noam Chomsky, is the sentence,

101
00:04:45,210 --> 00:04:48,270
"Colorless green ideas sleep furiously."

102
00:04:48,270 --> 00:04:51,690
This is a syntactically,
structurally well-formed sentence.

103
00:04:51,690 --> 00:04:56,280
We've got adjectives modifying a noun,
ideas, we've got a verb and an adverb

104
00:04:56,280 --> 00:04:57,420
in the correct positions.

105
00:04:57,420 --> 00:05:01,260
But when taken as a whole, the
sentence doesn't really mean anything.

106
00:05:01,260 --> 00:05:04,680
And so if our computers are going to
be able to work with natural language

107
00:05:04,680 --> 00:05:06,977
and perform tasks in
natural language processing,

108
00:05:06,977 --> 00:05:09,060
these are some concerns
we'll need to think about.

109
00:05:09,060 --> 00:05:12,060
We'll need to be thinking
about syntax and we'll

110
00:05:12,060 --> 00:05:13,830
need to be thinking about semantics.

111
00:05:13,830 --> 00:05:17,880
So, how could we go about trying to
teach a computer how to understand

112
00:05:17,880 --> 00:05:19,590
the structure of natural language?

113
00:05:19,590 --> 00:05:22,650
Well, one approach we might
take is by starting by thinking

114
00:05:22,650 --> 00:05:24,810
about the rules of natural language.

115
00:05:24,810 --> 00:05:26,580
Our natural languages have rules.

116
00:05:26,580 --> 00:05:29,910
In English, for example, nouns
tend to come before verbs.

117
00:05:29,910 --> 00:05:32,520
Nouns can be modified by
adjectives, for example.

118
00:05:32,520 --> 00:05:35,400
And so if only we could
formalize those rules,

119
00:05:35,400 --> 00:05:37,590
then we could give those
rules to a computer,

120
00:05:37,590 --> 00:05:41,130
and the computer would be able to make
sense of them and understand them.

121
00:05:41,130 --> 00:05:43,200
And so, let's try to do exactly that.

122
00:05:43,200 --> 00:05:46,020
We're going to try to define
a formal grammar, where

123
00:05:46,020 --> 00:05:51,300
a formal grammar is some system of rules
for generating sentences in a language.

124
00:05:51,300 --> 00:05:54,870
This is going to be a rule-based
approach to natural language

125
00:05:54,870 --> 00:05:55,470
processing.

126
00:05:55,470 --> 00:05:58,830
We're going to give the computer some
rules that we know about language,

127
00:05:58,830 --> 00:06:01,320
and have the computer
use those rules to make

128
00:06:01,320 --> 00:06:03,488
sense of the structure of language.

129
00:06:03,488 --> 00:06:06,030
And there are a number of
different types of formal grammars,

130
00:06:06,030 --> 00:06:08,430
each one of them has
slightly different use cases.

131
00:06:08,430 --> 00:06:10,530
But today, we're going
to focus specifically

132
00:06:10,530 --> 00:06:13,860
on one kind of grammar known
as a context-free grammar.

133
00:06:13,860 --> 00:06:15,810
So how does the
context-free grammar work?

134
00:06:15,810 --> 00:06:19,180
Well, here is a sentence that we
might want a computer to generate.

135
00:06:19,180 --> 00:06:20,950
She saw the city.

136
00:06:20,950 --> 00:06:24,280
And we're going to call each of
these words a terminal symbol.

137
00:06:24,280 --> 00:06:27,383
A terminal symbol, because once our
computer has generated the word,

138
00:06:27,383 --> 00:06:29,050
there's nothing else for it to generate.

139
00:06:29,050 --> 00:06:32,110
Once it's generated the
sentence, the computer is done.

140
00:06:32,110 --> 00:06:34,930
We're going to associate each
of these terminal symbols

141
00:06:34,930 --> 00:06:38,650
with a nonterminal
symbol that generates it.

142
00:06:38,650 --> 00:06:42,670
So here we've got N, which stands
for noun, like she or city.

143
00:06:42,670 --> 00:06:46,000
We've got V as a nonterminal
symbol, which stands for a verb.

144
00:06:46,000 --> 00:06:48,280
And then we have D, which
stands for determiner.

145
00:06:48,280 --> 00:06:52,210
A determiner is a word like the or
a or an in English, for example.

146
00:06:52,210 --> 00:06:56,620
So each of these nonterminal symbols
can generate the terminal symbols

147
00:06:56,620 --> 00:06:58,960
that we ultimately
care about generating.

148
00:06:58,960 --> 00:07:01,180
But how do we know, or
how does the computer

149
00:07:01,180 --> 00:07:05,110
know which nonterminal symbols are
associated with which terminal symbols?

150
00:07:05,110 --> 00:07:07,750
Well, to do that, we
need some kind of rule.

151
00:07:07,750 --> 00:07:10,570
Here are some what we
call rewriting rules, that

152
00:07:10,570 --> 00:07:13,660
have a nonterminal symbol on
the left-hand side of an arrow,

153
00:07:13,660 --> 00:07:16,660
and on the right side is
what that nonterminal symbol

154
00:07:16,660 --> 00:07:18,070
can be replaced with.

155
00:07:18,070 --> 00:07:22,180
So here, we're saying the nonterminal
symbol N, again, which stands for noun,

156
00:07:22,180 --> 00:07:26,170
could be replaced by any of these
options separated by vertical bars.

157
00:07:26,170 --> 00:07:30,190
N could be replaced by she
or city or car or Harry.

158
00:07:30,190 --> 00:07:34,300
D for determiner, could be replaced
by the, a, or an, and so forth.

159
00:07:34,300 --> 00:07:39,520
Each of these nonterminal symbols could
be replaced by any of these words.

160
00:07:39,520 --> 00:07:42,370
We can also have
nonterminal symbols that are

161
00:07:42,370 --> 00:07:45,070
replaced by other nonterminal symbols.

162
00:07:45,070 --> 00:07:46,660
Here's an interesting rule.

163
00:07:46,660 --> 00:07:51,400
NP arrow N bar D N. So
what does that mean?

164
00:07:51,400 --> 00:07:54,520
Well, NP stands for a noun phrase.

165
00:07:54,520 --> 00:07:56,920
Sometimes when we have a
noun phrase in a sentence,

166
00:07:56,920 --> 00:07:59,590
it's not just a single word,
it could be multiple words.

167
00:07:59,590 --> 00:08:03,670
And so here, we're saying a noun
phrase could be just a noun,

168
00:08:03,670 --> 00:08:07,240
or it could be a determiner
followed by a noun.

169
00:08:07,240 --> 00:08:10,570
So we might have a noun phrase
that's just a noun, like she.

170
00:08:10,570 --> 00:08:11,950
That's a noun phrase.

171
00:08:11,950 --> 00:08:14,920
Or we could have a noun phrase
that's multiple words, something

172
00:08:14,920 --> 00:08:16,120
like the city.

173
00:08:16,120 --> 00:08:18,700
Also acts as a noun
phrase, but in this case,

174
00:08:18,700 --> 00:08:23,800
it's composed of two words, a
determiner, the, and a noun, city.

175
00:08:23,800 --> 00:08:25,930
We could do the same for verb phrases.

176
00:08:25,930 --> 00:08:29,380
A verb phrase, or VP,
might be just a verb,

177
00:08:29,380 --> 00:08:32,530
or it might be a verb
followed by a noun phrase.

178
00:08:32,530 --> 00:08:35,980
So we could have a verb phrase that's
just a single word, like the word,

179
00:08:35,980 --> 00:08:38,440
walked, or we could
have a verb phrase that

180
00:08:38,440 --> 00:08:44,480
is an entire phrase, something like
saw the city, as an entire verb phrase.

181
00:08:44,480 --> 00:08:47,500
A sentence, meanwhile, we
might then define as a noun

182
00:08:47,500 --> 00:08:50,230
phrase followed by a verb phrase.

183
00:08:50,230 --> 00:08:52,840
And so this would allow us
to generate a sentence like,

184
00:08:52,840 --> 00:08:55,810
she saw the city, an
entire sentence made up

185
00:08:55,810 --> 00:08:58,330
of a noun phrase, which
is just the word she,

186
00:08:58,330 --> 00:09:01,300
and then a verb phrase,
which is saw the city.

187
00:09:01,300 --> 00:09:04,750
Saw, which is a verb, and
then, the city, which itself,

188
00:09:04,750 --> 00:09:07,180
is also a noun phrase.

189
00:09:07,180 --> 00:09:09,910
And so if we could give
these rules to a computer,

190
00:09:09,910 --> 00:09:12,460
explaining to it what
nonterminal symbols could

191
00:09:12,460 --> 00:09:16,300
be replaced by what other symbols,
then a computer could take a sentence

192
00:09:16,300 --> 00:09:19,930
and begin to understand the
structure of that sentence.

193
00:09:19,930 --> 00:09:22,600
And so let's take a look at an
example of how we might do that.

194
00:09:22,600 --> 00:09:24,730
And to do that, we're going
to use a python library

195
00:09:24,730 --> 00:09:27,923
called NLTK, or the
Natural Language Toolkit,

196
00:09:27,923 --> 00:09:29,590
which we'll see a couple of times today.

197
00:09:29,590 --> 00:09:31,930
It contains a lot of helpful
features and functions

198
00:09:31,930 --> 00:09:35,890
that we can use for trying to deal
with and process natural language.

199
00:09:35,890 --> 00:09:39,100
So here, we'll take a look at
how we can use NLTK in order

200
00:09:39,100 --> 00:09:41,590
to parse a context-free grammar.

201
00:09:41,590 --> 00:09:47,140
So let's go ahead and open up cfg0.py,
cfg standing for context-free grammar.

202
00:09:47,140 --> 00:09:49,990
And what you'll see in this
file, is that I first import

203
00:09:49,990 --> 00:09:52,420
NLTK, the Natural Language Toolkit.

204
00:09:52,420 --> 00:09:56,380
And the first thing I do, is
define a context-free grammar,

205
00:09:56,380 --> 00:09:59,830
saying that a sentence is a noun
phrase followed by a verb phrase.

206
00:09:59,830 --> 00:10:03,130
I'm defining what a noun phrase
is, defining what a verb phrase is.

207
00:10:03,130 --> 00:10:07,360
And then giving some examples of what I
can do with these nonterminal symbols,

208
00:10:07,360 --> 00:10:11,650
D for determiner, N for
noun, and V for verb.

209
00:10:11,650 --> 00:10:14,710
We're going to use NLTK
to parse that grammar.

210
00:10:14,710 --> 00:10:17,610
Then we'll ask the user for some
input in the form of a sentence,

211
00:10:17,610 --> 00:10:19,560
and split it into words.

212
00:10:19,560 --> 00:10:22,860
And then, we'll use this
context-free grammar parser

213
00:10:22,860 --> 00:10:27,600
to try to parse that sentence and
print out the resulting syntax tree.

214
00:10:27,600 --> 00:10:30,150
So let's take a look at an example.

215
00:10:30,150 --> 00:10:34,740
We'll go ahead and go into my cfg
directory and we'll run cfg0.py.

216
00:10:34,740 --> 00:10:36,510
And here, I'm asked
to type in a sentence.

217
00:10:36,510 --> 00:10:39,810
Let's say I type in, she walked.

218
00:10:39,810 --> 00:10:42,180
And when I do that,
I see that she walked

219
00:10:42,180 --> 00:10:46,350
is a valid sentence, where she
is a noun phrase, and walked

220
00:10:46,350 --> 00:10:48,990
is the corresponding verb phrase.

221
00:10:48,990 --> 00:10:51,900
I could try to do this with
a more complex sentence, too.

222
00:10:51,900 --> 00:10:55,110
I could do something
like, she saw the city.

223
00:10:55,110 --> 00:10:59,670
And here, we see that she is the
noun phrase, and then saw the city,

224
00:10:59,670 --> 00:11:03,660
is the entire verb phrase
that makes up this sentence.

225
00:11:03,660 --> 00:11:05,490
So that was a very simple grammar.

226
00:11:05,490 --> 00:11:08,130
Let's take a look at a
slightly more complex grammar.

227
00:11:08,130 --> 00:11:11,970
Here is cfg1.py, where
a sentence is still

228
00:11:11,970 --> 00:11:14,250
a noun phrase followed by
a verb phrase, but I've

229
00:11:14,250 --> 00:11:17,250
added some other possible
nonterminal symbols, too.

230
00:11:17,250 --> 00:11:22,140
I have AP for adjective phrase,
and PP for prepositional phrase.

231
00:11:22,140 --> 00:11:24,690
And we specified that we
could have an adjective

232
00:11:24,690 --> 00:11:28,590
phrase before a noun phrase, or a
prepositional phrase after a noun,

233
00:11:28,590 --> 00:11:29,640
for example.

234
00:11:29,640 --> 00:11:33,720
So lots of additional ways that we
might try to structure a sentence

235
00:11:33,720 --> 00:11:37,140
and interpret and parse one
of those resulting sentences.

236
00:11:37,140 --> 00:11:38,760
So let's see that one in action.

237
00:11:38,760 --> 00:11:42,840
We'll go ahead and run
cfg1.py with this new grammar.

238
00:11:42,840 --> 00:11:47,670
And we'll try a sentence
like, she saw the wide street.

239
00:11:47,670 --> 00:11:51,120
Here, pythons NLTK is able
to parse that sentence

240
00:11:51,120 --> 00:11:53,760
and identify that she
saw the wide street has

241
00:11:53,760 --> 00:11:57,750
this particular structure, a sentence
with a noun phrase and a verb phrase,

242
00:11:57,750 --> 00:12:01,440
where that verb phrase has a noun phrase
that within it, contains an adjective.

243
00:12:01,440 --> 00:12:05,670
And so it's able to get some sense for
what the structure of this language

244
00:12:05,670 --> 00:12:07,170
actually is.

245
00:12:07,170 --> 00:12:08,670
Let's try another example.

246
00:12:08,670 --> 00:12:14,100
Let's say, she saw the
dog with the binoculars.

247
00:12:14,100 --> 00:12:15,990
And we'll try that sentence.

248
00:12:15,990 --> 00:12:19,260
And here, we get one
possible syntax tree.

249
00:12:19,260 --> 00:12:21,210
She saw the dog with the binoculars.

250
00:12:21,210 --> 00:12:23,700
But notice that this sentence
is actually a little bit

251
00:12:23,700 --> 00:12:25,770
ambiguous in our own natural language.

252
00:12:25,770 --> 00:12:26,940
Who has the binoculars?

253
00:12:26,940 --> 00:12:30,600
Is it she who has the binoculars,
or the dog who has the binoculars?

254
00:12:30,600 --> 00:12:35,280
And NLTK is able to identify both
possible structures for the sentence.

255
00:12:35,280 --> 00:12:39,990
In this case, the dog with the
binoculars is an entire noun phrase.

256
00:12:39,990 --> 00:12:44,640
It's all underneath this NP here, so
it's the dog that has the binoculars.

257
00:12:44,640 --> 00:12:48,090
But we also got an
alternative parse tree,

258
00:12:48,090 --> 00:12:51,810
where the dog is just the noun phrase.

259
00:12:51,810 --> 00:12:56,460
And with the binoculars, is a
prepositional phrase modifying saw.

260
00:12:56,460 --> 00:13:00,210
So she saw the dog, and
she used the binoculars

261
00:13:00,210 --> 00:13:02,400
in order to see the dog, as well.

262
00:13:02,400 --> 00:13:05,640
So this allows us to get a sense for
the structure of natural language,

263
00:13:05,640 --> 00:13:08,060
but it relies on us
writing all of these rules.

264
00:13:08,060 --> 00:13:09,810
And it would take a
lot of effort to write

265
00:13:09,810 --> 00:13:12,210
all of the rules for
any possible sentence

266
00:13:12,210 --> 00:13:14,910
that someone might write or
say in the English language.

267
00:13:14,910 --> 00:13:16,860
Language is complicated,
and as a result,

268
00:13:16,860 --> 00:13:19,470
there are going to be
some very complex rules.

269
00:13:19,470 --> 00:13:21,030
So what else might we try?

270
00:13:21,030 --> 00:13:25,140
We might try to take a statistical
lens towards approaching this problem

271
00:13:25,140 --> 00:13:26,700
of natural language processing.

272
00:13:26,700 --> 00:13:30,720
If we were able to give the computer
a lot of existing data of sentences

273
00:13:30,720 --> 00:13:32,070
written in the English language.

274
00:13:32,070 --> 00:13:34,440
What could we try to
learn from that data?

275
00:13:34,440 --> 00:13:37,860
Well, it might be difficult to try
and interpret long pieces of text

276
00:13:37,860 --> 00:13:38,640
all at once.

277
00:13:38,640 --> 00:13:42,090
So instead, what we might want to
do, is break up that longer text

278
00:13:42,090 --> 00:13:44,550
into smaller pieces of
information instead.

279
00:13:44,550 --> 00:13:47,790
In particular, we might
try to create n-grams out

280
00:13:47,790 --> 00:13:49,650
of a longer sequence of text.

281
00:13:49,650 --> 00:13:53,910
An n-gram is just some
contiguous sequence of n items

282
00:13:53,910 --> 00:13:55,080
from a sample of text.

283
00:13:55,080 --> 00:13:59,130
It might be n characters in a row,
or n words in a row, for example.

284
00:13:59,130 --> 00:14:01,650
So let's take a passage
from Sherlock Holmes

285
00:14:01,650 --> 00:14:04,080
and let's look for all of the trigrams.

286
00:14:04,080 --> 00:14:07,080
A trigram is an n-gram
where n is equal to three.

287
00:14:07,080 --> 00:14:10,890
So in this case, we're looking for
sequences of three words in a row.

288
00:14:10,890 --> 00:14:14,790
So the trigrams here would be
phrases like, how often have.

289
00:14:14,790 --> 00:14:16,140
That's three words in a row.

290
00:14:16,140 --> 00:14:18,180
Often have I, is another trigram.

291
00:14:18,180 --> 00:14:19,080
Have I said.

292
00:14:19,080 --> 00:14:19,950
I said to.

293
00:14:19,950 --> 00:14:20,730
Said to you.

294
00:14:20,730 --> 00:14:21,480
To you that.

295
00:14:21,480 --> 00:14:26,310
These are all trigrams, sequences of
three words that appear in sequence.

296
00:14:26,310 --> 00:14:29,490
And if we could give the
computer a large corpus of text

297
00:14:29,490 --> 00:14:32,640
and have it pull out all of
the trigrams in this case,

298
00:14:32,640 --> 00:14:36,180
it could get a sense for
what sequences of three words

299
00:14:36,180 --> 00:14:40,110
tend to appear next to each other
in our own natural language.

300
00:14:40,110 --> 00:14:44,730
And as a result, get some sense for
what the structure of the language

301
00:14:44,730 --> 00:14:46,140
actually is.

302
00:14:46,140 --> 00:14:48,000
So let's take a look
at an example of that.

303
00:14:48,000 --> 00:14:54,570
How can we use NLTK to try to get
access to information about n-grams.

304
00:14:54,570 --> 00:14:57,690
So here we're going
to open up ngrams.py.

305
00:14:57,690 --> 00:15:00,090
And this is a python
program that's going

306
00:15:00,090 --> 00:15:04,530
to load a corpus of data, just some
text files, into our computer's memory.

307
00:15:04,530 --> 00:15:08,280
And then we're going to use
NLTK's n-gram's function, which

308
00:15:08,280 --> 00:15:10,170
is going to go through
the corpus of text,

309
00:15:10,170 --> 00:15:13,710
pulling out all of the n-grams
for a particular value of n.

310
00:15:13,710 --> 00:15:17,050
And then by using
python's counter class,

311
00:15:17,050 --> 00:15:21,070
we're going to figure out what
are the most common n-grams inside

312
00:15:21,070 --> 00:15:23,500
of this entire corpus of text.

313
00:15:23,500 --> 00:15:25,900
And we're going to need a
dataset in order to do this,

314
00:15:25,900 --> 00:15:29,380
and I've prepared a dataset of some
of the stories of Sherlock Holmes.

315
00:15:29,380 --> 00:15:33,130
So it's just a bunch of text files,
a lot of words for it to analyze.

316
00:15:33,130 --> 00:15:35,590
And as a result, we'll
get a sense for what

317
00:15:35,590 --> 00:15:38,350
sequences of two words
or three words tend

318
00:15:38,350 --> 00:15:41,800
to be most common in natural language.

319
00:15:41,800 --> 00:15:42,880
So let's give this a try.

320
00:15:42,880 --> 00:15:46,660
We'll go into my n-grams
directory and we'll run ngrams.py.

321
00:15:46,660 --> 00:15:48,610
We'll try an n value of two.

322
00:15:48,610 --> 00:15:51,130
So we're looking for sequences
of two words in a row.

323
00:15:51,130 --> 00:15:54,940
And we'll use our corpus of
stories from Sherlock Holmes.

324
00:15:54,940 --> 00:15:57,220
And when we run this
program, we get a list

325
00:15:57,220 --> 00:16:00,190
of the most common n-grams
where n is equal to two,

326
00:16:00,190 --> 00:16:01,690
otherwise known as a bigram.

327
00:16:01,690 --> 00:16:04,008
So the most common one is, of the.

328
00:16:04,008 --> 00:16:05,800
That's a sequence of
two words that appears

329
00:16:05,800 --> 00:16:07,960
quite frequently in natural language.

330
00:16:07,960 --> 00:16:10,150
Then, in the, and, it was.

331
00:16:10,150 --> 00:16:14,050
These are all common sequences of
two words that appear in a row.

332
00:16:14,050 --> 00:16:18,470
Let's instead now try running
n-grams with n equal to three.

333
00:16:18,470 --> 00:16:21,110
Let's get all of the
trigrams and see what we get.

334
00:16:21,110 --> 00:16:26,120
And now we see the most common
trigrams are, it was a, one of the,

335
00:16:26,120 --> 00:16:27,140
I think that.

336
00:16:27,140 --> 00:16:31,370
These are all sequences of three
words that appear quite frequently.

337
00:16:31,370 --> 00:16:35,480
And we were able to do this, essentially
via a process known as tokenization.

338
00:16:35,480 --> 00:16:39,740
Tokenization is the process of splitting
a sequence of characters into pieces.

339
00:16:39,740 --> 00:16:43,760
In this case, we're splitting a long
sequence of text into individual words,

340
00:16:43,760 --> 00:16:46,040
and then looking at
sequences of those words

341
00:16:46,040 --> 00:16:49,160
to get a sense for the
structure of natural language.

342
00:16:49,160 --> 00:16:52,370
So once we've done this, once we've
done the tokenization, once we've

343
00:16:52,370 --> 00:16:56,420
built up our corpus of n-grams, what
can we do with that information?

344
00:16:56,420 --> 00:16:58,220
Well, the one thing
that we might try, is

345
00:16:58,220 --> 00:17:00,440
we could build a Markov
chain, which you might recall

346
00:17:00,440 --> 00:17:02,090
from when we talked about probability.

347
00:17:02,090 --> 00:17:04,430
Recall that a Markov
chain is some sequence

348
00:17:04,430 --> 00:17:07,730
of values where we can
predict one value based

349
00:17:07,730 --> 00:17:09,500
on the values that came before it.

350
00:17:09,500 --> 00:17:13,069
And as a result, if we know
all of the common n-grams

351
00:17:13,069 --> 00:17:17,119
in the English language, what words tend
to be associated with what other words

352
00:17:17,119 --> 00:17:20,510
in sequence, we can use
that to predict what word

353
00:17:20,510 --> 00:17:22,910
might come next in a sequence of words.

354
00:17:22,910 --> 00:17:25,520
And so we could build a
Markov chain for language

355
00:17:25,520 --> 00:17:28,220
in order to try to generate
natural language that

356
00:17:28,220 --> 00:17:32,480
follows the same statistical
patterns as some input data.

357
00:17:32,480 --> 00:17:36,830
So let's take a look at that and build
a Markov chain for natural language.

358
00:17:36,830 --> 00:17:41,360
And as input, I'm going to use
the works of William Shakespeare.

359
00:17:41,360 --> 00:17:44,690
So here, I have a file,
shakespeare.txt, which

360
00:17:44,690 --> 00:17:47,660
is just a bunch of the works
of William Shakespeare.

361
00:17:47,660 --> 00:17:50,690
It's a long text file, so
plenty of data to analyze.

362
00:17:50,690 --> 00:17:55,040
And here in generator.py, I'm
using a third-party python library

363
00:17:55,040 --> 00:17:56,900
in order to do this analysis.

364
00:17:56,900 --> 00:17:59,600
We're going to read
in the sample of text,

365
00:17:59,600 --> 00:18:03,260
and then we're going to train a
Markov model based on that text.

366
00:18:03,260 --> 00:18:07,310
And then we're going to have the
Markov chain generate some sentences.

367
00:18:07,310 --> 00:18:10,910
We're going to generate a sentence that
doesn't appear in the original text,

368
00:18:10,910 --> 00:18:13,580
but that follows the same
statistical patterns,

369
00:18:13,580 --> 00:18:15,530
that's generating it
based on the n-grams,

370
00:18:15,530 --> 00:18:18,860
trying to predict what
word is likely to come next

371
00:18:18,860 --> 00:18:22,340
that we would expect based on
those statistical patterns.

372
00:18:22,340 --> 00:18:26,690
So we'll go ahead and go
into our Markov directory,

373
00:18:26,690 --> 00:18:30,530
run this generator with the works
of William Shakespeare as input.

374
00:18:30,530 --> 00:18:33,920
And what we're going to
get, are five new sentences,

375
00:18:33,920 --> 00:18:36,290
where these sentences
are not necessarily

376
00:18:36,290 --> 00:18:38,690
sentences from the
original input text itself,

377
00:18:38,690 --> 00:18:41,450
but just that follow the
same statistical patterns.

378
00:18:41,450 --> 00:18:45,260
It's predicting what word is likely
to come next, based on the input data

379
00:18:45,260 --> 00:18:49,110
that we've seen and the types of words
that tend to appear in sequence there,

380
00:18:49,110 --> 00:18:49,610
too.

381
00:18:49,610 --> 00:18:52,460
And so we're able to
generate these sentences.

382
00:18:52,460 --> 00:18:55,067
Of course, so far, there's
no guarantee that any

383
00:18:55,067 --> 00:18:56,900
of the sentences that
are generated actually

384
00:18:56,900 --> 00:18:58,430
mean anything or make any sense.

385
00:18:58,430 --> 00:19:01,250
They just happen to follow
the statistical patterns

386
00:19:01,250 --> 00:19:03,410
that our computer is already aware of.

387
00:19:03,410 --> 00:19:05,930
So we'll return to this
issue of how to generate text

388
00:19:05,930 --> 00:19:09,260
in perhaps a more accurate or more
meaningful way a little bit later.

389
00:19:09,260 --> 00:19:12,140
So, let's now turn our attention
to a slightly different problem,

390
00:19:12,140 --> 00:19:14,660
and that's the problem
of text classification.

391
00:19:14,660 --> 00:19:17,750
Text classification is the
problem where we have some text,

392
00:19:17,750 --> 00:19:20,840
and we want to put that text
into some kind of category.

393
00:19:20,840 --> 00:19:23,570
We want to apply some sort
of label to that text.

394
00:19:23,570 --> 00:19:26,720
And this kind of problem shows
up in a wide variety of places.

395
00:19:26,720 --> 00:19:29,300
A common place might be your
email inbox, for example.

396
00:19:29,300 --> 00:19:31,430
You get an email and
you want your computer

397
00:19:31,430 --> 00:19:34,490
to be able to identify whether
the email belongs in your inbox,

398
00:19:34,490 --> 00:19:36,770
or whether it should be
filtered out into spam.

399
00:19:36,770 --> 00:19:38,810
So we need to classify the text.

400
00:19:38,810 --> 00:19:41,450
Is it a good email or is it spam?

401
00:19:41,450 --> 00:19:44,210
Another common use case
is sentiment analysis.

402
00:19:44,210 --> 00:19:47,060
We might want to know whether
the sentiment of some text

403
00:19:47,060 --> 00:19:49,460
is positive or negative.

404
00:19:49,460 --> 00:19:50,690
And so how might we do that?

405
00:19:50,690 --> 00:19:53,420
This comes up in situations
like product reviews,

406
00:19:53,420 --> 00:19:56,600
where we might have a bunch of
reviews for a product on some website.

407
00:19:56,600 --> 00:19:57,650
"My grandson loved it!

408
00:19:57,650 --> 00:19:58,370
So much fun."

409
00:19:58,370 --> 00:19:59,995
"Product broke after a few days."

410
00:19:59,995 --> 00:20:02,120
"One of the best games I've
played in a long time."

411
00:20:02,120 --> 00:20:04,520
And "Kind of cheap and
flimsy, not worth it."

412
00:20:04,520 --> 00:20:08,870
Here's some example sentences that you
might see on a product review website.

413
00:20:08,870 --> 00:20:12,020
And you and I could pretty easily
look at this list of product reviews

414
00:20:12,020 --> 00:20:15,345
and decide which ones are positive
and which ones are negative.

415
00:20:15,345 --> 00:20:17,220
We might say the first
one and the third one,

416
00:20:17,220 --> 00:20:19,440
those seem like positive
sentiment messages,

417
00:20:19,440 --> 00:20:23,370
but the second one and the fourth one
seem like negative sentiment messages.

418
00:20:23,370 --> 00:20:24,690
But how did we know that?

419
00:20:24,690 --> 00:20:28,350
And how could we train a computer to
be able to figure that out, as well?

420
00:20:28,350 --> 00:20:31,980
Well, you might have clued your eye
in on particular key words, where

421
00:20:31,980 --> 00:20:35,910
those particular words tend to mean
something positive or negative.

422
00:20:35,910 --> 00:20:38,280
So you might have
identified words like loved,

423
00:20:38,280 --> 00:20:42,210
and fun, and best, tend to be
associated with positive messages.

424
00:20:42,210 --> 00:20:44,730
And words like broke,
and cheap, and flimsy

425
00:20:44,730 --> 00:20:47,220
tend to be associated
with negative messages.

426
00:20:47,220 --> 00:20:49,500
So if only we could
train a computer to be

427
00:20:49,500 --> 00:20:52,980
able to learn what words tend
to be associated with positive

428
00:20:52,980 --> 00:20:55,530
versus negative messages,
then maybe we could

429
00:20:55,530 --> 00:20:59,400
train a computer to do this kind
of sentiment analysis, as well.

430
00:20:59,400 --> 00:21:01,080
So we're going to try to do just that.

431
00:21:01,080 --> 00:21:04,590
We're going to use a model known
as the bag-of-words model, which

432
00:21:04,590 --> 00:21:08,913
is a model that represents text as
just an unordered collection of words.

433
00:21:08,913 --> 00:21:10,830
For the purpose of this
model, we're not going

434
00:21:10,830 --> 00:21:13,913
to worry about the sequence and the
ordering of the words, which word came

435
00:21:13,913 --> 00:21:15,660
first, second, or
third, we're just going

436
00:21:15,660 --> 00:21:19,140
to treat the text as a collection
of words in no particular order.

437
00:21:19,140 --> 00:21:20,890
And we're losing
information there, right?

438
00:21:20,890 --> 00:21:23,590
The order of words is important, and
we'll come back to that a little bit

439
00:21:23,590 --> 00:21:24,130
later.

440
00:21:24,130 --> 00:21:26,230
But for now, to simplify
our model, it'll

441
00:21:26,230 --> 00:21:28,870
help us tremendously
just to think about text

442
00:21:28,870 --> 00:21:31,630
as some unordered collection of words.

443
00:21:31,630 --> 00:21:34,570
And in particular, we're going
to use the bag-of-words model

444
00:21:34,570 --> 00:21:37,630
to build something known as
a Naive Bayes classifier.

445
00:21:37,630 --> 00:21:39,570
So what is a Naive Bayes classifier?

446
00:21:39,570 --> 00:21:41,320
Well, it's a tool
that's going to allow us

447
00:21:41,320 --> 00:21:43,683
to classify text based on Bayes rule.

448
00:21:43,683 --> 00:21:46,600
Again, which you might remember from
when we talked about probability,

449
00:21:46,600 --> 00:21:50,830
Bayes rule says that the
probability of b given a,

450
00:21:50,830 --> 00:21:55,930
is equal to the probability of a given
b multiplied by the probability of b

451
00:21:55,930 --> 00:21:58,900
divided by the probability of a.

452
00:21:58,900 --> 00:22:02,830
So how are we going to use this
rule to be able to analyze text?

453
00:22:02,830 --> 00:22:04,360
Well, what are we interested in?

454
00:22:04,360 --> 00:22:08,110
We're interested in the probability
that a message has a positive sentiment

455
00:22:08,110 --> 00:22:11,530
and the probability that a message has
a negative sentiment, which I'm here

456
00:22:11,530 --> 00:22:15,340
for simplicity, going to represent just
with these emoji, happy face and frown

457
00:22:15,340 --> 00:22:17,670
face, as positive and
negative sentiment.

458
00:22:17,670 --> 00:22:21,570
And so if I had a review, something
like, my grandson loved it,

459
00:22:21,570 --> 00:22:24,900
then what I'm interested in,
is not just the probability

460
00:22:24,900 --> 00:22:28,920
that a message has positive sentiment,
but the conditional probability

461
00:22:28,920 --> 00:22:32,880
that a message has positive sentiment
given that this is the message,

462
00:22:32,880 --> 00:22:34,290
my grandson loved it.

463
00:22:34,290 --> 00:22:37,680
But how do I go about calculating
this value, the probability

464
00:22:37,680 --> 00:22:42,210
that the message is positive given that
the review is this sequence of words?

465
00:22:42,210 --> 00:22:44,550
Well, here's where the
bag-of-words model comes in.

466
00:22:44,550 --> 00:22:48,930
Rather than treat this review as a
string of a sequence of words in order,

467
00:22:48,930 --> 00:22:52,170
we're just going to treat it as
an unordered collection of words.

468
00:22:52,170 --> 00:22:55,140
We're going to try to calculate
the probability that the review is

469
00:22:55,140 --> 00:22:59,220
positive, given that all of these
words, my grandson loved it,

470
00:22:59,220 --> 00:23:01,620
are in the review in
no particular order.

471
00:23:01,620 --> 00:23:04,290
Just this unordered collection of words.

472
00:23:04,290 --> 00:23:09,240
And this is a conditional probability,
which we can then apply Bayes rule

473
00:23:09,240 --> 00:23:10,890
to try to make sense of.

474
00:23:10,890 --> 00:23:13,950
So according to Bayes rule,
this conditional probability

475
00:23:13,950 --> 00:23:15,540
is equal to what?

476
00:23:15,540 --> 00:23:19,030
It's equal to the probability
that all of these four words

477
00:23:19,030 --> 00:23:21,970
are in the review, given
that the review is positive

478
00:23:21,970 --> 00:23:24,820
multiplied by the probability
that the review is positive

479
00:23:24,820 --> 00:23:29,920
divided by the probability that all of
these words happen to be in the review.

480
00:23:29,920 --> 00:23:33,190
So this is the value now that
we're going to try to calculate.

481
00:23:33,190 --> 00:23:36,940
Now, one thing you might notice, is that
the denominator here, the probability

482
00:23:36,940 --> 00:23:39,370
that all of these words
appear in the review,

483
00:23:39,370 --> 00:23:41,830
doesn't actually depend
on whether or not

484
00:23:41,830 --> 00:23:45,220
we're looking at the positive
sentiment or negative sentiment case.

485
00:23:45,220 --> 00:23:47,200
So we can actually get
rid of this denominator.

486
00:23:47,200 --> 00:23:48,450
We don't need to calculate it.

487
00:23:48,450 --> 00:23:52,600
We can just say that this probability
is proportional to the numerator.

488
00:23:52,600 --> 00:23:55,690
And then at the end, we're going to
need to normalize the probability

489
00:23:55,690 --> 00:24:00,160
distribution to make sure that all of
the values sum up to the value one.

490
00:24:00,160 --> 00:24:02,830
So now, how do we calculate this value?

491
00:24:02,830 --> 00:24:06,010
Well, this is the probability
of all of these words given

492
00:24:06,010 --> 00:24:09,250
positive times probability of positive.

493
00:24:09,250 --> 00:24:12,130
And that, by the definition
of joint probability,

494
00:24:12,130 --> 00:24:14,290
is just one big joint probability.

495
00:24:14,290 --> 00:24:16,930
The probability that all of
these things are the case.

496
00:24:16,930 --> 00:24:20,380
That it's a positive review,
and that all four of these words

497
00:24:20,380 --> 00:24:22,120
are in the review.

498
00:24:22,120 --> 00:24:26,200
But still, it's not entirely
obvious how we calculate that value.

499
00:24:26,200 --> 00:24:28,480
And here is where we need
to make one more assumption.

500
00:24:28,480 --> 00:24:31,660
And this is where the Naive
part of Naive Bayes comes in.

501
00:24:31,660 --> 00:24:34,420
We're going to make the
assumption that all of the words

502
00:24:34,420 --> 00:24:36,280
are independent of each other.

503
00:24:36,280 --> 00:24:39,640
And by that, I mean that
if the word, grandson, is

504
00:24:39,640 --> 00:24:42,910
in the review, that doesn't change the
probability that the word, loved, is

505
00:24:42,910 --> 00:24:45,700
in the review or that the word
it is in the review, for example.

506
00:24:45,700 --> 00:24:48,250
And in practice, this
assumption might not be true.

507
00:24:48,250 --> 00:24:50,890
It's almost certainly the case
that the probability of words

508
00:24:50,890 --> 00:24:52,210
do depend on each other.

509
00:24:52,210 --> 00:24:54,670
But it's going to simplify
our analysis, and still

510
00:24:54,670 --> 00:24:58,030
give us reasonably good results,
just to assume that the words are

511
00:24:58,030 --> 00:25:01,060
independent of each other
and they only depend on

512
00:25:01,060 --> 00:25:03,340
whether it's positive or negative.

513
00:25:03,340 --> 00:25:05,830
You might, for example,
expect the word, loved,

514
00:25:05,830 --> 00:25:09,790
to appear more often in a positive
review than in a negative review.

515
00:25:09,790 --> 00:25:11,020
So, what does that mean?

516
00:25:11,020 --> 00:25:13,180
Well, if we make this
assumption, then we

517
00:25:13,180 --> 00:25:16,210
can say that this value, the
probability we're interested in,

518
00:25:16,210 --> 00:25:18,450
is not directly
proportional to, but it's

519
00:25:18,450 --> 00:25:21,480
naively proportional to this value.

520
00:25:21,480 --> 00:25:24,570
The probability that
the review is positive

521
00:25:24,570 --> 00:25:28,500
times the probability that my is in
the review, given that it's positive,

522
00:25:28,500 --> 00:25:31,320
times the probability that
grandson is in the review, given

523
00:25:31,320 --> 00:25:33,630
that it's positive, and
so on for the other two

524
00:25:33,630 --> 00:25:35,520
words that happen to be in this review.

525
00:25:35,520 --> 00:25:38,340
And now this value, which
looks a little more complex,

526
00:25:38,340 --> 00:25:41,880
is actually a value that we
can calculate pretty easily.

527
00:25:41,880 --> 00:25:44,280
So how are we going to
estimate the probability

528
00:25:44,280 --> 00:25:45,540
that the review is positive?

529
00:25:45,540 --> 00:25:49,320
Well, if we have some training
data, some example data of example

530
00:25:49,320 --> 00:25:52,440
reviews where each one has already
been labeled as positive or negative,

531
00:25:52,440 --> 00:25:55,950
then we can estimate the probability
that a review is positive just

532
00:25:55,950 --> 00:25:57,930
by counting the number
of positive samples

533
00:25:57,930 --> 00:26:00,510
and dividing by the
total number of samples

534
00:26:00,510 --> 00:26:02,760
that we have in our training data.

535
00:26:02,760 --> 00:26:05,460
And for the conditional
probabilities, the probability

536
00:26:05,460 --> 00:26:07,650
of loved, given that
it's positive, well,

537
00:26:07,650 --> 00:26:09,900
that's going to be the number
of positive samples with

538
00:26:09,900 --> 00:26:14,520
loved in it divided by the total
number of positive samples.

539
00:26:14,520 --> 00:26:16,840
So let's take a look
at an actual example

540
00:26:16,840 --> 00:26:19,120
to see how we could try
to calculate these values.

541
00:26:19,120 --> 00:26:21,190
Here, I've put together
some sample data.

542
00:26:21,190 --> 00:26:24,340
The way to interpret the sample data,
is that based on the training data,

543
00:26:24,340 --> 00:26:28,510
49% of the reviews are
positive, 51% are negative.

544
00:26:28,510 --> 00:26:32,770
And then over here in this table, we
have some conditional probabilities.

545
00:26:32,770 --> 00:26:35,800
We have if the review is
positive, then there's

546
00:26:35,800 --> 00:26:38,140
a 30% chance that my appears in it.

547
00:26:38,140 --> 00:26:42,220
And if the review is negative, there's
a 20% chance that my appears in it.

548
00:26:42,220 --> 00:26:46,240
And based on our training data among
the positive reviews, 1% of them

549
00:26:46,240 --> 00:26:47,770
contain the word grandson.

550
00:26:47,770 --> 00:26:51,640
And among the negative reviews,
2% contain the word grandson.

551
00:26:51,640 --> 00:26:56,470
So, using this data, let's try to
calculate this value, the value we're

552
00:26:56,470 --> 00:26:57,220
interested in.

553
00:26:57,220 --> 00:27:01,420
And to do that, we'll need to
multiply all of these values together.

554
00:27:01,420 --> 00:27:03,820
The probability of
positive, and then all

555
00:27:03,820 --> 00:27:06,280
of these positive
conditional probabilities.

556
00:27:06,280 --> 00:27:08,620
And when we do that, we get some value.

557
00:27:08,620 --> 00:27:11,445
And then we can do the same
thing for the negative case.

558
00:27:11,445 --> 00:27:12,820
We're going to do the same thing.

559
00:27:12,820 --> 00:27:15,700
Take the probability that
it's negative, multiply it

560
00:27:15,700 --> 00:27:17,830
by all of these
conditional probabilities,

561
00:27:17,830 --> 00:27:19,960
and we're going to get some other value.

562
00:27:19,960 --> 00:27:21,910
And now these values don't sum to one.

563
00:27:21,910 --> 00:27:23,890
They're not a probability
distribution yet.

564
00:27:23,890 --> 00:27:26,800
But I can normalize them
and get some values,

565
00:27:26,800 --> 00:27:30,700
and that tells me that we're going
to predict that my grandson loved it.

566
00:27:30,700 --> 00:27:32,950
We think there's a 68% chance.

567
00:27:32,950 --> 00:27:37,030
Probability is 0.68 that that
is a positive sentiment review.

568
00:27:37,030 --> 00:27:41,470
And 0.32 probability that
it's a negative review.

569
00:27:41,470 --> 00:27:44,050
So, what problems
might we run into here?

570
00:27:44,050 --> 00:27:47,350
What could potentially go wrong
when doing this kind of analysis

571
00:27:47,350 --> 00:27:51,040
in order to analyze whether text has
a positive or negative sentiment?

572
00:27:51,040 --> 00:27:53,110
Well, a couple of problems might arise.

573
00:27:53,110 --> 00:27:57,850
One problem might be, what if
the word grandson never appears

574
00:27:57,850 --> 00:28:00,280
for any of the positive reviews?

575
00:28:00,280 --> 00:28:02,770
If that were the case, then
when we try to calculate

576
00:28:02,770 --> 00:28:05,770
the value, the probability that
we think the review is positive,

577
00:28:05,770 --> 00:28:07,960
we're going to multiply
all these values together

578
00:28:07,960 --> 00:28:10,630
and we're just going to get
0 for the positive case.

579
00:28:10,630 --> 00:28:13,810
Because we're going to ultimately
multiply by that 0 value.

580
00:28:13,810 --> 00:28:16,060
And so we're going to
say that we think there

581
00:28:16,060 --> 00:28:18,250
is no chance that the
review is positive,

582
00:28:18,250 --> 00:28:19,990
because it contains the word grandson.

583
00:28:19,990 --> 00:28:23,140
And in our training data, we've
never seen the word grandson appear

584
00:28:23,140 --> 00:28:26,320
in a positive sentiment message before.

585
00:28:26,320 --> 00:28:28,780
And that's probably
not the right analysis,

586
00:28:28,780 --> 00:28:31,240
because in cases of
rare words, it might be

587
00:28:31,240 --> 00:28:33,220
the case that in nowhere
in our training data

588
00:28:33,220 --> 00:28:36,460
did we ever see the word
grandson appear in a message that

589
00:28:36,460 --> 00:28:37,750
has positive sentiment.

590
00:28:37,750 --> 00:28:39,640
So, what can we do to
solve this problem?

591
00:28:39,640 --> 00:28:41,530
Well, one thing we'll
often do, is some kind

592
00:28:41,530 --> 00:28:44,230
of additive smoothing, where
we add some value alpha

593
00:28:44,230 --> 00:28:47,890
to each value in our distribution just
to smooth out the data a little bit.

594
00:28:47,890 --> 00:28:51,460
And a common form of this is
Laplace smoothing, where we add 1

595
00:28:51,460 --> 00:28:53,110
to each value in our distribution.

596
00:28:53,110 --> 00:28:56,110
In essence, we pretend we've
seen each value one more

597
00:28:56,110 --> 00:28:57,400
time than we actually have.

598
00:28:57,400 --> 00:29:00,650
If we've never seen the word
grandson for a positive review,

599
00:29:00,650 --> 00:29:01,900
we pretend we've seen it once.

600
00:29:01,900 --> 00:29:04,720
If we've seen it once, we
pretend we've seen it twice, just

601
00:29:04,720 --> 00:29:09,160
to avoid the possibility that we
might multiply by 0, and as a result,

602
00:29:09,160 --> 00:29:12,040
get some results we don't
want in our analysis.

603
00:29:12,040 --> 00:29:13,960
So let's see what this
looks like in practice.

604
00:29:13,960 --> 00:29:17,920
Let's try to do some Naive
Bayes classification in order

605
00:29:17,920 --> 00:29:21,730
to classify text as either
positive or negative.

606
00:29:21,730 --> 00:29:24,670
We'll take a look at sentiment.py.

607
00:29:24,670 --> 00:29:27,790
And what this is going to
do, is load some sample data

608
00:29:27,790 --> 00:29:31,690
into memory, some examples of
positive reviews and negative reviews.

609
00:29:31,690 --> 00:29:35,350
And then we're going to train
a Naive Bayes classifier

610
00:29:35,350 --> 00:29:37,720
on all of this training data.

611
00:29:37,720 --> 00:29:40,180
Training data that
includes all of the words

612
00:29:40,180 --> 00:29:44,200
we see in positive reviews and all of
the words we see in negative reviews.

613
00:29:44,200 --> 00:29:47,500
And then we're going to
try to classify some input.

614
00:29:47,500 --> 00:29:50,200
And so we're going to do this
based on a corpus of data.

615
00:29:50,200 --> 00:29:52,030
I have some example positive reviews.

616
00:29:52,030 --> 00:29:53,380
Here are some positive reviews.

617
00:29:53,380 --> 00:29:53,980
"It was great!

618
00:29:53,980 --> 00:29:55,360
So much fun," for example.

619
00:29:55,360 --> 00:29:57,070
And then some negative reviews.

620
00:29:57,070 --> 00:29:57,903
"Not worth it."

621
00:29:57,903 --> 00:29:58,570
"Kind of cheap."

622
00:29:58,570 --> 00:30:01,330
These are some examples
of negative reviews.

623
00:30:01,330 --> 00:30:04,180
So now, let's try to run
this classifier and see

624
00:30:04,180 --> 00:30:08,800
how it would classify particular
text as either positive or negative.

625
00:30:08,800 --> 00:30:13,480
We'll go ahead and run our
sentiment analysis on this corpus.

626
00:30:13,480 --> 00:30:15,580
And we need to provide it with a review.

627
00:30:15,580 --> 00:30:18,800
So I'll say something
like, "I enjoyed it."

628
00:30:18,800 --> 00:30:23,210
And we see that the classifier says
there's about a 0.92 probability

629
00:30:23,210 --> 00:30:26,540
that we think that this
particular review is positive.

630
00:30:26,540 --> 00:30:27,830
Let's try something negative.

631
00:30:27,830 --> 00:30:30,860
We'll try "kind of overpriced."

632
00:30:30,860 --> 00:30:34,280
And we see that there is
a 0.96 probability now

633
00:30:34,280 --> 00:30:36,560
that we think that this
particular review is negative.

634
00:30:36,560 --> 00:30:40,100
And so our Naive Bayes classifier
has learned what kinds of words

635
00:30:40,100 --> 00:30:43,160
tend to appear in positive
reviews and what kinds of words

636
00:30:43,160 --> 00:30:44,810
tend to appear in negative reviews.

637
00:30:44,810 --> 00:30:47,570
And as a result of that,
we've been able to design

638
00:30:47,570 --> 00:30:51,020
a classifier that can predict
whether a particular review is

639
00:30:51,020 --> 00:30:53,330
positive or negative.

640
00:30:53,330 --> 00:30:55,970
And so this definitely is
a useful tool that we can

641
00:30:55,970 --> 00:30:57,620
use to try and make some predictions.

642
00:30:57,620 --> 00:31:00,170
But we had to make some
assumptions in order to get there.

643
00:31:00,170 --> 00:31:03,590
So what if we want to now try to
build some more sophisticated models,

644
00:31:03,590 --> 00:31:06,530
use some tools from machine
learning to try and take

645
00:31:06,530 --> 00:31:10,130
better advantage of language data, to be
able to draw more accurate conclusions

646
00:31:10,130 --> 00:31:13,040
and solve new kinds of tasks
and new kinds of problems?

647
00:31:13,040 --> 00:31:16,575
Well, we've seen a couple of times now,
that when we want to take some data

648
00:31:16,575 --> 00:31:19,200
and take some input, put it in
a way that the computer is going

649
00:31:19,200 --> 00:31:22,140
to be able to make sense of, it
can be helpful to take that data

650
00:31:22,140 --> 00:31:24,360
and turn it into numbers ultimately.

651
00:31:24,360 --> 00:31:26,730
And so what we might want
to try to do, is come up

652
00:31:26,730 --> 00:31:30,270
with some word representation,
some way to take a word

653
00:31:30,270 --> 00:31:32,910
and translate its meaning into numbers.

654
00:31:32,910 --> 00:31:35,700
Because, for example, if we wanted
to use a neural network to be

655
00:31:35,700 --> 00:31:38,490
able to process language, give
our language to a neural network

656
00:31:38,490 --> 00:31:41,850
and have it make some predictions
or perform some analysis there,

657
00:31:41,850 --> 00:31:45,360
a neural network takes as
input and produces as output

658
00:31:45,360 --> 00:31:47,850
a vector of values, a vector of numbers.

659
00:31:47,850 --> 00:31:50,700
And so what we might want
to do, is take our data

660
00:31:50,700 --> 00:31:54,300
and somehow take words and
convert them into some kind

661
00:31:54,300 --> 00:31:56,010
of numeric representation.

662
00:31:56,010 --> 00:31:57,330
So, how might we do that?

663
00:31:57,330 --> 00:32:00,930
How might we take words
and turn them into numbers?

664
00:32:00,930 --> 00:32:02,680
Let's take a look at an example.

665
00:32:02,680 --> 00:32:04,920
Here's a sentence, "He wrote a book."

666
00:32:04,920 --> 00:32:07,380
And let's say I wanted to
take each of those words

667
00:32:07,380 --> 00:32:09,540
and turn it into a vector of values.

668
00:32:09,540 --> 00:32:10,980
Here's one way I might do that.

669
00:32:10,980 --> 00:32:15,030
We'll say he is going to be a vector
that has a 1 in the first position,

670
00:32:15,030 --> 00:32:17,040
and the rest of the values are 0.

671
00:32:17,040 --> 00:32:20,250
Wrote will have a 1 in the second
position, and the rest of the values

672
00:32:20,250 --> 00:32:21,090
are 0.

673
00:32:21,090 --> 00:32:24,330
A has a 1 in the third position
with the rest of the value 0.

674
00:32:24,330 --> 00:32:28,140
And book has a 1 in the fourth
position, with the rest of the value 0.

675
00:32:28,140 --> 00:32:32,610
So each of these words now has a
distinct vector representation.

676
00:32:32,610 --> 00:32:37,260
And this is what we often call a
one-hot representation, a representation

677
00:32:37,260 --> 00:32:40,800
of the meaning of a word
as a vector with a single 1

678
00:32:40,800 --> 00:32:43,260
and all of the rest of the values are 0.

679
00:32:43,260 --> 00:32:46,800
And so when doing this, we now have a
numeric representation for every word,

680
00:32:46,800 --> 00:32:49,470
and we could pass in those
vector representations

681
00:32:49,470 --> 00:32:54,090
into a neural network or other models
that require some kind of numeric data

682
00:32:54,090 --> 00:32:55,140
as input.

683
00:32:55,140 --> 00:32:58,590
But this one-hot representation
actually has a couple of problems,

684
00:32:58,590 --> 00:33:00,570
and it's not ideal for a few reasons.

685
00:33:00,570 --> 00:33:03,390
One reason is, here, we're
just looking at four words.

686
00:33:03,390 --> 00:33:07,080
But if you imagine a vocabulary
of thousands of words or more,

687
00:33:07,080 --> 00:33:09,240
these vectors are going
to get quite long in order

688
00:33:09,240 --> 00:33:13,590
to have a distinct vector for every
possible word in our vocabulary.

689
00:33:13,590 --> 00:33:15,497
And as a result of that,
these longer vectors

690
00:33:15,497 --> 00:33:18,330
are going to be more difficult to
deal with, more difficult to train

691
00:33:18,330 --> 00:33:21,000
and so forth, and so
that might be a problem.

692
00:33:21,000 --> 00:33:23,550
Another problem is a
little bit more subtle.

693
00:33:23,550 --> 00:33:26,460
If we want to represent
a word as a vector,

694
00:33:26,460 --> 00:33:30,270
and in particular, the meaning of
a word as a vector, then ideally,

695
00:33:30,270 --> 00:33:33,300
it should be the case that
words that have similar meanings

696
00:33:33,300 --> 00:33:36,360
should also have similar
vector representations,

697
00:33:36,360 --> 00:33:40,320
so that they're close to each other
together inside a vector space.

698
00:33:40,320 --> 00:33:42,180
But that's not really
going to be the case

699
00:33:42,180 --> 00:33:45,990
with these one-hot representations,
because if we take some similar words,

700
00:33:45,990 --> 00:33:49,590
say the word wrote and the word
authored, which mean similar things,

701
00:33:49,590 --> 00:33:53,400
they have entirely different
vector representations.

702
00:33:53,400 --> 00:33:54,870
Likewise book and novel.

703
00:33:54,870 --> 00:33:57,270
Those two words mean
somewhat similar things,

704
00:33:57,270 --> 00:34:00,390
but they have entirely different
vector representations,

705
00:34:00,390 --> 00:34:03,420
because they each have a 1
in some different position.

706
00:34:03,420 --> 00:34:05,340
And so that's not ideal either.

707
00:34:05,340 --> 00:34:07,440
So what we might be
interested in instead,

708
00:34:07,440 --> 00:34:10,110
is some kind of
distributed representation.

709
00:34:10,110 --> 00:34:12,900
A distributed representation
is the representation

710
00:34:12,900 --> 00:34:16,710
of the meaning of a word
distributed across multiple values,

711
00:34:16,710 --> 00:34:20,130
instead of just being one-hot
with a 1 in 1 position.

712
00:34:20,130 --> 00:34:24,540
Here is what a distributed
representation of words might be.

713
00:34:24,540 --> 00:34:27,840
Each word is associated
with some vector of values,

714
00:34:27,840 --> 00:34:30,570
with the meaning distributed
across multiple values,

715
00:34:30,570 --> 00:34:35,250
ideally in such a way, that
similar words have a similar vector

716
00:34:35,250 --> 00:34:36,480
representation.

717
00:34:36,480 --> 00:34:38,639
But how are we going to
come up with those values?

718
00:34:38,639 --> 00:34:40,110
Where do those values come from?

719
00:34:40,110 --> 00:34:43,590
How can we define the meaning
of a word in this distributed

720
00:34:43,590 --> 00:34:45,210
sequence of numbers?

721
00:34:45,210 --> 00:34:47,909
Well, to do that, we're going
to draw inspiration from a quote

722
00:34:47,909 --> 00:34:50,400
from British linguist
JR Firth, who said,

723
00:34:50,400 --> 00:34:53,580
"You shall know a word
by the company it keeps."

724
00:34:53,580 --> 00:34:56,370
In other words, we're going to
define the meaning of a word

725
00:34:56,370 --> 00:35:00,600
based on the words that appear around
it, the context words around it.

726
00:35:00,600 --> 00:35:02,460
Take for example, this context.

727
00:35:02,460 --> 00:35:04,560
For blank he ate.

728
00:35:04,560 --> 00:35:08,130
You might wonder, what words could
reasonably fill in that blank.

729
00:35:08,130 --> 00:35:11,520
Well, it might be words like
breakfast, or lunch, or dinner.

730
00:35:11,520 --> 00:35:13,920
All of those could reasonably
fill in that blank.

731
00:35:13,920 --> 00:35:16,160
And so what we're going
to say, is because does

732
00:35:16,160 --> 00:35:20,240
the words breakfast and lunch and
dinner appear in a similar context,

733
00:35:20,240 --> 00:35:22,580
that they must have a similar meaning.

734
00:35:22,580 --> 00:35:25,880
And that's something our computer
could understand and try to learn.

735
00:35:25,880 --> 00:35:28,310
A computer could look
at a big corpus of text,

736
00:35:28,310 --> 00:35:31,730
look at what words tend to appear
in similar contexts to each other,

737
00:35:31,730 --> 00:35:35,270
and use that to identify which
words have a similar meaning.

738
00:35:35,270 --> 00:35:39,440
And should therefore, appear close
to each other inside a vector space.

739
00:35:39,440 --> 00:35:43,640
And so one common model for doing
this is known as the word2vec model.

740
00:35:43,640 --> 00:35:47,060
It's a model for generating word
vectors, a vector representation

741
00:35:47,060 --> 00:35:49,700
for every word by looking
at data and looking

742
00:35:49,700 --> 00:35:52,250
at the context in which a word appears.

743
00:35:52,250 --> 00:35:53,690
The idea is going to be this.

744
00:35:53,690 --> 00:35:58,070
If you start out with all of the words
just in some random position in space

745
00:35:58,070 --> 00:36:02,000
and train it on some training data,
what the word2vec model will do,

746
00:36:02,000 --> 00:36:05,240
is start to learn what words
appear in similar contexts.

747
00:36:05,240 --> 00:36:08,180
And it will move these
vectors around in such a way

748
00:36:08,180 --> 00:36:10,550
that hopefully, words
with similar meanings,

749
00:36:10,550 --> 00:36:12,800
breakfast, lunch, and
dinner, book, memoir,

750
00:36:12,800 --> 00:36:18,300
novel, will hopefully appear to be
near to each other as vectors, as well.

751
00:36:18,300 --> 00:36:22,110
So, let's now take a look at what
word2vec might look like in practice

752
00:36:22,110 --> 00:36:24,300
when implemented in code.

753
00:36:24,300 --> 00:36:29,010
What I have here inside of
words.txt is a pre-trained model

754
00:36:29,010 --> 00:36:32,700
where each of these words has
some vector representation trained

755
00:36:32,700 --> 00:36:33,480
by word2vec.

756
00:36:33,480 --> 00:36:38,070
Each of these words has some sequence
of values representing its meaning,

757
00:36:38,070 --> 00:36:40,860
hopefully in such a way,
that similar words are

758
00:36:40,860 --> 00:36:43,110
represented by similar vectors.

759
00:36:43,110 --> 00:36:46,890
I also have this file, vectors.py,
which is going to open up the words

760
00:36:46,890 --> 00:36:48,300
and form them into a dictionary.

761
00:36:48,300 --> 00:36:50,970
And we also define some useful
functions, like distance,

762
00:36:50,970 --> 00:36:53,460
to get the distance
between two word vectors.

763
00:36:53,460 --> 00:36:56,730
And closest words define
which words are nearby

764
00:36:56,730 --> 00:36:59,490
in terms of having close
vectors to each other.

765
00:36:59,490 --> 00:37:01,650
And so let's give this a try.

766
00:37:01,650 --> 00:37:05,010
We'll go ahead and open
a python interpreter.

767
00:37:05,010 --> 00:37:09,510
And I'm going to import these vectors.

768
00:37:09,510 --> 00:37:14,970
And we might say, all right, what is the
vector representation of the word book.

769
00:37:14,970 --> 00:37:18,480
And we get this big long
vector that represents the word

770
00:37:18,480 --> 00:37:20,400
book as a sequence of values.

771
00:37:20,400 --> 00:37:23,670
And this sequence of values by
itself is not all that meaningful.

772
00:37:23,670 --> 00:37:26,850
But it is meaningful in
the context of comparing it

773
00:37:26,850 --> 00:37:29,610
to other vectors for other words.

774
00:37:29,610 --> 00:37:31,800
So we could use this
distance function, which

775
00:37:31,800 --> 00:37:35,065
is going to get us the distance
between two word vectors.

776
00:37:35,065 --> 00:37:37,440
And we might say, what is the
distance between the vector

777
00:37:37,440 --> 00:37:42,360
representation for the word book and
the vector representation for the word

778
00:37:42,360 --> 00:37:43,590
novel.

779
00:37:43,590 --> 00:37:45,840
And we see that it's 0.34.

780
00:37:45,840 --> 00:37:48,480
You can kind of interpret 0 as
being really close together,

781
00:37:48,480 --> 00:37:50,310
and 1 being very far apart.

782
00:37:50,310 --> 00:37:55,140
And so now, what is the distance
between book and let's say, breakfast?

783
00:37:55,140 --> 00:37:58,110
Well, book and breakfast are
more different from each other

784
00:37:58,110 --> 00:38:00,090
than book and novel are,
so I would hopefully,

785
00:38:00,090 --> 00:38:01,890
expect the distance to be larger.

786
00:38:01,890 --> 00:38:03,060
And in fact, it is.

787
00:38:03,060 --> 00:38:05,040
0.64 approximately.

788
00:38:05,040 --> 00:38:07,650
These two words are further
away from each other.

789
00:38:07,650 --> 00:38:12,960
And what about now, the distance
between let's say, lunch and breakfast?

790
00:38:12,960 --> 00:38:14,580
Well, that's about 0.2.

791
00:38:14,580 --> 00:38:15,960
Those are even closer together.

792
00:38:15,960 --> 00:38:19,190
They have a meaning that
is closer to each other.

793
00:38:19,190 --> 00:38:23,660
Another interesting thing we might
do is calculate the closest words.

794
00:38:23,660 --> 00:38:29,030
We might say, what are the closest words
according to word2vec to the word book,

795
00:38:29,030 --> 00:38:31,550
and let's say, let's get
the 10 closest words.

796
00:38:31,550 --> 00:38:35,960
What are the 10 closest vectors to
the vector representation for the word

797
00:38:35,960 --> 00:38:36,830
book?

798
00:38:36,830 --> 00:38:40,220
And when we perform that analysis,
we get this list of words.

799
00:38:40,220 --> 00:38:42,260
The closest one is book itself.

800
00:38:42,260 --> 00:38:46,640
But we also have books plural, and
then essay, memoir, essays, novella,

801
00:38:46,640 --> 00:38:48,050
anthology, and so on.

802
00:38:48,050 --> 00:38:52,040
All of these words mean something
similar to the word book, according

803
00:38:52,040 --> 00:38:55,970
to word2vec, at least, because they
have a similar vector representation.

804
00:38:55,970 --> 00:38:58,250
So it seems like we've
done a pretty good job

805
00:38:58,250 --> 00:39:03,200
of trying to capture this kind of
vector representation of word meaning.

806
00:39:03,200 --> 00:39:05,990
One other interesting
side effect of word2vec

807
00:39:05,990 --> 00:39:08,150
is that it's also able
to capture something

808
00:39:08,150 --> 00:39:11,240
about the relationships
between words, as well.

809
00:39:11,240 --> 00:39:12,770
Let's take a look at an example.

810
00:39:12,770 --> 00:39:16,130
Here, for instance, are
two words, man and king.

811
00:39:16,130 --> 00:39:19,740
And these are each represented
by word2vec as vectors.

812
00:39:19,740 --> 00:39:24,750
So what might happen if I subtracted
one from the other, calculated the value

813
00:39:24,750 --> 00:39:26,700
king minus man?

814
00:39:26,700 --> 00:39:30,600
Well, that will be the vector that
will take us from man to king,

815
00:39:30,600 --> 00:39:33,840
somehow represent this
relationship between the vector

816
00:39:33,840 --> 00:39:38,310
representation of the word man, and the
vector representation of the word king.

817
00:39:38,310 --> 00:39:41,820
And that's what this value,
king minus man, represents.

818
00:39:41,820 --> 00:39:46,260
So what would happen if I took the
vector representation of the word woman

819
00:39:46,260 --> 00:39:50,550
and added that same value,
king minus man, to it?

820
00:39:50,550 --> 00:39:54,300
What would we get as the closest
word to that, for example?

821
00:39:54,300 --> 00:39:55,230
Well, we could try it.

822
00:39:55,230 --> 00:39:59,280
Let's go ahead and go back to our
python interpreter and give this a try.

823
00:39:59,280 --> 00:40:03,690
I could say, what is the closest word
to the vector representation of the word

824
00:40:03,690 --> 00:40:06,810
king minus the representation
of the word man,

825
00:40:06,810 --> 00:40:10,710
plus the representation
of the word woman?

826
00:40:10,710 --> 00:40:13,740
And we see that the closest
word is the word queen.

827
00:40:13,740 --> 00:40:17,040
We've somehow been able to capture
the relationship between king and man,

828
00:40:17,040 --> 00:40:23,310
and then we apply it to the word woman,
we get as the result, the word queen.

829
00:40:23,310 --> 00:40:27,180
So word2vec has been able to capture
not just the words and how they're

830
00:40:27,180 --> 00:40:30,300
similar to each other, but also
something about the relationships

831
00:40:30,300 --> 00:40:33,840
between words and how those words
are connected to each other.

832
00:40:33,840 --> 00:40:36,720
So now that we have this
vector representation of words,

833
00:40:36,720 --> 00:40:37,920
what can we now do with it?

834
00:40:37,920 --> 00:40:40,470
Now we can represent words
as numbers, and so we

835
00:40:40,470 --> 00:40:44,400
might try to pass those words as
input to say, a neural network.

836
00:40:44,400 --> 00:40:46,650
Neural networks we've seen
are very powerful tools

837
00:40:46,650 --> 00:40:50,010
for identifying patterns
and making predictions.

838
00:40:50,010 --> 00:40:53,070
Recall that a neural network you
can think of as all of these units.

839
00:40:53,070 --> 00:40:56,460
But really what the neural network
is doing, is taking some input,

840
00:40:56,460 --> 00:40:59,610
passing it into the network,
and then producing some output.

841
00:40:59,610 --> 00:41:02,130
And by providing the neural
network with training data,

842
00:41:02,130 --> 00:41:04,950
we're able to update the
weights inside of the network,

843
00:41:04,950 --> 00:41:08,610
so that the neural network can do
a more accurate job of translating

844
00:41:08,610 --> 00:41:10,950
those inputs into those outputs.

845
00:41:10,950 --> 00:41:13,890
And now that we can
represent words as numbers

846
00:41:13,890 --> 00:41:15,910
that could be the input
or output, you could

847
00:41:15,910 --> 00:41:19,330
imagine passing a word in
as input to a neural network

848
00:41:19,330 --> 00:41:20,980
and getting a word as output.

849
00:41:20,980 --> 00:41:22,750
And so when might that be useful?

850
00:41:22,750 --> 00:41:26,230
One common use for neural networks
is in machine translation.

851
00:41:26,230 --> 00:41:29,440
When we want to translate text
from one language into another.

852
00:41:29,440 --> 00:41:33,820
Say, translate English into French, by
passing English into the neural network

853
00:41:33,820 --> 00:41:35,530
and getting some French output.

854
00:41:35,530 --> 00:41:39,130
You might imagine, for instance, that
we could take the English word for lamp,

855
00:41:39,130 --> 00:41:43,090
pass it into the neural network, get
the French word for lamp as output.

856
00:41:43,090 --> 00:41:47,350
But in practice, when we're translating
text from one language to another,

857
00:41:47,350 --> 00:41:51,370
we're usually not just interested in
translating a single word from one

858
00:41:51,370 --> 00:41:53,320
language to another, but a sequence.

859
00:41:53,320 --> 00:41:55,660
Say, a sentence or a paragraph of words.

860
00:41:55,660 --> 00:41:57,730
Here, for example, is
another paragraph, again

861
00:41:57,730 --> 00:42:01,390
taken from Sherlock Holmes written in
English, and what I might want to do,

862
00:42:01,390 --> 00:42:04,960
is take that entire sentence,
pass it into the neural network,

863
00:42:04,960 --> 00:42:09,430
and get as output, a French
translation of the same sentence.

864
00:42:09,430 --> 00:42:12,070
But recall that a neural
network's input and output

865
00:42:12,070 --> 00:42:14,260
needs to be of some fixed size.

866
00:42:14,260 --> 00:42:16,660
And a sentence is not a
fixed size, it's a variable.

867
00:42:16,660 --> 00:42:19,990
You might have shorter sentences
and you might have longer sentences.

868
00:42:19,990 --> 00:42:23,020
So somehow, we need to solve
the problem of translating

869
00:42:23,020 --> 00:42:27,100
a sequence into another sequence
by means of a neural network.

870
00:42:27,100 --> 00:42:29,980
And that's going to be true not
only for machine translation,

871
00:42:29,980 --> 00:42:33,340
but also for other problems,
problems like question answering.

872
00:42:33,340 --> 00:42:36,280
If I want to pass as input
a question, something like,

873
00:42:36,280 --> 00:42:38,740
what is the capital
of Massachusetts, feed

874
00:42:38,740 --> 00:42:41,080
that as input into the
neural network, I would

875
00:42:41,080 --> 00:42:45,250
hope that what I would get as output is
a sentence like, the capital is Boston.

876
00:42:45,250 --> 00:42:49,330
Again, translating some sequence
into some other sequence.

877
00:42:49,330 --> 00:42:52,870
And if you've ever had a
conversation with an AI chatbot

878
00:42:52,870 --> 00:42:55,420
or have ever asked
your phone a question,

879
00:42:55,420 --> 00:42:56,920
it needs to do something like this.

880
00:42:56,920 --> 00:43:00,220
It needs to understand the sequence
of words that you, the human,

881
00:43:00,220 --> 00:43:02,890
provided as input, and
then the computer needs

882
00:43:02,890 --> 00:43:05,470
to generate some sequence
of words as output.

883
00:43:05,470 --> 00:43:06,880
So how can we do this?

884
00:43:06,880 --> 00:43:10,180
Well, one tool that we can use,
is the recurrent neural network,

885
00:43:10,180 --> 00:43:13,360
which we took a look at last time,
which is a way for us to provide

886
00:43:13,360 --> 00:43:16,150
a sequence of values to a
neural network by running

887
00:43:16,150 --> 00:43:18,010
the neural network multiple times.

888
00:43:18,010 --> 00:43:21,490
And each time we run the neural
network, what we're going to do,

889
00:43:21,490 --> 00:43:24,400
is we're going to keep
track of some hidden state.

890
00:43:24,400 --> 00:43:26,290
And that hidden state
is going to be passed

891
00:43:26,290 --> 00:43:29,650
from one run of the neural network to
the next run of the neural network,

892
00:43:29,650 --> 00:43:32,530
keeping track of all of
the relevant information.

893
00:43:32,530 --> 00:43:35,920
And so let's take a look at how we
could apply that to something like this.

894
00:43:35,920 --> 00:43:39,850
And in particular, we're going to look
at an architecture known as an encoder

895
00:43:39,850 --> 00:43:42,610
decoder architecture,
where we're going to encode

896
00:43:42,610 --> 00:43:45,610
this question into some
kind of hidden state,

897
00:43:45,610 --> 00:43:49,720
and then use a decoder to decode
that hidden state into the output

898
00:43:49,720 --> 00:43:51,323
that we're interested in.

899
00:43:51,323 --> 00:43:52,740
So what's that going to look like?

900
00:43:52,740 --> 00:43:54,990
We'll start with the
first word, the word what.

901
00:43:54,990 --> 00:43:57,030
That goes into our neural network.

902
00:43:57,030 --> 00:44:00,000
And it's going to produce
some hidden state.

903
00:44:00,000 --> 00:44:04,200
This is some information about the
word what that our neural network is

904
00:44:04,200 --> 00:44:05,940
going to need to keep track of.

905
00:44:05,940 --> 00:44:08,700
Then when the second
word comes along, we're

906
00:44:08,700 --> 00:44:11,640
going to feed it into that
same encoder neural network,

907
00:44:11,640 --> 00:44:15,300
but it's going to get as input
that hidden state, as well.

908
00:44:15,300 --> 00:44:17,610
So we pass in the
second word, we also get

909
00:44:17,610 --> 00:44:19,860
the information about the
hidden state, and that's

910
00:44:19,860 --> 00:44:22,710
going to continue for the
other words in the input.

911
00:44:22,710 --> 00:44:24,870
This is going to produce
a new hidden state.

912
00:44:24,870 --> 00:44:29,610
And so then when we get to the third
word, the, that goes into the encoder,

913
00:44:29,610 --> 00:44:31,740
it also gets access to the hidden state.

914
00:44:31,740 --> 00:44:35,010
And then it produces a new hidden state
that gets passed in to the next run

915
00:44:35,010 --> 00:44:36,450
when we use the word capital.

916
00:44:36,450 --> 00:44:39,300
And the same thing is going to
repeat for the other words that

917
00:44:39,300 --> 00:44:40,920
appear in the input.

918
00:44:40,920 --> 00:44:46,560
So of, Massachusetts, that produces
one final piece of hidden state.

919
00:44:46,560 --> 00:44:49,470
Now somehow, we need to signal
the fact that we're done.

920
00:44:49,470 --> 00:44:51,040
There's nothing left in the input.

921
00:44:51,040 --> 00:44:54,010
And we typically do this by
passing some kind of special token,

922
00:44:54,010 --> 00:44:56,770
say an end token, into
the neural network.

923
00:44:56,770 --> 00:44:59,830
And now the decoding
process is going to start.

924
00:44:59,830 --> 00:45:02,620
We're going to generate the word, the.

925
00:45:02,620 --> 00:45:05,410
But in addition to
generating the word, the,

926
00:45:05,410 --> 00:45:10,420
this decoder network is also going to
generate some kind of hidden state.

927
00:45:10,420 --> 00:45:12,490
And so what happens the next time?

928
00:45:12,490 --> 00:45:14,770
Well, to generate the
next word, it might

929
00:45:14,770 --> 00:45:17,830
be helpful to know what
the first word was.

930
00:45:17,830 --> 00:45:22,210
So we might pass the first word,
the, back into the decoder network.

931
00:45:22,210 --> 00:45:24,280
It's going to get as
input this hidden state,

932
00:45:24,280 --> 00:45:26,860
and it's going to generate
the next word, capital.

933
00:45:26,860 --> 00:45:29,380
And that's also going to
generate some hidden state.

934
00:45:29,380 --> 00:45:31,810
And we'll repeat that, passing
capital into the network

935
00:45:31,810 --> 00:45:35,230
to generate the third word, is,
and then one more time, in order

936
00:45:35,230 --> 00:45:37,330
to get the fourth word, Boston.

937
00:45:37,330 --> 00:45:38,800
And at that point, we're done.

938
00:45:38,800 --> 00:45:40,210
But how do we know we're done?

939
00:45:40,210 --> 00:45:42,250
Usually we'll do this one more time.

940
00:45:42,250 --> 00:45:46,030
Pass Boston into the decoder
network and get as output

941
00:45:46,030 --> 00:45:49,990
some n token to indicate that
that is the end of our input.

942
00:45:49,990 --> 00:45:53,080
And so this then is how we could
use a recurrent neural network

943
00:45:53,080 --> 00:45:56,500
to take some input, encode
it into some hidden state,

944
00:45:56,500 --> 00:46:00,580
and then use that hidden state to decode
it into the output we're interested in.

945
00:46:00,580 --> 00:46:04,120
To visualize it in a slightly different
way, we have some input sequence.

946
00:46:04,120 --> 00:46:06,070
This is just some sequence of words.

947
00:46:06,070 --> 00:46:09,820
That input sequence goes into
the encoder, which in this case,

948
00:46:09,820 --> 00:46:13,570
is a recurrent neural network generating
these hidden states along the way,

949
00:46:13,570 --> 00:46:16,900
until we generate some final
hidden state, at which point,

950
00:46:16,900 --> 00:46:18,580
we start the decoding process.

951
00:46:18,580 --> 00:46:20,530
Again, using a recurrent neural network.

952
00:46:20,530 --> 00:46:23,290
That's going to generate the
output sequence, as well.

953
00:46:23,290 --> 00:46:25,480
So we've got the encoder,
which is encoding

954
00:46:25,480 --> 00:46:28,870
the information about the input
sequence into this hidden state.

955
00:46:28,870 --> 00:46:31,750
And then the decoder, which
takes that hidden state

956
00:46:31,750 --> 00:46:35,620
and uses it in order to
generate the output sequence.

957
00:46:35,620 --> 00:46:37,150
But there are some problems.

958
00:46:37,150 --> 00:46:39,370
And for many years, this
was the state of the art.

959
00:46:39,370 --> 00:46:41,830
The recurrent neural network
and variants on this approach

960
00:46:41,830 --> 00:46:44,890
were some of the best ways we
knew in order to perform tasks

961
00:46:44,890 --> 00:46:46,182
in natural language processing.

962
00:46:46,182 --> 00:46:48,973
But there are some problems that
we might want to try to deal with,

963
00:46:48,973 --> 00:46:50,890
and that have been dealt
with over the years

964
00:46:50,890 --> 00:46:53,770
to try and improve upon
this kind of model.

965
00:46:53,770 --> 00:46:57,610
And one problem you might notice
happens in this encoder stage.

966
00:46:57,610 --> 00:47:00,430
We've taken this input
sequence, the sequence of words,

967
00:47:00,430 --> 00:47:04,780
and encoded it all into this
final piece of hidden state.

968
00:47:04,780 --> 00:47:09,010
And that final piece of hidden state
needs to contain all of the information

969
00:47:09,010 --> 00:47:14,050
from the input sequence that we need in
order to generate the output sequence.

970
00:47:14,050 --> 00:47:17,440
And while that's possible, it
becomes increasingly difficult

971
00:47:17,440 --> 00:47:19,690
as the sequence gets larger and larger.

972
00:47:19,690 --> 00:47:22,240
For larger and larger
input sequences, it's

973
00:47:22,240 --> 00:47:24,310
going to become more and
more difficult to store

974
00:47:24,310 --> 00:47:28,930
all of the information we need about the
input inside this single hidden state

975
00:47:28,930 --> 00:47:30,010
piece of context.

976
00:47:30,010 --> 00:47:33,070
That's a lot of information to
pack into just a single value.

977
00:47:33,070 --> 00:47:36,220
It might be useful for us
when generating output,

978
00:47:36,220 --> 00:47:41,860
to not just refer to this one value,
but to all of the previous hidden values

979
00:47:41,860 --> 00:47:44,410
that have been generated by the encoder.

980
00:47:44,410 --> 00:47:45,610
And so that might be useful.

981
00:47:45,610 --> 00:47:46,390
But how could we do that?

982
00:47:46,390 --> 00:47:47,890
We've got a lot of different values.

983
00:47:47,890 --> 00:47:49,420
We need to combine them somehow.

984
00:47:49,420 --> 00:47:52,990
So you could imagine adding them
together, taking the average of them,

985
00:47:52,990 --> 00:47:53,700
for example.

986
00:47:53,700 --> 00:47:57,270
But doing that would assume that
all of these pieces of hidden state

987
00:47:57,270 --> 00:47:58,920
are equally important.

988
00:47:58,920 --> 00:48:00,780
But that's not necessarily true either.

989
00:48:00,780 --> 00:48:02,940
Some of these pieces of
hidden state are going

990
00:48:02,940 --> 00:48:06,000
to be more important than others,
depending on what word they

991
00:48:06,000 --> 00:48:07,800
most closely correspond to.

992
00:48:07,800 --> 00:48:11,250
This piece of hidden state very
closely corresponds to the first word

993
00:48:11,250 --> 00:48:12,330
of the input sequence.

994
00:48:12,330 --> 00:48:16,020
This one very closely corresponds to
the second word of the input sequence,

995
00:48:16,020 --> 00:48:16,980
for example.

996
00:48:16,980 --> 00:48:20,460
And some of those are going to
be more important than others.

997
00:48:20,460 --> 00:48:22,830
To make matters more
complicated, depending

998
00:48:22,830 --> 00:48:25,770
on which word of the output
sequence we're generating,

999
00:48:25,770 --> 00:48:29,220
different input words might
be more or less important.

1000
00:48:29,220 --> 00:48:31,950
And so what we really
want, is some way to decide

1001
00:48:31,950 --> 00:48:36,330
for ourselves which of the input
values are worth paying attention to

1002
00:48:36,330 --> 00:48:37,770
at what point in time.

1003
00:48:37,770 --> 00:48:41,490
And this is the key idea behind
a mechanism known as Attention.

1004
00:48:41,490 --> 00:48:44,760
Attention is all about
letting us decide which

1005
00:48:44,760 --> 00:48:47,280
values are important to
pay attention to when

1006
00:48:47,280 --> 00:48:51,210
generating, in this case, the
next word in our sequence.

1007
00:48:51,210 --> 00:48:53,430
So let's take a look
at an example of that.

1008
00:48:53,430 --> 00:48:54,600
Here's a sentence.

1009
00:48:54,600 --> 00:48:57,030
What is the capital of Massachusetts.

1010
00:48:57,030 --> 00:48:58,350
Same sentence as before.

1011
00:48:58,350 --> 00:49:02,490
And let's imagine that we were trying
to answer that question by generating

1012
00:49:02,490 --> 00:49:03,510
tokens of output.

1013
00:49:03,510 --> 00:49:05,190
So what would the output look like?

1014
00:49:05,190 --> 00:49:08,340
Well, it's going to look like
something like the capital is.

1015
00:49:08,340 --> 00:49:11,850
And let's say we're now trying
to generate this last word here.

1016
00:49:11,850 --> 00:49:13,230
What is that last word?

1017
00:49:13,230 --> 00:49:15,990
How is the computer
going to figure it out?

1018
00:49:15,990 --> 00:49:19,290
Well, what it's going to
need to do, is decide which

1019
00:49:19,290 --> 00:49:21,540
values it's going to pay attention to.

1020
00:49:21,540 --> 00:49:23,910
And so the Attention
mechanism will allow

1021
00:49:23,910 --> 00:49:27,450
us to calculate some Attention
scores for each word,

1022
00:49:27,450 --> 00:49:30,960
some value corresponding
to each word, determining

1023
00:49:30,960 --> 00:49:35,490
how relevant is it for us to pay
attention to that word right now.

1024
00:49:35,490 --> 00:49:38,520
And in this case, when generating
the fourth word of the output

1025
00:49:38,520 --> 00:49:42,840
sequence, the most important words
to pay attention to might be capital

1026
00:49:42,840 --> 00:49:46,380
and Massachusetts, for
example, that those words

1027
00:49:46,380 --> 00:49:48,423
are going to be particularly relevant.

1028
00:49:48,423 --> 00:49:50,340
And there are a number
of different mechanisms

1029
00:49:50,340 --> 00:49:53,140
that have been used in order to
calculate these attention scores.

1030
00:49:53,140 --> 00:49:56,200
It could be something as
simple as a dot product to see

1031
00:49:56,200 --> 00:49:59,740
how similar two vectors are, or we
could train an entire neural network

1032
00:49:59,740 --> 00:50:01,360
to calculate these Attention scores.

1033
00:50:01,360 --> 00:50:03,970
But the key idea, is
that during the training

1034
00:50:03,970 --> 00:50:05,860
process for our neural
network, we're going

1035
00:50:05,860 --> 00:50:08,890
to learn how to calculate
these Attention scores.

1036
00:50:08,890 --> 00:50:13,210
Our model is going to learn what is
important to pay attention to in order

1037
00:50:13,210 --> 00:50:16,450
to decide what the next word should be.

1038
00:50:16,450 --> 00:50:19,690
So the result of all of this,
calculating these Attention scores,

1039
00:50:19,690 --> 00:50:23,950
is that we can calculate some value,
some value for each input word,

1040
00:50:23,950 --> 00:50:27,490
determining how important is
it for us to pay attention

1041
00:50:27,490 --> 00:50:29,140
to that particular value.

1042
00:50:29,140 --> 00:50:31,330
And recall that each
of these input words

1043
00:50:31,330 --> 00:50:36,550
is also associated with one of these
hidden state context vectors, capturing

1044
00:50:36,550 --> 00:50:39,040
information about the
sentence up to that point,

1045
00:50:39,040 --> 00:50:42,880
but primarily focused on
that word in particular.

1046
00:50:42,880 --> 00:50:45,760
And so what we can now do, is
if we have all of these vectors

1047
00:50:45,760 --> 00:50:48,790
and we have values representing
how important is it

1048
00:50:48,790 --> 00:50:51,580
for us to pay attention to
those particular vectors,

1049
00:50:51,580 --> 00:50:53,650
is we can take a weighted average.

1050
00:50:53,650 --> 00:50:56,440
We can take all of these
vectors, multiply them

1051
00:50:56,440 --> 00:50:58,960
by their Attention
scores, and add them up

1052
00:50:58,960 --> 00:51:01,420
to get some new vector
value, which is going

1053
00:51:01,420 --> 00:51:04,630
to represent the context from
the input, but specifically

1054
00:51:04,630 --> 00:51:08,800
paying attention to the words
that we think are most important.

1055
00:51:08,800 --> 00:51:13,510
And once we've done that, that context
vector can be fed into our decoder

1056
00:51:13,510 --> 00:51:17,920
in order to say that the word
should be, in this case, Boston.

1057
00:51:17,920 --> 00:51:21,130
So Attention is this
very powerful tool that

1058
00:51:21,130 --> 00:51:23,680
allows any word when
we're trying to decode it,

1059
00:51:23,680 --> 00:51:27,490
to decide which words from the
input should we pay attention to

1060
00:51:27,490 --> 00:51:30,550
in order to determine what's
important for generating

1061
00:51:30,550 --> 00:51:32,710
the next word of the output.

1062
00:51:32,710 --> 00:51:34,990
And one of the first places
this was really used,

1063
00:51:34,990 --> 00:51:37,270
was in the field of machine translation.

1064
00:51:37,270 --> 00:51:39,670
Here's an example of a
diagram from the paper that

1065
00:51:39,670 --> 00:51:42,100
introduced this idea,
which was focused on trying

1066
00:51:42,100 --> 00:51:45,250
to translate English sentences
into French sentences.

1067
00:51:45,250 --> 00:51:47,950
So we have an input English
sentence up along the top,

1068
00:51:47,950 --> 00:51:50,590
and then along the left side,
the output French equivalent

1069
00:51:50,590 --> 00:51:51,940
of that same sentence.

1070
00:51:51,940 --> 00:51:55,810
And what you see in all of these
squares are the Attention scores

1071
00:51:55,810 --> 00:52:00,580
visualized, where a lighter square
indicates a higher Attention score.

1072
00:52:00,580 --> 00:52:03,610
And what you'll notice, is that
there's a strong correspondence

1073
00:52:03,610 --> 00:52:06,910
between the French word and
the equivalent English word.

1074
00:52:06,910 --> 00:52:09,610
That the French word
for agreement is really

1075
00:52:09,610 --> 00:52:12,130
paying attention to the
English word for agreement

1076
00:52:12,130 --> 00:52:14,440
in order to decide
what French word should

1077
00:52:14,440 --> 00:52:16,480
be generated at that point in time.

1078
00:52:16,480 --> 00:52:18,820
And sometimes you might pay
attention to multiple words.

1079
00:52:18,820 --> 00:52:21,640
If you look at the
French word for economic,

1080
00:52:21,640 --> 00:52:25,150
that's primarily paying attention
to the English word for economic,

1081
00:52:25,150 --> 00:52:29,770
but also paying attention to the English
word for European, in this case, too.

1082
00:52:29,770 --> 00:52:34,600
And so Attention scores are very easy
to visualize to get a sense for what

1083
00:52:34,600 --> 00:52:37,540
is our machine learning model
really paying attention to.

1084
00:52:37,540 --> 00:52:41,290
What information is it using in
order to determine what's important

1085
00:52:41,290 --> 00:52:45,220
and what's not in order to determine
what the ultimate output token should

1086
00:52:45,220 --> 00:52:46,210
be.

1087
00:52:46,210 --> 00:52:48,580
And so when we combine
the Attention mechanism

1088
00:52:48,580 --> 00:52:52,390
with a recurrent neural network, we can
get very powerful and useful results,

1089
00:52:52,390 --> 00:52:55,780
where we're able to generate an
output sequence by paying attention

1090
00:52:55,780 --> 00:52:57,430
to the input sequence, too.

1091
00:52:57,430 --> 00:53:00,100
But there are other problems
with this approach of using

1092
00:53:00,100 --> 00:53:01,690
a recurrent neural network, as well.

1093
00:53:01,690 --> 00:53:04,930
In particular, notice that
every run of the neural network

1094
00:53:04,930 --> 00:53:07,270
depends on the output
of the previous step.

1095
00:53:07,270 --> 00:53:10,240
And that was important for getting
a sense for the sequence of words

1096
00:53:10,240 --> 00:53:12,130
and the ordering of
those particular words.

1097
00:53:12,130 --> 00:53:15,340
But we can't run this
unit of the neural network

1098
00:53:15,340 --> 00:53:18,730
until after we've calculated
the hidden state from the run

1099
00:53:18,730 --> 00:53:21,010
before it, from the
previous input token.

1100
00:53:21,010 --> 00:53:25,300
And what that means, is that it's very
difficult to parallelize this process.

1101
00:53:25,300 --> 00:53:27,850
That as the input sequence
get longer and longer,

1102
00:53:27,850 --> 00:53:30,820
we might want to use
parallelism to try and speed up

1103
00:53:30,820 --> 00:53:32,920
this process of training
the neural network

1104
00:53:32,920 --> 00:53:34,960
and making sense of all
of this language data.

1105
00:53:34,960 --> 00:53:37,300
But it's difficult to do
that and it's slow to do

1106
00:53:37,300 --> 00:53:39,670
that with a recurrent neural
network, because all of it

1107
00:53:39,670 --> 00:53:41,800
needs to be performed in sequence.

1108
00:53:41,800 --> 00:53:44,140
And that's become an
increasing challenge

1109
00:53:44,140 --> 00:53:47,260
as we've started to get larger
and larger language models.

1110
00:53:47,260 --> 00:53:49,660
The more language data that
we have available to us

1111
00:53:49,660 --> 00:53:52,650
to use to train our machine
learning models, the more accurate

1112
00:53:52,650 --> 00:53:55,590
it can be, the better representation
of language it can have,

1113
00:53:55,590 --> 00:53:59,430
the better understanding it can have,
and the better results that we can see.

1114
00:53:59,430 --> 00:54:02,220
And so we've seen this growth
of large language models

1115
00:54:02,220 --> 00:54:05,400
that are using larger and larger
datasets, but as a result,

1116
00:54:05,400 --> 00:54:07,380
they take longer and longer to train.

1117
00:54:07,380 --> 00:54:10,650
And so this problem, that
recurrent neural networks are not

1118
00:54:10,650 --> 00:54:14,400
easy to parallelize, has
become an increasing problem.

1119
00:54:14,400 --> 00:54:17,250
And as a result of that, that
was one of the main motivations

1120
00:54:17,250 --> 00:54:20,130
for a different architecture
for thinking about how

1121
00:54:20,130 --> 00:54:21,870
to deal with natural language.

1122
00:54:21,870 --> 00:54:24,480
And that's known as the
Transformer architecture.

1123
00:54:24,480 --> 00:54:26,760
And this has been a
significant milestone

1124
00:54:26,760 --> 00:54:28,620
in the world of natural
language processing

1125
00:54:28,620 --> 00:54:32,640
for really increasing how well we can
perform these kinds of natural language

1126
00:54:32,640 --> 00:54:35,280
processing tasks, as
well as how quickly we

1127
00:54:35,280 --> 00:54:39,240
can train a machine learning model to
be able to produce effective results.

1128
00:54:39,240 --> 00:54:42,600
There are a number of different types of
Transformers in terms of how they work,

1129
00:54:42,600 --> 00:54:44,433
but what we're going
to take a look at here,

1130
00:54:44,433 --> 00:54:48,120
is the basic architecture for how
one might work with a Transformer

1131
00:54:48,120 --> 00:54:51,280
to get a sense for what's
involved and what we're doing.

1132
00:54:51,280 --> 00:54:55,060
So let's start with the model we
were looking at before, specifically

1133
00:54:55,060 --> 00:54:58,420
at this encoder part of our
encoder decoder architecture,

1134
00:54:58,420 --> 00:55:02,020
where we used a recurrent neural
network to take this input sequence

1135
00:55:02,020 --> 00:55:05,560
and capture all of this
information about the hidden state

1136
00:55:05,560 --> 00:55:08,860
and the information we need to
know about that input sequence.

1137
00:55:08,860 --> 00:55:12,550
Right now, it all needs to happen
in this linear progression.

1138
00:55:12,550 --> 00:55:14,980
But what the Transformer
is going to allow us to do,

1139
00:55:14,980 --> 00:55:17,710
is process each of the
words independently

1140
00:55:17,710 --> 00:55:19,510
in a way that's easy to parallelize.

1141
00:55:19,510 --> 00:55:22,000
Rather than have each word
wait for some other word,

1142
00:55:22,000 --> 00:55:25,360
each word is going to go
through this same neural network

1143
00:55:25,360 --> 00:55:30,110
and produce some kind of encoded
representation of that particular input

1144
00:55:30,110 --> 00:55:30,610
word.

1145
00:55:30,610 --> 00:55:33,200
And all of this is going
to happen in parallel.

1146
00:55:33,200 --> 00:55:35,200
Now it's happening for
all of the words at once,

1147
00:55:35,200 --> 00:55:37,870
but we're really just going to focus
on what's happening for one word

1148
00:55:37,870 --> 00:55:38,647
to make it clear.

1149
00:55:38,647 --> 00:55:41,230
But know that whatever you're
seeing happen for this one word,

1150
00:55:41,230 --> 00:55:44,950
is going to happen for all of
the other input words, too.

1151
00:55:44,950 --> 00:55:46,690
So what's going on here?

1152
00:55:46,690 --> 00:55:49,150
Well, we start with some input word.

1153
00:55:49,150 --> 00:55:53,410
That input word goes into the neural
network, and the output is hopefully,

1154
00:55:53,410 --> 00:55:57,400
some encoded representation of
the input word, the information

1155
00:55:57,400 --> 00:56:00,730
we need to know about the input word
that's going to be relevant to us

1156
00:56:00,730 --> 00:56:02,620
as we're generating the output.

1157
00:56:02,620 --> 00:56:05,470
And because we're doing this
each word independently,

1158
00:56:05,470 --> 00:56:06,700
it's easy to parallelize.

1159
00:56:06,700 --> 00:56:08,770
We don't have to wait
for the previous word

1160
00:56:08,770 --> 00:56:12,100
before we run this word
through the neural network.

1161
00:56:12,100 --> 00:56:16,210
But what did we lose in this process by
trying to parallelize this whole thing?

1162
00:56:16,210 --> 00:56:19,060
Well, we've lost all
notion of word ordering.

1163
00:56:19,060 --> 00:56:20,920
The order of words is important.

1164
00:56:20,920 --> 00:56:23,740
The sentence, Sherlock Holmes
gave the book to Watson,

1165
00:56:23,740 --> 00:56:26,920
has a different meaning than Watson
gave the book to Sherlock Holmes.

1166
00:56:26,920 --> 00:56:30,760
And so we want to keep track of that
information about word position.

1167
00:56:30,760 --> 00:56:33,670
In the recurrent neural network,
that happened for us automatically,

1168
00:56:33,670 --> 00:56:37,120
because we could run each word one
at a time through the neural network,

1169
00:56:37,120 --> 00:56:41,050
get the hidden state, pass it on to
the next run of the neural network.

1170
00:56:41,050 --> 00:56:43,600
But that's not the case
here with the Transformer,

1171
00:56:43,600 --> 00:56:48,460
where each word is being processed
independent of all of the other ones.

1172
00:56:48,460 --> 00:56:50,800
So what are we going to do
to try to solve that problem?

1173
00:56:50,800 --> 00:56:56,440
One thing we can do, is add some kind of
positional encoding to the input word.

1174
00:56:56,440 --> 00:56:58,990
The positional encoding
is some vector that

1175
00:56:58,990 --> 00:57:01,713
represents the position of
the word in the sentence.

1176
00:57:01,713 --> 00:57:04,630
This is the first word, the second
word, the third word, and so forth.

1177
00:57:04,630 --> 00:57:07,420
We're going to add
that to the input word.

1178
00:57:07,420 --> 00:57:10,600
And the result of that is going
to be a vector that captures

1179
00:57:10,600 --> 00:57:12,280
multiple pieces of information.

1180
00:57:12,280 --> 00:57:15,790
It captures the input word itself,
as well as where in the sentence

1181
00:57:15,790 --> 00:57:16,720
it appears.

1182
00:57:16,720 --> 00:57:19,240
The result of that, is
we can pass the output

1183
00:57:19,240 --> 00:57:23,200
of that addition, the addition of the
input word and the positional encoding,

1184
00:57:23,200 --> 00:57:24,400
into the neural network.

1185
00:57:24,400 --> 00:57:26,470
That way the neural
network knows the word

1186
00:57:26,470 --> 00:57:28,570
and where it appears in
the sentence, and can

1187
00:57:28,570 --> 00:57:31,600
use both of those pieces
of information to determine

1188
00:57:31,600 --> 00:57:35,980
how best to represent the meaning of
that word in the encoded representation

1189
00:57:35,980 --> 00:57:37,540
at the end of it.

1190
00:57:37,540 --> 00:57:41,470
In addition to what we have here, in
addition to the positional encoding

1191
00:57:41,470 --> 00:57:43,630
and this feed forward
neural network, we're

1192
00:57:43,630 --> 00:57:46,780
also going to add one
additional component, which

1193
00:57:46,780 --> 00:57:49,240
is going to be a Self-Attention step.

1194
00:57:49,240 --> 00:57:52,060
This is going to be Attention
where we're paying attention

1195
00:57:52,060 --> 00:57:53,950
to the other input words.

1196
00:57:53,950 --> 00:57:56,680
Because the meaning or
interpretation of an input word

1197
00:57:56,680 --> 00:58:00,220
might vary depending on the other
words in the input, as well.

1198
00:58:00,220 --> 00:58:02,980
And so we're going to allow
each word in the input

1199
00:58:02,980 --> 00:58:05,590
to decide what other
words in the input it

1200
00:58:05,590 --> 00:58:10,150
should pay attention to in order to
decide on its encoded representation.

1201
00:58:10,150 --> 00:58:13,540
And that's going to allow us to
get a better encoded representation

1202
00:58:13,540 --> 00:58:16,510
for each word, because words
are defined by their context,

1203
00:58:16,510 --> 00:58:20,800
by the words around them and how
they're used in that particular context.

1204
00:58:20,800 --> 00:58:23,710
This kind of Self-Attention
is so valuable in fact,

1205
00:58:23,710 --> 00:58:25,960
that oftentimes, the
Transformer will use

1206
00:58:25,960 --> 00:58:29,860
multiple different Self-Attention
layers at the same time

1207
00:58:29,860 --> 00:58:32,530
to allow for this model to
be able to pay attention

1208
00:58:32,530 --> 00:58:35,800
to multiple facets of the
input at the same time.

1209
00:58:35,800 --> 00:58:39,748
We call this Multi-Headed Attention,
where each attention head can

1210
00:58:39,748 --> 00:58:41,290
pay attention to something different.

1211
00:58:41,290 --> 00:58:44,590
And as a result, this network
can learn to pay attention

1212
00:58:44,590 --> 00:58:48,880
to many different parts of the input for
this input word all at the same time.

1213
00:58:48,880 --> 00:58:51,650
And in the spirit of deep
learning, these two steps,

1214
00:58:51,650 --> 00:58:55,460
this Multi-Headed Self-Attention
layer, and this neural network layer,

1215
00:58:55,460 --> 00:58:58,520
that itself can be repeated
multiple times, too,

1216
00:58:58,520 --> 00:59:01,130
in order to get a deeper
representation, in order

1217
00:59:01,130 --> 00:59:03,680
to learn deeper patterns
within the input text,

1218
00:59:03,680 --> 00:59:06,710
and ultimately, get a better
representation of language,

1219
00:59:06,710 --> 00:59:09,950
in order to get useful
and coded representations

1220
00:59:09,950 --> 00:59:12,110
of all of the input words.

1221
00:59:12,110 --> 00:59:14,810
And so this is the
process that a transformer

1222
00:59:14,810 --> 00:59:19,400
might use in order to take an input word
and get it as encoded representation.

1223
00:59:19,400 --> 00:59:23,120
And the key idea, is to really
rely on this Attention step

1224
00:59:23,120 --> 00:59:25,790
in order to get information
that's useful in order

1225
00:59:25,790 --> 00:59:28,190
to determine how to encode that word.

1226
00:59:28,190 --> 00:59:31,370
And that process is going to
repeat for all of the input

1227
00:59:31,370 --> 00:59:33,200
words that are in the input sequence.

1228
00:59:33,200 --> 00:59:35,090
We're going to take
all of the input words,

1229
00:59:35,090 --> 00:59:38,240
encode them with some kind
of positional encoding,

1230
00:59:38,240 --> 00:59:42,320
feed those into these Self-Attention and
feed forward neural networks in order

1231
00:59:42,320 --> 00:59:46,340
to ultimately get these encoded
representations of the words.

1232
00:59:46,340 --> 00:59:47,960
That's the result of the encoder.

1233
00:59:47,960 --> 00:59:51,390
We get all of these encoded
representations that

1234
00:59:51,390 --> 00:59:53,640
will be useful to us
when it comes time then

1235
00:59:53,640 --> 00:59:57,390
to try to decode all of this
information into the output

1236
00:59:57,390 --> 00:59:58,890
sequence we're interested in.

1237
00:59:58,890 --> 01:00:02,490
And again, this might take place in
the context of machine translation,

1238
01:00:02,490 --> 01:00:05,910
where the output is going to be the
same sentence in a different language.

1239
01:00:05,910 --> 01:00:09,720
Or it might be an answer to a
question, in the case of an AI chatbot,

1240
01:00:09,720 --> 01:00:10,590
for example.

1241
01:00:10,590 --> 01:00:15,330
And so now let's take a look at
how that decoder is going to work.

1242
01:00:15,330 --> 01:00:18,420
Ultimately, it's going to
have a very similar structure.

1243
01:00:18,420 --> 01:00:21,330
Any time we're trying to
generate the next output word,

1244
01:00:21,330 --> 01:00:24,480
we need to know what the
previous output word is,

1245
01:00:24,480 --> 01:00:28,620
as well as its positional encoding,
where in the output sequence are we.

1246
01:00:28,620 --> 01:00:30,600
And we're going to
have these same steps.

1247
01:00:30,600 --> 01:00:33,960
Self-Attention, because we
might want an output word

1248
01:00:33,960 --> 01:00:37,170
to be able to pay attention to
other words in that same output,

1249
01:00:37,170 --> 01:00:38,940
as well as a neural network.

1250
01:00:38,940 --> 01:00:41,820
And that might itself
repeat multiple times.

1251
01:00:41,820 --> 01:00:45,240
But in this decoder, we're going
to add one additional step.

1252
01:00:45,240 --> 01:00:47,250
We're going to add an
additional Attention

1253
01:00:47,250 --> 01:00:50,760
step, where instead of Self-Attention,
where the output word is going

1254
01:00:50,760 --> 01:00:54,360
to pay attention to other
output words, in this step,

1255
01:00:54,360 --> 01:00:57,660
we're going to allow the
output word to pay attention

1256
01:00:57,660 --> 01:00:59,700
to the encoded representations.

1257
01:00:59,700 --> 01:01:03,600
So recall that the encoder is
taking all of the input words

1258
01:01:03,600 --> 01:01:07,650
and transforming them into these encoded
representations of all of the input

1259
01:01:07,650 --> 01:01:08,220
words.

1260
01:01:08,220 --> 01:01:10,012
But it's going to be
important for us to be

1261
01:01:10,012 --> 01:01:12,750
able to decide which of
those encoded representations

1262
01:01:12,750 --> 01:01:16,560
we want to pay attention to when
generating any particular token

1263
01:01:16,560 --> 01:01:18,000
in the output sequence.

1264
01:01:18,000 --> 01:01:20,040
And that's what this
additional Attention

1265
01:01:20,040 --> 01:01:21,990
step is going to allow us to do.

1266
01:01:21,990 --> 01:01:25,560
It's saying that every time we're
generating a word of the output,

1267
01:01:25,560 --> 01:01:28,140
we can pay attention to the
other words in the output,

1268
01:01:28,140 --> 01:01:31,500
because we might want to know, what are
the words we've generated previously.

1269
01:01:31,500 --> 01:01:33,420
And we want to pay
attention to some of them

1270
01:01:33,420 --> 01:01:36,900
to decide what word is going
to be next in the sequence.

1271
01:01:36,900 --> 01:01:40,530
But we also care about paying
attention to the input words, too.

1272
01:01:40,530 --> 01:01:44,490
And we want the ability to decide
which of these encoded representations

1273
01:01:44,490 --> 01:01:46,890
of the input words are going
to be relevant in order

1274
01:01:46,890 --> 01:01:49,080
for us to generate the next step.

1275
01:01:49,080 --> 01:01:51,300
And so these two pieces
combine together.

1276
01:01:51,300 --> 01:01:54,360
We have this encoder that
takes all of the input words

1277
01:01:54,360 --> 01:01:56,970
and produces this
encoded representation.

1278
01:01:56,970 --> 01:02:00,990
And we have this decoder that is able
to take the previous output word,

1279
01:02:00,990 --> 01:02:05,550
pay attention to that encoded input,
and then generate the next output word.

1280
01:02:05,550 --> 01:02:08,220
And this is one of the
possible architectures

1281
01:02:08,220 --> 01:02:12,510
we could use for a transformer, with
the key idea being these attention

1282
01:02:12,510 --> 01:02:15,720
steps, that allow words to
pay attention to each other.

1283
01:02:15,720 --> 01:02:19,508
During the training process here, we
can now much more easily parallelize

1284
01:02:19,508 --> 01:02:22,800
this, because we don't have to wait for
all of the words to happen in sequence.

1285
01:02:22,800 --> 01:02:26,550
And we can learn how we should
perform these attention steps.

1286
01:02:26,550 --> 01:02:30,060
The model is able to learn what
is important to pay attention to,

1287
01:02:30,060 --> 01:02:32,040
what things do I need
to pay attention to

1288
01:02:32,040 --> 01:02:36,570
in order to be more accurate at
predicting what the output word is.

1289
01:02:36,570 --> 01:02:40,800
And this has proved to be a tremendously
effective model for conversational AI

1290
01:02:40,800 --> 01:02:43,770
agents, for building
machine translation systems.

1291
01:02:43,770 --> 01:02:46,470
And there have been many variants
proposed on this model, too.

1292
01:02:46,470 --> 01:02:48,930
Some transformers only use an encoder.

1293
01:02:48,930 --> 01:02:50,580
Some only use a decoder.

1294
01:02:50,580 --> 01:02:54,210
Some use some other combination of
these different particular features.

1295
01:02:54,210 --> 01:02:57,390
But the key ideas
ultimately remain the same.

1296
01:02:57,390 --> 01:03:01,380
This real focus on trying to pay
attention to what is most important.

1297
01:03:01,380 --> 01:03:03,600
And the world of natural
language processing

1298
01:03:03,600 --> 01:03:05,700
is fast-growing and fast-evolving.

1299
01:03:05,700 --> 01:03:08,310
Year after year, we keep
coming up with new models that

1300
01:03:08,310 --> 01:03:11,190
allow us to do an even
better job of performing

1301
01:03:11,190 --> 01:03:13,650
these natural
language-related tasks, all

1302
01:03:13,650 --> 01:03:16,320
in the service of solving
the tricky problem, which

1303
01:03:16,320 --> 01:03:17,580
is our own natural language.

1304
01:03:17,580 --> 01:03:20,250
We've seen how the syntax
and semantics of our language

1305
01:03:20,250 --> 01:03:23,430
is ambiguous and introduces
all of these new challenges

1306
01:03:23,430 --> 01:03:25,590
that we need to think
about if we're going

1307
01:03:25,590 --> 01:03:30,120
to be able to design AI agents that are
able to work with language effectively.

1308
01:03:30,120 --> 01:03:32,848
So as we think about where
we've been in this class, all

1309
01:03:32,848 --> 01:03:35,640
of the different types of artificial
intelligence we've considered,

1310
01:03:35,640 --> 01:03:39,130
we've looked at artificial intelligence
in a wide variety of different forms

1311
01:03:39,130 --> 01:03:39,630
now.

1312
01:03:39,630 --> 01:03:42,510
We started by taking a look
at search problems, where

1313
01:03:42,510 --> 01:03:45,060
we looked at how AI can
search for solutions,

1314
01:03:45,060 --> 01:03:48,060
play games, and find the
optimal decision to make.

1315
01:03:48,060 --> 01:03:52,370
We talked about knowledge, how AI can
represent information that it knows

1316
01:03:52,370 --> 01:03:56,270
and use that information to
generate new knowledge, as well.

1317
01:03:56,270 --> 01:03:59,420
Then we looked at what AI can
do when it's less certain, when

1318
01:03:59,420 --> 01:04:02,240
it doesn't know things for sure,
and we have to represent things

1319
01:04:02,240 --> 01:04:03,590
in terms of probability.

1320
01:04:03,590 --> 01:04:05,780
We then took a look at
optimization problems.

1321
01:04:05,780 --> 01:04:08,690
We saw how a lot of problems
in AI can be boiled down

1322
01:04:08,690 --> 01:04:12,230
to trying to maximize or
minimize some function.

1323
01:04:12,230 --> 01:04:14,840
And we looked at strategies
that AI can use in order

1324
01:04:14,840 --> 01:04:17,510
to do that kind of
maximizing and minimizing.

1325
01:04:17,510 --> 01:04:19,550
We then looked at the
world of machine learning,

1326
01:04:19,550 --> 01:04:22,550
learning from data in order
to figure out some patterns

1327
01:04:22,550 --> 01:04:26,030
and identify how to perform a task
by looking at the training data

1328
01:04:26,030 --> 01:04:27,500
that we have available to it.

1329
01:04:27,500 --> 01:04:30,770
And one of the most powerful tools
there was the neural network,

1330
01:04:30,770 --> 01:04:34,400
the sequence of units whose weights
can be trained in order to allow

1331
01:04:34,400 --> 01:04:37,100
us to really effectively
go from input to output

1332
01:04:37,100 --> 01:04:40,980
and predict how to get there by
learning these underlying patterns.

1333
01:04:40,980 --> 01:04:44,630
And then today, we took a look at
language itself, trying to understand

1334
01:04:44,630 --> 01:04:48,350
how can we train the computer to be
able to understand our natural language,

1335
01:04:48,350 --> 01:04:51,210
to be able to understand
syntax and semantics,

1336
01:04:51,210 --> 01:04:54,540
make sense of and generate
natural language, which introduces

1337
01:04:54,540 --> 01:04:56,490
a number of interesting problems, too.

1338
01:04:56,490 --> 01:04:59,640
And we've really just scratched the
surface of artificial intelligence.

1339
01:04:59,640 --> 01:05:02,910
There is so much interesting research
and interesting new techniques

1340
01:05:02,910 --> 01:05:05,250
and algorithms and ideas
being introduced to try

1341
01:05:05,250 --> 01:05:07,140
to solve these types of problems.

1342
01:05:07,140 --> 01:05:09,660
So I hope you enjoyed this
exploration into the world

1343
01:05:09,660 --> 01:05:10,950
of artificial intelligence.

1344
01:05:10,950 --> 01:05:13,950
A huge thanks to all of the course's
teaching staff and production team

1345
01:05:13,950 --> 01:05:15,360
for making the class possible.

1346
01:05:15,360 --> 01:05:19,730
This was an introduction to
Artificial Intelligence with Python.

1347
01:05:19,730 --> 01:05:21,000