1
00:00:00,000 --> 00:00:03,493
[MUSIC PLAYING]

2
00:00:03,493 --> 00:00:17,873


3
00:00:17,873 --> 00:00:21,040
SPEAKER 1: OK, welcome back, everyone,
to our final topic in an introduction

4
00:00:21,040 --> 00:00:23,050
to artificial intelligence with Python.

5
00:00:23,050 --> 00:00:25,390
And today, the topic is language.

6
00:00:25,390 --> 00:00:27,280
So thus far in the class,
we've seen a number

7
00:00:27,280 --> 00:00:30,700
of different ways of interacting
with AI, artificial intelligence,

8
00:00:30,700 --> 00:00:34,690
but it's mostly been happening in
the way of us formulating problems

9
00:00:34,690 --> 00:00:38,320
in ways that I can understand--
learning to speak the language of AI,

10
00:00:38,320 --> 00:00:41,800
so to speak, by trying to take a problem
and formulated as a search problem,

11
00:00:41,800 --> 00:00:45,160
or by trying to take a problem and make
it a constraint satisfaction problem--

12
00:00:45,160 --> 00:00:47,800
something that our AI
is able to understand.

13
00:00:47,800 --> 00:00:50,800
Today, we're going to try and come
up with algorithms and ideas that

14
00:00:50,800 --> 00:00:53,170
allow our AI to meet us
halfway, so to speak--

15
00:00:53,170 --> 00:00:56,770
to be able to allow AI to be able to
understand, and interpret, and get

16
00:00:56,770 --> 00:00:58,915
some sort of meaning
out of human language--

17
00:00:58,915 --> 00:01:00,790
the type of language,
in the spoken language,

18
00:01:00,790 --> 00:01:03,760
like English, or some other
language that we naturally speak.

19
00:01:03,760 --> 00:01:06,700
And this turns out to be a
really challenging task for AI.

20
00:01:06,700 --> 00:01:09,850
And it really encompasses a
number of different types of tasks

21
00:01:09,850 --> 00:01:13,210
all under the broad heading of
natural language processing,

22
00:01:13,210 --> 00:01:15,190
the idea of coming up
with algorithms that

23
00:01:15,190 --> 00:01:19,910
allow our AI to be able to process
and understand natural language.

24
00:01:19,910 --> 00:01:22,000
So these tasks vary in
terms of the types of tasks

25
00:01:22,000 --> 00:01:24,490
we might want an AI to perform,
and therefore, the types of

26
00:01:24,490 --> 00:01:25,698
algorithms that we might use.

27
00:01:25,698 --> 00:01:28,030
Them but some common
tasks that you might see

28
00:01:28,030 --> 00:01:30,250
are things like automatic summarization.

29
00:01:30,250 --> 00:01:33,520
You give an AI a long document,
and you would like for the AI

30
00:01:33,520 --> 00:01:35,680
to be able to summarize
it, come up with a shorter

31
00:01:35,680 --> 00:01:39,850
representation of the same idea, but
still in some kind of natural language,

32
00:01:39,850 --> 00:01:40,780
like English.

33
00:01:40,780 --> 00:01:44,740
Something like information extraction--
given a whole corpus of information

34
00:01:44,740 --> 00:01:46,750
in some body of documents
or on the internet,

35
00:01:46,750 --> 00:01:49,840
for example, we'd like for
our AI to be able to extract

36
00:01:49,840 --> 00:01:54,070
some sort of meaningful semantic
information out of all of that content

37
00:01:54,070 --> 00:01:56,470
that it's able to look at and read.

38
00:01:56,470 --> 00:01:59,020
Language identification--
the task of, given a page,

39
00:01:59,020 --> 00:02:01,562
can you figure out what language
that document is written in?

40
00:02:01,562 --> 00:02:04,520
This is the type of thing you might
see if you use a web browser where,

41
00:02:04,520 --> 00:02:06,280
if you open up a page
in another language,

42
00:02:06,280 --> 00:02:09,400
that web browser might ask you, oh,
I think it's in this language-- would

43
00:02:09,400 --> 00:02:12,070
you like me to translate into
English for you, for example?

44
00:02:12,070 --> 00:02:15,070
And that language
identification process is a task

45
00:02:15,070 --> 00:02:17,800
that our AI needs to be able to
do, which is then related then

46
00:02:17,800 --> 00:02:21,550
to machine translation, the process
of taking text in one language

47
00:02:21,550 --> 00:02:24,190
and translating it into another
language-- which there's

48
00:02:24,190 --> 00:02:26,710
been a lot of research
and development on really

49
00:02:26,710 --> 00:02:28,490
over the course of the
last several years.

50
00:02:28,490 --> 00:02:30,323
And it keeps getting
better, in terms of how

51
00:02:30,323 --> 00:02:33,010
it is that AI is able to
take text in one language

52
00:02:33,010 --> 00:02:37,010
and transform that text into
another language as well.

53
00:02:37,010 --> 00:02:40,330
In addition to that, we have topics
like named entity recognition.

54
00:02:40,330 --> 00:02:43,840
Given some sequence of text, can you
pick out what the named entities are?

55
00:02:43,840 --> 00:02:46,300
These are names of companies,
or names of people,

56
00:02:46,300 --> 00:02:50,050
or names of locations for example, which
are often relevant or important parts

57
00:02:50,050 --> 00:02:51,580
of a particular document.

58
00:02:51,580 --> 00:02:55,720
Speech recognition as a related task
not to do with the text that is written,

59
00:02:55,720 --> 00:02:58,840
but text that is spoken-- being able
to process audio and figure out,

60
00:02:58,840 --> 00:03:01,070
what are the actual words
that are spoken there?

61
00:03:01,070 --> 00:03:04,180
And if you think about smart
home devices, like Siri or Alexa,

62
00:03:04,180 --> 00:03:06,370
for example, these are
all devices that are now

63
00:03:06,370 --> 00:03:09,460
able to listen to when we
are able to speak, figure out

64
00:03:09,460 --> 00:03:13,190
what words we are saying, and draw some
sort of meaning out of that as well.

65
00:03:13,190 --> 00:03:15,398
We've talked about how you
could formulate something,

66
00:03:15,398 --> 00:03:17,860
for instance, as a hit and
Markov model to be able to draw

67
00:03:17,860 --> 00:03:19,250
those sorts of conclusions.

68
00:03:19,250 --> 00:03:22,150
Text classification, more
generally, is a broad category

69
00:03:22,150 --> 00:03:25,090
of types of ideas, whenever we
want to take some kind of text

70
00:03:25,090 --> 00:03:27,010
and put it into some sort of category.

71
00:03:27,010 --> 00:03:29,440
And we've seen these
classification type problems

72
00:03:29,440 --> 00:03:31,930
and how we can use statistical
machine learning approaches

73
00:03:31,930 --> 00:03:32,983
to be able to solve them.

74
00:03:32,983 --> 00:03:35,650
We'll be able to do something
very similar with natural language

75
00:03:35,650 --> 00:03:38,910
that we may need to make a couple
of adjustments that we'll see soon.

76
00:03:38,910 --> 00:03:41,500
And then something like
word sense disambiguation,

77
00:03:41,500 --> 00:03:45,010
the idea that, unlike in
the language of numbers,

78
00:03:45,010 --> 00:03:48,520
where AI has very precise
representations of everything, words

79
00:03:48,520 --> 00:03:50,980
and are a little bit fuzzy,
in terms of their meaning,

80
00:03:50,980 --> 00:03:52,980
and words can have multiple
different meanings--

81
00:03:52,980 --> 00:03:55,180
and natural language is
inherently ambiguous,

82
00:03:55,180 --> 00:03:58,360
and we'll take a look at some of
those ambiguities in due time today.

83
00:03:58,360 --> 00:04:00,760
But one challenging
task, if you want an AI

84
00:04:00,760 --> 00:04:02,950
to be able to understand
natural language,

85
00:04:02,950 --> 00:04:05,860
is being able to
disambiguate or differentiate

86
00:04:05,860 --> 00:04:08,080
between different possible
meanings of words.

87
00:04:08,080 --> 00:04:12,050
If I say a sentence like, I went to
the bank, you need to figure out,

88
00:04:12,050 --> 00:04:14,680
do I mean the bank where I
deposit and withdraw money or do

89
00:04:14,680 --> 00:04:16,240
I mean the bank like the river bank?

90
00:04:16,240 --> 00:04:18,250
And different words can
have different meanings

91
00:04:18,250 --> 00:04:19,260
that we might want to figure out.

92
00:04:19,260 --> 00:04:21,519
And based on the context
in which a word appears--

93
00:04:21,519 --> 00:04:23,890
the wider sentence,
or paragraph, or paper

94
00:04:23,890 --> 00:04:25,630
in which a particular word appears--

95
00:04:25,630 --> 00:04:27,880
that might help to
inform how it is that we

96
00:04:27,880 --> 00:04:31,390
disambiguate between different
meanings or different senses

97
00:04:31,390 --> 00:04:32,430
that a word might have.

98
00:04:32,430 --> 00:04:35,527
And there are many other topics
within natural language processing,

99
00:04:35,527 --> 00:04:37,360
many other algorithms
that have been devised

100
00:04:37,360 --> 00:04:40,190
in order to deal with and
address these sorts of problems.

101
00:04:40,190 --> 00:04:42,607
And today, we're really just
going to scratch the surface,

102
00:04:42,607 --> 00:04:46,240
looking at some of the fundamental ideas
that are behind many of these ideas

103
00:04:46,240 --> 00:04:49,750
within natural language processing,
within this idea of trying to come up

104
00:04:49,750 --> 00:04:53,800
with AI algorithms that are able to do
something meaningful with the languages

105
00:04:53,800 --> 00:04:55,780
that we speak everyday.

106
00:04:55,780 --> 00:04:58,480
And so to introduce this idea,
when we think about language,

107
00:04:58,480 --> 00:05:01,160
we can often think about it in
a couple of different parts.

108
00:05:01,160 --> 00:05:04,520
The first part refers to
the syntax of language.

109
00:05:04,520 --> 00:05:07,630
This is more to do with just
the structure of language

110
00:05:07,630 --> 00:05:09,830
and how it is that that structure works.

111
00:05:09,830 --> 00:05:13,060
And if you think about natural
language, syntax is one of those things

112
00:05:13,060 --> 00:05:15,160
that, if you're a native
speaker of a language,

113
00:05:15,160 --> 00:05:16,570
it comes pretty readily to you.

114
00:05:16,570 --> 00:05:18,320
You don't have to think
too much about it.

115
00:05:18,320 --> 00:05:21,600
If I give you a sentence from Sir
Arthur Conan Doyle's Sherlock Holmes,

116
00:05:21,600 --> 00:05:23,190
for example, a sentence like this--

117
00:05:23,190 --> 00:05:27,225
"just before 9:00 o'clock, Sherlock
Holmes stepped briskly into the room"--

118
00:05:27,225 --> 00:05:29,100
I think we could probably
all agree that this

119
00:05:29,100 --> 00:05:31,830
is a well-formed grammatical sentence.

120
00:05:31,830 --> 00:05:34,920
Syntactically, it makes
sense, in terms of the way

121
00:05:34,920 --> 00:05:37,232
that this particular
sentence is structured.

122
00:05:37,232 --> 00:05:40,440
And syntax applies not just to natural
language, but to programming languages

123
00:05:40,440 --> 00:05:40,940
as well.

124
00:05:40,940 --> 00:05:44,430
If you've ever seen a syntax error
in a program that you've written,

125
00:05:44,430 --> 00:05:47,280
it's likely because you
wrote some sort of program

126
00:05:47,280 --> 00:05:49,470
that was not syntactically well-formed.

127
00:05:49,470 --> 00:05:52,080
The structure of it was
not a valid program.

128
00:05:52,080 --> 00:05:54,780
In the same way, we can look at
English sentences, or sentences

129
00:05:54,780 --> 00:05:57,600
in any natural language, and
make the same kinds of judgments.

130
00:05:57,600 --> 00:06:01,290
I can say that this sentence
is syntactically well-formed.

131
00:06:01,290 --> 00:06:04,260
When all the parts are put together,
all these words are in this order,

132
00:06:04,260 --> 00:06:08,250
it constructs a grammatical sentence, or
a sentence that most people would agree

133
00:06:08,250 --> 00:06:09,720
is grammatical.

134
00:06:09,720 --> 00:06:11,970
But there are also grammatically
ill-formed sentences.

135
00:06:11,970 --> 00:06:14,370
A sentence like, "just
before Sherlock Holmes

136
00:06:14,370 --> 00:06:16,518
9 o'clock stepped briskly the room"--

137
00:06:16,518 --> 00:06:19,560
well, I think we would all agree that
this is not a well-formed sentence.

138
00:06:19,560 --> 00:06:22,290
Syntactically, it doesn't make sense.

139
00:06:22,290 --> 00:06:25,290
And this is the type of thing that,
if we want our AI, for example,

140
00:06:25,290 --> 00:06:27,330
to be able to generate
natural language--

141
00:06:27,330 --> 00:06:30,250
to be able to speak to us the way
like a chat bot would speak to us,

142
00:06:30,250 --> 00:06:31,010
for example--

143
00:06:31,010 --> 00:06:34,260
well then our AI is going to need to be
able to know this distinction somehow,

144
00:06:34,260 --> 00:06:37,980
is going to be able to know what
kinds of sentences are grammatical,

145
00:06:37,980 --> 00:06:39,330
what kinds of sentences are not.

146
00:06:39,330 --> 00:06:42,930
And we might come up with rules or ways
to statistically learn these ideas,

147
00:06:42,930 --> 00:06:45,840
and we'll talk about some
of those methods as well.

148
00:06:45,840 --> 00:06:47,910
Syntax can also be ambiguous.

149
00:06:47,910 --> 00:06:50,970
There are some sentences that are
well-formed and not well-formed,

150
00:06:50,970 --> 00:06:54,180
but certain way-- there are certain
ways that you could take a sentence

151
00:06:54,180 --> 00:06:58,260
and potentially construct multiple
different structures for that sentence.

152
00:06:58,260 --> 00:07:01,830
A sentence like, "I saw the man on
the mountain with a telescope," well,

153
00:07:01,830 --> 00:07:05,080
this is grammatically well-formed--
syntactically, it makes sense--

154
00:07:05,080 --> 00:07:07,350
but what is the structure
of the sentence?

155
00:07:07,350 --> 00:07:10,680
Is it the man on the mountain
who has the telescope, or am

156
00:07:10,680 --> 00:07:13,860
I seeing the man on the mountain and
I am using the telescope in order

157
00:07:13,860 --> 00:07:15,270
to see the man on the mountain?

158
00:07:15,270 --> 00:07:19,050
There's some interesting ambiguity
here, where it could have potentially

159
00:07:19,050 --> 00:07:21,090
two different types of structures.

160
00:07:21,090 --> 00:07:23,940
And this is one of the ideas
that will come back to also,

161
00:07:23,940 --> 00:07:27,690
in terms of how to think about dealing
with AI when natural language is

162
00:07:27,690 --> 00:07:29,820
inherently ambiguous.

163
00:07:29,820 --> 00:07:32,070
So that then is syntax,
the structure of language,

164
00:07:32,070 --> 00:07:34,080
and getting an
understanding for how it is

165
00:07:34,080 --> 00:07:36,330
that, depending on the order
and placement of words,

166
00:07:36,330 --> 00:07:38,910
we can come up with different
structures for language.

167
00:07:38,910 --> 00:07:42,300
But in addition to language having
structure, language also has meaning.

168
00:07:42,300 --> 00:07:44,700
And now we get into the world
of semantics, the idea of,

169
00:07:44,700 --> 00:07:47,190
what it is that a word,
or a sequence of words,

170
00:07:47,190 --> 00:07:51,200
or a sentence, or an entire
essay actually means?

171
00:07:51,200 --> 00:07:54,300
And so a sentence like, "just
before 9:00, Sherlock Holmes

172
00:07:54,300 --> 00:07:58,230
stepped briskly into the
room," is a different sentence

173
00:07:58,230 --> 00:08:01,860
from a sentence like, "Sherlock Holmes
stepped briskly into the room just

174
00:08:01,860 --> 00:08:03,300
before 9:00."

175
00:08:03,300 --> 00:08:06,480
And yet they have
effectively the same meaning.

176
00:08:06,480 --> 00:08:08,430
They're different
sentences, so an AI reading

177
00:08:08,430 --> 00:08:11,550
them would recognize them as
different, but we as humans

178
00:08:11,550 --> 00:08:13,650
can look at both the
sentences and say, yeah,

179
00:08:13,650 --> 00:08:15,295
they mean basically the same thing.

180
00:08:15,295 --> 00:08:18,420
And maybe, in this case, it was just
because I moved the order of the words

181
00:08:18,420 --> 00:08:18,920
around.

182
00:08:18,920 --> 00:08:21,520
Originally, 9 o'clock with near
the beginning of the sentence.

183
00:08:21,520 --> 00:08:23,700
Now 9 o'clock is near
the end of the sentence.

184
00:08:23,700 --> 00:08:26,950
But you might imagine that I could come
up with a different sentence entirely,

185
00:08:26,950 --> 00:08:29,670
a sentence like, "a few minutes
before 9:00, Sherlock Holmes

186
00:08:29,670 --> 00:08:31,820
walked quickly into the room."

187
00:08:31,820 --> 00:08:34,650
And OK, that also has
a very similar meaning,

188
00:08:34,650 --> 00:08:37,799
but I'm using different words
in order to express that idea.

189
00:08:37,799 --> 00:08:40,230
And ideally, AI would
be able to recognize

190
00:08:40,230 --> 00:08:43,230
that these two sentences, these
different sets of words that

191
00:08:43,230 --> 00:08:46,020
are similar to each other,
have similar meanings,

192
00:08:46,020 --> 00:08:49,090
and to be able to get
at that idea as well.

193
00:08:49,090 --> 00:08:52,350
Then there are also ways that a
syntactically well-formed sentence

194
00:08:52,350 --> 00:08:54,150
might not mean anything at all.

195
00:08:54,150 --> 00:08:57,360
A famous example from linguist Noam
Chomsky is this sentence here--

196
00:08:57,360 --> 00:09:00,570
"colorless green ideas sleep furiously."

197
00:09:00,570 --> 00:09:03,660
Syntactically, that
sentence is perfectly fine.

198
00:09:03,660 --> 00:09:07,080
Colorless and green are adjectives
that modify the noun ideas.

199
00:09:07,080 --> 00:09:08,010
Sleep is a verb.

200
00:09:08,010 --> 00:09:09,240
Furiously is an adverb.

201
00:09:09,240 --> 00:09:12,900
These are correct constructions,
in terms of the order of words,

202
00:09:12,900 --> 00:09:15,150
but it turns out this
sentence is meaningless.

203
00:09:15,150 --> 00:09:18,270
If you tried to ascribe meaning to
the sentence, what does it mean?

204
00:09:18,270 --> 00:09:20,250
And it's not easy to
be able to determine

205
00:09:20,250 --> 00:09:21,660
what it is that it might mean.

206
00:09:21,660 --> 00:09:25,355
Semantics itself can also be ambiguous,
given that different structures can

207
00:09:25,355 --> 00:09:26,730
have different types of meanings.

208
00:09:26,730 --> 00:09:29,110
Different words can have
different kinds of meanings,

209
00:09:29,110 --> 00:09:31,290
so the same sentence
with the same structure

210
00:09:31,290 --> 00:09:33,300
might end up meaning
different types of things.

211
00:09:33,300 --> 00:09:35,880
So my favorite example
from the LA times is

212
00:09:35,880 --> 00:09:39,570
a headline that was in the Los
Angeles Times a little while back.

213
00:09:39,570 --> 00:09:43,410
The headline says, "Big rig carrying
fruit crashes on 210 freeway,

214
00:09:43,410 --> 00:09:44,633
creates jam."

215
00:09:44,633 --> 00:09:46,800
So depending on how it is
you look at the sentence--

216
00:09:46,800 --> 00:09:50,440
how you interpret the sentence-- it
can have multiple different meanings.

217
00:09:50,440 --> 00:09:53,730
And so here too are challenges in this
world of natural language processing,

218
00:09:53,730 --> 00:09:56,640
being able to understand
both the syntax of language

219
00:09:56,640 --> 00:09:58,013
and the semantics of language.

220
00:09:58,013 --> 00:10:00,180
And today, we'll take a
look at both of those ideas.

221
00:10:00,180 --> 00:10:02,280
We're going to start
by talking about syntax

222
00:10:02,280 --> 00:10:05,550
and getting a sense for how it
is that language is structured,

223
00:10:05,550 --> 00:10:09,150
and how we can start by coming
up with some rules, some ways

224
00:10:09,150 --> 00:10:12,930
that we can tell our computer,
tell our AI what types of things

225
00:10:12,930 --> 00:10:16,540
are valid sentences, what types
of things are not valid sentences.

226
00:10:16,540 --> 00:10:19,070
And ultimately, we'd like
to use that information

227
00:10:19,070 --> 00:10:21,680
to be able to allow our AI to
draw meaningful conclusions,

228
00:10:21,680 --> 00:10:23,743
to be able to do
something with language.

229
00:10:23,743 --> 00:10:25,910
And so to do so, we're going
to start by introducing

230
00:10:25,910 --> 00:10:27,830
the notion of formal grammar.

231
00:10:27,830 --> 00:10:30,320
And what formal grammar is
all about its formal grammar

232
00:10:30,320 --> 00:10:34,400
is a system of rules that
generate sentences in a language.

233
00:10:34,400 --> 00:10:38,120
I would like to know what are
the valid English sentences--

234
00:10:38,120 --> 00:10:39,710
not in terms of what they mean--

235
00:10:39,710 --> 00:10:42,590
just in terms of their structure--
their syntactic structure.

236
00:10:42,590 --> 00:10:45,740
What structures of English
are valid, correct sentences?

237
00:10:45,740 --> 00:10:47,780
What structures of
English are not valid?

238
00:10:47,780 --> 00:10:50,930
And this is going to apply in a very
similar way to other natural languages

239
00:10:50,930 --> 00:10:54,110
as well, where language follows
certain types of structures.

240
00:10:54,110 --> 00:10:56,870
And we intuitively know
what these structures mean,

241
00:10:56,870 --> 00:10:59,840
but it's going to be helpful to
try and really formally define

242
00:10:59,840 --> 00:11:01,980
what the structures mean as well.

243
00:11:01,980 --> 00:11:04,520
There are a number of different
types of formal grammar

244
00:11:04,520 --> 00:11:07,318
all across what's known as the
Chomsky hierarchy of grammars.

245
00:11:07,318 --> 00:11:09,110
And you may have seen
some of these before.

246
00:11:09,110 --> 00:11:11,780
If you've ever worked with
regular expressions before,

247
00:11:11,780 --> 00:11:14,300
those belong to a class
of regular languages.

248
00:11:14,300 --> 00:11:19,320
They correspond to regular languages,
which is a particular type of language.

249
00:11:19,320 --> 00:11:21,860
But also on this hierarchy
is a type of grammar

250
00:11:21,860 --> 00:11:23,193
known as a context-free grammar.

251
00:11:23,193 --> 00:11:25,235
And this is the one we're
going to spend the most

252
00:11:25,235 --> 00:11:27,120
time on taking a look at today.

253
00:11:27,120 --> 00:11:31,640
And what a context-free grammar
is it is a way of taking--

254
00:11:31,640 --> 00:11:34,760
of generating sentences
in a language or via what

255
00:11:34,760 --> 00:11:39,020
are known as rewriting rules--
replacing one symbol with other symbols.

256
00:11:39,020 --> 00:11:42,360
And we'll take a look in a
moment at just what that means.

257
00:11:42,360 --> 00:11:45,950
So let's imagine, for example,
a simple sentence in English,

258
00:11:45,950 --> 00:11:48,520
a sentence like, "she saw the city"--

259
00:11:48,520 --> 00:11:52,190
a valid, syntactically
well-formed English sentence.

260
00:11:52,190 --> 00:11:55,640
But we'd like for some way for our
AI to be able to look at the sentence

261
00:11:55,640 --> 00:12:00,200
and figure out, what is the
structure of the sentence?

262
00:12:00,200 --> 00:12:02,630
If you imagine a guy in
question answering format--

263
00:12:02,630 --> 00:12:05,812
if you want to ask the AI a
question like, what did she see,

264
00:12:05,812 --> 00:12:08,270
well, then the AI wants to be
able to look at this sentence

265
00:12:08,270 --> 00:12:13,530
and recognize that what she saw is the
city-- to be able to figure that out.

266
00:12:13,530 --> 00:12:15,770
And it requires some
understanding of what

267
00:12:15,770 --> 00:12:19,760
it is that the structure of
this sentence really looks like.

268
00:12:19,760 --> 00:12:20,960
So where do we begin?

269
00:12:20,960 --> 00:12:23,410
Each of these words--
she, saw, the, city--

270
00:12:23,410 --> 00:12:25,585
we are going to call terminal symbols.

271
00:12:25,585 --> 00:12:28,460
There are symbols in our language--
where each of these words is just

272
00:12:28,460 --> 00:12:29,480
a symbol--

273
00:12:29,480 --> 00:12:32,470
where this is ultimately what
we care about generating.

274
00:12:32,470 --> 00:12:34,730
We care about generating these words.

275
00:12:34,730 --> 00:12:37,280
But each of these words
we're also going to associate

276
00:12:37,280 --> 00:12:40,130
with what we're going to
call a non-terminal symbol.

277
00:12:40,130 --> 00:12:43,460
And these non-terminal symbols initially
are going to look kind of like parts

278
00:12:43,460 --> 00:12:46,260
of speech, if you remember
back to like English grammar--

279
00:12:46,260 --> 00:12:49,880
where she is a noun,
saw is a V for verb,

280
00:12:49,880 --> 00:12:52,550
the is a D. D stands for determiner.

281
00:12:52,550 --> 00:12:55,730
These are words like the,
and a, and and, for example.

282
00:12:55,730 --> 00:12:59,550
And then city-- well, city is
also a noun, so an N goes there.

283
00:12:59,550 --> 00:13:00,320
So each of these--

284
00:13:00,320 --> 00:13:01,730
N, V, and D--

285
00:13:01,730 --> 00:13:04,460
these are what we might
call non-terminal symbols.

286
00:13:04,460 --> 00:13:07,370
They're not actually
words in the language.

287
00:13:07,370 --> 00:13:10,010
She saw the city-- those are
the words in the language.

288
00:13:10,010 --> 00:13:14,210
But we use these non-terminal symbols
to generate the terminal symbols,

289
00:13:14,210 --> 00:13:16,640
the terminal symbols which
are like, she saw the city--

290
00:13:16,640 --> 00:13:20,000
the words that are actually
in a language like English.

291
00:13:20,000 --> 00:13:24,260
And so in order to translate these
non-terminal symbols into terminal

292
00:13:24,260 --> 00:13:27,422
symbols, we have what are
known as rewriting rules,

293
00:13:27,422 --> 00:13:29,130
and these rules look
something like this.

294
00:13:29,130 --> 00:13:32,570
We have N on the left side
of an arrow, and the arrow

295
00:13:32,570 --> 00:13:35,480
says, if I have an N
non-terminal symbol,

296
00:13:35,480 --> 00:13:39,410
then I can turn it into any of these
various different possibilities

297
00:13:39,410 --> 00:13:42,120
that are separated with a vertical line.

298
00:13:42,120 --> 00:13:45,480
So a noun could translate
into the word she.

299
00:13:45,480 --> 00:13:49,720
A noun could translate into the
word city, or car, or Harry,

300
00:13:49,720 --> 00:13:50,970
or any number of other things.

301
00:13:50,970 --> 00:13:53,810
These are all examples
of nouns, for example.

302
00:13:53,810 --> 00:13:58,490
Meanwhile, a determiner, D, could
translate into the, or a, or an.

303
00:13:58,490 --> 00:14:01,310
V for verb could translate
into any of these verbs.

304
00:14:01,310 --> 00:14:04,430
P for preposition could translate
into any of those prepositions--

305
00:14:04,430 --> 00:14:06,440
to, on, over, and so forth.

306
00:14:06,440 --> 00:14:11,420
And then ADJ for adjective can translate
into any of these possible adjectives

307
00:14:11,420 --> 00:14:12,390
as well.

308
00:14:12,390 --> 00:14:15,650
So these then are rules in
our context-free grammar.

309
00:14:15,650 --> 00:14:18,110
When we are defining what
it is that our grammar is,

310
00:14:18,110 --> 00:14:21,500
what is the structure of the English
language or any other language,

311
00:14:21,500 --> 00:14:24,710
we give it these types of
rules saying that a noun could

312
00:14:24,710 --> 00:14:29,360
be any of these possibilities, a verb
could be any of those possibilities.

313
00:14:29,360 --> 00:14:32,900
But it turns out we can then begin
to construct other rules where

314
00:14:32,900 --> 00:14:37,392
it's not just one non-terminal
translating into one terminal symbol.

315
00:14:37,392 --> 00:14:40,100
We're always going to have one
non-terminal on the left-hand side

316
00:14:40,100 --> 00:14:42,515
of the arrow, but on the
right-hand side of the arrow,

317
00:14:42,515 --> 00:14:43,640
we could have other things.

318
00:14:43,640 --> 00:14:46,830
We could even have other
non-terminal symbols.

319
00:14:46,830 --> 00:14:48,030
So what do I mean by this?

320
00:14:48,030 --> 00:14:53,070
Well, we have the idea of nouns-- like
she, city, car, Harry, for example--

321
00:14:53,070 --> 00:14:55,340
but there are also a noun phrases--

322
00:14:55,340 --> 00:14:57,760
like phrases that work as nouns--

323
00:14:57,760 --> 00:15:00,900
that are not just a single word,
but there are multiple words.

324
00:15:00,900 --> 00:15:04,400
Like the city is two words,
that together, operate

325
00:15:04,400 --> 00:15:06,140
as what we might call a noun phrase.

326
00:15:06,140 --> 00:15:08,870
It's multiple words, but they're
together operating as a noun.

327
00:15:08,870 --> 00:15:12,410
Or if you think about a more complex
expression, like the big city--

328
00:15:12,410 --> 00:15:15,380
three words all operating
as a single noun--

329
00:15:15,380 --> 00:15:17,200
or the car on the street--

330
00:15:17,200 --> 00:15:22,390
multiple words now, but that entire set
of words operates kind of like a noun.

331
00:15:22,390 --> 00:15:25,130
It substitutes as a noun phrase.

332
00:15:25,130 --> 00:15:27,100
And so to do this, we'll
introduce the notion

333
00:15:27,100 --> 00:15:32,380
of a new non-terminal symbol called
NP, which will stand for noun phrase.

334
00:15:32,380 --> 00:15:36,220
And this rewriting rule says that
a noun phrase it could be a noun--

335
00:15:36,220 --> 00:15:39,250
so something like she is
a noun, and therefore, it

336
00:15:39,250 --> 00:15:40,810
can also be a noun phrase--

337
00:15:40,810 --> 00:15:46,360
but a noun phrase could also be a
determiner, D, followed by a noun--

338
00:15:46,360 --> 00:15:49,315
so two ways we can have a noun
phrase in this very simple grammar.

339
00:15:49,315 --> 00:15:51,940
Of course, the English language
is more complex than just this,

340
00:15:51,940 --> 00:15:57,460
but a noun phrase is either a noun or
it is a determiner followed by a noun.

341
00:15:57,460 --> 00:16:00,130
So for the first example, a
noun phrase that is just a noun,

342
00:16:00,130 --> 00:16:04,150
that would allow us to
generate noun phrases like she,

343
00:16:04,150 --> 00:16:07,960
because a noun phrase is
just a noun, and a noun

344
00:16:07,960 --> 00:16:10,833
could be the word she, for example.

345
00:16:10,833 --> 00:16:13,750
Meanwhile, if we wanted to look at
one of the examples of these, where

346
00:16:13,750 --> 00:16:16,750
a noun phrase becomes a
determiner and a noun,

347
00:16:16,750 --> 00:16:18,460
then we get a structure like this.

348
00:16:18,460 --> 00:16:21,250
And now we're starting to
see the structure of language

349
00:16:21,250 --> 00:16:24,970
emerge from these rules in a
syntax tree, as we'll call it,

350
00:16:24,970 --> 00:16:29,260
this tree-like structure that represents
the syntax of our natural language.

351
00:16:29,260 --> 00:16:31,960
Here, we have a noun
phrase, and this noun phrase

352
00:16:31,960 --> 00:16:36,460
is composed of a determiner and a noun,
where the determiner is the word the,

353
00:16:36,460 --> 00:16:40,310
according to that rule,
and noun is the word city.

354
00:16:40,310 --> 00:16:43,930
So here then is a noun phrase that
consists of multiple words inside

355
00:16:43,930 --> 00:16:45,130
of the structure.

356
00:16:45,130 --> 00:16:50,140
And using this idea of taking one symbol
and rewriting it using other symbols--

357
00:16:50,140 --> 00:16:52,900
that might be terminal
symbols, like the and city,

358
00:16:52,900 --> 00:16:57,670
but might also be non-terminal symbols,
like D for determiner or N for noun--

359
00:16:57,670 --> 00:17:01,090
then we can begin to construct
more and more complex structures.

360
00:17:01,090 --> 00:17:04,420
In addition to noun phrases, we
can also think about verb phrases.

361
00:17:04,420 --> 00:17:06,740
So what might a verb phrase look like?

362
00:17:06,740 --> 00:17:09,670
Well, a verb phrase might
just be a single verb.

363
00:17:09,670 --> 00:17:13,660
In a sentence like "I
walked," walked is a verb,

364
00:17:13,660 --> 00:17:17,329
and that is acting as the
verb phrase in that sentence.

365
00:17:17,329 --> 00:17:21,493
But there are also more complex verb
phrases that aren't just a single word,

366
00:17:21,493 --> 00:17:22,660
but that are multiple words.

367
00:17:22,660 --> 00:17:25,970
If you think of the sentence like
"she saw the city," for example,

368
00:17:25,970 --> 00:17:29,260
saw the city is really
that entire verb phrase.

369
00:17:29,260 --> 00:17:33,245
It's taking up like what it is
that she is doing, for example.

370
00:17:33,245 --> 00:17:35,370
And so our verb phrase
might have a rule like this.

371
00:17:35,370 --> 00:17:38,830
A verb phrase is either
just a plain verb

372
00:17:38,830 --> 00:17:43,090
or it is a verb followed
by a noun phrase.

373
00:17:43,090 --> 00:17:45,940
And we saw before that a
noun phrase is either a noun

374
00:17:45,940 --> 00:17:48,580
or it is a determiner
followed by a noun.

375
00:17:48,580 --> 00:17:50,710
And so a verb phrase
might be something simple,

376
00:17:50,710 --> 00:17:52,960
like verb phrase it is just a verb.

377
00:17:52,960 --> 00:17:55,587
And that verb could be the
word walked for example.

378
00:17:55,587 --> 00:17:57,670
But it could also be
something more sophisticated,

379
00:17:57,670 --> 00:18:01,780
something like this noun, where we
begin to see a larger syntax tree,

380
00:18:01,780 --> 00:18:04,450
where the way to read the
syntax tree is that a verb

381
00:18:04,450 --> 00:18:07,690
phrase is a verb and
a noun phrase, where

382
00:18:07,690 --> 00:18:09,380
that verb could be something like saw.

383
00:18:09,380 --> 00:18:12,130
And this is a noun phrase we've
seen before, this noun phrase that

384
00:18:12,130 --> 00:18:17,050
is the city-- a noun phrase composed
of the determiner the and the noun

385
00:18:17,050 --> 00:18:21,068
city all put together to
construct this larger verb phrase.

386
00:18:21,068 --> 00:18:23,110
And then just to give one
more example of a rule,

387
00:18:23,110 --> 00:18:24,652
we could also have a rule like this--

388
00:18:24,652 --> 00:18:28,180
sentence S goes to noun
phrase and a verb phrase.

389
00:18:28,180 --> 00:18:30,580
The basic structure of
a sentence is that it is

390
00:18:30,580 --> 00:18:32,680
a noun phrase followed by verb phrase.

391
00:18:32,680 --> 00:18:35,320
And this is a formal grammar
way of expressing the idea

392
00:18:35,320 --> 00:18:38,445
that you might have learned when you
learned English grammar, when you read

393
00:18:38,445 --> 00:18:42,190
that a sentence is like a subject
and a verb, subject and action--

394
00:18:42,190 --> 00:18:45,330
something that's happening
to a particular noun phrase.

395
00:18:45,330 --> 00:18:47,650
And so using this structure,
we could construct

396
00:18:47,650 --> 00:18:49,740
a sentence that looks like this.

397
00:18:49,740 --> 00:18:53,140
A sentence consists of a noun
phrase and a verb phrase.

398
00:18:53,140 --> 00:18:56,080
A noun phrase could just be
a noun, like the word she.

399
00:18:56,080 --> 00:18:58,180
The verb phrase could be
a verb and a noun phrase,

400
00:18:58,180 --> 00:19:00,940
where-- this is something we've
seen before-- the verb is saw

401
00:19:00,940 --> 00:19:03,838
and the noun phrase is the city.

402
00:19:03,838 --> 00:19:05,380
And so now look what we've done here.

403
00:19:05,380 --> 00:19:08,160
What we've done is, by
defining a set of rules,

404
00:19:08,160 --> 00:19:11,940
there are algorithms that we
can run that take these words--

405
00:19:11,940 --> 00:19:15,190
and the CYK algorithm, for example, is
one example of this if you want to look

406
00:19:15,190 --> 00:19:15,880
into that--

407
00:19:15,880 --> 00:19:20,200
where you start with a set of terminal
symbols, like she saw the city,

408
00:19:20,200 --> 00:19:22,630
and then using these rules,
you're able to figure out,

409
00:19:22,630 --> 00:19:26,958
how is it that you go from a
sentence to she saw the city?

410
00:19:26,958 --> 00:19:28,750
And it's all through
these rewriting rules.

411
00:19:28,750 --> 00:19:31,310
So the sentence is a noun
phrase and a verb phrase.

412
00:19:31,310 --> 00:19:34,600
A verb phrase could be a verb and
a noun phrase, so on and so forth,

413
00:19:34,600 --> 00:19:37,000
where you can imagine
taking this structure

414
00:19:37,000 --> 00:19:41,510
and figuring out how it is that
you could generate a parse tree--

415
00:19:41,510 --> 00:19:46,290
a syntax tree-- for that set of
terminal symbols, that set of words.

416
00:19:46,290 --> 00:19:49,990
And if you tried to do this for a
sentence that was not grammatical,

417
00:19:49,990 --> 00:19:53,830
something like "saw the city
she," well, that wouldn't work.

418
00:19:53,830 --> 00:19:56,320
There'd be no way to
take a sentence and use

419
00:19:56,320 --> 00:19:58,720
these rules to be able to
generate that sentence that

420
00:19:58,720 --> 00:20:01,220
is not inside of that language.

421
00:20:01,220 --> 00:20:03,490
So this sort of model
can be very helpful

422
00:20:03,490 --> 00:20:06,040
if the rules are expressive
enough to express

423
00:20:06,040 --> 00:20:09,400
all the ideas that you might want to
express inside of natural language.

424
00:20:09,400 --> 00:20:12,003
Of course, using just the
simple rules we have here,

425
00:20:12,003 --> 00:20:14,920
there are many sentences that we
won't be able to generate-- sentences

426
00:20:14,920 --> 00:20:18,280
that we might agree are grim
and syntactically well-formed,

427
00:20:18,280 --> 00:20:21,450
but that we're not going to be able
to construct using these rules.

428
00:20:21,450 --> 00:20:23,200
And then, in that case,
we might just need

429
00:20:23,200 --> 00:20:28,300
to have some more complex rules in
order to deal with those sorts of cases.

430
00:20:28,300 --> 00:20:30,370
And so this type of
approach can be powerful

431
00:20:30,370 --> 00:20:33,430
if you're dealing with a
limited set of rules and words

432
00:20:33,430 --> 00:20:35,230
that you really care about dealing with.

433
00:20:35,230 --> 00:20:37,690
And one way we can actually
interact with this in Python

434
00:20:37,690 --> 00:20:42,100
is by using a Python library called
NLTK, short for natural language

435
00:20:42,100 --> 00:20:44,410
toolkit, which we'll see
a couple of times today,

436
00:20:44,410 --> 00:20:47,410
which has a wide variety of
different functions and classes

437
00:20:47,410 --> 00:20:49,300
that we can take
advantage of that are all

438
00:20:49,300 --> 00:20:51,100
meant to deal with natural language.

439
00:20:51,100 --> 00:20:54,700
And one such algorithm that
it has is the ability to parse

440
00:20:54,700 --> 00:20:57,670
a context-free grammar, to
be able to take some words

441
00:20:57,670 --> 00:20:59,920
and figure out according to
some context-free grammar,

442
00:20:59,920 --> 00:21:02,892
how would you construct
the syntax tree for it?

443
00:21:02,892 --> 00:21:04,600
So let's go ahead and
take a look at NLTK

444
00:21:04,600 --> 00:21:09,950
now by examining how we might construct
some context-free grammars with it.

445
00:21:09,950 --> 00:21:12,110
So here inside of cfg0--

446
00:21:12,110 --> 00:21:14,410
cfg's short for context-free grammar--

447
00:21:14,410 --> 00:21:19,230
I have a sample context-free grammar
which has rules that we've seen before.

448
00:21:19,230 --> 00:21:22,330
So sentence goes to noun phrase
followed by a verb phrase.

449
00:21:22,330 --> 00:21:25,900
Noun phrase is either a
determiner and a noun or a noun.

450
00:21:25,900 --> 00:21:29,080
Verb phrase is either a verb
or a verb and a noun phrase.

451
00:21:29,080 --> 00:21:32,020
The order of these things
doesn't really matter.

452
00:21:32,020 --> 00:21:34,480
Determiners could be the
word the or the word a.

453
00:21:34,480 --> 00:21:37,630
A noun could be the
word she, city, or car.

454
00:21:37,630 --> 00:21:42,040
And a verb could be the word saw
or it could be the word walked.

455
00:21:42,040 --> 00:21:45,100
Now, using NLTK, which I've
imported here at the top,

456
00:21:45,100 --> 00:21:47,800
I'm going to go ahead
and parse this grammar

457
00:21:47,800 --> 00:21:50,823
and save it inside of this
variable called parser.

458
00:21:50,823 --> 00:21:52,990
Next, my program is going
to ask the user for input.

459
00:21:52,990 --> 00:21:55,630
Just type in a sentence,
and dot split will just

460
00:21:55,630 --> 00:21:57,790
split it on all of the
spaces, so I end up

461
00:21:57,790 --> 00:22:00,360
getting each of the individual words.

462
00:22:00,360 --> 00:22:03,400
We're going to save that inside
of this list called sentence.

463
00:22:03,400 --> 00:22:08,350
And then we'll go ahead and try to parse
the sentence, and for each sentence

464
00:22:08,350 --> 00:22:10,840
we parse, we're going to
pretty print it to the screen,

465
00:22:10,840 --> 00:22:12,327
just so it displays in my terminal.

466
00:22:12,327 --> 00:22:13,660
And we're also going to draw it.

467
00:22:13,660 --> 00:22:16,210
It turns out that NLTK has
some graphics capacity,

468
00:22:16,210 --> 00:22:19,632
so we can really visually see
what that tree looks like as well.

469
00:22:19,632 --> 00:22:22,340
And there are multiple different
ways a sentence might be parsed,

470
00:22:22,340 --> 00:22:24,700
which is why we're putting
it inside of this for loop.

471
00:22:24,700 --> 00:22:27,762
And we'll see why that can
be helpful in a moment too.

472
00:22:27,762 --> 00:22:30,220
All right, now that I have
that, let's go ahead and try it.

473
00:22:30,220 --> 00:22:34,840
I'll cd into cfg, and we'll
go ahead and run cfg0.

474
00:22:34,840 --> 00:22:37,450
So it then is going to prompt
me to type in a sentence.

475
00:22:37,450 --> 00:22:39,658
And let me type in a very
simple sentence-- something

476
00:22:39,658 --> 00:22:42,070
like she walked, for example.

477
00:22:42,070 --> 00:22:43,240
Press Return.

478
00:22:43,240 --> 00:22:45,510
So what I get is, on
the left-hand side, you

479
00:22:45,510 --> 00:22:48,902
can see a text-based
representation of the syntax tree.

480
00:22:48,902 --> 00:22:51,610
And on the right side here-- let
me go ahead and make it bigger--

481
00:22:51,610 --> 00:22:55,240
we see a visual representation
of that same syntax tree.

482
00:22:55,240 --> 00:22:59,960
This is how it is that my computer has
now parsed the sentence she walked.

483
00:22:59,960 --> 00:23:02,980
It's a sentence that consists of
a noun phrase and a verb phrase,

484
00:23:02,980 --> 00:23:06,790
where each phrase is just a single
noun or verb, she and then walked--

485
00:23:06,790 --> 00:23:09,100
same type of structure
we've seen before,

486
00:23:09,100 --> 00:23:11,410
but this now is our
computer able to understand

487
00:23:11,410 --> 00:23:13,990
the structure of the
sentence, to be able to get

488
00:23:13,990 --> 00:23:17,920
some sort of structural understanding
of how it is that parts of the sentence

489
00:23:17,920 --> 00:23:19,660
relate to each other.

490
00:23:19,660 --> 00:23:21,460
Let me now give it another sentence.

491
00:23:21,460 --> 00:23:25,180
I could try something like she
saw the city, for example--

492
00:23:25,180 --> 00:23:27,350
the words we were dealing
with a moment ago.

493
00:23:27,350 --> 00:23:31,050
And then we end up getting
this syntax tree out of it--

494
00:23:31,050 --> 00:23:34,170
again, a sentence that has a
noun phrase and a verb phrase.

495
00:23:34,170 --> 00:23:35,800
The noun phrase is fairly simple.

496
00:23:35,800 --> 00:23:36,960
It's just she.

497
00:23:36,960 --> 00:23:38,460
But the verb phrase is more complex.

498
00:23:38,460 --> 00:23:42,390
It is now saw the city, for example.

499
00:23:42,390 --> 00:23:44,790
Let's do one more with this grammar.

500
00:23:44,790 --> 00:23:47,343
Let's do something like she saw a car.

501
00:23:47,343 --> 00:23:49,010
And that is going to look very similar--

502
00:23:49,010 --> 00:23:50,328
that we also get she.

503
00:23:50,328 --> 00:23:51,870
But our verb phrase is now different.

504
00:23:51,870 --> 00:23:55,220
It's saw a car, because there
are multiple possible determiners

505
00:23:55,220 --> 00:23:57,307
in our language and
multiple possible nouns.

506
00:23:57,307 --> 00:23:59,390
I haven't given this grammar
rule that many words,

507
00:23:59,390 --> 00:24:01,790
but if I gave it a larger
vocabulary, it would then

508
00:24:01,790 --> 00:24:06,360
be able to understand more and
more different types of sentences.

509
00:24:06,360 --> 00:24:09,590
And just to give you a sense of some
added complexity we could add here,

510
00:24:09,590 --> 00:24:12,568
the more complex our grammar,
the more rules we add,

511
00:24:12,568 --> 00:24:14,360
the more different
types of sentences we'll

512
00:24:14,360 --> 00:24:15,860
then have the ability to generate.

513
00:24:15,860 --> 00:24:18,410
So let's take a look
at cfg1, for example,

514
00:24:18,410 --> 00:24:21,590
where I've added a whole number
of other different types of rules.

515
00:24:21,590 --> 00:24:25,970
I've added the adjective phrases, where
we can have multiple adjectives inside

516
00:24:25,970 --> 00:24:27,590
of a noun phrase as well.

517
00:24:27,590 --> 00:24:31,310
So a noun phrase could be an adjective
phrase followed by a noun phrase.

518
00:24:31,310 --> 00:24:33,650
If I wanted to say
something like the big city,

519
00:24:33,650 --> 00:24:37,250
that's an adjective phrase
followed by a noun phrase.

520
00:24:37,250 --> 00:24:40,740
Or we could also have a noun
and a prepositional phrase--

521
00:24:40,740 --> 00:24:43,250
so the car on the street, for example.

522
00:24:43,250 --> 00:24:46,100
On the street is a
prepositional phrase, and we

523
00:24:46,100 --> 00:24:50,060
might want to combine those two ideas
together, because the car on the street

524
00:24:50,060 --> 00:24:53,333
can still operate as something
kind of like a noun phrase as well.

525
00:24:53,333 --> 00:24:56,000
So no need to understand all of
these rules in too much detail--

526
00:24:56,000 --> 00:24:59,240
it starts to get into the
nature of English grammar--

527
00:24:59,240 --> 00:25:04,980
but now we have a more complex way of
understanding these types of sentences.

528
00:25:04,980 --> 00:25:07,190
So if I run Python cfg1--

529
00:25:07,190 --> 00:25:13,130
and I can try typing something like
she saw the wide street, for example--

530
00:25:13,130 --> 00:25:14,840
a more complex sentence.

531
00:25:14,840 --> 00:25:18,990
And if we make that larger, you can
see what this sentence looks like.

532
00:25:18,990 --> 00:25:21,700
I'll go ahead and
shrink it a little bit.

533
00:25:21,700 --> 00:25:26,100
So now we have a sentence like
this-- she saw the wide street.

534
00:25:26,100 --> 00:25:28,830
The wide street is one
entire noun phrase,

535
00:25:28,830 --> 00:25:31,470
saw the wide street is
an entire verb phrase,

536
00:25:31,470 --> 00:25:35,830
and she saw the wide street ends
up forming that entire sentence.

537
00:25:35,830 --> 00:25:40,150
So let's take a look at one more example
to introduce this notion of ambiguity.

538
00:25:40,150 --> 00:25:42,060
So I can run Python cfg1.

539
00:25:42,060 --> 00:25:48,540
Let me type a sentence like
she saw a dog with binoculars.

540
00:25:48,540 --> 00:25:52,860
So there's a sentence, and here
now is one possible syntax tree

541
00:25:52,860 --> 00:25:54,510
to represent this idea--

542
00:25:54,510 --> 00:25:59,190
she saw, the noun phrase a dog,
and then the prepositional phrase

543
00:25:59,190 --> 00:26:00,390
with binoculars.

544
00:26:00,390 --> 00:26:06,000
And the way to interpret the sentence is
that what it is that she saw was a dog.

545
00:26:06,000 --> 00:26:07,980
And how did she do the seeing?

546
00:26:07,980 --> 00:26:10,680
She did the seeing with binoculars.

547
00:26:10,680 --> 00:26:13,080
And so this is one possible
way to interpret this.

548
00:26:13,080 --> 00:26:14,730
She was using binoculars.

549
00:26:14,730 --> 00:26:18,170
Using those binoculars, she saw a dog.

550
00:26:18,170 --> 00:26:21,000
But another possible way
to pass that sentence

551
00:26:21,000 --> 00:26:25,020
would be with this tree over
here, where you have something

552
00:26:25,020 --> 00:26:31,000
like she saw a dog with binoculars,
where a dog with binoculars

553
00:26:31,000 --> 00:26:33,340
forms an entire noun phrase of its own--

554
00:26:33,340 --> 00:26:37,000
same words in the same order, but
a different grammatical structure,

555
00:26:37,000 --> 00:26:41,350
where now we have a dog with binoculars
all inside of this noun phrase,

556
00:26:41,350 --> 00:26:42,700
meaning what did she see?

557
00:26:42,700 --> 00:26:44,920
What she saw was a dog,
and that dog happened

558
00:26:44,920 --> 00:26:49,210
to have binoculars with the dog-- so
different ways to parse the sentence--

559
00:26:49,210 --> 00:26:53,700
structures for the sentence-- even given
the same possible sequence of words.

560
00:26:53,700 --> 00:26:56,320
And NLTK's algorithm and
this particular algorithm

561
00:26:56,320 --> 00:26:58,150
has the ability to find
all of these, to be

562
00:26:58,150 --> 00:27:00,610
able to understand the
different ways that you might

563
00:27:00,610 --> 00:27:05,080
be able to parse a sentence and be able
to extract some sort of useful meaning

564
00:27:05,080 --> 00:27:07,900
out of that sentence as well.

565
00:27:07,900 --> 00:27:11,650
So that then is a brief
look at what we can do--

566
00:27:11,650 --> 00:27:16,300
using getting the structure of language,
of using these context-free grammar

567
00:27:16,300 --> 00:27:19,270
rules to be able to describe
the structure of language.

568
00:27:19,270 --> 00:27:22,150
But what we might also
care about is understanding

569
00:27:22,150 --> 00:27:24,700
how it is that these
sequences of words are

570
00:27:24,700 --> 00:27:29,080
likely to relate to each other in
terms of the actual words themselves.

571
00:27:29,080 --> 00:27:33,100
The grammar that we saw before could
allow us to generate a sentence like,

572
00:27:33,100 --> 00:27:37,930
I eat a banana, for example, where I
is the noun phrase and ate a banana

573
00:27:37,930 --> 00:27:39,190
is a verb phrase.

574
00:27:39,190 --> 00:27:41,800
But it would also allow
for sentences like, I

575
00:27:41,800 --> 00:27:46,180
eat a blue car, for example, which
is also syntactically well-formed

576
00:27:46,180 --> 00:27:50,830
according to the rules, but is probably
a less likely sentence that a person is

577
00:27:50,830 --> 00:27:51,640
likely to speak.

578
00:27:51,640 --> 00:27:54,550
And we might want for our
AI to be able to encapsulate

579
00:27:54,550 --> 00:28:00,140
the idea that certain sequences of words
are more or less likely than others.

580
00:28:00,140 --> 00:28:03,880
So to deal with that, we'll
introduce the notion of an n-gram,

581
00:28:03,880 --> 00:28:06,910
and an n-gram, more generally,
just refers to some sequence

582
00:28:06,910 --> 00:28:09,880
of n items inside of our text.

583
00:28:09,880 --> 00:28:12,350
And those items might take
various different forms.

584
00:28:12,350 --> 00:28:15,220
We can have character n-grams,
which are just a contiguous

585
00:28:15,220 --> 00:28:18,520
sequence of n characters--
so three characters in a row,

586
00:28:18,520 --> 00:28:20,770
for example, or four
characters in a row.

587
00:28:20,770 --> 00:28:23,500
We can also have word n-grams,
which are a contiguous

588
00:28:23,500 --> 00:28:28,840
sequence of n words in a row
from a particular sample of text.

589
00:28:28,840 --> 00:28:30,760
And these end up proving
quite useful, and you

590
00:28:30,760 --> 00:28:34,700
can choose our n to decide how many
how long is our sequence going to be.

591
00:28:34,700 --> 00:28:39,170
So when n is 1, we're just looking at
a single word or a single character.

592
00:28:39,170 --> 00:28:42,760
And that is what we might
call a unigram, just one item.

593
00:28:42,760 --> 00:28:45,160
If we're looking at two
characters or two words,

594
00:28:45,160 --> 00:28:47,590
that's generally called
a bigram-- so an n-gram

595
00:28:47,590 --> 00:28:51,205
where n is equal to 2, looking at
two words that are consecutive.

596
00:28:51,205 --> 00:28:53,080
And then, if there are
three items, you might

597
00:28:53,080 --> 00:28:56,200
imagine we'll often call those
trigrams-- so three characters

598
00:28:56,200 --> 00:29:00,770
in a row or three words that happen
to be in a contiguous sequence.

599
00:29:00,770 --> 00:29:04,000
And so if we took a
sentence, for example--

600
00:29:04,000 --> 00:29:06,367
here's a sentence from,
again, Sherlock Holmes--

601
00:29:06,367 --> 00:29:08,200
"how often have I said
to you that, when you

602
00:29:08,200 --> 00:29:10,540
have eliminated the
impossible, whatever remains,

603
00:29:10,540 --> 00:29:13,300
however improbable, must be the truth."

604
00:29:13,300 --> 00:29:16,090
What are the trigrams that we
can extract from the sentence?

605
00:29:16,090 --> 00:29:18,830
If we're looking at
sequences of three words,

606
00:29:18,830 --> 00:29:21,280
well, the first trigram
would be how often

607
00:29:21,280 --> 00:29:23,890
have-- just a sequence of three words.

608
00:29:23,890 --> 00:29:25,960
And then we can look
at the next trigram,

609
00:29:25,960 --> 00:29:29,200
often have I. The next
trigram is have I said.

610
00:29:29,200 --> 00:29:32,320
Then I said to, said to you,
to you that, for example--

611
00:29:32,320 --> 00:29:36,700
those are all trigrams of words,
sequences of three contiguous words

612
00:29:36,700 --> 00:29:38,410
that show up in the text.

613
00:29:38,410 --> 00:29:43,120
And extracting those bigrams and
trigrams, or n-grams more generally,

614
00:29:43,120 --> 00:29:45,820
turns out to be quite
helpful, because often,

615
00:29:45,820 --> 00:29:48,113
when we're dealing with
analyzing a lot of text,

616
00:29:48,113 --> 00:29:50,530
it's not going to be particularly
meaningful for us to try

617
00:29:50,530 --> 00:29:53,990
and analyze the entire text at one time.

618
00:29:53,990 --> 00:29:57,670
But instead, we want to segment
that text into pieces that we

619
00:29:57,670 --> 00:29:59,650
can begin to do some analysis of--

620
00:29:59,650 --> 00:30:03,070
that our AI might never have
seen this entire sentence before,

621
00:30:03,070 --> 00:30:07,810
but it's probably seen the
trigram to you that before,

622
00:30:07,810 --> 00:30:11,710
because to you that is something that
might have come up in other documents

623
00:30:11,710 --> 00:30:13,240
that our AI has seen before.

624
00:30:13,240 --> 00:30:16,900
And therefore, it knows a little
bit about that particular sequence

625
00:30:16,900 --> 00:30:20,890
of three words in a row-- or
something like have I said,

626
00:30:20,890 --> 00:30:24,820
another example of another sequence
of three words that's probably

627
00:30:24,820 --> 00:30:28,880
quite popular, in terms of where you
see it inside the English language.

628
00:30:28,880 --> 00:30:32,433
So we'd like some way to be able
to extract these sorts of n-grams.

629
00:30:32,433 --> 00:30:33,350
And how do we do that?

630
00:30:33,350 --> 00:30:35,770
How do we extract
sequences of three words?

631
00:30:35,770 --> 00:30:39,490
Well, we need to take our
input and somehow separate it

632
00:30:39,490 --> 00:30:41,810
into all of the individual words.

633
00:30:41,810 --> 00:30:45,010
And this is a process generally
known as tokenization,

634
00:30:45,010 --> 00:30:48,250
the task of splitting up some
sequence into distinct pieces,

635
00:30:48,250 --> 00:30:50,440
where we call those pieces tokens.

636
00:30:50,440 --> 00:30:53,480
Most commonly, this refers to
something like word tokenization.

637
00:30:53,480 --> 00:30:55,810
I have some sequence of text
and I want to split it up

638
00:30:55,810 --> 00:30:58,810
into all of the words
that show up in that text.

639
00:30:58,810 --> 00:31:01,240
But it might also come up
in the context of something

640
00:31:01,240 --> 00:31:02,680
like sentence tokenization.

641
00:31:02,680 --> 00:31:05,950
I have a long sequence of text
and I'd like to split it up

642
00:31:05,950 --> 00:31:08,050
into sentences, for example.

643
00:31:08,050 --> 00:31:11,260
And so how might word tokenization
work, the task of splitting up

644
00:31:11,260 --> 00:31:13,660
our sequence of characters into words?

645
00:31:13,660 --> 00:31:15,640
Well, we've also already seen this idea.

646
00:31:15,640 --> 00:31:18,610
We've seen that, in word
tokenization just a moment ago, I

647
00:31:18,610 --> 00:31:22,660
took an input sequence and I just called
Python's split method on it, where

648
00:31:22,660 --> 00:31:25,360
the split method took
that sequence of words

649
00:31:25,360 --> 00:31:29,880
and just separated it based on where
the spaces showed up in that word.

650
00:31:29,880 --> 00:31:33,640
And so if I had a sentence like,
whatever remains, however improbable,

651
00:31:33,640 --> 00:31:37,620
must be the truth, how
would I tokenize this?

652
00:31:37,620 --> 00:31:41,460
Well, the naive approach is just
to say, anytime you see a space,

653
00:31:41,460 --> 00:31:42,600
go ahead and split it up.

654
00:31:42,600 --> 00:31:46,800
We're going to split up this particular
string just by looking for spaces.

655
00:31:46,800 --> 00:31:49,830
And what we get when we do
that is a sentence like this--

656
00:31:49,830 --> 00:31:53,660
whatever remains, however
improbable, must be the truth.

657
00:31:53,660 --> 00:31:56,160
But what you'll notice here is
that, if we just split things

658
00:31:56,160 --> 00:32:00,930
up in terms of where the spaces are, we
end up keeping the punctuation around.

659
00:32:00,930 --> 00:32:02,960
There's a comma after the word remains.

660
00:32:02,960 --> 00:32:06,030
There's a comma after
improbable, a period after truth.

661
00:32:06,030 --> 00:32:08,160
And this poses a little
bit of a challenge, when

662
00:32:08,160 --> 00:32:11,820
we think about trying to tokenize
things into individual words,

663
00:32:11,820 --> 00:32:15,150
because if you're comparing
words to each other, this word

664
00:32:15,150 --> 00:32:16,712
truth with a period after it--

665
00:32:16,712 --> 00:32:18,420
if you just string
compare it, it's going

666
00:32:18,420 --> 00:32:21,270
to be different from the word
truth without a period after it.

667
00:32:21,270 --> 00:32:23,810
And so this punctuation can
sometimes pose a problem for us,

668
00:32:23,810 --> 00:32:27,060
and so we might want some way of dealing
with it-- either treating punctuation

669
00:32:27,060 --> 00:32:30,990
as a separate token altogether or maybe
removing that punctuation entirely

670
00:32:30,990 --> 00:32:32,920
from our sequence as well.

671
00:32:32,920 --> 00:32:35,020
So that might be
something we want to do.

672
00:32:35,020 --> 00:32:38,010
But there are other cases where it
becomes a little bit less clear.

673
00:32:38,010 --> 00:32:40,680
If I said something like,
just before 9:00 o'clock,

674
00:32:40,680 --> 00:32:43,110
Sherlock Holmes stepped
briskly into the room,

675
00:32:43,110 --> 00:32:46,167
well, this apostrophe after 9 o'clock--

676
00:32:46,167 --> 00:32:48,750
after the O in 9 o'clock-- is
that something we should remove?

677
00:32:48,750 --> 00:32:52,080
Should be split based on that
as well, and do O and clock?

678
00:32:52,080 --> 00:32:54,090
There's some interesting
questions there too.

679
00:32:54,090 --> 00:32:57,360
And it gets even trickier if you begin
to think about hyphenated words--

680
00:32:57,360 --> 00:33:00,650
something like this, where we
have a whole bunch of words

681
00:33:00,650 --> 00:33:03,840
that are hyphenated and then you
need to make a judgment call.

682
00:33:03,840 --> 00:33:06,180
Is that a place where you're
going to split things apart

683
00:33:06,180 --> 00:33:09,840
into individual words, or are you going
to consider frock-coat, and well-cut,

684
00:33:09,840 --> 00:33:13,300
and pearl-grey to be
individual words of their own?

685
00:33:13,300 --> 00:33:16,530
And so those tend to pose challenges
that we need to somehow deal with

686
00:33:16,530 --> 00:33:19,890
and something we need to
decide as we go about trying

687
00:33:19,890 --> 00:33:21,790
to perform this kind of analysis.

688
00:33:21,790 --> 00:33:25,950
Similar challenges arise when it comes
to the world of sentence tokenization.

689
00:33:25,950 --> 00:33:29,410
Imagine this sequence of
sentences, for example.

690
00:33:29,410 --> 00:33:31,927
If you take a look at this
particular sequence of sentences,

691
00:33:31,927 --> 00:33:35,010
you could probably imagine you could
extract the sentences pretty readily.

692
00:33:35,010 --> 00:33:38,060
Here is one sentence and
here is a second sentence,

693
00:33:38,060 --> 00:33:43,060
so we have two different sentences
inside of this particular passage.

694
00:33:43,060 --> 00:33:46,260
And the distinguishing feature
seems to be the period--

695
00:33:46,260 --> 00:33:48,963
that a period separates
one sentence from another.

696
00:33:48,963 --> 00:33:50,880
And maybe there are other
types of punctuation

697
00:33:50,880 --> 00:33:52,830
you might include here as well--

698
00:33:52,830 --> 00:33:55,740
an exclamation point, for
example, or a question mark.

699
00:33:55,740 --> 00:33:58,080
But those are the types of
punctuation that we know

700
00:33:58,080 --> 00:34:00,750
tend to come at the end of sentences.

701
00:34:00,750 --> 00:34:04,410
But it gets trickier again if you look
at a sentence like this-- not just

702
00:34:04,410 --> 00:34:07,140
sure talking to Sherlock, but
instead of talking to Sherlock,

703
00:34:07,140 --> 00:34:09,449
talking to Mr. Holmes.

704
00:34:09,449 --> 00:34:11,313
Well now, we have a
period at the end of Mr.

705
00:34:11,313 --> 00:34:13,230
And so if you were just
separating on periods,

706
00:34:13,230 --> 00:34:15,570
you might imagine this
would be a sentence,

707
00:34:15,570 --> 00:34:17,760
and then just Holmes
would be a sentence,

708
00:34:17,760 --> 00:34:19,800
and then we'd have a
third sentence down below.

709
00:34:19,800 --> 00:34:23,159
Things do get a little
bit trickier as you start

710
00:34:23,159 --> 00:34:25,050
to imagine these sorts of situations.

711
00:34:25,050 --> 00:34:27,690
And dialogue too starts to
make this trickier as well--

712
00:34:27,690 --> 00:34:31,860
that if you have these sorts of lines
that are inside of something that--

713
00:34:31,860 --> 00:34:33,150
he said, for example--

714
00:34:33,150 --> 00:34:35,639
that he said this
particular sequence of words

715
00:34:35,639 --> 00:34:37,469
and then this particular
sequence of words.

716
00:34:37,469 --> 00:34:40,170
There are interesting
challenges that arise there too,

717
00:34:40,170 --> 00:34:42,389
in terms of how it is
that we take the sentence

718
00:34:42,389 --> 00:34:46,268
and split it up into
individual sentences as well.

719
00:34:46,268 --> 00:34:48,810
And these are just things that
our algorithm needs to decide.

720
00:34:48,810 --> 00:34:51,370
In practice, there usually some
heuristics that we can use.

721
00:34:51,370 --> 00:34:53,610
We know there are certain
occurrences of periods,

722
00:34:53,610 --> 00:34:56,580
like the period after Mr.,
or in other examples where

723
00:34:56,580 --> 00:34:59,010
we know that is not the
beginning of a new sentence,

724
00:34:59,010 --> 00:35:01,770
and so we can encode
those rules into our AI

725
00:35:01,770 --> 00:35:04,680
to allow it to be able to
do this tokenization the way

726
00:35:04,680 --> 00:35:06,060
that we want it to.

727
00:35:06,060 --> 00:35:09,960
So once we have these ability to
tokenize a particular passage--

728
00:35:09,960 --> 00:35:12,930
take the passage, split it
up into individual words--

729
00:35:12,930 --> 00:35:17,110
from there, we can begin to extract
what the n-grams actually are.

730
00:35:17,110 --> 00:35:20,190
So we can actually take
a look at this by going

731
00:35:20,190 --> 00:35:23,250
into a Python program that will
serve the purpose of extracting

732
00:35:23,250 --> 00:35:24,630
these n-grams.

733
00:35:24,630 --> 00:35:27,510
And again, we can use NLTK, the
Natural Language Toolkit, in order

734
00:35:27,510 --> 00:35:28,720
to help us here.

735
00:35:28,720 --> 00:35:33,540
So I'll go ahead and go into ngrams
and we'll take a look at ngrams.py.

736
00:35:33,540 --> 00:35:36,280
And what we have here
is we are going to take

737
00:35:36,280 --> 00:35:39,190
some corpus of text, just
some sequence of documents,

738
00:35:39,190 --> 00:35:43,960
and use all those documents and extract
what the most popular n-grams happen

739
00:35:43,960 --> 00:35:44,800
to be.

740
00:35:44,800 --> 00:35:48,490
So in order to do so, we're going to
go ahead and load data from a directory

741
00:35:48,490 --> 00:35:50,510
that we specify in the
command line argument.

742
00:35:50,510 --> 00:35:53,170
We'll also take in a number
n as a command line argument

743
00:35:53,170 --> 00:35:55,390
as well, in terms of what
our number should be,

744
00:35:55,390 --> 00:36:00,480
in terms of how many sequences-- words
we're going to look at in sequence.

745
00:36:00,480 --> 00:36:05,330
Then we're going to go ahead and
just count up all of the nltk.ngrams.

746
00:36:05,330 --> 00:36:09,170
So we're going to look at all of
the grams across this entire corpus

747
00:36:09,170 --> 00:36:11,600
and save it inside this variable ngrams.

748
00:36:11,600 --> 00:36:14,090
And then we're going to
look at the most common ones

749
00:36:14,090 --> 00:36:15,423
and go ahead and print them out.

750
00:36:15,423 --> 00:36:18,020
And so in order to do so,
I'm not only using NLTK--

751
00:36:18,020 --> 00:36:21,290
I'm also using counter, which is built
into Python as well, where I can just

752
00:36:21,290 --> 00:36:25,800
count up, how many times do these
various different grams appear?

753
00:36:25,800 --> 00:36:27,480
So we'll go ahead and show that.

754
00:36:27,480 --> 00:36:31,500
We'll go into ngrams, and I'll
say something like python ngrams--

755
00:36:31,500 --> 00:36:34,020
and let's just first look
for the unigrams, sequences

756
00:36:34,020 --> 00:36:37,000
of one word inside of a corpus.

757
00:36:37,000 --> 00:36:39,270
And the corpus that
I've prepared is I have

758
00:36:39,270 --> 00:36:42,720
all of the-- or some of these
stories from Sherlock Holmes

759
00:36:42,720 --> 00:36:47,140
all here, where each one is just
one of the Sherlock Holmes stories.

760
00:36:47,140 --> 00:36:50,010
And so I have a whole bunch of
text here inside of this corpus,

761
00:36:50,010 --> 00:36:54,270
and I'll go ahead and provide that
corpus as a command line argument.

762
00:36:54,270 --> 00:36:55,980
And now what my program
is going to do is

763
00:36:55,980 --> 00:36:59,000
it's going to load all of the
Sherlock Holmes stories into memory--

764
00:36:59,000 --> 00:37:01,500
or all the ones that I've
provided in this corpus at least--

765
00:37:01,500 --> 00:37:04,200
and it's just going to look
for the most popular unigrams,

766
00:37:04,200 --> 00:37:07,050
the most popular sequences of one word.

767
00:37:07,050 --> 00:37:12,060
And it seems the most popular one is
just the word the used in 9,700 times;

768
00:37:12,060 --> 00:37:15,930
followed by I, used 5,000 times;
and, used about 5,000 times--

769
00:37:15,930 --> 00:37:18,370
the kinds of words you might expect.

770
00:37:18,370 --> 00:37:24,900
So now let's go ahead and check for
bigrams, for example, ngrams 2, holmes.

771
00:37:24,900 --> 00:37:28,740
All right, again, sequences of two
words now that appear multiple times--

772
00:37:28,740 --> 00:37:32,840
of the, in the, it was, to the, it
is, I have-- so on and so forth.

773
00:37:32,840 --> 00:37:34,590
These are the types
of bigrams that happen

774
00:37:34,590 --> 00:37:37,590
to come up quite often inside this
corpus, the inside of the Sherlock

775
00:37:37,590 --> 00:37:38,400
Holmes stories.

776
00:37:38,400 --> 00:37:41,060
And it probably is true
across other corpses as well,

777
00:37:41,060 --> 00:37:43,472
but we could only find out
if we actually tested it.

778
00:37:43,472 --> 00:37:45,180
And now, just for good
measure, let's try

779
00:37:45,180 --> 00:37:50,120
one more-- maybe try three, looking now
for trigrams that happen to show up.

780
00:37:50,120 --> 00:37:54,570
And now we get it was the, one
of the, I think that, out of the.

781
00:37:54,570 --> 00:37:56,850
These are sequences of
three words now that

782
00:37:56,850 --> 00:38:00,900
happen to come up multiple times
across this particular corpus.

783
00:38:00,900 --> 00:38:02,970
So what are the
potential use cases here?

784
00:38:02,970 --> 00:38:04,440
Now we have some sort of data.

785
00:38:04,440 --> 00:38:07,890
We have data about how often
particular sequences of words

786
00:38:07,890 --> 00:38:11,010
show up in this particular
order, and using that,

787
00:38:11,010 --> 00:38:13,410
we can begin to do some
sort of predictions.

788
00:38:13,410 --> 00:38:18,090
We might be able to say that, if
you see the words that it was,

789
00:38:18,090 --> 00:38:19,950
there's a reasonable
chance the word that

790
00:38:19,950 --> 00:38:22,130
comes after it should be the word a.

791
00:38:22,130 --> 00:38:26,340
And if I see the words one of,
it it's reasonable to imagine

792
00:38:26,340 --> 00:38:29,190
that the next word might be
the word the, for example,

793
00:38:29,190 --> 00:38:32,640
because we have this data about
trigrams, sequences of three words

794
00:38:32,640 --> 00:38:33,900
and how often they come up.

795
00:38:33,900 --> 00:38:36,150
And now, based on two
words, you might be

796
00:38:36,150 --> 00:38:40,110
able to predict what the
third word happens to be.

797
00:38:40,110 --> 00:38:43,650
And one model we can use for that is
a model we've actually seen before.

798
00:38:43,650 --> 00:38:45,280
It's the Markov model.

799
00:38:45,280 --> 00:38:47,100
Recall again that the
Markov model really

800
00:38:47,100 --> 00:38:50,010
just refers to some sequence
of events that happen one time

801
00:38:50,010 --> 00:38:54,150
step after a one time step,
where every unit has some ability

802
00:38:54,150 --> 00:38:57,150
to predict what the next
unit is going to be--

803
00:38:57,150 --> 00:39:00,330
or maybe the past two units predict
with the next unit is going to be,

804
00:39:00,330 --> 00:39:03,270
or the past three predict with
the next one is going to be.

805
00:39:03,270 --> 00:39:05,490
And we can use a Markov
model and apply it

806
00:39:05,490 --> 00:39:08,100
to language for a very
naive and simple approach

807
00:39:08,100 --> 00:39:11,340
at trying to generate natural
language, at getting our AI

808
00:39:11,340 --> 00:39:14,340
to be able to speak English-like text.

809
00:39:14,340 --> 00:39:18,360
And the way it's going to work is we're
going to say something like, come up

810
00:39:18,360 --> 00:39:20,280
with some probability distribution.

811
00:39:20,280 --> 00:39:23,070
Given these two words,
what is the probability

812
00:39:23,070 --> 00:39:25,830
distribution over what the
third word could possibly

813
00:39:25,830 --> 00:39:27,240
be based on all the data?

814
00:39:27,240 --> 00:39:30,660
If you see it was, what are the
possible third words we might?

815
00:39:30,660 --> 00:39:32,190
Have how often do they come up?

816
00:39:32,190 --> 00:39:35,070
And using that information,
we can try and construct

817
00:39:35,070 --> 00:39:37,450
what we expect the third word to be.

818
00:39:37,450 --> 00:39:39,270
And if you keep doing
this, the effect is

819
00:39:39,270 --> 00:39:42,030
that our Markov model
can effectively start

820
00:39:42,030 --> 00:39:45,330
to generate text-- can be
able to generate text that

821
00:39:45,330 --> 00:39:48,330
was not in the original
corpus, but that sounds

822
00:39:48,330 --> 00:39:49,770
kind of like the original corpus.

823
00:39:49,770 --> 00:39:54,130
It's using the same sorts of rules
that the original corpus was using.

824
00:39:54,130 --> 00:39:56,370
So let's take a look
at an example of that

825
00:39:56,370 --> 00:40:01,740
as well, where here now, I have
another corpus that I have here,

826
00:40:01,740 --> 00:40:04,990
and it is the corpus of all of
the works of William Shakespeare.

827
00:40:04,990 --> 00:40:09,900
So I've got a whole bunch of stories
from Shakespeare, and all of them

828
00:40:09,900 --> 00:40:12,610
are just inside of this big text file.

829
00:40:12,610 --> 00:40:16,590
And so what I might like to do is
look at what all of the n-grams are--

830
00:40:16,590 --> 00:40:20,400
maybe look at all the trigrams
inside of shakespeare.txt--

831
00:40:20,400 --> 00:40:23,040
and figure out, given
two words, can I predict

832
00:40:23,040 --> 00:40:24,548
what the third word is likely to be?

833
00:40:24,548 --> 00:40:26,340
And then just keep
repeating this process--

834
00:40:26,340 --> 00:40:27,240
I have two words--

835
00:40:27,240 --> 00:40:29,400
predict the third word; then,
from the second and third, word

836
00:40:29,400 --> 00:40:31,900
predict the fourth word; and
from the third and fourth word,

837
00:40:31,900 --> 00:40:36,090
predict the fifth word, ultimately
generating random sentences that

838
00:40:36,090 --> 00:40:39,420
sounds like Shakespeare, that are
using similar patterns of words

839
00:40:39,420 --> 00:40:43,140
that Shakespeare used, but that never
actually showed up in Shakespeare

840
00:40:43,140 --> 00:40:44,770
itself.

841
00:40:44,770 --> 00:40:47,640
And so to do so, I'll
show you generator.py,

842
00:40:47,640 --> 00:40:50,910
which, again, is just going to
read data from a particular file.

843
00:40:50,910 --> 00:40:54,210
And I'm using a Python library
called markovify, which is just

844
00:40:54,210 --> 00:40:56,050
going to do this process for me.

845
00:40:56,050 --> 00:40:59,370
So there are libraries out here that
can just train on a bunch of text

846
00:40:59,370 --> 00:41:02,978
and come up with a Markov
model based on that text.

847
00:41:02,978 --> 00:41:04,770
And I'm going to go
ahead and just generate

848
00:41:04,770 --> 00:41:07,920
five randomly generated sentences.

849
00:41:07,920 --> 00:41:11,850
So we'll go ahead and go in to markov.

850
00:41:11,850 --> 00:41:14,750
I'll run the generator
on shakespeare.txt.

851
00:41:14,750 --> 00:41:18,290
What we'll see is it's going to load
that data, and then here's what we get.

852
00:41:18,290 --> 00:41:21,320
We get five different
sentences, and these

853
00:41:21,320 --> 00:41:24,890
are sentences that never showed
up in any Shakespeare play,

854
00:41:24,890 --> 00:41:27,680
but that are designed to
sound like Shakespeare,

855
00:41:27,680 --> 00:41:30,320
that are designed to just
take two words and predict,

856
00:41:30,320 --> 00:41:34,100
given those two words, what would
Shakespeare have been likely to choose

857
00:41:34,100 --> 00:41:35,517
as the third word that follows it.

858
00:41:35,517 --> 00:41:38,100
And you know, these sentences
probably don't have any meaning.

859
00:41:38,100 --> 00:41:41,600
It's not like the AI is trying to
express any sort of underlying meaning

860
00:41:41,600 --> 00:41:42,110
here.

861
00:41:42,110 --> 00:41:44,870
It's just trying to understand,
based on the sequence

862
00:41:44,870 --> 00:41:50,190
of words, what is likely to come
after it as a next word, for example.

863
00:41:50,190 --> 00:41:53,593
And these are the types of sentences
that it's able to come up with,

864
00:41:53,593 --> 00:41:54,260
just generating.

865
00:41:54,260 --> 00:41:58,100
And if you ran this multiple times, you
would end up getting different results.

866
00:41:58,100 --> 00:42:01,580
I could run this again and
get an entirely different set

867
00:42:01,580 --> 00:42:04,100
of five different
sentences that also are

868
00:42:04,100 --> 00:42:08,810
supposed to sound kind of like the way
that Shakespeare's sentences sounded

869
00:42:08,810 --> 00:42:10,340
as well.

870
00:42:10,340 --> 00:42:12,430
And so that then was
a look at how it is we

871
00:42:12,430 --> 00:42:16,580
can use Markov models to be able to
naively attempt generating language.

872
00:42:16,580 --> 00:42:18,580
The language doesn't mean
a whole lot right now.

873
00:42:18,580 --> 00:42:21,430
You wouldn't want to use the
system in its current form

874
00:42:21,430 --> 00:42:23,200
to do something like
machine translation,

875
00:42:23,200 --> 00:42:26,020
because it wouldn't be able
to encapsulate any meaning,

876
00:42:26,020 --> 00:42:30,240
but we're starting to see now that
our AI is getting a little bit better

877
00:42:30,240 --> 00:42:31,990
at trying to speak our
language, at trying

878
00:42:31,990 --> 00:42:36,500
to be able to process natural language
in some sort of meaningful way.

879
00:42:36,500 --> 00:42:38,830
So we'll now take a look
at a couple of other tasks

880
00:42:38,830 --> 00:42:41,140
that we might want our
AI to be able to perform.

881
00:42:41,140 --> 00:42:44,920
And one such task is text
categorization, which really is just

882
00:42:44,920 --> 00:42:46,138
a classification problem.

883
00:42:46,138 --> 00:42:48,430
And we've talked about
classification problems already,

884
00:42:48,430 --> 00:42:51,670
these problems where we would
like to take some object

885
00:42:51,670 --> 00:42:54,540
and categorize it into a
number of different classes.

886
00:42:54,540 --> 00:42:58,750
And so the way this comes up in text
is anytime you have some sample of text

887
00:42:58,750 --> 00:43:02,080
and you want to put it inside of a
category, where I want to say something

888
00:43:02,080 --> 00:43:06,760
like, given an email, does it belong
in the inbox or does it belong in spam?

889
00:43:06,760 --> 00:43:08,890
Which of these two
categories does it belong in?

890
00:43:08,890 --> 00:43:12,250
And you do that by looking
at the text and being

891
00:43:12,250 --> 00:43:16,660
able to do some sort of analysis on that
text to be able to draw conclusions,

892
00:43:16,660 --> 00:43:20,200
to be able to say that, given the
words that show up in the email,

893
00:43:20,200 --> 00:43:22,510
I think this is probably
belonging in the inbox,

894
00:43:22,510 --> 00:43:25,825
or I think it probably
belongs in spam instead.

895
00:43:25,825 --> 00:43:27,700
And you might imagine
doing this for a number

896
00:43:27,700 --> 00:43:30,910
of different types of classification
problems of this sort.

897
00:43:30,910 --> 00:43:34,360
So you might imagine that another
common example of this type of idea

898
00:43:34,360 --> 00:43:37,690
is something like sentiment
analysis, where I want to analyze,

899
00:43:37,690 --> 00:43:41,880
given a sample of text, does
it have a positive sentiment

900
00:43:41,880 --> 00:43:43,780
or does it have a negative sentiment?

901
00:43:43,780 --> 00:43:47,082
And this might come up in the case
of a product reviews on a website,

902
00:43:47,082 --> 00:43:50,290
for example, or feedback on a website,
where you have a whole bunch of data--

903
00:43:50,290 --> 00:43:53,230
samples of text that are
provided by users of a website--

904
00:43:53,230 --> 00:43:57,010
and you want to be able to quickly
analyze, are these reviews positive,

905
00:43:57,010 --> 00:43:59,710
are the reviews negative,
what is it that people

906
00:43:59,710 --> 00:44:03,460
are saying, just to get a sense for
what it is that people are saying,

907
00:44:03,460 --> 00:44:08,840
to be able to categorize text into
one of these two different categories.

908
00:44:08,840 --> 00:44:10,630
So how might we approach this problem?

909
00:44:10,630 --> 00:44:13,010
Well, let's take a look at
some sample product reviews.

910
00:44:13,010 --> 00:44:16,000
Here are some sample prep reviews
that we might come up with.

911
00:44:16,000 --> 00:44:16,930
My grandson loved it.

912
00:44:16,930 --> 00:44:17,890
So much fun.

913
00:44:17,890 --> 00:44:20,290
Product broke after a few days.

914
00:44:20,290 --> 00:44:22,368
One of the best games I've
played in a long time.

915
00:44:22,368 --> 00:44:23,410
Kind of cheap and flimsy.

916
00:44:23,410 --> 00:44:24,400
Not worth it.

917
00:44:24,400 --> 00:44:28,360
Different product reviews that you
might imagine seeing on Amazon, or eBay,

918
00:44:28,360 --> 00:44:31,690
or some other website where people
are selling products, for instance.

919
00:44:31,690 --> 00:44:34,480
And we humans can pretty
easily categorize these

920
00:44:34,480 --> 00:44:37,060
into positive sentiment
or negative sentiment.

921
00:44:37,060 --> 00:44:39,790
We'd probably say that the
first and the third one, those

922
00:44:39,790 --> 00:44:41,620
are positive sentiment messages.

923
00:44:41,620 --> 00:44:44,380
The second one and the fourth
one, those are probably

924
00:44:44,380 --> 00:44:46,060
negative sentiment messages.

925
00:44:46,060 --> 00:44:48,680
But how could a computer
do the same thing?

926
00:44:48,680 --> 00:44:53,470
How could it try and take these
reviews and assess, are they positive

927
00:44:53,470 --> 00:44:55,420
or are they negative?

928
00:44:55,420 --> 00:44:57,940
Well, ultimately, it
depends upon the words

929
00:44:57,940 --> 00:45:02,530
that happen to be in this particular--
these particular reviews-- inside

930
00:45:02,530 --> 00:45:03,850
of these particular sentences.

931
00:45:03,850 --> 00:45:06,040
For now we're going to
ignore the structure

932
00:45:06,040 --> 00:45:08,120
and how the words are
related to each other,

933
00:45:08,120 --> 00:45:11,230
and we're just going to focus
on what the words actually are.

934
00:45:11,230 --> 00:45:14,710
So there are probably some key
words here, words like loved,

935
00:45:14,710 --> 00:45:16,330
and fun, and best.

936
00:45:16,330 --> 00:45:20,770
Those probably show up in more
positive reviews, whereas words

937
00:45:20,770 --> 00:45:23,137
like broke, and cheap, and flimsy--

938
00:45:23,137 --> 00:45:24,970
well, those are words
that probably are more

939
00:45:24,970 --> 00:45:29,930
likely to come up inside of negative
reviews, instead of positive reviews.

940
00:45:29,930 --> 00:45:33,550
So one way to approach this
sort of text analysis idea

941
00:45:33,550 --> 00:45:37,900
is to say, let's, for now, ignore the
structures of these sentences-- to say,

942
00:45:37,900 --> 00:45:40,870
we're not going to care about how it
is the words relate to each other.

943
00:45:40,870 --> 00:45:43,540
We're not going to try and parse
these sentences to construct

944
00:45:43,540 --> 00:45:45,850
the grammatical structure
like we saw a moment ago.

945
00:45:45,850 --> 00:45:49,060
But we can probably just rely
on the words that were actually

946
00:45:49,060 --> 00:45:52,000
used-- rely on the fact that
the positive reviews are

947
00:45:52,000 --> 00:45:54,820
more likely to have words
like best, and loved, and fun,

948
00:45:54,820 --> 00:45:58,360
and that the negative reviews are
more likely to have the negative words

949
00:45:58,360 --> 00:46:00,017
that we've highlighted there as well.

950
00:46:00,017 --> 00:46:03,100
And this sort of model-- this approach
to trying to think about language--

951
00:46:03,100 --> 00:46:05,610
is generally known as
the bag of words model,

952
00:46:05,610 --> 00:46:09,023
where we're going to model a sample of
text not by caring about its structure,

953
00:46:09,023 --> 00:46:12,970
but just by caring about the
unordered collection of words that

954
00:46:12,970 --> 00:46:16,060
show up inside of a sample--
that all we care about

955
00:46:16,060 --> 00:46:18,040
is what words are in the text.

956
00:46:18,040 --> 00:46:20,552
And we don't care about what
the order of those words is.

957
00:46:20,552 --> 00:46:22,510
We don't care about the
structure of the words.

958
00:46:22,510 --> 00:46:25,210
We don't care what noun
goes with what adjective

959
00:46:25,210 --> 00:46:26,870
or how things agree with each other.

960
00:46:26,870 --> 00:46:28,830
We just care about the words.

961
00:46:28,830 --> 00:46:31,120
And it turns out this
approach tends to work

962
00:46:31,120 --> 00:46:34,810
pretty well for doing classifications
like positive sentiment

963
00:46:34,810 --> 00:46:36,142
or negative sentiment.

964
00:46:36,142 --> 00:46:38,350
And you could imagine doing
this in a number of ways.

965
00:46:38,350 --> 00:46:41,740
We've talked about different approaches
to trying to solve classification style

966
00:46:41,740 --> 00:46:43,870
problems, but when it
comes to natural language,

967
00:46:43,870 --> 00:46:48,110
one of the most popular approaches
is that naive Bayes approach.

968
00:46:48,110 --> 00:46:52,530
And this is one approach to trying to
analyze the probability that something

969
00:46:52,530 --> 00:46:54,940
is positive sentiment
or negative sentiment,

970
00:46:54,940 --> 00:46:58,515
or just trying to categorize it
some text into possible categories.

971
00:46:58,515 --> 00:47:01,390
And it doesn't just work for text--
it works for other types of ideas

972
00:47:01,390 --> 00:47:03,550
as well-- but it is quite
popular in the world

973
00:47:03,550 --> 00:47:05,980
of analyzing text and natural language.

974
00:47:05,980 --> 00:47:09,450
And the naive Bayes approach
is based on Bayes' rule, which

975
00:47:09,450 --> 00:47:11,950
you might recall back from when
we talked about probability,

976
00:47:11,950 --> 00:47:14,020
that the Bayes' rule looks like this--

977
00:47:14,020 --> 00:47:17,690
that the probability of
some event b, given a

978
00:47:17,690 --> 00:47:20,320
can be expressed using
this expression over here.

979
00:47:20,320 --> 00:47:25,150
Probability of b given a is the
probability of a given b multiplied

980
00:47:25,150 --> 00:47:28,590
by the probability of b divided
by the probability of a.

981
00:47:28,590 --> 00:47:32,290
And we saw that this came about
as a result of just the definition

982
00:47:32,290 --> 00:47:35,740
of conditional independence and
looking at what it means for two events

983
00:47:35,740 --> 00:47:37,010
to happen together.

984
00:47:37,010 --> 00:47:40,038
This was our formulation
then of Bayes' rule, which

985
00:47:40,038 --> 00:47:41,330
turned out to be quite helpful.

986
00:47:41,330 --> 00:47:43,990
We were able to predict one
event in terms of another

987
00:47:43,990 --> 00:47:49,218
by flipping the order of those events
inside of this probability calculation.

988
00:47:49,218 --> 00:47:51,760
And it turns out this approach
is going to be quite helpful--

989
00:47:51,760 --> 00:47:53,110
and we'll see why in a moment--

990
00:47:53,110 --> 00:47:55,330
for being able to do this
sort of sentiment analysis,

991
00:47:55,330 --> 00:47:58,750
because I want to say you
know, what is the probability

992
00:47:58,750 --> 00:48:02,350
that a message is positive,
or what is the pop probability

993
00:48:02,350 --> 00:48:03,727
that the message is negative?

994
00:48:03,727 --> 00:48:06,310
And I'll go ahead and simplify
this just using the emojis just

995
00:48:06,310 --> 00:48:10,450
for simplicity-- probability of
positive, probability of negative.

996
00:48:10,450 --> 00:48:12,340
And that is what I
would like to calculate,

997
00:48:12,340 --> 00:48:15,310
but I'd like to calculate
that given some information--

998
00:48:15,310 --> 00:48:18,940
given information like
here is a sample of text--

999
00:48:18,940 --> 00:48:20,440
my grandson loved it.

1000
00:48:20,440 --> 00:48:24,280
And I would like to know not just what
is the probability that any message is

1001
00:48:24,280 --> 00:48:27,880
positive, but what is the probability
that the message is positive,

1002
00:48:27,880 --> 00:48:32,890
given my grandson loved it
as the text of the sample?

1003
00:48:32,890 --> 00:48:36,340
So given this information that inside
the sample are the words my grandson

1004
00:48:36,340 --> 00:48:41,860
loved it, what is the probability
then that this is a positive message?

1005
00:48:41,860 --> 00:48:44,650
Well, according to the bag of
words model, what we're going to do

1006
00:48:44,650 --> 00:48:46,930
is really ignore the
ordering of the words--

1007
00:48:46,930 --> 00:48:50,420
not treat this as a single sentence
that has some structure to it,

1008
00:48:50,420 --> 00:48:52,750
but just treat it as a whole
bunch of different words.

1009
00:48:52,750 --> 00:48:55,180
We're going to say something
like, what is the probability

1010
00:48:55,180 --> 00:48:58,420
that this is a positive
message, given that the word my

1011
00:48:58,420 --> 00:49:01,810
was in the message, given that the
word grandson was in the message,

1012
00:49:01,810 --> 00:49:05,520
given that the word loved within
the message, and given the word it

1013
00:49:05,520 --> 00:49:06,380
was in the message?

1014
00:49:06,380 --> 00:49:07,720
The bag of words model here--

1015
00:49:07,720 --> 00:49:11,380
we're treating the entire simple
sample as just a whole bunch

1016
00:49:11,380 --> 00:49:12,740
of different words.

1017
00:49:12,740 --> 00:49:15,910
And so this then is what I'd like
to calculate, this probability--

1018
00:49:15,910 --> 00:49:18,610
given all those words,
what is the probability

1019
00:49:18,610 --> 00:49:20,920
that this is a positive message?

1020
00:49:20,920 --> 00:49:23,530
And this is where we can
now apply Bayes' rule.

1021
00:49:23,530 --> 00:49:28,315
This is really the probability
of some b, given some a.

1022
00:49:28,315 --> 00:49:30,400
And that now is what
I'd like to calculate.

1023
00:49:30,400 --> 00:49:34,723
So according to Bayes' rule, this
whole expression is equal to--

1024
00:49:34,723 --> 00:49:35,890
well, it's the probability--

1025
00:49:35,890 --> 00:49:37,420
I switched the order of them--

1026
00:49:37,420 --> 00:49:40,270
it's the probability
of all of these words,

1027
00:49:40,270 --> 00:49:42,910
given that it's a positive
message, multiplied

1028
00:49:42,910 --> 00:49:46,930
by the probability that is
the positive message divided

1029
00:49:46,930 --> 00:49:49,575
by the probability of
all of those words.

1030
00:49:49,575 --> 00:49:51,700
So this then is just an
application of Bayes' rule.

1031
00:49:51,700 --> 00:49:56,680
We've already seen where I want to
express the probability of positive,

1032
00:49:56,680 --> 00:50:02,440
given the words, as related to
somehow the probability of the words,

1033
00:50:02,440 --> 00:50:04,718
given that it's a positive message.

1034
00:50:04,718 --> 00:50:06,760
And it turns out that--
as you might recall, back

1035
00:50:06,760 --> 00:50:09,965
when we talked about probability,
that this denominator is

1036
00:50:09,965 --> 00:50:10,840
going to be the same.

1037
00:50:10,840 --> 00:50:13,840
Regardless of whether we're looking
at positive or negative messages,

1038
00:50:13,840 --> 00:50:15,850
the probability of these
words doesn't change,

1039
00:50:15,850 --> 00:50:18,805
because we don't have a
positive or negative down below.

1040
00:50:18,805 --> 00:50:20,680
So we can just say that,
rather than just say

1041
00:50:20,680 --> 00:50:23,980
that this expression up here is
equal to this expression down below,

1042
00:50:23,980 --> 00:50:27,130
it's really just proportional
to just the numerator.

1043
00:50:27,130 --> 00:50:29,530
We can ignore the denominator for now.

1044
00:50:29,530 --> 00:50:32,770
Using the denominator would
get us an exact probability.

1045
00:50:32,770 --> 00:50:34,780
But it turns out that
what we'll really just do

1046
00:50:34,780 --> 00:50:38,780
is figure out what the probability
is proportional to, and at the end,

1047
00:50:38,780 --> 00:50:41,500
we'll have to normalize the
probability distribution-- make

1048
00:50:41,500 --> 00:50:46,270
sure the probability distribution
ultimately sums up to the number 1.

1049
00:50:46,270 --> 00:50:49,730
So now I've been able to
formulate this probability--

1050
00:50:49,730 --> 00:50:51,520
which is what I want to care about--

1051
00:50:51,520 --> 00:50:56,530
as proportional to multiplying these two
things together-- probability of words,

1052
00:50:56,530 --> 00:51:01,580
given positive message, multiplied by
the probability of positive message.

1053
00:51:01,580 --> 00:51:04,060
But again, if you think back
to our probability rules,

1054
00:51:04,060 --> 00:51:09,070
we can calculate this really as just
a joint probability of all of these

1055
00:51:09,070 --> 00:51:14,140
things happening-- that the probability
of positive message multiplied

1056
00:51:14,140 --> 00:51:17,470
by the probability of these words,
given the positive message--

1057
00:51:17,470 --> 00:51:20,890
well, that's just the joint
probability of all of these things.

1058
00:51:20,890 --> 00:51:23,530
This is the same thing
as the probability

1059
00:51:23,530 --> 00:51:27,670
that it's a positive message, and my
isn't the sentence or in the message,

1060
00:51:27,670 --> 00:51:30,820
and grandson is in the sample,
and loved is in the sample,

1061
00:51:30,820 --> 00:51:33,160
and it is in the sample.

1062
00:51:33,160 --> 00:51:36,640
So using that rule for the
definition of joint probability,

1063
00:51:36,640 --> 00:51:40,630
I've been able to say that
this entire expression is now

1064
00:51:40,630 --> 00:51:43,570
proportional to this sequence--

1065
00:51:43,570 --> 00:51:47,530
this joint probability of these
words and this positive that's

1066
00:51:47,530 --> 00:51:49,670
in there as well.

1067
00:51:49,670 --> 00:51:51,790
And so now the interesting
question is just how

1068
00:51:51,790 --> 00:51:54,050
to calculate that joint probability.

1069
00:51:54,050 --> 00:51:55,870
How do I figure out
the probability that,

1070
00:51:55,870 --> 00:51:59,980
given some arbitrary message, that it is
positive, and the word my is in there,

1071
00:51:59,980 --> 00:52:03,040
and the word grandson is in there,
and the word loved is in there,

1072
00:52:03,040 --> 00:52:04,740
and the word it is in there?

1073
00:52:04,740 --> 00:52:07,990
Well, you'll recall that we can
calculate a joint probability

1074
00:52:07,990 --> 00:52:12,480
by multiplying together all of
these conditional probabilities.

1075
00:52:12,480 --> 00:52:16,350
If I want to know the
probability of a, and b, and c,

1076
00:52:16,350 --> 00:52:19,530
I can calculate that as
the probability of a times

1077
00:52:19,530 --> 00:52:24,300
the probability of b, given a, times
the probability of c, given a and b.

1078
00:52:24,300 --> 00:52:27,570
I can just multiply these
conditional probabilities together

1079
00:52:27,570 --> 00:52:31,290
in order to get the overall joint
probability that I care about.

1080
00:52:31,290 --> 00:52:32,790
And we could do the same thing here.

1081
00:52:32,790 --> 00:52:35,340
I could say, let's
multiply the probability

1082
00:52:35,340 --> 00:52:39,180
of positive by the probability of the
word my showing up in the message,

1083
00:52:39,180 --> 00:52:42,810
given that it's positive, multiplied
by the probability of grandson

1084
00:52:42,810 --> 00:52:45,550
showing up in the message, given
that the word my is in there

1085
00:52:45,550 --> 00:52:48,930
and that it's positive, multiplied
by the probability of loved,

1086
00:52:48,930 --> 00:52:51,930
given these three things,
multiplied by the probability of it,

1087
00:52:51,930 --> 00:52:53,500
given these four things.

1088
00:52:53,500 --> 00:52:56,882
And that's going to end up being a
fairly complex calculation to make,

1089
00:52:56,882 --> 00:52:58,590
one that we probably
aren't going to have

1090
00:52:58,590 --> 00:53:00,210
a good way of knowing the answer to.

1091
00:53:00,210 --> 00:53:04,140
What is the probability that
grandson is in the message, given

1092
00:53:04,140 --> 00:53:08,010
that it is positive and the
word my is in the message?

1093
00:53:08,010 --> 00:53:12,040
That's not something we're really
going to have a readily easy answer to,

1094
00:53:12,040 --> 00:53:15,270
and so this is where the naive
part of naive Bayes comes about.

1095
00:53:15,270 --> 00:53:16,950
We're going to simplify this notion.

1096
00:53:16,950 --> 00:53:20,340
Rather than compute exactly what
that probability distribution is,

1097
00:53:20,340 --> 00:53:23,880
we're going to assume
that these words are

1098
00:53:23,880 --> 00:53:26,710
going to be effectively
independent of each other,

1099
00:53:26,710 --> 00:53:28,980
if we know that it's
already a positive message.

1100
00:53:28,980 --> 00:53:32,670
If it's a positive message, it
doesn't change the probability

1101
00:53:32,670 --> 00:53:34,620
that the word grandson
is in the message,

1102
00:53:34,620 --> 00:53:37,620
if I know that the word loved
is in the message, for example.

1103
00:53:37,620 --> 00:53:39,750
And that might not necessarily
be true in practice.

1104
00:53:39,750 --> 00:53:41,610
In the real world, it
might not be the case

1105
00:53:41,610 --> 00:53:43,650
that these words are
actually independent,

1106
00:53:43,650 --> 00:53:45,960
but we're going to assume
it to simplify our model.

1107
00:53:45,960 --> 00:53:48,030
And it turns out that
simplification still

1108
00:53:48,030 --> 00:53:51,590
lets us get pretty good
results out of it as well.

1109
00:53:51,590 --> 00:53:55,320
And what we're going to assume is that
the probability that all of these words

1110
00:53:55,320 --> 00:53:58,690
show up depend only on whether
it's positive or negative.

1111
00:53:58,690 --> 00:54:01,170
I can still say that loved
is more likely to come up

1112
00:54:01,170 --> 00:54:04,510
in a positive message than a negative
message, which is probably true,

1113
00:54:04,510 --> 00:54:08,010
but we're also going to say that it's
not going to change whether or not

1114
00:54:08,010 --> 00:54:12,020
loved is more likely or less likely to
come up if I know that the word my is

1115
00:54:12,020 --> 00:54:13,643
in the message, for example.

1116
00:54:13,643 --> 00:54:16,060
And so those are the assumptions
that we're going to make.

1117
00:54:16,060 --> 00:54:20,310
So while top expression is
proportional to this bottom expression,

1118
00:54:20,310 --> 00:54:24,750
we're going to say it's naively
proportional to this expression,

1119
00:54:24,750 --> 00:54:27,480
probability of being a positive message.

1120
00:54:27,480 --> 00:54:30,300
And then, for each of the words
that show up in the sample,

1121
00:54:30,300 --> 00:54:33,270
I'm going to multiply what's
the probability that my

1122
00:54:33,270 --> 00:54:35,370
is in the message, given
that it's positive,

1123
00:54:35,370 --> 00:54:37,980
times the probability of grandson
being in the message, given

1124
00:54:37,980 --> 00:54:40,050
that it's positive-- and
then so on and so forth

1125
00:54:40,050 --> 00:54:44,040
for the other words that happen
to be inside of the sample.

1126
00:54:44,040 --> 00:54:47,580
And it turns out that these are
numbers that we can calculate.

1127
00:54:47,580 --> 00:54:50,640
The reason we've done all of this
math is to get to this point,

1128
00:54:50,640 --> 00:54:54,870
to be able to calculate this probability
of distribution that we care about,

1129
00:54:54,870 --> 00:54:58,410
given these terms that we
can actually calculate.

1130
00:54:58,410 --> 00:55:02,250
And we can calculate then,
given some data available to us.

1131
00:55:02,250 --> 00:55:04,530
And this is what a lot of
natural language processing

1132
00:55:04,530 --> 00:55:05,590
is about these days.

1133
00:55:05,590 --> 00:55:07,330
It's about analyzing data.

1134
00:55:07,330 --> 00:55:10,440
If I give you a whole bunch of
data with a whole bunch of reviews,

1135
00:55:10,440 --> 00:55:13,380
and I've labeled them
as positive or negative,

1136
00:55:13,380 --> 00:55:17,250
then you can begin to calculate
these particular terms.

1137
00:55:17,250 --> 00:55:20,490
I can calculate the probability
that a message is positive just

1138
00:55:20,490 --> 00:55:22,710
by looking at my data
and saying, how many

1139
00:55:22,710 --> 00:55:26,250
positive samples were there, and divide
that by the number of total samples.

1140
00:55:26,250 --> 00:55:29,477
That is my probability
that a message is positive.

1141
00:55:29,477 --> 00:55:32,310
What is the probability that the
word loved is in the message, given

1142
00:55:32,310 --> 00:55:33,330
that it's positive?

1143
00:55:33,330 --> 00:55:35,490
Well, I can calculate
that based on my data too.

1144
00:55:35,490 --> 00:55:38,970
Let me just look at how many positive
samples have the word loved in it

1145
00:55:38,970 --> 00:55:41,730
and divide that by my total
number of positive samples.

1146
00:55:41,730 --> 00:55:44,430
And that will give me
an approximation for,

1147
00:55:44,430 --> 00:55:47,950
what is the probability that loved is
going to show up inside of the review,

1148
00:55:47,950 --> 00:55:51,570
given that we know that
the review is positive.

1149
00:55:51,570 --> 00:55:55,160
And so this then allows us to be able
to calculate these probabilities.

1150
00:55:55,160 --> 00:55:56,910
So let's not actually
do this calculation.

1151
00:55:56,910 --> 00:56:00,390
Let's calculate for the
sentence, my grandson loved it.

1152
00:56:00,390 --> 00:56:01,890
Is it a positive or negative review?

1153
00:56:01,890 --> 00:56:04,030
How could we figure out
those probabilities?

1154
00:56:04,030 --> 00:56:07,110
Well, again, this up here is the
expression we're trying to calculate.

1155
00:56:07,110 --> 00:56:10,350
And I'll give you a hint the
data that is available to us.

1156
00:56:10,350 --> 00:56:13,080
And the way to interpret
this data in this case

1157
00:56:13,080 --> 00:56:19,127
is that, of all of the messages, 49%
of them were positive and 51% of them

1158
00:56:19,127 --> 00:56:19,710
were negative.

1159
00:56:19,710 --> 00:56:22,350
Maybe online reviews tend to be a
little bit more negative than they

1160
00:56:22,350 --> 00:56:24,683
are positive-- or at least
based on this particular data

1161
00:56:24,683 --> 00:56:26,620
sample, that's what I have.

1162
00:56:26,620 --> 00:56:31,800
And then I have distributions for
each of the various different words--

1163
00:56:31,800 --> 00:56:34,290
that, given that it's
a positive message,

1164
00:56:34,290 --> 00:56:38,040
how many positive messages
had the word in my in them?

1165
00:56:38,040 --> 00:56:39,335
It's about 30%.

1166
00:56:39,335 --> 00:56:42,210
And for negative messages, how many
of those had the word my in them?

1167
00:56:42,210 --> 00:56:47,910
About 20%-- so it seems like the word
my comes up more often in positive

1168
00:56:47,910 --> 00:56:52,140
messages-- at least slightly more
often based on this analysis here.

1169
00:56:52,140 --> 00:56:54,270
Grandson, for example--
maybe that showed up

1170
00:56:54,270 --> 00:56:58,680
in 1% of all positive messages
and 2% of all negative messages

1171
00:56:58,680 --> 00:57:00,330
had the word grandson in it.

1172
00:57:00,330 --> 00:57:05,010
The word loved showed up in 32%
of all positive messages, 8%

1173
00:57:05,010 --> 00:57:07,090
of all negative messages, for example.

1174
00:57:07,090 --> 00:57:10,230
And then the word it up in
30% of positive messages,

1175
00:57:10,230 --> 00:57:15,130
40% of negative messages-- again, just
arbitrary data here just for example,

1176
00:57:15,130 --> 00:57:19,560
but now we have data with which we can
begin to calculate this expression.

1177
00:57:19,560 --> 00:57:22,950
So how do I calculate multiplying
all these values together?

1178
00:57:22,950 --> 00:57:25,650
Well, it's just going to
be multiplying probability

1179
00:57:25,650 --> 00:57:29,400
that it's positive times the
probability of my, given positive,

1180
00:57:29,400 --> 00:57:32,190
times the probability of
grandson, given positive--

1181
00:57:32,190 --> 00:57:34,290
so on and so forth for
each of the other words.

1182
00:57:34,290 --> 00:57:37,780
And if you do that multiplication and
multiply all of those values together,

1183
00:57:37,780 --> 00:57:42,000
you get this, 0.00014112.

1184
00:57:42,000 --> 00:57:44,760
By itself, this is not
a meaningful number,

1185
00:57:44,760 --> 00:57:48,810
but it's going to be meaningful
if you compared this expression--

1186
00:57:48,810 --> 00:57:53,250
the probability that it's positive times
the probability of all of the words,

1187
00:57:53,250 --> 00:57:55,680
given that I know that
the message is positive,

1188
00:57:55,680 --> 00:57:59,350
and compare it to the same thing,
but for negative sentiment messages

1189
00:57:59,350 --> 00:57:59,850
instead.

1190
00:57:59,850 --> 00:58:03,090
I want to know the probability
that it's a negative message

1191
00:58:03,090 --> 00:58:05,430
times the probability
of all of these words,

1192
00:58:05,430 --> 00:58:07,900
given that it's a negative message.

1193
00:58:07,900 --> 00:58:09,360
And so how can I do that?

1194
00:58:09,360 --> 00:58:13,280
Well, to do that, you just multiply
probability of negative times

1195
00:58:13,280 --> 00:58:15,500
all of these conditional probabilities.

1196
00:58:15,500 --> 00:58:19,520
And if I take those five values,
multiply all of them together,

1197
00:58:19,520 --> 00:58:26,730
then what I get is this value
for negative 0.00006528--

1198
00:58:26,730 --> 00:58:30,080
again, in isolation, not a
particularly meaningful number.

1199
00:58:30,080 --> 00:58:35,300
What is meaningful is treating these
two values as a probability distribution

1200
00:58:35,300 --> 00:58:39,260
and normalizing them, making it so
that both of these values sum up to 1

1201
00:58:39,260 --> 00:58:41,450
the way of probability
distribution should.

1202
00:58:41,450 --> 00:58:45,740
And we do so by adding these two up
and then dividing each of these values

1203
00:58:45,740 --> 00:58:48,120
by their total in order to
be able to normalize them.

1204
00:58:48,120 --> 00:58:51,170
And when we do that, when we normalize
this probability distribution,

1205
00:58:51,170 --> 00:58:58,400
you end up getting something like
this, positive 0.6837, negative 0.3163.

1206
00:58:58,400 --> 00:59:02,990
It seems like we've been able to
conclude that we are about 68%

1207
00:59:02,990 --> 00:59:06,500
confident-- we think there's
a probability of 0.68

1208
00:59:06,500 --> 00:59:09,470
that this message is a positive
message-- my grandson loved it.

1209
00:59:09,470 --> 00:59:11,540
And why are we 68% confident?

1210
00:59:11,540 --> 00:59:15,350
Well, it seems like we're more
confident than not because the word

1211
00:59:15,350 --> 00:59:18,350
loved showed up in 32%
of positive messages,

1212
00:59:18,350 --> 00:59:20,420
but only 8% of negative messages.

1213
00:59:20,420 --> 00:59:22,410
So that was a pretty strong indicator.

1214
00:59:22,410 --> 00:59:25,070
And for the others, while
it's true that the word

1215
00:59:25,070 --> 00:59:27,260
it showed up more often
in negative messages,

1216
00:59:27,260 --> 00:59:30,170
it wasn't enough to
offset that loved shows up

1217
00:59:30,170 --> 00:59:34,560
far more often in positive
messages than negative messages.

1218
00:59:34,560 --> 00:59:37,970
And so this type of analysis is
how we can apply naive Bayes.

1219
00:59:37,970 --> 00:59:39,650
We've just done this calculation.

1220
00:59:39,650 --> 00:59:42,933
And we end up getting not just a
categorization of positive or negative,

1221
00:59:42,933 --> 00:59:44,600
but I get some sort of confidence level.

1222
00:59:44,600 --> 00:59:47,660
What do I think the probability
is that it's positive?

1223
00:59:47,660 --> 00:59:52,560
And I can say I think it's positive
with this particular probability.

1224
00:59:52,560 --> 00:59:55,820
And so naive Bayes can be quite
powerful at trying to achieve this.

1225
00:59:55,820 --> 00:59:58,250
Using just this bag of words
model, where all I'm doing

1226
00:59:58,250 --> 01:00:00,950
is looking at what words
show up in the sample,

1227
01:00:00,950 --> 01:00:03,870
I'm able to draw these
sorts of conclusions.

1228
01:00:03,870 --> 01:00:07,280
Now, one potential drawback-- something
that you'll notice pretty quickly

1229
01:00:07,280 --> 01:00:10,190
if you start applying
this room exactly as is--

1230
01:00:10,190 --> 01:00:15,500
is what happens depending on if
0's are inside this data somewhere.

1231
01:00:15,500 --> 01:00:20,410
Let's imagine, for example, this same
sentence-- my grandson loved it--

1232
01:00:20,410 --> 01:00:24,980
but let's instead imagine that this
value here, instead of being 0.01,

1233
01:00:24,980 --> 01:00:28,970
was 0, meaning inside of
our data set, it has never

1234
01:00:28,970 --> 01:00:33,620
before happened that in a positive
message the word grandson showed up.

1235
01:00:33,620 --> 01:00:35,450
And that's certainly possible.

1236
01:00:35,450 --> 01:00:37,817
If I have a pretty small data
set, it's probably likely

1237
01:00:37,817 --> 01:00:40,400
that not all the messages are
going to have the word grandson.

1238
01:00:40,400 --> 01:00:43,400
Maybe it is the case that no
positive messages have ever

1239
01:00:43,400 --> 01:00:46,370
had the word grandson in
it, at least in my data set.

1240
01:00:46,370 --> 01:00:49,640
But if it is the case that
2% of the negative messages

1241
01:00:49,640 --> 01:00:52,340
have still had the word
grandson in it, then we

1242
01:00:52,340 --> 01:00:54,330
run into an interesting challenge.

1243
01:00:54,330 --> 01:00:57,730
And the challenge is this-- when I
multiply all of the positive numbers

1244
01:00:57,730 --> 01:01:00,980
together and multiply all the negative
numbers together to calculate these two

1245
01:01:00,980 --> 01:01:06,800
probabilities, what I end up getting
is a positive value of 0.000.

1246
01:01:06,800 --> 01:01:10,010
I get pure 0's, because when I
multiply all of these numbers

1247
01:01:10,010 --> 01:01:12,470
together-- when I
multiply something by 0,

1248
01:01:12,470 --> 01:01:15,770
doesn't matter what the other numbers
are-- the result is going to be 0.

1249
01:01:15,770 --> 01:01:19,710
And the same thing can be said
of negative numbers as well.

1250
01:01:19,710 --> 01:01:24,320
So this then would seem to be a problem
that, because grandson has never

1251
01:01:24,320 --> 01:01:27,630
showed up in any of the positive
messages inside of our sample,

1252
01:01:27,630 --> 01:01:31,340
we're able to say-- we seem to
be concluding that there is a 0%

1253
01:01:31,340 --> 01:01:33,110
chance that the message is positive.

1254
01:01:33,110 --> 01:01:37,105
And therefore, it must be negative,
because the only cases where

1255
01:01:37,105 --> 01:01:39,980
we've seen the word grandson come
up is inside of a negative message.

1256
01:01:39,980 --> 01:01:43,340
And in doing so, we've totally
ignored all of the other probabilities

1257
01:01:43,340 --> 01:01:46,940
that a positive message is much more
likely to have the word loved in it,

1258
01:01:46,940 --> 01:01:49,190
because we've multiplied
by 0, which just

1259
01:01:49,190 --> 01:01:53,670
means none of the other probabilities
can possibly matter at all.

1260
01:01:53,670 --> 01:01:55,920
So this then is a challenge
that we need to deal with.

1261
01:01:55,920 --> 01:01:57,380
It means that we're
likely not going to be

1262
01:01:57,380 --> 01:02:00,220
able to get the correct results if
we just purely use this approach.

1263
01:02:00,220 --> 01:02:02,720
And it's for that reason there
are a number of possible ways

1264
01:02:02,720 --> 01:02:06,230
we can try and make sure that we
never multiply something by 0.

1265
01:02:06,230 --> 01:02:08,750
It's OK to multiply
something by a small number,

1266
01:02:08,750 --> 01:02:10,640
because then it can
still be counterbalanced

1267
01:02:10,640 --> 01:02:14,540
by other larger numbers, but multiplying
by 0 means it's the end of the story.

1268
01:02:14,540 --> 01:02:16,520
You multiply a number
by 0, and the output's

1269
01:02:16,520 --> 01:02:21,230
going to be 0, no matter how big any
of the other numbers happen to be.

1270
01:02:21,230 --> 01:02:23,810
So one approach that's fairly
common a naive Bayes is

1271
01:02:23,810 --> 01:02:29,090
this idea of additive smoothing, adding
some value alpha to each of the values

1272
01:02:29,090 --> 01:02:31,943
in our distribution just to
smooth the data little bit.

1273
01:02:31,943 --> 01:02:33,860
One such approach is
called Laplace smoothing,

1274
01:02:33,860 --> 01:02:37,530
which basically just means adding one
to each value in our distribution.

1275
01:02:37,530 --> 01:02:43,540
So if I have 100 samples and zero
of them contain the word grandson,

1276
01:02:43,540 --> 01:02:45,290
well then I might say
that, you know what?

1277
01:02:45,290 --> 01:02:49,460
Instead, let's pretend that I've had
one additional sample where the word

1278
01:02:49,460 --> 01:02:53,210
grandson appeared and one additional
sample where the word grandson didn't

1279
01:02:53,210 --> 01:02:53,840
appear.

1280
01:02:53,840 --> 01:02:57,150
So I'll say all right,
now I have one 1 of 102--

1281
01:02:57,150 --> 01:03:01,550
so one sample that does have the
word grandson out of 102 total.

1282
01:03:01,550 --> 01:03:05,070
I'm basically creating two
samples that didn't exist before.

1283
01:03:05,070 --> 01:03:08,830
But in doing so, I've been able to
smooth the distribution a little bit

1284
01:03:08,830 --> 01:03:12,040
to make sure that I never have
to multiply anything by 0.

1285
01:03:12,040 --> 01:03:17,080
By pretending I've seen one more value
in each category than I actually have,

1286
01:03:17,080 --> 01:03:19,390
this gets us that result
of not having to worry

1287
01:03:19,390 --> 01:03:22,180
about multiplying a number by 0.

1288
01:03:22,180 --> 01:03:24,580
So this then is an approach
that we can use in order

1289
01:03:24,580 --> 01:03:27,670
to try and apply naive
Bayes, even in situations

1290
01:03:27,670 --> 01:03:31,730
where we're dealing with words that we
might not necessarily have seen before.

1291
01:03:31,730 --> 01:03:35,140
And let's now take a look at how we
could actually apply that in practice.

1292
01:03:35,140 --> 01:03:38,490
It turns out that NLTK, in addition
to having the ability to extract

1293
01:03:38,490 --> 01:03:41,110
n-grams and tokenize
things into words, also

1294
01:03:41,110 --> 01:03:45,400
has the ability to be able to apply
naive Bayes on some samples of text,

1295
01:03:45,400 --> 01:03:46,920
for example.

1296
01:03:46,920 --> 01:03:48,430
And so let's go ahead and do that.

1297
01:03:48,430 --> 01:03:52,840
What I've done is, inside of sentiment,
I've prepared a corpus of just

1298
01:03:52,840 --> 01:03:55,997
know reviews that I've generated, but
you can imagine using real reviews.

1299
01:03:55,997 --> 01:03:58,330
I just have a couple of
positive reviews-- it was great.

1300
01:03:58,330 --> 01:03:58,873
So much fun.

1301
01:03:58,873 --> 01:03:59,540
Would recommend.

1302
01:03:59,540 --> 01:04:00,550
My grandson loved it.

1303
01:04:00,550 --> 01:04:01,712
Those sorts of messages.

1304
01:04:01,712 --> 01:04:04,420
And then I have a whole bunch of
negative reviews-- not worth it,

1305
01:04:04,420 --> 01:04:07,190
kind of cheap, really bad,
didn't work the way we expected--

1306
01:04:07,190 --> 01:04:08,470
just one on each line.

1307
01:04:08,470 --> 01:04:11,860
A whole bunch of positive
reviews and negative reviews.

1308
01:04:11,860 --> 01:04:15,130
And what I'd like to do now
is analyze them somehow.

1309
01:04:15,130 --> 01:04:19,690
So here then is sentiment up high,
and what we're going to do first

1310
01:04:19,690 --> 01:04:23,680
is extract all of the positive
and negative sentences,

1311
01:04:23,680 --> 01:04:28,600
create a set of all of the words that
were used across all of the messages,

1312
01:04:28,600 --> 01:04:33,340
and then we're going to go ahead and
train NLTK's naive Bayes classifier

1313
01:04:33,340 --> 01:04:34,810
on all of this training data.

1314
01:04:34,810 --> 01:04:36,850
And with the training
data effectively is is I

1315
01:04:36,850 --> 01:04:40,300
take all of the positive messages
and give them the label positive, all

1316
01:04:40,300 --> 01:04:42,790
the negative messages and
give them the label negative,

1317
01:04:42,790 --> 01:04:45,880
and then I'll go ahead and apply
this classifier to it, where I'd say,

1318
01:04:45,880 --> 01:04:48,100
I would like to take all
of this training data

1319
01:04:48,100 --> 01:04:52,030
and now have the ability to
classify it as positive or negative.

1320
01:04:52,030 --> 01:04:53,860
I'll then take some input from the user.

1321
01:04:53,860 --> 01:04:56,890
They can just type in
some sequence of words.

1322
01:04:56,890 --> 01:04:59,020
And then I would like to
classify that sequence

1323
01:04:59,020 --> 01:05:01,450
as either positive or
negative, and then I'll

1324
01:05:01,450 --> 01:05:04,482
go ahead and print out what the
probabilities of each happened to be.

1325
01:05:04,482 --> 01:05:07,690
And there are some helper functions here
that just organize things in the way

1326
01:05:07,690 --> 01:05:09,610
that NLTK is expecting them to be.

1327
01:05:09,610 --> 01:05:12,307
But the key idea here is that
I'm taking the positive messages,

1328
01:05:12,307 --> 01:05:14,140
labeling them, taking
the negative messages,

1329
01:05:14,140 --> 01:05:16,840
labeling them, putting them
inside of a classifier,

1330
01:05:16,840 --> 01:05:21,380
and then now trying to classify
some new text that comes about.

1331
01:05:21,380 --> 01:05:23,030
So let's go ahead and try it.

1332
01:05:23,030 --> 01:05:26,740
I'll go ahead and go into sentiment,
and we'll run Python sentiment,

1333
01:05:26,740 --> 01:05:29,328
passing in as input that
corpus that contains

1334
01:05:29,328 --> 01:05:31,120
all of the positive
and negative messages--

1335
01:05:31,120 --> 01:05:34,480
because depending on the corpus, that's
going to affect the probabilities.

1336
01:05:34,480 --> 01:05:36,970
The effectiveness of
our ability to classify

1337
01:05:36,970 --> 01:05:41,045
is entirely dependent on how good our
data is, and how much data we have,

1338
01:05:41,045 --> 01:05:42,670
and how well they happen to be labeled.

1339
01:05:42,670 --> 01:05:44,640
So now I can try something and say--

1340
01:05:44,640 --> 01:05:47,170
let's try a review
like, this was great--

1341
01:05:47,170 --> 01:05:49,800
just some review that I might leave.

1342
01:05:49,800 --> 01:05:53,200
And it seems that, all right,
there is a 96% chance it estimates

1343
01:05:53,200 --> 01:05:54,930
that this was a positive message--

1344
01:05:54,930 --> 01:05:58,480
4% chance that it was a negative,
likely because the word great

1345
01:05:58,480 --> 01:06:00,610
shows up inside of
the positive messages,

1346
01:06:00,610 --> 01:06:03,080
but doesn't show up inside
of the negative messages.

1347
01:06:03,080 --> 01:06:06,160
And that might be something that
our AI is able to capitalize on.

1348
01:06:06,160 --> 01:06:09,640
And really, what it's going to look
for are the differentiating words--

1349
01:06:09,640 --> 01:06:12,490
that if the probability
of words like this and was

1350
01:06:12,490 --> 01:06:15,530
and is pretty similar between
positive and negative words,

1351
01:06:15,530 --> 01:06:17,680
then the naive Bayes
classifier isn't going

1352
01:06:17,680 --> 01:06:21,202
to end up using those values as
having some sort of importance

1353
01:06:21,202 --> 01:06:21,910
in the algorithm.

1354
01:06:21,910 --> 01:06:23,710
Because if they're the
same on both sides,

1355
01:06:23,710 --> 01:06:26,560
you multiply that value for
both positive and negative,

1356
01:06:26,560 --> 01:06:28,270
you end up getting about the same thing.

1357
01:06:28,270 --> 01:06:30,730
What ultimately makes the
difference in naive Bayes

1358
01:06:30,730 --> 01:06:34,210
is when you multiply by value
that's much bigger for one category

1359
01:06:34,210 --> 01:06:36,880
than for another category--
when one word like great

1360
01:06:36,880 --> 01:06:39,910
is much more likely to show
up in one type of message

1361
01:06:39,910 --> 01:06:41,260
than another type of message.

1362
01:06:41,260 --> 01:06:43,385
And that's one of the nice
things about naive Bayes

1363
01:06:43,385 --> 01:06:45,250
is that, without me
telling it, that great

1364
01:06:45,250 --> 01:06:48,210
is more important to care
about than this or was.

1365
01:06:48,210 --> 01:06:50,380
Naive Bayes can figure
that out based on the data.

1366
01:06:50,380 --> 01:06:53,740
It can figure out that this shows
up about the same amount of time

1367
01:06:53,740 --> 01:06:56,560
between the two, but great,
that is a discriminator,

1368
01:06:56,560 --> 01:07:00,060
a word that can be different
between the two types of messages.

1369
01:07:00,060 --> 01:07:01,400
So I could try it again--

1370
01:07:01,400 --> 01:07:04,583
type in a sentence like,
lots of fun, for example.

1371
01:07:04,583 --> 01:07:06,250
This one it's a little less sure about--

1372
01:07:06,250 --> 01:07:10,690
62% chance that it's positive, 37%
chance that it's negative-- maybe

1373
01:07:10,690 --> 01:07:12,720
because there aren't
as clear discriminators

1374
01:07:12,720 --> 01:07:15,310
or differentiators inside of this data.

1375
01:07:15,310 --> 01:07:16,400
I'll try one more--

1376
01:07:16,400 --> 01:07:20,430
say kind of overpriced.

1377
01:07:20,430 --> 01:07:23,633
And all right, now
95%, 96% sure that this

1378
01:07:23,633 --> 01:07:25,800
is a negative sentiment--
likely because of the word

1379
01:07:25,800 --> 01:07:29,032
overpriced, because it's shown up
in a negative sentiment expression

1380
01:07:29,032 --> 01:07:31,740
before, and therefore, it thinks,
you know what, this is probably

1381
01:07:31,740 --> 01:07:34,720
going to be a negative sentence.

1382
01:07:34,720 --> 01:07:37,830
And so naive Bayes has now given
us the ability to classify text.

1383
01:07:37,830 --> 01:07:40,350
Given enough training data,
given enough examples,

1384
01:07:40,350 --> 01:07:44,400
we can train our AI to be able to
look at natural language, human words,

1385
01:07:44,400 --> 01:07:46,410
figure out which words
are likely to show up

1386
01:07:46,410 --> 01:07:48,870
in positive as opposed to
negative sentiment messages,

1387
01:07:48,870 --> 01:07:50,670
and categorize them accordingly.

1388
01:07:50,670 --> 01:07:52,420
And you could imagine
doing the same thing

1389
01:07:52,420 --> 01:07:55,170
anytime you want to take text
and group it into categories.

1390
01:07:55,170 --> 01:07:58,300
If I want to take an email
and categorize as email--

1391
01:07:58,300 --> 01:08:01,560
as a good email or as a spam email,
you could apply a similar idea.

1392
01:08:01,560 --> 01:08:04,020
Try and look for the
discriminating words,

1393
01:08:04,020 --> 01:08:07,230
the words that make it more
likely to be a spam email or not,

1394
01:08:07,230 --> 01:08:10,830
and just train a naive Bayes
classifier to be able to figure out

1395
01:08:10,830 --> 01:08:14,250
what that distribution is and to be
able to figure out how to categorize

1396
01:08:14,250 --> 01:08:15,978
an email as good or as spam.

1397
01:08:15,978 --> 01:08:19,020
Now, of course, it's not going to be
able to give us a definitive answer.

1398
01:08:19,020 --> 01:08:22,950
It gives us a probability
distribution, something like 63%

1399
01:08:22,950 --> 01:08:25,380
positive, 37% negative.

1400
01:08:25,380 --> 01:08:29,550
And that might be why our spam filters
and our emails sometimes make mistakes,

1401
01:08:29,550 --> 01:08:32,700
sometimes think that a good
email is actually spam or vice

1402
01:08:32,700 --> 01:08:36,000
versa, because ultimately,
the best that it can do

1403
01:08:36,000 --> 01:08:37,890
is calculate a probability distribution.

1404
01:08:37,890 --> 01:08:40,290
If natural language is
ambiguous, we can usually

1405
01:08:40,290 --> 01:08:42,960
just deal in the world of
probabilities to try and get

1406
01:08:42,960 --> 01:08:47,100
an answer that is reasonably good, even
if we aren't able to guarantee for sure

1407
01:08:47,100 --> 01:08:50,970
that it is the number that we
actually expect for it to be.

1408
01:08:50,970 --> 01:08:54,600
That then was a look at how
we can begin to take some text

1409
01:08:54,600 --> 01:08:59,910
and to be able to analyze the text and
group it into some sorts of categories.

1410
01:08:59,910 --> 01:09:04,140
But ultimately, in addition just being
able to analyze text and categorize it,

1411
01:09:04,140 --> 01:09:08,130
we'd like to be able to figure
out information about the text,

1412
01:09:08,130 --> 01:09:11,130
get it some sort of meaning
out of the text as well.

1413
01:09:11,130 --> 01:09:13,500
And this starts to get us
in the world of information,

1414
01:09:13,500 --> 01:09:16,620
of being able to try and
take data in the form of text

1415
01:09:16,620 --> 01:09:18,450
and retrieve information from it.

1416
01:09:18,450 --> 01:09:22,500
So one type of problem is known
as information retrieval, or IR,

1417
01:09:22,500 --> 01:09:26,979
which is the task of finding relevant
documents in response to a query.

1418
01:09:26,979 --> 01:09:30,330
So this is something like you type
in a query into a search engine,

1419
01:09:30,330 --> 01:09:32,279
like Google, or you're
typing in something

1420
01:09:32,279 --> 01:09:35,640
into some system that's going to look
for-- inside of a library catalog,

1421
01:09:35,640 --> 01:09:38,609
for example-- that's going to
look for responses to a query.

1422
01:09:38,609 --> 01:09:43,217
I want to look for documents that are
about the US constitution or something,

1423
01:09:43,217 --> 01:09:45,300
and I would like to get a
whole bunch of documents

1424
01:09:45,300 --> 01:09:47,819
that match that query back to me.

1425
01:09:47,819 --> 01:09:50,819
But you might imagine that what
I really want to be able to do

1426
01:09:50,819 --> 01:09:53,160
is, in order to solve
this task effectively,

1427
01:09:53,160 --> 01:09:55,830
I need to be able to take
documents and figure out,

1428
01:09:55,830 --> 01:09:57,870
what are those documents about?

1429
01:09:57,870 --> 01:10:01,680
I want to be able to say what is it
that these particular documents are

1430
01:10:01,680 --> 01:10:03,900
about-- what of the topics
of those documents--

1431
01:10:03,900 --> 01:10:08,160
so that I can then more effectively
be able to retrieve information

1432
01:10:08,160 --> 01:10:10,050
from those particular documents.

1433
01:10:10,050 --> 01:10:13,560
And this refers to a set of tasks
generally known as topic modeling,

1434
01:10:13,560 --> 01:10:17,918
where I'd like to discover what the
topics are for a set of documents.

1435
01:10:17,918 --> 01:10:19,710
And this is something
that humans could do.

1436
01:10:19,710 --> 01:10:21,800
A human could read a document
and tell you, all right,

1437
01:10:21,800 --> 01:10:23,883
here's what this document
is about, and give maybe

1438
01:10:23,883 --> 01:10:27,862
a couple of topics for who are the
important people in this document, what

1439
01:10:27,862 --> 01:10:30,570
are the important objects in the
document-- can probably tell you

1440
01:10:30,570 --> 01:10:32,370
that kind of thing.

1441
01:10:32,370 --> 01:10:35,160
But we'd like for our AI to
be able to do the same thing.

1442
01:10:35,160 --> 01:10:38,760
Given some document, can you
tell me what the important words

1443
01:10:38,760 --> 01:10:39,870
in this document are?

1444
01:10:39,870 --> 01:10:42,095
What are the words that
set this document apart

1445
01:10:42,095 --> 01:10:44,220
that I might care about if
I'm looking at documents

1446
01:10:44,220 --> 01:10:47,128
based on keywords, for example?

1447
01:10:47,128 --> 01:10:49,920
And so one instinctive idea-- an
intuitive idea that probably makes

1448
01:10:49,920 --> 01:10:50,580
sense--

1449
01:10:50,580 --> 01:10:53,250
is let's just use term frequency.

1450
01:10:53,250 --> 01:10:56,100
Term frequency is just
defined as the number of times

1451
01:10:56,100 --> 01:10:58,650
a particular term appears in a document.

1452
01:10:58,650 --> 01:11:03,300
If I have a document with 100 words and
one particular word shows up 10 times,

1453
01:11:03,300 --> 01:11:05,440
it has a term frequency of 10.

1454
01:11:05,440 --> 01:11:06,690
It shows up pretty often.

1455
01:11:06,690 --> 01:11:09,000
Maybe that's going to
be an important word.

1456
01:11:09,000 --> 01:11:10,750
And sometimes, you'll
also see this framed

1457
01:11:10,750 --> 01:11:14,620
as a proportion of the total number
of words, so 10 words out of 100.

1458
01:11:14,620 --> 01:11:19,110
Maybe it has a term frequency of
0.1, meaning 10% of all of the words

1459
01:11:19,110 --> 01:11:21,530
are this particular
word that I care about.

1460
01:11:21,530 --> 01:11:23,280
Ultimately, that doesn't
change relatively

1461
01:11:23,280 --> 01:11:26,300
how important they are for
any one particular document,

1462
01:11:26,300 --> 01:11:27,730
but they're the same idea.

1463
01:11:27,730 --> 01:11:31,050
The idea is look for words that show
up more frequently, because those

1464
01:11:31,050 --> 01:11:35,970
are more likely to be the important
words inside of a corpus of documents.

1465
01:11:35,970 --> 01:11:37,840
And so let's go ahead
and give that a try.

1466
01:11:37,840 --> 01:11:40,980
Let's say I wanted to find out what
the Sherlock Holmes stories are about.

1467
01:11:40,980 --> 01:11:42,780
I have a whole bunch of
Sherlock Holmes stories

1468
01:11:42,780 --> 01:11:45,000
and I want to know, in
general, what are they about?

1469
01:11:45,000 --> 01:11:47,708
What are the important characters?

1470
01:11:47,708 --> 01:11:49,000
What are the important objects?

1471
01:11:49,000 --> 01:11:52,170
What are the important parts of
the story, just in terms of words?

1472
01:11:52,170 --> 01:11:55,350
And I'd like for the AI to be able
to figure that out on its own,

1473
01:11:55,350 --> 01:11:57,660
and we'll do so by looking
at term frequency--

1474
01:11:57,660 --> 01:12:01,930
by looking at, what are the words
that show up the most often?

1475
01:12:01,930 --> 01:12:06,250
So we'll go ahead, and I'll go ahead
and go in to the tfidf directory.

1476
01:12:06,250 --> 01:12:08,350
You'll see why it's
called that in a moment.

1477
01:12:08,350 --> 01:12:14,290
But let's first open up tf0.py, which
is going to calculate the top 10 term

1478
01:12:14,290 --> 01:12:17,092
frequencies-- or maybe
top five term frequencies

1479
01:12:17,092 --> 01:12:19,300
for a corpus of documents,
a whole bunch of documents

1480
01:12:19,300 --> 01:12:22,930
where each document is just
a story from Sherlock Holmes.

1481
01:12:22,930 --> 01:12:26,772
We're going to load all
the data into our corpus

1482
01:12:26,772 --> 01:12:29,850
and we're going to figure out,
what are all of the words that

1483
01:12:29,850 --> 01:12:32,610
show up inside of that corpus?

1484
01:12:32,610 --> 01:12:35,187
And we're going to
basically just assemble all

1485
01:12:35,187 --> 01:12:36,770
of the number of the term frequencies.

1486
01:12:36,770 --> 01:12:39,510
We're going to calculate, how
often do each of these terms

1487
01:12:39,510 --> 01:12:41,880
appear inside of the document?

1488
01:12:41,880 --> 01:12:43,368
And we'll print out the top five.

1489
01:12:43,368 --> 01:12:45,660
And so there are some data
structures involved that you

1490
01:12:45,660 --> 01:12:47,160
can take a look at if you'd like to.

1491
01:12:47,160 --> 01:12:50,550
The exact code is not so important,
but it is the idea of what we're doing.

1492
01:12:50,550 --> 01:12:54,450
We're taking each of these
documents and first sorting them.

1493
01:12:54,450 --> 01:12:56,340
We're saying, take all
the words that show up

1494
01:12:56,340 --> 01:13:00,080
and sort them by how
often each word shows up.

1495
01:13:00,080 --> 01:13:04,710
And let's go ahead and just, for
each document, save the top five

1496
01:13:04,710 --> 01:13:07,720
terms that happen to show up
in each of those documents.

1497
01:13:07,720 --> 01:13:10,900
So again, some helper functions you can
take a look at if you're interested.

1498
01:13:10,900 --> 01:13:13,440
But the key idea here is
that all we're going to do

1499
01:13:13,440 --> 01:13:18,240
is run to tf0 on the
Sherlock Holmes stories.

1500
01:13:18,240 --> 01:13:21,840
And what I'm hoping to get out of this
process is I am hoping to figure out,

1501
01:13:21,840 --> 01:13:25,150
what are the important words in
Sherlock Holmes, for example?

1502
01:13:25,150 --> 01:13:29,370
So we'll go ahead and run
this and see what we get.

1503
01:13:29,370 --> 01:13:30,982
And it's loading the data.

1504
01:13:30,982 --> 01:13:31,940
And here's what we get.

1505
01:13:31,940 --> 01:13:36,530
For this particular story, the
important words are the, and and, and I,

1506
01:13:36,530 --> 01:13:37,368
and to, and of.

1507
01:13:37,368 --> 01:13:39,410
Those are the words that
show up more frequently.

1508
01:13:39,410 --> 01:13:45,000
In this particular story, it's
the, and and, and I, and a, and of.

1509
01:13:45,000 --> 01:13:47,000
This is not particularly useful to us.

1510
01:13:47,000 --> 01:13:48,230
We're using term frequencies.

1511
01:13:48,230 --> 01:13:50,930
We're looking at what words show
up the most frequently in each

1512
01:13:50,930 --> 01:13:54,830
of these various different
documents, but what we get naturally

1513
01:13:54,830 --> 01:13:57,470
are just the words that
show up a lot in English.

1514
01:13:57,470 --> 01:14:00,385
The word the, and of, and happen
to show up a lot in English,

1515
01:14:00,385 --> 01:14:02,510
and therefore, they happen
to show up a lot in each

1516
01:14:02,510 --> 01:14:04,052
of these various different documents.

1517
01:14:04,052 --> 01:14:06,320
This is not a particularly
useful metric for us

1518
01:14:06,320 --> 01:14:08,690
to be able to analyze
what words are important,

1519
01:14:08,690 --> 01:14:12,960
because these words are just part of
the grammatical structure of English.

1520
01:14:12,960 --> 01:14:17,610
And it turns out we can categorize words
into a couple of different categories.

1521
01:14:17,610 --> 01:14:21,102
These words happen to be known as what
we might call function words, words

1522
01:14:21,102 --> 01:14:23,060
that have little meaning
on their own, but that

1523
01:14:23,060 --> 01:14:26,100
are used to grammatically connect
different parts of a sentence.

1524
01:14:26,100 --> 01:14:29,120
These are words like am, and
by, and do, and is, and which,

1525
01:14:29,120 --> 01:14:32,130
and with, and yet-- words that,
on their own, what do they mean?

1526
01:14:32,130 --> 01:14:33,140
It's hard to say.

1527
01:14:33,140 --> 01:14:35,390
They get their meaning
from how they connect

1528
01:14:35,390 --> 01:14:36,980
different parts of the sentence.

1529
01:14:36,980 --> 01:14:40,610
And these function words are what we
might call a closed class of words

1530
01:14:40,610 --> 01:14:41,990
in a language like English.

1531
01:14:41,990 --> 01:14:44,690
There's really just some
fixed list of function words,

1532
01:14:44,690 --> 01:14:46,190
and they don't change very often.

1533
01:14:46,190 --> 01:14:48,260
There's just some list of
words that are commonly

1534
01:14:48,260 --> 01:14:52,460
used to connect other grammatical
structures in the language.

1535
01:14:52,460 --> 01:14:56,120
And that's in contrast with what
we might call content words, words

1536
01:14:56,120 --> 01:14:58,970
that carry meaning independently--
words like algorithm,

1537
01:14:58,970 --> 01:15:02,580
category, computer, words that
actually have some sort of meaning.

1538
01:15:02,580 --> 01:15:05,150
And these are usually the
words that we care about.

1539
01:15:05,150 --> 01:15:07,250
These are the words where
we want to figure out,

1540
01:15:07,250 --> 01:15:10,020
what are the important
words in our document?

1541
01:15:10,020 --> 01:15:12,230
We probably care about
the content words more

1542
01:15:12,230 --> 01:15:15,380
than we care about the function words.

1543
01:15:15,380 --> 01:15:20,770
And so one strategy we could apply is
just ignore all of the function words.

1544
01:15:20,770 --> 01:15:26,120
So here in tf1.py, I've
done the same exact thing,

1545
01:15:26,120 --> 01:15:31,790
except I'm going to load a whole bunch
of words from a function_words.txt

1546
01:15:31,790 --> 01:15:35,670
file, inside of which are just a whole
bunch of function words in alphabetical

1547
01:15:35,670 --> 01:15:36,170
order.

1548
01:15:36,170 --> 01:15:38,570
These are just a whole
bunch of function words

1549
01:15:38,570 --> 01:15:41,870
that are just words that are used
to connect other words in English,

1550
01:15:41,870 --> 01:15:44,275
and someone has just compiled
this particular list.

1551
01:15:44,275 --> 01:15:46,400
And these are the words
that I just want to ignore.

1552
01:15:46,400 --> 01:15:49,790
If any of these words-- let's just
ignore it as one of the top terms,

1553
01:15:49,790 --> 01:15:52,790
because these are not words
that I probably care about

1554
01:15:52,790 --> 01:15:56,570
if I want to analyze what the
important terms inside of a document

1555
01:15:56,570 --> 01:15:57,860
happen to be.

1556
01:15:57,860 --> 01:16:01,820
So in tfidf1, we were
ultimately doing is,

1557
01:16:01,820 --> 01:16:05,360
if the word is in my
set of function words,

1558
01:16:05,360 --> 01:16:08,720
I'm just going to skip over it, just
ignore any of the function words

1559
01:16:08,720 --> 01:16:11,210
by continuing on to
the next word and then

1560
01:16:11,210 --> 01:16:14,010
just calculating the frequencies
for those words instead.

1561
01:16:14,010 --> 01:16:16,520
So I'm going to pretend the
function words aren't there,

1562
01:16:16,520 --> 01:16:19,550
and now maybe I can get
a better sense for what

1563
01:16:19,550 --> 01:16:23,060
terms are important in each of the
various different Sherlock Holmes

1564
01:16:23,060 --> 01:16:24,560
stories.

1565
01:16:24,560 --> 01:16:29,080
So now let's run tf1 on the Sherlock
Holmes corpus and see what we get now.

1566
01:16:29,080 --> 01:16:32,510
And let's look at, what is the most
important term in each of the stories?

1567
01:16:32,510 --> 01:16:34,760
Well, it seems like,
for each of the stories,

1568
01:16:34,760 --> 01:16:36,770
the most important word is Holmes.

1569
01:16:36,770 --> 01:16:38,270
I guess that's what we would expect.

1570
01:16:38,270 --> 01:16:39,380
They're all Sherlock Holmes stories.

1571
01:16:39,380 --> 01:16:40,922
And Holmes is not a function in Word.

1572
01:16:40,922 --> 01:16:44,360
It's not the, or a, or
an, so it wasn't ignored.

1573
01:16:44,360 --> 01:16:46,130
But Holmes and man--

1574
01:16:46,130 --> 01:16:50,760
these are probably not what I mean when
I say, what are the important words?

1575
01:16:50,760 --> 01:16:52,700
Even though Holmes does
show up the most often

1576
01:16:52,700 --> 01:16:54,890
it's not giving me a whole
lot of information here

1577
01:16:54,890 --> 01:16:57,800
about what each of the different
Sherlock Holmes stories

1578
01:16:57,800 --> 01:16:59,460
are actually about.

1579
01:16:59,460 --> 01:17:02,880
And the reason why is because Sherlock
Holmes shows up in all the stories,

1580
01:17:02,880 --> 01:17:06,950
and so it's not meaningful for me to
say that this story is about Sherlock

1581
01:17:06,950 --> 01:17:09,560
Holmes I want to try and
figure out the different topics

1582
01:17:09,560 --> 01:17:11,180
across the corpus of documents.

1583
01:17:11,180 --> 01:17:13,640
What I really want to know
is, what words show up

1584
01:17:13,640 --> 01:17:18,170
in this document that show up less
frequently in the other documents,

1585
01:17:18,170 --> 01:17:19,380
for example?

1586
01:17:19,380 --> 01:17:22,730
And so to get at that idea, we're
going to introduce the notion

1587
01:17:22,730 --> 01:17:25,850
of inverse document frequency.

1588
01:17:25,850 --> 01:17:29,450
Inverse document frequency
is a measure of how common,

1589
01:17:29,450 --> 01:17:33,530
or rare, a word happens to be
across an entire corpus of words.

1590
01:17:33,530 --> 01:17:35,960
And mathematically, it's
usually calculated like this--

1591
01:17:35,960 --> 01:17:39,440
as the logarithm of the
total number of documents

1592
01:17:39,440 --> 01:17:43,550
divided by the number of
documents containing the word.

1593
01:17:43,550 --> 01:17:47,510
So if a word like Holmes shows
up in all of the documents,

1594
01:17:47,510 --> 01:17:50,870
well, then total documents
is how many documents there

1595
01:17:50,870 --> 01:17:55,110
are a number of documents containing
Holmes is going to be the same number.

1596
01:17:55,110 --> 01:17:58,760
So when you divide these two together,
you'll get 1, and the logarithm of one

1597
01:17:58,760 --> 01:18:00,460
is just 0.

1598
01:18:00,460 --> 01:18:04,370
And so what we get is, if Holmes
shows up in all of the documents,

1599
01:18:04,370 --> 01:18:07,040
it has an inverse
document frequency of 0.

1600
01:18:07,040 --> 01:18:09,560
And you can think now of
inverse document frequency

1601
01:18:09,560 --> 01:18:13,370
as a measure of how
rare is the word that

1602
01:18:13,370 --> 01:18:16,280
shows up in this particular document
that if a word doesn't show up

1603
01:18:16,280 --> 01:18:21,060
across many documents at all this
number is going to be much higher.

1604
01:18:21,060 --> 01:18:24,710
And this then gets us that
a model known as tf-idf,

1605
01:18:24,710 --> 01:18:28,310
which is a method for ranking what
words are important in the document

1606
01:18:28,310 --> 01:18:30,440
by multiplying these two ideas together.

1607
01:18:30,440 --> 01:18:37,190
Multiply term frequency, or TF, by
inverse document frequency, or IDF,

1608
01:18:37,190 --> 01:18:39,890
where the idea here now is
that how important a word is

1609
01:18:39,890 --> 01:18:41,540
depends on two things.

1610
01:18:41,540 --> 01:18:44,197
It depends on how often it
shows up in the document using

1611
01:18:44,197 --> 01:18:46,280
the heuristic that, if a
word shows up more often,

1612
01:18:46,280 --> 01:18:47,900
it's probably more important.

1613
01:18:47,900 --> 01:18:51,170
And we multiply that by
inverse document frequency IDF,

1614
01:18:51,170 --> 01:18:54,900
because if the word is rarer,
but it shows up in the document,

1615
01:18:54,900 --> 01:18:57,200
it's probably more important
than if the word shows up

1616
01:18:57,200 --> 01:19:00,200
across most or all of the documents,
because then it's probably

1617
01:19:00,200 --> 01:19:02,990
a less important factor in
what the different topics

1618
01:19:02,990 --> 01:19:06,840
across the different documents
in the corpus happen to be.

1619
01:19:06,840 --> 01:19:11,060
And so now let's go ahead and apply
this algorithm on the Sherlock Holmes

1620
01:19:11,060 --> 01:19:13,340
corpus.

1621
01:19:13,340 --> 01:19:15,650
And here's tfidf.

1622
01:19:15,650 --> 01:19:18,860
Now what I'm doing is,
for each of the documents,

1623
01:19:18,860 --> 01:19:22,120
for each word, I'm
calculating its TF score,

1624
01:19:22,120 --> 01:19:25,160
term frequency, multiplied
by the inverse document

1625
01:19:25,160 --> 01:19:28,190
frequency of that word-- not just
looking at the single volume,

1626
01:19:28,190 --> 01:19:30,410
but multiplying these
two values together

1627
01:19:30,410 --> 01:19:33,650
in order to compute the overall values.

1628
01:19:33,650 --> 01:19:37,610
And now, if I run tfidf
on the Holmes corpus,

1629
01:19:37,610 --> 01:19:40,615
this is going to try and get us
a better approximation for what's

1630
01:19:40,615 --> 01:19:41,990
important in each of the stories.

1631
01:19:41,990 --> 01:19:44,000
And it seems like it's
trying to extract here

1632
01:19:44,000 --> 01:19:46,280
probably like the names
of characters that

1633
01:19:46,280 --> 01:19:49,010
happen to be important in the
story-- characters that show up

1634
01:19:49,010 --> 01:19:51,380
in this story that don't
show up in the other story--

1635
01:19:51,380 --> 01:19:53,930
and prioritizing the more
important characters that

1636
01:19:53,930 --> 01:19:56,510
happen to show up more often.

1637
01:19:56,510 --> 01:20:00,170
And so this then might be a better
analysis of what types of topics

1638
01:20:00,170 --> 01:20:02,070
are more or less important.

1639
01:20:02,070 --> 01:20:05,330
I also have another corpus, which
is a corpus of all of the Federalist

1640
01:20:05,330 --> 01:20:07,700
Papers from American history.

1641
01:20:07,700 --> 01:20:11,240
If I go ahead and run tfidf
on the Federalist Papers,

1642
01:20:11,240 --> 01:20:14,330
we can begin to see what
the important words in each

1643
01:20:14,330 --> 01:20:16,910
of the various different
Federalist Papers happen to be--

1644
01:20:16,910 --> 01:20:22,070
that in Federalist Paper Number 61,
seems like it's a lot about elections.

1645
01:20:22,070 --> 01:20:25,350
In Federalist Papers 66, but
the Senate and impeachments.

1646
01:20:25,350 --> 01:20:28,470
You can start to extract what
the important terms and what

1647
01:20:28,470 --> 01:20:32,540
the important words are just by
looking at what things show up across--

1648
01:20:32,540 --> 01:20:34,800
and don't show up across
many of the documents,

1649
01:20:34,800 --> 01:20:38,637
but show up frequently enough
in certain of the documents.

1650
01:20:38,637 --> 01:20:40,470
And so this can be a
helpful tool for trying

1651
01:20:40,470 --> 01:20:43,350
to figure out this
kind of topic modeling,

1652
01:20:43,350 --> 01:20:47,100
figuring out what it is that
a particular document happens

1653
01:20:47,100 --> 01:20:48,620
to be about.

1654
01:20:48,620 --> 01:20:53,070
And so this then is starting to get
us into this world of semantics,

1655
01:20:53,070 --> 01:20:56,880
what it is that things actually mean
when we're talking about language.

1656
01:20:56,880 --> 01:20:59,100
Now, we're not going to
think about the bag of words,

1657
01:20:59,100 --> 01:21:02,670
where we just say, treat a sample of
text as just a whole bunch of words.

1658
01:21:02,670 --> 01:21:04,320
And we don't care about the order.

1659
01:21:04,320 --> 01:21:06,870
Now, when we get into
the world of semantics,

1660
01:21:06,870 --> 01:21:10,750
we really do start to care about what
it is that these words actually mean,

1661
01:21:10,750 --> 01:21:12,850
how it is these words
relate to each other,

1662
01:21:12,850 --> 01:21:17,250
and in particular, how we can
extract information out of that text.

1663
01:21:17,250 --> 01:21:20,970
Information extraction is
somehow extracting knowledge

1664
01:21:20,970 --> 01:21:23,970
from our documents-- figuring
out, given a whole bunch of text,

1665
01:21:23,970 --> 01:21:28,140
can we automate the process of having
an AI, look at those documents,

1666
01:21:28,140 --> 01:21:31,710
and get out what the useful or relevant
knowledge inside those documents

1667
01:21:31,710 --> 01:21:33,190
happens to be?

1668
01:21:33,190 --> 01:21:34,950
So let's take a look at an example.

1669
01:21:34,950 --> 01:21:37,415
I'll give you two samples
from news articles.

1670
01:21:37,415 --> 01:21:40,290
Here up above is a sample of a news
article from the Harvard Business

1671
01:21:40,290 --> 01:21:42,310
Review that was about Facebook.

1672
01:21:42,310 --> 01:21:45,630
Down below is an example of a
Business Insider article from 2018

1673
01:21:45,630 --> 01:21:47,550
that was about Amazon.

1674
01:21:47,550 --> 01:21:49,710
And there's some information
here that we might

1675
01:21:49,710 --> 01:21:51,570
want an AI to be able to extract--

1676
01:21:51,570 --> 01:21:54,030
information, knowledge
about these companies

1677
01:21:54,030 --> 01:21:55,670
that we might want to extract.

1678
01:21:55,670 --> 01:21:58,020
And in particular, what I
might want to extract is--

1679
01:21:58,020 --> 01:22:02,260
let's say I want to know data
about when companies were founded--

1680
01:22:02,260 --> 01:22:05,250
that I wanted to know that
Facebook was founded in 2004,

1681
01:22:05,250 --> 01:22:07,190
Amazon founded in 1994--

1682
01:22:07,190 --> 01:22:10,500
that that is important information
that I happen to care about.

1683
01:22:10,500 --> 01:22:13,110
Well, how do we extract that
information from the text?

1684
01:22:13,110 --> 01:22:15,660
What is my way of being
able to understand this text

1685
01:22:15,660 --> 01:22:18,810
and figure out, all right,
Facebook was founded in 2004?

1686
01:22:18,810 --> 01:22:22,710
Well, what I can look for are
templates or patterns, things

1687
01:22:22,710 --> 01:22:26,700
that happened to show up across multiple
different documents that give me

1688
01:22:26,700 --> 01:22:28,922
some sense for what this
knowledge happens to mean.

1689
01:22:28,922 --> 01:22:30,630
And what we'll notice
is a common pattern

1690
01:22:30,630 --> 01:22:34,500
between both of these passages,
which is this phrasing here.

1691
01:22:34,500 --> 01:22:37,890
When Facebook was
founded in 2004, comma--

1692
01:22:37,890 --> 01:22:42,360
and then down below, when Amazon
was founded in 1994, comma.

1693
01:22:42,360 --> 01:22:47,640
And those two templates end up giving
us a mechanism for trying to extract

1694
01:22:47,640 --> 01:22:53,220
information-- that this notion, when
company was founded in year comma,

1695
01:22:53,220 --> 01:22:56,310
this can tell us something about
when a company was founded,

1696
01:22:56,310 --> 01:22:58,820
because if we set our
AI loose on the web,

1697
01:22:58,820 --> 01:23:01,530
let look at a whole bunch of papers
or a whole bunch of articles,

1698
01:23:01,530 --> 01:23:03,360
and it finds this pattern--

1699
01:23:03,360 --> 01:23:06,930
when blank was founded in blank, comma--

1700
01:23:06,930 --> 01:23:09,840
well, then our AI can
pretty reasonably conclude

1701
01:23:09,840 --> 01:23:13,740
that there's a good chance that this
is going to be like some company,

1702
01:23:13,740 --> 01:23:17,470
and this is going to be like the year
that company was founded, for example--

1703
01:23:17,470 --> 01:23:20,907
might not be perfect, but at
least it's a good heuristic.

1704
01:23:20,907 --> 01:23:22,740
And so you might imagine
that, if you wanted

1705
01:23:22,740 --> 01:23:25,650
to train and AI to be able
to look for information,

1706
01:23:25,650 --> 01:23:27,810
you might give the AI
templates like this--

1707
01:23:27,810 --> 01:23:31,200
not only give it a template like when
company blank was founded in blank,

1708
01:23:31,200 --> 01:23:34,710
but give it like, the book blank
was written by blank, for example.

1709
01:23:34,710 --> 01:23:37,500
Just give it some templates
where it can search the web,

1710
01:23:37,500 --> 01:23:41,640
search a whole big corpus of documents,
looking for templates that match that,

1711
01:23:41,640 --> 01:23:44,970
and if it finds that, then
it's able to figure out,

1712
01:23:44,970 --> 01:23:47,370
all right, here's the
company and here's the year.

1713
01:23:47,370 --> 01:23:50,250
But of course, that requires
us to write these templates.

1714
01:23:50,250 --> 01:23:53,547
It requires us to figure out, what
is the structure of this information

1715
01:23:53,547 --> 01:23:54,630
likely going to look like?

1716
01:23:54,630 --> 01:23:56,190
And it might be difficult to know.

1717
01:23:56,190 --> 01:23:58,500
The different websites are, of
course, going to do this differently.

1718
01:23:58,500 --> 01:24:01,830
This type of method isn't going to be
able to extract all of the information,

1719
01:24:01,830 --> 01:24:04,170
because if the words are
slightly in a different order,

1720
01:24:04,170 --> 01:24:06,840
it won't match on that
particular template.

1721
01:24:06,840 --> 01:24:11,310
But one thing we can do is, rather
than give our AI the template,

1722
01:24:11,310 --> 01:24:13,290
we can give AI the data.

1723
01:24:13,290 --> 01:24:19,540
We can tell the AI, Facebook was founded
in 2004 and Amazon was founded in 1994,

1724
01:24:19,540 --> 01:24:22,440
and just tell the AI those
two pieces of information,

1725
01:24:22,440 --> 01:24:24,780
and then set the AI loose on the web.

1726
01:24:24,780 --> 01:24:30,030
And now the ideas that the AI can begin
to look for, where do Facebook in 2004

1727
01:24:30,030 --> 01:24:33,150
show up together, where do
Amazon in 1994 show up together,

1728
01:24:33,150 --> 01:24:36,150
and it can discover these
templates for itself.

1729
01:24:36,150 --> 01:24:38,580
It can discover that
this kind of phrasing--

1730
01:24:38,580 --> 01:24:40,320
when blank was founded in blank--

1731
01:24:40,320 --> 01:24:45,030
tends to relate Facebook to 2004,
and it released Amazon to 1994,

1732
01:24:45,030 --> 01:24:49,320
so maybe it will hold the same
relation for others as well.

1733
01:24:49,320 --> 01:24:51,572
And this ends up being--
this automated template

1734
01:24:51,572 --> 01:24:54,030
generation ends up being quite
powerful, and we'll go ahead

1735
01:24:54,030 --> 01:24:56,250
and take a look at that now as well.

1736
01:24:56,250 --> 01:24:59,040
What I have here inside
of templates directory

1737
01:24:59,040 --> 01:25:03,120
is a file called companies.csv,
and this is all of the data

1738
01:25:03,120 --> 01:25:04,520
that I am going to give to my AI.

1739
01:25:04,520 --> 01:25:09,000
I'm going to give it the pair
Amazon, 1994 and Facebook, 2004.

1740
01:25:09,000 --> 01:25:11,190
And what I'm going to
tell my AI to do is

1741
01:25:11,190 --> 01:25:14,010
search a corpus of
documents for other data--

1742
01:25:14,010 --> 01:25:16,620
these pairs like this--
other relationships.

1743
01:25:16,620 --> 01:25:18,990
I'm not telling AI that this
is a company and the date

1744
01:25:18,990 --> 01:25:19,920
that it was founded.

1745
01:25:19,920 --> 01:25:23,750
I'm just giving it Amazon,
1994 and Facebook, 2004

1746
01:25:23,750 --> 01:25:25,550
and letting the AI do the rest.

1747
01:25:25,550 --> 01:25:28,640
And what the AI is going to do is
it's going to look through my corpus--

1748
01:25:28,640 --> 01:25:30,770
here's my corpus of documents--

1749
01:25:30,770 --> 01:25:33,590
and it's going to find, like
inside of Business Insider,

1750
01:25:33,590 --> 01:25:38,580
that we have sentences like, back when
Amazon was founded in 2004, comma--

1751
01:25:38,580 --> 01:25:42,740
and that kind of phrasing is going to be
similar to this Harvard Business Review

1752
01:25:42,740 --> 01:25:46,935
story that has a sentence like,
when Facebook was founded in 2004--

1753
01:25:46,935 --> 01:25:49,310
and it's going to look across
a number of other documents

1754
01:25:49,310 --> 01:25:53,820
for similar types of patterns to be able
to extract that kind of information.

1755
01:25:53,820 --> 01:25:56,450
And what it will do is,
if I go ahead and run,

1756
01:25:56,450 --> 01:25:58,660
I'll go ahead and go into templates.

1757
01:25:58,660 --> 01:26:01,220
So I'll say python search.py.

1758
01:26:01,220 --> 01:26:05,030
I'm going to look for the data
like the data and companies.csv

1759
01:26:05,030 --> 01:26:08,690
inside of the company's directory, which
contains a whole bunch of news articles

1760
01:26:08,690 --> 01:26:10,900
that I've curated in advance.

1761
01:26:10,900 --> 01:26:12,080
And here's what I get--

1762
01:26:12,080 --> 01:26:15,560
Google 1998, Apple
1976, Microsoft 1975--

1763
01:26:15,560 --> 01:26:16,400
so on and so forth--

1764
01:26:16,400 --> 01:26:18,470
Walmart 1962, for example.

1765
01:26:18,470 --> 01:26:20,810
These are all of the pieces
of data that happened

1766
01:26:20,810 --> 01:26:23,750
to match that same template that
we were able to find before.

1767
01:26:23,750 --> 01:26:25,430
And how was it able to find this?

1768
01:26:25,430 --> 01:26:29,460
Well, it's probably because, if
we look at the Forbes article,

1769
01:26:29,460 --> 01:26:34,730
for example, that it has a phrase in it
like, when Walmart was founded in 1962,

1770
01:26:34,730 --> 01:26:38,000
comma-- that it's able to
identify these sorts of patterns

1771
01:26:38,000 --> 01:26:39,890
and extract information from them.

1772
01:26:39,890 --> 01:26:42,650
Now, granted, I have curated
all these stories in advance

1773
01:26:42,650 --> 01:26:46,130
in order to make sure that there
is data that it's able to match on.

1774
01:26:46,130 --> 01:26:49,100
And in practice, it's not always
going to be in this exact format

1775
01:26:49,100 --> 01:26:52,430
when you're seeing a company related
to the year in which it was founded,

1776
01:26:52,430 --> 01:26:56,030
but if you give the AI access to enough
data-- like all of the data of text

1777
01:26:56,030 --> 01:26:58,910
on the internet-- and just have
the AI crawl the internet looking

1778
01:26:58,910 --> 01:27:02,720
for information, it can very
reliably, or with some probability,

1779
01:27:02,720 --> 01:27:05,780
try and extract information
using these sorts of templates

1780
01:27:05,780 --> 01:27:08,330
and be able to generate
interesting sorts of knowledge.

1781
01:27:08,330 --> 01:27:10,940
And the more knowledge it
learns, the more new templates

1782
01:27:10,940 --> 01:27:13,190
it's able to construct,
looking for constructions that

1783
01:27:13,190 --> 01:27:15,930
show up in other locations as well.

1784
01:27:15,930 --> 01:27:17,910
So let's take a look at another example.

1785
01:27:17,910 --> 01:27:20,955
And then I'll here show
you presidents.csv,

1786
01:27:20,955 --> 01:27:23,330
where I have two presidents
and their inauguration date--

1787
01:27:23,330 --> 01:27:28,220
so George Washington 1789,
Barack Obama 2009 for example.

1788
01:27:28,220 --> 01:27:31,430
And I also am going to give
to our AI a corpus that

1789
01:27:31,430 --> 01:27:34,550
just contains a single
document, which is the Wikipedia

1790
01:27:34,550 --> 01:27:37,880
article for the list of presidents
of the United States, for example--

1791
01:27:37,880 --> 01:27:39,680
just information about presidents.

1792
01:27:39,680 --> 01:27:45,147
And I'd like to extract from this raw
HTML document on a web page information

1793
01:27:45,147 --> 01:27:45,980
about the president.

1794
01:27:45,980 --> 01:27:50,460
So I can say search in presidents.csv.

1795
01:27:50,460 --> 01:27:53,720
And what I get is a whole
bunch of data about presidents

1796
01:27:53,720 --> 01:27:56,300
and what year they were likely
inaugurated and by looking

1797
01:27:56,300 --> 01:27:58,010
for patterns that matched--

1798
01:27:58,010 --> 01:28:00,180
Barack Obama 2009, for example--

1799
01:28:00,180 --> 01:28:02,280
looking for these sorts
of patterns that happened

1800
01:28:02,280 --> 01:28:07,287
to give us some clues as to what it
is that a story happens to be about.

1801
01:28:07,287 --> 01:28:08,370
So here's another example.

1802
01:28:08,370 --> 01:28:12,710
If I open up inside the olympics,
here is a scraped version

1803
01:28:12,710 --> 01:28:15,050
of the Olympic home page
that has information

1804
01:28:15,050 --> 01:28:16,610
about various different Olympics.

1805
01:28:16,610 --> 01:28:20,360
And maybe I want to extract
Olympic locations and years

1806
01:28:20,360 --> 01:28:21,980
from this particular page.

1807
01:28:21,980 --> 01:28:24,950
Well, the way I can do that is
using the exact same algorithm.

1808
01:28:24,950 --> 01:28:29,730
I'm just saying, all right, here are two
Olympics and where they were located--

1809
01:28:29,730 --> 01:28:32,160
so 2012 London, for example.

1810
01:28:32,160 --> 01:28:35,030
Let me go ahead and
just run this process,

1811
01:28:35,030 --> 01:28:39,440
Python search, on olympics.csv,
look at all the Olympic data set,

1812
01:28:39,440 --> 01:28:41,280
and here I get some information back.

1813
01:28:41,280 --> 01:28:43,310
Now, this information--
not totally perfect.

1814
01:28:43,310 --> 01:28:45,530
There are a couple of examples
that are obviously not

1815
01:28:45,530 --> 01:28:48,955
quite right, because my template might
have been a little bit too general.

1816
01:28:48,955 --> 01:28:51,080
Maybe it was looking for
a broad category of things

1817
01:28:51,080 --> 01:28:55,190
and certain strange things happened to
capture on that particular template.

1818
01:28:55,190 --> 01:28:58,730
So you could imagine adding rules to try
and make this process more intelligent,

1819
01:28:58,730 --> 01:29:02,000
making sure the thing on the left
is just a year, for example--

1820
01:29:02,000 --> 01:29:04,280
for instance, and doing
other sorts of analysis.

1821
01:29:04,280 --> 01:29:07,040
But purely just based
on some data, we are

1822
01:29:07,040 --> 01:29:10,700
able to extract some interesting
information using some algorithms.

1823
01:29:10,700 --> 01:29:16,100
And all search.py is really doing here
is it is taking my corpus of data,

1824
01:29:16,100 --> 01:29:18,260
finding templates that match it--

1825
01:29:18,260 --> 01:29:22,280
here, I'm filtering down to just the
top two templates that happen to match--

1826
01:29:22,280 --> 01:29:26,960
and then using those templates
to extract results from the data

1827
01:29:26,960 --> 01:29:30,860
that I have access to, being able
to look for all of the information

1828
01:29:30,860 --> 01:29:31,670
that I care about.

1829
01:29:31,670 --> 01:29:33,587
And that's ultimately
what's going to help me,

1830
01:29:33,587 --> 01:29:38,390
to print out those results to figure
out what the matches happen to be.

1831
01:29:38,390 --> 01:29:41,090
And so information extraction
is another powerful tool

1832
01:29:41,090 --> 01:29:43,970
when it comes to trying
to extract information.

1833
01:29:43,970 --> 01:29:46,220
But of course, it only works
in very limited contexts.

1834
01:29:46,220 --> 01:29:49,640
It only works when I'm able will
find templates that look exactly

1835
01:29:49,640 --> 01:29:53,000
like this in order to come up
with some sort of match that

1836
01:29:53,000 --> 01:29:55,430
is able to connect this
to some pair of data,

1837
01:29:55,430 --> 01:29:57,890
that this company was
founded in this year.

1838
01:29:57,890 --> 01:30:01,670
What I might want to do, as we start
to think about the semantics of words,

1839
01:30:01,670 --> 01:30:04,880
is to begin to imagine some way
of coming up with definitions

1840
01:30:04,880 --> 01:30:08,120
for all words, being able to relate
all of the words in a dictionary

1841
01:30:08,120 --> 01:30:12,110
to each other, because that's ultimately
what's going to be necessary if we want

1842
01:30:12,110 --> 01:30:13,530
our AI to be able to communicate.

1843
01:30:13,530 --> 01:30:18,500
We need some representation
of what it is that words mean.

1844
01:30:18,500 --> 01:30:22,340
And one approach of doing this,
this famous data set called WordNet.

1845
01:30:22,340 --> 01:30:24,440
And what WordNet is is
it's a human-curated--

1846
01:30:24,440 --> 01:30:27,380
researchers have curated
together a whole bunch of words,

1847
01:30:27,380 --> 01:30:29,595
their definitions, their
various different senses--

1848
01:30:29,595 --> 01:30:31,970
because the word might have
multiple different meanings--

1849
01:30:31,970 --> 01:30:35,347
and also how those words
relate to one another.

1850
01:30:35,347 --> 01:30:36,680
And so what we mean by this is--

1851
01:30:36,680 --> 01:30:38,750
I can show you an example of WordNet.

1852
01:30:38,750 --> 01:30:40,550
WordNet comes built into NLTK.

1853
01:30:40,550 --> 01:30:44,060
Using NLTK, you can
download and access WordNet.

1854
01:30:44,060 --> 01:30:48,080
So let me go into WordNet,
and go ahead and run WordNet,

1855
01:30:48,080 --> 01:30:52,100
and extract information about a
word-- a word like city, for example.

1856
01:30:52,100 --> 01:30:53,600
Go ahead and press Return.

1857
01:30:53,600 --> 01:30:56,210
And here is the information
that I get back about a city.

1858
01:30:56,210 --> 01:30:59,360
It turns out that city has
three different senses, three

1859
01:30:59,360 --> 01:31:01,460
different meanings,
according to WordNet.

1860
01:31:01,460 --> 01:31:03,770
And it's really just kind
of like a dictionary, where

1861
01:31:03,770 --> 01:31:07,400
each sense is associated with its
meaning-- just some definition

1862
01:31:07,400 --> 01:31:08,810
provided by human.

1863
01:31:08,810 --> 01:31:13,130
And then it's also got categories,
for example, that a word belongs to--

1864
01:31:13,130 --> 01:31:15,830
that a city is a type
of municipality, a city

1865
01:31:15,830 --> 01:31:18,150
is a type of administrative district.

1866
01:31:18,150 --> 01:31:20,510
And that allows me to
relate words to other words.

1867
01:31:20,510 --> 01:31:24,380
So one of the powers of WordNet
is the ability to take one word

1868
01:31:24,380 --> 01:31:28,590
and connect it to other related words.

1869
01:31:28,590 --> 01:31:33,380
If I do another example, let me
try the word house, for instance.

1870
01:31:33,380 --> 01:31:36,690
I'll type in the word house
and see what I get back.

1871
01:31:36,690 --> 01:31:38,750
Well, all right, the house
is a kind of building.

1872
01:31:38,750 --> 01:31:42,160
The house is somehow
related to a family unit.

1873
01:31:42,160 --> 01:31:43,910
And so you might imagine
trying to come up

1874
01:31:43,910 --> 01:31:46,760
with these various different
ways of describing a house.

1875
01:31:46,760 --> 01:31:47,490
It is a building.

1876
01:31:47,490 --> 01:31:48,500
It is a dwelling.

1877
01:31:48,500 --> 01:31:51,110
And researchers have just
curated these relationships

1878
01:31:51,110 --> 01:31:55,100
between these various different words to
say that a house is a type of building,

1879
01:31:55,100 --> 01:31:58,890
that a house is a type
of dwelling, for example.

1880
01:31:58,890 --> 01:32:01,370
But this type of
approach, while certainly

1881
01:32:01,370 --> 01:32:04,640
helpful for being able to
relate words to one another,

1882
01:32:04,640 --> 01:32:06,920
doesn't scale particularly well.

1883
01:32:06,920 --> 01:32:08,990
As you start to think
about language changing,

1884
01:32:08,990 --> 01:32:11,870
as you start to think about all
the various different relationships

1885
01:32:11,870 --> 01:32:16,070
that words might have to one another,
this challenge of word representation

1886
01:32:16,070 --> 01:32:18,200
ends up being difficult.
What we've done is just

1887
01:32:18,200 --> 01:32:23,450
defined a word as just a sentence that
explains what it is that that word is,

1888
01:32:23,450 --> 01:32:26,030
but what we really
would like is some way

1889
01:32:26,030 --> 01:32:28,615
to represent the meaning
of a word in a way

1890
01:32:28,615 --> 01:32:31,240
that our AI is going to be able
to do something useful with it.

1891
01:32:31,240 --> 01:32:33,830
Anytime we want our AI to
be able to look at texts

1892
01:32:33,830 --> 01:32:35,840
and really understand
what that text means,

1893
01:32:35,840 --> 01:32:38,360
to relate text and
words to similar words

1894
01:32:38,360 --> 01:32:40,700
and understand the
relationship between words,

1895
01:32:40,700 --> 01:32:44,745
we'd like some way that a computer
can represent this information.

1896
01:32:44,745 --> 01:32:46,620
And what we've seen all
throughout the course

1897
01:32:46,620 --> 01:32:48,800
multiple times now is
the idea that, when

1898
01:32:48,800 --> 01:32:51,110
we want our AI to
represent something, it

1899
01:32:51,110 --> 01:32:54,890
can be helpful to have the AI
represent it using numbers--

1900
01:32:54,890 --> 01:32:57,530
that we've seen that we can
represent utilities in a game,

1901
01:32:57,530 --> 01:32:59,900
like winning, or losing,
or drawing, as a number--

1902
01:32:59,900 --> 01:33:01,520
1, negative 1, or a 0.

1903
01:33:01,520 --> 01:33:04,400
We've seen other ways that
we can take data and turn it

1904
01:33:04,400 --> 01:33:06,650
into a vector of features,
where we just have

1905
01:33:06,650 --> 01:33:11,270
a whole bunch of numbers that represent
some particular piece of data.

1906
01:33:11,270 --> 01:33:14,340
And if we ever want to past
words into a neural network,

1907
01:33:14,340 --> 01:33:16,580
for instance, to be able
to say, given some word,

1908
01:33:16,580 --> 01:33:18,650
translate this sentence
into another sentence,

1909
01:33:18,650 --> 01:33:21,890
or to be able to do interesting
classifications with neural networks

1910
01:33:21,890 --> 01:33:26,000
on individual words, we need
some representation of words

1911
01:33:26,000 --> 01:33:27,980
just in terms of vectors--

1912
01:33:27,980 --> 01:33:31,820
way to represent words, just
by using individual numbers

1913
01:33:31,820 --> 01:33:34,495
to define the meaning of a word.

1914
01:33:34,495 --> 01:33:35,370
So how do we do that?

1915
01:33:35,370 --> 01:33:37,767
How do we take words and
turn them into vectors

1916
01:33:37,767 --> 01:33:40,100
that we can use to represent
the meaning of those words?

1917
01:33:40,100 --> 01:33:42,110
Well, one way is to do this.

1918
01:33:42,110 --> 01:33:46,280
If I have four words that I want
to encode, like he wrote a book,

1919
01:33:46,280 --> 01:33:49,250
I can just say, let's let
the word he be this vector--

1920
01:33:49,250 --> 01:33:51,470
1, 0, 0, 0.

1921
01:33:51,470 --> 01:33:53,990
Wrote will be 0, 1, 0, 0.

1922
01:33:53,990 --> 01:33:56,390
A will be 0, 0, 1, 0.

1923
01:33:56,390 --> 01:33:59,570
Book will be 0, 0, 0, 1.

1924
01:33:59,570 --> 01:34:03,410
Effectively, what I have here is what's
known as a one-hot representation

1925
01:34:03,410 --> 01:34:06,930
or a one-hot encoding, which
is a representation of meaning,

1926
01:34:06,930 --> 01:34:10,580
where meaning is a vector that has a
single 1 in it and the rest are 0's.

1927
01:34:10,580 --> 01:34:14,540
The location of the 1 tells
me the meaning of the word--

1928
01:34:14,540 --> 01:34:17,020
that 1 in the first
position, that means here--

1929
01:34:17,020 --> 01:34:19,510
1 in the second position,
that means wrote.

1930
01:34:19,510 --> 01:34:21,740
And every word in the
dictionary is going

1931
01:34:21,740 --> 01:34:24,770
to be assigned to some representation
like this, where we just

1932
01:34:24,770 --> 01:34:28,320
assign one place in the vector
that has a 1 for the word

1933
01:34:28,320 --> 01:34:29,450
and 0 for the other words.

1934
01:34:29,450 --> 01:34:31,580
And now I have
representations of words that

1935
01:34:31,580 --> 01:34:33,710
are different for a whole
bunch of different words.

1936
01:34:33,710 --> 01:34:36,853
This is this one-hot representation.

1937
01:34:36,853 --> 01:34:38,270
So what are the drawbacks of this?

1938
01:34:38,270 --> 01:34:40,970
Why is this not necessarily
a great approach?

1939
01:34:40,970 --> 01:34:42,980
Well, here, I am only
creating enough vectors

1940
01:34:42,980 --> 01:34:45,530
to represent four words in a dictionary.

1941
01:34:45,530 --> 01:34:49,580
If you imagine a dictionary with 50,000
words that I might want to represent,

1942
01:34:49,580 --> 01:34:51,590
now these vectors get enormously long.

1943
01:34:51,590 --> 01:34:54,800
These are 50,000 dimensional
vectors to represent

1944
01:34:54,800 --> 01:34:58,940
a vocabulary of 50,000 words--
that he is 1 followed by all these.

1945
01:34:58,940 --> 01:35:01,280
Wrote has a whole bunch of 0's in it.

1946
01:35:01,280 --> 01:35:05,070
That's not a particularly tractable
way of trying to represent numbers,

1947
01:35:05,070 --> 01:35:09,860
if I'm going to have to deal
with vectors of length 50,000.

1948
01:35:09,860 --> 01:35:12,140
Another problem-- a subtler problem--

1949
01:35:12,140 --> 01:35:14,870
is that ideally, I'd
like for these vectors

1950
01:35:14,870 --> 01:35:17,960
to somehow represent meaning
in a way that I can extract

1951
01:35:17,960 --> 01:35:21,740
useful information out of-- that if
I have the sentence he wrote a book

1952
01:35:21,740 --> 01:35:26,270
and he authored a novel, well, wrote
and authored are going to be two

1953
01:35:26,270 --> 01:35:28,040
totally different vectors.

1954
01:35:28,040 --> 01:35:32,180
And book and novel are going to be
two totally different vectors inside

1955
01:35:32,180 --> 01:35:35,030
of my vector space that have
nothing to do with each other.

1956
01:35:35,030 --> 01:35:38,420
The one is just located
in a different position.

1957
01:35:38,420 --> 01:35:40,790
And really, what I would
like to have happen

1958
01:35:40,790 --> 01:35:43,600
is for wrote and
authored to have vectors

1959
01:35:43,600 --> 01:35:47,020
that are similar to one
another, and for book and novel

1960
01:35:47,020 --> 01:35:49,900
to have vector representations
that are similar to one another,

1961
01:35:49,900 --> 01:35:52,780
because they are words
that have similar meanings.

1962
01:35:52,780 --> 01:35:56,320
Because their meanings are
similar, ideally, I'd like for--

1963
01:35:56,320 --> 01:35:59,860
when I put them in vector form and
use a vector to represent meanings,

1964
01:35:59,860 --> 01:36:04,400
I would like for those vectors to
be similar to one another as well.

1965
01:36:04,400 --> 01:36:06,640
So rather than this
one-hot representation,

1966
01:36:06,640 --> 01:36:10,000
where we represent a word's meaning
by just giving it a vector that is one

1967
01:36:10,000 --> 01:36:12,620
in a particular location,
what we're going to do--

1968
01:36:12,620 --> 01:36:15,400
which is a bit of a strange
thing the first time you see it--

1969
01:36:15,400 --> 01:36:18,640
is what we're going to call
a distributed representation.

1970
01:36:18,640 --> 01:36:21,580
We are going to represent
the meaning of a word as just

1971
01:36:21,580 --> 01:36:25,330
a whole bunch of different values--
not just a single 1 and the rest 0's,

1972
01:36:25,330 --> 01:36:26,630
but a whole bunch of values.

1973
01:36:26,630 --> 01:36:31,240
So for example, in he wrote a book,
he might just be a big vector.

1974
01:36:31,240 --> 01:36:34,510
Maybe it's 50 dimensions, maybe it's
100, dimensions but certainly less

1975
01:36:34,510 --> 01:36:39,430
than like tens of thousands, where
each value is just some number--

1976
01:36:39,430 --> 01:36:42,160
and same thing for
wrote, and a, and book.

1977
01:36:42,160 --> 01:36:45,070
And the idea now is that, using
these vector representations,

1978
01:36:45,070 --> 01:36:48,850
I'd hope that wrote and authored
have vector representations that

1979
01:36:48,850 --> 01:36:50,317
are pretty close to one another.

1980
01:36:50,317 --> 01:36:52,900
Their distance is not too far
apart-- and same with the vector

1981
01:36:52,900 --> 01:36:56,230
representations for book and novel.

1982
01:36:56,230 --> 01:37:00,940
So this is going to be the goal of a
lot of what statistical machine learning

1983
01:37:00,940 --> 01:37:02,710
approaches to natural
language processing

1984
01:37:02,710 --> 01:37:06,760
is about is using these vector
representations of words.

1985
01:37:06,760 --> 01:37:10,190
But how on earth do we define
a word as just a whole bunch

1986
01:37:10,190 --> 01:37:11,440
of these sequences of numbers?

1987
01:37:11,440 --> 01:37:16,668
What does it even mean to talk
about the meaning of a word?

1988
01:37:16,668 --> 01:37:18,460
The famous quote that
answers this question

1989
01:37:18,460 --> 01:37:22,930
is from a British linguist in the
1950s, JR Firth, who said, "You shall

1990
01:37:22,930 --> 01:37:25,060
know a word by the company it keeps."

1991
01:37:25,060 --> 01:37:28,150


1992
01:37:28,150 --> 01:37:30,400
And what we mean by
that is the idea that we

1993
01:37:30,400 --> 01:37:35,290
can define a word in terms of the words
that show up around it, that we can get

1994
01:37:35,290 --> 01:37:39,070
at the meaning of a word based on the
context in which that word happens

1995
01:37:39,070 --> 01:37:40,370
to appear.

1996
01:37:40,370 --> 01:37:43,900
That if I have a sentence like
this, four words in sequence--

1997
01:37:43,900 --> 01:37:46,180
for blank he ate--

1998
01:37:46,180 --> 01:37:47,442
what goes in the blank?

1999
01:37:47,442 --> 01:37:49,150
Well, you might imagine
that, in English,

2000
01:37:49,150 --> 01:37:52,192
the types of words that might fill in
the blank are words like breakfast,

2001
01:37:52,192 --> 01:37:53,170
or lunch, or dinner.

2002
01:37:53,170 --> 01:37:56,480
These are the kinds of words
that fill in that blank.

2003
01:37:56,480 --> 01:38:00,730
And so if we want to define,
what does lunch or dinner mean,

2004
01:38:00,730 --> 01:38:03,970
we can define it in terms
of what words happened

2005
01:38:03,970 --> 01:38:07,030
to show up around it--
that if a word shows up

2006
01:38:07,030 --> 01:38:09,700
in a particular context and
another word happens to show up

2007
01:38:09,700 --> 01:38:13,750
in very similar context, then
those two words are probably

2008
01:38:13,750 --> 01:38:15,040
related to each other.

2009
01:38:15,040 --> 01:38:18,280
They probably have a similar
meaning to one another.

2010
01:38:18,280 --> 01:38:20,950
And this then is the
foundational idea of an algorithm

2011
01:38:20,950 --> 01:38:24,760
known as word2vec, which is a
model for generating word vectors.

2012
01:38:24,760 --> 01:38:28,960
You give word2vec a corpus of
documents, just a whole bunch of texts,

2013
01:38:28,960 --> 01:38:34,832
and what word to that will produce is
it will produce vectors for each word.

2014
01:38:34,832 --> 01:38:36,790
And there a number of
ways that it can do this.

2015
01:38:36,790 --> 01:38:40,300
One common way is through what's known
as the skip-gram architecture, which

2016
01:38:40,300 --> 01:38:44,470
basically uses a neural network
to predict context words,

2017
01:38:44,470 --> 01:38:47,240
given a target word-- so
given a word like lunch,

2018
01:38:47,240 --> 01:38:50,350
use a neural network to try and
predict, given the word lunch, what

2019
01:38:50,350 --> 01:38:53,190
words are going to show up around it.

2020
01:38:53,190 --> 01:38:55,210
And so the way we
might represent this is

2021
01:38:55,210 --> 01:38:57,760
with a big neural
network like this, where

2022
01:38:57,760 --> 01:39:00,820
we have one input cell for every word.

2023
01:39:00,820 --> 01:39:04,900
Every word gets one node
inside this neural network.

2024
01:39:04,900 --> 01:39:07,780
And the goal is to use this
neural network to predict,

2025
01:39:07,780 --> 01:39:09,790
given a target word, a context word.

2026
01:39:09,790 --> 01:39:14,030
Given a word like lunch, can I predict
the probabilities of other words,

2027
01:39:14,030 --> 01:39:18,560
showing up in a context of one word
away or two words away, for instance,

2028
01:39:18,560 --> 01:39:21,970
in some sort of window of context?

2029
01:39:21,970 --> 01:39:27,400
And if you just give the AI, this neural
network, a whole bunch of data of words

2030
01:39:27,400 --> 01:39:30,790
and what words show up in context,
you can train a neural network

2031
01:39:30,790 --> 01:39:34,600
to do this calculation, to be able
to predict, given a target word--

2032
01:39:34,600 --> 01:39:39,103
can I predict what those context
words ultimately should be?

2033
01:39:39,103 --> 01:39:41,020
And it will do so using
the same methods we've

2034
01:39:41,020 --> 01:39:43,850
talked about-- back propagating
the error from the context word

2035
01:39:43,850 --> 01:39:46,090
back through this neural network.

2036
01:39:46,090 --> 01:39:48,790
And what you get is, if
we use the single layer--

2037
01:39:48,790 --> 01:39:50,950
just a signal layer of hidden nodes--

2038
01:39:50,950 --> 01:39:54,960
what I get is, for every single
one of these words, I get--

2039
01:39:54,960 --> 01:39:59,680
from this word, for example, I
get five edges, each of which

2040
01:39:59,680 --> 01:40:02,695
has a weight to each of
these five hidden nodes.

2041
01:40:02,695 --> 01:40:05,950
In other words, I get five
numbers that effectively

2042
01:40:05,950 --> 01:40:10,180
are going to represent this
particular target word here.

2043
01:40:10,180 --> 01:40:13,750
And the number of hidden nodes I
choose in this middle layer here--

2044
01:40:13,750 --> 01:40:14,420
I can pick that.

2045
01:40:14,420 --> 01:40:17,830
Maybe I'll choose to have 50
hidden nodes or 100 hidden nodes.

2046
01:40:17,830 --> 01:40:19,720
And then, for each of
these target words,

2047
01:40:19,720 --> 01:40:22,630
I'll have 50 different values
or 100 different values,

2048
01:40:22,630 --> 01:40:26,050
and those values we can
effectively treat as the vector

2049
01:40:26,050 --> 01:40:29,320
numerical representation of that word.

2050
01:40:29,320 --> 01:40:33,520
And the general idea here is
that, if words are similar,

2051
01:40:33,520 --> 01:40:37,660
two words show up in similar contexts--
meaning, using the same target words,

2052
01:40:37,660 --> 01:40:40,380
I'd like to predict
similar contexts words--

2053
01:40:40,380 --> 01:40:43,180
well, then these vectors and these
values I choose in these vectors

2054
01:40:43,180 --> 01:40:45,940
here-- these numerical values
for the weight of these edges

2055
01:40:45,940 --> 01:40:49,180
are probably going to be similar,
because for two different words that

2056
01:40:49,180 --> 01:40:51,580
show up in similar
contexts, I would like

2057
01:40:51,580 --> 01:40:55,030
for these values that are
calculated to ultimately

2058
01:40:55,030 --> 01:40:58,250
be very similar to one another.

2059
01:40:58,250 --> 01:41:01,030
And so ultimately, the high-level
way you can picture this

2060
01:41:01,030 --> 01:41:02,980
is that what this word2vec
training method is

2061
01:41:02,980 --> 01:41:06,790
going to do is, given a whole
bunch of words, were initially,

2062
01:41:06,790 --> 01:41:09,430
recall, we initialize these
weights randomly and just pick

2063
01:41:09,430 --> 01:41:11,650
random weights that we choose.

2064
01:41:11,650 --> 01:41:14,050
Over time, as we train
the neural network,

2065
01:41:14,050 --> 01:41:17,680
we're going to adjust these weights,
adjust the vector representations

2066
01:41:17,680 --> 01:41:20,860
of each of these words
so that gradually,

2067
01:41:20,860 --> 01:41:24,970
words that show up in similar
contexts grow closer to one another,

2068
01:41:24,970 --> 01:41:27,190
and words that show up
in different contexts

2069
01:41:27,190 --> 01:41:29,210
get farther away from one another.

2070
01:41:29,210 --> 01:41:32,890
And as a result, hopefully
I get vector representations

2071
01:41:32,890 --> 01:41:36,760
of words like breakfast, and lunch, and
dinner that are similar to one another,

2072
01:41:36,760 --> 01:41:39,100
and then words like book,
and memoir, and novel

2073
01:41:39,100 --> 01:41:42,830
are also going to be similar
to one another as well.

2074
01:41:42,830 --> 01:41:46,510
So using this algorithm, we're
able to take a corpus of data

2075
01:41:46,510 --> 01:41:50,230
and just train our computer, train this
neural network to be able to figure out

2076
01:41:50,230 --> 01:41:52,650
what vector, what sequence
of numbers is going

2077
01:41:52,650 --> 01:41:55,900
to represent each of these words-- which
is, again, a bit of a strange concept

2078
01:41:55,900 --> 01:41:59,450
to think about representing a word
just as a whole bunch of numbers.

2079
01:41:59,450 --> 01:42:02,860
But we'll see in a moment just
how powerful this really can be.

2080
01:42:02,860 --> 01:42:08,290
So we'll go ahead and go into vectors,
and what I have inside a vectors.py--

2081
01:42:08,290 --> 01:42:09,910
which I'll open up now--

2082
01:42:09,910 --> 01:42:14,800
is I'm opening up words.txt, which
is a pretrained model that just--

2083
01:42:14,800 --> 01:42:17,230
I've already run word2vec
and it's already given me

2084
01:42:17,230 --> 01:42:19,810
a whole bunch of vectors for
each of these possible words.

2085
01:42:19,810 --> 01:42:22,330
And I'm just going to
take like 50,000 of them

2086
01:42:22,330 --> 01:42:26,420
and go ahead and save their vectors
inside of a dictionary called words.

2087
01:42:26,420 --> 01:42:29,260
And then I've also defined
some functions called distance,

2088
01:42:29,260 --> 01:42:33,820
closest_word, so it'll get me what are
the closest words to a particular word,

2089
01:42:33,820 --> 01:42:38,390
and then closest_word, that just gets
me the one closest word, for example.

2090
01:42:38,390 --> 01:42:39,860
And so now let me try doing this.

2091
01:42:39,860 --> 01:42:43,180
Let me open up the Python
interpreter and say something like,

2092
01:42:43,180 --> 01:42:46,080
from vectors import star--

2093
01:42:46,080 --> 01:42:48,590
just import everything from vectors.

2094
01:42:48,590 --> 01:42:51,700
And now let's take a look at
the meanings of some words.

2095
01:42:51,700 --> 01:42:55,760
Let me look at the
word city, for example.

2096
01:42:55,760 --> 01:43:01,130
And here is a big array that is the
vector representation of the words

2097
01:43:01,130 --> 01:43:01,630
city.

2098
01:43:01,630 --> 01:43:04,755
And this doesn't mean anything, in
terms of what these numbers exactly are,

2099
01:43:04,755 --> 01:43:07,390
but this is how my
computer is representing

2100
01:43:07,390 --> 01:43:08,990
the meaning of the word city.

2101
01:43:08,990 --> 01:43:11,200
We can do a different
word, like words house,

2102
01:43:11,200 --> 01:43:14,860
and here then is the vector
representation of the word house,

2103
01:43:14,860 --> 01:43:17,140
for example-- just a
whole bunch of numbers.

2104
01:43:17,140 --> 01:43:20,650
And this is encoding somehow
the meaning of the word house.

2105
01:43:20,650 --> 01:43:22,390
And how do I get at that idea?

2106
01:43:22,390 --> 01:43:24,880
Well, one way to measure how
good this is is by looking at,

2107
01:43:24,880 --> 01:43:29,282
what is the distance between
various different words?

2108
01:43:29,282 --> 01:43:31,240
There a number of ways
you can define distance.

2109
01:43:31,240 --> 01:43:33,310
In context of vectors,
one common way is what's

2110
01:43:33,310 --> 01:43:35,860
known as the cosine distance
that has to do with measuring

2111
01:43:35,860 --> 01:43:37,580
the angle between vectors.

2112
01:43:37,580 --> 01:43:40,150
But in short, it's just
measuring, how far apart

2113
01:43:40,150 --> 01:43:42,710
are these two vectors from each other?

2114
01:43:42,710 --> 01:43:47,210
So if I take a word like the word book,
how far away for is it from itself--

2115
01:43:47,210 --> 01:43:49,540
how far away is the
word book from book--

2116
01:43:49,540 --> 01:43:50,440
well, that's zero.

2117
01:43:50,440 --> 01:43:54,400
The word book is zero
distance away from itself.

2118
01:43:54,400 --> 01:43:59,180
But let's see how far away word
book is from a word like breakfast,

2119
01:43:59,180 --> 01:44:03,790
where we're going to say one is
very far away, zero is not far away.

2120
01:44:03,790 --> 01:44:07,430
All right, book is about
0.64 away from breakfast.

2121
01:44:07,430 --> 01:44:09,560
They seem to be pretty far apart.

2122
01:44:09,560 --> 01:44:12,920
But let's now try and calculate
the distance from words book

2123
01:44:12,920 --> 01:44:16,842
to words novel, for example.

2124
01:44:16,842 --> 01:44:18,800
Now, those two words are
closer to each other--

2125
01:44:18,800 --> 01:44:19,730
0.34.

2126
01:44:19,730 --> 01:44:21,950
The vector representation
of the word book

2127
01:44:21,950 --> 01:44:25,190
is closer to the vector
representation of the word novel

2128
01:44:25,190 --> 01:44:28,350
than it is to the vector
representation of the word breakfast.

2129
01:44:28,350 --> 01:44:34,010
And I can do the same thing and,
say, compare breakfast to lunch,

2130
01:44:34,010 --> 01:44:35,765
for example.

2131
01:44:35,765 --> 01:44:37,640
And those two words are
even closer together.

2132
01:44:37,640 --> 01:44:40,010
They have an even more
similar relationship

2133
01:44:40,010 --> 01:44:42,470
between one word and another.

2134
01:44:42,470 --> 01:44:45,500
So now it seems we have some
representation of words,

2135
01:44:45,500 --> 01:44:49,610
representing a word using vectors, that
allows us to be able to say something

2136
01:44:49,610 --> 01:44:52,340
like words that are
similar to each other

2137
01:44:52,340 --> 01:44:55,940
ultimately have a smaller distance
that happens to be between them.

2138
01:44:55,940 --> 01:44:58,070
And this turns out to be
incredibly powerful to be

2139
01:44:58,070 --> 01:45:01,760
able to represent the meaning of
words in terms of their relationships

2140
01:45:01,760 --> 01:45:03,620
to other words as well.

2141
01:45:03,620 --> 01:45:05,000
I can tell you as well--

2142
01:45:05,000 --> 01:45:06,980
I have a function called
closest words that

2143
01:45:06,980 --> 01:45:09,320
basically just takes
a whole bunch of words

2144
01:45:09,320 --> 01:45:11,520
and gets all the closest words to it.

2145
01:45:11,520 --> 01:45:15,980
So let me get the closest
words to book, for example,

2146
01:45:15,980 --> 01:45:18,500
and maybe get the 10 closest words.

2147
01:45:18,500 --> 01:45:20,950
We'll limit ourselves to 10.

2148
01:45:20,950 --> 01:45:21,450
And right.

2149
01:45:21,450 --> 01:45:24,420
Book is obviously closest
to itself-- the word book--

2150
01:45:24,420 --> 01:45:27,630
but is also closely related to books,
and essay, and memoir, and essays,

2151
01:45:27,630 --> 01:45:29,450
and novella, anthology.

2152
01:45:29,450 --> 01:45:32,370
And why are these words that it was
able to compute are close to it?

2153
01:45:32,370 --> 01:45:34,710
Well, because based on
the corpus of information

2154
01:45:34,710 --> 01:45:38,220
that this algorithm was trained
on, the vectors that arose

2155
01:45:38,220 --> 01:45:41,270
arose based on what words
show up in a similar context--

2156
01:45:41,270 --> 01:45:45,420
that the word book shows up in a similar
context, similar other words to words

2157
01:45:45,420 --> 01:45:47,730
like memoir and essays, for example.

2158
01:45:47,730 --> 01:45:49,110
And if I do something like--

2159
01:45:49,110 --> 01:45:53,740
let me get the closest words to city--

2160
01:45:53,740 --> 01:45:56,800
you end up getting city,
town, township, village.

2161
01:45:56,800 --> 01:46:02,200
These are words that happen to show up
in a similar context to the word city.

2162
01:46:02,200 --> 01:46:05,787
Now, where things get really interesting
is that, because these are vectors,

2163
01:46:05,787 --> 01:46:07,120
we can do mathematics with them.

2164
01:46:07,120 --> 01:46:11,210
We can calculate the relationships
between various different words.

2165
01:46:11,210 --> 01:46:16,240
So I can say something like, all
right, what if I had man and king?

2166
01:46:16,240 --> 01:46:18,790
These are two different vectors,
and this is a famous example

2167
01:46:18,790 --> 01:46:20,950
that comes out of word2vec.

2168
01:46:20,950 --> 01:46:24,920
I can take these two vectors and
just subtract them from each other.

2169
01:46:24,920 --> 01:46:28,040
This line here, the distance
here, is another vector

2170
01:46:28,040 --> 01:46:30,430
that represents like king minus man.

2171
01:46:30,430 --> 01:46:33,123
Now, what does it mean to take a
word and subtract another word?

2172
01:46:33,123 --> 01:46:34,540
Normally, that doesn't make sense.

2173
01:46:34,540 --> 01:46:37,082
In the world of vectors, though,
you can take some vector sum

2174
01:46:37,082 --> 01:46:40,090
sequence of numbers, subtract
some other sequence of numbers,

2175
01:46:40,090 --> 01:46:43,240
and get a new vector, get
a new sequence of numbers.

2176
01:46:43,240 --> 01:46:46,690
And what this new sequence of
numbers is effectively going to do

2177
01:46:46,690 --> 01:46:52,000
is it is going to tell me, what do I
need to do to get from man to king?

2178
01:46:52,000 --> 01:46:54,640
What is the relationship
then between these two words?

2179
01:46:54,640 --> 01:46:58,120
And this is some vector
representation of what makes--

2180
01:46:58,120 --> 01:47:00,640
takes us from man to king.

2181
01:47:00,640 --> 01:47:04,730
And we can then take this value
and add it to another vector.

2182
01:47:04,730 --> 01:47:07,700
You might imagine that the
word woman, for example,

2183
01:47:07,700 --> 01:47:10,330
is another vector that exists
somewhere inside of this space,

2184
01:47:10,330 --> 01:47:12,430
somewhere inside of this vector space.

2185
01:47:12,430 --> 01:47:15,550
And what might happen if I
took this same idea, king

2186
01:47:15,550 --> 01:47:19,930
minus man-- took that same vector
and just added it to woman?

2187
01:47:19,930 --> 01:47:22,480
What will we find around here?

2188
01:47:22,480 --> 01:47:24,230
It's an interesting
question we might ask,

2189
01:47:24,230 --> 01:47:27,700
and we can answer it very easily,
because I have vector representations

2190
01:47:27,700 --> 01:47:30,500
of all of these things.

2191
01:47:30,500 --> 01:47:31,660
Let's go back here.

2192
01:47:31,660 --> 01:47:34,690
Let me look at the
representation of the word man.

2193
01:47:34,690 --> 01:47:36,887
Here's the vector representation of men.

2194
01:47:36,887 --> 01:47:38,970
Let's look at the
representation of the word king.

2195
01:47:38,970 --> 01:47:41,222
Here's the representation
of the word king.

2196
01:47:41,222 --> 01:47:42,430
And I can subtract these two.

2197
01:47:42,430 --> 01:47:46,260
What is the vector
representation of king minus man?

2198
01:47:46,260 --> 01:47:48,250
It's this array right here--

2199
01:47:48,250 --> 01:47:49,600
whole bunch of values.

2200
01:47:49,600 --> 01:47:53,620
So king minus man now represents the
relationship between king and man

2201
01:47:53,620 --> 01:47:55,940
in some sort of numerical vector format.

2202
01:47:55,940 --> 01:48:00,170
So what happens then
if I add woman to that?

2203
01:48:00,170 --> 01:48:04,640
Whatever took us from man to king,
go ahead and apply that same vector

2204
01:48:04,640 --> 01:48:07,520
to the vector representation
of the word woman,

2205
01:48:07,520 --> 01:48:10,960
and that gives us this vector here.

2206
01:48:10,960 --> 01:48:15,130
And now, just out of curiosity,
let's take this expression

2207
01:48:15,130 --> 01:48:20,720
and find, what is the closest
word to that expression?

2208
01:48:20,720 --> 01:48:25,130
And amazingly, what we get
is we get the word queen--

2209
01:48:25,130 --> 01:48:28,820
that somehow, when you take the
distance between man and king--

2210
01:48:28,820 --> 01:48:32,090
this numerical representation
of how man is related to king--

2211
01:48:32,090 --> 01:48:34,780
and add that same
notion, king minus man,

2212
01:48:34,780 --> 01:48:37,100
to the vector representation
of the word woman.

2213
01:48:37,100 --> 01:48:40,790
What we get is we get the vector
representation, or something close

2214
01:48:40,790 --> 01:48:43,490
to the vector representation
of the word queen,

2215
01:48:43,490 --> 01:48:48,130
because this distance somehow encoded
the relationship between these two

2216
01:48:48,130 --> 01:48:48,630
words.

2217
01:48:48,630 --> 01:48:50,422
And when you run it
through this algorithm,

2218
01:48:50,422 --> 01:48:53,240
it's not programmed to do this,
but if you just try and figure

2219
01:48:53,240 --> 01:48:55,700
out how to predict words
based on context words,

2220
01:48:55,700 --> 01:48:59,960
you get vectors that are able to
make these SAT-like analogies out

2221
01:48:59,960 --> 01:49:02,232
of the information that has been given.

2222
01:49:02,232 --> 01:49:03,690
So there are more examples of this.

2223
01:49:03,690 --> 01:49:06,230
We can say, all right,
let's figure out, what

2224
01:49:06,230 --> 01:49:10,790
is the distance between
Paris and France?

2225
01:49:10,790 --> 01:49:12,580
So Paris and France are words.

2226
01:49:12,580 --> 01:49:14,390
They each have a vector representation.

2227
01:49:14,390 --> 01:49:18,680
This then is a vector representation of
the distance between Paris and France--

2228
01:49:18,680 --> 01:49:21,530
what takes us from France to Paris.

2229
01:49:21,530 --> 01:49:26,540
And let me go ahead and add the vector
representation of England to that.

2230
01:49:26,540 --> 01:49:29,690
So this then is the
vector representation

2231
01:49:29,690 --> 01:49:35,470
of going Paris minus
France plus England--

2232
01:49:35,470 --> 01:49:38,130
so the distance between
friends and Paris as vectors.

2233
01:49:38,130 --> 01:49:40,860
Add the England vector,
and let's go ahead

2234
01:49:40,860 --> 01:49:43,860
and find the closest word to that.

2235
01:49:43,860 --> 01:49:47,080


2236
01:49:47,080 --> 01:49:48,550
And it turns out to be London.

2237
01:49:48,550 --> 01:49:51,610
You do this relationship, the
relationship between France and Paris.

2238
01:49:51,610 --> 01:49:55,000
Go ahead and add the England vector
to it, and the closest vector to that

2239
01:49:55,000 --> 01:49:57,120
happens to be the vector
for the word London.

2240
01:49:57,120 --> 01:49:58,120
We can do more examples.

2241
01:49:58,120 --> 01:50:00,700
I can say, let's take
the word for teacher--

2242
01:50:00,700 --> 01:50:03,700
that vector representation
and-- let me subtract

2243
01:50:03,700 --> 01:50:05,470
the vector representation of school.

2244
01:50:05,470 --> 01:50:09,310
So what I'm left with is, what
takes us from school to teacher?

2245
01:50:09,310 --> 01:50:14,050
And apply that vector to a
word like hospital and see,

2246
01:50:14,050 --> 01:50:15,670
what is the closest word to that--

2247
01:50:15,670 --> 01:50:17,680
turns out the closest word is nurse.

2248
01:50:17,680 --> 01:50:23,400
Let's try a couple more examples--
closest word to ramen, for example.

2249
01:50:23,400 --> 01:50:25,610
Subtract closest word to Japan.

2250
01:50:25,610 --> 01:50:28,150
So what is the relationship
between Japan and ramen?

2251
01:50:28,150 --> 01:50:30,310
Add the word for America to that.

2252
01:50:30,310 --> 01:50:33,340
Want to take a guess is what
you might get as a result?

2253
01:50:33,340 --> 01:50:35,840
Turns out you get burritos
as the relationship.

2254
01:50:35,840 --> 01:50:38,050
If you do the subtraction,
do the addition,

2255
01:50:38,050 --> 01:50:42,080
this is the answer that you happen to
get as a consequence of this as well.

2256
01:50:42,080 --> 01:50:44,703
So these very interesting
analogies arise

2257
01:50:44,703 --> 01:50:46,620
in the relationships
between these two words--

2258
01:50:46,620 --> 01:50:50,420
that if you just map out all of
these words into a vector space,

2259
01:50:50,420 --> 01:50:54,380
you can get some pretty interesting
results as a consequence of that.

2260
01:50:54,380 --> 01:50:58,360
And this idea of representing
words as vectors turns out

2261
01:50:58,360 --> 01:51:01,300
to be incredibly useful
and powerful anytime

2262
01:51:01,300 --> 01:51:04,420
we want to be able to do
some statistical work with

2263
01:51:04,420 --> 01:51:06,910
regards to natural language,
to be able to have--

2264
01:51:06,910 --> 01:51:09,350
represent words not just
as their characters,

2265
01:51:09,350 --> 01:51:12,280
but to represent them as numbers,
numbers that say something

2266
01:51:12,280 --> 01:51:14,910
or mean something about
the words themselves,

2267
01:51:14,910 --> 01:51:18,250
and somehow relate the meaning
of a word to other words that

2268
01:51:18,250 --> 01:51:19,920
might happen to exists--

2269
01:51:19,920 --> 01:51:23,020
so many tools then for
being able to work inside

2270
01:51:23,020 --> 01:51:24,910
of this world of natural language.

2271
01:51:24,910 --> 01:51:26,417
The natural language is tricky.

2272
01:51:26,417 --> 01:51:29,500
We have to deal with the syntax of
language and the semantics of language,

2273
01:51:29,500 --> 01:51:33,100
but we've really just seen just the
beginning of some of the ideas that are

2274
01:51:33,100 --> 01:51:37,450
underlying a lot of natural language
processing-- the ability to take text,

2275
01:51:37,450 --> 01:51:40,270
extract information out of it, get
some sort of meaning out of it,

2276
01:51:40,270 --> 01:51:43,990
generate sentences maybe by having some
knowledge of the grammar or maybe just

2277
01:51:43,990 --> 01:51:47,380
by looking at probabilities of what
words are likely to show up based

2278
01:51:47,380 --> 01:51:49,780
on other words that have
shown up previously--

2279
01:51:49,780 --> 01:51:52,300
and then finally, the
ability to take words

2280
01:51:52,300 --> 01:51:55,330
and come up with some distributed
representation of them, to take words

2281
01:51:55,330 --> 01:51:58,240
and represent them as
numbers, and use those numbers

2282
01:51:58,240 --> 01:52:02,210
to be able to say something
meaningful about those words as well.

2283
01:52:02,210 --> 01:52:04,390
So this then is yet another
topic in this broader

2284
01:52:04,390 --> 01:52:06,300
heading of artificial intelligence.

2285
01:52:06,300 --> 01:52:08,380
And just as I look back
at where we've been now,

2286
01:52:08,380 --> 01:52:11,320
we started our conversation by
talking about the world of search,

2287
01:52:11,320 --> 01:52:14,590
about trying to solve problems
like tic-tac-toe by searching

2288
01:52:14,590 --> 01:52:17,500
for a solution, by exploring our
various different possibilities

2289
01:52:17,500 --> 01:52:21,220
and looking at what algorithms we
can apply to be able to efficiently

2290
01:52:21,220 --> 01:52:22,300
try and search a space.

2291
01:52:22,300 --> 01:52:25,930
We looked at some simple algorithms
and then looked at some optimizations

2292
01:52:25,930 --> 01:52:28,780
we could make to this
algorithms, and ultimately, that

2293
01:52:28,780 --> 01:52:31,742
was in service of trying to get our
AI to know things about the world.

2294
01:52:31,742 --> 01:52:34,450
And this has been a lot of what
we've talked about today as well,

2295
01:52:34,450 --> 01:52:37,270
trying to get knowledge out
of text-based information,

2296
01:52:37,270 --> 01:52:41,440
the ability to take information, draw
conclusions based on those information.

2297
01:52:41,440 --> 01:52:43,630
If I know these two things
for certain, maybe I

2298
01:52:43,630 --> 01:52:46,660
can draw a third conclusion as well.

2299
01:52:46,660 --> 01:52:49,330
That then was related to
the idea of uncertainty.

2300
01:52:49,330 --> 01:52:51,460
If we don't know
something for sure, can we

2301
01:52:51,460 --> 01:52:54,420
predict something, figure out
the probabilities of something?

2302
01:52:54,420 --> 01:52:56,170
And we saw that again
today in the context

2303
01:52:56,170 --> 01:52:59,200
of trying to predict whether
a tweet or whether a message

2304
01:52:59,200 --> 01:53:01,420
is positive sentiment
or negative sentiment,

2305
01:53:01,420 --> 01:53:04,022
and trying to draw that
conclusion as well.

2306
01:53:04,022 --> 01:53:05,980
Then we took a look at
optimization-- the sorts

2307
01:53:05,980 --> 01:53:09,490
of problems where we're looking
for a local global or local maximum

2308
01:53:09,490 --> 01:53:10,300
or minimum.

2309
01:53:10,300 --> 01:53:13,420
This has come up time and time
again, especially most recently

2310
01:53:13,420 --> 01:53:16,750
in the context of neural networks, which
are really just a kind of optimization

2311
01:53:16,750 --> 01:53:20,110
problem where we're trying to
minimize the total amount of loss

2312
01:53:20,110 --> 01:53:23,110
based on the setting of our
weights of our neural network,

2313
01:53:23,110 --> 01:53:26,710
based on the setting of what
vector representations for words we

2314
01:53:26,710 --> 01:53:27,880
happen to choose.

2315
01:53:27,880 --> 01:53:30,430
And those ultimately helped
us to be able to solve

2316
01:53:30,430 --> 01:53:33,940
learning-related problems-- the
ability to take a whole bunch of data,

2317
01:53:33,940 --> 01:53:37,650
and rather than us tell
the AI exactly what to do,

2318
01:53:37,650 --> 01:53:40,030
let the AI learn patterns
from the data for itself.

2319
01:53:40,030 --> 01:53:43,770
Let it figure out what makes an inbox
message different from a spam message.

2320
01:53:43,770 --> 01:53:45,520
Let it figure out what
makes a counterfeit

2321
01:53:45,520 --> 01:53:47,560
bill different from an
authentic bill, and being

2322
01:53:47,560 --> 01:53:49,820
able to draw that analysis as well.

2323
01:53:49,820 --> 01:53:52,390
And one of the big tools
in learning that we used

2324
01:53:52,390 --> 01:53:54,220
were neural networks,
these structures that

2325
01:53:54,220 --> 01:53:58,180
allow us to relate inputs to outputs
by training these internal networks

2326
01:53:58,180 --> 01:54:02,410
to learn some sort of function that
maps us from some input to some output--

2327
01:54:02,410 --> 01:54:05,770
ultimately yet another model in this
language of artificial intelligence

2328
01:54:05,770 --> 01:54:08,320
that we can use to
communicate with our AI.

2329
01:54:08,320 --> 01:54:10,210
Then finally today,
we looked at some ways

2330
01:54:10,210 --> 01:54:12,850
that AI can begin to communicate
with us, looking at ways

2331
01:54:12,850 --> 01:54:16,240
that AI can begin to get an
understanding for the syntax

2332
01:54:16,240 --> 01:54:19,990
and the semantics of language to
be able to generate sentences,

2333
01:54:19,990 --> 01:54:23,110
to be able to predict things about
text that's written in a spoken

2334
01:54:23,110 --> 01:54:25,360
language or a written
language like English,

2335
01:54:25,360 --> 01:54:27,927
and to be able to do interesting
analysis there as well.

2336
01:54:27,927 --> 01:54:30,010
And there's so much more
in active research that's

2337
01:54:30,010 --> 01:54:33,160
happening all over the areas within
artificial intelligence today,

2338
01:54:33,160 --> 01:54:36,890
and we've really only just seen the
beginning of what AI has to offer.

2339
01:54:36,890 --> 01:54:39,310
So I hope you enjoyed this
exploration into this world

2340
01:54:39,310 --> 01:54:41,235
of artificial intelligence with Python.

2341
01:54:41,235 --> 01:54:44,110
A big thank you to the courses
teaching staff and the production team

2342
01:54:44,110 --> 01:54:45,700
for making this class possible.

2343
01:54:45,700 --> 01:54:49,940
This was an Introduction to
Artificial Intelligence with Python.

2344
01:54:49,940 --> 01:54:51,000