1
00:00:00,000 --> 00:00:08,364

2
00:00:08,364 --> 00:00:08,870
>> LUCAS FREITAS: Hey.

3
00:00:08,870 --> 00:00:09,980
Welcome everyone.

4
00:00:09,980 --> 00:00:11,216
My name is Lucas Freitas.

5
00:00:11,216 --> 00:00:15,220
I'm a junior at [INAUDIBLE] studying
computer science with a focus in

6
00:00:15,220 --> 00:00:16,410
computational linguistics.

7
00:00:16,410 --> 00:00:19,310
So my secondary is in language
and linguistic theory.

8
00:00:19,310 --> 00:00:21,870
I'm really excited to teach you guys
a little bit about the field.

9
00:00:21,870 --> 00:00:24,300
It's a very exciting area to study.

10
00:00:24,300 --> 00:00:27,260
Also with a lot of potential
for the future.

11
00:00:27,260 --> 00:00:30,160
So, I'm really excited that you guys
are considering projects in

12
00:00:30,160 --> 00:00:31,160
computational linguistics.

13
00:00:31,160 --> 00:00:35,460
And I'll be more than happy to advise
any of you if you decide to

14
00:00:35,460 --> 00:00:37,090
pursue one of those.

15
00:00:37,090 --> 00:00:40,010
>> So first of all what are computational
linguistics?

16
00:00:40,010 --> 00:00:44,630
So computational linguistics is the
intersection between linguistics and

17
00:00:44,630 --> 00:00:46,390
computer science.

18
00:00:46,390 --> 00:00:47,415
So, what is linguistics?

19
00:00:47,415 --> 00:00:48,490
What is computer science?

20
00:00:48,490 --> 00:00:51,580
Well from linguistics, what
we take are the languages.

21
00:00:51,580 --> 00:00:54,960
So linguistics is actually the study
of natural language in general.

22
00:00:54,960 --> 00:00:58,330
So natural language-- we talk about
language that we actually use to

23
00:00:58,330 --> 00:00:59,770
communicate with each other.

24
00:00:59,770 --> 00:01:02,200
So we're not exactly talking
about C or Java.

25
00:01:02,200 --> 00:01:05,900
We're talking more about English and
Chinese and other languages that we

26
00:01:05,900 --> 00:01:07,780
use to communicate with each other.

27
00:01:07,780 --> 00:01:12,470
>> The challenging thing about that is that
right now we have almost 7,000

28
00:01:12,470 --> 00:01:14,260
languages in the world.

29
00:01:14,260 --> 00:01:19,520
So there are quite a high variety
of languages that we can study.

30
00:01:19,520 --> 00:01:22,600
And then you think that it's probably
very hard to do, for example,

31
00:01:22,600 --> 00:01:26,960
translation from one language to the
other, considering that you have

32
00:01:26,960 --> 00:01:28,240
almost 7,000 of them.

33
00:01:28,240 --> 00:01:31,450
So, if you think of doing translation
from one language to the other you

34
00:01:31,450 --> 00:01:35,840
have almost more than a million
different combinations that you can

35
00:01:35,840 --> 00:01:37,330
have from language to language.

36
00:01:37,330 --> 00:01:40,820
So it's really challenging to do some
kind of example translation system for

37
00:01:40,820 --> 00:01:43,540
every single language.

38
00:01:43,540 --> 00:01:47,120
>> So, linguistics treats with syntax,
semantics, pragmatics.

39
00:01:47,120 --> 00:01:49,550
You guys don't exactly need
to know what are they are.

40
00:01:49,550 --> 00:01:55,090
But the very interesting thing is that
as a native speaker, when you learn

41
00:01:55,090 --> 00:01:59,010
language as child, you actually learn
all of those things-- syntax semantics

42
00:01:59,010 --> 00:02:00,500
and pragmatics--

43
00:02:00,500 --> 00:02:01,430
by yourself.

44
00:02:01,430 --> 00:02:04,820
And nobody has to teach you syntax for
you to understand how sentences are

45
00:02:04,820 --> 00:02:05,290
structured.

46
00:02:05,290 --> 00:02:07,980
So, it's really interesting because
it's something that comes very

47
00:02:07,980 --> 00:02:10,389
intuitively.

48
00:02:10,389 --> 00:02:13,190
>> And what are you taking from
the computer science?

49
00:02:13,190 --> 00:02:16,700
Well, the most important thing that we
have in computer science is first of

50
00:02:16,700 --> 00:02:19,340
all, artificial intelligence
and machine learning.

51
00:02:19,340 --> 00:02:22,610
So, what we're trying to doing
computational linguistics is teach

52
00:02:22,610 --> 00:02:26,990
your computer how to do something
with language.

53
00:02:26,990 --> 00:02:28,630
>> So, for example, in machine
translation.

54
00:02:28,630 --> 00:02:32,490
I'm trying to teach my computer how
to know how to transition from one

55
00:02:32,490 --> 00:02:33,310
language to the other.

56
00:02:33,310 --> 00:02:35,790
So, basically like teaching
a computer two languages.

57
00:02:35,790 --> 00:02:38,870
If I do natural language processing,
which is the case for example of

58
00:02:38,870 --> 00:02:41,810
Facebook's Graph Search, you teach
your computer how to understand

59
00:02:41,810 --> 00:02:42,730
queries well.

60
00:02:42,730 --> 00:02:48,130
>> So, if you say "the photos of my
friends." Facebook doesn't treat that

61
00:02:48,130 --> 00:02:51,130
as a whole string that has
just a bunch of words.

62
00:02:51,130 --> 00:02:56,020
It actually understands the relation
between "photos" and "my friends" and

63
00:02:56,020 --> 00:02:59,620
understands that "photos" are
property of "my friends."

64
00:02:59,620 --> 00:03:02,350
>> So, that's part of, for example,
natural language processing.

65
00:03:02,350 --> 00:03:04,790
It's trying to understand what
is the relation between

66
00:03:04,790 --> 00:03:07,520
the words in a sentence.

67
00:03:07,520 --> 00:03:11,170
And the big question is, can you
teach a computer how to speak

68
00:03:11,170 --> 00:03:12,650
a language in general?

69
00:03:12,650 --> 00:03:17,810
Which is a very interesting question to
think, as if maybe in the future,

70
00:03:17,810 --> 00:03:19,930
you're going to be able to
talk to your cell phone.

71
00:03:19,930 --> 00:03:23,290
Kind of like what we do with Siri but
something more like, you can actually

72
00:03:23,290 --> 00:03:25,690
say whatever you want and the phone
is going to understand everything.

73
00:03:25,690 --> 00:03:28,350
And it can have follow up questions
and keep talking.

74
00:03:28,350 --> 00:03:30,880
That's something really exciting,
in my opinion.

75
00:03:30,880 --> 00:03:33,070
>> So, something about natural languages.

76
00:03:33,070 --> 00:03:36,220
Something really interesting about
natural languages is that, and this is

77
00:03:36,220 --> 00:03:38,470
credit to my linguistics professor,
Maria Polinsky.

78
00:03:38,470 --> 00:03:40,830
She gives an example and I think
it's really interesting.

79
00:03:40,830 --> 00:03:47,060
Because we learn language from when
we're born and then our native

80
00:03:47,060 --> 00:03:49,170
language kind of grows on us.

81
00:03:49,170 --> 00:03:52,570
>> And basically you learn language
from minimal input, right?

82
00:03:52,570 --> 00:03:56,700
You're just getting input from your
parents of what your language sounds

83
00:03:56,700 --> 00:03:58,770
like and you just learn it.

84
00:03:58,770 --> 00:04:02,240
So, it's interesting because if you look
at those sentences, for example.

85
00:04:02,240 --> 00:04:06,980
You look, "Mary puts on a coat every
time she leaves the house."

86
00:04:06,980 --> 00:04:10,650
>> In this case, it's possible to have the
word "she" refer to Mary, right?

87
00:04:10,650 --> 00:04:13,500
You can say "Mary puts on a coat
every time Mary leaves the

88
00:04:13,500 --> 00:04:14,960
house." so that's fine.

89
00:04:14,960 --> 00:04:19,370
But then if you look at the sentence
"She puts on a coat every time Mary

90
00:04:19,370 --> 00:04:22,850
leaves the house." you know it's
impossible to say that "she" is

91
00:04:22,850 --> 00:04:24,260
referring to Mary.

92
00:04:24,260 --> 00:04:27,070
>> There's no way of saying that "Mary puts
on a coat every time Mary leaves

93
00:04:27,070 --> 00:04:30,790
the house." So it's interesting because
this is the kind of intuition

94
00:04:30,790 --> 00:04:32,890
that every native speaker has.

95
00:04:32,890 --> 00:04:36,370
And nobody was taught that this is
the way that the syntax works.

96
00:04:36,370 --> 00:04:41,930
And that you can only have this "she"
referring to Mary in this first case,

97
00:04:41,930 --> 00:04:44,260
and actually in this other
too, but not in this one.

98
00:04:44,260 --> 00:04:46,500
But everyone kind of gets
to the same answer.

99
00:04:46,500 --> 00:04:48,580
Everyone agrees on that.

100
00:04:48,580 --> 00:04:53,280
So it's really interesting how although
you don't know all the rules

101
00:04:53,280 --> 00:04:55,575
in your language you kind of understand
how the language works.

102
00:04:55,575 --> 00:04:59,020

103
00:04:59,020 --> 00:05:01,530
>> So the interesting thing about natural
language is that you don't have to

104
00:05:01,530 --> 00:05:06,970
know any syntax to know if a sentence
is grammatical or ungrammatical for

105
00:05:06,970 --> 00:05:08,810
most cases.

106
00:05:08,810 --> 00:05:13,220
Which makes you think that maybe what
happens is that through your life, you

107
00:05:13,220 --> 00:05:17,410
just keep getting more and more
sentences told to you.

108
00:05:17,410 --> 00:05:19,800
And then you keep memorizing
all of the sentences.

109
00:05:19,800 --> 00:05:24,230
And then when someone tells you
something, you hear that sentence and

110
00:05:24,230 --> 00:05:27,040
you look at your vocabulary
of sentences and see if

111
00:05:27,040 --> 00:05:28,270
that sentence is there.

112
00:05:28,270 --> 00:05:29,830
And if it is there you
say it's grammatical.

113
00:05:29,830 --> 00:05:31,740
If it's not you say it's
ungrammatical.

114
00:05:31,740 --> 00:05:35,150
>> So, in that case, you would say, oh,
so you have a huge list of all

115
00:05:35,150 --> 00:05:36,140
possible sentences.

116
00:05:36,140 --> 00:05:38,240
And then when you hear a sentence,
you know if it's grammatical or

117
00:05:38,240 --> 00:05:39,450
not based on that.

118
00:05:39,450 --> 00:05:42,360
The thing is that if you look at
a sentence, for example, "The

119
00:05:42,360 --> 00:05:47,540
five-headed CS50 TFs cooked the blind
octopus using a DAPA mug." It's

120
00:05:47,540 --> 00:05:49,630
definitely not a sentence
that you heard before.

121
00:05:49,630 --> 00:05:52,380
But at the same time you know it's
pretty much grammatical, right?

122
00:05:52,380 --> 00:05:55,570
There are no grammatical mistakes
and you can say that

123
00:05:55,570 --> 00:05:57,020
it's a possible sentence.

124
00:05:57,020 --> 00:06:01,300
>> So it makes us think that actually the
way that we learn language is not only

125
00:06:01,300 --> 00:06:07,090
by having a huge database of possible
words or sentences, but more of

126
00:06:07,090 --> 00:06:11,490
understanding the relation between
words in those sentences.

127
00:06:11,490 --> 00:06:14,570
Does that make sense?

128
00:06:14,570 --> 00:06:19,370
So, then the question is, can
computers learn languages?

129
00:06:19,370 --> 00:06:21,490
Can we teach language to computers?

130
00:06:21,490 --> 00:06:24,230
>> So, let's think of the difference
between a native speaker of a language

131
00:06:24,230 --> 00:06:25,460
and a computer.

132
00:06:25,460 --> 00:06:27,340
So, what happens to the speaker?

133
00:06:27,340 --> 00:06:30,430
Well, the native speaker learns a
language from exposure to it.

134
00:06:30,430 --> 00:06:34,200
Usually its early childhood years.

135
00:06:34,200 --> 00:06:38,570
So, basically, you just have a baby,
and you keep talking to it, and it

136
00:06:38,570 --> 00:06:40,540
just learns how to speak
the language, right?

137
00:06:40,540 --> 00:06:42,660
So, you're basically giving
input to the baby.

138
00:06:42,660 --> 00:06:45,200
So, then you can argue that a computer
can do the same thing, right?

139
00:06:45,200 --> 00:06:49,510
You can just give language
as input to the computer.

140
00:06:49,510 --> 00:06:53,410
>> As for example a bunch of files
that have books in English.

141
00:06:53,410 --> 00:06:56,190
Maybe that's one way that you
could possibly teach a

142
00:06:56,190 --> 00:06:57,850
computer English, right?

143
00:06:57,850 --> 00:07:01,000
And in fact, if you think about it,
it takes you maybe a couple

144
00:07:01,000 --> 00:07:02,680
days to read a book.

145
00:07:02,680 --> 00:07:05,760
For a computer it takes a second to
look at all the words in a book.

146
00:07:05,760 --> 00:07:10,810
So you can think that may be just this
argument of input from around you,

147
00:07:10,810 --> 00:07:15,440
that's not enough to say that that's
something that only humans can do.

148
00:07:15,440 --> 00:07:17,680
You can think computers
also can get input.

149
00:07:17,680 --> 00:07:21,170
>> The second thing is that native speakers
also have a brain that has

150
00:07:21,170 --> 00:07:23,870
language learning capability.

151
00:07:23,870 --> 00:07:27,020
But if you think about it,
a brain is a solid thing.

152
00:07:27,020 --> 00:07:30,450
When you are born, it's already set--

153
00:07:30,450 --> 00:07:31,320
this is your brain.

154
00:07:31,320 --> 00:07:34,660
And as you grow up, you just get more
input of language and maybe nutrients

155
00:07:34,660 --> 00:07:35,960
and other stuff.

156
00:07:35,960 --> 00:07:38,170
But pretty much your brain
is a solid thing.

157
00:07:38,170 --> 00:07:41,290
>> So you can say, well, maybe you can
build a computer that has a bunch of

158
00:07:41,290 --> 00:07:45,890
functions and methods that just mimic
language learning capability.

159
00:07:45,890 --> 00:07:49,630
So in that sense, you could say, well, I
can have a computer that has all the

160
00:07:49,630 --> 00:07:52,270
things I need to learn language.

161
00:07:52,270 --> 00:07:56,200
And the last thing is that a native
speaker learns from trial and error.

162
00:07:56,200 --> 00:08:01,090
So basically another important thing in
language learning is that you kind

163
00:08:01,090 --> 00:08:05,340
of learn things by making
generalizations of what you hear.

164
00:08:05,340 --> 00:08:10,280
>> So as you are growing up you learn that
some words are more like nouns,

165
00:08:10,280 --> 00:08:11,820
some other ones are adjectives.

166
00:08:11,820 --> 00:08:14,250
And you don't have to have any
knowledge of linguistics

167
00:08:14,250 --> 00:08:15,040
to understand that.

168
00:08:15,040 --> 00:08:18,560
But you just know there's some words
are positioned in some part of the

169
00:08:18,560 --> 00:08:22,570
sentence and some others in other
parts of the sentence.

170
00:08:22,570 --> 00:08:26,110
>> And that when you do something that is
like a sentence that is not correct--

171
00:08:26,110 --> 00:08:28,770
maybe because of an over generalization
for example.

172
00:08:28,770 --> 00:08:32,210
Maybe when you're growing up, you notice
that the plural is usually

173
00:08:32,210 --> 00:08:35,809
formed by putting an S at
the end of the word.

174
00:08:35,809 --> 00:08:40,042
And then you try to do the plural of
"deer" as "deers" or "tooth" as

175
00:08:40,042 --> 00:08:44,780
"tooths." So then your parents or
someone corrects you and says, no, the

176
00:08:44,780 --> 00:08:49,020
plural of "deer" is "deer," and the
plural of "tooth" is "teeth." And then

177
00:08:49,020 --> 00:08:50,060
you learn those things.

178
00:08:50,060 --> 00:08:51,520
So you learn from trial and error.

179
00:08:51,520 --> 00:08:53,100
>> But you can also do that
with a computer.

180
00:08:53,100 --> 00:08:55,310
You can have something called
reinforcement learning.

181
00:08:55,310 --> 00:08:58,560
Which is basically like giving a
computer a reward whenever it does

182
00:08:58,560 --> 00:08:59,410
something correctly.

183
00:08:59,410 --> 00:09:04,710
And giving it the opposite of a reward
and when it does something wrong.

184
00:09:04,710 --> 00:09:07,410
You can actually see that if you go
to Google Translate and you try to

185
00:09:07,410 --> 00:09:10,220
translate a sentence, it
asks you for feedback.

186
00:09:10,220 --> 00:09:13,240
So if you say, oh, there's a better
translation for this sentence.

187
00:09:13,240 --> 00:09:18,140
You can type it up and then if a lot of
people keep saying that is a better

188
00:09:18,140 --> 00:09:21,560
translation, it just learns that it
should use that translation instead of

189
00:09:21,560 --> 00:09:22,960
the one it was giving.

190
00:09:22,960 --> 00:09:28,830
>> So, it's a very philosophical question
to see if computers are going to be

191
00:09:28,830 --> 00:09:30,340
able to talk or not in the future.

192
00:09:30,340 --> 00:09:34,440
But I have high hopes that they can
just based on those arguments.

193
00:09:34,440 --> 00:09:38,570
But it's just more of a philosophical
question.

194
00:09:38,570 --> 00:09:43,460
>> So while computers still cannot talk,
what are the things that we can do?

195
00:09:43,460 --> 00:09:47,070
Some really cool things are
data classification.

196
00:09:47,070 --> 00:09:53,210
So, for example, you guys know
that email services do, for

197
00:09:53,210 --> 00:09:55,580
example, spam filtering.

198
00:09:55,580 --> 00:09:59,070
So whenever you receive spam, it
tries to filter to another box.

199
00:09:59,070 --> 00:10:00,270
So how does it do that?

200
00:10:00,270 --> 00:10:06,080
It's not like the computer just knows
what email addresses are sending spam.

201
00:10:06,080 --> 00:10:09,130
So it's more based on the content of
the message, or maybe the title, or

202
00:10:09,130 --> 00:10:11,310
maybe some pattern that you have.

203
00:10:11,310 --> 00:10:15,690
>> So, basically, what you can do is get a
lot of data of emails that are spam,

204
00:10:15,690 --> 00:10:19,980
emails that are not spam, and learn what
kind of patterns you have in the

205
00:10:19,980 --> 00:10:21,000
ones that are spam.

206
00:10:21,000 --> 00:10:23,260
And this is part of computational
linguistics.

207
00:10:23,260 --> 00:10:24,720
It's called data classification.

208
00:10:24,720 --> 00:10:28,100
And we're actually going to see an
example of that in the next slides.

209
00:10:28,100 --> 00:10:32,910
>> The second thing is natural language
processing which is the thing that the

210
00:10:32,910 --> 00:10:36,580
Graph Search is doing of letting
you write a sentence.

211
00:10:36,580 --> 00:10:38,690
And it trusts you understand what
is the meaning and gives

212
00:10:38,690 --> 00:10:39,940
you a better result.

213
00:10:39,940 --> 00:10:43,880
Actually, if you go to Google or Bing
and you search something like Lady

214
00:10:43,880 --> 00:10:47,060
Gaga's height, you're actually going
to get 5' 1" instead of information

215
00:10:47,060 --> 00:10:50,170
from her because it actually understands
what you're talking about.

216
00:10:50,170 --> 00:10:52,140
So that's part of natural
language processing.

217
00:10:52,140 --> 00:10:57,000
>> Or also when you're using Siri, first
you have an algorithm that tries to

218
00:10:57,000 --> 00:11:01,130
translate what you're saying
into words, in text.

219
00:11:01,130 --> 00:11:03,690
And then it tries to translate
that into meaning.

220
00:11:03,690 --> 00:11:06,570
So that's all part of natural
language processing.

221
00:11:06,570 --> 00:11:08,320
>> Then you have machine translation--

222
00:11:08,320 --> 00:11:10,300
which is actually one
of my favorites--

223
00:11:10,300 --> 00:11:14,060
which is just translating from
a language to another.

224
00:11:14,060 --> 00:11:17,950
So you can think that when you're doing
machine translation, you have

225
00:11:17,950 --> 00:11:19,750
infinite possibilities of sentences.

226
00:11:19,750 --> 00:11:22,960
So there's no way of just storing
every single translation.

227
00:11:22,960 --> 00:11:27,440
So you have to come up with interesting
algorithms to be able to

228
00:11:27,440 --> 00:11:30,110
translate every single
sentence in some way.

229
00:11:30,110 --> 00:11:32,483
>> You guys have any questions so far?

230
00:11:32,483 --> 00:11:34,450
No?

231
00:11:34,450 --> 00:11:34,830
OK.

232
00:11:34,830 --> 00:11:36,900
>> So what are we going to see today?

233
00:11:36,900 --> 00:11:39,300
First of all, I'm going to talk about
the classification problem.

234
00:11:39,300 --> 00:11:41,440
So the one that I was
saying about spam.

235
00:11:41,440 --> 00:11:46,820
What I'm going to do is, given lyrics
to a song, can you try to figure out

236
00:11:46,820 --> 00:11:49,810
with high probability
who is the singer?

237
00:11:49,810 --> 00:11:53,590
Let's say that I have songs from Lady
Gaga and Katy Perry, if I give you a

238
00:11:53,590 --> 00:11:58,130
new song, can you figure out if
it's Katy Perry or Lady Gaga?

239
00:11:58,130 --> 00:12:01,490
>> The second one, I'm just going to talk
about the segmentation problem.

240
00:12:01,490 --> 00:12:05,780
So I don't know if you guys know, but
Chinese, Japanese, other East Asian

241
00:12:05,780 --> 00:12:08,090
languages, and other languages
in general, don't have

242
00:12:08,090 --> 00:12:09,830
spaces between words.

243
00:12:09,830 --> 00:12:13,540
And then if you think about the way that
your computer kind of tries to

244
00:12:13,540 --> 00:12:18,600
understand natural language processing,
it looks at the words and

245
00:12:18,600 --> 00:12:21,500
tries to understand the relations
between them, right?

246
00:12:21,500 --> 00:12:25,440
But then if you have Chinese, and you
have zero spaces, it's really hard to

247
00:12:25,440 --> 00:12:28,360
find out what is the relation between
words, because they don't have any

248
00:12:28,360 --> 00:12:29,530
words at first.

249
00:12:29,530 --> 00:12:32,600
So you have to do something called
segmentation which just means putting

250
00:12:32,600 --> 00:12:36,490
spaces between what we'd call
words in those languages.

251
00:12:36,490 --> 00:12:37,740
Make sense?

252
00:12:37,740 --> 00:12:39,680

253
00:12:39,680 --> 00:12:41,540
>> And then we're going to
talk about syntax.

254
00:12:41,540 --> 00:12:44,050
So just a little bit about natural
language processing.

255
00:12:44,050 --> 00:12:45,420
It's going to be just an overview.

256
00:12:45,420 --> 00:12:50,700
So today, basically what I want to do
is give you guys a little bit of an

257
00:12:50,700 --> 00:12:53,930
inside of what are the possibilities
that you can do with computational

258
00:12:53,930 --> 00:12:54,960
linguistics.

259
00:12:54,960 --> 00:13:00,410
And then you can see what you think
is cool among those things.

260
00:13:00,410 --> 00:13:02,270
And maybe you can think of a project
and come talk to me.

261
00:13:02,270 --> 00:13:05,260
And I can give you advice
on how to implement it.

262
00:13:05,260 --> 00:13:09,060
>> So syntax is going to be a little bit
about Graph Search and machine

263
00:13:09,060 --> 00:13:09,670
translation.

264
00:13:09,670 --> 00:13:13,650
I'm just going to give an example of how
you could, for example, translate

265
00:13:13,650 --> 00:13:16,020
something from Portuguese to English.

266
00:13:16,020 --> 00:13:17,830
Sounds good?

267
00:13:17,830 --> 00:13:19,293
>> So first, the classification problem.

268
00:13:19,293 --> 00:13:23,590
I'll say that this part of the seminar
is going to be the most challenging

269
00:13:23,590 --> 00:13:27,560
one just because there's going
to be some coding.

270
00:13:27,560 --> 00:13:29,470
But it's going to be Python.

271
00:13:29,470 --> 00:13:34,380
I know you guys don't know Python, so
I'm just going to explain on the high

272
00:13:34,380 --> 00:13:35,750
level what I'm doing.

273
00:13:35,750 --> 00:13:40,900
And you don't have to really care too
much about the syntax because that's

274
00:13:40,900 --> 00:13:42,140
something you guys can learn.

275
00:13:42,140 --> 00:13:42,540
OK?

276
00:13:42,540 --> 00:13:43,580
Sounds good.

277
00:13:43,580 --> 00:13:46,020
>> So what is the classification problem?

278
00:13:46,020 --> 00:13:49,140
So you're given some lyrics to
a song, and you want to guess

279
00:13:49,140 --> 00:13:50,620
who is singing it.

280
00:13:50,620 --> 00:13:54,045
And this can be for any kind
of other problems.

281
00:13:54,045 --> 00:13:59,980
So it can be, for example, you have a
presidential campaign and you have a

282
00:13:59,980 --> 00:14:02,610
speech, and you want to find
out if it was, for example,

283
00:14:02,610 --> 00:14:04,470
Obama or Mitt Romney.

284
00:14:04,470 --> 00:14:07,700
Or you can have a bunch of emails and
you want to figure out if they are

285
00:14:07,700 --> 00:14:08,890
spam or not.

286
00:14:08,890 --> 00:14:11,440
So it's just classifying some
data based on the words

287
00:14:11,440 --> 00:14:13,790
that you have there.

288
00:14:13,790 --> 00:14:16,295
>> So to do that, you have to
make some assumptions.

289
00:14:16,295 --> 00:14:20,570
So a lot about computational linguistics
is making assumptions,

290
00:14:20,570 --> 00:14:24,100
usually smart assumptions, so that
you can get good results.

291
00:14:24,100 --> 00:14:26,670
Trying to create a model for it.

292
00:14:26,670 --> 00:14:31,290
And then try it out and see if it works,
if it gives you good precision.

293
00:14:31,290 --> 00:14:33,940
And if it does, then you
try to improve it.

294
00:14:33,940 --> 00:14:37,640
If it doesn't, you're like, OK, maybe I
should make a different assumption.

295
00:14:37,640 --> 00:14:44,030
>> So the assumption that we're going to
make is that an artist usually sings

296
00:14:44,030 --> 00:14:49,220
about a topic multiple times, and maybe
uses words multiple times just

297
00:14:49,220 --> 00:14:50,270
because they're used to it.

298
00:14:50,270 --> 00:14:51,890
You can just think of your friend.

299
00:14:51,890 --> 00:14:57,350
I'm sure you guys all have friends
that say their signature phrase,

300
00:14:57,350 --> 00:14:59,260
literally for every single sentence--

301
00:14:59,260 --> 00:15:02,660
like some specific word or some specific
phrase that they say for

302
00:15:02,660 --> 00:15:04,020
every single sentence.

303
00:15:04,020 --> 00:15:07,920
>> And what you can say is that if you see
a sentence that has a signature

304
00:15:07,920 --> 00:15:11,450
phrase, you can guess that probably
your friend is the

305
00:15:11,450 --> 00:15:13,310
one saying it, right?

306
00:15:13,310 --> 00:15:18,410
So you make that assumption and then
that's how you create a model.

307
00:15:18,410 --> 00:15:24,440
>> The example that I'm going to give is on
how Lady Gaga, for example, people

308
00:15:24,440 --> 00:15:27,430
say that she uses "baby" for
all her number one songs.

309
00:15:27,430 --> 00:15:32,270
And actually this is a video that shows
her saying the word "baby" for

310
00:15:32,270 --> 00:15:33,410
different songs.

311
00:15:33,410 --> 00:15:33,860
>> [VIDEO PLAYBACK]

312
00:15:33,860 --> 00:15:34,310
>> -(SINGING) Baby.

313
00:15:34,310 --> 00:15:36,220
Baby.

314
00:15:36,220 --> 00:15:37,086
Baby.

315
00:15:37,086 --> 00:15:37,520
Baby.

316
00:15:37,520 --> 00:15:37,770
Baby.

317
00:15:37,770 --> 00:15:38,822
Babe.

318
00:15:38,822 --> 00:15:39,243
Baby.

319
00:15:39,243 --> 00:15:40,085
Baby.

320
00:15:40,085 --> 00:15:40,510
Baby.

321
00:15:40,510 --> 00:15:40,850
Baby.

322
00:15:40,850 --> 00:15:41,090
>> [END VIDEO PLAYBACK-

323
00:15:41,090 --> 00:15:44,020
>> LUCAS FREITAS: So there are, I think,
40 songs here in which she says the

324
00:15:44,020 --> 00:15:48,690
word "baby." So you can basically guess
that if you see a song that has

325
00:15:48,690 --> 00:15:52,180
the word "baby," there's some high
probability that it's Lady Gaga.

326
00:15:52,180 --> 00:15:56,450
But let's try to develop this
further more formally.

327
00:15:56,450 --> 00:16:00,470
>> So these are lyrics to songs by
Lady Gaga and Katy Perry.

328
00:16:00,470 --> 00:16:04,120
So you look at Lady Gaga, you see they
have a lot of occurrences of "baby," a

329
00:16:04,120 --> 00:16:07,710
lot of occurrences of "way." And then
Katy Perry has a lot of occurrences of

330
00:16:07,710 --> 00:16:10,360
"the," a lot of occurrences of "fire."

331
00:16:10,360 --> 00:16:14,560
>> So basically what we want to
do is, you get a lyric.

332
00:16:14,560 --> 00:16:20,480
Let's say that you get a lyric for a
song that is "baby," just "baby." If

333
00:16:20,480 --> 00:16:24,750
you just get the word "baby," and this
is all the data that you have from

334
00:16:24,750 --> 00:16:27,880
Lady Gaga and Katy Perry, who would
you guess is the person

335
00:16:27,880 --> 00:16:29,370
who sings the song?

336
00:16:29,370 --> 00:16:32,360
Lady Gaga or Katy Perry?

337
00:16:32,360 --> 00:16:33,150
Lady Gaga, right?

338
00:16:33,150 --> 00:16:37,400
Because she's the only one who says
"baby." This sounds stupid, right?

339
00:16:37,400 --> 00:16:38,760
OK, this is really easy.

340
00:16:38,760 --> 00:16:41,860
I'm just looking at the two songs and of
course, she's the only one who has

341
00:16:41,860 --> 00:16:42,660
"baby."

342
00:16:42,660 --> 00:16:44,740
>> But what if you have a bunch of words?

343
00:16:44,740 --> 00:16:50,900
If you have an actual lyric, something
like, "baby, I just

344
00:16:50,900 --> 00:16:51,610
went to see a [? CFT ?]

345
00:16:51,610 --> 00:16:54,020
lecture," or something like that, and
then you actually have to figure out--

346
00:16:54,020 --> 00:16:55,780
based on all those words--

347
00:16:55,780 --> 00:16:58,350
who is the artist who probably
sang this song?

348
00:16:58,350 --> 00:17:01,860
So let's try to develop
this a little further.

349
00:17:01,860 --> 00:17:05,630
>> OK, so based just on the data that we
got, it seems that Gaga is probably

350
00:17:05,630 --> 00:17:06,260
the singer.

351
00:17:06,260 --> 00:17:07,904
But how can we write
this more formally?

352
00:17:07,904 --> 00:17:10,579

353
00:17:10,579 --> 00:17:13,140
And there's going to be a little
bit of statistics.

354
00:17:13,140 --> 00:17:15,880
So if you get lost, just try
to understand the concept.

355
00:17:15,880 --> 00:17:18,700
It doesn't matter if you understand
the equations perfectly well.

356
00:17:18,700 --> 00:17:22,150
This is all going to be online.

357
00:17:22,150 --> 00:17:25,490
>> So basically what I'm calculating is the
probability that this song is by

358
00:17:25,490 --> 00:17:28,040
Lady Gaga given that--

359
00:17:28,040 --> 00:17:30,660
so this bar means given that--

360
00:17:30,660 --> 00:17:33,680
I saw the word "baby."
Does that make sense?

361
00:17:33,680 --> 00:17:35,540
So I'm trying to calculate
that probability.

362
00:17:35,540 --> 00:17:38,540
>> So there is this theorem called the
Bayes theorem that says that the

363
00:17:38,540 --> 00:17:43,330
probability of A given B, is the
probability of B given A, times the

364
00:17:43,330 --> 00:17:47,660
probability of A, over the probability
of B. This is a long equation.

365
00:17:47,660 --> 00:17:51,970
But what you have to understand from
that is that this is what I want to

366
00:17:51,970 --> 00:17:52,830
calculate, right?

367
00:17:52,830 --> 00:17:56,570
So the probability that that song is by
Lady Gaga given that I saw the word

368
00:17:56,570 --> 00:17:58,230
"baby."

369
00:17:58,230 --> 00:18:02,960
>> And now what I'm getting is the
probability of the word "baby" given

370
00:18:02,960 --> 00:18:04,390
that I have Lady Gaga.

371
00:18:04,390 --> 00:18:07,220
And what is that basically?

372
00:18:07,220 --> 00:18:10,500
What that means is, what is the
probability of seeing the word "baby"

373
00:18:10,500 --> 00:18:12,130
in Gaga lyrics?

374
00:18:12,130 --> 00:18:16,240
If I want to calculate that in a very
simple way, it's just the number of

375
00:18:16,240 --> 00:18:23,640
times I see "baby" over the total number
of words in Gaga lyrics, right?

376
00:18:23,640 --> 00:18:27,600
What is the frequency that I see
that word in Gaga's work?

377
00:18:27,600 --> 00:18:30,530
Make sense?

378
00:18:30,530 --> 00:18:33,420
>> The second term is the
probability of Gaga.

379
00:18:33,420 --> 00:18:34,360
What does that mean?

380
00:18:34,360 --> 00:18:38,550
That basically means, what is the
probability of classifying

381
00:18:38,550 --> 00:18:40,690
some lyrics as Gaga?

382
00:18:40,690 --> 00:18:45,320
And that is kind of weird, but
let's think of an example.

383
00:18:45,320 --> 00:18:49,230
So let's say that the probability of
having "baby" in a song is the same

384
00:18:49,230 --> 00:18:51,760
for Gaga and Britney Spears.

385
00:18:51,760 --> 00:18:54,950
But Britney Spears has twice
more songs than Lady Gaga.

386
00:18:54,950 --> 00:19:00,570
So if someone just randomly gives you
lyrics of "baby," the first thing you

387
00:19:00,570 --> 00:19:04,710
look at is, what is the probability of
having "baby" in a Gaga song, "baby"

388
00:19:04,710 --> 00:19:05,410
in a Britney song?

389
00:19:05,410 --> 00:19:06,460
And it's the same thing.

390
00:19:06,460 --> 00:19:10,040
>> So the second thing that you'll see is,
well, what is the probability of

391
00:19:10,040 --> 00:19:13,770
this lyric by itself being a Gaga lyric,
and what is the probability of

392
00:19:13,770 --> 00:19:15,380
being a Britney lyric?

393
00:19:15,380 --> 00:19:18,950
So since Britney has so many more lyrics
than Gaga, you would probably

394
00:19:18,950 --> 00:19:21,470
say, well, this is probably
a Britney lyric.

395
00:19:21,470 --> 00:19:23,340
So that's why we have this
term right here.

396
00:19:23,340 --> 00:19:24,670
Probability of Gaga.

397
00:19:24,670 --> 00:19:26,950
Makes sense?

398
00:19:26,950 --> 00:19:28,660
Does it?

399
00:19:28,660 --> 00:19:29,370
OK.

400
00:19:29,370 --> 00:19:33,500
>> And the last one is just the probability
of "baby" which doesn't

401
00:19:33,500 --> 00:19:34,810
really matter that much.

402
00:19:34,810 --> 00:19:39,940
But it's the probability of
seeing "baby" in English.

403
00:19:39,940 --> 00:19:42,725
We usually don't care that
much about that term.

404
00:19:42,725 --> 00:19:44,490
Does that make sense?

405
00:19:44,490 --> 00:19:48,110
So the probability of Gaga is
called the prior probability

406
00:19:48,110 --> 00:19:49,530
of the class Gaga.

407
00:19:49,530 --> 00:19:53,840
Because it just means that, what is the
probability of having that class--

408
00:19:53,840 --> 00:19:55,520
which is Gaga--

409
00:19:55,520 --> 00:19:59,350
just in general, just
with no conditions.

410
00:19:59,350 --> 00:20:02,560
>> And then when I have probability of
Gaga given "baby," we call it plus

411
00:20:02,560 --> 00:20:06,160
teary a probability because it's
the probability of having

412
00:20:06,160 --> 00:20:08,300
Gaga given some evidence.

413
00:20:08,300 --> 00:20:11,050
So I'm giving you the evidence
that I saw the word baby and

414
00:20:11,050 --> 00:20:12,690
the song make sense?

415
00:20:12,690 --> 00:20:15,960

416
00:20:15,960 --> 00:20:16,410
OK.

417
00:20:16,410 --> 00:20:22,400
>> So If I calculated that for each
of the songs for Lady Gaga,

418
00:20:22,400 --> 00:20:25,916
what that would be--

419
00:20:25,916 --> 00:20:27,730
apparently, I cannot move this.

420
00:20:27,730 --> 00:20:31,850

421
00:20:31,850 --> 00:20:36,920
The probability of Gaga will be
something like, 2 over 24, times 1/2,

422
00:20:36,920 --> 00:20:38,260
over 2 over 53.

423
00:20:38,260 --> 00:20:40,640
It doesn't matter if you know what
these numbers are coming from.

424
00:20:40,640 --> 00:20:44,750
But it's just a number that is going
to be more than 0, right?

425
00:20:44,750 --> 00:20:48,610
>> And then when I do Katy Perry, the
probability of "baby" given Katy is

426
00:20:48,610 --> 00:20:49,830
already 0, right?

427
00:20:49,830 --> 00:20:52,820
Because there's no "baby"
in Katy Perry.

428
00:20:52,820 --> 00:20:56,360
So then this becomes 0, and Gaga
wins, which means that Gaga is

429
00:20:56,360 --> 00:20:57,310
probably the singer.

430
00:20:57,310 --> 00:20:58,560
Does that make sense?

431
00:20:58,560 --> 00:21:00,700

432
00:21:00,700 --> 00:21:01,950
OK.

433
00:21:01,950 --> 00:21:04,160

434
00:21:04,160 --> 00:21:11,750
>> So if I want to make this more official,
I can actually do a model

435
00:21:11,750 --> 00:21:12,700
for multiple words.

436
00:21:12,700 --> 00:21:14,610
So let's say that I have something
like, "baby, I am

437
00:21:14,610 --> 00:21:16,030
on fire," or something.

438
00:21:16,030 --> 00:21:17,760
So it has multiple words.

439
00:21:17,760 --> 00:21:20,880
And in this case, you can see
that "baby" is in Gaga,

440
00:21:20,880 --> 00:21:21,710
but it's not in Katy.

441
00:21:21,710 --> 00:21:24,940
And "fire" is in Katy, but
it's not in Gaga, right?

442
00:21:24,940 --> 00:21:27,200
So it's getting trickier, right?

443
00:21:27,200 --> 00:21:31,440
Because it seems that you almost
have a tie between the two.

444
00:21:31,440 --> 00:21:36,980
>> So what you have to do is assume
independency among the words.

445
00:21:36,980 --> 00:21:41,210
So basically what that means is that
I'm just calculating what is the

446
00:21:41,210 --> 00:21:44,330
probability of seeing "baby," what is
the probability of seeing "I," and

447
00:21:44,330 --> 00:21:46,670
"am", and "on," and "fire,"
all separately.

448
00:21:46,670 --> 00:21:48,670
Then I'm multiplying all of them.

449
00:21:48,670 --> 00:21:52,420
And I'm seeing what is the probability
of seeing the whole sentence.

450
00:21:52,420 --> 00:21:55,210
Make sense?

451
00:21:55,210 --> 00:22:00,270
>> So basically, if I have just one word,
what I want to find is the arg max,

452
00:22:00,270 --> 00:22:05,385
which means, what is the class that is
giving me the highest probability?

453
00:22:05,385 --> 00:22:10,010
So what is the class that is giving
me the highest probability for

454
00:22:10,010 --> 00:22:11,940
probability of class given word.

455
00:22:11,940 --> 00:22:17,610
So in this case, Gaga given "baby."
Or Katy given "baby." Make sense?

456
00:22:17,610 --> 00:22:21,040
>> And just from Bayes, that
equation that I showed,

457
00:22:21,040 --> 00:22:24,780
we create this fraction.

458
00:22:24,780 --> 00:22:28,750
The only thing is that you see that
the probability of word given the

459
00:22:28,750 --> 00:22:31,370
class changes depending
on the class, right?

460
00:22:31,370 --> 00:22:34,260
The number of "baby"s that I have
in Gaga is different from Katy.

461
00:22:34,260 --> 00:22:37,640
The probability of the class also
changes because it's just the number

462
00:22:37,640 --> 00:22:39,740
of songs each of them has.

463
00:22:39,740 --> 00:22:43,980
>> But the probability of the word itself
is going to be the same for all the

464
00:22:43,980 --> 00:22:44,740
artists, right?

465
00:22:44,740 --> 00:22:47,150
So the probability of the word is
just, what is the probability of

466
00:22:47,150 --> 00:22:49,820
seeing that word in the
English language?

467
00:22:49,820 --> 00:22:51,420
So it's the same for all of them.

468
00:22:51,420 --> 00:22:55,790
So since this is constant, we can just
drop this and not care about it.

469
00:22:55,790 --> 00:23:00,230
So this will be actually the
equation we're looking for.

470
00:23:00,230 --> 00:23:03,360
>> And if I have multiple words, I'm
still going to have the prior

471
00:23:03,360 --> 00:23:04,610
probability here.

472
00:23:04,610 --> 00:23:06,980
The only thing is that I'm multiplying
the probability of

473
00:23:06,980 --> 00:23:08,490
all the other words.

474
00:23:08,490 --> 00:23:10,110
So I'm multiplying all of them.

475
00:23:10,110 --> 00:23:12,610
Make sense?

476
00:23:12,610 --> 00:23:18,440
It looks weird but basically means,
calculate the prior of the class, and

477
00:23:18,440 --> 00:23:22,100
then multiply by the probability of each
of the words being in that class.

478
00:23:22,100 --> 00:23:24,620

479
00:23:24,620 --> 00:23:29,150
>> And you know that the probability of a
word given a class is going to be the

480
00:23:29,150 --> 00:23:34,520
number of times you see that word in
that class, divided by the number of

481
00:23:34,520 --> 00:23:37,020
words you have in that
class in general.

482
00:23:37,020 --> 00:23:37,990
Make sense?

483
00:23:37,990 --> 00:23:41,680
It's just how "baby" was 2 over
the number of words that

484
00:23:41,680 --> 00:23:43,020
I had in the lyrics.

485
00:23:43,020 --> 00:23:45,130
So just the frequency.

486
00:23:45,130 --> 00:23:46,260
>> But there is one thing.

487
00:23:46,260 --> 00:23:51,250
Remember how I was showing that the
probability of "baby" being lyrics

488
00:23:51,250 --> 00:23:56,350
from Katy Perry was 0 just because Katy
Perry didn't have "baby" at all?

489
00:23:56,350 --> 00:24:04,900
But it sounds a little harsh to just
simply say that lyrics cannot be from

490
00:24:04,900 --> 00:24:10,040
an artist just because they don't have
that word in particular at any time.

491
00:24:10,040 --> 00:24:13,330
>> So you could just say, well, if you
don't have this word, I'm going to

492
00:24:13,330 --> 00:24:15,640
give you a lower probability,
but I'm just not going to

493
00:24:15,640 --> 00:24:17,420
give you 0 right away.

494
00:24:17,420 --> 00:24:21,040
Because maybe it was something like,
"fire, fire, fire, fire," which is

495
00:24:21,040 --> 00:24:21,990
totally Katy Perry.

496
00:24:21,990 --> 00:24:26,060
And then "baby," and it just goes to
0 right away because there was one

497
00:24:26,060 --> 00:24:27,250
"baby."

498
00:24:27,250 --> 00:24:31,440
>> So basically what we do is something
called Laplace smoothing.

499
00:24:31,440 --> 00:24:36,260
And this just means that I'm giving
some probability even to the words

500
00:24:36,260 --> 00:24:37,850
that do not exist.

501
00:24:37,850 --> 00:24:43,170
So what I do is that when I'm
calculating this, I always add 1 to

502
00:24:43,170 --> 00:24:44,180
the numerator.

503
00:24:44,180 --> 00:24:48,060
So even if the word doesn't exist, in
this case, if this is 0, I'm still

504
00:24:48,060 --> 00:24:51,250
calculating this as 1 over the
total number of words.

505
00:24:51,250 --> 00:24:55,060
Otherwise, I get how many words
I have and I add 1.

506
00:24:55,060 --> 00:24:58,300
So I'm counting for both cases.

507
00:24:58,300 --> 00:25:00,430
Make sense?

508
00:25:00,430 --> 00:25:03,060
>> So now let's do some coding.

509
00:25:03,060 --> 00:25:06,440
I'm going to have to do it pretty fast,
but it's just important that you

510
00:25:06,440 --> 00:25:08,600
guys understand the concepts.

511
00:25:08,600 --> 00:25:13,450
So what we're trying to do
is exactly implement this

512
00:25:13,450 --> 00:25:14,330
thing that I just said--

513
00:25:14,330 --> 00:25:19,110
I want you to put lyrics from
Lady Gaga and Katy Perry.

514
00:25:19,110 --> 00:25:22,980
And the program is going to be able to
say if these new lyrics are from Gaga

515
00:25:22,980 --> 00:25:24,170
or Katy Perry.

516
00:25:24,170 --> 00:25:25,800
Make sense?

517
00:25:25,800 --> 00:25:27,530
OK.

518
00:25:27,530 --> 00:25:30,710
>> So I have this program I'm going
to call classify.py.

519
00:25:30,710 --> 00:25:31,970
So this is Python.

520
00:25:31,970 --> 00:25:34,210
It's a new programming language.

521
00:25:34,210 --> 00:25:38,020
It is very similar in some
ways to C and PHP.

522
00:25:38,020 --> 00:25:43,180
It's similar because if you want to
learn Python after knowing C, it's

523
00:25:43,180 --> 00:25:46,270
really not that much of a challenge
just because Python is much easier

524
00:25:46,270 --> 00:25:47,520
than C, first of all.

525
00:25:47,520 --> 00:25:49,370
And a lot of things are already
implemented for you.

526
00:25:49,370 --> 00:25:56,820
So just how like PHP has functions that
sort a list, or append something

527
00:25:56,820 --> 00:25:58,780
to an array, or blah, blah, blah.

528
00:25:58,780 --> 00:26:00,690
Python has all of those as well.

529
00:26:00,690 --> 00:26:05,960
>> So I'm just going to explain quickly
how we could do the classification

530
00:26:05,960 --> 00:26:07,860
problem for here.

531
00:26:07,860 --> 00:26:13,230
So let's say that in this case, I have
lyrics from Gaga and Katy Perry.

532
00:26:13,230 --> 00:26:21,880
The way that I have those lyrics is that
the first word of the lyrics is

533
00:26:21,880 --> 00:26:25,250
the name of the artist, and
the rest is the lyrics.

534
00:26:25,250 --> 00:26:29,470
So let's say that I have this list in
which the first one is lyrics by Gaga.

535
00:26:29,470 --> 00:26:31,930
So here I am on the right track.

536
00:26:31,930 --> 00:26:35,270
And the next one is Katy, and
it has also the lyrics.

537
00:26:35,270 --> 00:26:38,040
>> So this is how you declare
a variable in Python.

538
00:26:38,040 --> 00:26:40,200
You don't have to give the data type.

539
00:26:40,200 --> 00:26:43,150
You just write "lyrics,"
kind of like in PHP.

540
00:26:43,150 --> 00:26:44,890
Make sense?

541
00:26:44,890 --> 00:26:47,770
>> So what are the things that I have to
calculate to be able to calculate the

542
00:26:47,770 --> 00:26:49,360
probabilities?

543
00:26:49,360 --> 00:26:55,110
I have to calculate the "priors"
of each of the different

544
00:26:55,110 --> 00:26:56,710
classes that I have.

545
00:26:56,710 --> 00:27:06,680
I have to calculate the "posteriors,"
or pretty much the probabilities of

546
00:27:06,680 --> 00:27:12,150
each of the different words that
I can have for each artist.

547
00:27:12,150 --> 00:27:17,210
So within Gaga, for example, I'm going
to have a list of how many times I see

548
00:27:17,210 --> 00:27:19,250
each of the words.

549
00:27:19,250 --> 00:27:20,760
Make sense?

550
00:27:20,760 --> 00:27:25,370
>> And finally, I'm just going to have a
list called "words" that is just going

551
00:27:25,370 --> 00:27:29,780
to have how many words I
have for each artist.

552
00:27:29,780 --> 00:27:33,760
So for Gaga, for example, when I look
to the lyrics, I had, I think, 24

553
00:27:33,760 --> 00:27:34,750
words in total.

554
00:27:34,750 --> 00:27:38,970
So this list is just going to have
Gaga 24, and Katy another number.

555
00:27:38,970 --> 00:27:40,130
Make sense?

556
00:27:40,130 --> 00:27:40,560
OK.

557
00:27:40,560 --> 00:27:42,530
>> So now, actually, let's
go to the coding.

558
00:27:42,530 --> 00:27:45,270
So in Python, you can actually
return a bunch of different

559
00:27:45,270 --> 00:27:46,630
things from a function.

560
00:27:46,630 --> 00:27:50,810
So I'm going to create this function
called "conditional," which is going

561
00:27:50,810 --> 00:27:53,890
to return all of those things, the
"priors," the "probabilities," and the

562
00:27:53,890 --> 00:28:05,690
"words." So "conditional," and it's
going to be calling into "lyrics."

563
00:28:05,690 --> 00:28:11,510
>> So now I want you to actually
write this function.

564
00:28:11,510 --> 00:28:17,750
So the way that I can write this
function is I just defined this

565
00:28:17,750 --> 00:28:20,620
function with "def." So I did "def
conditional," and it's taking

566
00:28:20,620 --> 00:28:28,700
"lyrics." And what this is going to do
is, first of all, I have my priors

567
00:28:28,700 --> 00:28:31,030
that I want to calculate.

568
00:28:31,030 --> 00:28:34,330
>> So the way that I can do this is create
a dictionary in Python, which

569
00:28:34,330 --> 00:28:37,320
is pretty much the same thing as a hash
table, or it's like an iterative

570
00:28:37,320 --> 00:28:40,480
array in PHP.

571
00:28:40,480 --> 00:28:44,150
This is how I declare a dictionary.

572
00:28:44,150 --> 00:28:53,580
And basically what this means is that
priors of Gaga is 0.5, for example, if

573
00:28:53,580 --> 00:28:57,200
50% of the lyrics are from
Gaga, 50% are from Katy.

574
00:28:57,200 --> 00:28:58,450
Make sense?

575
00:28:58,450 --> 00:29:00,680

576
00:29:00,680 --> 00:29:03,680
So I have to figure out how
to calculate the priors.

577
00:29:03,680 --> 00:29:07,120
>> The next ones that I have to do, also,
are the probabilities and the words.

578
00:29:07,120 --> 00:29:17,100
So the probabilities of Gaga is the list
of all the probabilities that I

579
00:29:17,100 --> 00:29:19,160
have for each of the words for Gaga.

580
00:29:19,160 --> 00:29:23,880
So if I go to probabilities of Gaga
"baby," for example, it'll give me

581
00:29:23,880 --> 00:29:28,750
something like 2 over 24 in that case.

582
00:29:28,750 --> 00:29:30,070
Make sense?

583
00:29:30,070 --> 00:29:36,120
So I go to "probabilities," go to the
"Gaga" bucket that has a list of all

584
00:29:36,120 --> 00:29:40,550
the Gaga words, then I go to "baby,"
and I see the probability.

585
00:29:40,550 --> 00:29:45,940
>> And finally I have this
"words" dictionary.

586
00:29:45,940 --> 00:29:53,620
So here, "probabilities." And then
"words." So if I do "words," "Gaga,"

587
00:29:53,620 --> 00:29:58,330
what is going to happen is that it's
going to give me 24, saying that I

588
00:29:58,330 --> 00:30:01,990
have 24 words within lyrics from Gaga.

589
00:30:01,990 --> 00:30:04,110
Makes sense?

590
00:30:04,110 --> 00:30:07,070
So here, "words" equals dah-dah-dah.

591
00:30:07,070 --> 00:30:07,620
OK

592
00:30:07,620 --> 00:30:12,210
>> So what I'm going to do is I'm going to
iterate over each of the lyrics, so

593
00:30:12,210 --> 00:30:14,490
each of the strings that
I have in the list.

594
00:30:14,490 --> 00:30:18,040
And I'm going to calculate those things
for each of the candidates.

595
00:30:18,040 --> 00:30:19,950
Makes sense?

596
00:30:19,950 --> 00:30:21,700
So I have to do a for loop.

597
00:30:21,700 --> 00:30:26,300
>> So in Python what I can do is "for line
in lyrics." The same thing as a

598
00:30:26,300 --> 00:30:28,000
"for each" statement in PHP.

599
00:30:28,000 --> 00:30:33,420
Remember how if it was PHP I could
say "for each lyrics as

600
00:30:33,420 --> 00:30:35,220
line." Makes sense?

601
00:30:35,220 --> 00:30:38,900
So I'm taking each of the lines, in this
case, this string and the next

602
00:30:38,900 --> 00:30:44,540
string so for each of the lines what I'm
going to do is first, I'm going to

603
00:30:44,540 --> 00:30:49,150
split this line into a list of
words separated by spaces.

604
00:30:49,150 --> 00:30:53,730
>> So the cool thing about Python is that
you could just Google like "how can I

605
00:30:53,730 --> 00:30:58,220
split a string into words? " and it's
going to tell you how to do it.

606
00:30:58,220 --> 00:31:04,890
And the way to do it, it's just "line
= line.split()" and it's basically

607
00:31:04,890 --> 00:31:08,640
going to give you a list with
each of the words here.

608
00:31:08,640 --> 00:31:09,620
Makes sense?

609
00:31:09,620 --> 00:31:15,870
So now that I did that I want to know
who is the singer of that song.

610
00:31:15,870 --> 00:31:20,130
And to do that I just have to get the
first element of the array, right?

611
00:31:20,130 --> 00:31:26,390
So I can just say that I "singer
= line(0)" Makes sense?

612
00:31:26,390 --> 00:31:32,010
>> And then what I need to do is, first of
all, I'm going to update how many

613
00:31:32,010 --> 00:31:36,130
words I have under "Gaga." so I'm just
going to calculate how many words I

614
00:31:36,130 --> 00:31:38,690
have in this list, right?

615
00:31:38,690 --> 00:31:41,910
Because this is how many words I have
in the lyrics and I'm just going to

616
00:31:41,910 --> 00:31:44,120
add it to the "Gaga" array.

617
00:31:44,120 --> 00:31:47,090
Does that make sense?

618
00:31:47,090 --> 00:31:49,010
Don't focus too much on the syntax.

619
00:31:49,010 --> 00:31:50,430
Think more about the concepts.

620
00:31:50,430 --> 00:31:52,400
That's the most important part.

621
00:31:52,400 --> 00:31:52,720
OK.

622
00:31:52,720 --> 00:32:00,260
>> So what I can do it is if "Gaga" is
already in that list, so "if singer in

623
00:32:00,260 --> 00:32:03,190
words" which means that I already
have words by Gaga.

624
00:32:03,190 --> 00:32:06,640
I just want to add the additional
words to that.

625
00:32:06,640 --> 00:32:15,810
So what I do is "words(singer)
+ = len(line) - 1".

626
00:32:15,810 --> 00:32:18,250
And then I can just do the
length of the line.

627
00:32:18,250 --> 00:32:21,860
So how many elements I
have in the array.

628
00:32:21,860 --> 00:32:27,060
And I have to do minus 1 just because
the first element of the array is just

629
00:32:27,060 --> 00:32:29,180
a singer and those are not lyrics.

630
00:32:29,180 --> 00:32:31,420
Makes sense?

631
00:32:31,420 --> 00:32:32,780
OK.

632
00:32:32,780 --> 00:32:35,820
>> "Else," it means that I want to actually
insert Gaga into the list.

633
00:32:35,820 --> 00:32:45,990
So I just do "words(singer)
= len(line) - 1," sorry.

634
00:32:45,990 --> 00:32:49,200
So the only difference between the two
lines is that this one, it doesn't

635
00:32:49,200 --> 00:32:51,080
exist yet, so I'm just
initializing it.

636
00:32:51,080 --> 00:32:53,820
This one I'm actually adding.

637
00:32:53,820 --> 00:32:55,570
OK.

638
00:32:55,570 --> 00:32:59,480
So this was adding to words.

639
00:32:59,480 --> 00:33:03,040
>> Now I want to add to the priors.

640
00:33:03,040 --> 00:33:05,480
So how do I calculate the priors?

641
00:33:05,480 --> 00:33:11,580
The priors can be calculated
by how many times.

642
00:33:11,580 --> 00:33:15,340
So how many times you see that singer
among all of the singers that you

643
00:33:15,340 --> 00:33:16,380
have, right?

644
00:33:16,380 --> 00:33:18,810
So for Gaga and Katy Perry,
in this case, I see Gaga

645
00:33:18,810 --> 00:33:20,570
once, Katy Perry once.

646
00:33:20,570 --> 00:33:23,320
>> So basically the priors for Gaga
and for Katy Perry would

647
00:33:23,320 --> 00:33:24,390
just be one, right?

648
00:33:24,390 --> 00:33:26,500
You just how many times
I see the artist.

649
00:33:26,500 --> 00:33:28,740
So this is very easy to calculate.

650
00:33:28,740 --> 00:33:34,100
I can just something similar as like "if
singer in priors," I'm just going

651
00:33:34,100 --> 00:33:38,970
to add 1 to their priors box.

652
00:33:38,970 --> 00:33:51,000
So, "priors(sing)" += 1" and then "else"
I'm going to do "priors(singer)

653
00:33:51,000 --> 00:33:55,000
= 1." Makes sense?

654
00:33:55,000 --> 00:34:00,080
>> So if it doesn't exist I just put
as 1, otherwise I just add 1.

655
00:34:00,080 --> 00:34:11,280
OK, so now all that I have left to do
is also add each of the words to the

656
00:34:11,280 --> 00:34:12,290
probabilities.

657
00:34:12,290 --> 00:34:14,889
So I have to count how many times
I see each of the words.

658
00:34:14,889 --> 00:34:18,780
So I just have to do another
for loop in the line.

659
00:34:18,780 --> 00:34:25,190
>> So first thing that I'm going to do is
check if the singer already has a

660
00:34:25,190 --> 00:34:26,969
probabilities array.

661
00:34:26,969 --> 00:34:31,739
So I'm checking if the singer doesn't
have a probabilities array, I'm just

662
00:34:31,739 --> 00:34:34,480
going to initialize one for them.

663
00:34:34,480 --> 00:34:36,400
It's not even an array, sorry,
it's a dictionary.

664
00:34:36,400 --> 00:34:43,080
So the probabilities of singer is going
to be an open dictionary, so I'm

665
00:34:43,080 --> 00:34:45,830
just initializing a dictionary for it.

666
00:34:45,830 --> 00:34:46,820
OK?

667
00:34:46,820 --> 00:34:58,330
>> And now I can actually do a for loop
to calculate each of the words'

668
00:34:58,330 --> 00:35:00,604
probabilities.

669
00:35:00,604 --> 00:35:01,540
OK.

670
00:35:01,540 --> 00:35:04,160
So what I can do is a for loop.

671
00:35:04,160 --> 00:35:06,590
So I'm just going to iterate
over the array.

672
00:35:06,590 --> 00:35:15,320
So the way that I can do that in Python
is "for i in range." From 1

673
00:35:15,320 --> 00:35:19,200
because I want to start in the second
element because the first one is the

674
00:35:19,200 --> 00:35:20,260
singer name.

675
00:35:20,260 --> 00:35:24,990
So from one up to the
length of the line.

676
00:35:24,990 --> 00:35:29,760
And when I do range it actually go from
like here from 1 to len of the

677
00:35:29,760 --> 00:35:30,740
line minus 1.

678
00:35:30,740 --> 00:35:33,810
So it already does that thing of doing
n minus 1 for arrays which is very

679
00:35:33,810 --> 00:35:35,500
convenient.

680
00:35:35,500 --> 00:35:37,850
Makes sense?

681
00:35:37,850 --> 00:35:42,770
>> So for each of these, what I'm going to
do is, just like in the other one,

682
00:35:42,770 --> 00:35:50,320
I'm going to check if the word in this
position in the line is already in

683
00:35:50,320 --> 00:35:51,570
probabilities.

684
00:35:51,570 --> 00:35:53,400

685
00:35:53,400 --> 00:35:57,260
And then as I said here, probabilities
words, as in I put

686
00:35:57,260 --> 00:35:58,400
"probabilities(singer)".

687
00:35:58,400 --> 00:35:59,390
So the name of the singer.

688
00:35:59,390 --> 00:36:03,450
So if it's already in
"probabilit(singer)", it means that I

689
00:36:03,450 --> 00:36:11,960
want to add 1 to it, so I'm going to
do "probabilities(singer)", and the

690
00:36:11,960 --> 00:36:14,100
word is called "line(i)".

691
00:36:14,100 --> 00:36:22,630
I'm going to add 1 and "else" I'm just
going to initialize it to 1.

692
00:36:22,630 --> 00:36:23,880
"Line(i)".

693
00:36:23,880 --> 00:36:26,920

694
00:36:26,920 --> 00:36:28,420
Makes sense?

695
00:36:28,420 --> 00:36:30,180
>> So, I calculated all of the arrays.

696
00:36:30,180 --> 00:36:36,580
So, now all that I have to do for
this one is just "return priors,

697
00:36:36,580 --> 00:36:43,230
probabilities and words." Let's
see if there are any, OK.

698
00:36:43,230 --> 00:36:45,690
It seems everything is working so far.

699
00:36:45,690 --> 00:36:46,900
So, that makes sense?

700
00:36:46,900 --> 00:36:47,750
In some way?

701
00:36:47,750 --> 00:36:49,280
OK.

702
00:36:49,280 --> 00:36:51,980
So now I have all the probabilities.

703
00:36:51,980 --> 00:36:55,100
So now the only thing I have left
is just to have that thing that

704
00:36:55,100 --> 00:36:58,650
calculates the product of all the
probabilities when I get the lyrics.

705
00:36:58,650 --> 00:37:06,270
>> So let's say that I want to now call
this function "classify()" and the

706
00:37:06,270 --> 00:37:08,880
thing that function takes
is just an argument.

707
00:37:08,880 --> 00:37:13,170
Let's say "Baby, I am on fire" and it's
going to figure out what is the

708
00:37:13,170 --> 00:37:14,490
probability that this is Gaga?

709
00:37:14,490 --> 00:37:16,405
What is the probability
that this is Katie?

710
00:37:16,405 --> 00:37:19,690
Sounds good?

711
00:37:19,690 --> 00:37:25,750
So I'm just going to have to create a
new function called "classify()" and

712
00:37:25,750 --> 00:37:29,180
it's going to take some
lyrics as well.

713
00:37:29,180 --> 00:37:31,790

714
00:37:31,790 --> 00:37:36,160
And besides the lyrics I also
have to send the priors, the

715
00:37:36,160 --> 00:37:37,700
probabilities and the words.

716
00:37:37,700 --> 00:37:44,000
So I'm going to send lyrics, priors,
probabilities, words.

717
00:37:44,000 --> 00:37:51,840
>> So this is taking lyrics, priors,
probabilities, words.

718
00:37:51,840 --> 00:37:53,530
So, what does it do?

719
00:37:53,530 --> 00:37:57,180
It basically is going to go through all
the possible candidates that you

720
00:37:57,180 --> 00:37:58,510
have as a singer.

721
00:37:58,510 --> 00:37:59,425
And where are those candidates?

722
00:37:59,425 --> 00:38:01,020
They're In the priors, right?

723
00:38:01,020 --> 00:38:02,710
So I have all of those there.

724
00:38:02,710 --> 00:38:07,870
So I'm going to have a dictionary
of all possible candidates.

725
00:38:07,870 --> 00:38:14,220
And then for each candidate in the
priors, so it means that it's going to

726
00:38:14,220 --> 00:38:17,740
be Gaga, Katie if I had
more it would be more.

727
00:38:17,740 --> 00:38:20,410
I'm going to start calculating
this probability.

728
00:38:20,410 --> 00:38:28,310
The probability as we saw in the
PowerPoint is the prior times the

729
00:38:28,310 --> 00:38:30,800
product of each of the
other probabilities.

730
00:38:30,800 --> 00:38:32,520
>> So I can do the same here.

731
00:38:32,520 --> 00:38:36,330
I can just do probability is
initially just the prior.

732
00:38:36,330 --> 00:38:40,340
So priors of the candidate.

733
00:38:40,340 --> 00:38:40,870
Right?

734
00:38:40,870 --> 00:38:45,360
And now I have to iterate over all the
words that I have in the lyrics to be

735
00:38:45,360 --> 00:38:48,820
able to add the probability
for each of them, OK?

736
00:38:48,820 --> 00:38:57,900
So, "for word in lyrics" what I'm going
to do is, if the word is in

737
00:38:57,900 --> 00:39:01,640
"probabilities(candidate)", which
means that it's a word that the

738
00:39:01,640 --> 00:39:03,640
candidate has in their lyrics--

739
00:39:03,640 --> 00:39:05,940
for example, "baby" for Gaga--

740
00:39:05,940 --> 00:39:11,710
what I'm going to do is that the
probability is going to be multiplied

741
00:39:11,710 --> 00:39:22,420
by 1 plus the probabilities of
the candidate for that word.

742
00:39:22,420 --> 00:39:25,710
And it's called "word".

743
00:39:25,710 --> 00:39:32,440
This divided by the number of words
that I have for that candidate.

744
00:39:32,440 --> 00:39:37,450
The total number of words that I have
for the singer that I'm looking at.

745
00:39:37,450 --> 00:39:40,290
>> "Else." it means it's a new word
so it'd be like for example

746
00:39:40,290 --> 00:39:41,860
"fire" for Lady Gaga.

747
00:39:41,860 --> 00:39:45,760
So I just want to do 1 over
"word(candidate)".

748
00:39:45,760 --> 00:39:47,710
So I don't want to put this term here.

749
00:39:47,710 --> 00:39:50,010
>> So it's going to be basically
copying and pasting this.

750
00:39:50,010 --> 00:39:54,380

751
00:39:54,380 --> 00:39:56,000
But I'm going to delete this part.

752
00:39:56,000 --> 00:39:57,610
So it's just going to be 1 over that.

753
00:39:57,610 --> 00:40:00,900

754
00:40:00,900 --> 00:40:02,150
Sounds good?

755
00:40:02,150 --> 00:40:03,980

756
00:40:03,980 --> 00:40:09,700
And now at the end, I'm just going to
print the name of the candidate and

757
00:40:09,700 --> 00:40:15,750
the probability that you have of
having the S on their lyrics.

758
00:40:15,750 --> 00:40:16,200
Makes sense?

759
00:40:16,200 --> 00:40:18,390
And I actually don't even
need this dictionary.

760
00:40:18,390 --> 00:40:19,510
Makes sense?

761
00:40:19,510 --> 00:40:21,810
>> So, let's see if this actually works.

762
00:40:21,810 --> 00:40:24,880
So if I run this, it didn't work.

763
00:40:24,880 --> 00:40:26,130
Wait one second.

764
00:40:26,130 --> 00:40:28,870

765
00:40:28,870 --> 00:40:31,720
"Words(candidate)", "words(candidate)",
that's

766
00:40:31,720 --> 00:40:33,750
the name of the array.

767
00:40:33,750 --> 00:40:41,435
OK So, it says there's some bug
for candidate in priors.

768
00:40:41,435 --> 00:40:46,300

769
00:40:46,300 --> 00:40:48,760
Let me just chill a little bit.

770
00:40:48,760 --> 00:40:50,360
OK.

771
00:40:50,360 --> 00:40:51,305
Let's try.

772
00:40:51,305 --> 00:40:51,720
OK.

773
00:40:51,720 --> 00:40:58,710
>> So it gives Katy Perry has this
probability of this times 10 to the

774
00:40:58,710 --> 00:41:02,200
minus 7, and Gaga has this
times 10 to the minus 6.

775
00:41:02,200 --> 00:41:05,610
So you see it shows that Gaga
has a higher probability.

776
00:41:05,610 --> 00:41:09,260
So "Baby, I'm on Fire" is
probably a Gaga song.

777
00:41:09,260 --> 00:41:10,580
Makes sense?

778
00:41:10,580 --> 00:41:12,030
So this is what we did.

779
00:41:12,030 --> 00:41:16,010
>> This code is going to be posted online,
so you guys can check it out.

780
00:41:16,010 --> 00:41:20,720
Maybe use some of it for if you want to
do a project or something similar.

781
00:41:20,720 --> 00:41:22,150
OK.

782
00:41:22,150 --> 00:41:25,930
This was just to show
what computational

783
00:41:25,930 --> 00:41:27,230
linguistics code looks like.

784
00:41:27,230 --> 00:41:33,040
But now let's go to more
high level stuff.

785
00:41:33,040 --> 00:41:33,340
OK.

786
00:41:33,340 --> 00:41:35,150
>> So the other problems I
was talking about--

787
00:41:35,150 --> 00:41:37,550
the segmentation problem
is the first of them.

788
00:41:37,550 --> 00:41:40,820
So you have here Japanese.

789
00:41:40,820 --> 00:41:43,420
And then you see that
there are no spaces.

790
00:41:43,420 --> 00:41:49,110
So this is basically means that it's
the top of the chair, right?

791
00:41:49,110 --> 00:41:50,550
You speak Japanese?

792
00:41:50,550 --> 00:41:52,840
It's the top of the chair, right?

793
00:41:52,840 --> 00:41:54,480
>> STUDENT: I don't know what
the kanji over there is.

794
00:41:54,480 --> 00:41:57,010
>> LUCAS FREITAS: It's [SPEAKING JAPANESE]

795
00:41:57,010 --> 00:41:57,950
OK.

796
00:41:57,950 --> 00:42:00,960
So it basically means chair of top.

797
00:42:00,960 --> 00:42:03,620
So if you had to put a space
it would be here.

798
00:42:03,620 --> 00:42:05,970
And then you have [? Ueda-san. ?]

799
00:42:05,970 --> 00:42:09,040
Which basically means Mr. Ueda.

800
00:42:09,040 --> 00:42:13,180
And you see that "Ueda" and you have a
space and then "san." So you see that

801
00:42:13,180 --> 00:42:15,470
here you "Ue" is like by itself.

802
00:42:15,470 --> 00:42:17,750
And here it has a character
next to it.

803
00:42:17,750 --> 00:42:21,720
>> So it's not like in those languages
characters meaning a word it so you

804
00:42:21,720 --> 00:42:23,980
just put a lot of spaces.

805
00:42:23,980 --> 00:42:25,500
Characters relate to each other.

806
00:42:25,500 --> 00:42:28,680
And they can be together
like two, three, one.

807
00:42:28,680 --> 00:42:34,520
So you actually have to create some kind
of way of putting those spaces.

808
00:42:34,520 --> 00:42:38,850
>> And this thing is that whenever you get
data from those Asian languages,

809
00:42:38,850 --> 00:42:40,580
everything comes unsegmented.

810
00:42:40,580 --> 00:42:45,940
Because no one who writes Japanese
or Chinese writes with spaces.

811
00:42:45,940 --> 00:42:48,200
Whenever you're writing Chinese,
Japanese you just write everything

812
00:42:48,200 --> 00:42:48,710
with no spaces.

813
00:42:48,710 --> 00:42:52,060
It doesn't even make sense
to put spaces.

814
00:42:52,060 --> 00:42:57,960
So then when you get data from, some
East Asian language, if you want to

815
00:42:57,960 --> 00:43:00,760
actually do something with that
you have to segment first.

816
00:43:00,760 --> 00:43:05,130
>> Think of doing the example of
the lyrics without spaces.

817
00:43:05,130 --> 00:43:07,950
So the only lyrics that you have
will be sentences, right?

818
00:43:07,950 --> 00:43:09,470
Separated by periods.

819
00:43:09,470 --> 00:43:13,930
But then having just the sentence will
not really help on giving information

820
00:43:13,930 --> 00:43:17,760
of who those lyrics are by.

821
00:43:17,760 --> 00:43:18,120
Right?

822
00:43:18,120 --> 00:43:20,010
So you should puts spaces first.

823
00:43:20,010 --> 00:43:21,990
So how can you do that?

824
00:43:21,990 --> 00:43:24,920
>> So then comes the idea of a language
model which is something really

825
00:43:24,920 --> 00:43:26,870
important for computational
linguistics.

826
00:43:26,870 --> 00:43:32,790
So a language model is basically a
table of probabilities that shows

827
00:43:32,790 --> 00:43:36,260
first of all what is the probability
of having the word in a language?

828
00:43:36,260 --> 00:43:39,590
So showing how frequent a word is.

829
00:43:39,590 --> 00:43:43,130
And then also showing the relation
between the words in a sentence.

830
00:43:43,130 --> 00:43:51,500
>> So the main idea is, if a stranger came
to you and said a sentence to

831
00:43:51,500 --> 00:43:55,600
you, what is the probability that, for
example, "this is my sister [? GTF" ?]

832
00:43:55,600 --> 00:43:57,480
was the sentence that the person said?

833
00:43:57,480 --> 00:44:00,380
So obviously some sentences are
more common than others.

834
00:44:00,380 --> 00:44:04,450
For example, "good morning," or "good
night," or "hey there," is much more

835
00:44:04,450 --> 00:44:08,260
common than most sentences
that we have an English.

836
00:44:08,260 --> 00:44:11,060
So why are those sentences
more frequent?

837
00:44:11,060 --> 00:44:14,060
>> First of all, it's because you have
words that are more frequent.

838
00:44:14,060 --> 00:44:20,180
So, for example, if you say, the dog is
big, and the dog is gigantic, you

839
00:44:20,180 --> 00:44:23,880
usually probably hear the dog is big
more often because "big" is more

840
00:44:23,880 --> 00:44:27,260
frequent in English than "gigantic."
So, one of the

841
00:44:27,260 --> 00:44:30,100
things is the word frequency.

842
00:44:30,100 --> 00:44:34,490
>> The second thing which is really
important is just the

843
00:44:34,490 --> 00:44:35,490
order of the words.

844
00:44:35,490 --> 00:44:39,500
So, it's common to say "the cat is
inside the box." but you don't usually

845
00:44:39,500 --> 00:44:44,250
see in "The box inside is the cat." so
you see that there is some importance

846
00:44:44,250 --> 00:44:46,030
in the order of the words.

847
00:44:46,030 --> 00:44:50,160
You cannot just say that those two
sentences have the same probability

848
00:44:50,160 --> 00:44:53,010
just because they have the same words.

849
00:44:53,010 --> 00:44:55,550
You actually have to care
about order as well.

850
00:44:55,550 --> 00:44:57,650
Make sense?

851
00:44:57,650 --> 00:44:59,490
>> So what do we do?

852
00:44:59,490 --> 00:45:01,550
So what I might try to get you?

853
00:45:01,550 --> 00:45:04,400
I'm trying to get you what we
call the n-gram models.

854
00:45:04,400 --> 00:45:09,095
So n-gram models basically assume
that for each word that

855
00:45:09,095 --> 00:45:10,960
you have in a sentence.

856
00:45:10,960 --> 00:45:15,020
It's the probability of having that
word there depends not only on the

857
00:45:15,020 --> 00:45:18,395
frequency of that word in the language,
but also on the words that

858
00:45:18,395 --> 00:45:19,860
are surrounding it.

859
00:45:19,860 --> 00:45:25,810
>> So for example, usually when you see
something like on or at you're

860
00:45:25,810 --> 00:45:28,040
probably going to see a
noun after it, right?

861
00:45:28,040 --> 00:45:31,750
Because when you have a preposition
usually it takes a noun after it.

862
00:45:31,750 --> 00:45:35,540
Or if you have a verb that is transitive
you usually are going to

863
00:45:35,540 --> 00:45:36,630
have a noun phrase.

864
00:45:36,630 --> 00:45:38,780
So it's going to have a noun
somewhere around it.

865
00:45:38,780 --> 00:45:44,950
>> So, basically, what it does is that it
considers the probability of having

866
00:45:44,950 --> 00:45:47,960
words next to each other, when
you're calculating the

867
00:45:47,960 --> 00:45:49,050
probability of a sentence.

868
00:45:49,050 --> 00:45:50,960
And that's what a language
model is basically.

869
00:45:50,960 --> 00:45:54,620
Just saying what's the probability
of having a specific

870
00:45:54,620 --> 00:45:57,120
sentence in a language?

871
00:45:57,120 --> 00:45:59,110
So why is that useful, basically?

872
00:45:59,110 --> 00:46:02,390
And first of all what is
an n-gram model, then?

873
00:46:02,390 --> 00:46:08,850
>> So an n-gram model means that
each word depends on the

874
00:46:08,850 --> 00:46:12,700
next N minus 1 words.

875
00:46:12,700 --> 00:46:18,150
So, basically, it means that if I look,
for example, at the CS50 TF when

876
00:46:18,150 --> 00:46:21,500
I'm calculating the probability of
the sentence, you'll be like "the

877
00:46:21,500 --> 00:46:25,280
probability of having the word "the"
times the probability of having "the

878
00:46:25,280 --> 00:46:31,720
CS50" times the probability of having
"The CS50 TF." So, basically, I count

879
00:46:31,720 --> 00:46:35,720
all possible ways of stretching it.

880
00:46:35,720 --> 00:46:41,870
>> And then usually when you're doing this,
as in a project, you put N to be

881
00:46:41,870 --> 00:46:42,600
a low value.

882
00:46:42,600 --> 00:46:45,930
So, usually have bigrams or trigrams.

883
00:46:45,930 --> 00:46:51,090
So that you just count two words, a
group of two words, or three words,

884
00:46:51,090 --> 00:46:52,620
just for performance issues.

885
00:46:52,620 --> 00:46:56,395
And also because maybe if you have
something like "The CS50 TF." When you

886
00:46:56,395 --> 00:47:00,510
have "TF," it's very important that
"CS50" is next to it, right?

887
00:47:00,510 --> 00:47:04,050
Those two things are usually
next to each other.

888
00:47:04,050 --> 00:47:06,410
>> If you think of "TF," it's probably
going to have what

889
00:47:06,410 --> 00:47:07,890
class it's TF'ing for.

890
00:47:07,890 --> 00:47:11,330
Also "the" is really important
for CS50 TF.

891
00:47:11,330 --> 00:47:14,570
But if you have something like "The CS50
TF went to class and gave their

892
00:47:14,570 --> 00:47:20,060
students some candy." "Candy" and "the"
have no relation really, right?

893
00:47:20,060 --> 00:47:23,670
They're so distant from each other that
it doesn't really matter what

894
00:47:23,670 --> 00:47:25,050
words you have.

895
00:47:25,050 --> 00:47:31,210
>> So by doing a bigram or a trigram, it
just means that you're limiting

896
00:47:31,210 --> 00:47:33,430
yourself to some words
that are around.

897
00:47:33,430 --> 00:47:35,810
Make sense?

898
00:47:35,810 --> 00:47:40,630
So when you want to do segmentation,
basically, what you want to do is see

899
00:47:40,630 --> 00:47:44,850
what are all the possible ways that
you can segment the sentence.

900
00:47:44,850 --> 00:47:49,090
>> Such that you see what is the
probability of each of those sentences

901
00:47:49,090 --> 00:47:50,880
existing in the language?

902
00:47:50,880 --> 00:47:53,410
So what you do is like, well, let
me try to put a space here.

903
00:47:53,410 --> 00:47:55,570
So you put a space there
and you see what is the

904
00:47:55,570 --> 00:47:57,590
probability of that sentence?

905
00:47:57,590 --> 00:48:00,240
Then you are like, OK, maybe
that was not that good.

906
00:48:00,240 --> 00:48:03,420
So I put a space there and a space
there, and you calculate the

907
00:48:03,420 --> 00:48:06,240
probability now, and you see that
it's a higher probability.

908
00:48:06,240 --> 00:48:12,160
>> So this is an algorithm called the TANGO
segmentation algorithm, which is

909
00:48:12,160 --> 00:48:14,990
actually something that would be really
cool for a project, which

910
00:48:14,990 --> 00:48:20,860
basically takes unsegmented text which
can be Japanese or Chinese or maybe

911
00:48:20,860 --> 00:48:26,080
English without spaces and tries to put
spaces between words and it does

912
00:48:26,080 --> 00:48:29,120
that by using a language model and
trying to see what is the highest

913
00:48:29,120 --> 00:48:31,270
probability you can get.

914
00:48:31,270 --> 00:48:32,230
OK.

915
00:48:32,230 --> 00:48:33,800
So this is segmentation.

916
00:48:33,800 --> 00:48:35,450
>> Now syntax.

917
00:48:35,450 --> 00:48:40,940
So, syntax is being used for
so many things right now.

918
00:48:40,940 --> 00:48:44,880
So for Graph Search, for Siri for
pretty much any kind of natural

919
00:48:44,880 --> 00:48:46,490
language processing you have.

920
00:48:46,490 --> 00:48:49,140
So what are the important
things about syntax?

921
00:48:49,140 --> 00:48:52,390
So, sentences in general have
what we call constituents.

922
00:48:52,390 --> 00:48:57,080
Which are kind of like groups of words
that have a function in the sentence.

923
00:48:57,080 --> 00:49:02,220
And they can not really be
apart from each other.

924
00:49:02,220 --> 00:49:07,380
>> So, if I say, for example, "Lauren loves
Milo." I know that "Lauren" is a

925
00:49:07,380 --> 00:49:10,180
constituent and then "loves
Milo" is also another one.

926
00:49:10,180 --> 00:49:16,860
Because you cannot say like "Lauren Milo
loves" to have the same meaning.

927
00:49:16,860 --> 00:49:18,020
It's not going to have
the same meaning.

928
00:49:18,020 --> 00:49:22,500
Or I cannot say like "Milo Lauren
loves." Not everything has the same

929
00:49:22,500 --> 00:49:25,890
meaning doing that.

930
00:49:25,890 --> 00:49:31,940
>> So the two more important things about
syntax are the lexical types which is

931
00:49:31,940 --> 00:49:35,390
basically the function that you
have for words by themselves.

932
00:49:35,390 --> 00:49:39,180
So you have to know that "Lauren"
and "Milo" are nouns.

933
00:49:39,180 --> 00:49:41,040
"Love" is a verb.

934
00:49:41,040 --> 00:49:45,660
And the second important thing is
that they're phrasal types.

935
00:49:45,660 --> 00:49:48,990
So you know that "loves Milo"
is actually a verbal phrase.

936
00:49:48,990 --> 00:49:52,390
So when I say "Lauren," I know that
Lauren is doing something.

937
00:49:52,390 --> 00:49:53,620
What is she doing?

938
00:49:53,620 --> 00:49:54,570
She's loving Milo.

939
00:49:54,570 --> 00:49:56,440
So it's a whole thing.

940
00:49:56,440 --> 00:50:01,640
But its components are
a noun and a verb.

941
00:50:01,640 --> 00:50:04,210
But together, they make a verb phrase.

942
00:50:04,210 --> 00:50:08,680
>> So, what can we actually do with
computational linguistics?

943
00:50:08,680 --> 00:50:13,810
So, if I have something for example
"friends of Allison." I see if I just

944
00:50:13,810 --> 00:50:17,440
did a syntactic tree I would know that
"friends" is a noun phrase it is a

945
00:50:17,440 --> 00:50:21,480
noun and then "of Allison" is a
prepositional phrase in which "of" is

946
00:50:21,480 --> 00:50:24,810
a proposition and "Allison" is a noun.

947
00:50:24,810 --> 00:50:30,910
What I could do is teach my computer
that when I have a noun phrase one and

948
00:50:30,910 --> 00:50:33,080
then a prepositional phrase.

949
00:50:33,080 --> 00:50:39,020
So in this case, "friends" and then "of
Milo" I know that this means that

950
00:50:39,020 --> 00:50:43,110
NP2, the second one, owns NP1.

951
00:50:43,110 --> 00:50:47,680
>> So I can create some kind of relation,
some kind of function for it.

952
00:50:47,680 --> 00:50:52,370
So whenever I see this structure, which
matches exactly with "friends of

953
00:50:52,370 --> 00:50:56,030
Allison," I know that Allison
owns the friends.

954
00:50:56,030 --> 00:50:58,830
So the friends are something
that Allison has.

955
00:50:58,830 --> 00:50:59,610
Makes sense?

956
00:50:59,610 --> 00:51:01,770
So this is basically what
Graph Search does.

957
00:51:01,770 --> 00:51:04,360
It just creates rules
for a lot of things.

958
00:51:04,360 --> 00:51:08,190
So "friends of Allison," "my friends
who live in Cambridge," "my friends

959
00:51:08,190 --> 00:51:12,970
who go to Harvard." It creates rules
for all of those things.

960
00:51:12,970 --> 00:51:14,930
>> Now machine translation.

961
00:51:14,930 --> 00:51:18,850
So, machine translation is also
something statistical.

962
00:51:18,850 --> 00:51:21,340
And actually if you get involved in
computational linguistics, a lot of

963
00:51:21,340 --> 00:51:23,580
your stuff is going to be statistics.

964
00:51:23,580 --> 00:51:26,670
So as I was doing the example with
a lot of probabilities that I was

965
00:51:26,670 --> 00:51:30,540
calculating, and then you get to this
very small number that's the final

966
00:51:30,540 --> 00:51:33,180
probability, and that's what
gives you the answer.

967
00:51:33,180 --> 00:51:37,540
Machine translation also uses
a statistical model.

968
00:51:37,540 --> 00:51:44,790
And if you want to think of machine
translation in the simplest possible

969
00:51:44,790 --> 00:51:48,970
way, what you can think is just
translate word by word, right?

970
00:51:48,970 --> 00:51:52,150
>> When you're learning a language for the
first time, that's usually what

971
00:51:52,150 --> 00:51:52,910
you do, right?

972
00:51:52,910 --> 00:51:57,050
If you want you translate a sentence
in your language to the language

973
00:51:57,050 --> 00:52:00,060
you're learning, usually first, you
translate each of the words

974
00:52:00,060 --> 00:52:03,180
individually, and then you try
to put the words into place.

975
00:52:03,180 --> 00:52:07,100
>> So if I wanted to translate this,
[SPEAKING PORTUGUESE]

976
00:52:07,100 --> 00:52:10,430
which means "the white cat ran away."
If I wanted to translate it from

977
00:52:10,430 --> 00:52:13,650
Portuguese to English, what I
could do is, first, I just

978
00:52:13,650 --> 00:52:14,800
translate word by word.

979
00:52:14,800 --> 00:52:20,570
So "o" is "the," "gato," "cat,"
"branco," "white," and then "fugio" is

980
00:52:20,570 --> 00:52:21,650
"ran away."

981
00:52:21,650 --> 00:52:26,130
>> So then I have all the words here,
but they're not in order.

982
00:52:26,130 --> 00:52:29,590
It's like "the cat white ran away"
which is ungrammatical.

983
00:52:29,590 --> 00:52:34,490
So, then I can have a second step, which
is going to be finding the ideal

984
00:52:34,490 --> 00:52:36,610
position for each of the words.

985
00:52:36,610 --> 00:52:40,240
So I know that I actually want to have
"white cat" instead of "cat white." So

986
00:52:40,240 --> 00:52:46,050
what I can do is, the most naive method
would be to create all the

987
00:52:46,050 --> 00:52:49,720
possible permutations of
words, of positions.

988
00:52:49,720 --> 00:52:53,300
And then see which one has the
highest probability according

989
00:52:53,300 --> 00:52:54,970
to my language model.

990
00:52:54,970 --> 00:52:58,390
And then when I find the one that has
the highest probability it, which is

991
00:52:58,390 --> 00:53:01,910
probably "the white cat ran away,"
that's my translation.

992
00:53:01,910 --> 00:53:06,710
>> And this is a simple way of explaining
how a lot of machine translation

993
00:53:06,710 --> 00:53:07,910
algorithms work.

994
00:53:07,910 --> 00:53:08,920
Does that make sense?

995
00:53:08,920 --> 00:53:12,735
This is also something really exciting
that you guys can maybe explore for a

996
00:53:12,735 --> 00:53:13,901
final project, yeah?

997
00:53:13,901 --> 00:53:15,549
>> STUDENT: Well, you said it was
the naive way, so what's

998
00:53:15,549 --> 00:53:17,200
the non-naive way?

999
00:53:17,200 --> 00:53:18,400
>> LUCAS FREITAS: The non-naive way?

1000
00:53:18,400 --> 00:53:19,050
OK.

1001
00:53:19,050 --> 00:53:22,860
So the first thing that is bad about
this method is that I just translated

1002
00:53:22,860 --> 00:53:24,330
words, word by word.

1003
00:53:24,330 --> 00:53:30,570
But sometimes you have words that
can have multiple translations.

1004
00:53:30,570 --> 00:53:32,210
I'm going to try to think
of something.

1005
00:53:32,210 --> 00:53:37,270
For example, "manga" in Portuguese can
either be "mangle" or "sleeve." So

1006
00:53:37,270 --> 00:53:40,450
when you're trying to translate word
by word, it might be giving you

1007
00:53:40,450 --> 00:53:42,050
something that makes no sense.

1008
00:53:42,050 --> 00:53:45,770
>> So you actually want to you look at all
the possible translations of the

1009
00:53:45,770 --> 00:53:49,840
words and see, first of all,
what is the order.

1010
00:53:49,840 --> 00:53:52,000
We were talking about permutating
the things?

1011
00:53:52,000 --> 00:53:54,150
To see all the possible orders and
choose the one with the highest

1012
00:53:54,150 --> 00:53:54,990
probability?

1013
00:53:54,990 --> 00:53:57,860
You can also choose all the possible
translations for each

1014
00:53:57,860 --> 00:54:00,510
word and then see--

1015
00:54:00,510 --> 00:54:01,950
combined with the permutations--

1016
00:54:01,950 --> 00:54:03,710
which one has the highest probability.

1017
00:54:03,710 --> 00:54:08,590
>> Plus, you can also look at not
only words but phrases.

1018
00:54:08,590 --> 00:54:11,700
so you can analyze the relations between
the words and then get a

1019
00:54:11,700 --> 00:54:13,210
better translation.

1020
00:54:13,210 --> 00:54:16,690
Also something else, so this semester
I'm actually doing research in

1021
00:54:16,690 --> 00:54:19,430
Chinese-English machine translation,
so translating from

1022
00:54:19,430 --> 00:54:20,940
Chinese into English.

1023
00:54:20,940 --> 00:54:26,760
>> And something we do is, besides using
a statistical model, which is just

1024
00:54:26,760 --> 00:54:30,570
seeing the probabilities of seeing
some position in a sentence, I'm

1025
00:54:30,570 --> 00:54:35,360
actually also adding some syntax to my
model, saying, oh, if I see this kind

1026
00:54:35,360 --> 00:54:39,420
of construction, this is what I want
to change it to when I translate.

1027
00:54:39,420 --> 00:54:43,880
So you can also add some kind of
element of syntax to make the

1028
00:54:43,880 --> 00:54:47,970
translation more efficient
and more precise.

1029
00:54:47,970 --> 00:54:48,550
OK.

1030
00:54:48,550 --> 00:54:51,010
>> So how can you get started, if you want
to do something in computational

1031
00:54:51,010 --> 00:54:51,980
linguistics?

1032
00:54:51,980 --> 00:54:54,560
>> First, you choose a project
that involves languages.

1033
00:54:54,560 --> 00:54:56,310
So, there's so many out there.

1034
00:54:56,310 --> 00:54:58,420
There's so many things you can do.

1035
00:54:58,420 --> 00:55:00,510
And then can think of a model
that you can use.

1036
00:55:00,510 --> 00:55:04,710
Usually that means thinking of
assumptions, as like, oh, when I was

1037
00:55:04,710 --> 00:55:05,770
like thinking of the lyrics.

1038
00:55:05,770 --> 00:55:09,510
I was like, well, if I want to figure
out a who wrote this, I probably want

1039
00:55:09,510 --> 00:55:15,400
to look at the words the person used and
see who uses that word very often.

1040
00:55:15,400 --> 00:55:18,470
So try to make assumptions and
try to think of models.

1041
00:55:18,470 --> 00:55:21,395
And then you can also search online for
the kind of problem that you have,

1042
00:55:21,395 --> 00:55:24,260
and it's going to suggest
to you models that maybe

1043
00:55:24,260 --> 00:55:26,560
modeled that thing well.

1044
00:55:26,560 --> 00:55:29,080
>> And also you can always email me.

1045
00:55:29,080 --> 00:55:31,140
me@lfreitas.com.

1046
00:55:31,140 --> 00:55:34,940
And I can just answer your questions.

1047
00:55:34,940 --> 00:55:38,600
We can even might meet up so I can
give suggestions on ways of

1048
00:55:38,600 --> 00:55:41,490
implementing your project.

1049
00:55:41,490 --> 00:55:45,610
And I mean if you get involved with
computational linguistics, it's going

1050
00:55:45,610 --> 00:55:46,790
to be great.

1051
00:55:46,790 --> 00:55:48,370
You're going to see there
is so much potential.

1052
00:55:48,370 --> 00:55:52,060
And the industry wants to hire
you so bad because of that.

1053
00:55:52,060 --> 00:55:54,720
So I hope you guys enjoyed this.

1054
00:55:54,720 --> 00:55:57,030
If you guys have any questions,
you can ask me after this.

1055
00:55:57,030 --> 00:55:58,280
But thank you.

1056
00:55:58,280 --> 00:56:00,150