1 00:00:00,000 --> 00:00:08,364 2 00:00:08,364 --> 00:00:08,870 >> LUCAS FREITAS: Hey. 3 00:00:08,870 --> 00:00:09,980 Welcome everyone. 4 00:00:09,980 --> 00:00:11,216 My name is Lucas Freitas. 5 00:00:11,216 --> 00:00:15,220 I'm a junior at [INAUDIBLE] studying computer science with a focus in 6 00:00:15,220 --> 00:00:16,410 computational linguistics. 7 00:00:16,410 --> 00:00:19,310 So my secondary is in language and linguistic theory. 8 00:00:19,310 --> 00:00:21,870 I'm really excited to teach you guys a little bit about the field. 9 00:00:21,870 --> 00:00:24,300 It's a very exciting area to study. 10 00:00:24,300 --> 00:00:27,260 Also with a lot of potential for the future. 11 00:00:27,260 --> 00:00:30,160 So, I'm really excited that you guys are considering projects in 12 00:00:30,160 --> 00:00:31,160 computational linguistics. 13 00:00:31,160 --> 00:00:35,460 And I'll be more than happy to advise any of you if you decide to 14 00:00:35,460 --> 00:00:37,090 pursue one of those. 15 00:00:37,090 --> 00:00:40,010 >> So first of all what are computational linguistics? 16 00:00:40,010 --> 00:00:44,630 So computational linguistics is the intersection between linguistics and 17 00:00:44,630 --> 00:00:46,390 computer science. 18 00:00:46,390 --> 00:00:47,415 So, what is linguistics? 19 00:00:47,415 --> 00:00:48,490 What is computer science? 20 00:00:48,490 --> 00:00:51,580 Well from linguistics, what we take are the languages. 21 00:00:51,580 --> 00:00:54,960 So linguistics is actually the study of natural language in general. 22 00:00:54,960 --> 00:00:58,330 So natural language-- we talk about language that we actually use to 23 00:00:58,330 --> 00:00:59,770 communicate with each other. 24 00:00:59,770 --> 00:01:02,200 So we're not exactly talking about C or Java. 25 00:01:02,200 --> 00:01:05,900 We're talking more about English and Chinese and other languages that we 26 00:01:05,900 --> 00:01:07,780 use to communicate with each other. 27 00:01:07,780 --> 00:01:12,470 >> The challenging thing about that is that right now we have almost 7,000 28 00:01:12,470 --> 00:01:14,260 languages in the world. 29 00:01:14,260 --> 00:01:19,520 So there are quite a high variety of languages that we can study. 30 00:01:19,520 --> 00:01:22,600 And then you think that it's probably very hard to do, for example, 31 00:01:22,600 --> 00:01:26,960 translation from one language to the other, considering that you have 32 00:01:26,960 --> 00:01:28,240 almost 7,000 of them. 33 00:01:28,240 --> 00:01:31,450 So, if you think of doing translation from one language to the other you 34 00:01:31,450 --> 00:01:35,840 have almost more than a million different combinations that you can 35 00:01:35,840 --> 00:01:37,330 have from language to language. 36 00:01:37,330 --> 00:01:40,820 So it's really challenging to do some kind of example translation system for 37 00:01:40,820 --> 00:01:43,540 every single language. 38 00:01:43,540 --> 00:01:47,120 >> So, linguistics treats with syntax, semantics, pragmatics. 39 00:01:47,120 --> 00:01:49,550 You guys don't exactly need to know what are they are. 40 00:01:49,550 --> 00:01:55,090 But the very interesting thing is that as a native speaker, when you learn 41 00:01:55,090 --> 00:01:59,010 language as child, you actually learn all of those things-- syntax semantics 42 00:01:59,010 --> 00:02:00,500 and pragmatics-- 43 00:02:00,500 --> 00:02:01,430 by yourself. 44 00:02:01,430 --> 00:02:04,820 And nobody has to teach you syntax for you to understand how sentences are 45 00:02:04,820 --> 00:02:05,290 structured. 46 00:02:05,290 --> 00:02:07,980 So, it's really interesting because it's something that comes very 47 00:02:07,980 --> 00:02:10,389 intuitively. 48 00:02:10,389 --> 00:02:13,190 >> And what are you taking from the computer science? 49 00:02:13,190 --> 00:02:16,700 Well, the most important thing that we have in computer science is first of 50 00:02:16,700 --> 00:02:19,340 all, artificial intelligence and machine learning. 51 00:02:19,340 --> 00:02:22,610 So, what we're trying to doing computational linguistics is teach 52 00:02:22,610 --> 00:02:26,990 your computer how to do something with language. 53 00:02:26,990 --> 00:02:28,630 >> So, for example, in machine translation. 54 00:02:28,630 --> 00:02:32,490 I'm trying to teach my computer how to know how to transition from one 55 00:02:32,490 --> 00:02:33,310 language to the other. 56 00:02:33,310 --> 00:02:35,790 So, basically like teaching a computer two languages. 57 00:02:35,790 --> 00:02:38,870 If I do natural language processing, which is the case for example of 58 00:02:38,870 --> 00:02:41,810 Facebook's Graph Search, you teach your computer how to understand 59 00:02:41,810 --> 00:02:42,730 queries well. 60 00:02:42,730 --> 00:02:48,130 >> So, if you say "the photos of my friends." Facebook doesn't treat that 61 00:02:48,130 --> 00:02:51,130 as a whole string that has just a bunch of words. 62 00:02:51,130 --> 00:02:56,020 It actually understands the relation between "photos" and "my friends" and 63 00:02:56,020 --> 00:02:59,620 understands that "photos" are property of "my friends." 64 00:02:59,620 --> 00:03:02,350 >> So, that's part of, for example, natural language processing. 65 00:03:02,350 --> 00:03:04,790 It's trying to understand what is the relation between 66 00:03:04,790 --> 00:03:07,520 the words in a sentence. 67 00:03:07,520 --> 00:03:11,170 And the big question is, can you teach a computer how to speak 68 00:03:11,170 --> 00:03:12,650 a language in general? 69 00:03:12,650 --> 00:03:17,810 Which is a very interesting question to think, as if maybe in the future, 70 00:03:17,810 --> 00:03:19,930 you're going to be able to talk to your cell phone. 71 00:03:19,930 --> 00:03:23,290 Kind of like what we do with Siri but something more like, you can actually 72 00:03:23,290 --> 00:03:25,690 say whatever you want and the phone is going to understand everything. 73 00:03:25,690 --> 00:03:28,350 And it can have follow up questions and keep talking. 74 00:03:28,350 --> 00:03:30,880 That's something really exciting, in my opinion. 75 00:03:30,880 --> 00:03:33,070 >> So, something about natural languages. 76 00:03:33,070 --> 00:03:36,220 Something really interesting about natural languages is that, and this is 77 00:03:36,220 --> 00:03:38,470 credit to my linguistics professor, Maria Polinsky. 78 00:03:38,470 --> 00:03:40,830 She gives an example and I think it's really interesting. 79 00:03:40,830 --> 00:03:47,060 Because we learn language from when we're born and then our native 80 00:03:47,060 --> 00:03:49,170 language kind of grows on us. 81 00:03:49,170 --> 00:03:52,570 >> And basically you learn language from minimal input, right? 82 00:03:52,570 --> 00:03:56,700 You're just getting input from your parents of what your language sounds 83 00:03:56,700 --> 00:03:58,770 like and you just learn it. 84 00:03:58,770 --> 00:04:02,240 So, it's interesting because if you look at those sentences, for example. 85 00:04:02,240 --> 00:04:06,980 You look, "Mary puts on a coat every time she leaves the house." 86 00:04:06,980 --> 00:04:10,650 >> In this case, it's possible to have the word "she" refer to Mary, right? 87 00:04:10,650 --> 00:04:13,500 You can say "Mary puts on a coat every time Mary leaves the 88 00:04:13,500 --> 00:04:14,960 house." so that's fine. 89 00:04:14,960 --> 00:04:19,370 But then if you look at the sentence "She puts on a coat every time Mary 90 00:04:19,370 --> 00:04:22,850 leaves the house." you know it's impossible to say that "she" is 91 00:04:22,850 --> 00:04:24,260 referring to Mary. 92 00:04:24,260 --> 00:04:27,070 >> There's no way of saying that "Mary puts on a coat every time Mary leaves 93 00:04:27,070 --> 00:04:30,790 the house." So it's interesting because this is the kind of intuition 94 00:04:30,790 --> 00:04:32,890 that every native speaker has. 95 00:04:32,890 --> 00:04:36,370 And nobody was taught that this is the way that the syntax works. 96 00:04:36,370 --> 00:04:41,930 And that you can only have this "she" referring to Mary in this first case, 97 00:04:41,930 --> 00:04:44,260 and actually in this other too, but not in this one. 98 00:04:44,260 --> 00:04:46,500 But everyone kind of gets to the same answer. 99 00:04:46,500 --> 00:04:48,580 Everyone agrees on that. 100 00:04:48,580 --> 00:04:53,280 So it's really interesting how although you don't know all the rules 101 00:04:53,280 --> 00:04:55,575 in your language you kind of understand how the language works. 102 00:04:55,575 --> 00:04:59,020 103 00:04:59,020 --> 00:05:01,530 >> So the interesting thing about natural language is that you don't have to 104 00:05:01,530 --> 00:05:06,970 know any syntax to know if a sentence is grammatical or ungrammatical for 105 00:05:06,970 --> 00:05:08,810 most cases. 106 00:05:08,810 --> 00:05:13,220 Which makes you think that maybe what happens is that through your life, you 107 00:05:13,220 --> 00:05:17,410 just keep getting more and more sentences told to you. 108 00:05:17,410 --> 00:05:19,800 And then you keep memorizing all of the sentences. 109 00:05:19,800 --> 00:05:24,230 And then when someone tells you something, you hear that sentence and 110 00:05:24,230 --> 00:05:27,040 you look at your vocabulary of sentences and see if 111 00:05:27,040 --> 00:05:28,270 that sentence is there. 112 00:05:28,270 --> 00:05:29,830 And if it is there you say it's grammatical. 113 00:05:29,830 --> 00:05:31,740 If it's not you say it's ungrammatical. 114 00:05:31,740 --> 00:05:35,150 >> So, in that case, you would say, oh, so you have a huge list of all 115 00:05:35,150 --> 00:05:36,140 possible sentences. 116 00:05:36,140 --> 00:05:38,240 And then when you hear a sentence, you know if it's grammatical or 117 00:05:38,240 --> 00:05:39,450 not based on that. 118 00:05:39,450 --> 00:05:42,360 The thing is that if you look at a sentence, for example, "The 119 00:05:42,360 --> 00:05:47,540 five-headed CS50 TFs cooked the blind octopus using a DAPA mug." It's 120 00:05:47,540 --> 00:05:49,630 definitely not a sentence that you heard before. 121 00:05:49,630 --> 00:05:52,380 But at the same time you know it's pretty much grammatical, right? 122 00:05:52,380 --> 00:05:55,570 There are no grammatical mistakes and you can say that 123 00:05:55,570 --> 00:05:57,020 it's a possible sentence. 124 00:05:57,020 --> 00:06:01,300 >> So it makes us think that actually the way that we learn language is not only 125 00:06:01,300 --> 00:06:07,090 by having a huge database of possible words or sentences, but more of 126 00:06:07,090 --> 00:06:11,490 understanding the relation between words in those sentences. 127 00:06:11,490 --> 00:06:14,570 Does that make sense? 128 00:06:14,570 --> 00:06:19,370 So, then the question is, can computers learn languages? 129 00:06:19,370 --> 00:06:21,490 Can we teach language to computers? 130 00:06:21,490 --> 00:06:24,230 >> So, let's think of the difference between a native speaker of a language 131 00:06:24,230 --> 00:06:25,460 and a computer. 132 00:06:25,460 --> 00:06:27,340 So, what happens to the speaker? 133 00:06:27,340 --> 00:06:30,430 Well, the native speaker learns a language from exposure to it. 134 00:06:30,430 --> 00:06:34,200 Usually its early childhood years. 135 00:06:34,200 --> 00:06:38,570 So, basically, you just have a baby, and you keep talking to it, and it 136 00:06:38,570 --> 00:06:40,540 just learns how to speak the language, right? 137 00:06:40,540 --> 00:06:42,660 So, you're basically giving input to the baby. 138 00:06:42,660 --> 00:06:45,200 So, then you can argue that a computer can do the same thing, right? 139 00:06:45,200 --> 00:06:49,510 You can just give language as input to the computer. 140 00:06:49,510 --> 00:06:53,410 >> As for example a bunch of files that have books in English. 141 00:06:53,410 --> 00:06:56,190 Maybe that's one way that you could possibly teach a 142 00:06:56,190 --> 00:06:57,850 computer English, right? 143 00:06:57,850 --> 00:07:01,000 And in fact, if you think about it, it takes you maybe a couple 144 00:07:01,000 --> 00:07:02,680 days to read a book. 145 00:07:02,680 --> 00:07:05,760 For a computer it takes a second to look at all the words in a book. 146 00:07:05,760 --> 00:07:10,810 So you can think that may be just this argument of input from around you, 147 00:07:10,810 --> 00:07:15,440 that's not enough to say that that's something that only humans can do. 148 00:07:15,440 --> 00:07:17,680 You can think computers also can get input. 149 00:07:17,680 --> 00:07:21,170 >> The second thing is that native speakers also have a brain that has 150 00:07:21,170 --> 00:07:23,870 language learning capability. 151 00:07:23,870 --> 00:07:27,020 But if you think about it, a brain is a solid thing. 152 00:07:27,020 --> 00:07:30,450 When you are born, it's already set-- 153 00:07:30,450 --> 00:07:31,320 this is your brain. 154 00:07:31,320 --> 00:07:34,660 And as you grow up, you just get more input of language and maybe nutrients 155 00:07:34,660 --> 00:07:35,960 and other stuff. 156 00:07:35,960 --> 00:07:38,170 But pretty much your brain is a solid thing. 157 00:07:38,170 --> 00:07:41,290 >> So you can say, well, maybe you can build a computer that has a bunch of 158 00:07:41,290 --> 00:07:45,890 functions and methods that just mimic language learning capability. 159 00:07:45,890 --> 00:07:49,630 So in that sense, you could say, well, I can have a computer that has all the 160 00:07:49,630 --> 00:07:52,270 things I need to learn language. 161 00:07:52,270 --> 00:07:56,200 And the last thing is that a native speaker learns from trial and error. 162 00:07:56,200 --> 00:08:01,090 So basically another important thing in language learning is that you kind 163 00:08:01,090 --> 00:08:05,340 of learn things by making generalizations of what you hear. 164 00:08:05,340 --> 00:08:10,280 >> So as you are growing up you learn that some words are more like nouns, 165 00:08:10,280 --> 00:08:11,820 some other ones are adjectives. 166 00:08:11,820 --> 00:08:14,250 And you don't have to have any knowledge of linguistics 167 00:08:14,250 --> 00:08:15,040 to understand that. 168 00:08:15,040 --> 00:08:18,560 But you just know there's some words are positioned in some part of the 169 00:08:18,560 --> 00:08:22,570 sentence and some others in other parts of the sentence. 170 00:08:22,570 --> 00:08:26,110 >> And that when you do something that is like a sentence that is not correct-- 171 00:08:26,110 --> 00:08:28,770 maybe because of an over generalization for example. 172 00:08:28,770 --> 00:08:32,210 Maybe when you're growing up, you notice that the plural is usually 173 00:08:32,210 --> 00:08:35,809 formed by putting an S at the end of the word. 174 00:08:35,809 --> 00:08:40,042 And then you try to do the plural of "deer" as "deers" or "tooth" as 175 00:08:40,042 --> 00:08:44,780 "tooths." So then your parents or someone corrects you and says, no, the 176 00:08:44,780 --> 00:08:49,020 plural of "deer" is "deer," and the plural of "tooth" is "teeth." And then 177 00:08:49,020 --> 00:08:50,060 you learn those things. 178 00:08:50,060 --> 00:08:51,520 So you learn from trial and error. 179 00:08:51,520 --> 00:08:53,100 >> But you can also do that with a computer. 180 00:08:53,100 --> 00:08:55,310 You can have something called reinforcement learning. 181 00:08:55,310 --> 00:08:58,560 Which is basically like giving a computer a reward whenever it does 182 00:08:58,560 --> 00:08:59,410 something correctly. 183 00:08:59,410 --> 00:09:04,710 And giving it the opposite of a reward and when it does something wrong. 184 00:09:04,710 --> 00:09:07,410 You can actually see that if you go to Google Translate and you try to 185 00:09:07,410 --> 00:09:10,220 translate a sentence, it asks you for feedback. 186 00:09:10,220 --> 00:09:13,240 So if you say, oh, there's a better translation for this sentence. 187 00:09:13,240 --> 00:09:18,140 You can type it up and then if a lot of people keep saying that is a better 188 00:09:18,140 --> 00:09:21,560 translation, it just learns that it should use that translation instead of 189 00:09:21,560 --> 00:09:22,960 the one it was giving. 190 00:09:22,960 --> 00:09:28,830 >> So, it's a very philosophical question to see if computers are going to be 191 00:09:28,830 --> 00:09:30,340 able to talk or not in the future. 192 00:09:30,340 --> 00:09:34,440 But I have high hopes that they can just based on those arguments. 193 00:09:34,440 --> 00:09:38,570 But it's just more of a philosophical question. 194 00:09:38,570 --> 00:09:43,460 >> So while computers still cannot talk, what are the things that we can do? 195 00:09:43,460 --> 00:09:47,070 Some really cool things are data classification. 196 00:09:47,070 --> 00:09:53,210 So, for example, you guys know that email services do, for 197 00:09:53,210 --> 00:09:55,580 example, spam filtering. 198 00:09:55,580 --> 00:09:59,070 So whenever you receive spam, it tries to filter to another box. 199 00:09:59,070 --> 00:10:00,270 So how does it do that? 200 00:10:00,270 --> 00:10:06,080 It's not like the computer just knows what email addresses are sending spam. 201 00:10:06,080 --> 00:10:09,130 So it's more based on the content of the message, or maybe the title, or 202 00:10:09,130 --> 00:10:11,310 maybe some pattern that you have. 203 00:10:11,310 --> 00:10:15,690 >> So, basically, what you can do is get a lot of data of emails that are spam, 204 00:10:15,690 --> 00:10:19,980 emails that are not spam, and learn what kind of patterns you have in the 205 00:10:19,980 --> 00:10:21,000 ones that are spam. 206 00:10:21,000 --> 00:10:23,260 And this is part of computational linguistics. 207 00:10:23,260 --> 00:10:24,720 It's called data classification. 208 00:10:24,720 --> 00:10:28,100 And we're actually going to see an example of that in the next slides. 209 00:10:28,100 --> 00:10:32,910 >> The second thing is natural language processing which is the thing that the 210 00:10:32,910 --> 00:10:36,580 Graph Search is doing of letting you write a sentence. 211 00:10:36,580 --> 00:10:38,690 And it trusts you understand what is the meaning and gives 212 00:10:38,690 --> 00:10:39,940 you a better result. 213 00:10:39,940 --> 00:10:43,880 Actually, if you go to Google or Bing and you search something like Lady 214 00:10:43,880 --> 00:10:47,060 Gaga's height, you're actually going to get 5' 1" instead of information 215 00:10:47,060 --> 00:10:50,170 from her because it actually understands what you're talking about. 216 00:10:50,170 --> 00:10:52,140 So that's part of natural language processing. 217 00:10:52,140 --> 00:10:57,000 >> Or also when you're using Siri, first you have an algorithm that tries to 218 00:10:57,000 --> 00:11:01,130 translate what you're saying into words, in text. 219 00:11:01,130 --> 00:11:03,690 And then it tries to translate that into meaning. 220 00:11:03,690 --> 00:11:06,570 So that's all part of natural language processing. 221 00:11:06,570 --> 00:11:08,320 >> Then you have machine translation-- 222 00:11:08,320 --> 00:11:10,300 which is actually one of my favorites-- 223 00:11:10,300 --> 00:11:14,060 which is just translating from a language to another. 224 00:11:14,060 --> 00:11:17,950 So you can think that when you're doing machine translation, you have 225 00:11:17,950 --> 00:11:19,750 infinite possibilities of sentences. 226 00:11:19,750 --> 00:11:22,960 So there's no way of just storing every single translation. 227 00:11:22,960 --> 00:11:27,440 So you have to come up with interesting algorithms to be able to 228 00:11:27,440 --> 00:11:30,110 translate every single sentence in some way. 229 00:11:30,110 --> 00:11:32,483 >> You guys have any questions so far? 230 00:11:32,483 --> 00:11:34,450 No? 231 00:11:34,450 --> 00:11:34,830 OK. 232 00:11:34,830 --> 00:11:36,900 >> So what are we going to see today? 233 00:11:36,900 --> 00:11:39,300 First of all, I'm going to talk about the classification problem. 234 00:11:39,300 --> 00:11:41,440 So the one that I was saying about spam. 235 00:11:41,440 --> 00:11:46,820 What I'm going to do is, given lyrics to a song, can you try to figure out 236 00:11:46,820 --> 00:11:49,810 with high probability who is the singer? 237 00:11:49,810 --> 00:11:53,590 Let's say that I have songs from Lady Gaga and Katy Perry, if I give you a 238 00:11:53,590 --> 00:11:58,130 new song, can you figure out if it's Katy Perry or Lady Gaga? 239 00:11:58,130 --> 00:12:01,490 >> The second one, I'm just going to talk about the segmentation problem. 240 00:12:01,490 --> 00:12:05,780 So I don't know if you guys know, but Chinese, Japanese, other East Asian 241 00:12:05,780 --> 00:12:08,090 languages, and other languages in general, don't have 242 00:12:08,090 --> 00:12:09,830 spaces between words. 243 00:12:09,830 --> 00:12:13,540 And then if you think about the way that your computer kind of tries to 244 00:12:13,540 --> 00:12:18,600 understand natural language processing, it looks at the words and 245 00:12:18,600 --> 00:12:21,500 tries to understand the relations between them, right? 246 00:12:21,500 --> 00:12:25,440 But then if you have Chinese, and you have zero spaces, it's really hard to 247 00:12:25,440 --> 00:12:28,360 find out what is the relation between words, because they don't have any 248 00:12:28,360 --> 00:12:29,530 words at first. 249 00:12:29,530 --> 00:12:32,600 So you have to do something called segmentation which just means putting 250 00:12:32,600 --> 00:12:36,490 spaces between what we'd call words in those languages. 251 00:12:36,490 --> 00:12:37,740 Make sense? 252 00:12:37,740 --> 00:12:39,680 253 00:12:39,680 --> 00:12:41,540 >> And then we're going to talk about syntax. 254 00:12:41,540 --> 00:12:44,050 So just a little bit about natural language processing. 255 00:12:44,050 --> 00:12:45,420 It's going to be just an overview. 256 00:12:45,420 --> 00:12:50,700 So today, basically what I want to do is give you guys a little bit of an 257 00:12:50,700 --> 00:12:53,930 inside of what are the possibilities that you can do with computational 258 00:12:53,930 --> 00:12:54,960 linguistics. 259 00:12:54,960 --> 00:13:00,410 And then you can see what you think is cool among those things. 260 00:13:00,410 --> 00:13:02,270 And maybe you can think of a project and come talk to me. 261 00:13:02,270 --> 00:13:05,260 And I can give you advice on how to implement it. 262 00:13:05,260 --> 00:13:09,060 >> So syntax is going to be a little bit about Graph Search and machine 263 00:13:09,060 --> 00:13:09,670 translation. 264 00:13:09,670 --> 00:13:13,650 I'm just going to give an example of how you could, for example, translate 265 00:13:13,650 --> 00:13:16,020 something from Portuguese to English. 266 00:13:16,020 --> 00:13:17,830 Sounds good? 267 00:13:17,830 --> 00:13:19,293 >> So first, the classification problem. 268 00:13:19,293 --> 00:13:23,590 I'll say that this part of the seminar is going to be the most challenging 269 00:13:23,590 --> 00:13:27,560 one just because there's going to be some coding. 270 00:13:27,560 --> 00:13:29,470 But it's going to be Python. 271 00:13:29,470 --> 00:13:34,380 I know you guys don't know Python, so I'm just going to explain on the high 272 00:13:34,380 --> 00:13:35,750 level what I'm doing. 273 00:13:35,750 --> 00:13:40,900 And you don't have to really care too much about the syntax because that's 274 00:13:40,900 --> 00:13:42,140 something you guys can learn. 275 00:13:42,140 --> 00:13:42,540 OK? 276 00:13:42,540 --> 00:13:43,580 Sounds good. 277 00:13:43,580 --> 00:13:46,020 >> So what is the classification problem? 278 00:13:46,020 --> 00:13:49,140 So you're given some lyrics to a song, and you want to guess 279 00:13:49,140 --> 00:13:50,620 who is singing it. 280 00:13:50,620 --> 00:13:54,045 And this can be for any kind of other problems. 281 00:13:54,045 --> 00:13:59,980 So it can be, for example, you have a presidential campaign and you have a 282 00:13:59,980 --> 00:14:02,610 speech, and you want to find out if it was, for example, 283 00:14:02,610 --> 00:14:04,470 Obama or Mitt Romney. 284 00:14:04,470 --> 00:14:07,700 Or you can have a bunch of emails and you want to figure out if they are 285 00:14:07,700 --> 00:14:08,890 spam or not. 286 00:14:08,890 --> 00:14:11,440 So it's just classifying some data based on the words 287 00:14:11,440 --> 00:14:13,790 that you have there. 288 00:14:13,790 --> 00:14:16,295 >> So to do that, you have to make some assumptions. 289 00:14:16,295 --> 00:14:20,570 So a lot about computational linguistics is making assumptions, 290 00:14:20,570 --> 00:14:24,100 usually smart assumptions, so that you can get good results. 291 00:14:24,100 --> 00:14:26,670 Trying to create a model for it. 292 00:14:26,670 --> 00:14:31,290 And then try it out and see if it works, if it gives you good precision. 293 00:14:31,290 --> 00:14:33,940 And if it does, then you try to improve it. 294 00:14:33,940 --> 00:14:37,640 If it doesn't, you're like, OK, maybe I should make a different assumption. 295 00:14:37,640 --> 00:14:44,030 >> So the assumption that we're going to make is that an artist usually sings 296 00:14:44,030 --> 00:14:49,220 about a topic multiple times, and maybe uses words multiple times just 297 00:14:49,220 --> 00:14:50,270 because they're used to it. 298 00:14:50,270 --> 00:14:51,890 You can just think of your friend. 299 00:14:51,890 --> 00:14:57,350 I'm sure you guys all have friends that say their signature phrase, 300 00:14:57,350 --> 00:14:59,260 literally for every single sentence-- 301 00:14:59,260 --> 00:15:02,660 like some specific word or some specific phrase that they say for 302 00:15:02,660 --> 00:15:04,020 every single sentence. 303 00:15:04,020 --> 00:15:07,920 >> And what you can say is that if you see a sentence that has a signature 304 00:15:07,920 --> 00:15:11,450 phrase, you can guess that probably your friend is the 305 00:15:11,450 --> 00:15:13,310 one saying it, right? 306 00:15:13,310 --> 00:15:18,410 So you make that assumption and then that's how you create a model. 307 00:15:18,410 --> 00:15:24,440 >> The example that I'm going to give is on how Lady Gaga, for example, people 308 00:15:24,440 --> 00:15:27,430 say that she uses "baby" for all her number one songs. 309 00:15:27,430 --> 00:15:32,270 And actually this is a video that shows her saying the word "baby" for 310 00:15:32,270 --> 00:15:33,410 different songs. 311 00:15:33,410 --> 00:15:33,860 >> [VIDEO PLAYBACK] 312 00:15:33,860 --> 00:15:34,310 >> -(SINGING) Baby. 313 00:15:34,310 --> 00:15:36,220 Baby. 314 00:15:36,220 --> 00:15:37,086 Baby. 315 00:15:37,086 --> 00:15:37,520 Baby. 316 00:15:37,520 --> 00:15:37,770 Baby. 317 00:15:37,770 --> 00:15:38,822 Babe. 318 00:15:38,822 --> 00:15:39,243 Baby. 319 00:15:39,243 --> 00:15:40,085 Baby. 320 00:15:40,085 --> 00:15:40,510 Baby. 321 00:15:40,510 --> 00:15:40,850 Baby. 322 00:15:40,850 --> 00:15:41,090 >> [END VIDEO PLAYBACK- 323 00:15:41,090 --> 00:15:44,020 >> LUCAS FREITAS: So there are, I think, 40 songs here in which she says the 324 00:15:44,020 --> 00:15:48,690 word "baby." So you can basically guess that if you see a song that has 325 00:15:48,690 --> 00:15:52,180 the word "baby," there's some high probability that it's Lady Gaga. 326 00:15:52,180 --> 00:15:56,450 But let's try to develop this further more formally. 327 00:15:56,450 --> 00:16:00,470 >> So these are lyrics to songs by Lady Gaga and Katy Perry. 328 00:16:00,470 --> 00:16:04,120 So you look at Lady Gaga, you see they have a lot of occurrences of "baby," a 329 00:16:04,120 --> 00:16:07,710 lot of occurrences of "way." And then Katy Perry has a lot of occurrences of 330 00:16:07,710 --> 00:16:10,360 "the," a lot of occurrences of "fire." 331 00:16:10,360 --> 00:16:14,560 >> So basically what we want to do is, you get a lyric. 332 00:16:14,560 --> 00:16:20,480 Let's say that you get a lyric for a song that is "baby," just "baby." If 333 00:16:20,480 --> 00:16:24,750 you just get the word "baby," and this is all the data that you have from 334 00:16:24,750 --> 00:16:27,880 Lady Gaga and Katy Perry, who would you guess is the person 335 00:16:27,880 --> 00:16:29,370 who sings the song? 336 00:16:29,370 --> 00:16:32,360 Lady Gaga or Katy Perry? 337 00:16:32,360 --> 00:16:33,150 Lady Gaga, right? 338 00:16:33,150 --> 00:16:37,400 Because she's the only one who says "baby." This sounds stupid, right? 339 00:16:37,400 --> 00:16:38,760 OK, this is really easy. 340 00:16:38,760 --> 00:16:41,860 I'm just looking at the two songs and of course, she's the only one who has 341 00:16:41,860 --> 00:16:42,660 "baby." 342 00:16:42,660 --> 00:16:44,740 >> But what if you have a bunch of words? 343 00:16:44,740 --> 00:16:50,900 If you have an actual lyric, something like, "baby, I just 344 00:16:50,900 --> 00:16:51,610 went to see a [? CFT ?] 345 00:16:51,610 --> 00:16:54,020 lecture," or something like that, and then you actually have to figure out-- 346 00:16:54,020 --> 00:16:55,780 based on all those words-- 347 00:16:55,780 --> 00:16:58,350 who is the artist who probably sang this song? 348 00:16:58,350 --> 00:17:01,860 So let's try to develop this a little further. 349 00:17:01,860 --> 00:17:05,630 >> OK, so based just on the data that we got, it seems that Gaga is probably 350 00:17:05,630 --> 00:17:06,260 the singer. 351 00:17:06,260 --> 00:17:07,904 But how can we write this more formally? 352 00:17:07,904 --> 00:17:10,579 353 00:17:10,579 --> 00:17:13,140 And there's going to be a little bit of statistics. 354 00:17:13,140 --> 00:17:15,880 So if you get lost, just try to understand the concept. 355 00:17:15,880 --> 00:17:18,700 It doesn't matter if you understand the equations perfectly well. 356 00:17:18,700 --> 00:17:22,150 This is all going to be online. 357 00:17:22,150 --> 00:17:25,490 >> So basically what I'm calculating is the probability that this song is by 358 00:17:25,490 --> 00:17:28,040 Lady Gaga given that-- 359 00:17:28,040 --> 00:17:30,660 so this bar means given that-- 360 00:17:30,660 --> 00:17:33,680 I saw the word "baby." Does that make sense? 361 00:17:33,680 --> 00:17:35,540 So I'm trying to calculate that probability. 362 00:17:35,540 --> 00:17:38,540 >> So there is this theorem called the Bayes theorem that says that the 363 00:17:38,540 --> 00:17:43,330 probability of A given B, is the probability of B given A, times the 364 00:17:43,330 --> 00:17:47,660 probability of A, over the probability of B. This is a long equation. 365 00:17:47,660 --> 00:17:51,970 But what you have to understand from that is that this is what I want to 366 00:17:51,970 --> 00:17:52,830 calculate, right? 367 00:17:52,830 --> 00:17:56,570 So the probability that that song is by Lady Gaga given that I saw the word 368 00:17:56,570 --> 00:17:58,230 "baby." 369 00:17:58,230 --> 00:18:02,960 >> And now what I'm getting is the probability of the word "baby" given 370 00:18:02,960 --> 00:18:04,390 that I have Lady Gaga. 371 00:18:04,390 --> 00:18:07,220 And what is that basically? 372 00:18:07,220 --> 00:18:10,500 What that means is, what is the probability of seeing the word "baby" 373 00:18:10,500 --> 00:18:12,130 in Gaga lyrics? 374 00:18:12,130 --> 00:18:16,240 If I want to calculate that in a very simple way, it's just the number of 375 00:18:16,240 --> 00:18:23,640 times I see "baby" over the total number of words in Gaga lyrics, right? 376 00:18:23,640 --> 00:18:27,600 What is the frequency that I see that word in Gaga's work? 377 00:18:27,600 --> 00:18:30,530 Make sense? 378 00:18:30,530 --> 00:18:33,420 >> The second term is the probability of Gaga. 379 00:18:33,420 --> 00:18:34,360 What does that mean? 380 00:18:34,360 --> 00:18:38,550 That basically means, what is the probability of classifying 381 00:18:38,550 --> 00:18:40,690 some lyrics as Gaga? 382 00:18:40,690 --> 00:18:45,320 And that is kind of weird, but let's think of an example. 383 00:18:45,320 --> 00:18:49,230 So let's say that the probability of having "baby" in a song is the same 384 00:18:49,230 --> 00:18:51,760 for Gaga and Britney Spears. 385 00:18:51,760 --> 00:18:54,950 But Britney Spears has twice more songs than Lady Gaga. 386 00:18:54,950 --> 00:19:00,570 So if someone just randomly gives you lyrics of "baby," the first thing you 387 00:19:00,570 --> 00:19:04,710 look at is, what is the probability of having "baby" in a Gaga song, "baby" 388 00:19:04,710 --> 00:19:05,410 in a Britney song? 389 00:19:05,410 --> 00:19:06,460 And it's the same thing. 390 00:19:06,460 --> 00:19:10,040 >> So the second thing that you'll see is, well, what is the probability of 391 00:19:10,040 --> 00:19:13,770 this lyric by itself being a Gaga lyric, and what is the probability of 392 00:19:13,770 --> 00:19:15,380 being a Britney lyric? 393 00:19:15,380 --> 00:19:18,950 So since Britney has so many more lyrics than Gaga, you would probably 394 00:19:18,950 --> 00:19:21,470 say, well, this is probably a Britney lyric. 395 00:19:21,470 --> 00:19:23,340 So that's why we have this term right here. 396 00:19:23,340 --> 00:19:24,670 Probability of Gaga. 397 00:19:24,670 --> 00:19:26,950 Makes sense? 398 00:19:26,950 --> 00:19:28,660 Does it? 399 00:19:28,660 --> 00:19:29,370 OK. 400 00:19:29,370 --> 00:19:33,500 >> And the last one is just the probability of "baby" which doesn't 401 00:19:33,500 --> 00:19:34,810 really matter that much. 402 00:19:34,810 --> 00:19:39,940 But it's the probability of seeing "baby" in English. 403 00:19:39,940 --> 00:19:42,725 We usually don't care that much about that term. 404 00:19:42,725 --> 00:19:44,490 Does that make sense? 405 00:19:44,490 --> 00:19:48,110 So the probability of Gaga is called the prior probability 406 00:19:48,110 --> 00:19:49,530 of the class Gaga. 407 00:19:49,530 --> 00:19:53,840 Because it just means that, what is the probability of having that class-- 408 00:19:53,840 --> 00:19:55,520 which is Gaga-- 409 00:19:55,520 --> 00:19:59,350 just in general, just with no conditions. 410 00:19:59,350 --> 00:20:02,560 >> And then when I have probability of Gaga given "baby," we call it plus 411 00:20:02,560 --> 00:20:06,160 teary a probability because it's the probability of having 412 00:20:06,160 --> 00:20:08,300 Gaga given some evidence. 413 00:20:08,300 --> 00:20:11,050 So I'm giving you the evidence that I saw the word baby and 414 00:20:11,050 --> 00:20:12,690 the song make sense? 415 00:20:12,690 --> 00:20:15,960 416 00:20:15,960 --> 00:20:16,410 OK. 417 00:20:16,410 --> 00:20:22,400 >> So If I calculated that for each of the songs for Lady Gaga, 418 00:20:22,400 --> 00:20:25,916 what that would be-- 419 00:20:25,916 --> 00:20:27,730 apparently, I cannot move this. 420 00:20:27,730 --> 00:20:31,850 421 00:20:31,850 --> 00:20:36,920 The probability of Gaga will be something like, 2 over 24, times 1/2, 422 00:20:36,920 --> 00:20:38,260 over 2 over 53. 423 00:20:38,260 --> 00:20:40,640 It doesn't matter if you know what these numbers are coming from. 424 00:20:40,640 --> 00:20:44,750 But it's just a number that is going to be more than 0, right? 425 00:20:44,750 --> 00:20:48,610 >> And then when I do Katy Perry, the probability of "baby" given Katy is 426 00:20:48,610 --> 00:20:49,830 already 0, right? 427 00:20:49,830 --> 00:20:52,820 Because there's no "baby" in Katy Perry. 428 00:20:52,820 --> 00:20:56,360 So then this becomes 0, and Gaga wins, which means that Gaga is 429 00:20:56,360 --> 00:20:57,310 probably the singer. 430 00:20:57,310 --> 00:20:58,560 Does that make sense? 431 00:20:58,560 --> 00:21:00,700 432 00:21:00,700 --> 00:21:01,950 OK. 433 00:21:01,950 --> 00:21:04,160 434 00:21:04,160 --> 00:21:11,750 >> So if I want to make this more official, I can actually do a model 435 00:21:11,750 --> 00:21:12,700 for multiple words. 436 00:21:12,700 --> 00:21:14,610 So let's say that I have something like, "baby, I am 437 00:21:14,610 --> 00:21:16,030 on fire," or something. 438 00:21:16,030 --> 00:21:17,760 So it has multiple words. 439 00:21:17,760 --> 00:21:20,880 And in this case, you can see that "baby" is in Gaga, 440 00:21:20,880 --> 00:21:21,710 but it's not in Katy. 441 00:21:21,710 --> 00:21:24,940 And "fire" is in Katy, but it's not in Gaga, right? 442 00:21:24,940 --> 00:21:27,200 So it's getting trickier, right? 443 00:21:27,200 --> 00:21:31,440 Because it seems that you almost have a tie between the two. 444 00:21:31,440 --> 00:21:36,980 >> So what you have to do is assume independency among the words. 445 00:21:36,980 --> 00:21:41,210 So basically what that means is that I'm just calculating what is the 446 00:21:41,210 --> 00:21:44,330 probability of seeing "baby," what is the probability of seeing "I," and 447 00:21:44,330 --> 00:21:46,670 "am", and "on," and "fire," all separately. 448 00:21:46,670 --> 00:21:48,670 Then I'm multiplying all of them. 449 00:21:48,670 --> 00:21:52,420 And I'm seeing what is the probability of seeing the whole sentence. 450 00:21:52,420 --> 00:21:55,210 Make sense? 451 00:21:55,210 --> 00:22:00,270 >> So basically, if I have just one word, what I want to find is the arg max, 452 00:22:00,270 --> 00:22:05,385 which means, what is the class that is giving me the highest probability? 453 00:22:05,385 --> 00:22:10,010 So what is the class that is giving me the highest probability for 454 00:22:10,010 --> 00:22:11,940 probability of class given word. 455 00:22:11,940 --> 00:22:17,610 So in this case, Gaga given "baby." Or Katy given "baby." Make sense? 456 00:22:17,610 --> 00:22:21,040 >> And just from Bayes, that equation that I showed, 457 00:22:21,040 --> 00:22:24,780 we create this fraction. 458 00:22:24,780 --> 00:22:28,750 The only thing is that you see that the probability of word given the 459 00:22:28,750 --> 00:22:31,370 class changes depending on the class, right? 460 00:22:31,370 --> 00:22:34,260 The number of "baby"s that I have in Gaga is different from Katy. 461 00:22:34,260 --> 00:22:37,640 The probability of the class also changes because it's just the number 462 00:22:37,640 --> 00:22:39,740 of songs each of them has. 463 00:22:39,740 --> 00:22:43,980 >> But the probability of the word itself is going to be the same for all the 464 00:22:43,980 --> 00:22:44,740 artists, right? 465 00:22:44,740 --> 00:22:47,150 So the probability of the word is just, what is the probability of 466 00:22:47,150 --> 00:22:49,820 seeing that word in the English language? 467 00:22:49,820 --> 00:22:51,420 So it's the same for all of them. 468 00:22:51,420 --> 00:22:55,790 So since this is constant, we can just drop this and not care about it. 469 00:22:55,790 --> 00:23:00,230 So this will be actually the equation we're looking for. 470 00:23:00,230 --> 00:23:03,360 >> And if I have multiple words, I'm still going to have the prior 471 00:23:03,360 --> 00:23:04,610 probability here. 472 00:23:04,610 --> 00:23:06,980 The only thing is that I'm multiplying the probability of 473 00:23:06,980 --> 00:23:08,490 all the other words. 474 00:23:08,490 --> 00:23:10,110 So I'm multiplying all of them. 475 00:23:10,110 --> 00:23:12,610 Make sense? 476 00:23:12,610 --> 00:23:18,440 It looks weird but basically means, calculate the prior of the class, and 477 00:23:18,440 --> 00:23:22,100 then multiply by the probability of each of the words being in that class. 478 00:23:22,100 --> 00:23:24,620 479 00:23:24,620 --> 00:23:29,150 >> And you know that the probability of a word given a class is going to be the 480 00:23:29,150 --> 00:23:34,520 number of times you see that word in that class, divided by the number of 481 00:23:34,520 --> 00:23:37,020 words you have in that class in general. 482 00:23:37,020 --> 00:23:37,990 Make sense? 483 00:23:37,990 --> 00:23:41,680 It's just how "baby" was 2 over the number of words that 484 00:23:41,680 --> 00:23:43,020 I had in the lyrics. 485 00:23:43,020 --> 00:23:45,130 So just the frequency. 486 00:23:45,130 --> 00:23:46,260 >> But there is one thing. 487 00:23:46,260 --> 00:23:51,250 Remember how I was showing that the probability of "baby" being lyrics 488 00:23:51,250 --> 00:23:56,350 from Katy Perry was 0 just because Katy Perry didn't have "baby" at all? 489 00:23:56,350 --> 00:24:04,900 But it sounds a little harsh to just simply say that lyrics cannot be from 490 00:24:04,900 --> 00:24:10,040 an artist just because they don't have that word in particular at any time. 491 00:24:10,040 --> 00:24:13,330 >> So you could just say, well, if you don't have this word, I'm going to 492 00:24:13,330 --> 00:24:15,640 give you a lower probability, but I'm just not going to 493 00:24:15,640 --> 00:24:17,420 give you 0 right away. 494 00:24:17,420 --> 00:24:21,040 Because maybe it was something like, "fire, fire, fire, fire," which is 495 00:24:21,040 --> 00:24:21,990 totally Katy Perry. 496 00:24:21,990 --> 00:24:26,060 And then "baby," and it just goes to 0 right away because there was one 497 00:24:26,060 --> 00:24:27,250 "baby." 498 00:24:27,250 --> 00:24:31,440 >> So basically what we do is something called Laplace smoothing. 499 00:24:31,440 --> 00:24:36,260 And this just means that I'm giving some probability even to the words 500 00:24:36,260 --> 00:24:37,850 that do not exist. 501 00:24:37,850 --> 00:24:43,170 So what I do is that when I'm calculating this, I always add 1 to 502 00:24:43,170 --> 00:24:44,180 the numerator. 503 00:24:44,180 --> 00:24:48,060 So even if the word doesn't exist, in this case, if this is 0, I'm still 504 00:24:48,060 --> 00:24:51,250 calculating this as 1 over the total number of words. 505 00:24:51,250 --> 00:24:55,060 Otherwise, I get how many words I have and I add 1. 506 00:24:55,060 --> 00:24:58,300 So I'm counting for both cases. 507 00:24:58,300 --> 00:25:00,430 Make sense? 508 00:25:00,430 --> 00:25:03,060 >> So now let's do some coding. 509 00:25:03,060 --> 00:25:06,440 I'm going to have to do it pretty fast, but it's just important that you 510 00:25:06,440 --> 00:25:08,600 guys understand the concepts. 511 00:25:08,600 --> 00:25:13,450 So what we're trying to do is exactly implement this 512 00:25:13,450 --> 00:25:14,330 thing that I just said-- 513 00:25:14,330 --> 00:25:19,110 I want you to put lyrics from Lady Gaga and Katy Perry. 514 00:25:19,110 --> 00:25:22,980 And the program is going to be able to say if these new lyrics are from Gaga 515 00:25:22,980 --> 00:25:24,170 or Katy Perry. 516 00:25:24,170 --> 00:25:25,800 Make sense? 517 00:25:25,800 --> 00:25:27,530 OK. 518 00:25:27,530 --> 00:25:30,710 >> So I have this program I'm going to call classify.py. 519 00:25:30,710 --> 00:25:31,970 So this is Python. 520 00:25:31,970 --> 00:25:34,210 It's a new programming language. 521 00:25:34,210 --> 00:25:38,020 It is very similar in some ways to C and PHP. 522 00:25:38,020 --> 00:25:43,180 It's similar because if you want to learn Python after knowing C, it's 523 00:25:43,180 --> 00:25:46,270 really not that much of a challenge just because Python is much easier 524 00:25:46,270 --> 00:25:47,520 than C, first of all. 525 00:25:47,520 --> 00:25:49,370 And a lot of things are already implemented for you. 526 00:25:49,370 --> 00:25:56,820 So just how like PHP has functions that sort a list, or append something 527 00:25:56,820 --> 00:25:58,780 to an array, or blah, blah, blah. 528 00:25:58,780 --> 00:26:00,690 Python has all of those as well. 529 00:26:00,690 --> 00:26:05,960 >> So I'm just going to explain quickly how we could do the classification 530 00:26:05,960 --> 00:26:07,860 problem for here. 531 00:26:07,860 --> 00:26:13,230 So let's say that in this case, I have lyrics from Gaga and Katy Perry. 532 00:26:13,230 --> 00:26:21,880 The way that I have those lyrics is that the first word of the lyrics is 533 00:26:21,880 --> 00:26:25,250 the name of the artist, and the rest is the lyrics. 534 00:26:25,250 --> 00:26:29,470 So let's say that I have this list in which the first one is lyrics by Gaga. 535 00:26:29,470 --> 00:26:31,930 So here I am on the right track. 536 00:26:31,930 --> 00:26:35,270 And the next one is Katy, and it has also the lyrics. 537 00:26:35,270 --> 00:26:38,040 >> So this is how you declare a variable in Python. 538 00:26:38,040 --> 00:26:40,200 You don't have to give the data type. 539 00:26:40,200 --> 00:26:43,150 You just write "lyrics," kind of like in PHP. 540 00:26:43,150 --> 00:26:44,890 Make sense? 541 00:26:44,890 --> 00:26:47,770 >> So what are the things that I have to calculate to be able to calculate the 542 00:26:47,770 --> 00:26:49,360 probabilities? 543 00:26:49,360 --> 00:26:55,110 I have to calculate the "priors" of each of the different 544 00:26:55,110 --> 00:26:56,710 classes that I have. 545 00:26:56,710 --> 00:27:06,680 I have to calculate the "posteriors," or pretty much the probabilities of 546 00:27:06,680 --> 00:27:12,150 each of the different words that I can have for each artist. 547 00:27:12,150 --> 00:27:17,210 So within Gaga, for example, I'm going to have a list of how many times I see 548 00:27:17,210 --> 00:27:19,250 each of the words. 549 00:27:19,250 --> 00:27:20,760 Make sense? 550 00:27:20,760 --> 00:27:25,370 >> And finally, I'm just going to have a list called "words" that is just going 551 00:27:25,370 --> 00:27:29,780 to have how many words I have for each artist. 552 00:27:29,780 --> 00:27:33,760 So for Gaga, for example, when I look to the lyrics, I had, I think, 24 553 00:27:33,760 --> 00:27:34,750 words in total. 554 00:27:34,750 --> 00:27:38,970 So this list is just going to have Gaga 24, and Katy another number. 555 00:27:38,970 --> 00:27:40,130 Make sense? 556 00:27:40,130 --> 00:27:40,560 OK. 557 00:27:40,560 --> 00:27:42,530 >> So now, actually, let's go to the coding. 558 00:27:42,530 --> 00:27:45,270 So in Python, you can actually return a bunch of different 559 00:27:45,270 --> 00:27:46,630 things from a function. 560 00:27:46,630 --> 00:27:50,810 So I'm going to create this function called "conditional," which is going 561 00:27:50,810 --> 00:27:53,890 to return all of those things, the "priors," the "probabilities," and the 562 00:27:53,890 --> 00:28:05,690 "words." So "conditional," and it's going to be calling into "lyrics." 563 00:28:05,690 --> 00:28:11,510 >> So now I want you to actually write this function. 564 00:28:11,510 --> 00:28:17,750 So the way that I can write this function is I just defined this 565 00:28:17,750 --> 00:28:20,620 function with "def." So I did "def conditional," and it's taking 566 00:28:20,620 --> 00:28:28,700 "lyrics." And what this is going to do is, first of all, I have my priors 567 00:28:28,700 --> 00:28:31,030 that I want to calculate. 568 00:28:31,030 --> 00:28:34,330 >> So the way that I can do this is create a dictionary in Python, which 569 00:28:34,330 --> 00:28:37,320 is pretty much the same thing as a hash table, or it's like an iterative 570 00:28:37,320 --> 00:28:40,480 array in PHP. 571 00:28:40,480 --> 00:28:44,150 This is how I declare a dictionary. 572 00:28:44,150 --> 00:28:53,580 And basically what this means is that priors of Gaga is 0.5, for example, if 573 00:28:53,580 --> 00:28:57,200 50% of the lyrics are from Gaga, 50% are from Katy. 574 00:28:57,200 --> 00:28:58,450 Make sense? 575 00:28:58,450 --> 00:29:00,680 576 00:29:00,680 --> 00:29:03,680 So I have to figure out how to calculate the priors. 577 00:29:03,680 --> 00:29:07,120 >> The next ones that I have to do, also, are the probabilities and the words. 578 00:29:07,120 --> 00:29:17,100 So the probabilities of Gaga is the list of all the probabilities that I 579 00:29:17,100 --> 00:29:19,160 have for each of the words for Gaga. 580 00:29:19,160 --> 00:29:23,880 So if I go to probabilities of Gaga "baby," for example, it'll give me 581 00:29:23,880 --> 00:29:28,750 something like 2 over 24 in that case. 582 00:29:28,750 --> 00:29:30,070 Make sense? 583 00:29:30,070 --> 00:29:36,120 So I go to "probabilities," go to the "Gaga" bucket that has a list of all 584 00:29:36,120 --> 00:29:40,550 the Gaga words, then I go to "baby," and I see the probability. 585 00:29:40,550 --> 00:29:45,940 >> And finally I have this "words" dictionary. 586 00:29:45,940 --> 00:29:53,620 So here, "probabilities." And then "words." So if I do "words," "Gaga," 587 00:29:53,620 --> 00:29:58,330 what is going to happen is that it's going to give me 24, saying that I 588 00:29:58,330 --> 00:30:01,990 have 24 words within lyrics from Gaga. 589 00:30:01,990 --> 00:30:04,110 Makes sense? 590 00:30:04,110 --> 00:30:07,070 So here, "words" equals dah-dah-dah. 591 00:30:07,070 --> 00:30:07,620 OK 592 00:30:07,620 --> 00:30:12,210 >> So what I'm going to do is I'm going to iterate over each of the lyrics, so 593 00:30:12,210 --> 00:30:14,490 each of the strings that I have in the list. 594 00:30:14,490 --> 00:30:18,040 And I'm going to calculate those things for each of the candidates. 595 00:30:18,040 --> 00:30:19,950 Makes sense? 596 00:30:19,950 --> 00:30:21,700 So I have to do a for loop. 597 00:30:21,700 --> 00:30:26,300 >> So in Python what I can do is "for line in lyrics." The same thing as a 598 00:30:26,300 --> 00:30:28,000 "for each" statement in PHP. 599 00:30:28,000 --> 00:30:33,420 Remember how if it was PHP I could say "for each lyrics as 600 00:30:33,420 --> 00:30:35,220 line." Makes sense? 601 00:30:35,220 --> 00:30:38,900 So I'm taking each of the lines, in this case, this string and the next 602 00:30:38,900 --> 00:30:44,540 string so for each of the lines what I'm going to do is first, I'm going to 603 00:30:44,540 --> 00:30:49,150 split this line into a list of words separated by spaces. 604 00:30:49,150 --> 00:30:53,730 >> So the cool thing about Python is that you could just Google like "how can I 605 00:30:53,730 --> 00:30:58,220 split a string into words? " and it's going to tell you how to do it. 606 00:30:58,220 --> 00:31:04,890 And the way to do it, it's just "line = line.split()" and it's basically 607 00:31:04,890 --> 00:31:08,640 going to give you a list with each of the words here. 608 00:31:08,640 --> 00:31:09,620 Makes sense? 609 00:31:09,620 --> 00:31:15,870 So now that I did that I want to know who is the singer of that song. 610 00:31:15,870 --> 00:31:20,130 And to do that I just have to get the first element of the array, right? 611 00:31:20,130 --> 00:31:26,390 So I can just say that I "singer = line(0)" Makes sense? 612 00:31:26,390 --> 00:31:32,010 >> And then what I need to do is, first of all, I'm going to update how many 613 00:31:32,010 --> 00:31:36,130 words I have under "Gaga." so I'm just going to calculate how many words I 614 00:31:36,130 --> 00:31:38,690 have in this list, right? 615 00:31:38,690 --> 00:31:41,910 Because this is how many words I have in the lyrics and I'm just going to 616 00:31:41,910 --> 00:31:44,120 add it to the "Gaga" array. 617 00:31:44,120 --> 00:31:47,090 Does that make sense? 618 00:31:47,090 --> 00:31:49,010 Don't focus too much on the syntax. 619 00:31:49,010 --> 00:31:50,430 Think more about the concepts. 620 00:31:50,430 --> 00:31:52,400 That's the most important part. 621 00:31:52,400 --> 00:31:52,720 OK. 622 00:31:52,720 --> 00:32:00,260 >> So what I can do it is if "Gaga" is already in that list, so "if singer in 623 00:32:00,260 --> 00:32:03,190 words" which means that I already have words by Gaga. 624 00:32:03,190 --> 00:32:06,640 I just want to add the additional words to that. 625 00:32:06,640 --> 00:32:15,810 So what I do is "words(singer) + = len(line) - 1". 626 00:32:15,810 --> 00:32:18,250 And then I can just do the length of the line. 627 00:32:18,250 --> 00:32:21,860 So how many elements I have in the array. 628 00:32:21,860 --> 00:32:27,060 And I have to do minus 1 just because the first element of the array is just 629 00:32:27,060 --> 00:32:29,180 a singer and those are not lyrics. 630 00:32:29,180 --> 00:32:31,420 Makes sense? 631 00:32:31,420 --> 00:32:32,780 OK. 632 00:32:32,780 --> 00:32:35,820 >> "Else," it means that I want to actually insert Gaga into the list. 633 00:32:35,820 --> 00:32:45,990 So I just do "words(singer) = len(line) - 1," sorry. 634 00:32:45,990 --> 00:32:49,200 So the only difference between the two lines is that this one, it doesn't 635 00:32:49,200 --> 00:32:51,080 exist yet, so I'm just initializing it. 636 00:32:51,080 --> 00:32:53,820 This one I'm actually adding. 637 00:32:53,820 --> 00:32:55,570 OK. 638 00:32:55,570 --> 00:32:59,480 So this was adding to words. 639 00:32:59,480 --> 00:33:03,040 >> Now I want to add to the priors. 640 00:33:03,040 --> 00:33:05,480 So how do I calculate the priors? 641 00:33:05,480 --> 00:33:11,580 The priors can be calculated by how many times. 642 00:33:11,580 --> 00:33:15,340 So how many times you see that singer among all of the singers that you 643 00:33:15,340 --> 00:33:16,380 have, right? 644 00:33:16,380 --> 00:33:18,810 So for Gaga and Katy Perry, in this case, I see Gaga 645 00:33:18,810 --> 00:33:20,570 once, Katy Perry once. 646 00:33:20,570 --> 00:33:23,320 >> So basically the priors for Gaga and for Katy Perry would 647 00:33:23,320 --> 00:33:24,390 just be one, right? 648 00:33:24,390 --> 00:33:26,500 You just how many times I see the artist. 649 00:33:26,500 --> 00:33:28,740 So this is very easy to calculate. 650 00:33:28,740 --> 00:33:34,100 I can just something similar as like "if singer in priors," I'm just going 651 00:33:34,100 --> 00:33:38,970 to add 1 to their priors box. 652 00:33:38,970 --> 00:33:51,000 So, "priors(sing)" += 1" and then "else" I'm going to do "priors(singer) 653 00:33:51,000 --> 00:33:55,000 = 1." Makes sense? 654 00:33:55,000 --> 00:34:00,080 >> So if it doesn't exist I just put as 1, otherwise I just add 1. 655 00:34:00,080 --> 00:34:11,280 OK, so now all that I have left to do is also add each of the words to the 656 00:34:11,280 --> 00:34:12,290 probabilities. 657 00:34:12,290 --> 00:34:14,889 So I have to count how many times I see each of the words. 658 00:34:14,889 --> 00:34:18,780 So I just have to do another for loop in the line. 659 00:34:18,780 --> 00:34:25,190 >> So first thing that I'm going to do is check if the singer already has a 660 00:34:25,190 --> 00:34:26,969 probabilities array. 661 00:34:26,969 --> 00:34:31,739 So I'm checking if the singer doesn't have a probabilities array, I'm just 662 00:34:31,739 --> 00:34:34,480 going to initialize one for them. 663 00:34:34,480 --> 00:34:36,400 It's not even an array, sorry, it's a dictionary. 664 00:34:36,400 --> 00:34:43,080 So the probabilities of singer is going to be an open dictionary, so I'm 665 00:34:43,080 --> 00:34:45,830 just initializing a dictionary for it. 666 00:34:45,830 --> 00:34:46,820 OK? 667 00:34:46,820 --> 00:34:58,330 >> And now I can actually do a for loop to calculate each of the words' 668 00:34:58,330 --> 00:35:00,604 probabilities. 669 00:35:00,604 --> 00:35:01,540 OK. 670 00:35:01,540 --> 00:35:04,160 So what I can do is a for loop. 671 00:35:04,160 --> 00:35:06,590 So I'm just going to iterate over the array. 672 00:35:06,590 --> 00:35:15,320 So the way that I can do that in Python is "for i in range." From 1 673 00:35:15,320 --> 00:35:19,200 because I want to start in the second element because the first one is the 674 00:35:19,200 --> 00:35:20,260 singer name. 675 00:35:20,260 --> 00:35:24,990 So from one up to the length of the line. 676 00:35:24,990 --> 00:35:29,760 And when I do range it actually go from like here from 1 to len of the 677 00:35:29,760 --> 00:35:30,740 line minus 1. 678 00:35:30,740 --> 00:35:33,810 So it already does that thing of doing n minus 1 for arrays which is very 679 00:35:33,810 --> 00:35:35,500 convenient. 680 00:35:35,500 --> 00:35:37,850 Makes sense? 681 00:35:37,850 --> 00:35:42,770 >> So for each of these, what I'm going to do is, just like in the other one, 682 00:35:42,770 --> 00:35:50,320 I'm going to check if the word in this position in the line is already in 683 00:35:50,320 --> 00:35:51,570 probabilities. 684 00:35:51,570 --> 00:35:53,400 685 00:35:53,400 --> 00:35:57,260 And then as I said here, probabilities words, as in I put 686 00:35:57,260 --> 00:35:58,400 "probabilities(singer)". 687 00:35:58,400 --> 00:35:59,390 So the name of the singer. 688 00:35:59,390 --> 00:36:03,450 So if it's already in "probabilit(singer)", it means that I 689 00:36:03,450 --> 00:36:11,960 want to add 1 to it, so I'm going to do "probabilities(singer)", and the 690 00:36:11,960 --> 00:36:14,100 word is called "line(i)". 691 00:36:14,100 --> 00:36:22,630 I'm going to add 1 and "else" I'm just going to initialize it to 1. 692 00:36:22,630 --> 00:36:23,880 "Line(i)". 693 00:36:23,880 --> 00:36:26,920 694 00:36:26,920 --> 00:36:28,420 Makes sense? 695 00:36:28,420 --> 00:36:30,180 >> So, I calculated all of the arrays. 696 00:36:30,180 --> 00:36:36,580 So, now all that I have to do for this one is just "return priors, 697 00:36:36,580 --> 00:36:43,230 probabilities and words." Let's see if there are any, OK. 698 00:36:43,230 --> 00:36:45,690 It seems everything is working so far. 699 00:36:45,690 --> 00:36:46,900 So, that makes sense? 700 00:36:46,900 --> 00:36:47,750 In some way? 701 00:36:47,750 --> 00:36:49,280 OK. 702 00:36:49,280 --> 00:36:51,980 So now I have all the probabilities. 703 00:36:51,980 --> 00:36:55,100 So now the only thing I have left is just to have that thing that 704 00:36:55,100 --> 00:36:58,650 calculates the product of all the probabilities when I get the lyrics. 705 00:36:58,650 --> 00:37:06,270 >> So let's say that I want to now call this function "classify()" and the 706 00:37:06,270 --> 00:37:08,880 thing that function takes is just an argument. 707 00:37:08,880 --> 00:37:13,170 Let's say "Baby, I am on fire" and it's going to figure out what is the 708 00:37:13,170 --> 00:37:14,490 probability that this is Gaga? 709 00:37:14,490 --> 00:37:16,405 What is the probability that this is Katie? 710 00:37:16,405 --> 00:37:19,690 Sounds good? 711 00:37:19,690 --> 00:37:25,750 So I'm just going to have to create a new function called "classify()" and 712 00:37:25,750 --> 00:37:29,180 it's going to take some lyrics as well. 713 00:37:29,180 --> 00:37:31,790 714 00:37:31,790 --> 00:37:36,160 And besides the lyrics I also have to send the priors, the 715 00:37:36,160 --> 00:37:37,700 probabilities and the words. 716 00:37:37,700 --> 00:37:44,000 So I'm going to send lyrics, priors, probabilities, words. 717 00:37:44,000 --> 00:37:51,840 >> So this is taking lyrics, priors, probabilities, words. 718 00:37:51,840 --> 00:37:53,530 So, what does it do? 719 00:37:53,530 --> 00:37:57,180 It basically is going to go through all the possible candidates that you 720 00:37:57,180 --> 00:37:58,510 have as a singer. 721 00:37:58,510 --> 00:37:59,425 And where are those candidates? 722 00:37:59,425 --> 00:38:01,020 They're In the priors, right? 723 00:38:01,020 --> 00:38:02,710 So I have all of those there. 724 00:38:02,710 --> 00:38:07,870 So I'm going to have a dictionary of all possible candidates. 725 00:38:07,870 --> 00:38:14,220 And then for each candidate in the priors, so it means that it's going to 726 00:38:14,220 --> 00:38:17,740 be Gaga, Katie if I had more it would be more. 727 00:38:17,740 --> 00:38:20,410 I'm going to start calculating this probability. 728 00:38:20,410 --> 00:38:28,310 The probability as we saw in the PowerPoint is the prior times the 729 00:38:28,310 --> 00:38:30,800 product of each of the other probabilities. 730 00:38:30,800 --> 00:38:32,520 >> So I can do the same here. 731 00:38:32,520 --> 00:38:36,330 I can just do probability is initially just the prior. 732 00:38:36,330 --> 00:38:40,340 So priors of the candidate. 733 00:38:40,340 --> 00:38:40,870 Right? 734 00:38:40,870 --> 00:38:45,360 And now I have to iterate over all the words that I have in the lyrics to be 735 00:38:45,360 --> 00:38:48,820 able to add the probability for each of them, OK? 736 00:38:48,820 --> 00:38:57,900 So, "for word in lyrics" what I'm going to do is, if the word is in 737 00:38:57,900 --> 00:39:01,640 "probabilities(candidate)", which means that it's a word that the 738 00:39:01,640 --> 00:39:03,640 candidate has in their lyrics-- 739 00:39:03,640 --> 00:39:05,940 for example, "baby" for Gaga-- 740 00:39:05,940 --> 00:39:11,710 what I'm going to do is that the probability is going to be multiplied 741 00:39:11,710 --> 00:39:22,420 by 1 plus the probabilities of the candidate for that word. 742 00:39:22,420 --> 00:39:25,710 And it's called "word". 743 00:39:25,710 --> 00:39:32,440 This divided by the number of words that I have for that candidate. 744 00:39:32,440 --> 00:39:37,450 The total number of words that I have for the singer that I'm looking at. 745 00:39:37,450 --> 00:39:40,290 >> "Else." it means it's a new word so it'd be like for example 746 00:39:40,290 --> 00:39:41,860 "fire" for Lady Gaga. 747 00:39:41,860 --> 00:39:45,760 So I just want to do 1 over "word(candidate)". 748 00:39:45,760 --> 00:39:47,710 So I don't want to put this term here. 749 00:39:47,710 --> 00:39:50,010 >> So it's going to be basically copying and pasting this. 750 00:39:50,010 --> 00:39:54,380 751 00:39:54,380 --> 00:39:56,000 But I'm going to delete this part. 752 00:39:56,000 --> 00:39:57,610 So it's just going to be 1 over that. 753 00:39:57,610 --> 00:40:00,900 754 00:40:00,900 --> 00:40:02,150 Sounds good? 755 00:40:02,150 --> 00:40:03,980 756 00:40:03,980 --> 00:40:09,700 And now at the end, I'm just going to print the name of the candidate and 757 00:40:09,700 --> 00:40:15,750 the probability that you have of having the S on their lyrics. 758 00:40:15,750 --> 00:40:16,200 Makes sense? 759 00:40:16,200 --> 00:40:18,390 And I actually don't even need this dictionary. 760 00:40:18,390 --> 00:40:19,510 Makes sense? 761 00:40:19,510 --> 00:40:21,810 >> So, let's see if this actually works. 762 00:40:21,810 --> 00:40:24,880 So if I run this, it didn't work. 763 00:40:24,880 --> 00:40:26,130 Wait one second. 764 00:40:26,130 --> 00:40:28,870 765 00:40:28,870 --> 00:40:31,720 "Words(candidate)", "words(candidate)", that's 766 00:40:31,720 --> 00:40:33,750 the name of the array. 767 00:40:33,750 --> 00:40:41,435 OK So, it says there's some bug for candidate in priors. 768 00:40:41,435 --> 00:40:46,300 769 00:40:46,300 --> 00:40:48,760 Let me just chill a little bit. 770 00:40:48,760 --> 00:40:50,360 OK. 771 00:40:50,360 --> 00:40:51,305 Let's try. 772 00:40:51,305 --> 00:40:51,720 OK. 773 00:40:51,720 --> 00:40:58,710 >> So it gives Katy Perry has this probability of this times 10 to the 774 00:40:58,710 --> 00:41:02,200 minus 7, and Gaga has this times 10 to the minus 6. 775 00:41:02,200 --> 00:41:05,610 So you see it shows that Gaga has a higher probability. 776 00:41:05,610 --> 00:41:09,260 So "Baby, I'm on Fire" is probably a Gaga song. 777 00:41:09,260 --> 00:41:10,580 Makes sense? 778 00:41:10,580 --> 00:41:12,030 So this is what we did. 779 00:41:12,030 --> 00:41:16,010 >> This code is going to be posted online, so you guys can check it out. 780 00:41:16,010 --> 00:41:20,720 Maybe use some of it for if you want to do a project or something similar. 781 00:41:20,720 --> 00:41:22,150 OK. 782 00:41:22,150 --> 00:41:25,930 This was just to show what computational 783 00:41:25,930 --> 00:41:27,230 linguistics code looks like. 784 00:41:27,230 --> 00:41:33,040 But now let's go to more high level stuff. 785 00:41:33,040 --> 00:41:33,340 OK. 786 00:41:33,340 --> 00:41:35,150 >> So the other problems I was talking about-- 787 00:41:35,150 --> 00:41:37,550 the segmentation problem is the first of them. 788 00:41:37,550 --> 00:41:40,820 So you have here Japanese. 789 00:41:40,820 --> 00:41:43,420 And then you see that there are no spaces. 790 00:41:43,420 --> 00:41:49,110 So this is basically means that it's the top of the chair, right? 791 00:41:49,110 --> 00:41:50,550 You speak Japanese? 792 00:41:50,550 --> 00:41:52,840 It's the top of the chair, right? 793 00:41:52,840 --> 00:41:54,480 >> STUDENT: I don't know what the kanji over there is. 794 00:41:54,480 --> 00:41:57,010 >> LUCAS FREITAS: It's [SPEAKING JAPANESE] 795 00:41:57,010 --> 00:41:57,950 OK. 796 00:41:57,950 --> 00:42:00,960 So it basically means chair of top. 797 00:42:00,960 --> 00:42:03,620 So if you had to put a space it would be here. 798 00:42:03,620 --> 00:42:05,970 And then you have [? Ueda-san. ?] 799 00:42:05,970 --> 00:42:09,040 Which basically means Mr. Ueda. 800 00:42:09,040 --> 00:42:13,180 And you see that "Ueda" and you have a space and then "san." So you see that 801 00:42:13,180 --> 00:42:15,470 here you "Ue" is like by itself. 802 00:42:15,470 --> 00:42:17,750 And here it has a character next to it. 803 00:42:17,750 --> 00:42:21,720 >> So it's not like in those languages characters meaning a word it so you 804 00:42:21,720 --> 00:42:23,980 just put a lot of spaces. 805 00:42:23,980 --> 00:42:25,500 Characters relate to each other. 806 00:42:25,500 --> 00:42:28,680 And they can be together like two, three, one. 807 00:42:28,680 --> 00:42:34,520 So you actually have to create some kind of way of putting those spaces. 808 00:42:34,520 --> 00:42:38,850 >> And this thing is that whenever you get data from those Asian languages, 809 00:42:38,850 --> 00:42:40,580 everything comes unsegmented. 810 00:42:40,580 --> 00:42:45,940 Because no one who writes Japanese or Chinese writes with spaces. 811 00:42:45,940 --> 00:42:48,200 Whenever you're writing Chinese, Japanese you just write everything 812 00:42:48,200 --> 00:42:48,710 with no spaces. 813 00:42:48,710 --> 00:42:52,060 It doesn't even make sense to put spaces. 814 00:42:52,060 --> 00:42:57,960 So then when you get data from, some East Asian language, if you want to 815 00:42:57,960 --> 00:43:00,760 actually do something with that you have to segment first. 816 00:43:00,760 --> 00:43:05,130 >> Think of doing the example of the lyrics without spaces. 817 00:43:05,130 --> 00:43:07,950 So the only lyrics that you have will be sentences, right? 818 00:43:07,950 --> 00:43:09,470 Separated by periods. 819 00:43:09,470 --> 00:43:13,930 But then having just the sentence will not really help on giving information 820 00:43:13,930 --> 00:43:17,760 of who those lyrics are by. 821 00:43:17,760 --> 00:43:18,120 Right? 822 00:43:18,120 --> 00:43:20,010 So you should puts spaces first. 823 00:43:20,010 --> 00:43:21,990 So how can you do that? 824 00:43:21,990 --> 00:43:24,920 >> So then comes the idea of a language model which is something really 825 00:43:24,920 --> 00:43:26,870 important for computational linguistics. 826 00:43:26,870 --> 00:43:32,790 So a language model is basically a table of probabilities that shows 827 00:43:32,790 --> 00:43:36,260 first of all what is the probability of having the word in a language? 828 00:43:36,260 --> 00:43:39,590 So showing how frequent a word is. 829 00:43:39,590 --> 00:43:43,130 And then also showing the relation between the words in a sentence. 830 00:43:43,130 --> 00:43:51,500 >> So the main idea is, if a stranger came to you and said a sentence to 831 00:43:51,500 --> 00:43:55,600 you, what is the probability that, for example, "this is my sister [? GTF" ?] 832 00:43:55,600 --> 00:43:57,480 was the sentence that the person said? 833 00:43:57,480 --> 00:44:00,380 So obviously some sentences are more common than others. 834 00:44:00,380 --> 00:44:04,450 For example, "good morning," or "good night," or "hey there," is much more 835 00:44:04,450 --> 00:44:08,260 common than most sentences that we have an English. 836 00:44:08,260 --> 00:44:11,060 So why are those sentences more frequent? 837 00:44:11,060 --> 00:44:14,060 >> First of all, it's because you have words that are more frequent. 838 00:44:14,060 --> 00:44:20,180 So, for example, if you say, the dog is big, and the dog is gigantic, you 839 00:44:20,180 --> 00:44:23,880 usually probably hear the dog is big more often because "big" is more 840 00:44:23,880 --> 00:44:27,260 frequent in English than "gigantic." So, one of the 841 00:44:27,260 --> 00:44:30,100 things is the word frequency. 842 00:44:30,100 --> 00:44:34,490 >> The second thing which is really important is just the 843 00:44:34,490 --> 00:44:35,490 order of the words. 844 00:44:35,490 --> 00:44:39,500 So, it's common to say "the cat is inside the box." but you don't usually 845 00:44:39,500 --> 00:44:44,250 see in "The box inside is the cat." so you see that there is some importance 846 00:44:44,250 --> 00:44:46,030 in the order of the words. 847 00:44:46,030 --> 00:44:50,160 You cannot just say that those two sentences have the same probability 848 00:44:50,160 --> 00:44:53,010 just because they have the same words. 849 00:44:53,010 --> 00:44:55,550 You actually have to care about order as well. 850 00:44:55,550 --> 00:44:57,650 Make sense? 851 00:44:57,650 --> 00:44:59,490 >> So what do we do? 852 00:44:59,490 --> 00:45:01,550 So what I might try to get you? 853 00:45:01,550 --> 00:45:04,400 I'm trying to get you what we call the n-gram models. 854 00:45:04,400 --> 00:45:09,095 So n-gram models basically assume that for each word that 855 00:45:09,095 --> 00:45:10,960 you have in a sentence. 856 00:45:10,960 --> 00:45:15,020 It's the probability of having that word there depends not only on the 857 00:45:15,020 --> 00:45:18,395 frequency of that word in the language, but also on the words that 858 00:45:18,395 --> 00:45:19,860 are surrounding it. 859 00:45:19,860 --> 00:45:25,810 >> So for example, usually when you see something like on or at you're 860 00:45:25,810 --> 00:45:28,040 probably going to see a noun after it, right? 861 00:45:28,040 --> 00:45:31,750 Because when you have a preposition usually it takes a noun after it. 862 00:45:31,750 --> 00:45:35,540 Or if you have a verb that is transitive you usually are going to 863 00:45:35,540 --> 00:45:36,630 have a noun phrase. 864 00:45:36,630 --> 00:45:38,780 So it's going to have a noun somewhere around it. 865 00:45:38,780 --> 00:45:44,950 >> So, basically, what it does is that it considers the probability of having 866 00:45:44,950 --> 00:45:47,960 words next to each other, when you're calculating the 867 00:45:47,960 --> 00:45:49,050 probability of a sentence. 868 00:45:49,050 --> 00:45:50,960 And that's what a language model is basically. 869 00:45:50,960 --> 00:45:54,620 Just saying what's the probability of having a specific 870 00:45:54,620 --> 00:45:57,120 sentence in a language? 871 00:45:57,120 --> 00:45:59,110 So why is that useful, basically? 872 00:45:59,110 --> 00:46:02,390 And first of all what is an n-gram model, then? 873 00:46:02,390 --> 00:46:08,850 >> So an n-gram model means that each word depends on the 874 00:46:08,850 --> 00:46:12,700 next N minus 1 words. 875 00:46:12,700 --> 00:46:18,150 So, basically, it means that if I look, for example, at the CS50 TF when 876 00:46:18,150 --> 00:46:21,500 I'm calculating the probability of the sentence, you'll be like "the 877 00:46:21,500 --> 00:46:25,280 probability of having the word "the" times the probability of having "the 878 00:46:25,280 --> 00:46:31,720 CS50" times the probability of having "The CS50 TF." So, basically, I count 879 00:46:31,720 --> 00:46:35,720 all possible ways of stretching it. 880 00:46:35,720 --> 00:46:41,870 >> And then usually when you're doing this, as in a project, you put N to be 881 00:46:41,870 --> 00:46:42,600 a low value. 882 00:46:42,600 --> 00:46:45,930 So, usually have bigrams or trigrams. 883 00:46:45,930 --> 00:46:51,090 So that you just count two words, a group of two words, or three words, 884 00:46:51,090 --> 00:46:52,620 just for performance issues. 885 00:46:52,620 --> 00:46:56,395 And also because maybe if you have something like "The CS50 TF." When you 886 00:46:56,395 --> 00:47:00,510 have "TF," it's very important that "CS50" is next to it, right? 887 00:47:00,510 --> 00:47:04,050 Those two things are usually next to each other. 888 00:47:04,050 --> 00:47:06,410 >> If you think of "TF," it's probably going to have what 889 00:47:06,410 --> 00:47:07,890 class it's TF'ing for. 890 00:47:07,890 --> 00:47:11,330 Also "the" is really important for CS50 TF. 891 00:47:11,330 --> 00:47:14,570 But if you have something like "The CS50 TF went to class and gave their 892 00:47:14,570 --> 00:47:20,060 students some candy." "Candy" and "the" have no relation really, right? 893 00:47:20,060 --> 00:47:23,670 They're so distant from each other that it doesn't really matter what 894 00:47:23,670 --> 00:47:25,050 words you have. 895 00:47:25,050 --> 00:47:31,210 >> So by doing a bigram or a trigram, it just means that you're limiting 896 00:47:31,210 --> 00:47:33,430 yourself to some words that are around. 897 00:47:33,430 --> 00:47:35,810 Make sense? 898 00:47:35,810 --> 00:47:40,630 So when you want to do segmentation, basically, what you want to do is see 899 00:47:40,630 --> 00:47:44,850 what are all the possible ways that you can segment the sentence. 900 00:47:44,850 --> 00:47:49,090 >> Such that you see what is the probability of each of those sentences 901 00:47:49,090 --> 00:47:50,880 existing in the language? 902 00:47:50,880 --> 00:47:53,410 So what you do is like, well, let me try to put a space here. 903 00:47:53,410 --> 00:47:55,570 So you put a space there and you see what is the 904 00:47:55,570 --> 00:47:57,590 probability of that sentence? 905 00:47:57,590 --> 00:48:00,240 Then you are like, OK, maybe that was not that good. 906 00:48:00,240 --> 00:48:03,420 So I put a space there and a space there, and you calculate the 907 00:48:03,420 --> 00:48:06,240 probability now, and you see that it's a higher probability. 908 00:48:06,240 --> 00:48:12,160 >> So this is an algorithm called the TANGO segmentation algorithm, which is 909 00:48:12,160 --> 00:48:14,990 actually something that would be really cool for a project, which 910 00:48:14,990 --> 00:48:20,860 basically takes unsegmented text which can be Japanese or Chinese or maybe 911 00:48:20,860 --> 00:48:26,080 English without spaces and tries to put spaces between words and it does 912 00:48:26,080 --> 00:48:29,120 that by using a language model and trying to see what is the highest 913 00:48:29,120 --> 00:48:31,270 probability you can get. 914 00:48:31,270 --> 00:48:32,230 OK. 915 00:48:32,230 --> 00:48:33,800 So this is segmentation. 916 00:48:33,800 --> 00:48:35,450 >> Now syntax. 917 00:48:35,450 --> 00:48:40,940 So, syntax is being used for so many things right now. 918 00:48:40,940 --> 00:48:44,880 So for Graph Search, for Siri for pretty much any kind of natural 919 00:48:44,880 --> 00:48:46,490 language processing you have. 920 00:48:46,490 --> 00:48:49,140 So what are the important things about syntax? 921 00:48:49,140 --> 00:48:52,390 So, sentences in general have what we call constituents. 922 00:48:52,390 --> 00:48:57,080 Which are kind of like groups of words that have a function in the sentence. 923 00:48:57,080 --> 00:49:02,220 And they can not really be apart from each other. 924 00:49:02,220 --> 00:49:07,380 >> So, if I say, for example, "Lauren loves Milo." I know that "Lauren" is a 925 00:49:07,380 --> 00:49:10,180 constituent and then "loves Milo" is also another one. 926 00:49:10,180 --> 00:49:16,860 Because you cannot say like "Lauren Milo loves" to have the same meaning. 927 00:49:16,860 --> 00:49:18,020 It's not going to have the same meaning. 928 00:49:18,020 --> 00:49:22,500 Or I cannot say like "Milo Lauren loves." Not everything has the same 929 00:49:22,500 --> 00:49:25,890 meaning doing that. 930 00:49:25,890 --> 00:49:31,940 >> So the two more important things about syntax are the lexical types which is 931 00:49:31,940 --> 00:49:35,390 basically the function that you have for words by themselves. 932 00:49:35,390 --> 00:49:39,180 So you have to know that "Lauren" and "Milo" are nouns. 933 00:49:39,180 --> 00:49:41,040 "Love" is a verb. 934 00:49:41,040 --> 00:49:45,660 And the second important thing is that they're phrasal types. 935 00:49:45,660 --> 00:49:48,990 So you know that "loves Milo" is actually a verbal phrase. 936 00:49:48,990 --> 00:49:52,390 So when I say "Lauren," I know that Lauren is doing something. 937 00:49:52,390 --> 00:49:53,620 What is she doing? 938 00:49:53,620 --> 00:49:54,570 She's loving Milo. 939 00:49:54,570 --> 00:49:56,440 So it's a whole thing. 940 00:49:56,440 --> 00:50:01,640 But its components are a noun and a verb. 941 00:50:01,640 --> 00:50:04,210 But together, they make a verb phrase. 942 00:50:04,210 --> 00:50:08,680 >> So, what can we actually do with computational linguistics? 943 00:50:08,680 --> 00:50:13,810 So, if I have something for example "friends of Allison." I see if I just 944 00:50:13,810 --> 00:50:17,440 did a syntactic tree I would know that "friends" is a noun phrase it is a 945 00:50:17,440 --> 00:50:21,480 noun and then "of Allison" is a prepositional phrase in which "of" is 946 00:50:21,480 --> 00:50:24,810 a proposition and "Allison" is a noun. 947 00:50:24,810 --> 00:50:30,910 What I could do is teach my computer that when I have a noun phrase one and 948 00:50:30,910 --> 00:50:33,080 then a prepositional phrase. 949 00:50:33,080 --> 00:50:39,020 So in this case, "friends" and then "of Milo" I know that this means that 950 00:50:39,020 --> 00:50:43,110 NP2, the second one, owns NP1. 951 00:50:43,110 --> 00:50:47,680 >> So I can create some kind of relation, some kind of function for it. 952 00:50:47,680 --> 00:50:52,370 So whenever I see this structure, which matches exactly with "friends of 953 00:50:52,370 --> 00:50:56,030 Allison," I know that Allison owns the friends. 954 00:50:56,030 --> 00:50:58,830 So the friends are something that Allison has. 955 00:50:58,830 --> 00:50:59,610 Makes sense? 956 00:50:59,610 --> 00:51:01,770 So this is basically what Graph Search does. 957 00:51:01,770 --> 00:51:04,360 It just creates rules for a lot of things. 958 00:51:04,360 --> 00:51:08,190 So "friends of Allison," "my friends who live in Cambridge," "my friends 959 00:51:08,190 --> 00:51:12,970 who go to Harvard." It creates rules for all of those things. 960 00:51:12,970 --> 00:51:14,930 >> Now machine translation. 961 00:51:14,930 --> 00:51:18,850 So, machine translation is also something statistical. 962 00:51:18,850 --> 00:51:21,340 And actually if you get involved in computational linguistics, a lot of 963 00:51:21,340 --> 00:51:23,580 your stuff is going to be statistics. 964 00:51:23,580 --> 00:51:26,670 So as I was doing the example with a lot of probabilities that I was 965 00:51:26,670 --> 00:51:30,540 calculating, and then you get to this very small number that's the final 966 00:51:30,540 --> 00:51:33,180 probability, and that's what gives you the answer. 967 00:51:33,180 --> 00:51:37,540 Machine translation also uses a statistical model. 968 00:51:37,540 --> 00:51:44,790 And if you want to think of machine translation in the simplest possible 969 00:51:44,790 --> 00:51:48,970 way, what you can think is just translate word by word, right? 970 00:51:48,970 --> 00:51:52,150 >> When you're learning a language for the first time, that's usually what 971 00:51:52,150 --> 00:51:52,910 you do, right? 972 00:51:52,910 --> 00:51:57,050 If you want you translate a sentence in your language to the language 973 00:51:57,050 --> 00:52:00,060 you're learning, usually first, you translate each of the words 974 00:52:00,060 --> 00:52:03,180 individually, and then you try to put the words into place. 975 00:52:03,180 --> 00:52:07,100 >> So if I wanted to translate this, [SPEAKING PORTUGUESE] 976 00:52:07,100 --> 00:52:10,430 which means "the white cat ran away." If I wanted to translate it from 977 00:52:10,430 --> 00:52:13,650 Portuguese to English, what I could do is, first, I just 978 00:52:13,650 --> 00:52:14,800 translate word by word. 979 00:52:14,800 --> 00:52:20,570 So "o" is "the," "gato," "cat," "branco," "white," and then "fugio" is 980 00:52:20,570 --> 00:52:21,650 "ran away." 981 00:52:21,650 --> 00:52:26,130 >> So then I have all the words here, but they're not in order. 982 00:52:26,130 --> 00:52:29,590 It's like "the cat white ran away" which is ungrammatical. 983 00:52:29,590 --> 00:52:34,490 So, then I can have a second step, which is going to be finding the ideal 984 00:52:34,490 --> 00:52:36,610 position for each of the words. 985 00:52:36,610 --> 00:52:40,240 So I know that I actually want to have "white cat" instead of "cat white." So 986 00:52:40,240 --> 00:52:46,050 what I can do is, the most naive method would be to create all the 987 00:52:46,050 --> 00:52:49,720 possible permutations of words, of positions. 988 00:52:49,720 --> 00:52:53,300 And then see which one has the highest probability according 989 00:52:53,300 --> 00:52:54,970 to my language model. 990 00:52:54,970 --> 00:52:58,390 And then when I find the one that has the highest probability it, which is 991 00:52:58,390 --> 00:53:01,910 probably "the white cat ran away," that's my translation. 992 00:53:01,910 --> 00:53:06,710 >> And this is a simple way of explaining how a lot of machine translation 993 00:53:06,710 --> 00:53:07,910 algorithms work. 994 00:53:07,910 --> 00:53:08,920 Does that make sense? 995 00:53:08,920 --> 00:53:12,735 This is also something really exciting that you guys can maybe explore for a 996 00:53:12,735 --> 00:53:13,901 final project, yeah? 997 00:53:13,901 --> 00:53:15,549 >> STUDENT: Well, you said it was the naive way, so what's 998 00:53:15,549 --> 00:53:17,200 the non-naive way? 999 00:53:17,200 --> 00:53:18,400 >> LUCAS FREITAS: The non-naive way? 1000 00:53:18,400 --> 00:53:19,050 OK. 1001 00:53:19,050 --> 00:53:22,860 So the first thing that is bad about this method is that I just translated 1002 00:53:22,860 --> 00:53:24,330 words, word by word. 1003 00:53:24,330 --> 00:53:30,570 But sometimes you have words that can have multiple translations. 1004 00:53:30,570 --> 00:53:32,210 I'm going to try to think of something. 1005 00:53:32,210 --> 00:53:37,270 For example, "manga" in Portuguese can either be "mangle" or "sleeve." So 1006 00:53:37,270 --> 00:53:40,450 when you're trying to translate word by word, it might be giving you 1007 00:53:40,450 --> 00:53:42,050 something that makes no sense. 1008 00:53:42,050 --> 00:53:45,770 >> So you actually want to you look at all the possible translations of the 1009 00:53:45,770 --> 00:53:49,840 words and see, first of all, what is the order. 1010 00:53:49,840 --> 00:53:52,000 We were talking about permutating the things? 1011 00:53:52,000 --> 00:53:54,150 To see all the possible orders and choose the one with the highest 1012 00:53:54,150 --> 00:53:54,990 probability? 1013 00:53:54,990 --> 00:53:57,860 You can also choose all the possible translations for each 1014 00:53:57,860 --> 00:54:00,510 word and then see-- 1015 00:54:00,510 --> 00:54:01,950 combined with the permutations-- 1016 00:54:01,950 --> 00:54:03,710 which one has the highest probability. 1017 00:54:03,710 --> 00:54:08,590 >> Plus, you can also look at not only words but phrases. 1018 00:54:08,590 --> 00:54:11,700 so you can analyze the relations between the words and then get a 1019 00:54:11,700 --> 00:54:13,210 better translation. 1020 00:54:13,210 --> 00:54:16,690 Also something else, so this semester I'm actually doing research in 1021 00:54:16,690 --> 00:54:19,430 Chinese-English machine translation, so translating from 1022 00:54:19,430 --> 00:54:20,940 Chinese into English. 1023 00:54:20,940 --> 00:54:26,760 >> And something we do is, besides using a statistical model, which is just 1024 00:54:26,760 --> 00:54:30,570 seeing the probabilities of seeing some position in a sentence, I'm 1025 00:54:30,570 --> 00:54:35,360 actually also adding some syntax to my model, saying, oh, if I see this kind 1026 00:54:35,360 --> 00:54:39,420 of construction, this is what I want to change it to when I translate. 1027 00:54:39,420 --> 00:54:43,880 So you can also add some kind of element of syntax to make the 1028 00:54:43,880 --> 00:54:47,970 translation more efficient and more precise. 1029 00:54:47,970 --> 00:54:48,550 OK. 1030 00:54:48,550 --> 00:54:51,010 >> So how can you get started, if you want to do something in computational 1031 00:54:51,010 --> 00:54:51,980 linguistics? 1032 00:54:51,980 --> 00:54:54,560 >> First, you choose a project that involves languages. 1033 00:54:54,560 --> 00:54:56,310 So, there's so many out there. 1034 00:54:56,310 --> 00:54:58,420 There's so many things you can do. 1035 00:54:58,420 --> 00:55:00,510 And then can think of a model that you can use. 1036 00:55:00,510 --> 00:55:04,710 Usually that means thinking of assumptions, as like, oh, when I was 1037 00:55:04,710 --> 00:55:05,770 like thinking of the lyrics. 1038 00:55:05,770 --> 00:55:09,510 I was like, well, if I want to figure out a who wrote this, I probably want 1039 00:55:09,510 --> 00:55:15,400 to look at the words the person used and see who uses that word very often. 1040 00:55:15,400 --> 00:55:18,470 So try to make assumptions and try to think of models. 1041 00:55:18,470 --> 00:55:21,395 And then you can also search online for the kind of problem that you have, 1042 00:55:21,395 --> 00:55:24,260 and it's going to suggest to you models that maybe 1043 00:55:24,260 --> 00:55:26,560 modeled that thing well. 1044 00:55:26,560 --> 00:55:29,080 >> And also you can always email me. 1045 00:55:29,080 --> 00:55:31,140 me@lfreitas.com. 1046 00:55:31,140 --> 00:55:34,940 And I can just answer your questions. 1047 00:55:34,940 --> 00:55:38,600 We can even might meet up so I can give suggestions on ways of 1048 00:55:38,600 --> 00:55:41,490 implementing your project. 1049 00:55:41,490 --> 00:55:45,610 And I mean if you get involved with computational linguistics, it's going 1050 00:55:45,610 --> 00:55:46,790 to be great. 1051 00:55:46,790 --> 00:55:48,370 You're going to see there is so much potential. 1052 00:55:48,370 --> 00:55:52,060 And the industry wants to hire you so bad because of that. 1053 00:55:52,060 --> 00:55:54,720 So I hope you guys enjoyed this. 1054 00:55:54,720 --> 00:55:57,030 If you guys have any questions, you can ask me after this. 1055 00:55:57,030 --> 00:55:58,280 But thank you. 1056 00:55:58,280 --> 00:56:00,150