1 00:00:00,000 --> 00:00:02,445 [MUSIC PLAYING] 2 00:00:02,445 --> 00:00:18,560 3 00:00:18,560 --> 00:00:21,560 BRIAN YU: Welcome back, everybody, to our final class in an Introduction 4 00:00:21,560 --> 00:00:23,570 to Artificial Intelligence with Python. 5 00:00:23,570 --> 00:00:26,150 Now, so far in this class, we've been taking problems 6 00:00:26,150 --> 00:00:29,457 that we want to solve intelligently and framing them in ways that computers 7 00:00:29,457 --> 00:00:31,040 are going to be able to make sense of. 8 00:00:31,040 --> 00:00:34,850 We've been taking problems and framing them as search problems or constraint 9 00:00:34,850 --> 00:00:38,210 satisfaction problems or optimization problems, for example. 10 00:00:38,210 --> 00:00:41,780 In essence, we have been trying to communicate about problems in ways 11 00:00:41,780 --> 00:00:44,360 that our computer is going to be able to understand. 12 00:00:44,360 --> 00:00:46,910 Today, the goal is going to be to get computers 13 00:00:46,910 --> 00:00:49,610 to understand the way you and I communicate naturally, 14 00:00:49,610 --> 00:00:51,710 via our own natural languages. 15 00:00:51,710 --> 00:00:53,060 Languages like English. 16 00:00:53,060 --> 00:00:56,720 But natural language contains a lot of nuance and complexity 17 00:00:56,720 --> 00:00:59,930 that's going to make it challenging for computers to be able to understand. 18 00:00:59,930 --> 00:01:03,440 So we'll need to explore some new tools and some new techniques 19 00:01:03,440 --> 00:01:06,920 to allow computers to make sense of natural language. 20 00:01:06,920 --> 00:01:09,920 So what is it exactly that we're trying to get computers to do? 21 00:01:09,920 --> 00:01:13,850 Well, they all fall under this general heading of natural language processing, 22 00:01:13,850 --> 00:01:16,610 getting computers to work with natural language. 23 00:01:16,610 --> 00:01:20,280 And these tasks include tasks like automatic summarization. 24 00:01:20,280 --> 00:01:23,040 Given a long text, can we train the computer 25 00:01:23,040 --> 00:01:25,590 to be able to come up with a shorter representation of it? 26 00:01:25,590 --> 00:01:27,030 Information extraction. 27 00:01:27,030 --> 00:01:29,730 Getting the computer to pull out relevant facts or details out 28 00:01:29,730 --> 00:01:30,540 of some text. 29 00:01:30,540 --> 00:01:32,790 Machine translation, like Google translate, 30 00:01:32,790 --> 00:01:36,120 translating some text from one language into another language. 31 00:01:36,120 --> 00:01:37,140 Question answering. 32 00:01:37,140 --> 00:01:39,330 If you've ever asked a question to your phone 33 00:01:39,330 --> 00:01:41,880 or had a conversation with an AI chatbot where 34 00:01:41,880 --> 00:01:45,150 you provide some text to the computer, the computer 35 00:01:45,150 --> 00:01:49,740 is able to understand that text and then generate some text in response. 36 00:01:49,740 --> 00:01:52,350 Text classification, where we provide some text 37 00:01:52,350 --> 00:01:56,280 to the computer and the computer assigns it a label, positive or negative, 38 00:01:56,280 --> 00:01:57,990 inbox or spam, for example. 39 00:01:57,990 --> 00:01:59,880 And there are several other kinds of tasks 40 00:01:59,880 --> 00:02:03,090 that all fall under this heading of natural language processing. 41 00:02:03,090 --> 00:02:05,790 But before we take a look at how the computer might 42 00:02:05,790 --> 00:02:08,460 try to solve these kinds of tasks, it might be useful 43 00:02:08,460 --> 00:02:10,949 for us to think about language in general. 44 00:02:10,949 --> 00:02:12,960 What are the kinds of challenges that we might 45 00:02:12,960 --> 00:02:15,540 need to deal with as we start to think about language 46 00:02:15,540 --> 00:02:18,120 and getting a computer to be able to understand it? 47 00:02:18,120 --> 00:02:22,230 So one part of language that we'll need to consider is the syntax of language. 48 00:02:22,230 --> 00:02:24,600 Syntax is all about the structure of language. 49 00:02:24,600 --> 00:02:26,760 Language is composed of individual words, 50 00:02:26,760 --> 00:02:30,690 and those words are composed together in some kind of structured whole. 51 00:02:30,690 --> 00:02:33,360 And if our computer is going to be able to understand language, 52 00:02:33,360 --> 00:02:36,720 it's going to need to understand something about that structure. 53 00:02:36,720 --> 00:02:38,550 So let's take a couple of examples. 54 00:02:38,550 --> 00:02:40,320 Here, for instance, is a sentence. 55 00:02:40,320 --> 00:02:44,100 "Just before nine o'clock Sherlock Holmes stepped briskly into the room." 56 00:02:44,100 --> 00:02:47,550 That sentence is made up of words, and those words together 57 00:02:47,550 --> 00:02:49,110 form a structured whole. 58 00:02:49,110 --> 00:02:51,810 This is syntactically valid as a sentence. 59 00:02:51,810 --> 00:02:54,540 But we could take some of those same words, 60 00:02:54,540 --> 00:02:56,760 rearrange them, and come up with a sentence that 61 00:02:56,760 --> 00:02:59,010 is not syntactically valid. 62 00:02:59,010 --> 00:03:01,800 Here, for example, "Just before Sherlock Holmes 63 00:03:01,800 --> 00:03:05,970 nine o'clock stepped briskly the room" is still composed of valid words, 64 00:03:05,970 --> 00:03:08,490 but they're not in any kind of logical whole. 65 00:03:08,490 --> 00:03:12,180 This is not a syntactically well-formed sentence. 66 00:03:12,180 --> 00:03:15,630 Another interesting challenge, is that some sentences will have 67 00:03:15,630 --> 00:03:18,240 multiple possible valid structures. 68 00:03:18,240 --> 00:03:19,890 Here's a sentence, for example. 69 00:03:19,890 --> 00:03:22,830 "I saw the man on the mountain with a telescope." 70 00:03:22,830 --> 00:03:25,350 And here, this is a valid sentence, but it actually 71 00:03:25,350 --> 00:03:28,380 has two different possible structures that 72 00:03:28,380 --> 00:03:31,110 lend themselves to two different interpretations and two 73 00:03:31,110 --> 00:03:31,950 different meanings. 74 00:03:31,950 --> 00:03:34,560 Maybe I, the one doing the seeing and the one 75 00:03:34,560 --> 00:03:36,960 with the telescope, or maybe the man on the mountain 76 00:03:36,960 --> 00:03:38,580 is the one with the telescope. 77 00:03:38,580 --> 00:03:40,890 And so natural language is ambiguous. 78 00:03:40,890 --> 00:03:44,190 Sometimes the same sentence can be interpreted in multiple ways. 79 00:03:44,190 --> 00:03:46,800 And that's something that we'll need to think about, as well. 80 00:03:46,800 --> 00:03:49,530 And this lends itself to another problem within language 81 00:03:49,530 --> 00:03:51,930 that we'll need to think about, which is semantics. 82 00:03:51,930 --> 00:03:54,540 While syntax is all about the structure of language, 83 00:03:54,540 --> 00:03:56,820 semantics is about the meaning of language. 84 00:03:56,820 --> 00:03:59,370 It's not enough for a computer just to know 85 00:03:59,370 --> 00:04:03,600 that a sentence is well-structured if it doesn't know what that sentence means. 86 00:04:03,600 --> 00:04:07,080 And so semantics is going to concern itself with the meaning of words 87 00:04:07,080 --> 00:04:08,700 and the meaning of sentences. 88 00:04:08,700 --> 00:04:12,390 So if we go back to that same sentence as before, "Just before nine o'clock 89 00:04:12,390 --> 00:04:15,420 Sherlock Holmes stepped briskly into the room." 90 00:04:15,420 --> 00:04:17,730 I could come up with another sentence. 91 00:04:17,730 --> 00:04:20,490 Say the sentence, "A few minutes before nine, 92 00:04:20,490 --> 00:04:22,980 Sherlock Holmes walked quickly into the room." 93 00:04:22,980 --> 00:04:26,040 And those are two different sentences, with some of the words the same 94 00:04:26,040 --> 00:04:28,650 and some of the words different, but the two sentences 95 00:04:28,650 --> 00:04:30,720 have essentially the same meaning. 96 00:04:30,720 --> 00:04:33,030 And so ideally, whatever model we build, we'll 97 00:04:33,030 --> 00:04:35,160 be able to understand that these two sentences 98 00:04:35,160 --> 00:04:38,130 while different, mean something very similar. 99 00:04:38,130 --> 00:04:41,940 Some syntactically well-formed sentences don't mean anything at all. 100 00:04:41,940 --> 00:04:45,210 A famous example from linguist, Noam Chomsky, is the sentence, 101 00:04:45,210 --> 00:04:48,270 "Colorless green ideas sleep furiously." 102 00:04:48,270 --> 00:04:51,690 This is a syntactically, structurally well-formed sentence. 103 00:04:51,690 --> 00:04:56,280 We've got adjectives modifying a noun, ideas, we've got a verb and an adverb 104 00:04:56,280 --> 00:04:57,420 in the correct positions. 105 00:04:57,420 --> 00:05:01,260 But when taken as a whole, the sentence doesn't really mean anything. 106 00:05:01,260 --> 00:05:04,680 And so if our computers are going to be able to work with natural language 107 00:05:04,680 --> 00:05:06,977 and perform tasks in natural language processing, 108 00:05:06,977 --> 00:05:09,060 these are some concerns we'll need to think about. 109 00:05:09,060 --> 00:05:12,060 We'll need to be thinking about syntax and we'll 110 00:05:12,060 --> 00:05:13,830 need to be thinking about semantics. 111 00:05:13,830 --> 00:05:17,880 So, how could we go about trying to teach a computer how to understand 112 00:05:17,880 --> 00:05:19,590 the structure of natural language? 113 00:05:19,590 --> 00:05:22,650 Well, one approach we might take is by starting by thinking 114 00:05:22,650 --> 00:05:24,810 about the rules of natural language. 115 00:05:24,810 --> 00:05:26,580 Our natural languages have rules. 116 00:05:26,580 --> 00:05:29,910 In English, for example, nouns tend to come before verbs. 117 00:05:29,910 --> 00:05:32,520 Nouns can be modified by adjectives, for example. 118 00:05:32,520 --> 00:05:35,400 And so if only we could formalize those rules, 119 00:05:35,400 --> 00:05:37,590 then we could give those rules to a computer, 120 00:05:37,590 --> 00:05:41,130 and the computer would be able to make sense of them and understand them. 121 00:05:41,130 --> 00:05:43,200 And so, let's try to do exactly that. 122 00:05:43,200 --> 00:05:46,020 We're going to try to define a formal grammar, where 123 00:05:46,020 --> 00:05:51,300 a formal grammar is some system of rules for generating sentences in a language. 124 00:05:51,300 --> 00:05:54,870 This is going to be a rule-based approach to natural language 125 00:05:54,870 --> 00:05:55,470 processing. 126 00:05:55,470 --> 00:05:58,830 We're going to give the computer some rules that we know about language, 127 00:05:58,830 --> 00:06:01,320 and have the computer use those rules to make 128 00:06:01,320 --> 00:06:03,488 sense of the structure of language. 129 00:06:03,488 --> 00:06:06,030 And there are a number of different types of formal grammars, 130 00:06:06,030 --> 00:06:08,430 each one of them has slightly different use cases. 131 00:06:08,430 --> 00:06:10,530 But today, we're going to focus specifically 132 00:06:10,530 --> 00:06:13,860 on one kind of grammar known as a context-free grammar. 133 00:06:13,860 --> 00:06:15,810 So how does the context-free grammar work? 134 00:06:15,810 --> 00:06:19,180 Well, here is a sentence that we might want a computer to generate. 135 00:06:19,180 --> 00:06:20,950 She saw the city. 136 00:06:20,950 --> 00:06:24,280 And we're going to call each of these words a terminal symbol. 137 00:06:24,280 --> 00:06:27,383 A terminal symbol, because once our computer has generated the word, 138 00:06:27,383 --> 00:06:29,050 there's nothing else for it to generate. 139 00:06:29,050 --> 00:06:32,110 Once it's generated the sentence, the computer is done. 140 00:06:32,110 --> 00:06:34,930 We're going to associate each of these terminal symbols 141 00:06:34,930 --> 00:06:38,650 with a nonterminal symbol that generates it. 142 00:06:38,650 --> 00:06:42,670 So here we've got N, which stands for noun, like she or city. 143 00:06:42,670 --> 00:06:46,000 We've got V as a nonterminal symbol, which stands for a verb. 144 00:06:46,000 --> 00:06:48,280 And then we have D, which stands for determiner. 145 00:06:48,280 --> 00:06:52,210 A determiner is a word like the or a or an in English, for example. 146 00:06:52,210 --> 00:06:56,620 So each of these nonterminal symbols can generate the terminal symbols 147 00:06:56,620 --> 00:06:58,960 that we ultimately care about generating. 148 00:06:58,960 --> 00:07:01,180 But how do we know, or how does the computer 149 00:07:01,180 --> 00:07:05,110 know which nonterminal symbols are associated with which terminal symbols? 150 00:07:05,110 --> 00:07:07,750 Well, to do that, we need some kind of rule. 151 00:07:07,750 --> 00:07:10,570 Here are some what we call rewriting rules, that 152 00:07:10,570 --> 00:07:13,660 have a nonterminal symbol on the left-hand side of an arrow, 153 00:07:13,660 --> 00:07:16,660 and on the right side is what that nonterminal symbol 154 00:07:16,660 --> 00:07:18,070 can be replaced with. 155 00:07:18,070 --> 00:07:22,180 So here, we're saying the nonterminal symbol N, again, which stands for noun, 156 00:07:22,180 --> 00:07:26,170 could be replaced by any of these options separated by vertical bars. 157 00:07:26,170 --> 00:07:30,190 N could be replaced by she or city or car or Harry. 158 00:07:30,190 --> 00:07:34,300 D for determiner, could be replaced by the, a, or an, and so forth. 159 00:07:34,300 --> 00:07:39,520 Each of these nonterminal symbols could be replaced by any of these words. 160 00:07:39,520 --> 00:07:42,370 We can also have nonterminal symbols that are 161 00:07:42,370 --> 00:07:45,070 replaced by other nonterminal symbols. 162 00:07:45,070 --> 00:07:46,660 Here's an interesting rule. 163 00:07:46,660 --> 00:07:51,400 NP arrow N bar D N. So what does that mean? 164 00:07:51,400 --> 00:07:54,520 Well, NP stands for a noun phrase. 165 00:07:54,520 --> 00:07:56,920 Sometimes when we have a noun phrase in a sentence, 166 00:07:56,920 --> 00:07:59,590 it's not just a single word, it could be multiple words. 167 00:07:59,590 --> 00:08:03,670 And so here, we're saying a noun phrase could be just a noun, 168 00:08:03,670 --> 00:08:07,240 or it could be a determiner followed by a noun. 169 00:08:07,240 --> 00:08:10,570 So we might have a noun phrase that's just a noun, like she. 170 00:08:10,570 --> 00:08:11,950 That's a noun phrase. 171 00:08:11,950 --> 00:08:14,920 Or we could have a noun phrase that's multiple words, something 172 00:08:14,920 --> 00:08:16,120 like the city. 173 00:08:16,120 --> 00:08:18,700 Also acts as a noun phrase, but in this case, 174 00:08:18,700 --> 00:08:23,800 it's composed of two words, a determiner, the, and a noun, city. 175 00:08:23,800 --> 00:08:25,930 We could do the same for verb phrases. 176 00:08:25,930 --> 00:08:29,380 A verb phrase, or VP, might be just a verb, 177 00:08:29,380 --> 00:08:32,530 or it might be a verb followed by a noun phrase. 178 00:08:32,530 --> 00:08:35,980 So we could have a verb phrase that's just a single word, like the word, 179 00:08:35,980 --> 00:08:38,440 walked, or we could have a verb phrase that 180 00:08:38,440 --> 00:08:44,480 is an entire phrase, something like saw the city, as an entire verb phrase. 181 00:08:44,480 --> 00:08:47,500 A sentence, meanwhile, we might then define as a noun 182 00:08:47,500 --> 00:08:50,230 phrase followed by a verb phrase. 183 00:08:50,230 --> 00:08:52,840 And so this would allow us to generate a sentence like, 184 00:08:52,840 --> 00:08:55,810 she saw the city, an entire sentence made up 185 00:08:55,810 --> 00:08:58,330 of a noun phrase, which is just the word she, 186 00:08:58,330 --> 00:09:01,300 and then a verb phrase, which is saw the city. 187 00:09:01,300 --> 00:09:04,750 Saw, which is a verb, and then, the city, which itself, 188 00:09:04,750 --> 00:09:07,180 is also a noun phrase. 189 00:09:07,180 --> 00:09:09,910 And so if we could give these rules to a computer, 190 00:09:09,910 --> 00:09:12,460 explaining to it what nonterminal symbols could 191 00:09:12,460 --> 00:09:16,300 be replaced by what other symbols, then a computer could take a sentence 192 00:09:16,300 --> 00:09:19,930 and begin to understand the structure of that sentence. 193 00:09:19,930 --> 00:09:22,600 And so let's take a look at an example of how we might do that. 194 00:09:22,600 --> 00:09:24,730 And to do that, we're going to use a python library 195 00:09:24,730 --> 00:09:27,923 called NLTK, or the Natural Language Toolkit, 196 00:09:27,923 --> 00:09:29,590 which we'll see a couple of times today. 197 00:09:29,590 --> 00:09:31,930 It contains a lot of helpful features and functions 198 00:09:31,930 --> 00:09:35,890 that we can use for trying to deal with and process natural language. 199 00:09:35,890 --> 00:09:39,100 So here, we'll take a look at how we can use NLTK in order 200 00:09:39,100 --> 00:09:41,590 to parse a context-free grammar. 201 00:09:41,590 --> 00:09:47,140 So let's go ahead and open up cfg0.py, cfg standing for context-free grammar. 202 00:09:47,140 --> 00:09:49,990 And what you'll see in this file, is that I first import 203 00:09:49,990 --> 00:09:52,420 NLTK, the Natural Language Toolkit. 204 00:09:52,420 --> 00:09:56,380 And the first thing I do, is define a context-free grammar, 205 00:09:56,380 --> 00:09:59,830 saying that a sentence is a noun phrase followed by a verb phrase. 206 00:09:59,830 --> 00:10:03,130 I'm defining what a noun phrase is, defining what a verb phrase is. 207 00:10:03,130 --> 00:10:07,360 And then giving some examples of what I can do with these nonterminal symbols, 208 00:10:07,360 --> 00:10:11,650 D for determiner, N for noun, and V for verb. 209 00:10:11,650 --> 00:10:14,710 We're going to use NLTK to parse that grammar. 210 00:10:14,710 --> 00:10:17,610 Then we'll ask the user for some input in the form of a sentence, 211 00:10:17,610 --> 00:10:19,560 and split it into words. 212 00:10:19,560 --> 00:10:22,860 And then, we'll use this context-free grammar parser 213 00:10:22,860 --> 00:10:27,600 to try to parse that sentence and print out the resulting syntax tree. 214 00:10:27,600 --> 00:10:30,150 So let's take a look at an example. 215 00:10:30,150 --> 00:10:34,740 We'll go ahead and go into my cfg directory and we'll run cfg0.py. 216 00:10:34,740 --> 00:10:36,510 And here, I'm asked to type in a sentence. 217 00:10:36,510 --> 00:10:39,810 Let's say I type in, she walked. 218 00:10:39,810 --> 00:10:42,180 And when I do that, I see that she walked 219 00:10:42,180 --> 00:10:46,350 is a valid sentence, where she is a noun phrase, and walked 220 00:10:46,350 --> 00:10:48,990 is the corresponding verb phrase. 221 00:10:48,990 --> 00:10:51,900 I could try to do this with a more complex sentence, too. 222 00:10:51,900 --> 00:10:55,110 I could do something like, she saw the city. 223 00:10:55,110 --> 00:10:59,670 And here, we see that she is the noun phrase, and then saw the city, 224 00:10:59,670 --> 00:11:03,660 is the entire verb phrase that makes up this sentence. 225 00:11:03,660 --> 00:11:05,490 So that was a very simple grammar. 226 00:11:05,490 --> 00:11:08,130 Let's take a look at a slightly more complex grammar. 227 00:11:08,130 --> 00:11:11,970 Here is cfg1.py, where a sentence is still 228 00:11:11,970 --> 00:11:14,250 a noun phrase followed by a verb phrase, but I've 229 00:11:14,250 --> 00:11:17,250 added some other possible nonterminal symbols, too. 230 00:11:17,250 --> 00:11:22,140 I have AP for adjective phrase, and PP for prepositional phrase. 231 00:11:22,140 --> 00:11:24,690 And we specified that we could have an adjective 232 00:11:24,690 --> 00:11:28,590 phrase before a noun phrase, or a prepositional phrase after a noun, 233 00:11:28,590 --> 00:11:29,640 for example. 234 00:11:29,640 --> 00:11:33,720 So lots of additional ways that we might try to structure a sentence 235 00:11:33,720 --> 00:11:37,140 and interpret and parse one of those resulting sentences. 236 00:11:37,140 --> 00:11:38,760 So let's see that one in action. 237 00:11:38,760 --> 00:11:42,840 We'll go ahead and run cfg1.py with this new grammar. 238 00:11:42,840 --> 00:11:47,670 And we'll try a sentence like, she saw the wide street. 239 00:11:47,670 --> 00:11:51,120 Here, pythons NLTK is able to parse that sentence 240 00:11:51,120 --> 00:11:53,760 and identify that she saw the wide street has 241 00:11:53,760 --> 00:11:57,750 this particular structure, a sentence with a noun phrase and a verb phrase, 242 00:11:57,750 --> 00:12:01,440 where that verb phrase has a noun phrase that within it, contains an adjective. 243 00:12:01,440 --> 00:12:05,670 And so it's able to get some sense for what the structure of this language 244 00:12:05,670 --> 00:12:07,170 actually is. 245 00:12:07,170 --> 00:12:08,670 Let's try another example. 246 00:12:08,670 --> 00:12:14,100 Let's say, she saw the dog with the binoculars. 247 00:12:14,100 --> 00:12:15,990 And we'll try that sentence. 248 00:12:15,990 --> 00:12:19,260 And here, we get one possible syntax tree. 249 00:12:19,260 --> 00:12:21,210 She saw the dog with the binoculars. 250 00:12:21,210 --> 00:12:23,700 But notice that this sentence is actually a little bit 251 00:12:23,700 --> 00:12:25,770 ambiguous in our own natural language. 252 00:12:25,770 --> 00:12:26,940 Who has the binoculars? 253 00:12:26,940 --> 00:12:30,600 Is it she who has the binoculars, or the dog who has the binoculars? 254 00:12:30,600 --> 00:12:35,280 And NLTK is able to identify both possible structures for the sentence. 255 00:12:35,280 --> 00:12:39,990 In this case, the dog with the binoculars is an entire noun phrase. 256 00:12:39,990 --> 00:12:44,640 It's all underneath this NP here, so it's the dog that has the binoculars. 257 00:12:44,640 --> 00:12:48,090 But we also got an alternative parse tree, 258 00:12:48,090 --> 00:12:51,810 where the dog is just the noun phrase. 259 00:12:51,810 --> 00:12:56,460 And with the binoculars, is a prepositional phrase modifying saw. 260 00:12:56,460 --> 00:13:00,210 So she saw the dog, and she used the binoculars 261 00:13:00,210 --> 00:13:02,400 in order to see the dog, as well. 262 00:13:02,400 --> 00:13:05,640 So this allows us to get a sense for the structure of natural language, 263 00:13:05,640 --> 00:13:08,060 but it relies on us writing all of these rules. 264 00:13:08,060 --> 00:13:09,810 And it would take a lot of effort to write 265 00:13:09,810 --> 00:13:12,210 all of the rules for any possible sentence 266 00:13:12,210 --> 00:13:14,910 that someone might write or say in the English language. 267 00:13:14,910 --> 00:13:16,860 Language is complicated, and as a result, 268 00:13:16,860 --> 00:13:19,470 there are going to be some very complex rules. 269 00:13:19,470 --> 00:13:21,030 So what else might we try? 270 00:13:21,030 --> 00:13:25,140 We might try to take a statistical lens towards approaching this problem 271 00:13:25,140 --> 00:13:26,700 of natural language processing. 272 00:13:26,700 --> 00:13:30,720 If we were able to give the computer a lot of existing data of sentences 273 00:13:30,720 --> 00:13:32,070 written in the English language. 274 00:13:32,070 --> 00:13:34,440 What could we try to learn from that data? 275 00:13:34,440 --> 00:13:37,860 Well, it might be difficult to try and interpret long pieces of text 276 00:13:37,860 --> 00:13:38,640 all at once. 277 00:13:38,640 --> 00:13:42,090 So instead, what we might want to do, is break up that longer text 278 00:13:42,090 --> 00:13:44,550 into smaller pieces of information instead. 279 00:13:44,550 --> 00:13:47,790 In particular, we might try to create n-grams out 280 00:13:47,790 --> 00:13:49,650 of a longer sequence of text. 281 00:13:49,650 --> 00:13:53,910 An n-gram is just some contiguous sequence of n items 282 00:13:53,910 --> 00:13:55,080 from a sample of text. 283 00:13:55,080 --> 00:13:59,130 It might be n characters in a row, or n words in a row, for example. 284 00:13:59,130 --> 00:14:01,650 So let's take a passage from Sherlock Holmes 285 00:14:01,650 --> 00:14:04,080 and let's look for all of the trigrams. 286 00:14:04,080 --> 00:14:07,080 A trigram is an n-gram where n is equal to three. 287 00:14:07,080 --> 00:14:10,890 So in this case, we're looking for sequences of three words in a row. 288 00:14:10,890 --> 00:14:14,790 So the trigrams here would be phrases like, how often have. 289 00:14:14,790 --> 00:14:16,140 That's three words in a row. 290 00:14:16,140 --> 00:14:18,180 Often have I, is another trigram. 291 00:14:18,180 --> 00:14:19,080 Have I said. 292 00:14:19,080 --> 00:14:19,950 I said to. 293 00:14:19,950 --> 00:14:20,730 Said to you. 294 00:14:20,730 --> 00:14:21,480 To you that. 295 00:14:21,480 --> 00:14:26,310 These are all trigrams, sequences of three words that appear in sequence. 296 00:14:26,310 --> 00:14:29,490 And if we could give the computer a large corpus of text 297 00:14:29,490 --> 00:14:32,640 and have it pull out all of the trigrams in this case, 298 00:14:32,640 --> 00:14:36,180 it could get a sense for what sequences of three words 299 00:14:36,180 --> 00:14:40,110 tend to appear next to each other in our own natural language. 300 00:14:40,110 --> 00:14:44,730 And as a result, get some sense for what the structure of the language 301 00:14:44,730 --> 00:14:46,140 actually is. 302 00:14:46,140 --> 00:14:48,000 So let's take a look at an example of that. 303 00:14:48,000 --> 00:14:54,570 How can we use NLTK to try to get access to information about n-grams. 304 00:14:54,570 --> 00:14:57,690 So here we're going to open up ngrams.py. 305 00:14:57,690 --> 00:15:00,090 And this is a python program that's going 306 00:15:00,090 --> 00:15:04,530 to load a corpus of data, just some text files, into our computer's memory. 307 00:15:04,530 --> 00:15:08,280 And then we're going to use NLTK's n-gram's function, which 308 00:15:08,280 --> 00:15:10,170 is going to go through the corpus of text, 309 00:15:10,170 --> 00:15:13,710 pulling out all of the n-grams for a particular value of n. 310 00:15:13,710 --> 00:15:17,050 And then by using python's counter class, 311 00:15:17,050 --> 00:15:21,070 we're going to figure out what are the most common n-grams inside 312 00:15:21,070 --> 00:15:23,500 of this entire corpus of text. 313 00:15:23,500 --> 00:15:25,900 And we're going to need a dataset in order to do this, 314 00:15:25,900 --> 00:15:29,380 and I've prepared a dataset of some of the stories of Sherlock Holmes. 315 00:15:29,380 --> 00:15:33,130 So it's just a bunch of text files, a lot of words for it to analyze. 316 00:15:33,130 --> 00:15:35,590 And as a result, we'll get a sense for what 317 00:15:35,590 --> 00:15:38,350 sequences of two words or three words tend 318 00:15:38,350 --> 00:15:41,800 to be most common in natural language. 319 00:15:41,800 --> 00:15:42,880 So let's give this a try. 320 00:15:42,880 --> 00:15:46,660 We'll go into my n-grams directory and we'll run ngrams.py. 321 00:15:46,660 --> 00:15:48,610 We'll try an n value of two. 322 00:15:48,610 --> 00:15:51,130 So we're looking for sequences of two words in a row. 323 00:15:51,130 --> 00:15:54,940 And we'll use our corpus of stories from Sherlock Holmes. 324 00:15:54,940 --> 00:15:57,220 And when we run this program, we get a list 325 00:15:57,220 --> 00:16:00,190 of the most common n-grams where n is equal to two, 326 00:16:00,190 --> 00:16:01,690 otherwise known as a bigram. 327 00:16:01,690 --> 00:16:04,008 So the most common one is, of the. 328 00:16:04,008 --> 00:16:05,800 That's a sequence of two words that appears 329 00:16:05,800 --> 00:16:07,960 quite frequently in natural language. 330 00:16:07,960 --> 00:16:10,150 Then, in the, and, it was. 331 00:16:10,150 --> 00:16:14,050 These are all common sequences of two words that appear in a row. 332 00:16:14,050 --> 00:16:18,470 Let's instead now try running n-grams with n equal to three. 333 00:16:18,470 --> 00:16:21,110 Let's get all of the trigrams and see what we get. 334 00:16:21,110 --> 00:16:26,120 And now we see the most common trigrams are, it was a, one of the, 335 00:16:26,120 --> 00:16:27,140 I think that. 336 00:16:27,140 --> 00:16:31,370 These are all sequences of three words that appear quite frequently. 337 00:16:31,370 --> 00:16:35,480 And we were able to do this, essentially via a process known as tokenization. 338 00:16:35,480 --> 00:16:39,740 Tokenization is the process of splitting a sequence of characters into pieces. 339 00:16:39,740 --> 00:16:43,760 In this case, we're splitting a long sequence of text into individual words, 340 00:16:43,760 --> 00:16:46,040 and then looking at sequences of those words 341 00:16:46,040 --> 00:16:49,160 to get a sense for the structure of natural language. 342 00:16:49,160 --> 00:16:52,370 So once we've done this, once we've done the tokenization, once we've 343 00:16:52,370 --> 00:16:56,420 built up our corpus of n-grams, what can we do with that information? 344 00:16:56,420 --> 00:16:58,220 Well, the one thing that we might try, is 345 00:16:58,220 --> 00:17:00,440 we could build a Markov chain, which you might recall 346 00:17:00,440 --> 00:17:02,090 from when we talked about probability. 347 00:17:02,090 --> 00:17:04,430 Recall that a Markov chain is some sequence 348 00:17:04,430 --> 00:17:07,730 of values where we can predict one value based 349 00:17:07,730 --> 00:17:09,500 on the values that came before it. 350 00:17:09,500 --> 00:17:13,069 And as a result, if we know all of the common n-grams 351 00:17:13,069 --> 00:17:17,119 in the English language, what words tend to be associated with what other words 352 00:17:17,119 --> 00:17:20,510 in sequence, we can use that to predict what word 353 00:17:20,510 --> 00:17:22,910 might come next in a sequence of words. 354 00:17:22,910 --> 00:17:25,520 And so we could build a Markov chain for language 355 00:17:25,520 --> 00:17:28,220 in order to try to generate natural language that 356 00:17:28,220 --> 00:17:32,480 follows the same statistical patterns as some input data. 357 00:17:32,480 --> 00:17:36,830 So let's take a look at that and build a Markov chain for natural language. 358 00:17:36,830 --> 00:17:41,360 And as input, I'm going to use the works of William Shakespeare. 359 00:17:41,360 --> 00:17:44,690 So here, I have a file, shakespeare.txt, which 360 00:17:44,690 --> 00:17:47,660 is just a bunch of the works of William Shakespeare. 361 00:17:47,660 --> 00:17:50,690 It's a long text file, so plenty of data to analyze. 362 00:17:50,690 --> 00:17:55,040 And here in generator.py, I'm using a third-party python library 363 00:17:55,040 --> 00:17:56,900 in order to do this analysis. 364 00:17:56,900 --> 00:17:59,600 We're going to read in the sample of text, 365 00:17:59,600 --> 00:18:03,260 and then we're going to train a Markov model based on that text. 366 00:18:03,260 --> 00:18:07,310 And then we're going to have the Markov chain generate some sentences. 367 00:18:07,310 --> 00:18:10,910 We're going to generate a sentence that doesn't appear in the original text, 368 00:18:10,910 --> 00:18:13,580 but that follows the same statistical patterns, 369 00:18:13,580 --> 00:18:15,530 that's generating it based on the n-grams, 370 00:18:15,530 --> 00:18:18,860 trying to predict what word is likely to come next 371 00:18:18,860 --> 00:18:22,340 that we would expect based on those statistical patterns. 372 00:18:22,340 --> 00:18:26,690 So we'll go ahead and go into our Markov directory, 373 00:18:26,690 --> 00:18:30,530 run this generator with the works of William Shakespeare as input. 374 00:18:30,530 --> 00:18:33,920 And what we're going to get, are five new sentences, 375 00:18:33,920 --> 00:18:36,290 where these sentences are not necessarily 376 00:18:36,290 --> 00:18:38,690 sentences from the original input text itself, 377 00:18:38,690 --> 00:18:41,450 but just that follow the same statistical patterns. 378 00:18:41,450 --> 00:18:45,260 It's predicting what word is likely to come next, based on the input data 379 00:18:45,260 --> 00:18:49,110 that we've seen and the types of words that tend to appear in sequence there, 380 00:18:49,110 --> 00:18:49,610 too. 381 00:18:49,610 --> 00:18:52,460 And so we're able to generate these sentences. 382 00:18:52,460 --> 00:18:55,067 Of course, so far, there's no guarantee that any 383 00:18:55,067 --> 00:18:56,900 of the sentences that are generated actually 384 00:18:56,900 --> 00:18:58,430 mean anything or make any sense. 385 00:18:58,430 --> 00:19:01,250 They just happen to follow the statistical patterns 386 00:19:01,250 --> 00:19:03,410 that our computer is already aware of. 387 00:19:03,410 --> 00:19:05,930 So we'll return to this issue of how to generate text 388 00:19:05,930 --> 00:19:09,260 in perhaps a more accurate or more meaningful way a little bit later. 389 00:19:09,260 --> 00:19:12,140 So, let's now turn our attention to a slightly different problem, 390 00:19:12,140 --> 00:19:14,660 and that's the problem of text classification. 391 00:19:14,660 --> 00:19:17,750 Text classification is the problem where we have some text, 392 00:19:17,750 --> 00:19:20,840 and we want to put that text into some kind of category. 393 00:19:20,840 --> 00:19:23,570 We want to apply some sort of label to that text. 394 00:19:23,570 --> 00:19:26,720 And this kind of problem shows up in a wide variety of places. 395 00:19:26,720 --> 00:19:29,300 A common place might be your email inbox, for example. 396 00:19:29,300 --> 00:19:31,430 You get an email and you want your computer 397 00:19:31,430 --> 00:19:34,490 to be able to identify whether the email belongs in your inbox, 398 00:19:34,490 --> 00:19:36,770 or whether it should be filtered out into spam. 399 00:19:36,770 --> 00:19:38,810 So we need to classify the text. 400 00:19:38,810 --> 00:19:41,450 Is it a good email or is it spam? 401 00:19:41,450 --> 00:19:44,210 Another common use case is sentiment analysis. 402 00:19:44,210 --> 00:19:47,060 We might want to know whether the sentiment of some text 403 00:19:47,060 --> 00:19:49,460 is positive or negative. 404 00:19:49,460 --> 00:19:50,690 And so how might we do that? 405 00:19:50,690 --> 00:19:53,420 This comes up in situations like product reviews, 406 00:19:53,420 --> 00:19:56,600 where we might have a bunch of reviews for a product on some website. 407 00:19:56,600 --> 00:19:57,650 "My grandson loved it! 408 00:19:57,650 --> 00:19:58,370 So much fun." 409 00:19:58,370 --> 00:19:59,995 "Product broke after a few days." 410 00:19:59,995 --> 00:20:02,120 "One of the best games I've played in a long time." 411 00:20:02,120 --> 00:20:04,520 And "Kind of cheap and flimsy, not worth it." 412 00:20:04,520 --> 00:20:08,870 Here's some example sentences that you might see on a product review website. 413 00:20:08,870 --> 00:20:12,020 And you and I could pretty easily look at this list of product reviews 414 00:20:12,020 --> 00:20:15,345 and decide which ones are positive and which ones are negative. 415 00:20:15,345 --> 00:20:17,220 We might say the first one and the third one, 416 00:20:17,220 --> 00:20:19,440 those seem like positive sentiment messages, 417 00:20:19,440 --> 00:20:23,370 but the second one and the fourth one seem like negative sentiment messages. 418 00:20:23,370 --> 00:20:24,690 But how did we know that? 419 00:20:24,690 --> 00:20:28,350 And how could we train a computer to be able to figure that out, as well? 420 00:20:28,350 --> 00:20:31,980 Well, you might have clued your eye in on particular key words, where 421 00:20:31,980 --> 00:20:35,910 those particular words tend to mean something positive or negative. 422 00:20:35,910 --> 00:20:38,280 So you might have identified words like loved, 423 00:20:38,280 --> 00:20:42,210 and fun, and best, tend to be associated with positive messages. 424 00:20:42,210 --> 00:20:44,730 And words like broke, and cheap, and flimsy 425 00:20:44,730 --> 00:20:47,220 tend to be associated with negative messages. 426 00:20:47,220 --> 00:20:49,500 So if only we could train a computer to be 427 00:20:49,500 --> 00:20:52,980 able to learn what words tend to be associated with positive 428 00:20:52,980 --> 00:20:55,530 versus negative messages, then maybe we could 429 00:20:55,530 --> 00:20:59,400 train a computer to do this kind of sentiment analysis, as well. 430 00:20:59,400 --> 00:21:01,080 So we're going to try to do just that. 431 00:21:01,080 --> 00:21:04,590 We're going to use a model known as the bag-of-words model, which 432 00:21:04,590 --> 00:21:08,913 is a model that represents text as just an unordered collection of words. 433 00:21:08,913 --> 00:21:10,830 For the purpose of this model, we're not going 434 00:21:10,830 --> 00:21:13,913 to worry about the sequence and the ordering of the words, which word came 435 00:21:13,913 --> 00:21:15,660 first, second, or third, we're just going 436 00:21:15,660 --> 00:21:19,140 to treat the text as a collection of words in no particular order. 437 00:21:19,140 --> 00:21:20,890 And we're losing information there, right? 438 00:21:20,890 --> 00:21:23,590 The order of words is important, and we'll come back to that a little bit 439 00:21:23,590 --> 00:21:24,130 later. 440 00:21:24,130 --> 00:21:26,230 But for now, to simplify our model, it'll 441 00:21:26,230 --> 00:21:28,870 help us tremendously just to think about text 442 00:21:28,870 --> 00:21:31,630 as some unordered collection of words. 443 00:21:31,630 --> 00:21:34,570 And in particular, we're going to use the bag-of-words model 444 00:21:34,570 --> 00:21:37,630 to build something known as a Naive Bayes classifier. 445 00:21:37,630 --> 00:21:39,570 So what is a Naive Bayes classifier? 446 00:21:39,570 --> 00:21:41,320 Well, it's a tool that's going to allow us 447 00:21:41,320 --> 00:21:43,683 to classify text based on Bayes rule. 448 00:21:43,683 --> 00:21:46,600 Again, which you might remember from when we talked about probability, 449 00:21:46,600 --> 00:21:50,830 Bayes rule says that the probability of b given a, 450 00:21:50,830 --> 00:21:55,930 is equal to the probability of a given b multiplied by the probability of b 451 00:21:55,930 --> 00:21:58,900 divided by the probability of a. 452 00:21:58,900 --> 00:22:02,830 So how are we going to use this rule to be able to analyze text? 453 00:22:02,830 --> 00:22:04,360 Well, what are we interested in? 454 00:22:04,360 --> 00:22:08,110 We're interested in the probability that a message has a positive sentiment 455 00:22:08,110 --> 00:22:11,530 and the probability that a message has a negative sentiment, which I'm here 456 00:22:11,530 --> 00:22:15,340 for simplicity, going to represent just with these emoji, happy face and frown 457 00:22:15,340 --> 00:22:17,670 face, as positive and negative sentiment. 458 00:22:17,670 --> 00:22:21,570 And so if I had a review, something like, my grandson loved it, 459 00:22:21,570 --> 00:22:24,900 then what I'm interested in, is not just the probability 460 00:22:24,900 --> 00:22:28,920 that a message has positive sentiment, but the conditional probability 461 00:22:28,920 --> 00:22:32,880 that a message has positive sentiment given that this is the message, 462 00:22:32,880 --> 00:22:34,290 my grandson loved it. 463 00:22:34,290 --> 00:22:37,680 But how do I go about calculating this value, the probability 464 00:22:37,680 --> 00:22:42,210 that the message is positive given that the review is this sequence of words? 465 00:22:42,210 --> 00:22:44,550 Well, here's where the bag-of-words model comes in. 466 00:22:44,550 --> 00:22:48,930 Rather than treat this review as a string of a sequence of words in order, 467 00:22:48,930 --> 00:22:52,170 we're just going to treat it as an unordered collection of words. 468 00:22:52,170 --> 00:22:55,140 We're going to try to calculate the probability that the review is 469 00:22:55,140 --> 00:22:59,220 positive, given that all of these words, my grandson loved it, 470 00:22:59,220 --> 00:23:01,620 are in the review in no particular order. 471 00:23:01,620 --> 00:23:04,290 Just this unordered collection of words. 472 00:23:04,290 --> 00:23:09,240 And this is a conditional probability, which we can then apply Bayes rule 473 00:23:09,240 --> 00:23:10,890 to try to make sense of. 474 00:23:10,890 --> 00:23:13,950 So according to Bayes rule, this conditional probability 475 00:23:13,950 --> 00:23:15,540 is equal to what? 476 00:23:15,540 --> 00:23:19,030 It's equal to the probability that all of these four words 477 00:23:19,030 --> 00:23:21,970 are in the review, given that the review is positive 478 00:23:21,970 --> 00:23:24,820 multiplied by the probability that the review is positive 479 00:23:24,820 --> 00:23:29,920 divided by the probability that all of these words happen to be in the review. 480 00:23:29,920 --> 00:23:33,190 So this is the value now that we're going to try to calculate. 481 00:23:33,190 --> 00:23:36,940 Now, one thing you might notice, is that the denominator here, the probability 482 00:23:36,940 --> 00:23:39,370 that all of these words appear in the review, 483 00:23:39,370 --> 00:23:41,830 doesn't actually depend on whether or not 484 00:23:41,830 --> 00:23:45,220 we're looking at the positive sentiment or negative sentiment case. 485 00:23:45,220 --> 00:23:47,200 So we can actually get rid of this denominator. 486 00:23:47,200 --> 00:23:48,450 We don't need to calculate it. 487 00:23:48,450 --> 00:23:52,600 We can just say that this probability is proportional to the numerator. 488 00:23:52,600 --> 00:23:55,690 And then at the end, we're going to need to normalize the probability 489 00:23:55,690 --> 00:24:00,160 distribution to make sure that all of the values sum up to the value one. 490 00:24:00,160 --> 00:24:02,830 So now, how do we calculate this value? 491 00:24:02,830 --> 00:24:06,010 Well, this is the probability of all of these words given 492 00:24:06,010 --> 00:24:09,250 positive times probability of positive. 493 00:24:09,250 --> 00:24:12,130 And that, by the definition of joint probability, 494 00:24:12,130 --> 00:24:14,290 is just one big joint probability. 495 00:24:14,290 --> 00:24:16,930 The probability that all of these things are the case. 496 00:24:16,930 --> 00:24:20,380 That it's a positive review, and that all four of these words 497 00:24:20,380 --> 00:24:22,120 are in the review. 498 00:24:22,120 --> 00:24:26,200 But still, it's not entirely obvious how we calculate that value. 499 00:24:26,200 --> 00:24:28,480 And here is where we need to make one more assumption. 500 00:24:28,480 --> 00:24:31,660 And this is where the Naive part of Naive Bayes comes in. 501 00:24:31,660 --> 00:24:34,420 We're going to make the assumption that all of the words 502 00:24:34,420 --> 00:24:36,280 are independent of each other. 503 00:24:36,280 --> 00:24:39,640 And by that, I mean that if the word, grandson, is 504 00:24:39,640 --> 00:24:42,910 in the review, that doesn't change the probability that the word, loved, is 505 00:24:42,910 --> 00:24:45,700 in the review or that the word it is in the review, for example. 506 00:24:45,700 --> 00:24:48,250 And in practice, this assumption might not be true. 507 00:24:48,250 --> 00:24:50,890 It's almost certainly the case that the probability of words 508 00:24:50,890 --> 00:24:52,210 do depend on each other. 509 00:24:52,210 --> 00:24:54,670 But it's going to simplify our analysis, and still 510 00:24:54,670 --> 00:24:58,030 give us reasonably good results, just to assume that the words are 511 00:24:58,030 --> 00:25:01,060 independent of each other and they only depend on 512 00:25:01,060 --> 00:25:03,340 whether it's positive or negative. 513 00:25:03,340 --> 00:25:05,830 You might, for example, expect the word, loved, 514 00:25:05,830 --> 00:25:09,790 to appear more often in a positive review than in a negative review. 515 00:25:09,790 --> 00:25:11,020 So, what does that mean? 516 00:25:11,020 --> 00:25:13,180 Well, if we make this assumption, then we 517 00:25:13,180 --> 00:25:16,210 can say that this value, the probability we're interested in, 518 00:25:16,210 --> 00:25:18,450 is not directly proportional to, but it's 519 00:25:18,450 --> 00:25:21,480 naively proportional to this value. 520 00:25:21,480 --> 00:25:24,570 The probability that the review is positive 521 00:25:24,570 --> 00:25:28,500 times the probability that my is in the review, given that it's positive, 522 00:25:28,500 --> 00:25:31,320 times the probability that grandson is in the review, given 523 00:25:31,320 --> 00:25:33,630 that it's positive, and so on for the other two 524 00:25:33,630 --> 00:25:35,520 words that happen to be in this review. 525 00:25:35,520 --> 00:25:38,340 And now this value, which looks a little more complex, 526 00:25:38,340 --> 00:25:41,880 is actually a value that we can calculate pretty easily. 527 00:25:41,880 --> 00:25:44,280 So how are we going to estimate the probability 528 00:25:44,280 --> 00:25:45,540 that the review is positive? 529 00:25:45,540 --> 00:25:49,320 Well, if we have some training data, some example data of example 530 00:25:49,320 --> 00:25:52,440 reviews where each one has already been labeled as positive or negative, 531 00:25:52,440 --> 00:25:55,950 then we can estimate the probability that a review is positive just 532 00:25:55,950 --> 00:25:57,930 by counting the number of positive samples 533 00:25:57,930 --> 00:26:00,510 and dividing by the total number of samples 534 00:26:00,510 --> 00:26:02,760 that we have in our training data. 535 00:26:02,760 --> 00:26:05,460 And for the conditional probabilities, the probability 536 00:26:05,460 --> 00:26:07,650 of loved, given that it's positive, well, 537 00:26:07,650 --> 00:26:09,900 that's going to be the number of positive samples with 538 00:26:09,900 --> 00:26:14,520 loved in it divided by the total number of positive samples. 539 00:26:14,520 --> 00:26:16,840 So let's take a look at an actual example 540 00:26:16,840 --> 00:26:19,120 to see how we could try to calculate these values. 541 00:26:19,120 --> 00:26:21,190 Here, I've put together some sample data. 542 00:26:21,190 --> 00:26:24,340 The way to interpret the sample data, is that based on the training data, 543 00:26:24,340 --> 00:26:28,510 49% of the reviews are positive, 51% are negative. 544 00:26:28,510 --> 00:26:32,770 And then over here in this table, we have some conditional probabilities. 545 00:26:32,770 --> 00:26:35,800 We have if the review is positive, then there's 546 00:26:35,800 --> 00:26:38,140 a 30% chance that my appears in it. 547 00:26:38,140 --> 00:26:42,220 And if the review is negative, there's a 20% chance that my appears in it. 548 00:26:42,220 --> 00:26:46,240 And based on our training data among the positive reviews, 1% of them 549 00:26:46,240 --> 00:26:47,770 contain the word grandson. 550 00:26:47,770 --> 00:26:51,640 And among the negative reviews, 2% contain the word grandson. 551 00:26:51,640 --> 00:26:56,470 So, using this data, let's try to calculate this value, the value we're 552 00:26:56,470 --> 00:26:57,220 interested in. 553 00:26:57,220 --> 00:27:01,420 And to do that, we'll need to multiply all of these values together. 554 00:27:01,420 --> 00:27:03,820 The probability of positive, and then all 555 00:27:03,820 --> 00:27:06,280 of these positive conditional probabilities. 556 00:27:06,280 --> 00:27:08,620 And when we do that, we get some value. 557 00:27:08,620 --> 00:27:11,445 And then we can do the same thing for the negative case. 558 00:27:11,445 --> 00:27:12,820 We're going to do the same thing. 559 00:27:12,820 --> 00:27:15,700 Take the probability that it's negative, multiply it 560 00:27:15,700 --> 00:27:17,830 by all of these conditional probabilities, 561 00:27:17,830 --> 00:27:19,960 and we're going to get some other value. 562 00:27:19,960 --> 00:27:21,910 And now these values don't sum to one. 563 00:27:21,910 --> 00:27:23,890 They're not a probability distribution yet. 564 00:27:23,890 --> 00:27:26,800 But I can normalize them and get some values, 565 00:27:26,800 --> 00:27:30,700 and that tells me that we're going to predict that my grandson loved it. 566 00:27:30,700 --> 00:27:32,950 We think there's a 68% chance. 567 00:27:32,950 --> 00:27:37,030 Probability is 0.68 that that is a positive sentiment review. 568 00:27:37,030 --> 00:27:41,470 And 0.32 probability that it's a negative review. 569 00:27:41,470 --> 00:27:44,050 So, what problems might we run into here? 570 00:27:44,050 --> 00:27:47,350 What could potentially go wrong when doing this kind of analysis 571 00:27:47,350 --> 00:27:51,040 in order to analyze whether text has a positive or negative sentiment? 572 00:27:51,040 --> 00:27:53,110 Well, a couple of problems might arise. 573 00:27:53,110 --> 00:27:57,850 One problem might be, what if the word grandson never appears 574 00:27:57,850 --> 00:28:00,280 for any of the positive reviews? 575 00:28:00,280 --> 00:28:02,770 If that were the case, then when we try to calculate 576 00:28:02,770 --> 00:28:05,770 the value, the probability that we think the review is positive, 577 00:28:05,770 --> 00:28:07,960 we're going to multiply all these values together 578 00:28:07,960 --> 00:28:10,630 and we're just going to get 0 for the positive case. 579 00:28:10,630 --> 00:28:13,810 Because we're going to ultimately multiply by that 0 value. 580 00:28:13,810 --> 00:28:16,060 And so we're going to say that we think there 581 00:28:16,060 --> 00:28:18,250 is no chance that the review is positive, 582 00:28:18,250 --> 00:28:19,990 because it contains the word grandson. 583 00:28:19,990 --> 00:28:23,140 And in our training data, we've never seen the word grandson appear 584 00:28:23,140 --> 00:28:26,320 in a positive sentiment message before. 585 00:28:26,320 --> 00:28:28,780 And that's probably not the right analysis, 586 00:28:28,780 --> 00:28:31,240 because in cases of rare words, it might be 587 00:28:31,240 --> 00:28:33,220 the case that in nowhere in our training data 588 00:28:33,220 --> 00:28:36,460 did we ever see the word grandson appear in a message that 589 00:28:36,460 --> 00:28:37,750 has positive sentiment. 590 00:28:37,750 --> 00:28:39,640 So, what can we do to solve this problem? 591 00:28:39,640 --> 00:28:41,530 Well, one thing we'll often do, is some kind 592 00:28:41,530 --> 00:28:44,230 of additive smoothing, where we add some value alpha 593 00:28:44,230 --> 00:28:47,890 to each value in our distribution just to smooth out the data a little bit. 594 00:28:47,890 --> 00:28:51,460 And a common form of this is Laplace smoothing, where we add 1 595 00:28:51,460 --> 00:28:53,110 to each value in our distribution. 596 00:28:53,110 --> 00:28:56,110 In essence, we pretend we've seen each value one more 597 00:28:56,110 --> 00:28:57,400 time than we actually have. 598 00:28:57,400 --> 00:29:00,650 If we've never seen the word grandson for a positive review, 599 00:29:00,650 --> 00:29:01,900 we pretend we've seen it once. 600 00:29:01,900 --> 00:29:04,720 If we've seen it once, we pretend we've seen it twice, just 601 00:29:04,720 --> 00:29:09,160 to avoid the possibility that we might multiply by 0, and as a result, 602 00:29:09,160 --> 00:29:12,040 get some results we don't want in our analysis. 603 00:29:12,040 --> 00:29:13,960 So let's see what this looks like in practice. 604 00:29:13,960 --> 00:29:17,920 Let's try to do some Naive Bayes classification in order 605 00:29:17,920 --> 00:29:21,730 to classify text as either positive or negative. 606 00:29:21,730 --> 00:29:24,670 We'll take a look at sentiment.py. 607 00:29:24,670 --> 00:29:27,790 And what this is going to do, is load some sample data 608 00:29:27,790 --> 00:29:31,690 into memory, some examples of positive reviews and negative reviews. 609 00:29:31,690 --> 00:29:35,350 And then we're going to train a Naive Bayes classifier 610 00:29:35,350 --> 00:29:37,720 on all of this training data. 611 00:29:37,720 --> 00:29:40,180 Training data that includes all of the words 612 00:29:40,180 --> 00:29:44,200 we see in positive reviews and all of the words we see in negative reviews. 613 00:29:44,200 --> 00:29:47,500 And then we're going to try to classify some input. 614 00:29:47,500 --> 00:29:50,200 And so we're going to do this based on a corpus of data. 615 00:29:50,200 --> 00:29:52,030 I have some example positive reviews. 616 00:29:52,030 --> 00:29:53,380 Here are some positive reviews. 617 00:29:53,380 --> 00:29:53,980 "It was great! 618 00:29:53,980 --> 00:29:55,360 So much fun," for example. 619 00:29:55,360 --> 00:29:57,070 And then some negative reviews. 620 00:29:57,070 --> 00:29:57,903 "Not worth it." 621 00:29:57,903 --> 00:29:58,570 "Kind of cheap." 622 00:29:58,570 --> 00:30:01,330 These are some examples of negative reviews. 623 00:30:01,330 --> 00:30:04,180 So now, let's try to run this classifier and see 624 00:30:04,180 --> 00:30:08,800 how it would classify particular text as either positive or negative. 625 00:30:08,800 --> 00:30:13,480 We'll go ahead and run our sentiment analysis on this corpus. 626 00:30:13,480 --> 00:30:15,580 And we need to provide it with a review. 627 00:30:15,580 --> 00:30:18,800 So I'll say something like, "I enjoyed it." 628 00:30:18,800 --> 00:30:23,210 And we see that the classifier says there's about a 0.92 probability 629 00:30:23,210 --> 00:30:26,540 that we think that this particular review is positive. 630 00:30:26,540 --> 00:30:27,830 Let's try something negative. 631 00:30:27,830 --> 00:30:30,860 We'll try "kind of overpriced." 632 00:30:30,860 --> 00:30:34,280 And we see that there is a 0.96 probability now 633 00:30:34,280 --> 00:30:36,560 that we think that this particular review is negative. 634 00:30:36,560 --> 00:30:40,100 And so our Naive Bayes classifier has learned what kinds of words 635 00:30:40,100 --> 00:30:43,160 tend to appear in positive reviews and what kinds of words 636 00:30:43,160 --> 00:30:44,810 tend to appear in negative reviews. 637 00:30:44,810 --> 00:30:47,570 And as a result of that, we've been able to design 638 00:30:47,570 --> 00:30:51,020 a classifier that can predict whether a particular review is 639 00:30:51,020 --> 00:30:53,330 positive or negative. 640 00:30:53,330 --> 00:30:55,970 And so this definitely is a useful tool that we can 641 00:30:55,970 --> 00:30:57,620 use to try and make some predictions. 642 00:30:57,620 --> 00:31:00,170 But we had to make some assumptions in order to get there. 643 00:31:00,170 --> 00:31:03,590 So what if we want to now try to build some more sophisticated models, 644 00:31:03,590 --> 00:31:06,530 use some tools from machine learning to try and take 645 00:31:06,530 --> 00:31:10,130 better advantage of language data, to be able to draw more accurate conclusions 646 00:31:10,130 --> 00:31:13,040 and solve new kinds of tasks and new kinds of problems? 647 00:31:13,040 --> 00:31:16,575 Well, we've seen a couple of times now, that when we want to take some data 648 00:31:16,575 --> 00:31:19,200 and take some input, put it in a way that the computer is going 649 00:31:19,200 --> 00:31:22,140 to be able to make sense of, it can be helpful to take that data 650 00:31:22,140 --> 00:31:24,360 and turn it into numbers ultimately. 651 00:31:24,360 --> 00:31:26,730 And so what we might want to try to do, is come up 652 00:31:26,730 --> 00:31:30,270 with some word representation, some way to take a word 653 00:31:30,270 --> 00:31:32,910 and translate its meaning into numbers. 654 00:31:32,910 --> 00:31:35,700 Because, for example, if we wanted to use a neural network to be 655 00:31:35,700 --> 00:31:38,490 able to process language, give our language to a neural network 656 00:31:38,490 --> 00:31:41,850 and have it make some predictions or perform some analysis there, 657 00:31:41,850 --> 00:31:45,360 a neural network takes as input and produces as output 658 00:31:45,360 --> 00:31:47,850 a vector of values, a vector of numbers. 659 00:31:47,850 --> 00:31:50,700 And so what we might want to do, is take our data 660 00:31:50,700 --> 00:31:54,300 and somehow take words and convert them into some kind 661 00:31:54,300 --> 00:31:56,010 of numeric representation. 662 00:31:56,010 --> 00:31:57,330 So, how might we do that? 663 00:31:57,330 --> 00:32:00,930 How might we take words and turn them into numbers? 664 00:32:00,930 --> 00:32:02,680 Let's take a look at an example. 665 00:32:02,680 --> 00:32:04,920 Here's a sentence, "He wrote a book." 666 00:32:04,920 --> 00:32:07,380 And let's say I wanted to take each of those words 667 00:32:07,380 --> 00:32:09,540 and turn it into a vector of values. 668 00:32:09,540 --> 00:32:10,980 Here's one way I might do that. 669 00:32:10,980 --> 00:32:15,030 We'll say he is going to be a vector that has a 1 in the first position, 670 00:32:15,030 --> 00:32:17,040 and the rest of the values are 0. 671 00:32:17,040 --> 00:32:20,250 Wrote will have a 1 in the second position, and the rest of the values 672 00:32:20,250 --> 00:32:21,090 are 0. 673 00:32:21,090 --> 00:32:24,330 A has a 1 in the third position with the rest of the value 0. 674 00:32:24,330 --> 00:32:28,140 And book has a 1 in the fourth position, with the rest of the value 0. 675 00:32:28,140 --> 00:32:32,610 So each of these words now has a distinct vector representation. 676 00:32:32,610 --> 00:32:37,260 And this is what we often call a one-hot representation, a representation 677 00:32:37,260 --> 00:32:40,800 of the meaning of a word as a vector with a single 1 678 00:32:40,800 --> 00:32:43,260 and all of the rest of the values are 0. 679 00:32:43,260 --> 00:32:46,800 And so when doing this, we now have a numeric representation for every word, 680 00:32:46,800 --> 00:32:49,470 and we could pass in those vector representations 681 00:32:49,470 --> 00:32:54,090 into a neural network or other models that require some kind of numeric data 682 00:32:54,090 --> 00:32:55,140 as input. 683 00:32:55,140 --> 00:32:58,590 But this one-hot representation actually has a couple of problems, 684 00:32:58,590 --> 00:33:00,570 and it's not ideal for a few reasons. 685 00:33:00,570 --> 00:33:03,390 One reason is, here, we're just looking at four words. 686 00:33:03,390 --> 00:33:07,080 But if you imagine a vocabulary of thousands of words or more, 687 00:33:07,080 --> 00:33:09,240 these vectors are going to get quite long in order 688 00:33:09,240 --> 00:33:13,590 to have a distinct vector for every possible word in our vocabulary. 689 00:33:13,590 --> 00:33:15,497 And as a result of that, these longer vectors 690 00:33:15,497 --> 00:33:18,330 are going to be more difficult to deal with, more difficult to train 691 00:33:18,330 --> 00:33:21,000 and so forth, and so that might be a problem. 692 00:33:21,000 --> 00:33:23,550 Another problem is a little bit more subtle. 693 00:33:23,550 --> 00:33:26,460 If we want to represent a word as a vector, 694 00:33:26,460 --> 00:33:30,270 and in particular, the meaning of a word as a vector, then ideally, 695 00:33:30,270 --> 00:33:33,300 it should be the case that words that have similar meanings 696 00:33:33,300 --> 00:33:36,360 should also have similar vector representations, 697 00:33:36,360 --> 00:33:40,320 so that they're close to each other together inside a vector space. 698 00:33:40,320 --> 00:33:42,180 But that's not really going to be the case 699 00:33:42,180 --> 00:33:45,990 with these one-hot representations, because if we take some similar words, 700 00:33:45,990 --> 00:33:49,590 say the word wrote and the word authored, which mean similar things, 701 00:33:49,590 --> 00:33:53,400 they have entirely different vector representations. 702 00:33:53,400 --> 00:33:54,870 Likewise book and novel. 703 00:33:54,870 --> 00:33:57,270 Those two words mean somewhat similar things, 704 00:33:57,270 --> 00:34:00,390 but they have entirely different vector representations, 705 00:34:00,390 --> 00:34:03,420 because they each have a 1 in some different position. 706 00:34:03,420 --> 00:34:05,340 And so that's not ideal either. 707 00:34:05,340 --> 00:34:07,440 So what we might be interested in instead, 708 00:34:07,440 --> 00:34:10,110 is some kind of distributed representation. 709 00:34:10,110 --> 00:34:12,900 A distributed representation is the representation 710 00:34:12,900 --> 00:34:16,710 of the meaning of a word distributed across multiple values, 711 00:34:16,710 --> 00:34:20,130 instead of just being one-hot with a 1 in 1 position. 712 00:34:20,130 --> 00:34:24,540 Here is what a distributed representation of words might be. 713 00:34:24,540 --> 00:34:27,840 Each word is associated with some vector of values, 714 00:34:27,840 --> 00:34:30,570 with the meaning distributed across multiple values, 715 00:34:30,570 --> 00:34:35,250 ideally in such a way, that similar words have a similar vector 716 00:34:35,250 --> 00:34:36,480 representation. 717 00:34:36,480 --> 00:34:38,639 But how are we going to come up with those values? 718 00:34:38,639 --> 00:34:40,110 Where do those values come from? 719 00:34:40,110 --> 00:34:43,590 How can we define the meaning of a word in this distributed 720 00:34:43,590 --> 00:34:45,210 sequence of numbers? 721 00:34:45,210 --> 00:34:47,909 Well, to do that, we're going to draw inspiration from a quote 722 00:34:47,909 --> 00:34:50,400 from British linguist JR Firth, who said, 723 00:34:50,400 --> 00:34:53,580 "You shall know a word by the company it keeps." 724 00:34:53,580 --> 00:34:56,370 In other words, we're going to define the meaning of a word 725 00:34:56,370 --> 00:35:00,600 based on the words that appear around it, the context words around it. 726 00:35:00,600 --> 00:35:02,460 Take for example, this context. 727 00:35:02,460 --> 00:35:04,560 For blank he ate. 728 00:35:04,560 --> 00:35:08,130 You might wonder, what words could reasonably fill in that blank. 729 00:35:08,130 --> 00:35:11,520 Well, it might be words like breakfast, or lunch, or dinner. 730 00:35:11,520 --> 00:35:13,920 All of those could reasonably fill in that blank. 731 00:35:13,920 --> 00:35:16,160 And so what we're going to say, is because does 732 00:35:16,160 --> 00:35:20,240 the words breakfast and lunch and dinner appear in a similar context, 733 00:35:20,240 --> 00:35:22,580 that they must have a similar meaning. 734 00:35:22,580 --> 00:35:25,880 And that's something our computer could understand and try to learn. 735 00:35:25,880 --> 00:35:28,310 A computer could look at a big corpus of text, 736 00:35:28,310 --> 00:35:31,730 look at what words tend to appear in similar contexts to each other, 737 00:35:31,730 --> 00:35:35,270 and use that to identify which words have a similar meaning. 738 00:35:35,270 --> 00:35:39,440 And should therefore, appear close to each other inside a vector space. 739 00:35:39,440 --> 00:35:43,640 And so one common model for doing this is known as the word2vec model. 740 00:35:43,640 --> 00:35:47,060 It's a model for generating word vectors, a vector representation 741 00:35:47,060 --> 00:35:49,700 for every word by looking at data and looking 742 00:35:49,700 --> 00:35:52,250 at the context in which a word appears. 743 00:35:52,250 --> 00:35:53,690 The idea is going to be this. 744 00:35:53,690 --> 00:35:58,070 If you start out with all of the words just in some random position in space 745 00:35:58,070 --> 00:36:02,000 and train it on some training data, what the word2vec model will do, 746 00:36:02,000 --> 00:36:05,240 is start to learn what words appear in similar contexts. 747 00:36:05,240 --> 00:36:08,180 And it will move these vectors around in such a way 748 00:36:08,180 --> 00:36:10,550 that hopefully, words with similar meanings, 749 00:36:10,550 --> 00:36:12,800 breakfast, lunch, and dinner, book, memoir, 750 00:36:12,800 --> 00:36:18,300 novel, will hopefully appear to be near to each other as vectors, as well. 751 00:36:18,300 --> 00:36:22,110 So, let's now take a look at what word2vec might look like in practice 752 00:36:22,110 --> 00:36:24,300 when implemented in code. 753 00:36:24,300 --> 00:36:29,010 What I have here inside of words.txt is a pre-trained model 754 00:36:29,010 --> 00:36:32,700 where each of these words has some vector representation trained 755 00:36:32,700 --> 00:36:33,480 by word2vec. 756 00:36:33,480 --> 00:36:38,070 Each of these words has some sequence of values representing its meaning, 757 00:36:38,070 --> 00:36:40,860 hopefully in such a way, that similar words are 758 00:36:40,860 --> 00:36:43,110 represented by similar vectors. 759 00:36:43,110 --> 00:36:46,890 I also have this file, vectors.py, which is going to open up the words 760 00:36:46,890 --> 00:36:48,300 and form them into a dictionary. 761 00:36:48,300 --> 00:36:50,970 And we also define some useful functions, like distance, 762 00:36:50,970 --> 00:36:53,460 to get the distance between two word vectors. 763 00:36:53,460 --> 00:36:56,730 And closest words define which words are nearby 764 00:36:56,730 --> 00:36:59,490 in terms of having close vectors to each other. 765 00:36:59,490 --> 00:37:01,650 And so let's give this a try. 766 00:37:01,650 --> 00:37:05,010 We'll go ahead and open a python interpreter. 767 00:37:05,010 --> 00:37:09,510 And I'm going to import these vectors. 768 00:37:09,510 --> 00:37:14,970 And we might say, all right, what is the vector representation of the word book. 769 00:37:14,970 --> 00:37:18,480 And we get this big long vector that represents the word 770 00:37:18,480 --> 00:37:20,400 book as a sequence of values. 771 00:37:20,400 --> 00:37:23,670 And this sequence of values by itself is not all that meaningful. 772 00:37:23,670 --> 00:37:26,850 But it is meaningful in the context of comparing it 773 00:37:26,850 --> 00:37:29,610 to other vectors for other words. 774 00:37:29,610 --> 00:37:31,800 So we could use this distance function, which 775 00:37:31,800 --> 00:37:35,065 is going to get us the distance between two word vectors. 776 00:37:35,065 --> 00:37:37,440 And we might say, what is the distance between the vector 777 00:37:37,440 --> 00:37:42,360 representation for the word book and the vector representation for the word 778 00:37:42,360 --> 00:37:43,590 novel. 779 00:37:43,590 --> 00:37:45,840 And we see that it's 0.34. 780 00:37:45,840 --> 00:37:48,480 You can kind of interpret 0 as being really close together, 781 00:37:48,480 --> 00:37:50,310 and 1 being very far apart. 782 00:37:50,310 --> 00:37:55,140 And so now, what is the distance between book and let's say, breakfast? 783 00:37:55,140 --> 00:37:58,110 Well, book and breakfast are more different from each other 784 00:37:58,110 --> 00:38:00,090 than book and novel are, so I would hopefully, 785 00:38:00,090 --> 00:38:01,890 expect the distance to be larger. 786 00:38:01,890 --> 00:38:03,060 And in fact, it is. 787 00:38:03,060 --> 00:38:05,040 0.64 approximately. 788 00:38:05,040 --> 00:38:07,650 These two words are further away from each other. 789 00:38:07,650 --> 00:38:12,960 And what about now, the distance between let's say, lunch and breakfast? 790 00:38:12,960 --> 00:38:14,580 Well, that's about 0.2. 791 00:38:14,580 --> 00:38:15,960 Those are even closer together. 792 00:38:15,960 --> 00:38:19,190 They have a meaning that is closer to each other. 793 00:38:19,190 --> 00:38:23,660 Another interesting thing we might do is calculate the closest words. 794 00:38:23,660 --> 00:38:29,030 We might say, what are the closest words according to word2vec to the word book, 795 00:38:29,030 --> 00:38:31,550 and let's say, let's get the 10 closest words. 796 00:38:31,550 --> 00:38:35,960 What are the 10 closest vectors to the vector representation for the word 797 00:38:35,960 --> 00:38:36,830 book? 798 00:38:36,830 --> 00:38:40,220 And when we perform that analysis, we get this list of words. 799 00:38:40,220 --> 00:38:42,260 The closest one is book itself. 800 00:38:42,260 --> 00:38:46,640 But we also have books plural, and then essay, memoir, essays, novella, 801 00:38:46,640 --> 00:38:48,050 anthology, and so on. 802 00:38:48,050 --> 00:38:52,040 All of these words mean something similar to the word book, according 803 00:38:52,040 --> 00:38:55,970 to word2vec, at least, because they have a similar vector representation. 804 00:38:55,970 --> 00:38:58,250 So it seems like we've done a pretty good job 805 00:38:58,250 --> 00:39:03,200 of trying to capture this kind of vector representation of word meaning. 806 00:39:03,200 --> 00:39:05,990 One other interesting side effect of word2vec 807 00:39:05,990 --> 00:39:08,150 is that it's also able to capture something 808 00:39:08,150 --> 00:39:11,240 about the relationships between words, as well. 809 00:39:11,240 --> 00:39:12,770 Let's take a look at an example. 810 00:39:12,770 --> 00:39:16,130 Here, for instance, are two words, man and king. 811 00:39:16,130 --> 00:39:19,740 And these are each represented by word2vec as vectors. 812 00:39:19,740 --> 00:39:24,750 So what might happen if I subtracted one from the other, calculated the value 813 00:39:24,750 --> 00:39:26,700 king minus man? 814 00:39:26,700 --> 00:39:30,600 Well, that will be the vector that will take us from man to king, 815 00:39:30,600 --> 00:39:33,840 somehow represent this relationship between the vector 816 00:39:33,840 --> 00:39:38,310 representation of the word man, and the vector representation of the word king. 817 00:39:38,310 --> 00:39:41,820 And that's what this value, king minus man, represents. 818 00:39:41,820 --> 00:39:46,260 So what would happen if I took the vector representation of the word woman 819 00:39:46,260 --> 00:39:50,550 and added that same value, king minus man, to it? 820 00:39:50,550 --> 00:39:54,300 What would we get as the closest word to that, for example? 821 00:39:54,300 --> 00:39:55,230 Well, we could try it. 822 00:39:55,230 --> 00:39:59,280 Let's go ahead and go back to our python interpreter and give this a try. 823 00:39:59,280 --> 00:40:03,690 I could say, what is the closest word to the vector representation of the word 824 00:40:03,690 --> 00:40:06,810 king minus the representation of the word man, 825 00:40:06,810 --> 00:40:10,710 plus the representation of the word woman? 826 00:40:10,710 --> 00:40:13,740 And we see that the closest word is the word queen. 827 00:40:13,740 --> 00:40:17,040 We've somehow been able to capture the relationship between king and man, 828 00:40:17,040 --> 00:40:23,310 and then we apply it to the word woman, we get as the result, the word queen. 829 00:40:23,310 --> 00:40:27,180 So word2vec has been able to capture not just the words and how they're 830 00:40:27,180 --> 00:40:30,300 similar to each other, but also something about the relationships 831 00:40:30,300 --> 00:40:33,840 between words and how those words are connected to each other. 832 00:40:33,840 --> 00:40:36,720 So now that we have this vector representation of words, 833 00:40:36,720 --> 00:40:37,920 what can we now do with it? 834 00:40:37,920 --> 00:40:40,470 Now we can represent words as numbers, and so we 835 00:40:40,470 --> 00:40:44,400 might try to pass those words as input to say, a neural network. 836 00:40:44,400 --> 00:40:46,650 Neural networks we've seen are very powerful tools 837 00:40:46,650 --> 00:40:50,010 for identifying patterns and making predictions. 838 00:40:50,010 --> 00:40:53,070 Recall that a neural network you can think of as all of these units. 839 00:40:53,070 --> 00:40:56,460 But really what the neural network is doing, is taking some input, 840 00:40:56,460 --> 00:40:59,610 passing it into the network, and then producing some output. 841 00:40:59,610 --> 00:41:02,130 And by providing the neural network with training data, 842 00:41:02,130 --> 00:41:04,950 we're able to update the weights inside of the network, 843 00:41:04,950 --> 00:41:08,610 so that the neural network can do a more accurate job of translating 844 00:41:08,610 --> 00:41:10,950 those inputs into those outputs. 845 00:41:10,950 --> 00:41:13,890 And now that we can represent words as numbers 846 00:41:13,890 --> 00:41:15,910 that could be the input or output, you could 847 00:41:15,910 --> 00:41:19,330 imagine passing a word in as input to a neural network 848 00:41:19,330 --> 00:41:20,980 and getting a word as output. 849 00:41:20,980 --> 00:41:22,750 And so when might that be useful? 850 00:41:22,750 --> 00:41:26,230 One common use for neural networks is in machine translation. 851 00:41:26,230 --> 00:41:29,440 When we want to translate text from one language into another. 852 00:41:29,440 --> 00:41:33,820 Say, translate English into French, by passing English into the neural network 853 00:41:33,820 --> 00:41:35,530 and getting some French output. 854 00:41:35,530 --> 00:41:39,130 You might imagine, for instance, that we could take the English word for lamp, 855 00:41:39,130 --> 00:41:43,090 pass it into the neural network, get the French word for lamp as output. 856 00:41:43,090 --> 00:41:47,350 But in practice, when we're translating text from one language to another, 857 00:41:47,350 --> 00:41:51,370 we're usually not just interested in translating a single word from one 858 00:41:51,370 --> 00:41:53,320 language to another, but a sequence. 859 00:41:53,320 --> 00:41:55,660 Say, a sentence or a paragraph of words. 860 00:41:55,660 --> 00:41:57,730 Here, for example, is another paragraph, again 861 00:41:57,730 --> 00:42:01,390 taken from Sherlock Holmes written in English, and what I might want to do, 862 00:42:01,390 --> 00:42:04,960 is take that entire sentence, pass it into the neural network, 863 00:42:04,960 --> 00:42:09,430 and get as output, a French translation of the same sentence. 864 00:42:09,430 --> 00:42:12,070 But recall that a neural network's input and output 865 00:42:12,070 --> 00:42:14,260 needs to be of some fixed size. 866 00:42:14,260 --> 00:42:16,660 And a sentence is not a fixed size, it's a variable. 867 00:42:16,660 --> 00:42:19,990 You might have shorter sentences and you might have longer sentences. 868 00:42:19,990 --> 00:42:23,020 So somehow, we need to solve the problem of translating 869 00:42:23,020 --> 00:42:27,100 a sequence into another sequence by means of a neural network. 870 00:42:27,100 --> 00:42:29,980 And that's going to be true not only for machine translation, 871 00:42:29,980 --> 00:42:33,340 but also for other problems, problems like question answering. 872 00:42:33,340 --> 00:42:36,280 If I want to pass as input a question, something like, 873 00:42:36,280 --> 00:42:38,740 what is the capital of Massachusetts, feed 874 00:42:38,740 --> 00:42:41,080 that as input into the neural network, I would 875 00:42:41,080 --> 00:42:45,250 hope that what I would get as output is a sentence like, the capital is Boston. 876 00:42:45,250 --> 00:42:49,330 Again, translating some sequence into some other sequence. 877 00:42:49,330 --> 00:42:52,870 And if you've ever had a conversation with an AI chatbot 878 00:42:52,870 --> 00:42:55,420 or have ever asked your phone a question, 879 00:42:55,420 --> 00:42:56,920 it needs to do something like this. 880 00:42:56,920 --> 00:43:00,220 It needs to understand the sequence of words that you, the human, 881 00:43:00,220 --> 00:43:02,890 provided as input, and then the computer needs 882 00:43:02,890 --> 00:43:05,470 to generate some sequence of words as output. 883 00:43:05,470 --> 00:43:06,880 So how can we do this? 884 00:43:06,880 --> 00:43:10,180 Well, one tool that we can use, is the recurrent neural network, 885 00:43:10,180 --> 00:43:13,360 which we took a look at last time, which is a way for us to provide 886 00:43:13,360 --> 00:43:16,150 a sequence of values to a neural network by running 887 00:43:16,150 --> 00:43:18,010 the neural network multiple times. 888 00:43:18,010 --> 00:43:21,490 And each time we run the neural network, what we're going to do, 889 00:43:21,490 --> 00:43:24,400 is we're going to keep track of some hidden state. 890 00:43:24,400 --> 00:43:26,290 And that hidden state is going to be passed 891 00:43:26,290 --> 00:43:29,650 from one run of the neural network to the next run of the neural network, 892 00:43:29,650 --> 00:43:32,530 keeping track of all of the relevant information. 893 00:43:32,530 --> 00:43:35,920 And so let's take a look at how we could apply that to something like this. 894 00:43:35,920 --> 00:43:39,850 And in particular, we're going to look at an architecture known as an encoder 895 00:43:39,850 --> 00:43:42,610 decoder architecture, where we're going to encode 896 00:43:42,610 --> 00:43:45,610 this question into some kind of hidden state, 897 00:43:45,610 --> 00:43:49,720 and then use a decoder to decode that hidden state into the output 898 00:43:49,720 --> 00:43:51,323 that we're interested in. 899 00:43:51,323 --> 00:43:52,740 So what's that going to look like? 900 00:43:52,740 --> 00:43:54,990 We'll start with the first word, the word what. 901 00:43:54,990 --> 00:43:57,030 That goes into our neural network. 902 00:43:57,030 --> 00:44:00,000 And it's going to produce some hidden state. 903 00:44:00,000 --> 00:44:04,200 This is some information about the word what that our neural network is 904 00:44:04,200 --> 00:44:05,940 going to need to keep track of. 905 00:44:05,940 --> 00:44:08,700 Then when the second word comes along, we're 906 00:44:08,700 --> 00:44:11,640 going to feed it into that same encoder neural network, 907 00:44:11,640 --> 00:44:15,300 but it's going to get as input that hidden state, as well. 908 00:44:15,300 --> 00:44:17,610 So we pass in the second word, we also get 909 00:44:17,610 --> 00:44:19,860 the information about the hidden state, and that's 910 00:44:19,860 --> 00:44:22,710 going to continue for the other words in the input. 911 00:44:22,710 --> 00:44:24,870 This is going to produce a new hidden state. 912 00:44:24,870 --> 00:44:29,610 And so then when we get to the third word, the, that goes into the encoder, 913 00:44:29,610 --> 00:44:31,740 it also gets access to the hidden state. 914 00:44:31,740 --> 00:44:35,010 And then it produces a new hidden state that gets passed in to the next run 915 00:44:35,010 --> 00:44:36,450 when we use the word capital. 916 00:44:36,450 --> 00:44:39,300 And the same thing is going to repeat for the other words that 917 00:44:39,300 --> 00:44:40,920 appear in the input. 918 00:44:40,920 --> 00:44:46,560 So of, Massachusetts, that produces one final piece of hidden state. 919 00:44:46,560 --> 00:44:49,470 Now somehow, we need to signal the fact that we're done. 920 00:44:49,470 --> 00:44:51,040 There's nothing left in the input. 921 00:44:51,040 --> 00:44:54,010 And we typically do this by passing some kind of special token, 922 00:44:54,010 --> 00:44:56,770 say an end token, into the neural network. 923 00:44:56,770 --> 00:44:59,830 And now the decoding process is going to start. 924 00:44:59,830 --> 00:45:02,620 We're going to generate the word, the. 925 00:45:02,620 --> 00:45:05,410 But in addition to generating the word, the, 926 00:45:05,410 --> 00:45:10,420 this decoder network is also going to generate some kind of hidden state. 927 00:45:10,420 --> 00:45:12,490 And so what happens the next time? 928 00:45:12,490 --> 00:45:14,770 Well, to generate the next word, it might 929 00:45:14,770 --> 00:45:17,830 be helpful to know what the first word was. 930 00:45:17,830 --> 00:45:22,210 So we might pass the first word, the, back into the decoder network. 931 00:45:22,210 --> 00:45:24,280 It's going to get as input this hidden state, 932 00:45:24,280 --> 00:45:26,860 and it's going to generate the next word, capital. 933 00:45:26,860 --> 00:45:29,380 And that's also going to generate some hidden state. 934 00:45:29,380 --> 00:45:31,810 And we'll repeat that, passing capital into the network 935 00:45:31,810 --> 00:45:35,230 to generate the third word, is, and then one more time, in order 936 00:45:35,230 --> 00:45:37,330 to get the fourth word, Boston. 937 00:45:37,330 --> 00:45:38,800 And at that point, we're done. 938 00:45:38,800 --> 00:45:40,210 But how do we know we're done? 939 00:45:40,210 --> 00:45:42,250 Usually we'll do this one more time. 940 00:45:42,250 --> 00:45:46,030 Pass Boston into the decoder network and get as output 941 00:45:46,030 --> 00:45:49,990 some n token to indicate that that is the end of our input. 942 00:45:49,990 --> 00:45:53,080 And so this then is how we could use a recurrent neural network 943 00:45:53,080 --> 00:45:56,500 to take some input, encode it into some hidden state, 944 00:45:56,500 --> 00:46:00,580 and then use that hidden state to decode it into the output we're interested in. 945 00:46:00,580 --> 00:46:04,120 To visualize it in a slightly different way, we have some input sequence. 946 00:46:04,120 --> 00:46:06,070 This is just some sequence of words. 947 00:46:06,070 --> 00:46:09,820 That input sequence goes into the encoder, which in this case, 948 00:46:09,820 --> 00:46:13,570 is a recurrent neural network generating these hidden states along the way, 949 00:46:13,570 --> 00:46:16,900 until we generate some final hidden state, at which point, 950 00:46:16,900 --> 00:46:18,580 we start the decoding process. 951 00:46:18,580 --> 00:46:20,530 Again, using a recurrent neural network. 952 00:46:20,530 --> 00:46:23,290 That's going to generate the output sequence, as well. 953 00:46:23,290 --> 00:46:25,480 So we've got the encoder, which is encoding 954 00:46:25,480 --> 00:46:28,870 the information about the input sequence into this hidden state. 955 00:46:28,870 --> 00:46:31,750 And then the decoder, which takes that hidden state 956 00:46:31,750 --> 00:46:35,620 and uses it in order to generate the output sequence. 957 00:46:35,620 --> 00:46:37,150 But there are some problems. 958 00:46:37,150 --> 00:46:39,370 And for many years, this was the state of the art. 959 00:46:39,370 --> 00:46:41,830 The recurrent neural network and variants on this approach 960 00:46:41,830 --> 00:46:44,890 were some of the best ways we knew in order to perform tasks 961 00:46:44,890 --> 00:46:46,182 in natural language processing. 962 00:46:46,182 --> 00:46:48,973 But there are some problems that we might want to try to deal with, 963 00:46:48,973 --> 00:46:50,890 and that have been dealt with over the years 964 00:46:50,890 --> 00:46:53,770 to try and improve upon this kind of model. 965 00:46:53,770 --> 00:46:57,610 And one problem you might notice happens in this encoder stage. 966 00:46:57,610 --> 00:47:00,430 We've taken this input sequence, the sequence of words, 967 00:47:00,430 --> 00:47:04,780 and encoded it all into this final piece of hidden state. 968 00:47:04,780 --> 00:47:09,010 And that final piece of hidden state needs to contain all of the information 969 00:47:09,010 --> 00:47:14,050 from the input sequence that we need in order to generate the output sequence. 970 00:47:14,050 --> 00:47:17,440 And while that's possible, it becomes increasingly difficult 971 00:47:17,440 --> 00:47:19,690 as the sequence gets larger and larger. 972 00:47:19,690 --> 00:47:22,240 For larger and larger input sequences, it's 973 00:47:22,240 --> 00:47:24,310 going to become more and more difficult to store 974 00:47:24,310 --> 00:47:28,930 all of the information we need about the input inside this single hidden state 975 00:47:28,930 --> 00:47:30,010 piece of context. 976 00:47:30,010 --> 00:47:33,070 That's a lot of information to pack into just a single value. 977 00:47:33,070 --> 00:47:36,220 It might be useful for us when generating output, 978 00:47:36,220 --> 00:47:41,860 to not just refer to this one value, but to all of the previous hidden values 979 00:47:41,860 --> 00:47:44,410 that have been generated by the encoder. 980 00:47:44,410 --> 00:47:45,610 And so that might be useful. 981 00:47:45,610 --> 00:47:46,390 But how could we do that? 982 00:47:46,390 --> 00:47:47,890 We've got a lot of different values. 983 00:47:47,890 --> 00:47:49,420 We need to combine them somehow. 984 00:47:49,420 --> 00:47:52,990 So you could imagine adding them together, taking the average of them, 985 00:47:52,990 --> 00:47:53,700 for example. 986 00:47:53,700 --> 00:47:57,270 But doing that would assume that all of these pieces of hidden state 987 00:47:57,270 --> 00:47:58,920 are equally important. 988 00:47:58,920 --> 00:48:00,780 But that's not necessarily true either. 989 00:48:00,780 --> 00:48:02,940 Some of these pieces of hidden state are going 990 00:48:02,940 --> 00:48:06,000 to be more important than others, depending on what word they 991 00:48:06,000 --> 00:48:07,800 most closely correspond to. 992 00:48:07,800 --> 00:48:11,250 This piece of hidden state very closely corresponds to the first word 993 00:48:11,250 --> 00:48:12,330 of the input sequence. 994 00:48:12,330 --> 00:48:16,020 This one very closely corresponds to the second word of the input sequence, 995 00:48:16,020 --> 00:48:16,980 for example. 996 00:48:16,980 --> 00:48:20,460 And some of those are going to be more important than others. 997 00:48:20,460 --> 00:48:22,830 To make matters more complicated, depending 998 00:48:22,830 --> 00:48:25,770 on which word of the output sequence we're generating, 999 00:48:25,770 --> 00:48:29,220 different input words might be more or less important. 1000 00:48:29,220 --> 00:48:31,950 And so what we really want, is some way to decide 1001 00:48:31,950 --> 00:48:36,330 for ourselves which of the input values are worth paying attention to 1002 00:48:36,330 --> 00:48:37,770 at what point in time. 1003 00:48:37,770 --> 00:48:41,490 And this is the key idea behind a mechanism known as Attention. 1004 00:48:41,490 --> 00:48:44,760 Attention is all about letting us decide which 1005 00:48:44,760 --> 00:48:47,280 values are important to pay attention to when 1006 00:48:47,280 --> 00:48:51,210 generating, in this case, the next word in our sequence. 1007 00:48:51,210 --> 00:48:53,430 So let's take a look at an example of that. 1008 00:48:53,430 --> 00:48:54,600 Here's a sentence. 1009 00:48:54,600 --> 00:48:57,030 What is the capital of Massachusetts. 1010 00:48:57,030 --> 00:48:58,350 Same sentence as before. 1011 00:48:58,350 --> 00:49:02,490 And let's imagine that we were trying to answer that question by generating 1012 00:49:02,490 --> 00:49:03,510 tokens of output. 1013 00:49:03,510 --> 00:49:05,190 So what would the output look like? 1014 00:49:05,190 --> 00:49:08,340 Well, it's going to look like something like the capital is. 1015 00:49:08,340 --> 00:49:11,850 And let's say we're now trying to generate this last word here. 1016 00:49:11,850 --> 00:49:13,230 What is that last word? 1017 00:49:13,230 --> 00:49:15,990 How is the computer going to figure it out? 1018 00:49:15,990 --> 00:49:19,290 Well, what it's going to need to do, is decide which 1019 00:49:19,290 --> 00:49:21,540 values it's going to pay attention to. 1020 00:49:21,540 --> 00:49:23,910 And so the Attention mechanism will allow 1021 00:49:23,910 --> 00:49:27,450 us to calculate some Attention scores for each word, 1022 00:49:27,450 --> 00:49:30,960 some value corresponding to each word, determining 1023 00:49:30,960 --> 00:49:35,490 how relevant is it for us to pay attention to that word right now. 1024 00:49:35,490 --> 00:49:38,520 And in this case, when generating the fourth word of the output 1025 00:49:38,520 --> 00:49:42,840 sequence, the most important words to pay attention to might be capital 1026 00:49:42,840 --> 00:49:46,380 and Massachusetts, for example, that those words 1027 00:49:46,380 --> 00:49:48,423 are going to be particularly relevant. 1028 00:49:48,423 --> 00:49:50,340 And there are a number of different mechanisms 1029 00:49:50,340 --> 00:49:53,140 that have been used in order to calculate these attention scores. 1030 00:49:53,140 --> 00:49:56,200 It could be something as simple as a dot product to see 1031 00:49:56,200 --> 00:49:59,740 how similar two vectors are, or we could train an entire neural network 1032 00:49:59,740 --> 00:50:01,360 to calculate these Attention scores. 1033 00:50:01,360 --> 00:50:03,970 But the key idea, is that during the training 1034 00:50:03,970 --> 00:50:05,860 process for our neural network, we're going 1035 00:50:05,860 --> 00:50:08,890 to learn how to calculate these Attention scores. 1036 00:50:08,890 --> 00:50:13,210 Our model is going to learn what is important to pay attention to in order 1037 00:50:13,210 --> 00:50:16,450 to decide what the next word should be. 1038 00:50:16,450 --> 00:50:19,690 So the result of all of this, calculating these Attention scores, 1039 00:50:19,690 --> 00:50:23,950 is that we can calculate some value, some value for each input word, 1040 00:50:23,950 --> 00:50:27,490 determining how important is it for us to pay attention 1041 00:50:27,490 --> 00:50:29,140 to that particular value. 1042 00:50:29,140 --> 00:50:31,330 And recall that each of these input words 1043 00:50:31,330 --> 00:50:36,550 is also associated with one of these hidden state context vectors, capturing 1044 00:50:36,550 --> 00:50:39,040 information about the sentence up to that point, 1045 00:50:39,040 --> 00:50:42,880 but primarily focused on that word in particular. 1046 00:50:42,880 --> 00:50:45,760 And so what we can now do, is if we have all of these vectors 1047 00:50:45,760 --> 00:50:48,790 and we have values representing how important is it 1048 00:50:48,790 --> 00:50:51,580 for us to pay attention to those particular vectors, 1049 00:50:51,580 --> 00:50:53,650 is we can take a weighted average. 1050 00:50:53,650 --> 00:50:56,440 We can take all of these vectors, multiply them 1051 00:50:56,440 --> 00:50:58,960 by their Attention scores, and add them up 1052 00:50:58,960 --> 00:51:01,420 to get some new vector value, which is going 1053 00:51:01,420 --> 00:51:04,630 to represent the context from the input, but specifically 1054 00:51:04,630 --> 00:51:08,800 paying attention to the words that we think are most important. 1055 00:51:08,800 --> 00:51:13,510 And once we've done that, that context vector can be fed into our decoder 1056 00:51:13,510 --> 00:51:17,920 in order to say that the word should be, in this case, Boston. 1057 00:51:17,920 --> 00:51:21,130 So Attention is this very powerful tool that 1058 00:51:21,130 --> 00:51:23,680 allows any word when we're trying to decode it, 1059 00:51:23,680 --> 00:51:27,490 to decide which words from the input should we pay attention to 1060 00:51:27,490 --> 00:51:30,550 in order to determine what's important for generating 1061 00:51:30,550 --> 00:51:32,710 the next word of the output. 1062 00:51:32,710 --> 00:51:34,990 And one of the first places this was really used, 1063 00:51:34,990 --> 00:51:37,270 was in the field of machine translation. 1064 00:51:37,270 --> 00:51:39,670 Here's an example of a diagram from the paper that 1065 00:51:39,670 --> 00:51:42,100 introduced this idea, which was focused on trying 1066 00:51:42,100 --> 00:51:45,250 to translate English sentences into French sentences. 1067 00:51:45,250 --> 00:51:47,950 So we have an input English sentence up along the top, 1068 00:51:47,950 --> 00:51:50,590 and then along the left side, the output French equivalent 1069 00:51:50,590 --> 00:51:51,940 of that same sentence. 1070 00:51:51,940 --> 00:51:55,810 And what you see in all of these squares are the Attention scores 1071 00:51:55,810 --> 00:52:00,580 visualized, where a lighter square indicates a higher Attention score. 1072 00:52:00,580 --> 00:52:03,610 And what you'll notice, is that there's a strong correspondence 1073 00:52:03,610 --> 00:52:06,910 between the French word and the equivalent English word. 1074 00:52:06,910 --> 00:52:09,610 That the French word for agreement is really 1075 00:52:09,610 --> 00:52:12,130 paying attention to the English word for agreement 1076 00:52:12,130 --> 00:52:14,440 in order to decide what French word should 1077 00:52:14,440 --> 00:52:16,480 be generated at that point in time. 1078 00:52:16,480 --> 00:52:18,820 And sometimes you might pay attention to multiple words. 1079 00:52:18,820 --> 00:52:21,640 If you look at the French word for economic, 1080 00:52:21,640 --> 00:52:25,150 that's primarily paying attention to the English word for economic, 1081 00:52:25,150 --> 00:52:29,770 but also paying attention to the English word for European, in this case, too. 1082 00:52:29,770 --> 00:52:34,600 And so Attention scores are very easy to visualize to get a sense for what 1083 00:52:34,600 --> 00:52:37,540 is our machine learning model really paying attention to. 1084 00:52:37,540 --> 00:52:41,290 What information is it using in order to determine what's important 1085 00:52:41,290 --> 00:52:45,220 and what's not in order to determine what the ultimate output token should 1086 00:52:45,220 --> 00:52:46,210 be. 1087 00:52:46,210 --> 00:52:48,580 And so when we combine the Attention mechanism 1088 00:52:48,580 --> 00:52:52,390 with a recurrent neural network, we can get very powerful and useful results, 1089 00:52:52,390 --> 00:52:55,780 where we're able to generate an output sequence by paying attention 1090 00:52:55,780 --> 00:52:57,430 to the input sequence, too. 1091 00:52:57,430 --> 00:53:00,100 But there are other problems with this approach of using 1092 00:53:00,100 --> 00:53:01,690 a recurrent neural network, as well. 1093 00:53:01,690 --> 00:53:04,930 In particular, notice that every run of the neural network 1094 00:53:04,930 --> 00:53:07,270 depends on the output of the previous step. 1095 00:53:07,270 --> 00:53:10,240 And that was important for getting a sense for the sequence of words 1096 00:53:10,240 --> 00:53:12,130 and the ordering of those particular words. 1097 00:53:12,130 --> 00:53:15,340 But we can't run this unit of the neural network 1098 00:53:15,340 --> 00:53:18,730 until after we've calculated the hidden state from the run 1099 00:53:18,730 --> 00:53:21,010 before it, from the previous input token. 1100 00:53:21,010 --> 00:53:25,300 And what that means, is that it's very difficult to parallelize this process. 1101 00:53:25,300 --> 00:53:27,850 That as the input sequence get longer and longer, 1102 00:53:27,850 --> 00:53:30,820 we might want to use parallelism to try and speed up 1103 00:53:30,820 --> 00:53:32,920 this process of training the neural network 1104 00:53:32,920 --> 00:53:34,960 and making sense of all of this language data. 1105 00:53:34,960 --> 00:53:37,300 But it's difficult to do that and it's slow to do 1106 00:53:37,300 --> 00:53:39,670 that with a recurrent neural network, because all of it 1107 00:53:39,670 --> 00:53:41,800 needs to be performed in sequence. 1108 00:53:41,800 --> 00:53:44,140 And that's become an increasing challenge 1109 00:53:44,140 --> 00:53:47,260 as we've started to get larger and larger language models. 1110 00:53:47,260 --> 00:53:49,660 The more language data that we have available to us 1111 00:53:49,660 --> 00:53:52,650 to use to train our machine learning models, the more accurate 1112 00:53:52,650 --> 00:53:55,590 it can be, the better representation of language it can have, 1113 00:53:55,590 --> 00:53:59,430 the better understanding it can have, and the better results that we can see. 1114 00:53:59,430 --> 00:54:02,220 And so we've seen this growth of large language models 1115 00:54:02,220 --> 00:54:05,400 that are using larger and larger datasets, but as a result, 1116 00:54:05,400 --> 00:54:07,380 they take longer and longer to train. 1117 00:54:07,380 --> 00:54:10,650 And so this problem, that recurrent neural networks are not 1118 00:54:10,650 --> 00:54:14,400 easy to parallelize, has become an increasing problem. 1119 00:54:14,400 --> 00:54:17,250 And as a result of that, that was one of the main motivations 1120 00:54:17,250 --> 00:54:20,130 for a different architecture for thinking about how 1121 00:54:20,130 --> 00:54:21,870 to deal with natural language. 1122 00:54:21,870 --> 00:54:24,480 And that's known as the Transformer architecture. 1123 00:54:24,480 --> 00:54:26,760 And this has been a significant milestone 1124 00:54:26,760 --> 00:54:28,620 in the world of natural language processing 1125 00:54:28,620 --> 00:54:32,640 for really increasing how well we can perform these kinds of natural language 1126 00:54:32,640 --> 00:54:35,280 processing tasks, as well as how quickly we 1127 00:54:35,280 --> 00:54:39,240 can train a machine learning model to be able to produce effective results. 1128 00:54:39,240 --> 00:54:42,600 There are a number of different types of Transformers in terms of how they work, 1129 00:54:42,600 --> 00:54:44,433 but what we're going to take a look at here, 1130 00:54:44,433 --> 00:54:48,120 is the basic architecture for how one might work with a Transformer 1131 00:54:48,120 --> 00:54:51,280 to get a sense for what's involved and what we're doing. 1132 00:54:51,280 --> 00:54:55,060 So let's start with the model we were looking at before, specifically 1133 00:54:55,060 --> 00:54:58,420 at this encoder part of our encoder decoder architecture, 1134 00:54:58,420 --> 00:55:02,020 where we used a recurrent neural network to take this input sequence 1135 00:55:02,020 --> 00:55:05,560 and capture all of this information about the hidden state 1136 00:55:05,560 --> 00:55:08,860 and the information we need to know about that input sequence. 1137 00:55:08,860 --> 00:55:12,550 Right now, it all needs to happen in this linear progression. 1138 00:55:12,550 --> 00:55:14,980 But what the Transformer is going to allow us to do, 1139 00:55:14,980 --> 00:55:17,710 is process each of the words independently 1140 00:55:17,710 --> 00:55:19,510 in a way that's easy to parallelize. 1141 00:55:19,510 --> 00:55:22,000 Rather than have each word wait for some other word, 1142 00:55:22,000 --> 00:55:25,360 each word is going to go through this same neural network 1143 00:55:25,360 --> 00:55:30,110 and produce some kind of encoded representation of that particular input 1144 00:55:30,110 --> 00:55:30,610 word. 1145 00:55:30,610 --> 00:55:33,200 And all of this is going to happen in parallel. 1146 00:55:33,200 --> 00:55:35,200 Now it's happening for all of the words at once, 1147 00:55:35,200 --> 00:55:37,870 but we're really just going to focus on what's happening for one word 1148 00:55:37,870 --> 00:55:38,647 to make it clear. 1149 00:55:38,647 --> 00:55:41,230 But know that whatever you're seeing happen for this one word, 1150 00:55:41,230 --> 00:55:44,950 is going to happen for all of the other input words, too. 1151 00:55:44,950 --> 00:55:46,690 So what's going on here? 1152 00:55:46,690 --> 00:55:49,150 Well, we start with some input word. 1153 00:55:49,150 --> 00:55:53,410 That input word goes into the neural network, and the output is hopefully, 1154 00:55:53,410 --> 00:55:57,400 some encoded representation of the input word, the information 1155 00:55:57,400 --> 00:56:00,730 we need to know about the input word that's going to be relevant to us 1156 00:56:00,730 --> 00:56:02,620 as we're generating the output. 1157 00:56:02,620 --> 00:56:05,470 And because we're doing this each word independently, 1158 00:56:05,470 --> 00:56:06,700 it's easy to parallelize. 1159 00:56:06,700 --> 00:56:08,770 We don't have to wait for the previous word 1160 00:56:08,770 --> 00:56:12,100 before we run this word through the neural network. 1161 00:56:12,100 --> 00:56:16,210 But what did we lose in this process by trying to parallelize this whole thing? 1162 00:56:16,210 --> 00:56:19,060 Well, we've lost all notion of word ordering. 1163 00:56:19,060 --> 00:56:20,920 The order of words is important. 1164 00:56:20,920 --> 00:56:23,740 The sentence, Sherlock Holmes gave the book to Watson, 1165 00:56:23,740 --> 00:56:26,920 has a different meaning than Watson gave the book to Sherlock Holmes. 1166 00:56:26,920 --> 00:56:30,760 And so we want to keep track of that information about word position. 1167 00:56:30,760 --> 00:56:33,670 In the recurrent neural network, that happened for us automatically, 1168 00:56:33,670 --> 00:56:37,120 because we could run each word one at a time through the neural network, 1169 00:56:37,120 --> 00:56:41,050 get the hidden state, pass it on to the next run of the neural network. 1170 00:56:41,050 --> 00:56:43,600 But that's not the case here with the Transformer, 1171 00:56:43,600 --> 00:56:48,460 where each word is being processed independent of all of the other ones. 1172 00:56:48,460 --> 00:56:50,800 So what are we going to do to try to solve that problem? 1173 00:56:50,800 --> 00:56:56,440 One thing we can do, is add some kind of positional encoding to the input word. 1174 00:56:56,440 --> 00:56:58,990 The positional encoding is some vector that 1175 00:56:58,990 --> 00:57:01,713 represents the position of the word in the sentence. 1176 00:57:01,713 --> 00:57:04,630 This is the first word, the second word, the third word, and so forth. 1177 00:57:04,630 --> 00:57:07,420 We're going to add that to the input word. 1178 00:57:07,420 --> 00:57:10,600 And the result of that is going to be a vector that captures 1179 00:57:10,600 --> 00:57:12,280 multiple pieces of information. 1180 00:57:12,280 --> 00:57:15,790 It captures the input word itself, as well as where in the sentence 1181 00:57:15,790 --> 00:57:16,720 it appears. 1182 00:57:16,720 --> 00:57:19,240 The result of that, is we can pass the output 1183 00:57:19,240 --> 00:57:23,200 of that addition, the addition of the input word and the positional encoding, 1184 00:57:23,200 --> 00:57:24,400 into the neural network. 1185 00:57:24,400 --> 00:57:26,470 That way the neural network knows the word 1186 00:57:26,470 --> 00:57:28,570 and where it appears in the sentence, and can 1187 00:57:28,570 --> 00:57:31,600 use both of those pieces of information to determine 1188 00:57:31,600 --> 00:57:35,980 how best to represent the meaning of that word in the encoded representation 1189 00:57:35,980 --> 00:57:37,540 at the end of it. 1190 00:57:37,540 --> 00:57:41,470 In addition to what we have here, in addition to the positional encoding 1191 00:57:41,470 --> 00:57:43,630 and this feed forward neural network, we're 1192 00:57:43,630 --> 00:57:46,780 also going to add one additional component, which 1193 00:57:46,780 --> 00:57:49,240 is going to be a Self-Attention step. 1194 00:57:49,240 --> 00:57:52,060 This is going to be Attention where we're paying attention 1195 00:57:52,060 --> 00:57:53,950 to the other input words. 1196 00:57:53,950 --> 00:57:56,680 Because the meaning or interpretation of an input word 1197 00:57:56,680 --> 00:58:00,220 might vary depending on the other words in the input, as well. 1198 00:58:00,220 --> 00:58:02,980 And so we're going to allow each word in the input 1199 00:58:02,980 --> 00:58:05,590 to decide what other words in the input it 1200 00:58:05,590 --> 00:58:10,150 should pay attention to in order to decide on its encoded representation. 1201 00:58:10,150 --> 00:58:13,540 And that's going to allow us to get a better encoded representation 1202 00:58:13,540 --> 00:58:16,510 for each word, because words are defined by their context, 1203 00:58:16,510 --> 00:58:20,800 by the words around them and how they're used in that particular context. 1204 00:58:20,800 --> 00:58:23,710 This kind of Self-Attention is so valuable in fact, 1205 00:58:23,710 --> 00:58:25,960 that oftentimes, the Transformer will use 1206 00:58:25,960 --> 00:58:29,860 multiple different Self-Attention layers at the same time 1207 00:58:29,860 --> 00:58:32,530 to allow for this model to be able to pay attention 1208 00:58:32,530 --> 00:58:35,800 to multiple facets of the input at the same time. 1209 00:58:35,800 --> 00:58:39,748 We call this Multi-Headed Attention, where each attention head can 1210 00:58:39,748 --> 00:58:41,290 pay attention to something different. 1211 00:58:41,290 --> 00:58:44,590 And as a result, this network can learn to pay attention 1212 00:58:44,590 --> 00:58:48,880 to many different parts of the input for this input word all at the same time. 1213 00:58:48,880 --> 00:58:51,650 And in the spirit of deep learning, these two steps, 1214 00:58:51,650 --> 00:58:55,460 this Multi-Headed Self-Attention layer, and this neural network layer, 1215 00:58:55,460 --> 00:58:58,520 that itself can be repeated multiple times, too, 1216 00:58:58,520 --> 00:59:01,130 in order to get a deeper representation, in order 1217 00:59:01,130 --> 00:59:03,680 to learn deeper patterns within the input text, 1218 00:59:03,680 --> 00:59:06,710 and ultimately, get a better representation of language, 1219 00:59:06,710 --> 00:59:09,950 in order to get useful and coded representations 1220 00:59:09,950 --> 00:59:12,110 of all of the input words. 1221 00:59:12,110 --> 00:59:14,810 And so this is the process that a transformer 1222 00:59:14,810 --> 00:59:19,400 might use in order to take an input word and get it as encoded representation. 1223 00:59:19,400 --> 00:59:23,120 And the key idea, is to really rely on this Attention step 1224 00:59:23,120 --> 00:59:25,790 in order to get information that's useful in order 1225 00:59:25,790 --> 00:59:28,190 to determine how to encode that word. 1226 00:59:28,190 --> 00:59:31,370 And that process is going to repeat for all of the input 1227 00:59:31,370 --> 00:59:33,200 words that are in the input sequence. 1228 00:59:33,200 --> 00:59:35,090 We're going to take all of the input words, 1229 00:59:35,090 --> 00:59:38,240 encode them with some kind of positional encoding, 1230 00:59:38,240 --> 00:59:42,320 feed those into these Self-Attention and feed forward neural networks in order 1231 00:59:42,320 --> 00:59:46,340 to ultimately get these encoded representations of the words. 1232 00:59:46,340 --> 00:59:47,960 That's the result of the encoder. 1233 00:59:47,960 --> 00:59:51,390 We get all of these encoded representations that 1234 00:59:51,390 --> 00:59:53,640 will be useful to us when it comes time then 1235 00:59:53,640 --> 00:59:57,390 to try to decode all of this information into the output 1236 00:59:57,390 --> 00:59:58,890 sequence we're interested in. 1237 00:59:58,890 --> 01:00:02,490 And again, this might take place in the context of machine translation, 1238 01:00:02,490 --> 01:00:05,910 where the output is going to be the same sentence in a different language. 1239 01:00:05,910 --> 01:00:09,720 Or it might be an answer to a question, in the case of an AI chatbot, 1240 01:00:09,720 --> 01:00:10,590 for example. 1241 01:00:10,590 --> 01:00:15,330 And so now let's take a look at how that decoder is going to work. 1242 01:00:15,330 --> 01:00:18,420 Ultimately, it's going to have a very similar structure. 1243 01:00:18,420 --> 01:00:21,330 Any time we're trying to generate the next output word, 1244 01:00:21,330 --> 01:00:24,480 we need to know what the previous output word is, 1245 01:00:24,480 --> 01:00:28,620 as well as its positional encoding, where in the output sequence are we. 1246 01:00:28,620 --> 01:00:30,600 And we're going to have these same steps. 1247 01:00:30,600 --> 01:00:33,960 Self-Attention, because we might want an output word 1248 01:00:33,960 --> 01:00:37,170 to be able to pay attention to other words in that same output, 1249 01:00:37,170 --> 01:00:38,940 as well as a neural network. 1250 01:00:38,940 --> 01:00:41,820 And that might itself repeat multiple times. 1251 01:00:41,820 --> 01:00:45,240 But in this decoder, we're going to add one additional step. 1252 01:00:45,240 --> 01:00:47,250 We're going to add an additional Attention 1253 01:00:47,250 --> 01:00:50,760 step, where instead of Self-Attention, where the output word is going 1254 01:00:50,760 --> 01:00:54,360 to pay attention to other output words, in this step, 1255 01:00:54,360 --> 01:00:57,660 we're going to allow the output word to pay attention 1256 01:00:57,660 --> 01:00:59,700 to the encoded representations. 1257 01:00:59,700 --> 01:01:03,600 So recall that the encoder is taking all of the input words 1258 01:01:03,600 --> 01:01:07,650 and transforming them into these encoded representations of all of the input 1259 01:01:07,650 --> 01:01:08,220 words. 1260 01:01:08,220 --> 01:01:10,012 But it's going to be important for us to be 1261 01:01:10,012 --> 01:01:12,750 able to decide which of those encoded representations 1262 01:01:12,750 --> 01:01:16,560 we want to pay attention to when generating any particular token 1263 01:01:16,560 --> 01:01:18,000 in the output sequence. 1264 01:01:18,000 --> 01:01:20,040 And that's what this additional Attention 1265 01:01:20,040 --> 01:01:21,990 step is going to allow us to do. 1266 01:01:21,990 --> 01:01:25,560 It's saying that every time we're generating a word of the output, 1267 01:01:25,560 --> 01:01:28,140 we can pay attention to the other words in the output, 1268 01:01:28,140 --> 01:01:31,500 because we might want to know, what are the words we've generated previously. 1269 01:01:31,500 --> 01:01:33,420 And we want to pay attention to some of them 1270 01:01:33,420 --> 01:01:36,900 to decide what word is going to be next in the sequence. 1271 01:01:36,900 --> 01:01:40,530 But we also care about paying attention to the input words, too. 1272 01:01:40,530 --> 01:01:44,490 And we want the ability to decide which of these encoded representations 1273 01:01:44,490 --> 01:01:46,890 of the input words are going to be relevant in order 1274 01:01:46,890 --> 01:01:49,080 for us to generate the next step. 1275 01:01:49,080 --> 01:01:51,300 And so these two pieces combine together. 1276 01:01:51,300 --> 01:01:54,360 We have this encoder that takes all of the input words 1277 01:01:54,360 --> 01:01:56,970 and produces this encoded representation. 1278 01:01:56,970 --> 01:02:00,990 And we have this decoder that is able to take the previous output word, 1279 01:02:00,990 --> 01:02:05,550 pay attention to that encoded input, and then generate the next output word. 1280 01:02:05,550 --> 01:02:08,220 And this is one of the possible architectures 1281 01:02:08,220 --> 01:02:12,510 we could use for a transformer, with the key idea being these attention 1282 01:02:12,510 --> 01:02:15,720 steps, that allow words to pay attention to each other. 1283 01:02:15,720 --> 01:02:19,508 During the training process here, we can now much more easily parallelize 1284 01:02:19,508 --> 01:02:22,800 this, because we don't have to wait for all of the words to happen in sequence. 1285 01:02:22,800 --> 01:02:26,550 And we can learn how we should perform these attention steps. 1286 01:02:26,550 --> 01:02:30,060 The model is able to learn what is important to pay attention to, 1287 01:02:30,060 --> 01:02:32,040 what things do I need to pay attention to 1288 01:02:32,040 --> 01:02:36,570 in order to be more accurate at predicting what the output word is. 1289 01:02:36,570 --> 01:02:40,800 And this has proved to be a tremendously effective model for conversational AI 1290 01:02:40,800 --> 01:02:43,770 agents, for building machine translation systems. 1291 01:02:43,770 --> 01:02:46,470 And there have been many variants proposed on this model, too. 1292 01:02:46,470 --> 01:02:48,930 Some transformers only use an encoder. 1293 01:02:48,930 --> 01:02:50,580 Some only use a decoder. 1294 01:02:50,580 --> 01:02:54,210 Some use some other combination of these different particular features. 1295 01:02:54,210 --> 01:02:57,390 But the key ideas ultimately remain the same. 1296 01:02:57,390 --> 01:03:01,380 This real focus on trying to pay attention to what is most important. 1297 01:03:01,380 --> 01:03:03,600 And the world of natural language processing 1298 01:03:03,600 --> 01:03:05,700 is fast-growing and fast-evolving. 1299 01:03:05,700 --> 01:03:08,310 Year after year, we keep coming up with new models that 1300 01:03:08,310 --> 01:03:11,190 allow us to do an even better job of performing 1301 01:03:11,190 --> 01:03:13,650 these natural language-related tasks, all 1302 01:03:13,650 --> 01:03:16,320 in the service of solving the tricky problem, which 1303 01:03:16,320 --> 01:03:17,580 is our own natural language. 1304 01:03:17,580 --> 01:03:20,250 We've seen how the syntax and semantics of our language 1305 01:03:20,250 --> 01:03:23,430 is ambiguous and introduces all of these new challenges 1306 01:03:23,430 --> 01:03:25,590 that we need to think about if we're going 1307 01:03:25,590 --> 01:03:30,120 to be able to design AI agents that are able to work with language effectively. 1308 01:03:30,120 --> 01:03:32,848 So as we think about where we've been in this class, all 1309 01:03:32,848 --> 01:03:35,640 of the different types of artificial intelligence we've considered, 1310 01:03:35,640 --> 01:03:39,130 we've looked at artificial intelligence in a wide variety of different forms 1311 01:03:39,130 --> 01:03:39,630 now. 1312 01:03:39,630 --> 01:03:42,510 We started by taking a look at search problems, where 1313 01:03:42,510 --> 01:03:45,060 we looked at how AI can search for solutions, 1314 01:03:45,060 --> 01:03:48,060 play games, and find the optimal decision to make. 1315 01:03:48,060 --> 01:03:52,370 We talked about knowledge, how AI can represent information that it knows 1316 01:03:52,370 --> 01:03:56,270 and use that information to generate new knowledge, as well. 1317 01:03:56,270 --> 01:03:59,420 Then we looked at what AI can do when it's less certain, when 1318 01:03:59,420 --> 01:04:02,240 it doesn't know things for sure, and we have to represent things 1319 01:04:02,240 --> 01:04:03,590 in terms of probability. 1320 01:04:03,590 --> 01:04:05,780 We then took a look at optimization problems. 1321 01:04:05,780 --> 01:04:08,690 We saw how a lot of problems in AI can be boiled down 1322 01:04:08,690 --> 01:04:12,230 to trying to maximize or minimize some function. 1323 01:04:12,230 --> 01:04:14,840 And we looked at strategies that AI can use in order 1324 01:04:14,840 --> 01:04:17,510 to do that kind of maximizing and minimizing. 1325 01:04:17,510 --> 01:04:19,550 We then looked at the world of machine learning, 1326 01:04:19,550 --> 01:04:22,550 learning from data in order to figure out some patterns 1327 01:04:22,550 --> 01:04:26,030 and identify how to perform a task by looking at the training data 1328 01:04:26,030 --> 01:04:27,500 that we have available to it. 1329 01:04:27,500 --> 01:04:30,770 And one of the most powerful tools there was the neural network, 1330 01:04:30,770 --> 01:04:34,400 the sequence of units whose weights can be trained in order to allow 1331 01:04:34,400 --> 01:04:37,100 us to really effectively go from input to output 1332 01:04:37,100 --> 01:04:40,980 and predict how to get there by learning these underlying patterns. 1333 01:04:40,980 --> 01:04:44,630 And then today, we took a look at language itself, trying to understand 1334 01:04:44,630 --> 01:04:48,350 how can we train the computer to be able to understand our natural language, 1335 01:04:48,350 --> 01:04:51,210 to be able to understand syntax and semantics, 1336 01:04:51,210 --> 01:04:54,540 make sense of and generate natural language, which introduces 1337 01:04:54,540 --> 01:04:56,490 a number of interesting problems, too. 1338 01:04:56,490 --> 01:04:59,640 And we've really just scratched the surface of artificial intelligence. 1339 01:04:59,640 --> 01:05:02,910 There is so much interesting research and interesting new techniques 1340 01:05:02,910 --> 01:05:05,250 and algorithms and ideas being introduced to try 1341 01:05:05,250 --> 01:05:07,140 to solve these types of problems. 1342 01:05:07,140 --> 01:05:09,660 So I hope you enjoyed this exploration into the world 1343 01:05:09,660 --> 01:05:10,950 of artificial intelligence. 1344 01:05:10,950 --> 01:05:13,950 A huge thanks to all of the course's teaching staff and production team 1345 01:05:13,950 --> 01:05:15,360 for making the class possible. 1346 01:05:15,360 --> 01:05:19,730 This was an introduction to Artificial Intelligence with Python. 1347 01:05:19,730 --> 01:05:21,000