1 00:00:00,000 --> 00:00:03,493 [MUSIC PLAYING] 2 00:00:03,493 --> 00:00:17,873 3 00:00:17,873 --> 00:00:21,040 SPEAKER 1: OK, welcome back, everyone, to our final topic in an introduction 4 00:00:21,040 --> 00:00:23,050 to artificial intelligence with Python. 5 00:00:23,050 --> 00:00:25,390 And today, the topic is language. 6 00:00:25,390 --> 00:00:27,280 So thus far in the class, we've seen a number 7 00:00:27,280 --> 00:00:30,700 of different ways of interacting with AI, artificial intelligence, 8 00:00:30,700 --> 00:00:34,690 but it's mostly been happening in the way of us formulating problems 9 00:00:34,690 --> 00:00:38,320 in ways that I can understand-- learning to speak the language of AI, 10 00:00:38,320 --> 00:00:41,800 so to speak, by trying to take a problem and formulated as a search problem, 11 00:00:41,800 --> 00:00:45,160 or by trying to take a problem and make it a constraint satisfaction problem-- 12 00:00:45,160 --> 00:00:47,800 something that our AI is able to understand. 13 00:00:47,800 --> 00:00:50,800 Today, we're going to try and come up with algorithms and ideas that 14 00:00:50,800 --> 00:00:53,170 allow our AI to meet us halfway, so to speak-- 15 00:00:53,170 --> 00:00:56,770 to be able to allow AI to be able to understand, and interpret, and get 16 00:00:56,770 --> 00:00:58,915 some sort of meaning out of human language-- 17 00:00:58,915 --> 00:01:00,790 the type of language, in the spoken language, 18 00:01:00,790 --> 00:01:03,760 like English, or some other language that we naturally speak. 19 00:01:03,760 --> 00:01:06,700 And this turns out to be a really challenging task for AI. 20 00:01:06,700 --> 00:01:09,850 And it really encompasses a number of different types of tasks 21 00:01:09,850 --> 00:01:13,210 all under the broad heading of natural language processing, 22 00:01:13,210 --> 00:01:15,190 the idea of coming up with algorithms that 23 00:01:15,190 --> 00:01:19,910 allow our AI to be able to process and understand natural language. 24 00:01:19,910 --> 00:01:22,000 So these tasks vary in terms of the types of tasks 25 00:01:22,000 --> 00:01:24,490 we might want an AI to perform, and therefore, the types of 26 00:01:24,490 --> 00:01:25,698 algorithms that we might use. 27 00:01:25,698 --> 00:01:28,030 Them but some common tasks that you might see 28 00:01:28,030 --> 00:01:30,250 are things like automatic summarization. 29 00:01:30,250 --> 00:01:33,520 You give an AI a long document, and you would like for the AI 30 00:01:33,520 --> 00:01:35,680 to be able to summarize it, come up with a shorter 31 00:01:35,680 --> 00:01:39,850 representation of the same idea, but still in some kind of natural language, 32 00:01:39,850 --> 00:01:40,780 like English. 33 00:01:40,780 --> 00:01:44,740 Something like information extraction-- given a whole corpus of information 34 00:01:44,740 --> 00:01:46,750 in some body of documents or on the internet, 35 00:01:46,750 --> 00:01:49,840 for example, we'd like for our AI to be able to extract 36 00:01:49,840 --> 00:01:54,070 some sort of meaningful semantic information out of all of that content 37 00:01:54,070 --> 00:01:56,470 that it's able to look at and read. 38 00:01:56,470 --> 00:01:59,020 Language identification-- the task of, given a page, 39 00:01:59,020 --> 00:02:01,562 can you figure out what language that document is written in? 40 00:02:01,562 --> 00:02:04,520 This is the type of thing you might see if you use a web browser where, 41 00:02:04,520 --> 00:02:06,280 if you open up a page in another language, 42 00:02:06,280 --> 00:02:09,400 that web browser might ask you, oh, I think it's in this language-- would 43 00:02:09,400 --> 00:02:12,070 you like me to translate into English for you, for example? 44 00:02:12,070 --> 00:02:15,070 And that language identification process is a task 45 00:02:15,070 --> 00:02:17,800 that our AI needs to be able to do, which is then related then 46 00:02:17,800 --> 00:02:21,550 to machine translation, the process of taking text in one language 47 00:02:21,550 --> 00:02:24,190 and translating it into another language-- which there's 48 00:02:24,190 --> 00:02:26,710 been a lot of research and development on really 49 00:02:26,710 --> 00:02:28,490 over the course of the last several years. 50 00:02:28,490 --> 00:02:30,323 And it keeps getting better, in terms of how 51 00:02:30,323 --> 00:02:33,010 it is that AI is able to take text in one language 52 00:02:33,010 --> 00:02:37,010 and transform that text into another language as well. 53 00:02:37,010 --> 00:02:40,330 In addition to that, we have topics like named entity recognition. 54 00:02:40,330 --> 00:02:43,840 Given some sequence of text, can you pick out what the named entities are? 55 00:02:43,840 --> 00:02:46,300 These are names of companies, or names of people, 56 00:02:46,300 --> 00:02:50,050 or names of locations for example, which are often relevant or important parts 57 00:02:50,050 --> 00:02:51,580 of a particular document. 58 00:02:51,580 --> 00:02:55,720 Speech recognition as a related task not to do with the text that is written, 59 00:02:55,720 --> 00:02:58,840 but text that is spoken-- being able to process audio and figure out, 60 00:02:58,840 --> 00:03:01,070 what are the actual words that are spoken there? 61 00:03:01,070 --> 00:03:04,180 And if you think about smart home devices, like Siri or Alexa, 62 00:03:04,180 --> 00:03:06,370 for example, these are all devices that are now 63 00:03:06,370 --> 00:03:09,460 able to listen to when we are able to speak, figure out 64 00:03:09,460 --> 00:03:13,190 what words we are saying, and draw some sort of meaning out of that as well. 65 00:03:13,190 --> 00:03:15,398 We've talked about how you could formulate something, 66 00:03:15,398 --> 00:03:17,860 for instance, as a hit and Markov model to be able to draw 67 00:03:17,860 --> 00:03:19,250 those sorts of conclusions. 68 00:03:19,250 --> 00:03:22,150 Text classification, more generally, is a broad category 69 00:03:22,150 --> 00:03:25,090 of types of ideas, whenever we want to take some kind of text 70 00:03:25,090 --> 00:03:27,010 and put it into some sort of category. 71 00:03:27,010 --> 00:03:29,440 And we've seen these classification type problems 72 00:03:29,440 --> 00:03:31,930 and how we can use statistical machine learning approaches 73 00:03:31,930 --> 00:03:32,983 to be able to solve them. 74 00:03:32,983 --> 00:03:35,650 We'll be able to do something very similar with natural language 75 00:03:35,650 --> 00:03:38,910 that we may need to make a couple of adjustments that we'll see soon. 76 00:03:38,910 --> 00:03:41,500 And then something like word sense disambiguation, 77 00:03:41,500 --> 00:03:45,010 the idea that, unlike in the language of numbers, 78 00:03:45,010 --> 00:03:48,520 where AI has very precise representations of everything, words 79 00:03:48,520 --> 00:03:50,980 and are a little bit fuzzy, in terms of their meaning, 80 00:03:50,980 --> 00:03:52,980 and words can have multiple different meanings-- 81 00:03:52,980 --> 00:03:55,180 and natural language is inherently ambiguous, 82 00:03:55,180 --> 00:03:58,360 and we'll take a look at some of those ambiguities in due time today. 83 00:03:58,360 --> 00:04:00,760 But one challenging task, if you want an AI 84 00:04:00,760 --> 00:04:02,950 to be able to understand natural language, 85 00:04:02,950 --> 00:04:05,860 is being able to disambiguate or differentiate 86 00:04:05,860 --> 00:04:08,080 between different possible meanings of words. 87 00:04:08,080 --> 00:04:12,050 If I say a sentence like, I went to the bank, you need to figure out, 88 00:04:12,050 --> 00:04:14,680 do I mean the bank where I deposit and withdraw money or do 89 00:04:14,680 --> 00:04:16,240 I mean the bank like the river bank? 90 00:04:16,240 --> 00:04:18,250 And different words can have different meanings 91 00:04:18,250 --> 00:04:19,260 that we might want to figure out. 92 00:04:19,260 --> 00:04:21,519 And based on the context in which a word appears-- 93 00:04:21,519 --> 00:04:23,890 the wider sentence, or paragraph, or paper 94 00:04:23,890 --> 00:04:25,630 in which a particular word appears-- 95 00:04:25,630 --> 00:04:27,880 that might help to inform how it is that we 96 00:04:27,880 --> 00:04:31,390 disambiguate between different meanings or different senses 97 00:04:31,390 --> 00:04:32,430 that a word might have. 98 00:04:32,430 --> 00:04:35,527 And there are many other topics within natural language processing, 99 00:04:35,527 --> 00:04:37,360 many other algorithms that have been devised 100 00:04:37,360 --> 00:04:40,190 in order to deal with and address these sorts of problems. 101 00:04:40,190 --> 00:04:42,607 And today, we're really just going to scratch the surface, 102 00:04:42,607 --> 00:04:46,240 looking at some of the fundamental ideas that are behind many of these ideas 103 00:04:46,240 --> 00:04:49,750 within natural language processing, within this idea of trying to come up 104 00:04:49,750 --> 00:04:53,800 with AI algorithms that are able to do something meaningful with the languages 105 00:04:53,800 --> 00:04:55,780 that we speak everyday. 106 00:04:55,780 --> 00:04:58,480 And so to introduce this idea, when we think about language, 107 00:04:58,480 --> 00:05:01,160 we can often think about it in a couple of different parts. 108 00:05:01,160 --> 00:05:04,520 The first part refers to the syntax of language. 109 00:05:04,520 --> 00:05:07,630 This is more to do with just the structure of language 110 00:05:07,630 --> 00:05:09,830 and how it is that that structure works. 111 00:05:09,830 --> 00:05:13,060 And if you think about natural language, syntax is one of those things 112 00:05:13,060 --> 00:05:15,160 that, if you're a native speaker of a language, 113 00:05:15,160 --> 00:05:16,570 it comes pretty readily to you. 114 00:05:16,570 --> 00:05:18,320 You don't have to think too much about it. 115 00:05:18,320 --> 00:05:21,600 If I give you a sentence from Sir Arthur Conan Doyle's Sherlock Holmes, 116 00:05:21,600 --> 00:05:23,190 for example, a sentence like this-- 117 00:05:23,190 --> 00:05:27,225 "just before 9:00 o'clock, Sherlock Holmes stepped briskly into the room"-- 118 00:05:27,225 --> 00:05:29,100 I think we could probably all agree that this 119 00:05:29,100 --> 00:05:31,830 is a well-formed grammatical sentence. 120 00:05:31,830 --> 00:05:34,920 Syntactically, it makes sense, in terms of the way 121 00:05:34,920 --> 00:05:37,232 that this particular sentence is structured. 122 00:05:37,232 --> 00:05:40,440 And syntax applies not just to natural language, but to programming languages 123 00:05:40,440 --> 00:05:40,940 as well. 124 00:05:40,940 --> 00:05:44,430 If you've ever seen a syntax error in a program that you've written, 125 00:05:44,430 --> 00:05:47,280 it's likely because you wrote some sort of program 126 00:05:47,280 --> 00:05:49,470 that was not syntactically well-formed. 127 00:05:49,470 --> 00:05:52,080 The structure of it was not a valid program. 128 00:05:52,080 --> 00:05:54,780 In the same way, we can look at English sentences, or sentences 129 00:05:54,780 --> 00:05:57,600 in any natural language, and make the same kinds of judgments. 130 00:05:57,600 --> 00:06:01,290 I can say that this sentence is syntactically well-formed. 131 00:06:01,290 --> 00:06:04,260 When all the parts are put together, all these words are in this order, 132 00:06:04,260 --> 00:06:08,250 it constructs a grammatical sentence, or a sentence that most people would agree 133 00:06:08,250 --> 00:06:09,720 is grammatical. 134 00:06:09,720 --> 00:06:11,970 But there are also grammatically ill-formed sentences. 135 00:06:11,970 --> 00:06:14,370 A sentence like, "just before Sherlock Holmes 136 00:06:14,370 --> 00:06:16,518 9 o'clock stepped briskly the room"-- 137 00:06:16,518 --> 00:06:19,560 well, I think we would all agree that this is not a well-formed sentence. 138 00:06:19,560 --> 00:06:22,290 Syntactically, it doesn't make sense. 139 00:06:22,290 --> 00:06:25,290 And this is the type of thing that, if we want our AI, for example, 140 00:06:25,290 --> 00:06:27,330 to be able to generate natural language-- 141 00:06:27,330 --> 00:06:30,250 to be able to speak to us the way like a chat bot would speak to us, 142 00:06:30,250 --> 00:06:31,010 for example-- 143 00:06:31,010 --> 00:06:34,260 well then our AI is going to need to be able to know this distinction somehow, 144 00:06:34,260 --> 00:06:37,980 is going to be able to know what kinds of sentences are grammatical, 145 00:06:37,980 --> 00:06:39,330 what kinds of sentences are not. 146 00:06:39,330 --> 00:06:42,930 And we might come up with rules or ways to statistically learn these ideas, 147 00:06:42,930 --> 00:06:45,840 and we'll talk about some of those methods as well. 148 00:06:45,840 --> 00:06:47,910 Syntax can also be ambiguous. 149 00:06:47,910 --> 00:06:50,970 There are some sentences that are well-formed and not well-formed, 150 00:06:50,970 --> 00:06:54,180 but certain way-- there are certain ways that you could take a sentence 151 00:06:54,180 --> 00:06:58,260 and potentially construct multiple different structures for that sentence. 152 00:06:58,260 --> 00:07:01,830 A sentence like, "I saw the man on the mountain with a telescope," well, 153 00:07:01,830 --> 00:07:05,080 this is grammatically well-formed-- syntactically, it makes sense-- 154 00:07:05,080 --> 00:07:07,350 but what is the structure of the sentence? 155 00:07:07,350 --> 00:07:10,680 Is it the man on the mountain who has the telescope, or am 156 00:07:10,680 --> 00:07:13,860 I seeing the man on the mountain and I am using the telescope in order 157 00:07:13,860 --> 00:07:15,270 to see the man on the mountain? 158 00:07:15,270 --> 00:07:19,050 There's some interesting ambiguity here, where it could have potentially 159 00:07:19,050 --> 00:07:21,090 two different types of structures. 160 00:07:21,090 --> 00:07:23,940 And this is one of the ideas that will come back to also, 161 00:07:23,940 --> 00:07:27,690 in terms of how to think about dealing with AI when natural language is 162 00:07:27,690 --> 00:07:29,820 inherently ambiguous. 163 00:07:29,820 --> 00:07:32,070 So that then is syntax, the structure of language, 164 00:07:32,070 --> 00:07:34,080 and getting an understanding for how it is 165 00:07:34,080 --> 00:07:36,330 that, depending on the order and placement of words, 166 00:07:36,330 --> 00:07:38,910 we can come up with different structures for language. 167 00:07:38,910 --> 00:07:42,300 But in addition to language having structure, language also has meaning. 168 00:07:42,300 --> 00:07:44,700 And now we get into the world of semantics, the idea of, 169 00:07:44,700 --> 00:07:47,190 what it is that a word, or a sequence of words, 170 00:07:47,190 --> 00:07:51,200 or a sentence, or an entire essay actually means? 171 00:07:51,200 --> 00:07:54,300 And so a sentence like, "just before 9:00, Sherlock Holmes 172 00:07:54,300 --> 00:07:58,230 stepped briskly into the room," is a different sentence 173 00:07:58,230 --> 00:08:01,860 from a sentence like, "Sherlock Holmes stepped briskly into the room just 174 00:08:01,860 --> 00:08:03,300 before 9:00." 175 00:08:03,300 --> 00:08:06,480 And yet they have effectively the same meaning. 176 00:08:06,480 --> 00:08:08,430 They're different sentences, so an AI reading 177 00:08:08,430 --> 00:08:11,550 them would recognize them as different, but we as humans 178 00:08:11,550 --> 00:08:13,650 can look at both the sentences and say, yeah, 179 00:08:13,650 --> 00:08:15,295 they mean basically the same thing. 180 00:08:15,295 --> 00:08:18,420 And maybe, in this case, it was just because I moved the order of the words 181 00:08:18,420 --> 00:08:18,920 around. 182 00:08:18,920 --> 00:08:21,520 Originally, 9 o'clock with near the beginning of the sentence. 183 00:08:21,520 --> 00:08:23,700 Now 9 o'clock is near the end of the sentence. 184 00:08:23,700 --> 00:08:26,950 But you might imagine that I could come up with a different sentence entirely, 185 00:08:26,950 --> 00:08:29,670 a sentence like, "a few minutes before 9:00, Sherlock Holmes 186 00:08:29,670 --> 00:08:31,820 walked quickly into the room." 187 00:08:31,820 --> 00:08:34,650 And OK, that also has a very similar meaning, 188 00:08:34,650 --> 00:08:37,799 but I'm using different words in order to express that idea. 189 00:08:37,799 --> 00:08:40,230 And ideally, AI would be able to recognize 190 00:08:40,230 --> 00:08:43,230 that these two sentences, these different sets of words that 191 00:08:43,230 --> 00:08:46,020 are similar to each other, have similar meanings, 192 00:08:46,020 --> 00:08:49,090 and to be able to get at that idea as well. 193 00:08:49,090 --> 00:08:52,350 Then there are also ways that a syntactically well-formed sentence 194 00:08:52,350 --> 00:08:54,150 might not mean anything at all. 195 00:08:54,150 --> 00:08:57,360 A famous example from linguist Noam Chomsky is this sentence here-- 196 00:08:57,360 --> 00:09:00,570 "colorless green ideas sleep furiously." 197 00:09:00,570 --> 00:09:03,660 Syntactically, that sentence is perfectly fine. 198 00:09:03,660 --> 00:09:07,080 Colorless and green are adjectives that modify the noun ideas. 199 00:09:07,080 --> 00:09:08,010 Sleep is a verb. 200 00:09:08,010 --> 00:09:09,240 Furiously is an adverb. 201 00:09:09,240 --> 00:09:12,900 These are correct constructions, in terms of the order of words, 202 00:09:12,900 --> 00:09:15,150 but it turns out this sentence is meaningless. 203 00:09:15,150 --> 00:09:18,270 If you tried to ascribe meaning to the sentence, what does it mean? 204 00:09:18,270 --> 00:09:20,250 And it's not easy to be able to determine 205 00:09:20,250 --> 00:09:21,660 what it is that it might mean. 206 00:09:21,660 --> 00:09:25,355 Semantics itself can also be ambiguous, given that different structures can 207 00:09:25,355 --> 00:09:26,730 have different types of meanings. 208 00:09:26,730 --> 00:09:29,110 Different words can have different kinds of meanings, 209 00:09:29,110 --> 00:09:31,290 so the same sentence with the same structure 210 00:09:31,290 --> 00:09:33,300 might end up meaning different types of things. 211 00:09:33,300 --> 00:09:35,880 So my favorite example from the LA times is 212 00:09:35,880 --> 00:09:39,570 a headline that was in the Los Angeles Times a little while back. 213 00:09:39,570 --> 00:09:43,410 The headline says, "Big rig carrying fruit crashes on 210 freeway, 214 00:09:43,410 --> 00:09:44,633 creates jam." 215 00:09:44,633 --> 00:09:46,800 So depending on how it is you look at the sentence-- 216 00:09:46,800 --> 00:09:50,440 how you interpret the sentence-- it can have multiple different meanings. 217 00:09:50,440 --> 00:09:53,730 And so here too are challenges in this world of natural language processing, 218 00:09:53,730 --> 00:09:56,640 being able to understand both the syntax of language 219 00:09:56,640 --> 00:09:58,013 and the semantics of language. 220 00:09:58,013 --> 00:10:00,180 And today, we'll take a look at both of those ideas. 221 00:10:00,180 --> 00:10:02,280 We're going to start by talking about syntax 222 00:10:02,280 --> 00:10:05,550 and getting a sense for how it is that language is structured, 223 00:10:05,550 --> 00:10:09,150 and how we can start by coming up with some rules, some ways 224 00:10:09,150 --> 00:10:12,930 that we can tell our computer, tell our AI what types of things 225 00:10:12,930 --> 00:10:16,540 are valid sentences, what types of things are not valid sentences. 226 00:10:16,540 --> 00:10:19,070 And ultimately, we'd like to use that information 227 00:10:19,070 --> 00:10:21,680 to be able to allow our AI to draw meaningful conclusions, 228 00:10:21,680 --> 00:10:23,743 to be able to do something with language. 229 00:10:23,743 --> 00:10:25,910 And so to do so, we're going to start by introducing 230 00:10:25,910 --> 00:10:27,830 the notion of formal grammar. 231 00:10:27,830 --> 00:10:30,320 And what formal grammar is all about its formal grammar 232 00:10:30,320 --> 00:10:34,400 is a system of rules that generate sentences in a language. 233 00:10:34,400 --> 00:10:38,120 I would like to know what are the valid English sentences-- 234 00:10:38,120 --> 00:10:39,710 not in terms of what they mean-- 235 00:10:39,710 --> 00:10:42,590 just in terms of their structure-- their syntactic structure. 236 00:10:42,590 --> 00:10:45,740 What structures of English are valid, correct sentences? 237 00:10:45,740 --> 00:10:47,780 What structures of English are not valid? 238 00:10:47,780 --> 00:10:50,930 And this is going to apply in a very similar way to other natural languages 239 00:10:50,930 --> 00:10:54,110 as well, where language follows certain types of structures. 240 00:10:54,110 --> 00:10:56,870 And we intuitively know what these structures mean, 241 00:10:56,870 --> 00:10:59,840 but it's going to be helpful to try and really formally define 242 00:10:59,840 --> 00:11:01,980 what the structures mean as well. 243 00:11:01,980 --> 00:11:04,520 There are a number of different types of formal grammar 244 00:11:04,520 --> 00:11:07,318 all across what's known as the Chomsky hierarchy of grammars. 245 00:11:07,318 --> 00:11:09,110 And you may have seen some of these before. 246 00:11:09,110 --> 00:11:11,780 If you've ever worked with regular expressions before, 247 00:11:11,780 --> 00:11:14,300 those belong to a class of regular languages. 248 00:11:14,300 --> 00:11:19,320 They correspond to regular languages, which is a particular type of language. 249 00:11:19,320 --> 00:11:21,860 But also on this hierarchy is a type of grammar 250 00:11:21,860 --> 00:11:23,193 known as a context-free grammar. 251 00:11:23,193 --> 00:11:25,235 And this is the one we're going to spend the most 252 00:11:25,235 --> 00:11:27,120 time on taking a look at today. 253 00:11:27,120 --> 00:11:31,640 And what a context-free grammar is it is a way of taking-- 254 00:11:31,640 --> 00:11:34,760 of generating sentences in a language or via what 255 00:11:34,760 --> 00:11:39,020 are known as rewriting rules-- replacing one symbol with other symbols. 256 00:11:39,020 --> 00:11:42,360 And we'll take a look in a moment at just what that means. 257 00:11:42,360 --> 00:11:45,950 So let's imagine, for example, a simple sentence in English, 258 00:11:45,950 --> 00:11:48,520 a sentence like, "she saw the city"-- 259 00:11:48,520 --> 00:11:52,190 a valid, syntactically well-formed English sentence. 260 00:11:52,190 --> 00:11:55,640 But we'd like for some way for our AI to be able to look at the sentence 261 00:11:55,640 --> 00:12:00,200 and figure out, what is the structure of the sentence? 262 00:12:00,200 --> 00:12:02,630 If you imagine a guy in question answering format-- 263 00:12:02,630 --> 00:12:05,812 if you want to ask the AI a question like, what did she see, 264 00:12:05,812 --> 00:12:08,270 well, then the AI wants to be able to look at this sentence 265 00:12:08,270 --> 00:12:13,530 and recognize that what she saw is the city-- to be able to figure that out. 266 00:12:13,530 --> 00:12:15,770 And it requires some understanding of what 267 00:12:15,770 --> 00:12:19,760 it is that the structure of this sentence really looks like. 268 00:12:19,760 --> 00:12:20,960 So where do we begin? 269 00:12:20,960 --> 00:12:23,410 Each of these words-- she, saw, the, city-- 270 00:12:23,410 --> 00:12:25,585 we are going to call terminal symbols. 271 00:12:25,585 --> 00:12:28,460 There are symbols in our language-- where each of these words is just 272 00:12:28,460 --> 00:12:29,480 a symbol-- 273 00:12:29,480 --> 00:12:32,470 where this is ultimately what we care about generating. 274 00:12:32,470 --> 00:12:34,730 We care about generating these words. 275 00:12:34,730 --> 00:12:37,280 But each of these words we're also going to associate 276 00:12:37,280 --> 00:12:40,130 with what we're going to call a non-terminal symbol. 277 00:12:40,130 --> 00:12:43,460 And these non-terminal symbols initially are going to look kind of like parts 278 00:12:43,460 --> 00:12:46,260 of speech, if you remember back to like English grammar-- 279 00:12:46,260 --> 00:12:49,880 where she is a noun, saw is a V for verb, 280 00:12:49,880 --> 00:12:52,550 the is a D. D stands for determiner. 281 00:12:52,550 --> 00:12:55,730 These are words like the, and a, and and, for example. 282 00:12:55,730 --> 00:12:59,550 And then city-- well, city is also a noun, so an N goes there. 283 00:12:59,550 --> 00:13:00,320 So each of these-- 284 00:13:00,320 --> 00:13:01,730 N, V, and D-- 285 00:13:01,730 --> 00:13:04,460 these are what we might call non-terminal symbols. 286 00:13:04,460 --> 00:13:07,370 They're not actually words in the language. 287 00:13:07,370 --> 00:13:10,010 She saw the city-- those are the words in the language. 288 00:13:10,010 --> 00:13:14,210 But we use these non-terminal symbols to generate the terminal symbols, 289 00:13:14,210 --> 00:13:16,640 the terminal symbols which are like, she saw the city-- 290 00:13:16,640 --> 00:13:20,000 the words that are actually in a language like English. 291 00:13:20,000 --> 00:13:24,260 And so in order to translate these non-terminal symbols into terminal 292 00:13:24,260 --> 00:13:27,422 symbols, we have what are known as rewriting rules, 293 00:13:27,422 --> 00:13:29,130 and these rules look something like this. 294 00:13:29,130 --> 00:13:32,570 We have N on the left side of an arrow, and the arrow 295 00:13:32,570 --> 00:13:35,480 says, if I have an N non-terminal symbol, 296 00:13:35,480 --> 00:13:39,410 then I can turn it into any of these various different possibilities 297 00:13:39,410 --> 00:13:42,120 that are separated with a vertical line. 298 00:13:42,120 --> 00:13:45,480 So a noun could translate into the word she. 299 00:13:45,480 --> 00:13:49,720 A noun could translate into the word city, or car, or Harry, 300 00:13:49,720 --> 00:13:50,970 or any number of other things. 301 00:13:50,970 --> 00:13:53,810 These are all examples of nouns, for example. 302 00:13:53,810 --> 00:13:58,490 Meanwhile, a determiner, D, could translate into the, or a, or an. 303 00:13:58,490 --> 00:14:01,310 V for verb could translate into any of these verbs. 304 00:14:01,310 --> 00:14:04,430 P for preposition could translate into any of those prepositions-- 305 00:14:04,430 --> 00:14:06,440 to, on, over, and so forth. 306 00:14:06,440 --> 00:14:11,420 And then ADJ for adjective can translate into any of these possible adjectives 307 00:14:11,420 --> 00:14:12,390 as well. 308 00:14:12,390 --> 00:14:15,650 So these then are rules in our context-free grammar. 309 00:14:15,650 --> 00:14:18,110 When we are defining what it is that our grammar is, 310 00:14:18,110 --> 00:14:21,500 what is the structure of the English language or any other language, 311 00:14:21,500 --> 00:14:24,710 we give it these types of rules saying that a noun could 312 00:14:24,710 --> 00:14:29,360 be any of these possibilities, a verb could be any of those possibilities. 313 00:14:29,360 --> 00:14:32,900 But it turns out we can then begin to construct other rules where 314 00:14:32,900 --> 00:14:37,392 it's not just one non-terminal translating into one terminal symbol. 315 00:14:37,392 --> 00:14:40,100 We're always going to have one non-terminal on the left-hand side 316 00:14:40,100 --> 00:14:42,515 of the arrow, but on the right-hand side of the arrow, 317 00:14:42,515 --> 00:14:43,640 we could have other things. 318 00:14:43,640 --> 00:14:46,830 We could even have other non-terminal symbols. 319 00:14:46,830 --> 00:14:48,030 So what do I mean by this? 320 00:14:48,030 --> 00:14:53,070 Well, we have the idea of nouns-- like she, city, car, Harry, for example-- 321 00:14:53,070 --> 00:14:55,340 but there are also a noun phrases-- 322 00:14:55,340 --> 00:14:57,760 like phrases that work as nouns-- 323 00:14:57,760 --> 00:15:00,900 that are not just a single word, but there are multiple words. 324 00:15:00,900 --> 00:15:04,400 Like the city is two words, that together, operate 325 00:15:04,400 --> 00:15:06,140 as what we might call a noun phrase. 326 00:15:06,140 --> 00:15:08,870 It's multiple words, but they're together operating as a noun. 327 00:15:08,870 --> 00:15:12,410 Or if you think about a more complex expression, like the big city-- 328 00:15:12,410 --> 00:15:15,380 three words all operating as a single noun-- 329 00:15:15,380 --> 00:15:17,200 or the car on the street-- 330 00:15:17,200 --> 00:15:22,390 multiple words now, but that entire set of words operates kind of like a noun. 331 00:15:22,390 --> 00:15:25,130 It substitutes as a noun phrase. 332 00:15:25,130 --> 00:15:27,100 And so to do this, we'll introduce the notion 333 00:15:27,100 --> 00:15:32,380 of a new non-terminal symbol called NP, which will stand for noun phrase. 334 00:15:32,380 --> 00:15:36,220 And this rewriting rule says that a noun phrase it could be a noun-- 335 00:15:36,220 --> 00:15:39,250 so something like she is a noun, and therefore, it 336 00:15:39,250 --> 00:15:40,810 can also be a noun phrase-- 337 00:15:40,810 --> 00:15:46,360 but a noun phrase could also be a determiner, D, followed by a noun-- 338 00:15:46,360 --> 00:15:49,315 so two ways we can have a noun phrase in this very simple grammar. 339 00:15:49,315 --> 00:15:51,940 Of course, the English language is more complex than just this, 340 00:15:51,940 --> 00:15:57,460 but a noun phrase is either a noun or it is a determiner followed by a noun. 341 00:15:57,460 --> 00:16:00,130 So for the first example, a noun phrase that is just a noun, 342 00:16:00,130 --> 00:16:04,150 that would allow us to generate noun phrases like she, 343 00:16:04,150 --> 00:16:07,960 because a noun phrase is just a noun, and a noun 344 00:16:07,960 --> 00:16:10,833 could be the word she, for example. 345 00:16:10,833 --> 00:16:13,750 Meanwhile, if we wanted to look at one of the examples of these, where 346 00:16:13,750 --> 00:16:16,750 a noun phrase becomes a determiner and a noun, 347 00:16:16,750 --> 00:16:18,460 then we get a structure like this. 348 00:16:18,460 --> 00:16:21,250 And now we're starting to see the structure of language 349 00:16:21,250 --> 00:16:24,970 emerge from these rules in a syntax tree, as we'll call it, 350 00:16:24,970 --> 00:16:29,260 this tree-like structure that represents the syntax of our natural language. 351 00:16:29,260 --> 00:16:31,960 Here, we have a noun phrase, and this noun phrase 352 00:16:31,960 --> 00:16:36,460 is composed of a determiner and a noun, where the determiner is the word the, 353 00:16:36,460 --> 00:16:40,310 according to that rule, and noun is the word city. 354 00:16:40,310 --> 00:16:43,930 So here then is a noun phrase that consists of multiple words inside 355 00:16:43,930 --> 00:16:45,130 of the structure. 356 00:16:45,130 --> 00:16:50,140 And using this idea of taking one symbol and rewriting it using other symbols-- 357 00:16:50,140 --> 00:16:52,900 that might be terminal symbols, like the and city, 358 00:16:52,900 --> 00:16:57,670 but might also be non-terminal symbols, like D for determiner or N for noun-- 359 00:16:57,670 --> 00:17:01,090 then we can begin to construct more and more complex structures. 360 00:17:01,090 --> 00:17:04,420 In addition to noun phrases, we can also think about verb phrases. 361 00:17:04,420 --> 00:17:06,740 So what might a verb phrase look like? 362 00:17:06,740 --> 00:17:09,670 Well, a verb phrase might just be a single verb. 363 00:17:09,670 --> 00:17:13,660 In a sentence like "I walked," walked is a verb, 364 00:17:13,660 --> 00:17:17,329 and that is acting as the verb phrase in that sentence. 365 00:17:17,329 --> 00:17:21,493 But there are also more complex verb phrases that aren't just a single word, 366 00:17:21,493 --> 00:17:22,660 but that are multiple words. 367 00:17:22,660 --> 00:17:25,970 If you think of the sentence like "she saw the city," for example, 368 00:17:25,970 --> 00:17:29,260 saw the city is really that entire verb phrase. 369 00:17:29,260 --> 00:17:33,245 It's taking up like what it is that she is doing, for example. 370 00:17:33,245 --> 00:17:35,370 And so our verb phrase might have a rule like this. 371 00:17:35,370 --> 00:17:38,830 A verb phrase is either just a plain verb 372 00:17:38,830 --> 00:17:43,090 or it is a verb followed by a noun phrase. 373 00:17:43,090 --> 00:17:45,940 And we saw before that a noun phrase is either a noun 374 00:17:45,940 --> 00:17:48,580 or it is a determiner followed by a noun. 375 00:17:48,580 --> 00:17:50,710 And so a verb phrase might be something simple, 376 00:17:50,710 --> 00:17:52,960 like verb phrase it is just a verb. 377 00:17:52,960 --> 00:17:55,587 And that verb could be the word walked for example. 378 00:17:55,587 --> 00:17:57,670 But it could also be something more sophisticated, 379 00:17:57,670 --> 00:18:01,780 something like this noun, where we begin to see a larger syntax tree, 380 00:18:01,780 --> 00:18:04,450 where the way to read the syntax tree is that a verb 381 00:18:04,450 --> 00:18:07,690 phrase is a verb and a noun phrase, where 382 00:18:07,690 --> 00:18:09,380 that verb could be something like saw. 383 00:18:09,380 --> 00:18:12,130 And this is a noun phrase we've seen before, this noun phrase that 384 00:18:12,130 --> 00:18:17,050 is the city-- a noun phrase composed of the determiner the and the noun 385 00:18:17,050 --> 00:18:21,068 city all put together to construct this larger verb phrase. 386 00:18:21,068 --> 00:18:23,110 And then just to give one more example of a rule, 387 00:18:23,110 --> 00:18:24,652 we could also have a rule like this-- 388 00:18:24,652 --> 00:18:28,180 sentence S goes to noun phrase and a verb phrase. 389 00:18:28,180 --> 00:18:30,580 The basic structure of a sentence is that it is 390 00:18:30,580 --> 00:18:32,680 a noun phrase followed by verb phrase. 391 00:18:32,680 --> 00:18:35,320 And this is a formal grammar way of expressing the idea 392 00:18:35,320 --> 00:18:38,445 that you might have learned when you learned English grammar, when you read 393 00:18:38,445 --> 00:18:42,190 that a sentence is like a subject and a verb, subject and action-- 394 00:18:42,190 --> 00:18:45,330 something that's happening to a particular noun phrase. 395 00:18:45,330 --> 00:18:47,650 And so using this structure, we could construct 396 00:18:47,650 --> 00:18:49,740 a sentence that looks like this. 397 00:18:49,740 --> 00:18:53,140 A sentence consists of a noun phrase and a verb phrase. 398 00:18:53,140 --> 00:18:56,080 A noun phrase could just be a noun, like the word she. 399 00:18:56,080 --> 00:18:58,180 The verb phrase could be a verb and a noun phrase, 400 00:18:58,180 --> 00:19:00,940 where-- this is something we've seen before-- the verb is saw 401 00:19:00,940 --> 00:19:03,838 and the noun phrase is the city. 402 00:19:03,838 --> 00:19:05,380 And so now look what we've done here. 403 00:19:05,380 --> 00:19:08,160 What we've done is, by defining a set of rules, 404 00:19:08,160 --> 00:19:11,940 there are algorithms that we can run that take these words-- 405 00:19:11,940 --> 00:19:15,190 and the CYK algorithm, for example, is one example of this if you want to look 406 00:19:15,190 --> 00:19:15,880 into that-- 407 00:19:15,880 --> 00:19:20,200 where you start with a set of terminal symbols, like she saw the city, 408 00:19:20,200 --> 00:19:22,630 and then using these rules, you're able to figure out, 409 00:19:22,630 --> 00:19:26,958 how is it that you go from a sentence to she saw the city? 410 00:19:26,958 --> 00:19:28,750 And it's all through these rewriting rules. 411 00:19:28,750 --> 00:19:31,310 So the sentence is a noun phrase and a verb phrase. 412 00:19:31,310 --> 00:19:34,600 A verb phrase could be a verb and a noun phrase, so on and so forth, 413 00:19:34,600 --> 00:19:37,000 where you can imagine taking this structure 414 00:19:37,000 --> 00:19:41,510 and figuring out how it is that you could generate a parse tree-- 415 00:19:41,510 --> 00:19:46,290 a syntax tree-- for that set of terminal symbols, that set of words. 416 00:19:46,290 --> 00:19:49,990 And if you tried to do this for a sentence that was not grammatical, 417 00:19:49,990 --> 00:19:53,830 something like "saw the city she," well, that wouldn't work. 418 00:19:53,830 --> 00:19:56,320 There'd be no way to take a sentence and use 419 00:19:56,320 --> 00:19:58,720 these rules to be able to generate that sentence that 420 00:19:58,720 --> 00:20:01,220 is not inside of that language. 421 00:20:01,220 --> 00:20:03,490 So this sort of model can be very helpful 422 00:20:03,490 --> 00:20:06,040 if the rules are expressive enough to express 423 00:20:06,040 --> 00:20:09,400 all the ideas that you might want to express inside of natural language. 424 00:20:09,400 --> 00:20:12,003 Of course, using just the simple rules we have here, 425 00:20:12,003 --> 00:20:14,920 there are many sentences that we won't be able to generate-- sentences 426 00:20:14,920 --> 00:20:18,280 that we might agree are grim and syntactically well-formed, 427 00:20:18,280 --> 00:20:21,450 but that we're not going to be able to construct using these rules. 428 00:20:21,450 --> 00:20:23,200 And then, in that case, we might just need 429 00:20:23,200 --> 00:20:28,300 to have some more complex rules in order to deal with those sorts of cases. 430 00:20:28,300 --> 00:20:30,370 And so this type of approach can be powerful 431 00:20:30,370 --> 00:20:33,430 if you're dealing with a limited set of rules and words 432 00:20:33,430 --> 00:20:35,230 that you really care about dealing with. 433 00:20:35,230 --> 00:20:37,690 And one way we can actually interact with this in Python 434 00:20:37,690 --> 00:20:42,100 is by using a Python library called NLTK, short for natural language 435 00:20:42,100 --> 00:20:44,410 toolkit, which we'll see a couple of times today, 436 00:20:44,410 --> 00:20:47,410 which has a wide variety of different functions and classes 437 00:20:47,410 --> 00:20:49,300 that we can take advantage of that are all 438 00:20:49,300 --> 00:20:51,100 meant to deal with natural language. 439 00:20:51,100 --> 00:20:54,700 And one such algorithm that it has is the ability to parse 440 00:20:54,700 --> 00:20:57,670 a context-free grammar, to be able to take some words 441 00:20:57,670 --> 00:20:59,920 and figure out according to some context-free grammar, 442 00:20:59,920 --> 00:21:02,892 how would you construct the syntax tree for it? 443 00:21:02,892 --> 00:21:04,600 So let's go ahead and take a look at NLTK 444 00:21:04,600 --> 00:21:09,950 now by examining how we might construct some context-free grammars with it. 445 00:21:09,950 --> 00:21:12,110 So here inside of cfg0-- 446 00:21:12,110 --> 00:21:14,410 cfg's short for context-free grammar-- 447 00:21:14,410 --> 00:21:19,230 I have a sample context-free grammar which has rules that we've seen before. 448 00:21:19,230 --> 00:21:22,330 So sentence goes to noun phrase followed by a verb phrase. 449 00:21:22,330 --> 00:21:25,900 Noun phrase is either a determiner and a noun or a noun. 450 00:21:25,900 --> 00:21:29,080 Verb phrase is either a verb or a verb and a noun phrase. 451 00:21:29,080 --> 00:21:32,020 The order of these things doesn't really matter. 452 00:21:32,020 --> 00:21:34,480 Determiners could be the word the or the word a. 453 00:21:34,480 --> 00:21:37,630 A noun could be the word she, city, or car. 454 00:21:37,630 --> 00:21:42,040 And a verb could be the word saw or it could be the word walked. 455 00:21:42,040 --> 00:21:45,100 Now, using NLTK, which I've imported here at the top, 456 00:21:45,100 --> 00:21:47,800 I'm going to go ahead and parse this grammar 457 00:21:47,800 --> 00:21:50,823 and save it inside of this variable called parser. 458 00:21:50,823 --> 00:21:52,990 Next, my program is going to ask the user for input. 459 00:21:52,990 --> 00:21:55,630 Just type in a sentence, and dot split will just 460 00:21:55,630 --> 00:21:57,790 split it on all of the spaces, so I end up 461 00:21:57,790 --> 00:22:00,360 getting each of the individual words. 462 00:22:00,360 --> 00:22:03,400 We're going to save that inside of this list called sentence. 463 00:22:03,400 --> 00:22:08,350 And then we'll go ahead and try to parse the sentence, and for each sentence 464 00:22:08,350 --> 00:22:10,840 we parse, we're going to pretty print it to the screen, 465 00:22:10,840 --> 00:22:12,327 just so it displays in my terminal. 466 00:22:12,327 --> 00:22:13,660 And we're also going to draw it. 467 00:22:13,660 --> 00:22:16,210 It turns out that NLTK has some graphics capacity, 468 00:22:16,210 --> 00:22:19,632 so we can really visually see what that tree looks like as well. 469 00:22:19,632 --> 00:22:22,340 And there are multiple different ways a sentence might be parsed, 470 00:22:22,340 --> 00:22:24,700 which is why we're putting it inside of this for loop. 471 00:22:24,700 --> 00:22:27,762 And we'll see why that can be helpful in a moment too. 472 00:22:27,762 --> 00:22:30,220 All right, now that I have that, let's go ahead and try it. 473 00:22:30,220 --> 00:22:34,840 I'll cd into cfg, and we'll go ahead and run cfg0. 474 00:22:34,840 --> 00:22:37,450 So it then is going to prompt me to type in a sentence. 475 00:22:37,450 --> 00:22:39,658 And let me type in a very simple sentence-- something 476 00:22:39,658 --> 00:22:42,070 like she walked, for example. 477 00:22:42,070 --> 00:22:43,240 Press Return. 478 00:22:43,240 --> 00:22:45,510 So what I get is, on the left-hand side, you 479 00:22:45,510 --> 00:22:48,902 can see a text-based representation of the syntax tree. 480 00:22:48,902 --> 00:22:51,610 And on the right side here-- let me go ahead and make it bigger-- 481 00:22:51,610 --> 00:22:55,240 we see a visual representation of that same syntax tree. 482 00:22:55,240 --> 00:22:59,960 This is how it is that my computer has now parsed the sentence she walked. 483 00:22:59,960 --> 00:23:02,980 It's a sentence that consists of a noun phrase and a verb phrase, 484 00:23:02,980 --> 00:23:06,790 where each phrase is just a single noun or verb, she and then walked-- 485 00:23:06,790 --> 00:23:09,100 same type of structure we've seen before, 486 00:23:09,100 --> 00:23:11,410 but this now is our computer able to understand 487 00:23:11,410 --> 00:23:13,990 the structure of the sentence, to be able to get 488 00:23:13,990 --> 00:23:17,920 some sort of structural understanding of how it is that parts of the sentence 489 00:23:17,920 --> 00:23:19,660 relate to each other. 490 00:23:19,660 --> 00:23:21,460 Let me now give it another sentence. 491 00:23:21,460 --> 00:23:25,180 I could try something like she saw the city, for example-- 492 00:23:25,180 --> 00:23:27,350 the words we were dealing with a moment ago. 493 00:23:27,350 --> 00:23:31,050 And then we end up getting this syntax tree out of it-- 494 00:23:31,050 --> 00:23:34,170 again, a sentence that has a noun phrase and a verb phrase. 495 00:23:34,170 --> 00:23:35,800 The noun phrase is fairly simple. 496 00:23:35,800 --> 00:23:36,960 It's just she. 497 00:23:36,960 --> 00:23:38,460 But the verb phrase is more complex. 498 00:23:38,460 --> 00:23:42,390 It is now saw the city, for example. 499 00:23:42,390 --> 00:23:44,790 Let's do one more with this grammar. 500 00:23:44,790 --> 00:23:47,343 Let's do something like she saw a car. 501 00:23:47,343 --> 00:23:49,010 And that is going to look very similar-- 502 00:23:49,010 --> 00:23:50,328 that we also get she. 503 00:23:50,328 --> 00:23:51,870 But our verb phrase is now different. 504 00:23:51,870 --> 00:23:55,220 It's saw a car, because there are multiple possible determiners 505 00:23:55,220 --> 00:23:57,307 in our language and multiple possible nouns. 506 00:23:57,307 --> 00:23:59,390 I haven't given this grammar rule that many words, 507 00:23:59,390 --> 00:24:01,790 but if I gave it a larger vocabulary, it would then 508 00:24:01,790 --> 00:24:06,360 be able to understand more and more different types of sentences. 509 00:24:06,360 --> 00:24:09,590 And just to give you a sense of some added complexity we could add here, 510 00:24:09,590 --> 00:24:12,568 the more complex our grammar, the more rules we add, 511 00:24:12,568 --> 00:24:14,360 the more different types of sentences we'll 512 00:24:14,360 --> 00:24:15,860 then have the ability to generate. 513 00:24:15,860 --> 00:24:18,410 So let's take a look at cfg1, for example, 514 00:24:18,410 --> 00:24:21,590 where I've added a whole number of other different types of rules. 515 00:24:21,590 --> 00:24:25,970 I've added the adjective phrases, where we can have multiple adjectives inside 516 00:24:25,970 --> 00:24:27,590 of a noun phrase as well. 517 00:24:27,590 --> 00:24:31,310 So a noun phrase could be an adjective phrase followed by a noun phrase. 518 00:24:31,310 --> 00:24:33,650 If I wanted to say something like the big city, 519 00:24:33,650 --> 00:24:37,250 that's an adjective phrase followed by a noun phrase. 520 00:24:37,250 --> 00:24:40,740 Or we could also have a noun and a prepositional phrase-- 521 00:24:40,740 --> 00:24:43,250 so the car on the street, for example. 522 00:24:43,250 --> 00:24:46,100 On the street is a prepositional phrase, and we 523 00:24:46,100 --> 00:24:50,060 might want to combine those two ideas together, because the car on the street 524 00:24:50,060 --> 00:24:53,333 can still operate as something kind of like a noun phrase as well. 525 00:24:53,333 --> 00:24:56,000 So no need to understand all of these rules in too much detail-- 526 00:24:56,000 --> 00:24:59,240 it starts to get into the nature of English grammar-- 527 00:24:59,240 --> 00:25:04,980 but now we have a more complex way of understanding these types of sentences. 528 00:25:04,980 --> 00:25:07,190 So if I run Python cfg1-- 529 00:25:07,190 --> 00:25:13,130 and I can try typing something like she saw the wide street, for example-- 530 00:25:13,130 --> 00:25:14,840 a more complex sentence. 531 00:25:14,840 --> 00:25:18,990 And if we make that larger, you can see what this sentence looks like. 532 00:25:18,990 --> 00:25:21,700 I'll go ahead and shrink it a little bit. 533 00:25:21,700 --> 00:25:26,100 So now we have a sentence like this-- she saw the wide street. 534 00:25:26,100 --> 00:25:28,830 The wide street is one entire noun phrase, 535 00:25:28,830 --> 00:25:31,470 saw the wide street is an entire verb phrase, 536 00:25:31,470 --> 00:25:35,830 and she saw the wide street ends up forming that entire sentence. 537 00:25:35,830 --> 00:25:40,150 So let's take a look at one more example to introduce this notion of ambiguity. 538 00:25:40,150 --> 00:25:42,060 So I can run Python cfg1. 539 00:25:42,060 --> 00:25:48,540 Let me type a sentence like she saw a dog with binoculars. 540 00:25:48,540 --> 00:25:52,860 So there's a sentence, and here now is one possible syntax tree 541 00:25:52,860 --> 00:25:54,510 to represent this idea-- 542 00:25:54,510 --> 00:25:59,190 she saw, the noun phrase a dog, and then the prepositional phrase 543 00:25:59,190 --> 00:26:00,390 with binoculars. 544 00:26:00,390 --> 00:26:06,000 And the way to interpret the sentence is that what it is that she saw was a dog. 545 00:26:06,000 --> 00:26:07,980 And how did she do the seeing? 546 00:26:07,980 --> 00:26:10,680 She did the seeing with binoculars. 547 00:26:10,680 --> 00:26:13,080 And so this is one possible way to interpret this. 548 00:26:13,080 --> 00:26:14,730 She was using binoculars. 549 00:26:14,730 --> 00:26:18,170 Using those binoculars, she saw a dog. 550 00:26:18,170 --> 00:26:21,000 But another possible way to pass that sentence 551 00:26:21,000 --> 00:26:25,020 would be with this tree over here, where you have something 552 00:26:25,020 --> 00:26:31,000 like she saw a dog with binoculars, where a dog with binoculars 553 00:26:31,000 --> 00:26:33,340 forms an entire noun phrase of its own-- 554 00:26:33,340 --> 00:26:37,000 same words in the same order, but a different grammatical structure, 555 00:26:37,000 --> 00:26:41,350 where now we have a dog with binoculars all inside of this noun phrase, 556 00:26:41,350 --> 00:26:42,700 meaning what did she see? 557 00:26:42,700 --> 00:26:44,920 What she saw was a dog, and that dog happened 558 00:26:44,920 --> 00:26:49,210 to have binoculars with the dog-- so different ways to parse the sentence-- 559 00:26:49,210 --> 00:26:53,700 structures for the sentence-- even given the same possible sequence of words. 560 00:26:53,700 --> 00:26:56,320 And NLTK's algorithm and this particular algorithm 561 00:26:56,320 --> 00:26:58,150 has the ability to find all of these, to be 562 00:26:58,150 --> 00:27:00,610 able to understand the different ways that you might 563 00:27:00,610 --> 00:27:05,080 be able to parse a sentence and be able to extract some sort of useful meaning 564 00:27:05,080 --> 00:27:07,900 out of that sentence as well. 565 00:27:07,900 --> 00:27:11,650 So that then is a brief look at what we can do-- 566 00:27:11,650 --> 00:27:16,300 using getting the structure of language, of using these context-free grammar 567 00:27:16,300 --> 00:27:19,270 rules to be able to describe the structure of language. 568 00:27:19,270 --> 00:27:22,150 But what we might also care about is understanding 569 00:27:22,150 --> 00:27:24,700 how it is that these sequences of words are 570 00:27:24,700 --> 00:27:29,080 likely to relate to each other in terms of the actual words themselves. 571 00:27:29,080 --> 00:27:33,100 The grammar that we saw before could allow us to generate a sentence like, 572 00:27:33,100 --> 00:27:37,930 I eat a banana, for example, where I is the noun phrase and ate a banana 573 00:27:37,930 --> 00:27:39,190 is a verb phrase. 574 00:27:39,190 --> 00:27:41,800 But it would also allow for sentences like, I 575 00:27:41,800 --> 00:27:46,180 eat a blue car, for example, which is also syntactically well-formed 576 00:27:46,180 --> 00:27:50,830 according to the rules, but is probably a less likely sentence that a person is 577 00:27:50,830 --> 00:27:51,640 likely to speak. 578 00:27:51,640 --> 00:27:54,550 And we might want for our AI to be able to encapsulate 579 00:27:54,550 --> 00:28:00,140 the idea that certain sequences of words are more or less likely than others. 580 00:28:00,140 --> 00:28:03,880 So to deal with that, we'll introduce the notion of an n-gram, 581 00:28:03,880 --> 00:28:06,910 and an n-gram, more generally, just refers to some sequence 582 00:28:06,910 --> 00:28:09,880 of n items inside of our text. 583 00:28:09,880 --> 00:28:12,350 And those items might take various different forms. 584 00:28:12,350 --> 00:28:15,220 We can have character n-grams, which are just a contiguous 585 00:28:15,220 --> 00:28:18,520 sequence of n characters-- so three characters in a row, 586 00:28:18,520 --> 00:28:20,770 for example, or four characters in a row. 587 00:28:20,770 --> 00:28:23,500 We can also have word n-grams, which are a contiguous 588 00:28:23,500 --> 00:28:28,840 sequence of n words in a row from a particular sample of text. 589 00:28:28,840 --> 00:28:30,760 And these end up proving quite useful, and you 590 00:28:30,760 --> 00:28:34,700 can choose our n to decide how many how long is our sequence going to be. 591 00:28:34,700 --> 00:28:39,170 So when n is 1, we're just looking at a single word or a single character. 592 00:28:39,170 --> 00:28:42,760 And that is what we might call a unigram, just one item. 593 00:28:42,760 --> 00:28:45,160 If we're looking at two characters or two words, 594 00:28:45,160 --> 00:28:47,590 that's generally called a bigram-- so an n-gram 595 00:28:47,590 --> 00:28:51,205 where n is equal to 2, looking at two words that are consecutive. 596 00:28:51,205 --> 00:28:53,080 And then, if there are three items, you might 597 00:28:53,080 --> 00:28:56,200 imagine we'll often call those trigrams-- so three characters 598 00:28:56,200 --> 00:29:00,770 in a row or three words that happen to be in a contiguous sequence. 599 00:29:00,770 --> 00:29:04,000 And so if we took a sentence, for example-- 600 00:29:04,000 --> 00:29:06,367 here's a sentence from, again, Sherlock Holmes-- 601 00:29:06,367 --> 00:29:08,200 "how often have I said to you that, when you 602 00:29:08,200 --> 00:29:10,540 have eliminated the impossible, whatever remains, 603 00:29:10,540 --> 00:29:13,300 however improbable, must be the truth." 604 00:29:13,300 --> 00:29:16,090 What are the trigrams that we can extract from the sentence? 605 00:29:16,090 --> 00:29:18,830 If we're looking at sequences of three words, 606 00:29:18,830 --> 00:29:21,280 well, the first trigram would be how often 607 00:29:21,280 --> 00:29:23,890 have-- just a sequence of three words. 608 00:29:23,890 --> 00:29:25,960 And then we can look at the next trigram, 609 00:29:25,960 --> 00:29:29,200 often have I. The next trigram is have I said. 610 00:29:29,200 --> 00:29:32,320 Then I said to, said to you, to you that, for example-- 611 00:29:32,320 --> 00:29:36,700 those are all trigrams of words, sequences of three contiguous words 612 00:29:36,700 --> 00:29:38,410 that show up in the text. 613 00:29:38,410 --> 00:29:43,120 And extracting those bigrams and trigrams, or n-grams more generally, 614 00:29:43,120 --> 00:29:45,820 turns out to be quite helpful, because often, 615 00:29:45,820 --> 00:29:48,113 when we're dealing with analyzing a lot of text, 616 00:29:48,113 --> 00:29:50,530 it's not going to be particularly meaningful for us to try 617 00:29:50,530 --> 00:29:53,990 and analyze the entire text at one time. 618 00:29:53,990 --> 00:29:57,670 But instead, we want to segment that text into pieces that we 619 00:29:57,670 --> 00:29:59,650 can begin to do some analysis of-- 620 00:29:59,650 --> 00:30:03,070 that our AI might never have seen this entire sentence before, 621 00:30:03,070 --> 00:30:07,810 but it's probably seen the trigram to you that before, 622 00:30:07,810 --> 00:30:11,710 because to you that is something that might have come up in other documents 623 00:30:11,710 --> 00:30:13,240 that our AI has seen before. 624 00:30:13,240 --> 00:30:16,900 And therefore, it knows a little bit about that particular sequence 625 00:30:16,900 --> 00:30:20,890 of three words in a row-- or something like have I said, 626 00:30:20,890 --> 00:30:24,820 another example of another sequence of three words that's probably 627 00:30:24,820 --> 00:30:28,880 quite popular, in terms of where you see it inside the English language. 628 00:30:28,880 --> 00:30:32,433 So we'd like some way to be able to extract these sorts of n-grams. 629 00:30:32,433 --> 00:30:33,350 And how do we do that? 630 00:30:33,350 --> 00:30:35,770 How do we extract sequences of three words? 631 00:30:35,770 --> 00:30:39,490 Well, we need to take our input and somehow separate it 632 00:30:39,490 --> 00:30:41,810 into all of the individual words. 633 00:30:41,810 --> 00:30:45,010 And this is a process generally known as tokenization, 634 00:30:45,010 --> 00:30:48,250 the task of splitting up some sequence into distinct pieces, 635 00:30:48,250 --> 00:30:50,440 where we call those pieces tokens. 636 00:30:50,440 --> 00:30:53,480 Most commonly, this refers to something like word tokenization. 637 00:30:53,480 --> 00:30:55,810 I have some sequence of text and I want to split it up 638 00:30:55,810 --> 00:30:58,810 into all of the words that show up in that text. 639 00:30:58,810 --> 00:31:01,240 But it might also come up in the context of something 640 00:31:01,240 --> 00:31:02,680 like sentence tokenization. 641 00:31:02,680 --> 00:31:05,950 I have a long sequence of text and I'd like to split it up 642 00:31:05,950 --> 00:31:08,050 into sentences, for example. 643 00:31:08,050 --> 00:31:11,260 And so how might word tokenization work, the task of splitting up 644 00:31:11,260 --> 00:31:13,660 our sequence of characters into words? 645 00:31:13,660 --> 00:31:15,640 Well, we've also already seen this idea. 646 00:31:15,640 --> 00:31:18,610 We've seen that, in word tokenization just a moment ago, I 647 00:31:18,610 --> 00:31:22,660 took an input sequence and I just called Python's split method on it, where 648 00:31:22,660 --> 00:31:25,360 the split method took that sequence of words 649 00:31:25,360 --> 00:31:29,880 and just separated it based on where the spaces showed up in that word. 650 00:31:29,880 --> 00:31:33,640 And so if I had a sentence like, whatever remains, however improbable, 651 00:31:33,640 --> 00:31:37,620 must be the truth, how would I tokenize this? 652 00:31:37,620 --> 00:31:41,460 Well, the naive approach is just to say, anytime you see a space, 653 00:31:41,460 --> 00:31:42,600 go ahead and split it up. 654 00:31:42,600 --> 00:31:46,800 We're going to split up this particular string just by looking for spaces. 655 00:31:46,800 --> 00:31:49,830 And what we get when we do that is a sentence like this-- 656 00:31:49,830 --> 00:31:53,660 whatever remains, however improbable, must be the truth. 657 00:31:53,660 --> 00:31:56,160 But what you'll notice here is that, if we just split things 658 00:31:56,160 --> 00:32:00,930 up in terms of where the spaces are, we end up keeping the punctuation around. 659 00:32:00,930 --> 00:32:02,960 There's a comma after the word remains. 660 00:32:02,960 --> 00:32:06,030 There's a comma after improbable, a period after truth. 661 00:32:06,030 --> 00:32:08,160 And this poses a little bit of a challenge, when 662 00:32:08,160 --> 00:32:11,820 we think about trying to tokenize things into individual words, 663 00:32:11,820 --> 00:32:15,150 because if you're comparing words to each other, this word 664 00:32:15,150 --> 00:32:16,712 truth with a period after it-- 665 00:32:16,712 --> 00:32:18,420 if you just string compare it, it's going 666 00:32:18,420 --> 00:32:21,270 to be different from the word truth without a period after it. 667 00:32:21,270 --> 00:32:23,810 And so this punctuation can sometimes pose a problem for us, 668 00:32:23,810 --> 00:32:27,060 and so we might want some way of dealing with it-- either treating punctuation 669 00:32:27,060 --> 00:32:30,990 as a separate token altogether or maybe removing that punctuation entirely 670 00:32:30,990 --> 00:32:32,920 from our sequence as well. 671 00:32:32,920 --> 00:32:35,020 So that might be something we want to do. 672 00:32:35,020 --> 00:32:38,010 But there are other cases where it becomes a little bit less clear. 673 00:32:38,010 --> 00:32:40,680 If I said something like, just before 9:00 o'clock, 674 00:32:40,680 --> 00:32:43,110 Sherlock Holmes stepped briskly into the room, 675 00:32:43,110 --> 00:32:46,167 well, this apostrophe after 9 o'clock-- 676 00:32:46,167 --> 00:32:48,750 after the O in 9 o'clock-- is that something we should remove? 677 00:32:48,750 --> 00:32:52,080 Should be split based on that as well, and do O and clock? 678 00:32:52,080 --> 00:32:54,090 There's some interesting questions there too. 679 00:32:54,090 --> 00:32:57,360 And it gets even trickier if you begin to think about hyphenated words-- 680 00:32:57,360 --> 00:33:00,650 something like this, where we have a whole bunch of words 681 00:33:00,650 --> 00:33:03,840 that are hyphenated and then you need to make a judgment call. 682 00:33:03,840 --> 00:33:06,180 Is that a place where you're going to split things apart 683 00:33:06,180 --> 00:33:09,840 into individual words, or are you going to consider frock-coat, and well-cut, 684 00:33:09,840 --> 00:33:13,300 and pearl-grey to be individual words of their own? 685 00:33:13,300 --> 00:33:16,530 And so those tend to pose challenges that we need to somehow deal with 686 00:33:16,530 --> 00:33:19,890 and something we need to decide as we go about trying 687 00:33:19,890 --> 00:33:21,790 to perform this kind of analysis. 688 00:33:21,790 --> 00:33:25,950 Similar challenges arise when it comes to the world of sentence tokenization. 689 00:33:25,950 --> 00:33:29,410 Imagine this sequence of sentences, for example. 690 00:33:29,410 --> 00:33:31,927 If you take a look at this particular sequence of sentences, 691 00:33:31,927 --> 00:33:35,010 you could probably imagine you could extract the sentences pretty readily. 692 00:33:35,010 --> 00:33:38,060 Here is one sentence and here is a second sentence, 693 00:33:38,060 --> 00:33:43,060 so we have two different sentences inside of this particular passage. 694 00:33:43,060 --> 00:33:46,260 And the distinguishing feature seems to be the period-- 695 00:33:46,260 --> 00:33:48,963 that a period separates one sentence from another. 696 00:33:48,963 --> 00:33:50,880 And maybe there are other types of punctuation 697 00:33:50,880 --> 00:33:52,830 you might include here as well-- 698 00:33:52,830 --> 00:33:55,740 an exclamation point, for example, or a question mark. 699 00:33:55,740 --> 00:33:58,080 But those are the types of punctuation that we know 700 00:33:58,080 --> 00:34:00,750 tend to come at the end of sentences. 701 00:34:00,750 --> 00:34:04,410 But it gets trickier again if you look at a sentence like this-- not just 702 00:34:04,410 --> 00:34:07,140 sure talking to Sherlock, but instead of talking to Sherlock, 703 00:34:07,140 --> 00:34:09,449 talking to Mr. Holmes. 704 00:34:09,449 --> 00:34:11,313 Well now, we have a period at the end of Mr. 705 00:34:11,313 --> 00:34:13,230 And so if you were just separating on periods, 706 00:34:13,230 --> 00:34:15,570 you might imagine this would be a sentence, 707 00:34:15,570 --> 00:34:17,760 and then just Holmes would be a sentence, 708 00:34:17,760 --> 00:34:19,800 and then we'd have a third sentence down below. 709 00:34:19,800 --> 00:34:23,159 Things do get a little bit trickier as you start 710 00:34:23,159 --> 00:34:25,050 to imagine these sorts of situations. 711 00:34:25,050 --> 00:34:27,690 And dialogue too starts to make this trickier as well-- 712 00:34:27,690 --> 00:34:31,860 that if you have these sorts of lines that are inside of something that-- 713 00:34:31,860 --> 00:34:33,150 he said, for example-- 714 00:34:33,150 --> 00:34:35,639 that he said this particular sequence of words 715 00:34:35,639 --> 00:34:37,469 and then this particular sequence of words. 716 00:34:37,469 --> 00:34:40,170 There are interesting challenges that arise there too, 717 00:34:40,170 --> 00:34:42,389 in terms of how it is that we take the sentence 718 00:34:42,389 --> 00:34:46,268 and split it up into individual sentences as well. 719 00:34:46,268 --> 00:34:48,810 And these are just things that our algorithm needs to decide. 720 00:34:48,810 --> 00:34:51,370 In practice, there usually some heuristics that we can use. 721 00:34:51,370 --> 00:34:53,610 We know there are certain occurrences of periods, 722 00:34:53,610 --> 00:34:56,580 like the period after Mr., or in other examples where 723 00:34:56,580 --> 00:34:59,010 we know that is not the beginning of a new sentence, 724 00:34:59,010 --> 00:35:01,770 and so we can encode those rules into our AI 725 00:35:01,770 --> 00:35:04,680 to allow it to be able to do this tokenization the way 726 00:35:04,680 --> 00:35:06,060 that we want it to. 727 00:35:06,060 --> 00:35:09,960 So once we have these ability to tokenize a particular passage-- 728 00:35:09,960 --> 00:35:12,930 take the passage, split it up into individual words-- 729 00:35:12,930 --> 00:35:17,110 from there, we can begin to extract what the n-grams actually are. 730 00:35:17,110 --> 00:35:20,190 So we can actually take a look at this by going 731 00:35:20,190 --> 00:35:23,250 into a Python program that will serve the purpose of extracting 732 00:35:23,250 --> 00:35:24,630 these n-grams. 733 00:35:24,630 --> 00:35:27,510 And again, we can use NLTK, the Natural Language Toolkit, in order 734 00:35:27,510 --> 00:35:28,720 to help us here. 735 00:35:28,720 --> 00:35:33,540 So I'll go ahead and go into ngrams and we'll take a look at ngrams.py. 736 00:35:33,540 --> 00:35:36,280 And what we have here is we are going to take 737 00:35:36,280 --> 00:35:39,190 some corpus of text, just some sequence of documents, 738 00:35:39,190 --> 00:35:43,960 and use all those documents and extract what the most popular n-grams happen 739 00:35:43,960 --> 00:35:44,800 to be. 740 00:35:44,800 --> 00:35:48,490 So in order to do so, we're going to go ahead and load data from a directory 741 00:35:48,490 --> 00:35:50,510 that we specify in the command line argument. 742 00:35:50,510 --> 00:35:53,170 We'll also take in a number n as a command line argument 743 00:35:53,170 --> 00:35:55,390 as well, in terms of what our number should be, 744 00:35:55,390 --> 00:36:00,480 in terms of how many sequences-- words we're going to look at in sequence. 745 00:36:00,480 --> 00:36:05,330 Then we're going to go ahead and just count up all of the nltk.ngrams. 746 00:36:05,330 --> 00:36:09,170 So we're going to look at all of the grams across this entire corpus 747 00:36:09,170 --> 00:36:11,600 and save it inside this variable ngrams. 748 00:36:11,600 --> 00:36:14,090 And then we're going to look at the most common ones 749 00:36:14,090 --> 00:36:15,423 and go ahead and print them out. 750 00:36:15,423 --> 00:36:18,020 And so in order to do so, I'm not only using NLTK-- 751 00:36:18,020 --> 00:36:21,290 I'm also using counter, which is built into Python as well, where I can just 752 00:36:21,290 --> 00:36:25,800 count up, how many times do these various different grams appear? 753 00:36:25,800 --> 00:36:27,480 So we'll go ahead and show that. 754 00:36:27,480 --> 00:36:31,500 We'll go into ngrams, and I'll say something like python ngrams-- 755 00:36:31,500 --> 00:36:34,020 and let's just first look for the unigrams, sequences 756 00:36:34,020 --> 00:36:37,000 of one word inside of a corpus. 757 00:36:37,000 --> 00:36:39,270 And the corpus that I've prepared is I have 758 00:36:39,270 --> 00:36:42,720 all of the-- or some of these stories from Sherlock Holmes 759 00:36:42,720 --> 00:36:47,140 all here, where each one is just one of the Sherlock Holmes stories. 760 00:36:47,140 --> 00:36:50,010 And so I have a whole bunch of text here inside of this corpus, 761 00:36:50,010 --> 00:36:54,270 and I'll go ahead and provide that corpus as a command line argument. 762 00:36:54,270 --> 00:36:55,980 And now what my program is going to do is 763 00:36:55,980 --> 00:36:59,000 it's going to load all of the Sherlock Holmes stories into memory-- 764 00:36:59,000 --> 00:37:01,500 or all the ones that I've provided in this corpus at least-- 765 00:37:01,500 --> 00:37:04,200 and it's just going to look for the most popular unigrams, 766 00:37:04,200 --> 00:37:07,050 the most popular sequences of one word. 767 00:37:07,050 --> 00:37:12,060 And it seems the most popular one is just the word the used in 9,700 times; 768 00:37:12,060 --> 00:37:15,930 followed by I, used 5,000 times; and, used about 5,000 times-- 769 00:37:15,930 --> 00:37:18,370 the kinds of words you might expect. 770 00:37:18,370 --> 00:37:24,900 So now let's go ahead and check for bigrams, for example, ngrams 2, holmes. 771 00:37:24,900 --> 00:37:28,740 All right, again, sequences of two words now that appear multiple times-- 772 00:37:28,740 --> 00:37:32,840 of the, in the, it was, to the, it is, I have-- so on and so forth. 773 00:37:32,840 --> 00:37:34,590 These are the types of bigrams that happen 774 00:37:34,590 --> 00:37:37,590 to come up quite often inside this corpus, the inside of the Sherlock 775 00:37:37,590 --> 00:37:38,400 Holmes stories. 776 00:37:38,400 --> 00:37:41,060 And it probably is true across other corpses as well, 777 00:37:41,060 --> 00:37:43,472 but we could only find out if we actually tested it. 778 00:37:43,472 --> 00:37:45,180 And now, just for good measure, let's try 779 00:37:45,180 --> 00:37:50,120 one more-- maybe try three, looking now for trigrams that happen to show up. 780 00:37:50,120 --> 00:37:54,570 And now we get it was the, one of the, I think that, out of the. 781 00:37:54,570 --> 00:37:56,850 These are sequences of three words now that 782 00:37:56,850 --> 00:38:00,900 happen to come up multiple times across this particular corpus. 783 00:38:00,900 --> 00:38:02,970 So what are the potential use cases here? 784 00:38:02,970 --> 00:38:04,440 Now we have some sort of data. 785 00:38:04,440 --> 00:38:07,890 We have data about how often particular sequences of words 786 00:38:07,890 --> 00:38:11,010 show up in this particular order, and using that, 787 00:38:11,010 --> 00:38:13,410 we can begin to do some sort of predictions. 788 00:38:13,410 --> 00:38:18,090 We might be able to say that, if you see the words that it was, 789 00:38:18,090 --> 00:38:19,950 there's a reasonable chance the word that 790 00:38:19,950 --> 00:38:22,130 comes after it should be the word a. 791 00:38:22,130 --> 00:38:26,340 And if I see the words one of, it it's reasonable to imagine 792 00:38:26,340 --> 00:38:29,190 that the next word might be the word the, for example, 793 00:38:29,190 --> 00:38:32,640 because we have this data about trigrams, sequences of three words 794 00:38:32,640 --> 00:38:33,900 and how often they come up. 795 00:38:33,900 --> 00:38:36,150 And now, based on two words, you might be 796 00:38:36,150 --> 00:38:40,110 able to predict what the third word happens to be. 797 00:38:40,110 --> 00:38:43,650 And one model we can use for that is a model we've actually seen before. 798 00:38:43,650 --> 00:38:45,280 It's the Markov model. 799 00:38:45,280 --> 00:38:47,100 Recall again that the Markov model really 800 00:38:47,100 --> 00:38:50,010 just refers to some sequence of events that happen one time 801 00:38:50,010 --> 00:38:54,150 step after a one time step, where every unit has some ability 802 00:38:54,150 --> 00:38:57,150 to predict what the next unit is going to be-- 803 00:38:57,150 --> 00:39:00,330 or maybe the past two units predict with the next unit is going to be, 804 00:39:00,330 --> 00:39:03,270 or the past three predict with the next one is going to be. 805 00:39:03,270 --> 00:39:05,490 And we can use a Markov model and apply it 806 00:39:05,490 --> 00:39:08,100 to language for a very naive and simple approach 807 00:39:08,100 --> 00:39:11,340 at trying to generate natural language, at getting our AI 808 00:39:11,340 --> 00:39:14,340 to be able to speak English-like text. 809 00:39:14,340 --> 00:39:18,360 And the way it's going to work is we're going to say something like, come up 810 00:39:18,360 --> 00:39:20,280 with some probability distribution. 811 00:39:20,280 --> 00:39:23,070 Given these two words, what is the probability 812 00:39:23,070 --> 00:39:25,830 distribution over what the third word could possibly 813 00:39:25,830 --> 00:39:27,240 be based on all the data? 814 00:39:27,240 --> 00:39:30,660 If you see it was, what are the possible third words we might? 815 00:39:30,660 --> 00:39:32,190 Have how often do they come up? 816 00:39:32,190 --> 00:39:35,070 And using that information, we can try and construct 817 00:39:35,070 --> 00:39:37,450 what we expect the third word to be. 818 00:39:37,450 --> 00:39:39,270 And if you keep doing this, the effect is 819 00:39:39,270 --> 00:39:42,030 that our Markov model can effectively start 820 00:39:42,030 --> 00:39:45,330 to generate text-- can be able to generate text that 821 00:39:45,330 --> 00:39:48,330 was not in the original corpus, but that sounds 822 00:39:48,330 --> 00:39:49,770 kind of like the original corpus. 823 00:39:49,770 --> 00:39:54,130 It's using the same sorts of rules that the original corpus was using. 824 00:39:54,130 --> 00:39:56,370 So let's take a look at an example of that 825 00:39:56,370 --> 00:40:01,740 as well, where here now, I have another corpus that I have here, 826 00:40:01,740 --> 00:40:04,990 and it is the corpus of all of the works of William Shakespeare. 827 00:40:04,990 --> 00:40:09,900 So I've got a whole bunch of stories from Shakespeare, and all of them 828 00:40:09,900 --> 00:40:12,610 are just inside of this big text file. 829 00:40:12,610 --> 00:40:16,590 And so what I might like to do is look at what all of the n-grams are-- 830 00:40:16,590 --> 00:40:20,400 maybe look at all the trigrams inside of shakespeare.txt-- 831 00:40:20,400 --> 00:40:23,040 and figure out, given two words, can I predict 832 00:40:23,040 --> 00:40:24,548 what the third word is likely to be? 833 00:40:24,548 --> 00:40:26,340 And then just keep repeating this process-- 834 00:40:26,340 --> 00:40:27,240 I have two words-- 835 00:40:27,240 --> 00:40:29,400 predict the third word; then, from the second and third, word 836 00:40:29,400 --> 00:40:31,900 predict the fourth word; and from the third and fourth word, 837 00:40:31,900 --> 00:40:36,090 predict the fifth word, ultimately generating random sentences that 838 00:40:36,090 --> 00:40:39,420 sounds like Shakespeare, that are using similar patterns of words 839 00:40:39,420 --> 00:40:43,140 that Shakespeare used, but that never actually showed up in Shakespeare 840 00:40:43,140 --> 00:40:44,770 itself. 841 00:40:44,770 --> 00:40:47,640 And so to do so, I'll show you generator.py, 842 00:40:47,640 --> 00:40:50,910 which, again, is just going to read data from a particular file. 843 00:40:50,910 --> 00:40:54,210 And I'm using a Python library called markovify, which is just 844 00:40:54,210 --> 00:40:56,050 going to do this process for me. 845 00:40:56,050 --> 00:40:59,370 So there are libraries out here that can just train on a bunch of text 846 00:40:59,370 --> 00:41:02,978 and come up with a Markov model based on that text. 847 00:41:02,978 --> 00:41:04,770 And I'm going to go ahead and just generate 848 00:41:04,770 --> 00:41:07,920 five randomly generated sentences. 849 00:41:07,920 --> 00:41:11,850 So we'll go ahead and go in to markov. 850 00:41:11,850 --> 00:41:14,750 I'll run the generator on shakespeare.txt. 851 00:41:14,750 --> 00:41:18,290 What we'll see is it's going to load that data, and then here's what we get. 852 00:41:18,290 --> 00:41:21,320 We get five different sentences, and these 853 00:41:21,320 --> 00:41:24,890 are sentences that never showed up in any Shakespeare play, 854 00:41:24,890 --> 00:41:27,680 but that are designed to sound like Shakespeare, 855 00:41:27,680 --> 00:41:30,320 that are designed to just take two words and predict, 856 00:41:30,320 --> 00:41:34,100 given those two words, what would Shakespeare have been likely to choose 857 00:41:34,100 --> 00:41:35,517 as the third word that follows it. 858 00:41:35,517 --> 00:41:38,100 And you know, these sentences probably don't have any meaning. 859 00:41:38,100 --> 00:41:41,600 It's not like the AI is trying to express any sort of underlying meaning 860 00:41:41,600 --> 00:41:42,110 here. 861 00:41:42,110 --> 00:41:44,870 It's just trying to understand, based on the sequence 862 00:41:44,870 --> 00:41:50,190 of words, what is likely to come after it as a next word, for example. 863 00:41:50,190 --> 00:41:53,593 And these are the types of sentences that it's able to come up with, 864 00:41:53,593 --> 00:41:54,260 just generating. 865 00:41:54,260 --> 00:41:58,100 And if you ran this multiple times, you would end up getting different results. 866 00:41:58,100 --> 00:42:01,580 I could run this again and get an entirely different set 867 00:42:01,580 --> 00:42:04,100 of five different sentences that also are 868 00:42:04,100 --> 00:42:08,810 supposed to sound kind of like the way that Shakespeare's sentences sounded 869 00:42:08,810 --> 00:42:10,340 as well. 870 00:42:10,340 --> 00:42:12,430 And so that then was a look at how it is we 871 00:42:12,430 --> 00:42:16,580 can use Markov models to be able to naively attempt generating language. 872 00:42:16,580 --> 00:42:18,580 The language doesn't mean a whole lot right now. 873 00:42:18,580 --> 00:42:21,430 You wouldn't want to use the system in its current form 874 00:42:21,430 --> 00:42:23,200 to do something like machine translation, 875 00:42:23,200 --> 00:42:26,020 because it wouldn't be able to encapsulate any meaning, 876 00:42:26,020 --> 00:42:30,240 but we're starting to see now that our AI is getting a little bit better 877 00:42:30,240 --> 00:42:31,990 at trying to speak our language, at trying 878 00:42:31,990 --> 00:42:36,500 to be able to process natural language in some sort of meaningful way. 879 00:42:36,500 --> 00:42:38,830 So we'll now take a look at a couple of other tasks 880 00:42:38,830 --> 00:42:41,140 that we might want our AI to be able to perform. 881 00:42:41,140 --> 00:42:44,920 And one such task is text categorization, which really is just 882 00:42:44,920 --> 00:42:46,138 a classification problem. 883 00:42:46,138 --> 00:42:48,430 And we've talked about classification problems already, 884 00:42:48,430 --> 00:42:51,670 these problems where we would like to take some object 885 00:42:51,670 --> 00:42:54,540 and categorize it into a number of different classes. 886 00:42:54,540 --> 00:42:58,750 And so the way this comes up in text is anytime you have some sample of text 887 00:42:58,750 --> 00:43:02,080 and you want to put it inside of a category, where I want to say something 888 00:43:02,080 --> 00:43:06,760 like, given an email, does it belong in the inbox or does it belong in spam? 889 00:43:06,760 --> 00:43:08,890 Which of these two categories does it belong in? 890 00:43:08,890 --> 00:43:12,250 And you do that by looking at the text and being 891 00:43:12,250 --> 00:43:16,660 able to do some sort of analysis on that text to be able to draw conclusions, 892 00:43:16,660 --> 00:43:20,200 to be able to say that, given the words that show up in the email, 893 00:43:20,200 --> 00:43:22,510 I think this is probably belonging in the inbox, 894 00:43:22,510 --> 00:43:25,825 or I think it probably belongs in spam instead. 895 00:43:25,825 --> 00:43:27,700 And you might imagine doing this for a number 896 00:43:27,700 --> 00:43:30,910 of different types of classification problems of this sort. 897 00:43:30,910 --> 00:43:34,360 So you might imagine that another common example of this type of idea 898 00:43:34,360 --> 00:43:37,690 is something like sentiment analysis, where I want to analyze, 899 00:43:37,690 --> 00:43:41,880 given a sample of text, does it have a positive sentiment 900 00:43:41,880 --> 00:43:43,780 or does it have a negative sentiment? 901 00:43:43,780 --> 00:43:47,082 And this might come up in the case of a product reviews on a website, 902 00:43:47,082 --> 00:43:50,290 for example, or feedback on a website, where you have a whole bunch of data-- 903 00:43:50,290 --> 00:43:53,230 samples of text that are provided by users of a website-- 904 00:43:53,230 --> 00:43:57,010 and you want to be able to quickly analyze, are these reviews positive, 905 00:43:57,010 --> 00:43:59,710 are the reviews negative, what is it that people 906 00:43:59,710 --> 00:44:03,460 are saying, just to get a sense for what it is that people are saying, 907 00:44:03,460 --> 00:44:08,840 to be able to categorize text into one of these two different categories. 908 00:44:08,840 --> 00:44:10,630 So how might we approach this problem? 909 00:44:10,630 --> 00:44:13,010 Well, let's take a look at some sample product reviews. 910 00:44:13,010 --> 00:44:16,000 Here are some sample prep reviews that we might come up with. 911 00:44:16,000 --> 00:44:16,930 My grandson loved it. 912 00:44:16,930 --> 00:44:17,890 So much fun. 913 00:44:17,890 --> 00:44:20,290 Product broke after a few days. 914 00:44:20,290 --> 00:44:22,368 One of the best games I've played in a long time. 915 00:44:22,368 --> 00:44:23,410 Kind of cheap and flimsy. 916 00:44:23,410 --> 00:44:24,400 Not worth it. 917 00:44:24,400 --> 00:44:28,360 Different product reviews that you might imagine seeing on Amazon, or eBay, 918 00:44:28,360 --> 00:44:31,690 or some other website where people are selling products, for instance. 919 00:44:31,690 --> 00:44:34,480 And we humans can pretty easily categorize these 920 00:44:34,480 --> 00:44:37,060 into positive sentiment or negative sentiment. 921 00:44:37,060 --> 00:44:39,790 We'd probably say that the first and the third one, those 922 00:44:39,790 --> 00:44:41,620 are positive sentiment messages. 923 00:44:41,620 --> 00:44:44,380 The second one and the fourth one, those are probably 924 00:44:44,380 --> 00:44:46,060 negative sentiment messages. 925 00:44:46,060 --> 00:44:48,680 But how could a computer do the same thing? 926 00:44:48,680 --> 00:44:53,470 How could it try and take these reviews and assess, are they positive 927 00:44:53,470 --> 00:44:55,420 or are they negative? 928 00:44:55,420 --> 00:44:57,940 Well, ultimately, it depends upon the words 929 00:44:57,940 --> 00:45:02,530 that happen to be in this particular-- these particular reviews-- inside 930 00:45:02,530 --> 00:45:03,850 of these particular sentences. 931 00:45:03,850 --> 00:45:06,040 For now we're going to ignore the structure 932 00:45:06,040 --> 00:45:08,120 and how the words are related to each other, 933 00:45:08,120 --> 00:45:11,230 and we're just going to focus on what the words actually are. 934 00:45:11,230 --> 00:45:14,710 So there are probably some key words here, words like loved, 935 00:45:14,710 --> 00:45:16,330 and fun, and best. 936 00:45:16,330 --> 00:45:20,770 Those probably show up in more positive reviews, whereas words 937 00:45:20,770 --> 00:45:23,137 like broke, and cheap, and flimsy-- 938 00:45:23,137 --> 00:45:24,970 well, those are words that probably are more 939 00:45:24,970 --> 00:45:29,930 likely to come up inside of negative reviews, instead of positive reviews. 940 00:45:29,930 --> 00:45:33,550 So one way to approach this sort of text analysis idea 941 00:45:33,550 --> 00:45:37,900 is to say, let's, for now, ignore the structures of these sentences-- to say, 942 00:45:37,900 --> 00:45:40,870 we're not going to care about how it is the words relate to each other. 943 00:45:40,870 --> 00:45:43,540 We're not going to try and parse these sentences to construct 944 00:45:43,540 --> 00:45:45,850 the grammatical structure like we saw a moment ago. 945 00:45:45,850 --> 00:45:49,060 But we can probably just rely on the words that were actually 946 00:45:49,060 --> 00:45:52,000 used-- rely on the fact that the positive reviews are 947 00:45:52,000 --> 00:45:54,820 more likely to have words like best, and loved, and fun, 948 00:45:54,820 --> 00:45:58,360 and that the negative reviews are more likely to have the negative words 949 00:45:58,360 --> 00:46:00,017 that we've highlighted there as well. 950 00:46:00,017 --> 00:46:03,100 And this sort of model-- this approach to trying to think about language-- 951 00:46:03,100 --> 00:46:05,610 is generally known as the bag of words model, 952 00:46:05,610 --> 00:46:09,023 where we're going to model a sample of text not by caring about its structure, 953 00:46:09,023 --> 00:46:12,970 but just by caring about the unordered collection of words that 954 00:46:12,970 --> 00:46:16,060 show up inside of a sample-- that all we care about 955 00:46:16,060 --> 00:46:18,040 is what words are in the text. 956 00:46:18,040 --> 00:46:20,552 And we don't care about what the order of those words is. 957 00:46:20,552 --> 00:46:22,510 We don't care about the structure of the words. 958 00:46:22,510 --> 00:46:25,210 We don't care what noun goes with what adjective 959 00:46:25,210 --> 00:46:26,870 or how things agree with each other. 960 00:46:26,870 --> 00:46:28,830 We just care about the words. 961 00:46:28,830 --> 00:46:31,120 And it turns out this approach tends to work 962 00:46:31,120 --> 00:46:34,810 pretty well for doing classifications like positive sentiment 963 00:46:34,810 --> 00:46:36,142 or negative sentiment. 964 00:46:36,142 --> 00:46:38,350 And you could imagine doing this in a number of ways. 965 00:46:38,350 --> 00:46:41,740 We've talked about different approaches to trying to solve classification style 966 00:46:41,740 --> 00:46:43,870 problems, but when it comes to natural language, 967 00:46:43,870 --> 00:46:48,110 one of the most popular approaches is that naive Bayes approach. 968 00:46:48,110 --> 00:46:52,530 And this is one approach to trying to analyze the probability that something 969 00:46:52,530 --> 00:46:54,940 is positive sentiment or negative sentiment, 970 00:46:54,940 --> 00:46:58,515 or just trying to categorize it some text into possible categories. 971 00:46:58,515 --> 00:47:01,390 And it doesn't just work for text-- it works for other types of ideas 972 00:47:01,390 --> 00:47:03,550 as well-- but it is quite popular in the world 973 00:47:03,550 --> 00:47:05,980 of analyzing text and natural language. 974 00:47:05,980 --> 00:47:09,450 And the naive Bayes approach is based on Bayes' rule, which 975 00:47:09,450 --> 00:47:11,950 you might recall back from when we talked about probability, 976 00:47:11,950 --> 00:47:14,020 that the Bayes' rule looks like this-- 977 00:47:14,020 --> 00:47:17,690 that the probability of some event b, given a 978 00:47:17,690 --> 00:47:20,320 can be expressed using this expression over here. 979 00:47:20,320 --> 00:47:25,150 Probability of b given a is the probability of a given b multiplied 980 00:47:25,150 --> 00:47:28,590 by the probability of b divided by the probability of a. 981 00:47:28,590 --> 00:47:32,290 And we saw that this came about as a result of just the definition 982 00:47:32,290 --> 00:47:35,740 of conditional independence and looking at what it means for two events 983 00:47:35,740 --> 00:47:37,010 to happen together. 984 00:47:37,010 --> 00:47:40,038 This was our formulation then of Bayes' rule, which 985 00:47:40,038 --> 00:47:41,330 turned out to be quite helpful. 986 00:47:41,330 --> 00:47:43,990 We were able to predict one event in terms of another 987 00:47:43,990 --> 00:47:49,218 by flipping the order of those events inside of this probability calculation. 988 00:47:49,218 --> 00:47:51,760 And it turns out this approach is going to be quite helpful-- 989 00:47:51,760 --> 00:47:53,110 and we'll see why in a moment-- 990 00:47:53,110 --> 00:47:55,330 for being able to do this sort of sentiment analysis, 991 00:47:55,330 --> 00:47:58,750 because I want to say you know, what is the probability 992 00:47:58,750 --> 00:48:02,350 that a message is positive, or what is the pop probability 993 00:48:02,350 --> 00:48:03,727 that the message is negative? 994 00:48:03,727 --> 00:48:06,310 And I'll go ahead and simplify this just using the emojis just 995 00:48:06,310 --> 00:48:10,450 for simplicity-- probability of positive, probability of negative. 996 00:48:10,450 --> 00:48:12,340 And that is what I would like to calculate, 997 00:48:12,340 --> 00:48:15,310 but I'd like to calculate that given some information-- 998 00:48:15,310 --> 00:48:18,940 given information like here is a sample of text-- 999 00:48:18,940 --> 00:48:20,440 my grandson loved it. 1000 00:48:20,440 --> 00:48:24,280 And I would like to know not just what is the probability that any message is 1001 00:48:24,280 --> 00:48:27,880 positive, but what is the probability that the message is positive, 1002 00:48:27,880 --> 00:48:32,890 given my grandson loved it as the text of the sample? 1003 00:48:32,890 --> 00:48:36,340 So given this information that inside the sample are the words my grandson 1004 00:48:36,340 --> 00:48:41,860 loved it, what is the probability then that this is a positive message? 1005 00:48:41,860 --> 00:48:44,650 Well, according to the bag of words model, what we're going to do 1006 00:48:44,650 --> 00:48:46,930 is really ignore the ordering of the words-- 1007 00:48:46,930 --> 00:48:50,420 not treat this as a single sentence that has some structure to it, 1008 00:48:50,420 --> 00:48:52,750 but just treat it as a whole bunch of different words. 1009 00:48:52,750 --> 00:48:55,180 We're going to say something like, what is the probability 1010 00:48:55,180 --> 00:48:58,420 that this is a positive message, given that the word my 1011 00:48:58,420 --> 00:49:01,810 was in the message, given that the word grandson was in the message, 1012 00:49:01,810 --> 00:49:05,520 given that the word loved within the message, and given the word it 1013 00:49:05,520 --> 00:49:06,380 was in the message? 1014 00:49:06,380 --> 00:49:07,720 The bag of words model here-- 1015 00:49:07,720 --> 00:49:11,380 we're treating the entire simple sample as just a whole bunch 1016 00:49:11,380 --> 00:49:12,740 of different words. 1017 00:49:12,740 --> 00:49:15,910 And so this then is what I'd like to calculate, this probability-- 1018 00:49:15,910 --> 00:49:18,610 given all those words, what is the probability 1019 00:49:18,610 --> 00:49:20,920 that this is a positive message? 1020 00:49:20,920 --> 00:49:23,530 And this is where we can now apply Bayes' rule. 1021 00:49:23,530 --> 00:49:28,315 This is really the probability of some b, given some a. 1022 00:49:28,315 --> 00:49:30,400 And that now is what I'd like to calculate. 1023 00:49:30,400 --> 00:49:34,723 So according to Bayes' rule, this whole expression is equal to-- 1024 00:49:34,723 --> 00:49:35,890 well, it's the probability-- 1025 00:49:35,890 --> 00:49:37,420 I switched the order of them-- 1026 00:49:37,420 --> 00:49:40,270 it's the probability of all of these words, 1027 00:49:40,270 --> 00:49:42,910 given that it's a positive message, multiplied 1028 00:49:42,910 --> 00:49:46,930 by the probability that is the positive message divided 1029 00:49:46,930 --> 00:49:49,575 by the probability of all of those words. 1030 00:49:49,575 --> 00:49:51,700 So this then is just an application of Bayes' rule. 1031 00:49:51,700 --> 00:49:56,680 We've already seen where I want to express the probability of positive, 1032 00:49:56,680 --> 00:50:02,440 given the words, as related to somehow the probability of the words, 1033 00:50:02,440 --> 00:50:04,718 given that it's a positive message. 1034 00:50:04,718 --> 00:50:06,760 And it turns out that-- as you might recall, back 1035 00:50:06,760 --> 00:50:09,965 when we talked about probability, that this denominator is 1036 00:50:09,965 --> 00:50:10,840 going to be the same. 1037 00:50:10,840 --> 00:50:13,840 Regardless of whether we're looking at positive or negative messages, 1038 00:50:13,840 --> 00:50:15,850 the probability of these words doesn't change, 1039 00:50:15,850 --> 00:50:18,805 because we don't have a positive or negative down below. 1040 00:50:18,805 --> 00:50:20,680 So we can just say that, rather than just say 1041 00:50:20,680 --> 00:50:23,980 that this expression up here is equal to this expression down below, 1042 00:50:23,980 --> 00:50:27,130 it's really just proportional to just the numerator. 1043 00:50:27,130 --> 00:50:29,530 We can ignore the denominator for now. 1044 00:50:29,530 --> 00:50:32,770 Using the denominator would get us an exact probability. 1045 00:50:32,770 --> 00:50:34,780 But it turns out that what we'll really just do 1046 00:50:34,780 --> 00:50:38,780 is figure out what the probability is proportional to, and at the end, 1047 00:50:38,780 --> 00:50:41,500 we'll have to normalize the probability distribution-- make 1048 00:50:41,500 --> 00:50:46,270 sure the probability distribution ultimately sums up to the number 1. 1049 00:50:46,270 --> 00:50:49,730 So now I've been able to formulate this probability-- 1050 00:50:49,730 --> 00:50:51,520 which is what I want to care about-- 1051 00:50:51,520 --> 00:50:56,530 as proportional to multiplying these two things together-- probability of words, 1052 00:50:56,530 --> 00:51:01,580 given positive message, multiplied by the probability of positive message. 1053 00:51:01,580 --> 00:51:04,060 But again, if you think back to our probability rules, 1054 00:51:04,060 --> 00:51:09,070 we can calculate this really as just a joint probability of all of these 1055 00:51:09,070 --> 00:51:14,140 things happening-- that the probability of positive message multiplied 1056 00:51:14,140 --> 00:51:17,470 by the probability of these words, given the positive message-- 1057 00:51:17,470 --> 00:51:20,890 well, that's just the joint probability of all of these things. 1058 00:51:20,890 --> 00:51:23,530 This is the same thing as the probability 1059 00:51:23,530 --> 00:51:27,670 that it's a positive message, and my isn't the sentence or in the message, 1060 00:51:27,670 --> 00:51:30,820 and grandson is in the sample, and loved is in the sample, 1061 00:51:30,820 --> 00:51:33,160 and it is in the sample. 1062 00:51:33,160 --> 00:51:36,640 So using that rule for the definition of joint probability, 1063 00:51:36,640 --> 00:51:40,630 I've been able to say that this entire expression is now 1064 00:51:40,630 --> 00:51:43,570 proportional to this sequence-- 1065 00:51:43,570 --> 00:51:47,530 this joint probability of these words and this positive that's 1066 00:51:47,530 --> 00:51:49,670 in there as well. 1067 00:51:49,670 --> 00:51:51,790 And so now the interesting question is just how 1068 00:51:51,790 --> 00:51:54,050 to calculate that joint probability. 1069 00:51:54,050 --> 00:51:55,870 How do I figure out the probability that, 1070 00:51:55,870 --> 00:51:59,980 given some arbitrary message, that it is positive, and the word my is in there, 1071 00:51:59,980 --> 00:52:03,040 and the word grandson is in there, and the word loved is in there, 1072 00:52:03,040 --> 00:52:04,740 and the word it is in there? 1073 00:52:04,740 --> 00:52:07,990 Well, you'll recall that we can calculate a joint probability 1074 00:52:07,990 --> 00:52:12,480 by multiplying together all of these conditional probabilities. 1075 00:52:12,480 --> 00:52:16,350 If I want to know the probability of a, and b, and c, 1076 00:52:16,350 --> 00:52:19,530 I can calculate that as the probability of a times 1077 00:52:19,530 --> 00:52:24,300 the probability of b, given a, times the probability of c, given a and b. 1078 00:52:24,300 --> 00:52:27,570 I can just multiply these conditional probabilities together 1079 00:52:27,570 --> 00:52:31,290 in order to get the overall joint probability that I care about. 1080 00:52:31,290 --> 00:52:32,790 And we could do the same thing here. 1081 00:52:32,790 --> 00:52:35,340 I could say, let's multiply the probability 1082 00:52:35,340 --> 00:52:39,180 of positive by the probability of the word my showing up in the message, 1083 00:52:39,180 --> 00:52:42,810 given that it's positive, multiplied by the probability of grandson 1084 00:52:42,810 --> 00:52:45,550 showing up in the message, given that the word my is in there 1085 00:52:45,550 --> 00:52:48,930 and that it's positive, multiplied by the probability of loved, 1086 00:52:48,930 --> 00:52:51,930 given these three things, multiplied by the probability of it, 1087 00:52:51,930 --> 00:52:53,500 given these four things. 1088 00:52:53,500 --> 00:52:56,882 And that's going to end up being a fairly complex calculation to make, 1089 00:52:56,882 --> 00:52:58,590 one that we probably aren't going to have 1090 00:52:58,590 --> 00:53:00,210 a good way of knowing the answer to. 1091 00:53:00,210 --> 00:53:04,140 What is the probability that grandson is in the message, given 1092 00:53:04,140 --> 00:53:08,010 that it is positive and the word my is in the message? 1093 00:53:08,010 --> 00:53:12,040 That's not something we're really going to have a readily easy answer to, 1094 00:53:12,040 --> 00:53:15,270 and so this is where the naive part of naive Bayes comes about. 1095 00:53:15,270 --> 00:53:16,950 We're going to simplify this notion. 1096 00:53:16,950 --> 00:53:20,340 Rather than compute exactly what that probability distribution is, 1097 00:53:20,340 --> 00:53:23,880 we're going to assume that these words are 1098 00:53:23,880 --> 00:53:26,710 going to be effectively independent of each other, 1099 00:53:26,710 --> 00:53:28,980 if we know that it's already a positive message. 1100 00:53:28,980 --> 00:53:32,670 If it's a positive message, it doesn't change the probability 1101 00:53:32,670 --> 00:53:34,620 that the word grandson is in the message, 1102 00:53:34,620 --> 00:53:37,620 if I know that the word loved is in the message, for example. 1103 00:53:37,620 --> 00:53:39,750 And that might not necessarily be true in practice. 1104 00:53:39,750 --> 00:53:41,610 In the real world, it might not be the case 1105 00:53:41,610 --> 00:53:43,650 that these words are actually independent, 1106 00:53:43,650 --> 00:53:45,960 but we're going to assume it to simplify our model. 1107 00:53:45,960 --> 00:53:48,030 And it turns out that simplification still 1108 00:53:48,030 --> 00:53:51,590 lets us get pretty good results out of it as well. 1109 00:53:51,590 --> 00:53:55,320 And what we're going to assume is that the probability that all of these words 1110 00:53:55,320 --> 00:53:58,690 show up depend only on whether it's positive or negative. 1111 00:53:58,690 --> 00:54:01,170 I can still say that loved is more likely to come up 1112 00:54:01,170 --> 00:54:04,510 in a positive message than a negative message, which is probably true, 1113 00:54:04,510 --> 00:54:08,010 but we're also going to say that it's not going to change whether or not 1114 00:54:08,010 --> 00:54:12,020 loved is more likely or less likely to come up if I know that the word my is 1115 00:54:12,020 --> 00:54:13,643 in the message, for example. 1116 00:54:13,643 --> 00:54:16,060 And so those are the assumptions that we're going to make. 1117 00:54:16,060 --> 00:54:20,310 So while top expression is proportional to this bottom expression, 1118 00:54:20,310 --> 00:54:24,750 we're going to say it's naively proportional to this expression, 1119 00:54:24,750 --> 00:54:27,480 probability of being a positive message. 1120 00:54:27,480 --> 00:54:30,300 And then, for each of the words that show up in the sample, 1121 00:54:30,300 --> 00:54:33,270 I'm going to multiply what's the probability that my 1122 00:54:33,270 --> 00:54:35,370 is in the message, given that it's positive, 1123 00:54:35,370 --> 00:54:37,980 times the probability of grandson being in the message, given 1124 00:54:37,980 --> 00:54:40,050 that it's positive-- and then so on and so forth 1125 00:54:40,050 --> 00:54:44,040 for the other words that happen to be inside of the sample. 1126 00:54:44,040 --> 00:54:47,580 And it turns out that these are numbers that we can calculate. 1127 00:54:47,580 --> 00:54:50,640 The reason we've done all of this math is to get to this point, 1128 00:54:50,640 --> 00:54:54,870 to be able to calculate this probability of distribution that we care about, 1129 00:54:54,870 --> 00:54:58,410 given these terms that we can actually calculate. 1130 00:54:58,410 --> 00:55:02,250 And we can calculate then, given some data available to us. 1131 00:55:02,250 --> 00:55:04,530 And this is what a lot of natural language processing 1132 00:55:04,530 --> 00:55:05,590 is about these days. 1133 00:55:05,590 --> 00:55:07,330 It's about analyzing data. 1134 00:55:07,330 --> 00:55:10,440 If I give you a whole bunch of data with a whole bunch of reviews, 1135 00:55:10,440 --> 00:55:13,380 and I've labeled them as positive or negative, 1136 00:55:13,380 --> 00:55:17,250 then you can begin to calculate these particular terms. 1137 00:55:17,250 --> 00:55:20,490 I can calculate the probability that a message is positive just 1138 00:55:20,490 --> 00:55:22,710 by looking at my data and saying, how many 1139 00:55:22,710 --> 00:55:26,250 positive samples were there, and divide that by the number of total samples. 1140 00:55:26,250 --> 00:55:29,477 That is my probability that a message is positive. 1141 00:55:29,477 --> 00:55:32,310 What is the probability that the word loved is in the message, given 1142 00:55:32,310 --> 00:55:33,330 that it's positive? 1143 00:55:33,330 --> 00:55:35,490 Well, I can calculate that based on my data too. 1144 00:55:35,490 --> 00:55:38,970 Let me just look at how many positive samples have the word loved in it 1145 00:55:38,970 --> 00:55:41,730 and divide that by my total number of positive samples. 1146 00:55:41,730 --> 00:55:44,430 And that will give me an approximation for, 1147 00:55:44,430 --> 00:55:47,950 what is the probability that loved is going to show up inside of the review, 1148 00:55:47,950 --> 00:55:51,570 given that we know that the review is positive. 1149 00:55:51,570 --> 00:55:55,160 And so this then allows us to be able to calculate these probabilities. 1150 00:55:55,160 --> 00:55:56,910 So let's not actually do this calculation. 1151 00:55:56,910 --> 00:56:00,390 Let's calculate for the sentence, my grandson loved it. 1152 00:56:00,390 --> 00:56:01,890 Is it a positive or negative review? 1153 00:56:01,890 --> 00:56:04,030 How could we figure out those probabilities? 1154 00:56:04,030 --> 00:56:07,110 Well, again, this up here is the expression we're trying to calculate. 1155 00:56:07,110 --> 00:56:10,350 And I'll give you a hint the data that is available to us. 1156 00:56:10,350 --> 00:56:13,080 And the way to interpret this data in this case 1157 00:56:13,080 --> 00:56:19,127 is that, of all of the messages, 49% of them were positive and 51% of them 1158 00:56:19,127 --> 00:56:19,710 were negative. 1159 00:56:19,710 --> 00:56:22,350 Maybe online reviews tend to be a little bit more negative than they 1160 00:56:22,350 --> 00:56:24,683 are positive-- or at least based on this particular data 1161 00:56:24,683 --> 00:56:26,620 sample, that's what I have. 1162 00:56:26,620 --> 00:56:31,800 And then I have distributions for each of the various different words-- 1163 00:56:31,800 --> 00:56:34,290 that, given that it's a positive message, 1164 00:56:34,290 --> 00:56:38,040 how many positive messages had the word in my in them? 1165 00:56:38,040 --> 00:56:39,335 It's about 30%. 1166 00:56:39,335 --> 00:56:42,210 And for negative messages, how many of those had the word my in them? 1167 00:56:42,210 --> 00:56:47,910 About 20%-- so it seems like the word my comes up more often in positive 1168 00:56:47,910 --> 00:56:52,140 messages-- at least slightly more often based on this analysis here. 1169 00:56:52,140 --> 00:56:54,270 Grandson, for example-- maybe that showed up 1170 00:56:54,270 --> 00:56:58,680 in 1% of all positive messages and 2% of all negative messages 1171 00:56:58,680 --> 00:57:00,330 had the word grandson in it. 1172 00:57:00,330 --> 00:57:05,010 The word loved showed up in 32% of all positive messages, 8% 1173 00:57:05,010 --> 00:57:07,090 of all negative messages, for example. 1174 00:57:07,090 --> 00:57:10,230 And then the word it up in 30% of positive messages, 1175 00:57:10,230 --> 00:57:15,130 40% of negative messages-- again, just arbitrary data here just for example, 1176 00:57:15,130 --> 00:57:19,560 but now we have data with which we can begin to calculate this expression. 1177 00:57:19,560 --> 00:57:22,950 So how do I calculate multiplying all these values together? 1178 00:57:22,950 --> 00:57:25,650 Well, it's just going to be multiplying probability 1179 00:57:25,650 --> 00:57:29,400 that it's positive times the probability of my, given positive, 1180 00:57:29,400 --> 00:57:32,190 times the probability of grandson, given positive-- 1181 00:57:32,190 --> 00:57:34,290 so on and so forth for each of the other words. 1182 00:57:34,290 --> 00:57:37,780 And if you do that multiplication and multiply all of those values together, 1183 00:57:37,780 --> 00:57:42,000 you get this, 0.00014112. 1184 00:57:42,000 --> 00:57:44,760 By itself, this is not a meaningful number, 1185 00:57:44,760 --> 00:57:48,810 but it's going to be meaningful if you compared this expression-- 1186 00:57:48,810 --> 00:57:53,250 the probability that it's positive times the probability of all of the words, 1187 00:57:53,250 --> 00:57:55,680 given that I know that the message is positive, 1188 00:57:55,680 --> 00:57:59,350 and compare it to the same thing, but for negative sentiment messages 1189 00:57:59,350 --> 00:57:59,850 instead. 1190 00:57:59,850 --> 00:58:03,090 I want to know the probability that it's a negative message 1191 00:58:03,090 --> 00:58:05,430 times the probability of all of these words, 1192 00:58:05,430 --> 00:58:07,900 given that it's a negative message. 1193 00:58:07,900 --> 00:58:09,360 And so how can I do that? 1194 00:58:09,360 --> 00:58:13,280 Well, to do that, you just multiply probability of negative times 1195 00:58:13,280 --> 00:58:15,500 all of these conditional probabilities. 1196 00:58:15,500 --> 00:58:19,520 And if I take those five values, multiply all of them together, 1197 00:58:19,520 --> 00:58:26,730 then what I get is this value for negative 0.00006528-- 1198 00:58:26,730 --> 00:58:30,080 again, in isolation, not a particularly meaningful number. 1199 00:58:30,080 --> 00:58:35,300 What is meaningful is treating these two values as a probability distribution 1200 00:58:35,300 --> 00:58:39,260 and normalizing them, making it so that both of these values sum up to 1 1201 00:58:39,260 --> 00:58:41,450 the way of probability distribution should. 1202 00:58:41,450 --> 00:58:45,740 And we do so by adding these two up and then dividing each of these values 1203 00:58:45,740 --> 00:58:48,120 by their total in order to be able to normalize them. 1204 00:58:48,120 --> 00:58:51,170 And when we do that, when we normalize this probability distribution, 1205 00:58:51,170 --> 00:58:58,400 you end up getting something like this, positive 0.6837, negative 0.3163. 1206 00:58:58,400 --> 00:59:02,990 It seems like we've been able to conclude that we are about 68% 1207 00:59:02,990 --> 00:59:06,500 confident-- we think there's a probability of 0.68 1208 00:59:06,500 --> 00:59:09,470 that this message is a positive message-- my grandson loved it. 1209 00:59:09,470 --> 00:59:11,540 And why are we 68% confident? 1210 00:59:11,540 --> 00:59:15,350 Well, it seems like we're more confident than not because the word 1211 00:59:15,350 --> 00:59:18,350 loved showed up in 32% of positive messages, 1212 00:59:18,350 --> 00:59:20,420 but only 8% of negative messages. 1213 00:59:20,420 --> 00:59:22,410 So that was a pretty strong indicator. 1214 00:59:22,410 --> 00:59:25,070 And for the others, while it's true that the word 1215 00:59:25,070 --> 00:59:27,260 it showed up more often in negative messages, 1216 00:59:27,260 --> 00:59:30,170 it wasn't enough to offset that loved shows up 1217 00:59:30,170 --> 00:59:34,560 far more often in positive messages than negative messages. 1218 00:59:34,560 --> 00:59:37,970 And so this type of analysis is how we can apply naive Bayes. 1219 00:59:37,970 --> 00:59:39,650 We've just done this calculation. 1220 00:59:39,650 --> 00:59:42,933 And we end up getting not just a categorization of positive or negative, 1221 00:59:42,933 --> 00:59:44,600 but I get some sort of confidence level. 1222 00:59:44,600 --> 00:59:47,660 What do I think the probability is that it's positive? 1223 00:59:47,660 --> 00:59:52,560 And I can say I think it's positive with this particular probability. 1224 00:59:52,560 --> 00:59:55,820 And so naive Bayes can be quite powerful at trying to achieve this. 1225 00:59:55,820 --> 00:59:58,250 Using just this bag of words model, where all I'm doing 1226 00:59:58,250 --> 01:00:00,950 is looking at what words show up in the sample, 1227 01:00:00,950 --> 01:00:03,870 I'm able to draw these sorts of conclusions. 1228 01:00:03,870 --> 01:00:07,280 Now, one potential drawback-- something that you'll notice pretty quickly 1229 01:00:07,280 --> 01:00:10,190 if you start applying this room exactly as is-- 1230 01:00:10,190 --> 01:00:15,500 is what happens depending on if 0's are inside this data somewhere. 1231 01:00:15,500 --> 01:00:20,410 Let's imagine, for example, this same sentence-- my grandson loved it-- 1232 01:00:20,410 --> 01:00:24,980 but let's instead imagine that this value here, instead of being 0.01, 1233 01:00:24,980 --> 01:00:28,970 was 0, meaning inside of our data set, it has never 1234 01:00:28,970 --> 01:00:33,620 before happened that in a positive message the word grandson showed up. 1235 01:00:33,620 --> 01:00:35,450 And that's certainly possible. 1236 01:00:35,450 --> 01:00:37,817 If I have a pretty small data set, it's probably likely 1237 01:00:37,817 --> 01:00:40,400 that not all the messages are going to have the word grandson. 1238 01:00:40,400 --> 01:00:43,400 Maybe it is the case that no positive messages have ever 1239 01:00:43,400 --> 01:00:46,370 had the word grandson in it, at least in my data set. 1240 01:00:46,370 --> 01:00:49,640 But if it is the case that 2% of the negative messages 1241 01:00:49,640 --> 01:00:52,340 have still had the word grandson in it, then we 1242 01:00:52,340 --> 01:00:54,330 run into an interesting challenge. 1243 01:00:54,330 --> 01:00:57,730 And the challenge is this-- when I multiply all of the positive numbers 1244 01:00:57,730 --> 01:01:00,980 together and multiply all the negative numbers together to calculate these two 1245 01:01:00,980 --> 01:01:06,800 probabilities, what I end up getting is a positive value of 0.000. 1246 01:01:06,800 --> 01:01:10,010 I get pure 0's, because when I multiply all of these numbers 1247 01:01:10,010 --> 01:01:12,470 together-- when I multiply something by 0, 1248 01:01:12,470 --> 01:01:15,770 doesn't matter what the other numbers are-- the result is going to be 0. 1249 01:01:15,770 --> 01:01:19,710 And the same thing can be said of negative numbers as well. 1250 01:01:19,710 --> 01:01:24,320 So this then would seem to be a problem that, because grandson has never 1251 01:01:24,320 --> 01:01:27,630 showed up in any of the positive messages inside of our sample, 1252 01:01:27,630 --> 01:01:31,340 we're able to say-- we seem to be concluding that there is a 0% 1253 01:01:31,340 --> 01:01:33,110 chance that the message is positive. 1254 01:01:33,110 --> 01:01:37,105 And therefore, it must be negative, because the only cases where 1255 01:01:37,105 --> 01:01:39,980 we've seen the word grandson come up is inside of a negative message. 1256 01:01:39,980 --> 01:01:43,340 And in doing so, we've totally ignored all of the other probabilities 1257 01:01:43,340 --> 01:01:46,940 that a positive message is much more likely to have the word loved in it, 1258 01:01:46,940 --> 01:01:49,190 because we've multiplied by 0, which just 1259 01:01:49,190 --> 01:01:53,670 means none of the other probabilities can possibly matter at all. 1260 01:01:53,670 --> 01:01:55,920 So this then is a challenge that we need to deal with. 1261 01:01:55,920 --> 01:01:57,380 It means that we're likely not going to be 1262 01:01:57,380 --> 01:02:00,220 able to get the correct results if we just purely use this approach. 1263 01:02:00,220 --> 01:02:02,720 And it's for that reason there are a number of possible ways 1264 01:02:02,720 --> 01:02:06,230 we can try and make sure that we never multiply something by 0. 1265 01:02:06,230 --> 01:02:08,750 It's OK to multiply something by a small number, 1266 01:02:08,750 --> 01:02:10,640 because then it can still be counterbalanced 1267 01:02:10,640 --> 01:02:14,540 by other larger numbers, but multiplying by 0 means it's the end of the story. 1268 01:02:14,540 --> 01:02:16,520 You multiply a number by 0, and the output's 1269 01:02:16,520 --> 01:02:21,230 going to be 0, no matter how big any of the other numbers happen to be. 1270 01:02:21,230 --> 01:02:23,810 So one approach that's fairly common a naive Bayes is 1271 01:02:23,810 --> 01:02:29,090 this idea of additive smoothing, adding some value alpha to each of the values 1272 01:02:29,090 --> 01:02:31,943 in our distribution just to smooth the data little bit. 1273 01:02:31,943 --> 01:02:33,860 One such approach is called Laplace smoothing, 1274 01:02:33,860 --> 01:02:37,530 which basically just means adding one to each value in our distribution. 1275 01:02:37,530 --> 01:02:43,540 So if I have 100 samples and zero of them contain the word grandson, 1276 01:02:43,540 --> 01:02:45,290 well then I might say that, you know what? 1277 01:02:45,290 --> 01:02:49,460 Instead, let's pretend that I've had one additional sample where the word 1278 01:02:49,460 --> 01:02:53,210 grandson appeared and one additional sample where the word grandson didn't 1279 01:02:53,210 --> 01:02:53,840 appear. 1280 01:02:53,840 --> 01:02:57,150 So I'll say all right, now I have one 1 of 102-- 1281 01:02:57,150 --> 01:03:01,550 so one sample that does have the word grandson out of 102 total. 1282 01:03:01,550 --> 01:03:05,070 I'm basically creating two samples that didn't exist before. 1283 01:03:05,070 --> 01:03:08,830 But in doing so, I've been able to smooth the distribution a little bit 1284 01:03:08,830 --> 01:03:12,040 to make sure that I never have to multiply anything by 0. 1285 01:03:12,040 --> 01:03:17,080 By pretending I've seen one more value in each category than I actually have, 1286 01:03:17,080 --> 01:03:19,390 this gets us that result of not having to worry 1287 01:03:19,390 --> 01:03:22,180 about multiplying a number by 0. 1288 01:03:22,180 --> 01:03:24,580 So this then is an approach that we can use in order 1289 01:03:24,580 --> 01:03:27,670 to try and apply naive Bayes, even in situations 1290 01:03:27,670 --> 01:03:31,730 where we're dealing with words that we might not necessarily have seen before. 1291 01:03:31,730 --> 01:03:35,140 And let's now take a look at how we could actually apply that in practice. 1292 01:03:35,140 --> 01:03:38,490 It turns out that NLTK, in addition to having the ability to extract 1293 01:03:38,490 --> 01:03:41,110 n-grams and tokenize things into words, also 1294 01:03:41,110 --> 01:03:45,400 has the ability to be able to apply naive Bayes on some samples of text, 1295 01:03:45,400 --> 01:03:46,920 for example. 1296 01:03:46,920 --> 01:03:48,430 And so let's go ahead and do that. 1297 01:03:48,430 --> 01:03:52,840 What I've done is, inside of sentiment, I've prepared a corpus of just 1298 01:03:52,840 --> 01:03:55,997 know reviews that I've generated, but you can imagine using real reviews. 1299 01:03:55,997 --> 01:03:58,330 I just have a couple of positive reviews-- it was great. 1300 01:03:58,330 --> 01:03:58,873 So much fun. 1301 01:03:58,873 --> 01:03:59,540 Would recommend. 1302 01:03:59,540 --> 01:04:00,550 My grandson loved it. 1303 01:04:00,550 --> 01:04:01,712 Those sorts of messages. 1304 01:04:01,712 --> 01:04:04,420 And then I have a whole bunch of negative reviews-- not worth it, 1305 01:04:04,420 --> 01:04:07,190 kind of cheap, really bad, didn't work the way we expected-- 1306 01:04:07,190 --> 01:04:08,470 just one on each line. 1307 01:04:08,470 --> 01:04:11,860 A whole bunch of positive reviews and negative reviews. 1308 01:04:11,860 --> 01:04:15,130 And what I'd like to do now is analyze them somehow. 1309 01:04:15,130 --> 01:04:19,690 So here then is sentiment up high, and what we're going to do first 1310 01:04:19,690 --> 01:04:23,680 is extract all of the positive and negative sentences, 1311 01:04:23,680 --> 01:04:28,600 create a set of all of the words that were used across all of the messages, 1312 01:04:28,600 --> 01:04:33,340 and then we're going to go ahead and train NLTK's naive Bayes classifier 1313 01:04:33,340 --> 01:04:34,810 on all of this training data. 1314 01:04:34,810 --> 01:04:36,850 And with the training data effectively is is I 1315 01:04:36,850 --> 01:04:40,300 take all of the positive messages and give them the label positive, all 1316 01:04:40,300 --> 01:04:42,790 the negative messages and give them the label negative, 1317 01:04:42,790 --> 01:04:45,880 and then I'll go ahead and apply this classifier to it, where I'd say, 1318 01:04:45,880 --> 01:04:48,100 I would like to take all of this training data 1319 01:04:48,100 --> 01:04:52,030 and now have the ability to classify it as positive or negative. 1320 01:04:52,030 --> 01:04:53,860 I'll then take some input from the user. 1321 01:04:53,860 --> 01:04:56,890 They can just type in some sequence of words. 1322 01:04:56,890 --> 01:04:59,020 And then I would like to classify that sequence 1323 01:04:59,020 --> 01:05:01,450 as either positive or negative, and then I'll 1324 01:05:01,450 --> 01:05:04,482 go ahead and print out what the probabilities of each happened to be. 1325 01:05:04,482 --> 01:05:07,690 And there are some helper functions here that just organize things in the way 1326 01:05:07,690 --> 01:05:09,610 that NLTK is expecting them to be. 1327 01:05:09,610 --> 01:05:12,307 But the key idea here is that I'm taking the positive messages, 1328 01:05:12,307 --> 01:05:14,140 labeling them, taking the negative messages, 1329 01:05:14,140 --> 01:05:16,840 labeling them, putting them inside of a classifier, 1330 01:05:16,840 --> 01:05:21,380 and then now trying to classify some new text that comes about. 1331 01:05:21,380 --> 01:05:23,030 So let's go ahead and try it. 1332 01:05:23,030 --> 01:05:26,740 I'll go ahead and go into sentiment, and we'll run Python sentiment, 1333 01:05:26,740 --> 01:05:29,328 passing in as input that corpus that contains 1334 01:05:29,328 --> 01:05:31,120 all of the positive and negative messages-- 1335 01:05:31,120 --> 01:05:34,480 because depending on the corpus, that's going to affect the probabilities. 1336 01:05:34,480 --> 01:05:36,970 The effectiveness of our ability to classify 1337 01:05:36,970 --> 01:05:41,045 is entirely dependent on how good our data is, and how much data we have, 1338 01:05:41,045 --> 01:05:42,670 and how well they happen to be labeled. 1339 01:05:42,670 --> 01:05:44,640 So now I can try something and say-- 1340 01:05:44,640 --> 01:05:47,170 let's try a review like, this was great-- 1341 01:05:47,170 --> 01:05:49,800 just some review that I might leave. 1342 01:05:49,800 --> 01:05:53,200 And it seems that, all right, there is a 96% chance it estimates 1343 01:05:53,200 --> 01:05:54,930 that this was a positive message-- 1344 01:05:54,930 --> 01:05:58,480 4% chance that it was a negative, likely because the word great 1345 01:05:58,480 --> 01:06:00,610 shows up inside of the positive messages, 1346 01:06:00,610 --> 01:06:03,080 but doesn't show up inside of the negative messages. 1347 01:06:03,080 --> 01:06:06,160 And that might be something that our AI is able to capitalize on. 1348 01:06:06,160 --> 01:06:09,640 And really, what it's going to look for are the differentiating words-- 1349 01:06:09,640 --> 01:06:12,490 that if the probability of words like this and was 1350 01:06:12,490 --> 01:06:15,530 and is pretty similar between positive and negative words, 1351 01:06:15,530 --> 01:06:17,680 then the naive Bayes classifier isn't going 1352 01:06:17,680 --> 01:06:21,202 to end up using those values as having some sort of importance 1353 01:06:21,202 --> 01:06:21,910 in the algorithm. 1354 01:06:21,910 --> 01:06:23,710 Because if they're the same on both sides, 1355 01:06:23,710 --> 01:06:26,560 you multiply that value for both positive and negative, 1356 01:06:26,560 --> 01:06:28,270 you end up getting about the same thing. 1357 01:06:28,270 --> 01:06:30,730 What ultimately makes the difference in naive Bayes 1358 01:06:30,730 --> 01:06:34,210 is when you multiply by value that's much bigger for one category 1359 01:06:34,210 --> 01:06:36,880 than for another category-- when one word like great 1360 01:06:36,880 --> 01:06:39,910 is much more likely to show up in one type of message 1361 01:06:39,910 --> 01:06:41,260 than another type of message. 1362 01:06:41,260 --> 01:06:43,385 And that's one of the nice things about naive Bayes 1363 01:06:43,385 --> 01:06:45,250 is that, without me telling it, that great 1364 01:06:45,250 --> 01:06:48,210 is more important to care about than this or was. 1365 01:06:48,210 --> 01:06:50,380 Naive Bayes can figure that out based on the data. 1366 01:06:50,380 --> 01:06:53,740 It can figure out that this shows up about the same amount of time 1367 01:06:53,740 --> 01:06:56,560 between the two, but great, that is a discriminator, 1368 01:06:56,560 --> 01:07:00,060 a word that can be different between the two types of messages. 1369 01:07:00,060 --> 01:07:01,400 So I could try it again-- 1370 01:07:01,400 --> 01:07:04,583 type in a sentence like, lots of fun, for example. 1371 01:07:04,583 --> 01:07:06,250 This one it's a little less sure about-- 1372 01:07:06,250 --> 01:07:10,690 62% chance that it's positive, 37% chance that it's negative-- maybe 1373 01:07:10,690 --> 01:07:12,720 because there aren't as clear discriminators 1374 01:07:12,720 --> 01:07:15,310 or differentiators inside of this data. 1375 01:07:15,310 --> 01:07:16,400 I'll try one more-- 1376 01:07:16,400 --> 01:07:20,430 say kind of overpriced. 1377 01:07:20,430 --> 01:07:23,633 And all right, now 95%, 96% sure that this 1378 01:07:23,633 --> 01:07:25,800 is a negative sentiment-- likely because of the word 1379 01:07:25,800 --> 01:07:29,032 overpriced, because it's shown up in a negative sentiment expression 1380 01:07:29,032 --> 01:07:31,740 before, and therefore, it thinks, you know what, this is probably 1381 01:07:31,740 --> 01:07:34,720 going to be a negative sentence. 1382 01:07:34,720 --> 01:07:37,830 And so naive Bayes has now given us the ability to classify text. 1383 01:07:37,830 --> 01:07:40,350 Given enough training data, given enough examples, 1384 01:07:40,350 --> 01:07:44,400 we can train our AI to be able to look at natural language, human words, 1385 01:07:44,400 --> 01:07:46,410 figure out which words are likely to show up 1386 01:07:46,410 --> 01:07:48,870 in positive as opposed to negative sentiment messages, 1387 01:07:48,870 --> 01:07:50,670 and categorize them accordingly. 1388 01:07:50,670 --> 01:07:52,420 And you could imagine doing the same thing 1389 01:07:52,420 --> 01:07:55,170 anytime you want to take text and group it into categories. 1390 01:07:55,170 --> 01:07:58,300 If I want to take an email and categorize as email-- 1391 01:07:58,300 --> 01:08:01,560 as a good email or as a spam email, you could apply a similar idea. 1392 01:08:01,560 --> 01:08:04,020 Try and look for the discriminating words, 1393 01:08:04,020 --> 01:08:07,230 the words that make it more likely to be a spam email or not, 1394 01:08:07,230 --> 01:08:10,830 and just train a naive Bayes classifier to be able to figure out 1395 01:08:10,830 --> 01:08:14,250 what that distribution is and to be able to figure out how to categorize 1396 01:08:14,250 --> 01:08:15,978 an email as good or as spam. 1397 01:08:15,978 --> 01:08:19,020 Now, of course, it's not going to be able to give us a definitive answer. 1398 01:08:19,020 --> 01:08:22,950 It gives us a probability distribution, something like 63% 1399 01:08:22,950 --> 01:08:25,380 positive, 37% negative. 1400 01:08:25,380 --> 01:08:29,550 And that might be why our spam filters and our emails sometimes make mistakes, 1401 01:08:29,550 --> 01:08:32,700 sometimes think that a good email is actually spam or vice 1402 01:08:32,700 --> 01:08:36,000 versa, because ultimately, the best that it can do 1403 01:08:36,000 --> 01:08:37,890 is calculate a probability distribution. 1404 01:08:37,890 --> 01:08:40,290 If natural language is ambiguous, we can usually 1405 01:08:40,290 --> 01:08:42,960 just deal in the world of probabilities to try and get 1406 01:08:42,960 --> 01:08:47,100 an answer that is reasonably good, even if we aren't able to guarantee for sure 1407 01:08:47,100 --> 01:08:50,970 that it is the number that we actually expect for it to be. 1408 01:08:50,970 --> 01:08:54,600 That then was a look at how we can begin to take some text 1409 01:08:54,600 --> 01:08:59,910 and to be able to analyze the text and group it into some sorts of categories. 1410 01:08:59,910 --> 01:09:04,140 But ultimately, in addition just being able to analyze text and categorize it, 1411 01:09:04,140 --> 01:09:08,130 we'd like to be able to figure out information about the text, 1412 01:09:08,130 --> 01:09:11,130 get it some sort of meaning out of the text as well. 1413 01:09:11,130 --> 01:09:13,500 And this starts to get us in the world of information, 1414 01:09:13,500 --> 01:09:16,620 of being able to try and take data in the form of text 1415 01:09:16,620 --> 01:09:18,450 and retrieve information from it. 1416 01:09:18,450 --> 01:09:22,500 So one type of problem is known as information retrieval, or IR, 1417 01:09:22,500 --> 01:09:26,979 which is the task of finding relevant documents in response to a query. 1418 01:09:26,979 --> 01:09:30,330 So this is something like you type in a query into a search engine, 1419 01:09:30,330 --> 01:09:32,279 like Google, or you're typing in something 1420 01:09:32,279 --> 01:09:35,640 into some system that's going to look for-- inside of a library catalog, 1421 01:09:35,640 --> 01:09:38,609 for example-- that's going to look for responses to a query. 1422 01:09:38,609 --> 01:09:43,217 I want to look for documents that are about the US constitution or something, 1423 01:09:43,217 --> 01:09:45,300 and I would like to get a whole bunch of documents 1424 01:09:45,300 --> 01:09:47,819 that match that query back to me. 1425 01:09:47,819 --> 01:09:50,819 But you might imagine that what I really want to be able to do 1426 01:09:50,819 --> 01:09:53,160 is, in order to solve this task effectively, 1427 01:09:53,160 --> 01:09:55,830 I need to be able to take documents and figure out, 1428 01:09:55,830 --> 01:09:57,870 what are those documents about? 1429 01:09:57,870 --> 01:10:01,680 I want to be able to say what is it that these particular documents are 1430 01:10:01,680 --> 01:10:03,900 about-- what of the topics of those documents-- 1431 01:10:03,900 --> 01:10:08,160 so that I can then more effectively be able to retrieve information 1432 01:10:08,160 --> 01:10:10,050 from those particular documents. 1433 01:10:10,050 --> 01:10:13,560 And this refers to a set of tasks generally known as topic modeling, 1434 01:10:13,560 --> 01:10:17,918 where I'd like to discover what the topics are for a set of documents. 1435 01:10:17,918 --> 01:10:19,710 And this is something that humans could do. 1436 01:10:19,710 --> 01:10:21,800 A human could read a document and tell you, all right, 1437 01:10:21,800 --> 01:10:23,883 here's what this document is about, and give maybe 1438 01:10:23,883 --> 01:10:27,862 a couple of topics for who are the important people in this document, what 1439 01:10:27,862 --> 01:10:30,570 are the important objects in the document-- can probably tell you 1440 01:10:30,570 --> 01:10:32,370 that kind of thing. 1441 01:10:32,370 --> 01:10:35,160 But we'd like for our AI to be able to do the same thing. 1442 01:10:35,160 --> 01:10:38,760 Given some document, can you tell me what the important words 1443 01:10:38,760 --> 01:10:39,870 in this document are? 1444 01:10:39,870 --> 01:10:42,095 What are the words that set this document apart 1445 01:10:42,095 --> 01:10:44,220 that I might care about if I'm looking at documents 1446 01:10:44,220 --> 01:10:47,128 based on keywords, for example? 1447 01:10:47,128 --> 01:10:49,920 And so one instinctive idea-- an intuitive idea that probably makes 1448 01:10:49,920 --> 01:10:50,580 sense-- 1449 01:10:50,580 --> 01:10:53,250 is let's just use term frequency. 1450 01:10:53,250 --> 01:10:56,100 Term frequency is just defined as the number of times 1451 01:10:56,100 --> 01:10:58,650 a particular term appears in a document. 1452 01:10:58,650 --> 01:11:03,300 If I have a document with 100 words and one particular word shows up 10 times, 1453 01:11:03,300 --> 01:11:05,440 it has a term frequency of 10. 1454 01:11:05,440 --> 01:11:06,690 It shows up pretty often. 1455 01:11:06,690 --> 01:11:09,000 Maybe that's going to be an important word. 1456 01:11:09,000 --> 01:11:10,750 And sometimes, you'll also see this framed 1457 01:11:10,750 --> 01:11:14,620 as a proportion of the total number of words, so 10 words out of 100. 1458 01:11:14,620 --> 01:11:19,110 Maybe it has a term frequency of 0.1, meaning 10% of all of the words 1459 01:11:19,110 --> 01:11:21,530 are this particular word that I care about. 1460 01:11:21,530 --> 01:11:23,280 Ultimately, that doesn't change relatively 1461 01:11:23,280 --> 01:11:26,300 how important they are for any one particular document, 1462 01:11:26,300 --> 01:11:27,730 but they're the same idea. 1463 01:11:27,730 --> 01:11:31,050 The idea is look for words that show up more frequently, because those 1464 01:11:31,050 --> 01:11:35,970 are more likely to be the important words inside of a corpus of documents. 1465 01:11:35,970 --> 01:11:37,840 And so let's go ahead and give that a try. 1466 01:11:37,840 --> 01:11:40,980 Let's say I wanted to find out what the Sherlock Holmes stories are about. 1467 01:11:40,980 --> 01:11:42,780 I have a whole bunch of Sherlock Holmes stories 1468 01:11:42,780 --> 01:11:45,000 and I want to know, in general, what are they about? 1469 01:11:45,000 --> 01:11:47,708 What are the important characters? 1470 01:11:47,708 --> 01:11:49,000 What are the important objects? 1471 01:11:49,000 --> 01:11:52,170 What are the important parts of the story, just in terms of words? 1472 01:11:52,170 --> 01:11:55,350 And I'd like for the AI to be able to figure that out on its own, 1473 01:11:55,350 --> 01:11:57,660 and we'll do so by looking at term frequency-- 1474 01:11:57,660 --> 01:12:01,930 by looking at, what are the words that show up the most often? 1475 01:12:01,930 --> 01:12:06,250 So we'll go ahead, and I'll go ahead and go in to the tfidf directory. 1476 01:12:06,250 --> 01:12:08,350 You'll see why it's called that in a moment. 1477 01:12:08,350 --> 01:12:14,290 But let's first open up tf0.py, which is going to calculate the top 10 term 1478 01:12:14,290 --> 01:12:17,092 frequencies-- or maybe top five term frequencies 1479 01:12:17,092 --> 01:12:19,300 for a corpus of documents, a whole bunch of documents 1480 01:12:19,300 --> 01:12:22,930 where each document is just a story from Sherlock Holmes. 1481 01:12:22,930 --> 01:12:26,772 We're going to load all the data into our corpus 1482 01:12:26,772 --> 01:12:29,850 and we're going to figure out, what are all of the words that 1483 01:12:29,850 --> 01:12:32,610 show up inside of that corpus? 1484 01:12:32,610 --> 01:12:35,187 And we're going to basically just assemble all 1485 01:12:35,187 --> 01:12:36,770 of the number of the term frequencies. 1486 01:12:36,770 --> 01:12:39,510 We're going to calculate, how often do each of these terms 1487 01:12:39,510 --> 01:12:41,880 appear inside of the document? 1488 01:12:41,880 --> 01:12:43,368 And we'll print out the top five. 1489 01:12:43,368 --> 01:12:45,660 And so there are some data structures involved that you 1490 01:12:45,660 --> 01:12:47,160 can take a look at if you'd like to. 1491 01:12:47,160 --> 01:12:50,550 The exact code is not so important, but it is the idea of what we're doing. 1492 01:12:50,550 --> 01:12:54,450 We're taking each of these documents and first sorting them. 1493 01:12:54,450 --> 01:12:56,340 We're saying, take all the words that show up 1494 01:12:56,340 --> 01:13:00,080 and sort them by how often each word shows up. 1495 01:13:00,080 --> 01:13:04,710 And let's go ahead and just, for each document, save the top five 1496 01:13:04,710 --> 01:13:07,720 terms that happen to show up in each of those documents. 1497 01:13:07,720 --> 01:13:10,900 So again, some helper functions you can take a look at if you're interested. 1498 01:13:10,900 --> 01:13:13,440 But the key idea here is that all we're going to do 1499 01:13:13,440 --> 01:13:18,240 is run to tf0 on the Sherlock Holmes stories. 1500 01:13:18,240 --> 01:13:21,840 And what I'm hoping to get out of this process is I am hoping to figure out, 1501 01:13:21,840 --> 01:13:25,150 what are the important words in Sherlock Holmes, for example? 1502 01:13:25,150 --> 01:13:29,370 So we'll go ahead and run this and see what we get. 1503 01:13:29,370 --> 01:13:30,982 And it's loading the data. 1504 01:13:30,982 --> 01:13:31,940 And here's what we get. 1505 01:13:31,940 --> 01:13:36,530 For this particular story, the important words are the, and and, and I, 1506 01:13:36,530 --> 01:13:37,368 and to, and of. 1507 01:13:37,368 --> 01:13:39,410 Those are the words that show up more frequently. 1508 01:13:39,410 --> 01:13:45,000 In this particular story, it's the, and and, and I, and a, and of. 1509 01:13:45,000 --> 01:13:47,000 This is not particularly useful to us. 1510 01:13:47,000 --> 01:13:48,230 We're using term frequencies. 1511 01:13:48,230 --> 01:13:50,930 We're looking at what words show up the most frequently in each 1512 01:13:50,930 --> 01:13:54,830 of these various different documents, but what we get naturally 1513 01:13:54,830 --> 01:13:57,470 are just the words that show up a lot in English. 1514 01:13:57,470 --> 01:14:00,385 The word the, and of, and happen to show up a lot in English, 1515 01:14:00,385 --> 01:14:02,510 and therefore, they happen to show up a lot in each 1516 01:14:02,510 --> 01:14:04,052 of these various different documents. 1517 01:14:04,052 --> 01:14:06,320 This is not a particularly useful metric for us 1518 01:14:06,320 --> 01:14:08,690 to be able to analyze what words are important, 1519 01:14:08,690 --> 01:14:12,960 because these words are just part of the grammatical structure of English. 1520 01:14:12,960 --> 01:14:17,610 And it turns out we can categorize words into a couple of different categories. 1521 01:14:17,610 --> 01:14:21,102 These words happen to be known as what we might call function words, words 1522 01:14:21,102 --> 01:14:23,060 that have little meaning on their own, but that 1523 01:14:23,060 --> 01:14:26,100 are used to grammatically connect different parts of a sentence. 1524 01:14:26,100 --> 01:14:29,120 These are words like am, and by, and do, and is, and which, 1525 01:14:29,120 --> 01:14:32,130 and with, and yet-- words that, on their own, what do they mean? 1526 01:14:32,130 --> 01:14:33,140 It's hard to say. 1527 01:14:33,140 --> 01:14:35,390 They get their meaning from how they connect 1528 01:14:35,390 --> 01:14:36,980 different parts of the sentence. 1529 01:14:36,980 --> 01:14:40,610 And these function words are what we might call a closed class of words 1530 01:14:40,610 --> 01:14:41,990 in a language like English. 1531 01:14:41,990 --> 01:14:44,690 There's really just some fixed list of function words, 1532 01:14:44,690 --> 01:14:46,190 and they don't change very often. 1533 01:14:46,190 --> 01:14:48,260 There's just some list of words that are commonly 1534 01:14:48,260 --> 01:14:52,460 used to connect other grammatical structures in the language. 1535 01:14:52,460 --> 01:14:56,120 And that's in contrast with what we might call content words, words 1536 01:14:56,120 --> 01:14:58,970 that carry meaning independently-- words like algorithm, 1537 01:14:58,970 --> 01:15:02,580 category, computer, words that actually have some sort of meaning. 1538 01:15:02,580 --> 01:15:05,150 And these are usually the words that we care about. 1539 01:15:05,150 --> 01:15:07,250 These are the words where we want to figure out, 1540 01:15:07,250 --> 01:15:10,020 what are the important words in our document? 1541 01:15:10,020 --> 01:15:12,230 We probably care about the content words more 1542 01:15:12,230 --> 01:15:15,380 than we care about the function words. 1543 01:15:15,380 --> 01:15:20,770 And so one strategy we could apply is just ignore all of the function words. 1544 01:15:20,770 --> 01:15:26,120 So here in tf1.py, I've done the same exact thing, 1545 01:15:26,120 --> 01:15:31,790 except I'm going to load a whole bunch of words from a function_words.txt 1546 01:15:31,790 --> 01:15:35,670 file, inside of which are just a whole bunch of function words in alphabetical 1547 01:15:35,670 --> 01:15:36,170 order. 1548 01:15:36,170 --> 01:15:38,570 These are just a whole bunch of function words 1549 01:15:38,570 --> 01:15:41,870 that are just words that are used to connect other words in English, 1550 01:15:41,870 --> 01:15:44,275 and someone has just compiled this particular list. 1551 01:15:44,275 --> 01:15:46,400 And these are the words that I just want to ignore. 1552 01:15:46,400 --> 01:15:49,790 If any of these words-- let's just ignore it as one of the top terms, 1553 01:15:49,790 --> 01:15:52,790 because these are not words that I probably care about 1554 01:15:52,790 --> 01:15:56,570 if I want to analyze what the important terms inside of a document 1555 01:15:56,570 --> 01:15:57,860 happen to be. 1556 01:15:57,860 --> 01:16:01,820 So in tfidf1, we were ultimately doing is, 1557 01:16:01,820 --> 01:16:05,360 if the word is in my set of function words, 1558 01:16:05,360 --> 01:16:08,720 I'm just going to skip over it, just ignore any of the function words 1559 01:16:08,720 --> 01:16:11,210 by continuing on to the next word and then 1560 01:16:11,210 --> 01:16:14,010 just calculating the frequencies for those words instead. 1561 01:16:14,010 --> 01:16:16,520 So I'm going to pretend the function words aren't there, 1562 01:16:16,520 --> 01:16:19,550 and now maybe I can get a better sense for what 1563 01:16:19,550 --> 01:16:23,060 terms are important in each of the various different Sherlock Holmes 1564 01:16:23,060 --> 01:16:24,560 stories. 1565 01:16:24,560 --> 01:16:29,080 So now let's run tf1 on the Sherlock Holmes corpus and see what we get now. 1566 01:16:29,080 --> 01:16:32,510 And let's look at, what is the most important term in each of the stories? 1567 01:16:32,510 --> 01:16:34,760 Well, it seems like, for each of the stories, 1568 01:16:34,760 --> 01:16:36,770 the most important word is Holmes. 1569 01:16:36,770 --> 01:16:38,270 I guess that's what we would expect. 1570 01:16:38,270 --> 01:16:39,380 They're all Sherlock Holmes stories. 1571 01:16:39,380 --> 01:16:40,922 And Holmes is not a function in Word. 1572 01:16:40,922 --> 01:16:44,360 It's not the, or a, or an, so it wasn't ignored. 1573 01:16:44,360 --> 01:16:46,130 But Holmes and man-- 1574 01:16:46,130 --> 01:16:50,760 these are probably not what I mean when I say, what are the important words? 1575 01:16:50,760 --> 01:16:52,700 Even though Holmes does show up the most often 1576 01:16:52,700 --> 01:16:54,890 it's not giving me a whole lot of information here 1577 01:16:54,890 --> 01:16:57,800 about what each of the different Sherlock Holmes stories 1578 01:16:57,800 --> 01:16:59,460 are actually about. 1579 01:16:59,460 --> 01:17:02,880 And the reason why is because Sherlock Holmes shows up in all the stories, 1580 01:17:02,880 --> 01:17:06,950 and so it's not meaningful for me to say that this story is about Sherlock 1581 01:17:06,950 --> 01:17:09,560 Holmes I want to try and figure out the different topics 1582 01:17:09,560 --> 01:17:11,180 across the corpus of documents. 1583 01:17:11,180 --> 01:17:13,640 What I really want to know is, what words show up 1584 01:17:13,640 --> 01:17:18,170 in this document that show up less frequently in the other documents, 1585 01:17:18,170 --> 01:17:19,380 for example? 1586 01:17:19,380 --> 01:17:22,730 And so to get at that idea, we're going to introduce the notion 1587 01:17:22,730 --> 01:17:25,850 of inverse document frequency. 1588 01:17:25,850 --> 01:17:29,450 Inverse document frequency is a measure of how common, 1589 01:17:29,450 --> 01:17:33,530 or rare, a word happens to be across an entire corpus of words. 1590 01:17:33,530 --> 01:17:35,960 And mathematically, it's usually calculated like this-- 1591 01:17:35,960 --> 01:17:39,440 as the logarithm of the total number of documents 1592 01:17:39,440 --> 01:17:43,550 divided by the number of documents containing the word. 1593 01:17:43,550 --> 01:17:47,510 So if a word like Holmes shows up in all of the documents, 1594 01:17:47,510 --> 01:17:50,870 well, then total documents is how many documents there 1595 01:17:50,870 --> 01:17:55,110 are a number of documents containing Holmes is going to be the same number. 1596 01:17:55,110 --> 01:17:58,760 So when you divide these two together, you'll get 1, and the logarithm of one 1597 01:17:58,760 --> 01:18:00,460 is just 0. 1598 01:18:00,460 --> 01:18:04,370 And so what we get is, if Holmes shows up in all of the documents, 1599 01:18:04,370 --> 01:18:07,040 it has an inverse document frequency of 0. 1600 01:18:07,040 --> 01:18:09,560 And you can think now of inverse document frequency 1601 01:18:09,560 --> 01:18:13,370 as a measure of how rare is the word that 1602 01:18:13,370 --> 01:18:16,280 shows up in this particular document that if a word doesn't show up 1603 01:18:16,280 --> 01:18:21,060 across many documents at all this number is going to be much higher. 1604 01:18:21,060 --> 01:18:24,710 And this then gets us that a model known as tf-idf, 1605 01:18:24,710 --> 01:18:28,310 which is a method for ranking what words are important in the document 1606 01:18:28,310 --> 01:18:30,440 by multiplying these two ideas together. 1607 01:18:30,440 --> 01:18:37,190 Multiply term frequency, or TF, by inverse document frequency, or IDF, 1608 01:18:37,190 --> 01:18:39,890 where the idea here now is that how important a word is 1609 01:18:39,890 --> 01:18:41,540 depends on two things. 1610 01:18:41,540 --> 01:18:44,197 It depends on how often it shows up in the document using 1611 01:18:44,197 --> 01:18:46,280 the heuristic that, if a word shows up more often, 1612 01:18:46,280 --> 01:18:47,900 it's probably more important. 1613 01:18:47,900 --> 01:18:51,170 And we multiply that by inverse document frequency IDF, 1614 01:18:51,170 --> 01:18:54,900 because if the word is rarer, but it shows up in the document, 1615 01:18:54,900 --> 01:18:57,200 it's probably more important than if the word shows up 1616 01:18:57,200 --> 01:19:00,200 across most or all of the documents, because then it's probably 1617 01:19:00,200 --> 01:19:02,990 a less important factor in what the different topics 1618 01:19:02,990 --> 01:19:06,840 across the different documents in the corpus happen to be. 1619 01:19:06,840 --> 01:19:11,060 And so now let's go ahead and apply this algorithm on the Sherlock Holmes 1620 01:19:11,060 --> 01:19:13,340 corpus. 1621 01:19:13,340 --> 01:19:15,650 And here's tfidf. 1622 01:19:15,650 --> 01:19:18,860 Now what I'm doing is, for each of the documents, 1623 01:19:18,860 --> 01:19:22,120 for each word, I'm calculating its TF score, 1624 01:19:22,120 --> 01:19:25,160 term frequency, multiplied by the inverse document 1625 01:19:25,160 --> 01:19:28,190 frequency of that word-- not just looking at the single volume, 1626 01:19:28,190 --> 01:19:30,410 but multiplying these two values together 1627 01:19:30,410 --> 01:19:33,650 in order to compute the overall values. 1628 01:19:33,650 --> 01:19:37,610 And now, if I run tfidf on the Holmes corpus, 1629 01:19:37,610 --> 01:19:40,615 this is going to try and get us a better approximation for what's 1630 01:19:40,615 --> 01:19:41,990 important in each of the stories. 1631 01:19:41,990 --> 01:19:44,000 And it seems like it's trying to extract here 1632 01:19:44,000 --> 01:19:46,280 probably like the names of characters that 1633 01:19:46,280 --> 01:19:49,010 happen to be important in the story-- characters that show up 1634 01:19:49,010 --> 01:19:51,380 in this story that don't show up in the other story-- 1635 01:19:51,380 --> 01:19:53,930 and prioritizing the more important characters that 1636 01:19:53,930 --> 01:19:56,510 happen to show up more often. 1637 01:19:56,510 --> 01:20:00,170 And so this then might be a better analysis of what types of topics 1638 01:20:00,170 --> 01:20:02,070 are more or less important. 1639 01:20:02,070 --> 01:20:05,330 I also have another corpus, which is a corpus of all of the Federalist 1640 01:20:05,330 --> 01:20:07,700 Papers from American history. 1641 01:20:07,700 --> 01:20:11,240 If I go ahead and run tfidf on the Federalist Papers, 1642 01:20:11,240 --> 01:20:14,330 we can begin to see what the important words in each 1643 01:20:14,330 --> 01:20:16,910 of the various different Federalist Papers happen to be-- 1644 01:20:16,910 --> 01:20:22,070 that in Federalist Paper Number 61, seems like it's a lot about elections. 1645 01:20:22,070 --> 01:20:25,350 In Federalist Papers 66, but the Senate and impeachments. 1646 01:20:25,350 --> 01:20:28,470 You can start to extract what the important terms and what 1647 01:20:28,470 --> 01:20:32,540 the important words are just by looking at what things show up across-- 1648 01:20:32,540 --> 01:20:34,800 and don't show up across many of the documents, 1649 01:20:34,800 --> 01:20:38,637 but show up frequently enough in certain of the documents. 1650 01:20:38,637 --> 01:20:40,470 And so this can be a helpful tool for trying 1651 01:20:40,470 --> 01:20:43,350 to figure out this kind of topic modeling, 1652 01:20:43,350 --> 01:20:47,100 figuring out what it is that a particular document happens 1653 01:20:47,100 --> 01:20:48,620 to be about. 1654 01:20:48,620 --> 01:20:53,070 And so this then is starting to get us into this world of semantics, 1655 01:20:53,070 --> 01:20:56,880 what it is that things actually mean when we're talking about language. 1656 01:20:56,880 --> 01:20:59,100 Now, we're not going to think about the bag of words, 1657 01:20:59,100 --> 01:21:02,670 where we just say, treat a sample of text as just a whole bunch of words. 1658 01:21:02,670 --> 01:21:04,320 And we don't care about the order. 1659 01:21:04,320 --> 01:21:06,870 Now, when we get into the world of semantics, 1660 01:21:06,870 --> 01:21:10,750 we really do start to care about what it is that these words actually mean, 1661 01:21:10,750 --> 01:21:12,850 how it is these words relate to each other, 1662 01:21:12,850 --> 01:21:17,250 and in particular, how we can extract information out of that text. 1663 01:21:17,250 --> 01:21:20,970 Information extraction is somehow extracting knowledge 1664 01:21:20,970 --> 01:21:23,970 from our documents-- figuring out, given a whole bunch of text, 1665 01:21:23,970 --> 01:21:28,140 can we automate the process of having an AI, look at those documents, 1666 01:21:28,140 --> 01:21:31,710 and get out what the useful or relevant knowledge inside those documents 1667 01:21:31,710 --> 01:21:33,190 happens to be? 1668 01:21:33,190 --> 01:21:34,950 So let's take a look at an example. 1669 01:21:34,950 --> 01:21:37,415 I'll give you two samples from news articles. 1670 01:21:37,415 --> 01:21:40,290 Here up above is a sample of a news article from the Harvard Business 1671 01:21:40,290 --> 01:21:42,310 Review that was about Facebook. 1672 01:21:42,310 --> 01:21:45,630 Down below is an example of a Business Insider article from 2018 1673 01:21:45,630 --> 01:21:47,550 that was about Amazon. 1674 01:21:47,550 --> 01:21:49,710 And there's some information here that we might 1675 01:21:49,710 --> 01:21:51,570 want an AI to be able to extract-- 1676 01:21:51,570 --> 01:21:54,030 information, knowledge about these companies 1677 01:21:54,030 --> 01:21:55,670 that we might want to extract. 1678 01:21:55,670 --> 01:21:58,020 And in particular, what I might want to extract is-- 1679 01:21:58,020 --> 01:22:02,260 let's say I want to know data about when companies were founded-- 1680 01:22:02,260 --> 01:22:05,250 that I wanted to know that Facebook was founded in 2004, 1681 01:22:05,250 --> 01:22:07,190 Amazon founded in 1994-- 1682 01:22:07,190 --> 01:22:10,500 that that is important information that I happen to care about. 1683 01:22:10,500 --> 01:22:13,110 Well, how do we extract that information from the text? 1684 01:22:13,110 --> 01:22:15,660 What is my way of being able to understand this text 1685 01:22:15,660 --> 01:22:18,810 and figure out, all right, Facebook was founded in 2004? 1686 01:22:18,810 --> 01:22:22,710 Well, what I can look for are templates or patterns, things 1687 01:22:22,710 --> 01:22:26,700 that happened to show up across multiple different documents that give me 1688 01:22:26,700 --> 01:22:28,922 some sense for what this knowledge happens to mean. 1689 01:22:28,922 --> 01:22:30,630 And what we'll notice is a common pattern 1690 01:22:30,630 --> 01:22:34,500 between both of these passages, which is this phrasing here. 1691 01:22:34,500 --> 01:22:37,890 When Facebook was founded in 2004, comma-- 1692 01:22:37,890 --> 01:22:42,360 and then down below, when Amazon was founded in 1994, comma. 1693 01:22:42,360 --> 01:22:47,640 And those two templates end up giving us a mechanism for trying to extract 1694 01:22:47,640 --> 01:22:53,220 information-- that this notion, when company was founded in year comma, 1695 01:22:53,220 --> 01:22:56,310 this can tell us something about when a company was founded, 1696 01:22:56,310 --> 01:22:58,820 because if we set our AI loose on the web, 1697 01:22:58,820 --> 01:23:01,530 let look at a whole bunch of papers or a whole bunch of articles, 1698 01:23:01,530 --> 01:23:03,360 and it finds this pattern-- 1699 01:23:03,360 --> 01:23:06,930 when blank was founded in blank, comma-- 1700 01:23:06,930 --> 01:23:09,840 well, then our AI can pretty reasonably conclude 1701 01:23:09,840 --> 01:23:13,740 that there's a good chance that this is going to be like some company, 1702 01:23:13,740 --> 01:23:17,470 and this is going to be like the year that company was founded, for example-- 1703 01:23:17,470 --> 01:23:20,907 might not be perfect, but at least it's a good heuristic. 1704 01:23:20,907 --> 01:23:22,740 And so you might imagine that, if you wanted 1705 01:23:22,740 --> 01:23:25,650 to train and AI to be able to look for information, 1706 01:23:25,650 --> 01:23:27,810 you might give the AI templates like this-- 1707 01:23:27,810 --> 01:23:31,200 not only give it a template like when company blank was founded in blank, 1708 01:23:31,200 --> 01:23:34,710 but give it like, the book blank was written by blank, for example. 1709 01:23:34,710 --> 01:23:37,500 Just give it some templates where it can search the web, 1710 01:23:37,500 --> 01:23:41,640 search a whole big corpus of documents, looking for templates that match that, 1711 01:23:41,640 --> 01:23:44,970 and if it finds that, then it's able to figure out, 1712 01:23:44,970 --> 01:23:47,370 all right, here's the company and here's the year. 1713 01:23:47,370 --> 01:23:50,250 But of course, that requires us to write these templates. 1714 01:23:50,250 --> 01:23:53,547 It requires us to figure out, what is the structure of this information 1715 01:23:53,547 --> 01:23:54,630 likely going to look like? 1716 01:23:54,630 --> 01:23:56,190 And it might be difficult to know. 1717 01:23:56,190 --> 01:23:58,500 The different websites are, of course, going to do this differently. 1718 01:23:58,500 --> 01:24:01,830 This type of method isn't going to be able to extract all of the information, 1719 01:24:01,830 --> 01:24:04,170 because if the words are slightly in a different order, 1720 01:24:04,170 --> 01:24:06,840 it won't match on that particular template. 1721 01:24:06,840 --> 01:24:11,310 But one thing we can do is, rather than give our AI the template, 1722 01:24:11,310 --> 01:24:13,290 we can give AI the data. 1723 01:24:13,290 --> 01:24:19,540 We can tell the AI, Facebook was founded in 2004 and Amazon was founded in 1994, 1724 01:24:19,540 --> 01:24:22,440 and just tell the AI those two pieces of information, 1725 01:24:22,440 --> 01:24:24,780 and then set the AI loose on the web. 1726 01:24:24,780 --> 01:24:30,030 And now the ideas that the AI can begin to look for, where do Facebook in 2004 1727 01:24:30,030 --> 01:24:33,150 show up together, where do Amazon in 1994 show up together, 1728 01:24:33,150 --> 01:24:36,150 and it can discover these templates for itself. 1729 01:24:36,150 --> 01:24:38,580 It can discover that this kind of phrasing-- 1730 01:24:38,580 --> 01:24:40,320 when blank was founded in blank-- 1731 01:24:40,320 --> 01:24:45,030 tends to relate Facebook to 2004, and it released Amazon to 1994, 1732 01:24:45,030 --> 01:24:49,320 so maybe it will hold the same relation for others as well. 1733 01:24:49,320 --> 01:24:51,572 And this ends up being-- this automated template 1734 01:24:51,572 --> 01:24:54,030 generation ends up being quite powerful, and we'll go ahead 1735 01:24:54,030 --> 01:24:56,250 and take a look at that now as well. 1736 01:24:56,250 --> 01:24:59,040 What I have here inside of templates directory 1737 01:24:59,040 --> 01:25:03,120 is a file called companies.csv, and this is all of the data 1738 01:25:03,120 --> 01:25:04,520 that I am going to give to my AI. 1739 01:25:04,520 --> 01:25:09,000 I'm going to give it the pair Amazon, 1994 and Facebook, 2004. 1740 01:25:09,000 --> 01:25:11,190 And what I'm going to tell my AI to do is 1741 01:25:11,190 --> 01:25:14,010 search a corpus of documents for other data-- 1742 01:25:14,010 --> 01:25:16,620 these pairs like this-- other relationships. 1743 01:25:16,620 --> 01:25:18,990 I'm not telling AI that this is a company and the date 1744 01:25:18,990 --> 01:25:19,920 that it was founded. 1745 01:25:19,920 --> 01:25:23,750 I'm just giving it Amazon, 1994 and Facebook, 2004 1746 01:25:23,750 --> 01:25:25,550 and letting the AI do the rest. 1747 01:25:25,550 --> 01:25:28,640 And what the AI is going to do is it's going to look through my corpus-- 1748 01:25:28,640 --> 01:25:30,770 here's my corpus of documents-- 1749 01:25:30,770 --> 01:25:33,590 and it's going to find, like inside of Business Insider, 1750 01:25:33,590 --> 01:25:38,580 that we have sentences like, back when Amazon was founded in 2004, comma-- 1751 01:25:38,580 --> 01:25:42,740 and that kind of phrasing is going to be similar to this Harvard Business Review 1752 01:25:42,740 --> 01:25:46,935 story that has a sentence like, when Facebook was founded in 2004-- 1753 01:25:46,935 --> 01:25:49,310 and it's going to look across a number of other documents 1754 01:25:49,310 --> 01:25:53,820 for similar types of patterns to be able to extract that kind of information. 1755 01:25:53,820 --> 01:25:56,450 And what it will do is, if I go ahead and run, 1756 01:25:56,450 --> 01:25:58,660 I'll go ahead and go into templates. 1757 01:25:58,660 --> 01:26:01,220 So I'll say python search.py. 1758 01:26:01,220 --> 01:26:05,030 I'm going to look for the data like the data and companies.csv 1759 01:26:05,030 --> 01:26:08,690 inside of the company's directory, which contains a whole bunch of news articles 1760 01:26:08,690 --> 01:26:10,900 that I've curated in advance. 1761 01:26:10,900 --> 01:26:12,080 And here's what I get-- 1762 01:26:12,080 --> 01:26:15,560 Google 1998, Apple 1976, Microsoft 1975-- 1763 01:26:15,560 --> 01:26:16,400 so on and so forth-- 1764 01:26:16,400 --> 01:26:18,470 Walmart 1962, for example. 1765 01:26:18,470 --> 01:26:20,810 These are all of the pieces of data that happened 1766 01:26:20,810 --> 01:26:23,750 to match that same template that we were able to find before. 1767 01:26:23,750 --> 01:26:25,430 And how was it able to find this? 1768 01:26:25,430 --> 01:26:29,460 Well, it's probably because, if we look at the Forbes article, 1769 01:26:29,460 --> 01:26:34,730 for example, that it has a phrase in it like, when Walmart was founded in 1962, 1770 01:26:34,730 --> 01:26:38,000 comma-- that it's able to identify these sorts of patterns 1771 01:26:38,000 --> 01:26:39,890 and extract information from them. 1772 01:26:39,890 --> 01:26:42,650 Now, granted, I have curated all these stories in advance 1773 01:26:42,650 --> 01:26:46,130 in order to make sure that there is data that it's able to match on. 1774 01:26:46,130 --> 01:26:49,100 And in practice, it's not always going to be in this exact format 1775 01:26:49,100 --> 01:26:52,430 when you're seeing a company related to the year in which it was founded, 1776 01:26:52,430 --> 01:26:56,030 but if you give the AI access to enough data-- like all of the data of text 1777 01:26:56,030 --> 01:26:58,910 on the internet-- and just have the AI crawl the internet looking 1778 01:26:58,910 --> 01:27:02,720 for information, it can very reliably, or with some probability, 1779 01:27:02,720 --> 01:27:05,780 try and extract information using these sorts of templates 1780 01:27:05,780 --> 01:27:08,330 and be able to generate interesting sorts of knowledge. 1781 01:27:08,330 --> 01:27:10,940 And the more knowledge it learns, the more new templates 1782 01:27:10,940 --> 01:27:13,190 it's able to construct, looking for constructions that 1783 01:27:13,190 --> 01:27:15,930 show up in other locations as well. 1784 01:27:15,930 --> 01:27:17,910 So let's take a look at another example. 1785 01:27:17,910 --> 01:27:20,955 And then I'll here show you presidents.csv, 1786 01:27:20,955 --> 01:27:23,330 where I have two presidents and their inauguration date-- 1787 01:27:23,330 --> 01:27:28,220 so George Washington 1789, Barack Obama 2009 for example. 1788 01:27:28,220 --> 01:27:31,430 And I also am going to give to our AI a corpus that 1789 01:27:31,430 --> 01:27:34,550 just contains a single document, which is the Wikipedia 1790 01:27:34,550 --> 01:27:37,880 article for the list of presidents of the United States, for example-- 1791 01:27:37,880 --> 01:27:39,680 just information about presidents. 1792 01:27:39,680 --> 01:27:45,147 And I'd like to extract from this raw HTML document on a web page information 1793 01:27:45,147 --> 01:27:45,980 about the president. 1794 01:27:45,980 --> 01:27:50,460 So I can say search in presidents.csv. 1795 01:27:50,460 --> 01:27:53,720 And what I get is a whole bunch of data about presidents 1796 01:27:53,720 --> 01:27:56,300 and what year they were likely inaugurated and by looking 1797 01:27:56,300 --> 01:27:58,010 for patterns that matched-- 1798 01:27:58,010 --> 01:28:00,180 Barack Obama 2009, for example-- 1799 01:28:00,180 --> 01:28:02,280 looking for these sorts of patterns that happened 1800 01:28:02,280 --> 01:28:07,287 to give us some clues as to what it is that a story happens to be about. 1801 01:28:07,287 --> 01:28:08,370 So here's another example. 1802 01:28:08,370 --> 01:28:12,710 If I open up inside the olympics, here is a scraped version 1803 01:28:12,710 --> 01:28:15,050 of the Olympic home page that has information 1804 01:28:15,050 --> 01:28:16,610 about various different Olympics. 1805 01:28:16,610 --> 01:28:20,360 And maybe I want to extract Olympic locations and years 1806 01:28:20,360 --> 01:28:21,980 from this particular page. 1807 01:28:21,980 --> 01:28:24,950 Well, the way I can do that is using the exact same algorithm. 1808 01:28:24,950 --> 01:28:29,730 I'm just saying, all right, here are two Olympics and where they were located-- 1809 01:28:29,730 --> 01:28:32,160 so 2012 London, for example. 1810 01:28:32,160 --> 01:28:35,030 Let me go ahead and just run this process, 1811 01:28:35,030 --> 01:28:39,440 Python search, on olympics.csv, look at all the Olympic data set, 1812 01:28:39,440 --> 01:28:41,280 and here I get some information back. 1813 01:28:41,280 --> 01:28:43,310 Now, this information-- not totally perfect. 1814 01:28:43,310 --> 01:28:45,530 There are a couple of examples that are obviously not 1815 01:28:45,530 --> 01:28:48,955 quite right, because my template might have been a little bit too general. 1816 01:28:48,955 --> 01:28:51,080 Maybe it was looking for a broad category of things 1817 01:28:51,080 --> 01:28:55,190 and certain strange things happened to capture on that particular template. 1818 01:28:55,190 --> 01:28:58,730 So you could imagine adding rules to try and make this process more intelligent, 1819 01:28:58,730 --> 01:29:02,000 making sure the thing on the left is just a year, for example-- 1820 01:29:02,000 --> 01:29:04,280 for instance, and doing other sorts of analysis. 1821 01:29:04,280 --> 01:29:07,040 But purely just based on some data, we are 1822 01:29:07,040 --> 01:29:10,700 able to extract some interesting information using some algorithms. 1823 01:29:10,700 --> 01:29:16,100 And all search.py is really doing here is it is taking my corpus of data, 1824 01:29:16,100 --> 01:29:18,260 finding templates that match it-- 1825 01:29:18,260 --> 01:29:22,280 here, I'm filtering down to just the top two templates that happen to match-- 1826 01:29:22,280 --> 01:29:26,960 and then using those templates to extract results from the data 1827 01:29:26,960 --> 01:29:30,860 that I have access to, being able to look for all of the information 1828 01:29:30,860 --> 01:29:31,670 that I care about. 1829 01:29:31,670 --> 01:29:33,587 And that's ultimately what's going to help me, 1830 01:29:33,587 --> 01:29:38,390 to print out those results to figure out what the matches happen to be. 1831 01:29:38,390 --> 01:29:41,090 And so information extraction is another powerful tool 1832 01:29:41,090 --> 01:29:43,970 when it comes to trying to extract information. 1833 01:29:43,970 --> 01:29:46,220 But of course, it only works in very limited contexts. 1834 01:29:46,220 --> 01:29:49,640 It only works when I'm able will find templates that look exactly 1835 01:29:49,640 --> 01:29:53,000 like this in order to come up with some sort of match that 1836 01:29:53,000 --> 01:29:55,430 is able to connect this to some pair of data, 1837 01:29:55,430 --> 01:29:57,890 that this company was founded in this year. 1838 01:29:57,890 --> 01:30:01,670 What I might want to do, as we start to think about the semantics of words, 1839 01:30:01,670 --> 01:30:04,880 is to begin to imagine some way of coming up with definitions 1840 01:30:04,880 --> 01:30:08,120 for all words, being able to relate all of the words in a dictionary 1841 01:30:08,120 --> 01:30:12,110 to each other, because that's ultimately what's going to be necessary if we want 1842 01:30:12,110 --> 01:30:13,530 our AI to be able to communicate. 1843 01:30:13,530 --> 01:30:18,500 We need some representation of what it is that words mean. 1844 01:30:18,500 --> 01:30:22,340 And one approach of doing this, this famous data set called WordNet. 1845 01:30:22,340 --> 01:30:24,440 And what WordNet is is it's a human-curated-- 1846 01:30:24,440 --> 01:30:27,380 researchers have curated together a whole bunch of words, 1847 01:30:27,380 --> 01:30:29,595 their definitions, their various different senses-- 1848 01:30:29,595 --> 01:30:31,970 because the word might have multiple different meanings-- 1849 01:30:31,970 --> 01:30:35,347 and also how those words relate to one another. 1850 01:30:35,347 --> 01:30:36,680 And so what we mean by this is-- 1851 01:30:36,680 --> 01:30:38,750 I can show you an example of WordNet. 1852 01:30:38,750 --> 01:30:40,550 WordNet comes built into NLTK. 1853 01:30:40,550 --> 01:30:44,060 Using NLTK, you can download and access WordNet. 1854 01:30:44,060 --> 01:30:48,080 So let me go into WordNet, and go ahead and run WordNet, 1855 01:30:48,080 --> 01:30:52,100 and extract information about a word-- a word like city, for example. 1856 01:30:52,100 --> 01:30:53,600 Go ahead and press Return. 1857 01:30:53,600 --> 01:30:56,210 And here is the information that I get back about a city. 1858 01:30:56,210 --> 01:30:59,360 It turns out that city has three different senses, three 1859 01:30:59,360 --> 01:31:01,460 different meanings, according to WordNet. 1860 01:31:01,460 --> 01:31:03,770 And it's really just kind of like a dictionary, where 1861 01:31:03,770 --> 01:31:07,400 each sense is associated with its meaning-- just some definition 1862 01:31:07,400 --> 01:31:08,810 provided by human. 1863 01:31:08,810 --> 01:31:13,130 And then it's also got categories, for example, that a word belongs to-- 1864 01:31:13,130 --> 01:31:15,830 that a city is a type of municipality, a city 1865 01:31:15,830 --> 01:31:18,150 is a type of administrative district. 1866 01:31:18,150 --> 01:31:20,510 And that allows me to relate words to other words. 1867 01:31:20,510 --> 01:31:24,380 So one of the powers of WordNet is the ability to take one word 1868 01:31:24,380 --> 01:31:28,590 and connect it to other related words. 1869 01:31:28,590 --> 01:31:33,380 If I do another example, let me try the word house, for instance. 1870 01:31:33,380 --> 01:31:36,690 I'll type in the word house and see what I get back. 1871 01:31:36,690 --> 01:31:38,750 Well, all right, the house is a kind of building. 1872 01:31:38,750 --> 01:31:42,160 The house is somehow related to a family unit. 1873 01:31:42,160 --> 01:31:43,910 And so you might imagine trying to come up 1874 01:31:43,910 --> 01:31:46,760 with these various different ways of describing a house. 1875 01:31:46,760 --> 01:31:47,490 It is a building. 1876 01:31:47,490 --> 01:31:48,500 It is a dwelling. 1877 01:31:48,500 --> 01:31:51,110 And researchers have just curated these relationships 1878 01:31:51,110 --> 01:31:55,100 between these various different words to say that a house is a type of building, 1879 01:31:55,100 --> 01:31:58,890 that a house is a type of dwelling, for example. 1880 01:31:58,890 --> 01:32:01,370 But this type of approach, while certainly 1881 01:32:01,370 --> 01:32:04,640 helpful for being able to relate words to one another, 1882 01:32:04,640 --> 01:32:06,920 doesn't scale particularly well. 1883 01:32:06,920 --> 01:32:08,990 As you start to think about language changing, 1884 01:32:08,990 --> 01:32:11,870 as you start to think about all the various different relationships 1885 01:32:11,870 --> 01:32:16,070 that words might have to one another, this challenge of word representation 1886 01:32:16,070 --> 01:32:18,200 ends up being difficult. What we've done is just 1887 01:32:18,200 --> 01:32:23,450 defined a word as just a sentence that explains what it is that that word is, 1888 01:32:23,450 --> 01:32:26,030 but what we really would like is some way 1889 01:32:26,030 --> 01:32:28,615 to represent the meaning of a word in a way 1890 01:32:28,615 --> 01:32:31,240 that our AI is going to be able to do something useful with it. 1891 01:32:31,240 --> 01:32:33,830 Anytime we want our AI to be able to look at texts 1892 01:32:33,830 --> 01:32:35,840 and really understand what that text means, 1893 01:32:35,840 --> 01:32:38,360 to relate text and words to similar words 1894 01:32:38,360 --> 01:32:40,700 and understand the relationship between words, 1895 01:32:40,700 --> 01:32:44,745 we'd like some way that a computer can represent this information. 1896 01:32:44,745 --> 01:32:46,620 And what we've seen all throughout the course 1897 01:32:46,620 --> 01:32:48,800 multiple times now is the idea that, when 1898 01:32:48,800 --> 01:32:51,110 we want our AI to represent something, it 1899 01:32:51,110 --> 01:32:54,890 can be helpful to have the AI represent it using numbers-- 1900 01:32:54,890 --> 01:32:57,530 that we've seen that we can represent utilities in a game, 1901 01:32:57,530 --> 01:32:59,900 like winning, or losing, or drawing, as a number-- 1902 01:32:59,900 --> 01:33:01,520 1, negative 1, or a 0. 1903 01:33:01,520 --> 01:33:04,400 We've seen other ways that we can take data and turn it 1904 01:33:04,400 --> 01:33:06,650 into a vector of features, where we just have 1905 01:33:06,650 --> 01:33:11,270 a whole bunch of numbers that represent some particular piece of data. 1906 01:33:11,270 --> 01:33:14,340 And if we ever want to past words into a neural network, 1907 01:33:14,340 --> 01:33:16,580 for instance, to be able to say, given some word, 1908 01:33:16,580 --> 01:33:18,650 translate this sentence into another sentence, 1909 01:33:18,650 --> 01:33:21,890 or to be able to do interesting classifications with neural networks 1910 01:33:21,890 --> 01:33:26,000 on individual words, we need some representation of words 1911 01:33:26,000 --> 01:33:27,980 just in terms of vectors-- 1912 01:33:27,980 --> 01:33:31,820 way to represent words, just by using individual numbers 1913 01:33:31,820 --> 01:33:34,495 to define the meaning of a word. 1914 01:33:34,495 --> 01:33:35,370 So how do we do that? 1915 01:33:35,370 --> 01:33:37,767 How do we take words and turn them into vectors 1916 01:33:37,767 --> 01:33:40,100 that we can use to represent the meaning of those words? 1917 01:33:40,100 --> 01:33:42,110 Well, one way is to do this. 1918 01:33:42,110 --> 01:33:46,280 If I have four words that I want to encode, like he wrote a book, 1919 01:33:46,280 --> 01:33:49,250 I can just say, let's let the word he be this vector-- 1920 01:33:49,250 --> 01:33:51,470 1, 0, 0, 0. 1921 01:33:51,470 --> 01:33:53,990 Wrote will be 0, 1, 0, 0. 1922 01:33:53,990 --> 01:33:56,390 A will be 0, 0, 1, 0. 1923 01:33:56,390 --> 01:33:59,570 Book will be 0, 0, 0, 1. 1924 01:33:59,570 --> 01:34:03,410 Effectively, what I have here is what's known as a one-hot representation 1925 01:34:03,410 --> 01:34:06,930 or a one-hot encoding, which is a representation of meaning, 1926 01:34:06,930 --> 01:34:10,580 where meaning is a vector that has a single 1 in it and the rest are 0's. 1927 01:34:10,580 --> 01:34:14,540 The location of the 1 tells me the meaning of the word-- 1928 01:34:14,540 --> 01:34:17,020 that 1 in the first position, that means here-- 1929 01:34:17,020 --> 01:34:19,510 1 in the second position, that means wrote. 1930 01:34:19,510 --> 01:34:21,740 And every word in the dictionary is going 1931 01:34:21,740 --> 01:34:24,770 to be assigned to some representation like this, where we just 1932 01:34:24,770 --> 01:34:28,320 assign one place in the vector that has a 1 for the word 1933 01:34:28,320 --> 01:34:29,450 and 0 for the other words. 1934 01:34:29,450 --> 01:34:31,580 And now I have representations of words that 1935 01:34:31,580 --> 01:34:33,710 are different for a whole bunch of different words. 1936 01:34:33,710 --> 01:34:36,853 This is this one-hot representation. 1937 01:34:36,853 --> 01:34:38,270 So what are the drawbacks of this? 1938 01:34:38,270 --> 01:34:40,970 Why is this not necessarily a great approach? 1939 01:34:40,970 --> 01:34:42,980 Well, here, I am only creating enough vectors 1940 01:34:42,980 --> 01:34:45,530 to represent four words in a dictionary. 1941 01:34:45,530 --> 01:34:49,580 If you imagine a dictionary with 50,000 words that I might want to represent, 1942 01:34:49,580 --> 01:34:51,590 now these vectors get enormously long. 1943 01:34:51,590 --> 01:34:54,800 These are 50,000 dimensional vectors to represent 1944 01:34:54,800 --> 01:34:58,940 a vocabulary of 50,000 words-- that he is 1 followed by all these. 1945 01:34:58,940 --> 01:35:01,280 Wrote has a whole bunch of 0's in it. 1946 01:35:01,280 --> 01:35:05,070 That's not a particularly tractable way of trying to represent numbers, 1947 01:35:05,070 --> 01:35:09,860 if I'm going to have to deal with vectors of length 50,000. 1948 01:35:09,860 --> 01:35:12,140 Another problem-- a subtler problem-- 1949 01:35:12,140 --> 01:35:14,870 is that ideally, I'd like for these vectors 1950 01:35:14,870 --> 01:35:17,960 to somehow represent meaning in a way that I can extract 1951 01:35:17,960 --> 01:35:21,740 useful information out of-- that if I have the sentence he wrote a book 1952 01:35:21,740 --> 01:35:26,270 and he authored a novel, well, wrote and authored are going to be two 1953 01:35:26,270 --> 01:35:28,040 totally different vectors. 1954 01:35:28,040 --> 01:35:32,180 And book and novel are going to be two totally different vectors inside 1955 01:35:32,180 --> 01:35:35,030 of my vector space that have nothing to do with each other. 1956 01:35:35,030 --> 01:35:38,420 The one is just located in a different position. 1957 01:35:38,420 --> 01:35:40,790 And really, what I would like to have happen 1958 01:35:40,790 --> 01:35:43,600 is for wrote and authored to have vectors 1959 01:35:43,600 --> 01:35:47,020 that are similar to one another, and for book and novel 1960 01:35:47,020 --> 01:35:49,900 to have vector representations that are similar to one another, 1961 01:35:49,900 --> 01:35:52,780 because they are words that have similar meanings. 1962 01:35:52,780 --> 01:35:56,320 Because their meanings are similar, ideally, I'd like for-- 1963 01:35:56,320 --> 01:35:59,860 when I put them in vector form and use a vector to represent meanings, 1964 01:35:59,860 --> 01:36:04,400 I would like for those vectors to be similar to one another as well. 1965 01:36:04,400 --> 01:36:06,640 So rather than this one-hot representation, 1966 01:36:06,640 --> 01:36:10,000 where we represent a word's meaning by just giving it a vector that is one 1967 01:36:10,000 --> 01:36:12,620 in a particular location, what we're going to do-- 1968 01:36:12,620 --> 01:36:15,400 which is a bit of a strange thing the first time you see it-- 1969 01:36:15,400 --> 01:36:18,640 is what we're going to call a distributed representation. 1970 01:36:18,640 --> 01:36:21,580 We are going to represent the meaning of a word as just 1971 01:36:21,580 --> 01:36:25,330 a whole bunch of different values-- not just a single 1 and the rest 0's, 1972 01:36:25,330 --> 01:36:26,630 but a whole bunch of values. 1973 01:36:26,630 --> 01:36:31,240 So for example, in he wrote a book, he might just be a big vector. 1974 01:36:31,240 --> 01:36:34,510 Maybe it's 50 dimensions, maybe it's 100, dimensions but certainly less 1975 01:36:34,510 --> 01:36:39,430 than like tens of thousands, where each value is just some number-- 1976 01:36:39,430 --> 01:36:42,160 and same thing for wrote, and a, and book. 1977 01:36:42,160 --> 01:36:45,070 And the idea now is that, using these vector representations, 1978 01:36:45,070 --> 01:36:48,850 I'd hope that wrote and authored have vector representations that 1979 01:36:48,850 --> 01:36:50,317 are pretty close to one another. 1980 01:36:50,317 --> 01:36:52,900 Their distance is not too far apart-- and same with the vector 1981 01:36:52,900 --> 01:36:56,230 representations for book and novel. 1982 01:36:56,230 --> 01:37:00,940 So this is going to be the goal of a lot of what statistical machine learning 1983 01:37:00,940 --> 01:37:02,710 approaches to natural language processing 1984 01:37:02,710 --> 01:37:06,760 is about is using these vector representations of words. 1985 01:37:06,760 --> 01:37:10,190 But how on earth do we define a word as just a whole bunch 1986 01:37:10,190 --> 01:37:11,440 of these sequences of numbers? 1987 01:37:11,440 --> 01:37:16,668 What does it even mean to talk about the meaning of a word? 1988 01:37:16,668 --> 01:37:18,460 The famous quote that answers this question 1989 01:37:18,460 --> 01:37:22,930 is from a British linguist in the 1950s, JR Firth, who said, "You shall 1990 01:37:22,930 --> 01:37:25,060 know a word by the company it keeps." 1991 01:37:25,060 --> 01:37:28,150 1992 01:37:28,150 --> 01:37:30,400 And what we mean by that is the idea that we 1993 01:37:30,400 --> 01:37:35,290 can define a word in terms of the words that show up around it, that we can get 1994 01:37:35,290 --> 01:37:39,070 at the meaning of a word based on the context in which that word happens 1995 01:37:39,070 --> 01:37:40,370 to appear. 1996 01:37:40,370 --> 01:37:43,900 That if I have a sentence like this, four words in sequence-- 1997 01:37:43,900 --> 01:37:46,180 for blank he ate-- 1998 01:37:46,180 --> 01:37:47,442 what goes in the blank? 1999 01:37:47,442 --> 01:37:49,150 Well, you might imagine that, in English, 2000 01:37:49,150 --> 01:37:52,192 the types of words that might fill in the blank are words like breakfast, 2001 01:37:52,192 --> 01:37:53,170 or lunch, or dinner. 2002 01:37:53,170 --> 01:37:56,480 These are the kinds of words that fill in that blank. 2003 01:37:56,480 --> 01:38:00,730 And so if we want to define, what does lunch or dinner mean, 2004 01:38:00,730 --> 01:38:03,970 we can define it in terms of what words happened 2005 01:38:03,970 --> 01:38:07,030 to show up around it-- that if a word shows up 2006 01:38:07,030 --> 01:38:09,700 in a particular context and another word happens to show up 2007 01:38:09,700 --> 01:38:13,750 in very similar context, then those two words are probably 2008 01:38:13,750 --> 01:38:15,040 related to each other. 2009 01:38:15,040 --> 01:38:18,280 They probably have a similar meaning to one another. 2010 01:38:18,280 --> 01:38:20,950 And this then is the foundational idea of an algorithm 2011 01:38:20,950 --> 01:38:24,760 known as word2vec, which is a model for generating word vectors. 2012 01:38:24,760 --> 01:38:28,960 You give word2vec a corpus of documents, just a whole bunch of texts, 2013 01:38:28,960 --> 01:38:34,832 and what word to that will produce is it will produce vectors for each word. 2014 01:38:34,832 --> 01:38:36,790 And there a number of ways that it can do this. 2015 01:38:36,790 --> 01:38:40,300 One common way is through what's known as the skip-gram architecture, which 2016 01:38:40,300 --> 01:38:44,470 basically uses a neural network to predict context words, 2017 01:38:44,470 --> 01:38:47,240 given a target word-- so given a word like lunch, 2018 01:38:47,240 --> 01:38:50,350 use a neural network to try and predict, given the word lunch, what 2019 01:38:50,350 --> 01:38:53,190 words are going to show up around it. 2020 01:38:53,190 --> 01:38:55,210 And so the way we might represent this is 2021 01:38:55,210 --> 01:38:57,760 with a big neural network like this, where 2022 01:38:57,760 --> 01:39:00,820 we have one input cell for every word. 2023 01:39:00,820 --> 01:39:04,900 Every word gets one node inside this neural network. 2024 01:39:04,900 --> 01:39:07,780 And the goal is to use this neural network to predict, 2025 01:39:07,780 --> 01:39:09,790 given a target word, a context word. 2026 01:39:09,790 --> 01:39:14,030 Given a word like lunch, can I predict the probabilities of other words, 2027 01:39:14,030 --> 01:39:18,560 showing up in a context of one word away or two words away, for instance, 2028 01:39:18,560 --> 01:39:21,970 in some sort of window of context? 2029 01:39:21,970 --> 01:39:27,400 And if you just give the AI, this neural network, a whole bunch of data of words 2030 01:39:27,400 --> 01:39:30,790 and what words show up in context, you can train a neural network 2031 01:39:30,790 --> 01:39:34,600 to do this calculation, to be able to predict, given a target word-- 2032 01:39:34,600 --> 01:39:39,103 can I predict what those context words ultimately should be? 2033 01:39:39,103 --> 01:39:41,020 And it will do so using the same methods we've 2034 01:39:41,020 --> 01:39:43,850 talked about-- back propagating the error from the context word 2035 01:39:43,850 --> 01:39:46,090 back through this neural network. 2036 01:39:46,090 --> 01:39:48,790 And what you get is, if we use the single layer-- 2037 01:39:48,790 --> 01:39:50,950 just a signal layer of hidden nodes-- 2038 01:39:50,950 --> 01:39:54,960 what I get is, for every single one of these words, I get-- 2039 01:39:54,960 --> 01:39:59,680 from this word, for example, I get five edges, each of which 2040 01:39:59,680 --> 01:40:02,695 has a weight to each of these five hidden nodes. 2041 01:40:02,695 --> 01:40:05,950 In other words, I get five numbers that effectively 2042 01:40:05,950 --> 01:40:10,180 are going to represent this particular target word here. 2043 01:40:10,180 --> 01:40:13,750 And the number of hidden nodes I choose in this middle layer here-- 2044 01:40:13,750 --> 01:40:14,420 I can pick that. 2045 01:40:14,420 --> 01:40:17,830 Maybe I'll choose to have 50 hidden nodes or 100 hidden nodes. 2046 01:40:17,830 --> 01:40:19,720 And then, for each of these target words, 2047 01:40:19,720 --> 01:40:22,630 I'll have 50 different values or 100 different values, 2048 01:40:22,630 --> 01:40:26,050 and those values we can effectively treat as the vector 2049 01:40:26,050 --> 01:40:29,320 numerical representation of that word. 2050 01:40:29,320 --> 01:40:33,520 And the general idea here is that, if words are similar, 2051 01:40:33,520 --> 01:40:37,660 two words show up in similar contexts-- meaning, using the same target words, 2052 01:40:37,660 --> 01:40:40,380 I'd like to predict similar contexts words-- 2053 01:40:40,380 --> 01:40:43,180 well, then these vectors and these values I choose in these vectors 2054 01:40:43,180 --> 01:40:45,940 here-- these numerical values for the weight of these edges 2055 01:40:45,940 --> 01:40:49,180 are probably going to be similar, because for two different words that 2056 01:40:49,180 --> 01:40:51,580 show up in similar contexts, I would like 2057 01:40:51,580 --> 01:40:55,030 for these values that are calculated to ultimately 2058 01:40:55,030 --> 01:40:58,250 be very similar to one another. 2059 01:40:58,250 --> 01:41:01,030 And so ultimately, the high-level way you can picture this 2060 01:41:01,030 --> 01:41:02,980 is that what this word2vec training method is 2061 01:41:02,980 --> 01:41:06,790 going to do is, given a whole bunch of words, were initially, 2062 01:41:06,790 --> 01:41:09,430 recall, we initialize these weights randomly and just pick 2063 01:41:09,430 --> 01:41:11,650 random weights that we choose. 2064 01:41:11,650 --> 01:41:14,050 Over time, as we train the neural network, 2065 01:41:14,050 --> 01:41:17,680 we're going to adjust these weights, adjust the vector representations 2066 01:41:17,680 --> 01:41:20,860 of each of these words so that gradually, 2067 01:41:20,860 --> 01:41:24,970 words that show up in similar contexts grow closer to one another, 2068 01:41:24,970 --> 01:41:27,190 and words that show up in different contexts 2069 01:41:27,190 --> 01:41:29,210 get farther away from one another. 2070 01:41:29,210 --> 01:41:32,890 And as a result, hopefully I get vector representations 2071 01:41:32,890 --> 01:41:36,760 of words like breakfast, and lunch, and dinner that are similar to one another, 2072 01:41:36,760 --> 01:41:39,100 and then words like book, and memoir, and novel 2073 01:41:39,100 --> 01:41:42,830 are also going to be similar to one another as well. 2074 01:41:42,830 --> 01:41:46,510 So using this algorithm, we're able to take a corpus of data 2075 01:41:46,510 --> 01:41:50,230 and just train our computer, train this neural network to be able to figure out 2076 01:41:50,230 --> 01:41:52,650 what vector, what sequence of numbers is going 2077 01:41:52,650 --> 01:41:55,900 to represent each of these words-- which is, again, a bit of a strange concept 2078 01:41:55,900 --> 01:41:59,450 to think about representing a word just as a whole bunch of numbers. 2079 01:41:59,450 --> 01:42:02,860 But we'll see in a moment just how powerful this really can be. 2080 01:42:02,860 --> 01:42:08,290 So we'll go ahead and go into vectors, and what I have inside a vectors.py-- 2081 01:42:08,290 --> 01:42:09,910 which I'll open up now-- 2082 01:42:09,910 --> 01:42:14,800 is I'm opening up words.txt, which is a pretrained model that just-- 2083 01:42:14,800 --> 01:42:17,230 I've already run word2vec and it's already given me 2084 01:42:17,230 --> 01:42:19,810 a whole bunch of vectors for each of these possible words. 2085 01:42:19,810 --> 01:42:22,330 And I'm just going to take like 50,000 of them 2086 01:42:22,330 --> 01:42:26,420 and go ahead and save their vectors inside of a dictionary called words. 2087 01:42:26,420 --> 01:42:29,260 And then I've also defined some functions called distance, 2088 01:42:29,260 --> 01:42:33,820 closest_word, so it'll get me what are the closest words to a particular word, 2089 01:42:33,820 --> 01:42:38,390 and then closest_word, that just gets me the one closest word, for example. 2090 01:42:38,390 --> 01:42:39,860 And so now let me try doing this. 2091 01:42:39,860 --> 01:42:43,180 Let me open up the Python interpreter and say something like, 2092 01:42:43,180 --> 01:42:46,080 from vectors import star-- 2093 01:42:46,080 --> 01:42:48,590 just import everything from vectors. 2094 01:42:48,590 --> 01:42:51,700 And now let's take a look at the meanings of some words. 2095 01:42:51,700 --> 01:42:55,760 Let me look at the word city, for example. 2096 01:42:55,760 --> 01:43:01,130 And here is a big array that is the vector representation of the words 2097 01:43:01,130 --> 01:43:01,630 city. 2098 01:43:01,630 --> 01:43:04,755 And this doesn't mean anything, in terms of what these numbers exactly are, 2099 01:43:04,755 --> 01:43:07,390 but this is how my computer is representing 2100 01:43:07,390 --> 01:43:08,990 the meaning of the word city. 2101 01:43:08,990 --> 01:43:11,200 We can do a different word, like words house, 2102 01:43:11,200 --> 01:43:14,860 and here then is the vector representation of the word house, 2103 01:43:14,860 --> 01:43:17,140 for example-- just a whole bunch of numbers. 2104 01:43:17,140 --> 01:43:20,650 And this is encoding somehow the meaning of the word house. 2105 01:43:20,650 --> 01:43:22,390 And how do I get at that idea? 2106 01:43:22,390 --> 01:43:24,880 Well, one way to measure how good this is is by looking at, 2107 01:43:24,880 --> 01:43:29,282 what is the distance between various different words? 2108 01:43:29,282 --> 01:43:31,240 There a number of ways you can define distance. 2109 01:43:31,240 --> 01:43:33,310 In context of vectors, one common way is what's 2110 01:43:33,310 --> 01:43:35,860 known as the cosine distance that has to do with measuring 2111 01:43:35,860 --> 01:43:37,580 the angle between vectors. 2112 01:43:37,580 --> 01:43:40,150 But in short, it's just measuring, how far apart 2113 01:43:40,150 --> 01:43:42,710 are these two vectors from each other? 2114 01:43:42,710 --> 01:43:47,210 So if I take a word like the word book, how far away for is it from itself-- 2115 01:43:47,210 --> 01:43:49,540 how far away is the word book from book-- 2116 01:43:49,540 --> 01:43:50,440 well, that's zero. 2117 01:43:50,440 --> 01:43:54,400 The word book is zero distance away from itself. 2118 01:43:54,400 --> 01:43:59,180 But let's see how far away word book is from a word like breakfast, 2119 01:43:59,180 --> 01:44:03,790 where we're going to say one is very far away, zero is not far away. 2120 01:44:03,790 --> 01:44:07,430 All right, book is about 0.64 away from breakfast. 2121 01:44:07,430 --> 01:44:09,560 They seem to be pretty far apart. 2122 01:44:09,560 --> 01:44:12,920 But let's now try and calculate the distance from words book 2123 01:44:12,920 --> 01:44:16,842 to words novel, for example. 2124 01:44:16,842 --> 01:44:18,800 Now, those two words are closer to each other-- 2125 01:44:18,800 --> 01:44:19,730 0.34. 2126 01:44:19,730 --> 01:44:21,950 The vector representation of the word book 2127 01:44:21,950 --> 01:44:25,190 is closer to the vector representation of the word novel 2128 01:44:25,190 --> 01:44:28,350 than it is to the vector representation of the word breakfast. 2129 01:44:28,350 --> 01:44:34,010 And I can do the same thing and, say, compare breakfast to lunch, 2130 01:44:34,010 --> 01:44:35,765 for example. 2131 01:44:35,765 --> 01:44:37,640 And those two words are even closer together. 2132 01:44:37,640 --> 01:44:40,010 They have an even more similar relationship 2133 01:44:40,010 --> 01:44:42,470 between one word and another. 2134 01:44:42,470 --> 01:44:45,500 So now it seems we have some representation of words, 2135 01:44:45,500 --> 01:44:49,610 representing a word using vectors, that allows us to be able to say something 2136 01:44:49,610 --> 01:44:52,340 like words that are similar to each other 2137 01:44:52,340 --> 01:44:55,940 ultimately have a smaller distance that happens to be between them. 2138 01:44:55,940 --> 01:44:58,070 And this turns out to be incredibly powerful to be 2139 01:44:58,070 --> 01:45:01,760 able to represent the meaning of words in terms of their relationships 2140 01:45:01,760 --> 01:45:03,620 to other words as well. 2141 01:45:03,620 --> 01:45:05,000 I can tell you as well-- 2142 01:45:05,000 --> 01:45:06,980 I have a function called closest words that 2143 01:45:06,980 --> 01:45:09,320 basically just takes a whole bunch of words 2144 01:45:09,320 --> 01:45:11,520 and gets all the closest words to it. 2145 01:45:11,520 --> 01:45:15,980 So let me get the closest words to book, for example, 2146 01:45:15,980 --> 01:45:18,500 and maybe get the 10 closest words. 2147 01:45:18,500 --> 01:45:20,950 We'll limit ourselves to 10. 2148 01:45:20,950 --> 01:45:21,450 And right. 2149 01:45:21,450 --> 01:45:24,420 Book is obviously closest to itself-- the word book-- 2150 01:45:24,420 --> 01:45:27,630 but is also closely related to books, and essay, and memoir, and essays, 2151 01:45:27,630 --> 01:45:29,450 and novella, anthology. 2152 01:45:29,450 --> 01:45:32,370 And why are these words that it was able to compute are close to it? 2153 01:45:32,370 --> 01:45:34,710 Well, because based on the corpus of information 2154 01:45:34,710 --> 01:45:38,220 that this algorithm was trained on, the vectors that arose 2155 01:45:38,220 --> 01:45:41,270 arose based on what words show up in a similar context-- 2156 01:45:41,270 --> 01:45:45,420 that the word book shows up in a similar context, similar other words to words 2157 01:45:45,420 --> 01:45:47,730 like memoir and essays, for example. 2158 01:45:47,730 --> 01:45:49,110 And if I do something like-- 2159 01:45:49,110 --> 01:45:53,740 let me get the closest words to city-- 2160 01:45:53,740 --> 01:45:56,800 you end up getting city, town, township, village. 2161 01:45:56,800 --> 01:46:02,200 These are words that happen to show up in a similar context to the word city. 2162 01:46:02,200 --> 01:46:05,787 Now, where things get really interesting is that, because these are vectors, 2163 01:46:05,787 --> 01:46:07,120 we can do mathematics with them. 2164 01:46:07,120 --> 01:46:11,210 We can calculate the relationships between various different words. 2165 01:46:11,210 --> 01:46:16,240 So I can say something like, all right, what if I had man and king? 2166 01:46:16,240 --> 01:46:18,790 These are two different vectors, and this is a famous example 2167 01:46:18,790 --> 01:46:20,950 that comes out of word2vec. 2168 01:46:20,950 --> 01:46:24,920 I can take these two vectors and just subtract them from each other. 2169 01:46:24,920 --> 01:46:28,040 This line here, the distance here, is another vector 2170 01:46:28,040 --> 01:46:30,430 that represents like king minus man. 2171 01:46:30,430 --> 01:46:33,123 Now, what does it mean to take a word and subtract another word? 2172 01:46:33,123 --> 01:46:34,540 Normally, that doesn't make sense. 2173 01:46:34,540 --> 01:46:37,082 In the world of vectors, though, you can take some vector sum 2174 01:46:37,082 --> 01:46:40,090 sequence of numbers, subtract some other sequence of numbers, 2175 01:46:40,090 --> 01:46:43,240 and get a new vector, get a new sequence of numbers. 2176 01:46:43,240 --> 01:46:46,690 And what this new sequence of numbers is effectively going to do 2177 01:46:46,690 --> 01:46:52,000 is it is going to tell me, what do I need to do to get from man to king? 2178 01:46:52,000 --> 01:46:54,640 What is the relationship then between these two words? 2179 01:46:54,640 --> 01:46:58,120 And this is some vector representation of what makes-- 2180 01:46:58,120 --> 01:47:00,640 takes us from man to king. 2181 01:47:00,640 --> 01:47:04,730 And we can then take this value and add it to another vector. 2182 01:47:04,730 --> 01:47:07,700 You might imagine that the word woman, for example, 2183 01:47:07,700 --> 01:47:10,330 is another vector that exists somewhere inside of this space, 2184 01:47:10,330 --> 01:47:12,430 somewhere inside of this vector space. 2185 01:47:12,430 --> 01:47:15,550 And what might happen if I took this same idea, king 2186 01:47:15,550 --> 01:47:19,930 minus man-- took that same vector and just added it to woman? 2187 01:47:19,930 --> 01:47:22,480 What will we find around here? 2188 01:47:22,480 --> 01:47:24,230 It's an interesting question we might ask, 2189 01:47:24,230 --> 01:47:27,700 and we can answer it very easily, because I have vector representations 2190 01:47:27,700 --> 01:47:30,500 of all of these things. 2191 01:47:30,500 --> 01:47:31,660 Let's go back here. 2192 01:47:31,660 --> 01:47:34,690 Let me look at the representation of the word man. 2193 01:47:34,690 --> 01:47:36,887 Here's the vector representation of men. 2194 01:47:36,887 --> 01:47:38,970 Let's look at the representation of the word king. 2195 01:47:38,970 --> 01:47:41,222 Here's the representation of the word king. 2196 01:47:41,222 --> 01:47:42,430 And I can subtract these two. 2197 01:47:42,430 --> 01:47:46,260 What is the vector representation of king minus man? 2198 01:47:46,260 --> 01:47:48,250 It's this array right here-- 2199 01:47:48,250 --> 01:47:49,600 whole bunch of values. 2200 01:47:49,600 --> 01:47:53,620 So king minus man now represents the relationship between king and man 2201 01:47:53,620 --> 01:47:55,940 in some sort of numerical vector format. 2202 01:47:55,940 --> 01:48:00,170 So what happens then if I add woman to that? 2203 01:48:00,170 --> 01:48:04,640 Whatever took us from man to king, go ahead and apply that same vector 2204 01:48:04,640 --> 01:48:07,520 to the vector representation of the word woman, 2205 01:48:07,520 --> 01:48:10,960 and that gives us this vector here. 2206 01:48:10,960 --> 01:48:15,130 And now, just out of curiosity, let's take this expression 2207 01:48:15,130 --> 01:48:20,720 and find, what is the closest word to that expression? 2208 01:48:20,720 --> 01:48:25,130 And amazingly, what we get is we get the word queen-- 2209 01:48:25,130 --> 01:48:28,820 that somehow, when you take the distance between man and king-- 2210 01:48:28,820 --> 01:48:32,090 this numerical representation of how man is related to king-- 2211 01:48:32,090 --> 01:48:34,780 and add that same notion, king minus man, 2212 01:48:34,780 --> 01:48:37,100 to the vector representation of the word woman. 2213 01:48:37,100 --> 01:48:40,790 What we get is we get the vector representation, or something close 2214 01:48:40,790 --> 01:48:43,490 to the vector representation of the word queen, 2215 01:48:43,490 --> 01:48:48,130 because this distance somehow encoded the relationship between these two 2216 01:48:48,130 --> 01:48:48,630 words. 2217 01:48:48,630 --> 01:48:50,422 And when you run it through this algorithm, 2218 01:48:50,422 --> 01:48:53,240 it's not programmed to do this, but if you just try and figure 2219 01:48:53,240 --> 01:48:55,700 out how to predict words based on context words, 2220 01:48:55,700 --> 01:48:59,960 you get vectors that are able to make these SAT-like analogies out 2221 01:48:59,960 --> 01:49:02,232 of the information that has been given. 2222 01:49:02,232 --> 01:49:03,690 So there are more examples of this. 2223 01:49:03,690 --> 01:49:06,230 We can say, all right, let's figure out, what 2224 01:49:06,230 --> 01:49:10,790 is the distance between Paris and France? 2225 01:49:10,790 --> 01:49:12,580 So Paris and France are words. 2226 01:49:12,580 --> 01:49:14,390 They each have a vector representation. 2227 01:49:14,390 --> 01:49:18,680 This then is a vector representation of the distance between Paris and France-- 2228 01:49:18,680 --> 01:49:21,530 what takes us from France to Paris. 2229 01:49:21,530 --> 01:49:26,540 And let me go ahead and add the vector representation of England to that. 2230 01:49:26,540 --> 01:49:29,690 So this then is the vector representation 2231 01:49:29,690 --> 01:49:35,470 of going Paris minus France plus England-- 2232 01:49:35,470 --> 01:49:38,130 so the distance between friends and Paris as vectors. 2233 01:49:38,130 --> 01:49:40,860 Add the England vector, and let's go ahead 2234 01:49:40,860 --> 01:49:43,860 and find the closest word to that. 2235 01:49:43,860 --> 01:49:47,080 2236 01:49:47,080 --> 01:49:48,550 And it turns out to be London. 2237 01:49:48,550 --> 01:49:51,610 You do this relationship, the relationship between France and Paris. 2238 01:49:51,610 --> 01:49:55,000 Go ahead and add the England vector to it, and the closest vector to that 2239 01:49:55,000 --> 01:49:57,120 happens to be the vector for the word London. 2240 01:49:57,120 --> 01:49:58,120 We can do more examples. 2241 01:49:58,120 --> 01:50:00,700 I can say, let's take the word for teacher-- 2242 01:50:00,700 --> 01:50:03,700 that vector representation and-- let me subtract 2243 01:50:03,700 --> 01:50:05,470 the vector representation of school. 2244 01:50:05,470 --> 01:50:09,310 So what I'm left with is, what takes us from school to teacher? 2245 01:50:09,310 --> 01:50:14,050 And apply that vector to a word like hospital and see, 2246 01:50:14,050 --> 01:50:15,670 what is the closest word to that-- 2247 01:50:15,670 --> 01:50:17,680 turns out the closest word is nurse. 2248 01:50:17,680 --> 01:50:23,400 Let's try a couple more examples-- closest word to ramen, for example. 2249 01:50:23,400 --> 01:50:25,610 Subtract closest word to Japan. 2250 01:50:25,610 --> 01:50:28,150 So what is the relationship between Japan and ramen? 2251 01:50:28,150 --> 01:50:30,310 Add the word for America to that. 2252 01:50:30,310 --> 01:50:33,340 Want to take a guess is what you might get as a result? 2253 01:50:33,340 --> 01:50:35,840 Turns out you get burritos as the relationship. 2254 01:50:35,840 --> 01:50:38,050 If you do the subtraction, do the addition, 2255 01:50:38,050 --> 01:50:42,080 this is the answer that you happen to get as a consequence of this as well. 2256 01:50:42,080 --> 01:50:44,703 So these very interesting analogies arise 2257 01:50:44,703 --> 01:50:46,620 in the relationships between these two words-- 2258 01:50:46,620 --> 01:50:50,420 that if you just map out all of these words into a vector space, 2259 01:50:50,420 --> 01:50:54,380 you can get some pretty interesting results as a consequence of that. 2260 01:50:54,380 --> 01:50:58,360 And this idea of representing words as vectors turns out 2261 01:50:58,360 --> 01:51:01,300 to be incredibly useful and powerful anytime 2262 01:51:01,300 --> 01:51:04,420 we want to be able to do some statistical work with 2263 01:51:04,420 --> 01:51:06,910 regards to natural language, to be able to have-- 2264 01:51:06,910 --> 01:51:09,350 represent words not just as their characters, 2265 01:51:09,350 --> 01:51:12,280 but to represent them as numbers, numbers that say something 2266 01:51:12,280 --> 01:51:14,910 or mean something about the words themselves, 2267 01:51:14,910 --> 01:51:18,250 and somehow relate the meaning of a word to other words that 2268 01:51:18,250 --> 01:51:19,920 might happen to exists-- 2269 01:51:19,920 --> 01:51:23,020 so many tools then for being able to work inside 2270 01:51:23,020 --> 01:51:24,910 of this world of natural language. 2271 01:51:24,910 --> 01:51:26,417 The natural language is tricky. 2272 01:51:26,417 --> 01:51:29,500 We have to deal with the syntax of language and the semantics of language, 2273 01:51:29,500 --> 01:51:33,100 but we've really just seen just the beginning of some of the ideas that are 2274 01:51:33,100 --> 01:51:37,450 underlying a lot of natural language processing-- the ability to take text, 2275 01:51:37,450 --> 01:51:40,270 extract information out of it, get some sort of meaning out of it, 2276 01:51:40,270 --> 01:51:43,990 generate sentences maybe by having some knowledge of the grammar or maybe just 2277 01:51:43,990 --> 01:51:47,380 by looking at probabilities of what words are likely to show up based 2278 01:51:47,380 --> 01:51:49,780 on other words that have shown up previously-- 2279 01:51:49,780 --> 01:51:52,300 and then finally, the ability to take words 2280 01:51:52,300 --> 01:51:55,330 and come up with some distributed representation of them, to take words 2281 01:51:55,330 --> 01:51:58,240 and represent them as numbers, and use those numbers 2282 01:51:58,240 --> 01:52:02,210 to be able to say something meaningful about those words as well. 2283 01:52:02,210 --> 01:52:04,390 So this then is yet another topic in this broader 2284 01:52:04,390 --> 01:52:06,300 heading of artificial intelligence. 2285 01:52:06,300 --> 01:52:08,380 And just as I look back at where we've been now, 2286 01:52:08,380 --> 01:52:11,320 we started our conversation by talking about the world of search, 2287 01:52:11,320 --> 01:52:14,590 about trying to solve problems like tic-tac-toe by searching 2288 01:52:14,590 --> 01:52:17,500 for a solution, by exploring our various different possibilities 2289 01:52:17,500 --> 01:52:21,220 and looking at what algorithms we can apply to be able to efficiently 2290 01:52:21,220 --> 01:52:22,300 try and search a space. 2291 01:52:22,300 --> 01:52:25,930 We looked at some simple algorithms and then looked at some optimizations 2292 01:52:25,930 --> 01:52:28,780 we could make to this algorithms, and ultimately, that 2293 01:52:28,780 --> 01:52:31,742 was in service of trying to get our AI to know things about the world. 2294 01:52:31,742 --> 01:52:34,450 And this has been a lot of what we've talked about today as well, 2295 01:52:34,450 --> 01:52:37,270 trying to get knowledge out of text-based information, 2296 01:52:37,270 --> 01:52:41,440 the ability to take information, draw conclusions based on those information. 2297 01:52:41,440 --> 01:52:43,630 If I know these two things for certain, maybe I 2298 01:52:43,630 --> 01:52:46,660 can draw a third conclusion as well. 2299 01:52:46,660 --> 01:52:49,330 That then was related to the idea of uncertainty. 2300 01:52:49,330 --> 01:52:51,460 If we don't know something for sure, can we 2301 01:52:51,460 --> 01:52:54,420 predict something, figure out the probabilities of something? 2302 01:52:54,420 --> 01:52:56,170 And we saw that again today in the context 2303 01:52:56,170 --> 01:52:59,200 of trying to predict whether a tweet or whether a message 2304 01:52:59,200 --> 01:53:01,420 is positive sentiment or negative sentiment, 2305 01:53:01,420 --> 01:53:04,022 and trying to draw that conclusion as well. 2306 01:53:04,022 --> 01:53:05,980 Then we took a look at optimization-- the sorts 2307 01:53:05,980 --> 01:53:09,490 of problems where we're looking for a local global or local maximum 2308 01:53:09,490 --> 01:53:10,300 or minimum. 2309 01:53:10,300 --> 01:53:13,420 This has come up time and time again, especially most recently 2310 01:53:13,420 --> 01:53:16,750 in the context of neural networks, which are really just a kind of optimization 2311 01:53:16,750 --> 01:53:20,110 problem where we're trying to minimize the total amount of loss 2312 01:53:20,110 --> 01:53:23,110 based on the setting of our weights of our neural network, 2313 01:53:23,110 --> 01:53:26,710 based on the setting of what vector representations for words we 2314 01:53:26,710 --> 01:53:27,880 happen to choose. 2315 01:53:27,880 --> 01:53:30,430 And those ultimately helped us to be able to solve 2316 01:53:30,430 --> 01:53:33,940 learning-related problems-- the ability to take a whole bunch of data, 2317 01:53:33,940 --> 01:53:37,650 and rather than us tell the AI exactly what to do, 2318 01:53:37,650 --> 01:53:40,030 let the AI learn patterns from the data for itself. 2319 01:53:40,030 --> 01:53:43,770 Let it figure out what makes an inbox message different from a spam message. 2320 01:53:43,770 --> 01:53:45,520 Let it figure out what makes a counterfeit 2321 01:53:45,520 --> 01:53:47,560 bill different from an authentic bill, and being 2322 01:53:47,560 --> 01:53:49,820 able to draw that analysis as well. 2323 01:53:49,820 --> 01:53:52,390 And one of the big tools in learning that we used 2324 01:53:52,390 --> 01:53:54,220 were neural networks, these structures that 2325 01:53:54,220 --> 01:53:58,180 allow us to relate inputs to outputs by training these internal networks 2326 01:53:58,180 --> 01:54:02,410 to learn some sort of function that maps us from some input to some output-- 2327 01:54:02,410 --> 01:54:05,770 ultimately yet another model in this language of artificial intelligence 2328 01:54:05,770 --> 01:54:08,320 that we can use to communicate with our AI. 2329 01:54:08,320 --> 01:54:10,210 Then finally today, we looked at some ways 2330 01:54:10,210 --> 01:54:12,850 that AI can begin to communicate with us, looking at ways 2331 01:54:12,850 --> 01:54:16,240 that AI can begin to get an understanding for the syntax 2332 01:54:16,240 --> 01:54:19,990 and the semantics of language to be able to generate sentences, 2333 01:54:19,990 --> 01:54:23,110 to be able to predict things about text that's written in a spoken 2334 01:54:23,110 --> 01:54:25,360 language or a written language like English, 2335 01:54:25,360 --> 01:54:27,927 and to be able to do interesting analysis there as well. 2336 01:54:27,927 --> 01:54:30,010 And there's so much more in active research that's 2337 01:54:30,010 --> 01:54:33,160 happening all over the areas within artificial intelligence today, 2338 01:54:33,160 --> 01:54:36,890 and we've really only just seen the beginning of what AI has to offer. 2339 01:54:36,890 --> 01:54:39,310 So I hope you enjoyed this exploration into this world 2340 01:54:39,310 --> 01:54:41,235 of artificial intelligence with Python. 2341 01:54:41,235 --> 01:54:44,110 A big thank you to the courses teaching staff and the production team 2342 01:54:44,110 --> 01:54:45,700 for making this class possible. 2343 01:54:45,700 --> 01:54:49,940 This was an Introduction to Artificial Intelligence with Python. 2344 01:54:49,940 --> 01:54:51,000