WEBVTT X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:900000 00:00:00.000 --> 00:00:03.493 [MUSIC PLAYING] 00:00:17.873 --> 00:00:21.040 SPEAKER 1: OK, welcome back, everyone, to our final topic in an introduction 00:00:21.040 --> 00:00:23.050 to artificial intelligence with Python. 00:00:23.050 --> 00:00:25.390 And today, the topic is language. 00:00:25.390 --> 00:00:27.280 So thus far in the class, we've seen a number 00:00:27.280 --> 00:00:30.700 of different ways of interacting with AI, artificial intelligence, 00:00:30.700 --> 00:00:34.690 but it's mostly been happening in the way of us formulating problems 00:00:34.690 --> 00:00:38.320 in ways that I can understand-- learning to speak the language of AI, 00:00:38.320 --> 00:00:41.800 so to speak, by trying to take a problem and formulated as a search problem, 00:00:41.800 --> 00:00:45.160 or by trying to take a problem and make it a constraint satisfaction problem-- 00:00:45.160 --> 00:00:47.800 something that our AI is able to understand. 00:00:47.800 --> 00:00:50.800 Today, we're going to try and come up with algorithms and ideas that 00:00:50.800 --> 00:00:53.170 allow our AI to meet us halfway, so to speak-- 00:00:53.170 --> 00:00:56.770 to be able to allow AI to be able to understand, and interpret, and get 00:00:56.770 --> 00:00:58.915 some sort of meaning out of human language-- 00:00:58.915 --> 00:01:00.790 the type of language, in the spoken language, 00:01:00.790 --> 00:01:03.760 like English, or some other language that we naturally speak. 00:01:03.760 --> 00:01:06.700 And this turns out to be a really challenging task for AI. 00:01:06.700 --> 00:01:09.850 And it really encompasses a number of different types of tasks 00:01:09.850 --> 00:01:13.210 all under the broad heading of natural language processing, 00:01:13.210 --> 00:01:15.190 the idea of coming up with algorithms that 00:01:15.190 --> 00:01:19.910 allow our AI to be able to process and understand natural language. 00:01:19.910 --> 00:01:22.000 So these tasks vary in terms of the types of tasks 00:01:22.000 --> 00:01:24.490 we might want an AI to perform, and therefore, the types of 00:01:24.490 --> 00:01:25.698 algorithms that we might use. 00:01:25.698 --> 00:01:28.030 Them but some common tasks that you might see 00:01:28.030 --> 00:01:30.250 are things like automatic summarization. 00:01:30.250 --> 00:01:33.520 You give an AI a long document, and you would like for the AI 00:01:33.520 --> 00:01:35.680 to be able to summarize it, come up with a shorter 00:01:35.680 --> 00:01:39.850 representation of the same idea, but still in some kind of natural language, 00:01:39.850 --> 00:01:40.780 like English. 00:01:40.780 --> 00:01:44.740 Something like information extraction-- given a whole corpus of information 00:01:44.740 --> 00:01:46.750 in some body of documents or on the internet, 00:01:46.750 --> 00:01:49.840 for example, we'd like for our AI to be able to extract 00:01:49.840 --> 00:01:54.070 some sort of meaningful semantic information out of all of that content 00:01:54.070 --> 00:01:56.470 that it's able to look at and read. 00:01:56.470 --> 00:01:59.020 Language identification-- the task of, given a page, 00:01:59.020 --> 00:02:01.562 can you figure out what language that document is written in? 00:02:01.562 --> 00:02:04.520 This is the type of thing you might see if you use a web browser where, 00:02:04.520 --> 00:02:06.280 if you open up a page in another language, 00:02:06.280 --> 00:02:09.400 that web browser might ask you, oh, I think it's in this language-- would 00:02:09.400 --> 00:02:12.070 you like me to translate into English for you, for example? 00:02:12.070 --> 00:02:15.070 And that language identification process is a task 00:02:15.070 --> 00:02:17.800 that our AI needs to be able to do, which is then related then 00:02:17.800 --> 00:02:21.550 to machine translation, the process of taking text in one language 00:02:21.550 --> 00:02:24.190 and translating it into another language-- which there's 00:02:24.190 --> 00:02:26.710 been a lot of research and development on really 00:02:26.710 --> 00:02:28.490 over the course of the last several years. 00:02:28.490 --> 00:02:30.323 And it keeps getting better, in terms of how 00:02:30.323 --> 00:02:33.010 it is that AI is able to take text in one language 00:02:33.010 --> 00:02:37.010 and transform that text into another language as well. 00:02:37.010 --> 00:02:40.330 In addition to that, we have topics like named entity recognition. 00:02:40.330 --> 00:02:43.840 Given some sequence of text, can you pick out what the named entities are? 00:02:43.840 --> 00:02:46.300 These are names of companies, or names of people, 00:02:46.300 --> 00:02:50.050 or names of locations for example, which are often relevant or important parts 00:02:50.050 --> 00:02:51.580 of a particular document. 00:02:51.580 --> 00:02:55.720 Speech recognition as a related task not to do with the text that is written, 00:02:55.720 --> 00:02:58.840 but text that is spoken-- being able to process audio and figure out, 00:02:58.840 --> 00:03:01.070 what are the actual words that are spoken there? 00:03:01.070 --> 00:03:04.180 And if you think about smart home devices, like Siri or Alexa, 00:03:04.180 --> 00:03:06.370 for example, these are all devices that are now 00:03:06.370 --> 00:03:09.460 able to listen to when we are able to speak, figure out 00:03:09.460 --> 00:03:13.190 what words we are saying, and draw some sort of meaning out of that as well. 00:03:13.190 --> 00:03:15.398 We've talked about how you could formulate something, 00:03:15.398 --> 00:03:17.860 for instance, as a hit and Markov model to be able to draw 00:03:17.860 --> 00:03:19.250 those sorts of conclusions. 00:03:19.250 --> 00:03:22.150 Text classification, more generally, is a broad category 00:03:22.150 --> 00:03:25.090 of types of ideas, whenever we want to take some kind of text 00:03:25.090 --> 00:03:27.010 and put it into some sort of category. 00:03:27.010 --> 00:03:29.440 And we've seen these classification type problems 00:03:29.440 --> 00:03:31.930 and how we can use statistical machine learning approaches 00:03:31.930 --> 00:03:32.983 to be able to solve them. 00:03:32.983 --> 00:03:35.650 We'll be able to do something very similar with natural language 00:03:35.650 --> 00:03:38.910 that we may need to make a couple of adjustments that we'll see soon. 00:03:38.910 --> 00:03:41.500 And then something like word sense disambiguation, 00:03:41.500 --> 00:03:45.010 the idea that, unlike in the language of numbers, 00:03:45.010 --> 00:03:48.520 where AI has very precise representations of everything, words 00:03:48.520 --> 00:03:50.980 and are a little bit fuzzy, in terms of their meaning, 00:03:50.980 --> 00:03:52.980 and words can have multiple different meanings-- 00:03:52.980 --> 00:03:55.180 and natural language is inherently ambiguous, 00:03:55.180 --> 00:03:58.360 and we'll take a look at some of those ambiguities in due time today. 00:03:58.360 --> 00:04:00.760 But one challenging task, if you want an AI 00:04:00.760 --> 00:04:02.950 to be able to understand natural language, 00:04:02.950 --> 00:04:05.860 is being able to disambiguate or differentiate 00:04:05.860 --> 00:04:08.080 between different possible meanings of words. 00:04:08.080 --> 00:04:12.050 If I say a sentence like, I went to the bank, you need to figure out, 00:04:12.050 --> 00:04:14.680 do I mean the bank where I deposit and withdraw money or do 00:04:14.680 --> 00:04:16.240 I mean the bank like the river bank? 00:04:16.240 --> 00:04:18.250 And different words can have different meanings 00:04:18.250 --> 00:04:19.260 that we might want to figure out. 00:04:19.260 --> 00:04:21.519 And based on the context in which a word appears-- 00:04:21.519 --> 00:04:23.890 the wider sentence, or paragraph, or paper 00:04:23.890 --> 00:04:25.630 in which a particular word appears-- 00:04:25.630 --> 00:04:27.880 that might help to inform how it is that we 00:04:27.880 --> 00:04:31.390 disambiguate between different meanings or different senses 00:04:31.390 --> 00:04:32.430 that a word might have. 00:04:32.430 --> 00:04:35.527 And there are many other topics within natural language processing, 00:04:35.527 --> 00:04:37.360 many other algorithms that have been devised 00:04:37.360 --> 00:04:40.190 in order to deal with and address these sorts of problems. 00:04:40.190 --> 00:04:42.607 And today, we're really just going to scratch the surface, 00:04:42.607 --> 00:04:46.240 looking at some of the fundamental ideas that are behind many of these ideas 00:04:46.240 --> 00:04:49.750 within natural language processing, within this idea of trying to come up 00:04:49.750 --> 00:04:53.800 with AI algorithms that are able to do something meaningful with the languages 00:04:53.800 --> 00:04:55.780 that we speak everyday. 00:04:55.780 --> 00:04:58.480 And so to introduce this idea, when we think about language, 00:04:58.480 --> 00:05:01.160 we can often think about it in a couple of different parts. 00:05:01.160 --> 00:05:04.520 The first part refers to the syntax of language. 00:05:04.520 --> 00:05:07.630 This is more to do with just the structure of language 00:05:07.630 --> 00:05:09.830 and how it is that that structure works. 00:05:09.830 --> 00:05:13.060 And if you think about natural language, syntax is one of those things 00:05:13.060 --> 00:05:15.160 that, if you're a native speaker of a language, 00:05:15.160 --> 00:05:16.570 it comes pretty readily to you. 00:05:16.570 --> 00:05:18.320 You don't have to think too much about it. 00:05:18.320 --> 00:05:21.600 If I give you a sentence from Sir Arthur Conan Doyle's Sherlock Holmes, 00:05:21.600 --> 00:05:23.190 for example, a sentence like this-- 00:05:23.190 --> 00:05:27.225 "just before 9:00 o'clock, Sherlock Holmes stepped briskly into the room"-- 00:05:27.225 --> 00:05:29.100 I think we could probably all agree that this 00:05:29.100 --> 00:05:31.830 is a well-formed grammatical sentence. 00:05:31.830 --> 00:05:34.920 Syntactically, it makes sense, in terms of the way 00:05:34.920 --> 00:05:37.232 that this particular sentence is structured. 00:05:37.232 --> 00:05:40.440 And syntax applies not just to natural language, but to programming languages 00:05:40.440 --> 00:05:40.940 as well. 00:05:40.940 --> 00:05:44.430 If you've ever seen a syntax error in a program that you've written, 00:05:44.430 --> 00:05:47.280 it's likely because you wrote some sort of program 00:05:47.280 --> 00:05:49.470 that was not syntactically well-formed. 00:05:49.470 --> 00:05:52.080 The structure of it was not a valid program. 00:05:52.080 --> 00:05:54.780 In the same way, we can look at English sentences, or sentences 00:05:54.780 --> 00:05:57.600 in any natural language, and make the same kinds of judgments. 00:05:57.600 --> 00:06:01.290 I can say that this sentence is syntactically well-formed. 00:06:01.290 --> 00:06:04.260 When all the parts are put together, all these words are in this order, 00:06:04.260 --> 00:06:08.250 it constructs a grammatical sentence, or a sentence that most people would agree 00:06:08.250 --> 00:06:09.720 is grammatical. 00:06:09.720 --> 00:06:11.970 But there are also grammatically ill-formed sentences. 00:06:11.970 --> 00:06:14.370 A sentence like, "just before Sherlock Holmes 00:06:14.370 --> 00:06:16.518 9 o'clock stepped briskly the room"-- 00:06:16.518 --> 00:06:19.560 well, I think we would all agree that this is not a well-formed sentence. 00:06:19.560 --> 00:06:22.290 Syntactically, it doesn't make sense. 00:06:22.290 --> 00:06:25.290 And this is the type of thing that, if we want our AI, for example, 00:06:25.290 --> 00:06:27.330 to be able to generate natural language-- 00:06:27.330 --> 00:06:30.250 to be able to speak to us the way like a chat bot would speak to us, 00:06:30.250 --> 00:06:31.010 for example-- 00:06:31.010 --> 00:06:34.260 well then our AI is going to need to be able to know this distinction somehow, 00:06:34.260 --> 00:06:37.980 is going to be able to know what kinds of sentences are grammatical, 00:06:37.980 --> 00:06:39.330 what kinds of sentences are not. 00:06:39.330 --> 00:06:42.930 And we might come up with rules or ways to statistically learn these ideas, 00:06:42.930 --> 00:06:45.840 and we'll talk about some of those methods as well. 00:06:45.840 --> 00:06:47.910 Syntax can also be ambiguous. 00:06:47.910 --> 00:06:50.970 There are some sentences that are well-formed and not well-formed, 00:06:50.970 --> 00:06:54.180 but certain way-- there are certain ways that you could take a sentence 00:06:54.180 --> 00:06:58.260 and potentially construct multiple different structures for that sentence. 00:06:58.260 --> 00:07:01.830 A sentence like, "I saw the man on the mountain with a telescope," well, 00:07:01.830 --> 00:07:05.080 this is grammatically well-formed-- syntactically, it makes sense-- 00:07:05.080 --> 00:07:07.350 but what is the structure of the sentence? 00:07:07.350 --> 00:07:10.680 Is it the man on the mountain who has the telescope, or am 00:07:10.680 --> 00:07:13.860 I seeing the man on the mountain and I am using the telescope in order 00:07:13.860 --> 00:07:15.270 to see the man on the mountain? 00:07:15.270 --> 00:07:19.050 There's some interesting ambiguity here, where it could have potentially 00:07:19.050 --> 00:07:21.090 two different types of structures. 00:07:21.090 --> 00:07:23.940 And this is one of the ideas that will come back to also, 00:07:23.940 --> 00:07:27.690 in terms of how to think about dealing with AI when natural language is 00:07:27.690 --> 00:07:29.820 inherently ambiguous. 00:07:29.820 --> 00:07:32.070 So that then is syntax, the structure of language, 00:07:32.070 --> 00:07:34.080 and getting an understanding for how it is 00:07:34.080 --> 00:07:36.330 that, depending on the order and placement of words, 00:07:36.330 --> 00:07:38.910 we can come up with different structures for language. 00:07:38.910 --> 00:07:42.300 But in addition to language having structure, language also has meaning. 00:07:42.300 --> 00:07:44.700 And now we get into the world of semantics, the idea of, 00:07:44.700 --> 00:07:47.190 what it is that a word, or a sequence of words, 00:07:47.190 --> 00:07:51.200 or a sentence, or an entire essay actually means? 00:07:51.200 --> 00:07:54.300 And so a sentence like, "just before 9:00, Sherlock Holmes 00:07:54.300 --> 00:07:58.230 stepped briskly into the room," is a different sentence 00:07:58.230 --> 00:08:01.860 from a sentence like, "Sherlock Holmes stepped briskly into the room just 00:08:01.860 --> 00:08:03.300 before 9:00." 00:08:03.300 --> 00:08:06.480 And yet they have effectively the same meaning. 00:08:06.480 --> 00:08:08.430 They're different sentences, so an AI reading 00:08:08.430 --> 00:08:11.550 them would recognize them as different, but we as humans 00:08:11.550 --> 00:08:13.650 can look at both the sentences and say, yeah, 00:08:13.650 --> 00:08:15.295 they mean basically the same thing. 00:08:15.295 --> 00:08:18.420 And maybe, in this case, it was just because I moved the order of the words 00:08:18.420 --> 00:08:18.920 around. 00:08:18.920 --> 00:08:21.520 Originally, 9 o'clock with near the beginning of the sentence. 00:08:21.520 --> 00:08:23.700 Now 9 o'clock is near the end of the sentence. 00:08:23.700 --> 00:08:26.950 But you might imagine that I could come up with a different sentence entirely, 00:08:26.950 --> 00:08:29.670 a sentence like, "a few minutes before 9:00, Sherlock Holmes 00:08:29.670 --> 00:08:31.820 walked quickly into the room." 00:08:31.820 --> 00:08:34.650 And OK, that also has a very similar meaning, 00:08:34.650 --> 00:08:37.799 but I'm using different words in order to express that idea. 00:08:37.799 --> 00:08:40.230 And ideally, AI would be able to recognize 00:08:40.230 --> 00:08:43.230 that these two sentences, these different sets of words that 00:08:43.230 --> 00:08:46.020 are similar to each other, have similar meanings, 00:08:46.020 --> 00:08:49.090 and to be able to get at that idea as well. 00:08:49.090 --> 00:08:52.350 Then there are also ways that a syntactically well-formed sentence 00:08:52.350 --> 00:08:54.150 might not mean anything at all. 00:08:54.150 --> 00:08:57.360 A famous example from linguist Noam Chomsky is this sentence here-- 00:08:57.360 --> 00:09:00.570 "colorless green ideas sleep furiously." 00:09:00.570 --> 00:09:03.660 Syntactically, that sentence is perfectly fine. 00:09:03.660 --> 00:09:07.080 Colorless and green are adjectives that modify the noun ideas. 00:09:07.080 --> 00:09:08.010 Sleep is a verb. 00:09:08.010 --> 00:09:09.240 Furiously is an adverb. 00:09:09.240 --> 00:09:12.900 These are correct constructions, in terms of the order of words, 00:09:12.900 --> 00:09:15.150 but it turns out this sentence is meaningless. 00:09:15.150 --> 00:09:18.270 If you tried to ascribe meaning to the sentence, what does it mean? 00:09:18.270 --> 00:09:20.250 And it's not easy to be able to determine 00:09:20.250 --> 00:09:21.660 what it is that it might mean. 00:09:21.660 --> 00:09:25.355 Semantics itself can also be ambiguous, given that different structures can 00:09:25.355 --> 00:09:26.730 have different types of meanings. 00:09:26.730 --> 00:09:29.110 Different words can have different kinds of meanings, 00:09:29.110 --> 00:09:31.290 so the same sentence with the same structure 00:09:31.290 --> 00:09:33.300 might end up meaning different types of things. 00:09:33.300 --> 00:09:35.880 So my favorite example from the LA times is 00:09:35.880 --> 00:09:39.570 a headline that was in the Los Angeles Times a little while back. 00:09:39.570 --> 00:09:43.410 The headline says, "Big rig carrying fruit crashes on 210 freeway, 00:09:43.410 --> 00:09:44.633 creates jam." 00:09:44.633 --> 00:09:46.800 So depending on how it is you look at the sentence-- 00:09:46.800 --> 00:09:50.440 how you interpret the sentence-- it can have multiple different meanings. 00:09:50.440 --> 00:09:53.730 And so here too are challenges in this world of natural language processing, 00:09:53.730 --> 00:09:56.640 being able to understand both the syntax of language 00:09:56.640 --> 00:09:58.013 and the semantics of language. 00:09:58.013 --> 00:10:00.180 And today, we'll take a look at both of those ideas. 00:10:00.180 --> 00:10:02.280 We're going to start by talking about syntax 00:10:02.280 --> 00:10:05.550 and getting a sense for how it is that language is structured, 00:10:05.550 --> 00:10:09.150 and how we can start by coming up with some rules, some ways 00:10:09.150 --> 00:10:12.930 that we can tell our computer, tell our AI what types of things 00:10:12.930 --> 00:10:16.540 are valid sentences, what types of things are not valid sentences. 00:10:16.540 --> 00:10:19.070 And ultimately, we'd like to use that information 00:10:19.070 --> 00:10:21.680 to be able to allow our AI to draw meaningful conclusions, 00:10:21.680 --> 00:10:23.743 to be able to do something with language. 00:10:23.743 --> 00:10:25.910 And so to do so, we're going to start by introducing 00:10:25.910 --> 00:10:27.830 the notion of formal grammar. 00:10:27.830 --> 00:10:30.320 And what formal grammar is all about its formal grammar 00:10:30.320 --> 00:10:34.400 is a system of rules that generate sentences in a language. 00:10:34.400 --> 00:10:38.120 I would like to know what are the valid English sentences-- 00:10:38.120 --> 00:10:39.710 not in terms of what they mean-- 00:10:39.710 --> 00:10:42.590 just in terms of their structure-- their syntactic structure. 00:10:42.590 --> 00:10:45.740 What structures of English are valid, correct sentences? 00:10:45.740 --> 00:10:47.780 What structures of English are not valid? 00:10:47.780 --> 00:10:50.930 And this is going to apply in a very similar way to other natural languages 00:10:50.930 --> 00:10:54.110 as well, where language follows certain types of structures. 00:10:54.110 --> 00:10:56.870 And we intuitively know what these structures mean, 00:10:56.870 --> 00:10:59.840 but it's going to be helpful to try and really formally define 00:10:59.840 --> 00:11:01.980 what the structures mean as well. 00:11:01.980 --> 00:11:04.520 There are a number of different types of formal grammar 00:11:04.520 --> 00:11:07.318 all across what's known as the Chomsky hierarchy of grammars. 00:11:07.318 --> 00:11:09.110 And you may have seen some of these before. 00:11:09.110 --> 00:11:11.780 If you've ever worked with regular expressions before, 00:11:11.780 --> 00:11:14.300 those belong to a class of regular languages. 00:11:14.300 --> 00:11:19.320 They correspond to regular languages, which is a particular type of language. 00:11:19.320 --> 00:11:21.860 But also on this hierarchy is a type of grammar 00:11:21.860 --> 00:11:23.193 known as a context-free grammar. 00:11:23.193 --> 00:11:25.235 And this is the one we're going to spend the most 00:11:25.235 --> 00:11:27.120 time on taking a look at today. 00:11:27.120 --> 00:11:31.640 And what a context-free grammar is it is a way of taking-- 00:11:31.640 --> 00:11:34.760 of generating sentences in a language or via what 00:11:34.760 --> 00:11:39.020 are known as rewriting rules-- replacing one symbol with other symbols. 00:11:39.020 --> 00:11:42.360 And we'll take a look in a moment at just what that means. 00:11:42.360 --> 00:11:45.950 So let's imagine, for example, a simple sentence in English, 00:11:45.950 --> 00:11:48.520 a sentence like, "she saw the city"-- 00:11:48.520 --> 00:11:52.190 a valid, syntactically well-formed English sentence. 00:11:52.190 --> 00:11:55.640 But we'd like for some way for our AI to be able to look at the sentence 00:11:55.640 --> 00:12:00.200 and figure out, what is the structure of the sentence? 00:12:00.200 --> 00:12:02.630 If you imagine a guy in question answering format-- 00:12:02.630 --> 00:12:05.812 if you want to ask the AI a question like, what did she see, 00:12:05.812 --> 00:12:08.270 well, then the AI wants to be able to look at this sentence 00:12:08.270 --> 00:12:13.530 and recognize that what she saw is the city-- to be able to figure that out. 00:12:13.530 --> 00:12:15.770 And it requires some understanding of what 00:12:15.770 --> 00:12:19.760 it is that the structure of this sentence really looks like. 00:12:19.760 --> 00:12:20.960 So where do we begin? 00:12:20.960 --> 00:12:23.410 Each of these words-- she, saw, the, city-- 00:12:23.410 --> 00:12:25.585 we are going to call terminal symbols. 00:12:25.585 --> 00:12:28.460 There are symbols in our language-- where each of these words is just 00:12:28.460 --> 00:12:29.480 a symbol-- 00:12:29.480 --> 00:12:32.470 where this is ultimately what we care about generating. 00:12:32.470 --> 00:12:34.730 We care about generating these words. 00:12:34.730 --> 00:12:37.280 But each of these words we're also going to associate 00:12:37.280 --> 00:12:40.130 with what we're going to call a non-terminal symbol. 00:12:40.130 --> 00:12:43.460 And these non-terminal symbols initially are going to look kind of like parts 00:12:43.460 --> 00:12:46.260 of speech, if you remember back to like English grammar-- 00:12:46.260 --> 00:12:49.880 where she is a noun, saw is a V for verb, 00:12:49.880 --> 00:12:52.550 the is a D. D stands for determiner. 00:12:52.550 --> 00:12:55.730 These are words like the, and a, and and, for example. 00:12:55.730 --> 00:12:59.550 And then city-- well, city is also a noun, so an N goes there. 00:12:59.550 --> 00:13:00.320 So each of these-- 00:13:00.320 --> 00:13:01.730 N, V, and D-- 00:13:01.730 --> 00:13:04.460 these are what we might call non-terminal symbols. 00:13:04.460 --> 00:13:07.370 They're not actually words in the language. 00:13:07.370 --> 00:13:10.010 She saw the city-- those are the words in the language. 00:13:10.010 --> 00:13:14.210 But we use these non-terminal symbols to generate the terminal symbols, 00:13:14.210 --> 00:13:16.640 the terminal symbols which are like, she saw the city-- 00:13:16.640 --> 00:13:20.000 the words that are actually in a language like English. 00:13:20.000 --> 00:13:24.260 And so in order to translate these non-terminal symbols into terminal 00:13:24.260 --> 00:13:27.422 symbols, we have what are known as rewriting rules, 00:13:27.422 --> 00:13:29.130 and these rules look something like this. 00:13:29.130 --> 00:13:32.570 We have N on the left side of an arrow, and the arrow 00:13:32.570 --> 00:13:35.480 says, if I have an N non-terminal symbol, 00:13:35.480 --> 00:13:39.410 then I can turn it into any of these various different possibilities 00:13:39.410 --> 00:13:42.120 that are separated with a vertical line. 00:13:42.120 --> 00:13:45.480 So a noun could translate into the word she. 00:13:45.480 --> 00:13:49.720 A noun could translate into the word city, or car, or Harry, 00:13:49.720 --> 00:13:50.970 or any number of other things. 00:13:50.970 --> 00:13:53.810 These are all examples of nouns, for example. 00:13:53.810 --> 00:13:58.490 Meanwhile, a determiner, D, could translate into the, or a, or an. 00:13:58.490 --> 00:14:01.310 V for verb could translate into any of these verbs. 00:14:01.310 --> 00:14:04.430 P for preposition could translate into any of those prepositions-- 00:14:04.430 --> 00:14:06.440 to, on, over, and so forth. 00:14:06.440 --> 00:14:11.420 And then ADJ for adjective can translate into any of these possible adjectives 00:14:11.420 --> 00:14:12.390 as well. 00:14:12.390 --> 00:14:15.650 So these then are rules in our context-free grammar. 00:14:15.650 --> 00:14:18.110 When we are defining what it is that our grammar is, 00:14:18.110 --> 00:14:21.500 what is the structure of the English language or any other language, 00:14:21.500 --> 00:14:24.710 we give it these types of rules saying that a noun could 00:14:24.710 --> 00:14:29.360 be any of these possibilities, a verb could be any of those possibilities. 00:14:29.360 --> 00:14:32.900 But it turns out we can then begin to construct other rules where 00:14:32.900 --> 00:14:37.392 it's not just one non-terminal translating into one terminal symbol. 00:14:37.392 --> 00:14:40.100 We're always going to have one non-terminal on the left-hand side 00:14:40.100 --> 00:14:42.515 of the arrow, but on the right-hand side of the arrow, 00:14:42.515 --> 00:14:43.640 we could have other things. 00:14:43.640 --> 00:14:46.830 We could even have other non-terminal symbols. 00:14:46.830 --> 00:14:48.030 So what do I mean by this? 00:14:48.030 --> 00:14:53.070 Well, we have the idea of nouns-- like she, city, car, Harry, for example-- 00:14:53.070 --> 00:14:55.340 but there are also a noun phrases-- 00:14:55.340 --> 00:14:57.760 like phrases that work as nouns-- 00:14:57.760 --> 00:15:00.900 that are not just a single word, but there are multiple words. 00:15:00.900 --> 00:15:04.400 Like the city is two words, that together, operate 00:15:04.400 --> 00:15:06.140 as what we might call a noun phrase. 00:15:06.140 --> 00:15:08.870 It's multiple words, but they're together operating as a noun. 00:15:08.870 --> 00:15:12.410 Or if you think about a more complex expression, like the big city-- 00:15:12.410 --> 00:15:15.380 three words all operating as a single noun-- 00:15:15.380 --> 00:15:17.200 or the car on the street-- 00:15:17.200 --> 00:15:22.390 multiple words now, but that entire set of words operates kind of like a noun. 00:15:22.390 --> 00:15:25.130 It substitutes as a noun phrase. 00:15:25.130 --> 00:15:27.100 And so to do this, we'll introduce the notion 00:15:27.100 --> 00:15:32.380 of a new non-terminal symbol called NP, which will stand for noun phrase. 00:15:32.380 --> 00:15:36.220 And this rewriting rule says that a noun phrase it could be a noun-- 00:15:36.220 --> 00:15:39.250 so something like she is a noun, and therefore, it 00:15:39.250 --> 00:15:40.810 can also be a noun phrase-- 00:15:40.810 --> 00:15:46.360 but a noun phrase could also be a determiner, D, followed by a noun-- 00:15:46.360 --> 00:15:49.315 so two ways we can have a noun phrase in this very simple grammar. 00:15:49.315 --> 00:15:51.940 Of course, the English language is more complex than just this, 00:15:51.940 --> 00:15:57.460 but a noun phrase is either a noun or it is a determiner followed by a noun. 00:15:57.460 --> 00:16:00.130 So for the first example, a noun phrase that is just a noun, 00:16:00.130 --> 00:16:04.150 that would allow us to generate noun phrases like she, 00:16:04.150 --> 00:16:07.960 because a noun phrase is just a noun, and a noun 00:16:07.960 --> 00:16:10.833 could be the word she, for example. 00:16:10.833 --> 00:16:13.750 Meanwhile, if we wanted to look at one of the examples of these, where 00:16:13.750 --> 00:16:16.750 a noun phrase becomes a determiner and a noun, 00:16:16.750 --> 00:16:18.460 then we get a structure like this. 00:16:18.460 --> 00:16:21.250 And now we're starting to see the structure of language 00:16:21.250 --> 00:16:24.970 emerge from these rules in a syntax tree, as we'll call it, 00:16:24.970 --> 00:16:29.260 this tree-like structure that represents the syntax of our natural language. 00:16:29.260 --> 00:16:31.960 Here, we have a noun phrase, and this noun phrase 00:16:31.960 --> 00:16:36.460 is composed of a determiner and a noun, where the determiner is the word the, 00:16:36.460 --> 00:16:40.310 according to that rule, and noun is the word city. 00:16:40.310 --> 00:16:43.930 So here then is a noun phrase that consists of multiple words inside 00:16:43.930 --> 00:16:45.130 of the structure. 00:16:45.130 --> 00:16:50.140 And using this idea of taking one symbol and rewriting it using other symbols-- 00:16:50.140 --> 00:16:52.900 that might be terminal symbols, like the and city, 00:16:52.900 --> 00:16:57.670 but might also be non-terminal symbols, like D for determiner or N for noun-- 00:16:57.670 --> 00:17:01.090 then we can begin to construct more and more complex structures. 00:17:01.090 --> 00:17:04.420 In addition to noun phrases, we can also think about verb phrases. 00:17:04.420 --> 00:17:06.740 So what might a verb phrase look like? 00:17:06.740 --> 00:17:09.670 Well, a verb phrase might just be a single verb. 00:17:09.670 --> 00:17:13.660 In a sentence like "I walked," walked is a verb, 00:17:13.660 --> 00:17:17.329 and that is acting as the verb phrase in that sentence. 00:17:17.329 --> 00:17:21.493 But there are also more complex verb phrases that aren't just a single word, 00:17:21.493 --> 00:17:22.660 but that are multiple words. 00:17:22.660 --> 00:17:25.970 If you think of the sentence like "she saw the city," for example, 00:17:25.970 --> 00:17:29.260 saw the city is really that entire verb phrase. 00:17:29.260 --> 00:17:33.245 It's taking up like what it is that she is doing, for example. 00:17:33.245 --> 00:17:35.370 And so our verb phrase might have a rule like this. 00:17:35.370 --> 00:17:38.830 A verb phrase is either just a plain verb 00:17:38.830 --> 00:17:43.090 or it is a verb followed by a noun phrase. 00:17:43.090 --> 00:17:45.940 And we saw before that a noun phrase is either a noun 00:17:45.940 --> 00:17:48.580 or it is a determiner followed by a noun. 00:17:48.580 --> 00:17:50.710 And so a verb phrase might be something simple, 00:17:50.710 --> 00:17:52.960 like verb phrase it is just a verb. 00:17:52.960 --> 00:17:55.587 And that verb could be the word walked for example. 00:17:55.587 --> 00:17:57.670 But it could also be something more sophisticated, 00:17:57.670 --> 00:18:01.780 something like this noun, where we begin to see a larger syntax tree, 00:18:01.780 --> 00:18:04.450 where the way to read the syntax tree is that a verb 00:18:04.450 --> 00:18:07.690 phrase is a verb and a noun phrase, where 00:18:07.690 --> 00:18:09.380 that verb could be something like saw. 00:18:09.380 --> 00:18:12.130 And this is a noun phrase we've seen before, this noun phrase that 00:18:12.130 --> 00:18:17.050 is the city-- a noun phrase composed of the determiner the and the noun 00:18:17.050 --> 00:18:21.068 city all put together to construct this larger verb phrase. 00:18:21.068 --> 00:18:23.110 And then just to give one more example of a rule, 00:18:23.110 --> 00:18:24.652 we could also have a rule like this-- 00:18:24.652 --> 00:18:28.180 sentence S goes to noun phrase and a verb phrase. 00:18:28.180 --> 00:18:30.580 The basic structure of a sentence is that it is 00:18:30.580 --> 00:18:32.680 a noun phrase followed by verb phrase. 00:18:32.680 --> 00:18:35.320 And this is a formal grammar way of expressing the idea 00:18:35.320 --> 00:18:38.445 that you might have learned when you learned English grammar, when you read 00:18:38.445 --> 00:18:42.190 that a sentence is like a subject and a verb, subject and action-- 00:18:42.190 --> 00:18:45.330 something that's happening to a particular noun phrase. 00:18:45.330 --> 00:18:47.650 And so using this structure, we could construct 00:18:47.650 --> 00:18:49.740 a sentence that looks like this. 00:18:49.740 --> 00:18:53.140 A sentence consists of a noun phrase and a verb phrase. 00:18:53.140 --> 00:18:56.080 A noun phrase could just be a noun, like the word she. 00:18:56.080 --> 00:18:58.180 The verb phrase could be a verb and a noun phrase, 00:18:58.180 --> 00:19:00.940 where-- this is something we've seen before-- the verb is saw 00:19:00.940 --> 00:19:03.838 and the noun phrase is the city. 00:19:03.838 --> 00:19:05.380 And so now look what we've done here. 00:19:05.380 --> 00:19:08.160 What we've done is, by defining a set of rules, 00:19:08.160 --> 00:19:11.940 there are algorithms that we can run that take these words-- 00:19:11.940 --> 00:19:15.190 and the CYK algorithm, for example, is one example of this if you want to look 00:19:15.190 --> 00:19:15.880 into that-- 00:19:15.880 --> 00:19:20.200 where you start with a set of terminal symbols, like she saw the city, 00:19:20.200 --> 00:19:22.630 and then using these rules, you're able to figure out, 00:19:22.630 --> 00:19:26.958 how is it that you go from a sentence to she saw the city? 00:19:26.958 --> 00:19:28.750 And it's all through these rewriting rules. 00:19:28.750 --> 00:19:31.310 So the sentence is a noun phrase and a verb phrase. 00:19:31.310 --> 00:19:34.600 A verb phrase could be a verb and a noun phrase, so on and so forth, 00:19:34.600 --> 00:19:37.000 where you can imagine taking this structure 00:19:37.000 --> 00:19:41.510 and figuring out how it is that you could generate a parse tree-- 00:19:41.510 --> 00:19:46.290 a syntax tree-- for that set of terminal symbols, that set of words. 00:19:46.290 --> 00:19:49.990 And if you tried to do this for a sentence that was not grammatical, 00:19:49.990 --> 00:19:53.830 something like "saw the city she," well, that wouldn't work. 00:19:53.830 --> 00:19:56.320 There'd be no way to take a sentence and use 00:19:56.320 --> 00:19:58.720 these rules to be able to generate that sentence that 00:19:58.720 --> 00:20:01.220 is not inside of that language. 00:20:01.220 --> 00:20:03.490 So this sort of model can be very helpful 00:20:03.490 --> 00:20:06.040 if the rules are expressive enough to express 00:20:06.040 --> 00:20:09.400 all the ideas that you might want to express inside of natural language. 00:20:09.400 --> 00:20:12.003 Of course, using just the simple rules we have here, 00:20:12.003 --> 00:20:14.920 there are many sentences that we won't be able to generate-- sentences 00:20:14.920 --> 00:20:18.280 that we might agree are grim and syntactically well-formed, 00:20:18.280 --> 00:20:21.450 but that we're not going to be able to construct using these rules. 00:20:21.450 --> 00:20:23.200 And then, in that case, we might just need 00:20:23.200 --> 00:20:28.300 to have some more complex rules in order to deal with those sorts of cases. 00:20:28.300 --> 00:20:30.370 And so this type of approach can be powerful 00:20:30.370 --> 00:20:33.430 if you're dealing with a limited set of rules and words 00:20:33.430 --> 00:20:35.230 that you really care about dealing with. 00:20:35.230 --> 00:20:37.690 And one way we can actually interact with this in Python 00:20:37.690 --> 00:20:42.100 is by using a Python library called NLTK, short for natural language 00:20:42.100 --> 00:20:44.410 toolkit, which we'll see a couple of times today, 00:20:44.410 --> 00:20:47.410 which has a wide variety of different functions and classes 00:20:47.410 --> 00:20:49.300 that we can take advantage of that are all 00:20:49.300 --> 00:20:51.100 meant to deal with natural language. 00:20:51.100 --> 00:20:54.700 And one such algorithm that it has is the ability to parse 00:20:54.700 --> 00:20:57.670 a context-free grammar, to be able to take some words 00:20:57.670 --> 00:20:59.920 and figure out according to some context-free grammar, 00:20:59.920 --> 00:21:02.892 how would you construct the syntax tree for it? 00:21:02.892 --> 00:21:04.600 So let's go ahead and take a look at NLTK 00:21:04.600 --> 00:21:09.950 now by examining how we might construct some context-free grammars with it. 00:21:09.950 --> 00:21:12.110 So here inside of cfg0-- 00:21:12.110 --> 00:21:14.410 cfg's short for context-free grammar-- 00:21:14.410 --> 00:21:19.230 I have a sample context-free grammar which has rules that we've seen before. 00:21:19.230 --> 00:21:22.330 So sentence goes to noun phrase followed by a verb phrase. 00:21:22.330 --> 00:21:25.900 Noun phrase is either a determiner and a noun or a noun. 00:21:25.900 --> 00:21:29.080 Verb phrase is either a verb or a verb and a noun phrase. 00:21:29.080 --> 00:21:32.020 The order of these things doesn't really matter. 00:21:32.020 --> 00:21:34.480 Determiners could be the word the or the word a. 00:21:34.480 --> 00:21:37.630 A noun could be the word she, city, or car. 00:21:37.630 --> 00:21:42.040 And a verb could be the word saw or it could be the word walked. 00:21:42.040 --> 00:21:45.100 Now, using NLTK, which I've imported here at the top, 00:21:45.100 --> 00:21:47.800 I'm going to go ahead and parse this grammar 00:21:47.800 --> 00:21:50.823 and save it inside of this variable called parser. 00:21:50.823 --> 00:21:52.990 Next, my program is going to ask the user for input. 00:21:52.990 --> 00:21:55.630 Just type in a sentence, and dot split will just 00:21:55.630 --> 00:21:57.790 split it on all of the spaces, so I end up 00:21:57.790 --> 00:22:00.360 getting each of the individual words. 00:22:00.360 --> 00:22:03.400 We're going to save that inside of this list called sentence. 00:22:03.400 --> 00:22:08.350 And then we'll go ahead and try to parse the sentence, and for each sentence 00:22:08.350 --> 00:22:10.840 we parse, we're going to pretty print it to the screen, 00:22:10.840 --> 00:22:12.327 just so it displays in my terminal. 00:22:12.327 --> 00:22:13.660 And we're also going to draw it. 00:22:13.660 --> 00:22:16.210 It turns out that NLTK has some graphics capacity, 00:22:16.210 --> 00:22:19.632 so we can really visually see what that tree looks like as well. 00:22:19.632 --> 00:22:22.340 And there are multiple different ways a sentence might be parsed, 00:22:22.340 --> 00:22:24.700 which is why we're putting it inside of this for loop. 00:22:24.700 --> 00:22:27.762 And we'll see why that can be helpful in a moment too. 00:22:27.762 --> 00:22:30.220 All right, now that I have that, let's go ahead and try it. 00:22:30.220 --> 00:22:34.840 I'll cd into cfg, and we'll go ahead and run cfg0. 00:22:34.840 --> 00:22:37.450 So it then is going to prompt me to type in a sentence. 00:22:37.450 --> 00:22:39.658 And let me type in a very simple sentence-- something 00:22:39.658 --> 00:22:42.070 like she walked, for example. 00:22:42.070 --> 00:22:43.240 Press Return. 00:22:43.240 --> 00:22:45.510 So what I get is, on the left-hand side, you 00:22:45.510 --> 00:22:48.902 can see a text-based representation of the syntax tree. 00:22:48.902 --> 00:22:51.610 And on the right side here-- let me go ahead and make it bigger-- 00:22:51.610 --> 00:22:55.240 we see a visual representation of that same syntax tree. 00:22:55.240 --> 00:22:59.960 This is how it is that my computer has now parsed the sentence she walked. 00:22:59.960 --> 00:23:02.980 It's a sentence that consists of a noun phrase and a verb phrase, 00:23:02.980 --> 00:23:06.790 where each phrase is just a single noun or verb, she and then walked-- 00:23:06.790 --> 00:23:09.100 same type of structure we've seen before, 00:23:09.100 --> 00:23:11.410 but this now is our computer able to understand 00:23:11.410 --> 00:23:13.990 the structure of the sentence, to be able to get 00:23:13.990 --> 00:23:17.920 some sort of structural understanding of how it is that parts of the sentence 00:23:17.920 --> 00:23:19.660 relate to each other. 00:23:19.660 --> 00:23:21.460 Let me now give it another sentence. 00:23:21.460 --> 00:23:25.180 I could try something like she saw the city, for example-- 00:23:25.180 --> 00:23:27.350 the words we were dealing with a moment ago. 00:23:27.350 --> 00:23:31.050 And then we end up getting this syntax tree out of it-- 00:23:31.050 --> 00:23:34.170 again, a sentence that has a noun phrase and a verb phrase. 00:23:34.170 --> 00:23:35.800 The noun phrase is fairly simple. 00:23:35.800 --> 00:23:36.960 It's just she. 00:23:36.960 --> 00:23:38.460 But the verb phrase is more complex. 00:23:38.460 --> 00:23:42.390 It is now saw the city, for example. 00:23:42.390 --> 00:23:44.790 Let's do one more with this grammar. 00:23:44.790 --> 00:23:47.343 Let's do something like she saw a car. 00:23:47.343 --> 00:23:49.010 And that is going to look very similar-- 00:23:49.010 --> 00:23:50.328 that we also get she. 00:23:50.328 --> 00:23:51.870 But our verb phrase is now different. 00:23:51.870 --> 00:23:55.220 It's saw a car, because there are multiple possible determiners 00:23:55.220 --> 00:23:57.307 in our language and multiple possible nouns. 00:23:57.307 --> 00:23:59.390 I haven't given this grammar rule that many words, 00:23:59.390 --> 00:24:01.790 but if I gave it a larger vocabulary, it would then 00:24:01.790 --> 00:24:06.360 be able to understand more and more different types of sentences. 00:24:06.360 --> 00:24:09.590 And just to give you a sense of some added complexity we could add here, 00:24:09.590 --> 00:24:12.568 the more complex our grammar, the more rules we add, 00:24:12.568 --> 00:24:14.360 the more different types of sentences we'll 00:24:14.360 --> 00:24:15.860 then have the ability to generate. 00:24:15.860 --> 00:24:18.410 So let's take a look at cfg1, for example, 00:24:18.410 --> 00:24:21.590 where I've added a whole number of other different types of rules. 00:24:21.590 --> 00:24:25.970 I've added the adjective phrases, where we can have multiple adjectives inside 00:24:25.970 --> 00:24:27.590 of a noun phrase as well. 00:24:27.590 --> 00:24:31.310 So a noun phrase could be an adjective phrase followed by a noun phrase. 00:24:31.310 --> 00:24:33.650 If I wanted to say something like the big city, 00:24:33.650 --> 00:24:37.250 that's an adjective phrase followed by a noun phrase. 00:24:37.250 --> 00:24:40.740 Or we could also have a noun and a prepositional phrase-- 00:24:40.740 --> 00:24:43.250 so the car on the street, for example. 00:24:43.250 --> 00:24:46.100 On the street is a prepositional phrase, and we 00:24:46.100 --> 00:24:50.060 might want to combine those two ideas together, because the car on the street 00:24:50.060 --> 00:24:53.333 can still operate as something kind of like a noun phrase as well. 00:24:53.333 --> 00:24:56.000 So no need to understand all of these rules in too much detail-- 00:24:56.000 --> 00:24:59.240 it starts to get into the nature of English grammar-- 00:24:59.240 --> 00:25:04.980 but now we have a more complex way of understanding these types of sentences. 00:25:04.980 --> 00:25:07.190 So if I run Python cfg1-- 00:25:07.190 --> 00:25:13.130 and I can try typing something like she saw the wide street, for example-- 00:25:13.130 --> 00:25:14.840 a more complex sentence. 00:25:14.840 --> 00:25:18.990 And if we make that larger, you can see what this sentence looks like. 00:25:18.990 --> 00:25:21.700 I'll go ahead and shrink it a little bit. 00:25:21.700 --> 00:25:26.100 So now we have a sentence like this-- she saw the wide street. 00:25:26.100 --> 00:25:28.830 The wide street is one entire noun phrase, 00:25:28.830 --> 00:25:31.470 saw the wide street is an entire verb phrase, 00:25:31.470 --> 00:25:35.830 and she saw the wide street ends up forming that entire sentence. 00:25:35.830 --> 00:25:40.150 So let's take a look at one more example to introduce this notion of ambiguity. 00:25:40.150 --> 00:25:42.060 So I can run Python cfg1. 00:25:42.060 --> 00:25:48.540 Let me type a sentence like she saw a dog with binoculars. 00:25:48.540 --> 00:25:52.860 So there's a sentence, and here now is one possible syntax tree 00:25:52.860 --> 00:25:54.510 to represent this idea-- 00:25:54.510 --> 00:25:59.190 she saw, the noun phrase a dog, and then the prepositional phrase 00:25:59.190 --> 00:26:00.390 with binoculars. 00:26:00.390 --> 00:26:06.000 And the way to interpret the sentence is that what it is that she saw was a dog. 00:26:06.000 --> 00:26:07.980 And how did she do the seeing? 00:26:07.980 --> 00:26:10.680 She did the seeing with binoculars. 00:26:10.680 --> 00:26:13.080 And so this is one possible way to interpret this. 00:26:13.080 --> 00:26:14.730 She was using binoculars. 00:26:14.730 --> 00:26:18.170 Using those binoculars, she saw a dog. 00:26:18.170 --> 00:26:21.000 But another possible way to pass that sentence 00:26:21.000 --> 00:26:25.020 would be with this tree over here, where you have something 00:26:25.020 --> 00:26:31.000 like she saw a dog with binoculars, where a dog with binoculars 00:26:31.000 --> 00:26:33.340 forms an entire noun phrase of its own-- 00:26:33.340 --> 00:26:37.000 same words in the same order, but a different grammatical structure, 00:26:37.000 --> 00:26:41.350 where now we have a dog with binoculars all inside of this noun phrase, 00:26:41.350 --> 00:26:42.700 meaning what did she see? 00:26:42.700 --> 00:26:44.920 What she saw was a dog, and that dog happened 00:26:44.920 --> 00:26:49.210 to have binoculars with the dog-- so different ways to parse the sentence-- 00:26:49.210 --> 00:26:53.700 structures for the sentence-- even given the same possible sequence of words. 00:26:53.700 --> 00:26:56.320 And NLTK's algorithm and this particular algorithm 00:26:56.320 --> 00:26:58.150 has the ability to find all of these, to be 00:26:58.150 --> 00:27:00.610 able to understand the different ways that you might 00:27:00.610 --> 00:27:05.080 be able to parse a sentence and be able to extract some sort of useful meaning 00:27:05.080 --> 00:27:07.900 out of that sentence as well. 00:27:07.900 --> 00:27:11.650 So that then is a brief look at what we can do-- 00:27:11.650 --> 00:27:16.300 using getting the structure of language, of using these context-free grammar 00:27:16.300 --> 00:27:19.270 rules to be able to describe the structure of language. 00:27:19.270 --> 00:27:22.150 But what we might also care about is understanding 00:27:22.150 --> 00:27:24.700 how it is that these sequences of words are 00:27:24.700 --> 00:27:29.080 likely to relate to each other in terms of the actual words themselves. 00:27:29.080 --> 00:27:33.100 The grammar that we saw before could allow us to generate a sentence like, 00:27:33.100 --> 00:27:37.930 I eat a banana, for example, where I is the noun phrase and ate a banana 00:27:37.930 --> 00:27:39.190 is a verb phrase. 00:27:39.190 --> 00:27:41.800 But it would also allow for sentences like, I 00:27:41.800 --> 00:27:46.180 eat a blue car, for example, which is also syntactically well-formed 00:27:46.180 --> 00:27:50.830 according to the rules, but is probably a less likely sentence that a person is 00:27:50.830 --> 00:27:51.640 likely to speak. 00:27:51.640 --> 00:27:54.550 And we might want for our AI to be able to encapsulate 00:27:54.550 --> 00:28:00.140 the idea that certain sequences of words are more or less likely than others. 00:28:00.140 --> 00:28:03.880 So to deal with that, we'll introduce the notion of an n-gram, 00:28:03.880 --> 00:28:06.910 and an n-gram, more generally, just refers to some sequence 00:28:06.910 --> 00:28:09.880 of n items inside of our text. 00:28:09.880 --> 00:28:12.350 And those items might take various different forms. 00:28:12.350 --> 00:28:15.220 We can have character n-grams, which are just a contiguous 00:28:15.220 --> 00:28:18.520 sequence of n characters-- so three characters in a row, 00:28:18.520 --> 00:28:20.770 for example, or four characters in a row. 00:28:20.770 --> 00:28:23.500 We can also have word n-grams, which are a contiguous 00:28:23.500 --> 00:28:28.840 sequence of n words in a row from a particular sample of text. 00:28:28.840 --> 00:28:30.760 And these end up proving quite useful, and you 00:28:30.760 --> 00:28:34.700 can choose our n to decide how many how long is our sequence going to be. 00:28:34.700 --> 00:28:39.170 So when n is 1, we're just looking at a single word or a single character. 00:28:39.170 --> 00:28:42.760 And that is what we might call a unigram, just one item. 00:28:42.760 --> 00:28:45.160 If we're looking at two characters or two words, 00:28:45.160 --> 00:28:47.590 that's generally called a bigram-- so an n-gram 00:28:47.590 --> 00:28:51.205 where n is equal to 2, looking at two words that are consecutive. 00:28:51.205 --> 00:28:53.080 And then, if there are three items, you might 00:28:53.080 --> 00:28:56.200 imagine we'll often call those trigrams-- so three characters 00:28:56.200 --> 00:29:00.770 in a row or three words that happen to be in a contiguous sequence. 00:29:00.770 --> 00:29:04.000 And so if we took a sentence, for example-- 00:29:04.000 --> 00:29:06.367 here's a sentence from, again, Sherlock Holmes-- 00:29:06.367 --> 00:29:08.200 "how often have I said to you that, when you 00:29:08.200 --> 00:29:10.540 have eliminated the impossible, whatever remains, 00:29:10.540 --> 00:29:13.300 however improbable, must be the truth." 00:29:13.300 --> 00:29:16.090 What are the trigrams that we can extract from the sentence? 00:29:16.090 --> 00:29:18.830 If we're looking at sequences of three words, 00:29:18.830 --> 00:29:21.280 well, the first trigram would be how often 00:29:21.280 --> 00:29:23.890 have-- just a sequence of three words. 00:29:23.890 --> 00:29:25.960 And then we can look at the next trigram, 00:29:25.960 --> 00:29:29.200 often have I. The next trigram is have I said. 00:29:29.200 --> 00:29:32.320 Then I said to, said to you, to you that, for example-- 00:29:32.320 --> 00:29:36.700 those are all trigrams of words, sequences of three contiguous words 00:29:36.700 --> 00:29:38.410 that show up in the text. 00:29:38.410 --> 00:29:43.120 And extracting those bigrams and trigrams, or n-grams more generally, 00:29:43.120 --> 00:29:45.820 turns out to be quite helpful, because often, 00:29:45.820 --> 00:29:48.113 when we're dealing with analyzing a lot of text, 00:29:48.113 --> 00:29:50.530 it's not going to be particularly meaningful for us to try 00:29:50.530 --> 00:29:53.990 and analyze the entire text at one time. 00:29:53.990 --> 00:29:57.670 But instead, we want to segment that text into pieces that we 00:29:57.670 --> 00:29:59.650 can begin to do some analysis of-- 00:29:59.650 --> 00:30:03.070 that our AI might never have seen this entire sentence before, 00:30:03.070 --> 00:30:07.810 but it's probably seen the trigram to you that before, 00:30:07.810 --> 00:30:11.710 because to you that is something that might have come up in other documents 00:30:11.710 --> 00:30:13.240 that our AI has seen before. 00:30:13.240 --> 00:30:16.900 And therefore, it knows a little bit about that particular sequence 00:30:16.900 --> 00:30:20.890 of three words in a row-- or something like have I said, 00:30:20.890 --> 00:30:24.820 another example of another sequence of three words that's probably 00:30:24.820 --> 00:30:28.880 quite popular, in terms of where you see it inside the English language. 00:30:28.880 --> 00:30:32.433 So we'd like some way to be able to extract these sorts of n-grams. 00:30:32.433 --> 00:30:33.350 And how do we do that? 00:30:33.350 --> 00:30:35.770 How do we extract sequences of three words? 00:30:35.770 --> 00:30:39.490 Well, we need to take our input and somehow separate it 00:30:39.490 --> 00:30:41.810 into all of the individual words. 00:30:41.810 --> 00:30:45.010 And this is a process generally known as tokenization, 00:30:45.010 --> 00:30:48.250 the task of splitting up some sequence into distinct pieces, 00:30:48.250 --> 00:30:50.440 where we call those pieces tokens. 00:30:50.440 --> 00:30:53.480 Most commonly, this refers to something like word tokenization. 00:30:53.480 --> 00:30:55.810 I have some sequence of text and I want to split it up 00:30:55.810 --> 00:30:58.810 into all of the words that show up in that text. 00:30:58.810 --> 00:31:01.240 But it might also come up in the context of something 00:31:01.240 --> 00:31:02.680 like sentence tokenization. 00:31:02.680 --> 00:31:05.950 I have a long sequence of text and I'd like to split it up 00:31:05.950 --> 00:31:08.050 into sentences, for example. 00:31:08.050 --> 00:31:11.260 And so how might word tokenization work, the task of splitting up 00:31:11.260 --> 00:31:13.660 our sequence of characters into words? 00:31:13.660 --> 00:31:15.640 Well, we've also already seen this idea. 00:31:15.640 --> 00:31:18.610 We've seen that, in word tokenization just a moment ago, I 00:31:18.610 --> 00:31:22.660 took an input sequence and I just called Python's split method on it, where 00:31:22.660 --> 00:31:25.360 the split method took that sequence of words 00:31:25.360 --> 00:31:29.880 and just separated it based on where the spaces showed up in that word. 00:31:29.880 --> 00:31:33.640 And so if I had a sentence like, whatever remains, however improbable, 00:31:33.640 --> 00:31:37.620 must be the truth, how would I tokenize this? 00:31:37.620 --> 00:31:41.460 Well, the naive approach is just to say, anytime you see a space, 00:31:41.460 --> 00:31:42.600 go ahead and split it up. 00:31:42.600 --> 00:31:46.800 We're going to split up this particular string just by looking for spaces. 00:31:46.800 --> 00:31:49.830 And what we get when we do that is a sentence like this-- 00:31:49.830 --> 00:31:53.660 whatever remains, however improbable, must be the truth. 00:31:53.660 --> 00:31:56.160 But what you'll notice here is that, if we just split things 00:31:56.160 --> 00:32:00.930 up in terms of where the spaces are, we end up keeping the punctuation around. 00:32:00.930 --> 00:32:02.960 There's a comma after the word remains. 00:32:02.960 --> 00:32:06.030 There's a comma after improbable, a period after truth. 00:32:06.030 --> 00:32:08.160 And this poses a little bit of a challenge, when 00:32:08.160 --> 00:32:11.820 we think about trying to tokenize things into individual words, 00:32:11.820 --> 00:32:15.150 because if you're comparing words to each other, this word 00:32:15.150 --> 00:32:16.712 truth with a period after it-- 00:32:16.712 --> 00:32:18.420 if you just string compare it, it's going 00:32:18.420 --> 00:32:21.270 to be different from the word truth without a period after it. 00:32:21.270 --> 00:32:23.810 And so this punctuation can sometimes pose a problem for us, 00:32:23.810 --> 00:32:27.060 and so we might want some way of dealing with it-- either treating punctuation 00:32:27.060 --> 00:32:30.990 as a separate token altogether or maybe removing that punctuation entirely 00:32:30.990 --> 00:32:32.920 from our sequence as well. 00:32:32.920 --> 00:32:35.020 So that might be something we want to do. 00:32:35.020 --> 00:32:38.010 But there are other cases where it becomes a little bit less clear. 00:32:38.010 --> 00:32:40.680 If I said something like, just before 9:00 o'clock, 00:32:40.680 --> 00:32:43.110 Sherlock Holmes stepped briskly into the room, 00:32:43.110 --> 00:32:46.167 well, this apostrophe after 9 o'clock-- 00:32:46.167 --> 00:32:48.750 after the O in 9 o'clock-- is that something we should remove? 00:32:48.750 --> 00:32:52.080 Should be split based on that as well, and do O and clock? 00:32:52.080 --> 00:32:54.090 There's some interesting questions there too. 00:32:54.090 --> 00:32:57.360 And it gets even trickier if you begin to think about hyphenated words-- 00:32:57.360 --> 00:33:00.650 something like this, where we have a whole bunch of words 00:33:00.650 --> 00:33:03.840 that are hyphenated and then you need to make a judgment call. 00:33:03.840 --> 00:33:06.180 Is that a place where you're going to split things apart 00:33:06.180 --> 00:33:09.840 into individual words, or are you going to consider frock-coat, and well-cut, 00:33:09.840 --> 00:33:13.300 and pearl-grey to be individual words of their own? 00:33:13.300 --> 00:33:16.530 And so those tend to pose challenges that we need to somehow deal with 00:33:16.530 --> 00:33:19.890 and something we need to decide as we go about trying 00:33:19.890 --> 00:33:21.790 to perform this kind of analysis. 00:33:21.790 --> 00:33:25.950 Similar challenges arise when it comes to the world of sentence tokenization. 00:33:25.950 --> 00:33:29.410 Imagine this sequence of sentences, for example. 00:33:29.410 --> 00:33:31.927 If you take a look at this particular sequence of sentences, 00:33:31.927 --> 00:33:35.010 you could probably imagine you could extract the sentences pretty readily. 00:33:35.010 --> 00:33:38.060 Here is one sentence and here is a second sentence, 00:33:38.060 --> 00:33:43.060 so we have two different sentences inside of this particular passage. 00:33:43.060 --> 00:33:46.260 And the distinguishing feature seems to be the period-- 00:33:46.260 --> 00:33:48.963 that a period separates one sentence from another. 00:33:48.963 --> 00:33:50.880 And maybe there are other types of punctuation 00:33:50.880 --> 00:33:52.830 you might include here as well-- 00:33:52.830 --> 00:33:55.740 an exclamation point, for example, or a question mark. 00:33:55.740 --> 00:33:58.080 But those are the types of punctuation that we know 00:33:58.080 --> 00:34:00.750 tend to come at the end of sentences. 00:34:00.750 --> 00:34:04.410 But it gets trickier again if you look at a sentence like this-- not just 00:34:04.410 --> 00:34:07.140 sure talking to Sherlock, but instead of talking to Sherlock, 00:34:07.140 --> 00:34:09.449 talking to Mr. Holmes. 00:34:09.449 --> 00:34:11.313 Well now, we have a period at the end of Mr. 00:34:11.313 --> 00:34:13.230 And so if you were just separating on periods, 00:34:13.230 --> 00:34:15.570 you might imagine this would be a sentence, 00:34:15.570 --> 00:34:17.760 and then just Holmes would be a sentence, 00:34:17.760 --> 00:34:19.800 and then we'd have a third sentence down below. 00:34:19.800 --> 00:34:23.159 Things do get a little bit trickier as you start 00:34:23.159 --> 00:34:25.050 to imagine these sorts of situations. 00:34:25.050 --> 00:34:27.690 And dialogue too starts to make this trickier as well-- 00:34:27.690 --> 00:34:31.860 that if you have these sorts of lines that are inside of something that-- 00:34:31.860 --> 00:34:33.150 he said, for example-- 00:34:33.150 --> 00:34:35.639 that he said this particular sequence of words 00:34:35.639 --> 00:34:37.469 and then this particular sequence of words. 00:34:37.469 --> 00:34:40.170 There are interesting challenges that arise there too, 00:34:40.170 --> 00:34:42.389 in terms of how it is that we take the sentence 00:34:42.389 --> 00:34:46.268 and split it up into individual sentences as well. 00:34:46.268 --> 00:34:48.810 And these are just things that our algorithm needs to decide. 00:34:48.810 --> 00:34:51.370 In practice, there usually some heuristics that we can use. 00:34:51.370 --> 00:34:53.610 We know there are certain occurrences of periods, 00:34:53.610 --> 00:34:56.580 like the period after Mr., or in other examples where 00:34:56.580 --> 00:34:59.010 we know that is not the beginning of a new sentence, 00:34:59.010 --> 00:35:01.770 and so we can encode those rules into our AI 00:35:01.770 --> 00:35:04.680 to allow it to be able to do this tokenization the way 00:35:04.680 --> 00:35:06.060 that we want it to. 00:35:06.060 --> 00:35:09.960 So once we have these ability to tokenize a particular passage-- 00:35:09.960 --> 00:35:12.930 take the passage, split it up into individual words-- 00:35:12.930 --> 00:35:17.110 from there, we can begin to extract what the n-grams actually are. 00:35:17.110 --> 00:35:20.190 So we can actually take a look at this by going 00:35:20.190 --> 00:35:23.250 into a Python program that will serve the purpose of extracting 00:35:23.250 --> 00:35:24.630 these n-grams. 00:35:24.630 --> 00:35:27.510 And again, we can use NLTK, the Natural Language Toolkit, in order 00:35:27.510 --> 00:35:28.720 to help us here. 00:35:28.720 --> 00:35:33.540 So I'll go ahead and go into ngrams and we'll take a look at ngrams.py. 00:35:33.540 --> 00:35:36.280 And what we have here is we are going to take 00:35:36.280 --> 00:35:39.190 some corpus of text, just some sequence of documents, 00:35:39.190 --> 00:35:43.960 and use all those documents and extract what the most popular n-grams happen 00:35:43.960 --> 00:35:44.800 to be. 00:35:44.800 --> 00:35:48.490 So in order to do so, we're going to go ahead and load data from a directory 00:35:48.490 --> 00:35:50.510 that we specify in the command line argument. 00:35:50.510 --> 00:35:53.170 We'll also take in a number n as a command line argument 00:35:53.170 --> 00:35:55.390 as well, in terms of what our number should be, 00:35:55.390 --> 00:36:00.480 in terms of how many sequences-- words we're going to look at in sequence. 00:36:00.480 --> 00:36:05.330 Then we're going to go ahead and just count up all of the nltk.ngrams. 00:36:05.330 --> 00:36:09.170 So we're going to look at all of the grams across this entire corpus 00:36:09.170 --> 00:36:11.600 and save it inside this variable ngrams. 00:36:11.600 --> 00:36:14.090 And then we're going to look at the most common ones 00:36:14.090 --> 00:36:15.423 and go ahead and print them out. 00:36:15.423 --> 00:36:18.020 And so in order to do so, I'm not only using NLTK-- 00:36:18.020 --> 00:36:21.290 I'm also using counter, which is built into Python as well, where I can just 00:36:21.290 --> 00:36:25.800 count up, how many times do these various different grams appear? 00:36:25.800 --> 00:36:27.480 So we'll go ahead and show that. 00:36:27.480 --> 00:36:31.500 We'll go into ngrams, and I'll say something like python ngrams-- 00:36:31.500 --> 00:36:34.020 and let's just first look for the unigrams, sequences 00:36:34.020 --> 00:36:37.000 of one word inside of a corpus. 00:36:37.000 --> 00:36:39.270 And the corpus that I've prepared is I have 00:36:39.270 --> 00:36:42.720 all of the-- or some of these stories from Sherlock Holmes 00:36:42.720 --> 00:36:47.140 all here, where each one is just one of the Sherlock Holmes stories. 00:36:47.140 --> 00:36:50.010 And so I have a whole bunch of text here inside of this corpus, 00:36:50.010 --> 00:36:54.270 and I'll go ahead and provide that corpus as a command line argument. 00:36:54.270 --> 00:36:55.980 And now what my program is going to do is 00:36:55.980 --> 00:36:59.000 it's going to load all of the Sherlock Holmes stories into memory-- 00:36:59.000 --> 00:37:01.500 or all the ones that I've provided in this corpus at least-- 00:37:01.500 --> 00:37:04.200 and it's just going to look for the most popular unigrams, 00:37:04.200 --> 00:37:07.050 the most popular sequences of one word. 00:37:07.050 --> 00:37:12.060 And it seems the most popular one is just the word the used in 9,700 times; 00:37:12.060 --> 00:37:15.930 followed by I, used 5,000 times; and, used about 5,000 times-- 00:37:15.930 --> 00:37:18.370 the kinds of words you might expect. 00:37:18.370 --> 00:37:24.900 So now let's go ahead and check for bigrams, for example, ngrams 2, holmes. 00:37:24.900 --> 00:37:28.740 All right, again, sequences of two words now that appear multiple times-- 00:37:28.740 --> 00:37:32.840 of the, in the, it was, to the, it is, I have-- so on and so forth. 00:37:32.840 --> 00:37:34.590 These are the types of bigrams that happen 00:37:34.590 --> 00:37:37.590 to come up quite often inside this corpus, the inside of the Sherlock 00:37:37.590 --> 00:37:38.400 Holmes stories. 00:37:38.400 --> 00:37:41.060 And it probably is true across other corpses as well, 00:37:41.060 --> 00:37:43.472 but we could only find out if we actually tested it. 00:37:43.472 --> 00:37:45.180 And now, just for good measure, let's try 00:37:45.180 --> 00:37:50.120 one more-- maybe try three, looking now for trigrams that happen to show up. 00:37:50.120 --> 00:37:54.570 And now we get it was the, one of the, I think that, out of the. 00:37:54.570 --> 00:37:56.850 These are sequences of three words now that 00:37:56.850 --> 00:38:00.900 happen to come up multiple times across this particular corpus. 00:38:00.900 --> 00:38:02.970 So what are the potential use cases here? 00:38:02.970 --> 00:38:04.440 Now we have some sort of data. 00:38:04.440 --> 00:38:07.890 We have data about how often particular sequences of words 00:38:07.890 --> 00:38:11.010 show up in this particular order, and using that, 00:38:11.010 --> 00:38:13.410 we can begin to do some sort of predictions. 00:38:13.410 --> 00:38:18.090 We might be able to say that, if you see the words that it was, 00:38:18.090 --> 00:38:19.950 there's a reasonable chance the word that 00:38:19.950 --> 00:38:22.130 comes after it should be the word a. 00:38:22.130 --> 00:38:26.340 And if I see the words one of, it it's reasonable to imagine 00:38:26.340 --> 00:38:29.190 that the next word might be the word the, for example, 00:38:29.190 --> 00:38:32.640 because we have this data about trigrams, sequences of three words 00:38:32.640 --> 00:38:33.900 and how often they come up. 00:38:33.900 --> 00:38:36.150 And now, based on two words, you might be 00:38:36.150 --> 00:38:40.110 able to predict what the third word happens to be. 00:38:40.110 --> 00:38:43.650 And one model we can use for that is a model we've actually seen before. 00:38:43.650 --> 00:38:45.280 It's the Markov model. 00:38:45.280 --> 00:38:47.100 Recall again that the Markov model really 00:38:47.100 --> 00:38:50.010 just refers to some sequence of events that happen one time 00:38:50.010 --> 00:38:54.150 step after a one time step, where every unit has some ability 00:38:54.150 --> 00:38:57.150 to predict what the next unit is going to be-- 00:38:57.150 --> 00:39:00.330 or maybe the past two units predict with the next unit is going to be, 00:39:00.330 --> 00:39:03.270 or the past three predict with the next one is going to be. 00:39:03.270 --> 00:39:05.490 And we can use a Markov model and apply it 00:39:05.490 --> 00:39:08.100 to language for a very naive and simple approach 00:39:08.100 --> 00:39:11.340 at trying to generate natural language, at getting our AI 00:39:11.340 --> 00:39:14.340 to be able to speak English-like text. 00:39:14.340 --> 00:39:18.360 And the way it's going to work is we're going to say something like, come up 00:39:18.360 --> 00:39:20.280 with some probability distribution. 00:39:20.280 --> 00:39:23.070 Given these two words, what is the probability 00:39:23.070 --> 00:39:25.830 distribution over what the third word could possibly 00:39:25.830 --> 00:39:27.240 be based on all the data? 00:39:27.240 --> 00:39:30.660 If you see it was, what are the possible third words we might? 00:39:30.660 --> 00:39:32.190 Have how often do they come up? 00:39:32.190 --> 00:39:35.070 And using that information, we can try and construct 00:39:35.070 --> 00:39:37.450 what we expect the third word to be. 00:39:37.450 --> 00:39:39.270 And if you keep doing this, the effect is 00:39:39.270 --> 00:39:42.030 that our Markov model can effectively start 00:39:42.030 --> 00:39:45.330 to generate text-- can be able to generate text that 00:39:45.330 --> 00:39:48.330 was not in the original corpus, but that sounds 00:39:48.330 --> 00:39:49.770 kind of like the original corpus. 00:39:49.770 --> 00:39:54.130 It's using the same sorts of rules that the original corpus was using. 00:39:54.130 --> 00:39:56.370 So let's take a look at an example of that 00:39:56.370 --> 00:40:01.740 as well, where here now, I have another corpus that I have here, 00:40:01.740 --> 00:40:04.990 and it is the corpus of all of the works of William Shakespeare. 00:40:04.990 --> 00:40:09.900 So I've got a whole bunch of stories from Shakespeare, and all of them 00:40:09.900 --> 00:40:12.610 are just inside of this big text file. 00:40:12.610 --> 00:40:16.590 And so what I might like to do is look at what all of the n-grams are-- 00:40:16.590 --> 00:40:20.400 maybe look at all the trigrams inside of shakespeare.txt-- 00:40:20.400 --> 00:40:23.040 and figure out, given two words, can I predict 00:40:23.040 --> 00:40:24.548 what the third word is likely to be? 00:40:24.548 --> 00:40:26.340 And then just keep repeating this process-- 00:40:26.340 --> 00:40:27.240 I have two words-- 00:40:27.240 --> 00:40:29.400 predict the third word; then, from the second and third, word 00:40:29.400 --> 00:40:31.900 predict the fourth word; and from the third and fourth word, 00:40:31.900 --> 00:40:36.090 predict the fifth word, ultimately generating random sentences that 00:40:36.090 --> 00:40:39.420 sounds like Shakespeare, that are using similar patterns of words 00:40:39.420 --> 00:40:43.140 that Shakespeare used, but that never actually showed up in Shakespeare 00:40:43.140 --> 00:40:44.770 itself. 00:40:44.770 --> 00:40:47.640 And so to do so, I'll show you generator.py, 00:40:47.640 --> 00:40:50.910 which, again, is just going to read data from a particular file. 00:40:50.910 --> 00:40:54.210 And I'm using a Python library called markovify, which is just 00:40:54.210 --> 00:40:56.050 going to do this process for me. 00:40:56.050 --> 00:40:59.370 So there are libraries out here that can just train on a bunch of text 00:40:59.370 --> 00:41:02.978 and come up with a Markov model based on that text. 00:41:02.978 --> 00:41:04.770 And I'm going to go ahead and just generate 00:41:04.770 --> 00:41:07.920 five randomly generated sentences. 00:41:07.920 --> 00:41:11.850 So we'll go ahead and go in to markov. 00:41:11.850 --> 00:41:14.750 I'll run the generator on shakespeare.txt. 00:41:14.750 --> 00:41:18.290 What we'll see is it's going to load that data, and then here's what we get. 00:41:18.290 --> 00:41:21.320 We get five different sentences, and these 00:41:21.320 --> 00:41:24.890 are sentences that never showed up in any Shakespeare play, 00:41:24.890 --> 00:41:27.680 but that are designed to sound like Shakespeare, 00:41:27.680 --> 00:41:30.320 that are designed to just take two words and predict, 00:41:30.320 --> 00:41:34.100 given those two words, what would Shakespeare have been likely to choose 00:41:34.100 --> 00:41:35.517 as the third word that follows it. 00:41:35.517 --> 00:41:38.100 And you know, these sentences probably don't have any meaning. 00:41:38.100 --> 00:41:41.600 It's not like the AI is trying to express any sort of underlying meaning 00:41:41.600 --> 00:41:42.110 here. 00:41:42.110 --> 00:41:44.870 It's just trying to understand, based on the sequence 00:41:44.870 --> 00:41:50.190 of words, what is likely to come after it as a next word, for example. 00:41:50.190 --> 00:41:53.593 And these are the types of sentences that it's able to come up with, 00:41:53.593 --> 00:41:54.260 just generating. 00:41:54.260 --> 00:41:58.100 And if you ran this multiple times, you would end up getting different results. 00:41:58.100 --> 00:42:01.580 I could run this again and get an entirely different set 00:42:01.580 --> 00:42:04.100 of five different sentences that also are 00:42:04.100 --> 00:42:08.810 supposed to sound kind of like the way that Shakespeare's sentences sounded 00:42:08.810 --> 00:42:10.340 as well. 00:42:10.340 --> 00:42:12.430 And so that then was a look at how it is we 00:42:12.430 --> 00:42:16.580 can use Markov models to be able to naively attempt generating language. 00:42:16.580 --> 00:42:18.580 The language doesn't mean a whole lot right now. 00:42:18.580 --> 00:42:21.430 You wouldn't want to use the system in its current form 00:42:21.430 --> 00:42:23.200 to do something like machine translation, 00:42:23.200 --> 00:42:26.020 because it wouldn't be able to encapsulate any meaning, 00:42:26.020 --> 00:42:30.240 but we're starting to see now that our AI is getting a little bit better 00:42:30.240 --> 00:42:31.990 at trying to speak our language, at trying 00:42:31.990 --> 00:42:36.500 to be able to process natural language in some sort of meaningful way. 00:42:36.500 --> 00:42:38.830 So we'll now take a look at a couple of other tasks 00:42:38.830 --> 00:42:41.140 that we might want our AI to be able to perform. 00:42:41.140 --> 00:42:44.920 And one such task is text categorization, which really is just 00:42:44.920 --> 00:42:46.138 a classification problem. 00:42:46.138 --> 00:42:48.430 And we've talked about classification problems already, 00:42:48.430 --> 00:42:51.670 these problems where we would like to take some object 00:42:51.670 --> 00:42:54.540 and categorize it into a number of different classes. 00:42:54.540 --> 00:42:58.750 And so the way this comes up in text is anytime you have some sample of text 00:42:58.750 --> 00:43:02.080 and you want to put it inside of a category, where I want to say something 00:43:02.080 --> 00:43:06.760 like, given an email, does it belong in the inbox or does it belong in spam? 00:43:06.760 --> 00:43:08.890 Which of these two categories does it belong in? 00:43:08.890 --> 00:43:12.250 And you do that by looking at the text and being 00:43:12.250 --> 00:43:16.660 able to do some sort of analysis on that text to be able to draw conclusions, 00:43:16.660 --> 00:43:20.200 to be able to say that, given the words that show up in the email, 00:43:20.200 --> 00:43:22.510 I think this is probably belonging in the inbox, 00:43:22.510 --> 00:43:25.825 or I think it probably belongs in spam instead. 00:43:25.825 --> 00:43:27.700 And you might imagine doing this for a number 00:43:27.700 --> 00:43:30.910 of different types of classification problems of this sort. 00:43:30.910 --> 00:43:34.360 So you might imagine that another common example of this type of idea 00:43:34.360 --> 00:43:37.690 is something like sentiment analysis, where I want to analyze, 00:43:37.690 --> 00:43:41.880 given a sample of text, does it have a positive sentiment 00:43:41.880 --> 00:43:43.780 or does it have a negative sentiment? 00:43:43.780 --> 00:43:47.082 And this might come up in the case of a product reviews on a website, 00:43:47.082 --> 00:43:50.290 for example, or feedback on a website, where you have a whole bunch of data-- 00:43:50.290 --> 00:43:53.230 samples of text that are provided by users of a website-- 00:43:53.230 --> 00:43:57.010 and you want to be able to quickly analyze, are these reviews positive, 00:43:57.010 --> 00:43:59.710 are the reviews negative, what is it that people 00:43:59.710 --> 00:44:03.460 are saying, just to get a sense for what it is that people are saying, 00:44:03.460 --> 00:44:08.840 to be able to categorize text into one of these two different categories. 00:44:08.840 --> 00:44:10.630 So how might we approach this problem? 00:44:10.630 --> 00:44:13.010 Well, let's take a look at some sample product reviews. 00:44:13.010 --> 00:44:16.000 Here are some sample prep reviews that we might come up with. 00:44:16.000 --> 00:44:16.930 My grandson loved it. 00:44:16.930 --> 00:44:17.890 So much fun. 00:44:17.890 --> 00:44:20.290 Product broke after a few days. 00:44:20.290 --> 00:44:22.368 One of the best games I've played in a long time. 00:44:22.368 --> 00:44:23.410 Kind of cheap and flimsy. 00:44:23.410 --> 00:44:24.400 Not worth it. 00:44:24.400 --> 00:44:28.360 Different product reviews that you might imagine seeing on Amazon, or eBay, 00:44:28.360 --> 00:44:31.690 or some other website where people are selling products, for instance. 00:44:31.690 --> 00:44:34.480 And we humans can pretty easily categorize these 00:44:34.480 --> 00:44:37.060 into positive sentiment or negative sentiment. 00:44:37.060 --> 00:44:39.790 We'd probably say that the first and the third one, those 00:44:39.790 --> 00:44:41.620 are positive sentiment messages. 00:44:41.620 --> 00:44:44.380 The second one and the fourth one, those are probably 00:44:44.380 --> 00:44:46.060 negative sentiment messages. 00:44:46.060 --> 00:44:48.680 But how could a computer do the same thing? 00:44:48.680 --> 00:44:53.470 How could it try and take these reviews and assess, are they positive 00:44:53.470 --> 00:44:55.420 or are they negative? 00:44:55.420 --> 00:44:57.940 Well, ultimately, it depends upon the words 00:44:57.940 --> 00:45:02.530 that happen to be in this particular-- these particular reviews-- inside 00:45:02.530 --> 00:45:03.850 of these particular sentences. 00:45:03.850 --> 00:45:06.040 For now we're going to ignore the structure 00:45:06.040 --> 00:45:08.120 and how the words are related to each other, 00:45:08.120 --> 00:45:11.230 and we're just going to focus on what the words actually are. 00:45:11.230 --> 00:45:14.710 So there are probably some key words here, words like loved, 00:45:14.710 --> 00:45:16.330 and fun, and best. 00:45:16.330 --> 00:45:20.770 Those probably show up in more positive reviews, whereas words 00:45:20.770 --> 00:45:23.137 like broke, and cheap, and flimsy-- 00:45:23.137 --> 00:45:24.970 well, those are words that probably are more 00:45:24.970 --> 00:45:29.930 likely to come up inside of negative reviews, instead of positive reviews. 00:45:29.930 --> 00:45:33.550 So one way to approach this sort of text analysis idea 00:45:33.550 --> 00:45:37.900 is to say, let's, for now, ignore the structures of these sentences-- to say, 00:45:37.900 --> 00:45:40.870 we're not going to care about how it is the words relate to each other. 00:45:40.870 --> 00:45:43.540 We're not going to try and parse these sentences to construct 00:45:43.540 --> 00:45:45.850 the grammatical structure like we saw a moment ago. 00:45:45.850 --> 00:45:49.060 But we can probably just rely on the words that were actually 00:45:49.060 --> 00:45:52.000 used-- rely on the fact that the positive reviews are 00:45:52.000 --> 00:45:54.820 more likely to have words like best, and loved, and fun, 00:45:54.820 --> 00:45:58.360 and that the negative reviews are more likely to have the negative words 00:45:58.360 --> 00:46:00.017 that we've highlighted there as well. 00:46:00.017 --> 00:46:03.100 And this sort of model-- this approach to trying to think about language-- 00:46:03.100 --> 00:46:05.610 is generally known as the bag of words model, 00:46:05.610 --> 00:46:09.023 where we're going to model a sample of text not by caring about its structure, 00:46:09.023 --> 00:46:12.970 but just by caring about the unordered collection of words that 00:46:12.970 --> 00:46:16.060 show up inside of a sample-- that all we care about 00:46:16.060 --> 00:46:18.040 is what words are in the text. 00:46:18.040 --> 00:46:20.552 And we don't care about what the order of those words is. 00:46:20.552 --> 00:46:22.510 We don't care about the structure of the words. 00:46:22.510 --> 00:46:25.210 We don't care what noun goes with what adjective 00:46:25.210 --> 00:46:26.870 or how things agree with each other. 00:46:26.870 --> 00:46:28.830 We just care about the words. 00:46:28.830 --> 00:46:31.120 And it turns out this approach tends to work 00:46:31.120 --> 00:46:34.810 pretty well for doing classifications like positive sentiment 00:46:34.810 --> 00:46:36.142 or negative sentiment. 00:46:36.142 --> 00:46:38.350 And you could imagine doing this in a number of ways. 00:46:38.350 --> 00:46:41.740 We've talked about different approaches to trying to solve classification style 00:46:41.740 --> 00:46:43.870 problems, but when it comes to natural language, 00:46:43.870 --> 00:46:48.110 one of the most popular approaches is that naive Bayes approach. 00:46:48.110 --> 00:46:52.530 And this is one approach to trying to analyze the probability that something 00:46:52.530 --> 00:46:54.940 is positive sentiment or negative sentiment, 00:46:54.940 --> 00:46:58.515 or just trying to categorize it some text into possible categories. 00:46:58.515 --> 00:47:01.390 And it doesn't just work for text-- it works for other types of ideas 00:47:01.390 --> 00:47:03.550 as well-- but it is quite popular in the world 00:47:03.550 --> 00:47:05.980 of analyzing text and natural language. 00:47:05.980 --> 00:47:09.450 And the naive Bayes approach is based on Bayes' rule, which 00:47:09.450 --> 00:47:11.950 you might recall back from when we talked about probability, 00:47:11.950 --> 00:47:14.020 that the Bayes' rule looks like this-- 00:47:14.020 --> 00:47:17.690 that the probability of some event b, given a 00:47:17.690 --> 00:47:20.320 can be expressed using this expression over here. 00:47:20.320 --> 00:47:25.150 Probability of b given a is the probability of a given b multiplied 00:47:25.150 --> 00:47:28.590 by the probability of b divided by the probability of a. 00:47:28.590 --> 00:47:32.290 And we saw that this came about as a result of just the definition 00:47:32.290 --> 00:47:35.740 of conditional independence and looking at what it means for two events 00:47:35.740 --> 00:47:37.010 to happen together. 00:47:37.010 --> 00:47:40.038 This was our formulation then of Bayes' rule, which 00:47:40.038 --> 00:47:41.330 turned out to be quite helpful. 00:47:41.330 --> 00:47:43.990 We were able to predict one event in terms of another 00:47:43.990 --> 00:47:49.218 by flipping the order of those events inside of this probability calculation. 00:47:49.218 --> 00:47:51.760 And it turns out this approach is going to be quite helpful-- 00:47:51.760 --> 00:47:53.110 and we'll see why in a moment-- 00:47:53.110 --> 00:47:55.330 for being able to do this sort of sentiment analysis, 00:47:55.330 --> 00:47:58.750 because I want to say you know, what is the probability 00:47:58.750 --> 00:48:02.350 that a message is positive, or what is the pop probability 00:48:02.350 --> 00:48:03.727 that the message is negative? 00:48:03.727 --> 00:48:06.310 And I'll go ahead and simplify this just using the emojis just 00:48:06.310 --> 00:48:10.450 for simplicity-- probability of positive, probability of negative. 00:48:10.450 --> 00:48:12.340 And that is what I would like to calculate, 00:48:12.340 --> 00:48:15.310 but I'd like to calculate that given some information-- 00:48:15.310 --> 00:48:18.940 given information like here is a sample of text-- 00:48:18.940 --> 00:48:20.440 my grandson loved it. 00:48:20.440 --> 00:48:24.280 And I would like to know not just what is the probability that any message is 00:48:24.280 --> 00:48:27.880 positive, but what is the probability that the message is positive, 00:48:27.880 --> 00:48:32.890 given my grandson loved it as the text of the sample? 00:48:32.890 --> 00:48:36.340 So given this information that inside the sample are the words my grandson 00:48:36.340 --> 00:48:41.860 loved it, what is the probability then that this is a positive message? 00:48:41.860 --> 00:48:44.650 Well, according to the bag of words model, what we're going to do 00:48:44.650 --> 00:48:46.930 is really ignore the ordering of the words-- 00:48:46.930 --> 00:48:50.420 not treat this as a single sentence that has some structure to it, 00:48:50.420 --> 00:48:52.750 but just treat it as a whole bunch of different words. 00:48:52.750 --> 00:48:55.180 We're going to say something like, what is the probability 00:48:55.180 --> 00:48:58.420 that this is a positive message, given that the word my 00:48:58.420 --> 00:49:01.810 was in the message, given that the word grandson was in the message, 00:49:01.810 --> 00:49:05.520 given that the word loved within the message, and given the word it 00:49:05.520 --> 00:49:06.380 was in the message? 00:49:06.380 --> 00:49:07.720 The bag of words model here-- 00:49:07.720 --> 00:49:11.380 we're treating the entire simple sample as just a whole bunch 00:49:11.380 --> 00:49:12.740 of different words. 00:49:12.740 --> 00:49:15.910 And so this then is what I'd like to calculate, this probability-- 00:49:15.910 --> 00:49:18.610 given all those words, what is the probability 00:49:18.610 --> 00:49:20.920 that this is a positive message? 00:49:20.920 --> 00:49:23.530 And this is where we can now apply Bayes' rule. 00:49:23.530 --> 00:49:28.315 This is really the probability of some b, given some a. 00:49:28.315 --> 00:49:30.400 And that now is what I'd like to calculate. 00:49:30.400 --> 00:49:34.723 So according to Bayes' rule, this whole expression is equal to-- 00:49:34.723 --> 00:49:35.890 well, it's the probability-- 00:49:35.890 --> 00:49:37.420 I switched the order of them-- 00:49:37.420 --> 00:49:40.270 it's the probability of all of these words, 00:49:40.270 --> 00:49:42.910 given that it's a positive message, multiplied 00:49:42.910 --> 00:49:46.930 by the probability that is the positive message divided 00:49:46.930 --> 00:49:49.575 by the probability of all of those words. 00:49:49.575 --> 00:49:51.700 So this then is just an application of Bayes' rule. 00:49:51.700 --> 00:49:56.680 We've already seen where I want to express the probability of positive, 00:49:56.680 --> 00:50:02.440 given the words, as related to somehow the probability of the words, 00:50:02.440 --> 00:50:04.718 given that it's a positive message. 00:50:04.718 --> 00:50:06.760 And it turns out that-- as you might recall, back 00:50:06.760 --> 00:50:09.965 when we talked about probability, that this denominator is 00:50:09.965 --> 00:50:10.840 going to be the same. 00:50:10.840 --> 00:50:13.840 Regardless of whether we're looking at positive or negative messages, 00:50:13.840 --> 00:50:15.850 the probability of these words doesn't change, 00:50:15.850 --> 00:50:18.805 because we don't have a positive or negative down below. 00:50:18.805 --> 00:50:20.680 So we can just say that, rather than just say 00:50:20.680 --> 00:50:23.980 that this expression up here is equal to this expression down below, 00:50:23.980 --> 00:50:27.130 it's really just proportional to just the numerator. 00:50:27.130 --> 00:50:29.530 We can ignore the denominator for now. 00:50:29.530 --> 00:50:32.770 Using the denominator would get us an exact probability. 00:50:32.770 --> 00:50:34.780 But it turns out that what we'll really just do 00:50:34.780 --> 00:50:38.780 is figure out what the probability is proportional to, and at the end, 00:50:38.780 --> 00:50:41.500 we'll have to normalize the probability distribution-- make 00:50:41.500 --> 00:50:46.270 sure the probability distribution ultimately sums up to the number 1. 00:50:46.270 --> 00:50:49.730 So now I've been able to formulate this probability-- 00:50:49.730 --> 00:50:51.520 which is what I want to care about-- 00:50:51.520 --> 00:50:56.530 as proportional to multiplying these two things together-- probability of words, 00:50:56.530 --> 00:51:01.580 given positive message, multiplied by the probability of positive message. 00:51:01.580 --> 00:51:04.060 But again, if you think back to our probability rules, 00:51:04.060 --> 00:51:09.070 we can calculate this really as just a joint probability of all of these 00:51:09.070 --> 00:51:14.140 things happening-- that the probability of positive message multiplied 00:51:14.140 --> 00:51:17.470 by the probability of these words, given the positive message-- 00:51:17.470 --> 00:51:20.890 well, that's just the joint probability of all of these things. 00:51:20.890 --> 00:51:23.530 This is the same thing as the probability 00:51:23.530 --> 00:51:27.670 that it's a positive message, and my isn't the sentence or in the message, 00:51:27.670 --> 00:51:30.820 and grandson is in the sample, and loved is in the sample, 00:51:30.820 --> 00:51:33.160 and it is in the sample. 00:51:33.160 --> 00:51:36.640 So using that rule for the definition of joint probability, 00:51:36.640 --> 00:51:40.630 I've been able to say that this entire expression is now 00:51:40.630 --> 00:51:43.570 proportional to this sequence-- 00:51:43.570 --> 00:51:47.530 this joint probability of these words and this positive that's 00:51:47.530 --> 00:51:49.670 in there as well. 00:51:49.670 --> 00:51:51.790 And so now the interesting question is just how 00:51:51.790 --> 00:51:54.050 to calculate that joint probability. 00:51:54.050 --> 00:51:55.870 How do I figure out the probability that, 00:51:55.870 --> 00:51:59.980 given some arbitrary message, that it is positive, and the word my is in there, 00:51:59.980 --> 00:52:03.040 and the word grandson is in there, and the word loved is in there, 00:52:03.040 --> 00:52:04.740 and the word it is in there? 00:52:04.740 --> 00:52:07.990 Well, you'll recall that we can calculate a joint probability 00:52:07.990 --> 00:52:12.480 by multiplying together all of these conditional probabilities. 00:52:12.480 --> 00:52:16.350 If I want to know the probability of a, and b, and c, 00:52:16.350 --> 00:52:19.530 I can calculate that as the probability of a times 00:52:19.530 --> 00:52:24.300 the probability of b, given a, times the probability of c, given a and b. 00:52:24.300 --> 00:52:27.570 I can just multiply these conditional probabilities together 00:52:27.570 --> 00:52:31.290 in order to get the overall joint probability that I care about. 00:52:31.290 --> 00:52:32.790 And we could do the same thing here. 00:52:32.790 --> 00:52:35.340 I could say, let's multiply the probability 00:52:35.340 --> 00:52:39.180 of positive by the probability of the word my showing up in the message, 00:52:39.180 --> 00:52:42.810 given that it's positive, multiplied by the probability of grandson 00:52:42.810 --> 00:52:45.550 showing up in the message, given that the word my is in there 00:52:45.550 --> 00:52:48.930 and that it's positive, multiplied by the probability of loved, 00:52:48.930 --> 00:52:51.930 given these three things, multiplied by the probability of it, 00:52:51.930 --> 00:52:53.500 given these four things. 00:52:53.500 --> 00:52:56.882 And that's going to end up being a fairly complex calculation to make, 00:52:56.882 --> 00:52:58.590 one that we probably aren't going to have 00:52:58.590 --> 00:53:00.210 a good way of knowing the answer to. 00:53:00.210 --> 00:53:04.140 What is the probability that grandson is in the message, given 00:53:04.140 --> 00:53:08.010 that it is positive and the word my is in the message? 00:53:08.010 --> 00:53:12.040 That's not something we're really going to have a readily easy answer to, 00:53:12.040 --> 00:53:15.270 and so this is where the naive part of naive Bayes comes about. 00:53:15.270 --> 00:53:16.950 We're going to simplify this notion. 00:53:16.950 --> 00:53:20.340 Rather than compute exactly what that probability distribution is, 00:53:20.340 --> 00:53:23.880 we're going to assume that these words are 00:53:23.880 --> 00:53:26.710 going to be effectively independent of each other, 00:53:26.710 --> 00:53:28.980 if we know that it's already a positive message. 00:53:28.980 --> 00:53:32.670 If it's a positive message, it doesn't change the probability 00:53:32.670 --> 00:53:34.620 that the word grandson is in the message, 00:53:34.620 --> 00:53:37.620 if I know that the word loved is in the message, for example. 00:53:37.620 --> 00:53:39.750 And that might not necessarily be true in practice. 00:53:39.750 --> 00:53:41.610 In the real world, it might not be the case 00:53:41.610 --> 00:53:43.650 that these words are actually independent, 00:53:43.650 --> 00:53:45.960 but we're going to assume it to simplify our model. 00:53:45.960 --> 00:53:48.030 And it turns out that simplification still 00:53:48.030 --> 00:53:51.590 lets us get pretty good results out of it as well. 00:53:51.590 --> 00:53:55.320 And what we're going to assume is that the probability that all of these words 00:53:55.320 --> 00:53:58.690 show up depend only on whether it's positive or negative. 00:53:58.690 --> 00:54:01.170 I can still say that loved is more likely to come up 00:54:01.170 --> 00:54:04.510 in a positive message than a negative message, which is probably true, 00:54:04.510 --> 00:54:08.010 but we're also going to say that it's not going to change whether or not 00:54:08.010 --> 00:54:12.020 loved is more likely or less likely to come up if I know that the word my is 00:54:12.020 --> 00:54:13.643 in the message, for example. 00:54:13.643 --> 00:54:16.060 And so those are the assumptions that we're going to make. 00:54:16.060 --> 00:54:20.310 So while top expression is proportional to this bottom expression, 00:54:20.310 --> 00:54:24.750 we're going to say it's naively proportional to this expression, 00:54:24.750 --> 00:54:27.480 probability of being a positive message. 00:54:27.480 --> 00:54:30.300 And then, for each of the words that show up in the sample, 00:54:30.300 --> 00:54:33.270 I'm going to multiply what's the probability that my 00:54:33.270 --> 00:54:35.370 is in the message, given that it's positive, 00:54:35.370 --> 00:54:37.980 times the probability of grandson being in the message, given 00:54:37.980 --> 00:54:40.050 that it's positive-- and then so on and so forth 00:54:40.050 --> 00:54:44.040 for the other words that happen to be inside of the sample. 00:54:44.040 --> 00:54:47.580 And it turns out that these are numbers that we can calculate. 00:54:47.580 --> 00:54:50.640 The reason we've done all of this math is to get to this point, 00:54:50.640 --> 00:54:54.870 to be able to calculate this probability of distribution that we care about, 00:54:54.870 --> 00:54:58.410 given these terms that we can actually calculate. 00:54:58.410 --> 00:55:02.250 And we can calculate then, given some data available to us. 00:55:02.250 --> 00:55:04.530 And this is what a lot of natural language processing 00:55:04.530 --> 00:55:05.590 is about these days. 00:55:05.590 --> 00:55:07.330 It's about analyzing data. 00:55:07.330 --> 00:55:10.440 If I give you a whole bunch of data with a whole bunch of reviews, 00:55:10.440 --> 00:55:13.380 and I've labeled them as positive or negative, 00:55:13.380 --> 00:55:17.250 then you can begin to calculate these particular terms. 00:55:17.250 --> 00:55:20.490 I can calculate the probability that a message is positive just 00:55:20.490 --> 00:55:22.710 by looking at my data and saying, how many 00:55:22.710 --> 00:55:26.250 positive samples were there, and divide that by the number of total samples. 00:55:26.250 --> 00:55:29.477 That is my probability that a message is positive. 00:55:29.477 --> 00:55:32.310 What is the probability that the word loved is in the message, given 00:55:32.310 --> 00:55:33.330 that it's positive? 00:55:33.330 --> 00:55:35.490 Well, I can calculate that based on my data too. 00:55:35.490 --> 00:55:38.970 Let me just look at how many positive samples have the word loved in it 00:55:38.970 --> 00:55:41.730 and divide that by my total number of positive samples. 00:55:41.730 --> 00:55:44.430 And that will give me an approximation for, 00:55:44.430 --> 00:55:47.950 what is the probability that loved is going to show up inside of the review, 00:55:47.950 --> 00:55:51.570 given that we know that the review is positive. 00:55:51.570 --> 00:55:55.160 And so this then allows us to be able to calculate these probabilities. 00:55:55.160 --> 00:55:56.910 So let's not actually do this calculation. 00:55:56.910 --> 00:56:00.390 Let's calculate for the sentence, my grandson loved it. 00:56:00.390 --> 00:56:01.890 Is it a positive or negative review? 00:56:01.890 --> 00:56:04.030 How could we figure out those probabilities? 00:56:04.030 --> 00:56:07.110 Well, again, this up here is the expression we're trying to calculate. 00:56:07.110 --> 00:56:10.350 And I'll give you a hint the data that is available to us. 00:56:10.350 --> 00:56:13.080 And the way to interpret this data in this case 00:56:13.080 --> 00:56:19.127 is that, of all of the messages, 49% of them were positive and 51% of them 00:56:19.127 --> 00:56:19.710 were negative. 00:56:19.710 --> 00:56:22.350 Maybe online reviews tend to be a little bit more negative than they 00:56:22.350 --> 00:56:24.683 are positive-- or at least based on this particular data 00:56:24.683 --> 00:56:26.620 sample, that's what I have. 00:56:26.620 --> 00:56:31.800 And then I have distributions for each of the various different words-- 00:56:31.800 --> 00:56:34.290 that, given that it's a positive message, 00:56:34.290 --> 00:56:38.040 how many positive messages had the word in my in them? 00:56:38.040 --> 00:56:39.335 It's about 30%. 00:56:39.335 --> 00:56:42.210 And for negative messages, how many of those had the word my in them? 00:56:42.210 --> 00:56:47.910 About 20%-- so it seems like the word my comes up more often in positive 00:56:47.910 --> 00:56:52.140 messages-- at least slightly more often based on this analysis here. 00:56:52.140 --> 00:56:54.270 Grandson, for example-- maybe that showed up 00:56:54.270 --> 00:56:58.680 in 1% of all positive messages and 2% of all negative messages 00:56:58.680 --> 00:57:00.330 had the word grandson in it. 00:57:00.330 --> 00:57:05.010 The word loved showed up in 32% of all positive messages, 8% 00:57:05.010 --> 00:57:07.090 of all negative messages, for example. 00:57:07.090 --> 00:57:10.230 And then the word it up in 30% of positive messages, 00:57:10.230 --> 00:57:15.130 40% of negative messages-- again, just arbitrary data here just for example, 00:57:15.130 --> 00:57:19.560 but now we have data with which we can begin to calculate this expression. 00:57:19.560 --> 00:57:22.950 So how do I calculate multiplying all these values together? 00:57:22.950 --> 00:57:25.650 Well, it's just going to be multiplying probability 00:57:25.650 --> 00:57:29.400 that it's positive times the probability of my, given positive, 00:57:29.400 --> 00:57:32.190 times the probability of grandson, given positive-- 00:57:32.190 --> 00:57:34.290 so on and so forth for each of the other words. 00:57:34.290 --> 00:57:37.780 And if you do that multiplication and multiply all of those values together, 00:57:37.780 --> 00:57:42.000 you get this, 0.00014112. 00:57:42.000 --> 00:57:44.760 By itself, this is not a meaningful number, 00:57:44.760 --> 00:57:48.810 but it's going to be meaningful if you compared this expression-- 00:57:48.810 --> 00:57:53.250 the probability that it's positive times the probability of all of the words, 00:57:53.250 --> 00:57:55.680 given that I know that the message is positive, 00:57:55.680 --> 00:57:59.350 and compare it to the same thing, but for negative sentiment messages 00:57:59.350 --> 00:57:59.850 instead. 00:57:59.850 --> 00:58:03.090 I want to know the probability that it's a negative message 00:58:03.090 --> 00:58:05.430 times the probability of all of these words, 00:58:05.430 --> 00:58:07.900 given that it's a negative message. 00:58:07.900 --> 00:58:09.360 And so how can I do that? 00:58:09.360 --> 00:58:13.280 Well, to do that, you just multiply probability of negative times 00:58:13.280 --> 00:58:15.500 all of these conditional probabilities. 00:58:15.500 --> 00:58:19.520 And if I take those five values, multiply all of them together, 00:58:19.520 --> 00:58:26.730 then what I get is this value for negative 0.00006528-- 00:58:26.730 --> 00:58:30.080 again, in isolation, not a particularly meaningful number. 00:58:30.080 --> 00:58:35.300 What is meaningful is treating these two values as a probability distribution 00:58:35.300 --> 00:58:39.260 and normalizing them, making it so that both of these values sum up to 1 00:58:39.260 --> 00:58:41.450 the way of probability distribution should. 00:58:41.450 --> 00:58:45.740 And we do so by adding these two up and then dividing each of these values 00:58:45.740 --> 00:58:48.120 by their total in order to be able to normalize them. 00:58:48.120 --> 00:58:51.170 And when we do that, when we normalize this probability distribution, 00:58:51.170 --> 00:58:58.400 you end up getting something like this, positive 0.6837, negative 0.3163. 00:58:58.400 --> 00:59:02.990 It seems like we've been able to conclude that we are about 68% 00:59:02.990 --> 00:59:06.500 confident-- we think there's a probability of 0.68 00:59:06.500 --> 00:59:09.470 that this message is a positive message-- my grandson loved it. 00:59:09.470 --> 00:59:11.540 And why are we 68% confident? 00:59:11.540 --> 00:59:15.350 Well, it seems like we're more confident than not because the word 00:59:15.350 --> 00:59:18.350 loved showed up in 32% of positive messages, 00:59:18.350 --> 00:59:20.420 but only 8% of negative messages. 00:59:20.420 --> 00:59:22.410 So that was a pretty strong indicator. 00:59:22.410 --> 00:59:25.070 And for the others, while it's true that the word 00:59:25.070 --> 00:59:27.260 it showed up more often in negative messages, 00:59:27.260 --> 00:59:30.170 it wasn't enough to offset that loved shows up 00:59:30.170 --> 00:59:34.560 far more often in positive messages than negative messages. 00:59:34.560 --> 00:59:37.970 And so this type of analysis is how we can apply naive Bayes. 00:59:37.970 --> 00:59:39.650 We've just done this calculation. 00:59:39.650 --> 00:59:42.933 And we end up getting not just a categorization of positive or negative, 00:59:42.933 --> 00:59:44.600 but I get some sort of confidence level. 00:59:44.600 --> 00:59:47.660 What do I think the probability is that it's positive? 00:59:47.660 --> 00:59:52.560 And I can say I think it's positive with this particular probability. 00:59:52.560 --> 00:59:55.820 And so naive Bayes can be quite powerful at trying to achieve this. 00:59:55.820 --> 00:59:58.250 Using just this bag of words model, where all I'm doing 00:59:58.250 --> 01:00:00.950 is looking at what words show up in the sample, 01:00:00.950 --> 01:00:03.870 I'm able to draw these sorts of conclusions. 01:00:03.870 --> 01:00:07.280 Now, one potential drawback-- something that you'll notice pretty quickly 01:00:07.280 --> 01:00:10.190 if you start applying this room exactly as is-- 01:00:10.190 --> 01:00:15.500 is what happens depending on if 0's are inside this data somewhere. 01:00:15.500 --> 01:00:20.410 Let's imagine, for example, this same sentence-- my grandson loved it-- 01:00:20.410 --> 01:00:24.980 but let's instead imagine that this value here, instead of being 0.01, 01:00:24.980 --> 01:00:28.970 was 0, meaning inside of our data set, it has never 01:00:28.970 --> 01:00:33.620 before happened that in a positive message the word grandson showed up. 01:00:33.620 --> 01:00:35.450 And that's certainly possible. 01:00:35.450 --> 01:00:37.817 If I have a pretty small data set, it's probably likely 01:00:37.817 --> 01:00:40.400 that not all the messages are going to have the word grandson. 01:00:40.400 --> 01:00:43.400 Maybe it is the case that no positive messages have ever 01:00:43.400 --> 01:00:46.370 had the word grandson in it, at least in my data set. 01:00:46.370 --> 01:00:49.640 But if it is the case that 2% of the negative messages 01:00:49.640 --> 01:00:52.340 have still had the word grandson in it, then we 01:00:52.340 --> 01:00:54.330 run into an interesting challenge. 01:00:54.330 --> 01:00:57.730 And the challenge is this-- when I multiply all of the positive numbers 01:00:57.730 --> 01:01:00.980 together and multiply all the negative numbers together to calculate these two 01:01:00.980 --> 01:01:06.800 probabilities, what I end up getting is a positive value of 0.000. 01:01:06.800 --> 01:01:10.010 I get pure 0's, because when I multiply all of these numbers 01:01:10.010 --> 01:01:12.470 together-- when I multiply something by 0, 01:01:12.470 --> 01:01:15.770 doesn't matter what the other numbers are-- the result is going to be 0. 01:01:15.770 --> 01:01:19.710 And the same thing can be said of negative numbers as well. 01:01:19.710 --> 01:01:24.320 So this then would seem to be a problem that, because grandson has never 01:01:24.320 --> 01:01:27.630 showed up in any of the positive messages inside of our sample, 01:01:27.630 --> 01:01:31.340 we're able to say-- we seem to be concluding that there is a 0% 01:01:31.340 --> 01:01:33.110 chance that the message is positive. 01:01:33.110 --> 01:01:37.105 And therefore, it must be negative, because the only cases where 01:01:37.105 --> 01:01:39.980 we've seen the word grandson come up is inside of a negative message. 01:01:39.980 --> 01:01:43.340 And in doing so, we've totally ignored all of the other probabilities 01:01:43.340 --> 01:01:46.940 that a positive message is much more likely to have the word loved in it, 01:01:46.940 --> 01:01:49.190 because we've multiplied by 0, which just 01:01:49.190 --> 01:01:53.670 means none of the other probabilities can possibly matter at all. 01:01:53.670 --> 01:01:55.920 So this then is a challenge that we need to deal with. 01:01:55.920 --> 01:01:57.380 It means that we're likely not going to be 01:01:57.380 --> 01:02:00.220 able to get the correct results if we just purely use this approach. 01:02:00.220 --> 01:02:02.720 And it's for that reason there are a number of possible ways 01:02:02.720 --> 01:02:06.230 we can try and make sure that we never multiply something by 0. 01:02:06.230 --> 01:02:08.750 It's OK to multiply something by a small number, 01:02:08.750 --> 01:02:10.640 because then it can still be counterbalanced 01:02:10.640 --> 01:02:14.540 by other larger numbers, but multiplying by 0 means it's the end of the story. 01:02:14.540 --> 01:02:16.520 You multiply a number by 0, and the output's 01:02:16.520 --> 01:02:21.230 going to be 0, no matter how big any of the other numbers happen to be. 01:02:21.230 --> 01:02:23.810 So one approach that's fairly common a naive Bayes is 01:02:23.810 --> 01:02:29.090 this idea of additive smoothing, adding some value alpha to each of the values 01:02:29.090 --> 01:02:31.943 in our distribution just to smooth the data little bit. 01:02:31.943 --> 01:02:33.860 One such approach is called Laplace smoothing, 01:02:33.860 --> 01:02:37.530 which basically just means adding one to each value in our distribution. 01:02:37.530 --> 01:02:43.540 So if I have 100 samples and zero of them contain the word grandson, 01:02:43.540 --> 01:02:45.290 well then I might say that, you know what? 01:02:45.290 --> 01:02:49.460 Instead, let's pretend that I've had one additional sample where the word 01:02:49.460 --> 01:02:53.210 grandson appeared and one additional sample where the word grandson didn't 01:02:53.210 --> 01:02:53.840 appear. 01:02:53.840 --> 01:02:57.150 So I'll say all right, now I have one 1 of 102-- 01:02:57.150 --> 01:03:01.550 so one sample that does have the word grandson out of 102 total. 01:03:01.550 --> 01:03:05.070 I'm basically creating two samples that didn't exist before. 01:03:05.070 --> 01:03:08.830 But in doing so, I've been able to smooth the distribution a little bit 01:03:08.830 --> 01:03:12.040 to make sure that I never have to multiply anything by 0. 01:03:12.040 --> 01:03:17.080 By pretending I've seen one more value in each category than I actually have, 01:03:17.080 --> 01:03:19.390 this gets us that result of not having to worry 01:03:19.390 --> 01:03:22.180 about multiplying a number by 0. 01:03:22.180 --> 01:03:24.580 So this then is an approach that we can use in order 01:03:24.580 --> 01:03:27.670 to try and apply naive Bayes, even in situations 01:03:27.670 --> 01:03:31.730 where we're dealing with words that we might not necessarily have seen before. 01:03:31.730 --> 01:03:35.140 And let's now take a look at how we could actually apply that in practice. 01:03:35.140 --> 01:03:38.490 It turns out that NLTK, in addition to having the ability to extract 01:03:38.490 --> 01:03:41.110 n-grams and tokenize things into words, also 01:03:41.110 --> 01:03:45.400 has the ability to be able to apply naive Bayes on some samples of text, 01:03:45.400 --> 01:03:46.920 for example. 01:03:46.920 --> 01:03:48.430 And so let's go ahead and do that. 01:03:48.430 --> 01:03:52.840 What I've done is, inside of sentiment, I've prepared a corpus of just 01:03:52.840 --> 01:03:55.997 know reviews that I've generated, but you can imagine using real reviews. 01:03:55.997 --> 01:03:58.330 I just have a couple of positive reviews-- it was great. 01:03:58.330 --> 01:03:58.873 So much fun. 01:03:58.873 --> 01:03:59.540 Would recommend. 01:03:59.540 --> 01:04:00.550 My grandson loved it. 01:04:00.550 --> 01:04:01.712 Those sorts of messages. 01:04:01.712 --> 01:04:04.420 And then I have a whole bunch of negative reviews-- not worth it, 01:04:04.420 --> 01:04:07.190 kind of cheap, really bad, didn't work the way we expected-- 01:04:07.190 --> 01:04:08.470 just one on each line. 01:04:08.470 --> 01:04:11.860 A whole bunch of positive reviews and negative reviews. 01:04:11.860 --> 01:04:15.130 And what I'd like to do now is analyze them somehow. 01:04:15.130 --> 01:04:19.690 So here then is sentiment up high, and what we're going to do first 01:04:19.690 --> 01:04:23.680 is extract all of the positive and negative sentences, 01:04:23.680 --> 01:04:28.600 create a set of all of the words that were used across all of the messages, 01:04:28.600 --> 01:04:33.340 and then we're going to go ahead and train NLTK's naive Bayes classifier 01:04:33.340 --> 01:04:34.810 on all of this training data. 01:04:34.810 --> 01:04:36.850 And with the training data effectively is is I 01:04:36.850 --> 01:04:40.300 take all of the positive messages and give them the label positive, all 01:04:40.300 --> 01:04:42.790 the negative messages and give them the label negative, 01:04:42.790 --> 01:04:45.880 and then I'll go ahead and apply this classifier to it, where I'd say, 01:04:45.880 --> 01:04:48.100 I would like to take all of this training data 01:04:48.100 --> 01:04:52.030 and now have the ability to classify it as positive or negative. 01:04:52.030 --> 01:04:53.860 I'll then take some input from the user. 01:04:53.860 --> 01:04:56.890 They can just type in some sequence of words. 01:04:56.890 --> 01:04:59.020 And then I would like to classify that sequence 01:04:59.020 --> 01:05:01.450 as either positive or negative, and then I'll 01:05:01.450 --> 01:05:04.482 go ahead and print out what the probabilities of each happened to be. 01:05:04.482 --> 01:05:07.690 And there are some helper functions here that just organize things in the way 01:05:07.690 --> 01:05:09.610 that NLTK is expecting them to be. 01:05:09.610 --> 01:05:12.307 But the key idea here is that I'm taking the positive messages, 01:05:12.307 --> 01:05:14.140 labeling them, taking the negative messages, 01:05:14.140 --> 01:05:16.840 labeling them, putting them inside of a classifier, 01:05:16.840 --> 01:05:21.380 and then now trying to classify some new text that comes about. 01:05:21.380 --> 01:05:23.030 So let's go ahead and try it. 01:05:23.030 --> 01:05:26.740 I'll go ahead and go into sentiment, and we'll run Python sentiment, 01:05:26.740 --> 01:05:29.328 passing in as input that corpus that contains 01:05:29.328 --> 01:05:31.120 all of the positive and negative messages-- 01:05:31.120 --> 01:05:34.480 because depending on the corpus, that's going to affect the probabilities. 01:05:34.480 --> 01:05:36.970 The effectiveness of our ability to classify 01:05:36.970 --> 01:05:41.045 is entirely dependent on how good our data is, and how much data we have, 01:05:41.045 --> 01:05:42.670 and how well they happen to be labeled. 01:05:42.670 --> 01:05:44.640 So now I can try something and say-- 01:05:44.640 --> 01:05:47.170 let's try a review like, this was great-- 01:05:47.170 --> 01:05:49.800 just some review that I might leave. 01:05:49.800 --> 01:05:53.200 And it seems that, all right, there is a 96% chance it estimates 01:05:53.200 --> 01:05:54.930 that this was a positive message-- 01:05:54.930 --> 01:05:58.480 4% chance that it was a negative, likely because the word great 01:05:58.480 --> 01:06:00.610 shows up inside of the positive messages, 01:06:00.610 --> 01:06:03.080 but doesn't show up inside of the negative messages. 01:06:03.080 --> 01:06:06.160 And that might be something that our AI is able to capitalize on. 01:06:06.160 --> 01:06:09.640 And really, what it's going to look for are the differentiating words-- 01:06:09.640 --> 01:06:12.490 that if the probability of words like this and was 01:06:12.490 --> 01:06:15.530 and is pretty similar between positive and negative words, 01:06:15.530 --> 01:06:17.680 then the naive Bayes classifier isn't going 01:06:17.680 --> 01:06:21.202 to end up using those values as having some sort of importance 01:06:21.202 --> 01:06:21.910 in the algorithm. 01:06:21.910 --> 01:06:23.710 Because if they're the same on both sides, 01:06:23.710 --> 01:06:26.560 you multiply that value for both positive and negative, 01:06:26.560 --> 01:06:28.270 you end up getting about the same thing. 01:06:28.270 --> 01:06:30.730 What ultimately makes the difference in naive Bayes 01:06:30.730 --> 01:06:34.210 is when you multiply by value that's much bigger for one category 01:06:34.210 --> 01:06:36.880 than for another category-- when one word like great 01:06:36.880 --> 01:06:39.910 is much more likely to show up in one type of message 01:06:39.910 --> 01:06:41.260 than another type of message. 01:06:41.260 --> 01:06:43.385 And that's one of the nice things about naive Bayes 01:06:43.385 --> 01:06:45.250 is that, without me telling it, that great 01:06:45.250 --> 01:06:48.210 is more important to care about than this or was. 01:06:48.210 --> 01:06:50.380 Naive Bayes can figure that out based on the data. 01:06:50.380 --> 01:06:53.740 It can figure out that this shows up about the same amount of time 01:06:53.740 --> 01:06:56.560 between the two, but great, that is a discriminator, 01:06:56.560 --> 01:07:00.060 a word that can be different between the two types of messages. 01:07:00.060 --> 01:07:01.400 So I could try it again-- 01:07:01.400 --> 01:07:04.583 type in a sentence like, lots of fun, for example. 01:07:04.583 --> 01:07:06.250 This one it's a little less sure about-- 01:07:06.250 --> 01:07:10.690 62% chance that it's positive, 37% chance that it's negative-- maybe 01:07:10.690 --> 01:07:12.720 because there aren't as clear discriminators 01:07:12.720 --> 01:07:15.310 or differentiators inside of this data. 01:07:15.310 --> 01:07:16.400 I'll try one more-- 01:07:16.400 --> 01:07:20.430 say kind of overpriced. 01:07:20.430 --> 01:07:23.633 And all right, now 95%, 96% sure that this 01:07:23.633 --> 01:07:25.800 is a negative sentiment-- likely because of the word 01:07:25.800 --> 01:07:29.032 overpriced, because it's shown up in a negative sentiment expression 01:07:29.032 --> 01:07:31.740 before, and therefore, it thinks, you know what, this is probably 01:07:31.740 --> 01:07:34.720 going to be a negative sentence. 01:07:34.720 --> 01:07:37.830 And so naive Bayes has now given us the ability to classify text. 01:07:37.830 --> 01:07:40.350 Given enough training data, given enough examples, 01:07:40.350 --> 01:07:44.400 we can train our AI to be able to look at natural language, human words, 01:07:44.400 --> 01:07:46.410 figure out which words are likely to show up 01:07:46.410 --> 01:07:48.870 in positive as opposed to negative sentiment messages, 01:07:48.870 --> 01:07:50.670 and categorize them accordingly. 01:07:50.670 --> 01:07:52.420 And you could imagine doing the same thing 01:07:52.420 --> 01:07:55.170 anytime you want to take text and group it into categories. 01:07:55.170 --> 01:07:58.300 If I want to take an email and categorize as email-- 01:07:58.300 --> 01:08:01.560 as a good email or as a spam email, you could apply a similar idea. 01:08:01.560 --> 01:08:04.020 Try and look for the discriminating words, 01:08:04.020 --> 01:08:07.230 the words that make it more likely to be a spam email or not, 01:08:07.230 --> 01:08:10.830 and just train a naive Bayes classifier to be able to figure out 01:08:10.830 --> 01:08:14.250 what that distribution is and to be able to figure out how to categorize 01:08:14.250 --> 01:08:15.978 an email as good or as spam. 01:08:15.978 --> 01:08:19.020 Now, of course, it's not going to be able to give us a definitive answer. 01:08:19.020 --> 01:08:22.950 It gives us a probability distribution, something like 63% 01:08:22.950 --> 01:08:25.380 positive, 37% negative. 01:08:25.380 --> 01:08:29.550 And that might be why our spam filters and our emails sometimes make mistakes, 01:08:29.550 --> 01:08:32.700 sometimes think that a good email is actually spam or vice 01:08:32.700 --> 01:08:36.000 versa, because ultimately, the best that it can do 01:08:36.000 --> 01:08:37.890 is calculate a probability distribution. 01:08:37.890 --> 01:08:40.290 If natural language is ambiguous, we can usually 01:08:40.290 --> 01:08:42.960 just deal in the world of probabilities to try and get 01:08:42.960 --> 01:08:47.100 an answer that is reasonably good, even if we aren't able to guarantee for sure 01:08:47.100 --> 01:08:50.970 that it is the number that we actually expect for it to be. 01:08:50.970 --> 01:08:54.600 That then was a look at how we can begin to take some text 01:08:54.600 --> 01:08:59.910 and to be able to analyze the text and group it into some sorts of categories. 01:08:59.910 --> 01:09:04.140 But ultimately, in addition just being able to analyze text and categorize it, 01:09:04.140 --> 01:09:08.130 we'd like to be able to figure out information about the text, 01:09:08.130 --> 01:09:11.130 get it some sort of meaning out of the text as well. 01:09:11.130 --> 01:09:13.500 And this starts to get us in the world of information, 01:09:13.500 --> 01:09:16.620 of being able to try and take data in the form of text 01:09:16.620 --> 01:09:18.450 and retrieve information from it. 01:09:18.450 --> 01:09:22.500 So one type of problem is known as information retrieval, or IR, 01:09:22.500 --> 01:09:26.979 which is the task of finding relevant documents in response to a query. 01:09:26.979 --> 01:09:30.330 So this is something like you type in a query into a search engine, 01:09:30.330 --> 01:09:32.279 like Google, or you're typing in something 01:09:32.279 --> 01:09:35.640 into some system that's going to look for-- inside of a library catalog, 01:09:35.640 --> 01:09:38.609 for example-- that's going to look for responses to a query. 01:09:38.609 --> 01:09:43.217 I want to look for documents that are about the US constitution or something, 01:09:43.217 --> 01:09:45.300 and I would like to get a whole bunch of documents 01:09:45.300 --> 01:09:47.819 that match that query back to me. 01:09:47.819 --> 01:09:50.819 But you might imagine that what I really want to be able to do 01:09:50.819 --> 01:09:53.160 is, in order to solve this task effectively, 01:09:53.160 --> 01:09:55.830 I need to be able to take documents and figure out, 01:09:55.830 --> 01:09:57.870 what are those documents about? 01:09:57.870 --> 01:10:01.680 I want to be able to say what is it that these particular documents are 01:10:01.680 --> 01:10:03.900 about-- what of the topics of those documents-- 01:10:03.900 --> 01:10:08.160 so that I can then more effectively be able to retrieve information 01:10:08.160 --> 01:10:10.050 from those particular documents. 01:10:10.050 --> 01:10:13.560 And this refers to a set of tasks generally known as topic modeling, 01:10:13.560 --> 01:10:17.918 where I'd like to discover what the topics are for a set of documents. 01:10:17.918 --> 01:10:19.710 And this is something that humans could do. 01:10:19.710 --> 01:10:21.800 A human could read a document and tell you, all right, 01:10:21.800 --> 01:10:23.883 here's what this document is about, and give maybe 01:10:23.883 --> 01:10:27.862 a couple of topics for who are the important people in this document, what 01:10:27.862 --> 01:10:30.570 are the important objects in the document-- can probably tell you 01:10:30.570 --> 01:10:32.370 that kind of thing. 01:10:32.370 --> 01:10:35.160 But we'd like for our AI to be able to do the same thing. 01:10:35.160 --> 01:10:38.760 Given some document, can you tell me what the important words 01:10:38.760 --> 01:10:39.870 in this document are? 01:10:39.870 --> 01:10:42.095 What are the words that set this document apart 01:10:42.095 --> 01:10:44.220 that I might care about if I'm looking at documents 01:10:44.220 --> 01:10:47.128 based on keywords, for example? 01:10:47.128 --> 01:10:49.920 And so one instinctive idea-- an intuitive idea that probably makes 01:10:49.920 --> 01:10:50.580 sense-- 01:10:50.580 --> 01:10:53.250 is let's just use term frequency. 01:10:53.250 --> 01:10:56.100 Term frequency is just defined as the number of times 01:10:56.100 --> 01:10:58.650 a particular term appears in a document. 01:10:58.650 --> 01:11:03.300 If I have a document with 100 words and one particular word shows up 10 times, 01:11:03.300 --> 01:11:05.440 it has a term frequency of 10. 01:11:05.440 --> 01:11:06.690 It shows up pretty often. 01:11:06.690 --> 01:11:09.000 Maybe that's going to be an important word. 01:11:09.000 --> 01:11:10.750 And sometimes, you'll also see this framed 01:11:10.750 --> 01:11:14.620 as a proportion of the total number of words, so 10 words out of 100. 01:11:14.620 --> 01:11:19.110 Maybe it has a term frequency of 0.1, meaning 10% of all of the words 01:11:19.110 --> 01:11:21.530 are this particular word that I care about. 01:11:21.530 --> 01:11:23.280 Ultimately, that doesn't change relatively 01:11:23.280 --> 01:11:26.300 how important they are for any one particular document, 01:11:26.300 --> 01:11:27.730 but they're the same idea. 01:11:27.730 --> 01:11:31.050 The idea is look for words that show up more frequently, because those 01:11:31.050 --> 01:11:35.970 are more likely to be the important words inside of a corpus of documents. 01:11:35.970 --> 01:11:37.840 And so let's go ahead and give that a try. 01:11:37.840 --> 01:11:40.980 Let's say I wanted to find out what the Sherlock Holmes stories are about. 01:11:40.980 --> 01:11:42.780 I have a whole bunch of Sherlock Holmes stories 01:11:42.780 --> 01:11:45.000 and I want to know, in general, what are they about? 01:11:45.000 --> 01:11:47.708 What are the important characters? 01:11:47.708 --> 01:11:49.000 What are the important objects? 01:11:49.000 --> 01:11:52.170 What are the important parts of the story, just in terms of words? 01:11:52.170 --> 01:11:55.350 And I'd like for the AI to be able to figure that out on its own, 01:11:55.350 --> 01:11:57.660 and we'll do so by looking at term frequency-- 01:11:57.660 --> 01:12:01.930 by looking at, what are the words that show up the most often? 01:12:01.930 --> 01:12:06.250 So we'll go ahead, and I'll go ahead and go in to the tfidf directory. 01:12:06.250 --> 01:12:08.350 You'll see why it's called that in a moment. 01:12:08.350 --> 01:12:14.290 But let's first open up tf0.py, which is going to calculate the top 10 term 01:12:14.290 --> 01:12:17.092 frequencies-- or maybe top five term frequencies 01:12:17.092 --> 01:12:19.300 for a corpus of documents, a whole bunch of documents 01:12:19.300 --> 01:12:22.930 where each document is just a story from Sherlock Holmes. 01:12:22.930 --> 01:12:26.772 We're going to load all the data into our corpus 01:12:26.772 --> 01:12:29.850 and we're going to figure out, what are all of the words that 01:12:29.850 --> 01:12:32.610 show up inside of that corpus? 01:12:32.610 --> 01:12:35.187 And we're going to basically just assemble all 01:12:35.187 --> 01:12:36.770 of the number of the term frequencies. 01:12:36.770 --> 01:12:39.510 We're going to calculate, how often do each of these terms 01:12:39.510 --> 01:12:41.880 appear inside of the document? 01:12:41.880 --> 01:12:43.368 And we'll print out the top five. 01:12:43.368 --> 01:12:45.660 And so there are some data structures involved that you 01:12:45.660 --> 01:12:47.160 can take a look at if you'd like to. 01:12:47.160 --> 01:12:50.550 The exact code is not so important, but it is the idea of what we're doing. 01:12:50.550 --> 01:12:54.450 We're taking each of these documents and first sorting them. 01:12:54.450 --> 01:12:56.340 We're saying, take all the words that show up 01:12:56.340 --> 01:13:00.080 and sort them by how often each word shows up. 01:13:00.080 --> 01:13:04.710 And let's go ahead and just, for each document, save the top five 01:13:04.710 --> 01:13:07.720 terms that happen to show up in each of those documents. 01:13:07.720 --> 01:13:10.900 So again, some helper functions you can take a look at if you're interested. 01:13:10.900 --> 01:13:13.440 But the key idea here is that all we're going to do 01:13:13.440 --> 01:13:18.240 is run to tf0 on the Sherlock Holmes stories. 01:13:18.240 --> 01:13:21.840 And what I'm hoping to get out of this process is I am hoping to figure out, 01:13:21.840 --> 01:13:25.150 what are the important words in Sherlock Holmes, for example? 01:13:25.150 --> 01:13:29.370 So we'll go ahead and run this and see what we get. 01:13:29.370 --> 01:13:30.982 And it's loading the data. 01:13:30.982 --> 01:13:31.940 And here's what we get. 01:13:31.940 --> 01:13:36.530 For this particular story, the important words are the, and and, and I, 01:13:36.530 --> 01:13:37.368 and to, and of. 01:13:37.368 --> 01:13:39.410 Those are the words that show up more frequently. 01:13:39.410 --> 01:13:45.000 In this particular story, it's the, and and, and I, and a, and of. 01:13:45.000 --> 01:13:47.000 This is not particularly useful to us. 01:13:47.000 --> 01:13:48.230 We're using term frequencies. 01:13:48.230 --> 01:13:50.930 We're looking at what words show up the most frequently in each 01:13:50.930 --> 01:13:54.830 of these various different documents, but what we get naturally 01:13:54.830 --> 01:13:57.470 are just the words that show up a lot in English. 01:13:57.470 --> 01:14:00.385 The word the, and of, and happen to show up a lot in English, 01:14:00.385 --> 01:14:02.510 and therefore, they happen to show up a lot in each 01:14:02.510 --> 01:14:04.052 of these various different documents. 01:14:04.052 --> 01:14:06.320 This is not a particularly useful metric for us 01:14:06.320 --> 01:14:08.690 to be able to analyze what words are important, 01:14:08.690 --> 01:14:12.960 because these words are just part of the grammatical structure of English. 01:14:12.960 --> 01:14:17.610 And it turns out we can categorize words into a couple of different categories. 01:14:17.610 --> 01:14:21.102 These words happen to be known as what we might call function words, words 01:14:21.102 --> 01:14:23.060 that have little meaning on their own, but that 01:14:23.060 --> 01:14:26.100 are used to grammatically connect different parts of a sentence. 01:14:26.100 --> 01:14:29.120 These are words like am, and by, and do, and is, and which, 01:14:29.120 --> 01:14:32.130 and with, and yet-- words that, on their own, what do they mean? 01:14:32.130 --> 01:14:33.140 It's hard to say. 01:14:33.140 --> 01:14:35.390 They get their meaning from how they connect 01:14:35.390 --> 01:14:36.980 different parts of the sentence. 01:14:36.980 --> 01:14:40.610 And these function words are what we might call a closed class of words 01:14:40.610 --> 01:14:41.990 in a language like English. 01:14:41.990 --> 01:14:44.690 There's really just some fixed list of function words, 01:14:44.690 --> 01:14:46.190 and they don't change very often. 01:14:46.190 --> 01:14:48.260 There's just some list of words that are commonly 01:14:48.260 --> 01:14:52.460 used to connect other grammatical structures in the language. 01:14:52.460 --> 01:14:56.120 And that's in contrast with what we might call content words, words 01:14:56.120 --> 01:14:58.970 that carry meaning independently-- words like algorithm, 01:14:58.970 --> 01:15:02.580 category, computer, words that actually have some sort of meaning. 01:15:02.580 --> 01:15:05.150 And these are usually the words that we care about. 01:15:05.150 --> 01:15:07.250 These are the words where we want to figure out, 01:15:07.250 --> 01:15:10.020 what are the important words in our document? 01:15:10.020 --> 01:15:12.230 We probably care about the content words more 01:15:12.230 --> 01:15:15.380 than we care about the function words. 01:15:15.380 --> 01:15:20.770 And so one strategy we could apply is just ignore all of the function words. 01:15:20.770 --> 01:15:26.120 So here in tf1.py, I've done the same exact thing, 01:15:26.120 --> 01:15:31.790 except I'm going to load a whole bunch of words from a function_words.txt 01:15:31.790 --> 01:15:35.670 file, inside of which are just a whole bunch of function words in alphabetical 01:15:35.670 --> 01:15:36.170 order. 01:15:36.170 --> 01:15:38.570 These are just a whole bunch of function words 01:15:38.570 --> 01:15:41.870 that are just words that are used to connect other words in English, 01:15:41.870 --> 01:15:44.275 and someone has just compiled this particular list. 01:15:44.275 --> 01:15:46.400 And these are the words that I just want to ignore. 01:15:46.400 --> 01:15:49.790 If any of these words-- let's just ignore it as one of the top terms, 01:15:49.790 --> 01:15:52.790 because these are not words that I probably care about 01:15:52.790 --> 01:15:56.570 if I want to analyze what the important terms inside of a document 01:15:56.570 --> 01:15:57.860 happen to be. 01:15:57.860 --> 01:16:01.820 So in tfidf1, we were ultimately doing is, 01:16:01.820 --> 01:16:05.360 if the word is in my set of function words, 01:16:05.360 --> 01:16:08.720 I'm just going to skip over it, just ignore any of the function words 01:16:08.720 --> 01:16:11.210 by continuing on to the next word and then 01:16:11.210 --> 01:16:14.010 just calculating the frequencies for those words instead. 01:16:14.010 --> 01:16:16.520 So I'm going to pretend the function words aren't there, 01:16:16.520 --> 01:16:19.550 and now maybe I can get a better sense for what 01:16:19.550 --> 01:16:23.060 terms are important in each of the various different Sherlock Holmes 01:16:23.060 --> 01:16:24.560 stories. 01:16:24.560 --> 01:16:29.080 So now let's run tf1 on the Sherlock Holmes corpus and see what we get now. 01:16:29.080 --> 01:16:32.510 And let's look at, what is the most important term in each of the stories? 01:16:32.510 --> 01:16:34.760 Well, it seems like, for each of the stories, 01:16:34.760 --> 01:16:36.770 the most important word is Holmes. 01:16:36.770 --> 01:16:38.270 I guess that's what we would expect. 01:16:38.270 --> 01:16:39.380 They're all Sherlock Holmes stories. 01:16:39.380 --> 01:16:40.922 And Holmes is not a function in Word. 01:16:40.922 --> 01:16:44.360 It's not the, or a, or an, so it wasn't ignored. 01:16:44.360 --> 01:16:46.130 But Holmes and man-- 01:16:46.130 --> 01:16:50.760 these are probably not what I mean when I say, what are the important words? 01:16:50.760 --> 01:16:52.700 Even though Holmes does show up the most often 01:16:52.700 --> 01:16:54.890 it's not giving me a whole lot of information here 01:16:54.890 --> 01:16:57.800 about what each of the different Sherlock Holmes stories 01:16:57.800 --> 01:16:59.460 are actually about. 01:16:59.460 --> 01:17:02.880 And the reason why is because Sherlock Holmes shows up in all the stories, 01:17:02.880 --> 01:17:06.950 and so it's not meaningful for me to say that this story is about Sherlock 01:17:06.950 --> 01:17:09.560 Holmes I want to try and figure out the different topics 01:17:09.560 --> 01:17:11.180 across the corpus of documents. 01:17:11.180 --> 01:17:13.640 What I really want to know is, what words show up 01:17:13.640 --> 01:17:18.170 in this document that show up less frequently in the other documents, 01:17:18.170 --> 01:17:19.380 for example? 01:17:19.380 --> 01:17:22.730 And so to get at that idea, we're going to introduce the notion 01:17:22.730 --> 01:17:25.850 of inverse document frequency. 01:17:25.850 --> 01:17:29.450 Inverse document frequency is a measure of how common, 01:17:29.450 --> 01:17:33.530 or rare, a word happens to be across an entire corpus of words. 01:17:33.530 --> 01:17:35.960 And mathematically, it's usually calculated like this-- 01:17:35.960 --> 01:17:39.440 as the logarithm of the total number of documents 01:17:39.440 --> 01:17:43.550 divided by the number of documents containing the word. 01:17:43.550 --> 01:17:47.510 So if a word like Holmes shows up in all of the documents, 01:17:47.510 --> 01:17:50.870 well, then total documents is how many documents there 01:17:50.870 --> 01:17:55.110 are a number of documents containing Holmes is going to be the same number. 01:17:55.110 --> 01:17:58.760 So when you divide these two together, you'll get 1, and the logarithm of one 01:17:58.760 --> 01:18:00.460 is just 0. 01:18:00.460 --> 01:18:04.370 And so what we get is, if Holmes shows up in all of the documents, 01:18:04.370 --> 01:18:07.040 it has an inverse document frequency of 0. 01:18:07.040 --> 01:18:09.560 And you can think now of inverse document frequency 01:18:09.560 --> 01:18:13.370 as a measure of how rare is the word that 01:18:13.370 --> 01:18:16.280 shows up in this particular document that if a word doesn't show up 01:18:16.280 --> 01:18:21.060 across many documents at all this number is going to be much higher. 01:18:21.060 --> 01:18:24.710 And this then gets us that a model known as tf-idf, 01:18:24.710 --> 01:18:28.310 which is a method for ranking what words are important in the document 01:18:28.310 --> 01:18:30.440 by multiplying these two ideas together. 01:18:30.440 --> 01:18:37.190 Multiply term frequency, or TF, by inverse document frequency, or IDF, 01:18:37.190 --> 01:18:39.890 where the idea here now is that how important a word is 01:18:39.890 --> 01:18:41.540 depends on two things. 01:18:41.540 --> 01:18:44.197 It depends on how often it shows up in the document using 01:18:44.197 --> 01:18:46.280 the heuristic that, if a word shows up more often, 01:18:46.280 --> 01:18:47.900 it's probably more important. 01:18:47.900 --> 01:18:51.170 And we multiply that by inverse document frequency IDF, 01:18:51.170 --> 01:18:54.900 because if the word is rarer, but it shows up in the document, 01:18:54.900 --> 01:18:57.200 it's probably more important than if the word shows up 01:18:57.200 --> 01:19:00.200 across most or all of the documents, because then it's probably 01:19:00.200 --> 01:19:02.990 a less important factor in what the different topics 01:19:02.990 --> 01:19:06.840 across the different documents in the corpus happen to be. 01:19:06.840 --> 01:19:11.060 And so now let's go ahead and apply this algorithm on the Sherlock Holmes 01:19:11.060 --> 01:19:13.340 corpus. 01:19:13.340 --> 01:19:15.650 And here's tfidf. 01:19:15.650 --> 01:19:18.860 Now what I'm doing is, for each of the documents, 01:19:18.860 --> 01:19:22.120 for each word, I'm calculating its TF score, 01:19:22.120 --> 01:19:25.160 term frequency, multiplied by the inverse document 01:19:25.160 --> 01:19:28.190 frequency of that word-- not just looking at the single volume, 01:19:28.190 --> 01:19:30.410 but multiplying these two values together 01:19:30.410 --> 01:19:33.650 in order to compute the overall values. 01:19:33.650 --> 01:19:37.610 And now, if I run tfidf on the Holmes corpus, 01:19:37.610 --> 01:19:40.615 this is going to try and get us a better approximation for what's 01:19:40.615 --> 01:19:41.990 important in each of the stories. 01:19:41.990 --> 01:19:44.000 And it seems like it's trying to extract here 01:19:44.000 --> 01:19:46.280 probably like the names of characters that 01:19:46.280 --> 01:19:49.010 happen to be important in the story-- characters that show up 01:19:49.010 --> 01:19:51.380 in this story that don't show up in the other story-- 01:19:51.380 --> 01:19:53.930 and prioritizing the more important characters that 01:19:53.930 --> 01:19:56.510 happen to show up more often. 01:19:56.510 --> 01:20:00.170 And so this then might be a better analysis of what types of topics 01:20:00.170 --> 01:20:02.070 are more or less important. 01:20:02.070 --> 01:20:05.330 I also have another corpus, which is a corpus of all of the Federalist 01:20:05.330 --> 01:20:07.700 Papers from American history. 01:20:07.700 --> 01:20:11.240 If I go ahead and run tfidf on the Federalist Papers, 01:20:11.240 --> 01:20:14.330 we can begin to see what the important words in each 01:20:14.330 --> 01:20:16.910 of the various different Federalist Papers happen to be-- 01:20:16.910 --> 01:20:22.070 that in Federalist Paper Number 61, seems like it's a lot about elections. 01:20:22.070 --> 01:20:25.350 In Federalist Papers 66, but the Senate and impeachments. 01:20:25.350 --> 01:20:28.470 You can start to extract what the important terms and what 01:20:28.470 --> 01:20:32.540 the important words are just by looking at what things show up across-- 01:20:32.540 --> 01:20:34.800 and don't show up across many of the documents, 01:20:34.800 --> 01:20:38.637 but show up frequently enough in certain of the documents. 01:20:38.637 --> 01:20:40.470 And so this can be a helpful tool for trying 01:20:40.470 --> 01:20:43.350 to figure out this kind of topic modeling, 01:20:43.350 --> 01:20:47.100 figuring out what it is that a particular document happens 01:20:47.100 --> 01:20:48.620 to be about. 01:20:48.620 --> 01:20:53.070 And so this then is starting to get us into this world of semantics, 01:20:53.070 --> 01:20:56.880 what it is that things actually mean when we're talking about language. 01:20:56.880 --> 01:20:59.100 Now, we're not going to think about the bag of words, 01:20:59.100 --> 01:21:02.670 where we just say, treat a sample of text as just a whole bunch of words. 01:21:02.670 --> 01:21:04.320 And we don't care about the order. 01:21:04.320 --> 01:21:06.870 Now, when we get into the world of semantics, 01:21:06.870 --> 01:21:10.750 we really do start to care about what it is that these words actually mean, 01:21:10.750 --> 01:21:12.850 how it is these words relate to each other, 01:21:12.850 --> 01:21:17.250 and in particular, how we can extract information out of that text. 01:21:17.250 --> 01:21:20.970 Information extraction is somehow extracting knowledge 01:21:20.970 --> 01:21:23.970 from our documents-- figuring out, given a whole bunch of text, 01:21:23.970 --> 01:21:28.140 can we automate the process of having an AI, look at those documents, 01:21:28.140 --> 01:21:31.710 and get out what the useful or relevant knowledge inside those documents 01:21:31.710 --> 01:21:33.190 happens to be? 01:21:33.190 --> 01:21:34.950 So let's take a look at an example. 01:21:34.950 --> 01:21:37.415 I'll give you two samples from news articles. 01:21:37.415 --> 01:21:40.290 Here up above is a sample of a news article from the Harvard Business 01:21:40.290 --> 01:21:42.310 Review that was about Facebook. 01:21:42.310 --> 01:21:45.630 Down below is an example of a Business Insider article from 2018 01:21:45.630 --> 01:21:47.550 that was about Amazon. 01:21:47.550 --> 01:21:49.710 And there's some information here that we might 01:21:49.710 --> 01:21:51.570 want an AI to be able to extract-- 01:21:51.570 --> 01:21:54.030 information, knowledge about these companies 01:21:54.030 --> 01:21:55.670 that we might want to extract. 01:21:55.670 --> 01:21:58.020 And in particular, what I might want to extract is-- 01:21:58.020 --> 01:22:02.260 let's say I want to know data about when companies were founded-- 01:22:02.260 --> 01:22:05.250 that I wanted to know that Facebook was founded in 2004, 01:22:05.250 --> 01:22:07.190 Amazon founded in 1994-- 01:22:07.190 --> 01:22:10.500 that that is important information that I happen to care about. 01:22:10.500 --> 01:22:13.110 Well, how do we extract that information from the text? 01:22:13.110 --> 01:22:15.660 What is my way of being able to understand this text 01:22:15.660 --> 01:22:18.810 and figure out, all right, Facebook was founded in 2004? 01:22:18.810 --> 01:22:22.710 Well, what I can look for are templates or patterns, things 01:22:22.710 --> 01:22:26.700 that happened to show up across multiple different documents that give me 01:22:26.700 --> 01:22:28.922 some sense for what this knowledge happens to mean. 01:22:28.922 --> 01:22:30.630 And what we'll notice is a common pattern 01:22:30.630 --> 01:22:34.500 between both of these passages, which is this phrasing here. 01:22:34.500 --> 01:22:37.890 When Facebook was founded in 2004, comma-- 01:22:37.890 --> 01:22:42.360 and then down below, when Amazon was founded in 1994, comma. 01:22:42.360 --> 01:22:47.640 And those two templates end up giving us a mechanism for trying to extract 01:22:47.640 --> 01:22:53.220 information-- that this notion, when company was founded in year comma, 01:22:53.220 --> 01:22:56.310 this can tell us something about when a company was founded, 01:22:56.310 --> 01:22:58.820 because if we set our AI loose on the web, 01:22:58.820 --> 01:23:01.530 let look at a whole bunch of papers or a whole bunch of articles, 01:23:01.530 --> 01:23:03.360 and it finds this pattern-- 01:23:03.360 --> 01:23:06.930 when blank was founded in blank, comma-- 01:23:06.930 --> 01:23:09.840 well, then our AI can pretty reasonably conclude 01:23:09.840 --> 01:23:13.740 that there's a good chance that this is going to be like some company, 01:23:13.740 --> 01:23:17.470 and this is going to be like the year that company was founded, for example-- 01:23:17.470 --> 01:23:20.907 might not be perfect, but at least it's a good heuristic. 01:23:20.907 --> 01:23:22.740 And so you might imagine that, if you wanted 01:23:22.740 --> 01:23:25.650 to train and AI to be able to look for information, 01:23:25.650 --> 01:23:27.810 you might give the AI templates like this-- 01:23:27.810 --> 01:23:31.200 not only give it a template like when company blank was founded in blank, 01:23:31.200 --> 01:23:34.710 but give it like, the book blank was written by blank, for example. 01:23:34.710 --> 01:23:37.500 Just give it some templates where it can search the web, 01:23:37.500 --> 01:23:41.640 search a whole big corpus of documents, looking for templates that match that, 01:23:41.640 --> 01:23:44.970 and if it finds that, then it's able to figure out, 01:23:44.970 --> 01:23:47.370 all right, here's the company and here's the year. 01:23:47.370 --> 01:23:50.250 But of course, that requires us to write these templates. 01:23:50.250 --> 01:23:53.547 It requires us to figure out, what is the structure of this information 01:23:53.547 --> 01:23:54.630 likely going to look like? 01:23:54.630 --> 01:23:56.190 And it might be difficult to know. 01:23:56.190 --> 01:23:58.500 The different websites are, of course, going to do this differently. 01:23:58.500 --> 01:24:01.830 This type of method isn't going to be able to extract all of the information, 01:24:01.830 --> 01:24:04.170 because if the words are slightly in a different order, 01:24:04.170 --> 01:24:06.840 it won't match on that particular template. 01:24:06.840 --> 01:24:11.310 But one thing we can do is, rather than give our AI the template, 01:24:11.310 --> 01:24:13.290 we can give AI the data. 01:24:13.290 --> 01:24:19.540 We can tell the AI, Facebook was founded in 2004 and Amazon was founded in 1994, 01:24:19.540 --> 01:24:22.440 and just tell the AI those two pieces of information, 01:24:22.440 --> 01:24:24.780 and then set the AI loose on the web. 01:24:24.780 --> 01:24:30.030 And now the ideas that the AI can begin to look for, where do Facebook in 2004 01:24:30.030 --> 01:24:33.150 show up together, where do Amazon in 1994 show up together, 01:24:33.150 --> 01:24:36.150 and it can discover these templates for itself. 01:24:36.150 --> 01:24:38.580 It can discover that this kind of phrasing-- 01:24:38.580 --> 01:24:40.320 when blank was founded in blank-- 01:24:40.320 --> 01:24:45.030 tends to relate Facebook to 2004, and it released Amazon to 1994, 01:24:45.030 --> 01:24:49.320 so maybe it will hold the same relation for others as well. 01:24:49.320 --> 01:24:51.572 And this ends up being-- this automated template 01:24:51.572 --> 01:24:54.030 generation ends up being quite powerful, and we'll go ahead 01:24:54.030 --> 01:24:56.250 and take a look at that now as well. 01:24:56.250 --> 01:24:59.040 What I have here inside of templates directory 01:24:59.040 --> 01:25:03.120 is a file called companies.csv, and this is all of the data 01:25:03.120 --> 01:25:04.520 that I am going to give to my AI. 01:25:04.520 --> 01:25:09.000 I'm going to give it the pair Amazon, 1994 and Facebook, 2004. 01:25:09.000 --> 01:25:11.190 And what I'm going to tell my AI to do is 01:25:11.190 --> 01:25:14.010 search a corpus of documents for other data-- 01:25:14.010 --> 01:25:16.620 these pairs like this-- other relationships. 01:25:16.620 --> 01:25:18.990 I'm not telling AI that this is a company and the date 01:25:18.990 --> 01:25:19.920 that it was founded. 01:25:19.920 --> 01:25:23.750 I'm just giving it Amazon, 1994 and Facebook, 2004 01:25:23.750 --> 01:25:25.550 and letting the AI do the rest. 01:25:25.550 --> 01:25:28.640 And what the AI is going to do is it's going to look through my corpus-- 01:25:28.640 --> 01:25:30.770 here's my corpus of documents-- 01:25:30.770 --> 01:25:33.590 and it's going to find, like inside of Business Insider, 01:25:33.590 --> 01:25:38.580 that we have sentences like, back when Amazon was founded in 2004, comma-- 01:25:38.580 --> 01:25:42.740 and that kind of phrasing is going to be similar to this Harvard Business Review 01:25:42.740 --> 01:25:46.935 story that has a sentence like, when Facebook was founded in 2004-- 01:25:46.935 --> 01:25:49.310 and it's going to look across a number of other documents 01:25:49.310 --> 01:25:53.820 for similar types of patterns to be able to extract that kind of information. 01:25:53.820 --> 01:25:56.450 And what it will do is, if I go ahead and run, 01:25:56.450 --> 01:25:58.660 I'll go ahead and go into templates. 01:25:58.660 --> 01:26:01.220 So I'll say python search.py. 01:26:01.220 --> 01:26:05.030 I'm going to look for the data like the data and companies.csv 01:26:05.030 --> 01:26:08.690 inside of the company's directory, which contains a whole bunch of news articles 01:26:08.690 --> 01:26:10.900 that I've curated in advance. 01:26:10.900 --> 01:26:12.080 And here's what I get-- 01:26:12.080 --> 01:26:15.560 Google 1998, Apple 1976, Microsoft 1975-- 01:26:15.560 --> 01:26:16.400 so on and so forth-- 01:26:16.400 --> 01:26:18.470 Walmart 1962, for example. 01:26:18.470 --> 01:26:20.810 These are all of the pieces of data that happened 01:26:20.810 --> 01:26:23.750 to match that same template that we were able to find before. 01:26:23.750 --> 01:26:25.430 And how was it able to find this? 01:26:25.430 --> 01:26:29.460 Well, it's probably because, if we look at the Forbes article, 01:26:29.460 --> 01:26:34.730 for example, that it has a phrase in it like, when Walmart was founded in 1962, 01:26:34.730 --> 01:26:38.000 comma-- that it's able to identify these sorts of patterns 01:26:38.000 --> 01:26:39.890 and extract information from them. 01:26:39.890 --> 01:26:42.650 Now, granted, I have curated all these stories in advance 01:26:42.650 --> 01:26:46.130 in order to make sure that there is data that it's able to match on. 01:26:46.130 --> 01:26:49.100 And in practice, it's not always going to be in this exact format 01:26:49.100 --> 01:26:52.430 when you're seeing a company related to the year in which it was founded, 01:26:52.430 --> 01:26:56.030 but if you give the AI access to enough data-- like all of the data of text 01:26:56.030 --> 01:26:58.910 on the internet-- and just have the AI crawl the internet looking 01:26:58.910 --> 01:27:02.720 for information, it can very reliably, or with some probability, 01:27:02.720 --> 01:27:05.780 try and extract information using these sorts of templates 01:27:05.780 --> 01:27:08.330 and be able to generate interesting sorts of knowledge. 01:27:08.330 --> 01:27:10.940 And the more knowledge it learns, the more new templates 01:27:10.940 --> 01:27:13.190 it's able to construct, looking for constructions that 01:27:13.190 --> 01:27:15.930 show up in other locations as well. 01:27:15.930 --> 01:27:17.910 So let's take a look at another example. 01:27:17.910 --> 01:27:20.955 And then I'll here show you presidents.csv, 01:27:20.955 --> 01:27:23.330 where I have two presidents and their inauguration date-- 01:27:23.330 --> 01:27:28.220 so George Washington 1789, Barack Obama 2009 for example. 01:27:28.220 --> 01:27:31.430 And I also am going to give to our AI a corpus that 01:27:31.430 --> 01:27:34.550 just contains a single document, which is the Wikipedia 01:27:34.550 --> 01:27:37.880 article for the list of presidents of the United States, for example-- 01:27:37.880 --> 01:27:39.680 just information about presidents. 01:27:39.680 --> 01:27:45.147 And I'd like to extract from this raw HTML document on a web page information 01:27:45.147 --> 01:27:45.980 about the president. 01:27:45.980 --> 01:27:50.460 So I can say search in presidents.csv. 01:27:50.460 --> 01:27:53.720 And what I get is a whole bunch of data about presidents 01:27:53.720 --> 01:27:56.300 and what year they were likely inaugurated and by looking 01:27:56.300 --> 01:27:58.010 for patterns that matched-- 01:27:58.010 --> 01:28:00.180 Barack Obama 2009, for example-- 01:28:00.180 --> 01:28:02.280 looking for these sorts of patterns that happened 01:28:02.280 --> 01:28:07.287 to give us some clues as to what it is that a story happens to be about. 01:28:07.287 --> 01:28:08.370 So here's another example. 01:28:08.370 --> 01:28:12.710 If I open up inside the olympics, here is a scraped version 01:28:12.710 --> 01:28:15.050 of the Olympic home page that has information 01:28:15.050 --> 01:28:16.610 about various different Olympics. 01:28:16.610 --> 01:28:20.360 And maybe I want to extract Olympic locations and years 01:28:20.360 --> 01:28:21.980 from this particular page. 01:28:21.980 --> 01:28:24.950 Well, the way I can do that is using the exact same algorithm. 01:28:24.950 --> 01:28:29.730 I'm just saying, all right, here are two Olympics and where they were located-- 01:28:29.730 --> 01:28:32.160 so 2012 London, for example. 01:28:32.160 --> 01:28:35.030 Let me go ahead and just run this process, 01:28:35.030 --> 01:28:39.440 Python search, on olympics.csv, look at all the Olympic data set, 01:28:39.440 --> 01:28:41.280 and here I get some information back. 01:28:41.280 --> 01:28:43.310 Now, this information-- not totally perfect. 01:28:43.310 --> 01:28:45.530 There are a couple of examples that are obviously not 01:28:45.530 --> 01:28:48.955 quite right, because my template might have been a little bit too general. 01:28:48.955 --> 01:28:51.080 Maybe it was looking for a broad category of things 01:28:51.080 --> 01:28:55.190 and certain strange things happened to capture on that particular template. 01:28:55.190 --> 01:28:58.730 So you could imagine adding rules to try and make this process more intelligent, 01:28:58.730 --> 01:29:02.000 making sure the thing on the left is just a year, for example-- 01:29:02.000 --> 01:29:04.280 for instance, and doing other sorts of analysis. 01:29:04.280 --> 01:29:07.040 But purely just based on some data, we are 01:29:07.040 --> 01:29:10.700 able to extract some interesting information using some algorithms. 01:29:10.700 --> 01:29:16.100 And all search.py is really doing here is it is taking my corpus of data, 01:29:16.100 --> 01:29:18.260 finding templates that match it-- 01:29:18.260 --> 01:29:22.280 here, I'm filtering down to just the top two templates that happen to match-- 01:29:22.280 --> 01:29:26.960 and then using those templates to extract results from the data 01:29:26.960 --> 01:29:30.860 that I have access to, being able to look for all of the information 01:29:30.860 --> 01:29:31.670 that I care about. 01:29:31.670 --> 01:29:33.587 And that's ultimately what's going to help me, 01:29:33.587 --> 01:29:38.390 to print out those results to figure out what the matches happen to be. 01:29:38.390 --> 01:29:41.090 And so information extraction is another powerful tool 01:29:41.090 --> 01:29:43.970 when it comes to trying to extract information. 01:29:43.970 --> 01:29:46.220 But of course, it only works in very limited contexts. 01:29:46.220 --> 01:29:49.640 It only works when I'm able will find templates that look exactly 01:29:49.640 --> 01:29:53.000 like this in order to come up with some sort of match that 01:29:53.000 --> 01:29:55.430 is able to connect this to some pair of data, 01:29:55.430 --> 01:29:57.890 that this company was founded in this year. 01:29:57.890 --> 01:30:01.670 What I might want to do, as we start to think about the semantics of words, 01:30:01.670 --> 01:30:04.880 is to begin to imagine some way of coming up with definitions 01:30:04.880 --> 01:30:08.120 for all words, being able to relate all of the words in a dictionary 01:30:08.120 --> 01:30:12.110 to each other, because that's ultimately what's going to be necessary if we want 01:30:12.110 --> 01:30:13.530 our AI to be able to communicate. 01:30:13.530 --> 01:30:18.500 We need some representation of what it is that words mean. 01:30:18.500 --> 01:30:22.340 And one approach of doing this, this famous data set called WordNet. 01:30:22.340 --> 01:30:24.440 And what WordNet is is it's a human-curated-- 01:30:24.440 --> 01:30:27.380 researchers have curated together a whole bunch of words, 01:30:27.380 --> 01:30:29.595 their definitions, their various different senses-- 01:30:29.595 --> 01:30:31.970 because the word might have multiple different meanings-- 01:30:31.970 --> 01:30:35.347 and also how those words relate to one another. 01:30:35.347 --> 01:30:36.680 And so what we mean by this is-- 01:30:36.680 --> 01:30:38.750 I can show you an example of WordNet. 01:30:38.750 --> 01:30:40.550 WordNet comes built into NLTK. 01:30:40.550 --> 01:30:44.060 Using NLTK, you can download and access WordNet. 01:30:44.060 --> 01:30:48.080 So let me go into WordNet, and go ahead and run WordNet, 01:30:48.080 --> 01:30:52.100 and extract information about a word-- a word like city, for example. 01:30:52.100 --> 01:30:53.600 Go ahead and press Return. 01:30:53.600 --> 01:30:56.210 And here is the information that I get back about a city. 01:30:56.210 --> 01:30:59.360 It turns out that city has three different senses, three 01:30:59.360 --> 01:31:01.460 different meanings, according to WordNet. 01:31:01.460 --> 01:31:03.770 And it's really just kind of like a dictionary, where 01:31:03.770 --> 01:31:07.400 each sense is associated with its meaning-- just some definition 01:31:07.400 --> 01:31:08.810 provided by human. 01:31:08.810 --> 01:31:13.130 And then it's also got categories, for example, that a word belongs to-- 01:31:13.130 --> 01:31:15.830 that a city is a type of municipality, a city 01:31:15.830 --> 01:31:18.150 is a type of administrative district. 01:31:18.150 --> 01:31:20.510 And that allows me to relate words to other words. 01:31:20.510 --> 01:31:24.380 So one of the powers of WordNet is the ability to take one word 01:31:24.380 --> 01:31:28.590 and connect it to other related words. 01:31:28.590 --> 01:31:33.380 If I do another example, let me try the word house, for instance. 01:31:33.380 --> 01:31:36.690 I'll type in the word house and see what I get back. 01:31:36.690 --> 01:31:38.750 Well, all right, the house is a kind of building. 01:31:38.750 --> 01:31:42.160 The house is somehow related to a family unit. 01:31:42.160 --> 01:31:43.910 And so you might imagine trying to come up 01:31:43.910 --> 01:31:46.760 with these various different ways of describing a house. 01:31:46.760 --> 01:31:47.490 It is a building. 01:31:47.490 --> 01:31:48.500 It is a dwelling. 01:31:48.500 --> 01:31:51.110 And researchers have just curated these relationships 01:31:51.110 --> 01:31:55.100 between these various different words to say that a house is a type of building, 01:31:55.100 --> 01:31:58.890 that a house is a type of dwelling, for example. 01:31:58.890 --> 01:32:01.370 But this type of approach, while certainly 01:32:01.370 --> 01:32:04.640 helpful for being able to relate words to one another, 01:32:04.640 --> 01:32:06.920 doesn't scale particularly well. 01:32:06.920 --> 01:32:08.990 As you start to think about language changing, 01:32:08.990 --> 01:32:11.870 as you start to think about all the various different relationships 01:32:11.870 --> 01:32:16.070 that words might have to one another, this challenge of word representation 01:32:16.070 --> 01:32:18.200 ends up being difficult. What we've done is just 01:32:18.200 --> 01:32:23.450 defined a word as just a sentence that explains what it is that that word is, 01:32:23.450 --> 01:32:26.030 but what we really would like is some way 01:32:26.030 --> 01:32:28.615 to represent the meaning of a word in a way 01:32:28.615 --> 01:32:31.240 that our AI is going to be able to do something useful with it. 01:32:31.240 --> 01:32:33.830 Anytime we want our AI to be able to look at texts 01:32:33.830 --> 01:32:35.840 and really understand what that text means, 01:32:35.840 --> 01:32:38.360 to relate text and words to similar words 01:32:38.360 --> 01:32:40.700 and understand the relationship between words, 01:32:40.700 --> 01:32:44.745 we'd like some way that a computer can represent this information. 01:32:44.745 --> 01:32:46.620 And what we've seen all throughout the course 01:32:46.620 --> 01:32:48.800 multiple times now is the idea that, when 01:32:48.800 --> 01:32:51.110 we want our AI to represent something, it 01:32:51.110 --> 01:32:54.890 can be helpful to have the AI represent it using numbers-- 01:32:54.890 --> 01:32:57.530 that we've seen that we can represent utilities in a game, 01:32:57.530 --> 01:32:59.900 like winning, or losing, or drawing, as a number-- 01:32:59.900 --> 01:33:01.520 1, negative 1, or a 0. 01:33:01.520 --> 01:33:04.400 We've seen other ways that we can take data and turn it 01:33:04.400 --> 01:33:06.650 into a vector of features, where we just have 01:33:06.650 --> 01:33:11.270 a whole bunch of numbers that represent some particular piece of data. 01:33:11.270 --> 01:33:14.340 And if we ever want to past words into a neural network, 01:33:14.340 --> 01:33:16.580 for instance, to be able to say, given some word, 01:33:16.580 --> 01:33:18.650 translate this sentence into another sentence, 01:33:18.650 --> 01:33:21.890 or to be able to do interesting classifications with neural networks 01:33:21.890 --> 01:33:26.000 on individual words, we need some representation of words 01:33:26.000 --> 01:33:27.980 just in terms of vectors-- 01:33:27.980 --> 01:33:31.820 way to represent words, just by using individual numbers 01:33:31.820 --> 01:33:34.495 to define the meaning of a word. 01:33:34.495 --> 01:33:35.370 So how do we do that? 01:33:35.370 --> 01:33:37.767 How do we take words and turn them into vectors 01:33:37.767 --> 01:33:40.100 that we can use to represent the meaning of those words? 01:33:40.100 --> 01:33:42.110 Well, one way is to do this. 01:33:42.110 --> 01:33:46.280 If I have four words that I want to encode, like he wrote a book, 01:33:46.280 --> 01:33:49.250 I can just say, let's let the word he be this vector-- 01:33:49.250 --> 01:33:51.470 1, 0, 0, 0. 01:33:51.470 --> 01:33:53.990 Wrote will be 0, 1, 0, 0. 01:33:53.990 --> 01:33:56.390 A will be 0, 0, 1, 0. 01:33:56.390 --> 01:33:59.570 Book will be 0, 0, 0, 1. 01:33:59.570 --> 01:34:03.410 Effectively, what I have here is what's known as a one-hot representation 01:34:03.410 --> 01:34:06.930 or a one-hot encoding, which is a representation of meaning, 01:34:06.930 --> 01:34:10.580 where meaning is a vector that has a single 1 in it and the rest are 0's. 01:34:10.580 --> 01:34:14.540 The location of the 1 tells me the meaning of the word-- 01:34:14.540 --> 01:34:17.020 that 1 in the first position, that means here-- 01:34:17.020 --> 01:34:19.510 1 in the second position, that means wrote. 01:34:19.510 --> 01:34:21.740 And every word in the dictionary is going 01:34:21.740 --> 01:34:24.770 to be assigned to some representation like this, where we just 01:34:24.770 --> 01:34:28.320 assign one place in the vector that has a 1 for the word 01:34:28.320 --> 01:34:29.450 and 0 for the other words. 01:34:29.450 --> 01:34:31.580 And now I have representations of words that 01:34:31.580 --> 01:34:33.710 are different for a whole bunch of different words. 01:34:33.710 --> 01:34:36.853 This is this one-hot representation. 01:34:36.853 --> 01:34:38.270 So what are the drawbacks of this? 01:34:38.270 --> 01:34:40.970 Why is this not necessarily a great approach? 01:34:40.970 --> 01:34:42.980 Well, here, I am only creating enough vectors 01:34:42.980 --> 01:34:45.530 to represent four words in a dictionary. 01:34:45.530 --> 01:34:49.580 If you imagine a dictionary with 50,000 words that I might want to represent, 01:34:49.580 --> 01:34:51.590 now these vectors get enormously long. 01:34:51.590 --> 01:34:54.800 These are 50,000 dimensional vectors to represent 01:34:54.800 --> 01:34:58.940 a vocabulary of 50,000 words-- that he is 1 followed by all these. 01:34:58.940 --> 01:35:01.280 Wrote has a whole bunch of 0's in it. 01:35:01.280 --> 01:35:05.070 That's not a particularly tractable way of trying to represent numbers, 01:35:05.070 --> 01:35:09.860 if I'm going to have to deal with vectors of length 50,000. 01:35:09.860 --> 01:35:12.140 Another problem-- a subtler problem-- 01:35:12.140 --> 01:35:14.870 is that ideally, I'd like for these vectors 01:35:14.870 --> 01:35:17.960 to somehow represent meaning in a way that I can extract 01:35:17.960 --> 01:35:21.740 useful information out of-- that if I have the sentence he wrote a book 01:35:21.740 --> 01:35:26.270 and he authored a novel, well, wrote and authored are going to be two 01:35:26.270 --> 01:35:28.040 totally different vectors. 01:35:28.040 --> 01:35:32.180 And book and novel are going to be two totally different vectors inside 01:35:32.180 --> 01:35:35.030 of my vector space that have nothing to do with each other. 01:35:35.030 --> 01:35:38.420 The one is just located in a different position. 01:35:38.420 --> 01:35:40.790 And really, what I would like to have happen 01:35:40.790 --> 01:35:43.600 is for wrote and authored to have vectors 01:35:43.600 --> 01:35:47.020 that are similar to one another, and for book and novel 01:35:47.020 --> 01:35:49.900 to have vector representations that are similar to one another, 01:35:49.900 --> 01:35:52.780 because they are words that have similar meanings. 01:35:52.780 --> 01:35:56.320 Because their meanings are similar, ideally, I'd like for-- 01:35:56.320 --> 01:35:59.860 when I put them in vector form and use a vector to represent meanings, 01:35:59.860 --> 01:36:04.400 I would like for those vectors to be similar to one another as well. 01:36:04.400 --> 01:36:06.640 So rather than this one-hot representation, 01:36:06.640 --> 01:36:10.000 where we represent a word's meaning by just giving it a vector that is one 01:36:10.000 --> 01:36:12.620 in a particular location, what we're going to do-- 01:36:12.620 --> 01:36:15.400 which is a bit of a strange thing the first time you see it-- 01:36:15.400 --> 01:36:18.640 is what we're going to call a distributed representation. 01:36:18.640 --> 01:36:21.580 We are going to represent the meaning of a word as just 01:36:21.580 --> 01:36:25.330 a whole bunch of different values-- not just a single 1 and the rest 0's, 01:36:25.330 --> 01:36:26.630 but a whole bunch of values. 01:36:26.630 --> 01:36:31.240 So for example, in he wrote a book, he might just be a big vector. 01:36:31.240 --> 01:36:34.510 Maybe it's 50 dimensions, maybe it's 100, dimensions but certainly less 01:36:34.510 --> 01:36:39.430 than like tens of thousands, where each value is just some number-- 01:36:39.430 --> 01:36:42.160 and same thing for wrote, and a, and book. 01:36:42.160 --> 01:36:45.070 And the idea now is that, using these vector representations, 01:36:45.070 --> 01:36:48.850 I'd hope that wrote and authored have vector representations that 01:36:48.850 --> 01:36:50.317 are pretty close to one another. 01:36:50.317 --> 01:36:52.900 Their distance is not too far apart-- and same with the vector 01:36:52.900 --> 01:36:56.230 representations for book and novel. 01:36:56.230 --> 01:37:00.940 So this is going to be the goal of a lot of what statistical machine learning 01:37:00.940 --> 01:37:02.710 approaches to natural language processing 01:37:02.710 --> 01:37:06.760 is about is using these vector representations of words. 01:37:06.760 --> 01:37:10.190 But how on earth do we define a word as just a whole bunch 01:37:10.190 --> 01:37:11.440 of these sequences of numbers? 01:37:11.440 --> 01:37:16.668 What does it even mean to talk about the meaning of a word? 01:37:16.668 --> 01:37:18.460 The famous quote that answers this question 01:37:18.460 --> 01:37:22.930 is from a British linguist in the 1950s, JR Firth, who said, "You shall 01:37:22.930 --> 01:37:25.060 know a word by the company it keeps." 01:37:28.150 --> 01:37:30.400 And what we mean by that is the idea that we 01:37:30.400 --> 01:37:35.290 can define a word in terms of the words that show up around it, that we can get 01:37:35.290 --> 01:37:39.070 at the meaning of a word based on the context in which that word happens 01:37:39.070 --> 01:37:40.370 to appear. 01:37:40.370 --> 01:37:43.900 That if I have a sentence like this, four words in sequence-- 01:37:43.900 --> 01:37:46.180 for blank he ate-- 01:37:46.180 --> 01:37:47.442 what goes in the blank? 01:37:47.442 --> 01:37:49.150 Well, you might imagine that, in English, 01:37:49.150 --> 01:37:52.192 the types of words that might fill in the blank are words like breakfast, 01:37:52.192 --> 01:37:53.170 or lunch, or dinner. 01:37:53.170 --> 01:37:56.480 These are the kinds of words that fill in that blank. 01:37:56.480 --> 01:38:00.730 And so if we want to define, what does lunch or dinner mean, 01:38:00.730 --> 01:38:03.970 we can define it in terms of what words happened 01:38:03.970 --> 01:38:07.030 to show up around it-- that if a word shows up 01:38:07.030 --> 01:38:09.700 in a particular context and another word happens to show up 01:38:09.700 --> 01:38:13.750 in very similar context, then those two words are probably 01:38:13.750 --> 01:38:15.040 related to each other. 01:38:15.040 --> 01:38:18.280 They probably have a similar meaning to one another. 01:38:18.280 --> 01:38:20.950 And this then is the foundational idea of an algorithm 01:38:20.950 --> 01:38:24.760 known as word2vec, which is a model for generating word vectors. 01:38:24.760 --> 01:38:28.960 You give word2vec a corpus of documents, just a whole bunch of texts, 01:38:28.960 --> 01:38:34.832 and what word to that will produce is it will produce vectors for each word. 01:38:34.832 --> 01:38:36.790 And there a number of ways that it can do this. 01:38:36.790 --> 01:38:40.300 One common way is through what's known as the skip-gram architecture, which 01:38:40.300 --> 01:38:44.470 basically uses a neural network to predict context words, 01:38:44.470 --> 01:38:47.240 given a target word-- so given a word like lunch, 01:38:47.240 --> 01:38:50.350 use a neural network to try and predict, given the word lunch, what 01:38:50.350 --> 01:38:53.190 words are going to show up around it. 01:38:53.190 --> 01:38:55.210 And so the way we might represent this is 01:38:55.210 --> 01:38:57.760 with a big neural network like this, where 01:38:57.760 --> 01:39:00.820 we have one input cell for every word. 01:39:00.820 --> 01:39:04.900 Every word gets one node inside this neural network. 01:39:04.900 --> 01:39:07.780 And the goal is to use this neural network to predict, 01:39:07.780 --> 01:39:09.790 given a target word, a context word. 01:39:09.790 --> 01:39:14.030 Given a word like lunch, can I predict the probabilities of other words, 01:39:14.030 --> 01:39:18.560 showing up in a context of one word away or two words away, for instance, 01:39:18.560 --> 01:39:21.970 in some sort of window of context? 01:39:21.970 --> 01:39:27.400 And if you just give the AI, this neural network, a whole bunch of data of words 01:39:27.400 --> 01:39:30.790 and what words show up in context, you can train a neural network 01:39:30.790 --> 01:39:34.600 to do this calculation, to be able to predict, given a target word-- 01:39:34.600 --> 01:39:39.103 can I predict what those context words ultimately should be? 01:39:39.103 --> 01:39:41.020 And it will do so using the same methods we've 01:39:41.020 --> 01:39:43.850 talked about-- back propagating the error from the context word 01:39:43.850 --> 01:39:46.090 back through this neural network. 01:39:46.090 --> 01:39:48.790 And what you get is, if we use the single layer-- 01:39:48.790 --> 01:39:50.950 just a signal layer of hidden nodes-- 01:39:50.950 --> 01:39:54.960 what I get is, for every single one of these words, I get-- 01:39:54.960 --> 01:39:59.680 from this word, for example, I get five edges, each of which 01:39:59.680 --> 01:40:02.695 has a weight to each of these five hidden nodes. 01:40:02.695 --> 01:40:05.950 In other words, I get five numbers that effectively 01:40:05.950 --> 01:40:10.180 are going to represent this particular target word here. 01:40:10.180 --> 01:40:13.750 And the number of hidden nodes I choose in this middle layer here-- 01:40:13.750 --> 01:40:14.420 I can pick that. 01:40:14.420 --> 01:40:17.830 Maybe I'll choose to have 50 hidden nodes or 100 hidden nodes. 01:40:17.830 --> 01:40:19.720 And then, for each of these target words, 01:40:19.720 --> 01:40:22.630 I'll have 50 different values or 100 different values, 01:40:22.630 --> 01:40:26.050 and those values we can effectively treat as the vector 01:40:26.050 --> 01:40:29.320 numerical representation of that word. 01:40:29.320 --> 01:40:33.520 And the general idea here is that, if words are similar, 01:40:33.520 --> 01:40:37.660 two words show up in similar contexts-- meaning, using the same target words, 01:40:37.660 --> 01:40:40.380 I'd like to predict similar contexts words-- 01:40:40.380 --> 01:40:43.180 well, then these vectors and these values I choose in these vectors 01:40:43.180 --> 01:40:45.940 here-- these numerical values for the weight of these edges 01:40:45.940 --> 01:40:49.180 are probably going to be similar, because for two different words that 01:40:49.180 --> 01:40:51.580 show up in similar contexts, I would like 01:40:51.580 --> 01:40:55.030 for these values that are calculated to ultimately 01:40:55.030 --> 01:40:58.250 be very similar to one another. 01:40:58.250 --> 01:41:01.030 And so ultimately, the high-level way you can picture this 01:41:01.030 --> 01:41:02.980 is that what this word2vec training method is 01:41:02.980 --> 01:41:06.790 going to do is, given a whole bunch of words, were initially, 01:41:06.790 --> 01:41:09.430 recall, we initialize these weights randomly and just pick 01:41:09.430 --> 01:41:11.650 random weights that we choose. 01:41:11.650 --> 01:41:14.050 Over time, as we train the neural network, 01:41:14.050 --> 01:41:17.680 we're going to adjust these weights, adjust the vector representations 01:41:17.680 --> 01:41:20.860 of each of these words so that gradually, 01:41:20.860 --> 01:41:24.970 words that show up in similar contexts grow closer to one another, 01:41:24.970 --> 01:41:27.190 and words that show up in different contexts 01:41:27.190 --> 01:41:29.210 get farther away from one another. 01:41:29.210 --> 01:41:32.890 And as a result, hopefully I get vector representations 01:41:32.890 --> 01:41:36.760 of words like breakfast, and lunch, and dinner that are similar to one another, 01:41:36.760 --> 01:41:39.100 and then words like book, and memoir, and novel 01:41:39.100 --> 01:41:42.830 are also going to be similar to one another as well. 01:41:42.830 --> 01:41:46.510 So using this algorithm, we're able to take a corpus of data 01:41:46.510 --> 01:41:50.230 and just train our computer, train this neural network to be able to figure out 01:41:50.230 --> 01:41:52.650 what vector, what sequence of numbers is going 01:41:52.650 --> 01:41:55.900 to represent each of these words-- which is, again, a bit of a strange concept 01:41:55.900 --> 01:41:59.450 to think about representing a word just as a whole bunch of numbers. 01:41:59.450 --> 01:42:02.860 But we'll see in a moment just how powerful this really can be. 01:42:02.860 --> 01:42:08.290 So we'll go ahead and go into vectors, and what I have inside a vectors.py-- 01:42:08.290 --> 01:42:09.910 which I'll open up now-- 01:42:09.910 --> 01:42:14.800 is I'm opening up words.txt, which is a pretrained model that just-- 01:42:14.800 --> 01:42:17.230 I've already run word2vec and it's already given me 01:42:17.230 --> 01:42:19.810 a whole bunch of vectors for each of these possible words. 01:42:19.810 --> 01:42:22.330 And I'm just going to take like 50,000 of them 01:42:22.330 --> 01:42:26.420 and go ahead and save their vectors inside of a dictionary called words. 01:42:26.420 --> 01:42:29.260 And then I've also defined some functions called distance, 01:42:29.260 --> 01:42:33.820 closest_word, so it'll get me what are the closest words to a particular word, 01:42:33.820 --> 01:42:38.390 and then closest_word, that just gets me the one closest word, for example. 01:42:38.390 --> 01:42:39.860 And so now let me try doing this. 01:42:39.860 --> 01:42:43.180 Let me open up the Python interpreter and say something like, 01:42:43.180 --> 01:42:46.080 from vectors import star-- 01:42:46.080 --> 01:42:48.590 just import everything from vectors. 01:42:48.590 --> 01:42:51.700 And now let's take a look at the meanings of some words. 01:42:51.700 --> 01:42:55.760 Let me look at the word city, for example. 01:42:55.760 --> 01:43:01.130 And here is a big array that is the vector representation of the words 01:43:01.130 --> 01:43:01.630 city. 01:43:01.630 --> 01:43:04.755 And this doesn't mean anything, in terms of what these numbers exactly are, 01:43:04.755 --> 01:43:07.390 but this is how my computer is representing 01:43:07.390 --> 01:43:08.990 the meaning of the word city. 01:43:08.990 --> 01:43:11.200 We can do a different word, like words house, 01:43:11.200 --> 01:43:14.860 and here then is the vector representation of the word house, 01:43:14.860 --> 01:43:17.140 for example-- just a whole bunch of numbers. 01:43:17.140 --> 01:43:20.650 And this is encoding somehow the meaning of the word house. 01:43:20.650 --> 01:43:22.390 And how do I get at that idea? 01:43:22.390 --> 01:43:24.880 Well, one way to measure how good this is is by looking at, 01:43:24.880 --> 01:43:29.282 what is the distance between various different words? 01:43:29.282 --> 01:43:31.240 There a number of ways you can define distance. 01:43:31.240 --> 01:43:33.310 In context of vectors, one common way is what's 01:43:33.310 --> 01:43:35.860 known as the cosine distance that has to do with measuring 01:43:35.860 --> 01:43:37.580 the angle between vectors. 01:43:37.580 --> 01:43:40.150 But in short, it's just measuring, how far apart 01:43:40.150 --> 01:43:42.710 are these two vectors from each other? 01:43:42.710 --> 01:43:47.210 So if I take a word like the word book, how far away for is it from itself-- 01:43:47.210 --> 01:43:49.540 how far away is the word book from book-- 01:43:49.540 --> 01:43:50.440 well, that's zero. 01:43:50.440 --> 01:43:54.400 The word book is zero distance away from itself. 01:43:54.400 --> 01:43:59.180 But let's see how far away word book is from a word like breakfast, 01:43:59.180 --> 01:44:03.790 where we're going to say one is very far away, zero is not far away. 01:44:03.790 --> 01:44:07.430 All right, book is about 0.64 away from breakfast. 01:44:07.430 --> 01:44:09.560 They seem to be pretty far apart. 01:44:09.560 --> 01:44:12.920 But let's now try and calculate the distance from words book 01:44:12.920 --> 01:44:16.842 to words novel, for example. 01:44:16.842 --> 01:44:18.800 Now, those two words are closer to each other-- 01:44:18.800 --> 01:44:19.730 0.34. 01:44:19.730 --> 01:44:21.950 The vector representation of the word book 01:44:21.950 --> 01:44:25.190 is closer to the vector representation of the word novel 01:44:25.190 --> 01:44:28.350 than it is to the vector representation of the word breakfast. 01:44:28.350 --> 01:44:34.010 And I can do the same thing and, say, compare breakfast to lunch, 01:44:34.010 --> 01:44:35.765 for example. 01:44:35.765 --> 01:44:37.640 And those two words are even closer together. 01:44:37.640 --> 01:44:40.010 They have an even more similar relationship 01:44:40.010 --> 01:44:42.470 between one word and another. 01:44:42.470 --> 01:44:45.500 So now it seems we have some representation of words, 01:44:45.500 --> 01:44:49.610 representing a word using vectors, that allows us to be able to say something 01:44:49.610 --> 01:44:52.340 like words that are similar to each other 01:44:52.340 --> 01:44:55.940 ultimately have a smaller distance that happens to be between them. 01:44:55.940 --> 01:44:58.070 And this turns out to be incredibly powerful to be 01:44:58.070 --> 01:45:01.760 able to represent the meaning of words in terms of their relationships 01:45:01.760 --> 01:45:03.620 to other words as well. 01:45:03.620 --> 01:45:05.000 I can tell you as well-- 01:45:05.000 --> 01:45:06.980 I have a function called closest words that 01:45:06.980 --> 01:45:09.320 basically just takes a whole bunch of words 01:45:09.320 --> 01:45:11.520 and gets all the closest words to it. 01:45:11.520 --> 01:45:15.980 So let me get the closest words to book, for example, 01:45:15.980 --> 01:45:18.500 and maybe get the 10 closest words. 01:45:18.500 --> 01:45:20.950 We'll limit ourselves to 10. 01:45:20.950 --> 01:45:21.450 And right. 01:45:21.450 --> 01:45:24.420 Book is obviously closest to itself-- the word book-- 01:45:24.420 --> 01:45:27.630 but is also closely related to books, and essay, and memoir, and essays, 01:45:27.630 --> 01:45:29.450 and novella, anthology. 01:45:29.450 --> 01:45:32.370 And why are these words that it was able to compute are close to it? 01:45:32.370 --> 01:45:34.710 Well, because based on the corpus of information 01:45:34.710 --> 01:45:38.220 that this algorithm was trained on, the vectors that arose 01:45:38.220 --> 01:45:41.270 arose based on what words show up in a similar context-- 01:45:41.270 --> 01:45:45.420 that the word book shows up in a similar context, similar other words to words 01:45:45.420 --> 01:45:47.730 like memoir and essays, for example. 01:45:47.730 --> 01:45:49.110 And if I do something like-- 01:45:49.110 --> 01:45:53.740 let me get the closest words to city-- 01:45:53.740 --> 01:45:56.800 you end up getting city, town, township, village. 01:45:56.800 --> 01:46:02.200 These are words that happen to show up in a similar context to the word city. 01:46:02.200 --> 01:46:05.787 Now, where things get really interesting is that, because these are vectors, 01:46:05.787 --> 01:46:07.120 we can do mathematics with them. 01:46:07.120 --> 01:46:11.210 We can calculate the relationships between various different words. 01:46:11.210 --> 01:46:16.240 So I can say something like, all right, what if I had man and king? 01:46:16.240 --> 01:46:18.790 These are two different vectors, and this is a famous example 01:46:18.790 --> 01:46:20.950 that comes out of word2vec. 01:46:20.950 --> 01:46:24.920 I can take these two vectors and just subtract them from each other. 01:46:24.920 --> 01:46:28.040 This line here, the distance here, is another vector 01:46:28.040 --> 01:46:30.430 that represents like king minus man. 01:46:30.430 --> 01:46:33.123 Now, what does it mean to take a word and subtract another word? 01:46:33.123 --> 01:46:34.540 Normally, that doesn't make sense. 01:46:34.540 --> 01:46:37.082 In the world of vectors, though, you can take some vector sum 01:46:37.082 --> 01:46:40.090 sequence of numbers, subtract some other sequence of numbers, 01:46:40.090 --> 01:46:43.240 and get a new vector, get a new sequence of numbers. 01:46:43.240 --> 01:46:46.690 And what this new sequence of numbers is effectively going to do 01:46:46.690 --> 01:46:52.000 is it is going to tell me, what do I need to do to get from man to king? 01:46:52.000 --> 01:46:54.640 What is the relationship then between these two words? 01:46:54.640 --> 01:46:58.120 And this is some vector representation of what makes-- 01:46:58.120 --> 01:47:00.640 takes us from man to king. 01:47:00.640 --> 01:47:04.730 And we can then take this value and add it to another vector. 01:47:04.730 --> 01:47:07.700 You might imagine that the word woman, for example, 01:47:07.700 --> 01:47:10.330 is another vector that exists somewhere inside of this space, 01:47:10.330 --> 01:47:12.430 somewhere inside of this vector space. 01:47:12.430 --> 01:47:15.550 And what might happen if I took this same idea, king 01:47:15.550 --> 01:47:19.930 minus man-- took that same vector and just added it to woman? 01:47:19.930 --> 01:47:22.480 What will we find around here? 01:47:22.480 --> 01:47:24.230 It's an interesting question we might ask, 01:47:24.230 --> 01:47:27.700 and we can answer it very easily, because I have vector representations 01:47:27.700 --> 01:47:30.500 of all of these things. 01:47:30.500 --> 01:47:31.660 Let's go back here. 01:47:31.660 --> 01:47:34.690 Let me look at the representation of the word man. 01:47:34.690 --> 01:47:36.887 Here's the vector representation of men. 01:47:36.887 --> 01:47:38.970 Let's look at the representation of the word king. 01:47:38.970 --> 01:47:41.222 Here's the representation of the word king. 01:47:41.222 --> 01:47:42.430 And I can subtract these two. 01:47:42.430 --> 01:47:46.260 What is the vector representation of king minus man? 01:47:46.260 --> 01:47:48.250 It's this array right here-- 01:47:48.250 --> 01:47:49.600 whole bunch of values. 01:47:49.600 --> 01:47:53.620 So king minus man now represents the relationship between king and man 01:47:53.620 --> 01:47:55.940 in some sort of numerical vector format. 01:47:55.940 --> 01:48:00.170 So what happens then if I add woman to that? 01:48:00.170 --> 01:48:04.640 Whatever took us from man to king, go ahead and apply that same vector 01:48:04.640 --> 01:48:07.520 to the vector representation of the word woman, 01:48:07.520 --> 01:48:10.960 and that gives us this vector here. 01:48:10.960 --> 01:48:15.130 And now, just out of curiosity, let's take this expression 01:48:15.130 --> 01:48:20.720 and find, what is the closest word to that expression? 01:48:20.720 --> 01:48:25.130 And amazingly, what we get is we get the word queen-- 01:48:25.130 --> 01:48:28.820 that somehow, when you take the distance between man and king-- 01:48:28.820 --> 01:48:32.090 this numerical representation of how man is related to king-- 01:48:32.090 --> 01:48:34.780 and add that same notion, king minus man, 01:48:34.780 --> 01:48:37.100 to the vector representation of the word woman. 01:48:37.100 --> 01:48:40.790 What we get is we get the vector representation, or something close 01:48:40.790 --> 01:48:43.490 to the vector representation of the word queen, 01:48:43.490 --> 01:48:48.130 because this distance somehow encoded the relationship between these two 01:48:48.130 --> 01:48:48.630 words. 01:48:48.630 --> 01:48:50.422 And when you run it through this algorithm, 01:48:50.422 --> 01:48:53.240 it's not programmed to do this, but if you just try and figure 01:48:53.240 --> 01:48:55.700 out how to predict words based on context words, 01:48:55.700 --> 01:48:59.960 you get vectors that are able to make these SAT-like analogies out 01:48:59.960 --> 01:49:02.232 of the information that has been given. 01:49:02.232 --> 01:49:03.690 So there are more examples of this. 01:49:03.690 --> 01:49:06.230 We can say, all right, let's figure out, what 01:49:06.230 --> 01:49:10.790 is the distance between Paris and France? 01:49:10.790 --> 01:49:12.580 So Paris and France are words. 01:49:12.580 --> 01:49:14.390 They each have a vector representation. 01:49:14.390 --> 01:49:18.680 This then is a vector representation of the distance between Paris and France-- 01:49:18.680 --> 01:49:21.530 what takes us from France to Paris. 01:49:21.530 --> 01:49:26.540 And let me go ahead and add the vector representation of England to that. 01:49:26.540 --> 01:49:29.690 So this then is the vector representation 01:49:29.690 --> 01:49:35.470 of going Paris minus France plus England-- 01:49:35.470 --> 01:49:38.130 so the distance between friends and Paris as vectors. 01:49:38.130 --> 01:49:40.860 Add the England vector, and let's go ahead 01:49:40.860 --> 01:49:43.860 and find the closest word to that. 01:49:47.080 --> 01:49:48.550 And it turns out to be London. 01:49:48.550 --> 01:49:51.610 You do this relationship, the relationship between France and Paris. 01:49:51.610 --> 01:49:55.000 Go ahead and add the England vector to it, and the closest vector to that 01:49:55.000 --> 01:49:57.120 happens to be the vector for the word London. 01:49:57.120 --> 01:49:58.120 We can do more examples. 01:49:58.120 --> 01:50:00.700 I can say, let's take the word for teacher-- 01:50:00.700 --> 01:50:03.700 that vector representation and-- let me subtract 01:50:03.700 --> 01:50:05.470 the vector representation of school. 01:50:05.470 --> 01:50:09.310 So what I'm left with is, what takes us from school to teacher? 01:50:09.310 --> 01:50:14.050 And apply that vector to a word like hospital and see, 01:50:14.050 --> 01:50:15.670 what is the closest word to that-- 01:50:15.670 --> 01:50:17.680 turns out the closest word is nurse. 01:50:17.680 --> 01:50:23.400 Let's try a couple more examples-- closest word to ramen, for example. 01:50:23.400 --> 01:50:25.610 Subtract closest word to Japan. 01:50:25.610 --> 01:50:28.150 So what is the relationship between Japan and ramen? 01:50:28.150 --> 01:50:30.310 Add the word for America to that. 01:50:30.310 --> 01:50:33.340 Want to take a guess is what you might get as a result? 01:50:33.340 --> 01:50:35.840 Turns out you get burritos as the relationship. 01:50:35.840 --> 01:50:38.050 If you do the subtraction, do the addition, 01:50:38.050 --> 01:50:42.080 this is the answer that you happen to get as a consequence of this as well. 01:50:42.080 --> 01:50:44.703 So these very interesting analogies arise 01:50:44.703 --> 01:50:46.620 in the relationships between these two words-- 01:50:46.620 --> 01:50:50.420 that if you just map out all of these words into a vector space, 01:50:50.420 --> 01:50:54.380 you can get some pretty interesting results as a consequence of that. 01:50:54.380 --> 01:50:58.360 And this idea of representing words as vectors turns out 01:50:58.360 --> 01:51:01.300 to be incredibly useful and powerful anytime 01:51:01.300 --> 01:51:04.420 we want to be able to do some statistical work with 01:51:04.420 --> 01:51:06.910 regards to natural language, to be able to have-- 01:51:06.910 --> 01:51:09.350 represent words not just as their characters, 01:51:09.350 --> 01:51:12.280 but to represent them as numbers, numbers that say something 01:51:12.280 --> 01:51:14.910 or mean something about the words themselves, 01:51:14.910 --> 01:51:18.250 and somehow relate the meaning of a word to other words that 01:51:18.250 --> 01:51:19.920 might happen to exists-- 01:51:19.920 --> 01:51:23.020 so many tools then for being able to work inside 01:51:23.020 --> 01:51:24.910 of this world of natural language. 01:51:24.910 --> 01:51:26.417 The natural language is tricky. 01:51:26.417 --> 01:51:29.500 We have to deal with the syntax of language and the semantics of language, 01:51:29.500 --> 01:51:33.100 but we've really just seen just the beginning of some of the ideas that are 01:51:33.100 --> 01:51:37.450 underlying a lot of natural language processing-- the ability to take text, 01:51:37.450 --> 01:51:40.270 extract information out of it, get some sort of meaning out of it, 01:51:40.270 --> 01:51:43.990 generate sentences maybe by having some knowledge of the grammar or maybe just 01:51:43.990 --> 01:51:47.380 by looking at probabilities of what words are likely to show up based 01:51:47.380 --> 01:51:49.780 on other words that have shown up previously-- 01:51:49.780 --> 01:51:52.300 and then finally, the ability to take words 01:51:52.300 --> 01:51:55.330 and come up with some distributed representation of them, to take words 01:51:55.330 --> 01:51:58.240 and represent them as numbers, and use those numbers 01:51:58.240 --> 01:52:02.210 to be able to say something meaningful about those words as well. 01:52:02.210 --> 01:52:04.390 So this then is yet another topic in this broader 01:52:04.390 --> 01:52:06.300 heading of artificial intelligence. 01:52:06.300 --> 01:52:08.380 And just as I look back at where we've been now, 01:52:08.380 --> 01:52:11.320 we started our conversation by talking about the world of search, 01:52:11.320 --> 01:52:14.590 about trying to solve problems like tic-tac-toe by searching 01:52:14.590 --> 01:52:17.500 for a solution, by exploring our various different possibilities 01:52:17.500 --> 01:52:21.220 and looking at what algorithms we can apply to be able to efficiently 01:52:21.220 --> 01:52:22.300 try and search a space. 01:52:22.300 --> 01:52:25.930 We looked at some simple algorithms and then looked at some optimizations 01:52:25.930 --> 01:52:28.780 we could make to this algorithms, and ultimately, that 01:52:28.780 --> 01:52:31.742 was in service of trying to get our AI to know things about the world. 01:52:31.742 --> 01:52:34.450 And this has been a lot of what we've talked about today as well, 01:52:34.450 --> 01:52:37.270 trying to get knowledge out of text-based information, 01:52:37.270 --> 01:52:41.440 the ability to take information, draw conclusions based on those information. 01:52:41.440 --> 01:52:43.630 If I know these two things for certain, maybe I 01:52:43.630 --> 01:52:46.660 can draw a third conclusion as well. 01:52:46.660 --> 01:52:49.330 That then was related to the idea of uncertainty. 01:52:49.330 --> 01:52:51.460 If we don't know something for sure, can we 01:52:51.460 --> 01:52:54.420 predict something, figure out the probabilities of something? 01:52:54.420 --> 01:52:56.170 And we saw that again today in the context 01:52:56.170 --> 01:52:59.200 of trying to predict whether a tweet or whether a message 01:52:59.200 --> 01:53:01.420 is positive sentiment or negative sentiment, 01:53:01.420 --> 01:53:04.022 and trying to draw that conclusion as well. 01:53:04.022 --> 01:53:05.980 Then we took a look at optimization-- the sorts 01:53:05.980 --> 01:53:09.490 of problems where we're looking for a local global or local maximum 01:53:09.490 --> 01:53:10.300 or minimum. 01:53:10.300 --> 01:53:13.420 This has come up time and time again, especially most recently 01:53:13.420 --> 01:53:16.750 in the context of neural networks, which are really just a kind of optimization 01:53:16.750 --> 01:53:20.110 problem where we're trying to minimize the total amount of loss 01:53:20.110 --> 01:53:23.110 based on the setting of our weights of our neural network, 01:53:23.110 --> 01:53:26.710 based on the setting of what vector representations for words we 01:53:26.710 --> 01:53:27.880 happen to choose. 01:53:27.880 --> 01:53:30.430 And those ultimately helped us to be able to solve 01:53:30.430 --> 01:53:33.940 learning-related problems-- the ability to take a whole bunch of data, 01:53:33.940 --> 01:53:37.650 and rather than us tell the AI exactly what to do, 01:53:37.650 --> 01:53:40.030 let the AI learn patterns from the data for itself. 01:53:40.030 --> 01:53:43.770 Let it figure out what makes an inbox message different from a spam message. 01:53:43.770 --> 01:53:45.520 Let it figure out what makes a counterfeit 01:53:45.520 --> 01:53:47.560 bill different from an authentic bill, and being 01:53:47.560 --> 01:53:49.820 able to draw that analysis as well. 01:53:49.820 --> 01:53:52.390 And one of the big tools in learning that we used 01:53:52.390 --> 01:53:54.220 were neural networks, these structures that 01:53:54.220 --> 01:53:58.180 allow us to relate inputs to outputs by training these internal networks 01:53:58.180 --> 01:54:02.410 to learn some sort of function that maps us from some input to some output-- 01:54:02.410 --> 01:54:05.770 ultimately yet another model in this language of artificial intelligence 01:54:05.770 --> 01:54:08.320 that we can use to communicate with our AI. 01:54:08.320 --> 01:54:10.210 Then finally today, we looked at some ways 01:54:10.210 --> 01:54:12.850 that AI can begin to communicate with us, looking at ways 01:54:12.850 --> 01:54:16.240 that AI can begin to get an understanding for the syntax 01:54:16.240 --> 01:54:19.990 and the semantics of language to be able to generate sentences, 01:54:19.990 --> 01:54:23.110 to be able to predict things about text that's written in a spoken 01:54:23.110 --> 01:54:25.360 language or a written language like English, 01:54:25.360 --> 01:54:27.927 and to be able to do interesting analysis there as well. 01:54:27.927 --> 01:54:30.010 And there's so much more in active research that's 01:54:30.010 --> 01:54:33.160 happening all over the areas within artificial intelligence today, 01:54:33.160 --> 01:54:36.890 and we've really only just seen the beginning of what AI has to offer. 01:54:36.890 --> 01:54:39.310 So I hope you enjoyed this exploration into this world 01:54:39.310 --> 01:54:41.235 of artificial intelligence with Python. 01:54:41.235 --> 01:54:44.110 A big thank you to the courses teaching staff and the production team 01:54:44.110 --> 01:54:45.700 for making this class possible. 01:54:45.700 --> 01:54:49.940 This was an Introduction to Artificial Intelligence with Python.