[MUSIC PLAYING] SPEAKER 1: OK, welcome back, everyone, to our final topic in an introduction to artificial intelligence with Python. And today, the topic is language. So thus far in the class, we've seen a number of different ways of interacting with AI, artificial intelligence, but it's mostly been happening in the way of us formulating problems in ways that I can understand-- learning to speak the language of AI, so to speak, by trying to take a problem and formulated as a search problem, or by trying to take a problem and make it a constraint satisfaction problem-- something that our AI is able to understand. Today, we're going to try and come up with algorithms and ideas that allow our AI to meet us halfway, so to speak-- to be able to allow AI to be able to understand, and interpret, and get some sort of meaning out of human language-- the type of language, in the spoken language, like English, or some other language that we naturally speak. And this turns out to be a really challenging task for AI. And it really encompasses a number of different types of tasks all under the broad heading of natural language processing, the idea of coming up with algorithms that allow our AI to be able to process and understand natural language. So these tasks vary in terms of the types of tasks we might want an AI to perform, and therefore, the types of algorithms that we might use. Them but some common tasks that you might see are things like automatic summarization. You give an AI a long document, and you would like for the AI to be able to summarize it, come up with a shorter representation of the same idea, but still in some kind of natural language, like English. Something like information extraction-- given a whole corpus of information in some body of documents or on the internet, for example, we'd like for our AI to be able to extract some sort of meaningful semantic information out of all of that content that it's able to look at and read. Language identification-- the task of, given a page, can you figure out what language that document is written in? This is the type of thing you might see if you use a web browser where, if you open up a page in another language, that web browser might ask you, oh, I think it's in this language-- would you like me to translate into English for you, for example? And that language identification process is a task that our AI needs to be able to do, which is then related then to machine translation, the process of taking text in one language and translating it into another language-- which there's been a lot of research and development on really over the course of the last several years. And it keeps getting better, in terms of how it is that AI is able to take text in one language and transform that text into another language as well. In addition to that, we have topics like named entity recognition. Given some sequence of text, can you pick out what the named entities are? These are names of companies, or names of people, or names of locations for example, which are often relevant or important parts of a particular document. Speech recognition as a related task not to do with the text that is written, but text that is spoken-- being able to process audio and figure out, what are the actual words that are spoken there? And if you think about smart home devices, like Siri or Alexa, for example, these are all devices that are now able to listen to when we are able to speak, figure out what words we are saying, and draw some sort of meaning out of that as well. We've talked about how you could formulate something, for instance, as a hit and Markov model to be able to draw those sorts of conclusions. Text classification, more generally, is a broad category of types of ideas, whenever we want to take some kind of text and put it into some sort of category. And we've seen these classification type problems and how we can use statistical machine learning approaches to be able to solve them. We'll be able to do something very similar with natural language that we may need to make a couple of adjustments that we'll see soon. And then something like word sense disambiguation, the idea that, unlike in the language of numbers, where AI has very precise representations of everything, words and are a little bit fuzzy, in terms of their meaning, and words can have multiple different meanings-- and natural language is inherently ambiguous, and we'll take a look at some of those ambiguities in due time today. But one challenging task, if you want an AI to be able to understand natural language, is being able to disambiguate or differentiate between different possible meanings of words. If I say a sentence like, I went to the bank, you need to figure out, do I mean the bank where I deposit and withdraw money or do I mean the bank like the river bank? And different words can have different meanings that we might want to figure out. And based on the context in which a word appears-- the wider sentence, or paragraph, or paper in which a particular word appears-- that might help to inform how it is that we disambiguate between different meanings or different senses that a word might have. And there are many other topics within natural language processing, many other algorithms that have been devised in order to deal with and address these sorts of problems. And today, we're really just going to scratch the surface, looking at some of the fundamental ideas that are behind many of these ideas within natural language processing, within this idea of trying to come up with AI algorithms that are able to do something meaningful with the languages that we speak everyday. And so to introduce this idea, when we think about language, we can often think about it in a couple of different parts. The first part refers to the syntax of language. This is more to do with just the structure of language and how it is that that structure works. And if you think about natural language, syntax is one of those things that, if you're a native speaker of a language, it comes pretty readily to you. You don't have to think too much about it. If I give you a sentence from Sir Arthur Conan Doyle's Sherlock Holmes, for example, a sentence like this-- "just before 9:00 o'clock, Sherlock Holmes stepped briskly into the room"-- I think we could probably all agree that this is a well-formed grammatical sentence. Syntactically, it makes sense, in terms of the way that this particular sentence is structured. And syntax applies not just to natural language, but to programming languages as well. If you've ever seen a syntax error in a program that you've written, it's likely because you wrote some sort of program that was not syntactically well-formed. The structure of it was not a valid program. In the same way, we can look at English sentences, or sentences in any natural language, and make the same kinds of judgments. I can say that this sentence is syntactically well-formed. When all the parts are put together, all these words are in this order, it constructs a grammatical sentence, or a sentence that most people would agree is grammatical. But there are also grammatically ill-formed sentences. A sentence like, "just before Sherlock Holmes 9 o'clock stepped briskly the room"-- well, I think we would all agree that this is not a well-formed sentence. Syntactically, it doesn't make sense. And this is the type of thing that, if we want our AI, for example, to be able to generate natural language-- to be able to speak to us the way like a chat bot would speak to us, for example-- well then our AI is going to need to be able to know this distinction somehow, is going to be able to know what kinds of sentences are grammatical, what kinds of sentences are not. And we might come up with rules or ways to statistically learn these ideas, and we'll talk about some of those methods as well. Syntax can also be ambiguous. There are some sentences that are well-formed and not well-formed, but certain way-- there are certain ways that you could take a sentence and potentially construct multiple different structures for that sentence. A sentence like, "I saw the man on the mountain with a telescope," well, this is grammatically well-formed-- syntactically, it makes sense-- but what is the structure of the sentence? Is it the man on the mountain who has the telescope, or am I seeing the man on the mountain and I am using the telescope in order to see the man on the mountain? There's some interesting ambiguity here, where it could have potentially two different types of structures. And this is one of the ideas that will come back to also, in terms of how to think about dealing with AI when natural language is inherently ambiguous. So that then is syntax, the structure of language, and getting an understanding for how it is that, depending on the order and placement of words, we can come up with different structures for language. But in addition to language having structure, language also has meaning. And now we get into the world of semantics, the idea of, what it is that a word, or a sequence of words, or a sentence, or an entire essay actually means? And so a sentence like, "just before 9:00, Sherlock Holmes stepped briskly into the room," is a different sentence from a sentence like, "Sherlock Holmes stepped briskly into the room just before 9:00." And yet they have effectively the same meaning. They're different sentences, so an AI reading them would recognize them as different, but we as humans can look at both the sentences and say, yeah, they mean basically the same thing. And maybe, in this case, it was just because I moved the order of the words around. Originally, 9 o'clock with near the beginning of the sentence. Now 9 o'clock is near the end of the sentence. But you might imagine that I could come up with a different sentence entirely, a sentence like, "a few minutes before 9:00, Sherlock Holmes walked quickly into the room." And OK, that also has a very similar meaning, but I'm using different words in order to express that idea. And ideally, AI would be able to recognize that these two sentences, these different sets of words that are similar to each other, have similar meanings, and to be able to get at that idea as well. Then there are also ways that a syntactically well-formed sentence might not mean anything at all. A famous example from linguist Noam Chomsky is this sentence here-- "colorless green ideas sleep furiously." Syntactically, that sentence is perfectly fine. Colorless and green are adjectives that modify the noun ideas. Sleep is a verb. Furiously is an adverb. These are correct constructions, in terms of the order of words, but it turns out this sentence is meaningless. If you tried to ascribe meaning to the sentence, what does it mean? And it's not easy to be able to determine what it is that it might mean. Semantics itself can also be ambiguous, given that different structures can have different types of meanings. Different words can have different kinds of meanings, so the same sentence with the same structure might end up meaning different types of things. So my favorite example from the LA times is a headline that was in the Los Angeles Times a little while back. The headline says, "Big rig carrying fruit crashes on 210 freeway, creates jam." So depending on how it is you look at the sentence-- how you interpret the sentence-- it can have multiple different meanings. And so here too are challenges in this world of natural language processing, being able to understand both the syntax of language and the semantics of language. And today, we'll take a look at both of those ideas. We're going to start by talking about syntax and getting a sense for how it is that language is structured, and how we can start by coming up with some rules, some ways that we can tell our computer, tell our AI what types of things are valid sentences, what types of things are not valid sentences. And ultimately, we'd like to use that information to be able to allow our AI to draw meaningful conclusions, to be able to do something with language. And so to do so, we're going to start by introducing the notion of formal grammar. And what formal grammar is all about its formal grammar is a system of rules that generate sentences in a language. I would like to know what are the valid English sentences-- not in terms of what they mean-- just in terms of their structure-- their syntactic structure. What structures of English are valid, correct sentences? What structures of English are not valid? And this is going to apply in a very similar way to other natural languages as well, where language follows certain types of structures. And we intuitively know what these structures mean, but it's going to be helpful to try and really formally define what the structures mean as well. There are a number of different types of formal grammar all across what's known as the Chomsky hierarchy of grammars. And you may have seen some of these before. If you've ever worked with regular expressions before, those belong to a class of regular languages. They correspond to regular languages, which is a particular type of language. But also on this hierarchy is a type of grammar known as a context-free grammar. And this is the one we're going to spend the most time on taking a look at today. And what a context-free grammar is it is a way of taking-- of generating sentences in a language or via what are known as rewriting rules-- replacing one symbol with other symbols. And we'll take a look in a moment at just what that means. So let's imagine, for example, a simple sentence in English, a sentence like, "she saw the city"-- a valid, syntactically well-formed English sentence. But we'd like for some way for our AI to be able to look at the sentence and figure out, what is the structure of the sentence? If you imagine a guy in question answering format-- if you want to ask the AI a question like, what did she see, well, then the AI wants to be able to look at this sentence and recognize that what she saw is the city-- to be able to figure that out. And it requires some understanding of what it is that the structure of this sentence really looks like. So where do we begin? Each of these words-- she, saw, the, city-- we are going to call terminal symbols. There are symbols in our language-- where each of these words is just a symbol-- where this is ultimately what we care about generating. We care about generating these words. But each of these words we're also going to associate with what we're going to call a non-terminal symbol. And these non-terminal symbols initially are going to look kind of like parts of speech, if you remember back to like English grammar-- where she is a noun, saw is a V for verb, the is a D. D stands for determiner. These are words like the, and a, and and, for example. And then city-- well, city is also a noun, so an N goes there. So each of these-- N, V, and D-- these are what we might call non-terminal symbols. They're not actually words in the language. She saw the city-- those are the words in the language. But we use these non-terminal symbols to generate the terminal symbols, the terminal symbols which are like, she saw the city-- the words that are actually in a language like English. And so in order to translate these non-terminal symbols into terminal symbols, we have what are known as rewriting rules, and these rules look something like this. We have N on the left side of an arrow, and the arrow says, if I have an N non-terminal symbol, then I can turn it into any of these various different possibilities that are separated with a vertical line. So a noun could translate into the word she. A noun could translate into the word city, or car, or Harry, or any number of other things. These are all examples of nouns, for example. Meanwhile, a determiner, D, could translate into the, or a, or an. V for verb could translate into any of these verbs. P for preposition could translate into any of those prepositions-- to, on, over, and so forth. And then ADJ for adjective can translate into any of these possible adjectives as well. So these then are rules in our context-free grammar. When we are defining what it is that our grammar is, what is the structure of the English language or any other language, we give it these types of rules saying that a noun could be any of these possibilities, a verb could be any of those possibilities. But it turns out we can then begin to construct other rules where it's not just one non-terminal translating into one terminal symbol. We're always going to have one non-terminal on the left-hand side of the arrow, but on the right-hand side of the arrow, we could have other things. We could even have other non-terminal symbols. So what do I mean by this? Well, we have the idea of nouns-- like she, city, car, Harry, for example-- but there are also a noun phrases-- like phrases that work as nouns-- that are not just a single word, but there are multiple words. Like the city is two words, that together, operate as what we might call a noun phrase. It's multiple words, but they're together operating as a noun. Or if you think about a more complex expression, like the big city-- three words all operating as a single noun-- or the car on the street-- multiple words now, but that entire set of words operates kind of like a noun. It substitutes as a noun phrase. And so to do this, we'll introduce the notion of a new non-terminal symbol called NP, which will stand for noun phrase. And this rewriting rule says that a noun phrase it could be a noun-- so something like she is a noun, and therefore, it can also be a noun phrase-- but a noun phrase could also be a determiner, D, followed by a noun-- so two ways we can have a noun phrase in this very simple grammar. Of course, the English language is more complex than just this, but a noun phrase is either a noun or it is a determiner followed by a noun. So for the first example, a noun phrase that is just a noun, that would allow us to generate noun phrases like she, because a noun phrase is just a noun, and a noun could be the word she, for example. Meanwhile, if we wanted to look at one of the examples of these, where a noun phrase becomes a determiner and a noun, then we get a structure like this. And now we're starting to see the structure of language emerge from these rules in a syntax tree, as we'll call it, this tree-like structure that represents the syntax of our natural language. Here, we have a noun phrase, and this noun phrase is composed of a determiner and a noun, where the determiner is the word the, according to that rule, and noun is the word city. So here then is a noun phrase that consists of multiple words inside of the structure. And using this idea of taking one symbol and rewriting it using other symbols-- that might be terminal symbols, like the and city, but might also be non-terminal symbols, like D for determiner or N for noun-- then we can begin to construct more and more complex structures. In addition to noun phrases, we can also think about verb phrases. So what might a verb phrase look like? Well, a verb phrase might just be a single verb. In a sentence like "I walked," walked is a verb, and that is acting as the verb phrase in that sentence. But there are also more complex verb phrases that aren't just a single word, but that are multiple words. If you think of the sentence like "she saw the city," for example, saw the city is really that entire verb phrase. It's taking up like what it is that she is doing, for example. And so our verb phrase might have a rule like this. A verb phrase is either just a plain verb or it is a verb followed by a noun phrase. And we saw before that a noun phrase is either a noun or it is a determiner followed by a noun. And so a verb phrase might be something simple, like verb phrase it is just a verb. And that verb could be the word walked for example. But it could also be something more sophisticated, something like this noun, where we begin to see a larger syntax tree, where the way to read the syntax tree is that a verb phrase is a verb and a noun phrase, where that verb could be something like saw. And this is a noun phrase we've seen before, this noun phrase that is the city-- a noun phrase composed of the determiner the and the noun city all put together to construct this larger verb phrase. And then just to give one more example of a rule, we could also have a rule like this-- sentence S goes to noun phrase and a verb phrase. The basic structure of a sentence is that it is a noun phrase followed by verb phrase. And this is a formal grammar way of expressing the idea that you might have learned when you learned English grammar, when you read that a sentence is like a subject and a verb, subject and action-- something that's happening to a particular noun phrase. And so using this structure, we could construct a sentence that looks like this. A sentence consists of a noun phrase and a verb phrase. A noun phrase could just be a noun, like the word she. The verb phrase could be a verb and a noun phrase, where-- this is something we've seen before-- the verb is saw and the noun phrase is the city. And so now look what we've done here. What we've done is, by defining a set of rules, there are algorithms that we can run that take these words-- and the CYK algorithm, for example, is one example of this if you want to look into that-- where you start with a set of terminal symbols, like she saw the city, and then using these rules, you're able to figure out, how is it that you go from a sentence to she saw the city? And it's all through these rewriting rules. So the sentence is a noun phrase and a verb phrase. A verb phrase could be a verb and a noun phrase, so on and so forth, where you can imagine taking this structure and figuring out how it is that you could generate a parse tree-- a syntax tree-- for that set of terminal symbols, that set of words. And if you tried to do this for a sentence that was not grammatical, something like "saw the city she," well, that wouldn't work. There'd be no way to take a sentence and use these rules to be able to generate that sentence that is not inside of that language. So this sort of model can be very helpful if the rules are expressive enough to express all the ideas that you might want to express inside of natural language. Of course, using just the simple rules we have here, there are many sentences that we won't be able to generate-- sentences that we might agree are grim and syntactically well-formed, but that we're not going to be able to construct using these rules. And then, in that case, we might just need to have some more complex rules in order to deal with those sorts of cases. And so this type of approach can be powerful if you're dealing with a limited set of rules and words that you really care about dealing with. And one way we can actually interact with this in Python is by using a Python library called NLTK, short for natural language toolkit, which we'll see a couple of times today, which has a wide variety of different functions and classes that we can take advantage of that are all meant to deal with natural language. And one such algorithm that it has is the ability to parse a context-free grammar, to be able to take some words and figure out according to some context-free grammar, how would you construct the syntax tree for it? So let's go ahead and take a look at NLTK now by examining how we might construct some context-free grammars with it. So here inside of cfg0-- cfg's short for context-free grammar-- I have a sample context-free grammar which has rules that we've seen before. So sentence goes to noun phrase followed by a verb phrase. Noun phrase is either a determiner and a noun or a noun. Verb phrase is either a verb or a verb and a noun phrase. The order of these things doesn't really matter. Determiners could be the word the or the word a. A noun could be the word she, city, or car. And a verb could be the word saw or it could be the word walked. Now, using NLTK, which I've imported here at the top, I'm going to go ahead and parse this grammar and save it inside of this variable called parser. Next, my program is going to ask the user for input. Just type in a sentence, and dot split will just split it on all of the spaces, so I end up getting each of the individual words. We're going to save that inside of this list called sentence. And then we'll go ahead and try to parse the sentence, and for each sentence we parse, we're going to pretty print it to the screen, just so it displays in my terminal. And we're also going to draw it. It turns out that NLTK has some graphics capacity, so we can really visually see what that tree looks like as well. And there are multiple different ways a sentence might be parsed, which is why we're putting it inside of this for loop. And we'll see why that can be helpful in a moment too. All right, now that I have that, let's go ahead and try it. I'll cd into cfg, and we'll go ahead and run cfg0. So it then is going to prompt me to type in a sentence. And let me type in a very simple sentence-- something like she walked, for example. Press Return. So what I get is, on the left-hand side, you can see a text-based representation of the syntax tree. And on the right side here-- let me go ahead and make it bigger-- we see a visual representation of that same syntax tree. This is how it is that my computer has now parsed the sentence she walked. It's a sentence that consists of a noun phrase and a verb phrase, where each phrase is just a single noun or verb, she and then walked-- same type of structure we've seen before, but this now is our computer able to understand the structure of the sentence, to be able to get some sort of structural understanding of how it is that parts of the sentence relate to each other. Let me now give it another sentence. I could try something like she saw the city, for example-- the words we were dealing with a moment ago. And then we end up getting this syntax tree out of it-- again, a sentence that has a noun phrase and a verb phrase. The noun phrase is fairly simple. It's just she. But the verb phrase is more complex. It is now saw the city, for example. Let's do one more with this grammar. Let's do something like she saw a car. And that is going to look very similar-- that we also get she. But our verb phrase is now different. It's saw a car, because there are multiple possible determiners in our language and multiple possible nouns. I haven't given this grammar rule that many words, but if I gave it a larger vocabulary, it would then be able to understand more and more different types of sentences. And just to give you a sense of some added complexity we could add here, the more complex our grammar, the more rules we add, the more different types of sentences we'll then have the ability to generate. So let's take a look at cfg1, for example, where I've added a whole number of other different types of rules. I've added the adjective phrases, where we can have multiple adjectives inside of a noun phrase as well. So a noun phrase could be an adjective phrase followed by a noun phrase. If I wanted to say something like the big city, that's an adjective phrase followed by a noun phrase. Or we could also have a noun and a prepositional phrase-- so the car on the street, for example. On the street is a prepositional phrase, and we might want to combine those two ideas together, because the car on the street can still operate as something kind of like a noun phrase as well. So no need to understand all of these rules in too much detail-- it starts to get into the nature of English grammar-- but now we have a more complex way of understanding these types of sentences. So if I run Python cfg1-- and I can try typing something like she saw the wide street, for example-- a more complex sentence. And if we make that larger, you can see what this sentence looks like. I'll go ahead and shrink it a little bit. So now we have a sentence like this-- she saw the wide street. The wide street is one entire noun phrase, saw the wide street is an entire verb phrase, and she saw the wide street ends up forming that entire sentence. So let's take a look at one more example to introduce this notion of ambiguity. So I can run Python cfg1. Let me type a sentence like she saw a dog with binoculars. So there's a sentence, and here now is one possible syntax tree to represent this idea-- she saw, the noun phrase a dog, and then the prepositional phrase with binoculars. And the way to interpret the sentence is that what it is that she saw was a dog. And how did she do the seeing? She did the seeing with binoculars. And so this is one possible way to interpret this. She was using binoculars. Using those binoculars, she saw a dog. But another possible way to pass that sentence would be with this tree over here, where you have something like she saw a dog with binoculars, where a dog with binoculars forms an entire noun phrase of its own-- same words in the same order, but a different grammatical structure, where now we have a dog with binoculars all inside of this noun phrase, meaning what did she see? What she saw was a dog, and that dog happened to have binoculars with the dog-- so different ways to parse the sentence-- structures for the sentence-- even given the same possible sequence of words. And NLTK's algorithm and this particular algorithm has the ability to find all of these, to be able to understand the different ways that you might be able to parse a sentence and be able to extract some sort of useful meaning out of that sentence as well. So that then is a brief look at what we can do-- using getting the structure of language, of using these context-free grammar rules to be able to describe the structure of language. But what we might also care about is understanding how it is that these sequences of words are likely to relate to each other in terms of the actual words themselves. The grammar that we saw before could allow us to generate a sentence like, I eat a banana, for example, where I is the noun phrase and ate a banana is a verb phrase. But it would also allow for sentences like, I eat a blue car, for example, which is also syntactically well-formed according to the rules, but is probably a less likely sentence that a person is likely to speak. And we might want for our AI to be able to encapsulate the idea that certain sequences of words are more or less likely than others. So to deal with that, we'll introduce the notion of an n-gram, and an n-gram, more generally, just refers to some sequence of n items inside of our text. And those items might take various different forms. We can have character n-grams, which are just a contiguous sequence of n characters-- so three characters in a row, for example, or four characters in a row. We can also have word n-grams, which are a contiguous sequence of n words in a row from a particular sample of text. And these end up proving quite useful, and you can choose our n to decide how many how long is our sequence going to be. So when n is 1, we're just looking at a single word or a single character. And that is what we might call a unigram, just one item. If we're looking at two characters or two words, that's generally called a bigram-- so an n-gram where n is equal to 2, looking at two words that are consecutive. And then, if there are three items, you might imagine we'll often call those trigrams-- so three characters in a row or three words that happen to be in a contiguous sequence. And so if we took a sentence, for example-- here's a sentence from, again, Sherlock Holmes-- "how often have I said to you that, when you have eliminated the impossible, whatever remains, however improbable, must be the truth." What are the trigrams that we can extract from the sentence? If we're looking at sequences of three words, well, the first trigram would be how often have-- just a sequence of three words. And then we can look at the next trigram, often have I. The next trigram is have I said. Then I said to, said to you, to you that, for example-- those are all trigrams of words, sequences of three contiguous words that show up in the text. And extracting those bigrams and trigrams, or n-grams more generally, turns out to be quite helpful, because often, when we're dealing with analyzing a lot of text, it's not going to be particularly meaningful for us to try and analyze the entire text at one time. But instead, we want to segment that text into pieces that we can begin to do some analysis of-- that our AI might never have seen this entire sentence before, but it's probably seen the trigram to you that before, because to you that is something that might have come up in other documents that our AI has seen before. And therefore, it knows a little bit about that particular sequence of three words in a row-- or something like have I said, another example of another sequence of three words that's probably quite popular, in terms of where you see it inside the English language. So we'd like some way to be able to extract these sorts of n-grams. And how do we do that? How do we extract sequences of three words? Well, we need to take our input and somehow separate it into all of the individual words. And this is a process generally known as tokenization, the task of splitting up some sequence into distinct pieces, where we call those pieces tokens. Most commonly, this refers to something like word tokenization. I have some sequence of text and I want to split it up into all of the words that show up in that text. But it might also come up in the context of something like sentence tokenization. I have a long sequence of text and I'd like to split it up into sentences, for example. And so how might word tokenization work, the task of splitting up our sequence of characters into words? Well, we've also already seen this idea. We've seen that, in word tokenization just a moment ago, I took an input sequence and I just called Python's split method on it, where the split method took that sequence of words and just separated it based on where the spaces showed up in that word. And so if I had a sentence like, whatever remains, however improbable, must be the truth, how would I tokenize this? Well, the naive approach is just to say, anytime you see a space, go ahead and split it up. We're going to split up this particular string just by looking for spaces. And what we get when we do that is a sentence like this-- whatever remains, however improbable, must be the truth. But what you'll notice here is that, if we just split things up in terms of where the spaces are, we end up keeping the punctuation around. There's a comma after the word remains. There's a comma after improbable, a period after truth. And this poses a little bit of a challenge, when we think about trying to tokenize things into individual words, because if you're comparing words to each other, this word truth with a period after it-- if you just string compare it, it's going to be different from the word truth without a period after it. And so this punctuation can sometimes pose a problem for us, and so we might want some way of dealing with it-- either treating punctuation as a separate token altogether or maybe removing that punctuation entirely from our sequence as well. So that might be something we want to do. But there are other cases where it becomes a little bit less clear. If I said something like, just before 9:00 o'clock, Sherlock Holmes stepped briskly into the room, well, this apostrophe after 9 o'clock-- after the O in 9 o'clock-- is that something we should remove? Should be split based on that as well, and do O and clock? There's some interesting questions there too. And it gets even trickier if you begin to think about hyphenated words-- something like this, where we have a whole bunch of words that are hyphenated and then you need to make a judgment call. Is that a place where you're going to split things apart into individual words, or are you going to consider frock-coat, and well-cut, and pearl-grey to be individual words of their own? And so those tend to pose challenges that we need to somehow deal with and something we need to decide as we go about trying to perform this kind of analysis. Similar challenges arise when it comes to the world of sentence tokenization. Imagine this sequence of sentences, for example. If you take a look at this particular sequence of sentences, you could probably imagine you could extract the sentences pretty readily. Here is one sentence and here is a second sentence, so we have two different sentences inside of this particular passage. And the distinguishing feature seems to be the period-- that a period separates one sentence from another. And maybe there are other types of punctuation you might include here as well-- an exclamation point, for example, or a question mark. But those are the types of punctuation that we know tend to come at the end of sentences. But it gets trickier again if you look at a sentence like this-- not just sure talking to Sherlock, but instead of talking to Sherlock, talking to Mr. Holmes. Well now, we have a period at the end of Mr. And so if you were just separating on periods, you might imagine this would be a sentence, and then just Holmes would be a sentence, and then we'd have a third sentence down below. Things do get a little bit trickier as you start to imagine these sorts of situations. And dialogue too starts to make this trickier as well-- that if you have these sorts of lines that are inside of something that-- he said, for example-- that he said this particular sequence of words and then this particular sequence of words. There are interesting challenges that arise there too, in terms of how it is that we take the sentence and split it up into individual sentences as well. And these are just things that our algorithm needs to decide. In practice, there usually some heuristics that we can use. We know there are certain occurrences of periods, like the period after Mr., or in other examples where we know that is not the beginning of a new sentence, and so we can encode those rules into our AI to allow it to be able to do this tokenization the way that we want it to. So once we have these ability to tokenize a particular passage-- take the passage, split it up into individual words-- from there, we can begin to extract what the n-grams actually are. So we can actually take a look at this by going into a Python program that will serve the purpose of extracting these n-grams. And again, we can use NLTK, the Natural Language Toolkit, in order to help us here. So I'll go ahead and go into ngrams and we'll take a look at ngrams.py. And what we have here is we are going to take some corpus of text, just some sequence of documents, and use all those documents and extract what the most popular n-grams happen to be. So in order to do so, we're going to go ahead and load data from a directory that we specify in the command line argument. We'll also take in a number n as a command line argument as well, in terms of what our number should be, in terms of how many sequences-- words we're going to look at in sequence. Then we're going to go ahead and just count up all of the nltk.ngrams. So we're going to look at all of the grams across this entire corpus and save it inside this variable ngrams. And then we're going to look at the most common ones and go ahead and print them out. And so in order to do so, I'm not only using NLTK-- I'm also using counter, which is built into Python as well, where I can just count up, how many times do these various different grams appear? So we'll go ahead and show that. We'll go into ngrams, and I'll say something like python ngrams-- and let's just first look for the unigrams, sequences of one word inside of a corpus. And the corpus that I've prepared is I have all of the-- or some of these stories from Sherlock Holmes all here, where each one is just one of the Sherlock Holmes stories. And so I have a whole bunch of text here inside of this corpus, and I'll go ahead and provide that corpus as a command line argument. And now what my program is going to do is it's going to load all of the Sherlock Holmes stories into memory-- or all the ones that I've provided in this corpus at least-- and it's just going to look for the most popular unigrams, the most popular sequences of one word. And it seems the most popular one is just the word the used in 9,700 times; followed by I, used 5,000 times; and, used about 5,000 times-- the kinds of words you might expect. So now let's go ahead and check for bigrams, for example, ngrams 2, holmes. All right, again, sequences of two words now that appear multiple times-- of the, in the, it was, to the, it is, I have-- so on and so forth. These are the types of bigrams that happen to come up quite often inside this corpus, the inside of the Sherlock Holmes stories. And it probably is true across other corpses as well, but we could only find out if we actually tested it. And now, just for good measure, let's try one more-- maybe try three, looking now for trigrams that happen to show up. And now we get it was the, one of the, I think that, out of the. These are sequences of three words now that happen to come up multiple times across this particular corpus. So what are the potential use cases here? Now we have some sort of data. We have data about how often particular sequences of words show up in this particular order, and using that, we can begin to do some sort of predictions. We might be able to say that, if you see the words that it was, there's a reasonable chance the word that comes after it should be the word a. And if I see the words one of, it it's reasonable to imagine that the next word might be the word the, for example, because we have this data about trigrams, sequences of three words and how often they come up. And now, based on two words, you might be able to predict what the third word happens to be. And one model we can use for that is a model we've actually seen before. It's the Markov model. Recall again that the Markov model really just refers to some sequence of events that happen one time step after a one time step, where every unit has some ability to predict what the next unit is going to be-- or maybe the past two units predict with the next unit is going to be, or the past three predict with the next one is going to be. And we can use a Markov model and apply it to language for a very naive and simple approach at trying to generate natural language, at getting our AI to be able to speak English-like text. And the way it's going to work is we're going to say something like, come up with some probability distribution. Given these two words, what is the probability distribution over what the third word could possibly be based on all the data? If you see it was, what are the possible third words we might? Have how often do they come up? And using that information, we can try and construct what we expect the third word to be. And if you keep doing this, the effect is that our Markov model can effectively start to generate text-- can be able to generate text that was not in the original corpus, but that sounds kind of like the original corpus. It's using the same sorts of rules that the original corpus was using. So let's take a look at an example of that as well, where here now, I have another corpus that I have here, and it is the corpus of all of the works of William Shakespeare. So I've got a whole bunch of stories from Shakespeare, and all of them are just inside of this big text file. And so what I might like to do is look at what all of the n-grams are-- maybe look at all the trigrams inside of shakespeare.txt-- and figure out, given two words, can I predict what the third word is likely to be? And then just keep repeating this process-- I have two words-- predict the third word; then, from the second and third, word predict the fourth word; and from the third and fourth word, predict the fifth word, ultimately generating random sentences that sounds like Shakespeare, that are using similar patterns of words that Shakespeare used, but that never actually showed up in Shakespeare itself. And so to do so, I'll show you generator.py, which, again, is just going to read data from a particular file. And I'm using a Python library called markovify, which is just going to do this process for me. So there are libraries out here that can just train on a bunch of text and come up with a Markov model based on that text. And I'm going to go ahead and just generate five randomly generated sentences. So we'll go ahead and go in to markov. I'll run the generator on shakespeare.txt. What we'll see is it's going to load that data, and then here's what we get. We get five different sentences, and these are sentences that never showed up in any Shakespeare play, but that are designed to sound like Shakespeare, that are designed to just take two words and predict, given those two words, what would Shakespeare have been likely to choose as the third word that follows it. And you know, these sentences probably don't have any meaning. It's not like the AI is trying to express any sort of underlying meaning here. It's just trying to understand, based on the sequence of words, what is likely to come after it as a next word, for example. And these are the types of sentences that it's able to come up with, just generating. And if you ran this multiple times, you would end up getting different results. I could run this again and get an entirely different set of five different sentences that also are supposed to sound kind of like the way that Shakespeare's sentences sounded as well. And so that then was a look at how it is we can use Markov models to be able to naively attempt generating language. The language doesn't mean a whole lot right now. You wouldn't want to use the system in its current form to do something like machine translation, because it wouldn't be able to encapsulate any meaning, but we're starting to see now that our AI is getting a little bit better at trying to speak our language, at trying to be able to process natural language in some sort of meaningful way. So we'll now take a look at a couple of other tasks that we might want our AI to be able to perform. And one such task is text categorization, which really is just a classification problem. And we've talked about classification problems already, these problems where we would like to take some object and categorize it into a number of different classes. And so the way this comes up in text is anytime you have some sample of text and you want to put it inside of a category, where I want to say something like, given an email, does it belong in the inbox or does it belong in spam? Which of these two categories does it belong in? And you do that by looking at the text and being able to do some sort of analysis on that text to be able to draw conclusions, to be able to say that, given the words that show up in the email, I think this is probably belonging in the inbox, or I think it probably belongs in spam instead. And you might imagine doing this for a number of different types of classification problems of this sort. So you might imagine that another common example of this type of idea is something like sentiment analysis, where I want to analyze, given a sample of text, does it have a positive sentiment or does it have a negative sentiment? And this might come up in the case of a product reviews on a website, for example, or feedback on a website, where you have a whole bunch of data-- samples of text that are provided by users of a website-- and you want to be able to quickly analyze, are these reviews positive, are the reviews negative, what is it that people are saying, just to get a sense for what it is that people are saying, to be able to categorize text into one of these two different categories. So how might we approach this problem? Well, let's take a look at some sample product reviews. Here are some sample prep reviews that we might come up with. My grandson loved it. So much fun. Product broke after a few days. One of the best games I've played in a long time. Kind of cheap and flimsy. Not worth it. Different product reviews that you might imagine seeing on Amazon, or eBay, or some other website where people are selling products, for instance. And we humans can pretty easily categorize these into positive sentiment or negative sentiment. We'd probably say that the first and the third one, those are positive sentiment messages. The second one and the fourth one, those are probably negative sentiment messages. But how could a computer do the same thing? How could it try and take these reviews and assess, are they positive or are they negative? Well, ultimately, it depends upon the words that happen to be in this particular-- these particular reviews-- inside of these particular sentences. For now we're going to ignore the structure and how the words are related to each other, and we're just going to focus on what the words actually are. So there are probably some key words here, words like loved, and fun, and best. Those probably show up in more positive reviews, whereas words like broke, and cheap, and flimsy-- well, those are words that probably are more likely to come up inside of negative reviews, instead of positive reviews. So one way to approach this sort of text analysis idea is to say, let's, for now, ignore the structures of these sentences-- to say, we're not going to care about how it is the words relate to each other. We're not going to try and parse these sentences to construct the grammatical structure like we saw a moment ago. But we can probably just rely on the words that were actually used-- rely on the fact that the positive reviews are more likely to have words like best, and loved, and fun, and that the negative reviews are more likely to have the negative words that we've highlighted there as well. And this sort of model-- this approach to trying to think about language-- is generally known as the bag of words model, where we're going to model a sample of text not by caring about its structure, but just by caring about the unordered collection of words that show up inside of a sample-- that all we care about is what words are in the text. And we don't care about what the order of those words is. We don't care about the structure of the words. We don't care what noun goes with what adjective or how things agree with each other. We just care about the words. And it turns out this approach tends to work pretty well for doing classifications like positive sentiment or negative sentiment. And you could imagine doing this in a number of ways. We've talked about different approaches to trying to solve classification style problems, but when it comes to natural language, one of the most popular approaches is that naive Bayes approach. And this is one approach to trying to analyze the probability that something is positive sentiment or negative sentiment, or just trying to categorize it some text into possible categories. And it doesn't just work for text-- it works for other types of ideas as well-- but it is quite popular in the world of analyzing text and natural language. And the naive Bayes approach is based on Bayes' rule, which you might recall back from when we talked about probability, that the Bayes' rule looks like this-- that the probability of some event b, given a can be expressed using this expression over here. Probability of b given a is the probability of a given b multiplied by the probability of b divided by the probability of a. And we saw that this came about as a result of just the definition of conditional independence and looking at what it means for two events to happen together. This was our formulation then of Bayes' rule, which turned out to be quite helpful. We were able to predict one event in terms of another by flipping the order of those events inside of this probability calculation. And it turns out this approach is going to be quite helpful-- and we'll see why in a moment-- for being able to do this sort of sentiment analysis, because I want to say you know, what is the probability that a message is positive, or what is the pop probability that the message is negative? And I'll go ahead and simplify this just using the emojis just for simplicity-- probability of positive, probability of negative. And that is what I would like to calculate, but I'd like to calculate that given some information-- given information like here is a sample of text-- my grandson loved it. And I would like to know not just what is the probability that any message is positive, but what is the probability that the message is positive, given my grandson loved it as the text of the sample? So given this information that inside the sample are the words my grandson loved it, what is the probability then that this is a positive message? Well, according to the bag of words model, what we're going to do is really ignore the ordering of the words-- not treat this as a single sentence that has some structure to it, but just treat it as a whole bunch of different words. We're going to say something like, what is the probability that this is a positive message, given that the word my was in the message, given that the word grandson was in the message, given that the word loved within the message, and given the word it was in the message? The bag of words model here-- we're treating the entire simple sample as just a whole bunch of different words. And so this then is what I'd like to calculate, this probability-- given all those words, what is the probability that this is a positive message? And this is where we can now apply Bayes' rule. This is really the probability of some b, given some a. And that now is what I'd like to calculate. So according to Bayes' rule, this whole expression is equal to-- well, it's the probability-- I switched the order of them-- it's the probability of all of these words, given that it's a positive message, multiplied by the probability that is the positive message divided by the probability of all of those words. So this then is just an application of Bayes' rule. We've already seen where I want to express the probability of positive, given the words, as related to somehow the probability of the words, given that it's a positive message. And it turns out that-- as you might recall, back when we talked about probability, that this denominator is going to be the same. Regardless of whether we're looking at positive or negative messages, the probability of these words doesn't change, because we don't have a positive or negative down below. So we can just say that, rather than just say that this expression up here is equal to this expression down below, it's really just proportional to just the numerator. We can ignore the denominator for now. Using the denominator would get us an exact probability. But it turns out that what we'll really just do is figure out what the probability is proportional to, and at the end, we'll have to normalize the probability distribution-- make sure the probability distribution ultimately sums up to the number 1. So now I've been able to formulate this probability-- which is what I want to care about-- as proportional to multiplying these two things together-- probability of words, given positive message, multiplied by the probability of positive message. But again, if you think back to our probability rules, we can calculate this really as just a joint probability of all of these things happening-- that the probability of positive message multiplied by the probability of these words, given the positive message-- well, that's just the joint probability of all of these things. This is the same thing as the probability that it's a positive message, and my isn't the sentence or in the message, and grandson is in the sample, and loved is in the sample, and it is in the sample. So using that rule for the definition of joint probability, I've been able to say that this entire expression is now proportional to this sequence-- this joint probability of these words and this positive that's in there as well. And so now the interesting question is just how to calculate that joint probability. How do I figure out the probability that, given some arbitrary message, that it is positive, and the word my is in there, and the word grandson is in there, and the word loved is in there, and the word it is in there? Well, you'll recall that we can calculate a joint probability by multiplying together all of these conditional probabilities. If I want to know the probability of a, and b, and c, I can calculate that as the probability of a times the probability of b, given a, times the probability of c, given a and b. I can just multiply these conditional probabilities together in order to get the overall joint probability that I care about. And we could do the same thing here. I could say, let's multiply the probability of positive by the probability of the word my showing up in the message, given that it's positive, multiplied by the probability of grandson showing up in the message, given that the word my is in there and that it's positive, multiplied by the probability of loved, given these three things, multiplied by the probability of it, given these four things. And that's going to end up being a fairly complex calculation to make, one that we probably aren't going to have a good way of knowing the answer to. What is the probability that grandson is in the message, given that it is positive and the word my is in the message? That's not something we're really going to have a readily easy answer to, and so this is where the naive part of naive Bayes comes about. We're going to simplify this notion. Rather than compute exactly what that probability distribution is, we're going to assume that these words are going to be effectively independent of each other, if we know that it's already a positive message. If it's a positive message, it doesn't change the probability that the word grandson is in the message, if I know that the word loved is in the message, for example. And that might not necessarily be true in practice. In the real world, it might not be the case that these words are actually independent, but we're going to assume it to simplify our model. And it turns out that simplification still lets us get pretty good results out of it as well. And what we're going to assume is that the probability that all of these words show up depend only on whether it's positive or negative. I can still say that loved is more likely to come up in a positive message than a negative message, which is probably true, but we're also going to say that it's not going to change whether or not loved is more likely or less likely to come up if I know that the word my is in the message, for example. And so those are the assumptions that we're going to make. So while top expression is proportional to this bottom expression, we're going to say it's naively proportional to this expression, probability of being a positive message. And then, for each of the words that show up in the sample, I'm going to multiply what's the probability that my is in the message, given that it's positive, times the probability of grandson being in the message, given that it's positive-- and then so on and so forth for the other words that happen to be inside of the sample. And it turns out that these are numbers that we can calculate. The reason we've done all of this math is to get to this point, to be able to calculate this probability of distribution that we care about, given these terms that we can actually calculate. And we can calculate then, given some data available to us. And this is what a lot of natural language processing is about these days. It's about analyzing data. If I give you a whole bunch of data with a whole bunch of reviews, and I've labeled them as positive or negative, then you can begin to calculate these particular terms. I can calculate the probability that a message is positive just by looking at my data and saying, how many positive samples were there, and divide that by the number of total samples. That is my probability that a message is positive. What is the probability that the word loved is in the message, given that it's positive? Well, I can calculate that based on my data too. Let me just look at how many positive samples have the word loved in it and divide that by my total number of positive samples. And that will give me an approximation for, what is the probability that loved is going to show up inside of the review, given that we know that the review is positive. And so this then allows us to be able to calculate these probabilities. So let's not actually do this calculation. Let's calculate for the sentence, my grandson loved it. Is it a positive or negative review? How could we figure out those probabilities? Well, again, this up here is the expression we're trying to calculate. And I'll give you a hint the data that is available to us. And the way to interpret this data in this case is that, of all of the messages, 49% of them were positive and 51% of them were negative. Maybe online reviews tend to be a little bit more negative than they are positive-- or at least based on this particular data sample, that's what I have. And then I have distributions for each of the various different words-- that, given that it's a positive message, how many positive messages had the word in my in them? It's about 30%. And for negative messages, how many of those had the word my in them? About 20%-- so it seems like the word my comes up more often in positive messages-- at least slightly more often based on this analysis here. Grandson, for example-- maybe that showed up in 1% of all positive messages and 2% of all negative messages had the word grandson in it. The word loved showed up in 32% of all positive messages, 8% of all negative messages, for example. And then the word it up in 30% of positive messages, 40% of negative messages-- again, just arbitrary data here just for example, but now we have data with which we can begin to calculate this expression. So how do I calculate multiplying all these values together? Well, it's just going to be multiplying probability that it's positive times the probability of my, given positive, times the probability of grandson, given positive-- so on and so forth for each of the other words. And if you do that multiplication and multiply all of those values together, you get this, 0.00014112. By itself, this is not a meaningful number, but it's going to be meaningful if you compared this expression-- the probability that it's positive times the probability of all of the words, given that I know that the message is positive, and compare it to the same thing, but for negative sentiment messages instead. I want to know the probability that it's a negative message times the probability of all of these words, given that it's a negative message. And so how can I do that? Well, to do that, you just multiply probability of negative times all of these conditional probabilities. And if I take those five values, multiply all of them together, then what I get is this value for negative 0.00006528-- again, in isolation, not a particularly meaningful number. What is meaningful is treating these two values as a probability distribution and normalizing them, making it so that both of these values sum up to 1 the way of probability distribution should. And we do so by adding these two up and then dividing each of these values by their total in order to be able to normalize them. And when we do that, when we normalize this probability distribution, you end up getting something like this, positive 0.6837, negative 0.3163. It seems like we've been able to conclude that we are about 68% confident-- we think there's a probability of 0.68 that this message is a positive message-- my grandson loved it. And why are we 68% confident? Well, it seems like we're more confident than not because the word loved showed up in 32% of positive messages, but only 8% of negative messages. So that was a pretty strong indicator. And for the others, while it's true that the word it showed up more often in negative messages, it wasn't enough to offset that loved shows up far more often in positive messages than negative messages. And so this type of analysis is how we can apply naive Bayes. We've just done this calculation. And we end up getting not just a categorization of positive or negative, but I get some sort of confidence level. What do I think the probability is that it's positive? And I can say I think it's positive with this particular probability. And so naive Bayes can be quite powerful at trying to achieve this. Using just this bag of words model, where all I'm doing is looking at what words show up in the sample, I'm able to draw these sorts of conclusions. Now, one potential drawback-- something that you'll notice pretty quickly if you start applying this room exactly as is-- is what happens depending on if 0's are inside this data somewhere. Let's imagine, for example, this same sentence-- my grandson loved it-- but let's instead imagine that this value here, instead of being 0.01, was 0, meaning inside of our data set, it has never before happened that in a positive message the word grandson showed up. And that's certainly possible. If I have a pretty small data set, it's probably likely that not all the messages are going to have the word grandson. Maybe it is the case that no positive messages have ever had the word grandson in it, at least in my data set. But if it is the case that 2% of the negative messages have still had the word grandson in it, then we run into an interesting challenge. And the challenge is this-- when I multiply all of the positive numbers together and multiply all the negative numbers together to calculate these two probabilities, what I end up getting is a positive value of 0.000. I get pure 0's, because when I multiply all of these numbers together-- when I multiply something by 0, doesn't matter what the other numbers are-- the result is going to be 0. And the same thing can be said of negative numbers as well. So this then would seem to be a problem that, because grandson has never showed up in any of the positive messages inside of our sample, we're able to say-- we seem to be concluding that there is a 0% chance that the message is positive. And therefore, it must be negative, because the only cases where we've seen the word grandson come up is inside of a negative message. And in doing so, we've totally ignored all of the other probabilities that a positive message is much more likely to have the word loved in it, because we've multiplied by 0, which just means none of the other probabilities can possibly matter at all. So this then is a challenge that we need to deal with. It means that we're likely not going to be able to get the correct results if we just purely use this approach. And it's for that reason there are a number of possible ways we can try and make sure that we never multiply something by 0. It's OK to multiply something by a small number, because then it can still be counterbalanced by other larger numbers, but multiplying by 0 means it's the end of the story. You multiply a number by 0, and the output's going to be 0, no matter how big any of the other numbers happen to be. So one approach that's fairly common a naive Bayes is this idea of additive smoothing, adding some value alpha to each of the values in our distribution just to smooth the data little bit. One such approach is called Laplace smoothing, which basically just means adding one to each value in our distribution. So if I have 100 samples and zero of them contain the word grandson, well then I might say that, you know what? Instead, let's pretend that I've had one additional sample where the word grandson appeared and one additional sample where the word grandson didn't appear. So I'll say all right, now I have one 1 of 102-- so one sample that does have the word grandson out of 102 total. I'm basically creating two samples that didn't exist before. But in doing so, I've been able to smooth the distribution a little bit to make sure that I never have to multiply anything by 0. By pretending I've seen one more value in each category than I actually have, this gets us that result of not having to worry about multiplying a number by 0. So this then is an approach that we can use in order to try and apply naive Bayes, even in situations where we're dealing with words that we might not necessarily have seen before. And let's now take a look at how we could actually apply that in practice. It turns out that NLTK, in addition to having the ability to extract n-grams and tokenize things into words, also has the ability to be able to apply naive Bayes on some samples of text, for example. And so let's go ahead and do that. What I've done is, inside of sentiment, I've prepared a corpus of just know reviews that I've generated, but you can imagine using real reviews. I just have a couple of positive reviews-- it was great. So much fun. Would recommend. My grandson loved it. Those sorts of messages. And then I have a whole bunch of negative reviews-- not worth it, kind of cheap, really bad, didn't work the way we expected-- just one on each line. A whole bunch of positive reviews and negative reviews. And what I'd like to do now is analyze them somehow. So here then is sentiment up high, and what we're going to do first is extract all of the positive and negative sentences, create a set of all of the words that were used across all of the messages, and then we're going to go ahead and train NLTK's naive Bayes classifier on all of this training data. And with the training data effectively is is I take all of the positive messages and give them the label positive, all the negative messages and give them the label negative, and then I'll go ahead and apply this classifier to it, where I'd say, I would like to take all of this training data and now have the ability to classify it as positive or negative. I'll then take some input from the user. They can just type in some sequence of words. And then I would like to classify that sequence as either positive or negative, and then I'll go ahead and print out what the probabilities of each happened to be. And there are some helper functions here that just organize things in the way that NLTK is expecting them to be. But the key idea here is that I'm taking the positive messages, labeling them, taking the negative messages, labeling them, putting them inside of a classifier, and then now trying to classify some new text that comes about. So let's go ahead and try it. I'll go ahead and go into sentiment, and we'll run Python sentiment, passing in as input that corpus that contains all of the positive and negative messages-- because depending on the corpus, that's going to affect the probabilities. The effectiveness of our ability to classify is entirely dependent on how good our data is, and how much data we have, and how well they happen to be labeled. So now I can try something and say-- let's try a review like, this was great-- just some review that I might leave. And it seems that, all right, there is a 96% chance it estimates that this was a positive message-- 4% chance that it was a negative, likely because the word great shows up inside of the positive messages, but doesn't show up inside of the negative messages. And that might be something that our AI is able to capitalize on. And really, what it's going to look for are the differentiating words-- that if the probability of words like this and was and is pretty similar between positive and negative words, then the naive Bayes classifier isn't going to end up using those values as having some sort of importance in the algorithm. Because if they're the same on both sides, you multiply that value for both positive and negative, you end up getting about the same thing. What ultimately makes the difference in naive Bayes is when you multiply by value that's much bigger for one category than for another category-- when one word like great is much more likely to show up in one type of message than another type of message. And that's one of the nice things about naive Bayes is that, without me telling it, that great is more important to care about than this or was. Naive Bayes can figure that out based on the data. It can figure out that this shows up about the same amount of time between the two, but great, that is a discriminator, a word that can be different between the two types of messages. So I could try it again-- type in a sentence like, lots of fun, for example. This one it's a little less sure about-- 62% chance that it's positive, 37% chance that it's negative-- maybe because there aren't as clear discriminators or differentiators inside of this data. I'll try one more-- say kind of overpriced. And all right, now 95%, 96% sure that this is a negative sentiment-- likely because of the word overpriced, because it's shown up in a negative sentiment expression before, and therefore, it thinks, you know what, this is probably going to be a negative sentence. And so naive Bayes has now given us the ability to classify text. Given enough training data, given enough examples, we can train our AI to be able to look at natural language, human words, figure out which words are likely to show up in positive as opposed to negative sentiment messages, and categorize them accordingly. And you could imagine doing the same thing anytime you want to take text and group it into categories. If I want to take an email and categorize as email-- as a good email or as a spam email, you could apply a similar idea. Try and look for the discriminating words, the words that make it more likely to be a spam email or not, and just train a naive Bayes classifier to be able to figure out what that distribution is and to be able to figure out how to categorize an email as good or as spam. Now, of course, it's not going to be able to give us a definitive answer. It gives us a probability distribution, something like 63% positive, 37% negative. And that might be why our spam filters and our emails sometimes make mistakes, sometimes think that a good email is actually spam or vice versa, because ultimately, the best that it can do is calculate a probability distribution. If natural language is ambiguous, we can usually just deal in the world of probabilities to try and get an answer that is reasonably good, even if we aren't able to guarantee for sure that it is the number that we actually expect for it to be. That then was a look at how we can begin to take some text and to be able to analyze the text and group it into some sorts of categories. But ultimately, in addition just being able to analyze text and categorize it, we'd like to be able to figure out information about the text, get it some sort of meaning out of the text as well. And this starts to get us in the world of information, of being able to try and take data in the form of text and retrieve information from it. So one type of problem is known as information retrieval, or IR, which is the task of finding relevant documents in response to a query. So this is something like you type in a query into a search engine, like Google, or you're typing in something into some system that's going to look for-- inside of a library catalog, for example-- that's going to look for responses to a query. I want to look for documents that are about the US constitution or something, and I would like to get a whole bunch of documents that match that query back to me. But you might imagine that what I really want to be able to do is, in order to solve this task effectively, I need to be able to take documents and figure out, what are those documents about? I want to be able to say what is it that these particular documents are about-- what of the topics of those documents-- so that I can then more effectively be able to retrieve information from those particular documents. And this refers to a set of tasks generally known as topic modeling, where I'd like to discover what the topics are for a set of documents. And this is something that humans could do. A human could read a document and tell you, all right, here's what this document is about, and give maybe a couple of topics for who are the important people in this document, what are the important objects in the document-- can probably tell you that kind of thing. But we'd like for our AI to be able to do the same thing. Given some document, can you tell me what the important words in this document are? What are the words that set this document apart that I might care about if I'm looking at documents based on keywords, for example? And so one instinctive idea-- an intuitive idea that probably makes sense-- is let's just use term frequency. Term frequency is just defined as the number of times a particular term appears in a document. If I have a document with 100 words and one particular word shows up 10 times, it has a term frequency of 10. It shows up pretty often. Maybe that's going to be an important word. And sometimes, you'll also see this framed as a proportion of the total number of words, so 10 words out of 100. Maybe it has a term frequency of 0.1, meaning 10% of all of the words are this particular word that I care about. Ultimately, that doesn't change relatively how important they are for any one particular document, but they're the same idea. The idea is look for words that show up more frequently, because those are more likely to be the important words inside of a corpus of documents. And so let's go ahead and give that a try. Let's say I wanted to find out what the Sherlock Holmes stories are about. I have a whole bunch of Sherlock Holmes stories and I want to know, in general, what are they about? What are the important characters? What are the important objects? What are the important parts of the story, just in terms of words? And I'd like for the AI to be able to figure that out on its own, and we'll do so by looking at term frequency-- by looking at, what are the words that show up the most often? So we'll go ahead, and I'll go ahead and go in to the tfidf directory. You'll see why it's called that in a moment. But let's first open up tf0.py, which is going to calculate the top 10 term frequencies-- or maybe top five term frequencies for a corpus of documents, a whole bunch of documents where each document is just a story from Sherlock Holmes. We're going to load all the data into our corpus and we're going to figure out, what are all of the words that show up inside of that corpus? And we're going to basically just assemble all of the number of the term frequencies. We're going to calculate, how often do each of these terms appear inside of the document? And we'll print out the top five. And so there are some data structures involved that you can take a look at if you'd like to. The exact code is not so important, but it is the idea of what we're doing. We're taking each of these documents and first sorting them. We're saying, take all the words that show up and sort them by how often each word shows up. And let's go ahead and just, for each document, save the top five terms that happen to show up in each of those documents. So again, some helper functions you can take a look at if you're interested. But the key idea here is that all we're going to do is run to tf0 on the Sherlock Holmes stories. And what I'm hoping to get out of this process is I am hoping to figure out, what are the important words in Sherlock Holmes, for example? So we'll go ahead and run this and see what we get. And it's loading the data. And here's what we get. For this particular story, the important words are the, and and, and I, and to, and of. Those are the words that show up more frequently. In this particular story, it's the, and and, and I, and a, and of. This is not particularly useful to us. We're using term frequencies. We're looking at what words show up the most frequently in each of these various different documents, but what we get naturally are just the words that show up a lot in English. The word the, and of, and happen to show up a lot in English, and therefore, they happen to show up a lot in each of these various different documents. This is not a particularly useful metric for us to be able to analyze what words are important, because these words are just part of the grammatical structure of English. And it turns out we can categorize words into a couple of different categories. These words happen to be known as what we might call function words, words that have little meaning on their own, but that are used to grammatically connect different parts of a sentence. These are words like am, and by, and do, and is, and which, and with, and yet-- words that, on their own, what do they mean? It's hard to say. They get their meaning from how they connect different parts of the sentence. And these function words are what we might call a closed class of words in a language like English. There's really just some fixed list of function words, and they don't change very often. There's just some list of words that are commonly used to connect other grammatical structures in the language. And that's in contrast with what we might call content words, words that carry meaning independently-- words like algorithm, category, computer, words that actually have some sort of meaning. And these are usually the words that we care about. These are the words where we want to figure out, what are the important words in our document? We probably care about the content words more than we care about the function words. And so one strategy we could apply is just ignore all of the function words. So here in tf1.py, I've done the same exact thing, except I'm going to load a whole bunch of words from a function_words.txt file, inside of which are just a whole bunch of function words in alphabetical order. These are just a whole bunch of function words that are just words that are used to connect other words in English, and someone has just compiled this particular list. And these are the words that I just want to ignore. If any of these words-- let's just ignore it as one of the top terms, because these are not words that I probably care about if I want to analyze what the important terms inside of a document happen to be. So in tfidf1, we were ultimately doing is, if the word is in my set of function words, I'm just going to skip over it, just ignore any of the function words by continuing on to the next word and then just calculating the frequencies for those words instead. So I'm going to pretend the function words aren't there, and now maybe I can get a better sense for what terms are important in each of the various different Sherlock Holmes stories. So now let's run tf1 on the Sherlock Holmes corpus and see what we get now. And let's look at, what is the most important term in each of the stories? Well, it seems like, for each of the stories, the most important word is Holmes. I guess that's what we would expect. They're all Sherlock Holmes stories. And Holmes is not a function in Word. It's not the, or a, or an, so it wasn't ignored. But Holmes and man-- these are probably not what I mean when I say, what are the important words? Even though Holmes does show up the most often it's not giving me a whole lot of information here about what each of the different Sherlock Holmes stories are actually about. And the reason why is because Sherlock Holmes shows up in all the stories, and so it's not meaningful for me to say that this story is about Sherlock Holmes I want to try and figure out the different topics across the corpus of documents. What I really want to know is, what words show up in this document that show up less frequently in the other documents, for example? And so to get at that idea, we're going to introduce the notion of inverse document frequency. Inverse document frequency is a measure of how common, or rare, a word happens to be across an entire corpus of words. And mathematically, it's usually calculated like this-- as the logarithm of the total number of documents divided by the number of documents containing the word. So if a word like Holmes shows up in all of the documents, well, then total documents is how many documents there are a number of documents containing Holmes is going to be the same number. So when you divide these two together, you'll get 1, and the logarithm of one is just 0. And so what we get is, if Holmes shows up in all of the documents, it has an inverse document frequency of 0. And you can think now of inverse document frequency as a measure of how rare is the word that shows up in this particular document that if a word doesn't show up across many documents at all this number is going to be much higher. And this then gets us that a model known as tf-idf, which is a method for ranking what words are important in the document by multiplying these two ideas together. Multiply term frequency, or TF, by inverse document frequency, or IDF, where the idea here now is that how important a word is depends on two things. It depends on how often it shows up in the document using the heuristic that, if a word shows up more often, it's probably more important. And we multiply that by inverse document frequency IDF, because if the word is rarer, but it shows up in the document, it's probably more important than if the word shows up across most or all of the documents, because then it's probably a less important factor in what the different topics across the different documents in the corpus happen to be. And so now let's go ahead and apply this algorithm on the Sherlock Holmes corpus. And here's tfidf. Now what I'm doing is, for each of the documents, for each word, I'm calculating its TF score, term frequency, multiplied by the inverse document frequency of that word-- not just looking at the single volume, but multiplying these two values together in order to compute the overall values. And now, if I run tfidf on the Holmes corpus, this is going to try and get us a better approximation for what's important in each of the stories. And it seems like it's trying to extract here probably like the names of characters that happen to be important in the story-- characters that show up in this story that don't show up in the other story-- and prioritizing the more important characters that happen to show up more often. And so this then might be a better analysis of what types of topics are more or less important. I also have another corpus, which is a corpus of all of the Federalist Papers from American history. If I go ahead and run tfidf on the Federalist Papers, we can begin to see what the important words in each of the various different Federalist Papers happen to be-- that in Federalist Paper Number 61, seems like it's a lot about elections. In Federalist Papers 66, but the Senate and impeachments. You can start to extract what the important terms and what the important words are just by looking at what things show up across-- and don't show up across many of the documents, but show up frequently enough in certain of the documents. And so this can be a helpful tool for trying to figure out this kind of topic modeling, figuring out what it is that a particular document happens to be about. And so this then is starting to get us into this world of semantics, what it is that things actually mean when we're talking about language. Now, we're not going to think about the bag of words, where we just say, treat a sample of text as just a whole bunch of words. And we don't care about the order. Now, when we get into the world of semantics, we really do start to care about what it is that these words actually mean, how it is these words relate to each other, and in particular, how we can extract information out of that text. Information extraction is somehow extracting knowledge from our documents-- figuring out, given a whole bunch of text, can we automate the process of having an AI, look at those documents, and get out what the useful or relevant knowledge inside those documents happens to be? So let's take a look at an example. I'll give you two samples from news articles. Here up above is a sample of a news article from the Harvard Business Review that was about Facebook. Down below is an example of a Business Insider article from 2018 that was about Amazon. And there's some information here that we might want an AI to be able to extract-- information, knowledge about these companies that we might want to extract. And in particular, what I might want to extract is-- let's say I want to know data about when companies were founded-- that I wanted to know that Facebook was founded in 2004, Amazon founded in 1994-- that that is important information that I happen to care about. Well, how do we extract that information from the text? What is my way of being able to understand this text and figure out, all right, Facebook was founded in 2004? Well, what I can look for are templates or patterns, things that happened to show up across multiple different documents that give me some sense for what this knowledge happens to mean. And what we'll notice is a common pattern between both of these passages, which is this phrasing here. When Facebook was founded in 2004, comma-- and then down below, when Amazon was founded in 1994, comma. And those two templates end up giving us a mechanism for trying to extract information-- that this notion, when company was founded in year comma, this can tell us something about when a company was founded, because if we set our AI loose on the web, let look at a whole bunch of papers or a whole bunch of articles, and it finds this pattern-- when blank was founded in blank, comma-- well, then our AI can pretty reasonably conclude that there's a good chance that this is going to be like some company, and this is going to be like the year that company was founded, for example-- might not be perfect, but at least it's a good heuristic. And so you might imagine that, if you wanted to train and AI to be able to look for information, you might give the AI templates like this-- not only give it a template like when company blank was founded in blank, but give it like, the book blank was written by blank, for example. Just give it some templates where it can search the web, search a whole big corpus of documents, looking for templates that match that, and if it finds that, then it's able to figure out, all right, here's the company and here's the year. But of course, that requires us to write these templates. It requires us to figure out, what is the structure of this information likely going to look like? And it might be difficult to know. The different websites are, of course, going to do this differently. This type of method isn't going to be able to extract all of the information, because if the words are slightly in a different order, it won't match on that particular template. But one thing we can do is, rather than give our AI the template, we can give AI the data. We can tell the AI, Facebook was founded in 2004 and Amazon was founded in 1994, and just tell the AI those two pieces of information, and then set the AI loose on the web. And now the ideas that the AI can begin to look for, where do Facebook in 2004 show up together, where do Amazon in 1994 show up together, and it can discover these templates for itself. It can discover that this kind of phrasing-- when blank was founded in blank-- tends to relate Facebook to 2004, and it released Amazon to 1994, so maybe it will hold the same relation for others as well. And this ends up being-- this automated template generation ends up being quite powerful, and we'll go ahead and take a look at that now as well. What I have here inside of templates directory is a file called companies.csv, and this is all of the data that I am going to give to my AI. I'm going to give it the pair Amazon, 1994 and Facebook, 2004. And what I'm going to tell my AI to do is search a corpus of documents for other data-- these pairs like this-- other relationships. I'm not telling AI that this is a company and the date that it was founded. I'm just giving it Amazon, 1994 and Facebook, 2004 and letting the AI do the rest. And what the AI is going to do is it's going to look through my corpus-- here's my corpus of documents-- and it's going to find, like inside of Business Insider, that we have sentences like, back when Amazon was founded in 2004, comma-- and that kind of phrasing is going to be similar to this Harvard Business Review story that has a sentence like, when Facebook was founded in 2004-- and it's going to look across a number of other documents for similar types of patterns to be able to extract that kind of information. And what it will do is, if I go ahead and run, I'll go ahead and go into templates. So I'll say python search.py. I'm going to look for the data like the data and companies.csv inside of the company's directory, which contains a whole bunch of news articles that I've curated in advance. And here's what I get-- Google 1998, Apple 1976, Microsoft 1975-- so on and so forth-- Walmart 1962, for example. These are all of the pieces of data that happened to match that same template that we were able to find before. And how was it able to find this? Well, it's probably because, if we look at the Forbes article, for example, that it has a phrase in it like, when Walmart was founded in 1962, comma-- that it's able to identify these sorts of patterns and extract information from them. Now, granted, I have curated all these stories in advance in order to make sure that there is data that it's able to match on. And in practice, it's not always going to be in this exact format when you're seeing a company related to the year in which it was founded, but if you give the AI access to enough data-- like all of the data of text on the internet-- and just have the AI crawl the internet looking for information, it can very reliably, or with some probability, try and extract information using these sorts of templates and be able to generate interesting sorts of knowledge. And the more knowledge it learns, the more new templates it's able to construct, looking for constructions that show up in other locations as well. So let's take a look at another example. And then I'll here show you presidents.csv, where I have two presidents and their inauguration date-- so George Washington 1789, Barack Obama 2009 for example. And I also am going to give to our AI a corpus that just contains a single document, which is the Wikipedia article for the list of presidents of the United States, for example-- just information about presidents. And I'd like to extract from this raw HTML document on a web page information about the president. So I can say search in presidents.csv. And what I get is a whole bunch of data about presidents and what year they were likely inaugurated and by looking for patterns that matched-- Barack Obama 2009, for example-- looking for these sorts of patterns that happened to give us some clues as to what it is that a story happens to be about. So here's another example. If I open up inside the olympics, here is a scraped version of the Olympic home page that has information about various different Olympics. And maybe I want to extract Olympic locations and years from this particular page. Well, the way I can do that is using the exact same algorithm. I'm just saying, all right, here are two Olympics and where they were located-- so 2012 London, for example. Let me go ahead and just run this process, Python search, on olympics.csv, look at all the Olympic data set, and here I get some information back. Now, this information-- not totally perfect. There are a couple of examples that are obviously not quite right, because my template might have been a little bit too general. Maybe it was looking for a broad category of things and certain strange things happened to capture on that particular template. So you could imagine adding rules to try and make this process more intelligent, making sure the thing on the left is just a year, for example-- for instance, and doing other sorts of analysis. But purely just based on some data, we are able to extract some interesting information using some algorithms. And all search.py is really doing here is it is taking my corpus of data, finding templates that match it-- here, I'm filtering down to just the top two templates that happen to match-- and then using those templates to extract results from the data that I have access to, being able to look for all of the information that I care about. And that's ultimately what's going to help me, to print out those results to figure out what the matches happen to be. And so information extraction is another powerful tool when it comes to trying to extract information. But of course, it only works in very limited contexts. It only works when I'm able will find templates that look exactly like this in order to come up with some sort of match that is able to connect this to some pair of data, that this company was founded in this year. What I might want to do, as we start to think about the semantics of words, is to begin to imagine some way of coming up with definitions for all words, being able to relate all of the words in a dictionary to each other, because that's ultimately what's going to be necessary if we want our AI to be able to communicate. We need some representation of what it is that words mean. And one approach of doing this, this famous data set called WordNet. And what WordNet is is it's a human-curated-- researchers have curated together a whole bunch of words, their definitions, their various different senses-- because the word might have multiple different meanings-- and also how those words relate to one another. And so what we mean by this is-- I can show you an example of WordNet. WordNet comes built into NLTK. Using NLTK, you can download and access WordNet. So let me go into WordNet, and go ahead and run WordNet, and extract information about a word-- a word like city, for example. Go ahead and press Return. And here is the information that I get back about a city. It turns out that city has three different senses, three different meanings, according to WordNet. And it's really just kind of like a dictionary, where each sense is associated with its meaning-- just some definition provided by human. And then it's also got categories, for example, that a word belongs to-- that a city is a type of municipality, a city is a type of administrative district. And that allows me to relate words to other words. So one of the powers of WordNet is the ability to take one word and connect it to other related words. If I do another example, let me try the word house, for instance. I'll type in the word house and see what I get back. Well, all right, the house is a kind of building. The house is somehow related to a family unit. And so you might imagine trying to come up with these various different ways of describing a house. It is a building. It is a dwelling. And researchers have just curated these relationships between these various different words to say that a house is a type of building, that a house is a type of dwelling, for example. But this type of approach, while certainly helpful for being able to relate words to one another, doesn't scale particularly well. As you start to think about language changing, as you start to think about all the various different relationships that words might have to one another, this challenge of word representation ends up being difficult. What we've done is just defined a word as just a sentence that explains what it is that that word is, but what we really would like is some way to represent the meaning of a word in a way that our AI is going to be able to do something useful with it. Anytime we want our AI to be able to look at texts and really understand what that text means, to relate text and words to similar words and understand the relationship between words, we'd like some way that a computer can represent this information. And what we've seen all throughout the course multiple times now is the idea that, when we want our AI to represent something, it can be helpful to have the AI represent it using numbers-- that we've seen that we can represent utilities in a game, like winning, or losing, or drawing, as a number-- 1, negative 1, or a 0. We've seen other ways that we can take data and turn it into a vector of features, where we just have a whole bunch of numbers that represent some particular piece of data. And if we ever want to past words into a neural network, for instance, to be able to say, given some word, translate this sentence into another sentence, or to be able to do interesting classifications with neural networks on individual words, we need some representation of words just in terms of vectors-- way to represent words, just by using individual numbers to define the meaning of a word. So how do we do that? How do we take words and turn them into vectors that we can use to represent the meaning of those words? Well, one way is to do this. If I have four words that I want to encode, like he wrote a book, I can just say, let's let the word he be this vector-- 1, 0, 0, 0. Wrote will be 0, 1, 0, 0. A will be 0, 0, 1, 0. Book will be 0, 0, 0, 1. Effectively, what I have here is what's known as a one-hot representation or a one-hot encoding, which is a representation of meaning, where meaning is a vector that has a single 1 in it and the rest are 0's. The location of the 1 tells me the meaning of the word-- that 1 in the first position, that means here-- 1 in the second position, that means wrote. And every word in the dictionary is going to be assigned to some representation like this, where we just assign one place in the vector that has a 1 for the word and 0 for the other words. And now I have representations of words that are different for a whole bunch of different words. This is this one-hot representation. So what are the drawbacks of this? Why is this not necessarily a great approach? Well, here, I am only creating enough vectors to represent four words in a dictionary. If you imagine a dictionary with 50,000 words that I might want to represent, now these vectors get enormously long. These are 50,000 dimensional vectors to represent a vocabulary of 50,000 words-- that he is 1 followed by all these. Wrote has a whole bunch of 0's in it. That's not a particularly tractable way of trying to represent numbers, if I'm going to have to deal with vectors of length 50,000. Another problem-- a subtler problem-- is that ideally, I'd like for these vectors to somehow represent meaning in a way that I can extract useful information out of-- that if I have the sentence he wrote a book and he authored a novel, well, wrote and authored are going to be two totally different vectors. And book and novel are going to be two totally different vectors inside of my vector space that have nothing to do with each other. The one is just located in a different position. And really, what I would like to have happen is for wrote and authored to have vectors that are similar to one another, and for book and novel to have vector representations that are similar to one another, because they are words that have similar meanings. Because their meanings are similar, ideally, I'd like for-- when I put them in vector form and use a vector to represent meanings, I would like for those vectors to be similar to one another as well. So rather than this one-hot representation, where we represent a word's meaning by just giving it a vector that is one in a particular location, what we're going to do-- which is a bit of a strange thing the first time you see it-- is what we're going to call a distributed representation. We are going to represent the meaning of a word as just a whole bunch of different values-- not just a single 1 and the rest 0's, but a whole bunch of values. So for example, in he wrote a book, he might just be a big vector. Maybe it's 50 dimensions, maybe it's 100, dimensions but certainly less than like tens of thousands, where each value is just some number-- and same thing for wrote, and a, and book. And the idea now is that, using these vector representations, I'd hope that wrote and authored have vector representations that are pretty close to one another. Their distance is not too far apart-- and same with the vector representations for book and novel. So this is going to be the goal of a lot of what statistical machine learning approaches to natural language processing is about is using these vector representations of words. But how on earth do we define a word as just a whole bunch of these sequences of numbers? What does it even mean to talk about the meaning of a word? The famous quote that answers this question is from a British linguist in the 1950s, JR Firth, who said, "You shall know a word by the company it keeps." And what we mean by that is the idea that we can define a word in terms of the words that show up around it, that we can get at the meaning of a word based on the context in which that word happens to appear. That if I have a sentence like this, four words in sequence-- for blank he ate-- what goes in the blank? Well, you might imagine that, in English, the types of words that might fill in the blank are words like breakfast, or lunch, or dinner. These are the kinds of words that fill in that blank. And so if we want to define, what does lunch or dinner mean, we can define it in terms of what words happened to show up around it-- that if a word shows up in a particular context and another word happens to show up in very similar context, then those two words are probably related to each other. They probably have a similar meaning to one another. And this then is the foundational idea of an algorithm known as word2vec, which is a model for generating word vectors. You give word2vec a corpus of documents, just a whole bunch of texts, and what word to that will produce is it will produce vectors for each word. And there a number of ways that it can do this. One common way is through what's known as the skip-gram architecture, which basically uses a neural network to predict context words, given a target word-- so given a word like lunch, use a neural network to try and predict, given the word lunch, what words are going to show up around it. And so the way we might represent this is with a big neural network like this, where we have one input cell for every word. Every word gets one node inside this neural network. And the goal is to use this neural network to predict, given a target word, a context word. Given a word like lunch, can I predict the probabilities of other words, showing up in a context of one word away or two words away, for instance, in some sort of window of context? And if you just give the AI, this neural network, a whole bunch of data of words and what words show up in context, you can train a neural network to do this calculation, to be able to predict, given a target word-- can I predict what those context words ultimately should be? And it will do so using the same methods we've talked about-- back propagating the error from the context word back through this neural network. And what you get is, if we use the single layer-- just a signal layer of hidden nodes-- what I get is, for every single one of these words, I get-- from this word, for example, I get five edges, each of which has a weight to each of these five hidden nodes. In other words, I get five numbers that effectively are going to represent this particular target word here. And the number of hidden nodes I choose in this middle layer here-- I can pick that. Maybe I'll choose to have 50 hidden nodes or 100 hidden nodes. And then, for each of these target words, I'll have 50 different values or 100 different values, and those values we can effectively treat as the vector numerical representation of that word. And the general idea here is that, if words are similar, two words show up in similar contexts-- meaning, using the same target words, I'd like to predict similar contexts words-- well, then these vectors and these values I choose in these vectors here-- these numerical values for the weight of these edges are probably going to be similar, because for two different words that show up in similar contexts, I would like for these values that are calculated to ultimately be very similar to one another. And so ultimately, the high-level way you can picture this is that what this word2vec training method is going to do is, given a whole bunch of words, were initially, recall, we initialize these weights randomly and just pick random weights that we choose. Over time, as we train the neural network, we're going to adjust these weights, adjust the vector representations of each of these words so that gradually, words that show up in similar contexts grow closer to one another, and words that show up in different contexts get farther away from one another. And as a result, hopefully I get vector representations of words like breakfast, and lunch, and dinner that are similar to one another, and then words like book, and memoir, and novel are also going to be similar to one another as well. So using this algorithm, we're able to take a corpus of data and just train our computer, train this neural network to be able to figure out what vector, what sequence of numbers is going to represent each of these words-- which is, again, a bit of a strange concept to think about representing a word just as a whole bunch of numbers. But we'll see in a moment just how powerful this really can be. So we'll go ahead and go into vectors, and what I have inside a vectors.py-- which I'll open up now-- is I'm opening up words.txt, which is a pretrained model that just-- I've already run word2vec and it's already given me a whole bunch of vectors for each of these possible words. And I'm just going to take like 50,000 of them and go ahead and save their vectors inside of a dictionary called words. And then I've also defined some functions called distance, closest_word, so it'll get me what are the closest words to a particular word, and then closest_word, that just gets me the one closest word, for example. And so now let me try doing this. Let me open up the Python interpreter and say something like, from vectors import star-- just import everything from vectors. And now let's take a look at the meanings of some words. Let me look at the word city, for example. And here is a big array that is the vector representation of the words city. And this doesn't mean anything, in terms of what these numbers exactly are, but this is how my computer is representing the meaning of the word city. We can do a different word, like words house, and here then is the vector representation of the word house, for example-- just a whole bunch of numbers. And this is encoding somehow the meaning of the word house. And how do I get at that idea? Well, one way to measure how good this is is by looking at, what is the distance between various different words? There a number of ways you can define distance. In context of vectors, one common way is what's known as the cosine distance that has to do with measuring the angle between vectors. But in short, it's just measuring, how far apart are these two vectors from each other? So if I take a word like the word book, how far away for is it from itself-- how far away is the word book from book-- well, that's zero. The word book is zero distance away from itself. But let's see how far away word book is from a word like breakfast, where we're going to say one is very far away, zero is not far away. All right, book is about 0.64 away from breakfast. They seem to be pretty far apart. But let's now try and calculate the distance from words book to words novel, for example. Now, those two words are closer to each other-- 0.34. The vector representation of the word book is closer to the vector representation of the word novel than it is to the vector representation of the word breakfast. And I can do the same thing and, say, compare breakfast to lunch, for example. And those two words are even closer together. They have an even more similar relationship between one word and another. So now it seems we have some representation of words, representing a word using vectors, that allows us to be able to say something like words that are similar to each other ultimately have a smaller distance that happens to be between them. And this turns out to be incredibly powerful to be able to represent the meaning of words in terms of their relationships to other words as well. I can tell you as well-- I have a function called closest words that basically just takes a whole bunch of words and gets all the closest words to it. So let me get the closest words to book, for example, and maybe get the 10 closest words. We'll limit ourselves to 10. And right. Book is obviously closest to itself-- the word book-- but is also closely related to books, and essay, and memoir, and essays, and novella, anthology. And why are these words that it was able to compute are close to it? Well, because based on the corpus of information that this algorithm was trained on, the vectors that arose arose based on what words show up in a similar context-- that the word book shows up in a similar context, similar other words to words like memoir and essays, for example. And if I do something like-- let me get the closest words to city-- you end up getting city, town, township, village. These are words that happen to show up in a similar context to the word city. Now, where things get really interesting is that, because these are vectors, we can do mathematics with them. We can calculate the relationships between various different words. So I can say something like, all right, what if I had man and king? These are two different vectors, and this is a famous example that comes out of word2vec. I can take these two vectors and just subtract them from each other. This line here, the distance here, is another vector that represents like king minus man. Now, what does it mean to take a word and subtract another word? Normally, that doesn't make sense. In the world of vectors, though, you can take some vector sum sequence of numbers, subtract some other sequence of numbers, and get a new vector, get a new sequence of numbers. And what this new sequence of numbers is effectively going to do is it is going to tell me, what do I need to do to get from man to king? What is the relationship then between these two words? And this is some vector representation of what makes-- takes us from man to king. And we can then take this value and add it to another vector. You might imagine that the word woman, for example, is another vector that exists somewhere inside of this space, somewhere inside of this vector space. And what might happen if I took this same idea, king minus man-- took that same vector and just added it to woman? What will we find around here? It's an interesting question we might ask, and we can answer it very easily, because I have vector representations of all of these things. Let's go back here. Let me look at the representation of the word man. Here's the vector representation of men. Let's look at the representation of the word king. Here's the representation of the word king. And I can subtract these two. What is the vector representation of king minus man? It's this array right here-- whole bunch of values. So king minus man now represents the relationship between king and man in some sort of numerical vector format. So what happens then if I add woman to that? Whatever took us from man to king, go ahead and apply that same vector to the vector representation of the word woman, and that gives us this vector here. And now, just out of curiosity, let's take this expression and find, what is the closest word to that expression? And amazingly, what we get is we get the word queen-- that somehow, when you take the distance between man and king-- this numerical representation of how man is related to king-- and add that same notion, king minus man, to the vector representation of the word woman. What we get is we get the vector representation, or something close to the vector representation of the word queen, because this distance somehow encoded the relationship between these two words. And when you run it through this algorithm, it's not programmed to do this, but if you just try and figure out how to predict words based on context words, you get vectors that are able to make these SAT-like analogies out of the information that has been given. So there are more examples of this. We can say, all right, let's figure out, what is the distance between Paris and France? So Paris and France are words. They each have a vector representation. This then is a vector representation of the distance between Paris and France-- what takes us from France to Paris. And let me go ahead and add the vector representation of England to that. So this then is the vector representation of going Paris minus France plus England-- so the distance between friends and Paris as vectors. Add the England vector, and let's go ahead and find the closest word to that. And it turns out to be London. You do this relationship, the relationship between France and Paris. Go ahead and add the England vector to it, and the closest vector to that happens to be the vector for the word London. We can do more examples. I can say, let's take the word for teacher-- that vector representation and-- let me subtract the vector representation of school. So what I'm left with is, what takes us from school to teacher? And apply that vector to a word like hospital and see, what is the closest word to that-- turns out the closest word is nurse. Let's try a couple more examples-- closest word to ramen, for example. Subtract closest word to Japan. So what is the relationship between Japan and ramen? Add the word for America to that. Want to take a guess is what you might get as a result? Turns out you get burritos as the relationship. If you do the subtraction, do the addition, this is the answer that you happen to get as a consequence of this as well. So these very interesting analogies arise in the relationships between these two words-- that if you just map out all of these words into a vector space, you can get some pretty interesting results as a consequence of that. And this idea of representing words as vectors turns out to be incredibly useful and powerful anytime we want to be able to do some statistical work with regards to natural language, to be able to have-- represent words not just as their characters, but to represent them as numbers, numbers that say something or mean something about the words themselves, and somehow relate the meaning of a word to other words that might happen to exists-- so many tools then for being able to work inside of this world of natural language. The natural language is tricky. We have to deal with the syntax of language and the semantics of language, but we've really just seen just the beginning of some of the ideas that are underlying a lot of natural language processing-- the ability to take text, extract information out of it, get some sort of meaning out of it, generate sentences maybe by having some knowledge of the grammar or maybe just by looking at probabilities of what words are likely to show up based on other words that have shown up previously-- and then finally, the ability to take words and come up with some distributed representation of them, to take words and represent them as numbers, and use those numbers to be able to say something meaningful about those words as well. So this then is yet another topic in this broader heading of artificial intelligence. And just as I look back at where we've been now, we started our conversation by talking about the world of search, about trying to solve problems like tic-tac-toe by searching for a solution, by exploring our various different possibilities and looking at what algorithms we can apply to be able to efficiently try and search a space. We looked at some simple algorithms and then looked at some optimizations we could make to this algorithms, and ultimately, that was in service of trying to get our AI to know things about the world. And this has been a lot of what we've talked about today as well, trying to get knowledge out of text-based information, the ability to take information, draw conclusions based on those information. If I know these two things for certain, maybe I can draw a third conclusion as well. That then was related to the idea of uncertainty. If we don't know something for sure, can we predict something, figure out the probabilities of something? And we saw that again today in the context of trying to predict whether a tweet or whether a message is positive sentiment or negative sentiment, and trying to draw that conclusion as well. Then we took a look at optimization-- the sorts of problems where we're looking for a local global or local maximum or minimum. This has come up time and time again, especially most recently in the context of neural networks, which are really just a kind of optimization problem where we're trying to minimize the total amount of loss based on the setting of our weights of our neural network, based on the setting of what vector representations for words we happen to choose. And those ultimately helped us to be able to solve learning-related problems-- the ability to take a whole bunch of data, and rather than us tell the AI exactly what to do, let the AI learn patterns from the data for itself. Let it figure out what makes an inbox message different from a spam message. Let it figure out what makes a counterfeit bill different from an authentic bill, and being able to draw that analysis as well. And one of the big tools in learning that we used were neural networks, these structures that allow us to relate inputs to outputs by training these internal networks to learn some sort of function that maps us from some input to some output-- ultimately yet another model in this language of artificial intelligence that we can use to communicate with our AI. Then finally today, we looked at some ways that AI can begin to communicate with us, looking at ways that AI can begin to get an understanding for the syntax and the semantics of language to be able to generate sentences, to be able to predict things about text that's written in a spoken language or a written language like English, and to be able to do interesting analysis there as well. And there's so much more in active research that's happening all over the areas within artificial intelligence today, and we've really only just seen the beginning of what AI has to offer. So I hope you enjoyed this exploration into this world of artificial intelligence with Python. A big thank you to the courses teaching staff and the production team for making this class possible. This was an Introduction to Artificial Intelligence with Python.