[ORCHESTRA TUNING] [MUSIC PLAYING] DAVID MALAN: All right. This is CS50's Introduction to Programming with Python. My name is David Malan, and this is our week on regular expressions. So a regular expression, otherwise known as a regex, is really just a pattern. And indeed, it's quite common in programming to want to use patterns to match on some kind of data, often user input. For instance, if the user types in an email address, whether to your program, or a website, or an app on your phone, you might ideally want to be able to validate that they did indeed type in an email address and not something completely different. So using regular expressions, we're going to have the newfound capability to define patterns in our code to compare them against data that we're receiving from someone else, whether it's just to validate it, or, heck, even if we want to clean up a whole lot of data that itself might be messy because it, too, came from us humans. Before, though, we use these regular expressions, let me propose that we solve a few problems using just some simpler syntax and see what kind of limitations we run up against. Let me propose that I open up VS Code here, and let me create a file called validate.py, the goal at hand being to validate, how about just that, a user's email address. They've come to your app, they've come to your website, they type in their email address, and we want to say yes or no, this email address looks valid. All right. Let me go ahead and type code of validate.py to create a new tab here. And then within this tab, let me go ahead and start writing some code, how about, that keeps things simple initially. First, let me go ahead and prompt the user for their email address. And I'll store the return value of input in a variable called email, asking them "what's your email?" question mark. I'm going to go ahead and preemptively at least clean up the user's input a little bit by minimally just calling strip at the end of my call to input, because recall that input returns a string or a str. strs come with some built-in methods or functions, one of which is strip, which has the effect of stripping off any leading whitespace to the left or any trailing whitespace to the right. So that's just going to go ahead and at least avoid the human having accidentally typed in a space character. We're going to throw it away just in case. Now I'm going to do something simple. For a user's input to be an email address, I think we can all agree that it's got a minimal we have an @ sign somewhere in it. So let's start simple. If the user has typed in something with an @ sign, let's very generously just say, OK, valid, looks like an email address. And if we're missing that @ sign, let's say invalid, because clearly it's not an email address. It's not going to be the best version of my code yet, but we'll start simple. So I'm going to ask the question, if there is an @ symbol in the user's email address, go ahead and print out, for instance, quote, unquote, "valid." Else, if there's not, now I'm pretty confident that the email address is, in fact, invalid. Now, what is this code doing? Well, if @ sign in email is a Pythonic way of asking is this string quote, unquote "@" in this other string email, no matter where it is-- at the beginning, the middle, or the end. It's going to automatically search through the entire string for you automatically. I could do this more verbosely. And I could use a for loop or a while loop and look at every character in the user's email address, looking to see if it's an @ sign. But this is one of the things that's nice about Python. You can do more with less. So just by saying if "@" quote, unquote in email, we're achieving that same result. We're going to get back true if it's somewhere in there, thus valid, or false if it is not. Well, let me go ahead now and run this program in my terminal window with python of validate.py. And I'm going to go ahead and give it my email address-- malan@harvard.edu, Enter. And indeed, it's valid. Looks valid, is valid. But of course, this program is technically broken. It's buggy. What would be an example input, if someone might like to volunteer an answer here, that would be considered valid but you and I know it really isn't valid? AUDIENCE: Yeah, thank you. Well, for instance, you can type just two signs and that's it, and it'll still be valid-- still be valid according to your program, but missing something. DAVID MALAN: Exactly. We've set a very low bar here. In fact, if I go ahead and rerun python of validate.py, and I'll just type in one @ sign, that's it-- no username, no domain name, this doesn't really look like an email address. But unfortunately, my code thinks it, in fact, is, because it's obviously just looking for an @ sign alone. Well, how could we improve this? Well, minimally an email address, I think, tends to have, though this is not actually a requirement, tends to have an @ sign and a single dot at least, maybe somewhere in the domain name-- so malan@harvard.edu. So let's check for that dot as well. But again, strictly speaking it doesn't even have to be that case. But I'm going for my own email address, at least for now, as our test case. So let me go ahead and change my code now and say, not only if @ is in email, but also dot is in email as well. So I'm asking now two questions. I have two Boolean expressions-- if @ in email, and I'm anding them together logically-- this is a logical and, so to speak. So if it's the case that @ is in email and dot is in email, OK, now I'm going to go ahead and say valid. All right. This would still seem to work for my email address. Let me go ahead and run python validate.py, malan@harvard.edu, Enter, and that, of course, is valid is expected. But here, too, we can be a little adversarial and type in something nonsensical like "@." and unfortunately, that, too, is going to be mistaken as valid, even though there's still no username, domain name, or anything like that. So I think we need to be a little more methodical here. In fact, notice that if I do this like this, the @ sign can be anywhere, and the dot can be anywhere. But if I'm assuming the user is going to have a traditional domain name like harvard.edu or gmail.com, I really want to look for the dot in the domain name only, not necessarily just the username. So let me go ahead and do this. Let me go ahead and introduce a bit more logic here, and instead do this. Let me go ahead and do email.split of quote, unquote @ sign. So email, again, is a string or a str. strs come with methods, not just strip but also another one called split that, as the name implies, will split one str into multiple ones if you give it a character or more to split on. So this is hopefully going to return to me two parts from a traditional email address, the username and the domain name. And it turns out I can unpack that sequence of responses by doing this-- username comma domain equals this. I could store it in a list or some other structure, but if I already know in advance what kinds of values I'm expecting, a username and hopefully a domain, I'm going to go ahead and do it like this instead and just define two variables at once on one line of code. And now I'm going to be a little more precise. If username-- if username, then I'm going to go ahead and say, print "valid." Else, I'm going to go ahead and say print "invalid." Now, this isn't good enough. But I'm at least checking for the presence of a username now. And you might not have seen this before, but if you simply ask a question like "if username," and username is a string, well, username-- "if username" is going to give me a true answer if username is anything except none or quote, unquote "nothing." So there's a truthy value here, whereby if username has at least one character, that's going to be considered true. But if username has no characters, it's going to be considered a false value effectively. But this isn't good enough. I don't want to just check for username. I want to also check that it's the case that dot is in the domain name as well. So notice here there's a bit of potential confusion with the English language. Here, I seem to be saying "if username and dot in domain," as though I'm asking the question, "if the username and the dot are in the domain," but that's not what this means. These are two separate Boolean expressions-- "if username," and separately, "if dot in domain." And if I parenthesis this, we could make that even more clear by putting parentheses there, parentheses here. So just to be clear, it's really two Boolean expressions that we're anding together, not one longer English-like sentence. Now, if I go ahead and run this, python validate.py Enter, I'll do my own email address again, malan@harvard.edu, and that's valid. And it looks like I could tolerate something like this. If I do malan@, just say, harvard, I think at the moment this is going to be invalid. Now, maybe the top-level domain harvard exists. But at the moment, it looks like we're looking for something more. We're looking for a top-level domain too, like .edu. For now, we'll just consider this to be invalid. But it's not just that we want to do-- it's not just that we want to check for the presence of a username and the presence of a dot. Let's be more specific. Let's start to now narrow the scope of this program, not just to be about generic emails more generally, but about edu addresses, so specifically for someone in a US university, for instance, whose email address tends to end with .edu. I can be a little more precise. And you might recall this function already. Instead of just saying, is there a dot somewhere in domain, let me instead say, and the domain ends with quote, unquote ".edu." Now we're being even more precise. We want there to be minimally a username that's not empty-- it's not just quote, unquote "nothing"-- and we want the domain name to actually end with .edu. Let me go ahead and run python of validate.py. And just to make sure I haven't made things even worse, let me at least test my own email address, which does seem to be valid. Now, it seems that I minimally need to provide a username, because we definitely do have that check in place. So I'm going to go ahead and say malan. And now I'm going to go ahead and say @. And it looks like I could be a little malicious here, just say malan@.edu, as though minimally meeting the requirements of this pattern. And that, of course, is considered valid, but I'm pretty sure there's no one at malan@.edu. We need to have some domain name in there. So we're still not being quite as generous. Now, we could absolutely continue to iterate on this program, and we could add some more Boolean expressions. We could maybe use some other Python methods for checking more precisely is there something to the left of the dot, to the right of the dot. We could use split multiple times. But honestly, this just escalates quickly. Like, you end up having to write a lot of code just to express something that's relatively simple in spirit-- just format this like an email address. So how can we go about improving this? Well, it turns out in Python there's a library for regular expressions. It's called succinctly R-E. And in the re library, you have a lot of capabilities to define and check for and even replace patterns. Again, a regular expression is a pattern. And this library, the re library in Python, is going to let us define some of these patterns, like a pattern for an email address, and then use some built-in functions to actually validate a user's input against that pattern or even use these patterns to change the user's input or extract partial information therefrom. We'll see examples of all this and more. So what can and should I do with this library? Well, first and foremost, it comes with a lot of functionality. Here is the URL, for instance, to the official documentation. And let me propose that we focus on using one of the most versatile functions in the library, namely this-- search. re.search is the name of the function and the re module that allows you to pass in a few arguments. The first is going to be a pattern that you want to search for in, for instance, a string that came from a user. The string argument here is going to be the actual string that you want to search for that pattern. And then there's a third argument optionally that's a whole bunch of flags. A flag in general is like a parameter you can pass in to modify the behavior of the function. But initially, we're not even going to use this. We're just going to pass in a couple of arguments instead. So let me go ahead and employ this re library, this regular expression library, and just improve on this design incrementally. So we're not going to solve this problem all at once, but we'll take some incremental steps. I'm going to go back to VS Code here. And I'm going to go ahead now and get rid of most of this code. But I'm going to go into the top of my file and first of fall, import this re library. So import re gives me access to that function and more. Now, after I've gotten the user's input in the same way as before, stripping off any leading or trailing whitespace, I'm just going to use this function super trivially for now, even though this isn't really a big step forward. I'm going to say, if re.search contains quote, unquote "@" in the email address, then let's go ahead and print "valid." Else, let's go ahead and print "invalid." At the moment, this is really no better than my very first version where I was just asking Python, if @ sign in the email address. But now I'm at least beginning to use this library by using its own re.search function, which for now you can assume returns a true value effectively if, indeed, the @ sign is an email. Just to make sure that this version does work as I expect, let me go ahead and run python of validate.py and Enter. I'll type in my actual email address, and we're back in business. But of course, this is not great, because if I similarly run this version of the program and just type in an @ sign, not an email address, and yet my code, of course, thinks it is valid. So how can I do better than this? Well, we need a bit more vocabulary in the realm of regular expressions, in order to be able to express ourselves a little more precisely. Really, the pattern I want to ultimately define is going to be something like, I want there to be something to the left, then an @ sign, then something to the right. And that something to the right should end with .edu but should also have something before the .edu, like Harvard, or Yale, or any other school in the US as well. Well, how can I go about doing this? Well, it turns out that in the world of regular expressions, whether in Python or a lot of other languages as well, there are certain symbols that you can use to define patterns. At the moment, I've just used literal raw text. If I go back to my code here, this technically qualifies as a regular expression. I've passed in a quoted string inside of which is an @ sign. Now, that's not a very interesting pattern. It's just an @ sign. But it turns out that once you have access to regular expressions or a library that offers that feature, you can more powerfully express yourself as follows. Let me reveal that the pattern that you pass to re.search can take a whole bunch of special symbols. And here's just some of them. In the examples we're about to see, in the patterns we're about to define, here are the special symbols. You can use a single period, a dot, to just represent any character except a newline, a blank line. So that is to say, if I don't really care what letters of the alphabet are in the user's username, I just want there to be one or more characters in the user's name, dot allows me to express A through z, uppercase and lowercase, and a bunch of other letters as well. * is going to mean-- a single asterisk-- zero or more repetitions. So if I say something *, that means that I'm willing to accept either zero repetitions, that is, nothing at all, or more repetitions-- 1, or 2, or 3, or 300. If you see a plus in my pattern, so that's going to mean one or more repetitions. That is to say, there's got to be at least one character there, one symbol, and then there's optionally more after that. And then you can say zero or one repetition. You can use a single question mark after a symbol, and that will say, I want zero of this character or one, but that's all I'll expect. And then lastly, there's going to be a way to specify a specific number of symbols. If you use these curly braces and a number, represented here symbolically as m, you can specify that you want m repetitions, be it 1, or 2, or 3, or 300. You can specify the number of repetitions yourself. And if you want a range of repetitions, like you want this few characters or this many characters, you can use curly braces and two numbers inside, called here m and n, which would be a range of m through n repetitions. Now, what does all of this mean? Well, let me go back to VS Code here, and let me propose that we iterate on this solution further. It's not sufficient to just check for the @ sign. We know that already. We minimally want something to the left and to the right. So how can I represent that? I don't really care what the user's username is, or what letters of the alphabet are in it, be it malan or anyone else's. So what I'm going to do to the left of this equal sign is I'm going to use a single period-- the dot that, again, indicates any character except for a newline. But I don't just want a single character. Otherwise, the person's username could only a at such and such, or b at such and such. I want it to be multiple such characters. So I'm going to initially use a *. So dot * means give me something to the left, and I'm going to do another one, dot * something to the right. Now, this isn't perfect, but it's at least a step forward. Because now what I'm going to go ahead and do is this. I'm going to rerun python of validate.py. And I'm going to keep testing my own email address just to make sure I haven't made things worse. And that's now OK. I'm now going to go ahead and type in some other input, like how about just malan@ with no domain name whatsoever. And you would think this is going to be invalid. But, but, but it's still considered valid. But why is that? If I go back to this chart, why is malan@ with no domain now considered valid? What's my mistake here by having used .*@.* as my regular expression or regex? AUDIENCE: Because you're using the * instead of the plus sign. DAVID MALAN: Exactly. The *, again, means zero or more repetitions. So re.search is perfectly happy to accept nothing after the @ sign, because that would be zero repetitions. So I think I minimally need to evolve this and go back to my code here. And let me go ahead and change this from dot * to dot +. And let me change the ending from dot * to dot + so that now when I run my code here-- let me go ahead and run python of validate.py. I'm going to test my email address as always. Still working. Now let me go ahead and type in that same thing from before that was accidentally considered valid. Now I hit Enter, finally it's invalid. So now we're making some progress on being a little more precise as to what it is we're doing. Now, I'll note here, like with almost everything in programming, Python included, there's often multiple ways to solve the same problem. And does anyone see a way in my code here that I can make a slight tweak if I forgot that the plus operator exists and go back to using a *? If I allowed you only to use dots and only stars, could you recreate the notion of plus? AUDIENCE: Yes. Use another dot, dot dot *. DAVID MALAN: Yeah. Because if a dot means any character, we'll just use a dot. And then when you want to say "or more," use another dot and then the *. So equivalent to dot + would have been dot dot *, because the first dot means any character, and the second pair of characters, dot *, means zero or more other characters. And to be clear, it doesn't have to be the same character. Just by doing dot or dot * does not mean your whole username needs to be a, or aa, or aaa, or aaaa. It can vary with each symbol. It just means zero or more of any character back to back. So I could do this on both the left and the right. Which one is better? You know, it depends. I think an argument could be made that this is even more clear, because it's obvious now that there's a dot, which means any character, and then there's the dot *. But if you're in the habit of doing this frequently, one of the reasons things like the plus exist is just to consolidate your code into something a little more succinct. And if you're familiar with seeing the plus now, maybe this is more readable to you. So again, just like with Python more generally, you're going to often see different ways to express the same patterns, and reasonable people might agree or disagree as to which way is better than another. Well, let me propose to you that we can think about both of these models a little more graphically. If this looks a little cryptic to you, let me go ahead and rewind to the previous incarnation of this regular expression, which was just a single dot *. This regular expression, .*@.* means what again? It means zero or more characters followed by a literal @ sign followed by zero or more other characters. Now when you pass this pattern in as an argument to re.search, it's going to read it from left to right and then use it to try to match against the input, email, in this case, that the user typed in. Now, how is the computer, how is re.search going to keep track of whether or not the user's email matches this pattern? Well, it turns out that it's going to be using a machine of sorts implemented in software known as a finite state machine, or more formally, a nondeterministic finite automaton. And the way it works, if we depict this graphically, is as follows. The re.search function starts over here in a so-called start state. That's the sort of condition in which it begins. And then it's going to read the user's email address from left to right. And it's going to decide whether or not to stay in this first state or transition to the next state. So for instance, in this first state, as the user is reading my email address, malan@harvard.edu, it's going to follow this curved edge up and around to itself, a reflexive edge. And it's labeled dot, because dot, again, just means any character. So as the function is reading my email address, malan@harvard.edu, from left to right, it's going to follow these transitions as follows, M-A-L-A-N. And then it's hopefully going to follow this transition to the second state, because there's a literal @ sign both in this machine as well as in my email address. Then it's going to try to read the rest of my address, H-A-R-V-A-R-D dot E-D-U, and that's it. And then the computer is going to check. Did it end up in an accept state, a final state, that's actually depicted here pictorially a little differently with double circles, one inside of the other? And that just means that if the computer finds itself in that second accept state after having read all of the user's input, it is, indeed, a valid email address. If by some chance, the machine somehow ended up stuck in that first state, which does not have double circles and is therefore not an accept state, the computer would conclude this is an invalid email address instead. By contrast, if we go back to my other your version of the code where I instead had dot plus on both the left and the right, recall that re.search is going to use one of these state machines in order to decide from left to right whether or not to accept the user's input, like malan@harvard.edu. Can we get from the start state, so to speak, to an accept state to decide, yep, this was, in fact, meeting the pattern? Well, let's propose that this nondeterministic finite automaton looked like this instead. We're going to start as before in the leftmost start state, and we're going to necessarily consume one character per this first edge, which is labeled with a dot to indicate that we can consume any one character, like the m in malan@harvard.edu. Then we can spend some time consuming more characters before the @ sign, so the A-L-A-N. Then we can consume the @ sign. Then we can consume at least one more character, because recall that the regex has dot plus this time. And then we can consume even more characters if we want. So if we first consume the H in harvard.edu, then leaves the A-R-V-A-R-D, and then dot E-D-U. And now here, too, we're at the end of the story, but we're in an accept state, because that circle at the end has two circles total, which means that if the computer, if this function, finds itself in that accept state after reading the entirety of the user's input, it is, too, in fact, a valid email address. If by contrast, we had gotten stuck in one of those other states, unable to follow a transition, one of those edges, and therefore unable to make progress in the user's input from left to right, then we would have to conclude that email address is, in fact, invalid. Well, how can we go upon approving this code further? Let me propose now that we check not only for a username and also something after the username, like a domain name, but minimally require that the string ends with .edu as well. Well, I think I could do this fairly straightforward. Not only do I want there to be something after the @ sign, like the domain like Harvard, I want the whole thing to end with .edu. But there's a little bit of danger here. What have I done wrong by implementing my regular expression now in this way, by using .+@.+.edu? What could go wrong with this version? AUDIENCE: The dot is-- the dot means something else in this context, where it means three or more repetitions of a character, which is why it will interpret it [INAUDIBLE]. DAVID MALAN: Exactly. Even though I mean for it to mean literally .edu, a period, and then .edu, unfortunately in the world of regular expressions, dot means any character, which means that this string could technically end in aedu, or bedu, or cedu, and so forth, but that's not, in fact, that I want. So any instincts now as to how I could fix this problem? And let me demonstrate the problem more clearly. Let me go ahead and run this code here. Let me go ahead and type in malan@harvard.edu. And as always, this does, in fact, work. But watch what happens here. Let me go ahead and do malan@harvard and then-- malan@harvard?edu, Enter, that, too, is valid. So I could put any character there and it's still going to be accepted. But I don't want ?edu. I want .edu literally. Any instincts, then, for how we can solve this problem here? How can I get this new function, re.search, and a regular expression more generally, to literally mean a dot, might you think? AUDIENCE: You can use the escape character, the backslash? DAVID MALAN: Indeed. The so-called escape character, which we've seen before outside of the context of regular expressions when we talked about newlines. Backslash n was a way of telling the computer I want a newline, but without actually literally hitting Enter and moving the cursor yourself. And you don't want a literal n on the screen. So backslash n was a way to escape n and convey that you want a newline. It turns out regular expressions use a similar technique to solve this problem here. In fact, let me go into my regular expression. And before that final dot, let me put a single backslash. In the world of regular expressions, this is a so-called special sequence. And it indicates, per this backslash and a single dot, that I literally want to match on a dot. It's not that I want to match on any character and then edu. I want to match on a dot, or a period, edu. But we don't want Python to misinterpret this backslash as beginning an escape sequence, something special like backslash n, which even though we as the programmer might type two characters backslash n, it really is interpreted by Python as a single newline. We don't want any kind of misinterpretation like that here. So it turns out there's one other thing we should do for regular expressions like this that have a backslash used in this way. I want to specify to Python that I want this string, this regular expression in double quotes, to be treated as a raw string, literally putting an r at the beginning of the string to indicate to Python that you should not try to interpret any backslashes in the usual way. I want to literally pass the backslash and the dot and the edu into this particular function, search, in this case. So it's similar in spirit to using that f at the beginning of a format string, which, of course, tells Python to format the string in a certain way, plugging in variables that might be between curly braces. But in this case, r indicates a raw string that I want passed in exactly as is. Now, it's only strictly necessary if you are, in fact, using backslashes to indicate that you want some special sequence, like backslash dot. But in general, it's probably a good habit to get into to just use raw strings for all of your regular expressions so that if you eventually go back in, make a change, make an addition, you don't accidentally introduce a backslash and then forget that that might have some special or misinterpreted meaning. Well, let me go ahead and try this new regular expression. I'll clear my terminal window, run python of validate-- run python of validate.py. And then I'll type in my email address correctly, malan@harvard.edu. And that's, fortunately, still valid. Let me clear my screen and run it one more time, python of validate.py. And this time, let's mistype it as malan@harvard?edu, whereby there's obviously not a dot there, but there is some other single character that last time was misinterpreted as valid. But this time, now that I've improved my regular expression, it's discovered as, indeed, invalid. Any questions now on this technique for matching something to the left of the @ sign, something to the right, and now ending with .edu explicitly? AUDIENCE: What happens when user inserts multiple @ signs? DAVID MALAN: A good question. And you kind of called me out here. Well, when in doubt, let's try. Let me go ahead and do python of validate.py, malan@@@harvard.edu, which also is incorrect, unfortunately, my code thinks it's valid. So another problem to solve, but a shortcoming for now. Other questions on these regular expressions thus far? AUDIENCE: Can you use curly brackets m instead of backslash? DAVID MALAN: Can you use curly brackets instead of backslash? Not in this case. If you want a literal dot, backslash dot is the way to do it literally. How about one other question on regular expressions? AUDIENCE: Is this the same thing that Google Forms uses in order to categorize data in, let's say, some-- if you've got multiple people sending in requests about some feedback? Do they categorize the data that they get using this particular regular expression thing? DAVID MALAN: Indeed. If you've ever used Google Forms to not just submit it but to create a Google Form, one of the menu options is for response validation, in English at least. And what that allows you to do is specify that the user has to input an email address, or a URL, or a string of some length. But there's an even more powerful feature that some of you may not have ever noticed. And indeed, if you'd like to open up Google Forms, create a new form temporarily, and poke around, you will actually see, in English at least, quote, unquote "regular expression" mentioned as one of the mechanisms you can use to validate your users' input into your Google Form. So in fact, after today you can start avoiding the specific dropdowns of like email address, or URL, or the like, and you can express your own patterns precisely as well. Regular expressions can even be used in VS Code itself. If you go and find, or do a find and replace in VS Code, you can, of course, just type in words, like you could into Microsoft Word or Google Docs. You can also type, if you check the right box, regular expressions and start searching for patterns, not literally specific values. Well, let me propose that we now enhance this implementation further by introducing a few other symbols, because right now with my code, I keep saying that I want my email address to end with .edu and start with a username, but I'm being a little too generous. This does, in fact, work as expected for my own email address, malan@harvard.edu. But what if I type in a sentence like, "my email address is malan@harvard.edu," and suppose I've typed that into the program or I've typed that into a Google Form? Is this going to be considered valid or invalid? Well, let's consider. It's got @ sign, so we're good there. It's got one or more characters to the left of the @ sign. It's got one or more characters to the right of the @ sign. It's got a literal .edu somewhere in there to the right of the @ sign. And granted, there's more stuff to the right. There's literally this period at the end of my English sentence. But that's OK, because at the moment, my regular expression is not so precise as to say, the pattern must start with the username and end with the .edu. Technically, it's left unsaid what more can be to the left and what more can be to the right. So when I hit Enter now, you'll see that that whole sentence in English is valid, and that's obviously not what you want. In fact, consider the case of using Google Forms or Office 365 to collect data from users. If you don't validate your input, your users might very well type in a full sentence or something else with a typographical error, not an actual email. So if you're just trying to copy all of the results that have been typed into your form so you can paste them into Gmail or some email program, it's going to break, because you're going to accidentally pay something like a whole English sentence into the program instead of just an email address, which is what your mailer expects. So how can I be more precise? Well, let me propose we introduce a few more symbols as well. It turns out in the context of a regular expression, one of these patterns, you can use the caret symbol, the little triangular mark, to represent that you want this pattern to match the start of the string specifically-- not anywhere but the start of the user's string. By contrast, you can use a $ sign in your regular expression to say that you want to match the end of the string, or technically just before the newline at the end of the string. But for all intents and purposes, think of caret as meaning "start of the string" and $ sign as meaning "end of the string." It is a weird thing that one is a caret and one is $ sign. These are not really things that I think of as opposites, like a parentheses or something like that. But those are the symbols the world chose many years ago. So let me go back to VS Code now. And let me add this feature to my code here. Let me specify that yes, I do want to search for this pattern, but I want the user's input to start with this pattern and end with this pattern. So even though it's going to start looking even more cryptic, I put a caret symbol here at the beginning, and I put a $ sign here at the end. That does not mean I want the user to type a caret symbol or a $ sign. This is special symbology that indicates to re.search that it should only look for now an exact match against this pattern. So if I now go back to my terminal window-- and I'll leave the previous result on the screen-- let me type the exact same thing. "My email address malan@harvard.edu," Enter-- sorry, period. And now I'm going to go ahead and hit Enter. Now that's considered invalid. But let me clear the screen. And just to make sure I didn't break things, let me type in just my email address, and that, too, is valid. Any questions now on this version of my regular expression, which, note, goes further to specify even more precisely that I want it to match at the start and the end? Any questions on this one here? AUDIENCE: OK. You have slash, and .edu, then the $ sign. But the dot is one of the regular expression, right? DAVID MALAN: It normally is. But this backslash that I deliberately put before this period here is an escape character. It is a way of telling re.search that I don't want any character there, I literally want a period there. And it's the only way you can distinguish one from the other. If I got rid of that slash, this would mean that the email address just has to end with any character, then an E, then a D, than a U. I don't want that. I want literally a period, then the E, then the D, then the U. This is actually common convention in programming and technology in general. If you and I decide on a convention, whereby we're using some character on the keyboard to mean something special, invariably we create a future problem for ourself when we want to literally use that same character. And so the solution in general to that problem is to somehow escape the character so that it's clear to the computer that it's not that special symbol, it's literally the symbol it sees. AUDIENCE: So we don't even know the-- we don't need another slash before the $ sign? DAVID MALAN: No. Because in this case, $ sign means something special. Per this chart here, $ sign by itself does not mean US dollars or currency. It literally means "match the end of the string." If, however, I wanted the user to literally type in $ sign at the end of their input, the solution would be the same. I would put a backslash before the $ sign, which means my email address would have to be something like malan@harvard.edu $ sign, which is obviously not correct too. So backslash is just allow you to tell the computer to not treat those symbols specially, likes meaning something special, but to treat them literally instead. How about one other question here on regular expressions? AUDIENCE: You said one represents to make it one plus, then you said one was to make it one with nothing. DAVID MALAN: Sure. AUDIENCE: So why would you add the plus? DAVID MALAN: Let me rewind in time. I think what you're referring to was one of our earlier versions that initially looked like this, which just meant zero or more characters, than an @ sign, then zero or more other characters. We then evolved to that to be this, dot plus on both sides, which means one or more characters on the left, then an @ sign, then one or more characters on the right. And if I'm interpreting your question correctly, one of the points I made earlier was that if you didn't use plus or forgot that it exists, you could equivalently achieve the exact same result with two dots and a *, because the first dot means any character-- it's got to be there-- the second dot * means zero or more other characters, and same on the right. So it's just another way of expressing the same idea. "One or more" can be represented like this with dot dot *, or you can just use the handier syntax of dot +, which means the same thing. All right. So I daresay there's still some problems with the regular expression in this current form, because even though now we're starting to look for the user name at the beginning of the string from the user, and we're looking for the .edu literally at the end of the string from the user, those dots are a little too encompassing right now. I'm allowed to type in more than the single @ sign. Why? Because @ is a character, and dot means any character. So honestly, I can have as many @ signs in this thing at the moment as I want. For instance, if I run python of validate.py, malan@harvard.edu, still works as expected. But if I also run python of validate.py and incorrectly do malan@@@harvard.edu, should be invalid, but it's considered valid instead. So I think we need to be a little more restrictive when it comes to that dot. And we can't just say, oh, any old character there is fine. We need to be more specific. Well, it turns out that regular expressions also support this syntax. You can use square brackets inside of your pattern, and inside of those square brackets include one or more characters that you want to look for specifically. Alternatively, you can inside of those square brackets put a caret symbol, which unfortunately in this context, means something completely different from "match the start of the string." But this would be the complement operator inside of the square brackets, which means "you cannot match any of these characters." So things are about to look even more cryptic now. But that's why we're focusing on regular expressions on their own here. If I don't want to allow any character, which is what a dot is, let me go ahead and I could just say, well, I only want to support A, or Bs, or Cs, or Ds, or Es, or Fs, or Gs. I could type in the whole alphabet here plus some numbers to actually include all of the letters that I do want to allow. But honestly, a little simpler would be this. I could use a ^ symbol and then an @ sign, which has the effect of saying, this is the set of characters that has everything except an @ sign. And I can do the same thing over here. Instead of a dot to the right of the @ sign, I can do open bracket ^, @ sign. And I admit, things are starting to escalate quickly here, but let's start from the left and go to the right. This ^ outside of the square brackets at the very start of my string, as before, means "match from the start of the string." And let's jump ahead. The $ sign all the way at the end of the regular expression means "match at the end of the string." So if we can mentally tick those off as straightforward, let's now focus on everything else in the middle. Well, to the left here we have new syntax-- a square bracket, another ^, an @ sign, and a closed square bracket, and then a +. The + means the same thing as always. It means "one or more of the things to the left." What is the thing to the left? Well, this is the new syntax. Inside of square brackets here, I have a ^ symbol and then an @ sign. That just means any character except an @ sign. It's a weird syntax, but this is how we can express that simple idea-- any character on the keyboard except for an @ sign. And heck, even other characters that aren't physically on your keyboard but that nonetheless exist. Then we have a literal @ sign, then we have another one of these same things-- square bracket, ^@ closed bracket, which means any character except an @ sign, then one or more of those things, followed by literally a period edu. So now let me go ahead and do this again. Let me rerun python of validate.py and test my own email address to make sure I've not made things worse. And we're good. Now let me go ahead and clear my screen and run python of validate.py again and do malan@@@harvard.edu, crossing my fingers this time. And finally, this now is invalid. Why? I'm allowing myself to have one @ sign in the middle of the user's input, but everything to the left per this new syntax cannot be an @ sign. It can be anything but one or more times. And everything to the right of the @ sign can be anything but an @ sign one or more times followed by, lastly, a literal .edu. So again, the new syntax is quite simply this-- square brackets allow you to specify a set of characters that you literally type out at your keyboard-- A, B, C, D, E, F, or the complement, the opposite, the ^ symbol, which means "not," and then the one or more symbols you want to exclude. Questions now on this syntax here? AUDIENCE: So right after @ sign, can we use the curly brackets m one so that we can only have one repetition of the @ symbol? DAVID MALAN: Absolutely. So we could do this. Let me go ahead and pull up VS Code. And let me delete the current form of a regular expression and go back to where we began, which was just dot * @ and dot *. I could absolutely do something like this and require that I want at least one of any character here. And then I could do something more to have any more as well. So the curly brace syntax, which we saw on the slide earlier but didn't yet use, absolutely can be used to specify a specific number of characters. But honestly, this is more verbose than is necessary. The best solution, arguably, or the simplest, at least, ultimately, is just to say dot +. But there, too, another example of how you can solve the same problem multiple ways. Let me go back to where the regular expression just was and take other questions as well. Questions on the sets of characters or complementing that set? AUDIENCE: So can you use that same syntax to say that you don't want a certain character throughout the whole string? DAVID MALAN: You could. It's going to be-- you could absolutely use the same character to exclude-- you could absolutely use this syntax to exclude a certain character from the entire string. But it would be a little harder right now, because we're still requiring .edu the end. But yes, absolutely. Other questions? AUDIENCE: What happens if the user inputs .edu in the beginning of the string? DAVID MALAN: A good question. What happens if the user types in .edu at the beginning of the string? Well, let me go back to VS Code here. And let's try to solve this in two different ways. First, let's look at the regular expression and see if we can infer if that's going to be tolerated. Well, according to the current cryptic regular expression, I'm saying that you can have any character except the @ sign. So that would work I. Could have the dot for the .edu. But then I have to have an @ sign. So that wouldn't really work, because if I'm just typing in .edu, we're not going to pass that constraint. So now let me try this by running the program. Let me type in just literally .edu. That doesn't work. But, but, but I could do this, .edu@.edu. That, too, is invalid. But let me do this, .edu@something.edu. That passes. So it's starting to get a little weird now. Maybe it's valid, maybe it's not. But I think we'll eventually be more precise, too. How about one more question on this regular expression and these complementing of sets? AUDIENCE: Can we use another domain name, the string input? DAVID MALAN: Can you use another domain name? Absolutely. I'm using my own just for the sake of demonstration. But you could absolutely use any domain or top-level domain. And I'm using .edu, which is very US centric. But this would absolutely work exactly the same for any top-level domain. All right. Let me go ahead now and propose that we improve this regular expression further, because if I pull it up again in VS Code here, you'll see that I'm being a little too tolerant still. It turns out that there are certain requirements for someone's username and domain name in an email address. There is an official standard in the world for what an email address can be and what characters can be in it. And this is way too accommodating of all the characters in the world except for the @ symbol. So let's actually narrow the definition of what we're going to tolerate in usernames. And companies like Gmail could certainly do this as well. Suppose that it's not just that I want to exclude @ sign. Suppose that I only want to allow for, say, characters that normally appear in words, like letters of the alphabet, A through z, be it uppercase or lowercase, maybe some numbers, and heck, maybe even an underscore could be allowed, too. Well, we can use this same square bracket syntax to specify a set of characters as follows. I could do abcdefghij-- oh, my god. This is going to take forever. I'm going to have to type out all 26 letters of the alphabet, both lowercase and uppercase. So let me stop doing that. There's a better way already. If you want to specify within these square brackets a range of letters, you can actually just do a hyphen. If you literally do a-z in these square brackets, the computer is going to know you mean a through z. You do not need to type 26 letters of the alphabet. If you want to include uppercase letters as well, you just do the same. No spaces, no commas, you literally just keep typing a through capital Z. So I have little a hyphen little z, big A hyphen big Z. No spaces, no commas, no separators. You just keep specifying those ranges. If I additionally want numbers, I could do 01234-- nope. You don't need to type in all 10 decimal digits. You can just say 0 through 9 using a hyphen as well. And if you now want to support underscores as well, which is pretty common in usernames for email addresses, you can literally just type an underscore at the end. Notice that all of these characters are inside of square brackets, which just again, means here is a set of characters that I want to allow. I have not used a ^ symbol at the beginning of this whole thing, because I don't want to complement it-- complement it with an E, not compliment it with an I-- I don't want to complement it by making it the opposite. I literally want to accept only these characters. I'm going to go ahead and do the same thing on the right. If I want to require that the domain name similarly come from this set of characters, which admittedly is a little too narrow, but it's familiar for now so we'll keep it simple, I'm going to go ahead and paste that exact same set of characters over there to the right. And so now, it's much more restrictive. Now I'm going to go ahead and run python of validate.py. I'm going to test my own email address, and we're still good. I'm going to clear my screen and run it once more, this time trying to break it. Let me go ahead and do something like, how about, david_malan@harvard.edu, Enter, but that, too, is going to be valid. But if I do something completely wrong again, like malan@@@harvard.edu, that's still going to be invalid. Why? Because my regular expression currently only allows for a single @ in the middle, because everything to the left must be alphanumeric-- alphabetical or numeric-- or an underscore, the same thing to the right, followed by the .edu. Now honestly, this is a regular expression that you might be in the habit of typing in the real world. As cryptic as this might look, this is the world of regular expressions. So you'll get more comfortable with this syntax over time. But thankfully, some of these patterns are so common that there are built-in shortcuts for representing some of the same information. That is to say, you don't have to constantly type out all of the symbols that you want to include, because odds are some other programmer has had the same problem. So built into regular expressions themselves are some additional patterns you can use. And in fact, I can go ahead and get rid of this entire set, a through z lowercase, A through Z uppercase, 0 through 9 and an underscore, and just replace it with a single backslash w. Backslash w in this case represents a "word character," which is commonly known as a alphanumeric symbol or the underscore as well. I'm going to do the same thing over here. I'm going to highlight the entire set of square brackets, delete it, and replace it with a single backslash w. And now I feel like we're making progress, because even though it's cryptic, and would have looked way cryptic a little bit ago-- and even though it would have looked even more cryptic a little bit ago, now it's at least starting to read a little more friendly. This ^ on the left means "start matching at the beginning of the string." Backslash w means "any word character." The + means "one or more." @ symbol literally. Then another word character, one or more. then a literal dot, then literally edu, and then match at the very end of the string, and that's it. So there's more of these, too. And we won't use them all here, but here is a partial list of the patterns you can use within a regular expression. One, you have backslash d for any decimal digit, "decimal digit" meaning 0 through 9. Commonly done here, too, is if you want to do the opposite of that, the complement, so to speak, you can do backslash capital D, which is anything that's not a decimal digit. So it might be letters, and punctuation, and other symbols as well. Meanwhile, backslash s means whitespace characters, like a single hit of the space, or maybe hitting Tab on the keyboard. That's whitespace. Backslash capital S is the opposite or complement of that-- anything that's not a whitespace character. Backslash w, we've seen, a word character, as well as numbers and the underscore. And if you want the complement or opposite of that, you can use backslash capital W to give you everything but a word character. Again, these are just common patterns that so many people were presumably using in yesteryear that it's now baked into the regular expression syntax so that you can more succinctly express your same ideas. Any questions, then, on this approach here, where we're now using backslash w to represent my word character? AUDIENCE: So what I want to ask about was the-- actually the previous approach, like the square bracket approach. Could we accept lists in there? DAVID MALAN: Yes. We'll see this before long. But suppose you wanted to tolerate not just .edu, but maybe .edu, or .com, you could do this. You could introduce parentheses, and then you can or those together. I could say com or edu. Could also add in something like in the US, or gov, or net, or anything else, or org, or the like. And each of the vertical bars here means something special. It means "or." And the parentheses simply group things together. Formally, you have this syntax here-- A or B, A or vertical bar B, means "A has to match or B has to match," where A and B can be any other patterns you want. In parentheses, you can group those things together. So just like math, you can combine ideas into one phrase and do this thing or the other. And there's other syntax as well that we'll soon see. Other questions on these regular expressions and this syntax here? AUDIENCE: What if we put spaces in the expression? DAVID MALAN: Sure. So if you want spaces in there, you can't use backslash w alone, because that is only a word character which is alphabetical, numerical, or the underscore. But you could do this. You could go back to this approach whereby you use square brackets. And you could say a through z, or A through Z, or 0 through 9, or underscore, or I'm going to hit the space bar, a single space. You can put a literal space inside of the square brackets, which will allow you then to detect a space. Alternatively, I could still use backslash w, But I could combine it as follows. I could say, give me a backslash w or a backslash s, because recall that backslash s is whitespace. So it's even more than a single space. It could be a tab. But by putting those things in parentheses, now you can match either the thing on the left or the thing on the right one or more times. How about one other question on these regular expressions? AUDIENCE: Perfect. So I was going to ask, does the backslash w include a dot? Because-- no, OK. DAVID MALAN: No, it only Includes letters, numbers, and underscore. That is it. AUDIENCE: And I was wondering, you gave an example at the beginning that had spaces, like this is my email, so-and-so. I don't think our current version-- or even quite a long while ago stopped accepting it. Was that because of the ^ or because of something else? DAVID MALAN: No, the reason I was handling spaces in other English words when I typed out my email address as malan@harvard.edu was because we were using initially dot *, or dot +, which is any character. And even after that, we said anything except the @ sign, which includes spaces. Only once I started using square brackets and a through z and 0 through 9 and underscore did we finally get to the point where we would reject white space. And in fact, I can run this here. Let me go into the current version of my code in VS Code, which is using, again, the backslash w's for word characters, let me run python of validate.py and incorrectly type in something like "my email address is malan@harvard.edu," period, which has spaces to the left of my username, and that is now invalid, because space is not a word character. You're going to notice, too, that technically I'm not allowing dots. And some of you might be thinking, wait a minute. My Gmail address has a dot in it. That's something we're going to still have to fix. A backslash w is not the end all here. It's just allowing us to express our previous solution a little more succinctly. Now, one thing we're still not handling quite properly is uppercase versus lowercase. The backslash w technically does handle lowercase letters and uppercase, because it's the exact same thing as that set from before, which had little a through little z and big A through big Z. But watch this. Let me go ahead in my current form run python of validate.py, and just because my Caps lock key is down, MALAN@HARVARD.EDU, shouting my email address. It's going to be OK in terms of the MALAN. It's going to be OK in terms of the HARVARD, because those are matching the backslash w, which does include lowercase and uppercase. But I'm about to see invalid. Why? Why is MALAN@HARVARD.EDU invalid when it's in all caps here, even though I'm using backslash w? AUDIENCE: Yeah. So you are asking for the domain.edu in lowercase, and you're typing it in uppercase. DAVID MALAN: Exactly. I'm typing in my email address in all uppercase, but I'm looking for literally ".edu." And as I see you with AirPods and so many of you with headphones, I apologize for yelling into my microphone just now to make this point. But let's see if we can't fix that. Well, if my pattern on line 5 is expecting it to be lowercase, there's actually a few ways I can solve this. One would be something we've seen before. I could just force the user's input to all lowercase. And I could put onto the end of my first line .lower and actually force it all to lowercase. Alternatively, I could do that a little later. Instead of passing an email, I could pass in the lowercase version of email, because email addresses should, in fact, be case insensitive. So that would work, too. But there's another mechanism here, which is worth seeing. It turns out that that function before called re.search supports, recall, a third argument as well, these so-called flags. And flags are configuration options, typically to a function, that allow you to configure it a little differently. And how might I go about configuring this call to re.search a little bit differently insofar as I'm currently only passing in two arguments? Well, it turns out that some of the flags you can pass into this function are these. It turns out that the regular expression library in Python, a.k.a. re, comes with a few built-in variables, so to speak, things that you can think of as constants, that have meaning to re.search. And they do so as follows. If you pass in as a flag re.IGNORECASE, what re.search is going to do is ignore the case of the user's input. It can be uppercase, lowercase, a combination thereof, the case is going to be ignored. It will be treated case insensitively. And you can do other things, too, that we won't do here. But if you want to handle the user's input that maybe spans multiple lines-- maybe they didn't just type in an email address but an entire paragraph of text, and you want to match different lines of that text that is multiple lines. Another flag is for re.MULTILINE for just that, or re.DOTALL, whereby you can configure the dot to recognize not just any character except newlines but any character plus newlines as well. But for now, let me go ahead and just make use of this first one. Let me pass in a third argument to re.search, which is re.IGNORECASE. Let me now rerun the program without clearing my screen, python of validate.py. Let me type in again in all caps, effectively shouting, MALAN@HARVARD.EDU, Enter, and now it's considered valid, because I'm telling re.search specifically to ignore the case of the input. And that, too, here is fine. And why might I do this approach rather than call .lower in one of those other locations? Eh, if I don't actually want to change the user's input for whatever reason, I can still treat it case insensitively without actually changing the value of that variable itself. All right, any final questions now on this validation of email addresses? AUDIENCE: So the pattern is a string, right? DAVID MALAN: Mm-hmm. AUDIENCE: Can we use an fstring? DAVID MALAN: You can. Yes, you can use an fstring so that you could plug in, for instance, the value of a variable and pass it into the function. Other questions on this? AUDIENCE: Backslash w character, could we take it as an input from the user? DAVID MALAN: Technically yes. That's not a problem we're trying to solve right now. We want the user to provide literal input, like their email address, not necessarily a regular expression. But you could imagine building software that asks the user, especially if they're more advanced users, to type in a regular expression for some reason to validate something else against that. And in fact, that's what Google is doing. If you play around with Google Forms and create a form with response validation and select Regular Expression, Google lets you and I type in our own regular expressions, which would be a perfect example of that. All right. Well, let me propose that we try to solve one other problem here, whereby if I go into the same version as before, which is now ignoring case, but I type in one of my other email addresses. Let me go ahead and run python of validate.py. And this time, let me type in not malan@harvard.edu, which I use primarily, but another email address of mine, malan@cs50.harvard.edu, which forwards to the same. Let me go ahead and hit Enter now. And huh, invalid, even though I'm pretty sure that is, in fact, my email address. Well, let's put our finger on the reason why. Why at the moment is malan@cs50.harvard.edu being considered invalid, even though I'm pretty sure I send and receive email from that address, too? Why might that be? AUDIENCE: Because there is a dot that has come after the @ symbol. DAVID MALAN: Exactly. There's a dot after my cs50. And I'm not expecting any dots there, I'm expecting only, again, word characters, which is A through z, 0 through 9, and underscore. So I'm going to have to retool here. But how could I go about doing this? Well, it turns out theoretically, there could be other email addresses, even though they'd be getting a little excessively long, for instance, malan@something.cs50.harvard.edu, which does not technically exist, but it could. You can have, of course, multiple dots in a domain name like we see here. Wouldn't it be nice if we could handle that as well? Well, let me propose that we modify my regular expression as follows. It turns out that you can group ideas together. And you can not only ask whether or not this pattern matches or this one using syntax like A vertical bar B, which means "either A or B," you can also group things together and then apply some other operator to them as well. In fact, let me go back to the code here. And let me propose that if I want to tolerate a subdomain, like cs50, that may or may not be there, let me go ahead and change it as follows. I could naively do this. If I want to support subdomains, I could say, well, let's allow for other word characters plus, and then a literal dot. And notice, I'll highlight in blue here what I've just added. Everything else is the same, but I'm now adding room for another sequence of one or more word characters and then a literal dot. So this now, I think, if I rerun python of validate.py, will work for malan@cs50.harvard.edu, Enter. Unfortunately, does anyone see where this is going? Let me rerun python of validate.py and type in as I keep doing, malan@harvard.edu, which up until now has kept working despite all of my changes. But now, ugh, finally I've broken my own email address. So logically what's the solution here? Well, there's a bunch of ways we could solve this. I could maybe start using two regular expressions and support email addresses of the form username@domain.tld, or username@subdomain.domain.tld, where TLD just means Top Level Domain, like edu. Or I could maybe just modify this one, because I'd prefer not to have two regular expressions or one that's twice as big. Why don't I just specify to re.search that part of this pattern is optional? What was the symbol we saw earlier that allows you to specify that the thing before it is technically optional? AUDIENCE: The straight bar? We were using the straight bar as an-- optional, make the argument optional. DAVID MALAN: So we could. We could use a vertical bar and some parentheses and say, "either there's something here or there's nothing." We could do that in parentheses. But I think there's actually an even easier way. AUDIENCE: Actually, it's a question mark. DAVID MALAN: Indeed, question mark. Think back to this summary here of our first set of symbols, whereby we had not just dot and * and +, but also a question mark, which means literally "zero or one repetitions," which effectively means optional. It's either there, one, or it's not, zero. Now, how can I translate that to this code here? Well, let me go ahead and surround this part of my pattern with parentheses, which doesn't mean I want literally a parentheses in the user's input, I just want to group these characters together. And in fact, this now will still work. I've only added parentheses around the new part for the subdomain. Let me run python of validate.py. Let me run malan@cs50.harvard.edu, Enter. That's still valid. But to be clear, if I rerun it again for malan@harvard.edu, that is still invalid, but not if I go in here and say, after the parentheses, which now is one logical unit, it's one big group of ideas together, I add a single question mark there. This will now tell re.search that that whole thing in parentheses can either be there once or be there not at all, zero times. So what does this translate into when I run it? Well, let me go ahead and rerun it with malan@cs50.harvard.edu so that the subdomain is there. That works as before. Let me clear my screen and run it again, python of validate.py with malan@harvard.edu, which used to work then broke. Are we back in business now? We are. That's now valid again. Questions now on this approach, where we've used not just the question mark but the parentheses as well? AUDIENCE: Yeah. You said it works for zero or one repetitions. What if you have more? DAVID MALAN: What if you have more? That's OK. That's where you could do *. * is zero or more, which gives you all the flexibility in the world. AUDIENCE: Yeah. So I was just asking that-- with question marks, there's only one repetition allowed. DAVID MALAN: It means zero or one repetition. So it's either not there or it is there. And so that's why this pattern now, if I go back to my code, even though again, it admittedly looks cryptic, let me highlight everything after the @ sign and before the $ sign. This now represents a domain name, like harvard.edu, or a subdomain within the domain name. Why? Well, this part to the right is the same as always. Backslash w + means something like Harvard or Yale. Backslash .edu means literally ".edu." So the new part is this. In parentheses, I have another set of backslash w + backslash dot now. But it's all in parentheses. I'm now having a question mark right after that, which means that whole thing in parentheses either can be there, or it can't be there. It's either of those that are acceptable. So a question mark effectively make something optional. It would not be correct to remove the parentheses, because what would this mean? If I removed the parentheses, that would mean that only this dot is optional, which isn't really what we want to express. I want the subdomain, like cs50 and the additional dot to be what's there or not there. How about one other question on regexes here? AUDIENCE: Can we use this for the usernames? DAVID MALAN: Absolutely. We still have other problems. We're not solving all of the problems today just yet. But absolutely. Right now, we are not letting you have a period in your username. And again, some of you with Gmail accounts or other accounts, you probably have not just underscores, numbers, and letters. You might have periods, too. Well, we could fix that, not using question mark here per se. But now that we have these parentheses at our disposal, what I could do is this. I could use parentheses to surround the backslash w to say "any word character," which is the same thing, again, as a letter, or a number, or an underscore. But I could also or in, using a vertical bar, something else, like a literal dot. Now, a literal dot needs to be escaped, otherwise it represents any character, which would be a regression, a step back. But now notice what I've done. In parentheses, I'm telling re.search that those first few characters in your email address, that is your username, has to be a word character, like A through z, uppercase or lowercase, or 0 through 9, or an underscore, or a literal dot. We could do this differently, too. I could get rid of the parentheses and the or, and I could just use a set of characters. I could, again, manually say a through z, A through Z, 0 through 9, underscore, and then I could do a literal dot with a backslash period. And now I technically don't even need the uppercase, because I'm already telling the computer to ignore case. I can just pick one or the other. Which one is better is really up to you. Whichever one you think is more readable would generally be the better design. All right. Let me propose that I rewind this in time to where we left off, which was here. And let me propose that there are, indeed, still limitations of this solution, not just with the username, not just with the domain name. We're still being a little too restrictive. So would you like to see the official regular expression that at least browsers use nowadays whenever you type in an email address to a web form, and the web form, the browser, tells you yes or no, your email address is syntactically valid? Ready? Ready? Here is-- and this isn't even officially the right regular expression. It's a simplified version that browsers use because it catches most mistakes but not all. Here we go. This is the regular expression for a valid email address, at least as browsers nowadays implement them. Now it's crazy cryptic at first glance. But note-- and it's wrapping on to many lines, but it's just one pattern. But just notice the now-familiar symbols. There is the ^ symbol at the very top. There is the $ sign at the very end. There is a square bracket over here and then some of these ranges plus other characters. Turns out you don't normally see these characters in email addresses. It looks like you're swearing at someone in their username. But they're valid characters. They're valid officially. That doesn't mean that Gmail is going to allow you to put $ signs and other punctuation in your username. But officially, some servers might allow that. So if you really want to validate a user's email address, you would actually come up with or copy-paste something like this. But honestly, this looks so cryptic. And if you were to type it out manually, you are so likely to make a mistake. What's the better solution here instead? This is where, per past weeks, libraries are your friend. Surely someone else on the internet, a programmer more experienced than you, even, has come up with code that validates email addresses properly, using this regular expression or even something more sophisticated than that. So generally, if the problem at hand is to validate input that is pretty conventional-- an email address, a URL, something where there's an official definition that's independent of you yourself-- find a popular library that you're comfortable using and use it in your code to validate email addresses. This is not a wheel, necessarily, that you yourself should invent. We've used email addresses, though, to iteratively start from something simple, too simple, and build on top of that. So you could certainly imagine using regular expressions still to validate things that aren't email addresses but are data that are important to you. So we at least now have these building blocks. Now, besides the regular expressions themselves, it turns out there's other functions in Python's re library for regular expressions. Among them is this function here, re.match, which is actually very similar to re.search, except you don't have to specify the ^ symbol at the very beginning of your regex if you want to match from the start of a string. re.match by design will automatically start matching from the start of the string for you. Similar in spirit is re.fullmatch, which does the same thing but not only matches at the start of the string but the end of the string, so that you, too, don't need to type in the ^ symbol or the $ sign as well. But let's go ahead and transition back now to some actual code, whereby we solve a different problem in spirit. Rather than just validate the user's input and make sure it looks the way we want, let's just assume that the users are not going to type in data exactly as we want, and so we're going to have to clean up their input. This happens so often when you're using like a Google Form, or Office 365 form, or anything else to collect user input. No matter what your form question says, your users are not necessarily going to follow those directions. They might go ahead and type in something that's a little differently formatted than you might like. Now, you could certainly go through the results and download a CSV, or open the Google spreadsheet, or equivalent in Excel, and just clean up all of the data manually. But if you've got lots of submissions-- dozens, hundreds, thousands of rows in your data set-- doing things manually might not be very fun. It might be much more effective to write code, as in Python, that can allow you to clean up that data and any future data as well. So let me propose that we go ahead here and close validate.py. And let's go ahead and create a new program altogether called format.py, the goal of which is to reformat the user's input in the format we expect. I'm going to go ahead and run code of format.py. And let's suppose that the data we're going to reformat is the user's name-- so not email address but name this time. And we're going to hope that they type in their name properly, like David Malan. But some users might be in the habit, for whatever reason, of typing their name backwards, if you will, with a comma, such as Malan comma David instead. Now, it's fine because both are clearly as readable to the human. But if you want to standardize how those names are stored in your system, perhaps a database, or CSV file, or something else, it would be nice to at least standardize or canonicalize the format in which you're storing your data, so that if you print out the user's name it's always the same format, David Malan, and there's no commas or backwardness to it. So let's go ahead and do something familiar. Let's go ahead and give myself a variable called name and set it equal to the return value of input, asking the user, as we've done many times, "what's your name," question mark. I'm going to go ahead and proactively at least clean up some messiness, as we keep doing here, by just stripping off any leading or trailing whitespace. Just in case the user accidentally hits the spacebar, we don't want that ultimately in our data set. And now let me go ahead and do this as we've done before. Let me just go ahead quickly and print out, just to make sure I'm off to the right start, "hello," and then in curly braces name, so making an fstring to format "hello," comma, "name." Now let me go ahead and clear my screen and run python of format.py. Let me behave and type in my name as I normally would, David, space, Malan, Enter. And I think the output looks pretty good. It looks as expected grammatically. Let me now go ahead, though, and play this game again. But this time, maybe because I'm not thinking, or I'm just in the habit of doing last name comma first, I do Malan, comma, David, and hit Enter. All right. Well, this now is weird. Even though the program is just spitting out exactly what I typed in, arguably this is not close to correct, at least grammatically. It should really say "hello, David Malan." Now, maybe I could have some if conditions and I could just reject the user's input if they type a comma or get their names backwards somehow. But that's going to be too little too late if the user has already submitted a form online, and I already have the data, and now I need to go in and clean it up. And it's not going to be fun to go through manually in Google Spreadsheets, or Apple Numbers, or Microsoft Excel and manually fix a lot of people's names to get rid of the commas and move the first name before the last, as is conventional in the US. So let's do this. It could be a little fragile, but let's start to express ourselves a little programmatically here and ask this. If there is a comma in the person's name, which is Pythonic-- I'm just asking the question, is this shorter string in this longer string?-- then let me go ahead and do this. Let me go ahead and grab that name in the variable, split on not just the comma but the space after, assuming the human typed in a space after their name. And let me go ahead and store the result of that splitting of Malan, comma, David into two variables. Let's do last, comma, first, again unpacking the sequence of values that comes back. Now let me go ahead and reformat the name. So I'm going to forcibly change the user's name to be as I expect. So name is actually going to be this format string-- first name then last name, both in curly braces but formatted together with a single space, so that I'm overwriting the user's input and updating my name variable accordingly. For the moment, to be clear, this program is interactive. Like, the users, like me, are typing their name into the program. But imagine the data already is in a CSV file. It came in from some process like a Google Form or something else online. You could imagine writing code similar to this, but that maybe goes and reads that file into memory first. Maybe it's a CSV via CSV Reader or DictReader, and then iterating over each of those names. But we'll keep it simple and just do one name at a time. But now what's kind of interesting here is if I go back to my terminal window and clear it, and run python of format.py, and hit Enter, I'm going to type in David, space, Malan as before. And I think we're still good. But I'm also going to go ahead and do this-- python of format.py Malan, comma, David, with a space in between, crossing my fingers and hit Enter, and voila. That now has been fixed. Such a simple thing to be sure. But it is so commonly necessary to clean up users input. Here we see at least one way to do so pretty easily. Now, to be fair, there's some problems here. And in fact, can someone imagine a scenario in which this code really doesn't fix the user's input? What could still go wrong even with this fix in my code? Any thoughts? AUDIENCE: If they typed in their name comma and then [INAUDIBLE]. DAVID MALAN: Oh, and then something else. Yeah. So let me try this, for instance. Let me go ahead and run a program. And I am the only David Malan that I know. But suppose I were, let's say, junior like this. And it's common, in English at least, to sometimes put a comma there. You don't necessarily need the comma, but I'm one of those people who uses a comma. That's now really, really broken. So I've broken some assumption there. And so that could certainly go wrong here. What else? Well, let me go ahead and run this again. And if I did Malan, comma, David, no space, because I'm being a little sloppy, I'm not paying attention, which is going to happen when you have lots of users ultimately, well, this really broke now. Notice I have a ValueError, an actual exception. Why? Well, because split is supposed to be splitting the string into two strings by looking for the comma and a space. But if there is no comma and space, it can't split it into two things. And the fact that I have two variables on the left, but I'm only getting back one thing on the right, means that I can't do this code quite as this. So it's fragile to be sure. But wouldn't it be nice if we could at least improve it? For instance, we now know some regular expressions syntax. What if I at least wanted to make this space optional? Well, I could use my newfound regular expression syntax and put a question mark, Question mark means zero or one of the things to the left. What's the thing to the left? It's literally a space. I don't even need parentheses if there's just one thing there. So that would be the start of a pattern that says, I must have a comma, and then I may or may not have a space, zero or one spaces thereafter. Unfortunately, the version of split that's built into the str variable, as in this case, doesn't support regular expressions. If we want our regular expressions, we need to go use that library here. So let me go ahead and do this. Let me go in and leave this code as is but go up to the top now and import re to import the library for regular expressions. And now let me go ahead and start changing my approach here. I'm going to go ahead and do this. I'm going to use the same function called re.search, and I'm going to search for a pattern that I think will be last, comma, first. So let me use my newfound regular expression syntax and represent a pattern for something like Malan, comma, space, David. How can I do this? Well, inside of my quotes for re.search, I'm going to have something-- so dot +-- sorry. I'm going to have something, so dot +. Then I'm going to have a comma. Then I'm going to have a space. Then I'm going to have something dot +. Now I'm going to preemptively refine this a little bit. I want this whole pattern to start matching at the beginning of the user's input. So I'm going to add the ^ right away. And I want the end of the user's input to be matched as well, so that I'm literally expecting any character one or more times, then a comma then a space, then any other character one or more times. And then that is it. And I'm going to pass in the name variable as before. Now, when we've used re.search in the past, we really used it just to answer a question. Does the user's input match the following pattern or not, true or false, effectively. But re.search is actually more powerful than that. You can actually get back more information. And you can do this. You can specify a variable and then an assignment operator, and get back more precise answers to what has been found when searched for. But what is it you want to get back? Well, it turns out there's this other feature of regular expressions which allow you to use parentheses, not just to group things together, but to capture them. It turns out when you specify parentheses in a regular expression unbeknownst to us up until now, everything in the parentheses will be returned to you as a return value from the re.search function. It's going to allow you to extract specific amounts of information from the user's own input. You can reverse this process, too, by using the non-capturing version as well. You can use parentheses, and then literally a question mark, and a colon, and then some other stuff. And that will say, don't either capturing this. I just want to group things. But for now, we're going to use just the parentheses themselves. So how am I going to do this? Well, if I want to get back the user's last name and first name, I think what I want to capture is the dot + here and the dot + here. So I've deliberately surrounded in parentheses the dot + both to the left and the right of the comma, not because I'm grouping them together per se-- I'm not adding a question mark, I'm not adding up another + or a *-- I'm using parentheses now for capturing purposes. Why? Well, I'm going to do this next. I'm going to still ask a Boolean question like, "if there are matches, then do this." So if matches is not effectively false, like none, I do expect I've gotten back some matches. And watch what I can do now. I can do last, comma, first equals whatever matches in and get back all of the groups of matches. Then go ahead and update name just like before with a format string and do first and then last in curly braces as well, and then at the very bottom, just like before, print out, for instance, "hello," comma, "name." So the new code now is everything highlighted here. I'm using re.search to search for whether the user typed their name in last, comma, first format. But I am more powerfully using re.search to capture some of the user's input. What's going to get captured? Anything I surrounded in parentheses will be returned to me as return values. How do you get at those return values? You ask the variable to which you assign them for all of the groups, all of the groups of parentheses that were captured. So let me go ahead and do this. Let me go ahead now and run python of format.py, Enter. And I'm going to type my name as usual. In this case, nothing happens with this if condition. Why? Because I did not type a comma, and so this search does not find a comma, so there are no matches. So we immediately just print out "hello, name." Nothing interesting or new there. But if I now go ahead, and clear my screen, and run python of format.py, and do Malan, comma, space, David, Enter, we've reformatted my name. Well, how did this work? Let me be a little more explicit now. It turns out I don't have to just say matches.groups. I can get specific groups back that I want. So let me change my code a little bit more. Let me go ahead now and just say this. Let's update name to-- actually, let's do this. Let's say that the last name is going to be in the matches but specifically group 1. The first name is going to be in the matches but specifically group 2. Why 1 and 2? Because this is the first set of parentheses to the left of the comma. This is the second set of parentheses to the right of the comma. And based on the input, this would be the user's last name in this scenario, Malan. This would be the user's first name, David, in this scenario. That's why I'm using group 1 for the last name and group 2 for the first name. And now I'm going to go ahead and say name equals fstring, again, first and then last, done. And let me refine this one last step before we take questions. I don't really need these variables if I'm immediately using them. Let's just go ahead and tighten this up further as we've done in the past for design's sake. If I want to make the name the concatenation of the person's first name and last name, let's just do this. matches.group 2 first, plus a space, plus matches.group 1. So it's just up to me from left to right, this is group 1, this is group 2. So group 1 is last, group 2 is first. So if I want to flip them around and update the value of name, I can explicitly get group 2 first, concatenate using +, a single space, and then concatenate on group 1. All right. That was a lot. Let me pause to see if there are questions. The key difference here is we're still using re.search the exact same way, but now I'm using its return value, not just to answer a question true or false, but to actually get back specific matches anything I captured, so to speak, with parentheses. AUDIENCE: Why is it here we're using 1 and 2 instead of 0 and 1 for capturing the first? DAVID MALAN: Really good question. A good observation. In almost every other context, we've started counting at 0 and 1 instead of 1 and 2. It turns out there's something else in location 0 when it comes back from re.search related to the string itself. So according to the documentation of this function only, 1 is the first set of parentheses, and 2 is the second set, and onward from there. Just a different convention here. Other questions? AUDIENCE: What if we write nothing, like whitespace, comma, whitespace? How do we check truth of condition? DAVID MALAN: Before I answer directly, let me just run this and make sure I've not broken anything further. Let me run python of format.py. Let me type in David, space, Malan, the right way. Let me run it once more. Let me type in Malan, comma, David, the wrong way that we're fixing. And we're still good. But I think it will still break. Let me run it a third time with Malan, comma, David with no space. And now it's still broken. Why? Because I'm still looking for comma space. Now, how can I fix that? One way I could do that is to add a question mark here, which again, is zero or more of the thing before. So if I have a space and then a question mark literally, no need for any parentheses, then I can literally tolerate both Malan, comma, space, David or Malan, comma, David. So let's try again. Before, this did not work. Let's do Malan, comma, David with no space. Now it does actually work. So we can tolerate different amounts of whitespace if I am a little more precise with my formula. Let me go ahead and try once more. Let me very weirdly but possibly hit the space bar a few too many times so now they're really separated. This, again, is not going to work quite right, because it's going to consume all of that whitespace. So now I might want to strip, left and right, any of the leading white space on the result. Or what I could do here is say this. Instead of zero or one, I could use a * here, so space *. And now if I run this once more with Malan, comma, space, space, space, David, Enter, now we've cleaned up things further. So you can imagine, depending on how messy the data is that you're cleaning up, your regular expressions might need to get more and more sophisticated. It really depends on just how many problems we want to solve at once. Well, allow me to propose that we forge ahead further just to clean this up even more so, using a feature that's actually relatively new to Python itself. It is very common when using regular expressions to do exactly what I've done here-- to call a function like re.search with capturing parentheses inside, such that you get back a return value that I'm calling matches-- you could call it something else, but I'm calling it by default matches. And then notice on the next line, I'm saying "if matches." Wouldn't it be nice if I could just tighten things up further and do these all on the same line? Well, you can sort of. Let me go ahead and do this. Let me get rid of this if. And let me just try to say something like this. If matches equals re.search and then colon-- so combining my if condition into just one line instead of those two. In C, or C++, or Java, you would actually do something like this, surrounding the whole thing with parentheses, sometimes double sets to suppress any warnings, if you want to do two things at once. If you want to not only assign the return value of re.search to a variable called matches, but you want to subsequently ask a Boolean question, is this effectively true or false. That's what I was doing a moment ago. Let me undo this. A moment ago, I was getting back the return value and assigning it to matches, and then I was asking the question. Well, it turns out this need to have two lines of code presumably rubbed people wrong for too long in Python. And so you can now combine these two kinds of lines into one. But you need a new operator. You cannot just say, "if matches equals re.search" and then in a colon at the end. You instead need to do this. You need to do colon equals if and only if you want to assign something from right to left and you want to ask an if or an elif question on the same line. This is affectionately known, as can see here, as the walrus operator. And it's new to Python in recent years. And it both allows you to assign a value as I'm doing from right to left, and ask a Boolean question about it, like I'm doing with the if or equivalently elif. Does anyone know why this is called the walrus operator? If you kind of look at it like this, perhaps, if you're familiar with walruses, it kind of sort of looks like a walrus. So a minor detail but a relatively new feature of Python that honestly, you'll probably continue to see online, and in source code, and in textbooks, and so forth, increasingly so now that it does exist. It does not change the logic at all. If I run python of format.py and type Malan, comma, space, David, it still fixes things, but it's tightened up my code just a bit more. All right. Let's go ahead and look at one final problem to solve, that of extracting information now as well. So at this point, we've now validated the user's input by checking whether or not it meets a certain pattern. We've cleaned up the user's input by checking against a pattern, whether it matches or not, and if it does match, we kind of reorganize some of the user's information so we can clean up their input and standardize the format in which we're storing or printing it, in this case. Let's do one final example where we're very specifically extracting information in order to answer some question. So let me propose this. Let me go ahead and close format.py and create a new file called twitter.py, the goal of which is to prompt users for the URL of their Twitter profile and extract from it, infer from that URL, what is the user's username. Now, why might you want to do this? Well, one, you might want users to be able to just very easily copy and paste the URL from their own Twitter profile into your form, into your app, so that you can figure out what their username is. Or you might have a form that asks the user for their Twitter username, and because people aren't necessarily paying very close attention, some people type their username. Some people type their whole URL or something else altogether. It would be nice now that you're a programmer to just be more tolerant of different types of input and just take on the burden of canonicalizing, standardizing the data, but being flexible with the users. It's arguably a better user experience if you just let me copy-paste or type in what I want, you clean it up. You're the programmer not me. Lends for a better experience, perhaps. Well, let me go ahead and do this with twitter.py. Let me first go ahead and prompt the user here for a value for a variable that I'll call url, and just ask them to input the URL of their Twitter profile. I'm going to go ahead and strip off any leading or trailing whitespace, just in case users accidentally hit the spacebar. That's literally the least I can do quite easily. But now let's go ahead and do this. Suppose that the user's address is the following. Let me print out what did they type in. And let me clear my screen and run python of twitter.py. I'm going to go ahead and type in, for instance, https://twitter.com/davidjmalan, which happens to be my own Twitter username. For now, we're just going to print it back onto the screen just to make sure I've not messed up yet. OK. So I've printed back out the exact same URL. But the goal at hand is to extract the username only. Now, let me just ask, perhaps, a straightforward question. Logically, what do I need to do to get at the user's username? AUDIENCE: Well, we just ignore what's before the username and then just extract the username? DAVID MALAN: Perfect. Yeah, I mean, it is as simple as that. If you know the username is at the end, well, let's just somehow ignore everything to the beginning. Well, what's at the beginning? Well, it's a URL. So we're probably going to need to ignore an HTTPS, a ://, a twitter.com, and a /. So we just want to throw all of that away. Why? Because if it's an URL, we know by how Twitter works that the username comes at the end. So let's use that very simple idea to get at the information we want. I'm going to try this a few different ways. Let me go back into my program here. And instead of just printing it out, which was just to see what's going on, let me do this. Let me create a new variable called username. And let me call url.replace. It turns out that if URL is a string or a str in Python, it, again, comes with multiple methods, like strip, and split, and others as well, one of which is called replace. And replace will do just that. You pass it two arguments, the first of which is, what do you want to replace? The second argument is, what do you want to replace it with? So if I want to get rid of, as I've proposed, really just everything before the username, that is, the Twitter URL or the beginning thereof, let's just say this. Go ahead and replace "https://twitter.com/", close quote, that's what I want to replace. And comma, second argument, what do you want to replace it with? Nothing. So I'm literally going to pass in quote unquote to effectively do a find and replace. That's what the replace method does, just like you can do it in Microsoft Word or Google Docs. This is the programmer's way of doing find and replace. Now let me go ahead and print out just the username. So I'll use an fstring like this. I'll say username, colon, and then in curly braces, username, just to format it nicely. All right. Let me go ahead and clear my screen and run python of twitter.py, Enter, URL. Here we go. https://twitter.com/davidjmalan, Enter. OK. Now we've made some progress. Done for the day, right? Well, what is suboptimal about this? Can anyone critique or find fault with my program? It is working now, but it's a little fragile. I bet we could contrive some scenarios where I think it works but it doesn't. AUDIENCE: Well, I have a few ideas, actually. Well, first of all, if we don't specify HTTPS, it will be broken. Secondly, if we have a slash at the end, it also will be broken. If we have a question mark or something after question mark, it also won't work. So a lot of scenarios, actually. DAVID MALAN: Oh, my god. I mean, here we are. I was pretending to think I was done. But my god, like, Alex gave us a whole laundry list of problems. And just to recap, then, what if it's not HTTPS, it's HTTP? Slightly less secure, but I should still be able to tolerate that programmatically. What if the protocol is not there? What if the user just typed twitter.com/davidjmalan? It would be nice to tolerate that rather than show an error and make me type in the protocol. Why? It's not good user experience. What if it had a slash at the end of the username, or a question mark? If you think about URLs you've seen on the web, there's very commonly more information, especially if it's been shared on social media. There might be a HTTP parameters, so to speak, just stuff there that we don't want. There could be a www.twitter.com, which I'm also not expecting but does work if you go to that URL, too. So there's just so many things that can go wrong. And even if I come back to my contrived example as earlier, what if I run this program and say this-- "my username is https://twitter.com/davidjmalan," Enter. Well, that too just didn't really work-- it got rid of the-- actually-- [LAUGHS] OK, actually that kind of worked. But the goal here is to actually get the user's username, not an English sentence describing the user's username. So I would argue that even though I just accidentally created perfectly correct English grammar, I did not extract the Twitter username correctly. I don't want words like "my username is" as part of my input. So how can we go about improving this, and maybe chipping away at some of those problems one by one? Well, let me clear my screen here. Let me come back up to my code. And let me not just replace it, but let me do something else instead. I'm going to go ahead, and instead of using replace, I'm going to use another function called removeprefix. A prefix is a string or a substring that comes at the start of another. So if I remove prefix, I don't need a second argument for this function. I just need one. What prefix do you want to remove? So this will at least now fix the problem I just described of typing in like a whole sentence, where the URL is there, but it's not at the beginning, it's only at the end. So here, this still is not correct. But we don't create this weird-looking output that just removes the URL part of the input-- "my username is https://twitter.com/davidjmalan." A moment ago, it did remove the URL and left only the davidjmalan. This is not perfect still. But at least now, it does not weirdly remove the URL and then leave the English. It's just leaving it alone. So maybe I could handle this better, but at least it's removing it from the part of the string I might anticipate. Well, what else could we do here? Well, it turns out that regular expressions just let us express patterns much more precisely. We could spend all day using a whole bunch of different Python functions like removeprefix, or remove, and strip, and others, and kind of make our way to the right solution. But a regular expression just allows you to more succinctly, if admittedly more cryptically, express these kinds of patterns and goals. And we've seen from parentheses, which can be used not just to group symbols together as sets but to capture information as well, we have a very powerful tool now in our toolkit. So let me do this. Let me go ahead and start fresh here and import the re library as before at the very top of my program. I'm still going to get the user's URL via the same line of code. But I'm now going to use another function as well. It turns out that there's not just re.search, or re.match, or re.fullmatch. There's also re.sub in the regular expression library, where "sub" here means "substitute." And it takes more arguments, but they're fairly straightforward. The first argument to re.sub is the pattern, the regular expression that you want to look for. Then you have a replacement string-- what do you want to replace that pattern with? And where do you want to do all that? Well, you pass in the string that you want to do the substitution on. Then there's some other arguments that I'll wave my hands at for now. Among them are those same flags and also a count, like how many times do you want to do find and replace? Do you want it to do all, do you want to do just one, or so forth you can have further control there, too, just like you would in Google Docs or Microsoft Word. Well, let me go back to my code here, and let me do this. I'm going to go ahead and call re not search but re.sub for substitute. I'm going to pass in the following regular expression, "https://twitter.com/" and then I'm going to close my quote. And now what do I want to replace that with? Well, like before with the simple str replace function, I want to replace it with nothing, just get rid of it altogether. But what string do I want to pass in to do this to? The URL from the user. And now let me go ahead and assign the return value of re.sub to a variable called username. So re.sub's purpose in life is, again, to substitute some value for some regular expression some number of times. It essentially is find and replace using regular expressions. And it returns to you the resulting string once you've done all those substitutions. So now the very last line of my code can be the same as before, print-- and I'll use an fstring, username, colon, and then in curly braces, username. So I can print out literally just that. All right. Let's try this and see what happens. I'll clear my terminal window, run python of twitter.py. And here we go, https://twitter.com/davidjmalan. Cross my fingers and hit Enter. OK, now we're in business. But it is still a little fragile. And so let me ask the group, what problem should I now further chip away at? They've been said before, but let's be clear. What's one or more problems that still remain? AUDIENCE: The protocols and the domain prefix [INAUDIBLE]. DAVID MALAN: Good. The protocols, so HTTP versus HTTPS. Maybe the subdomain, www, should it be there or not? And there's a few other mistakes here, too. Let me actually stay with the group. What are some other shortcomings of this current solution? AUDIENCE: If we use a phrase like you do before, we are going to have the same problem, because it's not taking account in the first part of the text example. DAVID MALAN: Good. I might still allow for some words, some English to the left of the URL because I didn't use my ^ symbol. So I'll fix that. And any final observations on shortcomings here? AUDIENCE: Well, it could be an HTTP, or there could be less than two slashes. DAVID MALAN: OK. So it could be HTTP. And I think that was mentioned, too, in terms of protocol. There could be fewer than two slashes. That I'm not going to worry about. If the user gives me instead of two, that's really user error. And I could be tolerant of it, but you know what, at that point I'm OK yelling at them with an error message saying, please fix your input. Otherwise, we could be here all day long trying to handle all possible typos. For now, I think in the interests of usability, or user experience, UX, let's at least be tolerant of all possible valid inputs or reasonable INPUTS if you will. So let me go here, and let me start chipping away at these here. What are some problems we can solve? Well, let me propose that we first address the issue of matching from the beginning of the string. So let me add the ^ to the beginning. And let me add not a $ sign at the end, though, right? Because I don't want to match all the way to the end, because I want to tolerate a username there. So I think we just want the ^ symbol there. There's a subtle bug that no one yet mentioned. And let me just kind of highlight it and see if it jumps out at you now. It's a little subtle here on my screen. I've highlighted in blue a final bug here-- maybe some smiles on the screen, yeah? Can we take one hand here? Why am I highlighting the dot in twitter.com, even though it definitely should be there? AUDIENCE: So the dot without a backslash means any character except a newline. DAVID MALAN: Yeah, exactly. It means any character. So I could type in something like twitter?com, or twitter anything com, and that would actually be tolerated. It's not really that bad, because why would the user do that? But if I want to be correct, and I want to be able to test my own code properly, I should really get this detail right. So that's an easy fix, too, but it's a common mistake. Anytime you're writing regular expressions that happen to involve special symbols, like dots in a URL or domain name, a $ sign in something involving currency, remember you might, indeed, need to escape it with a backslash like this here. All right. Let me ask the group about the protocol specifically. So HTTPS is a good thing in the world. It means secure. There is encryption being used. So generally, you like to see HTTPS. But you still see people typing or copy-pasting HTTP. What would be the simplest fix here to tolerate, as has been proposed, both HTTP and HTTPS? I'm going to propose that I could do this. I could do HTTP vertical bar or HTTPS, which, again, means A or B. But I think I can be smarter than that. I can keep my code a little more succinct. Any recommendations here for tolerating HTTP or HTTPS? AUDIENCE: We could try to put in question mark behind the S. DAVID MALAN: Perfect. Just use a question mark. Both of those would be viable solutions. If you want to be super explicit in your code, fine. Use parentheses and say HTTP or HTTPS, so that you, the reader, your boss, your teacher just know exactly what you're doing. But if you keep taking the more verbose approach all the time, it might actually become less readable, certainly once your regular expressions get this big instead of this big. So let's save space where we can. And I would argue that this is pretty reasonable, so long as you're in the habit of reading regular expressions and know that question mark does not mean a literal question mark, but it means zero or one of the thing before. I think we've effectively made the S optional here. Now, what else can I do? Well, suppose we want to tolerate the www dot, which may or may not be there, but it will work if you go to a browser. I could do this-- www dot-- wait, I want a backslash there so I don't repeat the same mistake as before. But this is no good either, because I want to tolerate being there or not being there. And now I've just required that it be there. But I think I can take the same approach. Any recommendations? How do I make the www. optional, just to hammer this home? AUDIENCE: We can group-- make a square and a question mark. DAVID MALAN: Perfect. So question mark is the short answer again. But we have to be a little smarter this time. As Maria has noted, we need parentheses now. Because if I just put a question mark after the dot, that just means the dot is optional. And that's wrong, because we don't want the user to type in W-W-W-T-W-I-T-T-E-R. We want the dot to be there or just not at all with no www. So we need to group this whole thing together, put a parenthesis there, and then a parenthesis, not after the third W, after the dot, so that that whole thing is either there or it's not there. And what else could we still do here? There's going to be one other thing we should tolerate. And it's been said before, and I'll pluck this one off. What about the protocol? Like, what if the user just doesn't type or doesn't copy-paste the http:// or an https://? Honestly, you and I are not in the habit, generally, of even typing the protocol anymore nowadays. You just let the browser figure it out for you, and automatically add it instead. So this one's going to look like more of a mouthful. But if I want this whole thing here in blue to be optional, it's actually the same solution as Maria offered a moment ago. I'm going to go ahead and put a parenthesis over here, and a parenthesis after the two slashes, and then a question mark so as to make that whole thing optional as well. And this is OK. It's totally fine to make this whole thing optional, or inside of it, this little thing, just the S optional as well. So long as I'm applying the same principles again and again, either on a small scale or a bigger scale, it's totally fine to nest one of these inside of the other. Questions now on any of these refinements to this parsing, this analyzing of Twitter? AUDIENCE: What if we put a vertical bar besides this www dot? DAVID MALAN: What if we use a vertical bar there? So we could do something like that, too. We could do something like this. Instead of the question mark, I could do www dot or nothing and just leave that and the parentheses. That, too, would be fine. I personally tend not to like that, because it's a little less obvious to me-- wait, a minute. Is that deliberate, or did I forget to finish my thought by putting something after the vertical bar? But that, too, would be allowed there as well, if that's what you mean. Other questions on where we left things here, where we made the protocol optional, too? AUDIENCE: What happens if we have parenthesis, and inside we have another parenthesis, and another parenthesis? Will it interfere with each other? DAVID MALAN: If you have parentheses inside of parentheses, that, too, is totally fine. And indeed, that should be one of the reassuring lessons today. As complicated as each of these regular expressions has admittedly gotten, I'm just applying the exact same principles and the exact same syntax again and again. So it's totally fine to have parentheses inside of parentheses if they're each solving different problems. And in fact, the lesson I would really emphasize the most today is that you will not be happy if you try to write out a whole complicated regular expression all at once. Like, if you're anything like me, you will fail, and you will have trouble finding the mistake. Because my god, look at these things. They are, even to me all these years later, cryptic. The better way, I would argue, whether you're new to programming or is old to it as I am, is to just take these baby steps, these incremental steps where you do something simple, you make sure it works. You add one more feature, make sure it works. Add one more feature, make sure it works. And hopefully, by the end, because you've done each of those steps one at a time, the whole thing will make sense to you. But you'll also have gotten each of those steps correct at each turn. So please, do avoid the inclination to try to come up with long, sophisticated regular expressions all at once, because it's just not a good use of a time if you then stare at it trying to find a mistake that you could have caught if you did things more incrementally instead. All right. There still remains, arguably, at least one problem with this solution in that even though I'm calling re.sub to substitute the URL with nothing, quote, unquote, I then in my final line of code, line 6, am just blindly assuming that it all worked, and I'm going to go ahead and print out the username. But what if the user-- if I clear my screen here and run python of twitter.py-- doesn't even type a Twitter URL? What if they do something like https://google.com/, like completely unrelated, for whatever reason, Enter, that is not their Twitter username. So we need to have some conditional logic, I would argue, so that for this program's sake, we're only printing out or, in a back end system, we're only saving into our database or a CSV file the username if we actually matched the proper pattern. So rather than use re.sub, which is useful for cleaning up data, as we've done here to get rid of something we don't want there, why don't we go back to re.search, where we began today, and use it to solve this same problem but in a way that's conditional, whereby I can confidently say, yes or no, at the end of my program, here's the username, or here it is not? So let me go ahead now. And I'll clear my terminal window here. I'm going to keep most of-- I'm going to keep the first two lines the, same where I import re, and I get the URL from the user. But this time, let's do this. Let's this time search for, using re.search instead of re.sub, the following. I'm going to start matching at the beginning of the string, https, question mark to make the S optional, colon, slash, slash, I'm going to make my www optional by putting that in question marks there, then a twitter.com with a literal dot there so I stay ahead of that issue, too, then a slash. And then well, this is where davidjmalan is supposed to go. How do I detect this? Well, I think I'll just tolerate anything at the end of the URL here. All right, $ sign at the very end, close quote. For the moment, I'm going to stipulate that we're not going to worry about question marks at the end or hashes, like for fragment IDs in URLs. We're going to assume for simplicity now that the URL just ends with the username alone. Now what am I going to do? Well, I want to search for this URL specifically, and I'm going to ignore case, so re.IGNORECASE, applying that same lesson learned from before. re.search, recall, will return to you the matches you've captured. Well, what do I want to capture? Well, I want to capture everything to the right of the twitter.com URL here. So let me surround what should be the user's username with parentheses, not for making them optional but to say, "capture this set of characters." Now, re.search, recall, returns an answer. matches will be my variable name again, but I could call it anything I want. And then I can do this. If matches, now I know I can do this. Let's print out the format string, username colon. And then what do I want to print out? Well, I think I want to print out matches.group 1 for my matched username. All right. So what am I doing just to recap? Line 1, I'm importing the library. Line 2, I'm getting the URL from the user. So nothing new there. Line 5, I'm searching the user's URL, as indicated here as the second argument, for this regular expression, this pattern. I have surrounded the dot + with parentheses so that they are captured ultimately, so I can extract, in this final scenario, the user's username. If I indeed got a match, and matches is non-none, it is actually containing some match, then and only then, print out username. In this way, let me try this now. If I run python of twitter.py and type in https://www.google.com/, now nothing gets printed. So I've at least solved the mistake we just saw, where I was just assuming that my code worked. Now I'm making sure that I have searched for and found the Twitter URL prefix. All right. Well, let's run this for real now. Python of twitter.py https://twitter.com/davidjmalan. But note, I could use HTTP, I could use www. I'm just going to go ahead here and hit Enter. Huh, none. What has gone wrong? This one's a bit more subtle. But why does matches.group 1 contain nothing? Wait a minute. Let me-- maybe I did this wrong. Maybe-- maybe do we need the www? Let me run it again. So here we go. https://, let's add a www.twitter.com/davidjmalan. All right. Enter. Ho, ho, ho. What is going on? AUDIENCE: You have to say group 2. DAVID MALAN: I have to say group 2? Well, wait-- oh, right, because we had the subdomain was optional. And to make it optional, I needed to use parentheses here. And so I then said zero or on. OK. So that means that actually, I'm unintentionally but by design capturing the www dot, or none of it if it wasn't there before, but I have a second match over here because I have a second set of parentheses. So I think, yep, let me change matches.group 1 to matches.group 2, and let's run this. Python of twitter.py https://www.twitter-- let's do this, twitter.com/davidjmalan, Enter, and now we've got access to the username. Let me go ahead and tighten it up a little bit further. If you like our new friend-- it's hard not to like. If we like our old friend the walrus operator, let's go ahead and add this just to tighten things up. Let me go back to VS Code here, and let me get rid of the unnecessary condition there and combine it up here, if matches equals that. But let's change the single assignment operator to the walrus operator. Now I've tightened things up further. But I bet, I bet, I bet there might be another solution here. And indeed, it turns out that we can come back to this final set of syntax. Recall that when we introduce these parentheses, we did it so that we could do A or B, for instance, with the vertical bar. Then you can even combine more than just one bar. We use the group to combine ideas like the, www dot. And then there's this admittedly weird syntax at the bottom here, up until now not used. There is a non-capturing version of parentheses if you want to use parentheses logically because you need to, but you don't want to bother capturing the result. And this would arguably be a better solution here, because, yes, if I go back to VS Code, I do need to surround the www dot with parentheses, at least as I've written my regex here, because I wanted to put the question mark after it. But I don't need the www dot coming back. In fact, let's only extract the data we care about, just so there's no confusion down the road, for me, or my colleagues, or my teachers. So what could I do? Well, the syntax per this slide is to use a question mark and a colon immediately after the open parentheses. It looks weird admittedly. Those of you who have prior programming experience might recognize the syntax from ternary operators, doing an if else all in one line. A question mark colon at the beginning of that parenthetical means, yes, I'm using parentheses to group these things together, but no, you do not need to capture them instead. So I can change my code back now to matches.group 1. I'll clear my screen here, run python of twitter.py. I'll again run here https://twitter.com/davidjmalan with or without the www. And now, I indeed get back that username. Any questions, then, on these final techniques? AUDIENCE: So first of all, could we move the ^ right at the beginning of Twitter, and then just start reading from there, and then get rid of everything else before that, the kind of www issues that we had? And then my second question is, how would we use kind of, I guess, either a list or a dictionary to sort the .com kind of thing, because we have .co.uk, and that kind of stuff. How would we bring that into the re function? DAVID MALAN: A good question but no. If I move the ^ before twitter.com and throw away the protocol and the www, then the user is going to have to type in literally twitter.com/username. They can't even type in that other stuff. So that would be a regression, a step back. As for the .com, the .org, and .edu, and so forth, the short answer is there's many different solutions here. If I wanted to be stringent about .com-- and suppose that Twitter probably owns multiple domain names, even though they tend to use just this one. Suppose they have something like .org as well. You could use more parentheses here and do something like this-- com or org. I'd probably want to go in and add a question mark colon to make it non-capturing, because I don't care which it is, I just want to tolerate both. Alternatively, we could capture that. We could do something like this, where we do dot + so as to actually capture that. And then we could do something like this. If matches.group 1 now equals equals com, then we could support this. So you could imagine factoring out the logic just by extracting the Top-Level Domain, or TLD, and then just using Python code, maybe a list, maybe a dictionary, to validate elsewhere, outside of the regex, if it's, in fact, what you expect. For now, though, we kept things simple. We focused only on the .com in this case. Let's make one final change to this program so that we're being a little more specific with the definition of a Twitter username. It turns out that we're being a little too generous over here, whereby we're accepting one or more of any character. I checked the documentation for Twitter. And Twitter only supports letters of the alphabet, a through Z, numbers 0 through 9, or underscores, so not just dot, which is literally anything. So let me go ahead and be more precise here. At the end of my string, let me go ahead and say, this set of symbols in square brackets. I'm going to go ahead and say a through Z, 0 through 9, and an underscore. Because, again, those are the only valid symbols. I don't need to bother with an uppercase A or a lowercase z, because we're using re.IGNORECASE over here. But I want to make sure now that I tolerate not only one or more of these symbols here but also maybe some other stuff at the end of the URL. I'm now going to be OK with there being a slash, or a question mark, or a hash at the end of the URL, all of which are valid symbols in a URL, but I know from the Twitter's documentation, are not part of the username. All right. Now I'm going to go ahead and run python of twitter.py one final time, typing in https://twitter.com/davidjmalan, maybe with, maybe without a trailing slash. But hopefully, with my biggest fingers crossed here, I'm going to go ahead now and hit Enter, and thankfully my username is, indeed, davidjmalan. So what more is there in the world of regular expressions and this own library? Not just re.search and also re.sub, there's other functions, too. There's re.split, via which you can split a string, not using a specific character or characters like a comma and a space, but multiple characters as well. And there's even functions like re.findall, which can allow you to search for multiple copies of the same pattern in different places in a string so that you can perhaps manipulate more than just one. So at the end of the day now, you've really learned a whole other language, like that of regular expressions, and we've used them in Python. But these regular expressions actually exist in so many languages, too, among them JavaScript, and Java, and Ruby, and more. So with this new language, even though it's admittedly cryptic when you use it for the first time, you have this newfound ability to express these patterns that, again, you can use to validate data, to clean up data, or even extract data, and from any data set you might have in mind. That's it for this week. We will see you next time.