1 00:00:00,000 --> 00:00:03,960 [ORCHESTRA TUNING] 2 00:00:03,960 --> 00:00:14,850 3 00:00:14,850 --> 00:00:18,315 [MUSIC PLAYING] 4 00:00:18,315 --> 00:00:24,270 5 00:00:24,270 --> 00:00:25,470 DAVID MALAN: All right. 6 00:00:25,470 --> 00:00:28,680 This is CS50's Introduction to Programming with Python. 7 00:00:28,680 --> 00:00:32,350 My name is David Malan, and this is our week on regular expressions. 8 00:00:32,350 --> 00:00:37,170 So a regular expression, otherwise known as a regex, is really just a pattern. 9 00:00:37,170 --> 00:00:39,120 And indeed, it's quite common in programming 10 00:00:39,120 --> 00:00:43,620 to want to use patterns to match on some kind of data, often user input. 11 00:00:43,620 --> 00:00:47,190 For instance, if the user types in an email address, whether to your program, 12 00:00:47,190 --> 00:00:49,200 or a website, or an app on your phone, you 13 00:00:49,200 --> 00:00:50,940 might ideally want to be able to validate 14 00:00:50,940 --> 00:00:53,130 that they did indeed type in an email address 15 00:00:53,130 --> 00:00:54,790 and not something completely different. 16 00:00:54,790 --> 00:00:58,080 So using regular expressions, we're going to have the newfound capability 17 00:00:58,080 --> 00:01:02,280 to define patterns in our code to compare them against data that we're 18 00:01:02,280 --> 00:01:04,920 receiving from someone else, whether it's just to validate it, 19 00:01:04,920 --> 00:01:07,470 or, heck, even if we want to clean up a whole lot of data 20 00:01:07,470 --> 00:01:11,220 that itself might be messy because it, too, came from us humans. 21 00:01:11,220 --> 00:01:14,010 Before, though, we use these regular expressions, 22 00:01:14,010 --> 00:01:19,620 let me propose that we solve a few problems using just some simpler syntax 23 00:01:19,620 --> 00:01:22,140 and see what kind of limitations we run up against. 24 00:01:22,140 --> 00:01:24,720 Let me propose that I open up VS Code here, 25 00:01:24,720 --> 00:01:27,900 and let me create a file called validate.py, the goal at hand 26 00:01:27,900 --> 00:01:30,768 being to validate, how about just that, a user's email address. 27 00:01:30,768 --> 00:01:33,060 They've come to your app, they've come to your website, 28 00:01:33,060 --> 00:01:34,935 they type in their email address, and we want 29 00:01:34,935 --> 00:01:38,380 to say yes or no, this email address looks valid. 30 00:01:38,380 --> 00:01:38,880 All right. 31 00:01:38,880 --> 00:01:43,980 Let me go ahead and type code of validate.py to create a new tab here. 32 00:01:43,980 --> 00:01:47,850 And then within this tab, let me go ahead and start writing some code, 33 00:01:47,850 --> 00:01:50,010 how about, that keeps things simple initially. 34 00:01:50,010 --> 00:01:53,130 First, let me go ahead and prompt the user for their email address. 35 00:01:53,130 --> 00:01:57,510 And I'll store the return value of input in a variable called email, 36 00:01:57,510 --> 00:01:59,940 asking them "what's your email?" 37 00:01:59,940 --> 00:02:00,742 question mark. 38 00:02:00,742 --> 00:02:02,700 I'm going to go ahead and preemptively at least 39 00:02:02,700 --> 00:02:06,780 clean up the user's input a little bit by minimally just calling strip 40 00:02:06,780 --> 00:02:10,020 at the end of my call to input, because recall 41 00:02:10,020 --> 00:02:12,240 that input returns a string or a str. 42 00:02:12,240 --> 00:02:16,200 strs come with some built-in methods or functions, one of which 43 00:02:16,200 --> 00:02:18,390 is strip, which has the effect of stripping off 44 00:02:18,390 --> 00:02:22,395 any leading whitespace to the left or any trailing whitespace to the right. 45 00:02:22,395 --> 00:02:24,270 So that's just going to go ahead and at least 46 00:02:24,270 --> 00:02:27,480 avoid the human having accidentally typed in a space character. 47 00:02:27,480 --> 00:02:29,860 We're going to throw it away just in case. 48 00:02:29,860 --> 00:02:31,500 Now I'm going to do something simple. 49 00:02:31,500 --> 00:02:35,165 For a user's input to be an email address, 50 00:02:35,165 --> 00:02:37,290 I think we can all agree that it's got a minimal we 51 00:02:37,290 --> 00:02:39,130 have an @ sign somewhere in it. 52 00:02:39,130 --> 00:02:40,140 So let's start simple. 53 00:02:40,140 --> 00:02:43,230 If the user has typed in something with an @ sign, let's 54 00:02:43,230 --> 00:02:46,800 very generously just say, OK, valid, looks like an email address. 55 00:02:46,800 --> 00:02:50,940 And if we're missing that @ sign, let's say invalid, because clearly it's 56 00:02:50,940 --> 00:02:51,870 not an email address. 57 00:02:51,870 --> 00:02:55,078 It's not going to be the best version of my code yet, but we'll start simple. 58 00:02:55,078 --> 00:02:59,700 So I'm going to ask the question, if there is an @ symbol in the user's 59 00:02:59,700 --> 00:03:03,570 email address, go ahead and print out, for instance, quote, unquote, "valid." 60 00:03:03,570 --> 00:03:06,750 Else, if there's not, now I'm pretty confident that the email 61 00:03:06,750 --> 00:03:09,250 address is, in fact, invalid. 62 00:03:09,250 --> 00:03:10,650 Now, what is this code doing? 63 00:03:10,650 --> 00:03:16,320 Well, if @ sign in email is a Pythonic way of asking is this string quote, 64 00:03:16,320 --> 00:03:20,762 unquote "@" in this other string email, no matter where it is-- 65 00:03:20,762 --> 00:03:22,470 at the beginning, the middle, or the end. 66 00:03:22,470 --> 00:03:25,470 It's going to automatically search through the entire string for you 67 00:03:25,470 --> 00:03:26,220 automatically. 68 00:03:26,220 --> 00:03:27,660 I could do this more verbosely. 69 00:03:27,660 --> 00:03:29,670 And I could use a for loop or a while loop 70 00:03:29,670 --> 00:03:32,640 and look at every character in the user's email address, 71 00:03:32,640 --> 00:03:34,128 looking to see if it's an @ sign. 72 00:03:34,128 --> 00:03:36,420 But this is one of the things that's nice about Python. 73 00:03:36,420 --> 00:03:38,020 You can do more with less. 74 00:03:38,020 --> 00:03:41,190 So just by saying if "@" quote, unquote in email, 75 00:03:41,190 --> 00:03:43,800 we're achieving that same result. We're going to get back true 76 00:03:43,800 --> 00:03:47,670 if it's somewhere in there, thus valid, or false if it is not. 77 00:03:47,670 --> 00:03:50,970 Well, let me go ahead now and run this program in my terminal window 78 00:03:50,970 --> 00:03:53,100 with python of validate.py. 79 00:03:53,100 --> 00:03:56,730 And I'm going to go ahead and give it my email address-- malan@harvard.edu, 80 00:03:56,730 --> 00:03:57,570 Enter. 81 00:03:57,570 --> 00:03:58,770 And indeed, it's valid. 82 00:03:58,770 --> 00:04:00,210 Looks valid, is valid. 83 00:04:00,210 --> 00:04:03,480 But of course, this program is technically broken. 84 00:04:03,480 --> 00:04:04,410 It's buggy. 85 00:04:04,410 --> 00:04:07,110 What would be an example input, if someone 86 00:04:07,110 --> 00:04:10,770 might like to volunteer an answer here, that would be considered valid 87 00:04:10,770 --> 00:04:13,197 but you and I know it really isn't valid? 88 00:04:13,197 --> 00:04:14,280 AUDIENCE: Yeah, thank you. 89 00:04:14,280 --> 00:04:17,880 Well, for instance, you can type just two signs and that's it, 90 00:04:17,880 --> 00:04:20,760 and it'll still be valid-- 91 00:04:20,760 --> 00:04:23,815 still be valid according to your program, but missing something. 92 00:04:23,815 --> 00:04:24,690 DAVID MALAN: Exactly. 93 00:04:24,690 --> 00:04:26,680 We've set a very low bar here. 94 00:04:26,680 --> 00:04:29,430 In fact, if I go ahead and rerun python of validate.py, 95 00:04:29,430 --> 00:04:33,512 and I'll just type in one @ sign, that's it-- no username, no domain name, 96 00:04:33,512 --> 00:04:35,470 this doesn't really look like an email address. 97 00:04:35,470 --> 00:04:38,790 But unfortunately, my code thinks it, in fact, is, because it's obviously 98 00:04:38,790 --> 00:04:40,710 just looking for an @ sign alone. 99 00:04:40,710 --> 00:04:42,250 Well, how could we improve this? 100 00:04:42,250 --> 00:04:45,500 Well, minimally an email address, I think, tends to have, 101 00:04:45,500 --> 00:04:47,250 though this is not actually a requirement, 102 00:04:47,250 --> 00:04:51,600 tends to have an @ sign and a single dot at least, maybe somewhere in the domain 103 00:04:51,600 --> 00:04:54,240 name-- so malan@harvard.edu. 104 00:04:54,240 --> 00:04:55,960 So let's check for that dot as well. 105 00:04:55,960 --> 00:04:59,040 But again, strictly speaking it doesn't even have to be that case. 106 00:04:59,040 --> 00:05:02,190 But I'm going for my own email address, at least for now, as our test case. 107 00:05:02,190 --> 00:05:06,450 So let me go ahead and change my code now and say, not only if @ is in email, 108 00:05:06,450 --> 00:05:11,050 but also dot is in email as well. 109 00:05:11,050 --> 00:05:12,690 So I'm asking now two questions. 110 00:05:12,690 --> 00:05:16,050 I have two Boolean expressions-- if @ in email, 111 00:05:16,050 --> 00:05:20,650 and I'm anding them together logically-- this is a logical and, so to speak. 112 00:05:20,650 --> 00:05:24,600 So if it's the case that @ is in email and dot is in email, OK, 113 00:05:24,600 --> 00:05:26,470 now I'm going to go ahead and say valid. 114 00:05:26,470 --> 00:05:26,970 All right. 115 00:05:26,970 --> 00:05:29,460 This would still seem to work for my email address. 116 00:05:29,460 --> 00:05:34,500 Let me go ahead and run python validate.py, malan@harvard.edu, Enter, 117 00:05:34,500 --> 00:05:36,390 and that, of course, is valid is expected. 118 00:05:36,390 --> 00:05:39,870 But here, too, we can be a little adversarial and type in something 119 00:05:39,870 --> 00:05:41,505 nonsensical like "@." 120 00:05:41,505 --> 00:05:45,180 and unfortunately, that, too, is going to be mistaken as valid, 121 00:05:45,180 --> 00:05:48,820 even though there's still no username, domain name, or anything like that. 122 00:05:48,820 --> 00:05:51,180 So I think we need to be a little more methodical here. 123 00:05:51,180 --> 00:05:57,660 In fact, notice that if I do this like this, the @ sign can be anywhere, 124 00:05:57,660 --> 00:05:59,250 and the dot can be anywhere. 125 00:05:59,250 --> 00:06:02,190 But if I'm assuming the user is going to have a traditional domain 126 00:06:02,190 --> 00:06:05,880 name like harvard.edu or gmail.com, I really 127 00:06:05,880 --> 00:06:10,110 want to look for the dot in the domain name only, not necessarily 128 00:06:10,110 --> 00:06:11,580 just the username. 129 00:06:11,580 --> 00:06:13,510 So let me go ahead and do this. 130 00:06:13,510 --> 00:06:18,250 Let me go ahead and introduce a bit more logic here, and instead do this. 131 00:06:18,250 --> 00:06:24,060 Let me go ahead and do email.split of quote, unquote @ sign. 132 00:06:24,060 --> 00:06:26,460 So email, again, is a string or a str. 133 00:06:26,460 --> 00:06:29,550 strs come with methods, not just strip but also 134 00:06:29,550 --> 00:06:32,190 another one called split that, as the name implies, 135 00:06:32,190 --> 00:06:36,570 will split one str into multiple ones if you give it a character or more 136 00:06:36,570 --> 00:06:37,950 to split on. 137 00:06:37,950 --> 00:06:42,390 So this is hopefully going to return to me two parts from a traditional email 138 00:06:42,390 --> 00:06:44,880 address, the username and the domain name. 139 00:06:44,880 --> 00:06:47,850 And it turns out I can unpack that sequence of responses 140 00:06:47,850 --> 00:06:52,410 by doing this-- username comma domain equals this. 141 00:06:52,410 --> 00:06:55,060 I could store it in a list or some other structure, 142 00:06:55,060 --> 00:06:58,530 but if I already know in advance what kinds of values I'm expecting, 143 00:06:58,530 --> 00:07:00,510 a username and hopefully a domain, I'm going 144 00:07:00,510 --> 00:07:04,050 to go ahead and do it like this instead and just define two variables at once 145 00:07:04,050 --> 00:07:05,310 on one line of code. 146 00:07:05,310 --> 00:07:07,290 And now I'm going to be a little more precise. 147 00:07:07,290 --> 00:07:13,230 If username-- if username, then I'm going to go ahead 148 00:07:13,230 --> 00:07:15,370 and say, print "valid." 149 00:07:15,370 --> 00:07:18,820 Else, I'm going to go ahead and say print "invalid." 150 00:07:18,820 --> 00:07:20,040 Now, this isn't good enough. 151 00:07:20,040 --> 00:07:22,800 But I'm at least checking for the presence of a username now. 152 00:07:22,800 --> 00:07:25,217 And you might not have seen this before, but if you simply 153 00:07:25,217 --> 00:07:28,680 ask a question like "if username," and username is a string, 154 00:07:28,680 --> 00:07:31,320 well, username-- "if username" is going to give me 155 00:07:31,320 --> 00:07:35,730 a true answer if username is anything except none or quote, 156 00:07:35,730 --> 00:07:36,840 unquote "nothing." 157 00:07:36,840 --> 00:07:41,820 So there's a truthy value here, whereby if username has at least one character, 158 00:07:41,820 --> 00:07:43,350 that's going to be considered true. 159 00:07:43,350 --> 00:07:46,170 But if username has no characters, it's going 160 00:07:46,170 --> 00:07:49,245 to be considered a false value effectively. 161 00:07:49,245 --> 00:07:50,370 But this isn't good enough. 162 00:07:50,370 --> 00:07:52,037 I don't want to just check for username. 163 00:07:52,037 --> 00:07:57,160 I want to also check that it's the case that dot is in the domain name as well. 164 00:07:57,160 --> 00:08:00,180 So notice here there's a bit of potential confusion 165 00:08:00,180 --> 00:08:01,830 with the English language. 166 00:08:01,830 --> 00:08:04,620 Here, I seem to be saying "if username and dot 167 00:08:04,620 --> 00:08:09,660 in domain," as though I'm asking the question, "if the username and the dot 168 00:08:09,660 --> 00:08:12,270 are in the domain," but that's not what this means. 169 00:08:12,270 --> 00:08:15,540 These are two separate Boolean expressions-- "if username," 170 00:08:15,540 --> 00:08:19,690 and separately, "if dot in domain." 171 00:08:19,690 --> 00:08:23,100 And if I parenthesis this, we could make that even more clear by putting 172 00:08:23,100 --> 00:08:25,000 parentheses there, parentheses here. 173 00:08:25,000 --> 00:08:27,390 So just to be clear, it's really two Boolean expressions 174 00:08:27,390 --> 00:08:30,840 that we're anding together, not one longer English-like sentence. 175 00:08:30,840 --> 00:08:35,580 Now, if I go ahead and run this, python validate.py Enter, 176 00:08:35,580 --> 00:08:39,809 I'll do my own email address again, malan@harvard.edu, and that's valid. 177 00:08:39,809 --> 00:08:43,710 And it looks like I could tolerate something like this. 178 00:08:43,710 --> 00:08:47,970 If I do malan@, just say, harvard, I think at the moment 179 00:08:47,970 --> 00:08:49,480 this is going to be invalid. 180 00:08:49,480 --> 00:08:52,150 Now, maybe the top-level domain harvard exists. 181 00:08:52,150 --> 00:08:54,900 But at the moment, it looks like we're looking for something more. 182 00:08:54,900 --> 00:08:58,380 We're looking for a top-level domain too, like .edu. 183 00:08:58,380 --> 00:09:01,540 For now, we'll just consider this to be invalid. 184 00:09:01,540 --> 00:09:04,510 But it's not just that we want to do-- 185 00:09:04,510 --> 00:09:07,260 it's not just that we want to check for the presence of a username 186 00:09:07,260 --> 00:09:08,370 and the presence of a dot. 187 00:09:08,370 --> 00:09:09,520 Let's be more specific. 188 00:09:09,520 --> 00:09:11,687 Let's start to now narrow the scope of this program, 189 00:09:11,687 --> 00:09:15,600 not just to be about generic emails more generally, but about edu addresses, 190 00:09:15,600 --> 00:09:18,780 so specifically for someone in a US university, for instance, 191 00:09:18,780 --> 00:09:21,450 whose email address tends to end with .edu. 192 00:09:21,450 --> 00:09:23,310 I can be a little more precise. 193 00:09:23,310 --> 00:09:25,350 And you might recall this function already. 194 00:09:25,350 --> 00:09:28,590 Instead of just saying, is there a dot somewhere in domain, 195 00:09:28,590 --> 00:09:34,740 let me instead say, and the domain ends with quote, unquote ".edu." 196 00:09:34,740 --> 00:09:36,420 Now we're being even more precise. 197 00:09:36,420 --> 00:09:40,200 We want there to be minimally a username that's not empty-- it's not just quote, 198 00:09:40,200 --> 00:09:45,190 unquote "nothing"-- and we want the domain name to actually end with .edu. 199 00:09:45,190 --> 00:09:47,448 Let me go ahead and run python of validate.py. 200 00:09:47,448 --> 00:09:49,740 And just to make sure I haven't made things even worse, 201 00:09:49,740 --> 00:09:53,470 let me at least test my own email address, which does seem to be valid. 202 00:09:53,470 --> 00:09:56,070 Now, it seems that I minimally need to provide a username, 203 00:09:56,070 --> 00:09:58,380 because we definitely do have that check in place. 204 00:09:58,380 --> 00:10:00,210 So I'm going to go ahead and say malan. 205 00:10:00,210 --> 00:10:02,790 And now I'm going to go ahead and say @. 206 00:10:02,790 --> 00:10:05,880 And it looks like I could be a little malicious here, 207 00:10:05,880 --> 00:10:09,030 just say malan@.edu, as though minimally meeting 208 00:10:09,030 --> 00:10:11,340 the requirements of this pattern. 209 00:10:11,340 --> 00:10:13,200 And that, of course, is considered valid, 210 00:10:13,200 --> 00:10:17,010 but I'm pretty sure there's no one at malan@.edu. 211 00:10:17,010 --> 00:10:19,350 We need to have some domain name in there. 212 00:10:19,350 --> 00:10:21,360 So we're still not being quite as generous. 213 00:10:21,360 --> 00:10:24,510 Now, we could absolutely continue to iterate on this program, 214 00:10:24,510 --> 00:10:26,640 and we could add some more Boolean expressions. 215 00:10:26,640 --> 00:10:28,590 We could maybe use some other Python methods 216 00:10:28,590 --> 00:10:31,530 for checking more precisely is there something to the left of the dot, 217 00:10:31,530 --> 00:10:32,550 to the right of the dot. 218 00:10:32,550 --> 00:10:34,320 We could use split multiple times. 219 00:10:34,320 --> 00:10:36,180 But honestly, this just escalates quickly. 220 00:10:36,180 --> 00:10:39,450 Like, you end up having to write a lot of code just 221 00:10:39,450 --> 00:10:42,360 to express something that's relatively simple in spirit-- 222 00:10:42,360 --> 00:10:45,550 just format this like an email address. 223 00:10:45,550 --> 00:10:47,920 So how can we go about improving this? 224 00:10:47,920 --> 00:10:52,350 Well, it turns out in Python there's a library for regular expressions. 225 00:10:52,350 --> 00:10:55,620 It's called succinctly R-E. And in the re library, 226 00:10:55,620 --> 00:11:00,510 you have a lot of capabilities to define and check for and even replace 227 00:11:00,510 --> 00:11:01,440 patterns. 228 00:11:01,440 --> 00:11:03,630 Again, a regular expression is a pattern. 229 00:11:03,630 --> 00:11:05,998 And this library, the re library in Python, 230 00:11:05,998 --> 00:11:08,040 is going to let us define some of these patterns, 231 00:11:08,040 --> 00:11:09,915 like a pattern for an email address, and then 232 00:11:09,915 --> 00:11:12,720 use some built-in functions to actually validate 233 00:11:12,720 --> 00:11:14,820 a user's input against that pattern or even 234 00:11:14,820 --> 00:11:17,250 use these patterns to change the user's input 235 00:11:17,250 --> 00:11:19,650 or extract partial information therefrom. 236 00:11:19,650 --> 00:11:22,030 We'll see examples of all this and more. 237 00:11:22,030 --> 00:11:24,045 So what can and should I do with this library? 238 00:11:24,045 --> 00:11:26,670 Well, first and foremost, it comes with a lot of functionality. 239 00:11:26,670 --> 00:11:29,760 Here is the URL, for instance, to the official documentation. 240 00:11:29,760 --> 00:11:31,710 And let me propose that we focus on using 241 00:11:31,710 --> 00:11:36,600 one of the most versatile functions in the library, namely this-- search. 242 00:11:36,600 --> 00:11:40,440 re.search is the name of the function and the re module 243 00:11:40,440 --> 00:11:42,910 that allows you to pass in a few arguments. 244 00:11:42,910 --> 00:11:46,620 The first is going to be a pattern that you want to search for in, 245 00:11:46,620 --> 00:11:48,900 for instance, a string that came from a user. 246 00:11:48,900 --> 00:11:51,977 The string argument here is going to be the actual string that you 247 00:11:51,977 --> 00:11:53,310 want to search for that pattern. 248 00:11:53,310 --> 00:11:55,410 And then there's a third argument optionally 249 00:11:55,410 --> 00:11:56,790 that's a whole bunch of flags. 250 00:11:56,790 --> 00:11:59,880 A flag in general is like a parameter you can pass in 251 00:11:59,880 --> 00:12:01,510 to modify the behavior of the function. 252 00:12:01,510 --> 00:12:03,510 But initially, we're not even going to use this. 253 00:12:03,510 --> 00:12:06,610 We're just going to pass in a couple of arguments instead. 254 00:12:06,610 --> 00:12:11,700 So let me go ahead and employ this re library, this regular expression 255 00:12:11,700 --> 00:12:15,162 library, and just improve on this design incrementally. 256 00:12:15,162 --> 00:12:17,370 So we're not going to solve this problem all at once, 257 00:12:17,370 --> 00:12:19,590 but we'll take some incremental steps. 258 00:12:19,590 --> 00:12:21,840 I'm going to go back to VS Code here. 259 00:12:21,840 --> 00:12:25,050 And I'm going to go ahead now and get rid of most of this code. 260 00:12:25,050 --> 00:12:28,230 But I'm going to go into the top of my file and first of fall, 261 00:12:28,230 --> 00:12:30,030 import this re library. 262 00:12:30,030 --> 00:12:33,030 So import re gives me access to that function and more. 263 00:12:33,030 --> 00:12:36,150 Now, after I've gotten the user's input in the same way as before, 264 00:12:36,150 --> 00:12:38,790 stripping off any leading or trailing whitespace, 265 00:12:38,790 --> 00:12:42,250 I'm just going to use this function super trivially for now, 266 00:12:42,250 --> 00:12:44,460 even though this isn't really a big step forward. 267 00:12:44,460 --> 00:12:50,190 I'm going to say, if re.search contains quote, unquote "@" 268 00:12:50,190 --> 00:12:53,700 in the email address, then let's go ahead and print "valid." 269 00:12:53,700 --> 00:12:55,740 Else, let's go ahead and print "invalid." 270 00:12:55,740 --> 00:12:59,730 At the moment, this is really no better than my very first version 271 00:12:59,730 --> 00:13:04,150 where I was just asking Python, if @ sign in the email address. 272 00:13:04,150 --> 00:13:08,880 But now I'm at least beginning to use this library by using its own re.search 273 00:13:08,880 --> 00:13:13,740 function, which for now you can assume returns a true value effectively 274 00:13:13,740 --> 00:13:16,440 if, indeed, the @ sign is an email. 275 00:13:16,440 --> 00:13:19,800 Just to make sure that this version does work as I expect, let me go ahead 276 00:13:19,800 --> 00:13:22,590 and run python of validate.py and Enter. 277 00:13:22,590 --> 00:13:26,220 I'll type in my actual email address, and we're back in business. 278 00:13:26,220 --> 00:13:29,370 But of course, this is not great, because if I similarly 279 00:13:29,370 --> 00:13:32,400 run this version of the program and just type in an @ sign, 280 00:13:32,400 --> 00:13:35,860 not an email address, and yet my code, of course, thinks it is valid. 281 00:13:35,860 --> 00:13:37,980 So how can I do better than this? 282 00:13:37,980 --> 00:13:42,330 Well, we need a bit more vocabulary in the realm of regular expressions, 283 00:13:42,330 --> 00:13:46,290 in order to be able to express ourselves a little more precisely. 284 00:13:46,290 --> 00:13:48,900 Really, the pattern I want to ultimately define 285 00:13:48,900 --> 00:13:52,410 is going to be something like, I want there to be something to the left, 286 00:13:52,410 --> 00:13:55,320 then an @ sign, then something to the right. 287 00:13:55,320 --> 00:13:59,310 And that something to the right should end with .edu but should also have 288 00:13:59,310 --> 00:14:02,160 something before the .edu, like Harvard, or Yale, 289 00:14:02,160 --> 00:14:04,680 or any other school in the US as well. 290 00:14:04,680 --> 00:14:06,550 Well, how can I go about doing this? 291 00:14:06,550 --> 00:14:11,040 Well, it turns out that in the world of regular expressions, whether in Python 292 00:14:11,040 --> 00:14:14,220 or a lot of other languages as well, there are certain symbols 293 00:14:14,220 --> 00:14:16,140 that you can use to define patterns. 294 00:14:16,140 --> 00:14:19,030 At the moment, I've just used literal raw text. 295 00:14:19,030 --> 00:14:21,600 If I go back to my code here, this technically 296 00:14:21,600 --> 00:14:23,940 qualifies as a regular expression. 297 00:14:23,940 --> 00:14:28,290 I've passed in a quoted string inside of which is an @ sign. 298 00:14:28,290 --> 00:14:30,550 Now, that's not a very interesting pattern. 299 00:14:30,550 --> 00:14:31,500 It's just an @ sign. 300 00:14:31,500 --> 00:14:34,290 But it turns out that once you have access to regular expressions 301 00:14:34,290 --> 00:14:37,350 or a library that offers that feature, you can more 302 00:14:37,350 --> 00:14:40,360 powerfully express yourself as follows. 303 00:14:40,360 --> 00:14:43,770 Let me reveal that the pattern that you pass to re.search 304 00:14:43,770 --> 00:14:45,690 can take a whole bunch of special symbols. 305 00:14:45,690 --> 00:14:47,160 And here's just some of them. 306 00:14:47,160 --> 00:14:51,630 In the examples we're about to see, in the patterns we're about to define, 307 00:14:51,630 --> 00:14:53,040 here are the special symbols. 308 00:14:53,040 --> 00:14:56,280 You can use a single period, a dot, to just represent 309 00:14:56,280 --> 00:14:59,040 any character except a newline, a blank line. 310 00:14:59,040 --> 00:15:02,190 So that is to say, if I don't really care what letters of the alphabet 311 00:15:02,190 --> 00:15:04,200 are in the user's username, I just want there 312 00:15:04,200 --> 00:15:07,410 to be one or more characters in the user's name, 313 00:15:07,410 --> 00:15:11,340 dot allows me to express A through z, uppercase and lowercase, 314 00:15:11,340 --> 00:15:13,560 and a bunch of other letters as well. 315 00:15:13,560 --> 00:15:18,850 * is going to mean-- a single asterisk-- zero or more repetitions. 316 00:15:18,850 --> 00:15:21,630 So if I say something *, that means that I'm 317 00:15:21,630 --> 00:15:24,450 willing to accept either zero repetitions, that is, 318 00:15:24,450 --> 00:15:27,510 nothing at all, or more repetitions-- 319 00:15:27,510 --> 00:15:29,580 1, or 2, or 3, or 300. 320 00:15:29,580 --> 00:15:31,950 If you see a plus in my pattern, so that's 321 00:15:31,950 --> 00:15:34,135 going to mean one or more repetitions. 322 00:15:34,135 --> 00:15:37,260 That is to say, there's got to be at least one character there, one symbol, 323 00:15:37,260 --> 00:15:40,180 and then there's optionally more after that. 324 00:15:40,180 --> 00:15:43,110 And then you can say zero or one repetition. 325 00:15:43,110 --> 00:15:46,590 You can use a single question mark after a symbol, and that will say, 326 00:15:46,590 --> 00:15:51,260 I want zero of this character or one, but that's all I'll expect. 327 00:15:51,260 --> 00:15:53,010 And then lastly, there's going to be a way 328 00:15:53,010 --> 00:15:55,140 to specify a specific number of symbols. 329 00:15:55,140 --> 00:15:57,330 If you use these curly braces and a number, 330 00:15:57,330 --> 00:15:59,610 represented here symbolically as m, you can 331 00:15:59,610 --> 00:16:03,720 specify that you want m repetitions, be it 1, or 2, or 3, or 300. 332 00:16:03,720 --> 00:16:06,190 You can specify the number of repetitions yourself. 333 00:16:06,190 --> 00:16:08,280 And if you want a range of repetitions, like you 334 00:16:08,280 --> 00:16:11,100 want this few characters or this many characters, 335 00:16:11,100 --> 00:16:13,770 you can use curly braces and two numbers inside, 336 00:16:13,770 --> 00:16:18,760 called here m and n, which would be a range of m through n repetitions. 337 00:16:18,760 --> 00:16:20,140 Now, what does all of this mean? 338 00:16:20,140 --> 00:16:22,380 Well, let me go back to VS Code here, and let 339 00:16:22,380 --> 00:16:25,650 me propose that we iterate on this solution further. 340 00:16:25,650 --> 00:16:27,985 It's not sufficient to just check for the @ sign. 341 00:16:27,985 --> 00:16:28,860 We know that already. 342 00:16:28,860 --> 00:16:31,600 We minimally want something to the left and to the right. 343 00:16:31,600 --> 00:16:33,210 So how can I represent that? 344 00:16:33,210 --> 00:16:35,910 I don't really care what the user's username is, 345 00:16:35,910 --> 00:16:40,020 or what letters of the alphabet are in it, be it malan or anyone else's. 346 00:16:40,020 --> 00:16:42,600 So what I'm going to do to the left of this equal sign 347 00:16:42,600 --> 00:16:44,410 is I'm going to use a single period-- 348 00:16:44,410 --> 00:16:49,600 the dot that, again, indicates any character except for a newline. 349 00:16:49,600 --> 00:16:51,630 But I don't just want a single character. 350 00:16:51,630 --> 00:16:55,900 Otherwise, the person's username could only a at such and such, 351 00:16:55,900 --> 00:16:57,450 or b at such and such. 352 00:16:57,450 --> 00:17:00,130 I want it to be multiple such characters. 353 00:17:00,130 --> 00:17:01,680 So I'm going to initially use a *. 354 00:17:01,680 --> 00:17:05,550 So dot * means give me something to the left, and I'm going to do another one, 355 00:17:05,550 --> 00:17:07,619 dot * something to the right. 356 00:17:07,619 --> 00:17:10,589 Now, this isn't perfect, but it's at least a step forward. 357 00:17:10,589 --> 00:17:12,871 Because now what I'm going to go ahead and do is this. 358 00:17:12,871 --> 00:17:14,579 I'm going to rerun python of validate.py. 359 00:17:14,579 --> 00:17:17,040 And I'm going to keep testing my own email address just to make 360 00:17:17,040 --> 00:17:18,415 sure I haven't made things worse. 361 00:17:18,415 --> 00:17:19,800 And that's now OK. 362 00:17:19,800 --> 00:17:22,530 I'm now going to go ahead and type in some other input, 363 00:17:22,530 --> 00:17:28,380 like how about just malan@ with no domain name whatsoever. 364 00:17:28,380 --> 00:17:30,640 And you would think this is going to be invalid. 365 00:17:30,640 --> 00:17:34,680 But, but, but it's still considered valid. 366 00:17:34,680 --> 00:17:35,850 But why is that? 367 00:17:35,850 --> 00:17:42,120 If I go back to this chart, why is malan@ with no domain now considered 368 00:17:42,120 --> 00:17:43,260 valid? 369 00:17:43,260 --> 00:17:50,010 What's my mistake here by having used .*@.* as my regular expression 370 00:17:50,010 --> 00:17:50,670 or regex? 371 00:17:50,670 --> 00:17:54,355 AUDIENCE: Because you're using the * instead of the plus sign. 372 00:17:54,355 --> 00:17:55,230 DAVID MALAN: Exactly. 373 00:17:55,230 --> 00:17:58,090 The *, again, means zero or more repetitions. 374 00:17:58,090 --> 00:18:03,120 So re.search is perfectly happy to accept nothing after the @ sign, 375 00:18:03,120 --> 00:18:05,230 because that would be zero repetitions. 376 00:18:05,230 --> 00:18:09,000 So I think I minimally need to evolve this and go back to my code here. 377 00:18:09,000 --> 00:18:12,990 And let me go ahead and change this from dot * to dot +. 378 00:18:12,990 --> 00:18:16,620 And let me change the ending from dot * to dot + 379 00:18:16,620 --> 00:18:18,900 so that now when I run my code here-- 380 00:18:18,900 --> 00:18:21,510 let me go ahead and run python of validate.py. 381 00:18:21,510 --> 00:18:23,490 I'm going to test my email address as always. 382 00:18:23,490 --> 00:18:24,600 Still working. 383 00:18:24,600 --> 00:18:27,690 Now let me go ahead and type in that same thing from before that 384 00:18:27,690 --> 00:18:29,820 was accidentally considered valid. 385 00:18:29,820 --> 00:18:32,590 Now I hit Enter, finally it's invalid. 386 00:18:32,590 --> 00:18:35,460 So now we're making some progress on being a little more 387 00:18:35,460 --> 00:18:37,560 precise as to what it is we're doing. 388 00:18:37,560 --> 00:18:40,920 Now, I'll note here, like with almost everything in programming, 389 00:18:40,920 --> 00:18:45,090 Python included, there's often multiple ways to solve the same problem. 390 00:18:45,090 --> 00:18:49,410 And does anyone see a way in my code here 391 00:18:49,410 --> 00:18:54,360 that I can make a slight tweak if I forgot that the plus operator exists 392 00:18:54,360 --> 00:18:56,880 and go back to using a *? 393 00:18:56,880 --> 00:19:00,570 If I allowed you only to use dots and only stars, 394 00:19:00,570 --> 00:19:03,570 could you recreate the notion of plus? 395 00:19:03,570 --> 00:19:04,890 AUDIENCE: Yes. 396 00:19:04,890 --> 00:19:06,930 Use another dot, dot dot *. 397 00:19:06,930 --> 00:19:07,680 DAVID MALAN: Yeah. 398 00:19:07,680 --> 00:19:10,290 Because if a dot means any character, we'll just use a dot. 399 00:19:10,290 --> 00:19:14,040 And then when you want to say "or more," use another dot and then the *. 400 00:19:14,040 --> 00:19:18,300 So equivalent to dot + would have been dot dot *, 401 00:19:18,300 --> 00:19:21,870 because the first dot means any character, and the second pair 402 00:19:21,870 --> 00:19:25,050 of characters, dot *, means zero or more other characters. 403 00:19:25,050 --> 00:19:27,600 And to be clear, it doesn't have to be the same character. 404 00:19:27,600 --> 00:19:31,830 Just by doing dot or dot * does not mean your whole username needs to be 405 00:19:31,830 --> 00:19:35,310 a, or aa, or aaa, or aaaa. 406 00:19:35,310 --> 00:19:37,230 It can vary with each symbol. 407 00:19:37,230 --> 00:19:41,790 It just means zero or more of any character back to back. 408 00:19:41,790 --> 00:19:44,050 So I could do this on both the left and the right. 409 00:19:44,050 --> 00:19:45,120 Which one is better? 410 00:19:45,120 --> 00:19:46,110 You know, it depends. 411 00:19:46,110 --> 00:19:49,860 I think an argument could be made that this is even more clear, because it's 412 00:19:49,860 --> 00:19:52,380 obvious now that there's a dot, which means any character, 413 00:19:52,380 --> 00:19:53,910 and then there's the dot *. 414 00:19:53,910 --> 00:19:56,250 But if you're in the habit of doing this frequently, 415 00:19:56,250 --> 00:19:58,500 one of the reasons things like the plus exist 416 00:19:58,500 --> 00:20:01,750 is just to consolidate your code into something a little more succinct. 417 00:20:01,750 --> 00:20:03,750 And if you're familiar with seeing the plus now, 418 00:20:03,750 --> 00:20:05,470 maybe this is more readable to you. 419 00:20:05,470 --> 00:20:07,590 So again, just like with Python more generally, 420 00:20:07,590 --> 00:20:10,590 you're going to often see different ways to express the same patterns, 421 00:20:10,590 --> 00:20:12,750 and reasonable people might agree or disagree 422 00:20:12,750 --> 00:20:15,810 as to which way is better than another. 423 00:20:15,810 --> 00:20:18,030 Well, let me propose to you that we can think 424 00:20:18,030 --> 00:20:20,520 about both of these models a little more graphically. 425 00:20:20,520 --> 00:20:22,770 If this looks a little cryptic to you, let me go ahead 426 00:20:22,770 --> 00:20:26,610 and rewind to the previous incarnation of this regular expression, which 427 00:20:26,610 --> 00:20:28,830 was just a single dot *. 428 00:20:28,830 --> 00:20:32,910 This regular expression, .*@.* means what again? 429 00:20:32,910 --> 00:20:36,690 It means zero or more characters followed by a literal @ sign followed 430 00:20:36,690 --> 00:20:38,580 by zero or more other characters. 431 00:20:38,580 --> 00:20:41,850 Now when you pass this pattern in as an argument to re.search, 432 00:20:41,850 --> 00:20:45,030 it's going to read it from left to right and then use 433 00:20:45,030 --> 00:20:48,750 it to try to match against the input, email, in this case, 434 00:20:48,750 --> 00:20:50,100 that the user typed in. 435 00:20:50,100 --> 00:20:53,070 Now, how is the computer, how is re.search 436 00:20:53,070 --> 00:20:57,760 going to keep track of whether or not the user's email matches this pattern? 437 00:20:57,760 --> 00:21:01,230 Well, it turns out that it's going to be using a machine of sorts implemented 438 00:21:01,230 --> 00:21:03,540 in software known as a finite state machine, or more 439 00:21:03,540 --> 00:21:06,750 formally, a nondeterministic finite automaton. 440 00:21:06,750 --> 00:21:09,930 And the way it works, if we depict this graphically, is as follows. 441 00:21:09,930 --> 00:21:14,940 The re.search function starts over here in a so-called start state. 442 00:21:14,940 --> 00:21:16,980 That's the sort of condition in which it begins. 443 00:21:16,980 --> 00:21:20,730 And then it's going to read the user's email address from left to right. 444 00:21:20,730 --> 00:21:24,030 And it's going to decide whether or not to stay in this first state 445 00:21:24,030 --> 00:21:26,170 or transition to the next state. 446 00:21:26,170 --> 00:21:29,970 So for instance, in this first state, as the user is reading my email address, 447 00:21:29,970 --> 00:21:35,130 malan@harvard.edu, it's going to follow this curved edge up and around 448 00:21:35,130 --> 00:21:36,870 to itself, a reflexive edge. 449 00:21:36,870 --> 00:21:40,030 And it's labeled dot, because dot, again, just means any character. 450 00:21:40,030 --> 00:21:43,840 So as the function is reading my email address, malan@harvard.edu, 451 00:21:43,840 --> 00:21:48,270 from left to right, it's going to follow these transitions as follows, 452 00:21:48,270 --> 00:21:53,070 M-A-L-A-N. 453 00:21:53,070 --> 00:21:56,040 And then it's hopefully going to follow this transition 454 00:21:56,040 --> 00:22:00,000 to the second state, because there's a literal @ sign both in this machine 455 00:22:00,000 --> 00:22:01,630 as well as in my email address. 456 00:22:01,630 --> 00:22:10,070 Then it's going to try to read the rest of my address, H-A-R-V-A-R-D dot E-D-U, 457 00:22:10,070 --> 00:22:11,190 and that's it. 458 00:22:11,190 --> 00:22:12,870 And then the computer is going to check. 459 00:22:12,870 --> 00:22:16,260 Did it end up in an accept state, a final state, 460 00:22:16,260 --> 00:22:18,120 that's actually depicted here pictorially 461 00:22:18,120 --> 00:22:21,150 a little differently with double circles, one inside of the other? 462 00:22:21,150 --> 00:22:25,410 And that just means that if the computer finds itself in that second 463 00:22:25,410 --> 00:22:29,130 accept state after having read all of the user's input, 464 00:22:29,130 --> 00:22:31,560 it is, indeed, a valid email address. 465 00:22:31,560 --> 00:22:34,350 If by some chance, the machine somehow ended up 466 00:22:34,350 --> 00:22:37,020 stuck in that first state, which does not have double circles 467 00:22:37,020 --> 00:22:39,300 and is therefore not an accept state, the computer 468 00:22:39,300 --> 00:22:42,810 would conclude this is an invalid email address instead. 469 00:22:42,810 --> 00:22:45,630 By contrast, if we go back to my other your version 470 00:22:45,630 --> 00:22:49,800 of the code where I instead had dot plus on both the left and the right, 471 00:22:49,800 --> 00:22:53,130 recall that re.search is going to use one of these state machines 472 00:22:53,130 --> 00:22:57,030 in order to decide from left to right whether or not to accept the user's 473 00:22:57,030 --> 00:22:59,310 input, like malan@harvard.edu. 474 00:22:59,310 --> 00:23:02,850 Can we get from the start state, so to speak, to an accept state 475 00:23:02,850 --> 00:23:05,940 to decide, yep, this was, in fact, meeting the pattern? 476 00:23:05,940 --> 00:23:09,900 Well, let's propose that this nondeterministic finite automaton 477 00:23:09,900 --> 00:23:11,430 looked like this instead. 478 00:23:11,430 --> 00:23:14,310 We're going to start as before in the leftmost start state, 479 00:23:14,310 --> 00:23:18,180 and we're going to necessarily consume one character per this first edge, 480 00:23:18,180 --> 00:23:21,480 which is labeled with a dot to indicate that we can consume any one character, 481 00:23:21,480 --> 00:23:24,411 like the m in malan@harvard.edu. 482 00:23:24,411 --> 00:23:27,960 Then we can spend some time consuming more characters before the @ sign, 483 00:23:27,960 --> 00:23:31,290 so the A-L-A-N. 484 00:23:31,290 --> 00:23:33,340 Then we can consume the @ sign. 485 00:23:33,340 --> 00:23:36,270 Then we can consume at least one more character, because recall 486 00:23:36,270 --> 00:23:38,760 that the regex has dot plus this time. 487 00:23:38,760 --> 00:23:42,190 And then we can consume even more characters if we want. 488 00:23:42,190 --> 00:23:45,900 So if we first consume the H in harvard.edu, 489 00:23:45,900 --> 00:23:53,885 then leaves the A-R-V-A-R-D, and then dot E-D-U. 490 00:23:53,885 --> 00:23:56,560 And now here, too, we're at the end of the story, 491 00:23:56,560 --> 00:23:59,760 but we're in an accept state, because that circle at the end 492 00:23:59,760 --> 00:24:03,840 has two circles total, which means that if the computer, if this function, 493 00:24:03,840 --> 00:24:07,830 finds itself in that accept state after reading the entirety of the user's 494 00:24:07,830 --> 00:24:11,580 input, it is, too, in fact, a valid email address. 495 00:24:11,580 --> 00:24:15,390 If by contrast, we had gotten stuck in one of those other states, 496 00:24:15,390 --> 00:24:18,180 unable to follow a transition, one of those edges, 497 00:24:18,180 --> 00:24:22,440 and therefore unable to make progress in the user's input from left to right, 498 00:24:22,440 --> 00:24:26,670 then we would have to conclude that email address is, in fact, invalid. 499 00:24:26,670 --> 00:24:29,490 Well, how can we go upon approving this code further? 500 00:24:29,490 --> 00:24:33,660 Let me propose now that we check not only for a username and also something 501 00:24:33,660 --> 00:24:37,320 after the username, like a domain name, but minimally require that the string 502 00:24:37,320 --> 00:24:39,600 ends with .edu as well. 503 00:24:39,600 --> 00:24:41,970 Well, I think I could do this fairly straightforward. 504 00:24:41,970 --> 00:24:44,940 Not only do I want there to be something after the @ sign, 505 00:24:44,940 --> 00:24:49,818 like the domain like Harvard, I want the whole thing to end with .edu. 506 00:24:49,818 --> 00:24:52,320 But there's a little bit of danger here. 507 00:24:52,320 --> 00:24:57,660 What have I done wrong by implementing my regular expression now in this way, 508 00:24:57,660 --> 00:24:59,110 by using .+@.+.edu? 509 00:24:59,110 --> 00:25:01,938 510 00:25:01,938 --> 00:25:06,080 What could go wrong with this version? 511 00:25:06,080 --> 00:25:08,360 AUDIENCE: The dot is-- the dot means something 512 00:25:08,360 --> 00:25:11,510 else in this context, where it means three or more repetitions 513 00:25:11,510 --> 00:25:14,630 of a character, which is why it will interpret it [INAUDIBLE].. 514 00:25:14,630 --> 00:25:15,570 DAVID MALAN: Exactly. 515 00:25:15,570 --> 00:25:19,340 Even though I mean for it to mean literally .edu, a period, 516 00:25:19,340 --> 00:25:22,560 and then .edu, unfortunately in the world of regular expressions, 517 00:25:22,560 --> 00:25:26,720 dot means any character, which means that this string could technically end 518 00:25:26,720 --> 00:25:34,080 in aedu, or bedu, or cedu, and so forth, but that's not, in fact, that I want. 519 00:25:34,080 --> 00:25:37,670 So any instincts now as to how I could fix this problem? 520 00:25:37,670 --> 00:25:39,770 And let me demonstrate the problem more clearly. 521 00:25:39,770 --> 00:25:41,900 Let me go ahead and run this code here. 522 00:25:41,900 --> 00:25:45,050 Let me go ahead and type in malan@harvard.edu. 523 00:25:45,050 --> 00:25:47,240 And as always, this does, in fact, work. 524 00:25:47,240 --> 00:25:48,680 But watch what happens here. 525 00:25:48,680 --> 00:25:52,520 Let me go ahead and do malan@harvard and then-- 526 00:25:52,520 --> 00:25:57,992 malan@harvard?edu, Enter, that, too, is valid. 527 00:25:57,992 --> 00:26:00,950 So I could put any character there and it's still going to be accepted. 528 00:26:00,950 --> 00:26:02,420 But I don't want ?edu. 529 00:26:02,420 --> 00:26:04,670 I want .edu literally. 530 00:26:04,670 --> 00:26:08,700 Any instincts, then, for how we can solve this problem here? 531 00:26:08,700 --> 00:26:12,770 How can I get this new function, re.search, and a regular expression 532 00:26:12,770 --> 00:26:16,160 more generally, to literally mean a dot, might you think? 533 00:26:16,160 --> 00:26:19,257 AUDIENCE: You can use the escape character, the backslash? 534 00:26:19,257 --> 00:26:20,090 DAVID MALAN: Indeed. 535 00:26:20,090 --> 00:26:22,927 The so-called escape character, which we've seen before outside 536 00:26:22,927 --> 00:26:25,760 of the context of regular expressions when we talked about newlines. 537 00:26:25,760 --> 00:26:29,640 Backslash n was a way of telling the computer I want a newline, 538 00:26:29,640 --> 00:26:32,810 but without actually literally hitting Enter and moving the cursor yourself. 539 00:26:32,810 --> 00:26:35,090 And you don't want a literal n on the screen. 540 00:26:35,090 --> 00:26:39,350 So backslash n was a way to escape n and convey that you want a newline. 541 00:26:39,350 --> 00:26:41,900 It turns out regular expressions use a similar technique 542 00:26:41,900 --> 00:26:43,640 to solve this problem here. 543 00:26:43,640 --> 00:26:45,770 In fact, let me go into my regular expression. 544 00:26:45,770 --> 00:26:49,370 And before that final dot, let me put a single backslash. 545 00:26:49,370 --> 00:26:52,880 In the world of regular expressions, this is a so-called special sequence. 546 00:26:52,880 --> 00:26:55,940 And it indicates, per this backslash and a single dot, 547 00:26:55,940 --> 00:26:58,290 that I literally want to match on a dot. 548 00:26:58,290 --> 00:27:02,180 It's not that I want to match on any character and then edu. 549 00:27:02,180 --> 00:27:05,300 I want to match on a dot, or a period, edu. 550 00:27:05,300 --> 00:27:09,050 But we don't want Python to misinterpret this backslash 551 00:27:09,050 --> 00:27:12,710 as beginning an escape sequence, something special like backslash 552 00:27:12,710 --> 00:27:15,590 n, which even though we as the programmer might type two characters 553 00:27:15,590 --> 00:27:20,090 backslash n, it really is interpreted by Python as a single newline. 554 00:27:20,090 --> 00:27:22,833 We don't want any kind of misinterpretation like that here. 555 00:27:22,833 --> 00:27:26,000 So it turns out there's one other thing we should do for regular expressions 556 00:27:26,000 --> 00:27:29,180 like this that have a backslash used in this way. 557 00:27:29,180 --> 00:27:33,440 I want to specify to Python that I want this string, this regular expression 558 00:27:33,440 --> 00:27:36,200 in double quotes, to be treated as a raw string, 559 00:27:36,200 --> 00:27:38,510 literally putting an r at the beginning of the string 560 00:27:38,510 --> 00:27:41,240 to indicate to Python that you should not try to interpret 561 00:27:41,240 --> 00:27:43,550 any backslashes in the usual way. 562 00:27:43,550 --> 00:27:46,850 I want to literally pass the backslash and the dot and the edu 563 00:27:46,850 --> 00:27:50,030 into this particular function, search, in this case. 564 00:27:50,030 --> 00:27:53,750 So it's similar in spirit to using that f at the beginning of a format 565 00:27:53,750 --> 00:27:57,170 string, which, of course, tells Python to format the string in a certain way, 566 00:27:57,170 --> 00:27:59,720 plugging in variables that might be between curly braces. 567 00:27:59,720 --> 00:28:02,900 But in this case, r indicates a raw string 568 00:28:02,900 --> 00:28:05,570 that I want passed in exactly as is. 569 00:28:05,570 --> 00:28:09,380 Now, it's only strictly necessary if you are, in fact, using backslashes 570 00:28:09,380 --> 00:28:12,860 to indicate that you want some special sequence, like backslash dot. 571 00:28:12,860 --> 00:28:14,750 But in general, it's probably a good habit 572 00:28:14,750 --> 00:28:18,540 to get into to just use raw strings for all of your regular expressions 573 00:28:18,540 --> 00:28:21,480 so that if you eventually go back in, make a change, make an addition, 574 00:28:21,480 --> 00:28:23,600 you don't accidentally introduce a backslash 575 00:28:23,600 --> 00:28:28,113 and then forget that that might have some special or misinterpreted meaning. 576 00:28:28,113 --> 00:28:30,530 Well, let me go ahead and try this new regular expression. 577 00:28:30,530 --> 00:28:34,430 I'll clear my terminal window, run python of validate-- 578 00:28:34,430 --> 00:28:36,800 run python of validate.py. 579 00:28:36,800 --> 00:28:40,496 And then I'll type in my email address correctly, malan@harvard.edu. 580 00:28:40,496 --> 00:28:42,710 And that's, fortunately, still valid. 581 00:28:42,710 --> 00:28:46,490 Let me clear my screen and run it one more time, python of validate.py. 582 00:28:46,490 --> 00:28:50,930 And this time, let's mistype it as malan@harvard?edu, 583 00:28:50,930 --> 00:28:53,540 whereby there's obviously not a dot there, 584 00:28:53,540 --> 00:28:57,710 but there is some other single character that last time was misinterpreted 585 00:28:57,710 --> 00:28:58,430 as valid. 586 00:28:58,430 --> 00:29:01,970 But this time, now that I've improved my regular expression, 587 00:29:01,970 --> 00:29:05,270 it's discovered as, indeed, invalid. 588 00:29:05,270 --> 00:29:10,850 Any questions now on this technique for matching something to the left of the @ 589 00:29:10,850 --> 00:29:15,320 sign, something to the right, and now ending with .edu explicitly? 590 00:29:15,320 --> 00:29:18,582 AUDIENCE: What happens when user inserts multiple @ signs? 591 00:29:18,582 --> 00:29:19,790 DAVID MALAN: A good question. 592 00:29:19,790 --> 00:29:21,320 And you kind of called me out here. 593 00:29:21,320 --> 00:29:22,910 Well, when in doubt, let's try. 594 00:29:22,910 --> 00:29:29,340 Let me go ahead and do python of validate.py, malan@@@harvard.edu, 595 00:29:29,340 --> 00:29:34,020 which also is incorrect, unfortunately, my code thinks it's valid. 596 00:29:34,020 --> 00:29:37,490 So another problem to solve, but a shortcoming for now. 597 00:29:37,490 --> 00:29:41,510 Other questions on these regular expressions thus far? 598 00:29:41,510 --> 00:29:46,108 AUDIENCE: Can you use curly brackets m instead of backslash? 599 00:29:46,108 --> 00:29:48,650 DAVID MALAN: Can you use curly brackets instead of backslash? 600 00:29:48,650 --> 00:29:49,490 Not in this case. 601 00:29:49,490 --> 00:29:53,750 If you want a literal dot, backslash dot is the way to do it literally. 602 00:29:53,750 --> 00:29:56,660 How about one other question on regular expressions? 603 00:29:56,660 --> 00:30:00,620 AUDIENCE: Is this the same thing that Google Forms uses in order 604 00:30:00,620 --> 00:30:06,590 to categorize data in, let's say, some-- if you've got multiple people sending 605 00:30:06,590 --> 00:30:09,380 in requests about some feedback? 606 00:30:09,380 --> 00:30:12,170 Do they categorize the data that they get 607 00:30:12,170 --> 00:30:14,247 using this particular regular expression thing? 608 00:30:14,247 --> 00:30:15,080 DAVID MALAN: Indeed. 609 00:30:15,080 --> 00:30:17,450 If you've ever used Google Forms to not just submit it 610 00:30:17,450 --> 00:30:20,900 but to create a Google Form, one of the menu options 611 00:30:20,900 --> 00:30:23,570 is for response validation, in English at least. 612 00:30:23,570 --> 00:30:25,340 And what that allows you to do is specify 613 00:30:25,340 --> 00:30:29,060 that the user has to input an email address, or a URL, 614 00:30:29,060 --> 00:30:31,400 or a string of some length. 615 00:30:31,400 --> 00:30:33,830 But there's an even more powerful feature that some of you 616 00:30:33,830 --> 00:30:35,150 may not have ever noticed. 617 00:30:35,150 --> 00:30:37,340 And indeed, if you'd like to open up Google Forms, 618 00:30:37,340 --> 00:30:41,180 create a new form temporarily, and poke around, you will actually see, 619 00:30:41,180 --> 00:30:44,270 in English at least, quote, unquote "regular expression" 620 00:30:44,270 --> 00:30:46,070 mentioned as one of the mechanisms you can 621 00:30:46,070 --> 00:30:49,520 use to validate your users' input into your Google Form. 622 00:30:49,520 --> 00:30:53,690 So in fact, after today you can start avoiding the specific dropdowns 623 00:30:53,690 --> 00:30:55,610 of like email address, or URL, or the like, 624 00:30:55,610 --> 00:30:59,540 and you can express your own patterns precisely as well. 625 00:30:59,540 --> 00:31:02,900 Regular expressions can even be used in VS Code itself. 626 00:31:02,900 --> 00:31:06,440 If you go and find, or do a find and replace in VS Code, 627 00:31:06,440 --> 00:31:08,690 you can, of course, just type in words, like you could 628 00:31:08,690 --> 00:31:10,880 into Microsoft Word or Google Docs. 629 00:31:10,880 --> 00:31:14,990 You can also type, if you check the right box, regular expressions 630 00:31:14,990 --> 00:31:19,670 and start searching for patterns, not literally specific values. 631 00:31:19,670 --> 00:31:24,080 Well, let me propose that we now enhance this implementation further 632 00:31:24,080 --> 00:31:28,010 by introducing a few other symbols, because right now with my code, 633 00:31:28,010 --> 00:31:32,540 I keep saying that I want my email address to end with .edu and start with 634 00:31:32,540 --> 00:31:35,780 a username, but I'm being a little too generous. 635 00:31:35,780 --> 00:31:38,690 This does, in fact, work as expected for my own email address, 636 00:31:38,690 --> 00:31:40,928 malan@harvard.edu. 637 00:31:40,928 --> 00:31:45,350 But what if I type in a sentence like, "my email address 638 00:31:45,350 --> 00:31:50,180 is malan@harvard.edu," and suppose I've typed that into the program 639 00:31:50,180 --> 00:31:52,310 or I've typed that into a Google Form? 640 00:31:52,310 --> 00:31:57,680 Is this going to be considered valid or invalid? 641 00:31:57,680 --> 00:31:59,390 Well, let's consider. 642 00:31:59,390 --> 00:32:01,970 It's got @ sign, so we're good there. 643 00:32:01,970 --> 00:32:05,570 It's got one or more characters to the left of the @ sign. 644 00:32:05,570 --> 00:32:09,050 It's got one or more characters to the right of the @ sign. 645 00:32:09,050 --> 00:32:14,390 It's got a literal .edu somewhere in there to the right of the @ sign. 646 00:32:14,390 --> 00:32:16,460 And granted, there's more stuff to the right. 647 00:32:16,460 --> 00:32:19,700 There's literally this period at the end of my English sentence. 648 00:32:19,700 --> 00:32:23,600 But that's OK, because at the moment, my regular expression is not so precise 649 00:32:23,600 --> 00:32:29,156 as to say, the pattern must start with the username and end with the .edu. 650 00:32:29,156 --> 00:32:32,573 Technically, it's left unsaid what more can be to the left 651 00:32:32,573 --> 00:32:33,990 and what more can be to the right. 652 00:32:33,990 --> 00:32:37,970 So when I hit Enter now, you'll see that that whole sentence in English 653 00:32:37,970 --> 00:32:40,500 is valid, and that's obviously not what you want. 654 00:32:40,500 --> 00:32:43,430 In fact, consider the case of using Google Forms or Office 655 00:32:43,430 --> 00:32:45,620 365 to collect data from users. 656 00:32:45,620 --> 00:32:48,320 If you don't validate your input, your users 657 00:32:48,320 --> 00:32:51,170 might very well type in a full sentence or something else 658 00:32:51,170 --> 00:32:53,550 with a typographical error, not an actual email. 659 00:32:53,550 --> 00:32:55,993 So if you're just trying to copy all of the results that 660 00:32:55,993 --> 00:32:58,160 have been typed into your form so you can paste them 661 00:32:58,160 --> 00:33:00,767 into Gmail or some email program, it's going to break, 662 00:33:00,767 --> 00:33:04,100 because you're going to accidentally pay something like a whole English sentence 663 00:33:04,100 --> 00:33:07,010 into the program instead of just an email address, which 664 00:33:07,010 --> 00:33:08,690 is what your mailer expects. 665 00:33:08,690 --> 00:33:10,280 So how can I be more precise? 666 00:33:10,280 --> 00:33:13,550 Well, let me propose we introduce a few more symbols as well. 667 00:33:13,550 --> 00:33:17,540 It turns out in the context of a regular expression, one of these patterns, 668 00:33:17,540 --> 00:33:21,170 you can use the caret symbol, the little triangular mark, 669 00:33:21,170 --> 00:33:24,080 to represent that you want this pattern to match 670 00:33:24,080 --> 00:33:27,110 the start of the string specifically-- not anywhere 671 00:33:27,110 --> 00:33:29,330 but the start of the user's string. 672 00:33:29,330 --> 00:33:34,040 By contrast, you can use a $ sign in your regular expression to say that you 673 00:33:34,040 --> 00:33:37,790 want to match the end of the string, or technically just before the newline 674 00:33:37,790 --> 00:33:38,910 at the end of the string. 675 00:33:38,910 --> 00:33:41,810 But for all intents and purposes, think of caret as meaning "start 676 00:33:41,810 --> 00:33:45,650 of the string" and $ sign as meaning "end of the string." 677 00:33:45,650 --> 00:33:49,310 It is a weird thing that one is a caret and one is $ sign. 678 00:33:49,310 --> 00:33:51,710 These are not really things that I think of as opposites, 679 00:33:51,710 --> 00:33:53,670 like a parentheses or something like that. 680 00:33:53,670 --> 00:33:56,430 But those are the symbols the world chose many years ago. 681 00:33:56,430 --> 00:33:58,370 So let me go back to VS Code now. 682 00:33:58,370 --> 00:34:01,460 And let me add this feature to my code here. 683 00:34:01,460 --> 00:34:04,790 Let me specify that yes, I do want to search for this pattern, 684 00:34:04,790 --> 00:34:08,480 but I want the user's input to start with this pattern 685 00:34:08,480 --> 00:34:09,860 and end with this pattern. 686 00:34:09,860 --> 00:34:12,440 So even though it's going to start looking even more cryptic, 687 00:34:12,440 --> 00:34:14,690 I put a caret symbol here at the beginning, 688 00:34:14,690 --> 00:34:17,270 and I put a $ sign here at the end. 689 00:34:17,270 --> 00:34:21,199 That does not mean I want the user to type a caret symbol or a $ sign. 690 00:34:21,199 --> 00:34:25,130 This is special symbology that indicates to re.search 691 00:34:25,130 --> 00:34:29,280 that it should only look for now an exact match against this pattern. 692 00:34:29,280 --> 00:34:31,699 So if I now go back to my terminal window-- 693 00:34:31,699 --> 00:34:33,920 and I'll leave the previous result on the screen-- 694 00:34:33,920 --> 00:34:35,540 let me type the exact same thing. 695 00:34:35,540 --> 00:34:39,610 "My email address malan@harvard.edu," Enter-- 696 00:34:39,610 --> 00:34:41,000 sorry, period. 697 00:34:41,000 --> 00:34:43,070 And now I'm going to go ahead and hit Enter. 698 00:34:43,070 --> 00:34:45,770 Now that's considered invalid. 699 00:34:45,770 --> 00:34:47,090 But let me clear the screen. 700 00:34:47,090 --> 00:34:48,923 And just to make sure I didn't break things, 701 00:34:48,923 --> 00:34:53,330 let me type in just my email address, and that, too, is valid. 702 00:34:53,330 --> 00:34:58,250 Any questions now on this version of my regular expression, which, note, 703 00:34:58,250 --> 00:35:01,670 goes further to specify even more precisely 704 00:35:01,670 --> 00:35:06,120 that I want it to match at the start and the end? 705 00:35:06,120 --> 00:35:08,568 Any questions on this one here? 706 00:35:08,568 --> 00:35:09,110 AUDIENCE: OK. 707 00:35:09,110 --> 00:35:13,160 You have slash, and .edu, then the $ sign. 708 00:35:13,160 --> 00:35:18,170 But the dot is one of the regular expression, right? 709 00:35:18,170 --> 00:35:19,460 DAVID MALAN: It normally is. 710 00:35:19,460 --> 00:35:24,590 But this backslash that I deliberately put before this period here 711 00:35:24,590 --> 00:35:26,180 is an escape character. 712 00:35:26,180 --> 00:35:30,710 It is a way of telling re.search that I don't want any character there, 713 00:35:30,710 --> 00:35:33,140 I literally want a period there. 714 00:35:33,140 --> 00:35:36,080 And it's the only way you can distinguish one from the other. 715 00:35:36,080 --> 00:35:40,550 If I got rid of that slash, this would mean that the email address just 716 00:35:40,550 --> 00:35:43,610 has to end with any character, then an E, then a D, 717 00:35:43,610 --> 00:35:45,180 than a U. I don't want that. 718 00:35:45,180 --> 00:35:49,730 I want literally a period, then the E, then the D, then the U. 719 00:35:49,730 --> 00:35:53,780 This is actually common convention in programming and technology in general. 720 00:35:53,780 --> 00:35:55,820 If you and I decide on a convention, whereby 721 00:35:55,820 --> 00:35:59,180 we're using some character on the keyboard to mean something special, 722 00:35:59,180 --> 00:36:02,060 invariably we create a future problem for ourself 723 00:36:02,060 --> 00:36:04,820 when we want to literally use that same character. 724 00:36:04,820 --> 00:36:07,190 And so the solution in general to that problem 725 00:36:07,190 --> 00:36:10,790 is to somehow escape the character so that it's clear to the computer 726 00:36:10,790 --> 00:36:14,510 that it's not that special symbol, it's literally the symbol it sees. 727 00:36:14,510 --> 00:36:19,700 AUDIENCE: So we don't even know the-- we don't need another slash before the $ 728 00:36:19,700 --> 00:36:20,930 sign? 729 00:36:20,930 --> 00:36:22,150 DAVID MALAN: No. 730 00:36:22,150 --> 00:36:25,550 Because in this case, $ sign means something special. 731 00:36:25,550 --> 00:36:30,590 Per this chart here, $ sign by itself does not mean US dollars or currency. 732 00:36:30,590 --> 00:36:33,420 It literally means "match the end of the string." 733 00:36:33,420 --> 00:36:38,600 If, however, I wanted the user to literally type in $ sign at the end 734 00:36:38,600 --> 00:36:40,910 of their input, the solution would be the same. 735 00:36:40,910 --> 00:36:43,700 I would put a backslash before the $ sign, 736 00:36:43,700 --> 00:36:48,242 which means my email address would have to be something like malan@harvard.edu 737 00:36:48,242 --> 00:36:50,850 $ sign, which is obviously not correct too. 738 00:36:50,850 --> 00:36:55,280 So backslash is just allow you to tell the computer to not treat 739 00:36:55,280 --> 00:36:58,310 those symbols specially, likes meaning something special, 740 00:36:58,310 --> 00:37:00,950 but to treat them literally instead. 741 00:37:00,950 --> 00:37:04,550 How about one other question here on regular expressions? 742 00:37:04,550 --> 00:37:09,010 AUDIENCE: You said one represents to make it one plus, 743 00:37:09,010 --> 00:37:11,095 then you said one was to make it one with nothing. 744 00:37:11,095 --> 00:37:11,845 DAVID MALAN: Sure. 745 00:37:11,845 --> 00:37:13,220 AUDIENCE: So why would you add the plus? 746 00:37:13,220 --> 00:37:14,360 DAVID MALAN: Let me rewind in time. 747 00:37:14,360 --> 00:37:17,027 I think what you're referring to was one of our earlier versions 748 00:37:17,027 --> 00:37:20,360 that initially looked like this, which just meant zero or more 749 00:37:20,360 --> 00:37:24,710 characters, than an @ sign, then zero or more other characters. 750 00:37:24,710 --> 00:37:29,090 We then evolved to that to be this, dot plus on both sides, which 751 00:37:29,090 --> 00:37:31,340 means one or more characters on the left, then 752 00:37:31,340 --> 00:37:34,320 an @ sign, then one or more characters on the right. 753 00:37:34,320 --> 00:37:36,560 And if I'm interpreting your question correctly, 754 00:37:36,560 --> 00:37:40,370 one of the points I made earlier was that if you didn't use plus or forgot 755 00:37:40,370 --> 00:37:44,510 that it exists, you could equivalently achieve the exact same result with two 756 00:37:44,510 --> 00:37:48,380 dots and a *, because the first dot means any character-- 757 00:37:48,380 --> 00:37:49,550 it's got to be there-- 758 00:37:49,550 --> 00:37:54,170 the second dot * means zero or more other characters, 759 00:37:54,170 --> 00:37:55,380 and same on the right. 760 00:37:55,380 --> 00:37:57,950 So it's just another way of expressing the same idea. 761 00:37:57,950 --> 00:38:01,970 "One or more" can be represented like this with dot dot *, 762 00:38:01,970 --> 00:38:06,840 or you can just use the handier syntax of dot +, which means the same thing. 763 00:38:06,840 --> 00:38:07,340 All right. 764 00:38:07,340 --> 00:38:10,507 So I daresay there's still some problems with the regular expression in this 765 00:38:10,507 --> 00:38:13,790 current form, because even though now we're starting to look for the user 766 00:38:13,790 --> 00:38:16,010 name at the beginning of the string from the user, 767 00:38:16,010 --> 00:38:20,390 and we're looking for the .edu literally at the end of the string from the user, 768 00:38:20,390 --> 00:38:23,780 those dots are a little too encompassing right now. 769 00:38:23,780 --> 00:38:26,450 I'm allowed to type in more than the single @ sign. 770 00:38:26,450 --> 00:38:27,020 Why? 771 00:38:27,020 --> 00:38:30,720 Because @ is a character, and dot means any character. 772 00:38:30,720 --> 00:38:34,650 So honestly, I can have as many @ signs in this thing at the moment as I want. 773 00:38:34,650 --> 00:38:37,280 For instance, if I run python of validate.py, 774 00:38:37,280 --> 00:38:40,500 malan@harvard.edu, still works as expected. 775 00:38:40,500 --> 00:38:44,270 But if I also run python of validate.py and incorrectly do 776 00:38:44,270 --> 00:38:51,030 malan@@@harvard.edu, should be invalid, but it's considered valid instead. 777 00:38:51,030 --> 00:38:55,670 So I think we need to be a little more restrictive when it comes to that dot. 778 00:38:55,670 --> 00:38:59,180 And we can't just say, oh, any old character there is fine. 779 00:38:59,180 --> 00:39:00,950 We need to be more specific. 780 00:39:00,950 --> 00:39:05,390 Well, it turns out that regular expressions also support this syntax. 781 00:39:05,390 --> 00:39:08,990 You can use square brackets inside of your pattern, 782 00:39:08,990 --> 00:39:14,210 and inside of those square brackets include one or more characters 783 00:39:14,210 --> 00:39:17,000 that you want to look for specifically. 784 00:39:17,000 --> 00:39:20,510 Alternatively, you can inside of those square brackets 785 00:39:20,510 --> 00:39:23,660 put a caret symbol, which unfortunately in this context, 786 00:39:23,660 --> 00:39:27,150 means something completely different from "match the start of the string." 787 00:39:27,150 --> 00:39:30,870 But this would be the complement operator inside of the square brackets, 788 00:39:30,870 --> 00:39:34,320 which means "you cannot match any of these characters." 789 00:39:34,320 --> 00:39:36,980 So things are about to look even more cryptic now. 790 00:39:36,980 --> 00:39:41,000 But that's why we're focusing on regular expressions on their own here. 791 00:39:41,000 --> 00:39:46,850 If I don't want to allow any character, which is what a dot is, let me go ahead 792 00:39:46,850 --> 00:39:52,610 and I could just say, well, I only want to support A, or Bs, or Cs, or Ds, 793 00:39:52,610 --> 00:39:54,200 or Es, or Fs, or Gs. 794 00:39:54,200 --> 00:39:56,750 I could type in the whole alphabet here plus some numbers 795 00:39:56,750 --> 00:40:00,110 to actually include all of the letters that I do want to allow. 796 00:40:00,110 --> 00:40:02,570 But honestly, a little simpler would be this. 797 00:40:02,570 --> 00:40:09,020 I could use a ^ symbol and then an @ sign, which has the effect of saying, 798 00:40:09,020 --> 00:40:14,270 this is the set of characters that has everything except an @ sign. 799 00:40:14,270 --> 00:40:16,130 And I can do the same thing over here. 800 00:40:16,130 --> 00:40:23,270 Instead of a dot to the right of the @ sign, I can do open bracket ^, @ sign. 801 00:40:23,270 --> 00:40:26,390 And I admit, things are starting to escalate quickly here, 802 00:40:26,390 --> 00:40:28,940 but let's start from the left and go to the right. 803 00:40:28,940 --> 00:40:33,020 This ^ outside of the square brackets at the very start of my string, 804 00:40:33,020 --> 00:40:35,810 as before, means "match from the start of the string." 805 00:40:35,810 --> 00:40:36,890 And let's jump ahead. 806 00:40:36,890 --> 00:40:40,580 The $ sign all the way at the end of the regular expression means "match 807 00:40:40,580 --> 00:40:42,180 at the end of the string." 808 00:40:42,180 --> 00:40:45,290 So if we can mentally tick those off as straightforward, let's 809 00:40:45,290 --> 00:40:47,630 now focus on everything else in the middle. 810 00:40:47,630 --> 00:40:50,510 Well, to the left here we have new syntax-- 811 00:40:50,510 --> 00:40:56,840 a square bracket, another ^, an @ sign, and a closed square bracket, and then 812 00:40:56,840 --> 00:40:57,560 a +. 813 00:40:57,560 --> 00:40:59,780 The + means the same thing as always. 814 00:40:59,780 --> 00:41:03,110 It means "one or more of the things to the left." 815 00:41:03,110 --> 00:41:04,830 What is the thing to the left? 816 00:41:04,830 --> 00:41:06,650 Well, this is the new syntax. 817 00:41:06,650 --> 00:41:10,880 Inside of square brackets here, I have a ^ symbol and then an @ sign. 818 00:41:10,880 --> 00:41:14,990 That just means any character except an @ sign. 819 00:41:14,990 --> 00:41:18,890 It's a weird syntax, but this is how we can express that simple idea-- 820 00:41:18,890 --> 00:41:23,022 any character on the keyboard except for an @ sign. 821 00:41:23,022 --> 00:41:25,980 And heck, even other characters that aren't physically on your keyboard 822 00:41:25,980 --> 00:41:28,020 but that nonetheless exist. 823 00:41:28,020 --> 00:41:32,120 Then we have a literal @ sign, then we have another one of these same things-- 824 00:41:32,120 --> 00:41:36,950 square bracket, ^@ closed bracket, which means any character except an @ sign, 825 00:41:36,950 --> 00:41:42,710 then one or more of those things, followed by literally a period edu. 826 00:41:42,710 --> 00:41:45,960 So now let me go ahead and do this again. 827 00:41:45,960 --> 00:41:49,280 Let me rerun python of validate.py and test my own email address 828 00:41:49,280 --> 00:41:51,595 to make sure I've not made things worse. 829 00:41:51,595 --> 00:41:52,220 And we're good. 830 00:41:52,220 --> 00:41:55,250 Now let me go ahead and clear my screen and run python of validate.py 831 00:41:55,250 --> 00:42:00,750 again and do malan@@@harvard.edu, crossing my fingers this time. 832 00:42:00,750 --> 00:42:03,020 And finally, this now is invalid. 833 00:42:03,020 --> 00:42:03,830 Why? 834 00:42:03,830 --> 00:42:08,600 I'm allowing myself to have one @ sign in the middle of the user's input, 835 00:42:08,600 --> 00:42:13,220 but everything to the left per this new syntax cannot be an @ sign. 836 00:42:13,220 --> 00:42:15,950 It can be anything but one or more times. 837 00:42:15,950 --> 00:42:20,570 And everything to the right of the @ sign can be anything but an @ sign one 838 00:42:20,570 --> 00:42:25,430 or more times followed by, lastly, a literal .edu. 839 00:42:25,430 --> 00:42:27,590 So again, the new syntax is quite simply this-- 840 00:42:27,590 --> 00:42:31,985 square brackets allow you to specify a set of characters that you literally 841 00:42:31,985 --> 00:42:33,110 type out at your keyboard-- 842 00:42:33,110 --> 00:42:36,410 A, B, C, D, E, F, or the complement, the opposite, 843 00:42:36,410 --> 00:42:40,550 the ^ symbol, which means "not," and then the one or more symbols you 844 00:42:40,550 --> 00:42:42,520 want to exclude. 845 00:42:42,520 --> 00:42:45,230 Questions now on this syntax here? 846 00:42:45,230 --> 00:42:49,450 AUDIENCE: So right after @ sign, can we use the curly brackets m one 847 00:42:49,450 --> 00:42:52,770 so that we can only have one repetition of the @ symbol? 848 00:42:52,770 --> 00:42:53,770 DAVID MALAN: Absolutely. 849 00:42:53,770 --> 00:42:54,800 So we could do this. 850 00:42:54,800 --> 00:42:56,680 Let me go ahead and pull up VS Code. 851 00:42:56,680 --> 00:42:59,680 And let me delete the current form of a regular expression 852 00:42:59,680 --> 00:43:03,580 and go back to where we began, which was just dot * @ and dot *. 853 00:43:03,580 --> 00:43:06,130 I could absolutely do something like this 854 00:43:06,130 --> 00:43:10,480 and require that I want at least one of any character here. 855 00:43:10,480 --> 00:43:13,760 And then I could do something more to have any more as well. 856 00:43:13,760 --> 00:43:16,710 So the curly brace syntax, which we saw on the slide earlier 857 00:43:16,710 --> 00:43:18,460 but didn't yet use, absolutely can be used 858 00:43:18,460 --> 00:43:21,400 to specify a specific number of characters. 859 00:43:21,400 --> 00:43:24,160 But honestly, this is more verbose than is necessary. 860 00:43:24,160 --> 00:43:27,130 The best solution, arguably, or the simplest, at least, 861 00:43:27,130 --> 00:43:29,500 ultimately, is just to say dot +. 862 00:43:29,500 --> 00:43:32,650 But there, too, another example of how you can solve the same problem 863 00:43:32,650 --> 00:43:34,010 multiple ways. 864 00:43:34,010 --> 00:43:36,340 Let me go back to where the regular expression just was 865 00:43:36,340 --> 00:43:39,170 and take other questions as well. 866 00:43:39,170 --> 00:43:44,790 Questions on the sets of characters or complementing that set? 867 00:43:44,790 --> 00:43:47,370 AUDIENCE: So can you use that same syntax 868 00:43:47,370 --> 00:43:51,780 to say that you don't want a certain character throughout the whole string? 869 00:43:51,780 --> 00:43:52,740 DAVID MALAN: You could. 870 00:43:52,740 --> 00:43:54,600 It's going to be-- 871 00:43:54,600 --> 00:43:58,530 you could absolutely use the same character to exclude-- 872 00:43:58,530 --> 00:44:01,830 you could absolutely use this syntax to exclude a certain character 873 00:44:01,830 --> 00:44:03,210 from the entire string. 874 00:44:03,210 --> 00:44:05,130 But it would be a little harder right now, 875 00:44:05,130 --> 00:44:07,530 because we're still requiring .edu the end. 876 00:44:07,530 --> 00:44:10,770 But yes, absolutely. 877 00:44:10,770 --> 00:44:12,220 Other questions? 878 00:44:12,220 --> 00:44:16,620 AUDIENCE: What happens if the user inputs .edu in the beginning 879 00:44:16,620 --> 00:44:17,632 of the string? 880 00:44:17,632 --> 00:44:18,840 DAVID MALAN: A good question. 881 00:44:18,840 --> 00:44:22,000 What happens if the user types in .edu at the beginning of the string? 882 00:44:22,000 --> 00:44:23,577 Well, let me go back to VS Code here. 883 00:44:23,577 --> 00:44:25,660 And let's try to solve this in two different ways. 884 00:44:25,660 --> 00:44:27,452 First, let's look at the regular expression 885 00:44:27,452 --> 00:44:31,080 and see if we can infer if that's going to be tolerated. 886 00:44:31,080 --> 00:44:34,950 Well, according to the current cryptic regular expression, 887 00:44:34,950 --> 00:44:38,730 I'm saying that you can have any character except the @ sign. 888 00:44:38,730 --> 00:44:41,910 So that would work I. Could have the dot for the .edu. 889 00:44:41,910 --> 00:44:44,490 But then I have to have an @ sign. 890 00:44:44,490 --> 00:44:48,940 So that wouldn't really work, because if I'm just typing in .edu, 891 00:44:48,940 --> 00:44:51,010 we're not going to pass that constraint. 892 00:44:51,010 --> 00:44:53,710 So now let me try this by running the program. 893 00:44:53,710 --> 00:44:55,810 Let me type in just literally .edu. 894 00:44:55,810 --> 00:44:57,090 That doesn't work. 895 00:44:57,090 --> 00:45:02,505 But, but, but I could do this, .edu@.edu. 896 00:45:02,505 --> 00:45:04,140 That, too, is invalid. 897 00:45:04,140 --> 00:45:07,581 But let me do this, .edu@something.edu. 898 00:45:07,581 --> 00:45:10,365 899 00:45:10,365 --> 00:45:11,490 That passes. 900 00:45:11,490 --> 00:45:13,470 So it's starting to get a little weird now. 901 00:45:13,470 --> 00:45:15,030 Maybe it's valid, maybe it's not. 902 00:45:15,030 --> 00:45:18,120 But I think we'll eventually be more precise, too. 903 00:45:18,120 --> 00:45:21,570 How about one more question on this regular expression 904 00:45:21,570 --> 00:45:23,310 and these complementing of sets? 905 00:45:23,310 --> 00:45:27,765 AUDIENCE: Can we use another domain name, the string input? 906 00:45:27,765 --> 00:45:29,640 DAVID MALAN: Can you use another domain name? 907 00:45:29,640 --> 00:45:30,240 Absolutely. 908 00:45:30,240 --> 00:45:32,460 I'm using my own just for the sake of demonstration. 909 00:45:32,460 --> 00:45:35,970 But you could absolutely use any domain or top-level domain. 910 00:45:35,970 --> 00:45:38,520 And I'm using .edu, which is very US centric. 911 00:45:38,520 --> 00:45:43,330 But this would absolutely work exactly the same for any top-level domain. 912 00:45:43,330 --> 00:45:43,830 All right. 913 00:45:43,830 --> 00:45:47,700 Let me go ahead now and propose that we improve this regular expression 914 00:45:47,700 --> 00:45:50,880 further, because if I pull it up again in VS Code here, 915 00:45:50,880 --> 00:45:53,790 you'll see that I'm being a little too tolerant still. 916 00:45:53,790 --> 00:45:58,140 It turns out that there are certain requirements for someone's username 917 00:45:58,140 --> 00:46:00,240 and domain name in an email address. 918 00:46:00,240 --> 00:46:03,840 There is an official standard in the world for what an email address can be 919 00:46:03,840 --> 00:46:05,670 and what characters can be in it. 920 00:46:05,670 --> 00:46:09,480 And this is way too accommodating of all the characters 921 00:46:09,480 --> 00:46:11,710 in the world except for the @ symbol. 922 00:46:11,710 --> 00:46:14,190 So let's actually narrow the definition of what 923 00:46:14,190 --> 00:46:16,110 we're going to tolerate in usernames. 924 00:46:16,110 --> 00:46:19,200 And companies like Gmail could certainly do this as well. 925 00:46:19,200 --> 00:46:22,200 Suppose that it's not just that I want to exclude @ sign. 926 00:46:22,200 --> 00:46:25,470 Suppose that I only want to allow for, say, 927 00:46:25,470 --> 00:46:27,600 characters that normally appear in words, 928 00:46:27,600 --> 00:46:31,500 like letters of the alphabet, A through z, be it uppercase or lowercase, 929 00:46:31,500 --> 00:46:35,520 maybe some numbers, and heck, maybe even an underscore could be allowed, too. 930 00:46:35,520 --> 00:46:38,550 Well, we can use this same square bracket syntax 931 00:46:38,550 --> 00:46:41,340 to specify a set of characters as follows. 932 00:46:41,340 --> 00:46:44,860 I could do abcdefghij-- 933 00:46:44,860 --> 00:46:45,360 oh, my god. 934 00:46:45,360 --> 00:46:46,290 This is going to take forever. 935 00:46:46,290 --> 00:46:49,140 I'm going to have to type out all 26 letters of the alphabet, 936 00:46:49,140 --> 00:46:50,940 both lowercase and uppercase. 937 00:46:50,940 --> 00:46:52,260 So let me stop doing that. 938 00:46:52,260 --> 00:46:53,700 There's a better way already. 939 00:46:53,700 --> 00:46:58,180 If you want to specify within these square brackets a range of letters, 940 00:46:58,180 --> 00:47:00,550 you can actually just do a hyphen. 941 00:47:00,550 --> 00:47:04,920 If you literally do a-z in these square brackets, 942 00:47:04,920 --> 00:47:07,470 the computer is going to know you mean a through z. 943 00:47:07,470 --> 00:47:10,620 You do not need to type 26 letters of the alphabet. 944 00:47:10,620 --> 00:47:14,190 If you want to include uppercase letters as well, you just do the same. 945 00:47:14,190 --> 00:47:19,440 No spaces, no commas, you literally just keep typing a through capital Z. 946 00:47:19,440 --> 00:47:23,880 So I have little a hyphen little z, big A hyphen 947 00:47:23,880 --> 00:47:26,640 big Z. No spaces, no commas, no separators. 948 00:47:26,640 --> 00:47:28,830 You just keep specifying those ranges. 949 00:47:28,830 --> 00:47:32,350 If I additionally want numbers, I could do 01234-- 950 00:47:32,350 --> 00:47:32,850 nope. 951 00:47:32,850 --> 00:47:35,070 You don't need to type in all 10 decimal digits. 952 00:47:35,070 --> 00:47:39,070 You can just say 0 through 9 using a hyphen as well. 953 00:47:39,070 --> 00:47:41,280 And if you now want to support underscores 954 00:47:41,280 --> 00:47:44,280 as well, which is pretty common in usernames for email addresses, 955 00:47:44,280 --> 00:47:48,160 you can literally just type an underscore at the end. 956 00:47:48,160 --> 00:47:51,180 Notice that all of these characters are inside 957 00:47:51,180 --> 00:47:55,860 of square brackets, which just again, means here is a set of characters 958 00:47:55,860 --> 00:47:57,180 that I want to allow. 959 00:47:57,180 --> 00:48:02,100 I have not used a ^ symbol at the beginning of this whole thing, 960 00:48:02,100 --> 00:48:05,370 because I don't want to complement it-- complement it with an E, 961 00:48:05,370 --> 00:48:07,230 not compliment it with an I-- 962 00:48:07,230 --> 00:48:09,940 I don't want to complement it by making it the opposite. 963 00:48:09,940 --> 00:48:13,225 I literally want to accept only these characters. 964 00:48:13,225 --> 00:48:15,600 I'm going to go ahead and do the same thing on the right. 965 00:48:15,600 --> 00:48:19,530 If I want to require that the domain name similarly 966 00:48:19,530 --> 00:48:22,800 come from this set of characters, which admittedly is a little too narrow, 967 00:48:22,800 --> 00:48:25,210 but it's familiar for now so we'll keep it simple, 968 00:48:25,210 --> 00:48:29,490 I'm going to go ahead and paste that exact same set of characters over there 969 00:48:29,490 --> 00:48:30,490 to the right. 970 00:48:30,490 --> 00:48:33,600 And so now, it's much more restrictive. 971 00:48:33,600 --> 00:48:36,660 Now I'm going to go ahead and run python of validate.py. 972 00:48:36,660 --> 00:48:39,420 I'm going to test my own email address, and we're still good. 973 00:48:39,420 --> 00:48:42,180 I'm going to clear my screen and run it once more, 974 00:48:42,180 --> 00:48:44,520 this time trying to break it. 975 00:48:44,520 --> 00:48:51,270 Let me go ahead and do something like, how about, david_malan@harvard.edu, 976 00:48:51,270 --> 00:48:54,790 Enter, but that, too, is going to be valid. 977 00:48:54,790 --> 00:48:57,330 But if I do something completely wrong again, 978 00:48:57,330 --> 00:49:02,790 like malan@@@harvard.edu, that's still going to be invalid. 979 00:49:02,790 --> 00:49:03,330 Why? 980 00:49:03,330 --> 00:49:06,090 Because my regular expression currently only allows 981 00:49:06,090 --> 00:49:09,480 for a single @ in the middle, because everything to the left 982 00:49:09,480 --> 00:49:11,530 must be alphanumeric-- 983 00:49:11,530 --> 00:49:14,420 alphabetical or numeric-- or an underscore, 984 00:49:14,420 --> 00:49:18,301 the same thing to the right, followed by the .edu. 985 00:49:18,301 --> 00:49:20,770 Now honestly, this is a regular expression 986 00:49:20,770 --> 00:49:23,890 that you might be in the habit of typing in the real world. 987 00:49:23,890 --> 00:49:27,860 As cryptic as this might look, this is the world of regular expressions. 988 00:49:27,860 --> 00:49:30,560 So you'll get more comfortable with this syntax over time. 989 00:49:30,560 --> 00:49:32,890 But thankfully, some of these patterns are 990 00:49:32,890 --> 00:49:36,910 so common that there are built-in shortcuts for representing 991 00:49:36,910 --> 00:49:38,680 some of the same information. 992 00:49:38,680 --> 00:49:42,373 That is to say, you don't have to constantly type out all of the symbols 993 00:49:42,373 --> 00:49:45,040 that you want to include, because odds are some other programmer 994 00:49:45,040 --> 00:49:46,280 has had the same problem. 995 00:49:46,280 --> 00:49:49,030 So built into regular expressions themselves 996 00:49:49,030 --> 00:49:51,250 are some additional patterns you can use. 997 00:49:51,250 --> 00:49:56,170 And in fact, I can go ahead and get rid of this entire set, a through z 998 00:49:56,170 --> 00:49:59,830 lowercase, A through Z uppercase, 0 through 9 and an underscore, 999 00:49:59,830 --> 00:50:03,640 and just replace it with a single backslash w. 1000 00:50:03,640 --> 00:50:07,210 Backslash w in this case represents a "word character," 1001 00:50:07,210 --> 00:50:13,330 which is commonly known as a alphanumeric symbol or the underscore 1002 00:50:13,330 --> 00:50:14,052 as well. 1003 00:50:14,052 --> 00:50:15,760 I'm going to do the same thing over here. 1004 00:50:15,760 --> 00:50:18,310 I'm going to highlight the entire set of square brackets, 1005 00:50:18,310 --> 00:50:21,430 delete it, and replace it with a single backslash w. 1006 00:50:21,430 --> 00:50:23,720 And now I feel like we're making progress, 1007 00:50:23,720 --> 00:50:25,720 because even though it's cryptic, and would have 1008 00:50:25,720 --> 00:50:29,320 looked way cryptic a little bit ago-- 1009 00:50:29,320 --> 00:50:32,680 and even though it would have looked even more cryptic a little bit ago, now 1010 00:50:32,680 --> 00:50:35,470 it's at least starting to read a little more friendly. 1011 00:50:35,470 --> 00:50:39,160 This ^ on the left means "start matching at the beginning of the string." 1012 00:50:39,160 --> 00:50:42,100 Backslash w means "any word character." 1013 00:50:42,100 --> 00:50:44,140 The + means "one or more." 1014 00:50:44,140 --> 00:50:45,370 @ symbol literally. 1015 00:50:45,370 --> 00:50:49,720 Then another word character, one or more. then a literal dot, then 1016 00:50:49,720 --> 00:50:54,200 literally edu, and then match at the very end of the string, and that's it. 1017 00:50:54,200 --> 00:50:55,660 So there's more of these, too. 1018 00:50:55,660 --> 00:50:57,910 And we won't use them all here, but here is 1019 00:50:57,910 --> 00:51:02,950 a partial list of the patterns you can use within a regular expression. 1020 00:51:02,950 --> 00:51:07,060 One, you have backslash d for any decimal digit, "decimal digit" meaning 1021 00:51:07,060 --> 00:51:08,590 0 through 9. 1022 00:51:08,590 --> 00:51:12,550 Commonly done here, too, is if you want to do the opposite of that, 1023 00:51:12,550 --> 00:51:17,020 the complement, so to speak, you can do backslash capital D, which 1024 00:51:17,020 --> 00:51:19,480 is anything that's not a decimal digit. 1025 00:51:19,480 --> 00:51:23,990 So it might be letters, and punctuation, and other symbols as well. 1026 00:51:23,990 --> 00:51:27,280 Meanwhile, backslash s means whitespace characters, 1027 00:51:27,280 --> 00:51:30,490 like a single hit of the space, or maybe hitting Tab on the keyboard. 1028 00:51:30,490 --> 00:51:31,720 That's whitespace. 1029 00:51:31,720 --> 00:51:35,110 Backslash capital S is the opposite or complement 1030 00:51:35,110 --> 00:51:38,080 of that-- anything that's not a whitespace character. 1031 00:51:38,080 --> 00:51:41,680 Backslash w, we've seen, a word character, as well as 1032 00:51:41,680 --> 00:51:43,390 numbers and the underscore. 1033 00:51:43,390 --> 00:51:45,970 And if you want the complement or opposite of that, 1034 00:51:45,970 --> 00:51:50,950 you can use backslash capital W to give you everything but a word character. 1035 00:51:50,950 --> 00:51:54,130 Again, these are just common patterns that so many people were presumably 1036 00:51:54,130 --> 00:51:58,520 using in yesteryear that it's now baked into the regular expression syntax 1037 00:51:58,520 --> 00:52:02,710 so that you can more succinctly express your same ideas. 1038 00:52:02,710 --> 00:52:05,320 Any questions, then, on this approach here, 1039 00:52:05,320 --> 00:52:12,340 where we're now using backslash w to represent my word character? 1040 00:52:12,340 --> 00:52:14,230 AUDIENCE: So what I want to ask about was 1041 00:52:14,230 --> 00:52:17,590 the-- actually the previous approach, like the square bracket approach. 1042 00:52:17,590 --> 00:52:19,792 Could we accept lists in there? 1043 00:52:19,792 --> 00:52:20,500 DAVID MALAN: Yes. 1044 00:52:20,500 --> 00:52:21,730 We'll see this before long. 1045 00:52:21,730 --> 00:52:27,460 But suppose you wanted to tolerate not just .edu, but maybe .edu, or .com, 1046 00:52:27,460 --> 00:52:28,450 you could do this. 1047 00:52:28,450 --> 00:52:32,500 You could introduce parentheses, and then you can or those together. 1048 00:52:32,500 --> 00:52:35,470 I could say com or edu. 1049 00:52:35,470 --> 00:52:40,180 Could also add in something like in the US, or gov, or net, 1050 00:52:40,180 --> 00:52:42,670 or anything else, or org, or the like. 1051 00:52:42,670 --> 00:52:45,190 And each of the vertical bars here means something special. 1052 00:52:45,190 --> 00:52:46,180 It means "or." 1053 00:52:46,180 --> 00:52:48,610 And the parentheses simply group things together. 1054 00:52:48,610 --> 00:52:50,920 Formally, you have this syntax here-- 1055 00:52:50,920 --> 00:52:56,530 A or B, A or vertical bar B, means "A has to match or B has to match," 1056 00:52:56,530 --> 00:52:59,080 where A and B can be any other patterns you want. 1057 00:52:59,080 --> 00:53:01,520 In parentheses, you can group those things together. 1058 00:53:01,520 --> 00:53:05,710 So just like math, you can combine ideas into one phrase 1059 00:53:05,710 --> 00:53:07,600 and do this thing or the other. 1060 00:53:07,600 --> 00:53:09,970 And there's other syntax as well that we'll soon see. 1061 00:53:09,970 --> 00:53:14,750 Other questions on these regular expressions and this syntax here? 1062 00:53:14,750 --> 00:53:16,990 AUDIENCE: What if we put spaces in the expression? 1063 00:53:16,990 --> 00:53:17,740 DAVID MALAN: Sure. 1064 00:53:17,740 --> 00:53:21,910 So if you want spaces in there, you can't use backslash w alone, 1065 00:53:21,910 --> 00:53:25,690 because that is only a word character which is alphabetical, numerical, 1066 00:53:25,690 --> 00:53:27,100 or the underscore. 1067 00:53:27,100 --> 00:53:28,580 But you could do this. 1068 00:53:28,580 --> 00:53:32,170 You could go back to this approach whereby you use square brackets. 1069 00:53:32,170 --> 00:53:37,120 And you could say a through z, or A through Z, or 0 through 9, 1070 00:53:37,120 --> 00:53:40,693 or underscore, or I'm going to hit the space bar, a single space. 1071 00:53:40,693 --> 00:53:43,360 You can put a literal space inside of the square brackets, which 1072 00:53:43,360 --> 00:53:45,700 will allow you then to detect a space. 1073 00:53:45,700 --> 00:53:49,420 Alternatively, I could still use backslash w, 1074 00:53:49,420 --> 00:53:51,280 But I could combine it as follows. 1075 00:53:51,280 --> 00:53:54,700 I could say, give me a backslash w or a backslash s, 1076 00:53:54,700 --> 00:53:57,287 because recall that backslash s is whitespace. 1077 00:53:57,287 --> 00:53:58,870 So it's even more than a single space. 1078 00:53:58,870 --> 00:53:59,770 It could be a tab. 1079 00:53:59,770 --> 00:54:02,140 But by putting those things in parentheses, now 1080 00:54:02,140 --> 00:54:04,060 you can match either the thing on the left 1081 00:54:04,060 --> 00:54:07,400 or the thing on the right one or more times. 1082 00:54:07,400 --> 00:54:12,290 How about one other question on these regular expressions? 1083 00:54:12,290 --> 00:54:13,040 AUDIENCE: Perfect. 1084 00:54:13,040 --> 00:54:19,070 So I was going to ask, does the backslash w include a dot? 1085 00:54:19,070 --> 00:54:20,730 Because-- no, OK. 1086 00:54:20,730 --> 00:54:24,230 DAVID MALAN: No, it only Includes letters, numbers, and underscore. 1087 00:54:24,230 --> 00:54:25,387 That is it. 1088 00:54:25,387 --> 00:54:27,470 AUDIENCE: And I was wondering, you gave an example 1089 00:54:27,470 --> 00:54:33,140 at the beginning that had spaces, like this is my email, so-and-so. 1090 00:54:33,140 --> 00:54:35,420 I don't think our current version-- 1091 00:54:35,420 --> 00:54:39,110 or even quite a long while ago stopped accepting it. 1092 00:54:39,110 --> 00:54:43,915 Was that because of the ^ or because of something else? 1093 00:54:43,915 --> 00:54:47,960 DAVID MALAN: No, the reason I was handling spaces in other English words 1094 00:54:47,960 --> 00:54:51,425 when I typed out my email address as malan@harvard.edu 1095 00:54:51,425 --> 00:54:57,380 was because we were using initially dot *, or dot +, which is any character. 1096 00:54:57,380 --> 00:55:01,340 And even after that, we said anything except the @ sign, 1097 00:55:01,340 --> 00:55:02,870 which includes spaces. 1098 00:55:02,870 --> 00:55:08,000 Only once I started using square brackets and a through z and 0 1099 00:55:08,000 --> 00:55:11,210 through 9 and underscore did we finally get to the point 1100 00:55:11,210 --> 00:55:13,040 where we would reject white space. 1101 00:55:13,040 --> 00:55:14,970 And in fact, I can run this here. 1102 00:55:14,970 --> 00:55:18,980 Let me go into the current version of my code in VS Code, which is using, again, 1103 00:55:18,980 --> 00:55:21,620 the backslash w's for word characters, let 1104 00:55:21,620 --> 00:55:24,860 me run python of validate.py and incorrectly type in something 1105 00:55:24,860 --> 00:55:30,020 like "my email address is malan@harvard.edu," period, which 1106 00:55:30,020 --> 00:55:34,250 has spaces to the left of my username, and that is now invalid, 1107 00:55:34,250 --> 00:55:36,590 because space is not a word character. 1108 00:55:36,590 --> 00:55:39,860 You're going to notice, too, that technically I'm not allowing dots. 1109 00:55:39,860 --> 00:55:41,902 And some of you might be thinking, wait a minute. 1110 00:55:41,902 --> 00:55:43,880 My Gmail address has a dot in it. 1111 00:55:43,880 --> 00:55:46,280 That's something we're going to still have to fix. 1112 00:55:46,280 --> 00:55:49,160 A backslash w is not the end all here. 1113 00:55:49,160 --> 00:55:52,520 It's just allowing us to express our previous solution 1114 00:55:52,520 --> 00:55:54,020 a little more succinctly. 1115 00:55:54,020 --> 00:55:57,260 Now, one thing we're still not handling quite properly 1116 00:55:57,260 --> 00:55:59,180 is uppercase versus lowercase. 1117 00:55:59,180 --> 00:56:03,200 The backslash w technically does handle lowercase letters and uppercase, 1118 00:56:03,200 --> 00:56:06,450 because it's the exact same thing as that set from before, 1119 00:56:06,450 --> 00:56:11,670 which had little a through little z and big A through big Z. But watch this. 1120 00:56:11,670 --> 00:56:14,960 Let me go ahead in my current form run python of validate.py, 1121 00:56:14,960 --> 00:56:19,376 and just because my Caps lock key is down, MALAN@HARVARD.EDU, 1122 00:56:19,376 --> 00:56:21,080 shouting my email address. 1123 00:56:21,080 --> 00:56:23,640 It's going to be OK in terms of the MALAN. 1124 00:56:23,640 --> 00:56:25,940 It's going to be OK in terms of the HARVARD, 1125 00:56:25,940 --> 00:56:28,790 because those are matching the backslash w, which 1126 00:56:28,790 --> 00:56:31,490 does include lowercase and uppercase. 1127 00:56:31,490 --> 00:56:34,310 But I'm about to see invalid. 1128 00:56:34,310 --> 00:56:35,210 Why? 1129 00:56:35,210 --> 00:56:41,670 Why is MALAN@HARVARD.EDU invalid when it's in all caps here, 1130 00:56:41,670 --> 00:56:44,195 even though I'm using backslash w? 1131 00:56:44,195 --> 00:56:44,820 AUDIENCE: Yeah. 1132 00:56:44,820 --> 00:56:50,010 So you are asking for the domain.edu in lowercase, 1133 00:56:50,010 --> 00:56:52,105 and you're typing it in uppercase. 1134 00:56:52,105 --> 00:56:52,980 DAVID MALAN: Exactly. 1135 00:56:52,980 --> 00:56:55,980 I'm typing in my email address in all uppercase, 1136 00:56:55,980 --> 00:56:57,892 but I'm looking for literally ".edu." 1137 00:56:57,892 --> 00:57:00,600 And as I see you with AirPods and so many of you with headphones, 1138 00:57:00,600 --> 00:57:03,810 I apologize for yelling into my microphone just now to make this point. 1139 00:57:03,810 --> 00:57:05,770 But let's see if we can't fix that. 1140 00:57:05,770 --> 00:57:11,925 Well, if my pattern on line 5 is expecting it to be lowercase, 1141 00:57:11,925 --> 00:57:13,800 there's actually a few ways I can solve this. 1142 00:57:13,800 --> 00:57:15,840 One would be something we've seen before. 1143 00:57:15,840 --> 00:57:19,050 I could just force the user's input to all lowercase. 1144 00:57:19,050 --> 00:57:23,610 And I could put onto the end of my first line .lower and actually force it all 1145 00:57:23,610 --> 00:57:24,480 to lowercase. 1146 00:57:24,480 --> 00:57:26,880 Alternatively, I could do that a little later. 1147 00:57:26,880 --> 00:57:31,050 Instead of passing an email, I could pass in the lowercase version of email, 1148 00:57:31,050 --> 00:57:33,810 because email addresses should, in fact, be case insensitive. 1149 00:57:33,810 --> 00:57:34,980 So that would work, too. 1150 00:57:34,980 --> 00:57:37,590 But there's another mechanism here, which is worth seeing. 1151 00:57:37,590 --> 00:57:43,890 It turns out that that function before called re.search supports, recall, 1152 00:57:43,890 --> 00:57:46,800 a third argument as well, these so-called flags. 1153 00:57:46,800 --> 00:57:49,170 And flags are configuration options, typically 1154 00:57:49,170 --> 00:57:52,290 to a function, that allow you to configure it a little differently. 1155 00:57:52,290 --> 00:57:55,290 And how might I go about configuring this call 1156 00:57:55,290 --> 00:57:59,910 to re.search a little bit differently insofar as I'm currently only passing 1157 00:57:59,910 --> 00:58:00,900 in two arguments? 1158 00:58:00,900 --> 00:58:04,650 Well, it turns out that some of the flags you can pass into this function 1159 00:58:04,650 --> 00:58:05,790 are these. 1160 00:58:05,790 --> 00:58:10,110 It turns out that the regular expression library in Python, a.k.a. 1161 00:58:10,110 --> 00:58:14,040 re, comes with a few built-in variables, so to speak, 1162 00:58:14,040 --> 00:58:16,110 things that you can think of as constants, 1163 00:58:16,110 --> 00:58:19,920 that have meaning to re.search. 1164 00:58:19,920 --> 00:58:21,760 And they do so as follows. 1165 00:58:21,760 --> 00:58:26,220 If you pass in as a flag re.IGNORECASE, what re.search is going to do 1166 00:58:26,220 --> 00:58:28,530 is ignore the case of the user's input. 1167 00:58:28,530 --> 00:58:30,880 It can be uppercase, lowercase, a combination thereof, 1168 00:58:30,880 --> 00:58:32,470 the case is going to be ignored. 1169 00:58:32,470 --> 00:58:34,327 It will be treated case insensitively. 1170 00:58:34,327 --> 00:58:36,660 And you can do other things, too, that we won't do here. 1171 00:58:36,660 --> 00:58:40,650 But if you want to handle the user's input that maybe spans multiple lines-- 1172 00:58:40,650 --> 00:58:44,040 maybe they didn't just type in an email address but an entire paragraph 1173 00:58:44,040 --> 00:58:46,410 of text, and you want to match different lines 1174 00:58:46,410 --> 00:58:48,210 of that text that is multiple lines. 1175 00:58:48,210 --> 00:58:52,950 Another flag is for re.MULTILINE for just that, or re.DOTALL, 1176 00:58:52,950 --> 00:58:57,990 whereby you can configure the dot to recognize not just 1177 00:58:57,990 --> 00:59:02,830 any character except newlines but any character plus newlines as well. 1178 00:59:02,830 --> 00:59:05,850 But for now, let me go ahead and just make use of this first one. 1179 00:59:05,850 --> 00:59:13,170 Let me pass in a third argument to re.search, which is re.IGNORECASE. 1180 00:59:13,170 --> 00:59:15,330 Let me now rerun the program without clearing 1181 00:59:15,330 --> 00:59:17,670 my screen, python of validate.py. 1182 00:59:17,670 --> 00:59:20,850 Let me type in again in all caps, effectively shouting, 1183 00:59:20,850 --> 00:59:25,200 MALAN@HARVARD.EDU, Enter, and now it's considered valid, 1184 00:59:25,200 --> 00:59:27,690 because I'm telling re.search specifically 1185 00:59:27,690 --> 00:59:29,460 to ignore the case of the input. 1186 00:59:29,460 --> 00:59:30,960 And that, too, here is fine. 1187 00:59:30,960 --> 00:59:34,500 And why might I do this approach rather than call .lower in one of those other 1188 00:59:34,500 --> 00:59:35,280 locations? 1189 00:59:35,280 --> 00:59:39,000 Eh, if I don't actually want to change the user's input for whatever reason, 1190 00:59:39,000 --> 00:59:43,290 I can still treat it case insensitively without actually changing 1191 00:59:43,290 --> 00:59:46,140 the value of that variable itself. 1192 00:59:46,140 --> 00:59:51,970 All right, any final questions now on this validation of email addresses? 1193 00:59:51,970 --> 00:59:54,600 AUDIENCE: So the pattern is a string, right? 1194 00:59:54,600 --> 00:59:55,800 DAVID MALAN: Mm-hmm. 1195 00:59:55,800 --> 00:59:57,390 AUDIENCE: Can we use an fstring? 1196 00:59:57,390 --> 00:59:58,440 DAVID MALAN: You can. 1197 00:59:58,440 --> 01:00:01,780 Yes, you can use an fstring so that you could plug in, for instance, 1198 01:00:01,780 --> 01:00:04,830 the value of a variable and pass it into the function. 1199 01:00:04,830 --> 01:00:06,000 Other questions on this? 1200 01:00:06,000 --> 01:00:10,342 AUDIENCE: Backslash w character, could we take it as an input from the user? 1201 01:00:10,342 --> 01:00:11,550 DAVID MALAN: Technically yes. 1202 01:00:11,550 --> 01:00:13,440 That's not a problem we're trying to solve right now. 1203 01:00:13,440 --> 01:00:16,530 We want the user to provide literal input, like their email address, 1204 01:00:16,530 --> 01:00:18,750 not necessarily a regular expression. 1205 01:00:18,750 --> 01:00:22,230 But you could imagine building software that asks the user, especially 1206 01:00:22,230 --> 01:00:25,800 if they're more advanced users, to type in a regular expression for some reason 1207 01:00:25,800 --> 01:00:27,722 to validate something else against that. 1208 01:00:27,722 --> 01:00:29,430 And in fact, that's what Google is doing. 1209 01:00:29,430 --> 01:00:33,630 If you play around with Google Forms and create a form with response validation 1210 01:00:33,630 --> 01:00:37,590 and select Regular Expression, Google lets you and I type 1211 01:00:37,590 --> 01:00:41,530 in our own regular expressions, which would be a perfect example of that. 1212 01:00:41,530 --> 01:00:42,030 All right. 1213 01:00:42,030 --> 01:00:45,900 Well, let me propose that we try to solve one other problem here, 1214 01:00:45,900 --> 01:00:51,480 whereby if I go into the same version as before, which is now ignoring case, 1215 01:00:51,480 --> 01:00:54,100 but I type in one of my other email addresses. 1216 01:00:54,100 --> 01:00:56,280 Let me go ahead and run python of validate.py. 1217 01:00:56,280 --> 01:00:59,580 And this time, let me type in not malan@harvard.edu, which 1218 01:00:59,580 --> 01:01:01,920 I use primarily, but another email address 1219 01:01:01,920 --> 01:01:06,030 of mine, malan@cs50.harvard.edu, which forwards to the same. 1220 01:01:06,030 --> 01:01:07,920 Let me go ahead and hit Enter now. 1221 01:01:07,920 --> 01:01:11,940 And huh, invalid, even though I'm pretty sure that 1222 01:01:11,940 --> 01:01:13,380 is, in fact, my email address. 1223 01:01:13,380 --> 01:01:15,920 Well, let's put our finger on the reason why. 1224 01:01:15,920 --> 01:01:20,400 Why at the moment is malan@cs50.harvard.edu 1225 01:01:20,400 --> 01:01:25,890 being considered invalid, even though I'm pretty sure I send and receive 1226 01:01:25,890 --> 01:01:27,330 email from that address, too? 1227 01:01:27,330 --> 01:01:30,470 1228 01:01:30,470 --> 01:01:32,000 Why might that be? 1229 01:01:32,000 --> 01:01:38,475 AUDIENCE: Because there is a dot that has come after the @ symbol. 1230 01:01:38,475 --> 01:01:39,350 DAVID MALAN: Exactly. 1231 01:01:39,350 --> 01:01:42,230 There's a dot after my cs50. 1232 01:01:42,230 --> 01:01:45,080 And I'm not expecting any dots there, I'm expecting only, 1233 01:01:45,080 --> 01:01:50,240 again, word characters, which is A through z, 0 through 9, and underscore. 1234 01:01:50,240 --> 01:01:52,130 So I'm going to have to retool here. 1235 01:01:52,130 --> 01:01:54,090 But how could I go about doing this? 1236 01:01:54,090 --> 01:01:57,613 Well, it turns out theoretically, there could be other email addresses, 1237 01:01:57,613 --> 01:02:00,530 even though they'd be getting a little excessively long, for instance, 1238 01:02:00,530 --> 01:02:05,210 malan@something.cs50.harvard.edu, which does not technically exist, 1239 01:02:05,210 --> 01:02:06,125 but it could. 1240 01:02:06,125 --> 01:02:09,950 You can have, of course, multiple dots in a domain name like we see here. 1241 01:02:09,950 --> 01:02:12,500 Wouldn't it be nice if we could handle that as well? 1242 01:02:12,500 --> 01:02:16,670 Well, let me propose that we modify my regular expression as follows. 1243 01:02:16,670 --> 01:02:20,240 It turns out that you can group ideas together. 1244 01:02:20,240 --> 01:02:24,050 And you can not only ask whether or not this pattern matches 1245 01:02:24,050 --> 01:02:29,780 or this one using syntax like A vertical bar B, which means "either A or B," 1246 01:02:29,780 --> 01:02:34,280 you can also group things together and then apply some other operator to them 1247 01:02:34,280 --> 01:02:35,100 as well. 1248 01:02:35,100 --> 01:02:37,160 In fact, let me go back to the code here. 1249 01:02:37,160 --> 01:02:42,260 And let me propose that if I want to tolerate a subdomain, like cs50, 1250 01:02:42,260 --> 01:02:46,700 that may or may not be there, let me go ahead and change it as follows. 1251 01:02:46,700 --> 01:02:48,320 I could naively do this. 1252 01:02:48,320 --> 01:02:51,210 If I want to support subdomains, I could say, well, 1253 01:02:51,210 --> 01:02:55,640 let's allow for other word characters plus, and then a literal dot. 1254 01:02:55,640 --> 01:02:58,970 And notice, I'll highlight in blue here what I've just added. 1255 01:02:58,970 --> 01:03:04,190 Everything else is the same, but I'm now adding room for another sequence of one 1256 01:03:04,190 --> 01:03:07,650 or more word characters and then a literal dot. 1257 01:03:07,650 --> 01:03:12,380 So this now, I think, if I rerun python of validate.py, 1258 01:03:12,380 --> 01:03:16,310 will work for malan@cs50.harvard.edu, Enter. 1259 01:03:16,310 --> 01:03:19,610 Unfortunately, does anyone see where this is going? 1260 01:03:19,610 --> 01:03:22,310 Let me rerun python of validate.py and type 1261 01:03:22,310 --> 01:03:25,010 in as I keep doing, malan@harvard.edu, which up until now 1262 01:03:25,010 --> 01:03:27,290 has kept working despite all of my changes. 1263 01:03:27,290 --> 01:03:33,110 But now, ugh, finally I've broken my own email address. 1264 01:03:33,110 --> 01:03:35,540 So logically what's the solution here? 1265 01:03:35,540 --> 01:03:37,730 Well, there's a bunch of ways we could solve this. 1266 01:03:37,730 --> 01:03:40,430 I could maybe start using two regular expressions 1267 01:03:40,430 --> 01:03:46,370 and support email addresses of the form username@domain.tld, 1268 01:03:46,370 --> 01:03:51,350 or username@subdomain.domain.tld, where TLD just 1269 01:03:51,350 --> 01:03:53,917 means Top Level Domain, like edu. 1270 01:03:53,917 --> 01:03:56,000 Or I could maybe just modify this one, because I'd 1271 01:03:56,000 --> 01:04:00,920 prefer not to have two regular expressions or one that's twice as big. 1272 01:04:00,920 --> 01:04:06,470 Why don't I just specify to re.search that part of this pattern is optional? 1273 01:04:06,470 --> 01:04:10,400 What was the symbol we saw earlier that allows 1274 01:04:10,400 --> 01:04:15,440 you to specify that the thing before it is technically optional? 1275 01:04:15,440 --> 01:04:16,610 AUDIENCE: The straight bar? 1276 01:04:16,610 --> 01:04:19,790 We were using the straight bar as an-- 1277 01:04:19,790 --> 01:04:22,678 optional, make the argument optional. 1278 01:04:22,678 --> 01:04:23,720 DAVID MALAN: So we could. 1279 01:04:23,720 --> 01:04:26,210 We could use a vertical bar and some parentheses 1280 01:04:26,210 --> 01:04:29,480 and say, "either there's something here or there's nothing." 1281 01:04:29,480 --> 01:04:31,010 We could do that in parentheses. 1282 01:04:31,010 --> 01:04:33,860 But I think there's actually an even easier way. 1283 01:04:33,860 --> 01:04:36,332 AUDIENCE: Actually, it's a question mark. 1284 01:04:36,332 --> 01:04:37,790 DAVID MALAN: Indeed, question mark. 1285 01:04:37,790 --> 01:04:41,240 Think back to this summary here of our first set of symbols, 1286 01:04:41,240 --> 01:04:46,130 whereby we had not just dot and * and +, but also a question mark, which 1287 01:04:46,130 --> 01:04:49,370 means literally "zero or one repetitions," which 1288 01:04:49,370 --> 01:04:50,810 effectively means optional. 1289 01:04:50,810 --> 01:04:54,740 It's either there, one, or it's not, zero. 1290 01:04:54,740 --> 01:04:57,650 Now, how can I translate that to this code here? 1291 01:04:57,650 --> 01:05:03,150 Well, let me go ahead and surround this part of my pattern with parentheses, 1292 01:05:03,150 --> 01:05:06,740 which doesn't mean I want literally a parentheses in the user's input, 1293 01:05:06,740 --> 01:05:09,410 I just want to group these characters together. 1294 01:05:09,410 --> 01:05:11,480 And in fact, this now will still work. 1295 01:05:11,480 --> 01:05:14,960 I've only added parentheses around the new part for the subdomain. 1296 01:05:14,960 --> 01:05:17,000 Let me run python of validate.py. 1297 01:05:17,000 --> 01:05:20,060 Let me run malan@cs50.harvard.edu, Enter. 1298 01:05:20,060 --> 01:05:21,110 That's still valid. 1299 01:05:21,110 --> 01:05:25,730 But to be clear, if I rerun it again for malan@harvard.edu, that is still 1300 01:05:25,730 --> 01:05:31,310 invalid, but not if I go in here and say, after the parentheses, which 1301 01:05:31,310 --> 01:05:36,410 now is one logical unit, it's one big group of ideas together, 1302 01:05:36,410 --> 01:05:38,690 I add a single question mark there. 1303 01:05:38,690 --> 01:05:43,910 This will now tell re.search that that whole thing in parentheses 1304 01:05:43,910 --> 01:05:49,020 can either be there once or be there not at all, zero times. 1305 01:05:49,020 --> 01:05:51,530 So what does this translate into when I run it? 1306 01:05:51,530 --> 01:05:56,030 Well, let me go ahead and rerun it with malan@cs50.harvard.edu 1307 01:05:56,030 --> 01:05:57,770 so that the subdomain is there. 1308 01:05:57,770 --> 01:05:59,720 That works as before. 1309 01:05:59,720 --> 01:06:01,860 Let me clear my screen and run it again, python 1310 01:06:01,860 --> 01:06:06,830 of validate.py with malan@harvard.edu, which used to work then broke. 1311 01:06:06,830 --> 01:06:08,330 Are we back in business now? 1312 01:06:08,330 --> 01:06:09,260 We are. 1313 01:06:09,260 --> 01:06:11,810 That's now valid again. 1314 01:06:11,810 --> 01:06:14,540 Questions now on this approach, where we've used 1315 01:06:14,540 --> 01:06:18,655 not just the question mark but the parentheses as well? 1316 01:06:18,655 --> 01:06:19,280 AUDIENCE: Yeah. 1317 01:06:19,280 --> 01:06:22,130 You said it works for zero or one repetitions. 1318 01:06:22,130 --> 01:06:23,912 What if you have more? 1319 01:06:23,912 --> 01:06:25,370 DAVID MALAN: What if you have more? 1320 01:06:25,370 --> 01:06:26,220 That's OK. 1321 01:06:26,220 --> 01:06:28,610 That's where you could do *. 1322 01:06:28,610 --> 01:06:33,835 * is zero or more, which gives you all the flexibility in the world. 1323 01:06:33,835 --> 01:06:34,460 AUDIENCE: Yeah. 1324 01:06:34,460 --> 01:06:37,050 So I was just asking that-- 1325 01:06:37,050 --> 01:06:40,670 with question marks, there's only one repetition allowed. 1326 01:06:40,670 --> 01:06:42,810 DAVID MALAN: It means zero or one repetition. 1327 01:06:42,810 --> 01:06:45,630 So it's either not there or it is there. 1328 01:06:45,630 --> 01:06:49,940 And so that's why this pattern now, if I go back to my code, even though again, 1329 01:06:49,940 --> 01:06:54,650 it admittedly looks cryptic, let me highlight everything after the @ sign 1330 01:06:54,650 --> 01:06:56,060 and before the $ sign. 1331 01:06:56,060 --> 01:07:01,001 This now represents a domain name, like harvard.edu, 1332 01:07:01,001 --> 01:07:03,920 or a subdomain within the domain name. 1333 01:07:03,920 --> 01:07:04,700 Why? 1334 01:07:04,700 --> 01:07:07,700 Well, this part to the right is the same as always. 1335 01:07:07,700 --> 01:07:11,330 Backslash w + means something like Harvard or Yale. 1336 01:07:11,330 --> 01:07:14,810 Backslash .edu means literally ".edu." 1337 01:07:14,810 --> 01:07:16,430 So the new part is this. 1338 01:07:16,430 --> 01:07:22,370 In parentheses, I have another set of backslash w + backslash dot now. 1339 01:07:22,370 --> 01:07:24,080 But it's all in parentheses. 1340 01:07:24,080 --> 01:07:26,870 I'm now having a question mark right after that, 1341 01:07:26,870 --> 01:07:30,710 which means that whole thing in parentheses either can be there, 1342 01:07:30,710 --> 01:07:31,850 or it can't be there. 1343 01:07:31,850 --> 01:07:34,010 It's either of those that are acceptable. 1344 01:07:34,010 --> 01:07:37,880 So a question mark effectively make something optional. 1345 01:07:37,880 --> 01:07:40,670 It would not be correct to remove the parentheses, 1346 01:07:40,670 --> 01:07:42,150 because what would this mean? 1347 01:07:42,150 --> 01:07:44,690 If I removed the parentheses, that would mean 1348 01:07:44,690 --> 01:07:49,580 that only this dot is optional, which isn't really what we want to express. 1349 01:07:49,580 --> 01:07:54,050 I want the subdomain, like cs50 and the additional dot 1350 01:07:54,050 --> 01:07:56,060 to be what's there or not there. 1351 01:07:56,060 --> 01:07:59,270 How about one other question on regexes here? 1352 01:07:59,270 --> 01:08:01,530 AUDIENCE: Can we use this for the usernames? 1353 01:08:01,530 --> 01:08:02,530 DAVID MALAN: Absolutely. 1354 01:08:02,530 --> 01:08:04,000 We still have other problems. 1355 01:08:04,000 --> 01:08:06,280 We're not solving all of the problems today just yet. 1356 01:08:06,280 --> 01:08:07,330 But absolutely. 1357 01:08:07,330 --> 01:08:11,380 Right now, we are not letting you have a period in your username. 1358 01:08:11,380 --> 01:08:14,088 And again, some of you with Gmail accounts or other accounts, you 1359 01:08:14,088 --> 01:08:16,463 probably have not just underscores, numbers, and letters. 1360 01:08:16,463 --> 01:08:17,740 You might have periods, too. 1361 01:08:17,740 --> 01:08:21,790 Well, we could fix that, not using question mark here per se. 1362 01:08:21,790 --> 01:08:25,630 But now that we have these parentheses at our disposal, what I could do 1363 01:08:25,630 --> 01:08:26,350 is this. 1364 01:08:26,350 --> 01:08:30,399 I could use parentheses to surround the backslash w 1365 01:08:30,399 --> 01:08:33,819 to say "any word character," which is the same thing, again, as a letter, 1366 01:08:33,819 --> 01:08:35,529 or a number, or an underscore. 1367 01:08:35,529 --> 01:08:40,120 But I could also or in, using a vertical bar, something else, 1368 01:08:40,120 --> 01:08:41,800 like a literal dot. 1369 01:08:41,800 --> 01:08:44,770 Now, a literal dot needs to be escaped, otherwise it 1370 01:08:44,770 --> 01:08:47,859 represents any character, which would be a regression, a step back. 1371 01:08:47,859 --> 01:08:49,540 But now notice what I've done. 1372 01:08:49,540 --> 01:08:54,370 In parentheses, I'm telling re.search that those first few characters 1373 01:08:54,370 --> 01:08:56,800 in your email address, that is your username, 1374 01:08:56,800 --> 01:09:02,049 has to be a word character, like A through z, uppercase or lowercase, or 0 1375 01:09:02,049 --> 01:09:05,290 through 9, or an underscore, or a literal dot. 1376 01:09:05,290 --> 01:09:06,760 We could do this differently, too. 1377 01:09:06,760 --> 01:09:09,220 I could get rid of the parentheses and the 1378 01:09:09,220 --> 01:09:12,010 or, and I could just use a set of characters. 1379 01:09:12,010 --> 01:09:17,890 I could, again, manually say a through z, A through Z, 0 through 9, 1380 01:09:17,890 --> 01:09:22,540 underscore, and then I could do a literal dot with a backslash period. 1381 01:09:22,540 --> 01:09:25,029 And now I technically don't even need the uppercase, 1382 01:09:25,029 --> 01:09:27,590 because I'm already telling the computer to ignore case. 1383 01:09:27,590 --> 01:09:29,359 I can just pick one or the other. 1384 01:09:29,359 --> 01:09:31,120 Which one is better is really up to you. 1385 01:09:31,120 --> 01:09:35,600 Whichever one you think is more readable would generally be the better design. 1386 01:09:35,600 --> 01:09:36,100 All right. 1387 01:09:36,100 --> 01:09:38,979 Let me propose that I rewind this in time 1388 01:09:38,979 --> 01:09:42,819 to where we left off, which was here. 1389 01:09:42,819 --> 01:09:44,800 And let me propose that there are, indeed, 1390 01:09:44,800 --> 01:09:48,935 still limitations of this solution, not just with the username, not just 1391 01:09:48,935 --> 01:09:49,810 with the domain name. 1392 01:09:49,810 --> 01:09:51,700 We're still being a little too restrictive. 1393 01:09:51,700 --> 01:09:54,910 So would you like to see the official regular expression 1394 01:09:54,910 --> 01:09:58,720 that at least browsers use nowadays whenever you type in an email address 1395 01:09:58,720 --> 01:10:01,450 to a web form, and the web form, the browser, 1396 01:10:01,450 --> 01:10:05,680 tells you yes or no, your email address is syntactically valid? 1397 01:10:05,680 --> 01:10:06,670 Ready? 1398 01:10:06,670 --> 01:10:07,810 Ready? 1399 01:10:07,810 --> 01:10:12,730 Here is-- and this isn't even officially the right regular expression. 1400 01:10:12,730 --> 01:10:15,670 It's a simplified version that browsers use because it 1401 01:10:15,670 --> 01:10:18,100 catches most mistakes but not all. 1402 01:10:18,100 --> 01:10:19,460 Here we go. 1403 01:10:19,460 --> 01:10:23,710 This is the regular expression for a valid email address, 1404 01:10:23,710 --> 01:10:27,550 at least as browsers nowadays implement them. 1405 01:10:27,550 --> 01:10:30,610 Now it's crazy cryptic at first glance. 1406 01:10:30,610 --> 01:10:34,930 But note-- and it's wrapping on to many lines, but it's just one pattern. 1407 01:10:34,930 --> 01:10:37,930 But just notice the now-familiar symbols. 1408 01:10:37,930 --> 01:10:40,540 There is the ^ symbol at the very top. 1409 01:10:40,540 --> 01:10:43,280 There is the $ sign at the very end. 1410 01:10:43,280 --> 01:10:45,730 There is a square bracket over here and then some 1411 01:10:45,730 --> 01:10:47,860 of these ranges plus other characters. 1412 01:10:47,860 --> 01:10:51,280 Turns out you don't normally see these characters in email addresses. 1413 01:10:51,280 --> 01:10:53,770 It looks like you're swearing at someone in their username. 1414 01:10:53,770 --> 01:10:55,450 But they're valid characters. 1415 01:10:55,450 --> 01:10:56,680 They're valid officially. 1416 01:10:56,680 --> 01:11:00,670 That doesn't mean that Gmail is going to allow you to put $ signs and other 1417 01:11:00,670 --> 01:11:02,260 punctuation in your username. 1418 01:11:02,260 --> 01:11:04,850 But officially, some servers might allow that. 1419 01:11:04,850 --> 01:11:08,080 So if you really want to validate a user's email address, 1420 01:11:08,080 --> 01:11:12,250 you would actually come up with or copy-paste something like this. 1421 01:11:12,250 --> 01:11:14,680 But honestly, this looks so cryptic. 1422 01:11:14,680 --> 01:11:18,680 And if you were to type it out manually, you are so likely to make a mistake. 1423 01:11:18,680 --> 01:11:21,040 What's the better solution here instead? 1424 01:11:21,040 --> 01:11:24,820 This is where, per past weeks, libraries are your friend. 1425 01:11:24,820 --> 01:11:28,360 Surely someone else on the internet, a programmer more 1426 01:11:28,360 --> 01:11:31,360 experienced than you, even, has come up with code 1427 01:11:31,360 --> 01:11:35,830 that validates email addresses properly, using this regular expression or even 1428 01:11:35,830 --> 01:11:37,580 something more sophisticated than that. 1429 01:11:37,580 --> 01:11:40,030 So generally, if the problem at hand is to validate 1430 01:11:40,030 --> 01:11:43,060 input that is pretty conventional-- an email address, 1431 01:11:43,060 --> 01:11:46,570 a URL, something where there's an official definition that's 1432 01:11:46,570 --> 01:11:50,710 independent of you yourself-- find a popular library that you're 1433 01:11:50,710 --> 01:11:55,130 comfortable using and use it in your code to validate email addresses. 1434 01:11:55,130 --> 01:11:58,750 This is not a wheel, necessarily, that you yourself should invent. 1435 01:11:58,750 --> 01:12:01,870 We've used email addresses, though, to iteratively start 1436 01:12:01,870 --> 01:12:05,300 from something simple, too simple, and build on top of that. 1437 01:12:05,300 --> 01:12:07,960 So you could certainly imagine using regular expressions still 1438 01:12:07,960 --> 01:12:10,210 to validate things that aren't email addresses but are 1439 01:12:10,210 --> 01:12:12,230 data that are important to you. 1440 01:12:12,230 --> 01:12:14,980 So we at least now have these building blocks. 1441 01:12:14,980 --> 01:12:17,380 Now, besides the regular expressions themselves, 1442 01:12:17,380 --> 01:12:20,290 it turns out there's other functions in Python's re 1443 01:12:20,290 --> 01:12:22,030 library for regular expressions. 1444 01:12:22,030 --> 01:12:24,280 Among them is this function here, re.match, 1445 01:12:24,280 --> 01:12:26,980 which is actually very similar to re.search, 1446 01:12:26,980 --> 01:12:29,462 except you don't have to specify the ^ symbol 1447 01:12:29,462 --> 01:12:31,420 at the very beginning of your regex if you want 1448 01:12:31,420 --> 01:12:33,400 to match from the start of a string. 1449 01:12:33,400 --> 01:12:36,958 re.match by design will automatically start matching 1450 01:12:36,958 --> 01:12:38,500 from the start of the string for you. 1451 01:12:38,500 --> 01:12:42,580 Similar in spirit is re.fullmatch, which does the same thing but not only 1452 01:12:42,580 --> 01:12:45,730 matches at the start of the string but the end of the string, so that you, 1453 01:12:45,730 --> 01:12:50,240 too, don't need to type in the ^ symbol or the $ sign as well. 1454 01:12:50,240 --> 01:12:53,170 But let's go ahead and transition back now to some actual code, 1455 01:12:53,170 --> 01:12:55,420 whereby we solve a different problem in spirit. 1456 01:12:55,420 --> 01:12:57,920 Rather than just validate the user's input 1457 01:12:57,920 --> 01:13:00,290 and make sure it looks the way we want, let's just 1458 01:13:00,290 --> 01:13:04,020 assume that the users are not going to type in data exactly as we want, 1459 01:13:04,020 --> 01:13:06,290 and so we're going to have to clean up their input. 1460 01:13:06,290 --> 01:13:10,580 This happens so often when you're using like a Google Form, or Office 365 form, 1461 01:13:10,580 --> 01:13:12,800 or anything else to collect user input. 1462 01:13:12,800 --> 01:13:15,800 No matter what your form question says, your users 1463 01:13:15,800 --> 01:13:18,225 are not necessarily going to follow those directions. 1464 01:13:18,225 --> 01:13:20,600 They might go ahead and type in something that's a little 1465 01:13:20,600 --> 01:13:22,910 differently formatted than you might like. 1466 01:13:22,910 --> 01:13:26,810 Now, you could certainly go through the results and download a CSV, 1467 01:13:26,810 --> 01:13:29,720 or open the Google spreadsheet, or equivalent in Excel, 1468 01:13:29,720 --> 01:13:31,980 and just clean up all of the data manually. 1469 01:13:31,980 --> 01:13:34,250 But if you've got lots of submissions-- dozens, 1470 01:13:34,250 --> 01:13:37,070 hundreds, thousands of rows in your data set-- 1471 01:13:37,070 --> 01:13:39,170 doing things manually might not be very fun. 1472 01:13:39,170 --> 01:13:42,680 It might be much more effective to write code, as in Python, 1473 01:13:42,680 --> 01:13:47,220 that can allow you to clean up that data and any future data as well. 1474 01:13:47,220 --> 01:13:51,620 So let me propose that we go ahead here and close validate.py. 1475 01:13:51,620 --> 01:13:55,460 And let's go ahead and create a new program altogether called format.py, 1476 01:13:55,460 --> 01:13:59,990 the goal of which is to reformat the user's input in the format we expect. 1477 01:13:59,990 --> 01:14:03,080 I'm going to go ahead and run code of format.py. 1478 01:14:03,080 --> 01:14:06,170 And let's suppose that the data we're going to reformat 1479 01:14:06,170 --> 01:14:09,703 is the user's name-- so not email address but name this time. 1480 01:14:09,703 --> 01:14:11,870 And we're going to hope that they type in their name 1481 01:14:11,870 --> 01:14:14,270 properly, like David Malan. 1482 01:14:14,270 --> 01:14:16,610 But some users might be in the habit, for whatever 1483 01:14:16,610 --> 01:14:19,020 reason, of typing their name backwards, if you will, 1484 01:14:19,020 --> 01:14:23,030 with a comma, such as Malan comma David instead. 1485 01:14:23,030 --> 01:14:27,740 Now, it's fine because both are clearly as readable to the human. 1486 01:14:27,740 --> 01:14:30,530 But if you want to standardize how those names are stored 1487 01:14:30,530 --> 01:14:34,250 in your system, perhaps a database, or CSV file, or something else, 1488 01:14:34,250 --> 01:14:37,970 it would be nice to at least standardize or canonicalize the format in which 1489 01:14:37,970 --> 01:14:41,060 you're storing your data, so that if you print out the user's name 1490 01:14:41,060 --> 01:14:43,250 it's always the same format, David Malan, 1491 01:14:43,250 --> 01:14:46,410 and there's no commas or backwardness to it. 1492 01:14:46,410 --> 01:14:48,650 So let's go ahead and do something familiar. 1493 01:14:48,650 --> 01:14:50,990 Let's go ahead and give myself a variable called name 1494 01:14:50,990 --> 01:14:53,120 and set it equal to the return value of input, 1495 01:14:53,120 --> 01:14:56,300 asking the user, as we've done many times, "what's your name," 1496 01:14:56,300 --> 01:14:57,170 question mark. 1497 01:14:57,170 --> 01:15:00,290 I'm going to go ahead and proactively at least clean up some messiness, 1498 01:15:00,290 --> 01:15:03,950 as we keep doing here, by just stripping off any leading or trailing whitespace. 1499 01:15:03,950 --> 01:15:06,470 Just in case the user accidentally hits the spacebar, 1500 01:15:06,470 --> 01:15:09,720 we don't want that ultimately in our data set. 1501 01:15:09,720 --> 01:15:12,260 And now let me go ahead and do this as we've done before. 1502 01:15:12,260 --> 01:15:14,900 Let me just go ahead quickly and print out, just to make sure 1503 01:15:14,900 --> 01:15:18,650 I'm off to the right start, "hello," and then in curly braces name, 1504 01:15:18,650 --> 01:15:22,010 so making an fstring to format "hello," comma, "name." 1505 01:15:22,010 --> 01:15:25,730 Now let me go ahead and clear my screen and run python of format.py. 1506 01:15:25,730 --> 01:15:29,510 Let me behave and type in my name as I normally would, David, space, Malan, 1507 01:15:29,510 --> 01:15:30,170 Enter. 1508 01:15:30,170 --> 01:15:32,270 And I think the output looks pretty good. 1509 01:15:32,270 --> 01:15:34,490 It looks as expected grammatically. 1510 01:15:34,490 --> 01:15:37,283 Let me now go ahead, though, and play this game again. 1511 01:15:37,283 --> 01:15:39,200 But this time, maybe because I'm not thinking, 1512 01:15:39,200 --> 01:15:41,600 or I'm just in the habit of doing last name comma first, 1513 01:15:41,600 --> 01:15:44,700 I do Malan, comma, David, and hit Enter. 1514 01:15:44,700 --> 01:15:45,200 All right. 1515 01:15:45,200 --> 01:15:47,270 Well, this now is weird. 1516 01:15:47,270 --> 01:15:51,020 Even though the program is just spitting out exactly what I typed in, 1517 01:15:51,020 --> 01:15:54,020 arguably this is not close to correct, at least grammatically. 1518 01:15:54,020 --> 01:15:56,810 It should really say "hello, David Malan." 1519 01:15:56,810 --> 01:15:58,820 Now, maybe I could have some if conditions 1520 01:15:58,820 --> 01:16:01,910 and I could just reject the user's input if they type a comma 1521 01:16:01,910 --> 01:16:03,800 or get their names backwards somehow. 1522 01:16:03,800 --> 01:16:07,190 But that's going to be too little too late if the user has already 1523 01:16:07,190 --> 01:16:10,580 submitted a form online, and I already have the data, 1524 01:16:10,580 --> 01:16:12,600 and now I need to go in and clean it up. 1525 01:16:12,600 --> 01:16:14,750 And it's not going to be fun to go through manually 1526 01:16:14,750 --> 01:16:17,900 in Google Spreadsheets, or Apple Numbers, or Microsoft Excel 1527 01:16:17,900 --> 01:16:21,650 and manually fix a lot of people's names to get rid of the commas 1528 01:16:21,650 --> 01:16:25,700 and move the first name before the last, as is conventional in the US. 1529 01:16:25,700 --> 01:16:27,080 So let's do this. 1530 01:16:27,080 --> 01:16:29,780 It could be a little fragile, but let's start 1531 01:16:29,780 --> 01:16:32,990 to express ourselves a little programmatically here and ask this. 1532 01:16:32,990 --> 01:16:37,940 If there is a comma in the person's name, which is Pythonic-- 1533 01:16:37,940 --> 01:16:41,960 I'm just asking the question, is this shorter string in this longer string?-- 1534 01:16:41,960 --> 01:16:43,650 then let me go ahead and do this. 1535 01:16:43,650 --> 01:16:46,340 Let me go ahead and grab that name in the variable, 1536 01:16:46,340 --> 01:16:50,840 split on not just the comma but the space after, 1537 01:16:50,840 --> 01:16:53,480 assuming the human typed in a space after their name. 1538 01:16:53,480 --> 01:16:57,080 And let me go ahead and store the result of that splitting of Malan, comma, 1539 01:16:57,080 --> 01:16:58,860 David into two variables. 1540 01:16:58,860 --> 01:17:02,000 Let's do last, comma, first, again unpacking 1541 01:17:02,000 --> 01:17:04,310 the sequence of values that comes back. 1542 01:17:04,310 --> 01:17:07,170 Now let me go ahead and reformat the name. 1543 01:17:07,170 --> 01:17:10,160 So I'm going to forcibly change the user's name to be as I expect. 1544 01:17:10,160 --> 01:17:13,580 So name is actually going to be this format string-- 1545 01:17:13,580 --> 01:17:18,830 first name then last name, both in curly braces but formatted together 1546 01:17:18,830 --> 01:17:22,580 with a single space, so that I'm overwriting the user's input 1547 01:17:22,580 --> 01:17:25,280 and updating my name variable accordingly. 1548 01:17:25,280 --> 01:17:27,770 For the moment, to be clear, this program is interactive. 1549 01:17:27,770 --> 01:17:31,250 Like, the users, like me, are typing their name into the program. 1550 01:17:31,250 --> 01:17:34,340 But imagine the data already is in a CSV file. 1551 01:17:34,340 --> 01:17:37,730 It came in from some process like a Google Form or something else online. 1552 01:17:37,730 --> 01:17:40,370 You could imagine writing code similar to this, 1553 01:17:40,370 --> 01:17:43,550 but that maybe goes and reads that file into memory first. 1554 01:17:43,550 --> 01:17:46,640 Maybe it's a CSV via CSV Reader or DictReader, 1555 01:17:46,640 --> 01:17:48,860 and then iterating over each of those names. 1556 01:17:48,860 --> 01:17:51,630 But we'll keep it simple and just do one name at a time. 1557 01:17:51,630 --> 01:17:55,070 But now what's kind of interesting here is if I go back to my terminal window 1558 01:17:55,070 --> 01:17:57,940 and clear it, and run python of format.py, 1559 01:17:57,940 --> 01:18:01,240 and hit Enter, I'm going to type in David, space, Malan as before. 1560 01:18:01,240 --> 01:18:03,130 And I think we're still good. 1561 01:18:03,130 --> 01:18:05,290 But I'm also going to go ahead and do this-- 1562 01:18:05,290 --> 01:18:10,630 python of format.py Malan, comma, David, with a space in between, 1563 01:18:10,630 --> 01:18:13,960 crossing my fingers and hit Enter, and voila. 1564 01:18:13,960 --> 01:18:15,640 That now has been fixed. 1565 01:18:15,640 --> 01:18:18,400 Such a simple thing to be sure. 1566 01:18:18,400 --> 01:18:22,300 But it is so commonly necessary to clean up users input. 1567 01:18:22,300 --> 01:18:25,870 Here we see at least one way to do so pretty easily. 1568 01:18:25,870 --> 01:18:28,480 Now, to be fair, there's some problems here. 1569 01:18:28,480 --> 01:18:32,500 And in fact, can someone imagine a scenario in which this code really 1570 01:18:32,500 --> 01:18:34,570 doesn't fix the user's input? 1571 01:18:34,570 --> 01:18:39,760 What could still go wrong even with this fix in my code? 1572 01:18:39,760 --> 01:18:40,810 Any thoughts? 1573 01:18:40,810 --> 01:18:44,322 AUDIENCE: If they typed in their name comma and then [INAUDIBLE].. 1574 01:18:44,322 --> 01:18:46,030 DAVID MALAN: Oh, and then something else. 1575 01:18:46,030 --> 01:18:46,530 Yeah. 1576 01:18:46,530 --> 01:18:48,730 So let me try this, for instance. 1577 01:18:48,730 --> 01:18:50,410 Let me go ahead and run a program. 1578 01:18:50,410 --> 01:18:53,350 And I am the only David Malan that I know. 1579 01:18:53,350 --> 01:18:57,850 But suppose I were, let's say, junior like this. 1580 01:18:57,850 --> 01:19:00,850 And it's common, in English at least, to sometimes put a comma there. 1581 01:19:00,850 --> 01:19:02,350 You don't necessarily need the comma, but I'm 1582 01:19:02,350 --> 01:19:04,120 one of those people who uses a comma. 1583 01:19:04,120 --> 01:19:06,730 That's now really, really broken. 1584 01:19:06,730 --> 01:19:08,830 So I've broken some assumption there. 1585 01:19:08,830 --> 01:19:10,970 And so that could certainly go wrong here. 1586 01:19:10,970 --> 01:19:11,470 What else? 1587 01:19:11,470 --> 01:19:13,178 Well, let me go ahead and run this again. 1588 01:19:13,178 --> 01:19:15,540 And if I did Malan, comma, David, no space, 1589 01:19:15,540 --> 01:19:17,290 because I'm being a little sloppy, I'm not 1590 01:19:17,290 --> 01:19:20,500 paying attention, which is going to happen when you have lots of users 1591 01:19:20,500 --> 01:19:22,750 ultimately, well, this really broke now. 1592 01:19:22,750 --> 01:19:25,870 Notice I have a ValueError, an actual exception. 1593 01:19:25,870 --> 01:19:26,410 Why? 1594 01:19:26,410 --> 01:19:31,330 Well, because split is supposed to be splitting the string into two strings 1595 01:19:31,330 --> 01:19:34,000 by looking for the comma and a space. 1596 01:19:34,000 --> 01:19:37,720 But if there is no comma and space, it can't split it into two things. 1597 01:19:37,720 --> 01:19:40,900 And the fact that I have two variables on the left, 1598 01:19:40,900 --> 01:19:44,290 but I'm only getting back one thing on the right, 1599 01:19:44,290 --> 01:19:47,030 means that I can't do this code quite as this. 1600 01:19:47,030 --> 01:19:48,467 So it's fragile to be sure. 1601 01:19:48,467 --> 01:19:50,800 But wouldn't it be nice if we could at least improve it? 1602 01:19:50,800 --> 01:19:53,710 For instance, we now know some regular expressions syntax. 1603 01:19:53,710 --> 01:19:56,920 What if I at least wanted to make this space optional? 1604 01:19:56,920 --> 01:20:00,010 Well, I could use my newfound regular expression syntax 1605 01:20:00,010 --> 01:20:04,330 and put a question mark, Question mark means zero or one of the things 1606 01:20:04,330 --> 01:20:05,080 to the left. 1607 01:20:05,080 --> 01:20:06,490 What's the thing to the left? 1608 01:20:06,490 --> 01:20:07,850 It's literally a space. 1609 01:20:07,850 --> 01:20:10,760 I don't even need parentheses if there's just one thing there. 1610 01:20:10,760 --> 01:20:15,040 So that would be the start of a pattern that says, I must have a comma, 1611 01:20:15,040 --> 01:20:19,240 and then I may or may not have a space, zero or one spaces thereafter. 1612 01:20:19,240 --> 01:20:25,810 Unfortunately, the version of split that's built into the str variable, 1613 01:20:25,810 --> 01:20:28,600 as in this case, doesn't support regular expressions. 1614 01:20:28,600 --> 01:20:32,120 If we want our regular expressions, we need to go use that library here. 1615 01:20:32,120 --> 01:20:33,650 So let me go ahead and do this. 1616 01:20:33,650 --> 01:20:37,550 Let me go in and leave this code as is but go up to the top 1617 01:20:37,550 --> 01:20:41,650 now and import re to import the library for regular expressions. 1618 01:20:41,650 --> 01:20:46,000 And now let me go ahead and start changing my approach here. 1619 01:20:46,000 --> 01:20:47,630 I'm going to go ahead and do this. 1620 01:20:47,630 --> 01:20:50,890 I'm going to use the same function called re.search, 1621 01:20:50,890 --> 01:20:54,370 and I'm going to search for a pattern that I 1622 01:20:54,370 --> 01:20:56,650 think will be last, comma, first. 1623 01:20:56,650 --> 01:20:59,050 So let me use my newfound regular expression syntax 1624 01:20:59,050 --> 01:21:04,390 and represent a pattern for something like Malan, comma, space, David. 1625 01:21:04,390 --> 01:21:05,660 How can I do this? 1626 01:21:05,660 --> 01:21:10,570 Well, inside of my quotes for re.search, I'm going to have something-- 1627 01:21:10,570 --> 01:21:11,950 so dot +-- 1628 01:21:11,950 --> 01:21:12,610 sorry. 1629 01:21:12,610 --> 01:21:14,980 I'm going to have something, so dot +. 1630 01:21:14,980 --> 01:21:16,540 Then I'm going to have a comma. 1631 01:21:16,540 --> 01:21:17,890 Then I'm going to have a space. 1632 01:21:17,890 --> 01:21:20,440 Then I'm going to have something dot +. 1633 01:21:20,440 --> 01:21:23,200 Now I'm going to preemptively refine this a little bit. 1634 01:21:23,200 --> 01:21:25,288 I want this whole pattern to start matching 1635 01:21:25,288 --> 01:21:26,830 at the beginning of the user's input. 1636 01:21:26,830 --> 01:21:28,960 So I'm going to add the ^ right away. 1637 01:21:28,960 --> 01:21:33,070 And I want the end of the user's input to be matched as well, so that I'm 1638 01:21:33,070 --> 01:21:37,720 literally expecting any character one or more times, then a comma then a space, 1639 01:21:37,720 --> 01:21:40,180 then any other character one or more times. 1640 01:21:40,180 --> 01:21:42,280 And then that is it. 1641 01:21:42,280 --> 01:21:46,430 And I'm going to pass in the name variable as before. 1642 01:21:46,430 --> 01:21:50,300 Now, when we've used re.search in the past, 1643 01:21:50,300 --> 01:21:52,900 we really used it just to answer a question. 1644 01:21:52,900 --> 01:21:57,040 Does the user's input match the following pattern or not, 1645 01:21:57,040 --> 01:21:59,140 true or false, effectively. 1646 01:21:59,140 --> 01:22:02,600 But re.search is actually more powerful than that. 1647 01:22:02,600 --> 01:22:05,110 You can actually get back more information. 1648 01:22:05,110 --> 01:22:06,430 And you can do this. 1649 01:22:06,430 --> 01:22:10,000 You can specify a variable and then an assignment operator, 1650 01:22:10,000 --> 01:22:15,250 and get back more precise answers to what has been found when searched for. 1651 01:22:15,250 --> 01:22:17,500 But what is it you want to get back? 1652 01:22:17,500 --> 01:22:21,260 Well, it turns out there's this other feature of regular expressions 1653 01:22:21,260 --> 01:22:25,330 which allow you to use parentheses, not just to group things together, 1654 01:22:25,330 --> 01:22:27,070 but to capture them. 1655 01:22:27,070 --> 01:22:31,750 It turns out when you specify parentheses in a regular expression 1656 01:22:31,750 --> 01:22:35,140 unbeknownst to us up until now, everything in the parentheses 1657 01:22:35,140 --> 01:22:41,350 will be returned to you as a return value from the re.search function. 1658 01:22:41,350 --> 01:22:45,700 It's going to allow you to extract specific amounts of information 1659 01:22:45,700 --> 01:22:47,530 from the user's own input. 1660 01:22:47,530 --> 01:22:51,730 You can reverse this process, too, by using the non-capturing version 1661 01:22:51,730 --> 01:22:52,340 as well. 1662 01:22:52,340 --> 01:22:55,507 You can use parentheses, and then literally a question mark, and a colon, 1663 01:22:55,507 --> 01:22:56,590 and then some other stuff. 1664 01:22:56,590 --> 01:22:58,400 And that will say, don't either capturing this. 1665 01:22:58,400 --> 01:22:59,567 I just want to group things. 1666 01:22:59,567 --> 01:23:02,850 But for now, we're going to use just the parentheses themselves. 1667 01:23:02,850 --> 01:23:04,200 So how am I going to do this? 1668 01:23:04,200 --> 01:23:08,780 Well, if I want to get back the user's last name and first name, 1669 01:23:08,780 --> 01:23:16,190 I think what I want to capture is the dot + here and the dot + here. 1670 01:23:16,190 --> 01:23:19,190 So I've deliberately surrounded in parentheses 1671 01:23:19,190 --> 01:23:22,160 the dot + both to the left and the right of the comma, 1672 01:23:22,160 --> 01:23:24,660 not because I'm grouping them together per se-- 1673 01:23:24,660 --> 01:23:28,190 I'm not adding a question mark, I'm not adding up another + or a *-- 1674 01:23:28,190 --> 01:23:32,420 I'm using parentheses now for capturing purposes. 1675 01:23:32,420 --> 01:23:33,200 Why? 1676 01:23:33,200 --> 01:23:34,820 Well, I'm going to do this next. 1677 01:23:34,820 --> 01:23:38,690 I'm going to still ask a Boolean question like, "if there are matches, 1678 01:23:38,690 --> 01:23:40,320 then do this." 1679 01:23:40,320 --> 01:23:44,360 So if matches is not effectively false, like none, 1680 01:23:44,360 --> 01:23:47,720 I do expect I've gotten back some matches. 1681 01:23:47,720 --> 01:23:49,400 And watch what I can do now. 1682 01:23:49,400 --> 01:23:54,170 I can do last, comma, first equals whatever matches in 1683 01:23:54,170 --> 01:23:56,930 and get back all of the groups of matches. 1684 01:23:56,930 --> 01:24:00,020 Then go ahead and update name just like before with a format string 1685 01:24:00,020 --> 01:24:03,770 and do first and then last in curly braces 1686 01:24:03,770 --> 01:24:06,770 as well, and then at the very bottom, just like before, print out, 1687 01:24:06,770 --> 01:24:09,830 for instance, "hello," comma, "name." 1688 01:24:09,830 --> 01:24:13,970 So the new code now is everything highlighted here. 1689 01:24:13,970 --> 01:24:19,700 I'm using re.search to search for whether the user typed their name 1690 01:24:19,700 --> 01:24:21,620 in last, comma, first format. 1691 01:24:21,620 --> 01:24:27,440 But I am more powerfully using re.search to capture some of the user's input. 1692 01:24:27,440 --> 01:24:28,850 What's going to get captured? 1693 01:24:28,850 --> 01:24:31,400 Anything I surrounded in parentheses will 1694 01:24:31,400 --> 01:24:34,250 be returned to me as return values. 1695 01:24:34,250 --> 01:24:36,650 How do you get at those return values? 1696 01:24:36,650 --> 01:24:40,490 You ask the variable to which you assign them for all of the groups, 1697 01:24:40,490 --> 01:24:44,250 all of the groups of parentheses that were captured. 1698 01:24:44,250 --> 01:24:46,020 So let me go ahead and do this. 1699 01:24:46,020 --> 01:24:49,970 Let me go ahead now and run python of format.py, Enter. 1700 01:24:49,970 --> 01:24:51,950 And I'm going to type my name as usual. 1701 01:24:51,950 --> 01:24:56,900 In this case, nothing happens with this if condition. 1702 01:24:56,900 --> 01:24:57,500 Why? 1703 01:24:57,500 --> 01:25:03,270 Because I did not type a comma, and so this search does not find a comma, 1704 01:25:03,270 --> 01:25:04,632 so there are no matches. 1705 01:25:04,632 --> 01:25:06,590 So we immediately just print out "hello, name." 1706 01:25:06,590 --> 01:25:08,370 Nothing interesting or new there. 1707 01:25:08,370 --> 01:25:12,920 But if I now go ahead, and clear my screen, and run python of format.py, 1708 01:25:12,920 --> 01:25:18,740 and do Malan, comma, space, David, Enter, we've reformatted my name. 1709 01:25:18,740 --> 01:25:19,940 Well, how did this work? 1710 01:25:19,940 --> 01:25:22,100 Let me be a little more explicit now. 1711 01:25:22,100 --> 01:25:24,560 It turns out I don't have to just say matches.groups. 1712 01:25:24,560 --> 01:25:28,020 I can get specific groups back that I want. 1713 01:25:28,020 --> 01:25:30,290 So let me change my code a little bit more. 1714 01:25:30,290 --> 01:25:33,470 Let me go ahead now and just say this. 1715 01:25:33,470 --> 01:25:36,620 Let's update name to-- 1716 01:25:36,620 --> 01:25:37,980 actually, let's do this. 1717 01:25:37,980 --> 01:25:42,530 Let's say that the last name is going to be in the matches 1718 01:25:42,530 --> 01:25:44,330 but specifically group 1. 1719 01:25:44,330 --> 01:25:48,020 The first name is going to be in the matches but specifically group 2. 1720 01:25:48,020 --> 01:25:49,100 Why 1 and 2? 1721 01:25:49,100 --> 01:25:52,490 Because this is the first set of parentheses to the left of the comma. 1722 01:25:52,490 --> 01:25:55,520 This is the second set of parentheses to the right of the comma. 1723 01:25:55,520 --> 01:25:58,700 And based on the input, this would be the user's last name 1724 01:25:58,700 --> 01:26:00,140 in this scenario, Malan. 1725 01:26:00,140 --> 01:26:03,560 This would be the user's first name, David, in this scenario. 1726 01:26:03,560 --> 01:26:07,340 That's why I'm using group 1 for the last name 1727 01:26:07,340 --> 01:26:09,720 and group 2 for the first name. 1728 01:26:09,720 --> 01:26:16,100 And now I'm going to go ahead and say name equals fstring, again, first 1729 01:26:16,100 --> 01:26:18,980 and then last, done. 1730 01:26:18,980 --> 01:26:23,340 And let me refine this one last step before we take questions. 1731 01:26:23,340 --> 01:26:26,090 I don't really need these variables if I'm immediately using them. 1732 01:26:26,090 --> 01:26:28,423 Let's just go ahead and tighten this up further as we've 1733 01:26:28,423 --> 01:26:29,990 done in the past for design's sake. 1734 01:26:29,990 --> 01:26:32,722 If I want to make the name the concatenation 1735 01:26:32,722 --> 01:26:34,430 of the person's first name and last name, 1736 01:26:34,430 --> 01:26:37,970 let's just do this. matches.group 2 first, 1737 01:26:37,970 --> 01:26:43,400 plus a space, plus matches.group 1. 1738 01:26:43,400 --> 01:26:46,910 So it's just up to me from left to right, this is group 1, 1739 01:26:46,910 --> 01:26:47,630 this is group 2. 1740 01:26:47,630 --> 01:26:51,000 So group 1 is last, group 2 is first. 1741 01:26:51,000 --> 01:26:54,860 So if I want to flip them around and update the value of name, 1742 01:26:54,860 --> 01:27:00,290 I can explicitly get group 2 first, concatenate using +, a single space, 1743 01:27:00,290 --> 01:27:03,540 and then concatenate on group 1. 1744 01:27:03,540 --> 01:27:04,170 All right. 1745 01:27:04,170 --> 01:27:05,280 That was a lot. 1746 01:27:05,280 --> 01:27:07,620 Let me pause to see if there are questions. 1747 01:27:07,620 --> 01:27:11,670 The key difference here is we're still using re.search the exact same way, 1748 01:27:11,670 --> 01:27:15,090 but now I'm using its return value, not just to answer 1749 01:27:15,090 --> 01:27:17,400 a question true or false, but to actually 1750 01:27:17,400 --> 01:27:21,750 get back specific matches anything I captured, so to speak, 1751 01:27:21,750 --> 01:27:23,190 with parentheses. 1752 01:27:23,190 --> 01:27:26,270 AUDIENCE: Why is it here we're using 1 and 2 instead of 0 and 1 1753 01:27:26,270 --> 01:27:27,270 for capturing the first? 1754 01:27:27,270 --> 01:27:29,010 DAVID MALAN: Really good question. 1755 01:27:29,010 --> 01:27:30,060 A good observation. 1756 01:27:30,060 --> 01:27:32,070 In almost every other context, we've started 1757 01:27:32,070 --> 01:27:35,250 counting at 0 and 1 instead of 1 and 2. 1758 01:27:35,250 --> 01:27:38,190 It turns out there's something else in location 0 1759 01:27:38,190 --> 01:27:41,530 when it comes back from re.search related to the string itself. 1760 01:27:41,530 --> 01:27:45,000 So according to the documentation of this function only, 1761 01:27:45,000 --> 01:27:49,110 1 is the first set of parentheses, and 2 is the second set, 1762 01:27:49,110 --> 01:27:50,460 and onward from there. 1763 01:27:50,460 --> 01:27:52,540 Just a different convention here. 1764 01:27:52,540 --> 01:27:53,580 Other questions? 1765 01:27:53,580 --> 01:27:59,820 AUDIENCE: What if we write nothing, like whitespace, comma, whitespace? 1766 01:27:59,820 --> 01:28:03,317 How do we check truth of condition? 1767 01:28:03,317 --> 01:28:05,400 DAVID MALAN: Before I answer directly, let me just 1768 01:28:05,400 --> 01:28:07,733 run this and make sure I've not broken anything further. 1769 01:28:07,733 --> 01:28:09,360 Let me run python of format.py. 1770 01:28:09,360 --> 01:28:12,060 Let me type in David, space, Malan, the right way. 1771 01:28:12,060 --> 01:28:13,200 Let me run it once more. 1772 01:28:13,200 --> 01:28:16,650 Let me type in Malan, comma, David, the wrong way that we're fixing. 1773 01:28:16,650 --> 01:28:17,850 And we're still good. 1774 01:28:17,850 --> 01:28:19,410 But I think it will still break. 1775 01:28:19,410 --> 01:28:23,610 Let me run it a third time with Malan, comma, David with no space. 1776 01:28:23,610 --> 01:28:26,190 And now it's still broken. 1777 01:28:26,190 --> 01:28:26,790 Why? 1778 01:28:26,790 --> 01:28:30,930 Because I'm still looking for comma space. 1779 01:28:30,930 --> 01:28:32,220 Now, how can I fix that? 1780 01:28:32,220 --> 01:28:35,070 One way I could do that is to add a question mark here, which again, 1781 01:28:35,070 --> 01:28:37,510 is zero or more of the thing before. 1782 01:28:37,510 --> 01:28:40,950 So if I have a space and then a question mark literally, no need for any 1783 01:28:40,950 --> 01:28:46,290 parentheses, then I can literally tolerate both Malan, comma, space, 1784 01:28:46,290 --> 01:28:48,610 David or Malan, comma, David. 1785 01:28:48,610 --> 01:28:49,680 So let's try again. 1786 01:28:49,680 --> 01:28:51,120 Before, this did not work. 1787 01:28:51,120 --> 01:28:53,310 Let's do Malan, comma, David with no space. 1788 01:28:53,310 --> 01:28:55,990 Now it does actually work. 1789 01:28:55,990 --> 01:28:58,740 So we can tolerate different amounts of whitespace 1790 01:28:58,740 --> 01:29:01,890 if I am a little more precise with my formula. 1791 01:29:01,890 --> 01:29:03,420 Let me go ahead and try once more. 1792 01:29:03,420 --> 01:29:07,260 Let me very weirdly but possibly hit the space bar a few too many times 1793 01:29:07,260 --> 01:29:08,850 so now they're really separated. 1794 01:29:08,850 --> 01:29:13,020 This, again, is not going to work quite right, because it's going 1795 01:29:13,020 --> 01:29:15,160 to consume all of that whitespace. 1796 01:29:15,160 --> 01:29:18,420 So now I might want to strip, left and right, any 1797 01:29:18,420 --> 01:29:21,720 of the leading white space on the result. Or what I could do here 1798 01:29:21,720 --> 01:29:22,930 is say this. 1799 01:29:22,930 --> 01:29:29,670 Instead of zero or one, I could use a * here, so space *. 1800 01:29:29,670 --> 01:29:33,000 And now if I run this once more with Malan, comma, space, space, space, 1801 01:29:33,000 --> 01:29:35,920 David, Enter, now we've cleaned up things further. 1802 01:29:35,920 --> 01:29:39,510 So you can imagine, depending on how messy the data is that you're 1803 01:29:39,510 --> 01:29:41,550 cleaning up, your regular expressions might need 1804 01:29:41,550 --> 01:29:43,500 to get more and more sophisticated. 1805 01:29:43,500 --> 01:29:46,830 It really depends on just how many problems we want to solve at once. 1806 01:29:46,830 --> 01:29:51,900 Well, allow me to propose that we forge ahead further just to clean this up 1807 01:29:51,900 --> 01:29:53,940 even more so, using a feature that's actually 1808 01:29:53,940 --> 01:29:56,430 relatively new to Python itself. 1809 01:29:56,430 --> 01:29:59,220 It is very common when using regular expressions 1810 01:29:59,220 --> 01:30:03,210 to do exactly what I've done here-- to call a function like re.search 1811 01:30:03,210 --> 01:30:07,300 with capturing parentheses inside, such that you get back a return 1812 01:30:07,300 --> 01:30:10,050 value that I'm calling matches-- you could call it something else, 1813 01:30:10,050 --> 01:30:12,090 but I'm calling it by default matches. 1814 01:30:12,090 --> 01:30:15,690 And then notice on the next line, I'm saying "if matches." 1815 01:30:15,690 --> 01:30:19,080 Wouldn't it be nice if I could just tighten things up further and do these 1816 01:30:19,080 --> 01:30:20,700 all on the same line? 1817 01:30:20,700 --> 01:30:23,070 Well, you can sort of. 1818 01:30:23,070 --> 01:30:24,850 Let me go ahead and do this. 1819 01:30:24,850 --> 01:30:26,340 Let me get rid of this if. 1820 01:30:26,340 --> 01:30:28,500 And let me just try to say something like this. 1821 01:30:28,500 --> 01:30:32,370 If matches equals re.search and then colon-- 1822 01:30:32,370 --> 01:30:39,090 so combining my if condition into just one line instead of those two. 1823 01:30:39,090 --> 01:30:43,455 In C, or C++, or Java, you would actually do something like this, 1824 01:30:43,455 --> 01:30:45,330 surrounding the whole thing with parentheses, 1825 01:30:45,330 --> 01:30:47,550 sometimes double sets to suppress any warnings, 1826 01:30:47,550 --> 01:30:49,980 if you want to do two things at once. 1827 01:30:49,980 --> 01:30:55,530 If you want to not only assign the return value of re.search 1828 01:30:55,530 --> 01:30:58,080 to a variable called matches, but you want 1829 01:30:58,080 --> 01:31:03,408 to subsequently ask a Boolean question, is this effectively true or false. 1830 01:31:03,408 --> 01:31:04,950 That's what I was doing a moment ago. 1831 01:31:04,950 --> 01:31:06,060 Let me undo this. 1832 01:31:06,060 --> 01:31:08,430 A moment ago, I was getting back the return value 1833 01:31:08,430 --> 01:31:12,090 and assigning it to matches, and then I was asking the question. 1834 01:31:12,090 --> 01:31:16,530 Well, it turns out this need to have two lines of code presumably rubbed 1835 01:31:16,530 --> 01:31:18,840 people wrong for too long in Python. 1836 01:31:18,840 --> 01:31:22,170 And so you can now combine these two kinds of lines into one. 1837 01:31:22,170 --> 01:31:24,450 But you need a new operator. 1838 01:31:24,450 --> 01:31:27,720 You cannot just say, "if matches equals re.search" 1839 01:31:27,720 --> 01:31:29,580 and then in a colon at the end. 1840 01:31:29,580 --> 01:31:32,170 You instead need to do this. 1841 01:31:32,170 --> 01:31:38,130 You need to do colon equals if and only if you want to assign something 1842 01:31:38,130 --> 01:31:42,390 from right to left and you want to ask an if or an elif 1843 01:31:42,390 --> 01:31:44,820 question on the same line. 1844 01:31:44,820 --> 01:31:48,870 This is affectionately known, as can see here, as the walrus operator. 1845 01:31:48,870 --> 01:31:51,480 And it's new to Python in recent years. 1846 01:31:51,480 --> 01:31:56,280 And it both allows you to assign a value as I'm doing from right to left, 1847 01:31:56,280 --> 01:32:00,180 and ask a Boolean question about it, like I'm 1848 01:32:00,180 --> 01:32:02,960 doing with the if or equivalently elif. 1849 01:32:02,960 --> 01:32:06,650 Does anyone know why this is called the walrus operator? 1850 01:32:06,650 --> 01:32:09,920 If you kind of look at it like this, perhaps, 1851 01:32:09,920 --> 01:32:14,040 if you're familiar with walruses, it kind of sort of looks like a walrus. 1852 01:32:14,040 --> 01:32:17,720 So a minor detail but a relatively new feature of Python that honestly, you'll 1853 01:32:17,720 --> 01:32:21,170 probably continue to see online, and in source code, and in textbooks, 1854 01:32:21,170 --> 01:32:24,300 and so forth, increasingly so now that it does exist. 1855 01:32:24,300 --> 01:32:25,910 It does not change the logic at all. 1856 01:32:25,910 --> 01:32:29,660 If I run python of format.py and type Malan, comma, space, David, 1857 01:32:29,660 --> 01:32:33,750 it still fixes things, but it's tightened up my code just a bit more. 1858 01:32:33,750 --> 01:32:34,250 All right. 1859 01:32:34,250 --> 01:32:37,010 Let's go ahead and look at one final problem 1860 01:32:37,010 --> 01:32:40,470 to solve, that of extracting information now as well. 1861 01:32:40,470 --> 01:32:43,460 So at this point, we've now validated the user's input 1862 01:32:43,460 --> 01:32:46,160 by checking whether or not it meets a certain pattern. 1863 01:32:46,160 --> 01:32:49,100 We've cleaned up the user's input by checking 1864 01:32:49,100 --> 01:32:51,470 against a pattern, whether it matches or not, and if it 1865 01:32:51,470 --> 01:32:54,350 does match, we kind of reorganize some of the user's information 1866 01:32:54,350 --> 01:32:57,800 so we can clean up their input and standardize the format in which we're 1867 01:32:57,800 --> 01:32:59,540 storing or printing it, in this case. 1868 01:32:59,540 --> 01:33:03,350 Let's do one final example where we're very specifically extracting 1869 01:33:03,350 --> 01:33:06,440 information in order to answer some question. 1870 01:33:06,440 --> 01:33:07,830 So let me propose this. 1871 01:33:07,830 --> 01:33:12,650 Let me go ahead and close format.py and create a new file called twitter.py, 1872 01:33:12,650 --> 01:33:17,690 the goal of which is to prompt users for the URL of their Twitter profile 1873 01:33:17,690 --> 01:33:23,562 and extract from it, infer from that URL, what is the user's username. 1874 01:33:23,562 --> 01:33:25,020 Now, why might you want to do this? 1875 01:33:25,020 --> 01:33:28,228 Well, one, you might want users to be able to just very easily copy and paste 1876 01:33:28,228 --> 01:33:32,330 the URL from their own Twitter profile into your form, into your app, 1877 01:33:32,330 --> 01:33:36,140 so that you can figure out what their username is. 1878 01:33:36,140 --> 01:33:40,430 Or you might have a form that asks the user for their Twitter username, 1879 01:33:40,430 --> 01:33:43,400 and because people aren't necessarily paying very close attention, 1880 01:33:43,400 --> 01:33:45,530 some people type their username. 1881 01:33:45,530 --> 01:33:49,340 Some people type their whole URL or something else altogether. 1882 01:33:49,340 --> 01:33:51,350 It would be nice now that you're a programmer 1883 01:33:51,350 --> 01:33:53,780 to just be more tolerant of different types of input 1884 01:33:53,780 --> 01:33:58,100 and just take on the burden of canonicalizing, standardizing the data, 1885 01:33:58,100 --> 01:34:00,140 but being flexible with the users. 1886 01:34:00,140 --> 01:34:03,500 It's arguably a better user experience if you just let me copy-paste 1887 01:34:03,500 --> 01:34:05,660 or type in what I want, you clean it up. 1888 01:34:05,660 --> 01:34:07,550 You're the programmer not me. 1889 01:34:07,550 --> 01:34:09,920 Lends for a better experience, perhaps. 1890 01:34:09,920 --> 01:34:12,620 Well, let me go ahead and do this with twitter.py. 1891 01:34:12,620 --> 01:34:17,120 Let me first go ahead and prompt the user here for a value for a variable 1892 01:34:17,120 --> 01:34:21,702 that I'll call url, and just ask them to input the URL of their Twitter profile. 1893 01:34:21,702 --> 01:34:23,660 I'm going to go ahead and strip off any leading 1894 01:34:23,660 --> 01:34:26,810 or trailing whitespace, just in case users accidentally hit the spacebar. 1895 01:34:26,810 --> 01:34:29,940 That's literally the least I can do quite easily. 1896 01:34:29,940 --> 01:34:32,100 But now let's go ahead and do this. 1897 01:34:32,100 --> 01:34:37,185 Suppose that the user's address is the following. 1898 01:34:37,185 --> 01:34:38,810 Let me print out what did they type in. 1899 01:34:38,810 --> 01:34:41,190 And let me clear my screen and run python of twitter.py. 1900 01:34:41,190 --> 01:34:43,190 I'm going to go ahead and type in, for instance, 1901 01:34:43,190 --> 01:34:50,240 https://twitter.com/davidjmalan, which happens to be my own Twitter username. 1902 01:34:50,240 --> 01:34:53,090 For now, we're just going to print it back onto the screen just 1903 01:34:53,090 --> 01:34:54,640 to make sure I've not messed up yet. 1904 01:34:54,640 --> 01:34:55,140 OK. 1905 01:34:55,140 --> 01:34:57,260 So I've printed back out the exact same URL. 1906 01:34:57,260 --> 01:35:01,310 But the goal at hand is to extract the username only. 1907 01:35:01,310 --> 01:35:05,060 Now, let me just ask, perhaps, a straightforward question. 1908 01:35:05,060 --> 01:35:09,830 Logically, what do I need to do to get at the user's username? 1909 01:35:09,830 --> 01:35:13,880 AUDIENCE: Well, we just ignore what's before the username 1910 01:35:13,880 --> 01:35:16,065 and then just extract the username? 1911 01:35:16,065 --> 01:35:16,940 DAVID MALAN: Perfect. 1912 01:35:16,940 --> 01:35:18,380 Yeah, I mean, it is as simple as that. 1913 01:35:18,380 --> 01:35:20,720 If you know the username is at the end, well, let's just 1914 01:35:20,720 --> 01:35:22,920 somehow ignore everything to the beginning. 1915 01:35:22,920 --> 01:35:24,170 Well, what's at the beginning? 1916 01:35:24,170 --> 01:35:25,130 Well, it's a URL. 1917 01:35:25,130 --> 01:35:30,890 So we're probably going to need to ignore an HTTPS, a ://, a twitter.com, 1918 01:35:30,890 --> 01:35:31,910 and a /. 1919 01:35:31,910 --> 01:35:33,840 So we just want to throw all of that away. 1920 01:35:33,840 --> 01:35:34,340 Why? 1921 01:35:34,340 --> 01:35:37,400 Because if it's an URL, we know by how Twitter works 1922 01:35:37,400 --> 01:35:39,240 that the username comes at the end. 1923 01:35:39,240 --> 01:35:43,418 So let's use that very simple idea to get at the information we want. 1924 01:35:43,418 --> 01:35:45,210 I'm going to try this a few different ways. 1925 01:35:45,210 --> 01:35:46,620 Let me go back into my program here. 1926 01:35:46,620 --> 01:35:49,820 And instead of just printing it out, which was just to see what's going on, 1927 01:35:49,820 --> 01:35:50,880 let me do this. 1928 01:35:50,880 --> 01:35:53,180 Let me create a new variable called username. 1929 01:35:53,180 --> 01:35:56,810 And let me call url.replace. 1930 01:35:56,810 --> 01:36:01,340 It turns out that if URL is a string or a str in Python, 1931 01:36:01,340 --> 01:36:05,840 it, again, comes with multiple methods, like strip, and split, 1932 01:36:05,840 --> 01:36:08,750 and others as well, one of which is called replace. 1933 01:36:08,750 --> 01:36:10,400 And replace will do just that. 1934 01:36:10,400 --> 01:36:14,360 You pass it two arguments, the first of which is, what do you want to replace? 1935 01:36:14,360 --> 01:36:17,640 The second argument is, what do you want to replace it with? 1936 01:36:17,640 --> 01:36:19,940 So if I want to get rid of, as I've proposed, 1937 01:36:19,940 --> 01:36:21,740 really just everything before the username, 1938 01:36:21,740 --> 01:36:26,090 that is, the Twitter URL or the beginning thereof, let's just say this. 1939 01:36:26,090 --> 01:36:31,520 Go ahead and replace "https://twitter.com/", 1940 01:36:31,520 --> 01:36:34,340 close quote, that's what I want to replace. 1941 01:36:34,340 --> 01:36:37,160 And comma, second argument, what do you want to replace it with? 1942 01:36:37,160 --> 01:36:37,880 Nothing. 1943 01:36:37,880 --> 01:36:40,100 So I'm literally going to pass in quote unquote 1944 01:36:40,100 --> 01:36:42,190 to effectively do a find and replace. 1945 01:36:42,190 --> 01:36:44,690 That's what the replace method does, just like you can do it 1946 01:36:44,690 --> 01:36:46,100 in Microsoft Word or Google Docs. 1947 01:36:46,100 --> 01:36:49,280 This is the programmer's way of doing find and replace. 1948 01:36:49,280 --> 01:36:52,940 Now let me go ahead and print out just the username. 1949 01:36:52,940 --> 01:36:54,780 So I'll use an fstring like this. 1950 01:36:54,780 --> 01:36:57,590 I'll say username, colon, and then in curly braces, 1951 01:36:57,590 --> 01:36:59,700 username, just to format it nicely. 1952 01:36:59,700 --> 01:37:00,200 All right. 1953 01:37:00,200 --> 01:37:04,410 Let me go ahead and clear my screen and run python of twitter.py, Enter, URL. 1954 01:37:04,410 --> 01:37:12,580 Here we go. https://twitter.com/davidjmalan, Enter. 1955 01:37:12,580 --> 01:37:13,300 OK. 1956 01:37:13,300 --> 01:37:15,040 Now we've made some progress. 1957 01:37:15,040 --> 01:37:17,360 Done for the day, right? 1958 01:37:17,360 --> 01:37:19,580 Well, what is suboptimal about this? 1959 01:37:19,580 --> 01:37:24,150 Can anyone critique or find fault with my program? 1960 01:37:24,150 --> 01:37:27,950 It is working now, but it's a little fragile. 1961 01:37:27,950 --> 01:37:31,880 I bet we could contrive some scenarios where I think it works but it doesn't. 1962 01:37:31,880 --> 01:37:33,890 AUDIENCE: Well, I have a few ideas, actually. 1963 01:37:33,890 --> 01:37:39,980 Well, first of all, if we don't specify HTTPS, it will be broken. 1964 01:37:39,980 --> 01:37:44,760 Secondly, if we have a slash at the end, it also will be broken. 1965 01:37:44,760 --> 01:37:48,320 If we have a question mark or something after question mark, 1966 01:37:48,320 --> 01:37:49,590 it also won't work. 1967 01:37:49,590 --> 01:37:51,160 So a lot of scenarios, actually. 1968 01:37:51,160 --> 01:37:52,160 DAVID MALAN: Oh, my god. 1969 01:37:52,160 --> 01:37:52,993 I mean, here we are. 1970 01:37:52,993 --> 01:37:54,650 I was pretending to think I was done. 1971 01:37:54,650 --> 01:37:57,920 But my god, like, Alex gave us a whole laundry list of problems. 1972 01:37:57,920 --> 01:38:01,700 And just to recap, then, what if it's not HTTPS, it's HTTP? 1973 01:38:01,700 --> 01:38:03,590 Slightly less secure, but I should still be 1974 01:38:03,590 --> 01:38:05,713 able to tolerate that programmatically. 1975 01:38:05,713 --> 01:38:07,130 What if the protocol is not there? 1976 01:38:07,130 --> 01:38:09,740 What if the user just typed twitter.com/davidjmalan? 1977 01:38:09,740 --> 01:38:12,680 It would be nice to tolerate that rather than show an error 1978 01:38:12,680 --> 01:38:14,150 and make me type in the protocol. 1979 01:38:14,150 --> 01:38:14,660 Why? 1980 01:38:14,660 --> 01:38:16,050 It's not good user experience. 1981 01:38:16,050 --> 01:38:20,030 What if it had a slash at the end of the username, or a question mark? 1982 01:38:20,030 --> 01:38:22,500 If you think about URLs you've seen on the web, 1983 01:38:22,500 --> 01:38:24,920 there's very commonly more information, especially 1984 01:38:24,920 --> 01:38:26,540 if it's been shared on social media. 1985 01:38:26,540 --> 01:38:28,640 There might be a HTTP parameters, so to speak, 1986 01:38:28,640 --> 01:38:30,230 just stuff there that we don't want. 1987 01:38:30,230 --> 01:38:34,880 There could be a www.twitter.com, which I'm also not expecting but does 1988 01:38:34,880 --> 01:38:37,360 work if you go to that URL, too. 1989 01:38:37,360 --> 01:38:39,540 So there's just so many things that can go wrong. 1990 01:38:39,540 --> 01:38:43,010 And even if I come back to my contrived example as earlier, 1991 01:38:43,010 --> 01:38:45,350 what if I run this program and say this-- 1992 01:38:45,350 --> 01:38:52,610 "my username is https://twitter.com/davidjmalan," 1993 01:38:52,610 --> 01:38:53,540 Enter. 1994 01:38:53,540 --> 01:38:58,570 Well, that too just didn't really work-- it got rid of the-- actually-- 1995 01:38:58,570 --> 01:39:01,730 [LAUGHS] OK, actually that kind of worked. 1996 01:39:01,730 --> 01:39:05,390 But the goal here is to actually get the user's username, 1997 01:39:05,390 --> 01:39:08,210 not an English sentence describing the user's username. 1998 01:39:08,210 --> 01:39:11,150 So I would argue that even though I just accidentally created 1999 01:39:11,150 --> 01:39:13,670 perfectly correct English grammar, I did not 2000 01:39:13,670 --> 01:39:15,860 extract the Twitter username correctly. 2001 01:39:15,860 --> 01:39:19,890 I don't want words like "my username is" as part of my input. 2002 01:39:19,890 --> 01:39:22,940 So how can we go about improving this, and maybe chipping away 2003 01:39:22,940 --> 01:39:24,530 at some of those problems one by one? 2004 01:39:24,530 --> 01:39:26,280 Well, let me clear my screen here. 2005 01:39:26,280 --> 01:39:27,780 Let me come back up to my code. 2006 01:39:27,780 --> 01:39:31,640 And let me not just replace it, but let me do something else instead. 2007 01:39:31,640 --> 01:39:34,040 I'm going to go ahead, and instead of using replace, 2008 01:39:34,040 --> 01:39:36,950 I'm going to use another function called removeprefix. 2009 01:39:36,950 --> 01:39:42,060 A prefix is a string or a substring that comes at the start of another. 2010 01:39:42,060 --> 01:39:45,320 So if I remove prefix, I don't need a second argument for this function. 2011 01:39:45,320 --> 01:39:46,220 I just need one. 2012 01:39:46,220 --> 01:39:48,540 What prefix do you want to remove? 2013 01:39:48,540 --> 01:39:51,680 So this will at least now fix the problem I just 2014 01:39:51,680 --> 01:39:54,860 described of typing in like a whole sentence, where the URL is there, 2015 01:39:54,860 --> 01:39:57,600 but it's not at the beginning, it's only at the end. 2016 01:39:57,600 --> 01:39:59,930 So here, this still is not correct. 2017 01:39:59,930 --> 01:40:04,100 But we don't create this weird-looking output that just removes the URL part 2018 01:40:04,100 --> 01:40:05,360 of the input-- 2019 01:40:05,360 --> 01:40:11,330 "my username is https://twitter.com/davidjmalan." 2020 01:40:11,330 --> 01:40:16,700 A moment ago, it did remove the URL and left only the davidjmalan. 2021 01:40:16,700 --> 01:40:17,990 This is not perfect still. 2022 01:40:17,990 --> 01:40:21,830 But at least now, it does not weirdly remove the URL 2023 01:40:21,830 --> 01:40:23,030 and then leave the English. 2024 01:40:23,030 --> 01:40:24,420 It's just leaving it alone. 2025 01:40:24,420 --> 01:40:26,600 So maybe I could handle this better, but at least 2026 01:40:26,600 --> 01:40:30,710 it's removing it from the part of the string I might anticipate. 2027 01:40:30,710 --> 01:40:32,550 Well, what else could we do here? 2028 01:40:32,550 --> 01:40:35,180 Well, it turns out that regular expressions just 2029 01:40:35,180 --> 01:40:37,940 let us express patterns much more precisely. 2030 01:40:37,940 --> 01:40:41,180 We could spend all day using a whole bunch of different Python functions 2031 01:40:41,180 --> 01:40:44,810 like removeprefix, or remove, and strip, and others, and kind of 2032 01:40:44,810 --> 01:40:47,240 make our way to the right solution. 2033 01:40:47,240 --> 01:40:50,310 But a regular expression just allows you to more succinctly, 2034 01:40:50,310 --> 01:40:55,040 if admittedly more cryptically, express these kinds of patterns and goals. 2035 01:40:55,040 --> 01:40:57,260 And we've seen from parentheses, which can 2036 01:40:57,260 --> 01:41:00,170 be used not just to group symbols together as sets 2037 01:41:00,170 --> 01:41:05,180 but to capture information as well, we have a very powerful tool now 2038 01:41:05,180 --> 01:41:06,630 in our toolkit. 2039 01:41:06,630 --> 01:41:07,800 So let me do this. 2040 01:41:07,800 --> 01:41:12,530 Let me go ahead and start fresh here and import the re library 2041 01:41:12,530 --> 01:41:14,450 as before at the very top of my program. 2042 01:41:14,450 --> 01:41:17,900 I'm still going to get the user's URL via the same line of code. 2043 01:41:17,900 --> 01:41:20,970 But I'm now going to use another function as well. 2044 01:41:20,970 --> 01:41:24,950 It turns out that there's not just re.search, or re.match, 2045 01:41:24,950 --> 01:41:26,060 or re.fullmatch. 2046 01:41:26,060 --> 01:41:30,860 There's also re.sub in the regular expression library, where "sub" here 2047 01:41:30,860 --> 01:41:32,000 means "substitute." 2048 01:41:32,000 --> 01:41:35,220 And it takes more arguments, but they're fairly straightforward. 2049 01:41:35,220 --> 01:41:38,990 The first argument to re.sub is the pattern, the regular expression 2050 01:41:38,990 --> 01:41:40,280 that you want to look for. 2051 01:41:40,280 --> 01:41:43,160 Then you have a replacement string-- what do 2052 01:41:43,160 --> 01:41:45,470 you want to replace that pattern with? 2053 01:41:45,470 --> 01:41:47,390 And where do you want to do all that? 2054 01:41:47,390 --> 01:41:51,265 Well, you pass in the string that you want to do the substitution on. 2055 01:41:51,265 --> 01:41:54,140 Then there's some other arguments that I'll wave my hands at for now. 2056 01:41:54,140 --> 01:41:56,240 Among them are those same flags and also a count, 2057 01:41:56,240 --> 01:41:58,970 like how many times do you want to do find and replace? 2058 01:41:58,970 --> 01:42:01,670 Do you want it to do all, do you want to do just one, 2059 01:42:01,670 --> 01:42:04,070 or so forth you can have further control there, too, 2060 01:42:04,070 --> 01:42:06,770 just like you would in Google Docs or Microsoft Word. 2061 01:42:06,770 --> 01:42:10,160 Well, let me go back to my code here, and let me do this. 2062 01:42:10,160 --> 01:42:15,020 I'm going to go ahead and call re not search but re.sub for substitute. 2063 01:42:15,020 --> 01:42:18,320 I'm going to pass in the following regular expression, 2064 01:42:18,320 --> 01:42:25,610 "https://twitter.com/" and then I'm going to close my quote. 2065 01:42:25,610 --> 01:42:27,860 And now what do I want to replace that with? 2066 01:42:27,860 --> 01:42:31,460 Well, like before with the simple str replace function, 2067 01:42:31,460 --> 01:42:34,380 I want to replace it with nothing, just get rid of it altogether. 2068 01:42:34,380 --> 01:42:37,730 But what string do I want to pass in to do this to? 2069 01:42:37,730 --> 01:42:39,810 The URL from the user. 2070 01:42:39,810 --> 01:42:44,360 And now let me go ahead and assign the return value of re.sub 2071 01:42:44,360 --> 01:42:46,100 to a variable called username. 2072 01:42:46,100 --> 01:42:49,460 So re.sub's purpose in life is, again, to substitute 2073 01:42:49,460 --> 01:42:52,490 some value for some regular expression some number of times. 2074 01:42:52,490 --> 01:42:56,360 It essentially is find and replace using regular expressions. 2075 01:42:56,360 --> 01:42:59,090 And it returns to you the resulting string 2076 01:42:59,090 --> 01:43:01,400 once you've done all those substitutions. 2077 01:43:01,400 --> 01:43:04,850 So now the very last line of my code can be the same as before, print-- 2078 01:43:04,850 --> 01:43:08,960 and I'll use an fstring, username, colon, and then in curly braces, 2079 01:43:08,960 --> 01:43:09,590 username. 2080 01:43:09,590 --> 01:43:12,300 So I can print out literally just that. 2081 01:43:12,300 --> 01:43:12,800 All right. 2082 01:43:12,800 --> 01:43:14,300 Let's try this and see what happens. 2083 01:43:14,300 --> 01:43:17,390 I'll clear my terminal window, run python of twitter.py. 2084 01:43:17,390 --> 01:43:23,690 And here we go, https://twitter.com/davidjmalan. 2085 01:43:23,690 --> 01:43:25,940 Cross my fingers and hit Enter. 2086 01:43:25,940 --> 01:43:28,580 OK, now we're in business. 2087 01:43:28,580 --> 01:43:30,560 But it is still a little fragile. 2088 01:43:30,560 --> 01:43:34,730 And so let me ask the group, what problem should I now 2089 01:43:34,730 --> 01:43:36,125 further chip away at? 2090 01:43:36,125 --> 01:43:38,000 They've been said before, but let's be clear. 2091 01:43:38,000 --> 01:43:40,460 What's one or more problems that still remain? 2092 01:43:40,460 --> 01:43:44,690 AUDIENCE: The protocols and the domain prefix [INAUDIBLE].. 2093 01:43:44,690 --> 01:43:45,440 DAVID MALAN: Good. 2094 01:43:45,440 --> 01:43:48,020 The protocols, so HTTP versus HTTPS. 2095 01:43:48,020 --> 01:43:51,980 Maybe the subdomain, www, should it be there or not? 2096 01:43:51,980 --> 01:43:54,200 And there's a few other mistakes here, too. 2097 01:43:54,200 --> 01:43:55,770 Let me actually stay with the group. 2098 01:43:55,770 --> 01:43:59,600 What are some other shortcomings of this current solution? 2099 01:43:59,600 --> 01:44:03,590 AUDIENCE: If we use a phrase like you do before, 2100 01:44:03,590 --> 01:44:07,940 we are going to have the same problem, because it's not taking account 2101 01:44:07,940 --> 01:44:11,150 in the first part of the text example. 2102 01:44:11,150 --> 01:44:11,900 DAVID MALAN: Good. 2103 01:44:11,900 --> 01:44:16,220 I might still allow for some words, some English to the left of the URL 2104 01:44:16,220 --> 01:44:17,810 because I didn't use my ^ symbol. 2105 01:44:17,810 --> 01:44:18,770 So I'll fix that. 2106 01:44:18,770 --> 01:44:22,450 And any final observations on shortcomings here? 2107 01:44:22,450 --> 01:44:26,993 AUDIENCE: Well, it could be an HTTP, or there could be less than two slashes. 2108 01:44:26,993 --> 01:44:27,660 DAVID MALAN: OK. 2109 01:44:27,660 --> 01:44:28,493 So it could be HTTP. 2110 01:44:28,493 --> 01:44:30,910 And I think that was mentioned, too, in terms of protocol. 2111 01:44:30,910 --> 01:44:32,570 There could be fewer than two slashes. 2112 01:44:32,570 --> 01:44:34,550 That I'm not going to worry about. 2113 01:44:34,550 --> 01:44:38,720 If the user gives me instead of two, that's really user error. 2114 01:44:38,720 --> 01:44:41,420 And I could be tolerant of it, but you know what, at that point 2115 01:44:41,420 --> 01:44:45,570 I'm OK yelling at them with an error message saying, please fix your input. 2116 01:44:45,570 --> 01:44:48,890 Otherwise, we could be here all day long trying to handle all possible typos. 2117 01:44:48,890 --> 01:44:51,740 For now, I think in the interests of usability, 2118 01:44:51,740 --> 01:44:54,560 or user experience, UX, let's at least be 2119 01:44:54,560 --> 01:44:59,130 tolerant of all possible valid inputs or reasonable INPUTS if you will. 2120 01:44:59,130 --> 01:45:01,940 So let me go here, and let me start chipping away at these here. 2121 01:45:01,940 --> 01:45:03,530 What are some problems we can solve? 2122 01:45:03,530 --> 01:45:08,735 Well, let me propose that we first address the issue of matching 2123 01:45:08,735 --> 01:45:10,110 from the beginning of the string. 2124 01:45:10,110 --> 01:45:11,900 So let me add the ^ to the beginning. 2125 01:45:11,900 --> 01:45:15,362 And let me add not a $ sign at the end, though, right? 2126 01:45:15,362 --> 01:45:17,570 Because I don't want to match all the way to the end, 2127 01:45:17,570 --> 01:45:19,950 because I want to tolerate a username there. 2128 01:45:19,950 --> 01:45:23,210 So I think we just want the ^ symbol there. 2129 01:45:23,210 --> 01:45:26,000 There's a subtle bug that no one yet mentioned. 2130 01:45:26,000 --> 01:45:30,860 And let me just kind of highlight it and see if it jumps out at you now. 2131 01:45:30,860 --> 01:45:32,730 It's a little subtle here on my screen. 2132 01:45:32,730 --> 01:45:37,610 I've highlighted in blue a final bug here-- 2133 01:45:37,610 --> 01:45:39,860 maybe some smiles on the screen, yeah? 2134 01:45:39,860 --> 01:45:41,400 Can we take one hand here? 2135 01:45:41,400 --> 01:45:46,730 Why am I highlighting the dot in twitter.com, even though it definitely 2136 01:45:46,730 --> 01:45:47,900 should be there? 2137 01:45:47,900 --> 01:45:52,610 AUDIENCE: So the dot without a backslash means any character except a newline. 2138 01:45:52,610 --> 01:45:53,990 DAVID MALAN: Yeah, exactly. 2139 01:45:53,990 --> 01:45:55,500 It means any character. 2140 01:45:55,500 --> 01:46:01,555 So I could type in something like twitter?com, or twitter anything com, 2141 01:46:01,555 --> 01:46:03,660 and that would actually be tolerated. 2142 01:46:03,660 --> 01:46:07,230 It's not really that bad, because why would the user do that? 2143 01:46:07,230 --> 01:46:09,410 But if I want to be correct, and I want to be 2144 01:46:09,410 --> 01:46:13,280 able to test my own code properly, I should really get this detail right. 2145 01:46:13,280 --> 01:46:16,040 So that's an easy fix, too, but it's a common mistake. 2146 01:46:16,040 --> 01:46:19,190 Anytime you're writing regular expressions that happen to involve 2147 01:46:19,190 --> 01:46:23,210 special symbols, like dots in a URL or domain name, 2148 01:46:23,210 --> 01:46:27,230 a $ sign in something involving currency, remember you might, indeed, 2149 01:46:27,230 --> 01:46:30,390 need to escape it with a backslash like this here. 2150 01:46:30,390 --> 01:46:30,890 All right. 2151 01:46:30,890 --> 01:46:34,040 Let me ask the group about the protocol specifically. 2152 01:46:34,040 --> 01:46:36,690 So HTTPS is a good thing in the world. 2153 01:46:36,690 --> 01:46:37,860 It means secure. 2154 01:46:37,860 --> 01:46:39,360 There is encryption being used. 2155 01:46:39,360 --> 01:46:41,840 So generally, you like to see HTTPS. 2156 01:46:41,840 --> 01:46:46,370 But you still see people typing or copy-pasting HTTP. 2157 01:46:46,370 --> 01:46:50,960 What would be the simplest fix here to tolerate, as has been proposed, 2158 01:46:50,960 --> 01:46:54,380 both HTTP and HTTPS? 2159 01:46:54,380 --> 01:46:56,600 I'm going to propose that I could do this. 2160 01:46:56,600 --> 01:47:02,630 I could do HTTP vertical bar or HTTPS, which, again, means A or B. 2161 01:47:02,630 --> 01:47:04,490 But I think I can be smarter than that. 2162 01:47:04,490 --> 01:47:06,770 I can keep my code a little more succinct. 2163 01:47:06,770 --> 01:47:13,400 Any recommendations here for tolerating HTTP or HTTPS? 2164 01:47:13,400 --> 01:47:16,845 AUDIENCE: We could try to put in question mark behind the S. 2165 01:47:16,845 --> 01:47:17,720 DAVID MALAN: Perfect. 2166 01:47:17,720 --> 01:47:19,340 Just use a question mark. 2167 01:47:19,340 --> 01:47:21,110 Both of those would be viable solutions. 2168 01:47:21,110 --> 01:47:23,330 If you want to be super explicit in your code, fine. 2169 01:47:23,330 --> 01:47:28,730 Use parentheses and say HTTP or HTTPS, so that you, the reader, your boss, 2170 01:47:28,730 --> 01:47:31,410 your teacher just know exactly what you're doing. 2171 01:47:31,410 --> 01:47:35,090 But if you keep taking the more verbose approach all the time, 2172 01:47:35,090 --> 01:47:37,760 it might actually become less readable, certainly 2173 01:47:37,760 --> 01:47:40,580 once your regular expressions get this big instead of this big. 2174 01:47:40,580 --> 01:47:42,290 So let's save space where we can. 2175 01:47:42,290 --> 01:47:45,030 And I would argue that this is pretty reasonable, so 2176 01:47:45,030 --> 01:47:47,640 long as you're in the habit of reading regular expressions 2177 01:47:47,640 --> 01:47:50,390 and know that question mark does not mean a literal question mark, 2178 01:47:50,390 --> 01:47:52,970 but it means zero or one of the thing before. 2179 01:47:52,970 --> 01:47:56,510 I think we've effectively made the S optional here. 2180 01:47:56,510 --> 01:47:58,410 Now, what else can I do? 2181 01:47:58,410 --> 01:48:03,860 Well, suppose we want to tolerate the www dot, which may or may not be there, 2182 01:48:03,860 --> 01:48:06,050 but it will work if you go to a browser. 2183 01:48:06,050 --> 01:48:07,220 I could do this-- 2184 01:48:07,220 --> 01:48:11,720 www dot-- wait, I want a backslash there so I don't 2185 01:48:11,720 --> 01:48:13,310 repeat the same mistake as before. 2186 01:48:13,310 --> 01:48:19,220 But this is no good either, because I want to tolerate being there or not 2187 01:48:19,220 --> 01:48:19,760 being there. 2188 01:48:19,760 --> 01:48:21,890 And now I've just required that it be there. 2189 01:48:21,890 --> 01:48:24,290 But I think I can take the same approach. 2190 01:48:24,290 --> 01:48:25,550 Any recommendations? 2191 01:48:25,550 --> 01:48:27,200 How do I make the www. 2192 01:48:27,200 --> 01:48:30,230 optional, just to hammer this home? 2193 01:48:30,230 --> 01:48:32,480 AUDIENCE: We can group-- 2194 01:48:32,480 --> 01:48:35,835 make a square and a question mark. 2195 01:48:35,835 --> 01:48:36,710 DAVID MALAN: Perfect. 2196 01:48:36,710 --> 01:48:38,825 So question mark is the short answer again. 2197 01:48:38,825 --> 01:48:40,700 But we have to be a little smarter this time. 2198 01:48:40,700 --> 01:48:43,130 As Maria has noted, we need parentheses now. 2199 01:48:43,130 --> 01:48:46,160 Because if I just put a question mark after the dot, 2200 01:48:46,160 --> 01:48:48,147 that just means the dot is optional. 2201 01:48:48,147 --> 01:48:50,480 And that's wrong, because we don't want the user to type 2202 01:48:50,480 --> 01:48:56,690 in W-W-W-T-W-I-T-T-E-R. We want the dot to be there or just not at all with no 2203 01:48:56,690 --> 01:48:57,490 www. 2204 01:48:57,490 --> 01:49:00,080 So we need to group this whole thing together, 2205 01:49:00,080 --> 01:49:04,160 put a parenthesis there, and then a parenthesis, not after the third W, 2206 01:49:04,160 --> 01:49:09,920 after the dot, so that that whole thing is either there or it's not there. 2207 01:49:09,920 --> 01:49:12,338 And what else could we still do here? 2208 01:49:12,338 --> 01:49:14,630 There's going to be one other thing we should tolerate. 2209 01:49:14,630 --> 01:49:16,922 And it's been said before, and I'll pluck this one off. 2210 01:49:16,922 --> 01:49:18,260 What about the protocol? 2211 01:49:18,260 --> 01:49:23,805 Like, what if the user just doesn't type or doesn't copy-paste the http:// 2212 01:49:23,805 --> 01:49:26,660 or an https://? 2213 01:49:26,660 --> 01:49:28,460 Honestly, you and I are not in the habit, 2214 01:49:28,460 --> 01:49:31,730 generally, of even typing the protocol anymore nowadays. 2215 01:49:31,730 --> 01:49:34,010 You just let the browser figure it out for you, 2216 01:49:34,010 --> 01:49:36,590 and automatically add it instead. 2217 01:49:36,590 --> 01:49:38,900 So this one's going to look like more of a mouthful. 2218 01:49:38,900 --> 01:49:43,520 But if I want this whole thing here in blue to be optional, 2219 01:49:43,520 --> 01:49:46,880 it's actually the same solution as Maria offered a moment ago. 2220 01:49:46,880 --> 01:49:49,550 I'm going to go ahead and put a parenthesis over here, 2221 01:49:49,550 --> 01:49:53,960 and a parenthesis after the two slashes, and then a question 2222 01:49:53,960 --> 01:49:57,120 mark so as to make that whole thing optional as well. 2223 01:49:57,120 --> 01:49:58,320 And this is OK. 2224 01:49:58,320 --> 01:50:00,920 It's totally fine to make this whole thing 2225 01:50:00,920 --> 01:50:06,480 optional, or inside of it, this little thing, just the S optional as well. 2226 01:50:06,480 --> 01:50:09,350 So long as I'm applying the same principles again and again, 2227 01:50:09,350 --> 01:50:11,390 either on a small scale or a bigger scale, 2228 01:50:11,390 --> 01:50:16,680 it's totally fine to nest one of these inside of the other. 2229 01:50:16,680 --> 01:50:20,730 Questions now on any of these refinements 2230 01:50:20,730 --> 01:50:23,730 to this parsing, this analyzing of Twitter? 2231 01:50:23,730 --> 01:50:29,850 AUDIENCE: What if we put a vertical bar besides this www dot? 2232 01:50:29,850 --> 01:50:31,930 DAVID MALAN: What if we use a vertical bar there? 2233 01:50:31,930 --> 01:50:34,110 So we could do something like that, too. 2234 01:50:34,110 --> 01:50:36,690 We could do something like this. 2235 01:50:36,690 --> 01:50:41,370 Instead of the question mark, I could do www dot or nothing 2236 01:50:41,370 --> 01:50:43,680 and just leave that and the parentheses. 2237 01:50:43,680 --> 01:50:45,160 That, too, would be fine. 2238 01:50:45,160 --> 01:50:47,743 I personally tend not to like that, because it's a little less 2239 01:50:47,743 --> 01:50:49,035 obvious to me-- wait, a minute. 2240 01:50:49,035 --> 01:50:52,260 Is that deliberate, or did I forget to finish my thought by putting something 2241 01:50:52,260 --> 01:50:53,460 after the vertical bar? 2242 01:50:53,460 --> 01:50:57,630 But that, too, would be allowed there as well, if that's what you mean. 2243 01:50:57,630 --> 01:50:59,790 Other questions on where we left things here, 2244 01:50:59,790 --> 01:51:03,090 where we made the protocol optional, too? 2245 01:51:03,090 --> 01:51:07,260 AUDIENCE: What happens if we have parenthesis, 2246 01:51:07,260 --> 01:51:10,173 and inside we have another parenthesis, and another parenthesis? 2247 01:51:10,173 --> 01:51:11,590 Will it interfere with each other? 2248 01:51:11,590 --> 01:51:14,298 DAVID MALAN: If you have parentheses inside of parentheses, that, 2249 01:51:14,298 --> 01:51:15,660 too, is totally fine. 2250 01:51:15,660 --> 01:51:19,680 And indeed, that should be one of the reassuring lessons today. 2251 01:51:19,680 --> 01:51:23,670 As complicated as each of these regular expressions has admittedly gotten, 2252 01:51:23,670 --> 01:51:27,570 I'm just applying the exact same principles and the exact same syntax 2253 01:51:27,570 --> 01:51:29,110 again and again. 2254 01:51:29,110 --> 01:51:31,988 So it's totally fine to have parentheses inside of parentheses 2255 01:51:31,988 --> 01:51:33,780 if they're each solving different problems. 2256 01:51:33,780 --> 01:51:37,200 And in fact, the lesson I would really emphasize the most today 2257 01:51:37,200 --> 01:51:41,250 is that you will not be happy if you try to write out 2258 01:51:41,250 --> 01:51:44,820 a whole complicated regular expression all at once. 2259 01:51:44,820 --> 01:51:47,310 Like, if you're anything like me, you will fail, 2260 01:51:47,310 --> 01:51:49,428 and you will have trouble finding the mistake. 2261 01:51:49,428 --> 01:51:50,970 Because my god, look at these things. 2262 01:51:50,970 --> 01:51:53,880 They are, even to me all these years later, cryptic. 2263 01:51:53,880 --> 01:51:57,240 The better way, I would argue, whether you're new to programming 2264 01:51:57,240 --> 01:52:01,110 or is old to it as I am, is to just take these baby 2265 01:52:01,110 --> 01:52:03,750 steps, these incremental steps where you do something simple, 2266 01:52:03,750 --> 01:52:04,710 you make sure it works. 2267 01:52:04,710 --> 01:52:07,080 You add one more feature, make sure it works. 2268 01:52:07,080 --> 01:52:09,120 Add one more feature, make sure it works. 2269 01:52:09,120 --> 01:52:12,360 And hopefully, by the end, because you've done each of those steps one 2270 01:52:12,360 --> 01:52:15,490 at a time, the whole thing will make sense to you. 2271 01:52:15,490 --> 01:52:20,310 But you'll also have gotten each of those steps correct at each turn. 2272 01:52:20,310 --> 01:52:23,970 So please, do avoid the inclination to try 2273 01:52:23,970 --> 01:52:26,550 to come up with long, sophisticated regular expressions 2274 01:52:26,550 --> 01:52:29,580 all at once, because it's just not a good use of a time 2275 01:52:29,580 --> 01:52:32,100 if you then stare at it trying to find a mistake that you 2276 01:52:32,100 --> 01:52:35,230 could have caught if you did things more incrementally instead. 2277 01:52:35,230 --> 01:52:35,730 All right. 2278 01:52:35,730 --> 01:52:38,160 There still remains, arguably, at least one problem 2279 01:52:38,160 --> 01:52:40,050 with this solution in that even though I'm 2280 01:52:40,050 --> 01:52:44,040 calling re.sub to substitute the URL with nothing, 2281 01:52:44,040 --> 01:52:47,410 quote, unquote, I then in my final line of code, line 6, 2282 01:52:47,410 --> 01:52:49,590 am just blindly assuming that it all worked, 2283 01:52:49,590 --> 01:52:52,200 and I'm going to go ahead and print out the username. 2284 01:52:52,200 --> 01:52:53,520 But what if the user-- 2285 01:52:53,520 --> 01:52:56,310 if I clear my screen here and run python of twitter.py-- 2286 01:52:56,310 --> 01:52:58,110 doesn't even type a Twitter URL? 2287 01:52:58,110 --> 01:53:02,805 What if they do something like https://google.com/, 2288 01:53:02,805 --> 01:53:06,090 like completely unrelated, for whatever reason, 2289 01:53:06,090 --> 01:53:08,970 Enter, that is not their Twitter username. 2290 01:53:08,970 --> 01:53:12,300 So we need to have some conditional logic, I would argue, 2291 01:53:12,300 --> 01:53:15,690 so that for this program's sake, we're only printing out 2292 01:53:15,690 --> 01:53:19,920 or, in a back end system, we're only saving into our database or a CSV 2293 01:53:19,920 --> 01:53:24,090 file the username if we actually matched the proper pattern. 2294 01:53:24,090 --> 01:53:29,010 So rather than use re.sub, which is useful for cleaning up data, 2295 01:53:29,010 --> 01:53:32,340 as we've done here to get rid of something we don't want there, 2296 01:53:32,340 --> 01:53:37,080 why don't we go back to re.search, where we began today, 2297 01:53:37,080 --> 01:53:41,100 and use it to solve this same problem but in a way that's conditional, 2298 01:53:41,100 --> 01:53:44,490 whereby I can confidently say, yes or no, at the end of my program, 2299 01:53:44,490 --> 01:53:47,260 here's the username, or here it is not? 2300 01:53:47,260 --> 01:53:48,300 So let me go ahead now. 2301 01:53:48,300 --> 01:53:50,340 And I'll clear my terminal window here. 2302 01:53:50,340 --> 01:53:52,560 I'm going to keep most of-- 2303 01:53:52,560 --> 01:53:55,800 I'm going to keep the first two lines the, same where I import re, 2304 01:53:55,800 --> 01:53:57,520 and I get the URL from the user. 2305 01:53:57,520 --> 01:53:59,010 But this time, let's do this. 2306 01:53:59,010 --> 01:54:03,630 Let's this time search for, using re.search instead of re.sub, 2307 01:54:03,630 --> 01:54:04,470 the following. 2308 01:54:04,470 --> 01:54:09,510 I'm going to start matching at the beginning of the string, https, 2309 01:54:09,510 --> 01:54:13,380 question mark to make the S optional, colon, slash, slash, 2310 01:54:13,380 --> 01:54:19,710 I'm going to make my www optional by putting that in question marks there, 2311 01:54:19,710 --> 01:54:24,000 then a twitter.com with a literal dot there so I stay ahead of that issue, 2312 01:54:24,000 --> 01:54:26,640 too, then a slash. 2313 01:54:26,640 --> 01:54:30,330 And then well, this is where davidjmalan is supposed to go. 2314 01:54:30,330 --> 01:54:31,710 How do I detect this? 2315 01:54:31,710 --> 01:54:35,580 Well, I think I'll just tolerate anything at the end of the URL here. 2316 01:54:35,580 --> 01:54:38,532 All right, $ sign at the very end, close quote. 2317 01:54:38,532 --> 01:54:40,740 For the moment, I'm going to stipulate that we're not 2318 01:54:40,740 --> 01:54:43,830 going to worry about question marks at the end or hashes, 2319 01:54:43,830 --> 01:54:45,600 like for fragment IDs in URLs. 2320 01:54:45,600 --> 01:54:48,630 We're going to assume for simplicity now that the URL just 2321 01:54:48,630 --> 01:54:50,610 ends with the username alone. 2322 01:54:50,610 --> 01:54:52,110 Now what am I going to do? 2323 01:54:52,110 --> 01:54:54,330 Well, I want to search for this URL specifically, 2324 01:54:54,330 --> 01:54:58,230 and I'm going to ignore case, so re.IGNORECASE, 2325 01:54:58,230 --> 01:55:00,840 applying that same lesson learned from before. 2326 01:55:00,840 --> 01:55:05,717 re.search, recall, will return to you the matches you've captured. 2327 01:55:05,717 --> 01:55:07,050 Well, what do I want to capture? 2328 01:55:07,050 --> 01:55:12,420 Well, I want to capture everything to the right of the twitter.com URL here. 2329 01:55:12,420 --> 01:55:17,560 So let me surround what should be the user's username with parentheses, 2330 01:55:17,560 --> 01:55:21,580 not for making them optional but to say, "capture this set of characters." 2331 01:55:21,580 --> 01:55:24,730 Now, re.search, recall, returns an answer. 2332 01:55:24,730 --> 01:55:28,600 matches will be my variable name again, but I could call it anything I want. 2333 01:55:28,600 --> 01:55:29,950 And then I can do this. 2334 01:55:29,950 --> 01:55:33,680 If matches, now I know I can do this. 2335 01:55:33,680 --> 01:55:36,370 Let's print out the format string, username colon. 2336 01:55:36,370 --> 01:55:40,190 And then what do I want to print out? 2337 01:55:40,190 --> 01:55:44,440 Well, I think I want to print out matches.group 1 for my matched 2338 01:55:44,440 --> 01:55:45,700 username. 2339 01:55:45,700 --> 01:55:46,210 All right. 2340 01:55:46,210 --> 01:55:47,980 So what am I doing just to recap? 2341 01:55:47,980 --> 01:55:49,960 Line 1, I'm importing the library. 2342 01:55:49,960 --> 01:55:52,280 Line 2, I'm getting the URL from the user. 2343 01:55:52,280 --> 01:55:53,230 So nothing new there. 2344 01:55:53,230 --> 01:55:59,740 Line 5, I'm searching the user's URL, as indicated here as the second argument, 2345 01:55:59,740 --> 01:56:03,220 for this regular expression, this pattern. 2346 01:56:03,220 --> 01:56:07,720 I have surrounded the dot + with parentheses 2347 01:56:07,720 --> 01:56:11,380 so that they are captured ultimately, so I can extract, 2348 01:56:11,380 --> 01:56:14,320 in this final scenario, the user's username. 2349 01:56:14,320 --> 01:56:18,580 If I indeed got a match, and matches is non-none, 2350 01:56:18,580 --> 01:56:23,470 it is actually containing some match, then and only then, print out username. 2351 01:56:23,470 --> 01:56:25,420 In this way, let me try this now. 2352 01:56:25,420 --> 01:56:31,110 If I run python of twitter.py and type in https://www.google.com/, 2353 01:56:31,110 --> 01:56:33,370 now nothing gets printed. 2354 01:56:33,370 --> 01:56:36,010 So I've at least solved the mistake we just saw, 2355 01:56:36,010 --> 01:56:38,050 where I was just assuming that my code worked. 2356 01:56:38,050 --> 01:56:44,000 Now I'm making sure that I have searched for and found the Twitter URL prefix. 2357 01:56:44,000 --> 01:56:44,500 All right. 2358 01:56:44,500 --> 01:56:45,917 Well, let's run this for real now. 2359 01:56:45,917 --> 01:56:51,730 Python of twitter.py https://twitter.com/davidjmalan. 2360 01:56:51,730 --> 01:56:55,420 But note, I could use HTTP, I could use www. 2361 01:56:55,420 --> 01:56:58,430 I'm just going to go ahead here and hit Enter. 2362 01:56:58,430 --> 01:57:01,730 Huh, none. 2363 01:57:01,730 --> 01:57:05,480 What has gone wrong? 2364 01:57:05,480 --> 01:57:08,060 This one's a bit more subtle. 2365 01:57:08,060 --> 01:57:13,027 But why does matches.group 1 contain nothing? 2366 01:57:13,027 --> 01:57:13,610 Wait a minute. 2367 01:57:13,610 --> 01:57:15,450 Let me-- maybe I did this wrong. 2368 01:57:15,450 --> 01:57:17,707 Maybe-- maybe do we need the www? 2369 01:57:17,707 --> 01:57:18,540 Let me run it again. 2370 01:57:18,540 --> 01:57:24,740 So here we go. https://, let's add a www.twitter.com/davidjmalan. 2371 01:57:24,740 --> 01:57:25,500 All right. 2372 01:57:25,500 --> 01:57:26,470 Enter. 2373 01:57:26,470 --> 01:57:28,550 Ho, ho, ho. 2374 01:57:28,550 --> 01:57:31,170 What is going on? 2375 01:57:31,170 --> 01:57:32,720 AUDIENCE: You have to say group 2. 2376 01:57:32,720 --> 01:57:34,520 DAVID MALAN: I have to say group 2? 2377 01:57:34,520 --> 01:57:39,140 Well, wait-- oh, right, because we had the subdomain was optional. 2378 01:57:39,140 --> 01:57:42,560 And to make it optional, I needed to use parentheses here. 2379 01:57:42,560 --> 01:57:44,070 And so I then said zero or on. 2380 01:57:44,070 --> 01:57:44,570 OK. 2381 01:57:44,570 --> 01:57:49,910 So that means that actually, I'm unintentionally but by design 2382 01:57:49,910 --> 01:57:54,710 capturing the www dot, or none of it if it wasn't there before, 2383 01:57:54,710 --> 01:57:56,645 but I have a second match over here because I 2384 01:57:56,645 --> 01:57:58,020 have a second set of parentheses. 2385 01:57:58,020 --> 01:58:00,350 So I think, yep, let me change matches.group 1 2386 01:58:00,350 --> 01:58:02,300 to matches.group 2, and let's run this. 2387 01:58:02,300 --> 01:58:07,460 Python of twitter.py https://www.twitter-- 2388 01:58:07,460 --> 01:58:13,070 let's do this, twitter.com/davidjmalan, Enter, 2389 01:58:13,070 --> 01:58:15,920 and now we've got access to the username. 2390 01:58:15,920 --> 01:58:19,040 Let me go ahead and tighten it up a little bit further. 2391 01:58:19,040 --> 01:58:21,513 If you like our new friend-- 2392 01:58:21,513 --> 01:58:22,430 it's hard not to like. 2393 01:58:22,430 --> 01:58:26,060 If we like our old friend the walrus operator, let's go ahead 2394 01:58:26,060 --> 01:58:27,740 and add this just to tighten things up. 2395 01:58:27,740 --> 01:58:31,460 Let me go back to VS Code here, and let me get rid of the unnecessary condition 2396 01:58:31,460 --> 01:58:34,580 there and combine it up here, if matches equals that. 2397 01:58:34,580 --> 01:58:38,090 But let's change the single assignment operator to the walrus operator. 2398 01:58:38,090 --> 01:58:40,040 Now I've tightened things up further. 2399 01:58:40,040 --> 01:58:43,940 But I bet, I bet, I bet there might be another solution here. 2400 01:58:43,940 --> 01:58:50,630 And indeed, it turns out that we can come back to this final set of syntax. 2401 01:58:50,630 --> 01:58:52,940 Recall that when we introduce these parentheses, 2402 01:58:52,940 --> 01:58:56,720 we did it so that we could do A or B, for instance, with the vertical bar. 2403 01:58:56,720 --> 01:58:59,060 Then you can even combine more than just one bar. 2404 01:58:59,060 --> 01:59:02,900 We use the group to combine ideas like the, www dot. 2405 01:59:02,900 --> 01:59:07,760 And then there's this admittedly weird syntax at the bottom here, up until now 2406 01:59:07,760 --> 01:59:08,690 not used. 2407 01:59:08,690 --> 01:59:12,230 There is a non-capturing version of parentheses 2408 01:59:12,230 --> 01:59:15,050 if you want to use parentheses logically because you need to, 2409 01:59:15,050 --> 01:59:18,080 but you don't want to bother capturing the result. 2410 01:59:18,080 --> 01:59:20,450 And this would arguably be a better solution 2411 01:59:20,450 --> 01:59:23,630 here, because, yes, if I go back to VS Code, I do 2412 01:59:23,630 --> 01:59:27,560 need to surround the www dot with parentheses, at least 2413 01:59:27,560 --> 01:59:30,170 as I've written my regex here, because I wanted 2414 01:59:30,170 --> 01:59:31,910 to put the question mark after it. 2415 01:59:31,910 --> 01:59:35,120 But I don't need the www dot coming back. 2416 01:59:35,120 --> 01:59:37,580 In fact, let's only extract the data we care about, 2417 01:59:37,580 --> 01:59:40,280 just so there's no confusion down the road, for me, 2418 01:59:40,280 --> 01:59:42,120 or my colleagues, or my teachers. 2419 01:59:42,120 --> 01:59:43,860 So what could I do? 2420 01:59:43,860 --> 01:59:48,800 Well, the syntax per this slide is to use a question mark and a colon 2421 01:59:48,800 --> 01:59:51,410 immediately after the open parentheses. 2422 01:59:51,410 --> 01:59:52,910 It looks weird admittedly. 2423 01:59:52,910 --> 01:59:55,040 Those of you who have prior programming experience 2424 01:59:55,040 --> 01:59:59,300 might recognize the syntax from ternary operators, doing an if else all in one 2425 01:59:59,300 --> 01:59:59,960 line. 2426 01:59:59,960 --> 02:00:04,190 A question mark colon at the beginning of that parenthetical 2427 02:00:04,190 --> 02:00:08,160 means, yes, I'm using parentheses to group these things together, 2428 02:00:08,160 --> 02:00:11,640 but no, you do not need to capture them instead. 2429 02:00:11,640 --> 02:00:15,500 So I can change my code back now to matches.group 1. 2430 02:00:15,500 --> 02:00:18,260 I'll clear my screen here, run python of twitter.py. 2431 02:00:18,260 --> 02:00:24,350 I'll again run here https://twitter.com/davidjmalan 2432 02:00:24,350 --> 02:00:26,480 with or without the www. 2433 02:00:26,480 --> 02:00:30,590 And now, I indeed get back that username. 2434 02:00:30,590 --> 02:00:37,280 Any questions, then, on these final techniques? 2435 02:00:37,280 --> 02:00:40,940 AUDIENCE: So first of all, could we move the ^ right 2436 02:00:40,940 --> 02:00:44,270 at the beginning of Twitter, and then just start reading from there, 2437 02:00:44,270 --> 02:00:49,700 and then get rid of everything else before that, the kind of www issues 2438 02:00:49,700 --> 02:00:50,930 that we had? 2439 02:00:50,930 --> 02:00:56,240 And then my second question is, how would we use kind of, I guess, 2440 02:00:56,240 --> 02:01:01,640 either a list or a dictionary to sort the .com kind of thing, 2441 02:01:01,640 --> 02:01:05,120 because we have .co.uk, and that kind of stuff. 2442 02:01:05,120 --> 02:01:08,330 How would we bring that into the re function? 2443 02:01:08,330 --> 02:01:09,830 DAVID MALAN: A good question but no. 2444 02:01:09,830 --> 02:01:15,560 If I move the ^ before twitter.com and throw away the protocol and the www, 2445 02:01:15,560 --> 02:01:20,960 then the user is going to have to type in literally twitter.com/username. 2446 02:01:20,960 --> 02:01:23,040 They can't even type in that other stuff. 2447 02:01:23,040 --> 02:01:25,170 So that would be a regression, a step back. 2448 02:01:25,170 --> 02:01:29,120 As for the .com, the .org, and .edu, and so forth, 2449 02:01:29,120 --> 02:01:31,970 the short answer is there's many different solutions here. 2450 02:01:31,970 --> 02:01:37,190 If I wanted to be stringent about .com-- and suppose that Twitter probably owns 2451 02:01:37,190 --> 02:01:40,620 multiple domain names, even though they tend to use just this one. 2452 02:01:40,620 --> 02:01:43,800 Suppose they have something like .org as well. 2453 02:01:43,800 --> 02:01:47,810 You could use more parentheses here and do something like this-- com or org. 2454 02:01:47,810 --> 02:01:50,270 I'd probably want to go in and add a question mark 2455 02:01:50,270 --> 02:01:53,060 colon to make it non-capturing, because I don't care which 2456 02:01:53,060 --> 02:01:55,100 it is, I just want to tolerate both. 2457 02:01:55,100 --> 02:01:58,220 Alternatively, we could capture that. 2458 02:01:58,220 --> 02:02:01,850 We could do something like this, where we do dot + so as 2459 02:02:01,850 --> 02:02:03,410 to actually capture that. 2460 02:02:03,410 --> 02:02:05,570 And then we could do something like this. 2461 02:02:05,570 --> 02:02:13,640 If matches.group 1 now equals equals com, then we could support this. 2462 02:02:13,640 --> 02:02:18,020 So you could imagine factoring out the logic just by extracting the Top-Level 2463 02:02:18,020 --> 02:02:21,410 Domain, or TLD, and then just using Python code, maybe a list, maybe 2464 02:02:21,410 --> 02:02:24,860 a dictionary, to validate elsewhere, outside of the regex, 2465 02:02:24,860 --> 02:02:26,780 if it's, in fact, what you expect. 2466 02:02:26,780 --> 02:02:28,700 For now, though, we kept things simple. 2467 02:02:28,700 --> 02:02:31,860 We focused only on the .com in this case. 2468 02:02:31,860 --> 02:02:33,767 Let's make one final change to this program 2469 02:02:33,767 --> 02:02:36,350 so that we're being a little more specific with the definition 2470 02:02:36,350 --> 02:02:37,640 of a Twitter username. 2471 02:02:37,640 --> 02:02:41,000 It turns out that we're being a little too generous over here, whereby we're 2472 02:02:41,000 --> 02:02:43,280 accepting one or more of any character. 2473 02:02:43,280 --> 02:02:45,050 I checked the documentation for Twitter. 2474 02:02:45,050 --> 02:02:48,890 And Twitter only supports letters of the alphabet, a through Z, 2475 02:02:48,890 --> 02:02:53,370 numbers 0 through 9, or underscores, so not just dot, 2476 02:02:53,370 --> 02:02:55,020 which is literally anything. 2477 02:02:55,020 --> 02:02:57,230 So let me go ahead and be more precise here. 2478 02:02:57,230 --> 02:02:59,870 At the end of my string, let me go ahead and say, 2479 02:02:59,870 --> 02:03:03,510 this set of symbols in square brackets. 2480 02:03:03,510 --> 02:03:08,058 I'm going to go ahead and say a through Z, 0 through 9, and an underscore. 2481 02:03:08,058 --> 02:03:10,100 Because, again, those are the only valid symbols. 2482 02:03:10,100 --> 02:03:12,740 I don't need to bother with an uppercase A or a lowercase z, 2483 02:03:12,740 --> 02:03:16,140 because we're using re.IGNORECASE over here. 2484 02:03:16,140 --> 02:03:19,760 But I want to make sure now that I tolerate not only one or more 2485 02:03:19,760 --> 02:03:24,260 of these symbols here but also maybe some other stuff at the end of the URL. 2486 02:03:24,260 --> 02:03:27,710 I'm now going to be OK with there being a slash, or a question mark, 2487 02:03:27,710 --> 02:03:31,730 or a hash at the end of the URL, all of which are valid symbols in a URL, 2488 02:03:31,730 --> 02:03:34,130 but I know from the Twitter's documentation, 2489 02:03:34,130 --> 02:03:36,390 are not part of the username. 2490 02:03:36,390 --> 02:03:36,890 All right. 2491 02:03:36,890 --> 02:03:39,770 Now I'm going to go ahead and run python of twitter.py one 2492 02:03:39,770 --> 02:03:46,610 final time, typing in https://twitter.com/davidjmalan, maybe 2493 02:03:46,610 --> 02:03:48,320 with, maybe without a trailing slash. 2494 02:03:48,320 --> 02:03:52,070 But hopefully, with my biggest fingers crossed here, I'm going to go ahead now 2495 02:03:52,070 --> 02:03:56,630 and hit Enter, and thankfully my username is, indeed, davidjmalan. 2496 02:03:56,630 --> 02:03:59,300 So what more is there in the world of regular expressions 2497 02:03:59,300 --> 02:04:00,320 and this own library? 2498 02:04:00,320 --> 02:04:04,340 Not just re.search and also re.sub, there's other functions, too. 2499 02:04:04,340 --> 02:04:07,850 There's re.split, via which you can split a string, not 2500 02:04:07,850 --> 02:04:11,480 using a specific character or characters like a comma and a space, 2501 02:04:11,480 --> 02:04:14,010 but multiple characters as well. 2502 02:04:14,010 --> 02:04:16,550 And there's even functions like re.findall, 2503 02:04:16,550 --> 02:04:20,540 which can allow you to search for multiple copies of the same pattern 2504 02:04:20,540 --> 02:04:23,120 in different places in a string so that you can perhaps 2505 02:04:23,120 --> 02:04:25,200 manipulate more than just one. 2506 02:04:25,200 --> 02:04:28,820 So at the end of the day now, you've really learned a whole other language, 2507 02:04:28,820 --> 02:04:31,700 like that of regular expressions, and we've used them in Python. 2508 02:04:31,700 --> 02:04:35,670 But these regular expressions actually exist in so many languages, too, 2509 02:04:35,670 --> 02:04:38,930 among them JavaScript, and Java, and Ruby, and more. 2510 02:04:38,930 --> 02:04:42,300 So with this new language, even though it's admittedly cryptic 2511 02:04:42,300 --> 02:04:45,050 when you use it for the first time, you have this newfound ability 2512 02:04:45,050 --> 02:04:48,800 to express these patterns that, again, you can use to validate data, 2513 02:04:48,800 --> 02:04:53,310 to clean up data, or even extract data, and from any data set 2514 02:04:53,310 --> 02:04:54,470 you might have in mind. 2515 02:04:54,470 --> 02:04:55,830 That's it for this week. 2516 02:04:55,830 --> 02:04:58,570 We will see you next time. 2517 02:04:58,570 --> 02:05:00,000