WEBVTT 00:00:00.000 --> 00:00:03.960 align:middle line:90% [ORCHESTRA TUNING] 00:00:03.960 --> 00:00:14.850 align:middle line:90% 00:00:14.850 --> 00:00:18.315 align:middle line:90% [MUSIC PLAYING] 00:00:18.315 --> 00:00:24.270 align:middle line:90% 00:00:24.270 --> 00:00:25.470 align:middle line:90% DAVID MALAN: All right. 00:00:25.470 --> 00:00:28.680 align:middle line:84% This is CS50's Introduction to Programming with Python. 00:00:28.680 --> 00:00:32.350 align:middle line:84% My name is David Malan, and this is our week on regular expressions. 00:00:32.350 --> 00:00:37.170 align:middle line:84% So a regular expression, otherwise known as a regex, is really just a pattern. 00:00:37.170 --> 00:00:39.120 align:middle line:84% And indeed, it's quite common in programming 00:00:39.120 --> 00:00:43.620 align:middle line:84% to want to use patterns to match on some kind of data, often user input. 00:00:43.620 --> 00:00:47.190 align:middle line:84% For instance, if the user types in an email address, whether to your program, 00:00:47.190 --> 00:00:49.200 align:middle line:84% or a website, or an app on your phone, you 00:00:49.200 --> 00:00:50.940 align:middle line:84% might ideally want to be able to validate 00:00:50.940 --> 00:00:53.130 align:middle line:84% that they did indeed type in an email address 00:00:53.130 --> 00:00:54.790 align:middle line:90% and not something completely different. 00:00:54.790 --> 00:00:58.080 align:middle line:84% So using regular expressions, we're going to have the newfound capability 00:00:58.080 --> 00:01:02.280 align:middle line:84% to define patterns in our code to compare them against data that we're 00:01:02.280 --> 00:01:04.920 align:middle line:84% receiving from someone else, whether it's just to validate it, 00:01:04.920 --> 00:01:07.470 align:middle line:84% or, heck, even if we want to clean up a whole lot of data 00:01:07.470 --> 00:01:11.220 align:middle line:84% that itself might be messy because it, too, came from us humans. 00:01:11.220 --> 00:01:14.010 align:middle line:84% Before, though, we use these regular expressions, 00:01:14.010 --> 00:01:19.620 align:middle line:84% let me propose that we solve a few problems using just some simpler syntax 00:01:19.620 --> 00:01:22.140 align:middle line:84% and see what kind of limitations we run up against. 00:01:22.140 --> 00:01:24.720 align:middle line:84% Let me propose that I open up VS Code here, 00:01:24.720 --> 00:01:27.900 align:middle line:84% and let me create a file called validate.py, the goal at hand 00:01:27.900 --> 00:01:30.768 align:middle line:84% being to validate, how about just that, a user's email address. 00:01:30.768 --> 00:01:33.060 align:middle line:84% They've come to your app, they've come to your website, 00:01:33.060 --> 00:01:34.935 align:middle line:84% they type in their email address, and we want 00:01:34.935 --> 00:01:38.380 align:middle line:84% to say yes or no, this email address looks valid. 00:01:38.380 --> 00:01:38.880 align:middle line:90% All right. 00:01:38.880 --> 00:01:43.980 align:middle line:84% Let me go ahead and type code of validate.py to create a new tab here. 00:01:43.980 --> 00:01:47.850 align:middle line:84% And then within this tab, let me go ahead and start writing some code, 00:01:47.850 --> 00:01:50.010 align:middle line:84% how about, that keeps things simple initially. 00:01:50.010 --> 00:01:53.130 align:middle line:84% First, let me go ahead and prompt the user for their email address. 00:01:53.130 --> 00:01:57.510 align:middle line:84% And I'll store the return value of input in a variable called email, 00:01:57.510 --> 00:01:59.940 align:middle line:90% asking them "what's your email?" 00:01:59.940 --> 00:02:00.742 align:middle line:90% question mark. 00:02:00.742 --> 00:02:02.700 align:middle line:84% I'm going to go ahead and preemptively at least 00:02:02.700 --> 00:02:06.780 align:middle line:84% clean up the user's input a little bit by minimally just calling strip 00:02:06.780 --> 00:02:10.020 align:middle line:84% at the end of my call to input, because recall 00:02:10.020 --> 00:02:12.240 align:middle line:90% that input returns a string or a str. 00:02:12.240 --> 00:02:16.200 align:middle line:84% strs come with some built-in methods or functions, one of which 00:02:16.200 --> 00:02:18.390 align:middle line:84% is strip, which has the effect of stripping off 00:02:18.390 --> 00:02:22.395 align:middle line:84% any leading whitespace to the left or any trailing whitespace to the right. 00:02:22.395 --> 00:02:24.270 align:middle line:84% So that's just going to go ahead and at least 00:02:24.270 --> 00:02:27.480 align:middle line:84% avoid the human having accidentally typed in a space character. 00:02:27.480 --> 00:02:29.860 align:middle line:84% We're going to throw it away just in case. 00:02:29.860 --> 00:02:31.500 align:middle line:90% Now I'm going to do something simple. 00:02:31.500 --> 00:02:35.165 align:middle line:84% For a user's input to be an email address, 00:02:35.165 --> 00:02:37.290 align:middle line:84% I think we can all agree that it's got a minimal we 00:02:37.290 --> 00:02:39.130 align:middle line:90% have an @ sign somewhere in it. 00:02:39.130 --> 00:02:40.140 align:middle line:90% So let's start simple. 00:02:40.140 --> 00:02:43.230 align:middle line:84% If the user has typed in something with an @ sign, let's 00:02:43.230 --> 00:02:46.800 align:middle line:84% very generously just say, OK, valid, looks like an email address. 00:02:46.800 --> 00:02:50.940 align:middle line:84% And if we're missing that @ sign, let's say invalid, because clearly it's 00:02:50.940 --> 00:02:51.870 align:middle line:90% not an email address. 00:02:51.870 --> 00:02:55.078 align:middle line:84% It's not going to be the best version of my code yet, but we'll start simple. 00:02:55.078 --> 00:02:59.700 align:middle line:84% So I'm going to ask the question, if there is an @ symbol in the user's 00:02:59.700 --> 00:03:03.570 align:middle line:84% email address, go ahead and print out, for instance, quote, unquote, "valid." 00:03:03.570 --> 00:03:06.750 align:middle line:84% Else, if there's not, now I'm pretty confident that the email 00:03:06.750 --> 00:03:09.250 align:middle line:90% address is, in fact, invalid. 00:03:09.250 --> 00:03:10.650 align:middle line:90% Now, what is this code doing? 00:03:10.650 --> 00:03:16.320 align:middle line:84% Well, if @ sign in email is a Pythonic way of asking is this string quote, 00:03:16.320 --> 00:03:20.762 align:middle line:84% unquote "@" in this other string email, no matter where it is-- 00:03:20.762 --> 00:03:22.470 align:middle line:84% at the beginning, the middle, or the end. 00:03:22.470 --> 00:03:25.470 align:middle line:84% It's going to automatically search through the entire string for you 00:03:25.470 --> 00:03:26.220 align:middle line:90% automatically. 00:03:26.220 --> 00:03:27.660 align:middle line:90% I could do this more verbosely. 00:03:27.660 --> 00:03:29.670 align:middle line:84% And I could use a for loop or a while loop 00:03:29.670 --> 00:03:32.640 align:middle line:84% and look at every character in the user's email address, 00:03:32.640 --> 00:03:34.128 align:middle line:90% looking to see if it's an @ sign. 00:03:34.128 --> 00:03:36.420 align:middle line:84% But this is one of the things that's nice about Python. 00:03:36.420 --> 00:03:38.020 align:middle line:90% You can do more with less. 00:03:38.020 --> 00:03:41.190 align:middle line:84% So just by saying if "@" quote, unquote in email, 00:03:41.190 --> 00:03:43.800 align:middle line:84% we're achieving that same result. We're going to get back true 00:03:43.800 --> 00:03:47.670 align:middle line:84% if it's somewhere in there, thus valid, or false if it is not. 00:03:47.670 --> 00:03:50.970 align:middle line:84% Well, let me go ahead now and run this program in my terminal window 00:03:50.970 --> 00:03:53.100 align:middle line:90% with python of validate.py. 00:03:53.100 --> 00:03:56.730 align:middle line:84% And I'm going to go ahead and give it my email address-- malan@harvard.edu, 00:03:56.730 --> 00:03:57.570 align:middle line:90% Enter. 00:03:57.570 --> 00:03:58.770 align:middle line:90% And indeed, it's valid. 00:03:58.770 --> 00:04:00.210 align:middle line:90% Looks valid, is valid. 00:04:00.210 --> 00:04:03.480 align:middle line:84% But of course, this program is technically broken. 00:04:03.480 --> 00:04:04.410 align:middle line:90% It's buggy. 00:04:04.410 --> 00:04:07.110 align:middle line:84% What would be an example input, if someone 00:04:07.110 --> 00:04:10.770 align:middle line:84% might like to volunteer an answer here, that would be considered valid 00:04:10.770 --> 00:04:13.197 align:middle line:84% but you and I know it really isn't valid? 00:04:13.197 --> 00:04:14.280 align:middle line:90% AUDIENCE: Yeah, thank you. 00:04:14.280 --> 00:04:17.880 align:middle line:84% Well, for instance, you can type just two signs and that's it, 00:04:17.880 --> 00:04:20.760 align:middle line:90% and it'll still be valid-- 00:04:20.760 --> 00:04:23.815 align:middle line:84% still be valid according to your program, but missing something. 00:04:23.815 --> 00:04:24.690 align:middle line:90% DAVID MALAN: Exactly. 00:04:24.690 --> 00:04:26.680 align:middle line:90% We've set a very low bar here. 00:04:26.680 --> 00:04:29.430 align:middle line:84% In fact, if I go ahead and rerun python of validate.py, 00:04:29.430 --> 00:04:33.512 align:middle line:84% and I'll just type in one @ sign, that's it-- no username, no domain name, 00:04:33.512 --> 00:04:35.470 align:middle line:84% this doesn't really look like an email address. 00:04:35.470 --> 00:04:38.790 align:middle line:84% But unfortunately, my code thinks it, in fact, is, because it's obviously 00:04:38.790 --> 00:04:40.710 align:middle line:90% just looking for an @ sign alone. 00:04:40.710 --> 00:04:42.250 align:middle line:90% Well, how could we improve this? 00:04:42.250 --> 00:04:45.500 align:middle line:84% Well, minimally an email address, I think, tends to have, 00:04:45.500 --> 00:04:47.250 align:middle line:84% though this is not actually a requirement, 00:04:47.250 --> 00:04:51.600 align:middle line:84% tends to have an @ sign and a single dot at least, maybe somewhere in the domain 00:04:51.600 --> 00:04:54.240 align:middle line:90% name-- so malan@harvard.edu. 00:04:54.240 --> 00:04:55.960 align:middle line:90% So let's check for that dot as well. 00:04:55.960 --> 00:04:59.040 align:middle line:84% But again, strictly speaking it doesn't even have to be that case. 00:04:59.040 --> 00:05:02.190 align:middle line:84% But I'm going for my own email address, at least for now, as our test case. 00:05:02.190 --> 00:05:06.450 align:middle line:84% So let me go ahead and change my code now and say, not only if @ is in email, 00:05:06.450 --> 00:05:11.050 align:middle line:90% but also dot is in email as well. 00:05:11.050 --> 00:05:12.690 align:middle line:90% So I'm asking now two questions. 00:05:12.690 --> 00:05:16.050 align:middle line:84% I have two Boolean expressions-- if @ in email, 00:05:16.050 --> 00:05:20.650 align:middle line:84% and I'm anding them together logically-- this is a logical and, so to speak. 00:05:20.650 --> 00:05:24.600 align:middle line:84% So if it's the case that @ is in email and dot is in email, OK, 00:05:24.600 --> 00:05:26.470 align:middle line:90% now I'm going to go ahead and say valid. 00:05:26.470 --> 00:05:26.970 align:middle line:90% All right. 00:05:26.970 --> 00:05:29.460 align:middle line:84% This would still seem to work for my email address. 00:05:29.460 --> 00:05:34.500 align:middle line:84% Let me go ahead and run python validate.py, malan@harvard.edu, Enter, 00:05:34.500 --> 00:05:36.390 align:middle line:84% and that, of course, is valid is expected. 00:05:36.390 --> 00:05:39.870 align:middle line:84% But here, too, we can be a little adversarial and type in something 00:05:39.870 --> 00:05:41.505 align:middle line:90% nonsensical like "@." 00:05:41.505 --> 00:05:45.180 align:middle line:84% and unfortunately, that, too, is going to be mistaken as valid, 00:05:45.180 --> 00:05:48.820 align:middle line:84% even though there's still no username, domain name, or anything like that. 00:05:48.820 --> 00:05:51.180 align:middle line:84% So I think we need to be a little more methodical here. 00:05:51.180 --> 00:05:57.660 align:middle line:84% In fact, notice that if I do this like this, the @ sign can be anywhere, 00:05:57.660 --> 00:05:59.250 align:middle line:90% and the dot can be anywhere. 00:05:59.250 --> 00:06:02.190 align:middle line:84% But if I'm assuming the user is going to have a traditional domain 00:06:02.190 --> 00:06:05.880 align:middle line:84% name like harvard.edu or gmail.com, I really 00:06:05.880 --> 00:06:10.110 align:middle line:84% want to look for the dot in the domain name only, not necessarily 00:06:10.110 --> 00:06:11.580 align:middle line:90% just the username. 00:06:11.580 --> 00:06:13.510 align:middle line:90% So let me go ahead and do this. 00:06:13.510 --> 00:06:18.250 align:middle line:84% Let me go ahead and introduce a bit more logic here, and instead do this. 00:06:18.250 --> 00:06:24.060 align:middle line:84% Let me go ahead and do email.split of quote, unquote @ sign. 00:06:24.060 --> 00:06:26.460 align:middle line:90% So email, again, is a string or a str. 00:06:26.460 --> 00:06:29.550 align:middle line:84% strs come with methods, not just strip but also 00:06:29.550 --> 00:06:32.190 align:middle line:84% another one called split that, as the name implies, 00:06:32.190 --> 00:06:36.570 align:middle line:84% will split one str into multiple ones if you give it a character or more 00:06:36.570 --> 00:06:37.950 align:middle line:90% to split on. 00:06:37.950 --> 00:06:42.390 align:middle line:84% So this is hopefully going to return to me two parts from a traditional email 00:06:42.390 --> 00:06:44.880 align:middle line:84% address, the username and the domain name. 00:06:44.880 --> 00:06:47.850 align:middle line:84% And it turns out I can unpack that sequence of responses 00:06:47.850 --> 00:06:52.410 align:middle line:84% by doing this-- username comma domain equals this. 00:06:52.410 --> 00:06:55.060 align:middle line:84% I could store it in a list or some other structure, 00:06:55.060 --> 00:06:58.530 align:middle line:84% but if I already know in advance what kinds of values I'm expecting, 00:06:58.530 --> 00:07:00.510 align:middle line:84% a username and hopefully a domain, I'm going 00:07:00.510 --> 00:07:04.050 align:middle line:84% to go ahead and do it like this instead and just define two variables at once 00:07:04.050 --> 00:07:05.310 align:middle line:90% on one line of code. 00:07:05.310 --> 00:07:07.290 align:middle line:84% And now I'm going to be a little more precise. 00:07:07.290 --> 00:07:13.230 align:middle line:84% If username-- if username, then I'm going to go ahead 00:07:13.230 --> 00:07:15.370 align:middle line:90% and say, print "valid." 00:07:15.370 --> 00:07:18.820 align:middle line:84% Else, I'm going to go ahead and say print "invalid." 00:07:18.820 --> 00:07:20.040 align:middle line:90% Now, this isn't good enough. 00:07:20.040 --> 00:07:22.800 align:middle line:84% But I'm at least checking for the presence of a username now. 00:07:22.800 --> 00:07:25.217 align:middle line:84% And you might not have seen this before, but if you simply 00:07:25.217 --> 00:07:28.680 align:middle line:84% ask a question like "if username," and username is a string, 00:07:28.680 --> 00:07:31.320 align:middle line:84% well, username-- "if username" is going to give me 00:07:31.320 --> 00:07:35.730 align:middle line:84% a true answer if username is anything except none or quote, 00:07:35.730 --> 00:07:36.840 align:middle line:90% unquote "nothing." 00:07:36.840 --> 00:07:41.820 align:middle line:84% So there's a truthy value here, whereby if username has at least one character, 00:07:41.820 --> 00:07:43.350 align:middle line:90% that's going to be considered true. 00:07:43.350 --> 00:07:46.170 align:middle line:84% But if username has no characters, it's going 00:07:46.170 --> 00:07:49.245 align:middle line:84% to be considered a false value effectively. 00:07:49.245 --> 00:07:50.370 align:middle line:90% But this isn't good enough. 00:07:50.370 --> 00:07:52.037 align:middle line:90% I don't want to just check for username. 00:07:52.037 --> 00:07:57.160 align:middle line:84% I want to also check that it's the case that dot is in the domain name as well. 00:07:57.160 --> 00:08:00.180 align:middle line:84% So notice here there's a bit of potential confusion 00:08:00.180 --> 00:08:01.830 align:middle line:90% with the English language. 00:08:01.830 --> 00:08:04.620 align:middle line:84% Here, I seem to be saying "if username and dot 00:08:04.620 --> 00:08:09.660 align:middle line:84% in domain," as though I'm asking the question, "if the username and the dot 00:08:09.660 --> 00:08:12.270 align:middle line:84% are in the domain," but that's not what this means. 00:08:12.270 --> 00:08:15.540 align:middle line:84% These are two separate Boolean expressions-- "if username," 00:08:15.540 --> 00:08:19.690 align:middle line:90% and separately, "if dot in domain." 00:08:19.690 --> 00:08:23.100 align:middle line:84% And if I parenthesis this, we could make that even more clear by putting 00:08:23.100 --> 00:08:25.000 align:middle line:90% parentheses there, parentheses here. 00:08:25.000 --> 00:08:27.390 align:middle line:84% So just to be clear, it's really two Boolean expressions 00:08:27.390 --> 00:08:30.840 align:middle line:84% that we're anding together, not one longer English-like sentence. 00:08:30.840 --> 00:08:35.580 align:middle line:84% Now, if I go ahead and run this, python validate.py Enter, 00:08:35.580 --> 00:08:39.809 align:middle line:84% I'll do my own email address again, malan@harvard.edu, and that's valid. 00:08:39.809 --> 00:08:43.710 align:middle line:84% And it looks like I could tolerate something like this. 00:08:43.710 --> 00:08:47.970 align:middle line:84% If I do malan@, just say, harvard, I think at the moment 00:08:47.970 --> 00:08:49.480 align:middle line:90% this is going to be invalid. 00:08:49.480 --> 00:08:52.150 align:middle line:84% Now, maybe the top-level domain harvard exists. 00:08:52.150 --> 00:08:54.900 align:middle line:84% But at the moment, it looks like we're looking for something more. 00:08:54.900 --> 00:08:58.380 align:middle line:84% We're looking for a top-level domain too, like .edu. 00:08:58.380 --> 00:09:01.540 align:middle line:84% For now, we'll just consider this to be invalid. 00:09:01.540 --> 00:09:04.510 align:middle line:90% But it's not just that we want to do-- 00:09:04.510 --> 00:09:07.260 align:middle line:84% it's not just that we want to check for the presence of a username 00:09:07.260 --> 00:09:08.370 align:middle line:90% and the presence of a dot. 00:09:08.370 --> 00:09:09.520 align:middle line:90% Let's be more specific. 00:09:09.520 --> 00:09:11.687 align:middle line:84% Let's start to now narrow the scope of this program, 00:09:11.687 --> 00:09:15.600 align:middle line:84% not just to be about generic emails more generally, but about edu addresses, 00:09:15.600 --> 00:09:18.780 align:middle line:84% so specifically for someone in a US university, for instance, 00:09:18.780 --> 00:09:21.450 align:middle line:84% whose email address tends to end with .edu. 00:09:21.450 --> 00:09:23.310 align:middle line:90% I can be a little more precise. 00:09:23.310 --> 00:09:25.350 align:middle line:84% And you might recall this function already. 00:09:25.350 --> 00:09:28.590 align:middle line:84% Instead of just saying, is there a dot somewhere in domain, 00:09:28.590 --> 00:09:34.740 align:middle line:84% let me instead say, and the domain ends with quote, unquote ".edu." 00:09:34.740 --> 00:09:36.420 align:middle line:90% Now we're being even more precise. 00:09:36.420 --> 00:09:40.200 align:middle line:84% We want there to be minimally a username that's not empty-- it's not just quote, 00:09:40.200 --> 00:09:45.190 align:middle line:84% unquote "nothing"-- and we want the domain name to actually end with .edu. 00:09:45.190 --> 00:09:47.448 align:middle line:84% Let me go ahead and run python of validate.py. 00:09:47.448 --> 00:09:49.740 align:middle line:84% And just to make sure I haven't made things even worse, 00:09:49.740 --> 00:09:53.470 align:middle line:84% let me at least test my own email address, which does seem to be valid. 00:09:53.470 --> 00:09:56.070 align:middle line:84% Now, it seems that I minimally need to provide a username, 00:09:56.070 --> 00:09:58.380 align:middle line:84% because we definitely do have that check in place. 00:09:58.380 --> 00:10:00.210 align:middle line:90% So I'm going to go ahead and say malan. 00:10:00.210 --> 00:10:02.790 align:middle line:90% And now I'm going to go ahead and say @. 00:10:02.790 --> 00:10:05.880 align:middle line:84% And it looks like I could be a little malicious here, 00:10:05.880 --> 00:10:09.030 align:middle line:84% just say malan@.edu, as though minimally meeting 00:10:09.030 --> 00:10:11.340 align:middle line:90% the requirements of this pattern. 00:10:11.340 --> 00:10:13.200 align:middle line:84% And that, of course, is considered valid, 00:10:13.200 --> 00:10:17.010 align:middle line:84% but I'm pretty sure there's no one at malan@.edu. 00:10:17.010 --> 00:10:19.350 align:middle line:84% We need to have some domain name in there. 00:10:19.350 --> 00:10:21.360 align:middle line:84% So we're still not being quite as generous. 00:10:21.360 --> 00:10:24.510 align:middle line:84% Now, we could absolutely continue to iterate on this program, 00:10:24.510 --> 00:10:26.640 align:middle line:84% and we could add some more Boolean expressions. 00:10:26.640 --> 00:10:28.590 align:middle line:84% We could maybe use some other Python methods 00:10:28.590 --> 00:10:31.530 align:middle line:84% for checking more precisely is there something to the left of the dot, 00:10:31.530 --> 00:10:32.550 align:middle line:90% to the right of the dot. 00:10:32.550 --> 00:10:34.320 align:middle line:90% We could use split multiple times. 00:10:34.320 --> 00:10:36.180 align:middle line:84% But honestly, this just escalates quickly. 00:10:36.180 --> 00:10:39.450 align:middle line:84% Like, you end up having to write a lot of code just 00:10:39.450 --> 00:10:42.360 align:middle line:84% to express something that's relatively simple in spirit-- 00:10:42.360 --> 00:10:45.550 align:middle line:90% just format this like an email address. 00:10:45.550 --> 00:10:47.920 align:middle line:90% So how can we go about improving this? 00:10:47.920 --> 00:10:52.350 align:middle line:84% Well, it turns out in Python there's a library for regular expressions. 00:10:52.350 --> 00:10:55.620 align:middle line:84% It's called succinctly R-E. And in the re library, 00:10:55.620 --> 00:11:00.510 align:middle line:84% you have a lot of capabilities to define and check for and even replace 00:11:00.510 --> 00:11:01.440 align:middle line:90% patterns. 00:11:01.440 --> 00:11:03.630 align:middle line:84% Again, a regular expression is a pattern. 00:11:03.630 --> 00:11:05.998 align:middle line:84% And this library, the re library in Python, 00:11:05.998 --> 00:11:08.040 align:middle line:84% is going to let us define some of these patterns, 00:11:08.040 --> 00:11:09.915 align:middle line:84% like a pattern for an email address, and then 00:11:09.915 --> 00:11:12.720 align:middle line:84% use some built-in functions to actually validate 00:11:12.720 --> 00:11:14.820 align:middle line:84% a user's input against that pattern or even 00:11:14.820 --> 00:11:17.250 align:middle line:84% use these patterns to change the user's input 00:11:17.250 --> 00:11:19.650 align:middle line:84% or extract partial information therefrom. 00:11:19.650 --> 00:11:22.030 align:middle line:90% We'll see examples of all this and more. 00:11:22.030 --> 00:11:24.045 align:middle line:84% So what can and should I do with this library? 00:11:24.045 --> 00:11:26.670 align:middle line:84% Well, first and foremost, it comes with a lot of functionality. 00:11:26.670 --> 00:11:29.760 align:middle line:84% Here is the URL, for instance, to the official documentation. 00:11:29.760 --> 00:11:31.710 align:middle line:84% And let me propose that we focus on using 00:11:31.710 --> 00:11:36.600 align:middle line:84% one of the most versatile functions in the library, namely this-- search. 00:11:36.600 --> 00:11:40.440 align:middle line:84% re.search is the name of the function and the re module 00:11:40.440 --> 00:11:42.910 align:middle line:84% that allows you to pass in a few arguments. 00:11:42.910 --> 00:11:46.620 align:middle line:84% The first is going to be a pattern that you want to search for in, 00:11:46.620 --> 00:11:48.900 align:middle line:84% for instance, a string that came from a user. 00:11:48.900 --> 00:11:51.977 align:middle line:84% The string argument here is going to be the actual string that you 00:11:51.977 --> 00:11:53.310 align:middle line:90% want to search for that pattern. 00:11:53.310 --> 00:11:55.410 align:middle line:84% And then there's a third argument optionally 00:11:55.410 --> 00:11:56.790 align:middle line:90% that's a whole bunch of flags. 00:11:56.790 --> 00:11:59.880 align:middle line:84% A flag in general is like a parameter you can pass in 00:11:59.880 --> 00:12:01.510 align:middle line:90% to modify the behavior of the function. 00:12:01.510 --> 00:12:03.510 align:middle line:84% But initially, we're not even going to use this. 00:12:03.510 --> 00:12:06.610 align:middle line:84% We're just going to pass in a couple of arguments instead. 00:12:06.610 --> 00:12:11.700 align:middle line:84% So let me go ahead and employ this re library, this regular expression 00:12:11.700 --> 00:12:15.162 align:middle line:84% library, and just improve on this design incrementally. 00:12:15.162 --> 00:12:17.370 align:middle line:84% So we're not going to solve this problem all at once, 00:12:17.370 --> 00:12:19.590 align:middle line:90% but we'll take some incremental steps. 00:12:19.590 --> 00:12:21.840 align:middle line:90% I'm going to go back to VS Code here. 00:12:21.840 --> 00:12:25.050 align:middle line:84% And I'm going to go ahead now and get rid of most of this code. 00:12:25.050 --> 00:12:28.230 align:middle line:84% But I'm going to go into the top of my file and first of fall, 00:12:28.230 --> 00:12:30.030 align:middle line:90% import this re library. 00:12:30.030 --> 00:12:33.030 align:middle line:84% So import re gives me access to that function and more. 00:12:33.030 --> 00:12:36.150 align:middle line:84% Now, after I've gotten the user's input in the same way as before, 00:12:36.150 --> 00:12:38.790 align:middle line:84% stripping off any leading or trailing whitespace, 00:12:38.790 --> 00:12:42.250 align:middle line:84% I'm just going to use this function super trivially for now, 00:12:42.250 --> 00:12:44.460 align:middle line:84% even though this isn't really a big step forward. 00:12:44.460 --> 00:12:50.190 align:middle line:84% I'm going to say, if re.search contains quote, unquote "@" 00:12:50.190 --> 00:12:53.700 align:middle line:84% in the email address, then let's go ahead and print "valid." 00:12:53.700 --> 00:12:55.740 align:middle line:84% Else, let's go ahead and print "invalid." 00:12:55.740 --> 00:12:59.730 align:middle line:84% At the moment, this is really no better than my very first version 00:12:59.730 --> 00:13:04.150 align:middle line:84% where I was just asking Python, if @ sign in the email address. 00:13:04.150 --> 00:13:08.880 align:middle line:84% But now I'm at least beginning to use this library by using its own re.search 00:13:08.880 --> 00:13:13.740 align:middle line:84% function, which for now you can assume returns a true value effectively 00:13:13.740 --> 00:13:16.440 align:middle line:90% if, indeed, the @ sign is an email. 00:13:16.440 --> 00:13:19.800 align:middle line:84% Just to make sure that this version does work as I expect, let me go ahead 00:13:19.800 --> 00:13:22.590 align:middle line:90% and run python of validate.py and Enter. 00:13:22.590 --> 00:13:26.220 align:middle line:84% I'll type in my actual email address, and we're back in business. 00:13:26.220 --> 00:13:29.370 align:middle line:84% But of course, this is not great, because if I similarly 00:13:29.370 --> 00:13:32.400 align:middle line:84% run this version of the program and just type in an @ sign, 00:13:32.400 --> 00:13:35.860 align:middle line:84% not an email address, and yet my code, of course, thinks it is valid. 00:13:35.860 --> 00:13:37.980 align:middle line:90% So how can I do better than this? 00:13:37.980 --> 00:13:42.330 align:middle line:84% Well, we need a bit more vocabulary in the realm of regular expressions, 00:13:42.330 --> 00:13:46.290 align:middle line:84% in order to be able to express ourselves a little more precisely. 00:13:46.290 --> 00:13:48.900 align:middle line:84% Really, the pattern I want to ultimately define 00:13:48.900 --> 00:13:52.410 align:middle line:84% is going to be something like, I want there to be something to the left, 00:13:52.410 --> 00:13:55.320 align:middle line:84% then an @ sign, then something to the right. 00:13:55.320 --> 00:13:59.310 align:middle line:84% And that something to the right should end with .edu but should also have 00:13:59.310 --> 00:14:02.160 align:middle line:84% something before the .edu, like Harvard, or Yale, 00:14:02.160 --> 00:14:04.680 align:middle line:90% or any other school in the US as well. 00:14:04.680 --> 00:14:06.550 align:middle line:90% Well, how can I go about doing this? 00:14:06.550 --> 00:14:11.040 align:middle line:84% Well, it turns out that in the world of regular expressions, whether in Python 00:14:11.040 --> 00:14:14.220 align:middle line:84% or a lot of other languages as well, there are certain symbols 00:14:14.220 --> 00:14:16.140 align:middle line:90% that you can use to define patterns. 00:14:16.140 --> 00:14:19.030 align:middle line:84% At the moment, I've just used literal raw text. 00:14:19.030 --> 00:14:21.600 align:middle line:84% If I go back to my code here, this technically 00:14:21.600 --> 00:14:23.940 align:middle line:90% qualifies as a regular expression. 00:14:23.940 --> 00:14:28.290 align:middle line:84% I've passed in a quoted string inside of which is an @ sign. 00:14:28.290 --> 00:14:30.550 align:middle line:84% Now, that's not a very interesting pattern. 00:14:30.550 --> 00:14:31.500 align:middle line:90% It's just an @ sign. 00:14:31.500 --> 00:14:34.290 align:middle line:84% But it turns out that once you have access to regular expressions 00:14:34.290 --> 00:14:37.350 align:middle line:84% or a library that offers that feature, you can more 00:14:37.350 --> 00:14:40.360 align:middle line:90% powerfully express yourself as follows. 00:14:40.360 --> 00:14:43.770 align:middle line:84% Let me reveal that the pattern that you pass to re.search 00:14:43.770 --> 00:14:45.690 align:middle line:84% can take a whole bunch of special symbols. 00:14:45.690 --> 00:14:47.160 align:middle line:90% And here's just some of them. 00:14:47.160 --> 00:14:51.630 align:middle line:84% In the examples we're about to see, in the patterns we're about to define, 00:14:51.630 --> 00:14:53.040 align:middle line:90% here are the special symbols. 00:14:53.040 --> 00:14:56.280 align:middle line:84% You can use a single period, a dot, to just represent 00:14:56.280 --> 00:14:59.040 align:middle line:84% any character except a newline, a blank line. 00:14:59.040 --> 00:15:02.190 align:middle line:84% So that is to say, if I don't really care what letters of the alphabet 00:15:02.190 --> 00:15:04.200 align:middle line:84% are in the user's username, I just want there 00:15:04.200 --> 00:15:07.410 align:middle line:84% to be one or more characters in the user's name, 00:15:07.410 --> 00:15:11.340 align:middle line:84% dot allows me to express A through z, uppercase and lowercase, 00:15:11.340 --> 00:15:13.560 align:middle line:90% and a bunch of other letters as well. 00:15:13.560 --> 00:15:18.850 align:middle line:84% * is going to mean-- a single asterisk-- zero or more repetitions. 00:15:18.850 --> 00:15:21.630 align:middle line:84% So if I say something *, that means that I'm 00:15:21.630 --> 00:15:24.450 align:middle line:84% willing to accept either zero repetitions, that is, 00:15:24.450 --> 00:15:27.510 align:middle line:90% nothing at all, or more repetitions-- 00:15:27.510 --> 00:15:29.580 align:middle line:90% 1, or 2, or 3, or 300. 00:15:29.580 --> 00:15:31.950 align:middle line:84% If you see a plus in my pattern, so that's 00:15:31.950 --> 00:15:34.135 align:middle line:90% going to mean one or more repetitions. 00:15:34.135 --> 00:15:37.260 align:middle line:84% That is to say, there's got to be at least one character there, one symbol, 00:15:37.260 --> 00:15:40.180 align:middle line:84% and then there's optionally more after that. 00:15:40.180 --> 00:15:43.110 align:middle line:84% And then you can say zero or one repetition. 00:15:43.110 --> 00:15:46.590 align:middle line:84% You can use a single question mark after a symbol, and that will say, 00:15:46.590 --> 00:15:51.260 align:middle line:84% I want zero of this character or one, but that's all I'll expect. 00:15:51.260 --> 00:15:53.010 align:middle line:84% And then lastly, there's going to be a way 00:15:53.010 --> 00:15:55.140 align:middle line:90% to specify a specific number of symbols. 00:15:55.140 --> 00:15:57.330 align:middle line:84% If you use these curly braces and a number, 00:15:57.330 --> 00:15:59.610 align:middle line:84% represented here symbolically as m, you can 00:15:59.610 --> 00:16:03.720 align:middle line:84% specify that you want m repetitions, be it 1, or 2, or 3, or 300. 00:16:03.720 --> 00:16:06.190 align:middle line:84% You can specify the number of repetitions yourself. 00:16:06.190 --> 00:16:08.280 align:middle line:84% And if you want a range of repetitions, like you 00:16:08.280 --> 00:16:11.100 align:middle line:84% want this few characters or this many characters, 00:16:11.100 --> 00:16:13.770 align:middle line:84% you can use curly braces and two numbers inside, 00:16:13.770 --> 00:16:18.760 align:middle line:84% called here m and n, which would be a range of m through n repetitions. 00:16:18.760 --> 00:16:20.140 align:middle line:90% Now, what does all of this mean? 00:16:20.140 --> 00:16:22.380 align:middle line:84% Well, let me go back to VS Code here, and let 00:16:22.380 --> 00:16:25.650 align:middle line:84% me propose that we iterate on this solution further. 00:16:25.650 --> 00:16:27.985 align:middle line:84% It's not sufficient to just check for the @ sign. 00:16:27.985 --> 00:16:28.860 align:middle line:90% We know that already. 00:16:28.860 --> 00:16:31.600 align:middle line:84% We minimally want something to the left and to the right. 00:16:31.600 --> 00:16:33.210 align:middle line:90% So how can I represent that? 00:16:33.210 --> 00:16:35.910 align:middle line:84% I don't really care what the user's username is, 00:16:35.910 --> 00:16:40.020 align:middle line:84% or what letters of the alphabet are in it, be it malan or anyone else's. 00:16:40.020 --> 00:16:42.600 align:middle line:84% So what I'm going to do to the left of this equal sign 00:16:42.600 --> 00:16:44.410 align:middle line:90% is I'm going to use a single period-- 00:16:44.410 --> 00:16:49.600 align:middle line:84% the dot that, again, indicates any character except for a newline. 00:16:49.600 --> 00:16:51.630 align:middle line:84% But I don't just want a single character. 00:16:51.630 --> 00:16:55.900 align:middle line:84% Otherwise, the person's username could only a at such and such, 00:16:55.900 --> 00:16:57.450 align:middle line:90% or b at such and such. 00:16:57.450 --> 00:17:00.130 align:middle line:84% I want it to be multiple such characters. 00:17:00.130 --> 00:17:01.680 align:middle line:90% So I'm going to initially use a *. 00:17:01.680 --> 00:17:05.550 align:middle line:84% So dot * means give me something to the left, and I'm going to do another one, 00:17:05.550 --> 00:17:07.619 align:middle line:90% dot * something to the right. 00:17:07.619 --> 00:17:10.589 align:middle line:84% Now, this isn't perfect, but it's at least a step forward. 00:17:10.589 --> 00:17:12.871 align:middle line:84% Because now what I'm going to go ahead and do is this. 00:17:12.871 --> 00:17:14.579 align:middle line:84% I'm going to rerun python of validate.py. 00:17:14.579 --> 00:17:17.040 align:middle line:84% And I'm going to keep testing my own email address just to make 00:17:17.040 --> 00:17:18.415 align:middle line:90% sure I haven't made things worse. 00:17:18.415 --> 00:17:19.800 align:middle line:90% And that's now OK. 00:17:19.800 --> 00:17:22.530 align:middle line:84% I'm now going to go ahead and type in some other input, 00:17:22.530 --> 00:17:28.380 align:middle line:84% like how about just malan@ with no domain name whatsoever. 00:17:28.380 --> 00:17:30.640 align:middle line:84% And you would think this is going to be invalid. 00:17:30.640 --> 00:17:34.680 align:middle line:84% But, but, but it's still considered valid. 00:17:34.680 --> 00:17:35.850 align:middle line:90% But why is that? 00:17:35.850 --> 00:17:42.120 align:middle line:84% If I go back to this chart, why is malan@ with no domain now considered 00:17:42.120 --> 00:17:43.260 align:middle line:90% valid? 00:17:43.260 --> 00:17:50.010 align:middle line:84% What's my mistake here by having used .*@.* as my regular expression 00:17:50.010 --> 00:17:50.670 align:middle line:90% or regex? 00:17:50.670 --> 00:17:54.355 align:middle line:84% AUDIENCE: Because you're using the * instead of the plus sign. 00:17:54.355 --> 00:17:55.230 align:middle line:90% DAVID MALAN: Exactly. 00:17:55.230 --> 00:17:58.090 align:middle line:84% The *, again, means zero or more repetitions. 00:17:58.090 --> 00:18:03.120 align:middle line:84% So re.search is perfectly happy to accept nothing after the @ sign, 00:18:03.120 --> 00:18:05.230 align:middle line:90% because that would be zero repetitions. 00:18:05.230 --> 00:18:09.000 align:middle line:84% So I think I minimally need to evolve this and go back to my code here. 00:18:09.000 --> 00:18:12.990 align:middle line:84% And let me go ahead and change this from dot * to dot +. 00:18:12.990 --> 00:18:16.620 align:middle line:84% And let me change the ending from dot * to dot + 00:18:16.620 --> 00:18:18.900 align:middle line:90% so that now when I run my code here-- 00:18:18.900 --> 00:18:21.510 align:middle line:84% let me go ahead and run python of validate.py. 00:18:21.510 --> 00:18:23.490 align:middle line:84% I'm going to test my email address as always. 00:18:23.490 --> 00:18:24.600 align:middle line:90% Still working. 00:18:24.600 --> 00:18:27.690 align:middle line:84% Now let me go ahead and type in that same thing from before that 00:18:27.690 --> 00:18:29.820 align:middle line:90% was accidentally considered valid. 00:18:29.820 --> 00:18:32.590 align:middle line:90% Now I hit Enter, finally it's invalid. 00:18:32.590 --> 00:18:35.460 align:middle line:84% So now we're making some progress on being a little more 00:18:35.460 --> 00:18:37.560 align:middle line:90% precise as to what it is we're doing. 00:18:37.560 --> 00:18:40.920 align:middle line:84% Now, I'll note here, like with almost everything in programming, 00:18:40.920 --> 00:18:45.090 align:middle line:84% Python included, there's often multiple ways to solve the same problem. 00:18:45.090 --> 00:18:49.410 align:middle line:84% And does anyone see a way in my code here 00:18:49.410 --> 00:18:54.360 align:middle line:84% that I can make a slight tweak if I forgot that the plus operator exists 00:18:54.360 --> 00:18:56.880 align:middle line:90% and go back to using a *? 00:18:56.880 --> 00:19:00.570 align:middle line:84% If I allowed you only to use dots and only stars, 00:19:00.570 --> 00:19:03.570 align:middle line:90% could you recreate the notion of plus? 00:19:03.570 --> 00:19:04.890 align:middle line:90% AUDIENCE: Yes. 00:19:04.890 --> 00:19:06.930 align:middle line:90% Use another dot, dot dot *. 00:19:06.930 --> 00:19:07.680 align:middle line:90% DAVID MALAN: Yeah. 00:19:07.680 --> 00:19:10.290 align:middle line:84% Because if a dot means any character, we'll just use a dot. 00:19:10.290 --> 00:19:14.040 align:middle line:84% And then when you want to say "or more," use another dot and then the *. 00:19:14.040 --> 00:19:18.300 align:middle line:84% So equivalent to dot + would have been dot dot *, 00:19:18.300 --> 00:19:21.870 align:middle line:84% because the first dot means any character, and the second pair 00:19:21.870 --> 00:19:25.050 align:middle line:84% of characters, dot *, means zero or more other characters. 00:19:25.050 --> 00:19:27.600 align:middle line:84% And to be clear, it doesn't have to be the same character. 00:19:27.600 --> 00:19:31.830 align:middle line:84% Just by doing dot or dot * does not mean your whole username needs to be 00:19:31.830 --> 00:19:35.310 align:middle line:90% a, or aa, or aaa, or aaaa. 00:19:35.310 --> 00:19:37.230 align:middle line:90% It can vary with each symbol. 00:19:37.230 --> 00:19:41.790 align:middle line:84% It just means zero or more of any character back to back. 00:19:41.790 --> 00:19:44.050 align:middle line:84% So I could do this on both the left and the right. 00:19:44.050 --> 00:19:45.120 align:middle line:90% Which one is better? 00:19:45.120 --> 00:19:46.110 align:middle line:90% You know, it depends. 00:19:46.110 --> 00:19:49.860 align:middle line:84% I think an argument could be made that this is even more clear, because it's 00:19:49.860 --> 00:19:52.380 align:middle line:84% obvious now that there's a dot, which means any character, 00:19:52.380 --> 00:19:53.910 align:middle line:90% and then there's the dot *. 00:19:53.910 --> 00:19:56.250 align:middle line:84% But if you're in the habit of doing this frequently, 00:19:56.250 --> 00:19:58.500 align:middle line:84% one of the reasons things like the plus exist 00:19:58.500 --> 00:20:01.750 align:middle line:84% is just to consolidate your code into something a little more succinct. 00:20:01.750 --> 00:20:03.750 align:middle line:84% And if you're familiar with seeing the plus now, 00:20:03.750 --> 00:20:05.470 align:middle line:90% maybe this is more readable to you. 00:20:05.470 --> 00:20:07.590 align:middle line:84% So again, just like with Python more generally, 00:20:07.590 --> 00:20:10.590 align:middle line:84% you're going to often see different ways to express the same patterns, 00:20:10.590 --> 00:20:12.750 align:middle line:84% and reasonable people might agree or disagree 00:20:12.750 --> 00:20:15.810 align:middle line:90% as to which way is better than another. 00:20:15.810 --> 00:20:18.030 align:middle line:84% Well, let me propose to you that we can think 00:20:18.030 --> 00:20:20.520 align:middle line:84% about both of these models a little more graphically. 00:20:20.520 --> 00:20:22.770 align:middle line:84% If this looks a little cryptic to you, let me go ahead 00:20:22.770 --> 00:20:26.610 align:middle line:84% and rewind to the previous incarnation of this regular expression, which 00:20:26.610 --> 00:20:28.830 align:middle line:90% was just a single dot *. 00:20:28.830 --> 00:20:32.910 align:middle line:84% This regular expression, .*@.* means what again? 00:20:32.910 --> 00:20:36.690 align:middle line:84% It means zero or more characters followed by a literal @ sign followed 00:20:36.690 --> 00:20:38.580 align:middle line:90% by zero or more other characters. 00:20:38.580 --> 00:20:41.850 align:middle line:84% Now when you pass this pattern in as an argument to re.search, 00:20:41.850 --> 00:20:45.030 align:middle line:84% it's going to read it from left to right and then use 00:20:45.030 --> 00:20:48.750 align:middle line:84% it to try to match against the input, email, in this case, 00:20:48.750 --> 00:20:50.100 align:middle line:90% that the user typed in. 00:20:50.100 --> 00:20:53.070 align:middle line:84% Now, how is the computer, how is re.search 00:20:53.070 --> 00:20:57.760 align:middle line:84% going to keep track of whether or not the user's email matches this pattern? 00:20:57.760 --> 00:21:01.230 align:middle line:84% Well, it turns out that it's going to be using a machine of sorts implemented 00:21:01.230 --> 00:21:03.540 align:middle line:84% in software known as a finite state machine, or more 00:21:03.540 --> 00:21:06.750 align:middle line:84% formally, a nondeterministic finite automaton. 00:21:06.750 --> 00:21:09.930 align:middle line:84% And the way it works, if we depict this graphically, is as follows. 00:21:09.930 --> 00:21:14.940 align:middle line:84% The re.search function starts over here in a so-called start state. 00:21:14.940 --> 00:21:16.980 align:middle line:84% That's the sort of condition in which it begins. 00:21:16.980 --> 00:21:20.730 align:middle line:84% And then it's going to read the user's email address from left to right. 00:21:20.730 --> 00:21:24.030 align:middle line:84% And it's going to decide whether or not to stay in this first state 00:21:24.030 --> 00:21:26.170 align:middle line:90% or transition to the next state. 00:21:26.170 --> 00:21:29.970 align:middle line:84% So for instance, in this first state, as the user is reading my email address, 00:21:29.970 --> 00:21:35.130 align:middle line:84% malan@harvard.edu, it's going to follow this curved edge up and around 00:21:35.130 --> 00:21:36.870 align:middle line:90% to itself, a reflexive edge. 00:21:36.870 --> 00:21:40.030 align:middle line:84% And it's labeled dot, because dot, again, just means any character. 00:21:40.030 --> 00:21:43.840 align:middle line:84% So as the function is reading my email address, malan@harvard.edu, 00:21:43.840 --> 00:21:48.270 align:middle line:84% from left to right, it's going to follow these transitions as follows, 00:21:48.270 --> 00:21:53.070 align:middle line:90% M-A-L-A-N. 00:21:53.070 --> 00:21:56.040 align:middle line:84% And then it's hopefully going to follow this transition 00:21:56.040 --> 00:22:00.000 align:middle line:84% to the second state, because there's a literal @ sign both in this machine 00:22:00.000 --> 00:22:01.630 align:middle line:90% as well as in my email address. 00:22:01.630 --> 00:22:10.070 align:middle line:84% Then it's going to try to read the rest of my address, H-A-R-V-A-R-D dot E-D-U, 00:22:10.070 --> 00:22:11.190 align:middle line:90% and that's it. 00:22:11.190 --> 00:22:12.870 align:middle line:90% And then the computer is going to check. 00:22:12.870 --> 00:22:16.260 align:middle line:84% Did it end up in an accept state, a final state, 00:22:16.260 --> 00:22:18.120 align:middle line:84% that's actually depicted here pictorially 00:22:18.120 --> 00:22:21.150 align:middle line:84% a little differently with double circles, one inside of the other? 00:22:21.150 --> 00:22:25.410 align:middle line:84% And that just means that if the computer finds itself in that second 00:22:25.410 --> 00:22:29.130 align:middle line:84% accept state after having read all of the user's input, 00:22:29.130 --> 00:22:31.560 align:middle line:90% it is, indeed, a valid email address. 00:22:31.560 --> 00:22:34.350 align:middle line:84% If by some chance, the machine somehow ended up 00:22:34.350 --> 00:22:37.020 align:middle line:84% stuck in that first state, which does not have double circles 00:22:37.020 --> 00:22:39.300 align:middle line:84% and is therefore not an accept state, the computer 00:22:39.300 --> 00:22:42.810 align:middle line:84% would conclude this is an invalid email address instead. 00:22:42.810 --> 00:22:45.630 align:middle line:84% By contrast, if we go back to my other your version 00:22:45.630 --> 00:22:49.800 align:middle line:84% of the code where I instead had dot plus on both the left and the right, 00:22:49.800 --> 00:22:53.130 align:middle line:84% recall that re.search is going to use one of these state machines 00:22:53.130 --> 00:22:57.030 align:middle line:84% in order to decide from left to right whether or not to accept the user's 00:22:57.030 --> 00:22:59.310 align:middle line:90% input, like malan@harvard.edu. 00:22:59.310 --> 00:23:02.850 align:middle line:84% Can we get from the start state, so to speak, to an accept state 00:23:02.850 --> 00:23:05.940 align:middle line:84% to decide, yep, this was, in fact, meeting the pattern? 00:23:05.940 --> 00:23:09.900 align:middle line:84% Well, let's propose that this nondeterministic finite automaton 00:23:09.900 --> 00:23:11.430 align:middle line:90% looked like this instead. 00:23:11.430 --> 00:23:14.310 align:middle line:84% We're going to start as before in the leftmost start state, 00:23:14.310 --> 00:23:18.180 align:middle line:84% and we're going to necessarily consume one character per this first edge, 00:23:18.180 --> 00:23:21.480 align:middle line:84% which is labeled with a dot to indicate that we can consume any one character, 00:23:21.480 --> 00:23:24.411 align:middle line:90% like the m in malan@harvard.edu. 00:23:24.411 --> 00:23:27.960 align:middle line:84% Then we can spend some time consuming more characters before the @ sign, 00:23:27.960 --> 00:23:31.290 align:middle line:90% so the A-L-A-N. 00:23:31.290 --> 00:23:33.340 align:middle line:90% Then we can consume the @ sign. 00:23:33.340 --> 00:23:36.270 align:middle line:84% Then we can consume at least one more character, because recall 00:23:36.270 --> 00:23:38.760 align:middle line:90% that the regex has dot plus this time. 00:23:38.760 --> 00:23:42.190 align:middle line:84% And then we can consume even more characters if we want. 00:23:42.190 --> 00:23:45.900 align:middle line:84% So if we first consume the H in harvard.edu, 00:23:45.900 --> 00:23:53.885 align:middle line:84% then leaves the A-R-V-A-R-D, and then dot E-D-U. 00:23:53.885 --> 00:23:56.560 align:middle line:84% And now here, too, we're at the end of the story, 00:23:56.560 --> 00:23:59.760 align:middle line:84% but we're in an accept state, because that circle at the end 00:23:59.760 --> 00:24:03.840 align:middle line:84% has two circles total, which means that if the computer, if this function, 00:24:03.840 --> 00:24:07.830 align:middle line:84% finds itself in that accept state after reading the entirety of the user's 00:24:07.830 --> 00:24:11.580 align:middle line:84% input, it is, too, in fact, a valid email address. 00:24:11.580 --> 00:24:15.390 align:middle line:84% If by contrast, we had gotten stuck in one of those other states, 00:24:15.390 --> 00:24:18.180 align:middle line:84% unable to follow a transition, one of those edges, 00:24:18.180 --> 00:24:22.440 align:middle line:84% and therefore unable to make progress in the user's input from left to right, 00:24:22.440 --> 00:24:26.670 align:middle line:84% then we would have to conclude that email address is, in fact, invalid. 00:24:26.670 --> 00:24:29.490 align:middle line:84% Well, how can we go upon approving this code further? 00:24:29.490 --> 00:24:33.660 align:middle line:84% Let me propose now that we check not only for a username and also something 00:24:33.660 --> 00:24:37.320 align:middle line:84% after the username, like a domain name, but minimally require that the string 00:24:37.320 --> 00:24:39.600 align:middle line:90% ends with .edu as well. 00:24:39.600 --> 00:24:41.970 align:middle line:84% Well, I think I could do this fairly straightforward. 00:24:41.970 --> 00:24:44.940 align:middle line:84% Not only do I want there to be something after the @ sign, 00:24:44.940 --> 00:24:49.818 align:middle line:84% like the domain like Harvard, I want the whole thing to end with .edu. 00:24:49.818 --> 00:24:52.320 align:middle line:90% But there's a little bit of danger here. 00:24:52.320 --> 00:24:57.660 align:middle line:84% What have I done wrong by implementing my regular expression now in this way, 00:24:57.660 --> 00:24:59.110 align:middle line:90% by using .+@.+.edu? 00:24:59.110 --> 00:25:01.938 align:middle line:90% 00:25:01.938 --> 00:25:06.080 align:middle line:90% What could go wrong with this version? 00:25:06.080 --> 00:25:08.360 align:middle line:84% AUDIENCE: The dot is-- the dot means something 00:25:08.360 --> 00:25:11.510 align:middle line:84% else in this context, where it means three or more repetitions 00:25:11.510 --> 00:25:14.630 align:middle line:84% of a character, which is why it will interpret it [INAUDIBLE].. 00:25:14.630 --> 00:25:15.570 align:middle line:90% DAVID MALAN: Exactly. 00:25:15.570 --> 00:25:19.340 align:middle line:84% Even though I mean for it to mean literally .edu, a period, 00:25:19.340 --> 00:25:22.560 align:middle line:84% and then .edu, unfortunately in the world of regular expressions, 00:25:22.560 --> 00:25:26.720 align:middle line:84% dot means any character, which means that this string could technically end 00:25:26.720 --> 00:25:34.080 align:middle line:84% in aedu, or bedu, or cedu, and so forth, but that's not, in fact, that I want. 00:25:34.080 --> 00:25:37.670 align:middle line:84% So any instincts now as to how I could fix this problem? 00:25:37.670 --> 00:25:39.770 align:middle line:84% And let me demonstrate the problem more clearly. 00:25:39.770 --> 00:25:41.900 align:middle line:90% Let me go ahead and run this code here. 00:25:41.900 --> 00:25:45.050 align:middle line:84% Let me go ahead and type in malan@harvard.edu. 00:25:45.050 --> 00:25:47.240 align:middle line:90% And as always, this does, in fact, work. 00:25:47.240 --> 00:25:48.680 align:middle line:90% But watch what happens here. 00:25:48.680 --> 00:25:52.520 align:middle line:84% Let me go ahead and do malan@harvard and then-- 00:25:52.520 --> 00:25:57.992 align:middle line:84% malan@harvard?edu, Enter, that, too, is valid. 00:25:57.992 --> 00:26:00.950 align:middle line:84% So I could put any character there and it's still going to be accepted. 00:26:00.950 --> 00:26:02.420 align:middle line:90% But I don't want ?edu. 00:26:02.420 --> 00:26:04.670 align:middle line:90% I want .edu literally. 00:26:04.670 --> 00:26:08.700 align:middle line:84% Any instincts, then, for how we can solve this problem here? 00:26:08.700 --> 00:26:12.770 align:middle line:84% How can I get this new function, re.search, and a regular expression 00:26:12.770 --> 00:26:16.160 align:middle line:84% more generally, to literally mean a dot, might you think? 00:26:16.160 --> 00:26:19.257 align:middle line:84% AUDIENCE: You can use the escape character, the backslash? 00:26:19.257 --> 00:26:20.090 align:middle line:90% DAVID MALAN: Indeed. 00:26:20.090 --> 00:26:22.927 align:middle line:84% The so-called escape character, which we've seen before outside 00:26:22.927 --> 00:26:25.760 align:middle line:84% of the context of regular expressions when we talked about newlines. 00:26:25.760 --> 00:26:29.640 align:middle line:84% Backslash n was a way of telling the computer I want a newline, 00:26:29.640 --> 00:26:32.810 align:middle line:84% but without actually literally hitting Enter and moving the cursor yourself. 00:26:32.810 --> 00:26:35.090 align:middle line:84% And you don't want a literal n on the screen. 00:26:35.090 --> 00:26:39.350 align:middle line:84% So backslash n was a way to escape n and convey that you want a newline. 00:26:39.350 --> 00:26:41.900 align:middle line:84% It turns out regular expressions use a similar technique 00:26:41.900 --> 00:26:43.640 align:middle line:90% to solve this problem here. 00:26:43.640 --> 00:26:45.770 align:middle line:84% In fact, let me go into my regular expression. 00:26:45.770 --> 00:26:49.370 align:middle line:84% And before that final dot, let me put a single backslash. 00:26:49.370 --> 00:26:52.880 align:middle line:84% In the world of regular expressions, this is a so-called special sequence. 00:26:52.880 --> 00:26:55.940 align:middle line:84% And it indicates, per this backslash and a single dot, 00:26:55.940 --> 00:26:58.290 align:middle line:90% that I literally want to match on a dot. 00:26:58.290 --> 00:27:02.180 align:middle line:84% It's not that I want to match on any character and then edu. 00:27:02.180 --> 00:27:05.300 align:middle line:84% I want to match on a dot, or a period, edu. 00:27:05.300 --> 00:27:09.050 align:middle line:84% But we don't want Python to misinterpret this backslash 00:27:09.050 --> 00:27:12.710 align:middle line:84% as beginning an escape sequence, something special like backslash 00:27:12.710 --> 00:27:15.590 align:middle line:84% n, which even though we as the programmer might type two characters 00:27:15.590 --> 00:27:20.090 align:middle line:84% backslash n, it really is interpreted by Python as a single newline. 00:27:20.090 --> 00:27:22.833 align:middle line:84% We don't want any kind of misinterpretation like that here. 00:27:22.833 --> 00:27:26.000 align:middle line:84% So it turns out there's one other thing we should do for regular expressions 00:27:26.000 --> 00:27:29.180 align:middle line:84% like this that have a backslash used in this way. 00:27:29.180 --> 00:27:33.440 align:middle line:84% I want to specify to Python that I want this string, this regular expression 00:27:33.440 --> 00:27:36.200 align:middle line:84% in double quotes, to be treated as a raw string, 00:27:36.200 --> 00:27:38.510 align:middle line:84% literally putting an r at the beginning of the string 00:27:38.510 --> 00:27:41.240 align:middle line:84% to indicate to Python that you should not try to interpret 00:27:41.240 --> 00:27:43.550 align:middle line:90% any backslashes in the usual way. 00:27:43.550 --> 00:27:46.850 align:middle line:84% I want to literally pass the backslash and the dot and the edu 00:27:46.850 --> 00:27:50.030 align:middle line:84% into this particular function, search, in this case. 00:27:50.030 --> 00:27:53.750 align:middle line:84% So it's similar in spirit to using that f at the beginning of a format 00:27:53.750 --> 00:27:57.170 align:middle line:84% string, which, of course, tells Python to format the string in a certain way, 00:27:57.170 --> 00:27:59.720 align:middle line:84% plugging in variables that might be between curly braces. 00:27:59.720 --> 00:28:02.900 align:middle line:84% But in this case, r indicates a raw string 00:28:02.900 --> 00:28:05.570 align:middle line:90% that I want passed in exactly as is. 00:28:05.570 --> 00:28:09.380 align:middle line:84% Now, it's only strictly necessary if you are, in fact, using backslashes 00:28:09.380 --> 00:28:12.860 align:middle line:84% to indicate that you want some special sequence, like backslash dot. 00:28:12.860 --> 00:28:14.750 align:middle line:84% But in general, it's probably a good habit 00:28:14.750 --> 00:28:18.540 align:middle line:84% to get into to just use raw strings for all of your regular expressions 00:28:18.540 --> 00:28:21.480 align:middle line:84% so that if you eventually go back in, make a change, make an addition, 00:28:21.480 --> 00:28:23.600 align:middle line:84% you don't accidentally introduce a backslash 00:28:23.600 --> 00:28:28.113 align:middle line:84% and then forget that that might have some special or misinterpreted meaning. 00:28:28.113 --> 00:28:30.530 align:middle line:84% Well, let me go ahead and try this new regular expression. 00:28:30.530 --> 00:28:34.430 align:middle line:84% I'll clear my terminal window, run python of validate-- 00:28:34.430 --> 00:28:36.800 align:middle line:90% run python of validate.py. 00:28:36.800 --> 00:28:40.496 align:middle line:84% And then I'll type in my email address correctly, malan@harvard.edu. 00:28:40.496 --> 00:28:42.710 align:middle line:90% And that's, fortunately, still valid. 00:28:42.710 --> 00:28:46.490 align:middle line:84% Let me clear my screen and run it one more time, python of validate.py. 00:28:46.490 --> 00:28:50.930 align:middle line:84% And this time, let's mistype it as malan@harvard?edu, 00:28:50.930 --> 00:28:53.540 align:middle line:84% whereby there's obviously not a dot there, 00:28:53.540 --> 00:28:57.710 align:middle line:84% but there is some other single character that last time was misinterpreted 00:28:57.710 --> 00:28:58.430 align:middle line:90% as valid. 00:28:58.430 --> 00:29:01.970 align:middle line:84% But this time, now that I've improved my regular expression, 00:29:01.970 --> 00:29:05.270 align:middle line:90% it's discovered as, indeed, invalid. 00:29:05.270 --> 00:29:10.850 align:middle line:84% Any questions now on this technique for matching something to the left of the @ 00:29:10.850 --> 00:29:15.320 align:middle line:84% sign, something to the right, and now ending with .edu explicitly? 00:29:15.320 --> 00:29:18.582 align:middle line:84% AUDIENCE: What happens when user inserts multiple @ signs? 00:29:18.582 --> 00:29:19.790 align:middle line:90% DAVID MALAN: A good question. 00:29:19.790 --> 00:29:21.320 align:middle line:90% And you kind of called me out here. 00:29:21.320 --> 00:29:22.910 align:middle line:90% Well, when in doubt, let's try. 00:29:22.910 --> 00:29:29.340 align:middle line:84% Let me go ahead and do python of validate.py, malan@@@harvard.edu, 00:29:29.340 --> 00:29:34.020 align:middle line:84% which also is incorrect, unfortunately, my code thinks it's valid. 00:29:34.020 --> 00:29:37.490 align:middle line:84% So another problem to solve, but a shortcoming for now. 00:29:37.490 --> 00:29:41.510 align:middle line:84% Other questions on these regular expressions thus far? 00:29:41.510 --> 00:29:46.108 align:middle line:84% AUDIENCE: Can you use curly brackets m instead of backslash? 00:29:46.108 --> 00:29:48.650 align:middle line:84% DAVID MALAN: Can you use curly brackets instead of backslash? 00:29:48.650 --> 00:29:49.490 align:middle line:90% Not in this case. 00:29:49.490 --> 00:29:53.750 align:middle line:84% If you want a literal dot, backslash dot is the way to do it literally. 00:29:53.750 --> 00:29:56.660 align:middle line:84% How about one other question on regular expressions? 00:29:56.660 --> 00:30:00.620 align:middle line:84% AUDIENCE: Is this the same thing that Google Forms uses in order 00:30:00.620 --> 00:30:06.590 align:middle line:84% to categorize data in, let's say, some-- if you've got multiple people sending 00:30:06.590 --> 00:30:09.380 align:middle line:90% in requests about some feedback? 00:30:09.380 --> 00:30:12.170 align:middle line:84% Do they categorize the data that they get 00:30:12.170 --> 00:30:14.247 align:middle line:84% using this particular regular expression thing? 00:30:14.247 --> 00:30:15.080 align:middle line:90% DAVID MALAN: Indeed. 00:30:15.080 --> 00:30:17.450 align:middle line:84% If you've ever used Google Forms to not just submit it 00:30:17.450 --> 00:30:20.900 align:middle line:84% but to create a Google Form, one of the menu options 00:30:20.900 --> 00:30:23.570 align:middle line:84% is for response validation, in English at least. 00:30:23.570 --> 00:30:25.340 align:middle line:84% And what that allows you to do is specify 00:30:25.340 --> 00:30:29.060 align:middle line:84% that the user has to input an email address, or a URL, 00:30:29.060 --> 00:30:31.400 align:middle line:90% or a string of some length. 00:30:31.400 --> 00:30:33.830 align:middle line:84% But there's an even more powerful feature that some of you 00:30:33.830 --> 00:30:35.150 align:middle line:90% may not have ever noticed. 00:30:35.150 --> 00:30:37.340 align:middle line:84% And indeed, if you'd like to open up Google Forms, 00:30:37.340 --> 00:30:41.180 align:middle line:84% create a new form temporarily, and poke around, you will actually see, 00:30:41.180 --> 00:30:44.270 align:middle line:84% in English at least, quote, unquote "regular expression" 00:30:44.270 --> 00:30:46.070 align:middle line:84% mentioned as one of the mechanisms you can 00:30:46.070 --> 00:30:49.520 align:middle line:84% use to validate your users' input into your Google Form. 00:30:49.520 --> 00:30:53.690 align:middle line:84% So in fact, after today you can start avoiding the specific dropdowns 00:30:53.690 --> 00:30:55.610 align:middle line:84% of like email address, or URL, or the like, 00:30:55.610 --> 00:30:59.540 align:middle line:84% and you can express your own patterns precisely as well. 00:30:59.540 --> 00:31:02.900 align:middle line:84% Regular expressions can even be used in VS Code itself. 00:31:02.900 --> 00:31:06.440 align:middle line:84% If you go and find, or do a find and replace in VS Code, 00:31:06.440 --> 00:31:08.690 align:middle line:84% you can, of course, just type in words, like you could 00:31:08.690 --> 00:31:10.880 align:middle line:90% into Microsoft Word or Google Docs. 00:31:10.880 --> 00:31:14.990 align:middle line:84% You can also type, if you check the right box, regular expressions 00:31:14.990 --> 00:31:19.670 align:middle line:84% and start searching for patterns, not literally specific values. 00:31:19.670 --> 00:31:24.080 align:middle line:84% Well, let me propose that we now enhance this implementation further 00:31:24.080 --> 00:31:28.010 align:middle line:84% by introducing a few other symbols, because right now with my code, 00:31:28.010 --> 00:31:32.540 align:middle line:84% I keep saying that I want my email address to end with .edu and start with 00:31:32.540 --> 00:31:35.780 align:middle line:84% a username, but I'm being a little too generous. 00:31:35.780 --> 00:31:38.690 align:middle line:84% This does, in fact, work as expected for my own email address, 00:31:38.690 --> 00:31:40.928 align:middle line:90% malan@harvard.edu. 00:31:40.928 --> 00:31:45.350 align:middle line:84% But what if I type in a sentence like, "my email address 00:31:45.350 --> 00:31:50.180 align:middle line:84% is malan@harvard.edu," and suppose I've typed that into the program 00:31:50.180 --> 00:31:52.310 align:middle line:90% or I've typed that into a Google Form? 00:31:52.310 --> 00:31:57.680 align:middle line:84% Is this going to be considered valid or invalid? 00:31:57.680 --> 00:31:59.390 align:middle line:90% Well, let's consider. 00:31:59.390 --> 00:32:01.970 align:middle line:90% It's got @ sign, so we're good there. 00:32:01.970 --> 00:32:05.570 align:middle line:84% It's got one or more characters to the left of the @ sign. 00:32:05.570 --> 00:32:09.050 align:middle line:84% It's got one or more characters to the right of the @ sign. 00:32:09.050 --> 00:32:14.390 align:middle line:84% It's got a literal .edu somewhere in there to the right of the @ sign. 00:32:14.390 --> 00:32:16.460 align:middle line:84% And granted, there's more stuff to the right. 00:32:16.460 --> 00:32:19.700 align:middle line:84% There's literally this period at the end of my English sentence. 00:32:19.700 --> 00:32:23.600 align:middle line:84% But that's OK, because at the moment, my regular expression is not so precise 00:32:23.600 --> 00:32:29.156 align:middle line:84% as to say, the pattern must start with the username and end with the .edu. 00:32:29.156 --> 00:32:32.573 align:middle line:84% Technically, it's left unsaid what more can be to the left 00:32:32.573 --> 00:32:33.990 align:middle line:90% and what more can be to the right. 00:32:33.990 --> 00:32:37.970 align:middle line:84% So when I hit Enter now, you'll see that that whole sentence in English 00:32:37.970 --> 00:32:40.500 align:middle line:84% is valid, and that's obviously not what you want. 00:32:40.500 --> 00:32:43.430 align:middle line:84% In fact, consider the case of using Google Forms or Office 00:32:43.430 --> 00:32:45.620 align:middle line:90% 365 to collect data from users. 00:32:45.620 --> 00:32:48.320 align:middle line:84% If you don't validate your input, your users 00:32:48.320 --> 00:32:51.170 align:middle line:84% might very well type in a full sentence or something else 00:32:51.170 --> 00:32:53.550 align:middle line:84% with a typographical error, not an actual email. 00:32:53.550 --> 00:32:55.993 align:middle line:84% So if you're just trying to copy all of the results that 00:32:55.993 --> 00:32:58.160 align:middle line:84% have been typed into your form so you can paste them 00:32:58.160 --> 00:33:00.767 align:middle line:84% into Gmail or some email program, it's going to break, 00:33:00.767 --> 00:33:04.100 align:middle line:84% because you're going to accidentally pay something like a whole English sentence 00:33:04.100 --> 00:33:07.010 align:middle line:84% into the program instead of just an email address, which 00:33:07.010 --> 00:33:08.690 align:middle line:90% is what your mailer expects. 00:33:08.690 --> 00:33:10.280 align:middle line:90% So how can I be more precise? 00:33:10.280 --> 00:33:13.550 align:middle line:84% Well, let me propose we introduce a few more symbols as well. 00:33:13.550 --> 00:33:17.540 align:middle line:84% It turns out in the context of a regular expression, one of these patterns, 00:33:17.540 --> 00:33:21.170 align:middle line:84% you can use the caret symbol, the little triangular mark, 00:33:21.170 --> 00:33:24.080 align:middle line:84% to represent that you want this pattern to match 00:33:24.080 --> 00:33:27.110 align:middle line:84% the start of the string specifically-- not anywhere 00:33:27.110 --> 00:33:29.330 align:middle line:90% but the start of the user's string. 00:33:29.330 --> 00:33:34.040 align:middle line:84% By contrast, you can use a $ sign in your regular expression to say that you 00:33:34.040 --> 00:33:37.790 align:middle line:84% want to match the end of the string, or technically just before the newline 00:33:37.790 --> 00:33:38.910 align:middle line:90% at the end of the string. 00:33:38.910 --> 00:33:41.810 align:middle line:84% But for all intents and purposes, think of caret as meaning "start 00:33:41.810 --> 00:33:45.650 align:middle line:84% of the string" and $ sign as meaning "end of the string." 00:33:45.650 --> 00:33:49.310 align:middle line:84% It is a weird thing that one is a caret and one is $ sign. 00:33:49.310 --> 00:33:51.710 align:middle line:84% These are not really things that I think of as opposites, 00:33:51.710 --> 00:33:53.670 align:middle line:84% like a parentheses or something like that. 00:33:53.670 --> 00:33:56.430 align:middle line:84% But those are the symbols the world chose many years ago. 00:33:56.430 --> 00:33:58.370 align:middle line:90% So let me go back to VS Code now. 00:33:58.370 --> 00:34:01.460 align:middle line:84% And let me add this feature to my code here. 00:34:01.460 --> 00:34:04.790 align:middle line:84% Let me specify that yes, I do want to search for this pattern, 00:34:04.790 --> 00:34:08.480 align:middle line:84% but I want the user's input to start with this pattern 00:34:08.480 --> 00:34:09.860 align:middle line:90% and end with this pattern. 00:34:09.860 --> 00:34:12.440 align:middle line:84% So even though it's going to start looking even more cryptic, 00:34:12.440 --> 00:34:14.690 align:middle line:84% I put a caret symbol here at the beginning, 00:34:14.690 --> 00:34:17.270 align:middle line:90% and I put a $ sign here at the end. 00:34:17.270 --> 00:34:21.199 align:middle line:84% That does not mean I want the user to type a caret symbol or a $ sign. 00:34:21.199 --> 00:34:25.130 align:middle line:84% This is special symbology that indicates to re.search 00:34:25.130 --> 00:34:29.280 align:middle line:84% that it should only look for now an exact match against this pattern. 00:34:29.280 --> 00:34:31.699 align:middle line:84% So if I now go back to my terminal window-- 00:34:31.699 --> 00:34:33.920 align:middle line:84% and I'll leave the previous result on the screen-- 00:34:33.920 --> 00:34:35.540 align:middle line:90% let me type the exact same thing. 00:34:35.540 --> 00:34:39.610 align:middle line:84% "My email address malan@harvard.edu," Enter-- 00:34:39.610 --> 00:34:41.000 align:middle line:90% sorry, period. 00:34:41.000 --> 00:34:43.070 align:middle line:84% And now I'm going to go ahead and hit Enter. 00:34:43.070 --> 00:34:45.770 align:middle line:90% Now that's considered invalid. 00:34:45.770 --> 00:34:47.090 align:middle line:90% But let me clear the screen. 00:34:47.090 --> 00:34:48.923 align:middle line:84% And just to make sure I didn't break things, 00:34:48.923 --> 00:34:53.330 align:middle line:84% let me type in just my email address, and that, too, is valid. 00:34:53.330 --> 00:34:58.250 align:middle line:84% Any questions now on this version of my regular expression, which, note, 00:34:58.250 --> 00:35:01.670 align:middle line:84% goes further to specify even more precisely 00:35:01.670 --> 00:35:06.120 align:middle line:84% that I want it to match at the start and the end? 00:35:06.120 --> 00:35:08.568 align:middle line:90% Any questions on this one here? 00:35:08.568 --> 00:35:09.110 align:middle line:90% AUDIENCE: OK. 00:35:09.110 --> 00:35:13.160 align:middle line:84% You have slash, and .edu, then the $ sign. 00:35:13.160 --> 00:35:18.170 align:middle line:84% But the dot is one of the regular expression, right? 00:35:18.170 --> 00:35:19.460 align:middle line:90% DAVID MALAN: It normally is. 00:35:19.460 --> 00:35:24.590 align:middle line:84% But this backslash that I deliberately put before this period here 00:35:24.590 --> 00:35:26.180 align:middle line:90% is an escape character. 00:35:26.180 --> 00:35:30.710 align:middle line:84% It is a way of telling re.search that I don't want any character there, 00:35:30.710 --> 00:35:33.140 align:middle line:90% I literally want a period there. 00:35:33.140 --> 00:35:36.080 align:middle line:84% And it's the only way you can distinguish one from the other. 00:35:36.080 --> 00:35:40.550 align:middle line:84% If I got rid of that slash, this would mean that the email address just 00:35:40.550 --> 00:35:43.610 align:middle line:84% has to end with any character, then an E, then a D, 00:35:43.610 --> 00:35:45.180 align:middle line:90% than a U. I don't want that. 00:35:45.180 --> 00:35:49.730 align:middle line:84% I want literally a period, then the E, then the D, then the U. 00:35:49.730 --> 00:35:53.780 align:middle line:84% This is actually common convention in programming and technology in general. 00:35:53.780 --> 00:35:55.820 align:middle line:84% If you and I decide on a convention, whereby 00:35:55.820 --> 00:35:59.180 align:middle line:84% we're using some character on the keyboard to mean something special, 00:35:59.180 --> 00:36:02.060 align:middle line:84% invariably we create a future problem for ourself 00:36:02.060 --> 00:36:04.820 align:middle line:84% when we want to literally use that same character. 00:36:04.820 --> 00:36:07.190 align:middle line:84% And so the solution in general to that problem 00:36:07.190 --> 00:36:10.790 align:middle line:84% is to somehow escape the character so that it's clear to the computer 00:36:10.790 --> 00:36:14.510 align:middle line:84% that it's not that special symbol, it's literally the symbol it sees. 00:36:14.510 --> 00:36:19.700 align:middle line:84% AUDIENCE: So we don't even know the-- we don't need another slash before the $ 00:36:19.700 --> 00:36:20.930 align:middle line:90% sign? 00:36:20.930 --> 00:36:22.150 align:middle line:90% DAVID MALAN: No. 00:36:22.150 --> 00:36:25.550 align:middle line:84% Because in this case, $ sign means something special. 00:36:25.550 --> 00:36:30.590 align:middle line:84% Per this chart here, $ sign by itself does not mean US dollars or currency. 00:36:30.590 --> 00:36:33.420 align:middle line:84% It literally means "match the end of the string." 00:36:33.420 --> 00:36:38.600 align:middle line:84% If, however, I wanted the user to literally type in $ sign at the end 00:36:38.600 --> 00:36:40.910 align:middle line:84% of their input, the solution would be the same. 00:36:40.910 --> 00:36:43.700 align:middle line:84% I would put a backslash before the $ sign, 00:36:43.700 --> 00:36:48.242 align:middle line:84% which means my email address would have to be something like malan@harvard.edu 00:36:48.242 --> 00:36:50.850 align:middle line:84% $ sign, which is obviously not correct too. 00:36:50.850 --> 00:36:55.280 align:middle line:84% So backslash is just allow you to tell the computer to not treat 00:36:55.280 --> 00:36:58.310 align:middle line:84% those symbols specially, likes meaning something special, 00:36:58.310 --> 00:37:00.950 align:middle line:90% but to treat them literally instead. 00:37:00.950 --> 00:37:04.550 align:middle line:84% How about one other question here on regular expressions? 00:37:04.550 --> 00:37:09.010 align:middle line:84% AUDIENCE: You said one represents to make it one plus, 00:37:09.010 --> 00:37:11.095 align:middle line:84% then you said one was to make it one with nothing. 00:37:11.095 --> 00:37:11.845 align:middle line:90% DAVID MALAN: Sure. 00:37:11.845 --> 00:37:13.220 align:middle line:90% AUDIENCE: So why would you add the plus? 00:37:13.220 --> 00:37:14.360 align:middle line:90% DAVID MALAN: Let me rewind in time. 00:37:14.360 --> 00:37:17.027 align:middle line:84% I think what you're referring to was one of our earlier versions 00:37:17.027 --> 00:37:20.360 align:middle line:84% that initially looked like this, which just meant zero or more 00:37:20.360 --> 00:37:24.710 align:middle line:84% characters, than an @ sign, then zero or more other characters. 00:37:24.710 --> 00:37:29.090 align:middle line:84% We then evolved to that to be this, dot plus on both sides, which 00:37:29.090 --> 00:37:31.340 align:middle line:84% means one or more characters on the left, then 00:37:31.340 --> 00:37:34.320 align:middle line:84% an @ sign, then one or more characters on the right. 00:37:34.320 --> 00:37:36.560 align:middle line:84% And if I'm interpreting your question correctly, 00:37:36.560 --> 00:37:40.370 align:middle line:84% one of the points I made earlier was that if you didn't use plus or forgot 00:37:40.370 --> 00:37:44.510 align:middle line:84% that it exists, you could equivalently achieve the exact same result with two 00:37:44.510 --> 00:37:48.380 align:middle line:84% dots and a *, because the first dot means any character-- 00:37:48.380 --> 00:37:49.550 align:middle line:90% it's got to be there-- 00:37:49.550 --> 00:37:54.170 align:middle line:84% the second dot * means zero or more other characters, 00:37:54.170 --> 00:37:55.380 align:middle line:90% and same on the right. 00:37:55.380 --> 00:37:57.950 align:middle line:84% So it's just another way of expressing the same idea. 00:37:57.950 --> 00:38:01.970 align:middle line:84% "One or more" can be represented like this with dot dot *, 00:38:01.970 --> 00:38:06.840 align:middle line:84% or you can just use the handier syntax of dot +, which means the same thing. 00:38:06.840 --> 00:38:07.340 align:middle line:90% All right. 00:38:07.340 --> 00:38:10.507 align:middle line:84% So I daresay there's still some problems with the regular expression in this 00:38:10.507 --> 00:38:13.790 align:middle line:84% current form, because even though now we're starting to look for the user 00:38:13.790 --> 00:38:16.010 align:middle line:84% name at the beginning of the string from the user, 00:38:16.010 --> 00:38:20.390 align:middle line:84% and we're looking for the .edu literally at the end of the string from the user, 00:38:20.390 --> 00:38:23.780 align:middle line:84% those dots are a little too encompassing right now. 00:38:23.780 --> 00:38:26.450 align:middle line:84% I'm allowed to type in more than the single @ sign. 00:38:26.450 --> 00:38:27.020 align:middle line:90% Why? 00:38:27.020 --> 00:38:30.720 align:middle line:84% Because @ is a character, and dot means any character. 00:38:30.720 --> 00:38:34.650 align:middle line:84% So honestly, I can have as many @ signs in this thing at the moment as I want. 00:38:34.650 --> 00:38:37.280 align:middle line:84% For instance, if I run python of validate.py, 00:38:37.280 --> 00:38:40.500 align:middle line:84% malan@harvard.edu, still works as expected. 00:38:40.500 --> 00:38:44.270 align:middle line:84% But if I also run python of validate.py and incorrectly do 00:38:44.270 --> 00:38:51.030 align:middle line:84% malan@@@harvard.edu, should be invalid, but it's considered valid instead. 00:38:51.030 --> 00:38:55.670 align:middle line:84% So I think we need to be a little more restrictive when it comes to that dot. 00:38:55.670 --> 00:38:59.180 align:middle line:84% And we can't just say, oh, any old character there is fine. 00:38:59.180 --> 00:39:00.950 align:middle line:90% We need to be more specific. 00:39:00.950 --> 00:39:05.390 align:middle line:84% Well, it turns out that regular expressions also support this syntax. 00:39:05.390 --> 00:39:08.990 align:middle line:84% You can use square brackets inside of your pattern, 00:39:08.990 --> 00:39:14.210 align:middle line:84% and inside of those square brackets include one or more characters 00:39:14.210 --> 00:39:17.000 align:middle line:90% that you want to look for specifically. 00:39:17.000 --> 00:39:20.510 align:middle line:84% Alternatively, you can inside of those square brackets 00:39:20.510 --> 00:39:23.660 align:middle line:84% put a caret symbol, which unfortunately in this context, 00:39:23.660 --> 00:39:27.150 align:middle line:84% means something completely different from "match the start of the string." 00:39:27.150 --> 00:39:30.870 align:middle line:84% But this would be the complement operator inside of the square brackets, 00:39:30.870 --> 00:39:34.320 align:middle line:84% which means "you cannot match any of these characters." 00:39:34.320 --> 00:39:36.980 align:middle line:84% So things are about to look even more cryptic now. 00:39:36.980 --> 00:39:41.000 align:middle line:84% But that's why we're focusing on regular expressions on their own here. 00:39:41.000 --> 00:39:46.850 align:middle line:84% If I don't want to allow any character, which is what a dot is, let me go ahead 00:39:46.850 --> 00:39:52.610 align:middle line:84% and I could just say, well, I only want to support A, or Bs, or Cs, or Ds, 00:39:52.610 --> 00:39:54.200 align:middle line:90% or Es, or Fs, or Gs. 00:39:54.200 --> 00:39:56.750 align:middle line:84% I could type in the whole alphabet here plus some numbers 00:39:56.750 --> 00:40:00.110 align:middle line:84% to actually include all of the letters that I do want to allow. 00:40:00.110 --> 00:40:02.570 align:middle line:84% But honestly, a little simpler would be this. 00:40:02.570 --> 00:40:09.020 align:middle line:84% I could use a ^ symbol and then an @ sign, which has the effect of saying, 00:40:09.020 --> 00:40:14.270 align:middle line:84% this is the set of characters that has everything except an @ sign. 00:40:14.270 --> 00:40:16.130 align:middle line:90% And I can do the same thing over here. 00:40:16.130 --> 00:40:23.270 align:middle line:84% Instead of a dot to the right of the @ sign, I can do open bracket ^, @ sign. 00:40:23.270 --> 00:40:26.390 align:middle line:84% And I admit, things are starting to escalate quickly here, 00:40:26.390 --> 00:40:28.940 align:middle line:84% but let's start from the left and go to the right. 00:40:28.940 --> 00:40:33.020 align:middle line:84% This ^ outside of the square brackets at the very start of my string, 00:40:33.020 --> 00:40:35.810 align:middle line:84% as before, means "match from the start of the string." 00:40:35.810 --> 00:40:36.890 align:middle line:90% And let's jump ahead. 00:40:36.890 --> 00:40:40.580 align:middle line:84% The $ sign all the way at the end of the regular expression means "match 00:40:40.580 --> 00:40:42.180 align:middle line:90% at the end of the string." 00:40:42.180 --> 00:40:45.290 align:middle line:84% So if we can mentally tick those off as straightforward, let's 00:40:45.290 --> 00:40:47.630 align:middle line:84% now focus on everything else in the middle. 00:40:47.630 --> 00:40:50.510 align:middle line:84% Well, to the left here we have new syntax-- 00:40:50.510 --> 00:40:56.840 align:middle line:84% a square bracket, another ^, an @ sign, and a closed square bracket, and then 00:40:56.840 --> 00:40:57.560 align:middle line:90% a +. 00:40:57.560 --> 00:40:59.780 align:middle line:90% The + means the same thing as always. 00:40:59.780 --> 00:41:03.110 align:middle line:84% It means "one or more of the things to the left." 00:41:03.110 --> 00:41:04.830 align:middle line:90% What is the thing to the left? 00:41:04.830 --> 00:41:06.650 align:middle line:90% Well, this is the new syntax. 00:41:06.650 --> 00:41:10.880 align:middle line:84% Inside of square brackets here, I have a ^ symbol and then an @ sign. 00:41:10.880 --> 00:41:14.990 align:middle line:84% That just means any character except an @ sign. 00:41:14.990 --> 00:41:18.890 align:middle line:84% It's a weird syntax, but this is how we can express that simple idea-- 00:41:18.890 --> 00:41:23.022 align:middle line:84% any character on the keyboard except for an @ sign. 00:41:23.022 --> 00:41:25.980 align:middle line:84% And heck, even other characters that aren't physically on your keyboard 00:41:25.980 --> 00:41:28.020 align:middle line:90% but that nonetheless exist. 00:41:28.020 --> 00:41:32.120 align:middle line:84% Then we have a literal @ sign, then we have another one of these same things-- 00:41:32.120 --> 00:41:36.950 align:middle line:84% square bracket, ^@ closed bracket, which means any character except an @ sign, 00:41:36.950 --> 00:41:42.710 align:middle line:84% then one or more of those things, followed by literally a period edu. 00:41:42.710 --> 00:41:45.960 align:middle line:84% So now let me go ahead and do this again. 00:41:45.960 --> 00:41:49.280 align:middle line:84% Let me rerun python of validate.py and test my own email address 00:41:49.280 --> 00:41:51.595 align:middle line:90% to make sure I've not made things worse. 00:41:51.595 --> 00:41:52.220 align:middle line:90% And we're good. 00:41:52.220 --> 00:41:55.250 align:middle line:84% Now let me go ahead and clear my screen and run python of validate.py 00:41:55.250 --> 00:42:00.750 align:middle line:84% again and do malan@@@harvard.edu, crossing my fingers this time. 00:42:00.750 --> 00:42:03.020 align:middle line:90% And finally, this now is invalid. 00:42:03.020 --> 00:42:03.830 align:middle line:90% Why? 00:42:03.830 --> 00:42:08.600 align:middle line:84% I'm allowing myself to have one @ sign in the middle of the user's input, 00:42:08.600 --> 00:42:13.220 align:middle line:84% but everything to the left per this new syntax cannot be an @ sign. 00:42:13.220 --> 00:42:15.950 align:middle line:84% It can be anything but one or more times. 00:42:15.950 --> 00:42:20.570 align:middle line:84% And everything to the right of the @ sign can be anything but an @ sign one 00:42:20.570 --> 00:42:25.430 align:middle line:84% or more times followed by, lastly, a literal .edu. 00:42:25.430 --> 00:42:27.590 align:middle line:84% So again, the new syntax is quite simply this-- 00:42:27.590 --> 00:42:31.985 align:middle line:84% square brackets allow you to specify a set of characters that you literally 00:42:31.985 --> 00:42:33.110 align:middle line:90% type out at your keyboard-- 00:42:33.110 --> 00:42:36.410 align:middle line:84% A, B, C, D, E, F, or the complement, the opposite, 00:42:36.410 --> 00:42:40.550 align:middle line:84% the ^ symbol, which means "not," and then the one or more symbols you 00:42:40.550 --> 00:42:42.520 align:middle line:90% want to exclude. 00:42:42.520 --> 00:42:45.230 align:middle line:90% Questions now on this syntax here? 00:42:45.230 --> 00:42:49.450 align:middle line:84% AUDIENCE: So right after @ sign, can we use the curly brackets m one 00:42:49.450 --> 00:42:52.770 align:middle line:84% so that we can only have one repetition of the @ symbol? 00:42:52.770 --> 00:42:53.770 align:middle line:90% DAVID MALAN: Absolutely. 00:42:53.770 --> 00:42:54.800 align:middle line:90% So we could do this. 00:42:54.800 --> 00:42:56.680 align:middle line:90% Let me go ahead and pull up VS Code. 00:42:56.680 --> 00:42:59.680 align:middle line:84% And let me delete the current form of a regular expression 00:42:59.680 --> 00:43:03.580 align:middle line:84% and go back to where we began, which was just dot * @ and dot *. 00:43:03.580 --> 00:43:06.130 align:middle line:84% I could absolutely do something like this 00:43:06.130 --> 00:43:10.480 align:middle line:84% and require that I want at least one of any character here. 00:43:10.480 --> 00:43:13.760 align:middle line:84% And then I could do something more to have any more as well. 00:43:13.760 --> 00:43:16.710 align:middle line:84% So the curly brace syntax, which we saw on the slide earlier 00:43:16.710 --> 00:43:18.460 align:middle line:84% but didn't yet use, absolutely can be used 00:43:18.460 --> 00:43:21.400 align:middle line:84% to specify a specific number of characters. 00:43:21.400 --> 00:43:24.160 align:middle line:84% But honestly, this is more verbose than is necessary. 00:43:24.160 --> 00:43:27.130 align:middle line:84% The best solution, arguably, or the simplest, at least, 00:43:27.130 --> 00:43:29.500 align:middle line:90% ultimately, is just to say dot +. 00:43:29.500 --> 00:43:32.650 align:middle line:84% But there, too, another example of how you can solve the same problem 00:43:32.650 --> 00:43:34.010 align:middle line:90% multiple ways. 00:43:34.010 --> 00:43:36.340 align:middle line:84% Let me go back to where the regular expression just was 00:43:36.340 --> 00:43:39.170 align:middle line:90% and take other questions as well. 00:43:39.170 --> 00:43:44.790 align:middle line:84% Questions on the sets of characters or complementing that set? 00:43:44.790 --> 00:43:47.370 align:middle line:84% AUDIENCE: So can you use that same syntax 00:43:47.370 --> 00:43:51.780 align:middle line:84% to say that you don't want a certain character throughout the whole string? 00:43:51.780 --> 00:43:52.740 align:middle line:90% DAVID MALAN: You could. 00:43:52.740 --> 00:43:54.600 align:middle line:90% It's going to be-- 00:43:54.600 --> 00:43:58.530 align:middle line:84% you could absolutely use the same character to exclude-- 00:43:58.530 --> 00:44:01.830 align:middle line:84% you could absolutely use this syntax to exclude a certain character 00:44:01.830 --> 00:44:03.210 align:middle line:90% from the entire string. 00:44:03.210 --> 00:44:05.130 align:middle line:84% But it would be a little harder right now, 00:44:05.130 --> 00:44:07.530 align:middle line:84% because we're still requiring .edu the end. 00:44:07.530 --> 00:44:10.770 align:middle line:90% But yes, absolutely. 00:44:10.770 --> 00:44:12.220 align:middle line:90% Other questions? 00:44:12.220 --> 00:44:16.620 align:middle line:84% AUDIENCE: What happens if the user inputs .edu in the beginning 00:44:16.620 --> 00:44:17.632 align:middle line:90% of the string? 00:44:17.632 --> 00:44:18.840 align:middle line:90% DAVID MALAN: A good question. 00:44:18.840 --> 00:44:22.000 align:middle line:84% What happens if the user types in .edu at the beginning of the string? 00:44:22.000 --> 00:44:23.577 align:middle line:90% Well, let me go back to VS Code here. 00:44:23.577 --> 00:44:25.660 align:middle line:84% And let's try to solve this in two different ways. 00:44:25.660 --> 00:44:27.452 align:middle line:84% First, let's look at the regular expression 00:44:27.452 --> 00:44:31.080 align:middle line:84% and see if we can infer if that's going to be tolerated. 00:44:31.080 --> 00:44:34.950 align:middle line:84% Well, according to the current cryptic regular expression, 00:44:34.950 --> 00:44:38.730 align:middle line:84% I'm saying that you can have any character except the @ sign. 00:44:38.730 --> 00:44:41.910 align:middle line:84% So that would work I. Could have the dot for the .edu. 00:44:41.910 --> 00:44:44.490 align:middle line:90% But then I have to have an @ sign. 00:44:44.490 --> 00:44:48.940 align:middle line:84% So that wouldn't really work, because if I'm just typing in .edu, 00:44:48.940 --> 00:44:51.010 align:middle line:90% we're not going to pass that constraint. 00:44:51.010 --> 00:44:53.710 align:middle line:84% So now let me try this by running the program. 00:44:53.710 --> 00:44:55.810 align:middle line:90% Let me type in just literally .edu. 00:44:55.810 --> 00:44:57.090 align:middle line:90% That doesn't work. 00:44:57.090 --> 00:45:02.505 align:middle line:84% But, but, but I could do this, .edu@.edu. 00:45:02.505 --> 00:45:04.140 align:middle line:90% That, too, is invalid. 00:45:04.140 --> 00:45:07.581 align:middle line:90% But let me do this, .edu@something.edu. 00:45:07.581 --> 00:45:10.365 align:middle line:90% 00:45:10.365 --> 00:45:11.490 align:middle line:90% That passes. 00:45:11.490 --> 00:45:13.470 align:middle line:84% So it's starting to get a little weird now. 00:45:13.470 --> 00:45:15.030 align:middle line:90% Maybe it's valid, maybe it's not. 00:45:15.030 --> 00:45:18.120 align:middle line:84% But I think we'll eventually be more precise, too. 00:45:18.120 --> 00:45:21.570 align:middle line:84% How about one more question on this regular expression 00:45:21.570 --> 00:45:23.310 align:middle line:90% and these complementing of sets? 00:45:23.310 --> 00:45:27.765 align:middle line:84% AUDIENCE: Can we use another domain name, the string input? 00:45:27.765 --> 00:45:29.640 align:middle line:84% DAVID MALAN: Can you use another domain name? 00:45:29.640 --> 00:45:30.240 align:middle line:90% Absolutely. 00:45:30.240 --> 00:45:32.460 align:middle line:84% I'm using my own just for the sake of demonstration. 00:45:32.460 --> 00:45:35.970 align:middle line:84% But you could absolutely use any domain or top-level domain. 00:45:35.970 --> 00:45:38.520 align:middle line:84% And I'm using .edu, which is very US centric. 00:45:38.520 --> 00:45:43.330 align:middle line:84% But this would absolutely work exactly the same for any top-level domain. 00:45:43.330 --> 00:45:43.830 align:middle line:90% All right. 00:45:43.830 --> 00:45:47.700 align:middle line:84% Let me go ahead now and propose that we improve this regular expression 00:45:47.700 --> 00:45:50.880 align:middle line:84% further, because if I pull it up again in VS Code here, 00:45:50.880 --> 00:45:53.790 align:middle line:84% you'll see that I'm being a little too tolerant still. 00:45:53.790 --> 00:45:58.140 align:middle line:84% It turns out that there are certain requirements for someone's username 00:45:58.140 --> 00:46:00.240 align:middle line:90% and domain name in an email address. 00:46:00.240 --> 00:46:03.840 align:middle line:84% There is an official standard in the world for what an email address can be 00:46:03.840 --> 00:46:05.670 align:middle line:90% and what characters can be in it. 00:46:05.670 --> 00:46:09.480 align:middle line:84% And this is way too accommodating of all the characters 00:46:09.480 --> 00:46:11.710 align:middle line:90% in the world except for the @ symbol. 00:46:11.710 --> 00:46:14.190 align:middle line:84% So let's actually narrow the definition of what 00:46:14.190 --> 00:46:16.110 align:middle line:90% we're going to tolerate in usernames. 00:46:16.110 --> 00:46:19.200 align:middle line:84% And companies like Gmail could certainly do this as well. 00:46:19.200 --> 00:46:22.200 align:middle line:84% Suppose that it's not just that I want to exclude @ sign. 00:46:22.200 --> 00:46:25.470 align:middle line:84% Suppose that I only want to allow for, say, 00:46:25.470 --> 00:46:27.600 align:middle line:84% characters that normally appear in words, 00:46:27.600 --> 00:46:31.500 align:middle line:84% like letters of the alphabet, A through z, be it uppercase or lowercase, 00:46:31.500 --> 00:46:35.520 align:middle line:84% maybe some numbers, and heck, maybe even an underscore could be allowed, too. 00:46:35.520 --> 00:46:38.550 align:middle line:84% Well, we can use this same square bracket syntax 00:46:38.550 --> 00:46:41.340 align:middle line:84% to specify a set of characters as follows. 00:46:41.340 --> 00:46:44.860 align:middle line:90% I could do abcdefghij-- 00:46:44.860 --> 00:46:45.360 align:middle line:90% oh, my god. 00:46:45.360 --> 00:46:46.290 align:middle line:90% This is going to take forever. 00:46:46.290 --> 00:46:49.140 align:middle line:84% I'm going to have to type out all 26 letters of the alphabet, 00:46:49.140 --> 00:46:50.940 align:middle line:90% both lowercase and uppercase. 00:46:50.940 --> 00:46:52.260 align:middle line:90% So let me stop doing that. 00:46:52.260 --> 00:46:53.700 align:middle line:90% There's a better way already. 00:46:53.700 --> 00:46:58.180 align:middle line:84% If you want to specify within these square brackets a range of letters, 00:46:58.180 --> 00:47:00.550 align:middle line:90% you can actually just do a hyphen. 00:47:00.550 --> 00:47:04.920 align:middle line:84% If you literally do a-z in these square brackets, 00:47:04.920 --> 00:47:07.470 align:middle line:84% the computer is going to know you mean a through z. 00:47:07.470 --> 00:47:10.620 align:middle line:84% You do not need to type 26 letters of the alphabet. 00:47:10.620 --> 00:47:14.190 align:middle line:84% If you want to include uppercase letters as well, you just do the same. 00:47:14.190 --> 00:47:19.440 align:middle line:84% No spaces, no commas, you literally just keep typing a through capital Z. 00:47:19.440 --> 00:47:23.880 align:middle line:84% So I have little a hyphen little z, big A hyphen 00:47:23.880 --> 00:47:26.640 align:middle line:84% big Z. No spaces, no commas, no separators. 00:47:26.640 --> 00:47:28.830 align:middle line:90% You just keep specifying those ranges. 00:47:28.830 --> 00:47:32.350 align:middle line:84% If I additionally want numbers, I could do 01234-- 00:47:32.350 --> 00:47:32.850 align:middle line:90% nope. 00:47:32.850 --> 00:47:35.070 align:middle line:84% You don't need to type in all 10 decimal digits. 00:47:35.070 --> 00:47:39.070 align:middle line:84% You can just say 0 through 9 using a hyphen as well. 00:47:39.070 --> 00:47:41.280 align:middle line:84% And if you now want to support underscores 00:47:41.280 --> 00:47:44.280 align:middle line:84% as well, which is pretty common in usernames for email addresses, 00:47:44.280 --> 00:47:48.160 align:middle line:84% you can literally just type an underscore at the end. 00:47:48.160 --> 00:47:51.180 align:middle line:84% Notice that all of these characters are inside 00:47:51.180 --> 00:47:55.860 align:middle line:84% of square brackets, which just again, means here is a set of characters 00:47:55.860 --> 00:47:57.180 align:middle line:90% that I want to allow. 00:47:57.180 --> 00:48:02.100 align:middle line:84% I have not used a ^ symbol at the beginning of this whole thing, 00:48:02.100 --> 00:48:05.370 align:middle line:84% because I don't want to complement it-- complement it with an E, 00:48:05.370 --> 00:48:07.230 align:middle line:90% not compliment it with an I-- 00:48:07.230 --> 00:48:09.940 align:middle line:84% I don't want to complement it by making it the opposite. 00:48:09.940 --> 00:48:13.225 align:middle line:84% I literally want to accept only these characters. 00:48:13.225 --> 00:48:15.600 align:middle line:84% I'm going to go ahead and do the same thing on the right. 00:48:15.600 --> 00:48:19.530 align:middle line:84% If I want to require that the domain name similarly 00:48:19.530 --> 00:48:22.800 align:middle line:84% come from this set of characters, which admittedly is a little too narrow, 00:48:22.800 --> 00:48:25.210 align:middle line:84% but it's familiar for now so we'll keep it simple, 00:48:25.210 --> 00:48:29.490 align:middle line:84% I'm going to go ahead and paste that exact same set of characters over there 00:48:29.490 --> 00:48:30.490 align:middle line:90% to the right. 00:48:30.490 --> 00:48:33.600 align:middle line:90% And so now, it's much more restrictive. 00:48:33.600 --> 00:48:36.660 align:middle line:84% Now I'm going to go ahead and run python of validate.py. 00:48:36.660 --> 00:48:39.420 align:middle line:84% I'm going to test my own email address, and we're still good. 00:48:39.420 --> 00:48:42.180 align:middle line:84% I'm going to clear my screen and run it once more, 00:48:42.180 --> 00:48:44.520 align:middle line:90% this time trying to break it. 00:48:44.520 --> 00:48:51.270 align:middle line:84% Let me go ahead and do something like, how about, david_malan@harvard.edu, 00:48:51.270 --> 00:48:54.790 align:middle line:84% Enter, but that, too, is going to be valid. 00:48:54.790 --> 00:48:57.330 align:middle line:84% But if I do something completely wrong again, 00:48:57.330 --> 00:49:02.790 align:middle line:84% like malan@@@harvard.edu, that's still going to be invalid. 00:49:02.790 --> 00:49:03.330 align:middle line:90% Why? 00:49:03.330 --> 00:49:06.090 align:middle line:84% Because my regular expression currently only allows 00:49:06.090 --> 00:49:09.480 align:middle line:84% for a single @ in the middle, because everything to the left 00:49:09.480 --> 00:49:11.530 align:middle line:90% must be alphanumeric-- 00:49:11.530 --> 00:49:14.420 align:middle line:84% alphabetical or numeric-- or an underscore, 00:49:14.420 --> 00:49:18.301 align:middle line:84% the same thing to the right, followed by the .edu. 00:49:18.301 --> 00:49:20.770 align:middle line:84% Now honestly, this is a regular expression 00:49:20.770 --> 00:49:23.890 align:middle line:84% that you might be in the habit of typing in the real world. 00:49:23.890 --> 00:49:27.860 align:middle line:84% As cryptic as this might look, this is the world of regular expressions. 00:49:27.860 --> 00:49:30.560 align:middle line:84% So you'll get more comfortable with this syntax over time. 00:49:30.560 --> 00:49:32.890 align:middle line:84% But thankfully, some of these patterns are 00:49:32.890 --> 00:49:36.910 align:middle line:84% so common that there are built-in shortcuts for representing 00:49:36.910 --> 00:49:38.680 align:middle line:90% some of the same information. 00:49:38.680 --> 00:49:42.373 align:middle line:84% That is to say, you don't have to constantly type out all of the symbols 00:49:42.373 --> 00:49:45.040 align:middle line:84% that you want to include, because odds are some other programmer 00:49:45.040 --> 00:49:46.280 align:middle line:90% has had the same problem. 00:49:46.280 --> 00:49:49.030 align:middle line:84% So built into regular expressions themselves 00:49:49.030 --> 00:49:51.250 align:middle line:84% are some additional patterns you can use. 00:49:51.250 --> 00:49:56.170 align:middle line:84% And in fact, I can go ahead and get rid of this entire set, a through z 00:49:56.170 --> 00:49:59.830 align:middle line:84% lowercase, A through Z uppercase, 0 through 9 and an underscore, 00:49:59.830 --> 00:50:03.640 align:middle line:84% and just replace it with a single backslash w. 00:50:03.640 --> 00:50:07.210 align:middle line:84% Backslash w in this case represents a "word character," 00:50:07.210 --> 00:50:13.330 align:middle line:84% which is commonly known as a alphanumeric symbol or the underscore 00:50:13.330 --> 00:50:14.052 align:middle line:90% as well. 00:50:14.052 --> 00:50:15.760 align:middle line:84% I'm going to do the same thing over here. 00:50:15.760 --> 00:50:18.310 align:middle line:84% I'm going to highlight the entire set of square brackets, 00:50:18.310 --> 00:50:21.430 align:middle line:84% delete it, and replace it with a single backslash w. 00:50:21.430 --> 00:50:23.720 align:middle line:84% And now I feel like we're making progress, 00:50:23.720 --> 00:50:25.720 align:middle line:84% because even though it's cryptic, and would have 00:50:25.720 --> 00:50:29.320 align:middle line:90% looked way cryptic a little bit ago-- 00:50:29.320 --> 00:50:32.680 align:middle line:84% and even though it would have looked even more cryptic a little bit ago, now 00:50:32.680 --> 00:50:35.470 align:middle line:84% it's at least starting to read a little more friendly. 00:50:35.470 --> 00:50:39.160 align:middle line:84% This ^ on the left means "start matching at the beginning of the string." 00:50:39.160 --> 00:50:42.100 align:middle line:90% Backslash w means "any word character." 00:50:42.100 --> 00:50:44.140 align:middle line:90% The + means "one or more." 00:50:44.140 --> 00:50:45.370 align:middle line:90% @ symbol literally. 00:50:45.370 --> 00:50:49.720 align:middle line:84% Then another word character, one or more. then a literal dot, then 00:50:49.720 --> 00:50:54.200 align:middle line:84% literally edu, and then match at the very end of the string, and that's it. 00:50:54.200 --> 00:50:55.660 align:middle line:90% So there's more of these, too. 00:50:55.660 --> 00:50:57.910 align:middle line:84% And we won't use them all here, but here is 00:50:57.910 --> 00:51:02.950 align:middle line:84% a partial list of the patterns you can use within a regular expression. 00:51:02.950 --> 00:51:07.060 align:middle line:84% One, you have backslash d for any decimal digit, "decimal digit" meaning 00:51:07.060 --> 00:51:08.590 align:middle line:90% 0 through 9. 00:51:08.590 --> 00:51:12.550 align:middle line:84% Commonly done here, too, is if you want to do the opposite of that, 00:51:12.550 --> 00:51:17.020 align:middle line:84% the complement, so to speak, you can do backslash capital D, which 00:51:17.020 --> 00:51:19.480 align:middle line:90% is anything that's not a decimal digit. 00:51:19.480 --> 00:51:23.990 align:middle line:84% So it might be letters, and punctuation, and other symbols as well. 00:51:23.990 --> 00:51:27.280 align:middle line:84% Meanwhile, backslash s means whitespace characters, 00:51:27.280 --> 00:51:30.490 align:middle line:84% like a single hit of the space, or maybe hitting Tab on the keyboard. 00:51:30.490 --> 00:51:31.720 align:middle line:90% That's whitespace. 00:51:31.720 --> 00:51:35.110 align:middle line:84% Backslash capital S is the opposite or complement 00:51:35.110 --> 00:51:38.080 align:middle line:84% of that-- anything that's not a whitespace character. 00:51:38.080 --> 00:51:41.680 align:middle line:84% Backslash w, we've seen, a word character, as well as 00:51:41.680 --> 00:51:43.390 align:middle line:90% numbers and the underscore. 00:51:43.390 --> 00:51:45.970 align:middle line:84% And if you want the complement or opposite of that, 00:51:45.970 --> 00:51:50.950 align:middle line:84% you can use backslash capital W to give you everything but a word character. 00:51:50.950 --> 00:51:54.130 align:middle line:84% Again, these are just common patterns that so many people were presumably 00:51:54.130 --> 00:51:58.520 align:middle line:84% using in yesteryear that it's now baked into the regular expression syntax 00:51:58.520 --> 00:52:02.710 align:middle line:84% so that you can more succinctly express your same ideas. 00:52:02.710 --> 00:52:05.320 align:middle line:84% Any questions, then, on this approach here, 00:52:05.320 --> 00:52:12.340 align:middle line:84% where we're now using backslash w to represent my word character? 00:52:12.340 --> 00:52:14.230 align:middle line:84% AUDIENCE: So what I want to ask about was 00:52:14.230 --> 00:52:17.590 align:middle line:84% the-- actually the previous approach, like the square bracket approach. 00:52:17.590 --> 00:52:19.792 align:middle line:90% Could we accept lists in there? 00:52:19.792 --> 00:52:20.500 align:middle line:90% DAVID MALAN: Yes. 00:52:20.500 --> 00:52:21.730 align:middle line:90% We'll see this before long. 00:52:21.730 --> 00:52:27.460 align:middle line:84% But suppose you wanted to tolerate not just .edu, but maybe .edu, or .com, 00:52:27.460 --> 00:52:28.450 align:middle line:90% you could do this. 00:52:28.450 --> 00:52:32.500 align:middle line:84% You could introduce parentheses, and then you can or those together. 00:52:32.500 --> 00:52:35.470 align:middle line:90% I could say com or edu. 00:52:35.470 --> 00:52:40.180 align:middle line:84% Could also add in something like in the US, or gov, or net, 00:52:40.180 --> 00:52:42.670 align:middle line:90% or anything else, or org, or the like. 00:52:42.670 --> 00:52:45.190 align:middle line:84% And each of the vertical bars here means something special. 00:52:45.190 --> 00:52:46.180 align:middle line:90% It means "or." 00:52:46.180 --> 00:52:48.610 align:middle line:84% And the parentheses simply group things together. 00:52:48.610 --> 00:52:50.920 align:middle line:90% Formally, you have this syntax here-- 00:52:50.920 --> 00:52:56.530 align:middle line:84% A or B, A or vertical bar B, means "A has to match or B has to match," 00:52:56.530 --> 00:52:59.080 align:middle line:84% where A and B can be any other patterns you want. 00:52:59.080 --> 00:53:01.520 align:middle line:84% In parentheses, you can group those things together. 00:53:01.520 --> 00:53:05.710 align:middle line:84% So just like math, you can combine ideas into one phrase 00:53:05.710 --> 00:53:07.600 align:middle line:90% and do this thing or the other. 00:53:07.600 --> 00:53:09.970 align:middle line:84% And there's other syntax as well that we'll soon see. 00:53:09.970 --> 00:53:14.750 align:middle line:84% Other questions on these regular expressions and this syntax here? 00:53:14.750 --> 00:53:16.990 align:middle line:84% AUDIENCE: What if we put spaces in the expression? 00:53:16.990 --> 00:53:17.740 align:middle line:90% DAVID MALAN: Sure. 00:53:17.740 --> 00:53:21.910 align:middle line:84% So if you want spaces in there, you can't use backslash w alone, 00:53:21.910 --> 00:53:25.690 align:middle line:84% because that is only a word character which is alphabetical, numerical, 00:53:25.690 --> 00:53:27.100 align:middle line:90% or the underscore. 00:53:27.100 --> 00:53:28.580 align:middle line:90% But you could do this. 00:53:28.580 --> 00:53:32.170 align:middle line:84% You could go back to this approach whereby you use square brackets. 00:53:32.170 --> 00:53:37.120 align:middle line:84% And you could say a through z, or A through Z, or 0 through 9, 00:53:37.120 --> 00:53:40.693 align:middle line:84% or underscore, or I'm going to hit the space bar, a single space. 00:53:40.693 --> 00:53:43.360 align:middle line:84% You can put a literal space inside of the square brackets, which 00:53:43.360 --> 00:53:45.700 align:middle line:90% will allow you then to detect a space. 00:53:45.700 --> 00:53:49.420 align:middle line:84% Alternatively, I could still use backslash w, 00:53:49.420 --> 00:53:51.280 align:middle line:90% But I could combine it as follows. 00:53:51.280 --> 00:53:54.700 align:middle line:84% I could say, give me a backslash w or a backslash s, 00:53:54.700 --> 00:53:57.287 align:middle line:84% because recall that backslash s is whitespace. 00:53:57.287 --> 00:53:58.870 align:middle line:90% So it's even more than a single space. 00:53:58.870 --> 00:53:59.770 align:middle line:90% It could be a tab. 00:53:59.770 --> 00:54:02.140 align:middle line:84% But by putting those things in parentheses, now 00:54:02.140 --> 00:54:04.060 align:middle line:84% you can match either the thing on the left 00:54:04.060 --> 00:54:07.400 align:middle line:84% or the thing on the right one or more times. 00:54:07.400 --> 00:54:12.290 align:middle line:84% How about one other question on these regular expressions? 00:54:12.290 --> 00:54:13.040 align:middle line:90% AUDIENCE: Perfect. 00:54:13.040 --> 00:54:19.070 align:middle line:84% So I was going to ask, does the backslash w include a dot? 00:54:19.070 --> 00:54:20.730 align:middle line:90% Because-- no, OK. 00:54:20.730 --> 00:54:24.230 align:middle line:84% DAVID MALAN: No, it only Includes letters, numbers, and underscore. 00:54:24.230 --> 00:54:25.387 align:middle line:90% That is it. 00:54:25.387 --> 00:54:27.470 align:middle line:84% AUDIENCE: And I was wondering, you gave an example 00:54:27.470 --> 00:54:33.140 align:middle line:84% at the beginning that had spaces, like this is my email, so-and-so. 00:54:33.140 --> 00:54:35.420 align:middle line:90% I don't think our current version-- 00:54:35.420 --> 00:54:39.110 align:middle line:84% or even quite a long while ago stopped accepting it. 00:54:39.110 --> 00:54:43.915 align:middle line:84% Was that because of the ^ or because of something else? 00:54:43.915 --> 00:54:47.960 align:middle line:84% DAVID MALAN: No, the reason I was handling spaces in other English words 00:54:47.960 --> 00:54:51.425 align:middle line:84% when I typed out my email address as malan@harvard.edu 00:54:51.425 --> 00:54:57.380 align:middle line:84% was because we were using initially dot *, or dot +, which is any character. 00:54:57.380 --> 00:55:01.340 align:middle line:84% And even after that, we said anything except the @ sign, 00:55:01.340 --> 00:55:02.870 align:middle line:90% which includes spaces. 00:55:02.870 --> 00:55:08.000 align:middle line:84% Only once I started using square brackets and a through z and 0 00:55:08.000 --> 00:55:11.210 align:middle line:84% through 9 and underscore did we finally get to the point 00:55:11.210 --> 00:55:13.040 align:middle line:90% where we would reject white space. 00:55:13.040 --> 00:55:14.970 align:middle line:90% And in fact, I can run this here. 00:55:14.970 --> 00:55:18.980 align:middle line:84% Let me go into the current version of my code in VS Code, which is using, again, 00:55:18.980 --> 00:55:21.620 align:middle line:84% the backslash w's for word characters, let 00:55:21.620 --> 00:55:24.860 align:middle line:84% me run python of validate.py and incorrectly type in something 00:55:24.860 --> 00:55:30.020 align:middle line:84% like "my email address is malan@harvard.edu," period, which 00:55:30.020 --> 00:55:34.250 align:middle line:84% has spaces to the left of my username, and that is now invalid, 00:55:34.250 --> 00:55:36.590 align:middle line:90% because space is not a word character. 00:55:36.590 --> 00:55:39.860 align:middle line:84% You're going to notice, too, that technically I'm not allowing dots. 00:55:39.860 --> 00:55:41.902 align:middle line:84% And some of you might be thinking, wait a minute. 00:55:41.902 --> 00:55:43.880 align:middle line:90% My Gmail address has a dot in it. 00:55:43.880 --> 00:55:46.280 align:middle line:84% That's something we're going to still have to fix. 00:55:46.280 --> 00:55:49.160 align:middle line:90% A backslash w is not the end all here. 00:55:49.160 --> 00:55:52.520 align:middle line:84% It's just allowing us to express our previous solution 00:55:52.520 --> 00:55:54.020 align:middle line:90% a little more succinctly. 00:55:54.020 --> 00:55:57.260 align:middle line:84% Now, one thing we're still not handling quite properly 00:55:57.260 --> 00:55:59.180 align:middle line:90% is uppercase versus lowercase. 00:55:59.180 --> 00:56:03.200 align:middle line:84% The backslash w technically does handle lowercase letters and uppercase, 00:56:03.200 --> 00:56:06.450 align:middle line:84% because it's the exact same thing as that set from before, 00:56:06.450 --> 00:56:11.670 align:middle line:84% which had little a through little z and big A through big Z. But watch this. 00:56:11.670 --> 00:56:14.960 align:middle line:84% Let me go ahead in my current form run python of validate.py, 00:56:14.960 --> 00:56:19.376 align:middle line:84% and just because my Caps lock key is down, MALAN@HARVARD.EDU, 00:56:19.376 --> 00:56:21.080 align:middle line:90% shouting my email address. 00:56:21.080 --> 00:56:23.640 align:middle line:84% It's going to be OK in terms of the MALAN. 00:56:23.640 --> 00:56:25.940 align:middle line:84% It's going to be OK in terms of the HARVARD, 00:56:25.940 --> 00:56:28.790 align:middle line:84% because those are matching the backslash w, which 00:56:28.790 --> 00:56:31.490 align:middle line:90% does include lowercase and uppercase. 00:56:31.490 --> 00:56:34.310 align:middle line:90% But I'm about to see invalid. 00:56:34.310 --> 00:56:35.210 align:middle line:90% Why? 00:56:35.210 --> 00:56:41.670 align:middle line:84% Why is MALAN@HARVARD.EDU invalid when it's in all caps here, 00:56:41.670 --> 00:56:44.195 align:middle line:90% even though I'm using backslash w? 00:56:44.195 --> 00:56:44.820 align:middle line:90% AUDIENCE: Yeah. 00:56:44.820 --> 00:56:50.010 align:middle line:84% So you are asking for the domain.edu in lowercase, 00:56:50.010 --> 00:56:52.105 align:middle line:90% and you're typing it in uppercase. 00:56:52.105 --> 00:56:52.980 align:middle line:90% DAVID MALAN: Exactly. 00:56:52.980 --> 00:56:55.980 align:middle line:84% I'm typing in my email address in all uppercase, 00:56:55.980 --> 00:56:57.892 align:middle line:90% but I'm looking for literally ".edu." 00:56:57.892 --> 00:57:00.600 align:middle line:84% And as I see you with AirPods and so many of you with headphones, 00:57:00.600 --> 00:57:03.810 align:middle line:84% I apologize for yelling into my microphone just now to make this point. 00:57:03.810 --> 00:57:05.770 align:middle line:90% But let's see if we can't fix that. 00:57:05.770 --> 00:57:11.925 align:middle line:84% Well, if my pattern on line 5 is expecting it to be lowercase, 00:57:11.925 --> 00:57:13.800 align:middle line:84% there's actually a few ways I can solve this. 00:57:13.800 --> 00:57:15.840 align:middle line:84% One would be something we've seen before. 00:57:15.840 --> 00:57:19.050 align:middle line:84% I could just force the user's input to all lowercase. 00:57:19.050 --> 00:57:23.610 align:middle line:84% And I could put onto the end of my first line .lower and actually force it all 00:57:23.610 --> 00:57:24.480 align:middle line:90% to lowercase. 00:57:24.480 --> 00:57:26.880 align:middle line:84% Alternatively, I could do that a little later. 00:57:26.880 --> 00:57:31.050 align:middle line:84% Instead of passing an email, I could pass in the lowercase version of email, 00:57:31.050 --> 00:57:33.810 align:middle line:84% because email addresses should, in fact, be case insensitive. 00:57:33.810 --> 00:57:34.980 align:middle line:90% So that would work, too. 00:57:34.980 --> 00:57:37.590 align:middle line:84% But there's another mechanism here, which is worth seeing. 00:57:37.590 --> 00:57:43.890 align:middle line:84% It turns out that that function before called re.search supports, recall, 00:57:43.890 --> 00:57:46.800 align:middle line:84% a third argument as well, these so-called flags. 00:57:46.800 --> 00:57:49.170 align:middle line:84% And flags are configuration options, typically 00:57:49.170 --> 00:57:52.290 align:middle line:84% to a function, that allow you to configure it a little differently. 00:57:52.290 --> 00:57:55.290 align:middle line:84% And how might I go about configuring this call 00:57:55.290 --> 00:57:59.910 align:middle line:84% to re.search a little bit differently insofar as I'm currently only passing 00:57:59.910 --> 00:58:00.900 align:middle line:90% into arguments? 00:58:00.900 --> 00:58:04.650 align:middle line:84% Well, it turns out that some of the flags you can pass into this function 00:58:04.650 --> 00:58:05.790 align:middle line:90% are these. 00:58:05.790 --> 00:58:10.110 align:middle line:84% It turns out that the regular expression library in Python, a.k.a. 00:58:10.110 --> 00:58:14.040 align:middle line:84% re, comes with a few built-in variables, so to speak, 00:58:14.040 --> 00:58:16.110 align:middle line:84% things that you can think of as constants, 00:58:16.110 --> 00:58:19.920 align:middle line:90% that have meaning to re.search. 00:58:19.920 --> 00:58:21.760 align:middle line:90% And they do so as follows. 00:58:21.760 --> 00:58:26.220 align:middle line:84% If you pass in as a flag re.IGNORECASE, what re.search is going to do 00:58:26.220 --> 00:58:28.530 align:middle line:90% is ignore the case of the user's input. 00:58:28.530 --> 00:58:30.880 align:middle line:84% It can be uppercase, lowercase, a combination thereof, 00:58:30.880 --> 00:58:32.470 align:middle line:90% the case is going to be ignored. 00:58:32.470 --> 00:58:34.327 align:middle line:90% It will be treated case insensitively. 00:58:34.327 --> 00:58:36.660 align:middle line:84% And you can do other things, too, that we won't do here. 00:58:36.660 --> 00:58:40.650 align:middle line:84% But if you want to handle the user's input that maybe spans multiple lines-- 00:58:40.650 --> 00:58:44.040 align:middle line:84% maybe they didn't just type in an email address but an entire paragraph 00:58:44.040 --> 00:58:46.410 align:middle line:84% of text, and you want to match different lines 00:58:46.410 --> 00:58:48.210 align:middle line:90% of that text that is multiple lines. 00:58:48.210 --> 00:58:52.950 align:middle line:84% Another flag is for re.MULTILINE for just that, or re.DOTALL, 00:58:52.950 --> 00:58:57.990 align:middle line:84% whereby you can configure the dot to recognize not just 00:58:57.990 --> 00:59:02.830 align:middle line:84% any character except newlines but any character plus newlines as well. 00:59:02.830 --> 00:59:05.850 align:middle line:84% But for now, let me go ahead and just make use of this first one. 00:59:05.850 --> 00:59:13.170 align:middle line:84% Let me pass in a third argument to re.search, which is re.IGNORECASE. 00:59:13.170 --> 00:59:15.330 align:middle line:84% Let me now rerun the program without clearing 00:59:15.330 --> 00:59:17.670 align:middle line:90% my screen, python of validate.py. 00:59:17.670 --> 00:59:20.850 align:middle line:84% Let me type in again in all caps, effectively shouting, 00:59:20.850 --> 00:59:25.200 align:middle line:84% MALAN@HARVARD.EDU, Enter, and now it's considered valid, 00:59:25.200 --> 00:59:27.690 align:middle line:84% because I'm telling re.search specifically 00:59:27.690 --> 00:59:29.460 align:middle line:90% to ignore the case of the input. 00:59:29.460 --> 00:59:30.960 align:middle line:90% And that, too, here is fine. 00:59:30.960 --> 00:59:34.500 align:middle line:84% And why might I do this approach rather than call .lower in one of those other 00:59:34.500 --> 00:59:35.280 align:middle line:90% locations? 00:59:35.280 --> 00:59:39.000 align:middle line:84% Eh, if I don't actually want to change the user's input for whatever reason, 00:59:39.000 --> 00:59:43.290 align:middle line:84% I can still treat it case insensitively without actually changing 00:59:43.290 --> 00:59:46.140 align:middle line:90% the value of that variable itself. 00:59:46.140 --> 00:59:51.970 align:middle line:84% All right, any final questions now on this validation of email addresses? 00:59:51.970 --> 00:59:54.600 align:middle line:84% AUDIENCE: So the pattern is a string, right? 00:59:54.600 --> 00:59:55.800 align:middle line:90% DAVID MALAN: Mm-hmm. 00:59:55.800 --> 00:59:57.390 align:middle line:90% AUDIENCE: Can we use an fstring? 00:59:57.390 --> 00:59:58.440 align:middle line:90% DAVID MALAN: You can. 00:59:58.440 --> 01:00:01.780 align:middle line:84% Yes, you can use an fstring so that you could plug in, for instance, 01:00:01.780 --> 01:00:04.830 align:middle line:84% the value of a variable and pass it into the function. 01:00:04.830 --> 01:00:06.000 align:middle line:90% Other questions on this? 01:00:06.000 --> 01:00:10.342 align:middle line:84% AUDIENCE: Backslash w character, could we take it as an input from the user? 01:00:10.342 --> 01:00:11.550 align:middle line:90% DAVID MALAN: Technically yes. 01:00:11.550 --> 01:00:13.440 align:middle line:84% That's not a problem we're trying to solve right now. 01:00:13.440 --> 01:00:16.530 align:middle line:84% We want the user to provide literal input, like their email address, 01:00:16.530 --> 01:00:18.750 align:middle line:90% not necessarily a regular expression. 01:00:18.750 --> 01:00:22.230 align:middle line:84% But you could imagine building software that asks the user, especially 01:00:22.230 --> 01:00:25.800 align:middle line:84% if they're more advanced users, to type in a regular expression for some reason 01:00:25.800 --> 01:00:27.722 align:middle line:90% to validate something else against that. 01:00:27.722 --> 01:00:29.430 align:middle line:84% And in fact, that's what Google is doing. 01:00:29.430 --> 01:00:33.630 align:middle line:84% If you play around with Google Forms and create a form with response validation 01:00:33.630 --> 01:00:37.590 align:middle line:84% and select Regular Expression, Google lets you and I type 01:00:37.590 --> 01:00:41.530 align:middle line:84% in our own regular expressions, which would be a perfect example of that. 01:00:41.530 --> 01:00:42.030 align:middle line:90% All right. 01:00:42.030 --> 01:00:45.900 align:middle line:84% Well, let me propose that we try to solve one other problem here, 01:00:45.900 --> 01:00:51.480 align:middle line:84% whereby if I go into the same version as before, which is now ignoring case, 01:00:51.480 --> 01:00:54.100 align:middle line:84% but I type in one of my other email addresses. 01:00:54.100 --> 01:00:56.280 align:middle line:84% Let me go ahead and run python of validate.py. 01:00:56.280 --> 01:00:59.580 align:middle line:84% And this time, let me type in not malan@harvard.edu, which 01:00:59.580 --> 01:01:01.920 align:middle line:84% I use primarily, but another email address 01:01:01.920 --> 01:01:06.030 align:middle line:84% of mine, malan@cs50.harvard.edu, which forwards to the same. 01:01:06.030 --> 01:01:07.920 align:middle line:90% Let me go ahead and hit Enter now. 01:01:07.920 --> 01:01:11.940 align:middle line:84% And huh, invalid, even though I'm pretty sure that 01:01:11.940 --> 01:01:13.380 align:middle line:90% is, in fact, my email address. 01:01:13.380 --> 01:01:15.920 align:middle line:84% Well, let's put our finger on the reason why. 01:01:15.920 --> 01:01:20.400 align:middle line:84% Why at the moment is malan@cs50.harvard.edu 01:01:20.400 --> 01:01:25.890 align:middle line:84% being considered invalid, even though I'm pretty sure I send and receive 01:01:25.890 --> 01:01:27.330 align:middle line:90% email from that address, too? 01:01:27.330 --> 01:01:30.470 align:middle line:90% 01:01:30.470 --> 01:01:32.000 align:middle line:90% Why might that be? 01:01:32.000 --> 01:01:38.475 align:middle line:84% AUDIENCE: Because there is a dot that has come after the @ symbol. 01:01:38.475 --> 01:01:39.350 align:middle line:90% DAVID MALAN: Exactly. 01:01:39.350 --> 01:01:42.230 align:middle line:90% There's a dot after my cs50. 01:01:42.230 --> 01:01:45.080 align:middle line:84% And I'm not expecting any dots there, I'm expecting only, 01:01:45.080 --> 01:01:50.240 align:middle line:84% again, word characters, which is A through z, 0 through 9, and underscore. 01:01:50.240 --> 01:01:52.130 align:middle line:90% So I'm going to have to retool here. 01:01:52.130 --> 01:01:54.090 align:middle line:90% But how could I go about doing this? 01:01:54.090 --> 01:01:57.613 align:middle line:84% Well, it turns out theoretically, there could be other email addresses, 01:01:57.613 --> 01:02:00.530 align:middle line:84% even though they'd be getting a little excessively long, for instance, 01:02:00.530 --> 01:02:05.210 align:middle line:84% malan@something.cs50.harvard.edu, which does not technically exist, 01:02:05.210 --> 01:02:06.125 align:middle line:90% but it could. 01:02:06.125 --> 01:02:09.950 align:middle line:84% You can have, of course, multiple dots in a domain name like we see here. 01:02:09.950 --> 01:02:12.500 align:middle line:84% Wouldn't it be nice if we could handle that as well? 01:02:12.500 --> 01:02:16.670 align:middle line:84% Well, let me propose that we modify my regular expression as follows. 01:02:16.670 --> 01:02:20.240 align:middle line:84% It turns out that you can group ideas together. 01:02:20.240 --> 01:02:24.050 align:middle line:84% And you can not only ask whether or not this pattern matches 01:02:24.050 --> 01:02:29.780 align:middle line:84% or this one using syntax like A vertical bar B, which means "either A or B," 01:02:29.780 --> 01:02:34.280 align:middle line:84% you can also group things together and then apply some other operator to them 01:02:34.280 --> 01:02:35.100 align:middle line:90% as well. 01:02:35.100 --> 01:02:37.160 align:middle line:84% In fact, let me go back to the code here. 01:02:37.160 --> 01:02:42.260 align:middle line:84% And let me propose that if I want to tolerate a subdomain, like cs50, 01:02:42.260 --> 01:02:46.700 align:middle line:84% that may or may not be there, let me go ahead and change it as follows. 01:02:46.700 --> 01:02:48.320 align:middle line:90% I could naively do this. 01:02:48.320 --> 01:02:51.210 align:middle line:84% If I want to support subdomains, I could say, well, 01:02:51.210 --> 01:02:55.640 align:middle line:84% let's allow for other word characters plus, and then a literal dot. 01:02:55.640 --> 01:02:58.970 align:middle line:84% And notice, I'll highlight in blue here what I've just added. 01:02:58.970 --> 01:03:04.190 align:middle line:84% Everything else is the same, but I'm now adding room for another sequence of one 01:03:04.190 --> 01:03:07.650 align:middle line:84% or more word characters and then a literal dot. 01:03:07.650 --> 01:03:12.380 align:middle line:84% So this now, I think, if I rerun python of validate.py, 01:03:12.380 --> 01:03:16.310 align:middle line:84% will work for malan@cs50.harvard.edu, Enter. 01:03:16.310 --> 01:03:19.610 align:middle line:84% Unfortunately, does anyone see where this is going? 01:03:19.610 --> 01:03:22.310 align:middle line:84% Let me rerun python of validate.py and type 01:03:22.310 --> 01:03:25.010 align:middle line:84% in as I keep doing, malan@harvard.edu, which up until now 01:03:25.010 --> 01:03:27.290 align:middle line:84% has kept working despite all of my changes. 01:03:27.290 --> 01:03:33.110 align:middle line:84% But now, ugh, finally I've broken my own email address. 01:03:33.110 --> 01:03:35.540 align:middle line:90% So logically what's the solution here? 01:03:35.540 --> 01:03:37.730 align:middle line:84% Well, there's a bunch of ways we could solve this. 01:03:37.730 --> 01:03:40.430 align:middle line:84% I could maybe start using two regular expressions 01:03:40.430 --> 01:03:46.370 align:middle line:84% and support email addresses of the form username@domain.tld, 01:03:46.370 --> 01:03:51.350 align:middle line:84% or username@subdomain.domain.tld, where TLD just 01:03:51.350 --> 01:03:53.917 align:middle line:90% means Top Level Domain, like edu. 01:03:53.917 --> 01:03:56.000 align:middle line:84% Or I could maybe just modify this one, because I'd 01:03:56.000 --> 01:04:00.920 align:middle line:84% prefer not to have two regular expressions or one that's twice as big. 01:04:00.920 --> 01:04:06.470 align:middle line:84% Why don't I just specify to re.search that part of this pattern is optional? 01:04:06.470 --> 01:04:10.400 align:middle line:84% What was the symbol we saw earlier that allows 01:04:10.400 --> 01:04:15.440 align:middle line:84% you to specify that the thing before it is technically optional? 01:04:15.440 --> 01:04:16.610 align:middle line:90% AUDIENCE: The straight bar? 01:04:16.610 --> 01:04:19.790 align:middle line:90% We were using the straight bar as an-- 01:04:19.790 --> 01:04:22.678 align:middle line:90% optional, make the argument optional. 01:04:22.678 --> 01:04:23.720 align:middle line:90% DAVID MALAN: So we could. 01:04:23.720 --> 01:04:26.210 align:middle line:84% We could use a vertical bar and some parentheses 01:04:26.210 --> 01:04:29.480 align:middle line:84% and say, "either there's something here or there's nothing." 01:04:29.480 --> 01:04:31.010 align:middle line:90% We could do that in parentheses. 01:04:31.010 --> 01:04:33.860 align:middle line:84% But I think there's actually an even easier way. 01:04:33.860 --> 01:04:36.332 align:middle line:84% AUDIENCE: Actually, it's a question mark. 01:04:36.332 --> 01:04:37.790 align:middle line:90% DAVID MALAN: Indeed, question mark. 01:04:37.790 --> 01:04:41.240 align:middle line:84% Think back to this summary here of our first set of symbols, 01:04:41.240 --> 01:04:46.130 align:middle line:84% whereby we had not just dot and * and +, but also a question mark, which 01:04:46.130 --> 01:04:49.370 align:middle line:84% means literally "zero or one repetitions," which 01:04:49.370 --> 01:04:50.810 align:middle line:90% effectively means optional. 01:04:50.810 --> 01:04:54.740 align:middle line:84% It's either there, one, or it's not, zero. 01:04:54.740 --> 01:04:57.650 align:middle line:84% Now, how can I translate that to this code here? 01:04:57.650 --> 01:05:03.150 align:middle line:84% Well, let me go ahead and surround this part of my pattern with parentheses, 01:05:03.150 --> 01:05:06.740 align:middle line:84% which doesn't mean I want literally a parentheses in the user's input, 01:05:06.740 --> 01:05:09.410 align:middle line:84% I just want to group these characters together. 01:05:09.410 --> 01:05:11.480 align:middle line:90% And in fact, this now will still work. 01:05:11.480 --> 01:05:14.960 align:middle line:84% I've only added parentheses around the new part for the subdomain. 01:05:14.960 --> 01:05:17.000 align:middle line:90% Let me run python of validate.py. 01:05:17.000 --> 01:05:20.060 align:middle line:84% Let me run malan@cs50.harvard.edu, Enter. 01:05:20.060 --> 01:05:21.110 align:middle line:90% That's still valid. 01:05:21.110 --> 01:05:25.730 align:middle line:84% But to be clear, if I rerun it again for malan@harvard.edu, that is still 01:05:25.730 --> 01:05:31.310 align:middle line:84% invalid, but not if I go in here and say, after the parentheses, which 01:05:31.310 --> 01:05:36.410 align:middle line:84% now is one logical unit, it's one big group of ideas together, 01:05:36.410 --> 01:05:38.690 align:middle line:90% I add a single question mark there. 01:05:38.690 --> 01:05:43.910 align:middle line:84% This will now tell re.search that that whole thing in parentheses 01:05:43.910 --> 01:05:49.020 align:middle line:84% can either be there once or be there not at all, zero times. 01:05:49.020 --> 01:05:51.530 align:middle line:84% So what does this translate into when I run it? 01:05:51.530 --> 01:05:56.030 align:middle line:84% Well, let me go ahead and rerun it with malan@cs50.harvard.edu 01:05:56.030 --> 01:05:57.770 align:middle line:90% so that the subdomain is there. 01:05:57.770 --> 01:05:59.720 align:middle line:90% That works as before. 01:05:59.720 --> 01:06:01.860 align:middle line:84% Let me clear my screen and run it again, python 01:06:01.860 --> 01:06:06.830 align:middle line:84% of validate.py with malan@harvard.edu, which used to work then broke. 01:06:06.830 --> 01:06:08.330 align:middle line:90% Are we back in business now? 01:06:08.330 --> 01:06:09.260 align:middle line:90% We are. 01:06:09.260 --> 01:06:11.810 align:middle line:90% That's now valid again. 01:06:11.810 --> 01:06:14.540 align:middle line:84% Questions now on this approach, where we've used 01:06:14.540 --> 01:06:18.655 align:middle line:84% not just the question mark but the parentheses as well? 01:06:18.655 --> 01:06:19.280 align:middle line:90% AUDIENCE: Yeah. 01:06:19.280 --> 01:06:22.130 align:middle line:84% You said it works for zero or one repetitions. 01:06:22.130 --> 01:06:23.912 align:middle line:90% What if you have more? 01:06:23.912 --> 01:06:25.370 align:middle line:90% DAVID MALAN: What if you have more? 01:06:25.370 --> 01:06:26.220 align:middle line:90% That's OK. 01:06:26.220 --> 01:06:28.610 align:middle line:90% That's where you could do *. 01:06:28.610 --> 01:06:33.835 align:middle line:84% * is zero or more, which gives you all the flexibility in the world. 01:06:33.835 --> 01:06:34.460 align:middle line:90% AUDIENCE: Yeah. 01:06:34.460 --> 01:06:37.050 align:middle line:90% So I was just asking that-- 01:06:37.050 --> 01:06:40.670 align:middle line:84% with question marks, there's only one repetition allowed. 01:06:40.670 --> 01:06:42.810 align:middle line:84% DAVID MALAN: It means zero or one repetition. 01:06:42.810 --> 01:06:45.630 align:middle line:90% So it's either not there or it is there. 01:06:45.630 --> 01:06:49.940 align:middle line:84% And so that's why this pattern now, if I go back to my code, even though again, 01:06:49.940 --> 01:06:54.650 align:middle line:84% it admittedly looks cryptic, let me highlight everything after the @ sign 01:06:54.650 --> 01:06:56.060 align:middle line:90% and before the $ sign. 01:06:56.060 --> 01:07:01.001 align:middle line:84% This now represents a domain name, like harvard.edu, 01:07:01.001 --> 01:07:03.920 align:middle line:90% or a subdomain within the domain name. 01:07:03.920 --> 01:07:04.700 align:middle line:90% Why? 01:07:04.700 --> 01:07:07.700 align:middle line:84% Well, this part to the right is the same as always. 01:07:07.700 --> 01:07:11.330 align:middle line:84% Backslash w + means something like Harvard or Yale. 01:07:11.330 --> 01:07:14.810 align:middle line:90% Backslash .edu means literally ".edu." 01:07:14.810 --> 01:07:16.430 align:middle line:90% So the new part is this. 01:07:16.430 --> 01:07:22.370 align:middle line:84% In parentheses, I have another set of backslash w + backslash dot now. 01:07:22.370 --> 01:07:24.080 align:middle line:90% But it's all in parentheses. 01:07:24.080 --> 01:07:26.870 align:middle line:84% I'm now having a question mark right after that, 01:07:26.870 --> 01:07:30.710 align:middle line:84% which means that whole thing in parentheses either can be there, 01:07:30.710 --> 01:07:31.850 align:middle line:90% or it can't be there. 01:07:31.850 --> 01:07:34.010 align:middle line:84% It's either of those that are acceptable. 01:07:34.010 --> 01:07:37.880 align:middle line:84% So a question mark effectively make something optional. 01:07:37.880 --> 01:07:40.670 align:middle line:84% It would not be correct to remove the parentheses, 01:07:40.670 --> 01:07:42.150 align:middle line:90% because what would this mean? 01:07:42.150 --> 01:07:44.690 align:middle line:84% If I removed the parentheses, that would mean 01:07:44.690 --> 01:07:49.580 align:middle line:84% that only this dot is optional, which isn't really what we want to express. 01:07:49.580 --> 01:07:54.050 align:middle line:84% I want the subdomain, like cs50 and the additional dot 01:07:54.050 --> 01:07:56.060 align:middle line:90% to be what's there or not there. 01:07:56.060 --> 01:07:59.270 align:middle line:84% How about one other question on regexes here? 01:07:59.270 --> 01:08:01.530 align:middle line:84% AUDIENCE: Can we use this for the usernames? 01:08:01.530 --> 01:08:02.530 align:middle line:90% DAVID MALAN: Absolutely. 01:08:02.530 --> 01:08:04.000 align:middle line:90% We still have other problems. 01:08:04.000 --> 01:08:06.280 align:middle line:84% We're not solving all of the problems today just yet. 01:08:06.280 --> 01:08:07.330 align:middle line:90% But absolutely. 01:08:07.330 --> 01:08:11.380 align:middle line:84% Right now, we are not letting you have a period in your username. 01:08:11.380 --> 01:08:14.088 align:middle line:84% And again, some of you with Gmail accounts or other accounts, you 01:08:14.088 --> 01:08:16.463 align:middle line:84% probably have not just underscores, numbers, and letters. 01:08:16.463 --> 01:08:17.740 align:middle line:90% You might have periods, too. 01:08:17.740 --> 01:08:21.790 align:middle line:84% Well, we could fix that, not using question mark here per se. 01:08:21.790 --> 01:08:25.630 align:middle line:84% But now that we have these parentheses at our disposal, what I could do 01:08:25.630 --> 01:08:26.350 align:middle line:90% is this. 01:08:26.350 --> 01:08:30.399 align:middle line:84% I could use parentheses to surround the backslash w 01:08:30.399 --> 01:08:33.819 align:middle line:84% to say "any word character," which is the same thing, again, as a letter, 01:08:33.819 --> 01:08:35.529 align:middle line:90% or a number, or an underscore. 01:08:35.529 --> 01:08:40.120 align:middle line:84% But I could also or in, using a vertical bar, something else, 01:08:40.120 --> 01:08:41.800 align:middle line:90% like a literal dot. 01:08:41.800 --> 01:08:44.770 align:middle line:84% Now, a literal dot needs to be escaped, otherwise it 01:08:44.770 --> 01:08:47.859 align:middle line:84% represents any character, which would be a regression, a step back. 01:08:47.859 --> 01:08:49.540 align:middle line:90% But now notice what I've done. 01:08:49.540 --> 01:08:54.370 align:middle line:84% In parentheses, I'm telling re.search that those first few characters 01:08:54.370 --> 01:08:56.800 align:middle line:84% in your email address, that is your username, 01:08:56.800 --> 01:09:02.049 align:middle line:84% has to be a word character, like A through z, uppercase or lowercase, or 0 01:09:02.049 --> 01:09:05.290 align:middle line:84% through 9, or an underscore, or a literal dot. 01:09:05.290 --> 01:09:06.760 align:middle line:90% We could do this differently, too. 01:09:06.760 --> 01:09:09.220 align:middle line:84% I could get rid of the parentheses and the 01:09:09.220 --> 01:09:12.010 align:middle line:84% or, and I could just use a set of characters. 01:09:12.010 --> 01:09:17.890 align:middle line:84% I could, again, manually say a through z, A through Z, 0 through 9, 01:09:17.890 --> 01:09:22.540 align:middle line:84% underscore, and then I could do a literal dot with a backslash period. 01:09:22.540 --> 01:09:25.029 align:middle line:84% And now I technically don't even need the uppercase, 01:09:25.029 --> 01:09:27.590 align:middle line:84% because I'm already telling the computer to ignore case. 01:09:27.590 --> 01:09:29.359 align:middle line:90% I can just pick one or the other. 01:09:29.359 --> 01:09:31.120 align:middle line:90% Which one is better is really up to you. 01:09:31.120 --> 01:09:35.600 align:middle line:84% Whichever one you think is more readable would generally be the better design. 01:09:35.600 --> 01:09:36.100 align:middle line:90% All right. 01:09:36.100 --> 01:09:38.979 align:middle line:84% Let me propose that I rewind this in time 01:09:38.979 --> 01:09:42.819 align:middle line:90% to where we left off, which was here. 01:09:42.819 --> 01:09:44.800 align:middle line:84% And let me propose that there are, indeed, 01:09:44.800 --> 01:09:48.935 align:middle line:84% still limitations of this solution, not just with the username, not just 01:09:48.935 --> 01:09:49.810 align:middle line:90% with the domain name. 01:09:49.810 --> 01:09:51.700 align:middle line:84% We're still being a little too restrictive. 01:09:51.700 --> 01:09:54.910 align:middle line:84% So would you like to see the official regular expression 01:09:54.910 --> 01:09:58.720 align:middle line:84% that at least browsers use nowadays whenever you type in an email address 01:09:58.720 --> 01:10:01.450 align:middle line:84% to a web form, and the web form, the browser, 01:10:01.450 --> 01:10:05.680 align:middle line:84% tells you yes or no, your email address is syntactically valid? 01:10:05.680 --> 01:10:06.670 align:middle line:90% Ready? 01:10:06.670 --> 01:10:07.810 align:middle line:90% Ready? 01:10:07.810 --> 01:10:12.730 align:middle line:84% Here is-- and this isn't even officially the right regular expression. 01:10:12.730 --> 01:10:15.670 align:middle line:84% It's a simplified version that browsers use because it 01:10:15.670 --> 01:10:18.100 align:middle line:90% catches most mistakes but not all. 01:10:18.100 --> 01:10:19.460 align:middle line:90% Here we go. 01:10:19.460 --> 01:10:23.710 align:middle line:84% This is the regular expression for a valid email address, 01:10:23.710 --> 01:10:27.550 align:middle line:84% at least as browsers nowadays implement them. 01:10:27.550 --> 01:10:30.610 align:middle line:90% Now it's crazy cryptic at first glance. 01:10:30.610 --> 01:10:34.930 align:middle line:84% But note-- and it's wrapping on to many lines, but it's just one pattern. 01:10:34.930 --> 01:10:37.930 align:middle line:84% But just notice the now-familiar symbols. 01:10:37.930 --> 01:10:40.540 align:middle line:90% There is the ^ symbol at the very top. 01:10:40.540 --> 01:10:43.280 align:middle line:90% There is the $ sign at the very end. 01:10:43.280 --> 01:10:45.730 align:middle line:84% There is a square bracket over here and then some 01:10:45.730 --> 01:10:47.860 align:middle line:90% of these ranges plus other characters. 01:10:47.860 --> 01:10:51.280 align:middle line:84% Turns out you don't normally see these characters in email addresses. 01:10:51.280 --> 01:10:53.770 align:middle line:84% It looks like you're swearing at someone in their username. 01:10:53.770 --> 01:10:55.450 align:middle line:90% But they're valid characters. 01:10:55.450 --> 01:10:56.680 align:middle line:90% They're valid officially. 01:10:56.680 --> 01:11:00.670 align:middle line:84% That doesn't mean that Gmail is going to allow you to put $ signs and other 01:11:00.670 --> 01:11:02.260 align:middle line:90% punctuation in your username. 01:11:02.260 --> 01:11:04.850 align:middle line:84% But officially, some servers might allow that. 01:11:04.850 --> 01:11:08.080 align:middle line:84% So if you really want to validate a user's email address, 01:11:08.080 --> 01:11:12.250 align:middle line:84% you would actually come up with or copy-paste something like this. 01:11:12.250 --> 01:11:14.680 align:middle line:90% But honestly, this looks so cryptic. 01:11:14.680 --> 01:11:18.680 align:middle line:84% And if you were to type it out manually, you are so likely to make a mistake. 01:11:18.680 --> 01:11:21.040 align:middle line:90% What's the better solution here instead? 01:11:21.040 --> 01:11:24.820 align:middle line:84% This is where, per past weeks, libraries are your friend. 01:11:24.820 --> 01:11:28.360 align:middle line:84% Surely someone else on the internet, a programmer more 01:11:28.360 --> 01:11:31.360 align:middle line:84% experienced than you, even, has come up with code 01:11:31.360 --> 01:11:35.830 align:middle line:84% that validates email addresses properly, using this regular expression or even 01:11:35.830 --> 01:11:37.580 align:middle line:90% something more sophisticated than that. 01:11:37.580 --> 01:11:40.030 align:middle line:84% So generally, if the problem at hand is to validate 01:11:40.030 --> 01:11:43.060 align:middle line:84% input that is pretty conventional-- an email address, 01:11:43.060 --> 01:11:46.570 align:middle line:84% a URL, something where there's an official definition that's 01:11:46.570 --> 01:11:50.710 align:middle line:84% independent of you yourself-- find a popular library that you're 01:11:50.710 --> 01:11:55.130 align:middle line:84% comfortable using and use it in your code to validate email addresses. 01:11:55.130 --> 01:11:58.750 align:middle line:84% This is not a wheel, necessarily, that you yourself should invent. 01:11:58.750 --> 01:12:01.870 align:middle line:84% We've used email addresses, though, to iteratively start 01:12:01.870 --> 01:12:05.300 align:middle line:84% from something simple, too simple, and build on top of that. 01:12:05.300 --> 01:12:07.960 align:middle line:84% So you could certainly imagine using regular expressions still 01:12:07.960 --> 01:12:10.210 align:middle line:84% to validate things that aren't email addresses but are 01:12:10.210 --> 01:12:12.230 align:middle line:90% data that are important to you. 01:12:12.230 --> 01:12:14.980 align:middle line:84% So we at least now have these building blocks. 01:12:14.980 --> 01:12:17.380 align:middle line:84% Now, besides the regular expressions themselves, 01:12:17.380 --> 01:12:20.290 align:middle line:84% it turns out there's other functions in Python's re 01:12:20.290 --> 01:12:22.030 align:middle line:90% library for regular expressions. 01:12:22.030 --> 01:12:24.280 align:middle line:84% Among them is this function here, re.match, 01:12:24.280 --> 01:12:26.980 align:middle line:84% which is actually very similar to re.search, 01:12:26.980 --> 01:12:29.462 align:middle line:84% except you don't have to specify the ^ symbol 01:12:29.462 --> 01:12:31.420 align:middle line:84% at the very beginning of your regex if you want 01:12:31.420 --> 01:12:33.400 align:middle line:90% to match from the start of a string. 01:12:33.400 --> 01:12:36.958 align:middle line:84% re.match by design will automatically start matching 01:12:36.958 --> 01:12:38.500 align:middle line:90% from the start of the string for you. 01:12:38.500 --> 01:12:42.580 align:middle line:84% Similar in spirit is re.fullmatch, which does the same thing but not only 01:12:42.580 --> 01:12:45.730 align:middle line:84% matches at the start of the string but the end of the string, so that you, 01:12:45.730 --> 01:12:50.240 align:middle line:84% too, don't need to type in the ^ symbol or the $ sign as well. 01:12:50.240 --> 01:12:53.170 align:middle line:84% But let's go ahead and transition back now to some actual code, 01:12:53.170 --> 01:12:55.420 align:middle line:84% whereby we solve a different problem in spirit. 01:12:55.420 --> 01:12:57.920 align:middle line:84% Rather than just validate the user's input 01:12:57.920 --> 01:13:00.290 align:middle line:84% and make sure it looks the way we want, let's just 01:13:00.290 --> 01:13:04.020 align:middle line:84% assume that the users are not going to type in data exactly as we want, 01:13:04.020 --> 01:13:06.290 align:middle line:84% and so we're going to have to clean up their input. 01:13:06.290 --> 01:13:10.580 align:middle line:84% This happens so often when you're using like a Google Form, or Office 365 form, 01:13:10.580 --> 01:13:12.800 align:middle line:90% or anything else to collect user input. 01:13:12.800 --> 01:13:15.800 align:middle line:84% No matter what your form question says, your users 01:13:15.800 --> 01:13:18.225 align:middle line:84% are not necessarily going to follow those directions. 01:13:18.225 --> 01:13:20.600 align:middle line:84% They might go ahead and type in something that's a little 01:13:20.600 --> 01:13:22.910 align:middle line:84% differently formatted than you might like. 01:13:22.910 --> 01:13:26.810 align:middle line:84% Now, you could certainly go through the results and download a CSV, 01:13:26.810 --> 01:13:29.720 align:middle line:84% or open the Google spreadsheet, or equivalent in Excel, 01:13:29.720 --> 01:13:31.980 align:middle line:84% and just clean up all of the data manually. 01:13:31.980 --> 01:13:34.250 align:middle line:84% But if you've got lots of submissions-- dozens, 01:13:34.250 --> 01:13:37.070 align:middle line:84% hundreds, thousands of rows in your data set-- 01:13:37.070 --> 01:13:39.170 align:middle line:84% doing things manually might not be very fun. 01:13:39.170 --> 01:13:42.680 align:middle line:84% It might be much more effective to write code, as in Python, 01:13:42.680 --> 01:13:47.220 align:middle line:84% that can allow you to clean up that data and any future data as well. 01:13:47.220 --> 01:13:51.620 align:middle line:84% So let me propose that we go ahead here and close validate.py. 01:13:51.620 --> 01:13:55.460 align:middle line:84% And let's go ahead and create a new program altogether called format.py, 01:13:55.460 --> 01:13:59.990 align:middle line:84% the goal of which is to reformat the user's input in the format we expect. 01:13:59.990 --> 01:14:03.080 align:middle line:84% I'm going to go ahead and run code of format.py. 01:14:03.080 --> 01:14:06.170 align:middle line:84% And let's suppose that the data we're going to reformat 01:14:06.170 --> 01:14:09.703 align:middle line:84% is the user's name-- so not email address but name this time. 01:14:09.703 --> 01:14:11.870 align:middle line:84% And we're going to hope that they type in their name 01:14:11.870 --> 01:14:14.270 align:middle line:90% properly, like David Malan. 01:14:14.270 --> 01:14:16.610 align:middle line:84% But some users might be in the habit, for whatever 01:14:16.610 --> 01:14:19.020 align:middle line:84% reason, of typing their name backwards, if you will, 01:14:19.020 --> 01:14:23.030 align:middle line:84% with a comma, such as Malan comma David instead. 01:14:23.030 --> 01:14:27.740 align:middle line:84% Now, it's fine because both are clearly as readable to the human. 01:14:27.740 --> 01:14:30.530 align:middle line:84% But if you want to standardize how those names are stored 01:14:30.530 --> 01:14:34.250 align:middle line:84% in your system, perhaps a database, or CSV file, or something else, 01:14:34.250 --> 01:14:37.970 align:middle line:84% it would be nice to at least standardize or canonicalize the format in which 01:14:37.970 --> 01:14:41.060 align:middle line:84% you're storing your data, so that if you print out the user's name 01:14:41.060 --> 01:14:43.250 align:middle line:84% it's always the same format, David Malan, 01:14:43.250 --> 01:14:46.410 align:middle line:84% and there's no commas or backwardness to it. 01:14:46.410 --> 01:14:48.650 align:middle line:84% So let's go ahead and do something familiar. 01:14:48.650 --> 01:14:50.990 align:middle line:84% Let's go ahead and give myself a variable called name 01:14:50.990 --> 01:14:53.120 align:middle line:84% and set it equal to the return value of input, 01:14:53.120 --> 01:14:56.300 align:middle line:84% asking the user, as we've done many times, "what's your name," 01:14:56.300 --> 01:14:57.170 align:middle line:90% question mark. 01:14:57.170 --> 01:15:00.290 align:middle line:84% I'm going to go ahead and proactively at least clean up some messiness, 01:15:00.290 --> 01:15:03.950 align:middle line:84% as we keep doing here, by just stripping off any leading or trailing whitespace. 01:15:03.950 --> 01:15:06.470 align:middle line:84% Just in case the user accidentally hits the spacebar, 01:15:06.470 --> 01:15:09.720 align:middle line:84% we don't want that ultimately in our data set. 01:15:09.720 --> 01:15:12.260 align:middle line:84% And now let me go ahead and do this as we've done before. 01:15:12.260 --> 01:15:14.900 align:middle line:84% Let me just go ahead quickly and print out, just to make sure 01:15:14.900 --> 01:15:18.650 align:middle line:84% I'm off to the right start, "hello," and then in curly braces name, 01:15:18.650 --> 01:15:22.010 align:middle line:84% so making an fstring to format "hello," comma, "name." 01:15:22.010 --> 01:15:25.730 align:middle line:84% Now let me go ahead and clear my screen and run python of format.py. 01:15:25.730 --> 01:15:29.510 align:middle line:84% Let me behave and type in my name as I normally would, David, space, Malan, 01:15:29.510 --> 01:15:30.170 align:middle line:90% Enter. 01:15:30.170 --> 01:15:32.270 align:middle line:84% And I think the output looks pretty good. 01:15:32.270 --> 01:15:34.490 align:middle line:90% It looks as expected grammatically. 01:15:34.490 --> 01:15:37.283 align:middle line:84% Let me now go ahead, though, and play this game again. 01:15:37.283 --> 01:15:39.200 align:middle line:84% But this time, maybe because I'm not thinking, 01:15:39.200 --> 01:15:41.600 align:middle line:84% or I'm just in the habit of doing last name comma first, 01:15:41.600 --> 01:15:44.700 align:middle line:90% I do Malan, comma, David, and hit Enter. 01:15:44.700 --> 01:15:45.200 align:middle line:90% All right. 01:15:45.200 --> 01:15:47.270 align:middle line:90% Well, this now is weird. 01:15:47.270 --> 01:15:51.020 align:middle line:84% Even though the program is just spitting out exactly what I typed in, 01:15:51.020 --> 01:15:54.020 align:middle line:84% arguably this is not close to correct, at least grammatically. 01:15:54.020 --> 01:15:56.810 align:middle line:84% It should really say "hello, David Malan." 01:15:56.810 --> 01:15:58.820 align:middle line:84% Now, maybe I could have some if conditions 01:15:58.820 --> 01:16:01.910 align:middle line:84% and I could just reject the user's input if they type a comma 01:16:01.910 --> 01:16:03.800 align:middle line:90% or get their names backwards somehow. 01:16:03.800 --> 01:16:07.190 align:middle line:84% But that's going to be too little too late if the user has already 01:16:07.190 --> 01:16:10.580 align:middle line:84% submitted a form online, and I already have the data, 01:16:10.580 --> 01:16:12.600 align:middle line:90% and now I need to go in and clean it up. 01:16:12.600 --> 01:16:14.750 align:middle line:84% And it's not going to be fun to go through manually 01:16:14.750 --> 01:16:17.900 align:middle line:84% in Google Spreadsheets, or Apple Numbers, or Microsoft Excel 01:16:17.900 --> 01:16:21.650 align:middle line:84% and manually fix a lot of people's names to get rid of the commas 01:16:21.650 --> 01:16:25.700 align:middle line:84% and move the first name before the last, as is conventional in the US. 01:16:25.700 --> 01:16:27.080 align:middle line:90% So let's do this. 01:16:27.080 --> 01:16:29.780 align:middle line:84% It could be a little fragile, but let's start 01:16:29.780 --> 01:16:32.990 align:middle line:84% to express ourselves a little programmatically here and ask this. 01:16:32.990 --> 01:16:37.940 align:middle line:84% If there is a comma in the person's name, which is Pythonic-- 01:16:37.940 --> 01:16:41.960 align:middle line:84% I'm just asking the question, is this shorter string in this longer string?-- 01:16:41.960 --> 01:16:43.650 align:middle line:90% then let me go ahead and do this. 01:16:43.650 --> 01:16:46.340 align:middle line:84% Let me go ahead and grab that name in the variable, 01:16:46.340 --> 01:16:50.840 align:middle line:84% split on not just the comma but the space after, 01:16:50.840 --> 01:16:53.480 align:middle line:84% assuming the human typed in a space after their name. 01:16:53.480 --> 01:16:57.080 align:middle line:84% And let me go ahead and store the result of that splitting of Malan, comma, 01:16:57.080 --> 01:16:58.860 align:middle line:90% David into two variables. 01:16:58.860 --> 01:17:02.000 align:middle line:84% Let's do last, comma, first, again unpacking 01:17:02.000 --> 01:17:04.310 align:middle line:90% the sequence of values that comes back. 01:17:04.310 --> 01:17:07.170 align:middle line:84% Now let me go ahead and reformat the name. 01:17:07.170 --> 01:17:10.160 align:middle line:84% So I'm going to forcibly change the user's name to be as I expect. 01:17:10.160 --> 01:17:13.580 align:middle line:84% So name is actually going to be this format string-- 01:17:13.580 --> 01:17:18.830 align:middle line:84% first name then last name, both in curly braces but formatted together 01:17:18.830 --> 01:17:22.580 align:middle line:84% with a single space, so that I'm overwriting the user's input 01:17:22.580 --> 01:17:25.280 align:middle line:84% and updating my name variable accordingly. 01:17:25.280 --> 01:17:27.770 align:middle line:84% For the moment, to be clear, this program is interactive. 01:17:27.770 --> 01:17:31.250 align:middle line:84% Like, the users, like me, are typing their name into the program. 01:17:31.250 --> 01:17:34.340 align:middle line:84% But imagine the data already is in a CSV file. 01:17:34.340 --> 01:17:37.730 align:middle line:84% It came in from some process like a Google Form or something else online. 01:17:37.730 --> 01:17:40.370 align:middle line:84% You could imagine writing code similar to this, 01:17:40.370 --> 01:17:43.550 align:middle line:84% but that maybe goes and reads that file into memory first. 01:17:43.550 --> 01:17:46.640 align:middle line:84% Maybe it's a CSV via CSV Reader or DictReader, 01:17:46.640 --> 01:17:48.860 align:middle line:84% and then iterating over each of those names. 01:17:48.860 --> 01:17:51.630 align:middle line:84% But we'll keep it simple and just do one name at a time. 01:17:51.630 --> 01:17:55.070 align:middle line:84% But now what's kind of interesting here is if I go back to my terminal window 01:17:55.070 --> 01:17:57.940 align:middle line:84% and clear it, and run python of format.py, 01:17:57.940 --> 01:18:01.240 align:middle line:84% and hit Enter, I'm going to type in David, space, Malan as before. 01:18:01.240 --> 01:18:03.130 align:middle line:90% And I think we're still good. 01:18:03.130 --> 01:18:05.290 align:middle line:84% But I'm also going to go ahead and do this-- 01:18:05.290 --> 01:18:10.630 align:middle line:84% python of format.py Malan, comma, David, with a space in between, 01:18:10.630 --> 01:18:13.960 align:middle line:84% crossing my fingers and hit Enter, and voila. 01:18:13.960 --> 01:18:15.640 align:middle line:90% That now has been fixed. 01:18:15.640 --> 01:18:18.400 align:middle line:90% Such a simple thing to be sure. 01:18:18.400 --> 01:18:22.300 align:middle line:84% But it is so commonly necessary to clean up users input. 01:18:22.300 --> 01:18:25.870 align:middle line:84% Here we see at least one way to do so pretty easily. 01:18:25.870 --> 01:18:28.480 align:middle line:84% Now, to be fair, there's some problems here. 01:18:28.480 --> 01:18:32.500 align:middle line:84% And in fact, can someone imagine a scenario in which this code really 01:18:32.500 --> 01:18:34.570 align:middle line:90% doesn't fix the user's input? 01:18:34.570 --> 01:18:39.760 align:middle line:84% What could still go wrong even with this fix in my code? 01:18:39.760 --> 01:18:40.810 align:middle line:90% Any thoughts? 01:18:40.810 --> 01:18:44.322 align:middle line:84% AUDIENCE: If they typed in their name comma and then [INAUDIBLE].. 01:18:44.322 --> 01:18:46.030 align:middle line:84% DAVID MALAN: Oh, and then something else. 01:18:46.030 --> 01:18:46.530 align:middle line:90% Yeah. 01:18:46.530 --> 01:18:48.730 align:middle line:90% So let me try this, for instance. 01:18:48.730 --> 01:18:50.410 align:middle line:90% Let me go ahead and run a program. 01:18:50.410 --> 01:18:53.350 align:middle line:84% And I am the only David Malan that I know. 01:18:53.350 --> 01:18:57.850 align:middle line:84% But suppose I were, let's say, junior like this. 01:18:57.850 --> 01:19:00.850 align:middle line:84% And it's common, in English at least, to sometimes put a comma there. 01:19:00.850 --> 01:19:02.350 align:middle line:84% You don't necessarily need the comma, but I'm 01:19:02.350 --> 01:19:04.120 align:middle line:90% one of those people who uses a comma. 01:19:04.120 --> 01:19:06.730 align:middle line:90% That's now really, really broken. 01:19:06.730 --> 01:19:08.830 align:middle line:90% So I've broken some assumption there. 01:19:08.830 --> 01:19:10.970 align:middle line:84% And so that could certainly go wrong here. 01:19:10.970 --> 01:19:11.470 align:middle line:90% What else? 01:19:11.470 --> 01:19:13.178 align:middle line:84% Well, let me go ahead and run this again. 01:19:13.178 --> 01:19:15.540 align:middle line:84% And if I did Malan, comma, David, no space, 01:19:15.540 --> 01:19:17.290 align:middle line:84% because I'm being a little sloppy, I'm not 01:19:17.290 --> 01:19:20.500 align:middle line:84% paying attention, which is going to happen when you have lots of users 01:19:20.500 --> 01:19:22.750 align:middle line:90% ultimately, well, this really broke now. 01:19:22.750 --> 01:19:25.870 align:middle line:84% Notice I have a ValueError, an actual exception. 01:19:25.870 --> 01:19:26.410 align:middle line:90% Why? 01:19:26.410 --> 01:19:31.330 align:middle line:84% Well, because split is supposed to be splitting the string into two strings 01:19:31.330 --> 01:19:34.000 align:middle line:90% by looking for the comma and a space. 01:19:34.000 --> 01:19:37.720 align:middle line:84% But if there is no comma and space, it can't split it into two things. 01:19:37.720 --> 01:19:40.900 align:middle line:84% And the fact that I have two variables on the left, 01:19:40.900 --> 01:19:44.290 align:middle line:84% but I'm only getting back one thing on the right, 01:19:44.290 --> 01:19:47.030 align:middle line:84% means that I can't do this code quite as this. 01:19:47.030 --> 01:19:48.467 align:middle line:90% So it's fragile to be sure. 01:19:48.467 --> 01:19:50.800 align:middle line:84% But wouldn't it be nice if we could at least improve it? 01:19:50.800 --> 01:19:53.710 align:middle line:84% For instance, we now know some regular expressions syntax. 01:19:53.710 --> 01:19:56.920 align:middle line:84% What if I at least wanted to make this space optional? 01:19:56.920 --> 01:20:00.010 align:middle line:84% Well, I could use my newfound regular expression syntax 01:20:00.010 --> 01:20:04.330 align:middle line:84% and put a question mark, Question mark means zero or one of the things 01:20:04.330 --> 01:20:05.080 align:middle line:90% to the left. 01:20:05.080 --> 01:20:06.490 align:middle line:90% What's the thing to the left? 01:20:06.490 --> 01:20:07.850 align:middle line:90% It's literally a space. 01:20:07.850 --> 01:20:10.760 align:middle line:84% I don't even need parentheses if there's just one thing there. 01:20:10.760 --> 01:20:15.040 align:middle line:84% So that would be the start of a pattern that says, I must have a comma, 01:20:15.040 --> 01:20:19.240 align:middle line:84% and then I may or may not have a space, zero or one spaces thereafter. 01:20:19.240 --> 01:20:25.810 align:middle line:84% Unfortunately, the version of split that's built into the str variable, 01:20:25.810 --> 01:20:28.600 align:middle line:84% as in this case, doesn't support regular expressions. 01:20:28.600 --> 01:20:32.120 align:middle line:84% If we want our regular expressions, we need to go use that library here. 01:20:32.120 --> 01:20:33.650 align:middle line:90% So let me go ahead and do this. 01:20:33.650 --> 01:20:37.550 align:middle line:84% Let me go in and leave this code as is but go up to the top 01:20:37.550 --> 01:20:41.650 align:middle line:84% now and import re to import the library for regular expressions. 01:20:41.650 --> 01:20:46.000 align:middle line:84% And now let me go ahead and start changing my approach here. 01:20:46.000 --> 01:20:47.630 align:middle line:90% I'm going to go ahead and do this. 01:20:47.630 --> 01:20:50.890 align:middle line:84% I'm going to use the same function called re.search, 01:20:50.890 --> 01:20:54.370 align:middle line:84% and I'm going to search for a pattern that I 01:20:54.370 --> 01:20:56.650 align:middle line:90% think will be last, comma, first. 01:20:56.650 --> 01:20:59.050 align:middle line:84% So let me use my newfound regular expression syntax 01:20:59.050 --> 01:21:04.390 align:middle line:84% and represent a pattern for something like Malan, comma, space, David. 01:21:04.390 --> 01:21:05.660 align:middle line:90% How can I do this? 01:21:05.660 --> 01:21:10.570 align:middle line:84% Well, inside of my quotes for re.search, I'm going to have something-- 01:21:10.570 --> 01:21:11.950 align:middle line:90% so dot +-- 01:21:11.950 --> 01:21:12.610 align:middle line:90% sorry. 01:21:12.610 --> 01:21:14.980 align:middle line:90% I'm going to have something, so dot +. 01:21:14.980 --> 01:21:16.540 align:middle line:90% Then I'm going to have a comma. 01:21:16.540 --> 01:21:17.890 align:middle line:90% Then I'm going to have a space. 01:21:17.890 --> 01:21:20.440 align:middle line:90% Then I'm going to have something dot +. 01:21:20.440 --> 01:21:23.200 align:middle line:84% Now I'm going to preemptively refine this a little bit. 01:21:23.200 --> 01:21:25.288 align:middle line:84% I want this whole pattern to start matching 01:21:25.288 --> 01:21:26.830 align:middle line:90% at the beginning of the user's input. 01:21:26.830 --> 01:21:28.960 align:middle line:90% So I'm going to add the ^ right away. 01:21:28.960 --> 01:21:33.070 align:middle line:84% And I want the end of the user's input to be matched as well, so that I'm 01:21:33.070 --> 01:21:37.720 align:middle line:84% literally expecting any character one or more times, then a comma then a space, 01:21:37.720 --> 01:21:40.180 align:middle line:84% then any other character one or more times. 01:21:40.180 --> 01:21:42.280 align:middle line:90% And then that is it. 01:21:42.280 --> 01:21:46.430 align:middle line:84% And I'm going to pass in the name variable as before. 01:21:46.430 --> 01:21:50.300 align:middle line:84% Now, when we've used re.search in the past, 01:21:50.300 --> 01:21:52.900 align:middle line:84% we really used it just to answer a question. 01:21:52.900 --> 01:21:57.040 align:middle line:84% Does the user's input match the following pattern or not, 01:21:57.040 --> 01:21:59.140 align:middle line:90% true or false, effectively. 01:21:59.140 --> 01:22:02.600 align:middle line:84% But re.search is actually more powerful than that. 01:22:02.600 --> 01:22:05.110 align:middle line:84% You can actually get back more information. 01:22:05.110 --> 01:22:06.430 align:middle line:90% And you can do this. 01:22:06.430 --> 01:22:10.000 align:middle line:84% You can specify a variable and then an assignment operator, 01:22:10.000 --> 01:22:15.250 align:middle line:84% and get back more precise answers to what has been found when searched for. 01:22:15.250 --> 01:22:17.500 align:middle line:90% But what is it you want to get back? 01:22:17.500 --> 01:22:21.260 align:middle line:84% Well, it turns out there's this other feature of regular expressions 01:22:21.260 --> 01:22:25.330 align:middle line:84% which allow you to use parentheses, not just to group things together, 01:22:25.330 --> 01:22:27.070 align:middle line:90% but to capture them. 01:22:27.070 --> 01:22:31.750 align:middle line:84% It turns out when you specify parentheses in a regular expression 01:22:31.750 --> 01:22:35.140 align:middle line:84% unbeknownst to us up until now, everything in the parentheses 01:22:35.140 --> 01:22:41.350 align:middle line:84% will be returned to you as a return value from the re.search function. 01:22:41.350 --> 01:22:45.700 align:middle line:84% It's going to allow you to extract specific amounts of information 01:22:45.700 --> 01:22:47.530 align:middle line:90% from the user's own input. 01:22:47.530 --> 01:22:51.730 align:middle line:84% You can reverse this process, too, by using the non-capturing version 01:22:51.730 --> 01:22:52.340 align:middle line:90% as well. 01:22:52.340 --> 01:22:55.507 align:middle line:84% You can use parentheses, and then literally a question mark, and a colon, 01:22:55.507 --> 01:22:56.590 align:middle line:90% and then some other stuff. 01:22:56.590 --> 01:22:58.400 align:middle line:84% And that will say, don't either capturing this. 01:22:58.400 --> 01:22:59.567 align:middle line:90% I just want to group things. 01:22:59.567 --> 01:23:02.850 align:middle line:84% But for now, we're going to use just the parentheses themselves. 01:23:02.850 --> 01:23:04.200 align:middle line:90% So how am I going to do this? 01:23:04.200 --> 01:23:08.780 align:middle line:84% Well, if I want to get back the user's last name and first name, 01:23:08.780 --> 01:23:16.190 align:middle line:84% I think what I want to capture is the dot + here and the dot + here. 01:23:16.190 --> 01:23:19.190 align:middle line:84% So I've deliberately surrounded in parentheses 01:23:19.190 --> 01:23:22.160 align:middle line:84% the dot + both to the left and the right of the comma, 01:23:22.160 --> 01:23:24.660 align:middle line:84% not because I'm grouping them together per se-- 01:23:24.660 --> 01:23:28.190 align:middle line:84% I'm not adding a question mark, I'm not adding up another + or a *-- 01:23:28.190 --> 01:23:32.420 align:middle line:84% I'm using parentheses now for capturing purposes. 01:23:32.420 --> 01:23:33.200 align:middle line:90% Why? 01:23:33.200 --> 01:23:34.820 align:middle line:90% Well, I'm going to do this next. 01:23:34.820 --> 01:23:38.690 align:middle line:84% I'm going to still ask a Boolean question like, "if there are matches, 01:23:38.690 --> 01:23:40.320 align:middle line:90% then do this." 01:23:40.320 --> 01:23:44.360 align:middle line:84% So if matches is not effectively false, like none, 01:23:44.360 --> 01:23:47.720 align:middle line:84% I do expect I've gotten back some matches. 01:23:47.720 --> 01:23:49.400 align:middle line:90% And watch what I can do now. 01:23:49.400 --> 01:23:54.170 align:middle line:84% I can do last, comma, first equals whatever matches in 01:23:54.170 --> 01:23:56.930 align:middle line:84% and get back all of the groups of matches. 01:23:56.930 --> 01:24:00.020 align:middle line:84% Then go ahead and update name just like before with a format string 01:24:00.020 --> 01:24:03.770 align:middle line:84% and do first and then last in curly braces 01:24:03.770 --> 01:24:06.770 align:middle line:84% as well, and then at the very bottom, just like before, print out, 01:24:06.770 --> 01:24:09.830 align:middle line:90% for instance, "hello," comma, "name." 01:24:09.830 --> 01:24:13.970 align:middle line:84% So the new code now is everything highlighted here. 01:24:13.970 --> 01:24:19.700 align:middle line:84% I'm using re.search to search for whether the user typed their name 01:24:19.700 --> 01:24:21.620 align:middle line:90% in last, comma, first format. 01:24:21.620 --> 01:24:27.440 align:middle line:84% But I am more powerfully using re.search to capture some of the user's input. 01:24:27.440 --> 01:24:28.850 align:middle line:90% What's going to get captured? 01:24:28.850 --> 01:24:31.400 align:middle line:84% Anything I surrounded in parentheses will 01:24:31.400 --> 01:24:34.250 align:middle line:90% be returned to me as return values. 01:24:34.250 --> 01:24:36.650 align:middle line:90% How do you get at those return values? 01:24:36.650 --> 01:24:40.490 align:middle line:84% You ask the variable to which you assign them for all of the groups, 01:24:40.490 --> 01:24:44.250 align:middle line:84% all of the groups of parentheses that were captured. 01:24:44.250 --> 01:24:46.020 align:middle line:90% So let me go ahead and do this. 01:24:46.020 --> 01:24:49.970 align:middle line:84% Let me go ahead now and run python of format.py, Enter. 01:24:49.970 --> 01:24:51.950 align:middle line:90% And I'm going to type my name as usual. 01:24:51.950 --> 01:24:56.900 align:middle line:84% In this case, nothing happens with this if condition. 01:24:56.900 --> 01:24:57.500 align:middle line:90% Why? 01:24:57.500 --> 01:25:03.270 align:middle line:84% Because I did not type a comma, and so this search does not find a comma, 01:25:03.270 --> 01:25:04.632 align:middle line:90% so there are no matches. 01:25:04.632 --> 01:25:06.590 align:middle line:84% So we immediately just print out "hello, name." 01:25:06.590 --> 01:25:08.370 align:middle line:90% Nothing interesting or new there. 01:25:08.370 --> 01:25:12.920 align:middle line:84% But if I now go ahead, and clear my screen, and run python of format.py, 01:25:12.920 --> 01:25:18.740 align:middle line:84% and do Malan, comma, space, David, Enter, we've reformatted my name. 01:25:18.740 --> 01:25:19.940 align:middle line:90% Well, how did this work? 01:25:19.940 --> 01:25:22.100 align:middle line:90% Let me be a little more explicit now. 01:25:22.100 --> 01:25:24.560 align:middle line:84% It turns out I don't have to just say matches.groups. 01:25:24.560 --> 01:25:28.020 align:middle line:84% I can get specific groups back that I want. 01:25:28.020 --> 01:25:30.290 align:middle line:84% So let me change my code a little bit more. 01:25:30.290 --> 01:25:33.470 align:middle line:90% Let me go ahead now and just say this. 01:25:33.470 --> 01:25:36.620 align:middle line:90% Let's update name to-- 01:25:36.620 --> 01:25:37.980 align:middle line:90% actually, let's do this. 01:25:37.980 --> 01:25:42.530 align:middle line:84% Let's say that the last name is going to be in the matches 01:25:42.530 --> 01:25:44.330 align:middle line:90% but specifically group 1. 01:25:44.330 --> 01:25:48.020 align:middle line:84% The first name is going to be in the matches but specifically group 2. 01:25:48.020 --> 01:25:49.100 align:middle line:90% Why 1 and 2? 01:25:49.100 --> 01:25:52.490 align:middle line:84% Because this is the first set of parentheses to the left of the comma. 01:25:52.490 --> 01:25:55.520 align:middle line:84% This is the second set of parentheses to the right of the comma. 01:25:55.520 --> 01:25:58.700 align:middle line:84% And based on the input, this would be the user's last name 01:25:58.700 --> 01:26:00.140 align:middle line:90% in this scenario, Malan. 01:26:00.140 --> 01:26:03.560 align:middle line:84% This would be the user's first name, David, in this scenario. 01:26:03.560 --> 01:26:07.340 align:middle line:84% That's why I'm using group 1 for the last name 01:26:07.340 --> 01:26:09.720 align:middle line:90% and group 2 for the first name. 01:26:09.720 --> 01:26:16.100 align:middle line:84% And now I'm going to go ahead and say name equals fstring, again, first 01:26:16.100 --> 01:26:18.980 align:middle line:90% and then last, done. 01:26:18.980 --> 01:26:23.340 align:middle line:84% And let me refine this one last step before we take questions. 01:26:23.340 --> 01:26:26.090 align:middle line:84% I don't really need these variables if I'm immediately using them. 01:26:26.090 --> 01:26:28.423 align:middle line:84% Let's just go ahead and tighten this up further as we've 01:26:28.423 --> 01:26:29.990 align:middle line:90% done in the past for design's sake. 01:26:29.990 --> 01:26:32.722 align:middle line:84% If I want to make the name the concatenation 01:26:32.722 --> 01:26:34.430 align:middle line:84% of the person's first name and last name, 01:26:34.430 --> 01:26:37.970 align:middle line:84% let's just do this. matches.group 2 first, 01:26:37.970 --> 01:26:43.400 align:middle line:90% plus a space, plus matches.group 1. 01:26:43.400 --> 01:26:46.910 align:middle line:84% So it's just up to me from left to right, this is group 1, 01:26:46.910 --> 01:26:47.630 align:middle line:90% this is group 2. 01:26:47.630 --> 01:26:51.000 align:middle line:90% So group 1 is last, group 2 is first. 01:26:51.000 --> 01:26:54.860 align:middle line:84% So if I want to flip them around and update the value of name, 01:26:54.860 --> 01:27:00.290 align:middle line:84% I can explicitly get group 2 first, concatenate using +, a single space, 01:27:00.290 --> 01:27:03.540 align:middle line:90% and then concatenate on group 1. 01:27:03.540 --> 01:27:04.170 align:middle line:90% All right. 01:27:04.170 --> 01:27:05.280 align:middle line:90% That was a lot. 01:27:05.280 --> 01:27:07.620 align:middle line:84% Let me pause to see if there are questions. 01:27:07.620 --> 01:27:11.670 align:middle line:84% The key difference here is we're still using re.search the exact same way, 01:27:11.670 --> 01:27:15.090 align:middle line:84% but now I'm using its return value, not just to answer 01:27:15.090 --> 01:27:17.400 align:middle line:84% a question true or false, but to actually 01:27:17.400 --> 01:27:21.750 align:middle line:84% get back specific matches anything I captured, so to speak, 01:27:21.750 --> 01:27:23.190 align:middle line:90% with parentheses. 01:27:23.190 --> 01:27:26.270 align:middle line:84% AUDIENCE: Why is it here we're using 1 and 2 instead of 0 and 1 01:27:26.270 --> 01:27:27.270 align:middle line:90% for capturing the first? 01:27:27.270 --> 01:27:29.010 align:middle line:90% DAVID MALAN: Really good question. 01:27:29.010 --> 01:27:30.060 align:middle line:90% A good observation. 01:27:30.060 --> 01:27:32.070 align:middle line:84% In almost every other context, we've started 01:27:32.070 --> 01:27:35.250 align:middle line:90% counting at 0 and 1 instead of 1 and 2. 01:27:35.250 --> 01:27:38.190 align:middle line:84% It turns out there's something else in location 0 01:27:38.190 --> 01:27:41.530 align:middle line:84% when it comes back from re.search related to the string itself. 01:27:41.530 --> 01:27:45.000 align:middle line:84% So according to the documentation of this function only, 01:27:45.000 --> 01:27:49.110 align:middle line:84% 1 is the first set of parentheses, and 2 is the second set, 01:27:49.110 --> 01:27:50.460 align:middle line:90% and onward from there. 01:27:50.460 --> 01:27:52.540 align:middle line:90% Just a different convention here. 01:27:52.540 --> 01:27:53.580 align:middle line:90% Other questions? 01:27:53.580 --> 01:27:59.820 align:middle line:84% AUDIENCE: What if we write nothing, like whitespace, comma, whitespace? 01:27:59.820 --> 01:28:03.317 align:middle line:90% How do we check truth of condition? 01:28:03.317 --> 01:28:05.400 align:middle line:84% DAVID MALAN: Before I answer directly, let me just 01:28:05.400 --> 01:28:07.733 align:middle line:84% run this and make sure I've not broken anything further. 01:28:07.733 --> 01:28:09.360 align:middle line:90% Let me run python of format.py. 01:28:09.360 --> 01:28:12.060 align:middle line:84% Let me type in David, space, Malan, the right way. 01:28:12.060 --> 01:28:13.200 align:middle line:90% Let me run it once more. 01:28:13.200 --> 01:28:16.650 align:middle line:84% Let me type in Malan, comma, David, the wrong way that we're fixing. 01:28:16.650 --> 01:28:17.850 align:middle line:90% And we're still good. 01:28:17.850 --> 01:28:19.410 align:middle line:90% But I think it will still break. 01:28:19.410 --> 01:28:23.610 align:middle line:84% Let me run it a third time with Malan, comma, David with no space. 01:28:23.610 --> 01:28:26.190 align:middle line:90% And now it's still broken. 01:28:26.190 --> 01:28:26.790 align:middle line:90% Why? 01:28:26.790 --> 01:28:30.930 align:middle line:84% Because I'm still looking for comma space. 01:28:30.930 --> 01:28:32.220 align:middle line:90% Now, how can I fix that? 01:28:32.220 --> 01:28:35.070 align:middle line:84% One way I could do that is to add a question mark here, which again, 01:28:35.070 --> 01:28:37.510 align:middle line:90% is zero or more of the thing before. 01:28:37.510 --> 01:28:40.950 align:middle line:84% So if I have a space and then a question mark literally, no need for any 01:28:40.950 --> 01:28:46.290 align:middle line:84% parentheses, then I can literally tolerate both Malan, comma, space, 01:28:46.290 --> 01:28:48.610 align:middle line:90% David or Malan, comma, David. 01:28:48.610 --> 01:28:49.680 align:middle line:90% So let's try again. 01:28:49.680 --> 01:28:51.120 align:middle line:90% Before, this did not work. 01:28:51.120 --> 01:28:53.310 align:middle line:84% Let's do Malan, comma, David with no space. 01:28:53.310 --> 01:28:55.990 align:middle line:90% Now it does actually work. 01:28:55.990 --> 01:28:58.740 align:middle line:84% So we can tolerate different amounts of whitespace 01:28:58.740 --> 01:29:01.890 align:middle line:84% if I am a little more precise with my formula. 01:29:01.890 --> 01:29:03.420 align:middle line:90% Let me go ahead and try once more. 01:29:03.420 --> 01:29:07.260 align:middle line:84% Let me very weirdly but possibly hit the space bar a few too many times 01:29:07.260 --> 01:29:08.850 align:middle line:90% so now they're really separated. 01:29:08.850 --> 01:29:13.020 align:middle line:84% This, again, is not going to work quite right, because it's going 01:29:13.020 --> 01:29:15.160 align:middle line:90% to consume all of that whitespace. 01:29:15.160 --> 01:29:18.420 align:middle line:84% So now I might want to strip, left and right, any 01:29:18.420 --> 01:29:21.720 align:middle line:84% of the leading white space on the result. Or what I could do here 01:29:21.720 --> 01:29:22.930 align:middle line:90% is say this. 01:29:22.930 --> 01:29:29.670 align:middle line:84% Instead of zero or one, I could use a * here, so space *. 01:29:29.670 --> 01:29:33.000 align:middle line:84% And now if I run this once more with Malan, comma, space, space, space, 01:29:33.000 --> 01:29:35.920 align:middle line:84% David, Enter, now we've cleaned up things further. 01:29:35.920 --> 01:29:39.510 align:middle line:84% So you can imagine, depending on how messy the data is that you're 01:29:39.510 --> 01:29:41.550 align:middle line:84% cleaning up, your regular expressions might need 01:29:41.550 --> 01:29:43.500 align:middle line:90% to get more and more sophisticated. 01:29:43.500 --> 01:29:46.830 align:middle line:84% It really depends on just how many problems we want to solve at once. 01:29:46.830 --> 01:29:51.900 align:middle line:84% Well, allow me to propose that we forge ahead further just to clean this up 01:29:51.900 --> 01:29:53.940 align:middle line:84% even more so, using a feature that's actually 01:29:53.940 --> 01:29:56.430 align:middle line:90% relatively new to Python itself. 01:29:56.430 --> 01:29:59.220 align:middle line:84% It is very common when using regular expressions 01:29:59.220 --> 01:30:03.210 align:middle line:84% to do exactly what I've done here-- to call a function like re.search 01:30:03.210 --> 01:30:07.300 align:middle line:84% with capturing parentheses inside, such that you get back a return 01:30:07.300 --> 01:30:10.050 align:middle line:84% value that I'm calling matches-- you could call it something else, 01:30:10.050 --> 01:30:12.090 align:middle line:90% but I'm calling it by default matches. 01:30:12.090 --> 01:30:15.690 align:middle line:84% And then notice on the next line, I'm saying "if matches." 01:30:15.690 --> 01:30:19.080 align:middle line:84% Wouldn't it be nice if I could just tighten things up further and do these 01:30:19.080 --> 01:30:20.700 align:middle line:90% all on the same line? 01:30:20.700 --> 01:30:23.070 align:middle line:90% Well, you can sort of. 01:30:23.070 --> 01:30:24.850 align:middle line:90% Let me go ahead and do this. 01:30:24.850 --> 01:30:26.340 align:middle line:90% Let me get rid of this if. 01:30:26.340 --> 01:30:28.500 align:middle line:84% And let me just try to say something like this. 01:30:28.500 --> 01:30:32.370 align:middle line:84% If matches equals re.search and then colon-- 01:30:32.370 --> 01:30:39.090 align:middle line:84% so combining my if condition into just one line instead of those two. 01:30:39.090 --> 01:30:43.455 align:middle line:84% In C, or C++, or Java, you would actually do something like this, 01:30:43.455 --> 01:30:45.330 align:middle line:84% surrounding the whole thing with parentheses, 01:30:45.330 --> 01:30:47.550 align:middle line:84% sometimes double sets to suppress any warnings, 01:30:47.550 --> 01:30:49.980 align:middle line:90% if you want to do two things at once. 01:30:49.980 --> 01:30:55.530 align:middle line:84% If you want to not only assign the return value of re.search 01:30:55.530 --> 01:30:58.080 align:middle line:84% to a variable called matches, but you want 01:30:58.080 --> 01:31:03.408 align:middle line:84% to subsequently ask a Boolean question, is this effectively true or false. 01:31:03.408 --> 01:31:04.950 align:middle line:90% That's what I was doing a moment ago. 01:31:04.950 --> 01:31:06.060 align:middle line:90% Let me undo this. 01:31:06.060 --> 01:31:08.430 align:middle line:84% A moment ago, I was getting back the return value 01:31:08.430 --> 01:31:12.090 align:middle line:84% and assigning it to matches, and then I was asking the question. 01:31:12.090 --> 01:31:16.530 align:middle line:84% Well, it turns out this need to have two lines of code presumably rubbed 01:31:16.530 --> 01:31:18.840 align:middle line:90% people wrong for too long in Python. 01:31:18.840 --> 01:31:22.170 align:middle line:84% And so you can now combine these two kinds of lines into one. 01:31:22.170 --> 01:31:24.450 align:middle line:90% But you need a new operator. 01:31:24.450 --> 01:31:27.720 align:middle line:84% You cannot just say, "if matches equals re.search" 01:31:27.720 --> 01:31:29.580 align:middle line:90% and then in a colon at the end. 01:31:29.580 --> 01:31:32.170 align:middle line:90% You instead need to do this. 01:31:32.170 --> 01:31:38.130 align:middle line:84% You need to do colon equals if and only if you want to assign something 01:31:38.130 --> 01:31:42.390 align:middle line:84% from right to left and you want to ask an if or an elif 01:31:42.390 --> 01:31:44.820 align:middle line:90% question on the same line. 01:31:44.820 --> 01:31:48.870 align:middle line:84% This is affectionately known, as can see here, as the walrus operator. 01:31:48.870 --> 01:31:51.480 align:middle line:90% And it's new to Python in recent years. 01:31:51.480 --> 01:31:56.280 align:middle line:84% And it both allows you to assign a value as I'm doing from right to left, 01:31:56.280 --> 01:32:00.180 align:middle line:84% and ask a Boolean question about it, like I'm 01:32:00.180 --> 01:32:02.960 align:middle line:90% doing with the if or equivalently elif. 01:32:02.960 --> 01:32:06.650 align:middle line:84% Does anyone know why this is called the walrus operator? 01:32:06.650 --> 01:32:09.920 align:middle line:84% If you kind of look at it like this, perhaps, 01:32:09.920 --> 01:32:14.040 align:middle line:84% if you're familiar with walruses, it kind of sort of looks like a walrus. 01:32:14.040 --> 01:32:17.720 align:middle line:84% So a minor detail but a relatively new feature of Python that honestly, you'll 01:32:17.720 --> 01:32:21.170 align:middle line:84% probably continue to see online, and in source code, and in textbooks, 01:32:21.170 --> 01:32:24.300 align:middle line:84% and so forth, increasingly so now that it does exist. 01:32:24.300 --> 01:32:25.910 align:middle line:90% It does not change the logic at all. 01:32:25.910 --> 01:32:29.660 align:middle line:84% If I run python of format.py and type Malan, comma, space, David, 01:32:29.660 --> 01:32:33.750 align:middle line:84% it still fixes things, but it's tightened up my code just a bit more. 01:32:33.750 --> 01:32:34.250 align:middle line:90% All right. 01:32:34.250 --> 01:32:37.010 align:middle line:84% Let's go ahead and look at one final problem 01:32:37.010 --> 01:32:40.470 align:middle line:84% to solve, that of extracting information now as well. 01:32:40.470 --> 01:32:43.460 align:middle line:84% So at this point, we've now validated the user's input 01:32:43.460 --> 01:32:46.160 align:middle line:84% by checking whether or not it meets a certain pattern. 01:32:46.160 --> 01:32:49.100 align:middle line:84% We've cleaned up the user's input by checking 01:32:49.100 --> 01:32:51.470 align:middle line:84% against a pattern, whether it matches or not, and if it 01:32:51.470 --> 01:32:54.350 align:middle line:84% does match, we kind of reorganize some of the user's information 01:32:54.350 --> 01:32:57.800 align:middle line:84% so we can clean up their input and standardize the format in which we're 01:32:57.800 --> 01:32:59.540 align:middle line:90% storing or printing it, in this case. 01:32:59.540 --> 01:33:03.350 align:middle line:84% Let's do one final example where we're very specifically extracting 01:33:03.350 --> 01:33:06.440 align:middle line:84% information in order to answer some question. 01:33:06.440 --> 01:33:07.830 align:middle line:90% So let me propose this. 01:33:07.830 --> 01:33:12.650 align:middle line:84% Let me go ahead and close format.py and create a new file called twitter.py, 01:33:12.650 --> 01:33:17.690 align:middle line:84% the goal of which is to prompt users for the URL of their Twitter profile 01:33:17.690 --> 01:33:23.562 align:middle line:84% and extract from it, infer from that URL, what is the user's username. 01:33:23.562 --> 01:33:25.020 align:middle line:90% Now, why might you want to do this? 01:33:25.020 --> 01:33:28.228 align:middle line:84% Well, one, you might want users to be able to just very easily copy and paste 01:33:28.228 --> 01:33:32.330 align:middle line:84% the URL from their own Twitter profile into your form, into your app, 01:33:32.330 --> 01:33:36.140 align:middle line:84% so that you can figure out what their username is. 01:33:36.140 --> 01:33:40.430 align:middle line:84% Or you might have a form that asks the user for their Twitter username, 01:33:40.430 --> 01:33:43.400 align:middle line:84% and because people aren't necessarily paying very close attention, 01:33:43.400 --> 01:33:45.530 align:middle line:90% some people type their username. 01:33:45.530 --> 01:33:49.340 align:middle line:84% Some people type their whole URL or something else altogether. 01:33:49.340 --> 01:33:51.350 align:middle line:84% It would be nice now that you're a programmer 01:33:51.350 --> 01:33:53.780 align:middle line:84% to just be more tolerant of different types of input 01:33:53.780 --> 01:33:58.100 align:middle line:84% and just take on the burden of canonicalizing, standardizing the data, 01:33:58.100 --> 01:34:00.140 align:middle line:90% but being flexible with the users. 01:34:00.140 --> 01:34:03.500 align:middle line:84% It's arguably a better user experience if you just let me copy-paste 01:34:03.500 --> 01:34:05.660 align:middle line:90% or type in what I want, you clean it up. 01:34:05.660 --> 01:34:07.550 align:middle line:90% You're the programmer not me. 01:34:07.550 --> 01:34:09.920 align:middle line:90% Lends for a better experience, perhaps. 01:34:09.920 --> 01:34:12.620 align:middle line:84% Well, let me go ahead and do this with twitter.py. 01:34:12.620 --> 01:34:17.120 align:middle line:84% Let me first go ahead and prompt the user here for a value for a variable 01:34:17.120 --> 01:34:21.702 align:middle line:84% that I'll call url, and just ask them to input the URL of their Twitter profile. 01:34:21.702 --> 01:34:23.660 align:middle line:84% I'm going to go ahead and strip off any leading 01:34:23.660 --> 01:34:26.810 align:middle line:84% or trailing whitespace, just in case users accidentally hit the spacebar. 01:34:26.810 --> 01:34:29.940 align:middle line:84% That's literally the least I can do quite easily. 01:34:29.940 --> 01:34:32.100 align:middle line:90% But now let's go ahead and do this. 01:34:32.100 --> 01:34:37.185 align:middle line:84% Suppose that the user's address is the following. 01:34:37.185 --> 01:34:38.810 align:middle line:90% Let me print out what did they type in. 01:34:38.810 --> 01:34:41.190 align:middle line:84% And let me clear my screen and run python of twitter.py. 01:34:41.190 --> 01:34:43.190 align:middle line:84% I'm going to go ahead and type in, for instance, 01:34:43.190 --> 01:34:50.240 align:middle line:84% https://twitter.com/davidjmalan, which happens to be my own Twitter username. 01:34:50.240 --> 01:34:53.090 align:middle line:84% For now, we're just going to print it back onto the screen just 01:34:53.090 --> 01:34:54.640 align:middle line:90% to make sure I've not messed up yet. 01:34:54.640 --> 01:34:55.140 align:middle line:90% OK. 01:34:55.140 --> 01:34:57.260 align:middle line:84% So I've printed back out the exact same URL. 01:34:57.260 --> 01:35:01.310 align:middle line:84% But the goal at hand is to extract the username only. 01:35:01.310 --> 01:35:05.060 align:middle line:84% Now, let me just ask, perhaps, a straightforward question. 01:35:05.060 --> 01:35:09.830 align:middle line:84% Logically, what do I need to do to get at the user's username? 01:35:09.830 --> 01:35:13.880 align:middle line:84% AUDIENCE: Well, we just ignore what's before the username 01:35:13.880 --> 01:35:16.065 align:middle line:90% and then just extract the username? 01:35:16.065 --> 01:35:16.940 align:middle line:90% DAVID MALAN: Perfect. 01:35:16.940 --> 01:35:18.380 align:middle line:90% Yeah, I mean, it is as simple as that. 01:35:18.380 --> 01:35:20.720 align:middle line:84% If you know the username is at the end, well, let's just 01:35:20.720 --> 01:35:22.920 align:middle line:84% somehow ignore everything to the beginning. 01:35:22.920 --> 01:35:24.170 align:middle line:90% Well, what's at the beginning? 01:35:24.170 --> 01:35:25.130 align:middle line:90% Well, it's a URL. 01:35:25.130 --> 01:35:30.890 align:middle line:84% So we're probably going to need to ignore an HTTPS, a ://, a twitter.com, 01:35:30.890 --> 01:35:31.910 align:middle line:90% and a /. 01:35:31.910 --> 01:35:33.840 align:middle line:84% So we just want to throw all of that away. 01:35:33.840 --> 01:35:34.340 align:middle line:90% Why? 01:35:34.340 --> 01:35:37.400 align:middle line:84% Because if it's an URL, we know by how Twitter works 01:35:37.400 --> 01:35:39.240 align:middle line:90% that the username comes at the end. 01:35:39.240 --> 01:35:43.418 align:middle line:84% So let's use that very simple idea to get at the information we want. 01:35:43.418 --> 01:35:45.210 align:middle line:84% I'm going to try this a few different ways. 01:35:45.210 --> 01:35:46.620 align:middle line:90% Let me go back into my program here. 01:35:46.620 --> 01:35:49.820 align:middle line:84% And instead of just printing it out, which was just to see what's going on, 01:35:49.820 --> 01:35:50.880 align:middle line:90% let me do this. 01:35:50.880 --> 01:35:53.180 align:middle line:84% Let me create a new variable called username. 01:35:53.180 --> 01:35:56.810 align:middle line:90% And let me call url.replace. 01:35:56.810 --> 01:36:01.340 align:middle line:84% It turns out that if URL is a string or a str in Python, 01:36:01.340 --> 01:36:05.840 align:middle line:84% it, again, comes with multiple methods, like strip, and split, 01:36:05.840 --> 01:36:08.750 align:middle line:84% and others as well, one of which is called replace. 01:36:08.750 --> 01:36:10.400 align:middle line:90% And replace will do just that. 01:36:10.400 --> 01:36:14.360 align:middle line:84% You pass it two arguments, the first of which is, what do you want to replace? 01:36:14.360 --> 01:36:17.640 align:middle line:84% The second argument is, what do you want to replace it with? 01:36:17.640 --> 01:36:19.940 align:middle line:84% So if I want to get rid of, as I've proposed, 01:36:19.940 --> 01:36:21.740 align:middle line:84% really just everything before the username, 01:36:21.740 --> 01:36:26.090 align:middle line:84% that is, the Twitter URL or the beginning thereof, let's just say this. 01:36:26.090 --> 01:36:31.520 align:middle line:84% Go ahead and replace "https://twitter.com/", 01:36:31.520 --> 01:36:34.340 align:middle line:84% close quote, that's what I want to replace. 01:36:34.340 --> 01:36:37.160 align:middle line:84% And comma, second argument, what do you want to replace it with? 01:36:37.160 --> 01:36:37.880 align:middle line:90% Nothing. 01:36:37.880 --> 01:36:40.100 align:middle line:84% So I'm literally going to pass in quote unquote 01:36:40.100 --> 01:36:42.190 align:middle line:90% to effectively do a find and replace. 01:36:42.190 --> 01:36:44.690 align:middle line:84% That's what the replace method does, just like you can do it 01:36:44.690 --> 01:36:46.100 align:middle line:90% in Microsoft Word or Google Docs. 01:36:46.100 --> 01:36:49.280 align:middle line:84% This is the programmer's way of doing find and replace. 01:36:49.280 --> 01:36:52.940 align:middle line:84% Now let me go ahead and print out just the username. 01:36:52.940 --> 01:36:54.780 align:middle line:90% So I'll use an fstring like this. 01:36:54.780 --> 01:36:57.590 align:middle line:84% I'll say username, colon, and then in curly braces, 01:36:57.590 --> 01:36:59.700 align:middle line:90% username, just to format it nicely. 01:36:59.700 --> 01:37:00.200 align:middle line:90% All right. 01:37:00.200 --> 01:37:04.410 align:middle line:84% Let me go ahead and clear my screen and run python of twitter.py, Enter, URL. 01:37:04.410 --> 01:37:12.580 align:middle line:84% Here we go. https://twitter.com/davidjmalan, Enter. 01:37:12.580 --> 01:37:13.300 align:middle line:90% OK. 01:37:13.300 --> 01:37:15.040 align:middle line:90% Now we've made some progress. 01:37:15.040 --> 01:37:17.360 align:middle line:90% Done for the day, right? 01:37:17.360 --> 01:37:19.580 align:middle line:90% Well, what is suboptimal about this? 01:37:19.580 --> 01:37:24.150 align:middle line:84% Can anyone critique or find fault with my program? 01:37:24.150 --> 01:37:27.950 align:middle line:84% It is working now, but it's a little fragile. 01:37:27.950 --> 01:37:31.880 align:middle line:84% I bet we could contrive some scenarios where I think it works but it doesn't. 01:37:31.880 --> 01:37:33.890 align:middle line:84% AUDIENCE: Well, I have a few ideas, actually. 01:37:33.890 --> 01:37:39.980 align:middle line:84% Well, first of all, if we don't specify HTTPS, it will be broken. 01:37:39.980 --> 01:37:44.760 align:middle line:84% Secondly, if we have a slash at the end, it also will be broken. 01:37:44.760 --> 01:37:48.320 align:middle line:84% If we have a question mark or something after question mark, 01:37:48.320 --> 01:37:49.590 align:middle line:90% it also won't work. 01:37:49.590 --> 01:37:51.160 align:middle line:90% So a lot of scenarios, actually. 01:37:51.160 --> 01:37:52.160 align:middle line:90% DAVID MALAN: Oh, my god. 01:37:52.160 --> 01:37:52.993 align:middle line:90% I mean, here we are. 01:37:52.993 --> 01:37:54.650 align:middle line:90% I was pretending to think I was done. 01:37:54.650 --> 01:37:57.920 align:middle line:84% But my god, like, Alex gave us a whole laundry list of problems. 01:37:57.920 --> 01:38:01.700 align:middle line:84% And just to recap, then, what if it's not HTTPS, it's HTTP? 01:38:01.700 --> 01:38:03.590 align:middle line:84% Slightly less secure, but I should still be 01:38:03.590 --> 01:38:05.713 align:middle line:90% able to tolerate that programmatically. 01:38:05.713 --> 01:38:07.130 align:middle line:90% What if the protocol is not there? 01:38:07.130 --> 01:38:09.740 align:middle line:84% What if the user just typed twitter.com/davidjmalan? 01:38:09.740 --> 01:38:12.680 align:middle line:84% It would be nice to tolerate that rather than show an error 01:38:12.680 --> 01:38:14.150 align:middle line:90% and make me type in the protocol. 01:38:14.150 --> 01:38:14.660 align:middle line:90% Why? 01:38:14.660 --> 01:38:16.050 align:middle line:90% It's not good user experience. 01:38:16.050 --> 01:38:20.030 align:middle line:84% What if it had a slash at the end of the username, or a question mark? 01:38:20.030 --> 01:38:22.500 align:middle line:84% If you think about URLs you've seen on the web, 01:38:22.500 --> 01:38:24.920 align:middle line:84% there's very commonly more information, especially 01:38:24.920 --> 01:38:26.540 align:middle line:90% if it's been shared on social media. 01:38:26.540 --> 01:38:28.640 align:middle line:84% There might be a HTTP parameters, so to speak, 01:38:28.640 --> 01:38:30.230 align:middle line:90% just stuff there that we don't want. 01:38:30.230 --> 01:38:34.880 align:middle line:84% There could be a www.twitter.com, which I'm also not expecting but does 01:38:34.880 --> 01:38:37.360 align:middle line:90% work if you go to that URL, too. 01:38:37.360 --> 01:38:39.540 align:middle line:84% So there's just so many things that can go wrong. 01:38:39.540 --> 01:38:43.010 align:middle line:84% And even if I come back to my contrived example as earlier, 01:38:43.010 --> 01:38:45.350 align:middle line:84% what if I run this program and say this-- 01:38:45.350 --> 01:38:52.610 align:middle line:84% "my username is https://twitter.com/davidjmalan," 01:38:52.610 --> 01:38:53.540 align:middle line:90% Enter. 01:38:53.540 --> 01:38:58.570 align:middle line:84% Well, that too just didn't really work-- it got rid of the-- actually-- 01:38:58.570 --> 01:39:01.730 align:middle line:84% [LAUGHS] OK, actually that kind of worked. 01:39:01.730 --> 01:39:05.390 align:middle line:84% But the goal here is to actually get the user's username, 01:39:05.390 --> 01:39:08.210 align:middle line:84% not an English sentence describing the user's username. 01:39:08.210 --> 01:39:11.150 align:middle line:84% So I would argue that even though I just accidentally created 01:39:11.150 --> 01:39:13.670 align:middle line:84% perfectly correct English grammar, I did not 01:39:13.670 --> 01:39:15.860 align:middle line:90% extract the Twitter username correctly. 01:39:15.860 --> 01:39:19.890 align:middle line:84% I don't want words like "my username is" as part of my input. 01:39:19.890 --> 01:39:22.940 align:middle line:84% So how can we go about improving this, and maybe chipping away 01:39:22.940 --> 01:39:24.530 align:middle line:90% at some of those problems one by one? 01:39:24.530 --> 01:39:26.280 align:middle line:90% Well, let me clear my screen here. 01:39:26.280 --> 01:39:27.780 align:middle line:90% Let me come back up to my code. 01:39:27.780 --> 01:39:31.640 align:middle line:84% And let me not just replace it, but let me do something else instead. 01:39:31.640 --> 01:39:34.040 align:middle line:84% I'm going to go ahead, and instead of using replace, 01:39:34.040 --> 01:39:36.950 align:middle line:84% I'm going to use another function called removeprefix. 01:39:36.950 --> 01:39:42.060 align:middle line:84% A prefix is a string or a substring that comes at the start of another. 01:39:42.060 --> 01:39:45.320 align:middle line:84% So if I remove prefix, I don't need a second argument for this function. 01:39:45.320 --> 01:39:46.220 align:middle line:90% I just need one. 01:39:46.220 --> 01:39:48.540 align:middle line:90% What prefix do you want to remove? 01:39:48.540 --> 01:39:51.680 align:middle line:84% So this will at least now fix the problem I just 01:39:51.680 --> 01:39:54.860 align:middle line:84% described of typing in like a whole sentence, where the URL is there, 01:39:54.860 --> 01:39:57.600 align:middle line:84% but it's not at the beginning, it's only at the end. 01:39:57.600 --> 01:39:59.930 align:middle line:90% So here, this still is not correct. 01:39:59.930 --> 01:40:04.100 align:middle line:84% But we don't create this weird-looking output that just removes the URL part 01:40:04.100 --> 01:40:05.360 align:middle line:90% of the input-- 01:40:05.360 --> 01:40:11.330 align:middle line:84% "my username is https://twitter.com/davidjmalan." 01:40:11.330 --> 01:40:16.700 align:middle line:84% A moment ago, it did remove the URL and left only the davidjmalan. 01:40:16.700 --> 01:40:17.990 align:middle line:90% This is not perfect still. 01:40:17.990 --> 01:40:21.830 align:middle line:84% But at least now, it does not weirdly remove the URL 01:40:21.830 --> 01:40:23.030 align:middle line:90% and then leave the English. 01:40:23.030 --> 01:40:24.420 align:middle line:90% It's just leaving it alone. 01:40:24.420 --> 01:40:26.600 align:middle line:84% So maybe I could handle this better, but at least 01:40:26.600 --> 01:40:30.710 align:middle line:84% it's removing it from the part of the string I might anticipate. 01:40:30.710 --> 01:40:32.550 align:middle line:90% Well, what else could we do here? 01:40:32.550 --> 01:40:35.180 align:middle line:84% Well, it turns out that regular expressions just 01:40:35.180 --> 01:40:37.940 align:middle line:84% let us express patterns much more precisely. 01:40:37.940 --> 01:40:41.180 align:middle line:84% We could spend all day using a whole bunch of different Python functions 01:40:41.180 --> 01:40:44.810 align:middle line:84% like removeprefix, or remove, and strip, and others, and kind of 01:40:44.810 --> 01:40:47.240 align:middle line:90% make our way to the right solution. 01:40:47.240 --> 01:40:50.310 align:middle line:84% But a regular expression just allows you to more succinctly, 01:40:50.310 --> 01:40:55.040 align:middle line:84% if admittedly more cryptically, express these kinds of patterns and goals. 01:40:55.040 --> 01:40:57.260 align:middle line:84% And we've seen from parentheses, which can 01:40:57.260 --> 01:41:00.170 align:middle line:84% be used not just to group symbols together as sets 01:41:00.170 --> 01:41:05.180 align:middle line:84% but to capture information as well, we have a very powerful tool now 01:41:05.180 --> 01:41:06.630 align:middle line:90% in our toolkit. 01:41:06.630 --> 01:41:07.800 align:middle line:90% So let me do this. 01:41:07.800 --> 01:41:12.530 align:middle line:84% Let me go ahead and start fresh here and import the re library 01:41:12.530 --> 01:41:14.450 align:middle line:90% as before at the very top of my program. 01:41:14.450 --> 01:41:17.900 align:middle line:84% I'm still going to get the user's URL via the same line of code. 01:41:17.900 --> 01:41:20.970 align:middle line:84% But I'm now going to use another function as well. 01:41:20.970 --> 01:41:24.950 align:middle line:84% It turns out that there's not just re.search, or re.match, 01:41:24.950 --> 01:41:26.060 align:middle line:90% or re.fullmatch. 01:41:26.060 --> 01:41:30.860 align:middle line:84% There's also re.sub in the regular expression library, where "sub" here 01:41:30.860 --> 01:41:32.000 align:middle line:90% means "substitute." 01:41:32.000 --> 01:41:35.220 align:middle line:84% And it takes more arguments, but they're fairly straightforward. 01:41:35.220 --> 01:41:38.990 align:middle line:84% The first argument to re.sub is the pattern, the regular expression 01:41:38.990 --> 01:41:40.280 align:middle line:90% that you want to look for. 01:41:40.280 --> 01:41:43.160 align:middle line:84% Then you have a replacement string-- what do 01:41:43.160 --> 01:41:45.470 align:middle line:90% you want to replace that pattern with? 01:41:45.470 --> 01:41:47.390 align:middle line:90% And where do you want to do all that? 01:41:47.390 --> 01:41:51.265 align:middle line:84% Well, you pass in the string that you want to do the substitution on. 01:41:51.265 --> 01:41:54.140 align:middle line:84% Then there's some other arguments that I'll wave my hands at for now. 01:41:54.140 --> 01:41:56.240 align:middle line:84% Among them are those same flags and also a count, 01:41:56.240 --> 01:41:58.970 align:middle line:84% like how many times do you want to do find and replace? 01:41:58.970 --> 01:42:01.670 align:middle line:84% Do you want it to do all, do you want to do just one, 01:42:01.670 --> 01:42:04.070 align:middle line:84% or so forth you can have further control there, too, 01:42:04.070 --> 01:42:06.770 align:middle line:84% just like you would in Google Docs or Microsoft Word. 01:42:06.770 --> 01:42:10.160 align:middle line:84% Well, let me go back to my code here, and let me do this. 01:42:10.160 --> 01:42:15.020 align:middle line:84% I'm going to go ahead and call re not search but re.sub for substitute. 01:42:15.020 --> 01:42:18.320 align:middle line:84% I'm going to pass in the following regular expression, 01:42:18.320 --> 01:42:25.610 align:middle line:84% "https://twitter.com/" and then I'm going to close my quote. 01:42:25.610 --> 01:42:27.860 align:middle line:84% And now what do I want to replace that with? 01:42:27.860 --> 01:42:31.460 align:middle line:84% Well, like before with the simple str replace function, 01:42:31.460 --> 01:42:34.380 align:middle line:84% I want to replace it with nothing, just get rid of it altogether. 01:42:34.380 --> 01:42:37.730 align:middle line:84% But what string do I want to pass in to do this to? 01:42:37.730 --> 01:42:39.810 align:middle line:90% The URL from the user. 01:42:39.810 --> 01:42:44.360 align:middle line:84% And now let me go ahead and assign the return value of re.sub 01:42:44.360 --> 01:42:46.100 align:middle line:90% to a variable called username. 01:42:46.100 --> 01:42:49.460 align:middle line:84% So re.sub's purpose in life is, again, to substitute 01:42:49.460 --> 01:42:52.490 align:middle line:84% some value for some regular expression some number of times. 01:42:52.490 --> 01:42:56.360 align:middle line:84% It essentially is find and replace using regular expressions. 01:42:56.360 --> 01:42:59.090 align:middle line:84% And it returns to you the resulting string 01:42:59.090 --> 01:43:01.400 align:middle line:84% once you've done all those substitutions. 01:43:01.400 --> 01:43:04.850 align:middle line:84% So now the very last line of my code can be the same as before, print-- 01:43:04.850 --> 01:43:08.960 align:middle line:84% and I'll use an fstring, username, colon, and then in curly braces, 01:43:08.960 --> 01:43:09.590 align:middle line:90% username. 01:43:09.590 --> 01:43:12.300 align:middle line:90% So I can print out literally just that. 01:43:12.300 --> 01:43:12.800 align:middle line:90% All right. 01:43:12.800 --> 01:43:14.300 align:middle line:90% Let's try this and see what happens. 01:43:14.300 --> 01:43:17.390 align:middle line:84% I'll clear my terminal window, run python of twitter.py. 01:43:17.390 --> 01:43:23.690 align:middle line:84% And here we go, https://twitter.com/davidjmalan. 01:43:23.690 --> 01:43:25.940 align:middle line:90% Cross my fingers and hit Enter. 01:43:25.940 --> 01:43:28.580 align:middle line:90% OK, now we're in business. 01:43:28.580 --> 01:43:30.560 align:middle line:90% But it is still a little fragile. 01:43:30.560 --> 01:43:34.730 align:middle line:84% And so let me ask the group, what problem should I now 01:43:34.730 --> 01:43:36.125 align:middle line:90% further chip away at? 01:43:36.125 --> 01:43:38.000 align:middle line:84% They've been said before, but let's be clear. 01:43:38.000 --> 01:43:40.460 align:middle line:84% What's one or more problems that still remain? 01:43:40.460 --> 01:43:44.690 align:middle line:84% AUDIENCE: The protocols and the domain prefix [INAUDIBLE].. 01:43:44.690 --> 01:43:45.440 align:middle line:90% DAVID MALAN: Good. 01:43:45.440 --> 01:43:48.020 align:middle line:90% The protocols, so HTTP versus HTTPS. 01:43:48.020 --> 01:43:51.980 align:middle line:84% Maybe the subdomain, www, should it be there or not? 01:43:51.980 --> 01:43:54.200 align:middle line:84% And there's a few other mistakes here, too. 01:43:54.200 --> 01:43:55.770 align:middle line:90% Let me actually stay with the group. 01:43:55.770 --> 01:43:59.600 align:middle line:84% What are some other shortcomings of this current solution? 01:43:59.600 --> 01:44:03.590 align:middle line:84% AUDIENCE: If we use a phrase like you do before, 01:44:03.590 --> 01:44:07.940 align:middle line:84% we are going to have the same problem, because it's not taking account 01:44:07.940 --> 01:44:11.150 align:middle line:90% in the first part of the text example. 01:44:11.150 --> 01:44:11.900 align:middle line:90% DAVID MALAN: Good. 01:44:11.900 --> 01:44:16.220 align:middle line:84% I might still allow for some words, some English to the left of the URL 01:44:16.220 --> 01:44:17.810 align:middle line:90% because I didn't use my ^ symbol. 01:44:17.810 --> 01:44:18.770 align:middle line:90% So I'll fix that. 01:44:18.770 --> 01:44:22.450 align:middle line:84% And any final observations on shortcomings here? 01:44:22.450 --> 01:44:26.993 align:middle line:84% AUDIENCE: Well, it could be an HTTP, or there could be less than two slashes. 01:44:26.993 --> 01:44:27.660 align:middle line:90% DAVID MALAN: OK. 01:44:27.660 --> 01:44:28.493 align:middle line:90% So it could be HTTP. 01:44:28.493 --> 01:44:30.910 align:middle line:84% And I think that was mentioned, too, in terms of protocol. 01:44:30.910 --> 01:44:32.570 align:middle line:90% There could be fewer than two slashes. 01:44:32.570 --> 01:44:34.550 align:middle line:90% That I'm not going to worry about. 01:44:34.550 --> 01:44:38.720 align:middle line:84% If the user gives me instead of two, that's really user error. 01:44:38.720 --> 01:44:41.420 align:middle line:84% And I could be tolerant of it, but you know what, at that point 01:44:41.420 --> 01:44:45.570 align:middle line:84% I'm OK yelling at them with an error message saying, please fix your input. 01:44:45.570 --> 01:44:48.890 align:middle line:84% Otherwise, we could be here all day long trying to handle all possible typos. 01:44:48.890 --> 01:44:51.740 align:middle line:84% For now, I think in the interests of usability, 01:44:51.740 --> 01:44:54.560 align:middle line:84% or user experience, UX, let's at least be 01:44:54.560 --> 01:44:59.130 align:middle line:84% tolerant of all possible valid inputs or reasonable INPUTS if you will. 01:44:59.130 --> 01:45:01.940 align:middle line:84% So let me go here, and let me start chipping away at these here. 01:45:01.940 --> 01:45:03.530 align:middle line:90% What are some problems we can solve? 01:45:03.530 --> 01:45:08.735 align:middle line:84% Well, let me propose that we first address the issue of matching 01:45:08.735 --> 01:45:10.110 align:middle line:90% from the beginning of the string. 01:45:10.110 --> 01:45:11.900 align:middle line:90% So let me add the ^ to the beginning. 01:45:11.900 --> 01:45:15.362 align:middle line:84% And let me add not a $ sign at the end, though, right? 01:45:15.362 --> 01:45:17.570 align:middle line:84% Because I don't want to match all the way to the end, 01:45:17.570 --> 01:45:19.950 align:middle line:84% because I want to tolerate a username there. 01:45:19.950 --> 01:45:23.210 align:middle line:84% So I think we just want the ^ symbol there. 01:45:23.210 --> 01:45:26.000 align:middle line:84% There's a subtle bug that no one yet mentioned. 01:45:26.000 --> 01:45:30.860 align:middle line:84% And let me just kind of highlight it and see if it jumps out at you now. 01:45:30.860 --> 01:45:32.730 align:middle line:90% It's a little subtle here on my screen. 01:45:32.730 --> 01:45:37.610 align:middle line:84% I've highlighted in blue a final bug here-- 01:45:37.610 --> 01:45:39.860 align:middle line:90% maybe some smiles on the screen, yeah? 01:45:39.860 --> 01:45:41.400 align:middle line:90% Can we take one hand here? 01:45:41.400 --> 01:45:46.730 align:middle line:84% Why am I highlighting the dot in twitter.com, even though it definitely 01:45:46.730 --> 01:45:47.900 align:middle line:90% should be there? 01:45:47.900 --> 01:45:52.610 align:middle line:84% AUDIENCE: So the dot without a backslash means any character except a newline. 01:45:52.610 --> 01:45:53.990 align:middle line:90% DAVID MALAN: Yeah, exactly. 01:45:53.990 --> 01:45:55.500 align:middle line:90% It means any character. 01:45:55.500 --> 01:46:01.555 align:middle line:84% So I could type in something like twitter?com, or twitter anything com, 01:46:01.555 --> 01:46:03.660 align:middle line:90% and that would actually be tolerated. 01:46:03.660 --> 01:46:07.230 align:middle line:84% It's not really that bad, because why would the user do that? 01:46:07.230 --> 01:46:09.410 align:middle line:84% But if I want to be correct, and I want to be 01:46:09.410 --> 01:46:13.280 align:middle line:84% able to test my own code properly, I should really get this detail right. 01:46:13.280 --> 01:46:16.040 align:middle line:84% So that's an easy fix, too, but it's a common mistake. 01:46:16.040 --> 01:46:19.190 align:middle line:84% Anytime you're writing regular expressions that happen to involve 01:46:19.190 --> 01:46:23.210 align:middle line:84% special symbols, like dots in a URL or domain name, 01:46:23.210 --> 01:46:27.230 align:middle line:84% a $ sign in something involving currency, remember you might, indeed, 01:46:27.230 --> 01:46:30.390 align:middle line:84% need to escape it with a backslash like this here. 01:46:30.390 --> 01:46:30.890 align:middle line:90% All right. 01:46:30.890 --> 01:46:34.040 align:middle line:84% Let me ask the group about the protocol specifically. 01:46:34.040 --> 01:46:36.690 align:middle line:90% So HTTPS is a good thing in the world. 01:46:36.690 --> 01:46:37.860 align:middle line:90% It means secure. 01:46:37.860 --> 01:46:39.360 align:middle line:90% There is encryption being used. 01:46:39.360 --> 01:46:41.840 align:middle line:90% So generally, you like to see HTTPS. 01:46:41.840 --> 01:46:46.370 align:middle line:84% But you still see people typing or copy-pasting HTTP. 01:46:46.370 --> 01:46:50.960 align:middle line:84% What would be the simplest fix here to tolerate, as has been proposed, 01:46:50.960 --> 01:46:54.380 align:middle line:90% both HTTP and HTTPS? 01:46:54.380 --> 01:46:56.600 align:middle line:84% I'm going to propose that I could do this. 01:46:56.600 --> 01:47:02.630 align:middle line:84% I could do HTTP vertical bar or HTTPS, which, again, means A or B. 01:47:02.630 --> 01:47:04.490 align:middle line:90% But I think I can be smarter than that. 01:47:04.490 --> 01:47:06.770 align:middle line:84% I can keep my code a little more succinct. 01:47:06.770 --> 01:47:13.400 align:middle line:84% Any recommendations here for tolerating HTTP or HTTPS? 01:47:13.400 --> 01:47:16.845 align:middle line:84% AUDIENCE: We could try to put in question mark behind the S. 01:47:16.845 --> 01:47:17.720 align:middle line:90% DAVID MALAN: Perfect. 01:47:17.720 --> 01:47:19.340 align:middle line:90% Just use a question mark. 01:47:19.340 --> 01:47:21.110 align:middle line:90% Both of those would be viable solutions. 01:47:21.110 --> 01:47:23.330 align:middle line:84% If you want to be super explicit in your code, fine. 01:47:23.330 --> 01:47:28.730 align:middle line:84% Use parentheses and say HTTP or HTTPS, so that you, the reader, your boss, 01:47:28.730 --> 01:47:31.410 align:middle line:84% your teacher just know exactly what you're doing. 01:47:31.410 --> 01:47:35.090 align:middle line:84% But if you keep taking the more verbose approach all the time, 01:47:35.090 --> 01:47:37.760 align:middle line:84% it might actually become less readable, certainly 01:47:37.760 --> 01:47:40.580 align:middle line:84% once your regular expressions get this big instead of this big. 01:47:40.580 --> 01:47:42.290 align:middle line:90% So let's save space where we can. 01:47:42.290 --> 01:47:45.030 align:middle line:84% And I would argue that this is pretty reasonable, so 01:47:45.030 --> 01:47:47.640 align:middle line:84% long as you're in the habit of reading regular expressions 01:47:47.640 --> 01:47:50.390 align:middle line:84% and know that question mark does not mean a literal question mark, 01:47:50.390 --> 01:47:52.970 align:middle line:84% but it means zero or one of the thing before. 01:47:52.970 --> 01:47:56.510 align:middle line:84% I think we've effectively made the S optional here. 01:47:56.510 --> 01:47:58.410 align:middle line:90% Now, what else can I do? 01:47:58.410 --> 01:48:03.860 align:middle line:84% Well, suppose we want to tolerate the www dot, which may or may not be there, 01:48:03.860 --> 01:48:06.050 align:middle line:90% but it will work if you go to a browser. 01:48:06.050 --> 01:48:07.220 align:middle line:90% I could do this-- 01:48:07.220 --> 01:48:11.720 align:middle line:84% www dot-- wait, I want a backslash there so I don't 01:48:11.720 --> 01:48:13.310 align:middle line:90% repeat the same mistake as before. 01:48:13.310 --> 01:48:19.220 align:middle line:84% But this is no good either, because I want to tolerate being there or not 01:48:19.220 --> 01:48:19.760 align:middle line:90% being there. 01:48:19.760 --> 01:48:21.890 align:middle line:84% And now I've just required that it be there. 01:48:21.890 --> 01:48:24.290 align:middle line:84% But I think I can take the same approach. 01:48:24.290 --> 01:48:25.550 align:middle line:90% Any recommendations? 01:48:25.550 --> 01:48:27.200 align:middle line:90% How do I make the www. 01:48:27.200 --> 01:48:30.230 align:middle line:90% optional, just to hammer this home? 01:48:30.230 --> 01:48:32.480 align:middle line:90% AUDIENCE: We can group-- 01:48:32.480 --> 01:48:35.835 align:middle line:90% make a square and a question mark. 01:48:35.835 --> 01:48:36.710 align:middle line:90% DAVID MALAN: Perfect. 01:48:36.710 --> 01:48:38.825 align:middle line:84% So question mark is the short answer again. 01:48:38.825 --> 01:48:40.700 align:middle line:84% But we have to be a little smarter this time. 01:48:40.700 --> 01:48:43.130 align:middle line:84% As Maria has noted, we need parentheses now. 01:48:43.130 --> 01:48:46.160 align:middle line:84% Because if I just put a question mark after the dot, 01:48:46.160 --> 01:48:48.147 align:middle line:90% that just means the dot is optional. 01:48:48.147 --> 01:48:50.480 align:middle line:84% And that's wrong, because we don't want the user to type 01:48:50.480 --> 01:48:56.690 align:middle line:84% in W-W-W-T-W-I-T-T-E-R. We want the dot to be there or just not at all with no 01:48:56.690 --> 01:48:57.490 align:middle line:90% www. 01:48:57.490 --> 01:49:00.080 align:middle line:84% So we need to group this whole thing together, 01:49:00.080 --> 01:49:04.160 align:middle line:84% put a parenthesis there, and then a parenthesis, not after the third W, 01:49:04.160 --> 01:49:09.920 align:middle line:84% after the dot, so that that whole thing is either there or it's not there. 01:49:09.920 --> 01:49:12.338 align:middle line:90% And what else could we still do here? 01:49:12.338 --> 01:49:14.630 align:middle line:84% There's going to be one other thing we should tolerate. 01:49:14.630 --> 01:49:16.922 align:middle line:84% And it's been said before, and I'll pluck this one off. 01:49:16.922 --> 01:49:18.260 align:middle line:90% What about the protocol? 01:49:18.260 --> 01:49:23.805 align:middle line:84% Like, what if the user just doesn't type or doesn't copy-paste the http:// 01:49:23.805 --> 01:49:26.660 align:middle line:90% or an https://? 01:49:26.660 --> 01:49:28.460 align:middle line:84% Honestly, you and I are not in the habit, 01:49:28.460 --> 01:49:31.730 align:middle line:84% generally, of even typing the protocol anymore nowadays. 01:49:31.730 --> 01:49:34.010 align:middle line:84% You just let the browser figure it out for you, 01:49:34.010 --> 01:49:36.590 align:middle line:90% and automatically add it instead. 01:49:36.590 --> 01:49:38.900 align:middle line:84% So this one's going to look like more of a mouthful. 01:49:38.900 --> 01:49:43.520 align:middle line:84% But if I want this whole thing here in blue to be optional, 01:49:43.520 --> 01:49:46.880 align:middle line:84% it's actually the same solution as Maria offered a moment ago. 01:49:46.880 --> 01:49:49.550 align:middle line:84% I'm going to go ahead and put a parenthesis over here, 01:49:49.550 --> 01:49:53.960 align:middle line:84% and a parenthesis after the two slashes, and then a question 01:49:53.960 --> 01:49:57.120 align:middle line:84% mark so as to make that whole thing optional as well. 01:49:57.120 --> 01:49:58.320 align:middle line:90% And this is OK. 01:49:58.320 --> 01:50:00.920 align:middle line:84% It's totally fine to make this whole thing 01:50:00.920 --> 01:50:06.480 align:middle line:84% optional, or inside of it, this little thing, just the S optional as well. 01:50:06.480 --> 01:50:09.350 align:middle line:84% So long as I'm applying the same principles again and again, 01:50:09.350 --> 01:50:11.390 align:middle line:84% either on a small scale or a bigger scale, 01:50:11.390 --> 01:50:16.680 align:middle line:84% it's totally fine to nest one of these inside of the other. 01:50:16.680 --> 01:50:20.730 align:middle line:84% Questions now on any of these refinements 01:50:20.730 --> 01:50:23.730 align:middle line:84% to this parsing, this analyzing of Twitter? 01:50:23.730 --> 01:50:29.850 align:middle line:84% AUDIENCE: What if we put a vertical bar besides this www dot? 01:50:29.850 --> 01:50:31.930 align:middle line:84% DAVID MALAN: What if we use a vertical bar there? 01:50:31.930 --> 01:50:34.110 align:middle line:90% So we could do something like that, too. 01:50:34.110 --> 01:50:36.690 align:middle line:90% We could do something like this. 01:50:36.690 --> 01:50:41.370 align:middle line:84% Instead of the question mark, I could do www dot or nothing 01:50:41.370 --> 01:50:43.680 align:middle line:90% and just leave that and the parentheses. 01:50:43.680 --> 01:50:45.160 align:middle line:90% That, too, would be fine. 01:50:45.160 --> 01:50:47.743 align:middle line:84% I personally tend not to like that, because it's a little less 01:50:47.743 --> 01:50:49.035 align:middle line:90% obvious to me-- wait, a minute. 01:50:49.035 --> 01:50:52.260 align:middle line:84% Is that deliberate, or did I forget to finish my thought by putting something 01:50:52.260 --> 01:50:53.460 align:middle line:90% after the vertical bar? 01:50:53.460 --> 01:50:57.630 align:middle line:84% But that, too, would be allowed there as well, if that's what you mean. 01:50:57.630 --> 01:50:59.790 align:middle line:84% Other questions on where we left things here, 01:50:59.790 --> 01:51:03.090 align:middle line:84% where we made the protocol optional, too? 01:51:03.090 --> 01:51:07.260 align:middle line:84% AUDIENCE: What happens if we have parenthesis, 01:51:07.260 --> 01:51:10.173 align:middle line:84% and inside we have another parenthesis, and another parenthesis? 01:51:10.173 --> 01:51:11.590 align:middle line:90% Will it interfere with each other? 01:51:11.590 --> 01:51:14.298 align:middle line:84% DAVID MALAN: If you have parentheses inside of parentheses, that, 01:51:14.298 --> 01:51:15.660 align:middle line:90% too, is totally fine. 01:51:15.660 --> 01:51:19.680 align:middle line:84% And indeed, that should be one of the reassuring lessons today. 01:51:19.680 --> 01:51:23.670 align:middle line:84% As complicated as each of these regular expressions has admittedly gotten, 01:51:23.670 --> 01:51:27.570 align:middle line:84% I'm just applying the exact same principles and the exact same syntax 01:51:27.570 --> 01:51:29.110 align:middle line:90% again and again. 01:51:29.110 --> 01:51:31.988 align:middle line:84% So it's totally fine to have parentheses inside of parentheses 01:51:31.988 --> 01:51:33.780 align:middle line:84% if they're each solving different problems. 01:51:33.780 --> 01:51:37.200 align:middle line:84% And in fact, the lesson I would really emphasize the most today 01:51:37.200 --> 01:51:41.250 align:middle line:84% is that you will not be happy if you try to write out 01:51:41.250 --> 01:51:44.820 align:middle line:84% a whole complicated regular expression all at once. 01:51:44.820 --> 01:51:47.310 align:middle line:84% Like, if you're anything like me, you will fail, 01:51:47.310 --> 01:51:49.428 align:middle line:84% and you will have trouble finding the mistake. 01:51:49.428 --> 01:51:50.970 align:middle line:90% Because my god, look at these things. 01:51:50.970 --> 01:51:53.880 align:middle line:84% They are, even to me all these years later, cryptic. 01:51:53.880 --> 01:51:57.240 align:middle line:84% The better way, I would argue, whether you're new to programming 01:51:57.240 --> 01:52:01.110 align:middle line:84% or is old to it as I am, is to just take these baby 01:52:01.110 --> 01:52:03.750 align:middle line:84% steps, these incremental steps where you do something simple, 01:52:03.750 --> 01:52:04.710 align:middle line:90% you make sure it works. 01:52:04.710 --> 01:52:07.080 align:middle line:84% You add one more feature, make sure it works. 01:52:07.080 --> 01:52:09.120 align:middle line:84% Add one more feature, make sure it works. 01:52:09.120 --> 01:52:12.360 align:middle line:84% And hopefully, by the end, because you've done each of those steps one 01:52:12.360 --> 01:52:15.490 align:middle line:84% at a time, the whole thing will make sense to you. 01:52:15.490 --> 01:52:20.310 align:middle line:84% But you'll also have gotten each of those steps correct at each turn. 01:52:20.310 --> 01:52:23.970 align:middle line:84% So please, do avoid the inclination to try 01:52:23.970 --> 01:52:26.550 align:middle line:84% to come up with long, sophisticated regular expressions 01:52:26.550 --> 01:52:29.580 align:middle line:84% all at once, because it's just not a good use of a time 01:52:29.580 --> 01:52:32.100 align:middle line:84% if you then stare at it trying to find a mistake that you 01:52:32.100 --> 01:52:35.230 align:middle line:84% could have caught if you did things more incrementally instead. 01:52:35.230 --> 01:52:35.730 align:middle line:90% All right. 01:52:35.730 --> 01:52:38.160 align:middle line:84% There still remains, arguably, at least one problem 01:52:38.160 --> 01:52:40.050 align:middle line:84% with this solution in that even though I'm 01:52:40.050 --> 01:52:44.040 align:middle line:84% calling re.sub to substitute the URL with nothing, 01:52:44.040 --> 01:52:47.410 align:middle line:84% quote, unquote, I then in my final line of code, line 6, 01:52:47.410 --> 01:52:49.590 align:middle line:84% am just blindly assuming that it all worked, 01:52:49.590 --> 01:52:52.200 align:middle line:84% and I'm going to go ahead and print out the username. 01:52:52.200 --> 01:52:53.520 align:middle line:90% But what if the user-- 01:52:53.520 --> 01:52:56.310 align:middle line:84% if I clear my screen here and run python of twitter.py-- 01:52:56.310 --> 01:52:58.110 align:middle line:90% doesn't even type a Twitter URL? 01:52:58.110 --> 01:53:02.805 align:middle line:84% What if they do something like https://google.com/, 01:53:02.805 --> 01:53:06.090 align:middle line:84% like completely unrelated, for whatever reason, 01:53:06.090 --> 01:53:08.970 align:middle line:84% Enter, that is not their Twitter username. 01:53:08.970 --> 01:53:12.300 align:middle line:84% So we need to have some conditional logic, I would argue, 01:53:12.300 --> 01:53:15.690 align:middle line:84% so that for this program's sake, we're only printing out 01:53:15.690 --> 01:53:19.920 align:middle line:84% or, in a back end system, we're only saving into our database or a CSV 01:53:19.920 --> 01:53:24.090 align:middle line:84% file the username if we actually matched the proper pattern. 01:53:24.090 --> 01:53:29.010 align:middle line:84% So rather than use re.sub, which is useful for cleaning up data, 01:53:29.010 --> 01:53:32.340 align:middle line:84% as we've done here to get rid of something we don't want there, 01:53:32.340 --> 01:53:37.080 align:middle line:84% why don't we go back to re.search, where we began today, 01:53:37.080 --> 01:53:41.100 align:middle line:84% and use it to solve this same problem but in a way that's conditional, 01:53:41.100 --> 01:53:44.490 align:middle line:84% whereby I can confidently say, yes or no, at the end of my program, 01:53:44.490 --> 01:53:47.260 align:middle line:90% here's the username, or here it is not? 01:53:47.260 --> 01:53:48.300 align:middle line:90% So let me go ahead now. 01:53:48.300 --> 01:53:50.340 align:middle line:90% And I'll clear my terminal window here. 01:53:50.340 --> 01:53:52.560 align:middle line:90% I'm going to keep most of-- 01:53:52.560 --> 01:53:55.800 align:middle line:84% I'm going to keep the first two lines the, same where I import re, 01:53:55.800 --> 01:53:57.520 align:middle line:90% and I get the URL from the user. 01:53:57.520 --> 01:53:59.010 align:middle line:90% But this time, let's do this. 01:53:59.010 --> 01:54:03.630 align:middle line:84% Let's this time search for, using re.search instead of re.sub, 01:54:03.630 --> 01:54:04.470 align:middle line:90% the following. 01:54:04.470 --> 01:54:09.510 align:middle line:84% I'm going to start matching at the beginning of the string, https, 01:54:09.510 --> 01:54:13.380 align:middle line:84% question mark to make the S optional, colon, slash, slash, 01:54:13.380 --> 01:54:19.710 align:middle line:84% I'm going to make my www optional by putting that in question marks there, 01:54:19.710 --> 01:54:24.000 align:middle line:84% then a twitter.com with a literal dot there so I stay ahead of that issue, 01:54:24.000 --> 01:54:26.640 align:middle line:90% too, then a slash. 01:54:26.640 --> 01:54:30.330 align:middle line:84% And then well, this is where davidjmalan is supposed to go. 01:54:30.330 --> 01:54:31.710 align:middle line:90% How do I detect this? 01:54:31.710 --> 01:54:35.580 align:middle line:84% Well, I think I'll just tolerate anything at the end of the URL here. 01:54:35.580 --> 01:54:38.532 align:middle line:84% All right, $ sign at the very end, close quote. 01:54:38.532 --> 01:54:40.740 align:middle line:84% For the moment, I'm going to stipulate that we're not 01:54:40.740 --> 01:54:43.830 align:middle line:84% going to worry about question marks at the end or hashes, 01:54:43.830 --> 01:54:45.600 align:middle line:90% like for fragment IDs in URLs. 01:54:45.600 --> 01:54:48.630 align:middle line:84% We're going to assume for simplicity now that the URL just 01:54:48.630 --> 01:54:50.610 align:middle line:90% ends with the username alone. 01:54:50.610 --> 01:54:52.110 align:middle line:90% Now what am I going to do? 01:54:52.110 --> 01:54:54.330 align:middle line:84% Well, I want to search for this URL specifically, 01:54:54.330 --> 01:54:58.230 align:middle line:84% and I'm going to ignore case, so re.IGNORECASE, 01:54:58.230 --> 01:55:00.840 align:middle line:84% applying that same lesson learned from before. 01:55:00.840 --> 01:55:05.717 align:middle line:84% re.search, recall, will return to you the matches you've captured. 01:55:05.717 --> 01:55:07.050 align:middle line:90% Well, what do I want to capture? 01:55:07.050 --> 01:55:12.420 align:middle line:84% Well, I want to capture everything to the right of the twitter.com URL here. 01:55:12.420 --> 01:55:17.560 align:middle line:84% So let me surround what should be the user's username with parentheses, 01:55:17.560 --> 01:55:21.580 align:middle line:84% not for making them optional but to say, "capture this set of characters." 01:55:21.580 --> 01:55:24.730 align:middle line:84% Now, re.search, recall, returns an answer. 01:55:24.730 --> 01:55:28.600 align:middle line:84% matches will be my variable name again, but I could call it anything I want. 01:55:28.600 --> 01:55:29.950 align:middle line:90% And then I can do this. 01:55:29.950 --> 01:55:33.680 align:middle line:90% If matches, now I know I can do this. 01:55:33.680 --> 01:55:36.370 align:middle line:84% Let's print out the format string, username colon. 01:55:36.370 --> 01:55:40.190 align:middle line:90% And then what do I want to print out? 01:55:40.190 --> 01:55:44.440 align:middle line:84% Well, I think I want to print out matches.group 1 for my matched 01:55:44.440 --> 01:55:45.700 align:middle line:90% username. 01:55:45.700 --> 01:55:46.210 align:middle line:90% All right. 01:55:46.210 --> 01:55:47.980 align:middle line:90% So what am I doing just to recap? 01:55:47.980 --> 01:55:49.960 align:middle line:90% Line 1, I'm importing the library. 01:55:49.960 --> 01:55:52.280 align:middle line:84% Line 2, I'm getting the URL from the user. 01:55:52.280 --> 01:55:53.230 align:middle line:90% So nothing new there. 01:55:53.230 --> 01:55:59.740 align:middle line:84% Line 5, I'm searching the user's URL, as indicated here as the second argument, 01:55:59.740 --> 01:56:03.220 align:middle line:84% for this regular expression, this pattern. 01:56:03.220 --> 01:56:07.720 align:middle line:84% I have surrounded the dot + with parentheses 01:56:07.720 --> 01:56:11.380 align:middle line:84% so that they are captured ultimately, so I can extract, 01:56:11.380 --> 01:56:14.320 align:middle line:84% in this final scenario, the user's username. 01:56:14.320 --> 01:56:18.580 align:middle line:84% If I indeed got a match, and matches is non-none, 01:56:18.580 --> 01:56:23.470 align:middle line:84% it is actually containing some match, then and only then, print out username. 01:56:23.470 --> 01:56:25.420 align:middle line:90% In this way, let me try this now. 01:56:25.420 --> 01:56:31.110 align:middle line:84% If I run python of twitter.py and type in https://www.google.com/, 01:56:31.110 --> 01:56:33.370 align:middle line:90% now nothing gets printed. 01:56:33.370 --> 01:56:36.010 align:middle line:84% So I've at least solved the mistake we just saw, 01:56:36.010 --> 01:56:38.050 align:middle line:84% where I was just assuming that my code worked. 01:56:38.050 --> 01:56:44.000 align:middle line:84% Now I'm making sure that I have searched for and found the Twitter URL prefix. 01:56:44.000 --> 01:56:44.500 align:middle line:90% All right. 01:56:44.500 --> 01:56:45.917 align:middle line:90% Well, let's run this for real now. 01:56:45.917 --> 01:56:51.730 align:middle line:84% Python of twitter.py https://twitter.com/davidjmalan. 01:56:51.730 --> 01:56:55.420 align:middle line:84% But note, I could use HTTP, I could use www. 01:56:55.420 --> 01:56:58.430 align:middle line:84% I'm just going to go ahead here and hit Enter. 01:56:58.430 --> 01:57:01.730 align:middle line:90% Huh, none. 01:57:01.730 --> 01:57:05.480 align:middle line:90% What has gone wrong? 01:57:05.480 --> 01:57:08.060 align:middle line:90% This one's a bit more subtle. 01:57:08.060 --> 01:57:13.027 align:middle line:84% But why does matches.group 1 contain nothing? 01:57:13.027 --> 01:57:13.610 align:middle line:90% Wait a minute. 01:57:13.610 --> 01:57:15.450 align:middle line:90% Let me-- maybe I did this wrong. 01:57:15.450 --> 01:57:17.707 align:middle line:90% Maybe-- maybe do we need the www? 01:57:17.707 --> 01:57:18.540 align:middle line:90% Let me run it again. 01:57:18.540 --> 01:57:24.740 align:middle line:84% So here we go. https://, let's add a www.twitter.com/davidjmalan. 01:57:24.740 --> 01:57:25.500 align:middle line:90% All right. 01:57:25.500 --> 01:57:26.470 align:middle line:90% Enter. 01:57:26.470 --> 01:57:28.550 align:middle line:90% Ho, ho, ho. 01:57:28.550 --> 01:57:31.170 align:middle line:90% What is going on? 01:57:31.170 --> 01:57:32.720 align:middle line:90% AUDIENCE: You have to say group 2. 01:57:32.720 --> 01:57:34.520 align:middle line:90% DAVID MALAN: I have to say group 2? 01:57:34.520 --> 01:57:39.140 align:middle line:84% Well, wait-- oh, right, because we had the subdomain was optional. 01:57:39.140 --> 01:57:42.560 align:middle line:84% And to make it optional, I needed to use parentheses here. 01:57:42.560 --> 01:57:44.070 align:middle line:90% And so I then said zero or on. 01:57:44.070 --> 01:57:44.570 align:middle line:90% OK. 01:57:44.570 --> 01:57:49.910 align:middle line:84% So that means that actually, I'm unintentionally but by design 01:57:49.910 --> 01:57:54.710 align:middle line:84% capturing the www dot, or none of it if it wasn't there before, 01:57:54.710 --> 01:57:56.645 align:middle line:84% but I have a second match over here because I 01:57:56.645 --> 01:57:58.020 align:middle line:90% have a second set of parentheses. 01:57:58.020 --> 01:58:00.350 align:middle line:84% So I think, yep, let me change matches.group 1 01:58:00.350 --> 01:58:02.300 align:middle line:90% to matches.group 2, and let's run this. 01:58:02.300 --> 01:58:07.460 align:middle line:84% Python of twitter.py https://www.twitter-- 01:58:07.460 --> 01:58:13.070 align:middle line:84% let's do this, twitter.com/davidjmalan, Enter, 01:58:13.070 --> 01:58:15.920 align:middle line:84% and now we've got access to the username. 01:58:15.920 --> 01:58:19.040 align:middle line:84% Let me go ahead and tighten it up a little bit further. 01:58:19.040 --> 01:58:21.513 align:middle line:90% If you like our new friend-- 01:58:21.513 --> 01:58:22.430 align:middle line:90% it's hard not to like. 01:58:22.430 --> 01:58:26.060 align:middle line:84% If we like our old friend the walrus operator, let's go ahead 01:58:26.060 --> 01:58:27.740 align:middle line:90% and add this just to tighten things up. 01:58:27.740 --> 01:58:31.460 align:middle line:84% Let me go back to VS Code here, and let me get rid of the unnecessary condition 01:58:31.460 --> 01:58:34.580 align:middle line:84% there and combine it up here, if matches equals that. 01:58:34.580 --> 01:58:38.090 align:middle line:84% But let's change the single assignment operator to the walrus operator. 01:58:38.090 --> 01:58:40.040 align:middle line:90% Now I've tightened things up further. 01:58:40.040 --> 01:58:43.940 align:middle line:84% But I bet, I bet, I bet there might be another solution here. 01:58:43.940 --> 01:58:50.630 align:middle line:84% And indeed, it turns out that we can come back to this final set of syntax. 01:58:50.630 --> 01:58:52.940 align:middle line:84% Recall that when we introduce these parentheses, 01:58:52.940 --> 01:58:56.720 align:middle line:84% we did it so that we could do A or B, for instance, with the vertical bar. 01:58:56.720 --> 01:58:59.060 align:middle line:84% Then you can even combine more than just one bar. 01:58:59.060 --> 01:59:02.900 align:middle line:84% We use the group to combine ideas like the, www dot. 01:59:02.900 --> 01:59:07.760 align:middle line:84% And then there's this admittedly weird syntax at the bottom here, up until now 01:59:07.760 --> 01:59:08.690 align:middle line:90% not used. 01:59:08.690 --> 01:59:12.230 align:middle line:84% There is a non-capturing version of parentheses 01:59:12.230 --> 01:59:15.050 align:middle line:84% if you want to use parentheses logically because you need to, 01:59:15.050 --> 01:59:18.080 align:middle line:84% but you don't want to bother capturing the result. 01:59:18.080 --> 01:59:20.450 align:middle line:84% And this would arguably be a better solution 01:59:20.450 --> 01:59:23.630 align:middle line:84% here, because, yes, if I go back to VS Code, I do 01:59:23.630 --> 01:59:27.560 align:middle line:84% need to surround the www dot with parentheses, at least 01:59:27.560 --> 01:59:30.170 align:middle line:84% as I've written my regex here, because I wanted 01:59:30.170 --> 01:59:31.910 align:middle line:90% to put the question mark after it. 01:59:31.910 --> 01:59:35.120 align:middle line:84% But I don't need the www dot coming back. 01:59:35.120 --> 01:59:37.580 align:middle line:84% In fact, let's only extract the data we care about, 01:59:37.580 --> 01:59:40.280 align:middle line:84% just so there's no confusion down the road, for me, 01:59:40.280 --> 01:59:42.120 align:middle line:90% or my colleagues, or my teachers. 01:59:42.120 --> 01:59:43.860 align:middle line:90% So what could I do? 01:59:43.860 --> 01:59:48.800 align:middle line:84% Well, the syntax per this slide is to use a question mark and a colon 01:59:48.800 --> 01:59:51.410 align:middle line:90% immediately after the open parentheses. 01:59:51.410 --> 01:59:52.910 align:middle line:90% It looks weird admittedly. 01:59:52.910 --> 01:59:55.040 align:middle line:84% Those of you who have prior programming experience 01:59:55.040 --> 01:59:59.300 align:middle line:84% might recognize the syntax from ternary operators, doing an if else all in one 01:59:59.300 --> 01:59:59.960 align:middle line:90% line. 01:59:59.960 --> 02:00:04.190 align:middle line:84% A question mark colon at the beginning of that parenthetical 02:00:04.190 --> 02:00:08.160 align:middle line:84% means, yes, I'm using parentheses to group these things together, 02:00:08.160 --> 02:00:11.640 align:middle line:84% but no, you do not need to capture them instead. 02:00:11.640 --> 02:00:15.500 align:middle line:84% So I can change my code back now to matches.group 1. 02:00:15.500 --> 02:00:18.260 align:middle line:84% I'll clear my screen here, run python of twitter.py. 02:00:18.260 --> 02:00:24.350 align:middle line:84% I'll again run here https://twitter.com/davidjmalan 02:00:24.350 --> 02:00:26.480 align:middle line:90% with or without the www. 02:00:26.480 --> 02:00:30.590 align:middle line:84% And now, I indeed get back that username. 02:00:30.590 --> 02:00:37.280 align:middle line:84% Any questions, then, on these final techniques? 02:00:37.280 --> 02:00:40.940 align:middle line:84% AUDIENCE: So first of all, could we move the ^ right 02:00:40.940 --> 02:00:44.270 align:middle line:84% at the beginning of Twitter, and then just start reading from there, 02:00:44.270 --> 02:00:49.700 align:middle line:84% and then get rid of everything else before that, the kind of www issues 02:00:49.700 --> 02:00:50.930 align:middle line:90% that we had? 02:00:50.930 --> 02:00:56.240 align:middle line:84% And then my second question is, how would we use kind of, I guess, 02:00:56.240 --> 02:01:01.640 align:middle line:84% either a list or a dictionary to sort the .com kind of thing, 02:01:01.640 --> 02:01:05.120 align:middle line:84% because we have .co.uk, and that kind of stuff. 02:01:05.120 --> 02:01:08.330 align:middle line:84% How would we bring that into the re function? 02:01:08.330 --> 02:01:09.830 align:middle line:90% DAVID MALAN: A good question but no. 02:01:09.830 --> 02:01:15.560 align:middle line:84% If I move the ^ before twitter.com and throw away the protocol and the www, 02:01:15.560 --> 02:01:20.960 align:middle line:84% then the user is going to have to type in literally twitter.com/username. 02:01:20.960 --> 02:01:23.040 align:middle line:84% They can't even type in that other stuff. 02:01:23.040 --> 02:01:25.170 align:middle line:84% So that would be a regression, a step back. 02:01:25.170 --> 02:01:29.120 align:middle line:84% As for the .com, the .org, and .edu, and so forth, 02:01:29.120 --> 02:01:31.970 align:middle line:84% the short answer is there's many different solutions here. 02:01:31.970 --> 02:01:37.190 align:middle line:84% If I wanted to be stringent about .com-- and suppose that Twitter probably owns 02:01:37.190 --> 02:01:40.620 align:middle line:84% multiple domain names, even though they tend to use just this one. 02:01:40.620 --> 02:01:43.800 align:middle line:84% Suppose they have something like .org as well. 02:01:43.800 --> 02:01:47.810 align:middle line:84% You could use more parentheses here and do something like this-- com or org. 02:01:47.810 --> 02:01:50.270 align:middle line:84% I'd probably want to go in and add a question mark 02:01:50.270 --> 02:01:53.060 align:middle line:84% colon to make it non-capturing, because I don't care which 02:01:53.060 --> 02:01:55.100 align:middle line:90% it is, I just want to tolerate both. 02:01:55.100 --> 02:01:58.220 align:middle line:90% Alternatively, we could capture that. 02:01:58.220 --> 02:02:01.850 align:middle line:84% We could do something like this, where we do dot + so as 02:02:01.850 --> 02:02:03.410 align:middle line:90% to actually capture that. 02:02:03.410 --> 02:02:05.570 align:middle line:84% And then we could do something like this. 02:02:05.570 --> 02:02:13.640 align:middle line:84% If matches.group 1 now equals equals com, then we could support this. 02:02:13.640 --> 02:02:18.020 align:middle line:84% So you could imagine factoring out the logic just by extracting the Top-Level 02:02:18.020 --> 02:02:21.410 align:middle line:84% Domain, or TLD, and then just using Python code, maybe a list, maybe 02:02:21.410 --> 02:02:24.860 align:middle line:84% a dictionary, to validate elsewhere, outside of the regex, 02:02:24.860 --> 02:02:26.780 align:middle line:90% if it's, in fact, what you expect. 02:02:26.780 --> 02:02:28.700 align:middle line:90% For now, though, we kept things simple. 02:02:28.700 --> 02:02:31.860 align:middle line:84% We focused only on the .com in this case. 02:02:31.860 --> 02:02:33.767 align:middle line:84% Let's make one final change to this program 02:02:33.767 --> 02:02:36.350 align:middle line:84% so that we're being a little more specific with the definition 02:02:36.350 --> 02:02:37.640 align:middle line:90% of a Twitter username. 02:02:37.640 --> 02:02:41.000 align:middle line:84% It turns out that we're being a little too generous over here, whereby we're 02:02:41.000 --> 02:02:43.280 align:middle line:90% accepting one or more of any character. 02:02:43.280 --> 02:02:45.050 align:middle line:90% I checked the documentation for Twitter. 02:02:45.050 --> 02:02:48.890 align:middle line:84% And Twitter only supports letters of the alphabet, a through Z, 02:02:48.890 --> 02:02:53.370 align:middle line:84% numbers 0 through 9, or underscores, so not just dot, 02:02:53.370 --> 02:02:55.020 align:middle line:90% which is literally anything. 02:02:55.020 --> 02:02:57.230 align:middle line:84% So let me go ahead and be more precise here. 02:02:57.230 --> 02:02:59.870 align:middle line:84% At the end of my string, let me go ahead and say, 02:02:59.870 --> 02:03:03.510 align:middle line:90% this set of symbols in square brackets. 02:03:03.510 --> 02:03:08.058 align:middle line:84% I'm going to go ahead and say a through Z, 0 through 9, and an underscore. 02:03:08.058 --> 02:03:10.100 align:middle line:84% Because, again, those are the only valid symbols. 02:03:10.100 --> 02:03:12.740 align:middle line:84% I don't need to bother with an uppercase A or a lowercase z, 02:03:12.740 --> 02:03:16.140 align:middle line:84% because we're using re.IGNORECASE over here. 02:03:16.140 --> 02:03:19.760 align:middle line:84% But I want to make sure now that I tolerate not only one or more 02:03:19.760 --> 02:03:24.260 align:middle line:84% of these symbols here but also maybe some other stuff at the end of the URL. 02:03:24.260 --> 02:03:27.710 align:middle line:84% I'm now going to be OK with there being a slash, or a question mark, 02:03:27.710 --> 02:03:31.730 align:middle line:84% or a hash at the end of the URL, all of which are valid symbols in a URL, 02:03:31.730 --> 02:03:34.130 align:middle line:84% but I know from the Twitter's documentation, 02:03:34.130 --> 02:03:36.390 align:middle line:90% are not part of the username. 02:03:36.390 --> 02:03:36.890 align:middle line:90% All right. 02:03:36.890 --> 02:03:39.770 align:middle line:84% Now I'm going to go ahead and run python of twitter.py one 02:03:39.770 --> 02:03:46.610 align:middle line:84% final time, typing in https://twitter.com/davidjmalan, maybe 02:03:46.610 --> 02:03:48.320 align:middle line:90% with, maybe without a trailing slash. 02:03:48.320 --> 02:03:52.070 align:middle line:84% But hopefully, with my biggest fingers crossed here, I'm going to go ahead now 02:03:52.070 --> 02:03:56.630 align:middle line:84% and hit Enter, and thankfully my username is, indeed, davidjmalan. 02:03:56.630 --> 02:03:59.300 align:middle line:84% So what more is there in the world of regular expressions 02:03:59.300 --> 02:04:00.320 align:middle line:90% and this own library? 02:04:00.320 --> 02:04:04.340 align:middle line:84% Not just re.search and also re.sub, there's other functions, too. 02:04:04.340 --> 02:04:07.850 align:middle line:84% There's re.split, via which you can split a string, not 02:04:07.850 --> 02:04:11.480 align:middle line:84% using a specific character or characters like a comma and a space, 02:04:11.480 --> 02:04:14.010 align:middle line:90% but multiple characters as well. 02:04:14.010 --> 02:04:16.550 align:middle line:84% And there's even functions like re.findall, 02:04:16.550 --> 02:04:20.540 align:middle line:84% which can allow you to search for multiple copies of the same pattern 02:04:20.540 --> 02:04:23.120 align:middle line:84% in different places in a string so that you can perhaps 02:04:23.120 --> 02:04:25.200 align:middle line:90% manipulate more than just one. 02:04:25.200 --> 02:04:28.820 align:middle line:84% So at the end of the day now, you've really learned a whole other language, 02:04:28.820 --> 02:04:31.700 align:middle line:84% like that of regular expressions, and we've used them in Python. 02:04:31.700 --> 02:04:35.670 align:middle line:84% But these regular expressions actually exist in so many languages, too, 02:04:35.670 --> 02:04:38.930 align:middle line:84% among them JavaScript, and Java, and Ruby, and more. 02:04:38.930 --> 02:04:42.300 align:middle line:84% So with this new language, even though it's admittedly cryptic 02:04:42.300 --> 02:04:45.050 align:middle line:84% when you use it for the first time, you have this newfound ability 02:04:45.050 --> 02:04:48.800 align:middle line:84% to express these patterns that, again, you can use to validate data, 02:04:48.800 --> 02:04:53.310 align:middle line:84% to clean up data, or even extract data, and from any data set 02:04:53.310 --> 02:04:54.470 align:middle line:90% you might have in mind. 02:04:54.470 --> 02:04:55.830 align:middle line:90% That's it for this week. 02:04:55.830 --> 02:04:58.570 align:middle line:90% We will see you next time. 02:04:58.570 --> 02:05:00.000 align:middle line:90%