WEBVTT X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:900000 00:00:00.000 --> 00:00:02.796 [MUSIC PLAYING] 00:00:05.600 --> 00:00:09.450 SPEAKER: Well, hello, one and all, and welcome to our short on capture groups. 00:00:09.450 --> 00:00:11.960 Now, those of you who have made international calls 00:00:11.960 --> 00:00:14.730 might be familiar with what's called a country calling code. 00:00:14.730 --> 00:00:19.170 And I have here three of them up above in a dictionary called locations. 00:00:19.170 --> 00:00:22.550 I have 1 plus 1, which often involves numbers 00:00:22.550 --> 00:00:24.480 from the United States and Canada. 00:00:24.480 --> 00:00:28.520 I have 1 plus 62, which often involves numbers from Indonesia, 00:00:28.520 --> 00:00:33.660 and 1 plus 505 that involves numbers from Nicaragua. 00:00:33.660 --> 00:00:38.090 And I have down below here, in main, a program that in this case 00:00:38.090 --> 00:00:41.460 validates phone numbers internationally. 00:00:41.460 --> 00:00:44.060 So I have here a pattern that I'm going to look 00:00:44.060 --> 00:00:46.660 for within each of these phone numbers I'm 00:00:46.660 --> 00:00:50.340 going to enter into my program down below on line 7. 00:00:50.340 --> 00:00:52.680 In this case, notice what I'm expecting. 00:00:52.680 --> 00:00:56.510 I'm expecting the literal actual character plus here. 00:00:56.510 --> 00:00:59.930 I've escaped it with this backslash because plus has other meaning 00:00:59.930 --> 00:01:02.030 meanings within regular expressions. 00:01:02.030 --> 00:01:06.930 I'm not expecting any kind of number, in this case, 0 through 9, 00:01:06.930 --> 00:01:09.550 between 1 and 3 times. 00:01:09.550 --> 00:01:13.810 So notice here, the country code for the US and Canada, that's plus 1. 00:01:13.810 --> 00:01:15.580 So only one number here. 00:01:15.580 --> 00:01:18.210 For Indonesia, it's two, 62. 00:01:18.210 --> 00:01:21.130 And for Nicaragua, it's three, 505. 00:01:21.130 --> 00:01:25.440 So between, in this case, one and three numbers following some plus. 00:01:25.440 --> 00:01:29.400 Thereafter, there will hopefully be a space for this invalid number, 00:01:29.400 --> 00:01:34.560 and then there will be exactly three numbers, a dash, again, exactly three 00:01:34.560 --> 00:01:39.760 numbers, followed by a dash again, and then exactly four numbers. 00:01:39.760 --> 00:01:42.400 So this is the pattern we are looking for. 00:01:42.400 --> 00:01:45.630 And down below, on lines 9 through 13, well, this 00:01:45.630 --> 00:01:48.280 is the code doing that work for us. 00:01:48.280 --> 00:01:52.530 We've stored, within number, the user's phone number that they have entered, 00:01:52.530 --> 00:01:55.080 and we're going to check, using re.search, 00:01:55.080 --> 00:02:00.040 if we found a match for our pattern within the number string. 00:02:00.040 --> 00:02:03.790 If we do have a match is returned to us, we'll print valid. 00:02:03.790 --> 00:02:06.780 If we don't, we'll print invalid. 00:02:06.780 --> 00:02:09.780 Let me go ahead, down below here, let me type "main" 00:02:09.780 --> 00:02:12.690 to ensure I call main when I run this program. 00:02:12.690 --> 00:02:15.820 And I'll go ahead and run Python of groups.py. 00:02:15.820 --> 00:02:19.620 And if I hit Enter, now I should be able to enter a number. 00:02:19.620 --> 00:02:21.780 I'll test one here, plus 1. 00:02:21.780 --> 00:02:26.230 And I'll type in that 617-495-1000. 00:02:26.230 --> 00:02:29.620 I'll hit Enter here, and we'll see that is valid. 00:02:29.620 --> 00:02:32.140 Maybe I'll try it, too, for Indonesia. 00:02:32.140 --> 00:02:36.030 I'll do plus 62, and I'll enter in my phone number again. 00:02:36.030 --> 00:02:38.800 And I'll hit Enter, and I'll see if that's valid as well. 00:02:38.800 --> 00:02:41.110 And just for good measure, I'll test Nicaragua. 00:02:41.110 --> 00:02:44.250 I'll do plus 505, and I'll type in this number, 00:02:44.250 --> 00:02:47.220 and we'll see that is valid as well. 00:02:47.220 --> 00:02:51.210 So it seems like our pattern is working, but there's 00:02:51.210 --> 00:02:54.930 more we could do with this program, I think, thanks to this feature 00:02:54.930 --> 00:02:57.280 called a capture group. 00:02:57.280 --> 00:03:00.330 Well, maybe what I want to do is not just share 00:03:00.330 --> 00:03:04.110 if this number is valid or invalid, but maybe tell 00:03:04.110 --> 00:03:08.830 somebody from what country, in this case, this number is calling from. 00:03:08.830 --> 00:03:10.420 You can think of your phone. 00:03:10.420 --> 00:03:13.060 When it receives some call from an unknown number, 00:03:13.060 --> 00:03:16.790 it might at least tell you the location or the area that number is calling from. 00:03:16.790 --> 00:03:18.540 What if we could write the same thing here 00:03:18.540 --> 00:03:21.510 where people call us internationally and we show the user, 00:03:21.510 --> 00:03:24.690 in this case, what country they are calling from? 00:03:24.690 --> 00:03:27.420 Well, in this case, we don't want to just test 00:03:27.420 --> 00:03:31.240 to see if we find the pattern within our phone numbers here. 00:03:31.240 --> 00:03:34.210 We also want to extract some portion of it, 00:03:34.210 --> 00:03:37.260 in this case, the very first portion, the country calling 00:03:37.260 --> 00:03:41.670 code-- plus 505 for Nicaragua, plus 62 for Indonesia, 00:03:41.670 --> 00:03:44.520 or plus 1 for the US and Canada. 00:03:44.520 --> 00:03:49.740 But we run into a problem here if we to use maybe simple a string manipulation. 00:03:49.740 --> 00:03:53.190 If I enter in some number and was trying to extract, 00:03:53.190 --> 00:03:56.700 in this case, the country calling code, well, I wouldn't immediately 00:03:56.700 --> 00:04:01.500 know whether I should extract, in this case, the first two characters, plus 1, 00:04:01.500 --> 00:04:06.450 the first three characters, plus 62, or, in this case, the first four characters, 00:04:06.450 --> 00:04:07.990 plus 505. 00:04:07.990 --> 00:04:11.730 But thankfully, I actually use regular expressions and capture 00:04:11.730 --> 00:04:16.829 groups to dynamically capture the portion of the content I'm looking for. 00:04:16.829 --> 00:04:21.540 Now, the way I can make a capture group is by using parentheses inside 00:04:21.540 --> 00:04:23.620 of a regular expression. 00:04:23.620 --> 00:04:27.480 So really I want to capture or extract, in this case, 00:04:27.480 --> 00:04:32.100 the country calling code, which we said the pattern exists for right here, 00:04:32.100 --> 00:04:36.870 a literal plus sign followed by on to e numbers. 00:04:36.870 --> 00:04:40.140 Now, I can encase this inside of parentheses, 00:04:40.140 --> 00:04:43.140 and this becomes my own capture group. 00:04:43.140 --> 00:04:46.810 But how could I maybe find the information I capture? 00:04:46.810 --> 00:04:51.510 Well, if I find a match here, turns out that this match object in Python 00:04:51.510 --> 00:04:54.720 comes with another one called group-- 00:04:54.720 --> 00:04:55.360 group. 00:04:55.360 --> 00:04:57.090 If I were to do-- 00:04:57.090 --> 00:05:00.400 let me do match.group. 00:05:00.400 --> 00:05:04.010 Well, this would help me find all of the capture groups 00:05:04.010 --> 00:05:07.670 I've actually implemented in my regular expression 00:05:07.670 --> 00:05:10.730 and extract them from this match. 00:05:10.730 --> 00:05:14.700 Because, let's say, this is the first capture group we have, 00:05:14.700 --> 00:05:16.950 I could go ahead and type 1 here. 00:05:16.950 --> 00:05:21.420 Capture groups, at least in Python, in this particular case, are one index. 00:05:21.420 --> 00:05:25.400 So I'm saying here, if there is a match, go ahead and give me, 00:05:25.400 --> 00:05:30.530 in this case, the result that I found within the first capture group. 00:05:30.530 --> 00:05:34.410 I'll go ahead and store this in a variable called country_code. 00:05:34.410 --> 00:05:37.190 And I could-- instead of printing valid here, 00:05:37.190 --> 00:05:40.430 maybe I'll go ahead and print something like country_code 00:05:40.430 --> 00:05:42.750 and see what we can find. 00:05:42.750 --> 00:05:46.310 Well I'll go ahead and run Python of groups.py, 00:05:46.310 --> 00:05:52.070 and I'll go ahead and do plus 1, followed by my phone number, hit Enter. 00:05:52.070 --> 00:05:53.970 And now we'll see plus 1. 00:05:53.970 --> 00:05:58.730 So it seems like we extracted, in this case, the portion of our content 00:05:58.730 --> 00:06:02.643 that matched the pattern within these parentheses. 00:06:02.643 --> 00:06:04.060 I'll try it with another one here. 00:06:04.060 --> 00:06:08.040 I'll do Python of groups.py plus 62. 00:06:08.040 --> 00:06:09.640 And we'll see plus 62. 00:06:09.640 --> 00:06:11.070 So it is dynamic. 00:06:11.070 --> 00:06:14.130 And it's not looking for the first two characters all the time 00:06:14.130 --> 00:06:16.630 or the first three characters all the time. 00:06:16.630 --> 00:06:18.340 It's looking for this pattern. 00:06:18.340 --> 00:06:22.890 And when it finds it, it's returning it to us as appropriate. 00:06:22.890 --> 00:06:25.240 Now, what else could we do with this? 00:06:25.240 --> 00:06:27.940 Well, country_code literally is a string here. 00:06:27.940 --> 00:06:32.010 So if I wanted to, in this case, find the country somebody 00:06:32.010 --> 00:06:35.220 is calling from based on their country calling code, well, 00:06:35.220 --> 00:06:40.660 I could perhaps use country_code as the key for this dictionary here. 00:06:40.660 --> 00:06:46.180 I could type locations bracket locations bracket country_code. 00:06:46.180 --> 00:06:51.810 And because each of these country calling codes is a key in my dictionary, 00:06:51.810 --> 00:06:55.620 I should hopefully find, in this case, the actual location 00:06:55.620 --> 00:06:57.100 they are calling from. 00:06:57.100 --> 00:06:58.300 Let's try this out. 00:06:58.300 --> 00:07:02.890 I'll run Python of groups.py, and I'll now type-- oops. 00:07:02.890 --> 00:07:08.350 I'll now type-- let's do plus 1 again, followed by the number, hit Enter. 00:07:08.350 --> 00:07:10.950 And now we'll see United States and Canada. 00:07:10.950 --> 00:07:12.130 I could do this again. 00:07:12.130 --> 00:07:17.130 I could try, let's say, a plus 62 and number again-- 00:07:17.130 --> 00:07:18.240 Indonesia. 00:07:18.240 --> 00:07:23.320 I'll now try plus 505, and I'll see Nicaragua. 00:07:23.320 --> 00:07:25.590 So the capture group here is doing the work 00:07:25.590 --> 00:07:28.440 of finding the portion of our content that matches 00:07:28.440 --> 00:07:31.590 some pattern we were looking for. 00:07:31.590 --> 00:07:34.950 Well, I think we've really seen a lot of what this can do for us, 00:07:34.950 --> 00:07:38.250 but there is one more feature to take a look at. 00:07:38.250 --> 00:07:45.240 Here, notice how on line 11 I am really using indices, indexes, to find 00:07:45.240 --> 00:07:48.100 the capture group I'm looking for. 00:07:48.100 --> 00:07:52.380 But a more complex regular expression might involve more than one 00:07:52.380 --> 00:07:56.430 capture group, could involve up to, I don't know, more than one, 00:07:56.430 --> 00:07:58.390 two, three, four, could get up to 10. 00:07:58.390 --> 00:08:01.410 However many it is, it can be helpful to have a better way 00:08:01.410 --> 00:08:04.210 to refer to these capture groups. 00:08:04.210 --> 00:08:07.620 So if this capture group has some particular meaning to it, 00:08:07.620 --> 00:08:10.770 I could actually give it a name to refer to later 00:08:10.770 --> 00:08:13.630 on within the regular expression. 00:08:13.630 --> 00:08:16.410 And the way I do this is with the following syntax. 00:08:16.410 --> 00:08:22.290 Within my capture group, after the first parentheses, I can type question mark p 00:08:22.290 --> 00:08:26.310 and then open bracket close bracket, or, in this case, less than 00:08:26.310 --> 00:08:31.690 sign, greater than sign, and then some name for this capture group. 00:08:31.690 --> 00:08:35.760 I could call this country_code just like this. 00:08:35.760 --> 00:08:39.780 So now this pattern here and the capture group 00:08:39.780 --> 00:08:44.610 has a name I can refer to later to extract it with. 00:08:44.610 --> 00:08:49.170 Down here on line 11, I could, in this case, use 1, 00:08:49.170 --> 00:08:52.230 but now I could actually make use of country_code, the name I 00:08:52.230 --> 00:08:54.820 gave for this particular capture group. 00:08:54.820 --> 00:08:57.960 And I could type in country_code just like this, 00:08:57.960 --> 00:09:02.910 which will say, find for me, in this case, the capture group that I named 00:09:02.910 --> 00:09:05.830 country_code and use that instead. 00:09:05.830 --> 00:09:08.610 I'll type Python of groups.py. 00:09:08.610 --> 00:09:11.940 I'll go ahead and type plus 1, same number here. 00:09:11.940 --> 00:09:14.670 And now we'll see United States and Canada. 00:09:14.670 --> 00:09:19.020 Seems to work but is now a little more readable, even some name something 00:09:19.020 --> 00:09:22.350 that we might later hope to capture in our programs. 00:09:22.350 --> 00:09:25.690 So this was our brief foray into capture groups. 00:09:25.690 --> 00:09:27.130 And this was our short. 00:09:27.130 --> 00:09:29.810 We'll see you next time.