1 00:00:00,000 --> 00:00:02,796 [MUSIC PLAYING] 2 00:00:02,796 --> 00:00:05,600 3 00:00:05,600 --> 00:00:09,450 SPEAKER: Well, hello, one and all, and welcome to our short on capture groups. 4 00:00:09,450 --> 00:00:11,960 Now, those of you who have made international calls 5 00:00:11,960 --> 00:00:14,730 might be familiar with what's called a country calling code. 6 00:00:14,730 --> 00:00:19,170 And I have here three of them up above in a dictionary called locations. 7 00:00:19,170 --> 00:00:22,550 I have 1 plus 1, which often involves numbers 8 00:00:22,550 --> 00:00:24,480 from the United States and Canada. 9 00:00:24,480 --> 00:00:28,520 I have 1 plus 62, which often involves numbers from Indonesia, 10 00:00:28,520 --> 00:00:33,660 and 1 plus 505 that involves numbers from Nicaragua. 11 00:00:33,660 --> 00:00:38,090 And I have down below here, in main, a program that in this case 12 00:00:38,090 --> 00:00:41,460 validates phone numbers internationally. 13 00:00:41,460 --> 00:00:44,060 So I have here a pattern that I'm going to look 14 00:00:44,060 --> 00:00:46,660 for within each of these phone numbers I'm 15 00:00:46,660 --> 00:00:50,340 going to enter into my program down below on line 7. 16 00:00:50,340 --> 00:00:52,680 In this case, notice what I'm expecting. 17 00:00:52,680 --> 00:00:56,510 I'm expecting the literal actual character plus here. 18 00:00:56,510 --> 00:00:59,930 I've escaped it with this backslash because plus has other meaning 19 00:00:59,930 --> 00:01:02,030 meanings within regular expressions. 20 00:01:02,030 --> 00:01:06,930 I'm not expecting any kind of number, in this case, 0 through 9, 21 00:01:06,930 --> 00:01:09,550 between 1 and 3 times. 22 00:01:09,550 --> 00:01:13,810 So notice here, the country code for the US and Canada, that's plus 1. 23 00:01:13,810 --> 00:01:15,580 So only one number here. 24 00:01:15,580 --> 00:01:18,210 For Indonesia, it's two, 62. 25 00:01:18,210 --> 00:01:21,130 And for Nicaragua, it's three, 505. 26 00:01:21,130 --> 00:01:25,440 So between, in this case, one and three numbers following some plus. 27 00:01:25,440 --> 00:01:29,400 Thereafter, there will hopefully be a space for this invalid number, 28 00:01:29,400 --> 00:01:34,560 and then there will be exactly three numbers, a dash, again, exactly three 29 00:01:34,560 --> 00:01:39,760 numbers, followed by a dash again, and then exactly four numbers. 30 00:01:39,760 --> 00:01:42,400 So this is the pattern we are looking for. 31 00:01:42,400 --> 00:01:45,630 And down below, on lines 9 through 13, well, this 32 00:01:45,630 --> 00:01:48,280 is the code doing that work for us. 33 00:01:48,280 --> 00:01:52,530 We've stored, within number, the user's phone number that they have entered, 34 00:01:52,530 --> 00:01:55,080 and we're going to check, using re.search, 35 00:01:55,080 --> 00:02:00,040 if we found a match for our pattern within the number string. 36 00:02:00,040 --> 00:02:03,790 If we do have a match is returned to us, we'll print valid. 37 00:02:03,790 --> 00:02:06,780 If we don't, we'll print invalid. 38 00:02:06,780 --> 00:02:09,780 Let me go ahead, down below here, let me type "main" 39 00:02:09,780 --> 00:02:12,690 to ensure I call main when I run this program. 40 00:02:12,690 --> 00:02:15,820 And I'll go ahead and run Python of groups.py. 41 00:02:15,820 --> 00:02:19,620 And if I hit Enter, now I should be able to enter a number. 42 00:02:19,620 --> 00:02:21,780 I'll test one here, plus 1. 43 00:02:21,780 --> 00:02:26,230 And I'll type in that 617-495-1000. 44 00:02:26,230 --> 00:02:29,620 I'll hit Enter here, and we'll see that is valid. 45 00:02:29,620 --> 00:02:32,140 Maybe I'll try it, too, for Indonesia. 46 00:02:32,140 --> 00:02:36,030 I'll do plus 62, and I'll enter in my phone number again. 47 00:02:36,030 --> 00:02:38,800 And I'll hit Enter, and I'll see if that's valid as well. 48 00:02:38,800 --> 00:02:41,110 And just for good measure, I'll test Nicaragua. 49 00:02:41,110 --> 00:02:44,250 I'll do plus 505, and I'll type in this number, 50 00:02:44,250 --> 00:02:47,220 and we'll see that is valid as well. 51 00:02:47,220 --> 00:02:51,210 So it seems like our pattern is working, but there's 52 00:02:51,210 --> 00:02:54,930 more we could do with this program, I think, thanks to this feature 53 00:02:54,930 --> 00:02:57,280 called a capture group. 54 00:02:57,280 --> 00:03:00,330 Well, maybe what I want to do is not just share 55 00:03:00,330 --> 00:03:04,110 if this number is valid or invalid, but maybe tell 56 00:03:04,110 --> 00:03:08,830 somebody from what country, in this case, this number is calling from. 57 00:03:08,830 --> 00:03:10,420 You can think of your phone. 58 00:03:10,420 --> 00:03:13,060 When it receives some call from an unknown number, 59 00:03:13,060 --> 00:03:16,790 it might at least tell you the location or the area that number is calling from. 60 00:03:16,790 --> 00:03:18,540 What if we could write the same thing here 61 00:03:18,540 --> 00:03:21,510 where people call us internationally and we show the user, 62 00:03:21,510 --> 00:03:24,690 in this case, what country they are calling from? 63 00:03:24,690 --> 00:03:27,420 Well, in this case, we don't want to just test 64 00:03:27,420 --> 00:03:31,240 to see if we find the pattern within our phone numbers here. 65 00:03:31,240 --> 00:03:34,210 We also want to extract some portion of it, 66 00:03:34,210 --> 00:03:37,260 in this case, the very first portion, the country calling 67 00:03:37,260 --> 00:03:41,670 code-- plus 505 for Nicaragua, plus 62 for Indonesia, 68 00:03:41,670 --> 00:03:44,520 or plus 1 for the US and Canada. 69 00:03:44,520 --> 00:03:49,740 But we run into a problem here if we to use maybe simple a string manipulation. 70 00:03:49,740 --> 00:03:53,190 If I enter in some number and was trying to extract, 71 00:03:53,190 --> 00:03:56,700 in this case, the country calling code, well, I wouldn't immediately 72 00:03:56,700 --> 00:04:01,500 know whether I should extract, in this case, the first two characters, plus 1, 73 00:04:01,500 --> 00:04:06,450 the first three characters, plus 62, or, in this case, the first four characters, 74 00:04:06,450 --> 00:04:07,990 plus 505. 75 00:04:07,990 --> 00:04:11,730 But thankfully, I actually use regular expressions and capture 76 00:04:11,730 --> 00:04:16,829 groups to dynamically capture the portion of the content I'm looking for. 77 00:04:16,829 --> 00:04:21,540 Now, the way I can make a capture group is by using parentheses inside 78 00:04:21,540 --> 00:04:23,620 of a regular expression. 79 00:04:23,620 --> 00:04:27,480 So really I want to capture or extract, in this case, 80 00:04:27,480 --> 00:04:32,100 the country calling code, which we said the pattern exists for right here, 81 00:04:32,100 --> 00:04:36,870 a literal plus sign followed by on to e numbers. 82 00:04:36,870 --> 00:04:40,140 Now, I can encase this inside of parentheses, 83 00:04:40,140 --> 00:04:43,140 and this becomes my own capture group. 84 00:04:43,140 --> 00:04:46,810 But how could I maybe find the information I capture? 85 00:04:46,810 --> 00:04:51,510 Well, if I find a match here, turns out that this match object in Python 86 00:04:51,510 --> 00:04:54,720 comes with another one called group-- 87 00:04:54,720 --> 00:04:55,360 group. 88 00:04:55,360 --> 00:04:57,090 If I were to do-- 89 00:04:57,090 --> 00:05:00,400 let me do match.group. 90 00:05:00,400 --> 00:05:04,010 Well, this would help me find all of the capture groups 91 00:05:04,010 --> 00:05:07,670 I've actually implemented in my regular expression 92 00:05:07,670 --> 00:05:10,730 and extract them from this match. 93 00:05:10,730 --> 00:05:14,700 Because, let's say, this is the first capture group we have, 94 00:05:14,700 --> 00:05:16,950 I could go ahead and type 1 here. 95 00:05:16,950 --> 00:05:21,420 Capture groups, at least in Python, in this particular case, are one index. 96 00:05:21,420 --> 00:05:25,400 So I'm saying here, if there is a match, go ahead and give me, 97 00:05:25,400 --> 00:05:30,530 in this case, the result that I found within the first capture group. 98 00:05:30,530 --> 00:05:34,410 I'll go ahead and store this in a variable called country_code. 99 00:05:34,410 --> 00:05:37,190 And I could-- instead of printing valid here, 100 00:05:37,190 --> 00:05:40,430 maybe I'll go ahead and print something like country_code 101 00:05:40,430 --> 00:05:42,750 and see what we can find. 102 00:05:42,750 --> 00:05:46,310 Well I'll go ahead and run Python of groups.py, 103 00:05:46,310 --> 00:05:52,070 and I'll go ahead and do plus 1, followed by my phone number, hit Enter. 104 00:05:52,070 --> 00:05:53,970 And now we'll see plus 1. 105 00:05:53,970 --> 00:05:58,730 So it seems like we extracted, in this case, the portion of our content 106 00:05:58,730 --> 00:06:02,643 that matched the pattern within these parentheses. 107 00:06:02,643 --> 00:06:04,060 I'll try it with another one here. 108 00:06:04,060 --> 00:06:08,040 I'll do Python of groups.py plus 62. 109 00:06:08,040 --> 00:06:09,640 And we'll see plus 62. 110 00:06:09,640 --> 00:06:11,070 So it is dynamic. 111 00:06:11,070 --> 00:06:14,130 And it's not looking for the first two characters all the time 112 00:06:14,130 --> 00:06:16,630 or the first three characters all the time. 113 00:06:16,630 --> 00:06:18,340 It's looking for this pattern. 114 00:06:18,340 --> 00:06:22,890 And when it finds it, it's returning it to us as appropriate. 115 00:06:22,890 --> 00:06:25,240 Now, what else could we do with this? 116 00:06:25,240 --> 00:06:27,940 Well, country_code literally is a string here. 117 00:06:27,940 --> 00:06:32,010 So if I wanted to, in this case, find the country somebody 118 00:06:32,010 --> 00:06:35,220 is calling from based on their country calling code, well, 119 00:06:35,220 --> 00:06:40,660 I could perhaps use country_code as the key for this dictionary here. 120 00:06:40,660 --> 00:06:46,180 I could type locations bracket locations bracket country_code. 121 00:06:46,180 --> 00:06:51,810 And because each of these country calling codes is a key in my dictionary, 122 00:06:51,810 --> 00:06:55,620 I should hopefully find, in this case, the actual location 123 00:06:55,620 --> 00:06:57,100 they are calling from. 124 00:06:57,100 --> 00:06:58,300 Let's try this out. 125 00:06:58,300 --> 00:07:02,890 I'll run Python of groups.py, and I'll now type-- oops. 126 00:07:02,890 --> 00:07:08,350 I'll now type-- let's do plus 1 again, followed by the number, hit Enter. 127 00:07:08,350 --> 00:07:10,950 And now we'll see United States and Canada. 128 00:07:10,950 --> 00:07:12,130 I could do this again. 129 00:07:12,130 --> 00:07:17,130 I could try, let's say, a plus 62 and number again-- 130 00:07:17,130 --> 00:07:18,240 Indonesia. 131 00:07:18,240 --> 00:07:23,320 I'll now try plus 505, and I'll see Nicaragua. 132 00:07:23,320 --> 00:07:25,590 So the capture group here is doing the work 133 00:07:25,590 --> 00:07:28,440 of finding the portion of our content that matches 134 00:07:28,440 --> 00:07:31,590 some pattern we were looking for. 135 00:07:31,590 --> 00:07:34,950 Well, I think we've really seen a lot of what this can do for us, 136 00:07:34,950 --> 00:07:38,250 but there is one more feature to take a look at. 137 00:07:38,250 --> 00:07:45,240 Here, notice how on line 11 I am really using indices, indexes, to find 138 00:07:45,240 --> 00:07:48,100 the capture group I'm looking for. 139 00:07:48,100 --> 00:07:52,380 But a more complex regular expression might involve more than one 140 00:07:52,380 --> 00:07:56,430 capture group, could involve up to, I don't know, more than one, 141 00:07:56,430 --> 00:07:58,390 two, three, four, could get up to 10. 142 00:07:58,390 --> 00:08:01,410 However many it is, it can be helpful to have a better way 143 00:08:01,410 --> 00:08:04,210 to refer to these capture groups. 144 00:08:04,210 --> 00:08:07,620 So if this capture group has some particular meaning to it, 145 00:08:07,620 --> 00:08:10,770 I could actually give it a name to refer to later 146 00:08:10,770 --> 00:08:13,630 on within the regular expression. 147 00:08:13,630 --> 00:08:16,410 And the way I do this is with the following syntax. 148 00:08:16,410 --> 00:08:22,290 Within my capture group, after the first parentheses, I can type question mark p 149 00:08:22,290 --> 00:08:26,310 and then open bracket close bracket, or, in this case, less than 150 00:08:26,310 --> 00:08:31,690 sign, greater than sign, and then some name for this capture group. 151 00:08:31,690 --> 00:08:35,760 I could call this country_code just like this. 152 00:08:35,760 --> 00:08:39,780 So now this pattern here and the capture group 153 00:08:39,780 --> 00:08:44,610 has a name I can refer to later to extract it with. 154 00:08:44,610 --> 00:08:49,170 Down here on line 11, I could, in this case, use 1, 155 00:08:49,170 --> 00:08:52,230 but now I could actually make use of country_code, the name I 156 00:08:52,230 --> 00:08:54,820 gave for this particular capture group. 157 00:08:54,820 --> 00:08:57,960 And I could type in country_code just like this, 158 00:08:57,960 --> 00:09:02,910 which will say, find for me, in this case, the capture group that I named 159 00:09:02,910 --> 00:09:05,830 country_code and use that instead. 160 00:09:05,830 --> 00:09:08,610 I'll type Python of groups.py. 161 00:09:08,610 --> 00:09:11,940 I'll go ahead and type plus 1, same number here. 162 00:09:11,940 --> 00:09:14,670 And now we'll see United States and Canada. 163 00:09:14,670 --> 00:09:19,020 Seems to work but is now a little more readable, even some name something 164 00:09:19,020 --> 00:09:22,350 that we might later hope to capture in our programs. 165 00:09:22,350 --> 00:09:25,690 So this was our brief foray into capture groups. 166 00:09:25,690 --> 00:09:27,130 And this was our short. 167 00:09:27,130 --> 00:09:29,810 We'll see you next time. 168 00:09:29,810 --> 00:09:31,000