WEBVTT
X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:900000

00:00:00.000 --> 00:00:02.796
[MUSIC PLAYING]

00:00:05.600 --> 00:00:09.450
SPEAKER: Well, hello, one and all, and
welcome to our short on capture groups.

00:00:09.450 --> 00:00:11.960
Now, those of you who have
made international calls

00:00:11.960 --> 00:00:14.730
might be familiar with what's
called a country calling code.

00:00:14.730 --> 00:00:19.170
And I have here three of them up above
in a dictionary called locations.

00:00:19.170 --> 00:00:22.550
I have 1 plus 1, which
often involves numbers

00:00:22.550 --> 00:00:24.480
from the United States and Canada.

00:00:24.480 --> 00:00:28.520
I have 1 plus 62, which often
involves numbers from Indonesia,

00:00:28.520 --> 00:00:33.660
and 1 plus 505 that involves
numbers from Nicaragua.

00:00:33.660 --> 00:00:38.090
And I have down below here, in
main, a program that in this case

00:00:38.090 --> 00:00:41.460
validates phone numbers internationally.

00:00:41.460 --> 00:00:44.060
So I have here a pattern
that I'm going to look

00:00:44.060 --> 00:00:46.660
for within each of
these phone numbers I'm

00:00:46.660 --> 00:00:50.340
going to enter into my
program down below on line 7.

00:00:50.340 --> 00:00:52.680
In this case, notice what I'm expecting.

00:00:52.680 --> 00:00:56.510
I'm expecting the literal
actual character plus here.

00:00:56.510 --> 00:00:59.930
I've escaped it with this backslash
because plus has other meaning

00:00:59.930 --> 00:01:02.030
meanings within regular expressions.

00:01:02.030 --> 00:01:06.930
I'm not expecting any kind of
number, in this case, 0 through 9,

00:01:06.930 --> 00:01:09.550
between 1 and 3 times.

00:01:09.550 --> 00:01:13.810
So notice here, the country code for
the US and Canada, that's plus 1.

00:01:13.810 --> 00:01:15.580
So only one number here.

00:01:15.580 --> 00:01:18.210
For Indonesia, it's two, 62.

00:01:18.210 --> 00:01:21.130
And for Nicaragua, it's three, 505.

00:01:21.130 --> 00:01:25.440
So between, in this case, one and
three numbers following some plus.

00:01:25.440 --> 00:01:29.400
Thereafter, there will hopefully
be a space for this invalid number,

00:01:29.400 --> 00:01:34.560
and then there will be exactly three
numbers, a dash, again, exactly three

00:01:34.560 --> 00:01:39.760
numbers, followed by a dash again,
and then exactly four numbers.

00:01:39.760 --> 00:01:42.400
So this is the pattern
we are looking for.

00:01:42.400 --> 00:01:45.630
And down below, on lines
9 through 13, well, this

00:01:45.630 --> 00:01:48.280
is the code doing that work for us.

00:01:48.280 --> 00:01:52.530
We've stored, within number, the user's
phone number that they have entered,

00:01:52.530 --> 00:01:55.080
and we're going to
check, using re.search,

00:01:55.080 --> 00:02:00.040
if we found a match for our
pattern within the number string.

00:02:00.040 --> 00:02:03.790
If we do have a match is returned
to us, we'll print valid.

00:02:03.790 --> 00:02:06.780
If we don't, we'll print invalid.

00:02:06.780 --> 00:02:09.780
Let me go ahead, down below
here, let me type "main"

00:02:09.780 --> 00:02:12.690
to ensure I call main
when I run this program.

00:02:12.690 --> 00:02:15.820
And I'll go ahead and
run Python of groups.py.

00:02:15.820 --> 00:02:19.620
And if I hit Enter, now I should
be able to enter a number.

00:02:19.620 --> 00:02:21.780
I'll test one here, plus 1.

00:02:21.780 --> 00:02:26.230
And I'll type in that 617-495-1000.

00:02:26.230 --> 00:02:29.620
I'll hit Enter here, and
we'll see that is valid.

00:02:29.620 --> 00:02:32.140
Maybe I'll try it, too, for Indonesia.

00:02:32.140 --> 00:02:36.030
I'll do plus 62, and I'll
enter in my phone number again.

00:02:36.030 --> 00:02:38.800
And I'll hit Enter, and I'll
see if that's valid as well.

00:02:38.800 --> 00:02:41.110
And just for good measure,
I'll test Nicaragua.

00:02:41.110 --> 00:02:44.250
I'll do plus 505, and
I'll type in this number,

00:02:44.250 --> 00:02:47.220
and we'll see that is valid as well.

00:02:47.220 --> 00:02:51.210
So it seems like our pattern
is working, but there's

00:02:51.210 --> 00:02:54.930
more we could do with this program,
I think, thanks to this feature

00:02:54.930 --> 00:02:57.280
called a capture group.

00:02:57.280 --> 00:03:00.330
Well, maybe what I want
to do is not just share

00:03:00.330 --> 00:03:04.110
if this number is valid
or invalid, but maybe tell

00:03:04.110 --> 00:03:08.830
somebody from what country, in this
case, this number is calling from.

00:03:08.830 --> 00:03:10.420
You can think of your phone.

00:03:10.420 --> 00:03:13.060
When it receives some call
from an unknown number,

00:03:13.060 --> 00:03:16.790
it might at least tell you the location
or the area that number is calling from.

00:03:16.790 --> 00:03:18.540
What if we could write
the same thing here

00:03:18.540 --> 00:03:21.510
where people call us internationally
and we show the user,

00:03:21.510 --> 00:03:24.690
in this case, what country
they are calling from?

00:03:24.690 --> 00:03:27.420
Well, in this case, we
don't want to just test

00:03:27.420 --> 00:03:31.240
to see if we find the pattern
within our phone numbers here.

00:03:31.240 --> 00:03:34.210
We also want to extract
some portion of it,

00:03:34.210 --> 00:03:37.260
in this case, the very first
portion, the country calling

00:03:37.260 --> 00:03:41.670
code-- plus 505 for Nicaragua,
plus 62 for Indonesia,

00:03:41.670 --> 00:03:44.520
or plus 1 for the US and Canada.

00:03:44.520 --> 00:03:49.740
But we run into a problem here if we to
use maybe simple a string manipulation.

00:03:49.740 --> 00:03:53.190
If I enter in some number
and was trying to extract,

00:03:53.190 --> 00:03:56.700
in this case, the country calling
code, well, I wouldn't immediately

00:03:56.700 --> 00:04:01.500
know whether I should extract, in this
case, the first two characters, plus 1,

00:04:01.500 --> 00:04:06.450
the first three characters, plus 62, or,
in this case, the first four characters,

00:04:06.450 --> 00:04:07.990
plus 505.

00:04:07.990 --> 00:04:11.730
But thankfully, I actually use
regular expressions and capture

00:04:11.730 --> 00:04:16.829
groups to dynamically capture the
portion of the content I'm looking for.

00:04:16.829 --> 00:04:21.540
Now, the way I can make a capture
group is by using parentheses inside

00:04:21.540 --> 00:04:23.620
of a regular expression.

00:04:23.620 --> 00:04:27.480
So really I want to capture
or extract, in this case,

00:04:27.480 --> 00:04:32.100
the country calling code, which we
said the pattern exists for right here,

00:04:32.100 --> 00:04:36.870
a literal plus sign
followed by on to e numbers.

00:04:36.870 --> 00:04:40.140
Now, I can encase this
inside of parentheses,

00:04:40.140 --> 00:04:43.140
and this becomes my own capture group.

00:04:43.140 --> 00:04:46.810
But how could I maybe find
the information I capture?

00:04:46.810 --> 00:04:51.510
Well, if I find a match here, turns
out that this match object in Python

00:04:51.510 --> 00:04:54.720
comes with another one called group--

00:04:54.720 --> 00:04:55.360
group.

00:04:55.360 --> 00:04:57.090
If I were to do--

00:04:57.090 --> 00:05:00.400
let me do match.group.

00:05:00.400 --> 00:05:04.010
Well, this would help me find
all of the capture groups

00:05:04.010 --> 00:05:07.670
I've actually implemented
in my regular expression

00:05:07.670 --> 00:05:10.730
and extract them from this match.

00:05:10.730 --> 00:05:14.700
Because, let's say, this is the
first capture group we have,

00:05:14.700 --> 00:05:16.950
I could go ahead and type 1 here.

00:05:16.950 --> 00:05:21.420
Capture groups, at least in Python, in
this particular case, are one index.

00:05:21.420 --> 00:05:25.400
So I'm saying here, if there is
a match, go ahead and give me,

00:05:25.400 --> 00:05:30.530
in this case, the result that I
found within the first capture group.

00:05:30.530 --> 00:05:34.410
I'll go ahead and store this in
a variable called country_code.

00:05:34.410 --> 00:05:37.190
And I could-- instead
of printing valid here,

00:05:37.190 --> 00:05:40.430
maybe I'll go ahead and print
something like country_code

00:05:40.430 --> 00:05:42.750
and see what we can find.

00:05:42.750 --> 00:05:46.310
Well I'll go ahead and
run Python of groups.py,

00:05:46.310 --> 00:05:52.070
and I'll go ahead and do plus 1,
followed by my phone number, hit Enter.

00:05:52.070 --> 00:05:53.970
And now we'll see plus 1.

00:05:53.970 --> 00:05:58.730
So it seems like we extracted, in
this case, the portion of our content

00:05:58.730 --> 00:06:02.643
that matched the pattern
within these parentheses.

00:06:02.643 --> 00:06:04.060
I'll try it with another one here.

00:06:04.060 --> 00:06:08.040
I'll do Python of groups.py plus 62.

00:06:08.040 --> 00:06:09.640
And we'll see plus 62.

00:06:09.640 --> 00:06:11.070
So it is dynamic.

00:06:11.070 --> 00:06:14.130
And it's not looking for the
first two characters all the time

00:06:14.130 --> 00:06:16.630
or the first three
characters all the time.

00:06:16.630 --> 00:06:18.340
It's looking for this pattern.

00:06:18.340 --> 00:06:22.890
And when it finds it, it's
returning it to us as appropriate.

00:06:22.890 --> 00:06:25.240
Now, what else could we do with this?

00:06:25.240 --> 00:06:27.940
Well, country_code
literally is a string here.

00:06:27.940 --> 00:06:32.010
So if I wanted to, in this
case, find the country somebody

00:06:32.010 --> 00:06:35.220
is calling from based on their
country calling code, well,

00:06:35.220 --> 00:06:40.660
I could perhaps use country_code as
the key for this dictionary here.

00:06:40.660 --> 00:06:46.180
I could type locations bracket
locations bracket country_code.

00:06:46.180 --> 00:06:51.810
And because each of these country
calling codes is a key in my dictionary,

00:06:51.810 --> 00:06:55.620
I should hopefully find, in
this case, the actual location

00:06:55.620 --> 00:06:57.100
they are calling from.

00:06:57.100 --> 00:06:58.300
Let's try this out.

00:06:58.300 --> 00:07:02.890
I'll run Python of groups.py,
and I'll now type-- oops.

00:07:02.890 --> 00:07:08.350
I'll now type-- let's do plus 1 again,
followed by the number, hit Enter.

00:07:08.350 --> 00:07:10.950
And now we'll see United
States and Canada.

00:07:10.950 --> 00:07:12.130
I could do this again.

00:07:12.130 --> 00:07:17.130
I could try, let's say, a
plus 62 and number again--

00:07:17.130 --> 00:07:18.240
Indonesia.

00:07:18.240 --> 00:07:23.320
I'll now try plus 505,
and I'll see Nicaragua.

00:07:23.320 --> 00:07:25.590
So the capture group
here is doing the work

00:07:25.590 --> 00:07:28.440
of finding the portion of
our content that matches

00:07:28.440 --> 00:07:31.590
some pattern we were looking for.

00:07:31.590 --> 00:07:34.950
Well, I think we've really seen
a lot of what this can do for us,

00:07:34.950 --> 00:07:38.250
but there is one more
feature to take a look at.

00:07:38.250 --> 00:07:45.240
Here, notice how on line 11 I am
really using indices, indexes, to find

00:07:45.240 --> 00:07:48.100
the capture group I'm looking for.

00:07:48.100 --> 00:07:52.380
But a more complex regular expression
might involve more than one

00:07:52.380 --> 00:07:56.430
capture group, could involve up
to, I don't know, more than one,

00:07:56.430 --> 00:07:58.390
two, three, four, could get up to 10.

00:07:58.390 --> 00:08:01.410
However many it is, it can be
helpful to have a better way

00:08:01.410 --> 00:08:04.210
to refer to these capture groups.

00:08:04.210 --> 00:08:07.620
So if this capture group has
some particular meaning to it,

00:08:07.620 --> 00:08:10.770
I could actually give it
a name to refer to later

00:08:10.770 --> 00:08:13.630
on within the regular expression.

00:08:13.630 --> 00:08:16.410
And the way I do this is
with the following syntax.

00:08:16.410 --> 00:08:22.290
Within my capture group, after the first
parentheses, I can type question mark p

00:08:22.290 --> 00:08:26.310
and then open bracket close
bracket, or, in this case, less than

00:08:26.310 --> 00:08:31.690
sign, greater than sign, and then
some name for this capture group.

00:08:31.690 --> 00:08:35.760
I could call this
country_code just like this.

00:08:35.760 --> 00:08:39.780
So now this pattern here
and the capture group

00:08:39.780 --> 00:08:44.610
has a name I can refer to
later to extract it with.

00:08:44.610 --> 00:08:49.170
Down here on line 11, I
could, in this case, use 1,

00:08:49.170 --> 00:08:52.230
but now I could actually make
use of country_code, the name I

00:08:52.230 --> 00:08:54.820
gave for this particular capture group.

00:08:54.820 --> 00:08:57.960
And I could type in
country_code just like this,

00:08:57.960 --> 00:09:02.910
which will say, find for me, in this
case, the capture group that I named

00:09:02.910 --> 00:09:05.830
country_code and use that instead.

00:09:05.830 --> 00:09:08.610
I'll type Python of groups.py.

00:09:08.610 --> 00:09:11.940
I'll go ahead and type
plus 1, same number here.

00:09:11.940 --> 00:09:14.670
And now we'll see United
States and Canada.

00:09:14.670 --> 00:09:19.020
Seems to work but is now a little more
readable, even some name something

00:09:19.020 --> 00:09:22.350
that we might later hope
to capture in our programs.

00:09:22.350 --> 00:09:25.690
So this was our brief
foray into capture groups.

00:09:25.690 --> 00:09:27.130
And this was our short.

00:09:27.130 --> 00:09:29.810
We'll see you next time.