WEBVTT

00:00:00.000 --> 00:00:03.960 align:middle line:90%
[ORCHESTRA TUNING]

00:00:03.960 --> 00:00:14.850 align:middle line:90%


00:00:14.850 --> 00:00:18.315 align:middle line:90%
[MUSIC PLAYING]

00:00:18.315 --> 00:00:24.270 align:middle line:90%


00:00:24.270 --> 00:00:25.470 align:middle line:90%
DAVID MALAN: All right.

00:00:25.470 --> 00:00:28.680 align:middle line:84%
This is CS50's Introduction
to Programming with Python.

00:00:28.680 --> 00:00:32.350 align:middle line:84%
My name is David Malan, and this
is our week on regular expressions.

00:00:32.350 --> 00:00:37.170 align:middle line:84%
So a regular expression, otherwise known
as a regex, is really just a pattern.

00:00:37.170 --> 00:00:39.120 align:middle line:84%
And indeed, it's quite
common in programming

00:00:39.120 --> 00:00:43.620 align:middle line:84%
to want to use patterns to match on
some kind of data, often user input.

00:00:43.620 --> 00:00:47.190 align:middle line:84%
For instance, if the user types in an
email address, whether to your program,

00:00:47.190 --> 00:00:49.200 align:middle line:84%
or a website, or an
app on your phone, you

00:00:49.200 --> 00:00:50.940 align:middle line:84%
might ideally want to
be able to validate

00:00:50.940 --> 00:00:53.130 align:middle line:84%
that they did indeed
type in an email address

00:00:53.130 --> 00:00:54.790 align:middle line:90%
and not something completely different.

00:00:54.790 --> 00:00:58.080 align:middle line:84%
So using regular expressions, we're
going to have the newfound capability

00:00:58.080 --> 00:01:02.280 align:middle line:84%
to define patterns in our code to
compare them against data that we're

00:01:02.280 --> 00:01:04.920 align:middle line:84%
receiving from someone else,
whether it's just to validate it,

00:01:04.920 --> 00:01:07.470 align:middle line:84%
or, heck, even if we want to
clean up a whole lot of data

00:01:07.470 --> 00:01:11.220 align:middle line:84%
that itself might be messy because
it, too, came from us humans.

00:01:11.220 --> 00:01:14.010 align:middle line:84%
Before, though, we use
these regular expressions,

00:01:14.010 --> 00:01:19.620 align:middle line:84%
let me propose that we solve a few
problems using just some simpler syntax

00:01:19.620 --> 00:01:22.140 align:middle line:84%
and see what kind of
limitations we run up against.

00:01:22.140 --> 00:01:24.720 align:middle line:84%
Let me propose that I
open up VS Code here,

00:01:24.720 --> 00:01:27.900 align:middle line:84%
and let me create a file called
validate.py, the goal at hand

00:01:27.900 --> 00:01:30.768 align:middle line:84%
being to validate, how about just
that, a user's email address.

00:01:30.768 --> 00:01:33.060 align:middle line:84%
They've come to your app,
they've come to your website,

00:01:33.060 --> 00:01:34.935 align:middle line:84%
they type in their email
address, and we want

00:01:34.935 --> 00:01:38.380 align:middle line:84%
to say yes or no, this
email address looks valid.

00:01:38.380 --> 00:01:38.880 align:middle line:90%
All right.

00:01:38.880 --> 00:01:43.980 align:middle line:84%
Let me go ahead and type code of
validate.py to create a new tab here.

00:01:43.980 --> 00:01:47.850 align:middle line:84%
And then within this tab, let me go
ahead and start writing some code,

00:01:47.850 --> 00:01:50.010 align:middle line:84%
how about, that keeps
things simple initially.

00:01:50.010 --> 00:01:53.130 align:middle line:84%
First, let me go ahead and prompt
the user for their email address.

00:01:53.130 --> 00:01:57.510 align:middle line:84%
And I'll store the return value of
input in a variable called email,

00:01:57.510 --> 00:01:59.940 align:middle line:90%
asking them "what's your email?"

00:01:59.940 --> 00:02:00.742 align:middle line:90%
question mark.

00:02:00.742 --> 00:02:02.700 align:middle line:84%
I'm going to go ahead
and preemptively at least

00:02:02.700 --> 00:02:06.780 align:middle line:84%
clean up the user's input a little
bit by minimally just calling strip

00:02:06.780 --> 00:02:10.020 align:middle line:84%
at the end of my call
to input, because recall

00:02:10.020 --> 00:02:12.240 align:middle line:90%
that input returns a string or a str.

00:02:12.240 --> 00:02:16.200 align:middle line:84%
strs come with some built-in
methods or functions, one of which

00:02:16.200 --> 00:02:18.390 align:middle line:84%
is strip, which has the
effect of stripping off

00:02:18.390 --> 00:02:22.395 align:middle line:84%
any leading whitespace to the left or
any trailing whitespace to the right.

00:02:22.395 --> 00:02:24.270 align:middle line:84%
So that's just going to
go ahead and at least

00:02:24.270 --> 00:02:27.480 align:middle line:84%
avoid the human having accidentally
typed in a space character.

00:02:27.480 --> 00:02:29.860 align:middle line:84%
We're going to throw
it away just in case.

00:02:29.860 --> 00:02:31.500 align:middle line:90%
Now I'm going to do something simple.

00:02:31.500 --> 00:02:35.165 align:middle line:84%
For a user's input to
be an email address,

00:02:35.165 --> 00:02:37.290 align:middle line:84%
I think we can all agree
that it's got a minimal we

00:02:37.290 --> 00:02:39.130 align:middle line:90%
have an @ sign somewhere in it.

00:02:39.130 --> 00:02:40.140 align:middle line:90%
So let's start simple.

00:02:40.140 --> 00:02:43.230 align:middle line:84%
If the user has typed in
something with an @ sign, let's

00:02:43.230 --> 00:02:46.800 align:middle line:84%
very generously just say, OK,
valid, looks like an email address.

00:02:46.800 --> 00:02:50.940 align:middle line:84%
And if we're missing that @ sign,
let's say invalid, because clearly it's

00:02:50.940 --> 00:02:51.870 align:middle line:90%
not an email address.

00:02:51.870 --> 00:02:55.078 align:middle line:84%
It's not going to be the best version
of my code yet, but we'll start simple.

00:02:55.078 --> 00:02:59.700 align:middle line:84%
So I'm going to ask the question, if
there is an @ symbol in the user's

00:02:59.700 --> 00:03:03.570 align:middle line:84%
email address, go ahead and print out,
for instance, quote, unquote, "valid."

00:03:03.570 --> 00:03:06.750 align:middle line:84%
Else, if there's not, now I'm
pretty confident that the email

00:03:06.750 --> 00:03:09.250 align:middle line:90%
address is, in fact, invalid.

00:03:09.250 --> 00:03:10.650 align:middle line:90%
Now, what is this code doing?

00:03:10.650 --> 00:03:16.320 align:middle line:84%
Well, if @ sign in email is a Pythonic
way of asking is this string quote,

00:03:16.320 --> 00:03:20.762 align:middle line:84%
unquote "@" in this other string
email, no matter where it is--

00:03:20.762 --> 00:03:22.470 align:middle line:84%
at the beginning, the
middle, or the end.

00:03:22.470 --> 00:03:25.470 align:middle line:84%
It's going to automatically search
through the entire string for you

00:03:25.470 --> 00:03:26.220 align:middle line:90%
automatically.

00:03:26.220 --> 00:03:27.660 align:middle line:90%
I could do this more verbosely.

00:03:27.660 --> 00:03:29.670 align:middle line:84%
And I could use a for
loop or a while loop

00:03:29.670 --> 00:03:32.640 align:middle line:84%
and look at every character
in the user's email address,

00:03:32.640 --> 00:03:34.128 align:middle line:90%
looking to see if it's an @ sign.

00:03:34.128 --> 00:03:36.420 align:middle line:84%
But this is one of the things
that's nice about Python.

00:03:36.420 --> 00:03:38.020 align:middle line:90%
You can do more with less.

00:03:38.020 --> 00:03:41.190 align:middle line:84%
So just by saying if "@"
quote, unquote in email,

00:03:41.190 --> 00:03:43.800 align:middle line:84%
we're achieving that same result.
We're going to get back true

00:03:43.800 --> 00:03:47.670 align:middle line:84%
if it's somewhere in there, thus
valid, or false if it is not.

00:03:47.670 --> 00:03:50.970 align:middle line:84%
Well, let me go ahead now and run
this program in my terminal window

00:03:50.970 --> 00:03:53.100 align:middle line:90%
with python of validate.py.

00:03:53.100 --> 00:03:56.730 align:middle line:84%
And I'm going to go ahead and give it
my email address-- malan@harvard.edu,

00:03:56.730 --> 00:03:57.570 align:middle line:90%
Enter.

00:03:57.570 --> 00:03:58.770 align:middle line:90%
And indeed, it's valid.

00:03:58.770 --> 00:04:00.210 align:middle line:90%
Looks valid, is valid.

00:04:00.210 --> 00:04:03.480 align:middle line:84%
But of course, this program
is technically broken.

00:04:03.480 --> 00:04:04.410 align:middle line:90%
It's buggy.

00:04:04.410 --> 00:04:07.110 align:middle line:84%
What would be an example
input, if someone

00:04:07.110 --> 00:04:10.770 align:middle line:84%
might like to volunteer an answer
here, that would be considered valid

00:04:10.770 --> 00:04:13.197 align:middle line:84%
but you and I know it
really isn't valid?

00:04:13.197 --> 00:04:14.280 align:middle line:90%
AUDIENCE: Yeah, thank you.

00:04:14.280 --> 00:04:17.880 align:middle line:84%
Well, for instance, you can type
just two signs and that's it,

00:04:17.880 --> 00:04:20.760 align:middle line:90%
and it'll still be valid--

00:04:20.760 --> 00:04:23.815 align:middle line:84%
still be valid according to your
program, but missing something.

00:04:23.815 --> 00:04:24.690 align:middle line:90%
DAVID MALAN: Exactly.

00:04:24.690 --> 00:04:26.680 align:middle line:90%
We've set a very low bar here.

00:04:26.680 --> 00:04:29.430 align:middle line:84%
In fact, if I go ahead and
rerun python of validate.py,

00:04:29.430 --> 00:04:33.512 align:middle line:84%
and I'll just type in one @ sign,
that's it-- no username, no domain name,

00:04:33.512 --> 00:04:35.470 align:middle line:84%
this doesn't really look
like an email address.

00:04:35.470 --> 00:04:38.790 align:middle line:84%
But unfortunately, my code thinks it,
in fact, is, because it's obviously

00:04:38.790 --> 00:04:40.710 align:middle line:90%
just looking for an @ sign alone.

00:04:40.710 --> 00:04:42.250 align:middle line:90%
Well, how could we improve this?

00:04:42.250 --> 00:04:45.500 align:middle line:84%
Well, minimally an email
address, I think, tends to have,

00:04:45.500 --> 00:04:47.250 align:middle line:84%
though this is not
actually a requirement,

00:04:47.250 --> 00:04:51.600 align:middle line:84%
tends to have an @ sign and a single dot
at least, maybe somewhere in the domain

00:04:51.600 --> 00:04:54.240 align:middle line:90%
name-- so malan@harvard.edu.

00:04:54.240 --> 00:04:55.960 align:middle line:90%
So let's check for that dot as well.

00:04:55.960 --> 00:04:59.040 align:middle line:84%
But again, strictly speaking it
doesn't even have to be that case.

00:04:59.040 --> 00:05:02.190 align:middle line:84%
But I'm going for my own email address,
at least for now, as our test case.

00:05:02.190 --> 00:05:06.450 align:middle line:84%
So let me go ahead and change my code
now and say, not only if @ is in email,

00:05:06.450 --> 00:05:11.050 align:middle line:90%
but also dot is in email as well.

00:05:11.050 --> 00:05:12.690 align:middle line:90%
So I'm asking now two questions.

00:05:12.690 --> 00:05:16.050 align:middle line:84%
I have two Boolean
expressions-- if @ in email,

00:05:16.050 --> 00:05:20.650 align:middle line:84%
and I'm anding them together logically--
this is a logical and, so to speak.

00:05:20.650 --> 00:05:24.600 align:middle line:84%
So if it's the case that @ is in
email and dot is in email, OK,

00:05:24.600 --> 00:05:26.470 align:middle line:90%
now I'm going to go ahead and say valid.

00:05:26.470 --> 00:05:26.970 align:middle line:90%
All right.

00:05:26.970 --> 00:05:29.460 align:middle line:84%
This would still seem to
work for my email address.

00:05:29.460 --> 00:05:34.500 align:middle line:84%
Let me go ahead and run python
validate.py, malan@harvard.edu, Enter,

00:05:34.500 --> 00:05:36.390 align:middle line:84%
and that, of course,
is valid is expected.

00:05:36.390 --> 00:05:39.870 align:middle line:84%
But here, too, we can be a little
adversarial and type in something

00:05:39.870 --> 00:05:41.505 align:middle line:90%
nonsensical like "@."

00:05:41.505 --> 00:05:45.180 align:middle line:84%
and unfortunately, that, too, is
going to be mistaken as valid,

00:05:45.180 --> 00:05:48.820 align:middle line:84%
even though there's still no username,
domain name, or anything like that.

00:05:48.820 --> 00:05:51.180 align:middle line:84%
So I think we need to be a
little more methodical here.

00:05:51.180 --> 00:05:57.660 align:middle line:84%
In fact, notice that if I do this
like this, the @ sign can be anywhere,

00:05:57.660 --> 00:05:59.250 align:middle line:90%
and the dot can be anywhere.

00:05:59.250 --> 00:06:02.190 align:middle line:84%
But if I'm assuming the user is
going to have a traditional domain

00:06:02.190 --> 00:06:05.880 align:middle line:84%
name like harvard.edu
or gmail.com, I really

00:06:05.880 --> 00:06:10.110 align:middle line:84%
want to look for the dot in the
domain name only, not necessarily

00:06:10.110 --> 00:06:11.580 align:middle line:90%
just the username.

00:06:11.580 --> 00:06:13.510 align:middle line:90%
So let me go ahead and do this.

00:06:13.510 --> 00:06:18.250 align:middle line:84%
Let me go ahead and introduce a bit
more logic here, and instead do this.

00:06:18.250 --> 00:06:24.060 align:middle line:84%
Let me go ahead and do email.split
of quote, unquote @ sign.

00:06:24.060 --> 00:06:26.460 align:middle line:90%
So email, again, is a string or a str.

00:06:26.460 --> 00:06:29.550 align:middle line:84%
strs come with methods,
not just strip but also

00:06:29.550 --> 00:06:32.190 align:middle line:84%
another one called split
that, as the name implies,

00:06:32.190 --> 00:06:36.570 align:middle line:84%
will split one str into multiple ones
if you give it a character or more

00:06:36.570 --> 00:06:37.950 align:middle line:90%
to split on.

00:06:37.950 --> 00:06:42.390 align:middle line:84%
So this is hopefully going to return to
me two parts from a traditional email

00:06:42.390 --> 00:06:44.880 align:middle line:84%
address, the username
and the domain name.

00:06:44.880 --> 00:06:47.850 align:middle line:84%
And it turns out I can unpack
that sequence of responses

00:06:47.850 --> 00:06:52.410 align:middle line:84%
by doing this-- username
comma domain equals this.

00:06:52.410 --> 00:06:55.060 align:middle line:84%
I could store it in a list
or some other structure,

00:06:55.060 --> 00:06:58.530 align:middle line:84%
but if I already know in advance
what kinds of values I'm expecting,

00:06:58.530 --> 00:07:00.510 align:middle line:84%
a username and hopefully
a domain, I'm going

00:07:00.510 --> 00:07:04.050 align:middle line:84%
to go ahead and do it like this instead
and just define two variables at once

00:07:04.050 --> 00:07:05.310 align:middle line:90%
on one line of code.

00:07:05.310 --> 00:07:07.290 align:middle line:84%
And now I'm going to be
a little more precise.

00:07:07.290 --> 00:07:13.230 align:middle line:84%
If username-- if username,
then I'm going to go ahead

00:07:13.230 --> 00:07:15.370 align:middle line:90%
and say, print "valid."

00:07:15.370 --> 00:07:18.820 align:middle line:84%
Else, I'm going to go ahead
and say print "invalid."

00:07:18.820 --> 00:07:20.040 align:middle line:90%
Now, this isn't good enough.

00:07:20.040 --> 00:07:22.800 align:middle line:84%
But I'm at least checking for
the presence of a username now.

00:07:22.800 --> 00:07:25.217 align:middle line:84%
And you might not have seen
this before, but if you simply

00:07:25.217 --> 00:07:28.680 align:middle line:84%
ask a question like "if username,"
and username is a string,

00:07:28.680 --> 00:07:31.320 align:middle line:84%
well, username-- "if
username" is going to give me

00:07:31.320 --> 00:07:35.730 align:middle line:84%
a true answer if username is
anything except none or quote,

00:07:35.730 --> 00:07:36.840 align:middle line:90%
unquote "nothing."

00:07:36.840 --> 00:07:41.820 align:middle line:84%
So there's a truthy value here, whereby
if username has at least one character,

00:07:41.820 --> 00:07:43.350 align:middle line:90%
that's going to be considered true.

00:07:43.350 --> 00:07:46.170 align:middle line:84%
But if username has no
characters, it's going

00:07:46.170 --> 00:07:49.245 align:middle line:84%
to be considered a
false value effectively.

00:07:49.245 --> 00:07:50.370 align:middle line:90%
But this isn't good enough.

00:07:50.370 --> 00:07:52.037 align:middle line:90%
I don't want to just check for username.

00:07:52.037 --> 00:07:57.160 align:middle line:84%
I want to also check that it's the case
that dot is in the domain name as well.

00:07:57.160 --> 00:08:00.180 align:middle line:84%
So notice here there's a
bit of potential confusion

00:08:00.180 --> 00:08:01.830 align:middle line:90%
with the English language.

00:08:01.830 --> 00:08:04.620 align:middle line:84%
Here, I seem to be saying
"if username and dot

00:08:04.620 --> 00:08:09.660 align:middle line:84%
in domain," as though I'm asking the
question, "if the username and the dot

00:08:09.660 --> 00:08:12.270 align:middle line:84%
are in the domain," but
that's not what this means.

00:08:12.270 --> 00:08:15.540 align:middle line:84%
These are two separate Boolean
expressions-- "if username,"

00:08:15.540 --> 00:08:19.690 align:middle line:90%
and separately, "if dot in domain."

00:08:19.690 --> 00:08:23.100 align:middle line:84%
And if I parenthesis this, we could
make that even more clear by putting

00:08:23.100 --> 00:08:25.000 align:middle line:90%
parentheses there, parentheses here.

00:08:25.000 --> 00:08:27.390 align:middle line:84%
So just to be clear, it's
really two Boolean expressions

00:08:27.390 --> 00:08:30.840 align:middle line:84%
that we're anding together, not
one longer English-like sentence.

00:08:30.840 --> 00:08:35.580 align:middle line:84%
Now, if I go ahead and run
this, python validate.py Enter,

00:08:35.580 --> 00:08:39.809 align:middle line:84%
I'll do my own email address again,
malan@harvard.edu, and that's valid.

00:08:39.809 --> 00:08:43.710 align:middle line:84%
And it looks like I could
tolerate something like this.

00:08:43.710 --> 00:08:47.970 align:middle line:84%
If I do malan@, just say,
harvard, I think at the moment

00:08:47.970 --> 00:08:49.480 align:middle line:90%
this is going to be invalid.

00:08:49.480 --> 00:08:52.150 align:middle line:84%
Now, maybe the top-level
domain harvard exists.

00:08:52.150 --> 00:08:54.900 align:middle line:84%
But at the moment, it looks like
we're looking for something more.

00:08:54.900 --> 00:08:58.380 align:middle line:84%
We're looking for a top-level
domain too, like .edu.

00:08:58.380 --> 00:09:01.540 align:middle line:84%
For now, we'll just
consider this to be invalid.

00:09:01.540 --> 00:09:04.510 align:middle line:90%
But it's not just that we want to do--

00:09:04.510 --> 00:09:07.260 align:middle line:84%
it's not just that we want to check
for the presence of a username

00:09:07.260 --> 00:09:08.370 align:middle line:90%
and the presence of a dot.

00:09:08.370 --> 00:09:09.520 align:middle line:90%
Let's be more specific.

00:09:09.520 --> 00:09:11.687 align:middle line:84%
Let's start to now narrow
the scope of this program,

00:09:11.687 --> 00:09:15.600 align:middle line:84%
not just to be about generic emails
more generally, but about edu addresses,

00:09:15.600 --> 00:09:18.780 align:middle line:84%
so specifically for someone in
a US university, for instance,

00:09:18.780 --> 00:09:21.450 align:middle line:84%
whose email address
tends to end with .edu.

00:09:21.450 --> 00:09:23.310 align:middle line:90%
I can be a little more precise.

00:09:23.310 --> 00:09:25.350 align:middle line:84%
And you might recall
this function already.

00:09:25.350 --> 00:09:28.590 align:middle line:84%
Instead of just saying, is
there a dot somewhere in domain,

00:09:28.590 --> 00:09:34.740 align:middle line:84%
let me instead say, and the domain
ends with quote, unquote ".edu."

00:09:34.740 --> 00:09:36.420 align:middle line:90%
Now we're being even more precise.

00:09:36.420 --> 00:09:40.200 align:middle line:84%
We want there to be minimally a username
that's not empty-- it's not just quote,

00:09:40.200 --> 00:09:45.190 align:middle line:84%
unquote "nothing"-- and we want the
domain name to actually end with .edu.

00:09:45.190 --> 00:09:47.448 align:middle line:84%
Let me go ahead and run
python of validate.py.

00:09:47.448 --> 00:09:49.740 align:middle line:84%
And just to make sure I
haven't made things even worse,

00:09:49.740 --> 00:09:53.470 align:middle line:84%
let me at least test my own email
address, which does seem to be valid.

00:09:53.470 --> 00:09:56.070 align:middle line:84%
Now, it seems that I minimally
need to provide a username,

00:09:56.070 --> 00:09:58.380 align:middle line:84%
because we definitely do
have that check in place.

00:09:58.380 --> 00:10:00.210 align:middle line:90%
So I'm going to go ahead and say malan.

00:10:00.210 --> 00:10:02.790 align:middle line:90%
And now I'm going to go ahead and say @.

00:10:02.790 --> 00:10:05.880 align:middle line:84%
And it looks like I could
be a little malicious here,

00:10:05.880 --> 00:10:09.030 align:middle line:84%
just say malan@.edu, as
though minimally meeting

00:10:09.030 --> 00:10:11.340 align:middle line:90%
the requirements of this pattern.

00:10:11.340 --> 00:10:13.200 align:middle line:84%
And that, of course,
is considered valid,

00:10:13.200 --> 00:10:17.010 align:middle line:84%
but I'm pretty sure there's
no one at malan@.edu.

00:10:17.010 --> 00:10:19.350 align:middle line:84%
We need to have some
domain name in there.

00:10:19.350 --> 00:10:21.360 align:middle line:84%
So we're still not
being quite as generous.

00:10:21.360 --> 00:10:24.510 align:middle line:84%
Now, we could absolutely continue
to iterate on this program,

00:10:24.510 --> 00:10:26.640 align:middle line:84%
and we could add some
more Boolean expressions.

00:10:26.640 --> 00:10:28.590 align:middle line:84%
We could maybe use some
other Python methods

00:10:28.590 --> 00:10:31.530 align:middle line:84%
for checking more precisely is there
something to the left of the dot,

00:10:31.530 --> 00:10:32.550 align:middle line:90%
to the right of the dot.

00:10:32.550 --> 00:10:34.320 align:middle line:90%
We could use split multiple times.

00:10:34.320 --> 00:10:36.180 align:middle line:84%
But honestly, this
just escalates quickly.

00:10:36.180 --> 00:10:39.450 align:middle line:84%
Like, you end up having to
write a lot of code just

00:10:39.450 --> 00:10:42.360 align:middle line:84%
to express something that's
relatively simple in spirit--

00:10:42.360 --> 00:10:45.550 align:middle line:90%
just format this like an email address.

00:10:45.550 --> 00:10:47.920 align:middle line:90%
So how can we go about improving this?

00:10:47.920 --> 00:10:52.350 align:middle line:84%
Well, it turns out in Python there's
a library for regular expressions.

00:10:52.350 --> 00:10:55.620 align:middle line:84%
It's called succinctly
R-E. And in the re library,

00:10:55.620 --> 00:11:00.510 align:middle line:84%
you have a lot of capabilities to
define and check for and even replace

00:11:00.510 --> 00:11:01.440 align:middle line:90%
patterns.

00:11:01.440 --> 00:11:03.630 align:middle line:84%
Again, a regular
expression is a pattern.

00:11:03.630 --> 00:11:05.998 align:middle line:84%
And this library, the
re library in Python,

00:11:05.998 --> 00:11:08.040 align:middle line:84%
is going to let us define
some of these patterns,

00:11:08.040 --> 00:11:09.915 align:middle line:84%
like a pattern for an
email address, and then

00:11:09.915 --> 00:11:12.720 align:middle line:84%
use some built-in functions
to actually validate

00:11:12.720 --> 00:11:14.820 align:middle line:84%
a user's input against
that pattern or even

00:11:14.820 --> 00:11:17.250 align:middle line:84%
use these patterns to
change the user's input

00:11:17.250 --> 00:11:19.650 align:middle line:84%
or extract partial
information therefrom.

00:11:19.650 --> 00:11:22.030 align:middle line:90%
We'll see examples of all this and more.

00:11:22.030 --> 00:11:24.045 align:middle line:84%
So what can and should
I do with this library?

00:11:24.045 --> 00:11:26.670 align:middle line:84%
Well, first and foremost, it
comes with a lot of functionality.

00:11:26.670 --> 00:11:29.760 align:middle line:84%
Here is the URL, for instance,
to the official documentation.

00:11:29.760 --> 00:11:31.710 align:middle line:84%
And let me propose
that we focus on using

00:11:31.710 --> 00:11:36.600 align:middle line:84%
one of the most versatile functions
in the library, namely this-- search.

00:11:36.600 --> 00:11:40.440 align:middle line:84%
re.search is the name of the
function and the re module

00:11:40.440 --> 00:11:42.910 align:middle line:84%
that allows you to pass
in a few arguments.

00:11:42.910 --> 00:11:46.620 align:middle line:84%
The first is going to be a pattern
that you want to search for in,

00:11:46.620 --> 00:11:48.900 align:middle line:84%
for instance, a string
that came from a user.

00:11:48.900 --> 00:11:51.977 align:middle line:84%
The string argument here is going
to be the actual string that you

00:11:51.977 --> 00:11:53.310 align:middle line:90%
want to search for that pattern.

00:11:53.310 --> 00:11:55.410 align:middle line:84%
And then there's a third
argument optionally

00:11:55.410 --> 00:11:56.790 align:middle line:90%
that's a whole bunch of flags.

00:11:56.790 --> 00:11:59.880 align:middle line:84%
A flag in general is like
a parameter you can pass in

00:11:59.880 --> 00:12:01.510 align:middle line:90%
to modify the behavior of the function.

00:12:01.510 --> 00:12:03.510 align:middle line:84%
But initially, we're not
even going to use this.

00:12:03.510 --> 00:12:06.610 align:middle line:84%
We're just going to pass in a
couple of arguments instead.

00:12:06.610 --> 00:12:11.700 align:middle line:84%
So let me go ahead and employ this
re library, this regular expression

00:12:11.700 --> 00:12:15.162 align:middle line:84%
library, and just improve on
this design incrementally.

00:12:15.162 --> 00:12:17.370 align:middle line:84%
So we're not going to solve
this problem all at once,

00:12:17.370 --> 00:12:19.590 align:middle line:90%
but we'll take some incremental steps.

00:12:19.590 --> 00:12:21.840 align:middle line:90%
I'm going to go back to VS Code here.

00:12:21.840 --> 00:12:25.050 align:middle line:84%
And I'm going to go ahead now
and get rid of most of this code.

00:12:25.050 --> 00:12:28.230 align:middle line:84%
But I'm going to go into the top
of my file and first of fall,

00:12:28.230 --> 00:12:30.030 align:middle line:90%
import this re library.

00:12:30.030 --> 00:12:33.030 align:middle line:84%
So import re gives me access
to that function and more.

00:12:33.030 --> 00:12:36.150 align:middle line:84%
Now, after I've gotten the user's
input in the same way as before,

00:12:36.150 --> 00:12:38.790 align:middle line:84%
stripping off any leading
or trailing whitespace,

00:12:38.790 --> 00:12:42.250 align:middle line:84%
I'm just going to use this
function super trivially for now,

00:12:42.250 --> 00:12:44.460 align:middle line:84%
even though this isn't
really a big step forward.

00:12:44.460 --> 00:12:50.190 align:middle line:84%
I'm going to say, if re.search
contains quote, unquote "@"

00:12:50.190 --> 00:12:53.700 align:middle line:84%
in the email address, then let's
go ahead and print "valid."

00:12:53.700 --> 00:12:55.740 align:middle line:84%
Else, let's go ahead
and print "invalid."

00:12:55.740 --> 00:12:59.730 align:middle line:84%
At the moment, this is really no
better than my very first version

00:12:59.730 --> 00:13:04.150 align:middle line:84%
where I was just asking Python,
if @ sign in the email address.

00:13:04.150 --> 00:13:08.880 align:middle line:84%
But now I'm at least beginning to use
this library by using its own re.search

00:13:08.880 --> 00:13:13.740 align:middle line:84%
function, which for now you can assume
returns a true value effectively

00:13:13.740 --> 00:13:16.440 align:middle line:90%
if, indeed, the @ sign is an email.

00:13:16.440 --> 00:13:19.800 align:middle line:84%
Just to make sure that this version
does work as I expect, let me go ahead

00:13:19.800 --> 00:13:22.590 align:middle line:90%
and run python of validate.py and Enter.

00:13:22.590 --> 00:13:26.220 align:middle line:84%
I'll type in my actual email
address, and we're back in business.

00:13:26.220 --> 00:13:29.370 align:middle line:84%
But of course, this is not
great, because if I similarly

00:13:29.370 --> 00:13:32.400 align:middle line:84%
run this version of the program
and just type in an @ sign,

00:13:32.400 --> 00:13:35.860 align:middle line:84%
not an email address, and yet my
code, of course, thinks it is valid.

00:13:35.860 --> 00:13:37.980 align:middle line:90%
So how can I do better than this?

00:13:37.980 --> 00:13:42.330 align:middle line:84%
Well, we need a bit more vocabulary
in the realm of regular expressions,

00:13:42.330 --> 00:13:46.290 align:middle line:84%
in order to be able to express
ourselves a little more precisely.

00:13:46.290 --> 00:13:48.900 align:middle line:84%
Really, the pattern I
want to ultimately define

00:13:48.900 --> 00:13:52.410 align:middle line:84%
is going to be something like, I want
there to be something to the left,

00:13:52.410 --> 00:13:55.320 align:middle line:84%
then an @ sign, then
something to the right.

00:13:55.320 --> 00:13:59.310 align:middle line:84%
And that something to the right should
end with .edu but should also have

00:13:59.310 --> 00:14:02.160 align:middle line:84%
something before the .edu,
like Harvard, or Yale,

00:14:02.160 --> 00:14:04.680 align:middle line:90%
or any other school in the US as well.

00:14:04.680 --> 00:14:06.550 align:middle line:90%
Well, how can I go about doing this?

00:14:06.550 --> 00:14:11.040 align:middle line:84%
Well, it turns out that in the world of
regular expressions, whether in Python

00:14:11.040 --> 00:14:14.220 align:middle line:84%
or a lot of other languages as
well, there are certain symbols

00:14:14.220 --> 00:14:16.140 align:middle line:90%
that you can use to define patterns.

00:14:16.140 --> 00:14:19.030 align:middle line:84%
At the moment, I've just
used literal raw text.

00:14:19.030 --> 00:14:21.600 align:middle line:84%
If I go back to my code
here, this technically

00:14:21.600 --> 00:14:23.940 align:middle line:90%
qualifies as a regular expression.

00:14:23.940 --> 00:14:28.290 align:middle line:84%
I've passed in a quoted string
inside of which is an @ sign.

00:14:28.290 --> 00:14:30.550 align:middle line:84%
Now, that's not a very
interesting pattern.

00:14:30.550 --> 00:14:31.500 align:middle line:90%
It's just an @ sign.

00:14:31.500 --> 00:14:34.290 align:middle line:84%
But it turns out that once you
have access to regular expressions

00:14:34.290 --> 00:14:37.350 align:middle line:84%
or a library that offers
that feature, you can more

00:14:37.350 --> 00:14:40.360 align:middle line:90%
powerfully express yourself as follows.

00:14:40.360 --> 00:14:43.770 align:middle line:84%
Let me reveal that the pattern
that you pass to re.search

00:14:43.770 --> 00:14:45.690 align:middle line:84%
can take a whole bunch
of special symbols.

00:14:45.690 --> 00:14:47.160 align:middle line:90%
And here's just some of them.

00:14:47.160 --> 00:14:51.630 align:middle line:84%
In the examples we're about to see,
in the patterns we're about to define,

00:14:51.630 --> 00:14:53.040 align:middle line:90%
here are the special symbols.

00:14:53.040 --> 00:14:56.280 align:middle line:84%
You can use a single period,
a dot, to just represent

00:14:56.280 --> 00:14:59.040 align:middle line:84%
any character except a
newline, a blank line.

00:14:59.040 --> 00:15:02.190 align:middle line:84%
So that is to say, if I don't really
care what letters of the alphabet

00:15:02.190 --> 00:15:04.200 align:middle line:84%
are in the user's
username, I just want there

00:15:04.200 --> 00:15:07.410 align:middle line:84%
to be one or more characters
in the user's name,

00:15:07.410 --> 00:15:11.340 align:middle line:84%
dot allows me to express A through
z, uppercase and lowercase,

00:15:11.340 --> 00:15:13.560 align:middle line:90%
and a bunch of other letters as well.

00:15:13.560 --> 00:15:18.850 align:middle line:84%
* is going to mean-- a single
asterisk-- zero or more repetitions.

00:15:18.850 --> 00:15:21.630 align:middle line:84%
So if I say something
*, that means that I'm

00:15:21.630 --> 00:15:24.450 align:middle line:84%
willing to accept either
zero repetitions, that is,

00:15:24.450 --> 00:15:27.510 align:middle line:90%
nothing at all, or more repetitions--

00:15:27.510 --> 00:15:29.580 align:middle line:90%
1, or 2, or 3, or 300.

00:15:29.580 --> 00:15:31.950 align:middle line:84%
If you see a plus in
my pattern, so that's

00:15:31.950 --> 00:15:34.135 align:middle line:90%
going to mean one or more repetitions.

00:15:34.135 --> 00:15:37.260 align:middle line:84%
That is to say, there's got to be at
least one character there, one symbol,

00:15:37.260 --> 00:15:40.180 align:middle line:84%
and then there's
optionally more after that.

00:15:40.180 --> 00:15:43.110 align:middle line:84%
And then you can say
zero or one repetition.

00:15:43.110 --> 00:15:46.590 align:middle line:84%
You can use a single question mark
after a symbol, and that will say,

00:15:46.590 --> 00:15:51.260 align:middle line:84%
I want zero of this character or
one, but that's all I'll expect.

00:15:51.260 --> 00:15:53.010 align:middle line:84%
And then lastly, there's
going to be a way

00:15:53.010 --> 00:15:55.140 align:middle line:90%
to specify a specific number of symbols.

00:15:55.140 --> 00:15:57.330 align:middle line:84%
If you use these curly
braces and a number,

00:15:57.330 --> 00:15:59.610 align:middle line:84%
represented here
symbolically as m, you can

00:15:59.610 --> 00:16:03.720 align:middle line:84%
specify that you want m repetitions,
be it 1, or 2, or 3, or 300.

00:16:03.720 --> 00:16:06.190 align:middle line:84%
You can specify the number
of repetitions yourself.

00:16:06.190 --> 00:16:08.280 align:middle line:84%
And if you want a range
of repetitions, like you

00:16:08.280 --> 00:16:11.100 align:middle line:84%
want this few characters
or this many characters,

00:16:11.100 --> 00:16:13.770 align:middle line:84%
you can use curly braces
and two numbers inside,

00:16:13.770 --> 00:16:18.760 align:middle line:84%
called here m and n, which would be
a range of m through n repetitions.

00:16:18.760 --> 00:16:20.140 align:middle line:90%
Now, what does all of this mean?

00:16:20.140 --> 00:16:22.380 align:middle line:84%
Well, let me go back to
VS Code here, and let

00:16:22.380 --> 00:16:25.650 align:middle line:84%
me propose that we iterate
on this solution further.

00:16:25.650 --> 00:16:27.985 align:middle line:84%
It's not sufficient to
just check for the @ sign.

00:16:27.985 --> 00:16:28.860 align:middle line:90%
We know that already.

00:16:28.860 --> 00:16:31.600 align:middle line:84%
We minimally want something
to the left and to the right.

00:16:31.600 --> 00:16:33.210 align:middle line:90%
So how can I represent that?

00:16:33.210 --> 00:16:35.910 align:middle line:84%
I don't really care what
the user's username is,

00:16:35.910 --> 00:16:40.020 align:middle line:84%
or what letters of the alphabet are
in it, be it malan or anyone else's.

00:16:40.020 --> 00:16:42.600 align:middle line:84%
So what I'm going to do to
the left of this equal sign

00:16:42.600 --> 00:16:44.410 align:middle line:90%
is I'm going to use a single period--

00:16:44.410 --> 00:16:49.600 align:middle line:84%
the dot that, again, indicates any
character except for a newline.

00:16:49.600 --> 00:16:51.630 align:middle line:84%
But I don't just want
a single character.

00:16:51.630 --> 00:16:55.900 align:middle line:84%
Otherwise, the person's username
could only a at such and such,

00:16:55.900 --> 00:16:57.450 align:middle line:90%
or b at such and such.

00:16:57.450 --> 00:17:00.130 align:middle line:84%
I want it to be multiple
such characters.

00:17:00.130 --> 00:17:01.680 align:middle line:90%
So I'm going to initially use a *.

00:17:01.680 --> 00:17:05.550 align:middle line:84%
So dot * means give me something to the
left, and I'm going to do another one,

00:17:05.550 --> 00:17:07.619 align:middle line:90%
dot * something to the right.

00:17:07.619 --> 00:17:10.589 align:middle line:84%
Now, this isn't perfect, but
it's at least a step forward.

00:17:10.589 --> 00:17:12.871 align:middle line:84%
Because now what I'm going
to go ahead and do is this.

00:17:12.871 --> 00:17:14.579 align:middle line:84%
I'm going to rerun
python of validate.py.

00:17:14.579 --> 00:17:17.040 align:middle line:84%
And I'm going to keep testing my
own email address just to make

00:17:17.040 --> 00:17:18.415 align:middle line:90%
sure I haven't made things worse.

00:17:18.415 --> 00:17:19.800 align:middle line:90%
And that's now OK.

00:17:19.800 --> 00:17:22.530 align:middle line:84%
I'm now going to go ahead
and type in some other input,

00:17:22.530 --> 00:17:28.380 align:middle line:84%
like how about just malan@
with no domain name whatsoever.

00:17:28.380 --> 00:17:30.640 align:middle line:84%
And you would think this
is going to be invalid.

00:17:30.640 --> 00:17:34.680 align:middle line:84%
But, but, but it's
still considered valid.

00:17:34.680 --> 00:17:35.850 align:middle line:90%
But why is that?

00:17:35.850 --> 00:17:42.120 align:middle line:84%
If I go back to this chart, why is
malan@ with no domain now considered

00:17:42.120 --> 00:17:43.260 align:middle line:90%
valid?

00:17:43.260 --> 00:17:50.010 align:middle line:84%
What's my mistake here by having
used .*@.* as my regular expression

00:17:50.010 --> 00:17:50.670 align:middle line:90%
or regex?

00:17:50.670 --> 00:17:54.355 align:middle line:84%
AUDIENCE: Because you're using
the * instead of the plus sign.

00:17:54.355 --> 00:17:55.230 align:middle line:90%
DAVID MALAN: Exactly.

00:17:55.230 --> 00:17:58.090 align:middle line:84%
The *, again, means zero
or more repetitions.

00:17:58.090 --> 00:18:03.120 align:middle line:84%
So re.search is perfectly happy to
accept nothing after the @ sign,

00:18:03.120 --> 00:18:05.230 align:middle line:90%
because that would be zero repetitions.

00:18:05.230 --> 00:18:09.000 align:middle line:84%
So I think I minimally need to evolve
this and go back to my code here.

00:18:09.000 --> 00:18:12.990 align:middle line:84%
And let me go ahead and change
this from dot * to dot +.

00:18:12.990 --> 00:18:16.620 align:middle line:84%
And let me change the
ending from dot * to dot +

00:18:16.620 --> 00:18:18.900 align:middle line:90%
so that now when I run my code here--

00:18:18.900 --> 00:18:21.510 align:middle line:84%
let me go ahead and run
python of validate.py.

00:18:21.510 --> 00:18:23.490 align:middle line:84%
I'm going to test my
email address as always.

00:18:23.490 --> 00:18:24.600 align:middle line:90%
Still working.

00:18:24.600 --> 00:18:27.690 align:middle line:84%
Now let me go ahead and type in
that same thing from before that

00:18:27.690 --> 00:18:29.820 align:middle line:90%
was accidentally considered valid.

00:18:29.820 --> 00:18:32.590 align:middle line:90%
Now I hit Enter, finally it's invalid.

00:18:32.590 --> 00:18:35.460 align:middle line:84%
So now we're making some
progress on being a little more

00:18:35.460 --> 00:18:37.560 align:middle line:90%
precise as to what it is we're doing.

00:18:37.560 --> 00:18:40.920 align:middle line:84%
Now, I'll note here, like with
almost everything in programming,

00:18:40.920 --> 00:18:45.090 align:middle line:84%
Python included, there's often multiple
ways to solve the same problem.

00:18:45.090 --> 00:18:49.410 align:middle line:84%
And does anyone see
a way in my code here

00:18:49.410 --> 00:18:54.360 align:middle line:84%
that I can make a slight tweak if I
forgot that the plus operator exists

00:18:54.360 --> 00:18:56.880 align:middle line:90%
and go back to using a *?

00:18:56.880 --> 00:19:00.570 align:middle line:84%
If I allowed you only to
use dots and only stars,

00:19:00.570 --> 00:19:03.570 align:middle line:90%
could you recreate the notion of plus?

00:19:03.570 --> 00:19:04.890 align:middle line:90%
AUDIENCE: Yes.

00:19:04.890 --> 00:19:06.930 align:middle line:90%
Use another dot, dot dot *.

00:19:06.930 --> 00:19:07.680 align:middle line:90%
DAVID MALAN: Yeah.

00:19:07.680 --> 00:19:10.290 align:middle line:84%
Because if a dot means any
character, we'll just use a dot.

00:19:10.290 --> 00:19:14.040 align:middle line:84%
And then when you want to say "or
more," use another dot and then the *.

00:19:14.040 --> 00:19:18.300 align:middle line:84%
So equivalent to dot +
would have been dot dot *,

00:19:18.300 --> 00:19:21.870 align:middle line:84%
because the first dot means any
character, and the second pair

00:19:21.870 --> 00:19:25.050 align:middle line:84%
of characters, dot *, means
zero or more other characters.

00:19:25.050 --> 00:19:27.600 align:middle line:84%
And to be clear, it doesn't
have to be the same character.

00:19:27.600 --> 00:19:31.830 align:middle line:84%
Just by doing dot or dot * does not
mean your whole username needs to be

00:19:31.830 --> 00:19:35.310 align:middle line:90%
a, or aa, or aaa, or aaaa.

00:19:35.310 --> 00:19:37.230 align:middle line:90%
It can vary with each symbol.

00:19:37.230 --> 00:19:41.790 align:middle line:84%
It just means zero or more of
any character back to back.

00:19:41.790 --> 00:19:44.050 align:middle line:84%
So I could do this on both
the left and the right.

00:19:44.050 --> 00:19:45.120 align:middle line:90%
Which one is better?

00:19:45.120 --> 00:19:46.110 align:middle line:90%
You know, it depends.

00:19:46.110 --> 00:19:49.860 align:middle line:84%
I think an argument could be made that
this is even more clear, because it's

00:19:49.860 --> 00:19:52.380 align:middle line:84%
obvious now that there's a
dot, which means any character,

00:19:52.380 --> 00:19:53.910 align:middle line:90%
and then there's the dot *.

00:19:53.910 --> 00:19:56.250 align:middle line:84%
But if you're in the habit
of doing this frequently,

00:19:56.250 --> 00:19:58.500 align:middle line:84%
one of the reasons things
like the plus exist

00:19:58.500 --> 00:20:01.750 align:middle line:84%
is just to consolidate your code into
something a little more succinct.

00:20:01.750 --> 00:20:03.750 align:middle line:84%
And if you're familiar
with seeing the plus now,

00:20:03.750 --> 00:20:05.470 align:middle line:90%
maybe this is more readable to you.

00:20:05.470 --> 00:20:07.590 align:middle line:84%
So again, just like with
Python more generally,

00:20:07.590 --> 00:20:10.590 align:middle line:84%
you're going to often see different
ways to express the same patterns,

00:20:10.590 --> 00:20:12.750 align:middle line:84%
and reasonable people
might agree or disagree

00:20:12.750 --> 00:20:15.810 align:middle line:90%
as to which way is better than another.

00:20:15.810 --> 00:20:18.030 align:middle line:84%
Well, let me propose to
you that we can think

00:20:18.030 --> 00:20:20.520 align:middle line:84%
about both of these models
a little more graphically.

00:20:20.520 --> 00:20:22.770 align:middle line:84%
If this looks a little cryptic
to you, let me go ahead

00:20:22.770 --> 00:20:26.610 align:middle line:84%
and rewind to the previous incarnation
of this regular expression, which

00:20:26.610 --> 00:20:28.830 align:middle line:90%
was just a single dot *.

00:20:28.830 --> 00:20:32.910 align:middle line:84%
This regular expression,
.*@.* means what again?

00:20:32.910 --> 00:20:36.690 align:middle line:84%
It means zero or more characters
followed by a literal @ sign followed

00:20:36.690 --> 00:20:38.580 align:middle line:90%
by zero or more other characters.

00:20:38.580 --> 00:20:41.850 align:middle line:84%
Now when you pass this pattern
in as an argument to re.search,

00:20:41.850 --> 00:20:45.030 align:middle line:84%
it's going to read it from
left to right and then use

00:20:45.030 --> 00:20:48.750 align:middle line:84%
it to try to match against the
input, email, in this case,

00:20:48.750 --> 00:20:50.100 align:middle line:90%
that the user typed in.

00:20:50.100 --> 00:20:53.070 align:middle line:84%
Now, how is the computer,
how is re.search

00:20:53.070 --> 00:20:57.760 align:middle line:84%
going to keep track of whether or not
the user's email matches this pattern?

00:20:57.760 --> 00:21:01.230 align:middle line:84%
Well, it turns out that it's going to
be using a machine of sorts implemented

00:21:01.230 --> 00:21:03.540 align:middle line:84%
in software known as a
finite state machine, or more

00:21:03.540 --> 00:21:06.750 align:middle line:84%
formally, a nondeterministic
finite automaton.

00:21:06.750 --> 00:21:09.930 align:middle line:84%
And the way it works, if we depict
this graphically, is as follows.

00:21:09.930 --> 00:21:14.940 align:middle line:84%
The re.search function starts over
here in a so-called start state.

00:21:14.940 --> 00:21:16.980 align:middle line:84%
That's the sort of condition
in which it begins.

00:21:16.980 --> 00:21:20.730 align:middle line:84%
And then it's going to read the user's
email address from left to right.

00:21:20.730 --> 00:21:24.030 align:middle line:84%
And it's going to decide whether
or not to stay in this first state

00:21:24.030 --> 00:21:26.170 align:middle line:90%
or transition to the next state.

00:21:26.170 --> 00:21:29.970 align:middle line:84%
So for instance, in this first state,
as the user is reading my email address,

00:21:29.970 --> 00:21:35.130 align:middle line:84%
malan@harvard.edu, it's going to
follow this curved edge up and around

00:21:35.130 --> 00:21:36.870 align:middle line:90%
to itself, a reflexive edge.

00:21:36.870 --> 00:21:40.030 align:middle line:84%
And it's labeled dot, because dot,
again, just means any character.

00:21:40.030 --> 00:21:43.840 align:middle line:84%
So as the function is reading my
email address, malan@harvard.edu,

00:21:43.840 --> 00:21:48.270 align:middle line:84%
from left to right, it's going to
follow these transitions as follows,

00:21:48.270 --> 00:21:53.070 align:middle line:90%
M-A-L-A-N.

00:21:53.070 --> 00:21:56.040 align:middle line:84%
And then it's hopefully going
to follow this transition

00:21:56.040 --> 00:22:00.000 align:middle line:84%
to the second state, because there's
a literal @ sign both in this machine

00:22:00.000 --> 00:22:01.630 align:middle line:90%
as well as in my email address.

00:22:01.630 --> 00:22:10.070 align:middle line:84%
Then it's going to try to read the rest
of my address, H-A-R-V-A-R-D dot E-D-U,

00:22:10.070 --> 00:22:11.190 align:middle line:90%
and that's it.

00:22:11.190 --> 00:22:12.870 align:middle line:90%
And then the computer is going to check.

00:22:12.870 --> 00:22:16.260 align:middle line:84%
Did it end up in an accept
state, a final state,

00:22:16.260 --> 00:22:18.120 align:middle line:84%
that's actually depicted
here pictorially

00:22:18.120 --> 00:22:21.150 align:middle line:84%
a little differently with double
circles, one inside of the other?

00:22:21.150 --> 00:22:25.410 align:middle line:84%
And that just means that if the
computer finds itself in that second

00:22:25.410 --> 00:22:29.130 align:middle line:84%
accept state after having
read all of the user's input,

00:22:29.130 --> 00:22:31.560 align:middle line:90%
it is, indeed, a valid email address.

00:22:31.560 --> 00:22:34.350 align:middle line:84%
If by some chance, the
machine somehow ended up

00:22:34.350 --> 00:22:37.020 align:middle line:84%
stuck in that first state, which
does not have double circles

00:22:37.020 --> 00:22:39.300 align:middle line:84%
and is therefore not an
accept state, the computer

00:22:39.300 --> 00:22:42.810 align:middle line:84%
would conclude this is an
invalid email address instead.

00:22:42.810 --> 00:22:45.630 align:middle line:84%
By contrast, if we go back
to my other your version

00:22:45.630 --> 00:22:49.800 align:middle line:84%
of the code where I instead had dot
plus on both the left and the right,

00:22:49.800 --> 00:22:53.130 align:middle line:84%
recall that re.search is going to
use one of these state machines

00:22:53.130 --> 00:22:57.030 align:middle line:84%
in order to decide from left to right
whether or not to accept the user's

00:22:57.030 --> 00:22:59.310 align:middle line:90%
input, like malan@harvard.edu.

00:22:59.310 --> 00:23:02.850 align:middle line:84%
Can we get from the start state,
so to speak, to an accept state

00:23:02.850 --> 00:23:05.940 align:middle line:84%
to decide, yep, this was, in
fact, meeting the pattern?

00:23:05.940 --> 00:23:09.900 align:middle line:84%
Well, let's propose that this
nondeterministic finite automaton

00:23:09.900 --> 00:23:11.430 align:middle line:90%
looked like this instead.

00:23:11.430 --> 00:23:14.310 align:middle line:84%
We're going to start as before
in the leftmost start state,

00:23:14.310 --> 00:23:18.180 align:middle line:84%
and we're going to necessarily consume
one character per this first edge,

00:23:18.180 --> 00:23:21.480 align:middle line:84%
which is labeled with a dot to indicate
that we can consume any one character,

00:23:21.480 --> 00:23:24.411 align:middle line:90%
like the m in malan@harvard.edu.

00:23:24.411 --> 00:23:27.960 align:middle line:84%
Then we can spend some time consuming
more characters before the @ sign,

00:23:27.960 --> 00:23:31.290 align:middle line:90%
so the A-L-A-N.

00:23:31.290 --> 00:23:33.340 align:middle line:90%
Then we can consume the @ sign.

00:23:33.340 --> 00:23:36.270 align:middle line:84%
Then we can consume at least one
more character, because recall

00:23:36.270 --> 00:23:38.760 align:middle line:90%
that the regex has dot plus this time.

00:23:38.760 --> 00:23:42.190 align:middle line:84%
And then we can consume even
more characters if we want.

00:23:42.190 --> 00:23:45.900 align:middle line:84%
So if we first consume
the H in harvard.edu,

00:23:45.900 --> 00:23:53.885 align:middle line:84%
then leaves the A-R-V-A-R-D,
and then dot E-D-U.

00:23:53.885 --> 00:23:56.560 align:middle line:84%
And now here, too, we're
at the end of the story,

00:23:56.560 --> 00:23:59.760 align:middle line:84%
but we're in an accept state,
because that circle at the end

00:23:59.760 --> 00:24:03.840 align:middle line:84%
has two circles total, which means
that if the computer, if this function,

00:24:03.840 --> 00:24:07.830 align:middle line:84%
finds itself in that accept state after
reading the entirety of the user's

00:24:07.830 --> 00:24:11.580 align:middle line:84%
input, it is, too, in fact,
a valid email address.

00:24:11.580 --> 00:24:15.390 align:middle line:84%
If by contrast, we had gotten
stuck in one of those other states,

00:24:15.390 --> 00:24:18.180 align:middle line:84%
unable to follow a transition,
one of those edges,

00:24:18.180 --> 00:24:22.440 align:middle line:84%
and therefore unable to make progress
in the user's input from left to right,

00:24:22.440 --> 00:24:26.670 align:middle line:84%
then we would have to conclude that
email address is, in fact, invalid.

00:24:26.670 --> 00:24:29.490 align:middle line:84%
Well, how can we go upon
approving this code further?

00:24:29.490 --> 00:24:33.660 align:middle line:84%
Let me propose now that we check not
only for a username and also something

00:24:33.660 --> 00:24:37.320 align:middle line:84%
after the username, like a domain name,
but minimally require that the string

00:24:37.320 --> 00:24:39.600 align:middle line:90%
ends with .edu as well.

00:24:39.600 --> 00:24:41.970 align:middle line:84%
Well, I think I could do
this fairly straightforward.

00:24:41.970 --> 00:24:44.940 align:middle line:84%
Not only do I want there to
be something after the @ sign,

00:24:44.940 --> 00:24:49.818 align:middle line:84%
like the domain like Harvard, I want
the whole thing to end with .edu.

00:24:49.818 --> 00:24:52.320 align:middle line:90%
But there's a little bit of danger here.

00:24:52.320 --> 00:24:57.660 align:middle line:84%
What have I done wrong by implementing
my regular expression now in this way,

00:24:57.660 --> 00:24:59.110 align:middle line:90%
by using .+@.+.edu?

00:24:59.110 --> 00:25:01.938 align:middle line:90%


00:25:01.938 --> 00:25:06.080 align:middle line:90%
What could go wrong with this version?

00:25:06.080 --> 00:25:08.360 align:middle line:84%
AUDIENCE: The dot is--
the dot means something

00:25:08.360 --> 00:25:11.510 align:middle line:84%
else in this context, where it
means three or more repetitions

00:25:11.510 --> 00:25:14.630 align:middle line:84%
of a character, which is why it
will interpret it [INAUDIBLE]..

00:25:14.630 --> 00:25:15.570 align:middle line:90%
DAVID MALAN: Exactly.

00:25:15.570 --> 00:25:19.340 align:middle line:84%
Even though I mean for it to
mean literally .edu, a period,

00:25:19.340 --> 00:25:22.560 align:middle line:84%
and then .edu, unfortunately in
the world of regular expressions,

00:25:22.560 --> 00:25:26.720 align:middle line:84%
dot means any character, which means
that this string could technically end

00:25:26.720 --> 00:25:34.080 align:middle line:84%
in aedu, or bedu, or cedu, and so forth,
but that's not, in fact, that I want.

00:25:34.080 --> 00:25:37.670 align:middle line:84%
So any instincts now as to
how I could fix this problem?

00:25:37.670 --> 00:25:39.770 align:middle line:84%
And let me demonstrate
the problem more clearly.

00:25:39.770 --> 00:25:41.900 align:middle line:90%
Let me go ahead and run this code here.

00:25:41.900 --> 00:25:45.050 align:middle line:84%
Let me go ahead and type
in malan@harvard.edu.

00:25:45.050 --> 00:25:47.240 align:middle line:90%
And as always, this does, in fact, work.

00:25:47.240 --> 00:25:48.680 align:middle line:90%
But watch what happens here.

00:25:48.680 --> 00:25:52.520 align:middle line:84%
Let me go ahead and do
malan@harvard and then--

00:25:52.520 --> 00:25:57.992 align:middle line:84%
malan@harvard?edu, Enter,
that, too, is valid.

00:25:57.992 --> 00:26:00.950 align:middle line:84%
So I could put any character there
and it's still going to be accepted.

00:26:00.950 --> 00:26:02.420 align:middle line:90%
But I don't want ?edu.

00:26:02.420 --> 00:26:04.670 align:middle line:90%
I want .edu literally.

00:26:04.670 --> 00:26:08.700 align:middle line:84%
Any instincts, then, for how
we can solve this problem here?

00:26:08.700 --> 00:26:12.770 align:middle line:84%
How can I get this new function,
re.search, and a regular expression

00:26:12.770 --> 00:26:16.160 align:middle line:84%
more generally, to literally
mean a dot, might you think?

00:26:16.160 --> 00:26:19.257 align:middle line:84%
AUDIENCE: You can use the
escape character, the backslash?

00:26:19.257 --> 00:26:20.090 align:middle line:90%
DAVID MALAN: Indeed.

00:26:20.090 --> 00:26:22.927 align:middle line:84%
The so-called escape character,
which we've seen before outside

00:26:22.927 --> 00:26:25.760 align:middle line:84%
of the context of regular expressions
when we talked about newlines.

00:26:25.760 --> 00:26:29.640 align:middle line:84%
Backslash n was a way of telling
the computer I want a newline,

00:26:29.640 --> 00:26:32.810 align:middle line:84%
but without actually literally hitting
Enter and moving the cursor yourself.

00:26:32.810 --> 00:26:35.090 align:middle line:84%
And you don't want a
literal n on the screen.

00:26:35.090 --> 00:26:39.350 align:middle line:84%
So backslash n was a way to escape n
and convey that you want a newline.

00:26:39.350 --> 00:26:41.900 align:middle line:84%
It turns out regular expressions
use a similar technique

00:26:41.900 --> 00:26:43.640 align:middle line:90%
to solve this problem here.

00:26:43.640 --> 00:26:45.770 align:middle line:84%
In fact, let me go into
my regular expression.

00:26:45.770 --> 00:26:49.370 align:middle line:84%
And before that final dot,
let me put a single backslash.

00:26:49.370 --> 00:26:52.880 align:middle line:84%
In the world of regular expressions,
this is a so-called special sequence.

00:26:52.880 --> 00:26:55.940 align:middle line:84%
And it indicates, per this
backslash and a single dot,

00:26:55.940 --> 00:26:58.290 align:middle line:90%
that I literally want to match on a dot.

00:26:58.290 --> 00:27:02.180 align:middle line:84%
It's not that I want to match
on any character and then edu.

00:27:02.180 --> 00:27:05.300 align:middle line:84%
I want to match on a
dot, or a period, edu.

00:27:05.300 --> 00:27:09.050 align:middle line:84%
But we don't want Python to
misinterpret this backslash

00:27:09.050 --> 00:27:12.710 align:middle line:84%
as beginning an escape sequence,
something special like backslash

00:27:12.710 --> 00:27:15.590 align:middle line:84%
n, which even though we as the
programmer might type two characters

00:27:15.590 --> 00:27:20.090 align:middle line:84%
backslash n, it really is interpreted
by Python as a single newline.

00:27:20.090 --> 00:27:22.833 align:middle line:84%
We don't want any kind of
misinterpretation like that here.

00:27:22.833 --> 00:27:26.000 align:middle line:84%
So it turns out there's one other thing
we should do for regular expressions

00:27:26.000 --> 00:27:29.180 align:middle line:84%
like this that have a
backslash used in this way.

00:27:29.180 --> 00:27:33.440 align:middle line:84%
I want to specify to Python that I want
this string, this regular expression

00:27:33.440 --> 00:27:36.200 align:middle line:84%
in double quotes, to be
treated as a raw string,

00:27:36.200 --> 00:27:38.510 align:middle line:84%
literally putting an r at
the beginning of the string

00:27:38.510 --> 00:27:41.240 align:middle line:84%
to indicate to Python that you
should not try to interpret

00:27:41.240 --> 00:27:43.550 align:middle line:90%
any backslashes in the usual way.

00:27:43.550 --> 00:27:46.850 align:middle line:84%
I want to literally pass the
backslash and the dot and the edu

00:27:46.850 --> 00:27:50.030 align:middle line:84%
into this particular function,
search, in this case.

00:27:50.030 --> 00:27:53.750 align:middle line:84%
So it's similar in spirit to using
that f at the beginning of a format

00:27:53.750 --> 00:27:57.170 align:middle line:84%
string, which, of course, tells Python
to format the string in a certain way,

00:27:57.170 --> 00:27:59.720 align:middle line:84%
plugging in variables that
might be between curly braces.

00:27:59.720 --> 00:28:02.900 align:middle line:84%
But in this case, r
indicates a raw string

00:28:02.900 --> 00:28:05.570 align:middle line:90%
that I want passed in exactly as is.

00:28:05.570 --> 00:28:09.380 align:middle line:84%
Now, it's only strictly necessary if
you are, in fact, using backslashes

00:28:09.380 --> 00:28:12.860 align:middle line:84%
to indicate that you want some
special sequence, like backslash dot.

00:28:12.860 --> 00:28:14.750 align:middle line:84%
But in general, it's
probably a good habit

00:28:14.750 --> 00:28:18.540 align:middle line:84%
to get into to just use raw strings
for all of your regular expressions

00:28:18.540 --> 00:28:21.480 align:middle line:84%
so that if you eventually go back
in, make a change, make an addition,

00:28:21.480 --> 00:28:23.600 align:middle line:84%
you don't accidentally
introduce a backslash

00:28:23.600 --> 00:28:28.113 align:middle line:84%
and then forget that that might have
some special or misinterpreted meaning.

00:28:28.113 --> 00:28:30.530 align:middle line:84%
Well, let me go ahead and try
this new regular expression.

00:28:30.530 --> 00:28:34.430 align:middle line:84%
I'll clear my terminal window,
run python of validate--

00:28:34.430 --> 00:28:36.800 align:middle line:90%
run python of validate.py.

00:28:36.800 --> 00:28:40.496 align:middle line:84%
And then I'll type in my email
address correctly, malan@harvard.edu.

00:28:40.496 --> 00:28:42.710 align:middle line:90%
And that's, fortunately, still valid.

00:28:42.710 --> 00:28:46.490 align:middle line:84%
Let me clear my screen and run it
one more time, python of validate.py.

00:28:46.490 --> 00:28:50.930 align:middle line:84%
And this time, let's mistype
it as malan@harvard?edu,

00:28:50.930 --> 00:28:53.540 align:middle line:84%
whereby there's obviously
not a dot there,

00:28:53.540 --> 00:28:57.710 align:middle line:84%
but there is some other single character
that last time was misinterpreted

00:28:57.710 --> 00:28:58.430 align:middle line:90%
as valid.

00:28:58.430 --> 00:29:01.970 align:middle line:84%
But this time, now that I've
improved my regular expression,

00:29:01.970 --> 00:29:05.270 align:middle line:90%
it's discovered as, indeed, invalid.

00:29:05.270 --> 00:29:10.850 align:middle line:84%
Any questions now on this technique for
matching something to the left of the @

00:29:10.850 --> 00:29:15.320 align:middle line:84%
sign, something to the right, and
now ending with .edu explicitly?

00:29:15.320 --> 00:29:18.582 align:middle line:84%
AUDIENCE: What happens when
user inserts multiple @ signs?

00:29:18.582 --> 00:29:19.790 align:middle line:90%
DAVID MALAN: A good question.

00:29:19.790 --> 00:29:21.320 align:middle line:90%
And you kind of called me out here.

00:29:21.320 --> 00:29:22.910 align:middle line:90%
Well, when in doubt, let's try.

00:29:22.910 --> 00:29:29.340 align:middle line:84%
Let me go ahead and do python of
validate.py, malan@@@harvard.edu,

00:29:29.340 --> 00:29:34.020 align:middle line:84%
which also is incorrect, unfortunately,
my code thinks it's valid.

00:29:34.020 --> 00:29:37.490 align:middle line:84%
So another problem to solve,
but a shortcoming for now.

00:29:37.490 --> 00:29:41.510 align:middle line:84%
Other questions on these
regular expressions thus far?

00:29:41.510 --> 00:29:46.108 align:middle line:84%
AUDIENCE: Can you use curly
brackets m instead of backslash?

00:29:46.108 --> 00:29:48.650 align:middle line:84%
DAVID MALAN: Can you use curly
brackets instead of backslash?

00:29:48.650 --> 00:29:49.490 align:middle line:90%
Not in this case.

00:29:49.490 --> 00:29:53.750 align:middle line:84%
If you want a literal dot, backslash
dot is the way to do it literally.

00:29:53.750 --> 00:29:56.660 align:middle line:84%
How about one other question
on regular expressions?

00:29:56.660 --> 00:30:00.620 align:middle line:84%
AUDIENCE: Is this the same thing
that Google Forms uses in order

00:30:00.620 --> 00:30:06.590 align:middle line:84%
to categorize data in, let's say, some--
if you've got multiple people sending

00:30:06.590 --> 00:30:09.380 align:middle line:90%
in requests about some feedback?

00:30:09.380 --> 00:30:12.170 align:middle line:84%
Do they categorize
the data that they get

00:30:12.170 --> 00:30:14.247 align:middle line:84%
using this particular
regular expression thing?

00:30:14.247 --> 00:30:15.080 align:middle line:90%
DAVID MALAN: Indeed.

00:30:15.080 --> 00:30:17.450 align:middle line:84%
If you've ever used Google
Forms to not just submit it

00:30:17.450 --> 00:30:20.900 align:middle line:84%
but to create a Google Form,
one of the menu options

00:30:20.900 --> 00:30:23.570 align:middle line:84%
is for response validation,
in English at least.

00:30:23.570 --> 00:30:25.340 align:middle line:84%
And what that allows
you to do is specify

00:30:25.340 --> 00:30:29.060 align:middle line:84%
that the user has to input
an email address, or a URL,

00:30:29.060 --> 00:30:31.400 align:middle line:90%
or a string of some length.

00:30:31.400 --> 00:30:33.830 align:middle line:84%
But there's an even more
powerful feature that some of you

00:30:33.830 --> 00:30:35.150 align:middle line:90%
may not have ever noticed.

00:30:35.150 --> 00:30:37.340 align:middle line:84%
And indeed, if you'd like
to open up Google Forms,

00:30:37.340 --> 00:30:41.180 align:middle line:84%
create a new form temporarily, and
poke around, you will actually see,

00:30:41.180 --> 00:30:44.270 align:middle line:84%
in English at least, quote,
unquote "regular expression"

00:30:44.270 --> 00:30:46.070 align:middle line:84%
mentioned as one of
the mechanisms you can

00:30:46.070 --> 00:30:49.520 align:middle line:84%
use to validate your users'
input into your Google Form.

00:30:49.520 --> 00:30:53.690 align:middle line:84%
So in fact, after today you can
start avoiding the specific dropdowns

00:30:53.690 --> 00:30:55.610 align:middle line:84%
of like email address,
or URL, or the like,

00:30:55.610 --> 00:30:59.540 align:middle line:84%
and you can express your own
patterns precisely as well.

00:30:59.540 --> 00:31:02.900 align:middle line:84%
Regular expressions can even
be used in VS Code itself.

00:31:02.900 --> 00:31:06.440 align:middle line:84%
If you go and find, or do a
find and replace in VS Code,

00:31:06.440 --> 00:31:08.690 align:middle line:84%
you can, of course, just
type in words, like you could

00:31:08.690 --> 00:31:10.880 align:middle line:90%
into Microsoft Word or Google Docs.

00:31:10.880 --> 00:31:14.990 align:middle line:84%
You can also type, if you check
the right box, regular expressions

00:31:14.990 --> 00:31:19.670 align:middle line:84%
and start searching for patterns,
not literally specific values.

00:31:19.670 --> 00:31:24.080 align:middle line:84%
Well, let me propose that we now
enhance this implementation further

00:31:24.080 --> 00:31:28.010 align:middle line:84%
by introducing a few other symbols,
because right now with my code,

00:31:28.010 --> 00:31:32.540 align:middle line:84%
I keep saying that I want my email
address to end with .edu and start with

00:31:32.540 --> 00:31:35.780 align:middle line:84%
a username, but I'm being
a little too generous.

00:31:35.780 --> 00:31:38.690 align:middle line:84%
This does, in fact, work as
expected for my own email address,

00:31:38.690 --> 00:31:40.928 align:middle line:90%
malan@harvard.edu.

00:31:40.928 --> 00:31:45.350 align:middle line:84%
But what if I type in a
sentence like, "my email address

00:31:45.350 --> 00:31:50.180 align:middle line:84%
is malan@harvard.edu," and suppose
I've typed that into the program

00:31:50.180 --> 00:31:52.310 align:middle line:90%
or I've typed that into a Google Form?

00:31:52.310 --> 00:31:57.680 align:middle line:84%
Is this going to be
considered valid or invalid?

00:31:57.680 --> 00:31:59.390 align:middle line:90%
Well, let's consider.

00:31:59.390 --> 00:32:01.970 align:middle line:90%
It's got @ sign, so we're good there.

00:32:01.970 --> 00:32:05.570 align:middle line:84%
It's got one or more characters
to the left of the @ sign.

00:32:05.570 --> 00:32:09.050 align:middle line:84%
It's got one or more characters
to the right of the @ sign.

00:32:09.050 --> 00:32:14.390 align:middle line:84%
It's got a literal .edu somewhere
in there to the right of the @ sign.

00:32:14.390 --> 00:32:16.460 align:middle line:84%
And granted, there's
more stuff to the right.

00:32:16.460 --> 00:32:19.700 align:middle line:84%
There's literally this period at
the end of my English sentence.

00:32:19.700 --> 00:32:23.600 align:middle line:84%
But that's OK, because at the moment,
my regular expression is not so precise

00:32:23.600 --> 00:32:29.156 align:middle line:84%
as to say, the pattern must start with
the username and end with the .edu.

00:32:29.156 --> 00:32:32.573 align:middle line:84%
Technically, it's left unsaid
what more can be to the left

00:32:32.573 --> 00:32:33.990 align:middle line:90%
and what more can be to the right.

00:32:33.990 --> 00:32:37.970 align:middle line:84%
So when I hit Enter now, you'll see
that that whole sentence in English

00:32:37.970 --> 00:32:40.500 align:middle line:84%
is valid, and that's
obviously not what you want.

00:32:40.500 --> 00:32:43.430 align:middle line:84%
In fact, consider the case of
using Google Forms or Office

00:32:43.430 --> 00:32:45.620 align:middle line:90%
365 to collect data from users.

00:32:45.620 --> 00:32:48.320 align:middle line:84%
If you don't validate
your input, your users

00:32:48.320 --> 00:32:51.170 align:middle line:84%
might very well type in a full
sentence or something else

00:32:51.170 --> 00:32:53.550 align:middle line:84%
with a typographical
error, not an actual email.

00:32:53.550 --> 00:32:55.993 align:middle line:84%
So if you're just trying to
copy all of the results that

00:32:55.993 --> 00:32:58.160 align:middle line:84%
have been typed into your
form so you can paste them

00:32:58.160 --> 00:33:00.767 align:middle line:84%
into Gmail or some email
program, it's going to break,

00:33:00.767 --> 00:33:04.100 align:middle line:84%
because you're going to accidentally pay
something like a whole English sentence

00:33:04.100 --> 00:33:07.010 align:middle line:84%
into the program instead of
just an email address, which

00:33:07.010 --> 00:33:08.690 align:middle line:90%
is what your mailer expects.

00:33:08.690 --> 00:33:10.280 align:middle line:90%
So how can I be more precise?

00:33:10.280 --> 00:33:13.550 align:middle line:84%
Well, let me propose we introduce
a few more symbols as well.

00:33:13.550 --> 00:33:17.540 align:middle line:84%
It turns out in the context of a regular
expression, one of these patterns,

00:33:17.540 --> 00:33:21.170 align:middle line:84%
you can use the caret symbol,
the little triangular mark,

00:33:21.170 --> 00:33:24.080 align:middle line:84%
to represent that you
want this pattern to match

00:33:24.080 --> 00:33:27.110 align:middle line:84%
the start of the string
specifically-- not anywhere

00:33:27.110 --> 00:33:29.330 align:middle line:90%
but the start of the user's string.

00:33:29.330 --> 00:33:34.040 align:middle line:84%
By contrast, you can use a $ sign in
your regular expression to say that you

00:33:34.040 --> 00:33:37.790 align:middle line:84%
want to match the end of the string,
or technically just before the newline

00:33:37.790 --> 00:33:38.910 align:middle line:90%
at the end of the string.

00:33:38.910 --> 00:33:41.810 align:middle line:84%
But for all intents and purposes,
think of caret as meaning "start

00:33:41.810 --> 00:33:45.650 align:middle line:84%
of the string" and $ sign as
meaning "end of the string."

00:33:45.650 --> 00:33:49.310 align:middle line:84%
It is a weird thing that one
is a caret and one is $ sign.

00:33:49.310 --> 00:33:51.710 align:middle line:84%
These are not really things
that I think of as opposites,

00:33:51.710 --> 00:33:53.670 align:middle line:84%
like a parentheses or
something like that.

00:33:53.670 --> 00:33:56.430 align:middle line:84%
But those are the symbols the
world chose many years ago.

00:33:56.430 --> 00:33:58.370 align:middle line:90%
So let me go back to VS Code now.

00:33:58.370 --> 00:34:01.460 align:middle line:84%
And let me add this
feature to my code here.

00:34:01.460 --> 00:34:04.790 align:middle line:84%
Let me specify that yes, I do
want to search for this pattern,

00:34:04.790 --> 00:34:08.480 align:middle line:84%
but I want the user's input
to start with this pattern

00:34:08.480 --> 00:34:09.860 align:middle line:90%
and end with this pattern.

00:34:09.860 --> 00:34:12.440 align:middle line:84%
So even though it's going to
start looking even more cryptic,

00:34:12.440 --> 00:34:14.690 align:middle line:84%
I put a caret symbol
here at the beginning,

00:34:14.690 --> 00:34:17.270 align:middle line:90%
and I put a $ sign here at the end.

00:34:17.270 --> 00:34:21.199 align:middle line:84%
That does not mean I want the user
to type a caret symbol or a $ sign.

00:34:21.199 --> 00:34:25.130 align:middle line:84%
This is special symbology
that indicates to re.search

00:34:25.130 --> 00:34:29.280 align:middle line:84%
that it should only look for now an
exact match against this pattern.

00:34:29.280 --> 00:34:31.699 align:middle line:84%
So if I now go back to
my terminal window--

00:34:31.699 --> 00:34:33.920 align:middle line:84%
and I'll leave the previous
result on the screen--

00:34:33.920 --> 00:34:35.540 align:middle line:90%
let me type the exact same thing.

00:34:35.540 --> 00:34:39.610 align:middle line:84%
"My email address
malan@harvard.edu," Enter--

00:34:39.610 --> 00:34:41.000 align:middle line:90%
sorry, period.

00:34:41.000 --> 00:34:43.070 align:middle line:84%
And now I'm going to
go ahead and hit Enter.

00:34:43.070 --> 00:34:45.770 align:middle line:90%
Now that's considered invalid.

00:34:45.770 --> 00:34:47.090 align:middle line:90%
But let me clear the screen.

00:34:47.090 --> 00:34:48.923 align:middle line:84%
And just to make sure
I didn't break things,

00:34:48.923 --> 00:34:53.330 align:middle line:84%
let me type in just my email
address, and that, too, is valid.

00:34:53.330 --> 00:34:58.250 align:middle line:84%
Any questions now on this version of
my regular expression, which, note,

00:34:58.250 --> 00:35:01.670 align:middle line:84%
goes further to specify
even more precisely

00:35:01.670 --> 00:35:06.120 align:middle line:84%
that I want it to match
at the start and the end?

00:35:06.120 --> 00:35:08.568 align:middle line:90%
Any questions on this one here?

00:35:08.568 --> 00:35:09.110 align:middle line:90%
AUDIENCE: OK.

00:35:09.110 --> 00:35:13.160 align:middle line:84%
You have slash, and
.edu, then the $ sign.

00:35:13.160 --> 00:35:18.170 align:middle line:84%
But the dot is one of the
regular expression, right?

00:35:18.170 --> 00:35:19.460 align:middle line:90%
DAVID MALAN: It normally is.

00:35:19.460 --> 00:35:24.590 align:middle line:84%
But this backslash that I deliberately
put before this period here

00:35:24.590 --> 00:35:26.180 align:middle line:90%
is an escape character.

00:35:26.180 --> 00:35:30.710 align:middle line:84%
It is a way of telling re.search that
I don't want any character there,

00:35:30.710 --> 00:35:33.140 align:middle line:90%
I literally want a period there.

00:35:33.140 --> 00:35:36.080 align:middle line:84%
And it's the only way you can
distinguish one from the other.

00:35:36.080 --> 00:35:40.550 align:middle line:84%
If I got rid of that slash, this
would mean that the email address just

00:35:40.550 --> 00:35:43.610 align:middle line:84%
has to end with any character,
then an E, then a D,

00:35:43.610 --> 00:35:45.180 align:middle line:90%
than a U. I don't want that.

00:35:45.180 --> 00:35:49.730 align:middle line:84%
I want literally a period, then
the E, then the D, then the U.

00:35:49.730 --> 00:35:53.780 align:middle line:84%
This is actually common convention in
programming and technology in general.

00:35:53.780 --> 00:35:55.820 align:middle line:84%
If you and I decide on
a convention, whereby

00:35:55.820 --> 00:35:59.180 align:middle line:84%
we're using some character on the
keyboard to mean something special,

00:35:59.180 --> 00:36:02.060 align:middle line:84%
invariably we create a
future problem for ourself

00:36:02.060 --> 00:36:04.820 align:middle line:84%
when we want to literally
use that same character.

00:36:04.820 --> 00:36:07.190 align:middle line:84%
And so the solution in
general to that problem

00:36:07.190 --> 00:36:10.790 align:middle line:84%
is to somehow escape the character
so that it's clear to the computer

00:36:10.790 --> 00:36:14.510 align:middle line:84%
that it's not that special symbol,
it's literally the symbol it sees.

00:36:14.510 --> 00:36:19.700 align:middle line:84%
AUDIENCE: So we don't even know the--
we don't need another slash before the $

00:36:19.700 --> 00:36:20.930 align:middle line:90%
sign?

00:36:20.930 --> 00:36:22.150 align:middle line:90%
DAVID MALAN: No.

00:36:22.150 --> 00:36:25.550 align:middle line:84%
Because in this case, $ sign
means something special.

00:36:25.550 --> 00:36:30.590 align:middle line:84%
Per this chart here, $ sign by itself
does not mean US dollars or currency.

00:36:30.590 --> 00:36:33.420 align:middle line:84%
It literally means "match
the end of the string."

00:36:33.420 --> 00:36:38.600 align:middle line:84%
If, however, I wanted the user to
literally type in $ sign at the end

00:36:38.600 --> 00:36:40.910 align:middle line:84%
of their input, the
solution would be the same.

00:36:40.910 --> 00:36:43.700 align:middle line:84%
I would put a backslash
before the $ sign,

00:36:43.700 --> 00:36:48.242 align:middle line:84%
which means my email address would have
to be something like malan@harvard.edu

00:36:48.242 --> 00:36:50.850 align:middle line:84%
$ sign, which is
obviously not correct too.

00:36:50.850 --> 00:36:55.280 align:middle line:84%
So backslash is just allow you
to tell the computer to not treat

00:36:55.280 --> 00:36:58.310 align:middle line:84%
those symbols specially, likes
meaning something special,

00:36:58.310 --> 00:37:00.950 align:middle line:90%
but to treat them literally instead.

00:37:00.950 --> 00:37:04.550 align:middle line:84%
How about one other question
here on regular expressions?

00:37:04.550 --> 00:37:09.010 align:middle line:84%
AUDIENCE: You said one
represents to make it one plus,

00:37:09.010 --> 00:37:11.095 align:middle line:84%
then you said one was to
make it one with nothing.

00:37:11.095 --> 00:37:11.845 align:middle line:90%
DAVID MALAN: Sure.

00:37:11.845 --> 00:37:13.220 align:middle line:90%
AUDIENCE: So why would you add the plus?

00:37:13.220 --> 00:37:14.360 align:middle line:90%
DAVID MALAN: Let me rewind in time.

00:37:14.360 --> 00:37:17.027 align:middle line:84%
I think what you're referring to
was one of our earlier versions

00:37:17.027 --> 00:37:20.360 align:middle line:84%
that initially looked like this,
which just meant zero or more

00:37:20.360 --> 00:37:24.710 align:middle line:84%
characters, than an @ sign, then
zero or more other characters.

00:37:24.710 --> 00:37:29.090 align:middle line:84%
We then evolved to that to be
this, dot plus on both sides, which

00:37:29.090 --> 00:37:31.340 align:middle line:84%
means one or more
characters on the left, then

00:37:31.340 --> 00:37:34.320 align:middle line:84%
an @ sign, then one or more
characters on the right.

00:37:34.320 --> 00:37:36.560 align:middle line:84%
And if I'm interpreting
your question correctly,

00:37:36.560 --> 00:37:40.370 align:middle line:84%
one of the points I made earlier was
that if you didn't use plus or forgot

00:37:40.370 --> 00:37:44.510 align:middle line:84%
that it exists, you could equivalently
achieve the exact same result with two

00:37:44.510 --> 00:37:48.380 align:middle line:84%
dots and a *, because the
first dot means any character--

00:37:48.380 --> 00:37:49.550 align:middle line:90%
it's got to be there--

00:37:49.550 --> 00:37:54.170 align:middle line:84%
the second dot * means zero
or more other characters,

00:37:54.170 --> 00:37:55.380 align:middle line:90%
and same on the right.

00:37:55.380 --> 00:37:57.950 align:middle line:84%
So it's just another way of
expressing the same idea.

00:37:57.950 --> 00:38:01.970 align:middle line:84%
"One or more" can be represented
like this with dot dot *,

00:38:01.970 --> 00:38:06.840 align:middle line:84%
or you can just use the handier syntax
of dot +, which means the same thing.

00:38:06.840 --> 00:38:07.340 align:middle line:90%
All right.

00:38:07.340 --> 00:38:10.507 align:middle line:84%
So I daresay there's still some problems
with the regular expression in this

00:38:10.507 --> 00:38:13.790 align:middle line:84%
current form, because even though now
we're starting to look for the user

00:38:13.790 --> 00:38:16.010 align:middle line:84%
name at the beginning of
the string from the user,

00:38:16.010 --> 00:38:20.390 align:middle line:84%
and we're looking for the .edu literally
at the end of the string from the user,

00:38:20.390 --> 00:38:23.780 align:middle line:84%
those dots are a little
too encompassing right now.

00:38:23.780 --> 00:38:26.450 align:middle line:84%
I'm allowed to type in more
than the single @ sign.

00:38:26.450 --> 00:38:27.020 align:middle line:90%
Why?

00:38:27.020 --> 00:38:30.720 align:middle line:84%
Because @ is a character,
and dot means any character.

00:38:30.720 --> 00:38:34.650 align:middle line:84%
So honestly, I can have as many @ signs
in this thing at the moment as I want.

00:38:34.650 --> 00:38:37.280 align:middle line:84%
For instance, if I run
python of validate.py,

00:38:37.280 --> 00:38:40.500 align:middle line:84%
malan@harvard.edu,
still works as expected.

00:38:40.500 --> 00:38:44.270 align:middle line:84%
But if I also run python of
validate.py and incorrectly do

00:38:44.270 --> 00:38:51.030 align:middle line:84%
malan@@@harvard.edu, should be invalid,
but it's considered valid instead.

00:38:51.030 --> 00:38:55.670 align:middle line:84%
So I think we need to be a little more
restrictive when it comes to that dot.

00:38:55.670 --> 00:38:59.180 align:middle line:84%
And we can't just say, oh, any
old character there is fine.

00:38:59.180 --> 00:39:00.950 align:middle line:90%
We need to be more specific.

00:39:00.950 --> 00:39:05.390 align:middle line:84%
Well, it turns out that regular
expressions also support this syntax.

00:39:05.390 --> 00:39:08.990 align:middle line:84%
You can use square brackets
inside of your pattern,

00:39:08.990 --> 00:39:14.210 align:middle line:84%
and inside of those square brackets
include one or more characters

00:39:14.210 --> 00:39:17.000 align:middle line:90%
that you want to look for specifically.

00:39:17.000 --> 00:39:20.510 align:middle line:84%
Alternatively, you can inside
of those square brackets

00:39:20.510 --> 00:39:23.660 align:middle line:84%
put a caret symbol, which
unfortunately in this context,

00:39:23.660 --> 00:39:27.150 align:middle line:84%
means something completely different
from "match the start of the string."

00:39:27.150 --> 00:39:30.870 align:middle line:84%
But this would be the complement
operator inside of the square brackets,

00:39:30.870 --> 00:39:34.320 align:middle line:84%
which means "you cannot match
any of these characters."

00:39:34.320 --> 00:39:36.980 align:middle line:84%
So things are about to
look even more cryptic now.

00:39:36.980 --> 00:39:41.000 align:middle line:84%
But that's why we're focusing on
regular expressions on their own here.

00:39:41.000 --> 00:39:46.850 align:middle line:84%
If I don't want to allow any character,
which is what a dot is, let me go ahead

00:39:46.850 --> 00:39:52.610 align:middle line:84%
and I could just say, well, I only
want to support A, or Bs, or Cs, or Ds,

00:39:52.610 --> 00:39:54.200 align:middle line:90%
or Es, or Fs, or Gs.

00:39:54.200 --> 00:39:56.750 align:middle line:84%
I could type in the whole
alphabet here plus some numbers

00:39:56.750 --> 00:40:00.110 align:middle line:84%
to actually include all of the
letters that I do want to allow.

00:40:00.110 --> 00:40:02.570 align:middle line:84%
But honestly, a little
simpler would be this.

00:40:02.570 --> 00:40:09.020 align:middle line:84%
I could use a ^ symbol and then an @
sign, which has the effect of saying,

00:40:09.020 --> 00:40:14.270 align:middle line:84%
this is the set of characters that
has everything except an @ sign.

00:40:14.270 --> 00:40:16.130 align:middle line:90%
And I can do the same thing over here.

00:40:16.130 --> 00:40:23.270 align:middle line:84%
Instead of a dot to the right of the @
sign, I can do open bracket ^, @ sign.

00:40:23.270 --> 00:40:26.390 align:middle line:84%
And I admit, things are starting
to escalate quickly here,

00:40:26.390 --> 00:40:28.940 align:middle line:84%
but let's start from the
left and go to the right.

00:40:28.940 --> 00:40:33.020 align:middle line:84%
This ^ outside of the square brackets
at the very start of my string,

00:40:33.020 --> 00:40:35.810 align:middle line:84%
as before, means "match from
the start of the string."

00:40:35.810 --> 00:40:36.890 align:middle line:90%
And let's jump ahead.

00:40:36.890 --> 00:40:40.580 align:middle line:84%
The $ sign all the way at the end of
the regular expression means "match

00:40:40.580 --> 00:40:42.180 align:middle line:90%
at the end of the string."

00:40:42.180 --> 00:40:45.290 align:middle line:84%
So if we can mentally tick those
off as straightforward, let's

00:40:45.290 --> 00:40:47.630 align:middle line:84%
now focus on everything
else in the middle.

00:40:47.630 --> 00:40:50.510 align:middle line:84%
Well, to the left here
we have new syntax--

00:40:50.510 --> 00:40:56.840 align:middle line:84%
a square bracket, another ^, an @ sign,
and a closed square bracket, and then

00:40:56.840 --> 00:40:57.560 align:middle line:90%
a +.

00:40:57.560 --> 00:40:59.780 align:middle line:90%
The + means the same thing as always.

00:40:59.780 --> 00:41:03.110 align:middle line:84%
It means "one or more of
the things to the left."

00:41:03.110 --> 00:41:04.830 align:middle line:90%
What is the thing to the left?

00:41:04.830 --> 00:41:06.650 align:middle line:90%
Well, this is the new syntax.

00:41:06.650 --> 00:41:10.880 align:middle line:84%
Inside of square brackets here, I
have a ^ symbol and then an @ sign.

00:41:10.880 --> 00:41:14.990 align:middle line:84%
That just means any
character except an @ sign.

00:41:14.990 --> 00:41:18.890 align:middle line:84%
It's a weird syntax, but this is how
we can express that simple idea--

00:41:18.890 --> 00:41:23.022 align:middle line:84%
any character on the keyboard
except for an @ sign.

00:41:23.022 --> 00:41:25.980 align:middle line:84%
And heck, even other characters that
aren't physically on your keyboard

00:41:25.980 --> 00:41:28.020 align:middle line:90%
but that nonetheless exist.

00:41:28.020 --> 00:41:32.120 align:middle line:84%
Then we have a literal @ sign, then we
have another one of these same things--

00:41:32.120 --> 00:41:36.950 align:middle line:84%
square bracket, ^@ closed bracket, which
means any character except an @ sign,

00:41:36.950 --> 00:41:42.710 align:middle line:84%
then one or more of those things,
followed by literally a period edu.

00:41:42.710 --> 00:41:45.960 align:middle line:84%
So now let me go ahead
and do this again.

00:41:45.960 --> 00:41:49.280 align:middle line:84%
Let me rerun python of validate.py
and test my own email address

00:41:49.280 --> 00:41:51.595 align:middle line:90%
to make sure I've not made things worse.

00:41:51.595 --> 00:41:52.220 align:middle line:90%
And we're good.

00:41:52.220 --> 00:41:55.250 align:middle line:84%
Now let me go ahead and clear my
screen and run python of validate.py

00:41:55.250 --> 00:42:00.750 align:middle line:84%
again and do malan@@@harvard.edu,
crossing my fingers this time.

00:42:00.750 --> 00:42:03.020 align:middle line:90%
And finally, this now is invalid.

00:42:03.020 --> 00:42:03.830 align:middle line:90%
Why?

00:42:03.830 --> 00:42:08.600 align:middle line:84%
I'm allowing myself to have one @ sign
in the middle of the user's input,

00:42:08.600 --> 00:42:13.220 align:middle line:84%
but everything to the left per this
new syntax cannot be an @ sign.

00:42:13.220 --> 00:42:15.950 align:middle line:84%
It can be anything
but one or more times.

00:42:15.950 --> 00:42:20.570 align:middle line:84%
And everything to the right of the @
sign can be anything but an @ sign one

00:42:20.570 --> 00:42:25.430 align:middle line:84%
or more times followed by,
lastly, a literal .edu.

00:42:25.430 --> 00:42:27.590 align:middle line:84%
So again, the new syntax
is quite simply this--

00:42:27.590 --> 00:42:31.985 align:middle line:84%
square brackets allow you to specify
a set of characters that you literally

00:42:31.985 --> 00:42:33.110 align:middle line:90%
type out at your keyboard--

00:42:33.110 --> 00:42:36.410 align:middle line:84%
A, B, C, D, E, F, or the
complement, the opposite,

00:42:36.410 --> 00:42:40.550 align:middle line:84%
the ^ symbol, which means "not,"
and then the one or more symbols you

00:42:40.550 --> 00:42:42.520 align:middle line:90%
want to exclude.

00:42:42.520 --> 00:42:45.230 align:middle line:90%
Questions now on this syntax here?

00:42:45.230 --> 00:42:49.450 align:middle line:84%
AUDIENCE: So right after @ sign,
can we use the curly brackets m one

00:42:49.450 --> 00:42:52.770 align:middle line:84%
so that we can only have one
repetition of the @ symbol?

00:42:52.770 --> 00:42:53.770 align:middle line:90%
DAVID MALAN: Absolutely.

00:42:53.770 --> 00:42:54.800 align:middle line:90%
So we could do this.

00:42:54.800 --> 00:42:56.680 align:middle line:90%
Let me go ahead and pull up VS Code.

00:42:56.680 --> 00:42:59.680 align:middle line:84%
And let me delete the current
form of a regular expression

00:42:59.680 --> 00:43:03.580 align:middle line:84%
and go back to where we began,
which was just dot * @ and dot *.

00:43:03.580 --> 00:43:06.130 align:middle line:84%
I could absolutely do
something like this

00:43:06.130 --> 00:43:10.480 align:middle line:84%
and require that I want at
least one of any character here.

00:43:10.480 --> 00:43:13.760 align:middle line:84%
And then I could do something
more to have any more as well.

00:43:13.760 --> 00:43:16.710 align:middle line:84%
So the curly brace syntax, which
we saw on the slide earlier

00:43:16.710 --> 00:43:18.460 align:middle line:84%
but didn't yet use,
absolutely can be used

00:43:18.460 --> 00:43:21.400 align:middle line:84%
to specify a specific
number of characters.

00:43:21.400 --> 00:43:24.160 align:middle line:84%
But honestly, this is more
verbose than is necessary.

00:43:24.160 --> 00:43:27.130 align:middle line:84%
The best solution, arguably,
or the simplest, at least,

00:43:27.130 --> 00:43:29.500 align:middle line:90%
ultimately, is just to say dot +.

00:43:29.500 --> 00:43:32.650 align:middle line:84%
But there, too, another example of
how you can solve the same problem

00:43:32.650 --> 00:43:34.010 align:middle line:90%
multiple ways.

00:43:34.010 --> 00:43:36.340 align:middle line:84%
Let me go back to where the
regular expression just was

00:43:36.340 --> 00:43:39.170 align:middle line:90%
and take other questions as well.

00:43:39.170 --> 00:43:44.790 align:middle line:84%
Questions on the sets of characters
or complementing that set?

00:43:44.790 --> 00:43:47.370 align:middle line:84%
AUDIENCE: So can you
use that same syntax

00:43:47.370 --> 00:43:51.780 align:middle line:84%
to say that you don't want a certain
character throughout the whole string?

00:43:51.780 --> 00:43:52.740 align:middle line:90%
DAVID MALAN: You could.

00:43:52.740 --> 00:43:54.600 align:middle line:90%
It's going to be--

00:43:54.600 --> 00:43:58.530 align:middle line:84%
you could absolutely use the
same character to exclude--

00:43:58.530 --> 00:44:01.830 align:middle line:84%
you could absolutely use this syntax
to exclude a certain character

00:44:01.830 --> 00:44:03.210 align:middle line:90%
from the entire string.

00:44:03.210 --> 00:44:05.130 align:middle line:84%
But it would be a
little harder right now,

00:44:05.130 --> 00:44:07.530 align:middle line:84%
because we're still
requiring .edu the end.

00:44:07.530 --> 00:44:10.770 align:middle line:90%
But yes, absolutely.

00:44:10.770 --> 00:44:12.220 align:middle line:90%
Other questions?

00:44:12.220 --> 00:44:16.620 align:middle line:84%
AUDIENCE: What happens if the
user inputs .edu in the beginning

00:44:16.620 --> 00:44:17.632 align:middle line:90%
of the string?

00:44:17.632 --> 00:44:18.840 align:middle line:90%
DAVID MALAN: A good question.

00:44:18.840 --> 00:44:22.000 align:middle line:84%
What happens if the user types in
.edu at the beginning of the string?

00:44:22.000 --> 00:44:23.577 align:middle line:90%
Well, let me go back to VS Code here.

00:44:23.577 --> 00:44:25.660 align:middle line:84%
And let's try to solve
this in two different ways.

00:44:25.660 --> 00:44:27.452 align:middle line:84%
First, let's look at
the regular expression

00:44:27.452 --> 00:44:31.080 align:middle line:84%
and see if we can infer if
that's going to be tolerated.

00:44:31.080 --> 00:44:34.950 align:middle line:84%
Well, according to the current
cryptic regular expression,

00:44:34.950 --> 00:44:38.730 align:middle line:84%
I'm saying that you can have
any character except the @ sign.

00:44:38.730 --> 00:44:41.910 align:middle line:84%
So that would work I. Could
have the dot for the .edu.

00:44:41.910 --> 00:44:44.490 align:middle line:90%
But then I have to have an @ sign.

00:44:44.490 --> 00:44:48.940 align:middle line:84%
So that wouldn't really work,
because if I'm just typing in .edu,

00:44:48.940 --> 00:44:51.010 align:middle line:90%
we're not going to pass that constraint.

00:44:51.010 --> 00:44:53.710 align:middle line:84%
So now let me try this
by running the program.

00:44:53.710 --> 00:44:55.810 align:middle line:90%
Let me type in just literally .edu.

00:44:55.810 --> 00:44:57.090 align:middle line:90%
That doesn't work.

00:44:57.090 --> 00:45:02.505 align:middle line:84%
But, but, but I could
do this, .edu@.edu.

00:45:02.505 --> 00:45:04.140 align:middle line:90%
That, too, is invalid.

00:45:04.140 --> 00:45:07.581 align:middle line:90%
But let me do this, .edu@something.edu.

00:45:07.581 --> 00:45:10.365 align:middle line:90%


00:45:10.365 --> 00:45:11.490 align:middle line:90%
That passes.

00:45:11.490 --> 00:45:13.470 align:middle line:84%
So it's starting to
get a little weird now.

00:45:13.470 --> 00:45:15.030 align:middle line:90%
Maybe it's valid, maybe it's not.

00:45:15.030 --> 00:45:18.120 align:middle line:84%
But I think we'll eventually
be more precise, too.

00:45:18.120 --> 00:45:21.570 align:middle line:84%
How about one more question
on this regular expression

00:45:21.570 --> 00:45:23.310 align:middle line:90%
and these complementing of sets?

00:45:23.310 --> 00:45:27.765 align:middle line:84%
AUDIENCE: Can we use another
domain name, the string input?

00:45:27.765 --> 00:45:29.640 align:middle line:84%
DAVID MALAN: Can you
use another domain name?

00:45:29.640 --> 00:45:30.240 align:middle line:90%
Absolutely.

00:45:30.240 --> 00:45:32.460 align:middle line:84%
I'm using my own just for
the sake of demonstration.

00:45:32.460 --> 00:45:35.970 align:middle line:84%
But you could absolutely use
any domain or top-level domain.

00:45:35.970 --> 00:45:38.520 align:middle line:84%
And I'm using .edu,
which is very US centric.

00:45:38.520 --> 00:45:43.330 align:middle line:84%
But this would absolutely work exactly
the same for any top-level domain.

00:45:43.330 --> 00:45:43.830 align:middle line:90%
All right.

00:45:43.830 --> 00:45:47.700 align:middle line:84%
Let me go ahead now and propose that
we improve this regular expression

00:45:47.700 --> 00:45:50.880 align:middle line:84%
further, because if I pull
it up again in VS Code here,

00:45:50.880 --> 00:45:53.790 align:middle line:84%
you'll see that I'm being a
little too tolerant still.

00:45:53.790 --> 00:45:58.140 align:middle line:84%
It turns out that there are certain
requirements for someone's username

00:45:58.140 --> 00:46:00.240 align:middle line:90%
and domain name in an email address.

00:46:00.240 --> 00:46:03.840 align:middle line:84%
There is an official standard in the
world for what an email address can be

00:46:03.840 --> 00:46:05.670 align:middle line:90%
and what characters can be in it.

00:46:05.670 --> 00:46:09.480 align:middle line:84%
And this is way too accommodating
of all the characters

00:46:09.480 --> 00:46:11.710 align:middle line:90%
in the world except for the @ symbol.

00:46:11.710 --> 00:46:14.190 align:middle line:84%
So let's actually narrow
the definition of what

00:46:14.190 --> 00:46:16.110 align:middle line:90%
we're going to tolerate in usernames.

00:46:16.110 --> 00:46:19.200 align:middle line:84%
And companies like Gmail could
certainly do this as well.

00:46:19.200 --> 00:46:22.200 align:middle line:84%
Suppose that it's not just
that I want to exclude @ sign.

00:46:22.200 --> 00:46:25.470 align:middle line:84%
Suppose that I only
want to allow for, say,

00:46:25.470 --> 00:46:27.600 align:middle line:84%
characters that normally
appear in words,

00:46:27.600 --> 00:46:31.500 align:middle line:84%
like letters of the alphabet, A through
z, be it uppercase or lowercase,

00:46:31.500 --> 00:46:35.520 align:middle line:84%
maybe some numbers, and heck, maybe even
an underscore could be allowed, too.

00:46:35.520 --> 00:46:38.550 align:middle line:84%
Well, we can use this
same square bracket syntax

00:46:38.550 --> 00:46:41.340 align:middle line:84%
to specify a set of
characters as follows.

00:46:41.340 --> 00:46:44.860 align:middle line:90%
I could do abcdefghij--

00:46:44.860 --> 00:46:45.360 align:middle line:90%
oh, my god.

00:46:45.360 --> 00:46:46.290 align:middle line:90%
This is going to take forever.

00:46:46.290 --> 00:46:49.140 align:middle line:84%
I'm going to have to type out
all 26 letters of the alphabet,

00:46:49.140 --> 00:46:50.940 align:middle line:90%
both lowercase and uppercase.

00:46:50.940 --> 00:46:52.260 align:middle line:90%
So let me stop doing that.

00:46:52.260 --> 00:46:53.700 align:middle line:90%
There's a better way already.

00:46:53.700 --> 00:46:58.180 align:middle line:84%
If you want to specify within these
square brackets a range of letters,

00:46:58.180 --> 00:47:00.550 align:middle line:90%
you can actually just do a hyphen.

00:47:00.550 --> 00:47:04.920 align:middle line:84%
If you literally do a-z
in these square brackets,

00:47:04.920 --> 00:47:07.470 align:middle line:84%
the computer is going to
know you mean a through z.

00:47:07.470 --> 00:47:10.620 align:middle line:84%
You do not need to type 26
letters of the alphabet.

00:47:10.620 --> 00:47:14.190 align:middle line:84%
If you want to include uppercase
letters as well, you just do the same.

00:47:14.190 --> 00:47:19.440 align:middle line:84%
No spaces, no commas, you literally
just keep typing a through capital Z.

00:47:19.440 --> 00:47:23.880 align:middle line:84%
So I have little a hyphen
little z, big A hyphen

00:47:23.880 --> 00:47:26.640 align:middle line:84%
big Z. No spaces, no
commas, no separators.

00:47:26.640 --> 00:47:28.830 align:middle line:90%
You just keep specifying those ranges.

00:47:28.830 --> 00:47:32.350 align:middle line:84%
If I additionally want
numbers, I could do 01234--

00:47:32.350 --> 00:47:32.850 align:middle line:90%
nope.

00:47:32.850 --> 00:47:35.070 align:middle line:84%
You don't need to type
in all 10 decimal digits.

00:47:35.070 --> 00:47:39.070 align:middle line:84%
You can just say 0 through
9 using a hyphen as well.

00:47:39.070 --> 00:47:41.280 align:middle line:84%
And if you now want
to support underscores

00:47:41.280 --> 00:47:44.280 align:middle line:84%
as well, which is pretty common
in usernames for email addresses,

00:47:44.280 --> 00:47:48.160 align:middle line:84%
you can literally just type
an underscore at the end.

00:47:48.160 --> 00:47:51.180 align:middle line:84%
Notice that all of these
characters are inside

00:47:51.180 --> 00:47:55.860 align:middle line:84%
of square brackets, which just again,
means here is a set of characters

00:47:55.860 --> 00:47:57.180 align:middle line:90%
that I want to allow.

00:47:57.180 --> 00:48:02.100 align:middle line:84%
I have not used a ^ symbol at the
beginning of this whole thing,

00:48:02.100 --> 00:48:05.370 align:middle line:84%
because I don't want to complement
it-- complement it with an E,

00:48:05.370 --> 00:48:07.230 align:middle line:90%
not compliment it with an I--

00:48:07.230 --> 00:48:09.940 align:middle line:84%
I don't want to complement
it by making it the opposite.

00:48:09.940 --> 00:48:13.225 align:middle line:84%
I literally want to accept
only these characters.

00:48:13.225 --> 00:48:15.600 align:middle line:84%
I'm going to go ahead and do
the same thing on the right.

00:48:15.600 --> 00:48:19.530 align:middle line:84%
If I want to require that
the domain name similarly

00:48:19.530 --> 00:48:22.800 align:middle line:84%
come from this set of characters, which
admittedly is a little too narrow,

00:48:22.800 --> 00:48:25.210 align:middle line:84%
but it's familiar for now
so we'll keep it simple,

00:48:25.210 --> 00:48:29.490 align:middle line:84%
I'm going to go ahead and paste that
exact same set of characters over there

00:48:29.490 --> 00:48:30.490 align:middle line:90%
to the right.

00:48:30.490 --> 00:48:33.600 align:middle line:90%
And so now, it's much more restrictive.

00:48:33.600 --> 00:48:36.660 align:middle line:84%
Now I'm going to go ahead and
run python of validate.py.

00:48:36.660 --> 00:48:39.420 align:middle line:84%
I'm going to test my own email
address, and we're still good.

00:48:39.420 --> 00:48:42.180 align:middle line:84%
I'm going to clear my
screen and run it once more,

00:48:42.180 --> 00:48:44.520 align:middle line:90%
this time trying to break it.

00:48:44.520 --> 00:48:51.270 align:middle line:84%
Let me go ahead and do something like,
how about, david_malan@harvard.edu,

00:48:51.270 --> 00:48:54.790 align:middle line:84%
Enter, but that, too,
is going to be valid.

00:48:54.790 --> 00:48:57.330 align:middle line:84%
But if I do something
completely wrong again,

00:48:57.330 --> 00:49:02.790 align:middle line:84%
like malan@@@harvard.edu, that's
still going to be invalid.

00:49:02.790 --> 00:49:03.330 align:middle line:90%
Why?

00:49:03.330 --> 00:49:06.090 align:middle line:84%
Because my regular expression
currently only allows

00:49:06.090 --> 00:49:09.480 align:middle line:84%
for a single @ in the middle,
because everything to the left

00:49:09.480 --> 00:49:11.530 align:middle line:90%
must be alphanumeric--

00:49:11.530 --> 00:49:14.420 align:middle line:84%
alphabetical or numeric--
or an underscore,

00:49:14.420 --> 00:49:18.301 align:middle line:84%
the same thing to the
right, followed by the .edu.

00:49:18.301 --> 00:49:20.770 align:middle line:84%
Now honestly, this is
a regular expression

00:49:20.770 --> 00:49:23.890 align:middle line:84%
that you might be in the habit
of typing in the real world.

00:49:23.890 --> 00:49:27.860 align:middle line:84%
As cryptic as this might look, this
is the world of regular expressions.

00:49:27.860 --> 00:49:30.560 align:middle line:84%
So you'll get more comfortable
with this syntax over time.

00:49:30.560 --> 00:49:32.890 align:middle line:84%
But thankfully, some
of these patterns are

00:49:32.890 --> 00:49:36.910 align:middle line:84%
so common that there are built-in
shortcuts for representing

00:49:36.910 --> 00:49:38.680 align:middle line:90%
some of the same information.

00:49:38.680 --> 00:49:42.373 align:middle line:84%
That is to say, you don't have to
constantly type out all of the symbols

00:49:42.373 --> 00:49:45.040 align:middle line:84%
that you want to include, because
odds are some other programmer

00:49:45.040 --> 00:49:46.280 align:middle line:90%
has had the same problem.

00:49:46.280 --> 00:49:49.030 align:middle line:84%
So built into regular
expressions themselves

00:49:49.030 --> 00:49:51.250 align:middle line:84%
are some additional
patterns you can use.

00:49:51.250 --> 00:49:56.170 align:middle line:84%
And in fact, I can go ahead and get
rid of this entire set, a through z

00:49:56.170 --> 00:49:59.830 align:middle line:84%
lowercase, A through Z uppercase,
0 through 9 and an underscore,

00:49:59.830 --> 00:50:03.640 align:middle line:84%
and just replace it with
a single backslash w.

00:50:03.640 --> 00:50:07.210 align:middle line:84%
Backslash w in this case
represents a "word character,"

00:50:07.210 --> 00:50:13.330 align:middle line:84%
which is commonly known as a
alphanumeric symbol or the underscore

00:50:13.330 --> 00:50:14.052 align:middle line:90%
as well.

00:50:14.052 --> 00:50:15.760 align:middle line:84%
I'm going to do the
same thing over here.

00:50:15.760 --> 00:50:18.310 align:middle line:84%
I'm going to highlight the
entire set of square brackets,

00:50:18.310 --> 00:50:21.430 align:middle line:84%
delete it, and replace it
with a single backslash w.

00:50:21.430 --> 00:50:23.720 align:middle line:84%
And now I feel like
we're making progress,

00:50:23.720 --> 00:50:25.720 align:middle line:84%
because even though it's
cryptic, and would have

00:50:25.720 --> 00:50:29.320 align:middle line:90%
looked way cryptic a little bit ago--

00:50:29.320 --> 00:50:32.680 align:middle line:84%
and even though it would have looked
even more cryptic a little bit ago, now

00:50:32.680 --> 00:50:35.470 align:middle line:84%
it's at least starting to
read a little more friendly.

00:50:35.470 --> 00:50:39.160 align:middle line:84%
This ^ on the left means "start matching
at the beginning of the string."

00:50:39.160 --> 00:50:42.100 align:middle line:90%
Backslash w means "any word character."

00:50:42.100 --> 00:50:44.140 align:middle line:90%
The + means "one or more."

00:50:44.140 --> 00:50:45.370 align:middle line:90%
@ symbol literally.

00:50:45.370 --> 00:50:49.720 align:middle line:84%
Then another word character, one
or more. then a literal dot, then

00:50:49.720 --> 00:50:54.200 align:middle line:84%
literally edu, and then match at the
very end of the string, and that's it.

00:50:54.200 --> 00:50:55.660 align:middle line:90%
So there's more of these, too.

00:50:55.660 --> 00:50:57.910 align:middle line:84%
And we won't use them
all here, but here is

00:50:57.910 --> 00:51:02.950 align:middle line:84%
a partial list of the patterns you
can use within a regular expression.

00:51:02.950 --> 00:51:07.060 align:middle line:84%
One, you have backslash d for any
decimal digit, "decimal digit" meaning

00:51:07.060 --> 00:51:08.590 align:middle line:90%
0 through 9.

00:51:08.590 --> 00:51:12.550 align:middle line:84%
Commonly done here, too, is if you
want to do the opposite of that,

00:51:12.550 --> 00:51:17.020 align:middle line:84%
the complement, so to speak, you
can do backslash capital D, which

00:51:17.020 --> 00:51:19.480 align:middle line:90%
is anything that's not a decimal digit.

00:51:19.480 --> 00:51:23.990 align:middle line:84%
So it might be letters, and
punctuation, and other symbols as well.

00:51:23.990 --> 00:51:27.280 align:middle line:84%
Meanwhile, backslash s
means whitespace characters,

00:51:27.280 --> 00:51:30.490 align:middle line:84%
like a single hit of the space, or
maybe hitting Tab on the keyboard.

00:51:30.490 --> 00:51:31.720 align:middle line:90%
That's whitespace.

00:51:31.720 --> 00:51:35.110 align:middle line:84%
Backslash capital S is
the opposite or complement

00:51:35.110 --> 00:51:38.080 align:middle line:84%
of that-- anything that's
not a whitespace character.

00:51:38.080 --> 00:51:41.680 align:middle line:84%
Backslash w, we've seen, a
word character, as well as

00:51:41.680 --> 00:51:43.390 align:middle line:90%
numbers and the underscore.

00:51:43.390 --> 00:51:45.970 align:middle line:84%
And if you want the complement
or opposite of that,

00:51:45.970 --> 00:51:50.950 align:middle line:84%
you can use backslash capital W to give
you everything but a word character.

00:51:50.950 --> 00:51:54.130 align:middle line:84%
Again, these are just common patterns
that so many people were presumably

00:51:54.130 --> 00:51:58.520 align:middle line:84%
using in yesteryear that it's now baked
into the regular expression syntax

00:51:58.520 --> 00:52:02.710 align:middle line:84%
so that you can more succinctly
express your same ideas.

00:52:02.710 --> 00:52:05.320 align:middle line:84%
Any questions, then,
on this approach here,

00:52:05.320 --> 00:52:12.340 align:middle line:84%
where we're now using backslash
w to represent my word character?

00:52:12.340 --> 00:52:14.230 align:middle line:84%
AUDIENCE: So what I
want to ask about was

00:52:14.230 --> 00:52:17.590 align:middle line:84%
the-- actually the previous approach,
like the square bracket approach.

00:52:17.590 --> 00:52:19.792 align:middle line:90%
Could we accept lists in there?

00:52:19.792 --> 00:52:20.500 align:middle line:90%
DAVID MALAN: Yes.

00:52:20.500 --> 00:52:21.730 align:middle line:90%
We'll see this before long.

00:52:21.730 --> 00:52:27.460 align:middle line:84%
But suppose you wanted to tolerate not
just .edu, but maybe .edu, or .com,

00:52:27.460 --> 00:52:28.450 align:middle line:90%
you could do this.

00:52:28.450 --> 00:52:32.500 align:middle line:84%
You could introduce parentheses,
and then you can or those together.

00:52:32.500 --> 00:52:35.470 align:middle line:90%
I could say com or edu.

00:52:35.470 --> 00:52:40.180 align:middle line:84%
Could also add in something
like in the US, or gov, or net,

00:52:40.180 --> 00:52:42.670 align:middle line:90%
or anything else, or org, or the like.

00:52:42.670 --> 00:52:45.190 align:middle line:84%
And each of the vertical bars
here means something special.

00:52:45.190 --> 00:52:46.180 align:middle line:90%
It means "or."

00:52:46.180 --> 00:52:48.610 align:middle line:84%
And the parentheses simply
group things together.

00:52:48.610 --> 00:52:50.920 align:middle line:90%
Formally, you have this syntax here--

00:52:50.920 --> 00:52:56.530 align:middle line:84%
A or B, A or vertical bar B, means
"A has to match or B has to match,"

00:52:56.530 --> 00:52:59.080 align:middle line:84%
where A and B can be any
other patterns you want.

00:52:59.080 --> 00:53:01.520 align:middle line:84%
In parentheses, you can
group those things together.

00:53:01.520 --> 00:53:05.710 align:middle line:84%
So just like math, you can
combine ideas into one phrase

00:53:05.710 --> 00:53:07.600 align:middle line:90%
and do this thing or the other.

00:53:07.600 --> 00:53:09.970 align:middle line:84%
And there's other syntax as
well that we'll soon see.

00:53:09.970 --> 00:53:14.750 align:middle line:84%
Other questions on these regular
expressions and this syntax here?

00:53:14.750 --> 00:53:16.990 align:middle line:84%
AUDIENCE: What if we put
spaces in the expression?

00:53:16.990 --> 00:53:17.740 align:middle line:90%
DAVID MALAN: Sure.

00:53:17.740 --> 00:53:21.910 align:middle line:84%
So if you want spaces in there,
you can't use backslash w alone,

00:53:21.910 --> 00:53:25.690 align:middle line:84%
because that is only a word character
which is alphabetical, numerical,

00:53:25.690 --> 00:53:27.100 align:middle line:90%
or the underscore.

00:53:27.100 --> 00:53:28.580 align:middle line:90%
But you could do this.

00:53:28.580 --> 00:53:32.170 align:middle line:84%
You could go back to this approach
whereby you use square brackets.

00:53:32.170 --> 00:53:37.120 align:middle line:84%
And you could say a through z,
or A through Z, or 0 through 9,

00:53:37.120 --> 00:53:40.693 align:middle line:84%
or underscore, or I'm going to
hit the space bar, a single space.

00:53:40.693 --> 00:53:43.360 align:middle line:84%
You can put a literal space inside
of the square brackets, which

00:53:43.360 --> 00:53:45.700 align:middle line:90%
will allow you then to detect a space.

00:53:45.700 --> 00:53:49.420 align:middle line:84%
Alternatively, I could
still use backslash w,

00:53:49.420 --> 00:53:51.280 align:middle line:90%
But I could combine it as follows.

00:53:51.280 --> 00:53:54.700 align:middle line:84%
I could say, give me a
backslash w or a backslash s,

00:53:54.700 --> 00:53:57.287 align:middle line:84%
because recall that
backslash s is whitespace.

00:53:57.287 --> 00:53:58.870 align:middle line:90%
So it's even more than a single space.

00:53:58.870 --> 00:53:59.770 align:middle line:90%
It could be a tab.

00:53:59.770 --> 00:54:02.140 align:middle line:84%
But by putting those
things in parentheses, now

00:54:02.140 --> 00:54:04.060 align:middle line:84%
you can match either
the thing on the left

00:54:04.060 --> 00:54:07.400 align:middle line:84%
or the thing on the
right one or more times.

00:54:07.400 --> 00:54:12.290 align:middle line:84%
How about one other question
on these regular expressions?

00:54:12.290 --> 00:54:13.040 align:middle line:90%
AUDIENCE: Perfect.

00:54:13.040 --> 00:54:19.070 align:middle line:84%
So I was going to ask, does
the backslash w include a dot?

00:54:19.070 --> 00:54:20.730 align:middle line:90%
Because-- no, OK.

00:54:20.730 --> 00:54:24.230 align:middle line:84%
DAVID MALAN: No, it only Includes
letters, numbers, and underscore.

00:54:24.230 --> 00:54:25.387 align:middle line:90%
That is it.

00:54:25.387 --> 00:54:27.470 align:middle line:84%
AUDIENCE: And I was
wondering, you gave an example

00:54:27.470 --> 00:54:33.140 align:middle line:84%
at the beginning that had spaces,
like this is my email, so-and-so.

00:54:33.140 --> 00:54:35.420 align:middle line:90%
I don't think our current version--

00:54:35.420 --> 00:54:39.110 align:middle line:84%
or even quite a long while
ago stopped accepting it.

00:54:39.110 --> 00:54:43.915 align:middle line:84%
Was that because of the ^ or
because of something else?

00:54:43.915 --> 00:54:47.960 align:middle line:84%
DAVID MALAN: No, the reason I was
handling spaces in other English words

00:54:47.960 --> 00:54:51.425 align:middle line:84%
when I typed out my email
address as malan@harvard.edu

00:54:51.425 --> 00:54:57.380 align:middle line:84%
was because we were using initially dot
*, or dot +, which is any character.

00:54:57.380 --> 00:55:01.340 align:middle line:84%
And even after that, we said
anything except the @ sign,

00:55:01.340 --> 00:55:02.870 align:middle line:90%
which includes spaces.

00:55:02.870 --> 00:55:08.000 align:middle line:84%
Only once I started using square
brackets and a through z and 0

00:55:08.000 --> 00:55:11.210 align:middle line:84%
through 9 and underscore did
we finally get to the point

00:55:11.210 --> 00:55:13.040 align:middle line:90%
where we would reject white space.

00:55:13.040 --> 00:55:14.970 align:middle line:90%
And in fact, I can run this here.

00:55:14.970 --> 00:55:18.980 align:middle line:84%
Let me go into the current version of my
code in VS Code, which is using, again,

00:55:18.980 --> 00:55:21.620 align:middle line:84%
the backslash w's for
word characters, let

00:55:21.620 --> 00:55:24.860 align:middle line:84%
me run python of validate.py and
incorrectly type in something

00:55:24.860 --> 00:55:30.020 align:middle line:84%
like "my email address is
malan@harvard.edu," period, which

00:55:30.020 --> 00:55:34.250 align:middle line:84%
has spaces to the left of my
username, and that is now invalid,

00:55:34.250 --> 00:55:36.590 align:middle line:90%
because space is not a word character.

00:55:36.590 --> 00:55:39.860 align:middle line:84%
You're going to notice, too, that
technically I'm not allowing dots.

00:55:39.860 --> 00:55:41.902 align:middle line:84%
And some of you might be
thinking, wait a minute.

00:55:41.902 --> 00:55:43.880 align:middle line:90%
My Gmail address has a dot in it.

00:55:43.880 --> 00:55:46.280 align:middle line:84%
That's something we're
going to still have to fix.

00:55:46.280 --> 00:55:49.160 align:middle line:90%
A backslash w is not the end all here.

00:55:49.160 --> 00:55:52.520 align:middle line:84%
It's just allowing us to
express our previous solution

00:55:52.520 --> 00:55:54.020 align:middle line:90%
a little more succinctly.

00:55:54.020 --> 00:55:57.260 align:middle line:84%
Now, one thing we're still
not handling quite properly

00:55:57.260 --> 00:55:59.180 align:middle line:90%
is uppercase versus lowercase.

00:55:59.180 --> 00:56:03.200 align:middle line:84%
The backslash w technically does
handle lowercase letters and uppercase,

00:56:03.200 --> 00:56:06.450 align:middle line:84%
because it's the exact same
thing as that set from before,

00:56:06.450 --> 00:56:11.670 align:middle line:84%
which had little a through little z and
big A through big Z. But watch this.

00:56:11.670 --> 00:56:14.960 align:middle line:84%
Let me go ahead in my current
form run python of validate.py,

00:56:14.960 --> 00:56:19.376 align:middle line:84%
and just because my Caps lock
key is down, MALAN@HARVARD.EDU,

00:56:19.376 --> 00:56:21.080 align:middle line:90%
shouting my email address.

00:56:21.080 --> 00:56:23.640 align:middle line:84%
It's going to be OK
in terms of the MALAN.

00:56:23.640 --> 00:56:25.940 align:middle line:84%
It's going to be OK in
terms of the HARVARD,

00:56:25.940 --> 00:56:28.790 align:middle line:84%
because those are matching
the backslash w, which

00:56:28.790 --> 00:56:31.490 align:middle line:90%
does include lowercase and uppercase.

00:56:31.490 --> 00:56:34.310 align:middle line:90%
But I'm about to see invalid.

00:56:34.310 --> 00:56:35.210 align:middle line:90%
Why?

00:56:35.210 --> 00:56:41.670 align:middle line:84%
Why is MALAN@HARVARD.EDU invalid
when it's in all caps here,

00:56:41.670 --> 00:56:44.195 align:middle line:90%
even though I'm using backslash w?

00:56:44.195 --> 00:56:44.820 align:middle line:90%
AUDIENCE: Yeah.

00:56:44.820 --> 00:56:50.010 align:middle line:84%
So you are asking for the
domain.edu in lowercase,

00:56:50.010 --> 00:56:52.105 align:middle line:90%
and you're typing it in uppercase.

00:56:52.105 --> 00:56:52.980 align:middle line:90%
DAVID MALAN: Exactly.

00:56:52.980 --> 00:56:55.980 align:middle line:84%
I'm typing in my email
address in all uppercase,

00:56:55.980 --> 00:56:57.892 align:middle line:90%
but I'm looking for literally ".edu."

00:56:57.892 --> 00:57:00.600 align:middle line:84%
And as I see you with AirPods and
so many of you with headphones,

00:57:00.600 --> 00:57:03.810 align:middle line:84%
I apologize for yelling into my
microphone just now to make this point.

00:57:03.810 --> 00:57:05.770 align:middle line:90%
But let's see if we can't fix that.

00:57:05.770 --> 00:57:11.925 align:middle line:84%
Well, if my pattern on line 5
is expecting it to be lowercase,

00:57:11.925 --> 00:57:13.800 align:middle line:84%
there's actually a few
ways I can solve this.

00:57:13.800 --> 00:57:15.840 align:middle line:84%
One would be something
we've seen before.

00:57:15.840 --> 00:57:19.050 align:middle line:84%
I could just force the user's
input to all lowercase.

00:57:19.050 --> 00:57:23.610 align:middle line:84%
And I could put onto the end of my first
line .lower and actually force it all

00:57:23.610 --> 00:57:24.480 align:middle line:90%
to lowercase.

00:57:24.480 --> 00:57:26.880 align:middle line:84%
Alternatively, I could
do that a little later.

00:57:26.880 --> 00:57:31.050 align:middle line:84%
Instead of passing an email, I could
pass in the lowercase version of email,

00:57:31.050 --> 00:57:33.810 align:middle line:84%
because email addresses should,
in fact, be case insensitive.

00:57:33.810 --> 00:57:34.980 align:middle line:90%
So that would work, too.

00:57:34.980 --> 00:57:37.590 align:middle line:84%
But there's another mechanism
here, which is worth seeing.

00:57:37.590 --> 00:57:43.890 align:middle line:84%
It turns out that that function before
called re.search supports, recall,

00:57:43.890 --> 00:57:46.800 align:middle line:84%
a third argument as well,
these so-called flags.

00:57:46.800 --> 00:57:49.170 align:middle line:84%
And flags are configuration
options, typically

00:57:49.170 --> 00:57:52.290 align:middle line:84%
to a function, that allow you to
configure it a little differently.

00:57:52.290 --> 00:57:55.290 align:middle line:84%
And how might I go about
configuring this call

00:57:55.290 --> 00:57:59.910 align:middle line:84%
to re.search a little bit differently
insofar as I'm currently only passing

00:57:59.910 --> 00:58:00.900 align:middle line:90%
into arguments?

00:58:00.900 --> 00:58:04.650 align:middle line:84%
Well, it turns out that some of the
flags you can pass into this function

00:58:04.650 --> 00:58:05.790 align:middle line:90%
are these.

00:58:05.790 --> 00:58:10.110 align:middle line:84%
It turns out that the regular
expression library in Python, a.k.a.

00:58:10.110 --> 00:58:14.040 align:middle line:84%
re, comes with a few built-in
variables, so to speak,

00:58:14.040 --> 00:58:16.110 align:middle line:84%
things that you can
think of as constants,

00:58:16.110 --> 00:58:19.920 align:middle line:90%
that have meaning to re.search.

00:58:19.920 --> 00:58:21.760 align:middle line:90%
And they do so as follows.

00:58:21.760 --> 00:58:26.220 align:middle line:84%
If you pass in as a flag re.IGNORECASE,
what re.search is going to do

00:58:26.220 --> 00:58:28.530 align:middle line:90%
is ignore the case of the user's input.

00:58:28.530 --> 00:58:30.880 align:middle line:84%
It can be uppercase, lowercase,
a combination thereof,

00:58:30.880 --> 00:58:32.470 align:middle line:90%
the case is going to be ignored.

00:58:32.470 --> 00:58:34.327 align:middle line:90%
It will be treated case insensitively.

00:58:34.327 --> 00:58:36.660 align:middle line:84%
And you can do other things,
too, that we won't do here.

00:58:36.660 --> 00:58:40.650 align:middle line:84%
But if you want to handle the user's
input that maybe spans multiple lines--

00:58:40.650 --> 00:58:44.040 align:middle line:84%
maybe they didn't just type in an
email address but an entire paragraph

00:58:44.040 --> 00:58:46.410 align:middle line:84%
of text, and you want
to match different lines

00:58:46.410 --> 00:58:48.210 align:middle line:90%
of that text that is multiple lines.

00:58:48.210 --> 00:58:52.950 align:middle line:84%
Another flag is for re.MULTILINE
for just that, or re.DOTALL,

00:58:52.950 --> 00:58:57.990 align:middle line:84%
whereby you can configure
the dot to recognize not just

00:58:57.990 --> 00:59:02.830 align:middle line:84%
any character except newlines but
any character plus newlines as well.

00:59:02.830 --> 00:59:05.850 align:middle line:84%
But for now, let me go ahead and
just make use of this first one.

00:59:05.850 --> 00:59:13.170 align:middle line:84%
Let me pass in a third argument to
re.search, which is re.IGNORECASE.

00:59:13.170 --> 00:59:15.330 align:middle line:84%
Let me now rerun the
program without clearing

00:59:15.330 --> 00:59:17.670 align:middle line:90%
my screen, python of validate.py.

00:59:17.670 --> 00:59:20.850 align:middle line:84%
Let me type in again in all
caps, effectively shouting,

00:59:20.850 --> 00:59:25.200 align:middle line:84%
MALAN@HARVARD.EDU, Enter, and
now it's considered valid,

00:59:25.200 --> 00:59:27.690 align:middle line:84%
because I'm telling
re.search specifically

00:59:27.690 --> 00:59:29.460 align:middle line:90%
to ignore the case of the input.

00:59:29.460 --> 00:59:30.960 align:middle line:90%
And that, too, here is fine.

00:59:30.960 --> 00:59:34.500 align:middle line:84%
And why might I do this approach rather
than call .lower in one of those other

00:59:34.500 --> 00:59:35.280 align:middle line:90%
locations?

00:59:35.280 --> 00:59:39.000 align:middle line:84%
Eh, if I don't actually want to change
the user's input for whatever reason,

00:59:39.000 --> 00:59:43.290 align:middle line:84%
I can still treat it case
insensitively without actually changing

00:59:43.290 --> 00:59:46.140 align:middle line:90%
the value of that variable itself.

00:59:46.140 --> 00:59:51.970 align:middle line:84%
All right, any final questions now on
this validation of email addresses?

00:59:51.970 --> 00:59:54.600 align:middle line:84%
AUDIENCE: So the pattern
is a string, right?

00:59:54.600 --> 00:59:55.800 align:middle line:90%
DAVID MALAN: Mm-hmm.

00:59:55.800 --> 00:59:57.390 align:middle line:90%
AUDIENCE: Can we use an fstring?

00:59:57.390 --> 00:59:58.440 align:middle line:90%
DAVID MALAN: You can.

00:59:58.440 --> 01:00:01.780 align:middle line:84%
Yes, you can use an fstring so that
you could plug in, for instance,

01:00:01.780 --> 01:00:04.830 align:middle line:84%
the value of a variable and
pass it into the function.

01:00:04.830 --> 01:00:06.000 align:middle line:90%
Other questions on this?

01:00:06.000 --> 01:00:10.342 align:middle line:84%
AUDIENCE: Backslash w character, could
we take it as an input from the user?

01:00:10.342 --> 01:00:11.550 align:middle line:90%
DAVID MALAN: Technically yes.

01:00:11.550 --> 01:00:13.440 align:middle line:84%
That's not a problem we're
trying to solve right now.

01:00:13.440 --> 01:00:16.530 align:middle line:84%
We want the user to provide literal
input, like their email address,

01:00:16.530 --> 01:00:18.750 align:middle line:90%
not necessarily a regular expression.

01:00:18.750 --> 01:00:22.230 align:middle line:84%
But you could imagine building
software that asks the user, especially

01:00:22.230 --> 01:00:25.800 align:middle line:84%
if they're more advanced users, to type
in a regular expression for some reason

01:00:25.800 --> 01:00:27.722 align:middle line:90%
to validate something else against that.

01:00:27.722 --> 01:00:29.430 align:middle line:84%
And in fact, that's
what Google is doing.

01:00:29.430 --> 01:00:33.630 align:middle line:84%
If you play around with Google Forms and
create a form with response validation

01:00:33.630 --> 01:00:37.590 align:middle line:84%
and select Regular Expression,
Google lets you and I type

01:00:37.590 --> 01:00:41.530 align:middle line:84%
in our own regular expressions, which
would be a perfect example of that.

01:00:41.530 --> 01:00:42.030 align:middle line:90%
All right.

01:00:42.030 --> 01:00:45.900 align:middle line:84%
Well, let me propose that we try
to solve one other problem here,

01:00:45.900 --> 01:00:51.480 align:middle line:84%
whereby if I go into the same version
as before, which is now ignoring case,

01:00:51.480 --> 01:00:54.100 align:middle line:84%
but I type in one of my
other email addresses.

01:00:54.100 --> 01:00:56.280 align:middle line:84%
Let me go ahead and run
python of validate.py.

01:00:56.280 --> 01:00:59.580 align:middle line:84%
And this time, let me type in
not malan@harvard.edu, which

01:00:59.580 --> 01:01:01.920 align:middle line:84%
I use primarily, but
another email address

01:01:01.920 --> 01:01:06.030 align:middle line:84%
of mine, malan@cs50.harvard.edu,
which forwards to the same.

01:01:06.030 --> 01:01:07.920 align:middle line:90%
Let me go ahead and hit Enter now.

01:01:07.920 --> 01:01:11.940 align:middle line:84%
And huh, invalid, even
though I'm pretty sure that

01:01:11.940 --> 01:01:13.380 align:middle line:90%
is, in fact, my email address.

01:01:13.380 --> 01:01:15.920 align:middle line:84%
Well, let's put our
finger on the reason why.

01:01:15.920 --> 01:01:20.400 align:middle line:84%
Why at the moment is
malan@cs50.harvard.edu

01:01:20.400 --> 01:01:25.890 align:middle line:84%
being considered invalid, even though
I'm pretty sure I send and receive

01:01:25.890 --> 01:01:27.330 align:middle line:90%
email from that address, too?

01:01:27.330 --> 01:01:30.470 align:middle line:90%


01:01:30.470 --> 01:01:32.000 align:middle line:90%
Why might that be?

01:01:32.000 --> 01:01:38.475 align:middle line:84%
AUDIENCE: Because there is a dot
that has come after the @ symbol.

01:01:38.475 --> 01:01:39.350 align:middle line:90%
DAVID MALAN: Exactly.

01:01:39.350 --> 01:01:42.230 align:middle line:90%
There's a dot after my cs50.

01:01:42.230 --> 01:01:45.080 align:middle line:84%
And I'm not expecting any dots
there, I'm expecting only,

01:01:45.080 --> 01:01:50.240 align:middle line:84%
again, word characters, which is A
through z, 0 through 9, and underscore.

01:01:50.240 --> 01:01:52.130 align:middle line:90%
So I'm going to have to retool here.

01:01:52.130 --> 01:01:54.090 align:middle line:90%
But how could I go about doing this?

01:01:54.090 --> 01:01:57.613 align:middle line:84%
Well, it turns out theoretically,
there could be other email addresses,

01:01:57.613 --> 01:02:00.530 align:middle line:84%
even though they'd be getting a
little excessively long, for instance,

01:02:00.530 --> 01:02:05.210 align:middle line:84%
malan@something.cs50.harvard.edu,
which does not technically exist,

01:02:05.210 --> 01:02:06.125 align:middle line:90%
but it could.

01:02:06.125 --> 01:02:09.950 align:middle line:84%
You can have, of course, multiple dots
in a domain name like we see here.

01:02:09.950 --> 01:02:12.500 align:middle line:84%
Wouldn't it be nice if we
could handle that as well?

01:02:12.500 --> 01:02:16.670 align:middle line:84%
Well, let me propose that we modify
my regular expression as follows.

01:02:16.670 --> 01:02:20.240 align:middle line:84%
It turns out that you
can group ideas together.

01:02:20.240 --> 01:02:24.050 align:middle line:84%
And you can not only ask whether
or not this pattern matches

01:02:24.050 --> 01:02:29.780 align:middle line:84%
or this one using syntax like A vertical
bar B, which means "either A or B,"

01:02:29.780 --> 01:02:34.280 align:middle line:84%
you can also group things together and
then apply some other operator to them

01:02:34.280 --> 01:02:35.100 align:middle line:90%
as well.

01:02:35.100 --> 01:02:37.160 align:middle line:84%
In fact, let me go
back to the code here.

01:02:37.160 --> 01:02:42.260 align:middle line:84%
And let me propose that if I want
to tolerate a subdomain, like cs50,

01:02:42.260 --> 01:02:46.700 align:middle line:84%
that may or may not be there, let me
go ahead and change it as follows.

01:02:46.700 --> 01:02:48.320 align:middle line:90%
I could naively do this.

01:02:48.320 --> 01:02:51.210 align:middle line:84%
If I want to support
subdomains, I could say, well,

01:02:51.210 --> 01:02:55.640 align:middle line:84%
let's allow for other word characters
plus, and then a literal dot.

01:02:55.640 --> 01:02:58.970 align:middle line:84%
And notice, I'll highlight in
blue here what I've just added.

01:02:58.970 --> 01:03:04.190 align:middle line:84%
Everything else is the same, but I'm now
adding room for another sequence of one

01:03:04.190 --> 01:03:07.650 align:middle line:84%
or more word characters
and then a literal dot.

01:03:07.650 --> 01:03:12.380 align:middle line:84%
So this now, I think, if I
rerun python of validate.py,

01:03:12.380 --> 01:03:16.310 align:middle line:84%
will work for
malan@cs50.harvard.edu, Enter.

01:03:16.310 --> 01:03:19.610 align:middle line:84%
Unfortunately, does anyone
see where this is going?

01:03:19.610 --> 01:03:22.310 align:middle line:84%
Let me rerun python of
validate.py and type

01:03:22.310 --> 01:03:25.010 align:middle line:84%
in as I keep doing,
malan@harvard.edu, which up until now

01:03:25.010 --> 01:03:27.290 align:middle line:84%
has kept working despite
all of my changes.

01:03:27.290 --> 01:03:33.110 align:middle line:84%
But now, ugh, finally I've
broken my own email address.

01:03:33.110 --> 01:03:35.540 align:middle line:90%
So logically what's the solution here?

01:03:35.540 --> 01:03:37.730 align:middle line:84%
Well, there's a bunch of
ways we could solve this.

01:03:37.730 --> 01:03:40.430 align:middle line:84%
I could maybe start using
two regular expressions

01:03:40.430 --> 01:03:46.370 align:middle line:84%
and support email addresses of
the form username@domain.tld,

01:03:46.370 --> 01:03:51.350 align:middle line:84%
or username@subdomain.domain.tld,
where TLD just

01:03:51.350 --> 01:03:53.917 align:middle line:90%
means Top Level Domain, like edu.

01:03:53.917 --> 01:03:56.000 align:middle line:84%
Or I could maybe just
modify this one, because I'd

01:03:56.000 --> 01:04:00.920 align:middle line:84%
prefer not to have two regular
expressions or one that's twice as big.

01:04:00.920 --> 01:04:06.470 align:middle line:84%
Why don't I just specify to re.search
that part of this pattern is optional?

01:04:06.470 --> 01:04:10.400 align:middle line:84%
What was the symbol we
saw earlier that allows

01:04:10.400 --> 01:04:15.440 align:middle line:84%
you to specify that the thing
before it is technically optional?

01:04:15.440 --> 01:04:16.610 align:middle line:90%
AUDIENCE: The straight bar?

01:04:16.610 --> 01:04:19.790 align:middle line:90%
We were using the straight bar as an--

01:04:19.790 --> 01:04:22.678 align:middle line:90%
optional, make the argument optional.

01:04:22.678 --> 01:04:23.720 align:middle line:90%
DAVID MALAN: So we could.

01:04:23.720 --> 01:04:26.210 align:middle line:84%
We could use a vertical
bar and some parentheses

01:04:26.210 --> 01:04:29.480 align:middle line:84%
and say, "either there's something
here or there's nothing."

01:04:29.480 --> 01:04:31.010 align:middle line:90%
We could do that in parentheses.

01:04:31.010 --> 01:04:33.860 align:middle line:84%
But I think there's
actually an even easier way.

01:04:33.860 --> 01:04:36.332 align:middle line:84%
AUDIENCE: Actually,
it's a question mark.

01:04:36.332 --> 01:04:37.790 align:middle line:90%
DAVID MALAN: Indeed, question mark.

01:04:37.790 --> 01:04:41.240 align:middle line:84%
Think back to this summary here
of our first set of symbols,

01:04:41.240 --> 01:04:46.130 align:middle line:84%
whereby we had not just dot and * and
+, but also a question mark, which

01:04:46.130 --> 01:04:49.370 align:middle line:84%
means literally "zero or
one repetitions," which

01:04:49.370 --> 01:04:50.810 align:middle line:90%
effectively means optional.

01:04:50.810 --> 01:04:54.740 align:middle line:84%
It's either there,
one, or it's not, zero.

01:04:54.740 --> 01:04:57.650 align:middle line:84%
Now, how can I translate
that to this code here?

01:04:57.650 --> 01:05:03.150 align:middle line:84%
Well, let me go ahead and surround this
part of my pattern with parentheses,

01:05:03.150 --> 01:05:06.740 align:middle line:84%
which doesn't mean I want literally
a parentheses in the user's input,

01:05:06.740 --> 01:05:09.410 align:middle line:84%
I just want to group
these characters together.

01:05:09.410 --> 01:05:11.480 align:middle line:90%
And in fact, this now will still work.

01:05:11.480 --> 01:05:14.960 align:middle line:84%
I've only added parentheses around
the new part for the subdomain.

01:05:14.960 --> 01:05:17.000 align:middle line:90%
Let me run python of validate.py.

01:05:17.000 --> 01:05:20.060 align:middle line:84%
Let me run
malan@cs50.harvard.edu, Enter.

01:05:20.060 --> 01:05:21.110 align:middle line:90%
That's still valid.

01:05:21.110 --> 01:05:25.730 align:middle line:84%
But to be clear, if I rerun it again
for malan@harvard.edu, that is still

01:05:25.730 --> 01:05:31.310 align:middle line:84%
invalid, but not if I go in here and
say, after the parentheses, which

01:05:31.310 --> 01:05:36.410 align:middle line:84%
now is one logical unit, it's
one big group of ideas together,

01:05:36.410 --> 01:05:38.690 align:middle line:90%
I add a single question mark there.

01:05:38.690 --> 01:05:43.910 align:middle line:84%
This will now tell re.search that
that whole thing in parentheses

01:05:43.910 --> 01:05:49.020 align:middle line:84%
can either be there once or be
there not at all, zero times.

01:05:49.020 --> 01:05:51.530 align:middle line:84%
So what does this translate
into when I run it?

01:05:51.530 --> 01:05:56.030 align:middle line:84%
Well, let me go ahead and rerun
it with malan@cs50.harvard.edu

01:05:56.030 --> 01:05:57.770 align:middle line:90%
so that the subdomain is there.

01:05:57.770 --> 01:05:59.720 align:middle line:90%
That works as before.

01:05:59.720 --> 01:06:01.860 align:middle line:84%
Let me clear my screen
and run it again, python

01:06:01.860 --> 01:06:06.830 align:middle line:84%
of validate.py with malan@harvard.edu,
which used to work then broke.

01:06:06.830 --> 01:06:08.330 align:middle line:90%
Are we back in business now?

01:06:08.330 --> 01:06:09.260 align:middle line:90%
We are.

01:06:09.260 --> 01:06:11.810 align:middle line:90%
That's now valid again.

01:06:11.810 --> 01:06:14.540 align:middle line:84%
Questions now on this
approach, where we've used

01:06:14.540 --> 01:06:18.655 align:middle line:84%
not just the question mark
but the parentheses as well?

01:06:18.655 --> 01:06:19.280 align:middle line:90%
AUDIENCE: Yeah.

01:06:19.280 --> 01:06:22.130 align:middle line:84%
You said it works for
zero or one repetitions.

01:06:22.130 --> 01:06:23.912 align:middle line:90%
What if you have more?

01:06:23.912 --> 01:06:25.370 align:middle line:90%
DAVID MALAN: What if you have more?

01:06:25.370 --> 01:06:26.220 align:middle line:90%
That's OK.

01:06:26.220 --> 01:06:28.610 align:middle line:90%
That's where you could do *.

01:06:28.610 --> 01:06:33.835 align:middle line:84%
* is zero or more, which gives you
all the flexibility in the world.

01:06:33.835 --> 01:06:34.460 align:middle line:90%
AUDIENCE: Yeah.

01:06:34.460 --> 01:06:37.050 align:middle line:90%
So I was just asking that--

01:06:37.050 --> 01:06:40.670 align:middle line:84%
with question marks, there's
only one repetition allowed.

01:06:40.670 --> 01:06:42.810 align:middle line:84%
DAVID MALAN: It means
zero or one repetition.

01:06:42.810 --> 01:06:45.630 align:middle line:90%
So it's either not there or it is there.

01:06:45.630 --> 01:06:49.940 align:middle line:84%
And so that's why this pattern now, if
I go back to my code, even though again,

01:06:49.940 --> 01:06:54.650 align:middle line:84%
it admittedly looks cryptic, let me
highlight everything after the @ sign

01:06:54.650 --> 01:06:56.060 align:middle line:90%
and before the $ sign.

01:06:56.060 --> 01:07:01.001 align:middle line:84%
This now represents a domain
name, like harvard.edu,

01:07:01.001 --> 01:07:03.920 align:middle line:90%
or a subdomain within the domain name.

01:07:03.920 --> 01:07:04.700 align:middle line:90%
Why?

01:07:04.700 --> 01:07:07.700 align:middle line:84%
Well, this part to the
right is the same as always.

01:07:07.700 --> 01:07:11.330 align:middle line:84%
Backslash w + means something
like Harvard or Yale.

01:07:11.330 --> 01:07:14.810 align:middle line:90%
Backslash .edu means literally ".edu."

01:07:14.810 --> 01:07:16.430 align:middle line:90%
So the new part is this.

01:07:16.430 --> 01:07:22.370 align:middle line:84%
In parentheses, I have another set
of backslash w + backslash dot now.

01:07:22.370 --> 01:07:24.080 align:middle line:90%
But it's all in parentheses.

01:07:24.080 --> 01:07:26.870 align:middle line:84%
I'm now having a question
mark right after that,

01:07:26.870 --> 01:07:30.710 align:middle line:84%
which means that whole thing in
parentheses either can be there,

01:07:30.710 --> 01:07:31.850 align:middle line:90%
or it can't be there.

01:07:31.850 --> 01:07:34.010 align:middle line:84%
It's either of those
that are acceptable.

01:07:34.010 --> 01:07:37.880 align:middle line:84%
So a question mark effectively
make something optional.

01:07:37.880 --> 01:07:40.670 align:middle line:84%
It would not be correct
to remove the parentheses,

01:07:40.670 --> 01:07:42.150 align:middle line:90%
because what would this mean?

01:07:42.150 --> 01:07:44.690 align:middle line:84%
If I removed the
parentheses, that would mean

01:07:44.690 --> 01:07:49.580 align:middle line:84%
that only this dot is optional, which
isn't really what we want to express.

01:07:49.580 --> 01:07:54.050 align:middle line:84%
I want the subdomain, like
cs50 and the additional dot

01:07:54.050 --> 01:07:56.060 align:middle line:90%
to be what's there or not there.

01:07:56.060 --> 01:07:59.270 align:middle line:84%
How about one other
question on regexes here?

01:07:59.270 --> 01:08:01.530 align:middle line:84%
AUDIENCE: Can we use
this for the usernames?

01:08:01.530 --> 01:08:02.530 align:middle line:90%
DAVID MALAN: Absolutely.

01:08:02.530 --> 01:08:04.000 align:middle line:90%
We still have other problems.

01:08:04.000 --> 01:08:06.280 align:middle line:84%
We're not solving all of
the problems today just yet.

01:08:06.280 --> 01:08:07.330 align:middle line:90%
But absolutely.

01:08:07.330 --> 01:08:11.380 align:middle line:84%
Right now, we are not letting you
have a period in your username.

01:08:11.380 --> 01:08:14.088 align:middle line:84%
And again, some of you with Gmail
accounts or other accounts, you

01:08:14.088 --> 01:08:16.463 align:middle line:84%
probably have not just
underscores, numbers, and letters.

01:08:16.463 --> 01:08:17.740 align:middle line:90%
You might have periods, too.

01:08:17.740 --> 01:08:21.790 align:middle line:84%
Well, we could fix that, not
using question mark here per se.

01:08:21.790 --> 01:08:25.630 align:middle line:84%
But now that we have these parentheses
at our disposal, what I could do

01:08:25.630 --> 01:08:26.350 align:middle line:90%
is this.

01:08:26.350 --> 01:08:30.399 align:middle line:84%
I could use parentheses to
surround the backslash w

01:08:30.399 --> 01:08:33.819 align:middle line:84%
to say "any word character," which is
the same thing, again, as a letter,

01:08:33.819 --> 01:08:35.529 align:middle line:90%
or a number, or an underscore.

01:08:35.529 --> 01:08:40.120 align:middle line:84%
But I could also or in, using
a vertical bar, something else,

01:08:40.120 --> 01:08:41.800 align:middle line:90%
like a literal dot.

01:08:41.800 --> 01:08:44.770 align:middle line:84%
Now, a literal dot needs
to be escaped, otherwise it

01:08:44.770 --> 01:08:47.859 align:middle line:84%
represents any character, which
would be a regression, a step back.

01:08:47.859 --> 01:08:49.540 align:middle line:90%
But now notice what I've done.

01:08:49.540 --> 01:08:54.370 align:middle line:84%
In parentheses, I'm telling re.search
that those first few characters

01:08:54.370 --> 01:08:56.800 align:middle line:84%
in your email address,
that is your username,

01:08:56.800 --> 01:09:02.049 align:middle line:84%
has to be a word character, like A
through z, uppercase or lowercase, or 0

01:09:02.049 --> 01:09:05.290 align:middle line:84%
through 9, or an underscore,
or a literal dot.

01:09:05.290 --> 01:09:06.760 align:middle line:90%
We could do this differently, too.

01:09:06.760 --> 01:09:09.220 align:middle line:84%
I could get rid of the
parentheses and the

01:09:09.220 --> 01:09:12.010 align:middle line:84%
or, and I could just
use a set of characters.

01:09:12.010 --> 01:09:17.890 align:middle line:84%
I could, again, manually say a
through z, A through Z, 0 through 9,

01:09:17.890 --> 01:09:22.540 align:middle line:84%
underscore, and then I could do a
literal dot with a backslash period.

01:09:22.540 --> 01:09:25.029 align:middle line:84%
And now I technically don't
even need the uppercase,

01:09:25.029 --> 01:09:27.590 align:middle line:84%
because I'm already telling
the computer to ignore case.

01:09:27.590 --> 01:09:29.359 align:middle line:90%
I can just pick one or the other.

01:09:29.359 --> 01:09:31.120 align:middle line:90%
Which one is better is really up to you.

01:09:31.120 --> 01:09:35.600 align:middle line:84%
Whichever one you think is more readable
would generally be the better design.

01:09:35.600 --> 01:09:36.100 align:middle line:90%
All right.

01:09:36.100 --> 01:09:38.979 align:middle line:84%
Let me propose that
I rewind this in time

01:09:38.979 --> 01:09:42.819 align:middle line:90%
to where we left off, which was here.

01:09:42.819 --> 01:09:44.800 align:middle line:84%
And let me propose
that there are, indeed,

01:09:44.800 --> 01:09:48.935 align:middle line:84%
still limitations of this solution,
not just with the username, not just

01:09:48.935 --> 01:09:49.810 align:middle line:90%
with the domain name.

01:09:49.810 --> 01:09:51.700 align:middle line:84%
We're still being a
little too restrictive.

01:09:51.700 --> 01:09:54.910 align:middle line:84%
So would you like to see the
official regular expression

01:09:54.910 --> 01:09:58.720 align:middle line:84%
that at least browsers use nowadays
whenever you type in an email address

01:09:58.720 --> 01:10:01.450 align:middle line:84%
to a web form, and the
web form, the browser,

01:10:01.450 --> 01:10:05.680 align:middle line:84%
tells you yes or no, your email
address is syntactically valid?

01:10:05.680 --> 01:10:06.670 align:middle line:90%
Ready?

01:10:06.670 --> 01:10:07.810 align:middle line:90%
Ready?

01:10:07.810 --> 01:10:12.730 align:middle line:84%
Here is-- and this isn't even
officially the right regular expression.

01:10:12.730 --> 01:10:15.670 align:middle line:84%
It's a simplified version
that browsers use because it

01:10:15.670 --> 01:10:18.100 align:middle line:90%
catches most mistakes but not all.

01:10:18.100 --> 01:10:19.460 align:middle line:90%
Here we go.

01:10:19.460 --> 01:10:23.710 align:middle line:84%
This is the regular expression
for a valid email address,

01:10:23.710 --> 01:10:27.550 align:middle line:84%
at least as browsers
nowadays implement them.

01:10:27.550 --> 01:10:30.610 align:middle line:90%
Now it's crazy cryptic at first glance.

01:10:30.610 --> 01:10:34.930 align:middle line:84%
But note-- and it's wrapping on to
many lines, but it's just one pattern.

01:10:34.930 --> 01:10:37.930 align:middle line:84%
But just notice the
now-familiar symbols.

01:10:37.930 --> 01:10:40.540 align:middle line:90%
There is the ^ symbol at the very top.

01:10:40.540 --> 01:10:43.280 align:middle line:90%
There is the $ sign at the very end.

01:10:43.280 --> 01:10:45.730 align:middle line:84%
There is a square bracket
over here and then some

01:10:45.730 --> 01:10:47.860 align:middle line:90%
of these ranges plus other characters.

01:10:47.860 --> 01:10:51.280 align:middle line:84%
Turns out you don't normally see
these characters in email addresses.

01:10:51.280 --> 01:10:53.770 align:middle line:84%
It looks like you're swearing
at someone in their username.

01:10:53.770 --> 01:10:55.450 align:middle line:90%
But they're valid characters.

01:10:55.450 --> 01:10:56.680 align:middle line:90%
They're valid officially.

01:10:56.680 --> 01:11:00.670 align:middle line:84%
That doesn't mean that Gmail is going
to allow you to put $ signs and other

01:11:00.670 --> 01:11:02.260 align:middle line:90%
punctuation in your username.

01:11:02.260 --> 01:11:04.850 align:middle line:84%
But officially, some
servers might allow that.

01:11:04.850 --> 01:11:08.080 align:middle line:84%
So if you really want to
validate a user's email address,

01:11:08.080 --> 01:11:12.250 align:middle line:84%
you would actually come up with
or copy-paste something like this.

01:11:12.250 --> 01:11:14.680 align:middle line:90%
But honestly, this looks so cryptic.

01:11:14.680 --> 01:11:18.680 align:middle line:84%
And if you were to type it out manually,
you are so likely to make a mistake.

01:11:18.680 --> 01:11:21.040 align:middle line:90%
What's the better solution here instead?

01:11:21.040 --> 01:11:24.820 align:middle line:84%
This is where, per past weeks,
libraries are your friend.

01:11:24.820 --> 01:11:28.360 align:middle line:84%
Surely someone else on the
internet, a programmer more

01:11:28.360 --> 01:11:31.360 align:middle line:84%
experienced than you,
even, has come up with code

01:11:31.360 --> 01:11:35.830 align:middle line:84%
that validates email addresses properly,
using this regular expression or even

01:11:35.830 --> 01:11:37.580 align:middle line:90%
something more sophisticated than that.

01:11:37.580 --> 01:11:40.030 align:middle line:84%
So generally, if the problem
at hand is to validate

01:11:40.030 --> 01:11:43.060 align:middle line:84%
input that is pretty
conventional-- an email address,

01:11:43.060 --> 01:11:46.570 align:middle line:84%
a URL, something where there's
an official definition that's

01:11:46.570 --> 01:11:50.710 align:middle line:84%
independent of you yourself--
find a popular library that you're

01:11:50.710 --> 01:11:55.130 align:middle line:84%
comfortable using and use it in your
code to validate email addresses.

01:11:55.130 --> 01:11:58.750 align:middle line:84%
This is not a wheel, necessarily,
that you yourself should invent.

01:11:58.750 --> 01:12:01.870 align:middle line:84%
We've used email addresses,
though, to iteratively start

01:12:01.870 --> 01:12:05.300 align:middle line:84%
from something simple, too
simple, and build on top of that.

01:12:05.300 --> 01:12:07.960 align:middle line:84%
So you could certainly imagine
using regular expressions still

01:12:07.960 --> 01:12:10.210 align:middle line:84%
to validate things that
aren't email addresses but are

01:12:10.210 --> 01:12:12.230 align:middle line:90%
data that are important to you.

01:12:12.230 --> 01:12:14.980 align:middle line:84%
So we at least now have
these building blocks.

01:12:14.980 --> 01:12:17.380 align:middle line:84%
Now, besides the regular
expressions themselves,

01:12:17.380 --> 01:12:20.290 align:middle line:84%
it turns out there's other
functions in Python's re

01:12:20.290 --> 01:12:22.030 align:middle line:90%
library for regular expressions.

01:12:22.030 --> 01:12:24.280 align:middle line:84%
Among them is this
function here, re.match,

01:12:24.280 --> 01:12:26.980 align:middle line:84%
which is actually very
similar to re.search,

01:12:26.980 --> 01:12:29.462 align:middle line:84%
except you don't have
to specify the ^ symbol

01:12:29.462 --> 01:12:31.420 align:middle line:84%
at the very beginning of
your regex if you want

01:12:31.420 --> 01:12:33.400 align:middle line:90%
to match from the start of a string.

01:12:33.400 --> 01:12:36.958 align:middle line:84%
re.match by design will
automatically start matching

01:12:36.958 --> 01:12:38.500 align:middle line:90%
from the start of the string for you.

01:12:38.500 --> 01:12:42.580 align:middle line:84%
Similar in spirit is re.fullmatch,
which does the same thing but not only

01:12:42.580 --> 01:12:45.730 align:middle line:84%
matches at the start of the string but
the end of the string, so that you,

01:12:45.730 --> 01:12:50.240 align:middle line:84%
too, don't need to type in the
^ symbol or the $ sign as well.

01:12:50.240 --> 01:12:53.170 align:middle line:84%
But let's go ahead and transition
back now to some actual code,

01:12:53.170 --> 01:12:55.420 align:middle line:84%
whereby we solve a
different problem in spirit.

01:12:55.420 --> 01:12:57.920 align:middle line:84%
Rather than just
validate the user's input

01:12:57.920 --> 01:13:00.290 align:middle line:84%
and make sure it looks the
way we want, let's just

01:13:00.290 --> 01:13:04.020 align:middle line:84%
assume that the users are not going
to type in data exactly as we want,

01:13:04.020 --> 01:13:06.290 align:middle line:84%
and so we're going to have
to clean up their input.

01:13:06.290 --> 01:13:10.580 align:middle line:84%
This happens so often when you're using
like a Google Form, or Office 365 form,

01:13:10.580 --> 01:13:12.800 align:middle line:90%
or anything else to collect user input.

01:13:12.800 --> 01:13:15.800 align:middle line:84%
No matter what your form
question says, your users

01:13:15.800 --> 01:13:18.225 align:middle line:84%
are not necessarily going
to follow those directions.

01:13:18.225 --> 01:13:20.600 align:middle line:84%
They might go ahead and type
in something that's a little

01:13:20.600 --> 01:13:22.910 align:middle line:84%
differently formatted
than you might like.

01:13:22.910 --> 01:13:26.810 align:middle line:84%
Now, you could certainly go through
the results and download a CSV,

01:13:26.810 --> 01:13:29.720 align:middle line:84%
or open the Google spreadsheet,
or equivalent in Excel,

01:13:29.720 --> 01:13:31.980 align:middle line:84%
and just clean up all
of the data manually.

01:13:31.980 --> 01:13:34.250 align:middle line:84%
But if you've got lots
of submissions-- dozens,

01:13:34.250 --> 01:13:37.070 align:middle line:84%
hundreds, thousands of
rows in your data set--

01:13:37.070 --> 01:13:39.170 align:middle line:84%
doing things manually
might not be very fun.

01:13:39.170 --> 01:13:42.680 align:middle line:84%
It might be much more effective
to write code, as in Python,

01:13:42.680 --> 01:13:47.220 align:middle line:84%
that can allow you to clean up that
data and any future data as well.

01:13:47.220 --> 01:13:51.620 align:middle line:84%
So let me propose that we go
ahead here and close validate.py.

01:13:51.620 --> 01:13:55.460 align:middle line:84%
And let's go ahead and create a new
program altogether called format.py,

01:13:55.460 --> 01:13:59.990 align:middle line:84%
the goal of which is to reformat the
user's input in the format we expect.

01:13:59.990 --> 01:14:03.080 align:middle line:84%
I'm going to go ahead and
run code of format.py.

01:14:03.080 --> 01:14:06.170 align:middle line:84%
And let's suppose that the
data we're going to reformat

01:14:06.170 --> 01:14:09.703 align:middle line:84%
is the user's name-- so not
email address but name this time.

01:14:09.703 --> 01:14:11.870 align:middle line:84%
And we're going to hope
that they type in their name

01:14:11.870 --> 01:14:14.270 align:middle line:90%
properly, like David Malan.

01:14:14.270 --> 01:14:16.610 align:middle line:84%
But some users might be
in the habit, for whatever

01:14:16.610 --> 01:14:19.020 align:middle line:84%
reason, of typing their
name backwards, if you will,

01:14:19.020 --> 01:14:23.030 align:middle line:84%
with a comma, such as
Malan comma David instead.

01:14:23.030 --> 01:14:27.740 align:middle line:84%
Now, it's fine because both are
clearly as readable to the human.

01:14:27.740 --> 01:14:30.530 align:middle line:84%
But if you want to standardize
how those names are stored

01:14:30.530 --> 01:14:34.250 align:middle line:84%
in your system, perhaps a database,
or CSV file, or something else,

01:14:34.250 --> 01:14:37.970 align:middle line:84%
it would be nice to at least standardize
or canonicalize the format in which

01:14:37.970 --> 01:14:41.060 align:middle line:84%
you're storing your data, so that
if you print out the user's name

01:14:41.060 --> 01:14:43.250 align:middle line:84%
it's always the same
format, David Malan,

01:14:43.250 --> 01:14:46.410 align:middle line:84%
and there's no commas
or backwardness to it.

01:14:46.410 --> 01:14:48.650 align:middle line:84%
So let's go ahead and
do something familiar.

01:14:48.650 --> 01:14:50.990 align:middle line:84%
Let's go ahead and give
myself a variable called name

01:14:50.990 --> 01:14:53.120 align:middle line:84%
and set it equal to the
return value of input,

01:14:53.120 --> 01:14:56.300 align:middle line:84%
asking the user, as we've done
many times, "what's your name,"

01:14:56.300 --> 01:14:57.170 align:middle line:90%
question mark.

01:14:57.170 --> 01:15:00.290 align:middle line:84%
I'm going to go ahead and proactively
at least clean up some messiness,

01:15:00.290 --> 01:15:03.950 align:middle line:84%
as we keep doing here, by just stripping
off any leading or trailing whitespace.

01:15:03.950 --> 01:15:06.470 align:middle line:84%
Just in case the user
accidentally hits the spacebar,

01:15:06.470 --> 01:15:09.720 align:middle line:84%
we don't want that
ultimately in our data set.

01:15:09.720 --> 01:15:12.260 align:middle line:84%
And now let me go ahead and
do this as we've done before.

01:15:12.260 --> 01:15:14.900 align:middle line:84%
Let me just go ahead quickly
and print out, just to make sure

01:15:14.900 --> 01:15:18.650 align:middle line:84%
I'm off to the right start, "hello,"
and then in curly braces name,

01:15:18.650 --> 01:15:22.010 align:middle line:84%
so making an fstring to
format "hello," comma, "name."

01:15:22.010 --> 01:15:25.730 align:middle line:84%
Now let me go ahead and clear my
screen and run python of format.py.

01:15:25.730 --> 01:15:29.510 align:middle line:84%
Let me behave and type in my name as
I normally would, David, space, Malan,

01:15:29.510 --> 01:15:30.170 align:middle line:90%
Enter.

01:15:30.170 --> 01:15:32.270 align:middle line:84%
And I think the output
looks pretty good.

01:15:32.270 --> 01:15:34.490 align:middle line:90%
It looks as expected grammatically.

01:15:34.490 --> 01:15:37.283 align:middle line:84%
Let me now go ahead, though,
and play this game again.

01:15:37.283 --> 01:15:39.200 align:middle line:84%
But this time, maybe
because I'm not thinking,

01:15:39.200 --> 01:15:41.600 align:middle line:84%
or I'm just in the habit of
doing last name comma first,

01:15:41.600 --> 01:15:44.700 align:middle line:90%
I do Malan, comma, David, and hit Enter.

01:15:44.700 --> 01:15:45.200 align:middle line:90%
All right.

01:15:45.200 --> 01:15:47.270 align:middle line:90%
Well, this now is weird.

01:15:47.270 --> 01:15:51.020 align:middle line:84%
Even though the program is just
spitting out exactly what I typed in,

01:15:51.020 --> 01:15:54.020 align:middle line:84%
arguably this is not close to
correct, at least grammatically.

01:15:54.020 --> 01:15:56.810 align:middle line:84%
It should really say
"hello, David Malan."

01:15:56.810 --> 01:15:58.820 align:middle line:84%
Now, maybe I could
have some if conditions

01:15:58.820 --> 01:16:01.910 align:middle line:84%
and I could just reject the
user's input if they type a comma

01:16:01.910 --> 01:16:03.800 align:middle line:90%
or get their names backwards somehow.

01:16:03.800 --> 01:16:07.190 align:middle line:84%
But that's going to be too little
too late if the user has already

01:16:07.190 --> 01:16:10.580 align:middle line:84%
submitted a form online,
and I already have the data,

01:16:10.580 --> 01:16:12.600 align:middle line:90%
and now I need to go in and clean it up.

01:16:12.600 --> 01:16:14.750 align:middle line:84%
And it's not going to be
fun to go through manually

01:16:14.750 --> 01:16:17.900 align:middle line:84%
in Google Spreadsheets, or Apple
Numbers, or Microsoft Excel

01:16:17.900 --> 01:16:21.650 align:middle line:84%
and manually fix a lot of people's
names to get rid of the commas

01:16:21.650 --> 01:16:25.700 align:middle line:84%
and move the first name before the
last, as is conventional in the US.

01:16:25.700 --> 01:16:27.080 align:middle line:90%
So let's do this.

01:16:27.080 --> 01:16:29.780 align:middle line:84%
It could be a little
fragile, but let's start

01:16:29.780 --> 01:16:32.990 align:middle line:84%
to express ourselves a little
programmatically here and ask this.

01:16:32.990 --> 01:16:37.940 align:middle line:84%
If there is a comma in the
person's name, which is Pythonic--

01:16:37.940 --> 01:16:41.960 align:middle line:84%
I'm just asking the question, is this
shorter string in this longer string?--

01:16:41.960 --> 01:16:43.650 align:middle line:90%
then let me go ahead and do this.

01:16:43.650 --> 01:16:46.340 align:middle line:84%
Let me go ahead and grab
that name in the variable,

01:16:46.340 --> 01:16:50.840 align:middle line:84%
split on not just the
comma but the space after,

01:16:50.840 --> 01:16:53.480 align:middle line:84%
assuming the human typed in
a space after their name.

01:16:53.480 --> 01:16:57.080 align:middle line:84%
And let me go ahead and store the result
of that splitting of Malan, comma,

01:16:57.080 --> 01:16:58.860 align:middle line:90%
David into two variables.

01:16:58.860 --> 01:17:02.000 align:middle line:84%
Let's do last, comma,
first, again unpacking

01:17:02.000 --> 01:17:04.310 align:middle line:90%
the sequence of values that comes back.

01:17:04.310 --> 01:17:07.170 align:middle line:84%
Now let me go ahead
and reformat the name.

01:17:07.170 --> 01:17:10.160 align:middle line:84%
So I'm going to forcibly change
the user's name to be as I expect.

01:17:10.160 --> 01:17:13.580 align:middle line:84%
So name is actually going
to be this format string--

01:17:13.580 --> 01:17:18.830 align:middle line:84%
first name then last name, both in
curly braces but formatted together

01:17:18.830 --> 01:17:22.580 align:middle line:84%
with a single space, so that
I'm overwriting the user's input

01:17:22.580 --> 01:17:25.280 align:middle line:84%
and updating my name
variable accordingly.

01:17:25.280 --> 01:17:27.770 align:middle line:84%
For the moment, to be clear,
this program is interactive.

01:17:27.770 --> 01:17:31.250 align:middle line:84%
Like, the users, like me, are
typing their name into the program.

01:17:31.250 --> 01:17:34.340 align:middle line:84%
But imagine the data
already is in a CSV file.

01:17:34.340 --> 01:17:37.730 align:middle line:84%
It came in from some process like a
Google Form or something else online.

01:17:37.730 --> 01:17:40.370 align:middle line:84%
You could imagine writing
code similar to this,

01:17:40.370 --> 01:17:43.550 align:middle line:84%
but that maybe goes and reads
that file into memory first.

01:17:43.550 --> 01:17:46.640 align:middle line:84%
Maybe it's a CSV via CSV
Reader or DictReader,

01:17:46.640 --> 01:17:48.860 align:middle line:84%
and then iterating over
each of those names.

01:17:48.860 --> 01:17:51.630 align:middle line:84%
But we'll keep it simple and
just do one name at a time.

01:17:51.630 --> 01:17:55.070 align:middle line:84%
But now what's kind of interesting here
is if I go back to my terminal window

01:17:55.070 --> 01:17:57.940 align:middle line:84%
and clear it, and run
python of format.py,

01:17:57.940 --> 01:18:01.240 align:middle line:84%
and hit Enter, I'm going to type
in David, space, Malan as before.

01:18:01.240 --> 01:18:03.130 align:middle line:90%
And I think we're still good.

01:18:03.130 --> 01:18:05.290 align:middle line:84%
But I'm also going to
go ahead and do this--

01:18:05.290 --> 01:18:10.630 align:middle line:84%
python of format.py Malan, comma,
David, with a space in between,

01:18:10.630 --> 01:18:13.960 align:middle line:84%
crossing my fingers and
hit Enter, and voila.

01:18:13.960 --> 01:18:15.640 align:middle line:90%
That now has been fixed.

01:18:15.640 --> 01:18:18.400 align:middle line:90%
Such a simple thing to be sure.

01:18:18.400 --> 01:18:22.300 align:middle line:84%
But it is so commonly necessary
to clean up users input.

01:18:22.300 --> 01:18:25.870 align:middle line:84%
Here we see at least one
way to do so pretty easily.

01:18:25.870 --> 01:18:28.480 align:middle line:84%
Now, to be fair, there's
some problems here.

01:18:28.480 --> 01:18:32.500 align:middle line:84%
And in fact, can someone imagine a
scenario in which this code really

01:18:32.500 --> 01:18:34.570 align:middle line:90%
doesn't fix the user's input?

01:18:34.570 --> 01:18:39.760 align:middle line:84%
What could still go wrong
even with this fix in my code?

01:18:39.760 --> 01:18:40.810 align:middle line:90%
Any thoughts?

01:18:40.810 --> 01:18:44.322 align:middle line:84%
AUDIENCE: If they typed in their
name comma and then [INAUDIBLE]..

01:18:44.322 --> 01:18:46.030 align:middle line:84%
DAVID MALAN: Oh, and
then something else.

01:18:46.030 --> 01:18:46.530 align:middle line:90%
Yeah.

01:18:46.530 --> 01:18:48.730 align:middle line:90%
So let me try this, for instance.

01:18:48.730 --> 01:18:50.410 align:middle line:90%
Let me go ahead and run a program.

01:18:50.410 --> 01:18:53.350 align:middle line:84%
And I am the only David
Malan that I know.

01:18:53.350 --> 01:18:57.850 align:middle line:84%
But suppose I were, let's
say, junior like this.

01:18:57.850 --> 01:19:00.850 align:middle line:84%
And it's common, in English at least,
to sometimes put a comma there.

01:19:00.850 --> 01:19:02.350 align:middle line:84%
You don't necessarily
need the comma, but I'm

01:19:02.350 --> 01:19:04.120 align:middle line:90%
one of those people who uses a comma.

01:19:04.120 --> 01:19:06.730 align:middle line:90%
That's now really, really broken.

01:19:06.730 --> 01:19:08.830 align:middle line:90%
So I've broken some assumption there.

01:19:08.830 --> 01:19:10.970 align:middle line:84%
And so that could
certainly go wrong here.

01:19:10.970 --> 01:19:11.470 align:middle line:90%
What else?

01:19:11.470 --> 01:19:13.178 align:middle line:84%
Well, let me go ahead
and run this again.

01:19:13.178 --> 01:19:15.540 align:middle line:84%
And if I did Malan,
comma, David, no space,

01:19:15.540 --> 01:19:17.290 align:middle line:84%
because I'm being a
little sloppy, I'm not

01:19:17.290 --> 01:19:20.500 align:middle line:84%
paying attention, which is going to
happen when you have lots of users

01:19:20.500 --> 01:19:22.750 align:middle line:90%
ultimately, well, this really broke now.

01:19:22.750 --> 01:19:25.870 align:middle line:84%
Notice I have a ValueError,
an actual exception.

01:19:25.870 --> 01:19:26.410 align:middle line:90%
Why?

01:19:26.410 --> 01:19:31.330 align:middle line:84%
Well, because split is supposed to be
splitting the string into two strings

01:19:31.330 --> 01:19:34.000 align:middle line:90%
by looking for the comma and a space.

01:19:34.000 --> 01:19:37.720 align:middle line:84%
But if there is no comma and space,
it can't split it into two things.

01:19:37.720 --> 01:19:40.900 align:middle line:84%
And the fact that I have
two variables on the left,

01:19:40.900 --> 01:19:44.290 align:middle line:84%
but I'm only getting back
one thing on the right,

01:19:44.290 --> 01:19:47.030 align:middle line:84%
means that I can't do
this code quite as this.

01:19:47.030 --> 01:19:48.467 align:middle line:90%
So it's fragile to be sure.

01:19:48.467 --> 01:19:50.800 align:middle line:84%
But wouldn't it be nice if
we could at least improve it?

01:19:50.800 --> 01:19:53.710 align:middle line:84%
For instance, we now know some
regular expressions syntax.

01:19:53.710 --> 01:19:56.920 align:middle line:84%
What if I at least wanted
to make this space optional?

01:19:56.920 --> 01:20:00.010 align:middle line:84%
Well, I could use my newfound
regular expression syntax

01:20:00.010 --> 01:20:04.330 align:middle line:84%
and put a question mark, Question
mark means zero or one of the things

01:20:04.330 --> 01:20:05.080 align:middle line:90%
to the left.

01:20:05.080 --> 01:20:06.490 align:middle line:90%
What's the thing to the left?

01:20:06.490 --> 01:20:07.850 align:middle line:90%
It's literally a space.

01:20:07.850 --> 01:20:10.760 align:middle line:84%
I don't even need parentheses
if there's just one thing there.

01:20:10.760 --> 01:20:15.040 align:middle line:84%
So that would be the start of a
pattern that says, I must have a comma,

01:20:15.040 --> 01:20:19.240 align:middle line:84%
and then I may or may not have a
space, zero or one spaces thereafter.

01:20:19.240 --> 01:20:25.810 align:middle line:84%
Unfortunately, the version of split
that's built into the str variable,

01:20:25.810 --> 01:20:28.600 align:middle line:84%
as in this case, doesn't
support regular expressions.

01:20:28.600 --> 01:20:32.120 align:middle line:84%
If we want our regular expressions,
we need to go use that library here.

01:20:32.120 --> 01:20:33.650 align:middle line:90%
So let me go ahead and do this.

01:20:33.650 --> 01:20:37.550 align:middle line:84%
Let me go in and leave this
code as is but go up to the top

01:20:37.550 --> 01:20:41.650 align:middle line:84%
now and import re to import the
library for regular expressions.

01:20:41.650 --> 01:20:46.000 align:middle line:84%
And now let me go ahead and
start changing my approach here.

01:20:46.000 --> 01:20:47.630 align:middle line:90%
I'm going to go ahead and do this.

01:20:47.630 --> 01:20:50.890 align:middle line:84%
I'm going to use the same
function called re.search,

01:20:50.890 --> 01:20:54.370 align:middle line:84%
and I'm going to search
for a pattern that I

01:20:54.370 --> 01:20:56.650 align:middle line:90%
think will be last, comma, first.

01:20:56.650 --> 01:20:59.050 align:middle line:84%
So let me use my newfound
regular expression syntax

01:20:59.050 --> 01:21:04.390 align:middle line:84%
and represent a pattern for something
like Malan, comma, space, David.

01:21:04.390 --> 01:21:05.660 align:middle line:90%
How can I do this?

01:21:05.660 --> 01:21:10.570 align:middle line:84%
Well, inside of my quotes for
re.search, I'm going to have something--

01:21:10.570 --> 01:21:11.950 align:middle line:90%
so dot +--

01:21:11.950 --> 01:21:12.610 align:middle line:90%
sorry.

01:21:12.610 --> 01:21:14.980 align:middle line:90%
I'm going to have something, so dot +.

01:21:14.980 --> 01:21:16.540 align:middle line:90%
Then I'm going to have a comma.

01:21:16.540 --> 01:21:17.890 align:middle line:90%
Then I'm going to have a space.

01:21:17.890 --> 01:21:20.440 align:middle line:90%
Then I'm going to have something dot +.

01:21:20.440 --> 01:21:23.200 align:middle line:84%
Now I'm going to preemptively
refine this a little bit.

01:21:23.200 --> 01:21:25.288 align:middle line:84%
I want this whole
pattern to start matching

01:21:25.288 --> 01:21:26.830 align:middle line:90%
at the beginning of the user's input.

01:21:26.830 --> 01:21:28.960 align:middle line:90%
So I'm going to add the ^ right away.

01:21:28.960 --> 01:21:33.070 align:middle line:84%
And I want the end of the user's input
to be matched as well, so that I'm

01:21:33.070 --> 01:21:37.720 align:middle line:84%
literally expecting any character one or
more times, then a comma then a space,

01:21:37.720 --> 01:21:40.180 align:middle line:84%
then any other character
one or more times.

01:21:40.180 --> 01:21:42.280 align:middle line:90%
And then that is it.

01:21:42.280 --> 01:21:46.430 align:middle line:84%
And I'm going to pass in
the name variable as before.

01:21:46.430 --> 01:21:50.300 align:middle line:84%
Now, when we've used
re.search in the past,

01:21:50.300 --> 01:21:52.900 align:middle line:84%
we really used it just
to answer a question.

01:21:52.900 --> 01:21:57.040 align:middle line:84%
Does the user's input match
the following pattern or not,

01:21:57.040 --> 01:21:59.140 align:middle line:90%
true or false, effectively.

01:21:59.140 --> 01:22:02.600 align:middle line:84%
But re.search is actually
more powerful than that.

01:22:02.600 --> 01:22:05.110 align:middle line:84%
You can actually get
back more information.

01:22:05.110 --> 01:22:06.430 align:middle line:90%
And you can do this.

01:22:06.430 --> 01:22:10.000 align:middle line:84%
You can specify a variable and
then an assignment operator,

01:22:10.000 --> 01:22:15.250 align:middle line:84%
and get back more precise answers to
what has been found when searched for.

01:22:15.250 --> 01:22:17.500 align:middle line:90%
But what is it you want to get back?

01:22:17.500 --> 01:22:21.260 align:middle line:84%
Well, it turns out there's this
other feature of regular expressions

01:22:21.260 --> 01:22:25.330 align:middle line:84%
which allow you to use parentheses,
not just to group things together,

01:22:25.330 --> 01:22:27.070 align:middle line:90%
but to capture them.

01:22:27.070 --> 01:22:31.750 align:middle line:84%
It turns out when you specify
parentheses in a regular expression

01:22:31.750 --> 01:22:35.140 align:middle line:84%
unbeknownst to us up until now,
everything in the parentheses

01:22:35.140 --> 01:22:41.350 align:middle line:84%
will be returned to you as a return
value from the re.search function.

01:22:41.350 --> 01:22:45.700 align:middle line:84%
It's going to allow you to extract
specific amounts of information

01:22:45.700 --> 01:22:47.530 align:middle line:90%
from the user's own input.

01:22:47.530 --> 01:22:51.730 align:middle line:84%
You can reverse this process, too,
by using the non-capturing version

01:22:51.730 --> 01:22:52.340 align:middle line:90%
as well.

01:22:52.340 --> 01:22:55.507 align:middle line:84%
You can use parentheses, and then
literally a question mark, and a colon,

01:22:55.507 --> 01:22:56.590 align:middle line:90%
and then some other stuff.

01:22:56.590 --> 01:22:58.400 align:middle line:84%
And that will say, don't
either capturing this.

01:22:58.400 --> 01:22:59.567 align:middle line:90%
I just want to group things.

01:22:59.567 --> 01:23:02.850 align:middle line:84%
But for now, we're going to use
just the parentheses themselves.

01:23:02.850 --> 01:23:04.200 align:middle line:90%
So how am I going to do this?

01:23:04.200 --> 01:23:08.780 align:middle line:84%
Well, if I want to get back the
user's last name and first name,

01:23:08.780 --> 01:23:16.190 align:middle line:84%
I think what I want to capture is
the dot + here and the dot + here.

01:23:16.190 --> 01:23:19.190 align:middle line:84%
So I've deliberately
surrounded in parentheses

01:23:19.190 --> 01:23:22.160 align:middle line:84%
the dot + both to the left
and the right of the comma,

01:23:22.160 --> 01:23:24.660 align:middle line:84%
not because I'm grouping
them together per se--

01:23:24.660 --> 01:23:28.190 align:middle line:84%
I'm not adding a question mark, I'm
not adding up another + or a *--

01:23:28.190 --> 01:23:32.420 align:middle line:84%
I'm using parentheses now
for capturing purposes.

01:23:32.420 --> 01:23:33.200 align:middle line:90%
Why?

01:23:33.200 --> 01:23:34.820 align:middle line:90%
Well, I'm going to do this next.

01:23:34.820 --> 01:23:38.690 align:middle line:84%
I'm going to still ask a Boolean
question like, "if there are matches,

01:23:38.690 --> 01:23:40.320 align:middle line:90%
then do this."

01:23:40.320 --> 01:23:44.360 align:middle line:84%
So if matches is not
effectively false, like none,

01:23:44.360 --> 01:23:47.720 align:middle line:84%
I do expect I've gotten
back some matches.

01:23:47.720 --> 01:23:49.400 align:middle line:90%
And watch what I can do now.

01:23:49.400 --> 01:23:54.170 align:middle line:84%
I can do last, comma, first
equals whatever matches in

01:23:54.170 --> 01:23:56.930 align:middle line:84%
and get back all of
the groups of matches.

01:23:56.930 --> 01:24:00.020 align:middle line:84%
Then go ahead and update name just
like before with a format string

01:24:00.020 --> 01:24:03.770 align:middle line:84%
and do first and then
last in curly braces

01:24:03.770 --> 01:24:06.770 align:middle line:84%
as well, and then at the very
bottom, just like before, print out,

01:24:06.770 --> 01:24:09.830 align:middle line:90%
for instance, "hello," comma, "name."

01:24:09.830 --> 01:24:13.970 align:middle line:84%
So the new code now is
everything highlighted here.

01:24:13.970 --> 01:24:19.700 align:middle line:84%
I'm using re.search to search for
whether the user typed their name

01:24:19.700 --> 01:24:21.620 align:middle line:90%
in last, comma, first format.

01:24:21.620 --> 01:24:27.440 align:middle line:84%
But I am more powerfully using re.search
to capture some of the user's input.

01:24:27.440 --> 01:24:28.850 align:middle line:90%
What's going to get captured?

01:24:28.850 --> 01:24:31.400 align:middle line:84%
Anything I surrounded
in parentheses will

01:24:31.400 --> 01:24:34.250 align:middle line:90%
be returned to me as return values.

01:24:34.250 --> 01:24:36.650 align:middle line:90%
How do you get at those return values?

01:24:36.650 --> 01:24:40.490 align:middle line:84%
You ask the variable to which you
assign them for all of the groups,

01:24:40.490 --> 01:24:44.250 align:middle line:84%
all of the groups of
parentheses that were captured.

01:24:44.250 --> 01:24:46.020 align:middle line:90%
So let me go ahead and do this.

01:24:46.020 --> 01:24:49.970 align:middle line:84%
Let me go ahead now and run
python of format.py, Enter.

01:24:49.970 --> 01:24:51.950 align:middle line:90%
And I'm going to type my name as usual.

01:24:51.950 --> 01:24:56.900 align:middle line:84%
In this case, nothing happens
with this if condition.

01:24:56.900 --> 01:24:57.500 align:middle line:90%
Why?

01:24:57.500 --> 01:25:03.270 align:middle line:84%
Because I did not type a comma, and
so this search does not find a comma,

01:25:03.270 --> 01:25:04.632 align:middle line:90%
so there are no matches.

01:25:04.632 --> 01:25:06.590 align:middle line:84%
So we immediately just
print out "hello, name."

01:25:06.590 --> 01:25:08.370 align:middle line:90%
Nothing interesting or new there.

01:25:08.370 --> 01:25:12.920 align:middle line:84%
But if I now go ahead, and clear my
screen, and run python of format.py,

01:25:12.920 --> 01:25:18.740 align:middle line:84%
and do Malan, comma, space, David,
Enter, we've reformatted my name.

01:25:18.740 --> 01:25:19.940 align:middle line:90%
Well, how did this work?

01:25:19.940 --> 01:25:22.100 align:middle line:90%
Let me be a little more explicit now.

01:25:22.100 --> 01:25:24.560 align:middle line:84%
It turns out I don't have
to just say matches.groups.

01:25:24.560 --> 01:25:28.020 align:middle line:84%
I can get specific
groups back that I want.

01:25:28.020 --> 01:25:30.290 align:middle line:84%
So let me change my
code a little bit more.

01:25:30.290 --> 01:25:33.470 align:middle line:90%
Let me go ahead now and just say this.

01:25:33.470 --> 01:25:36.620 align:middle line:90%
Let's update name to--

01:25:36.620 --> 01:25:37.980 align:middle line:90%
actually, let's do this.

01:25:37.980 --> 01:25:42.530 align:middle line:84%
Let's say that the last name
is going to be in the matches

01:25:42.530 --> 01:25:44.330 align:middle line:90%
but specifically group 1.

01:25:44.330 --> 01:25:48.020 align:middle line:84%
The first name is going to be in the
matches but specifically group 2.

01:25:48.020 --> 01:25:49.100 align:middle line:90%
Why 1 and 2?

01:25:49.100 --> 01:25:52.490 align:middle line:84%
Because this is the first set of
parentheses to the left of the comma.

01:25:52.490 --> 01:25:55.520 align:middle line:84%
This is the second set of parentheses
to the right of the comma.

01:25:55.520 --> 01:25:58.700 align:middle line:84%
And based on the input, this
would be the user's last name

01:25:58.700 --> 01:26:00.140 align:middle line:90%
in this scenario, Malan.

01:26:00.140 --> 01:26:03.560 align:middle line:84%
This would be the user's first
name, David, in this scenario.

01:26:03.560 --> 01:26:07.340 align:middle line:84%
That's why I'm using
group 1 for the last name

01:26:07.340 --> 01:26:09.720 align:middle line:90%
and group 2 for the first name.

01:26:09.720 --> 01:26:16.100 align:middle line:84%
And now I'm going to go ahead and
say name equals fstring, again, first

01:26:16.100 --> 01:26:18.980 align:middle line:90%
and then last, done.

01:26:18.980 --> 01:26:23.340 align:middle line:84%
And let me refine this one last
step before we take questions.

01:26:23.340 --> 01:26:26.090 align:middle line:84%
I don't really need these variables
if I'm immediately using them.

01:26:26.090 --> 01:26:28.423 align:middle line:84%
Let's just go ahead and tighten
this up further as we've

01:26:28.423 --> 01:26:29.990 align:middle line:90%
done in the past for design's sake.

01:26:29.990 --> 01:26:32.722 align:middle line:84%
If I want to make the
name the concatenation

01:26:32.722 --> 01:26:34.430 align:middle line:84%
of the person's first
name and last name,

01:26:34.430 --> 01:26:37.970 align:middle line:84%
let's just do this.
matches.group 2 first,

01:26:37.970 --> 01:26:43.400 align:middle line:90%
plus a space, plus matches.group 1.

01:26:43.400 --> 01:26:46.910 align:middle line:84%
So it's just up to me from
left to right, this is group 1,

01:26:46.910 --> 01:26:47.630 align:middle line:90%
this is group 2.

01:26:47.630 --> 01:26:51.000 align:middle line:90%
So group 1 is last, group 2 is first.

01:26:51.000 --> 01:26:54.860 align:middle line:84%
So if I want to flip them around
and update the value of name,

01:26:54.860 --> 01:27:00.290 align:middle line:84%
I can explicitly get group 2 first,
concatenate using +, a single space,

01:27:00.290 --> 01:27:03.540 align:middle line:90%
and then concatenate on group 1.

01:27:03.540 --> 01:27:04.170 align:middle line:90%
All right.

01:27:04.170 --> 01:27:05.280 align:middle line:90%
That was a lot.

01:27:05.280 --> 01:27:07.620 align:middle line:84%
Let me pause to see if
there are questions.

01:27:07.620 --> 01:27:11.670 align:middle line:84%
The key difference here is we're still
using re.search the exact same way,

01:27:11.670 --> 01:27:15.090 align:middle line:84%
but now I'm using its return
value, not just to answer

01:27:15.090 --> 01:27:17.400 align:middle line:84%
a question true or
false, but to actually

01:27:17.400 --> 01:27:21.750 align:middle line:84%
get back specific matches
anything I captured, so to speak,

01:27:21.750 --> 01:27:23.190 align:middle line:90%
with parentheses.

01:27:23.190 --> 01:27:26.270 align:middle line:84%
AUDIENCE: Why is it here we're
using 1 and 2 instead of 0 and 1

01:27:26.270 --> 01:27:27.270 align:middle line:90%
for capturing the first?

01:27:27.270 --> 01:27:29.010 align:middle line:90%
DAVID MALAN: Really good question.

01:27:29.010 --> 01:27:30.060 align:middle line:90%
A good observation.

01:27:30.060 --> 01:27:32.070 align:middle line:84%
In almost every other
context, we've started

01:27:32.070 --> 01:27:35.250 align:middle line:90%
counting at 0 and 1 instead of 1 and 2.

01:27:35.250 --> 01:27:38.190 align:middle line:84%
It turns out there's
something else in location 0

01:27:38.190 --> 01:27:41.530 align:middle line:84%
when it comes back from re.search
related to the string itself.

01:27:41.530 --> 01:27:45.000 align:middle line:84%
So according to the documentation
of this function only,

01:27:45.000 --> 01:27:49.110 align:middle line:84%
1 is the first set of parentheses,
and 2 is the second set,

01:27:49.110 --> 01:27:50.460 align:middle line:90%
and onward from there.

01:27:50.460 --> 01:27:52.540 align:middle line:90%
Just a different convention here.

01:27:52.540 --> 01:27:53.580 align:middle line:90%
Other questions?

01:27:53.580 --> 01:27:59.820 align:middle line:84%
AUDIENCE: What if we write nothing,
like whitespace, comma, whitespace?

01:27:59.820 --> 01:28:03.317 align:middle line:90%
How do we check truth of condition?

01:28:03.317 --> 01:28:05.400 align:middle line:84%
DAVID MALAN: Before I
answer directly, let me just

01:28:05.400 --> 01:28:07.733 align:middle line:84%
run this and make sure I've
not broken anything further.

01:28:07.733 --> 01:28:09.360 align:middle line:90%
Let me run python of format.py.

01:28:09.360 --> 01:28:12.060 align:middle line:84%
Let me type in David,
space, Malan, the right way.

01:28:12.060 --> 01:28:13.200 align:middle line:90%
Let me run it once more.

01:28:13.200 --> 01:28:16.650 align:middle line:84%
Let me type in Malan, comma, David,
the wrong way that we're fixing.

01:28:16.650 --> 01:28:17.850 align:middle line:90%
And we're still good.

01:28:17.850 --> 01:28:19.410 align:middle line:90%
But I think it will still break.

01:28:19.410 --> 01:28:23.610 align:middle line:84%
Let me run it a third time with
Malan, comma, David with no space.

01:28:23.610 --> 01:28:26.190 align:middle line:90%
And now it's still broken.

01:28:26.190 --> 01:28:26.790 align:middle line:90%
Why?

01:28:26.790 --> 01:28:30.930 align:middle line:84%
Because I'm still
looking for comma space.

01:28:30.930 --> 01:28:32.220 align:middle line:90%
Now, how can I fix that?

01:28:32.220 --> 01:28:35.070 align:middle line:84%
One way I could do that is to add
a question mark here, which again,

01:28:35.070 --> 01:28:37.510 align:middle line:90%
is zero or more of the thing before.

01:28:37.510 --> 01:28:40.950 align:middle line:84%
So if I have a space and then a
question mark literally, no need for any

01:28:40.950 --> 01:28:46.290 align:middle line:84%
parentheses, then I can literally
tolerate both Malan, comma, space,

01:28:46.290 --> 01:28:48.610 align:middle line:90%
David or Malan, comma, David.

01:28:48.610 --> 01:28:49.680 align:middle line:90%
So let's try again.

01:28:49.680 --> 01:28:51.120 align:middle line:90%
Before, this did not work.

01:28:51.120 --> 01:28:53.310 align:middle line:84%
Let's do Malan, comma,
David with no space.

01:28:53.310 --> 01:28:55.990 align:middle line:90%
Now it does actually work.

01:28:55.990 --> 01:28:58.740 align:middle line:84%
So we can tolerate different
amounts of whitespace

01:28:58.740 --> 01:29:01.890 align:middle line:84%
if I am a little more
precise with my formula.

01:29:01.890 --> 01:29:03.420 align:middle line:90%
Let me go ahead and try once more.

01:29:03.420 --> 01:29:07.260 align:middle line:84%
Let me very weirdly but possibly hit
the space bar a few too many times

01:29:07.260 --> 01:29:08.850 align:middle line:90%
so now they're really separated.

01:29:08.850 --> 01:29:13.020 align:middle line:84%
This, again, is not going to work
quite right, because it's going

01:29:13.020 --> 01:29:15.160 align:middle line:90%
to consume all of that whitespace.

01:29:15.160 --> 01:29:18.420 align:middle line:84%
So now I might want to
strip, left and right, any

01:29:18.420 --> 01:29:21.720 align:middle line:84%
of the leading white space on the
result. Or what I could do here

01:29:21.720 --> 01:29:22.930 align:middle line:90%
is say this.

01:29:22.930 --> 01:29:29.670 align:middle line:84%
Instead of zero or one, I
could use a * here, so space *.

01:29:29.670 --> 01:29:33.000 align:middle line:84%
And now if I run this once more with
Malan, comma, space, space, space,

01:29:33.000 --> 01:29:35.920 align:middle line:84%
David, Enter, now we've
cleaned up things further.

01:29:35.920 --> 01:29:39.510 align:middle line:84%
So you can imagine, depending on
how messy the data is that you're

01:29:39.510 --> 01:29:41.550 align:middle line:84%
cleaning up, your regular
expressions might need

01:29:41.550 --> 01:29:43.500 align:middle line:90%
to get more and more sophisticated.

01:29:43.500 --> 01:29:46.830 align:middle line:84%
It really depends on just how many
problems we want to solve at once.

01:29:46.830 --> 01:29:51.900 align:middle line:84%
Well, allow me to propose that we forge
ahead further just to clean this up

01:29:51.900 --> 01:29:53.940 align:middle line:84%
even more so, using a
feature that's actually

01:29:53.940 --> 01:29:56.430 align:middle line:90%
relatively new to Python itself.

01:29:56.430 --> 01:29:59.220 align:middle line:84%
It is very common when
using regular expressions

01:29:59.220 --> 01:30:03.210 align:middle line:84%
to do exactly what I've done here--
to call a function like re.search

01:30:03.210 --> 01:30:07.300 align:middle line:84%
with capturing parentheses inside,
such that you get back a return

01:30:07.300 --> 01:30:10.050 align:middle line:84%
value that I'm calling matches--
you could call it something else,

01:30:10.050 --> 01:30:12.090 align:middle line:90%
but I'm calling it by default matches.

01:30:12.090 --> 01:30:15.690 align:middle line:84%
And then notice on the next
line, I'm saying "if matches."

01:30:15.690 --> 01:30:19.080 align:middle line:84%
Wouldn't it be nice if I could just
tighten things up further and do these

01:30:19.080 --> 01:30:20.700 align:middle line:90%
all on the same line?

01:30:20.700 --> 01:30:23.070 align:middle line:90%
Well, you can sort of.

01:30:23.070 --> 01:30:24.850 align:middle line:90%
Let me go ahead and do this.

01:30:24.850 --> 01:30:26.340 align:middle line:90%
Let me get rid of this if.

01:30:26.340 --> 01:30:28.500 align:middle line:84%
And let me just try to
say something like this.

01:30:28.500 --> 01:30:32.370 align:middle line:84%
If matches equals
re.search and then colon--

01:30:32.370 --> 01:30:39.090 align:middle line:84%
so combining my if condition into
just one line instead of those two.

01:30:39.090 --> 01:30:43.455 align:middle line:84%
In C, or C++, or Java, you would
actually do something like this,

01:30:43.455 --> 01:30:45.330 align:middle line:84%
surrounding the whole
thing with parentheses,

01:30:45.330 --> 01:30:47.550 align:middle line:84%
sometimes double sets to
suppress any warnings,

01:30:47.550 --> 01:30:49.980 align:middle line:90%
if you want to do two things at once.

01:30:49.980 --> 01:30:55.530 align:middle line:84%
If you want to not only assign
the return value of re.search

01:30:55.530 --> 01:30:58.080 align:middle line:84%
to a variable called
matches, but you want

01:30:58.080 --> 01:31:03.408 align:middle line:84%
to subsequently ask a Boolean question,
is this effectively true or false.

01:31:03.408 --> 01:31:04.950 align:middle line:90%
That's what I was doing a moment ago.

01:31:04.950 --> 01:31:06.060 align:middle line:90%
Let me undo this.

01:31:06.060 --> 01:31:08.430 align:middle line:84%
A moment ago, I was getting
back the return value

01:31:08.430 --> 01:31:12.090 align:middle line:84%
and assigning it to matches, and
then I was asking the question.

01:31:12.090 --> 01:31:16.530 align:middle line:84%
Well, it turns out this need to have
two lines of code presumably rubbed

01:31:16.530 --> 01:31:18.840 align:middle line:90%
people wrong for too long in Python.

01:31:18.840 --> 01:31:22.170 align:middle line:84%
And so you can now combine these
two kinds of lines into one.

01:31:22.170 --> 01:31:24.450 align:middle line:90%
But you need a new operator.

01:31:24.450 --> 01:31:27.720 align:middle line:84%
You cannot just say, "if
matches equals re.search"

01:31:27.720 --> 01:31:29.580 align:middle line:90%
and then in a colon at the end.

01:31:29.580 --> 01:31:32.170 align:middle line:90%
You instead need to do this.

01:31:32.170 --> 01:31:38.130 align:middle line:84%
You need to do colon equals if and
only if you want to assign something

01:31:38.130 --> 01:31:42.390 align:middle line:84%
from right to left and you
want to ask an if or an elif

01:31:42.390 --> 01:31:44.820 align:middle line:90%
question on the same line.

01:31:44.820 --> 01:31:48.870 align:middle line:84%
This is affectionately known, as can
see here, as the walrus operator.

01:31:48.870 --> 01:31:51.480 align:middle line:90%
And it's new to Python in recent years.

01:31:51.480 --> 01:31:56.280 align:middle line:84%
And it both allows you to assign a
value as I'm doing from right to left,

01:31:56.280 --> 01:32:00.180 align:middle line:84%
and ask a Boolean question
about it, like I'm

01:32:00.180 --> 01:32:02.960 align:middle line:90%
doing with the if or equivalently elif.

01:32:02.960 --> 01:32:06.650 align:middle line:84%
Does anyone know why this is
called the walrus operator?

01:32:06.650 --> 01:32:09.920 align:middle line:84%
If you kind of look at
it like this, perhaps,

01:32:09.920 --> 01:32:14.040 align:middle line:84%
if you're familiar with walruses, it
kind of sort of looks like a walrus.

01:32:14.040 --> 01:32:17.720 align:middle line:84%
So a minor detail but a relatively new
feature of Python that honestly, you'll

01:32:17.720 --> 01:32:21.170 align:middle line:84%
probably continue to see online, and
in source code, and in textbooks,

01:32:21.170 --> 01:32:24.300 align:middle line:84%
and so forth, increasingly
so now that it does exist.

01:32:24.300 --> 01:32:25.910 align:middle line:90%
It does not change the logic at all.

01:32:25.910 --> 01:32:29.660 align:middle line:84%
If I run python of format.py and
type Malan, comma, space, David,

01:32:29.660 --> 01:32:33.750 align:middle line:84%
it still fixes things, but it's
tightened up my code just a bit more.

01:32:33.750 --> 01:32:34.250 align:middle line:90%
All right.

01:32:34.250 --> 01:32:37.010 align:middle line:84%
Let's go ahead and look
at one final problem

01:32:37.010 --> 01:32:40.470 align:middle line:84%
to solve, that of extracting
information now as well.

01:32:40.470 --> 01:32:43.460 align:middle line:84%
So at this point, we've now
validated the user's input

01:32:43.460 --> 01:32:46.160 align:middle line:84%
by checking whether or not
it meets a certain pattern.

01:32:46.160 --> 01:32:49.100 align:middle line:84%
We've cleaned up the
user's input by checking

01:32:49.100 --> 01:32:51.470 align:middle line:84%
against a pattern, whether
it matches or not, and if it

01:32:51.470 --> 01:32:54.350 align:middle line:84%
does match, we kind of reorganize
some of the user's information

01:32:54.350 --> 01:32:57.800 align:middle line:84%
so we can clean up their input and
standardize the format in which we're

01:32:57.800 --> 01:32:59.540 align:middle line:90%
storing or printing it, in this case.

01:32:59.540 --> 01:33:03.350 align:middle line:84%
Let's do one final example where
we're very specifically extracting

01:33:03.350 --> 01:33:06.440 align:middle line:84%
information in order to
answer some question.

01:33:06.440 --> 01:33:07.830 align:middle line:90%
So let me propose this.

01:33:07.830 --> 01:33:12.650 align:middle line:84%
Let me go ahead and close format.py and
create a new file called twitter.py,

01:33:12.650 --> 01:33:17.690 align:middle line:84%
the goal of which is to prompt users
for the URL of their Twitter profile

01:33:17.690 --> 01:33:23.562 align:middle line:84%
and extract from it, infer from that
URL, what is the user's username.

01:33:23.562 --> 01:33:25.020 align:middle line:90%
Now, why might you want to do this?

01:33:25.020 --> 01:33:28.228 align:middle line:84%
Well, one, you might want users to be
able to just very easily copy and paste

01:33:28.228 --> 01:33:32.330 align:middle line:84%
the URL from their own Twitter
profile into your form, into your app,

01:33:32.330 --> 01:33:36.140 align:middle line:84%
so that you can figure out
what their username is.

01:33:36.140 --> 01:33:40.430 align:middle line:84%
Or you might have a form that asks
the user for their Twitter username,

01:33:40.430 --> 01:33:43.400 align:middle line:84%
and because people aren't necessarily
paying very close attention,

01:33:43.400 --> 01:33:45.530 align:middle line:90%
some people type their username.

01:33:45.530 --> 01:33:49.340 align:middle line:84%
Some people type their whole URL
or something else altogether.

01:33:49.340 --> 01:33:51.350 align:middle line:84%
It would be nice now
that you're a programmer

01:33:51.350 --> 01:33:53.780 align:middle line:84%
to just be more tolerant
of different types of input

01:33:53.780 --> 01:33:58.100 align:middle line:84%
and just take on the burden of
canonicalizing, standardizing the data,

01:33:58.100 --> 01:34:00.140 align:middle line:90%
but being flexible with the users.

01:34:00.140 --> 01:34:03.500 align:middle line:84%
It's arguably a better user experience
if you just let me copy-paste

01:34:03.500 --> 01:34:05.660 align:middle line:90%
or type in what I want, you clean it up.

01:34:05.660 --> 01:34:07.550 align:middle line:90%
You're the programmer not me.

01:34:07.550 --> 01:34:09.920 align:middle line:90%
Lends for a better experience, perhaps.

01:34:09.920 --> 01:34:12.620 align:middle line:84%
Well, let me go ahead and
do this with twitter.py.

01:34:12.620 --> 01:34:17.120 align:middle line:84%
Let me first go ahead and prompt the
user here for a value for a variable

01:34:17.120 --> 01:34:21.702 align:middle line:84%
that I'll call url, and just ask them to
input the URL of their Twitter profile.

01:34:21.702 --> 01:34:23.660 align:middle line:84%
I'm going to go ahead
and strip off any leading

01:34:23.660 --> 01:34:26.810 align:middle line:84%
or trailing whitespace, just in case
users accidentally hit the spacebar.

01:34:26.810 --> 01:34:29.940 align:middle line:84%
That's literally the least
I can do quite easily.

01:34:29.940 --> 01:34:32.100 align:middle line:90%
But now let's go ahead and do this.

01:34:32.100 --> 01:34:37.185 align:middle line:84%
Suppose that the user's
address is the following.

01:34:37.185 --> 01:34:38.810 align:middle line:90%
Let me print out what did they type in.

01:34:38.810 --> 01:34:41.190 align:middle line:84%
And let me clear my screen
and run python of twitter.py.

01:34:41.190 --> 01:34:43.190 align:middle line:84%
I'm going to go ahead and
type in, for instance,

01:34:43.190 --> 01:34:50.240 align:middle line:84%
https://twitter.com/davidjmalan, which
happens to be my own Twitter username.

01:34:50.240 --> 01:34:53.090 align:middle line:84%
For now, we're just going to
print it back onto the screen just

01:34:53.090 --> 01:34:54.640 align:middle line:90%
to make sure I've not messed up yet.

01:34:54.640 --> 01:34:55.140 align:middle line:90%
OK.

01:34:55.140 --> 01:34:57.260 align:middle line:84%
So I've printed back
out the exact same URL.

01:34:57.260 --> 01:35:01.310 align:middle line:84%
But the goal at hand is to
extract the username only.

01:35:01.310 --> 01:35:05.060 align:middle line:84%
Now, let me just ask, perhaps,
a straightforward question.

01:35:05.060 --> 01:35:09.830 align:middle line:84%
Logically, what do I need to do
to get at the user's username?

01:35:09.830 --> 01:35:13.880 align:middle line:84%
AUDIENCE: Well, we just ignore
what's before the username

01:35:13.880 --> 01:35:16.065 align:middle line:90%
and then just extract the username?

01:35:16.065 --> 01:35:16.940 align:middle line:90%
DAVID MALAN: Perfect.

01:35:16.940 --> 01:35:18.380 align:middle line:90%
Yeah, I mean, it is as simple as that.

01:35:18.380 --> 01:35:20.720 align:middle line:84%
If you know the username is
at the end, well, let's just

01:35:20.720 --> 01:35:22.920 align:middle line:84%
somehow ignore everything
to the beginning.

01:35:22.920 --> 01:35:24.170 align:middle line:90%
Well, what's at the beginning?

01:35:24.170 --> 01:35:25.130 align:middle line:90%
Well, it's a URL.

01:35:25.130 --> 01:35:30.890 align:middle line:84%
So we're probably going to need to
ignore an HTTPS, a ://, a twitter.com,

01:35:30.890 --> 01:35:31.910 align:middle line:90%
and a /.

01:35:31.910 --> 01:35:33.840 align:middle line:84%
So we just want to
throw all of that away.

01:35:33.840 --> 01:35:34.340 align:middle line:90%
Why?

01:35:34.340 --> 01:35:37.400 align:middle line:84%
Because if it's an URL, we
know by how Twitter works

01:35:37.400 --> 01:35:39.240 align:middle line:90%
that the username comes at the end.

01:35:39.240 --> 01:35:43.418 align:middle line:84%
So let's use that very simple idea
to get at the information we want.

01:35:43.418 --> 01:35:45.210 align:middle line:84%
I'm going to try this
a few different ways.

01:35:45.210 --> 01:35:46.620 align:middle line:90%
Let me go back into my program here.

01:35:46.620 --> 01:35:49.820 align:middle line:84%
And instead of just printing it out,
which was just to see what's going on,

01:35:49.820 --> 01:35:50.880 align:middle line:90%
let me do this.

01:35:50.880 --> 01:35:53.180 align:middle line:84%
Let me create a new
variable called username.

01:35:53.180 --> 01:35:56.810 align:middle line:90%
And let me call url.replace.

01:35:56.810 --> 01:36:01.340 align:middle line:84%
It turns out that if URL is
a string or a str in Python,

01:36:01.340 --> 01:36:05.840 align:middle line:84%
it, again, comes with multiple
methods, like strip, and split,

01:36:05.840 --> 01:36:08.750 align:middle line:84%
and others as well, one of
which is called replace.

01:36:08.750 --> 01:36:10.400 align:middle line:90%
And replace will do just that.

01:36:10.400 --> 01:36:14.360 align:middle line:84%
You pass it two arguments, the first of
which is, what do you want to replace?

01:36:14.360 --> 01:36:17.640 align:middle line:84%
The second argument is, what
do you want to replace it with?

01:36:17.640 --> 01:36:19.940 align:middle line:84%
So if I want to get rid
of, as I've proposed,

01:36:19.940 --> 01:36:21.740 align:middle line:84%
really just everything
before the username,

01:36:21.740 --> 01:36:26.090 align:middle line:84%
that is, the Twitter URL or the
beginning thereof, let's just say this.

01:36:26.090 --> 01:36:31.520 align:middle line:84%
Go ahead and replace
"https://twitter.com/",

01:36:31.520 --> 01:36:34.340 align:middle line:84%
close quote, that's
what I want to replace.

01:36:34.340 --> 01:36:37.160 align:middle line:84%
And comma, second argument, what
do you want to replace it with?

01:36:37.160 --> 01:36:37.880 align:middle line:90%
Nothing.

01:36:37.880 --> 01:36:40.100 align:middle line:84%
So I'm literally going
to pass in quote unquote

01:36:40.100 --> 01:36:42.190 align:middle line:90%
to effectively do a find and replace.

01:36:42.190 --> 01:36:44.690 align:middle line:84%
That's what the replace method
does, just like you can do it

01:36:44.690 --> 01:36:46.100 align:middle line:90%
in Microsoft Word or Google Docs.

01:36:46.100 --> 01:36:49.280 align:middle line:84%
This is the programmer's way
of doing find and replace.

01:36:49.280 --> 01:36:52.940 align:middle line:84%
Now let me go ahead and
print out just the username.

01:36:52.940 --> 01:36:54.780 align:middle line:90%
So I'll use an fstring like this.

01:36:54.780 --> 01:36:57.590 align:middle line:84%
I'll say username, colon,
and then in curly braces,

01:36:57.590 --> 01:36:59.700 align:middle line:90%
username, just to format it nicely.

01:36:59.700 --> 01:37:00.200 align:middle line:90%
All right.

01:37:00.200 --> 01:37:04.410 align:middle line:84%
Let me go ahead and clear my screen and
run python of twitter.py, Enter, URL.

01:37:04.410 --> 01:37:12.580 align:middle line:84%
Here we go.
https://twitter.com/davidjmalan, Enter.

01:37:12.580 --> 01:37:13.300 align:middle line:90%
OK.

01:37:13.300 --> 01:37:15.040 align:middle line:90%
Now we've made some progress.

01:37:15.040 --> 01:37:17.360 align:middle line:90%
Done for the day, right?

01:37:17.360 --> 01:37:19.580 align:middle line:90%
Well, what is suboptimal about this?

01:37:19.580 --> 01:37:24.150 align:middle line:84%
Can anyone critique or
find fault with my program?

01:37:24.150 --> 01:37:27.950 align:middle line:84%
It is working now, but
it's a little fragile.

01:37:27.950 --> 01:37:31.880 align:middle line:84%
I bet we could contrive some scenarios
where I think it works but it doesn't.

01:37:31.880 --> 01:37:33.890 align:middle line:84%
AUDIENCE: Well, I have
a few ideas, actually.

01:37:33.890 --> 01:37:39.980 align:middle line:84%
Well, first of all, if we don't
specify HTTPS, it will be broken.

01:37:39.980 --> 01:37:44.760 align:middle line:84%
Secondly, if we have a slash at
the end, it also will be broken.

01:37:44.760 --> 01:37:48.320 align:middle line:84%
If we have a question mark or
something after question mark,

01:37:48.320 --> 01:37:49.590 align:middle line:90%
it also won't work.

01:37:49.590 --> 01:37:51.160 align:middle line:90%
So a lot of scenarios, actually.

01:37:51.160 --> 01:37:52.160 align:middle line:90%
DAVID MALAN: Oh, my god.

01:37:52.160 --> 01:37:52.993 align:middle line:90%
I mean, here we are.

01:37:52.993 --> 01:37:54.650 align:middle line:90%
I was pretending to think I was done.

01:37:54.650 --> 01:37:57.920 align:middle line:84%
But my god, like, Alex gave us a
whole laundry list of problems.

01:37:57.920 --> 01:38:01.700 align:middle line:84%
And just to recap, then, what
if it's not HTTPS, it's HTTP?

01:38:01.700 --> 01:38:03.590 align:middle line:84%
Slightly less secure,
but I should still be

01:38:03.590 --> 01:38:05.713 align:middle line:90%
able to tolerate that programmatically.

01:38:05.713 --> 01:38:07.130 align:middle line:90%
What if the protocol is not there?

01:38:07.130 --> 01:38:09.740 align:middle line:84%
What if the user just typed
twitter.com/davidjmalan?

01:38:09.740 --> 01:38:12.680 align:middle line:84%
It would be nice to tolerate
that rather than show an error

01:38:12.680 --> 01:38:14.150 align:middle line:90%
and make me type in the protocol.

01:38:14.150 --> 01:38:14.660 align:middle line:90%
Why?

01:38:14.660 --> 01:38:16.050 align:middle line:90%
It's not good user experience.

01:38:16.050 --> 01:38:20.030 align:middle line:84%
What if it had a slash at the end
of the username, or a question mark?

01:38:20.030 --> 01:38:22.500 align:middle line:84%
If you think about URLs
you've seen on the web,

01:38:22.500 --> 01:38:24.920 align:middle line:84%
there's very commonly more
information, especially

01:38:24.920 --> 01:38:26.540 align:middle line:90%
if it's been shared on social media.

01:38:26.540 --> 01:38:28.640 align:middle line:84%
There might be a HTTP
parameters, so to speak,

01:38:28.640 --> 01:38:30.230 align:middle line:90%
just stuff there that we don't want.

01:38:30.230 --> 01:38:34.880 align:middle line:84%
There could be a www.twitter.com,
which I'm also not expecting but does

01:38:34.880 --> 01:38:37.360 align:middle line:90%
work if you go to that URL, too.

01:38:37.360 --> 01:38:39.540 align:middle line:84%
So there's just so many
things that can go wrong.

01:38:39.540 --> 01:38:43.010 align:middle line:84%
And even if I come back to my
contrived example as earlier,

01:38:43.010 --> 01:38:45.350 align:middle line:84%
what if I run this
program and say this--

01:38:45.350 --> 01:38:52.610 align:middle line:84%
"my username is
https://twitter.com/davidjmalan,"

01:38:52.610 --> 01:38:53.540 align:middle line:90%
Enter.

01:38:53.540 --> 01:38:58.570 align:middle line:84%
Well, that too just didn't really
work-- it got rid of the-- actually--

01:38:58.570 --> 01:39:01.730 align:middle line:84%
[LAUGHS] OK, actually
that kind of worked.

01:39:01.730 --> 01:39:05.390 align:middle line:84%
But the goal here is to actually
get the user's username,

01:39:05.390 --> 01:39:08.210 align:middle line:84%
not an English sentence
describing the user's username.

01:39:08.210 --> 01:39:11.150 align:middle line:84%
So I would argue that even though
I just accidentally created

01:39:11.150 --> 01:39:13.670 align:middle line:84%
perfectly correct English
grammar, I did not

01:39:13.670 --> 01:39:15.860 align:middle line:90%
extract the Twitter username correctly.

01:39:15.860 --> 01:39:19.890 align:middle line:84%
I don't want words like "my
username is" as part of my input.

01:39:19.890 --> 01:39:22.940 align:middle line:84%
So how can we go about improving
this, and maybe chipping away

01:39:22.940 --> 01:39:24.530 align:middle line:90%
at some of those problems one by one?

01:39:24.530 --> 01:39:26.280 align:middle line:90%
Well, let me clear my screen here.

01:39:26.280 --> 01:39:27.780 align:middle line:90%
Let me come back up to my code.

01:39:27.780 --> 01:39:31.640 align:middle line:84%
And let me not just replace it, but
let me do something else instead.

01:39:31.640 --> 01:39:34.040 align:middle line:84%
I'm going to go ahead, and
instead of using replace,

01:39:34.040 --> 01:39:36.950 align:middle line:84%
I'm going to use another
function called removeprefix.

01:39:36.950 --> 01:39:42.060 align:middle line:84%
A prefix is a string or a substring
that comes at the start of another.

01:39:42.060 --> 01:39:45.320 align:middle line:84%
So if I remove prefix, I don't need
a second argument for this function.

01:39:45.320 --> 01:39:46.220 align:middle line:90%
I just need one.

01:39:46.220 --> 01:39:48.540 align:middle line:90%
What prefix do you want to remove?

01:39:48.540 --> 01:39:51.680 align:middle line:84%
So this will at least now
fix the problem I just

01:39:51.680 --> 01:39:54.860 align:middle line:84%
described of typing in like a whole
sentence, where the URL is there,

01:39:54.860 --> 01:39:57.600 align:middle line:84%
but it's not at the beginning,
it's only at the end.

01:39:57.600 --> 01:39:59.930 align:middle line:90%
So here, this still is not correct.

01:39:59.930 --> 01:40:04.100 align:middle line:84%
But we don't create this weird-looking
output that just removes the URL part

01:40:04.100 --> 01:40:05.360 align:middle line:90%
of the input--

01:40:05.360 --> 01:40:11.330 align:middle line:84%
"my username is
https://twitter.com/davidjmalan."

01:40:11.330 --> 01:40:16.700 align:middle line:84%
A moment ago, it did remove the
URL and left only the davidjmalan.

01:40:16.700 --> 01:40:17.990 align:middle line:90%
This is not perfect still.

01:40:17.990 --> 01:40:21.830 align:middle line:84%
But at least now, it does
not weirdly remove the URL

01:40:21.830 --> 01:40:23.030 align:middle line:90%
and then leave the English.

01:40:23.030 --> 01:40:24.420 align:middle line:90%
It's just leaving it alone.

01:40:24.420 --> 01:40:26.600 align:middle line:84%
So maybe I could handle
this better, but at least

01:40:26.600 --> 01:40:30.710 align:middle line:84%
it's removing it from the part
of the string I might anticipate.

01:40:30.710 --> 01:40:32.550 align:middle line:90%
Well, what else could we do here?

01:40:32.550 --> 01:40:35.180 align:middle line:84%
Well, it turns out that
regular expressions just

01:40:35.180 --> 01:40:37.940 align:middle line:84%
let us express patterns
much more precisely.

01:40:37.940 --> 01:40:41.180 align:middle line:84%
We could spend all day using a whole
bunch of different Python functions

01:40:41.180 --> 01:40:44.810 align:middle line:84%
like removeprefix, or remove, and
strip, and others, and kind of

01:40:44.810 --> 01:40:47.240 align:middle line:90%
make our way to the right solution.

01:40:47.240 --> 01:40:50.310 align:middle line:84%
But a regular expression just
allows you to more succinctly,

01:40:50.310 --> 01:40:55.040 align:middle line:84%
if admittedly more cryptically, express
these kinds of patterns and goals.

01:40:55.040 --> 01:40:57.260 align:middle line:84%
And we've seen from
parentheses, which can

01:40:57.260 --> 01:41:00.170 align:middle line:84%
be used not just to group
symbols together as sets

01:41:00.170 --> 01:41:05.180 align:middle line:84%
but to capture information as well,
we have a very powerful tool now

01:41:05.180 --> 01:41:06.630 align:middle line:90%
in our toolkit.

01:41:06.630 --> 01:41:07.800 align:middle line:90%
So let me do this.

01:41:07.800 --> 01:41:12.530 align:middle line:84%
Let me go ahead and start fresh
here and import the re library

01:41:12.530 --> 01:41:14.450 align:middle line:90%
as before at the very top of my program.

01:41:14.450 --> 01:41:17.900 align:middle line:84%
I'm still going to get the user's
URL via the same line of code.

01:41:17.900 --> 01:41:20.970 align:middle line:84%
But I'm now going to use
another function as well.

01:41:20.970 --> 01:41:24.950 align:middle line:84%
It turns out that there's not
just re.search, or re.match,

01:41:24.950 --> 01:41:26.060 align:middle line:90%
or re.fullmatch.

01:41:26.060 --> 01:41:30.860 align:middle line:84%
There's also re.sub in the regular
expression library, where "sub" here

01:41:30.860 --> 01:41:32.000 align:middle line:90%
means "substitute."

01:41:32.000 --> 01:41:35.220 align:middle line:84%
And it takes more arguments, but
they're fairly straightforward.

01:41:35.220 --> 01:41:38.990 align:middle line:84%
The first argument to re.sub is
the pattern, the regular expression

01:41:38.990 --> 01:41:40.280 align:middle line:90%
that you want to look for.

01:41:40.280 --> 01:41:43.160 align:middle line:84%
Then you have a replacement
string-- what do

01:41:43.160 --> 01:41:45.470 align:middle line:90%
you want to replace that pattern with?

01:41:45.470 --> 01:41:47.390 align:middle line:90%
And where do you want to do all that?

01:41:47.390 --> 01:41:51.265 align:middle line:84%
Well, you pass in the string that
you want to do the substitution on.

01:41:51.265 --> 01:41:54.140 align:middle line:84%
Then there's some other arguments
that I'll wave my hands at for now.

01:41:54.140 --> 01:41:56.240 align:middle line:84%
Among them are those same
flags and also a count,

01:41:56.240 --> 01:41:58.970 align:middle line:84%
like how many times do you
want to do find and replace?

01:41:58.970 --> 01:42:01.670 align:middle line:84%
Do you want it to do all,
do you want to do just one,

01:42:01.670 --> 01:42:04.070 align:middle line:84%
or so forth you can have
further control there, too,

01:42:04.070 --> 01:42:06.770 align:middle line:84%
just like you would in Google
Docs or Microsoft Word.

01:42:06.770 --> 01:42:10.160 align:middle line:84%
Well, let me go back to my
code here, and let me do this.

01:42:10.160 --> 01:42:15.020 align:middle line:84%
I'm going to go ahead and call re
not search but re.sub for substitute.

01:42:15.020 --> 01:42:18.320 align:middle line:84%
I'm going to pass in the
following regular expression,

01:42:18.320 --> 01:42:25.610 align:middle line:84%
"https://twitter.com/" and then
I'm going to close my quote.

01:42:25.610 --> 01:42:27.860 align:middle line:84%
And now what do I want
to replace that with?

01:42:27.860 --> 01:42:31.460 align:middle line:84%
Well, like before with the
simple str replace function,

01:42:31.460 --> 01:42:34.380 align:middle line:84%
I want to replace it with nothing,
just get rid of it altogether.

01:42:34.380 --> 01:42:37.730 align:middle line:84%
But what string do I want
to pass in to do this to?

01:42:37.730 --> 01:42:39.810 align:middle line:90%
The URL from the user.

01:42:39.810 --> 01:42:44.360 align:middle line:84%
And now let me go ahead and
assign the return value of re.sub

01:42:44.360 --> 01:42:46.100 align:middle line:90%
to a variable called username.

01:42:46.100 --> 01:42:49.460 align:middle line:84%
So re.sub's purpose in life
is, again, to substitute

01:42:49.460 --> 01:42:52.490 align:middle line:84%
some value for some regular
expression some number of times.

01:42:52.490 --> 01:42:56.360 align:middle line:84%
It essentially is find and
replace using regular expressions.

01:42:56.360 --> 01:42:59.090 align:middle line:84%
And it returns to you
the resulting string

01:42:59.090 --> 01:43:01.400 align:middle line:84%
once you've done all
those substitutions.

01:43:01.400 --> 01:43:04.850 align:middle line:84%
So now the very last line of my code
can be the same as before, print--

01:43:04.850 --> 01:43:08.960 align:middle line:84%
and I'll use an fstring, username,
colon, and then in curly braces,

01:43:08.960 --> 01:43:09.590 align:middle line:90%
username.

01:43:09.590 --> 01:43:12.300 align:middle line:90%
So I can print out literally just that.

01:43:12.300 --> 01:43:12.800 align:middle line:90%
All right.

01:43:12.800 --> 01:43:14.300 align:middle line:90%
Let's try this and see what happens.

01:43:14.300 --> 01:43:17.390 align:middle line:84%
I'll clear my terminal window,
run python of twitter.py.

01:43:17.390 --> 01:43:23.690 align:middle line:84%
And here we go,
https://twitter.com/davidjmalan.

01:43:23.690 --> 01:43:25.940 align:middle line:90%
Cross my fingers and hit Enter.

01:43:25.940 --> 01:43:28.580 align:middle line:90%
OK, now we're in business.

01:43:28.580 --> 01:43:30.560 align:middle line:90%
But it is still a little fragile.

01:43:30.560 --> 01:43:34.730 align:middle line:84%
And so let me ask the group,
what problem should I now

01:43:34.730 --> 01:43:36.125 align:middle line:90%
further chip away at?

01:43:36.125 --> 01:43:38.000 align:middle line:84%
They've been said before,
but let's be clear.

01:43:38.000 --> 01:43:40.460 align:middle line:84%
What's one or more
problems that still remain?

01:43:40.460 --> 01:43:44.690 align:middle line:84%
AUDIENCE: The protocols and
the domain prefix [INAUDIBLE]..

01:43:44.690 --> 01:43:45.440 align:middle line:90%
DAVID MALAN: Good.

01:43:45.440 --> 01:43:48.020 align:middle line:90%
The protocols, so HTTP versus HTTPS.

01:43:48.020 --> 01:43:51.980 align:middle line:84%
Maybe the subdomain, www,
should it be there or not?

01:43:51.980 --> 01:43:54.200 align:middle line:84%
And there's a few other
mistakes here, too.

01:43:54.200 --> 01:43:55.770 align:middle line:90%
Let me actually stay with the group.

01:43:55.770 --> 01:43:59.600 align:middle line:84%
What are some other shortcomings
of this current solution?

01:43:59.600 --> 01:44:03.590 align:middle line:84%
AUDIENCE: If we use a
phrase like you do before,

01:44:03.590 --> 01:44:07.940 align:middle line:84%
we are going to have the same problem,
because it's not taking account

01:44:07.940 --> 01:44:11.150 align:middle line:90%
in the first part of the text example.

01:44:11.150 --> 01:44:11.900 align:middle line:90%
DAVID MALAN: Good.

01:44:11.900 --> 01:44:16.220 align:middle line:84%
I might still allow for some words,
some English to the left of the URL

01:44:16.220 --> 01:44:17.810 align:middle line:90%
because I didn't use my ^ symbol.

01:44:17.810 --> 01:44:18.770 align:middle line:90%
So I'll fix that.

01:44:18.770 --> 01:44:22.450 align:middle line:84%
And any final observations
on shortcomings here?

01:44:22.450 --> 01:44:26.993 align:middle line:84%
AUDIENCE: Well, it could be an HTTP, or
there could be less than two slashes.

01:44:26.993 --> 01:44:27.660 align:middle line:90%
DAVID MALAN: OK.

01:44:27.660 --> 01:44:28.493 align:middle line:90%
So it could be HTTP.

01:44:28.493 --> 01:44:30.910 align:middle line:84%
And I think that was mentioned,
too, in terms of protocol.

01:44:30.910 --> 01:44:32.570 align:middle line:90%
There could be fewer than two slashes.

01:44:32.570 --> 01:44:34.550 align:middle line:90%
That I'm not going to worry about.

01:44:34.550 --> 01:44:38.720 align:middle line:84%
If the user gives me instead of
two, that's really user error.

01:44:38.720 --> 01:44:41.420 align:middle line:84%
And I could be tolerant of it,
but you know what, at that point

01:44:41.420 --> 01:44:45.570 align:middle line:84%
I'm OK yelling at them with an error
message saying, please fix your input.

01:44:45.570 --> 01:44:48.890 align:middle line:84%
Otherwise, we could be here all day long
trying to handle all possible typos.

01:44:48.890 --> 01:44:51.740 align:middle line:84%
For now, I think in the
interests of usability,

01:44:51.740 --> 01:44:54.560 align:middle line:84%
or user experience,
UX, let's at least be

01:44:54.560 --> 01:44:59.130 align:middle line:84%
tolerant of all possible valid inputs
or reasonable INPUTS if you will.

01:44:59.130 --> 01:45:01.940 align:middle line:84%
So let me go here, and let me
start chipping away at these here.

01:45:01.940 --> 01:45:03.530 align:middle line:90%
What are some problems we can solve?

01:45:03.530 --> 01:45:08.735 align:middle line:84%
Well, let me propose that we first
address the issue of matching

01:45:08.735 --> 01:45:10.110 align:middle line:90%
from the beginning of the string.

01:45:10.110 --> 01:45:11.900 align:middle line:90%
So let me add the ^ to the beginning.

01:45:11.900 --> 01:45:15.362 align:middle line:84%
And let me add not a $ sign
at the end, though, right?

01:45:15.362 --> 01:45:17.570 align:middle line:84%
Because I don't want to
match all the way to the end,

01:45:17.570 --> 01:45:19.950 align:middle line:84%
because I want to
tolerate a username there.

01:45:19.950 --> 01:45:23.210 align:middle line:84%
So I think we just want
the ^ symbol there.

01:45:23.210 --> 01:45:26.000 align:middle line:84%
There's a subtle bug that
no one yet mentioned.

01:45:26.000 --> 01:45:30.860 align:middle line:84%
And let me just kind of highlight it
and see if it jumps out at you now.

01:45:30.860 --> 01:45:32.730 align:middle line:90%
It's a little subtle here on my screen.

01:45:32.730 --> 01:45:37.610 align:middle line:84%
I've highlighted in
blue a final bug here--

01:45:37.610 --> 01:45:39.860 align:middle line:90%
maybe some smiles on the screen, yeah?

01:45:39.860 --> 01:45:41.400 align:middle line:90%
Can we take one hand here?

01:45:41.400 --> 01:45:46.730 align:middle line:84%
Why am I highlighting the dot in
twitter.com, even though it definitely

01:45:46.730 --> 01:45:47.900 align:middle line:90%
should be there?

01:45:47.900 --> 01:45:52.610 align:middle line:84%
AUDIENCE: So the dot without a backslash
means any character except a newline.

01:45:52.610 --> 01:45:53.990 align:middle line:90%
DAVID MALAN: Yeah, exactly.

01:45:53.990 --> 01:45:55.500 align:middle line:90%
It means any character.

01:45:55.500 --> 01:46:01.555 align:middle line:84%
So I could type in something like
twitter?com, or twitter anything com,

01:46:01.555 --> 01:46:03.660 align:middle line:90%
and that would actually be tolerated.

01:46:03.660 --> 01:46:07.230 align:middle line:84%
It's not really that bad, because
why would the user do that?

01:46:07.230 --> 01:46:09.410 align:middle line:84%
But if I want to be
correct, and I want to be

01:46:09.410 --> 01:46:13.280 align:middle line:84%
able to test my own code properly, I
should really get this detail right.

01:46:13.280 --> 01:46:16.040 align:middle line:84%
So that's an easy fix, too,
but it's a common mistake.

01:46:16.040 --> 01:46:19.190 align:middle line:84%
Anytime you're writing regular
expressions that happen to involve

01:46:19.190 --> 01:46:23.210 align:middle line:84%
special symbols, like dots
in a URL or domain name,

01:46:23.210 --> 01:46:27.230 align:middle line:84%
a $ sign in something involving
currency, remember you might, indeed,

01:46:27.230 --> 01:46:30.390 align:middle line:84%
need to escape it with a
backslash like this here.

01:46:30.390 --> 01:46:30.890 align:middle line:90%
All right.

01:46:30.890 --> 01:46:34.040 align:middle line:84%
Let me ask the group about
the protocol specifically.

01:46:34.040 --> 01:46:36.690 align:middle line:90%
So HTTPS is a good thing in the world.

01:46:36.690 --> 01:46:37.860 align:middle line:90%
It means secure.

01:46:37.860 --> 01:46:39.360 align:middle line:90%
There is encryption being used.

01:46:39.360 --> 01:46:41.840 align:middle line:90%
So generally, you like to see HTTPS.

01:46:41.840 --> 01:46:46.370 align:middle line:84%
But you still see people
typing or copy-pasting HTTP.

01:46:46.370 --> 01:46:50.960 align:middle line:84%
What would be the simplest fix here
to tolerate, as has been proposed,

01:46:50.960 --> 01:46:54.380 align:middle line:90%
both HTTP and HTTPS?

01:46:54.380 --> 01:46:56.600 align:middle line:84%
I'm going to propose
that I could do this.

01:46:56.600 --> 01:47:02.630 align:middle line:84%
I could do HTTP vertical bar or
HTTPS, which, again, means A or B.

01:47:02.630 --> 01:47:04.490 align:middle line:90%
But I think I can be smarter than that.

01:47:04.490 --> 01:47:06.770 align:middle line:84%
I can keep my code a
little more succinct.

01:47:06.770 --> 01:47:13.400 align:middle line:84%
Any recommendations here for
tolerating HTTP or HTTPS?

01:47:13.400 --> 01:47:16.845 align:middle line:84%
AUDIENCE: We could try to put
in question mark behind the S.

01:47:16.845 --> 01:47:17.720 align:middle line:90%
DAVID MALAN: Perfect.

01:47:17.720 --> 01:47:19.340 align:middle line:90%
Just use a question mark.

01:47:19.340 --> 01:47:21.110 align:middle line:90%
Both of those would be viable solutions.

01:47:21.110 --> 01:47:23.330 align:middle line:84%
If you want to be super
explicit in your code, fine.

01:47:23.330 --> 01:47:28.730 align:middle line:84%
Use parentheses and say HTTP or HTTPS,
so that you, the reader, your boss,

01:47:28.730 --> 01:47:31.410 align:middle line:84%
your teacher just know
exactly what you're doing.

01:47:31.410 --> 01:47:35.090 align:middle line:84%
But if you keep taking the more
verbose approach all the time,

01:47:35.090 --> 01:47:37.760 align:middle line:84%
it might actually become
less readable, certainly

01:47:37.760 --> 01:47:40.580 align:middle line:84%
once your regular expressions
get this big instead of this big.

01:47:40.580 --> 01:47:42.290 align:middle line:90%
So let's save space where we can.

01:47:42.290 --> 01:47:45.030 align:middle line:84%
And I would argue that this
is pretty reasonable, so

01:47:45.030 --> 01:47:47.640 align:middle line:84%
long as you're in the habit
of reading regular expressions

01:47:47.640 --> 01:47:50.390 align:middle line:84%
and know that question mark does
not mean a literal question mark,

01:47:50.390 --> 01:47:52.970 align:middle line:84%
but it means zero or
one of the thing before.

01:47:52.970 --> 01:47:56.510 align:middle line:84%
I think we've effectively
made the S optional here.

01:47:56.510 --> 01:47:58.410 align:middle line:90%
Now, what else can I do?

01:47:58.410 --> 01:48:03.860 align:middle line:84%
Well, suppose we want to tolerate the
www dot, which may or may not be there,

01:48:03.860 --> 01:48:06.050 align:middle line:90%
but it will work if you go to a browser.

01:48:06.050 --> 01:48:07.220 align:middle line:90%
I could do this--

01:48:07.220 --> 01:48:11.720 align:middle line:84%
www dot-- wait, I want a
backslash there so I don't

01:48:11.720 --> 01:48:13.310 align:middle line:90%
repeat the same mistake as before.

01:48:13.310 --> 01:48:19.220 align:middle line:84%
But this is no good either, because
I want to tolerate being there or not

01:48:19.220 --> 01:48:19.760 align:middle line:90%
being there.

01:48:19.760 --> 01:48:21.890 align:middle line:84%
And now I've just
required that it be there.

01:48:21.890 --> 01:48:24.290 align:middle line:84%
But I think I can take
the same approach.

01:48:24.290 --> 01:48:25.550 align:middle line:90%
Any recommendations?

01:48:25.550 --> 01:48:27.200 align:middle line:90%
How do I make the www.

01:48:27.200 --> 01:48:30.230 align:middle line:90%
optional, just to hammer this home?

01:48:30.230 --> 01:48:32.480 align:middle line:90%
AUDIENCE: We can group--

01:48:32.480 --> 01:48:35.835 align:middle line:90%
make a square and a question mark.

01:48:35.835 --> 01:48:36.710 align:middle line:90%
DAVID MALAN: Perfect.

01:48:36.710 --> 01:48:38.825 align:middle line:84%
So question mark is
the short answer again.

01:48:38.825 --> 01:48:40.700 align:middle line:84%
But we have to be a
little smarter this time.

01:48:40.700 --> 01:48:43.130 align:middle line:84%
As Maria has noted, we
need parentheses now.

01:48:43.130 --> 01:48:46.160 align:middle line:84%
Because if I just put a
question mark after the dot,

01:48:46.160 --> 01:48:48.147 align:middle line:90%
that just means the dot is optional.

01:48:48.147 --> 01:48:50.480 align:middle line:84%
And that's wrong, because we
don't want the user to type

01:48:50.480 --> 01:48:56.690 align:middle line:84%
in W-W-W-T-W-I-T-T-E-R. We want the dot
to be there or just not at all with no

01:48:56.690 --> 01:48:57.490 align:middle line:90%
www.

01:48:57.490 --> 01:49:00.080 align:middle line:84%
So we need to group this
whole thing together,

01:49:00.080 --> 01:49:04.160 align:middle line:84%
put a parenthesis there, and then a
parenthesis, not after the third W,

01:49:04.160 --> 01:49:09.920 align:middle line:84%
after the dot, so that that whole thing
is either there or it's not there.

01:49:09.920 --> 01:49:12.338 align:middle line:90%
And what else could we still do here?

01:49:12.338 --> 01:49:14.630 align:middle line:84%
There's going to be one other
thing we should tolerate.

01:49:14.630 --> 01:49:16.922 align:middle line:84%
And it's been said before,
and I'll pluck this one off.

01:49:16.922 --> 01:49:18.260 align:middle line:90%
What about the protocol?

01:49:18.260 --> 01:49:23.805 align:middle line:84%
Like, what if the user just doesn't
type or doesn't copy-paste the http://

01:49:23.805 --> 01:49:26.660 align:middle line:90%
or an https://?

01:49:26.660 --> 01:49:28.460 align:middle line:84%
Honestly, you and I
are not in the habit,

01:49:28.460 --> 01:49:31.730 align:middle line:84%
generally, of even typing the
protocol anymore nowadays.

01:49:31.730 --> 01:49:34.010 align:middle line:84%
You just let the browser
figure it out for you,

01:49:34.010 --> 01:49:36.590 align:middle line:90%
and automatically add it instead.

01:49:36.590 --> 01:49:38.900 align:middle line:84%
So this one's going to look
like more of a mouthful.

01:49:38.900 --> 01:49:43.520 align:middle line:84%
But if I want this whole thing
here in blue to be optional,

01:49:43.520 --> 01:49:46.880 align:middle line:84%
it's actually the same solution
as Maria offered a moment ago.

01:49:46.880 --> 01:49:49.550 align:middle line:84%
I'm going to go ahead and
put a parenthesis over here,

01:49:49.550 --> 01:49:53.960 align:middle line:84%
and a parenthesis after the two
slashes, and then a question

01:49:53.960 --> 01:49:57.120 align:middle line:84%
mark so as to make that
whole thing optional as well.

01:49:57.120 --> 01:49:58.320 align:middle line:90%
And this is OK.

01:49:58.320 --> 01:50:00.920 align:middle line:84%
It's totally fine to
make this whole thing

01:50:00.920 --> 01:50:06.480 align:middle line:84%
optional, or inside of it, this little
thing, just the S optional as well.

01:50:06.480 --> 01:50:09.350 align:middle line:84%
So long as I'm applying the
same principles again and again,

01:50:09.350 --> 01:50:11.390 align:middle line:84%
either on a small scale
or a bigger scale,

01:50:11.390 --> 01:50:16.680 align:middle line:84%
it's totally fine to nest one
of these inside of the other.

01:50:16.680 --> 01:50:20.730 align:middle line:84%
Questions now on any
of these refinements

01:50:20.730 --> 01:50:23.730 align:middle line:84%
to this parsing, this
analyzing of Twitter?

01:50:23.730 --> 01:50:29.850 align:middle line:84%
AUDIENCE: What if we put a
vertical bar besides this www dot?

01:50:29.850 --> 01:50:31.930 align:middle line:84%
DAVID MALAN: What if we
use a vertical bar there?

01:50:31.930 --> 01:50:34.110 align:middle line:90%
So we could do something like that, too.

01:50:34.110 --> 01:50:36.690 align:middle line:90%
We could do something like this.

01:50:36.690 --> 01:50:41.370 align:middle line:84%
Instead of the question mark,
I could do www dot or nothing

01:50:41.370 --> 01:50:43.680 align:middle line:90%
and just leave that and the parentheses.

01:50:43.680 --> 01:50:45.160 align:middle line:90%
That, too, would be fine.

01:50:45.160 --> 01:50:47.743 align:middle line:84%
I personally tend not to like
that, because it's a little less

01:50:47.743 --> 01:50:49.035 align:middle line:90%
obvious to me-- wait, a minute.

01:50:49.035 --> 01:50:52.260 align:middle line:84%
Is that deliberate, or did I forget to
finish my thought by putting something

01:50:52.260 --> 01:50:53.460 align:middle line:90%
after the vertical bar?

01:50:53.460 --> 01:50:57.630 align:middle line:84%
But that, too, would be allowed there
as well, if that's what you mean.

01:50:57.630 --> 01:50:59.790 align:middle line:84%
Other questions on where
we left things here,

01:50:59.790 --> 01:51:03.090 align:middle line:84%
where we made the
protocol optional, too?

01:51:03.090 --> 01:51:07.260 align:middle line:84%
AUDIENCE: What happens
if we have parenthesis,

01:51:07.260 --> 01:51:10.173 align:middle line:84%
and inside we have another
parenthesis, and another parenthesis?

01:51:10.173 --> 01:51:11.590 align:middle line:90%
Will it interfere with each other?

01:51:11.590 --> 01:51:14.298 align:middle line:84%
DAVID MALAN: If you have parentheses
inside of parentheses, that,

01:51:14.298 --> 01:51:15.660 align:middle line:90%
too, is totally fine.

01:51:15.660 --> 01:51:19.680 align:middle line:84%
And indeed, that should be one
of the reassuring lessons today.

01:51:19.680 --> 01:51:23.670 align:middle line:84%
As complicated as each of these regular
expressions has admittedly gotten,

01:51:23.670 --> 01:51:27.570 align:middle line:84%
I'm just applying the exact same
principles and the exact same syntax

01:51:27.570 --> 01:51:29.110 align:middle line:90%
again and again.

01:51:29.110 --> 01:51:31.988 align:middle line:84%
So it's totally fine to have
parentheses inside of parentheses

01:51:31.988 --> 01:51:33.780 align:middle line:84%
if they're each solving
different problems.

01:51:33.780 --> 01:51:37.200 align:middle line:84%
And in fact, the lesson I would
really emphasize the most today

01:51:37.200 --> 01:51:41.250 align:middle line:84%
is that you will not be
happy if you try to write out

01:51:41.250 --> 01:51:44.820 align:middle line:84%
a whole complicated regular
expression all at once.

01:51:44.820 --> 01:51:47.310 align:middle line:84%
Like, if you're anything
like me, you will fail,

01:51:47.310 --> 01:51:49.428 align:middle line:84%
and you will have trouble
finding the mistake.

01:51:49.428 --> 01:51:50.970 align:middle line:90%
Because my god, look at these things.

01:51:50.970 --> 01:51:53.880 align:middle line:84%
They are, even to me all
these years later, cryptic.

01:51:53.880 --> 01:51:57.240 align:middle line:84%
The better way, I would argue,
whether you're new to programming

01:51:57.240 --> 01:52:01.110 align:middle line:84%
or is old to it as I am,
is to just take these baby

01:52:01.110 --> 01:52:03.750 align:middle line:84%
steps, these incremental steps
where you do something simple,

01:52:03.750 --> 01:52:04.710 align:middle line:90%
you make sure it works.

01:52:04.710 --> 01:52:07.080 align:middle line:84%
You add one more feature,
make sure it works.

01:52:07.080 --> 01:52:09.120 align:middle line:84%
Add one more feature,
make sure it works.

01:52:09.120 --> 01:52:12.360 align:middle line:84%
And hopefully, by the end, because
you've done each of those steps one

01:52:12.360 --> 01:52:15.490 align:middle line:84%
at a time, the whole thing
will make sense to you.

01:52:15.490 --> 01:52:20.310 align:middle line:84%
But you'll also have gotten each of
those steps correct at each turn.

01:52:20.310 --> 01:52:23.970 align:middle line:84%
So please, do avoid
the inclination to try

01:52:23.970 --> 01:52:26.550 align:middle line:84%
to come up with long,
sophisticated regular expressions

01:52:26.550 --> 01:52:29.580 align:middle line:84%
all at once, because it's
just not a good use of a time

01:52:29.580 --> 01:52:32.100 align:middle line:84%
if you then stare at it trying
to find a mistake that you

01:52:32.100 --> 01:52:35.230 align:middle line:84%
could have caught if you did
things more incrementally instead.

01:52:35.230 --> 01:52:35.730 align:middle line:90%
All right.

01:52:35.730 --> 01:52:38.160 align:middle line:84%
There still remains,
arguably, at least one problem

01:52:38.160 --> 01:52:40.050 align:middle line:84%
with this solution in
that even though I'm

01:52:40.050 --> 01:52:44.040 align:middle line:84%
calling re.sub to substitute
the URL with nothing,

01:52:44.040 --> 01:52:47.410 align:middle line:84%
quote, unquote, I then in my
final line of code, line 6,

01:52:47.410 --> 01:52:49.590 align:middle line:84%
am just blindly assuming
that it all worked,

01:52:49.590 --> 01:52:52.200 align:middle line:84%
and I'm going to go ahead
and print out the username.

01:52:52.200 --> 01:52:53.520 align:middle line:90%
But what if the user--

01:52:53.520 --> 01:52:56.310 align:middle line:84%
if I clear my screen here and
run python of twitter.py--

01:52:56.310 --> 01:52:58.110 align:middle line:90%
doesn't even type a Twitter URL?

01:52:58.110 --> 01:53:02.805 align:middle line:84%
What if they do something
like https://google.com/,

01:53:02.805 --> 01:53:06.090 align:middle line:84%
like completely unrelated,
for whatever reason,

01:53:06.090 --> 01:53:08.970 align:middle line:84%
Enter, that is not
their Twitter username.

01:53:08.970 --> 01:53:12.300 align:middle line:84%
So we need to have some
conditional logic, I would argue,

01:53:12.300 --> 01:53:15.690 align:middle line:84%
so that for this program's
sake, we're only printing out

01:53:15.690 --> 01:53:19.920 align:middle line:84%
or, in a back end system, we're only
saving into our database or a CSV

01:53:19.920 --> 01:53:24.090 align:middle line:84%
file the username if we actually
matched the proper pattern.

01:53:24.090 --> 01:53:29.010 align:middle line:84%
So rather than use re.sub, which
is useful for cleaning up data,

01:53:29.010 --> 01:53:32.340 align:middle line:84%
as we've done here to get rid of
something we don't want there,

01:53:32.340 --> 01:53:37.080 align:middle line:84%
why don't we go back to
re.search, where we began today,

01:53:37.080 --> 01:53:41.100 align:middle line:84%
and use it to solve this same problem
but in a way that's conditional,

01:53:41.100 --> 01:53:44.490 align:middle line:84%
whereby I can confidently say, yes
or no, at the end of my program,

01:53:44.490 --> 01:53:47.260 align:middle line:90%
here's the username, or here it is not?

01:53:47.260 --> 01:53:48.300 align:middle line:90%
So let me go ahead now.

01:53:48.300 --> 01:53:50.340 align:middle line:90%
And I'll clear my terminal window here.

01:53:50.340 --> 01:53:52.560 align:middle line:90%
I'm going to keep most of--

01:53:52.560 --> 01:53:55.800 align:middle line:84%
I'm going to keep the first two
lines the, same where I import re,

01:53:55.800 --> 01:53:57.520 align:middle line:90%
and I get the URL from the user.

01:53:57.520 --> 01:53:59.010 align:middle line:90%
But this time, let's do this.

01:53:59.010 --> 01:54:03.630 align:middle line:84%
Let's this time search for, using
re.search instead of re.sub,

01:54:03.630 --> 01:54:04.470 align:middle line:90%
the following.

01:54:04.470 --> 01:54:09.510 align:middle line:84%
I'm going to start matching at the
beginning of the string, https,

01:54:09.510 --> 01:54:13.380 align:middle line:84%
question mark to make the S
optional, colon, slash, slash,

01:54:13.380 --> 01:54:19.710 align:middle line:84%
I'm going to make my www optional by
putting that in question marks there,

01:54:19.710 --> 01:54:24.000 align:middle line:84%
then a twitter.com with a literal dot
there so I stay ahead of that issue,

01:54:24.000 --> 01:54:26.640 align:middle line:90%
too, then a slash.

01:54:26.640 --> 01:54:30.330 align:middle line:84%
And then well, this is where
davidjmalan is supposed to go.

01:54:30.330 --> 01:54:31.710 align:middle line:90%
How do I detect this?

01:54:31.710 --> 01:54:35.580 align:middle line:84%
Well, I think I'll just tolerate
anything at the end of the URL here.

01:54:35.580 --> 01:54:38.532 align:middle line:84%
All right, $ sign at the
very end, close quote.

01:54:38.532 --> 01:54:40.740 align:middle line:84%
For the moment, I'm going
to stipulate that we're not

01:54:40.740 --> 01:54:43.830 align:middle line:84%
going to worry about question
marks at the end or hashes,

01:54:43.830 --> 01:54:45.600 align:middle line:90%
like for fragment IDs in URLs.

01:54:45.600 --> 01:54:48.630 align:middle line:84%
We're going to assume for
simplicity now that the URL just

01:54:48.630 --> 01:54:50.610 align:middle line:90%
ends with the username alone.

01:54:50.610 --> 01:54:52.110 align:middle line:90%
Now what am I going to do?

01:54:52.110 --> 01:54:54.330 align:middle line:84%
Well, I want to search
for this URL specifically,

01:54:54.330 --> 01:54:58.230 align:middle line:84%
and I'm going to ignore
case, so re.IGNORECASE,

01:54:58.230 --> 01:55:00.840 align:middle line:84%
applying that same lesson
learned from before.

01:55:00.840 --> 01:55:05.717 align:middle line:84%
re.search, recall, will return to
you the matches you've captured.

01:55:05.717 --> 01:55:07.050 align:middle line:90%
Well, what do I want to capture?

01:55:07.050 --> 01:55:12.420 align:middle line:84%
Well, I want to capture everything to
the right of the twitter.com URL here.

01:55:12.420 --> 01:55:17.560 align:middle line:84%
So let me surround what should be
the user's username with parentheses,

01:55:17.560 --> 01:55:21.580 align:middle line:84%
not for making them optional but to
say, "capture this set of characters."

01:55:21.580 --> 01:55:24.730 align:middle line:84%
Now, re.search, recall,
returns an answer.

01:55:24.730 --> 01:55:28.600 align:middle line:84%
matches will be my variable name again,
but I could call it anything I want.

01:55:28.600 --> 01:55:29.950 align:middle line:90%
And then I can do this.

01:55:29.950 --> 01:55:33.680 align:middle line:90%
If matches, now I know I can do this.

01:55:33.680 --> 01:55:36.370 align:middle line:84%
Let's print out the format
string, username colon.

01:55:36.370 --> 01:55:40.190 align:middle line:90%
And then what do I want to print out?

01:55:40.190 --> 01:55:44.440 align:middle line:84%
Well, I think I want to print out
matches.group 1 for my matched

01:55:44.440 --> 01:55:45.700 align:middle line:90%
username.

01:55:45.700 --> 01:55:46.210 align:middle line:90%
All right.

01:55:46.210 --> 01:55:47.980 align:middle line:90%
So what am I doing just to recap?

01:55:47.980 --> 01:55:49.960 align:middle line:90%
Line 1, I'm importing the library.

01:55:49.960 --> 01:55:52.280 align:middle line:84%
Line 2, I'm getting
the URL from the user.

01:55:52.280 --> 01:55:53.230 align:middle line:90%
So nothing new there.

01:55:53.230 --> 01:55:59.740 align:middle line:84%
Line 5, I'm searching the user's URL, as
indicated here as the second argument,

01:55:59.740 --> 01:56:03.220 align:middle line:84%
for this regular
expression, this pattern.

01:56:03.220 --> 01:56:07.720 align:middle line:84%
I have surrounded the
dot + with parentheses

01:56:07.720 --> 01:56:11.380 align:middle line:84%
so that they are captured
ultimately, so I can extract,

01:56:11.380 --> 01:56:14.320 align:middle line:84%
in this final scenario,
the user's username.

01:56:14.320 --> 01:56:18.580 align:middle line:84%
If I indeed got a match,
and matches is non-none,

01:56:18.580 --> 01:56:23.470 align:middle line:84%
it is actually containing some match,
then and only then, print out username.

01:56:23.470 --> 01:56:25.420 align:middle line:90%
In this way, let me try this now.

01:56:25.420 --> 01:56:31.110 align:middle line:84%
If I run python of twitter.py and
type in https://www.google.com/,

01:56:31.110 --> 01:56:33.370 align:middle line:90%
now nothing gets printed.

01:56:33.370 --> 01:56:36.010 align:middle line:84%
So I've at least solved
the mistake we just saw,

01:56:36.010 --> 01:56:38.050 align:middle line:84%
where I was just assuming
that my code worked.

01:56:38.050 --> 01:56:44.000 align:middle line:84%
Now I'm making sure that I have searched
for and found the Twitter URL prefix.

01:56:44.000 --> 01:56:44.500 align:middle line:90%
All right.

01:56:44.500 --> 01:56:45.917 align:middle line:90%
Well, let's run this for real now.

01:56:45.917 --> 01:56:51.730 align:middle line:84%
Python of twitter.py
https://twitter.com/davidjmalan.

01:56:51.730 --> 01:56:55.420 align:middle line:84%
But note, I could use
HTTP, I could use www.

01:56:55.420 --> 01:56:58.430 align:middle line:84%
I'm just going to go
ahead here and hit Enter.

01:56:58.430 --> 01:57:01.730 align:middle line:90%
Huh, none.

01:57:01.730 --> 01:57:05.480 align:middle line:90%
What has gone wrong?

01:57:05.480 --> 01:57:08.060 align:middle line:90%
This one's a bit more subtle.

01:57:08.060 --> 01:57:13.027 align:middle line:84%
But why does matches.group
1 contain nothing?

01:57:13.027 --> 01:57:13.610 align:middle line:90%
Wait a minute.

01:57:13.610 --> 01:57:15.450 align:middle line:90%
Let me-- maybe I did this wrong.

01:57:15.450 --> 01:57:17.707 align:middle line:90%
Maybe-- maybe do we need the www?

01:57:17.707 --> 01:57:18.540 align:middle line:90%
Let me run it again.

01:57:18.540 --> 01:57:24.740 align:middle line:84%
So here we go. https://, let's
add a www.twitter.com/davidjmalan.

01:57:24.740 --> 01:57:25.500 align:middle line:90%
All right.

01:57:25.500 --> 01:57:26.470 align:middle line:90%
Enter.

01:57:26.470 --> 01:57:28.550 align:middle line:90%
Ho, ho, ho.

01:57:28.550 --> 01:57:31.170 align:middle line:90%
What is going on?

01:57:31.170 --> 01:57:32.720 align:middle line:90%
AUDIENCE: You have to say group 2.

01:57:32.720 --> 01:57:34.520 align:middle line:90%
DAVID MALAN: I have to say group 2?

01:57:34.520 --> 01:57:39.140 align:middle line:84%
Well, wait-- oh, right, because
we had the subdomain was optional.

01:57:39.140 --> 01:57:42.560 align:middle line:84%
And to make it optional, I
needed to use parentheses here.

01:57:42.560 --> 01:57:44.070 align:middle line:90%
And so I then said zero or on.

01:57:44.070 --> 01:57:44.570 align:middle line:90%
OK.

01:57:44.570 --> 01:57:49.910 align:middle line:84%
So that means that actually, I'm
unintentionally but by design

01:57:49.910 --> 01:57:54.710 align:middle line:84%
capturing the www dot, or none
of it if it wasn't there before,

01:57:54.710 --> 01:57:56.645 align:middle line:84%
but I have a second
match over here because I

01:57:56.645 --> 01:57:58.020 align:middle line:90%
have a second set of parentheses.

01:57:58.020 --> 01:58:00.350 align:middle line:84%
So I think, yep, let me
change matches.group 1

01:58:00.350 --> 01:58:02.300 align:middle line:90%
to matches.group 2, and let's run this.

01:58:02.300 --> 01:58:07.460 align:middle line:84%
Python of twitter.py
https://www.twitter--

01:58:07.460 --> 01:58:13.070 align:middle line:84%
let's do this,
twitter.com/davidjmalan, Enter,

01:58:13.070 --> 01:58:15.920 align:middle line:84%
and now we've got
access to the username.

01:58:15.920 --> 01:58:19.040 align:middle line:84%
Let me go ahead and tighten
it up a little bit further.

01:58:19.040 --> 01:58:21.513 align:middle line:90%
If you like our new friend--

01:58:21.513 --> 01:58:22.430 align:middle line:90%
it's hard not to like.

01:58:22.430 --> 01:58:26.060 align:middle line:84%
If we like our old friend the
walrus operator, let's go ahead

01:58:26.060 --> 01:58:27.740 align:middle line:90%
and add this just to tighten things up.

01:58:27.740 --> 01:58:31.460 align:middle line:84%
Let me go back to VS Code here, and let
me get rid of the unnecessary condition

01:58:31.460 --> 01:58:34.580 align:middle line:84%
there and combine it up
here, if matches equals that.

01:58:34.580 --> 01:58:38.090 align:middle line:84%
But let's change the single assignment
operator to the walrus operator.

01:58:38.090 --> 01:58:40.040 align:middle line:90%
Now I've tightened things up further.

01:58:40.040 --> 01:58:43.940 align:middle line:84%
But I bet, I bet, I bet there
might be another solution here.

01:58:43.940 --> 01:58:50.630 align:middle line:84%
And indeed, it turns out that we can
come back to this final set of syntax.

01:58:50.630 --> 01:58:52.940 align:middle line:84%
Recall that when we
introduce these parentheses,

01:58:52.940 --> 01:58:56.720 align:middle line:84%
we did it so that we could do A or B,
for instance, with the vertical bar.

01:58:56.720 --> 01:58:59.060 align:middle line:84%
Then you can even combine
more than just one bar.

01:58:59.060 --> 01:59:02.900 align:middle line:84%
We use the group to combine
ideas like the, www dot.

01:59:02.900 --> 01:59:07.760 align:middle line:84%
And then there's this admittedly weird
syntax at the bottom here, up until now

01:59:07.760 --> 01:59:08.690 align:middle line:90%
not used.

01:59:08.690 --> 01:59:12.230 align:middle line:84%
There is a non-capturing
version of parentheses

01:59:12.230 --> 01:59:15.050 align:middle line:84%
if you want to use parentheses
logically because you need to,

01:59:15.050 --> 01:59:18.080 align:middle line:84%
but you don't want to
bother capturing the result.

01:59:18.080 --> 01:59:20.450 align:middle line:84%
And this would arguably
be a better solution

01:59:20.450 --> 01:59:23.630 align:middle line:84%
here, because, yes, if I
go back to VS Code, I do

01:59:23.630 --> 01:59:27.560 align:middle line:84%
need to surround the www dot
with parentheses, at least

01:59:27.560 --> 01:59:30.170 align:middle line:84%
as I've written my regex
here, because I wanted

01:59:30.170 --> 01:59:31.910 align:middle line:90%
to put the question mark after it.

01:59:31.910 --> 01:59:35.120 align:middle line:84%
But I don't need the
www dot coming back.

01:59:35.120 --> 01:59:37.580 align:middle line:84%
In fact, let's only extract
the data we care about,

01:59:37.580 --> 01:59:40.280 align:middle line:84%
just so there's no confusion
down the road, for me,

01:59:40.280 --> 01:59:42.120 align:middle line:90%
or my colleagues, or my teachers.

01:59:42.120 --> 01:59:43.860 align:middle line:90%
So what could I do?

01:59:43.860 --> 01:59:48.800 align:middle line:84%
Well, the syntax per this slide is
to use a question mark and a colon

01:59:48.800 --> 01:59:51.410 align:middle line:90%
immediately after the open parentheses.

01:59:51.410 --> 01:59:52.910 align:middle line:90%
It looks weird admittedly.

01:59:52.910 --> 01:59:55.040 align:middle line:84%
Those of you who have prior
programming experience

01:59:55.040 --> 01:59:59.300 align:middle line:84%
might recognize the syntax from ternary
operators, doing an if else all in one

01:59:59.300 --> 01:59:59.960 align:middle line:90%
line.

01:59:59.960 --> 02:00:04.190 align:middle line:84%
A question mark colon at the
beginning of that parenthetical

02:00:04.190 --> 02:00:08.160 align:middle line:84%
means, yes, I'm using parentheses
to group these things together,

02:00:08.160 --> 02:00:11.640 align:middle line:84%
but no, you do not need
to capture them instead.

02:00:11.640 --> 02:00:15.500 align:middle line:84%
So I can change my code
back now to matches.group 1.

02:00:15.500 --> 02:00:18.260 align:middle line:84%
I'll clear my screen here,
run python of twitter.py.

02:00:18.260 --> 02:00:24.350 align:middle line:84%
I'll again run here
https://twitter.com/davidjmalan

02:00:24.350 --> 02:00:26.480 align:middle line:90%
with or without the www.

02:00:26.480 --> 02:00:30.590 align:middle line:84%
And now, I indeed get
back that username.

02:00:30.590 --> 02:00:37.280 align:middle line:84%
Any questions, then, on
these final techniques?

02:00:37.280 --> 02:00:40.940 align:middle line:84%
AUDIENCE: So first of all,
could we move the ^ right

02:00:40.940 --> 02:00:44.270 align:middle line:84%
at the beginning of Twitter, and
then just start reading from there,

02:00:44.270 --> 02:00:49.700 align:middle line:84%
and then get rid of everything else
before that, the kind of www issues

02:00:49.700 --> 02:00:50.930 align:middle line:90%
that we had?

02:00:50.930 --> 02:00:56.240 align:middle line:84%
And then my second question is,
how would we use kind of, I guess,

02:00:56.240 --> 02:01:01.640 align:middle line:84%
either a list or a dictionary
to sort the .com kind of thing,

02:01:01.640 --> 02:01:05.120 align:middle line:84%
because we have .co.uk,
and that kind of stuff.

02:01:05.120 --> 02:01:08.330 align:middle line:84%
How would we bring that
into the re function?

02:01:08.330 --> 02:01:09.830 align:middle line:90%
DAVID MALAN: A good question but no.

02:01:09.830 --> 02:01:15.560 align:middle line:84%
If I move the ^ before twitter.com and
throw away the protocol and the www,

02:01:15.560 --> 02:01:20.960 align:middle line:84%
then the user is going to have to type
in literally twitter.com/username.

02:01:20.960 --> 02:01:23.040 align:middle line:84%
They can't even type
in that other stuff.

02:01:23.040 --> 02:01:25.170 align:middle line:84%
So that would be a
regression, a step back.

02:01:25.170 --> 02:01:29.120 align:middle line:84%
As for the .com, the .org,
and .edu, and so forth,

02:01:29.120 --> 02:01:31.970 align:middle line:84%
the short answer is there's
many different solutions here.

02:01:31.970 --> 02:01:37.190 align:middle line:84%
If I wanted to be stringent about .com--
and suppose that Twitter probably owns

02:01:37.190 --> 02:01:40.620 align:middle line:84%
multiple domain names, even though
they tend to use just this one.

02:01:40.620 --> 02:01:43.800 align:middle line:84%
Suppose they have something
like .org as well.

02:01:43.800 --> 02:01:47.810 align:middle line:84%
You could use more parentheses here and
do something like this-- com or org.

02:01:47.810 --> 02:01:50.270 align:middle line:84%
I'd probably want to go
in and add a question mark

02:01:50.270 --> 02:01:53.060 align:middle line:84%
colon to make it non-capturing,
because I don't care which

02:01:53.060 --> 02:01:55.100 align:middle line:90%
it is, I just want to tolerate both.

02:01:55.100 --> 02:01:58.220 align:middle line:90%
Alternatively, we could capture that.

02:01:58.220 --> 02:02:01.850 align:middle line:84%
We could do something like
this, where we do dot + so as

02:02:01.850 --> 02:02:03.410 align:middle line:90%
to actually capture that.

02:02:03.410 --> 02:02:05.570 align:middle line:84%
And then we could do
something like this.

02:02:05.570 --> 02:02:13.640 align:middle line:84%
If matches.group 1 now equals equals
com, then we could support this.

02:02:13.640 --> 02:02:18.020 align:middle line:84%
So you could imagine factoring out the
logic just by extracting the Top-Level

02:02:18.020 --> 02:02:21.410 align:middle line:84%
Domain, or TLD, and then just using
Python code, maybe a list, maybe

02:02:21.410 --> 02:02:24.860 align:middle line:84%
a dictionary, to validate
elsewhere, outside of the regex,

02:02:24.860 --> 02:02:26.780 align:middle line:90%
if it's, in fact, what you expect.

02:02:26.780 --> 02:02:28.700 align:middle line:90%
For now, though, we kept things simple.

02:02:28.700 --> 02:02:31.860 align:middle line:84%
We focused only on
the .com in this case.

02:02:31.860 --> 02:02:33.767 align:middle line:84%
Let's make one final
change to this program

02:02:33.767 --> 02:02:36.350 align:middle line:84%
so that we're being a little
more specific with the definition

02:02:36.350 --> 02:02:37.640 align:middle line:90%
of a Twitter username.

02:02:37.640 --> 02:02:41.000 align:middle line:84%
It turns out that we're being a little
too generous over here, whereby we're

02:02:41.000 --> 02:02:43.280 align:middle line:90%
accepting one or more of any character.

02:02:43.280 --> 02:02:45.050 align:middle line:90%
I checked the documentation for Twitter.

02:02:45.050 --> 02:02:48.890 align:middle line:84%
And Twitter only supports letters
of the alphabet, a through Z,

02:02:48.890 --> 02:02:53.370 align:middle line:84%
numbers 0 through 9, or
underscores, so not just dot,

02:02:53.370 --> 02:02:55.020 align:middle line:90%
which is literally anything.

02:02:55.020 --> 02:02:57.230 align:middle line:84%
So let me go ahead and
be more precise here.

02:02:57.230 --> 02:02:59.870 align:middle line:84%
At the end of my string,
let me go ahead and say,

02:02:59.870 --> 02:03:03.510 align:middle line:90%
this set of symbols in square brackets.

02:03:03.510 --> 02:03:08.058 align:middle line:84%
I'm going to go ahead and say a through
Z, 0 through 9, and an underscore.

02:03:08.058 --> 02:03:10.100 align:middle line:84%
Because, again, those are
the only valid symbols.

02:03:10.100 --> 02:03:12.740 align:middle line:84%
I don't need to bother with an
uppercase A or a lowercase z,

02:03:12.740 --> 02:03:16.140 align:middle line:84%
because we're using
re.IGNORECASE over here.

02:03:16.140 --> 02:03:19.760 align:middle line:84%
But I want to make sure now that
I tolerate not only one or more

02:03:19.760 --> 02:03:24.260 align:middle line:84%
of these symbols here but also maybe
some other stuff at the end of the URL.

02:03:24.260 --> 02:03:27.710 align:middle line:84%
I'm now going to be OK with there
being a slash, or a question mark,

02:03:27.710 --> 02:03:31.730 align:middle line:84%
or a hash at the end of the URL, all
of which are valid symbols in a URL,

02:03:31.730 --> 02:03:34.130 align:middle line:84%
but I know from the
Twitter's documentation,

02:03:34.130 --> 02:03:36.390 align:middle line:90%
are not part of the username.

02:03:36.390 --> 02:03:36.890 align:middle line:90%
All right.

02:03:36.890 --> 02:03:39.770 align:middle line:84%
Now I'm going to go ahead and
run python of twitter.py one

02:03:39.770 --> 02:03:46.610 align:middle line:84%
final time, typing in
https://twitter.com/davidjmalan, maybe

02:03:46.610 --> 02:03:48.320 align:middle line:90%
with, maybe without a trailing slash.

02:03:48.320 --> 02:03:52.070 align:middle line:84%
But hopefully, with my biggest fingers
crossed here, I'm going to go ahead now

02:03:52.070 --> 02:03:56.630 align:middle line:84%
and hit Enter, and thankfully my
username is, indeed, davidjmalan.

02:03:56.630 --> 02:03:59.300 align:middle line:84%
So what more is there in the
world of regular expressions

02:03:59.300 --> 02:04:00.320 align:middle line:90%
and this own library?

02:04:00.320 --> 02:04:04.340 align:middle line:84%
Not just re.search and also re.sub,
there's other functions, too.

02:04:04.340 --> 02:04:07.850 align:middle line:84%
There's re.split, via which
you can split a string, not

02:04:07.850 --> 02:04:11.480 align:middle line:84%
using a specific character or
characters like a comma and a space,

02:04:11.480 --> 02:04:14.010 align:middle line:90%
but multiple characters as well.

02:04:14.010 --> 02:04:16.550 align:middle line:84%
And there's even
functions like re.findall,

02:04:16.550 --> 02:04:20.540 align:middle line:84%
which can allow you to search for
multiple copies of the same pattern

02:04:20.540 --> 02:04:23.120 align:middle line:84%
in different places in a
string so that you can perhaps

02:04:23.120 --> 02:04:25.200 align:middle line:90%
manipulate more than just one.

02:04:25.200 --> 02:04:28.820 align:middle line:84%
So at the end of the day now, you've
really learned a whole other language,

02:04:28.820 --> 02:04:31.700 align:middle line:84%
like that of regular expressions,
and we've used them in Python.

02:04:31.700 --> 02:04:35.670 align:middle line:84%
But these regular expressions actually
exist in so many languages, too,

02:04:35.670 --> 02:04:38.930 align:middle line:84%
among them JavaScript, and
Java, and Ruby, and more.

02:04:38.930 --> 02:04:42.300 align:middle line:84%
So with this new language, even
though it's admittedly cryptic

02:04:42.300 --> 02:04:45.050 align:middle line:84%
when you use it for the first time,
you have this newfound ability

02:04:45.050 --> 02:04:48.800 align:middle line:84%
to express these patterns that,
again, you can use to validate data,

02:04:48.800 --> 02:04:53.310 align:middle line:84%
to clean up data, or even extract
data, and from any data set

02:04:53.310 --> 02:04:54.470 align:middle line:90%
you might have in mind.

02:04:54.470 --> 02:04:55.830 align:middle line:90%
That's it for this week.

02:04:55.830 --> 02:04:58.570 align:middle line:90%
We will see you next time.

02:04:58.570 --> 02:05:00.000 align:middle line:90%