[Seminar: Pattern Matching with Regular Expressions] [John Mussman-Harvard University] [This is CS50.-CS50.TV] Okay. Well, welcome everyone. This is CS50 2012. My name is John, and I will be talking today about regular expressions. Regular expressions is primarily a tool, but also sometimes used in code actively to essentially match patterns and strings. So here's a web comic from xkcd. In this comic there is a murder mystery where the killer has  followed somebody on vacation, and the protagonists have to search through 200 megabytes of emails looking for an address. And they are about to give up when someone who knows regular expressions-- presumably a superhero--swoops down and writes some code and solves the murder mystery. So presumably that will be something that you will be empowered to do after this seminar. We are just going to provide a concise introduction to the language and give you enough wherewithal to go after more resources on your own. 

So regular expressions look basically like this. This is a regular expression in Ruby. It is not terribly different across languages. We have just on slashes to begin and mark the regular expression in Ruby. And this is a regular expression to look for in email address pattern. So we see at the first bit looks for any alphanumeric character. That is because email addresses often have to start with an alphabetical character. And then any special character followed by the @ symbol. And then the same thing for domain name. And then between 2 and 4 characters to look for the .com, .net, and so on. So that is another example of regular expression. So regular expressions are protocols for finding patters in text. They do comparisons, selections, and replacements. So a third example is finding all the phone numbers ending in 54 in a directory. So before David rips up the CS50 directory we could search for a pattern where we have parentheses then 3 numbers then end parenthesis, 3 more numbers, a dash, 2 numbers, and then 54. And that would be essentially how we come up with a regular expression to search for that. 

So there are--we have done some things in CS50 that are a little bit like regular expressions, so--for example--in the dictionary.C file  for the spell check problem set you may have used fscanf to read in a word from the dictionary. And you can see the percentage 45s is looking for a string of 45 characters. So it is somewhat like a rudimentary regular expression. And you can have any 45 characters that fit the bill in there and pick those up. And then the second example in the most recent web programming problem set in the distro code for php we actually do have a simple regular expression. And this one is just simply looking to check if the web page that is passed in matches either login or logout register .PHP. And then returning true or false based on that regular expression matching. 

So when do you use regular expression? Why are you here today? So you don't want to use regular expression when there's something that does the job for you even more easily. So XML and HTML are actually pretty tricky to write regular expressions for as we will see in a little bit. So there are dedicated parsers for those languages. You also need to be okay with the trade offs and accuracy frequently. If you are trying--so we saw a regular expression for an email address, but say you wanted a specific email address and gradually the  regular expression might become more complex as it became more precise. So that would be one trade off. You have to be sure that you are okay making with the regular expression. If you know exactly what you are looking for it might make more sense to put in the time and write a more effective parser. And finally there is a historical issue with the regularity of expressions and languages. Regular expressions are actually much more powerful than regular expressions per say in a formal sense. 

So I don't want to go too far into the formal theory, but most languages that we code in actually are not regular. And this is why regular expressions sometimes are not considered all that secure. So basically there is a Chomsky hierarchy for languages, and regular expressions are build up using union, concatenation,  and the Kleene star operation that we will see in a few minutes. If you are interested in theory there is quite a lot going on there under the hood. 

So a brief history--just for the context here--regular sets came up in the 1950s, and then we had simple editors that incorporated regular expressions--just searching for strings. Grep--which is a command line tool--was one of the first  very popular tools that incorporated regular expressions in the 1960s. In the '80s, Perl was built--is a programming language that incorporates regular expressions very prominently. And then more recently we have had Perl compatible regular expression protocols basically in other languages that use much of the same syntax. Of course the most important event was in 2008 where there was the first National Regular Expressions Day, which I believe is June 1 if you want to celebrate that. 

Again, just a little bit more theory here. So there are a couple different ways of constructing regular expressions. One simple way is to build the expression that you are going to run on the string interpret--basically build a little mini-program that will analyze pieces of a string and see, "Oh, does this fit the regular expression or not?" And then run that. So if you have a very small regular expression, this is probably  the most efficient way to do it. And then if you--another option is to keep reconstructing the expression as you go, and that is the simulate possibility. And these early attempts at regular expression algorithms were relatively simple and relatively fast, but didn't have a lot of flexibility. So to do even some of the things that we are going to look at today we have had to do more complex regular expression implementations that are potentially much slower; so that is something to bear in mind There's also a regular expressions denial of attack variety that exploit the potential for these newer implementations of  regular expressions to become very complex. And in much the same sense that we saw in buffer overflow attacks, you have attacks that work by making recursive loops that overrun the capacity of memory. And by the way Regexen is one of the official plurals of regular expression by analogy to oxen in the Anglo-Saxon. 

Okay, so the Python Library many of you here in person have Macs, so you can actually pull this up on your screen. Regular expressions are built into Python. And so Python is preloaded on Macs and also available online at this link. So if you are watching you can pause and make sure you have Python as we play around here. There is a manual online, so if you just type Python into your computer you will see that the version comes up in the terminal. So I provided a link to the manual for Version 2 of Python as well as a cheat sheet. There is a Version 3 of Python, but your Mac doesn't necessarily come with that preloaded. So not terribly different. Okay, so some basics of using regular expressions in Python. 

So here I used a very simple expression, so I did Python import re and then took the result of re.search. And the search takes 2 arguments. The first is the regular expression, and the second is the text or string you want to analyze. And then I printed out the result.group. So these are the 2 basic functions we are going to see today in learning about regular expressions. So just breaking down this regular expression here h and then \w and then m so \w just accepts any alphabetical character in there. So here we are looking for an "h" and then another alphabetical character and then m, so here that would match ham in, "Abraham Lincoln and ham sandwiches." This is the result of that group. Another thing that we can do is use our before strings of text in Python. So I guess I will go ahead and pull that up here. Python import re. And if I were to do the same thing--let us say text is, "Abraham," let us zoom in--there we go. Text is, "Abraham eats ham." Okay, and then result = re.search. And then our expression can be h, and then I will do dot m. So dot just takes any character that is not a new line including numbers, percentage signs, anything like that. And then text--boom--and then result.group--yeah. So that is just how to implement basic functionality here. If we had a text ring that--that crazy text--included say lots of back slashes and strings inside and things that could look like escape sequences, then we probably want to use the raw text input to make sure that is accepted. And that just looks like that. So if we were looking for each of them in there we shouldn't find anything. But that is how you would implement it; just before the string of  the regular expression you put the letter r. 

Okay, so let us keep going. All right--so let us look at a couple repetitive patterns here. So one thing that you want to do is repeat things as you are searching through text. So to do a followed by any number of b--you do ab*. And then there are a series of other rules too. And you can look all of these up; I'll just run through some of the most commonly used ones. So ab+ is a followed by any N greater than 0 of b. ab? is a followed by 0 or 1 of b. ab{N} is a followed by N of b, and then so on. If you have 2 numbers in the curly braces you are specifying a range that can be possibly matched. So we will look more at a couple repetitive patterns in a minute. So 2 things to keep in mind when using these pattern matching tools here. So say we want to look at the h.m of, "Abraham Lincoln makes ham sandwiches." So I changed Abraham Lincoln's name to Abraham. And now we are looking for what is returned by this search function, and it only returns ham in this case. And it does that because search just naturally takes the left most queue. And all regular expressions unless you specify otherwise will do that. If we wanted to find all there is a function for that--find all. So that could just look like all = re.findall('h.m', text)  and then all.group(). All produces both ham and ham; in this case both of the strings in Abraham each ham. So that is another option. 

Great. The other thing to keep in mind is that regular expressions take the largest intuitively. Let us look at this example. We did that left most search here, and then I attempted a larger search using the Kleene star operator. So for, "Abraham Lincoln makes ham sandwiches," and I only got back m as a result. The reason for that mistake was that I could have taken any number of h's because I didn't specify anything to go in between h and m. The only example there that had m--the only examples there with m in it and any number of h's were just the string m. Then I tried it again; I said, "Okay, let us get the actual largest group here." And then I did h.*m, so that just returns any number of characters between h and m. And if you are just starting out and thinking, "Oh, okay, well this will get me ham," it actually takes everything from the h in Abraham Lincoln all the way up to the end of ham. It is greedy; it sees h--all this other text--m, and that is what it takes in. This is a particularly egregious--this is a feature we can also  specify for it not be greedy using other functions. But this is something we have to keep in mind especially  when looking at HTML text, which is one reason that regular expressions are difficult for HTML. Because if you have an HTML open tag and then lots of stuff in the middle and then some other HTML closed tag much later in the program, you have just eaten up a lot of your HTML code possibly by mistake. 

All right--so more special characters, like many other languages, we escape using the slash. So we can use the dot to specify any character except for a new line. We can use the escape w to specify any alphabetical character. And by analogy escape d for any integer--numerical character. We can specify--we can use brackets to specify related expressions. So this would accept a, b, or c. And we can also specify or options for either a or b. For example--if we were looking for multiple possibilities in brackets we could use the or operator as in-- so let us go back to this example here. And now let us take--let us go back to this example here, and then take ae--so this should return--I guess this is still Abraham. So this--if we do all--great. So let us update the text here. "Abraham eats ham while hemming his--while hemming." Great. All. Great. Now we get ham, ham, and hem. While hemming--while humming to him--while humming to hem him. Great. Same thing. Now all returns still just ham, ham, and hem without picking up on the hum or the him. Great--so what if we wanted to look at either that--so we could also do him or--we will come back to that. Okay--so--all right--in positions you can also use the caret or the dollar sign to specify that you are looking for something at the start or the end of a string. Or the start or the end of a word. That is one way to use that. 

Okay--so let us play around with a slightly larger block of text. Let us say this row here--this statement here. The  power of regular expression is that they can specify patterns not just fixed characters. Let us make--let us call this block. Then we will read all of that in. And then have a--let us make all =; so what are some things we could search in here profitably?  We could look for the expression ear. Not very interesting. How about that? We'll see what happens. I gave it a problem. So any number of things before re and all. So that should return everything from the beginning up to all re perhaps a couple times. And then here we have the power of regular expressions is that they can specify patterns not just characters here are. So all the way up to the final re, it started with the left most and was greedy. Let us see--what else could we look for. I guess one thing if you were interested in looking for the pronouns she and he, you could check for s being equal to 0 or 1  and the expression he, and that is probably not going to return-- oh, I guess it returned he because there we are looking at the power, that day, here are. Let us try specifying that this has to come at the start of something. Let us see if that drops off. So we can do fat, and there we don't get anything because she and he don't occur in this phrase. Great. Okay--so back to the cat here. So complex patterns is hurting the brain. So that is why we use regular expressions to avoid these issues. 

So here are some other useful modes you can play around with. We looked at search today, but you can also use match, split, findall, and groups. So other cool things you can do with regular expressions besides just looking for patterns is taking a pattern and holding all the matches-- its variables--and then using those in your code later on. That can be quite helpful. Other things might be counting. So we can count the number of instances of a regular expression pattern, and that is what we can use groups for. And other modes as well are also possible. So I just want to talk a little bit more about other ways you can use regular expressions. 

So one more advanced application is in fuzzy matching. So if you are looking for a text for the expression, Julius Caesar,  and you see either Gaius Julius Caesar or the name Julius Caesar  in other languages, then you might also want to assign some weight to those values. And if it is close enough--if it crosses a certain threshold--then you want to be able to accept Julius Caesar. So there are a couple different implementations for that in a few other languages as well. Here are some other tools, Regex Pal--a handy little app online to check if your regular expressions are composed correctly. There are also standalone tools that you can run from your desktop like Ultra Pico, and as well as just cookbooks. So if you are doing a project that involves a ton of regular expressions this is probably the place to go outside the scope of today. And then just to give you a sense of how common it is there is grep in Unix, Perl has built-in, and C there is PCRE for C. And then all these other languages also have regular expression packages that operate with essentially the same syntax we got a taste of today. PHP, Java, Ruby, and so on. 

Google Code Search is actually worth mentioning; it is one of the relatively few applications out there that allows the public to access its database using regular expressions. So if you look on Google Code Search you can find code if you are looking for an instance of how a function might be used, you can use a regular expression to find that function being used in all sorts of different cases. You could look for fwrite, and then you could look for the flag of write or read if you wanted an example of fwrite being used in that case. So the same thing there, and here are some references. This will be available online as well, so going forwards if you want to look at Python, grep, Perl--you just want to get some inspiration or if you want to look more at the theory here are some good jumping off places. Thank you very much. [CS50.TV]