{"captions":[{"content":"[Seminar: Pattern Matching with Regular Expressions]","startTime":0,"duration":2000,"startOfParagraph":false},{"content":"[John Mussman-Harvard University]","startTime":2000,"duration":2000,"startOfParagraph":false},{"content":"[This is CS50.-CS50.TV]","startTime":4000,"duration":3220,"startOfParagraph":false},{"content":"Okay. Well, welcome everyone. This is CS50 2012.","startTime":7780,"duration":3830,"startOfParagraph":false},{"content":"My name is John, and I will be talking today about regular expressions.","startTime":11780,"duration":4830,"startOfParagraph":false},{"content":"Regular expressions is primarily a tool, but also sometimes used","startTime":16610,"duration":5920,"startOfParagraph":false},{"content":"in code actively to essentially match patterns and strings.","startTime":22530,"duration":6120,"startOfParagraph":false},{"content":"So here's a web comic from xkcd.","startTime":28650,"duration":5150,"startOfParagraph":false},{"content":"In this comic there is a murder mystery where the killer has ","startTime":34440,"duration":7930,"startOfParagraph":false},{"content":"followed somebody on vacation, and the protagonists have to","startTime":42370,"duration":5490,"startOfParagraph":false},{"content":"search through 200 megabytes of emails looking for an address.","startTime":47860,"duration":4640,"startOfParagraph":false},{"content":"And they are about to give up when someone who knows regular expressions--","startTime":52500,"duration":3590,"startOfParagraph":false},{"content":"presumably a superhero--swoops down and writes some code","startTime":56090,"duration":4460,"startOfParagraph":false},{"content":"and solves the murder mystery.","startTime":60550,"duration":2420,"startOfParagraph":false},{"content":"So presumably that will be something that you will be empowered to do","startTime":62970,"duration":4400,"startOfParagraph":false},{"content":"after this seminar.","startTime":67370,"duration":2000,"startOfParagraph":false},{"content":"We are just going to provide a concise introduction to the language","startTime":69370,"duration":2880,"startOfParagraph":false},{"content":"and give you enough wherewithal to go after more resources on your own.","startTime":72250,"duration":4520,"startOfParagraph":false},{"content":"So regular expressions look basically like this.","startTime":77680,"duration":4020,"startOfParagraph":true},{"content":"This is a regular expression in Ruby.","startTime":82930,"duration":2620,"startOfParagraph":false},{"content":"It is not terribly different across languages.","startTime":85550,"duration":3730,"startOfParagraph":false},{"content":"We have just on slashes to begin and mark the regular expression in Ruby.","startTime":89690,"duration":7940,"startOfParagraph":false},{"content":"And this is a regular expression to look for in email address pattern.","startTime":97630,"duration":5250,"startOfParagraph":false},{"content":"So we see at the first bit looks for any alphanumeric character.","startTime":102880,"duration":6280,"startOfParagraph":false},{"content":"That is because email addresses often have to start with an alphabetical character.","startTime":110500,"duration":4380,"startOfParagraph":false},{"content":"And then any special character followed by the @ symbol.","startTime":115460,"duration":3870,"startOfParagraph":false},{"content":"And then the same thing for domain name.","startTime":119330,"duration":3930,"startOfParagraph":false},{"content":"And then between 2 and 4 characters to look for the .com, .net, and so on.","startTime":123260,"duration":6770,"startOfParagraph":false},{"content":"So that is another example of regular expression.","startTime":130850,"duration":2350,"startOfParagraph":false},{"content":"So regular expressions are protocols for finding patters in text.","startTime":133200,"duration":4070,"startOfParagraph":false},{"content":"They do comparisons, selections, and replacements.","startTime":137270,"duration":3860,"startOfParagraph":false},{"content":"So a third example is finding all the phone numbers ending in 54 in a directory.","startTime":141690,"duration":6280,"startOfParagraph":false},{"content":"So before David rips up the CS50 directory we could search for","startTime":147970,"duration":6390,"startOfParagraph":false},{"content":"a pattern where we have parentheses then 3 numbers then end parenthesis,","startTime":154360,"duration":6090,"startOfParagraph":false},{"content":"3 more numbers, a dash, 2 numbers, and then 54.","startTime":160450,"duration":3620,"startOfParagraph":false},{"content":"And that would be essentially how we come up with a regular expression to search for that.","startTime":164070,"duration":4240,"startOfParagraph":false},{"content":"So there are--we have done some things in CS50 that are a little bit like","startTime":169150,"duration":3810,"startOfParagraph":true},{"content":"regular expressions, so--for example--in the dictionary.C file ","startTime":172960,"duration":6780,"startOfParagraph":false},{"content":"for the spell check problem set you may have used fscanf","startTime":179740,"duration":4980,"startOfParagraph":false},{"content":"to read in a word from the dictionary.","startTime":184720,"duration":3210,"startOfParagraph":false},{"content":"And you can see the percentage 45s is looking for a string of 45 characters.","startTime":187930,"duration":8310,"startOfParagraph":false},{"content":"So it is somewhat like a rudimentary regular expression.","startTime":196240,"duration":3780,"startOfParagraph":false},{"content":"And you can have any 45 characters that fit the bill in there","startTime":201150,"duration":4910,"startOfParagraph":false},{"content":"and pick those up.","startTime":206060,"duration":2020,"startOfParagraph":false},{"content":"And then the second example in the most recent web programming problem","startTime":208080,"duration":5400,"startOfParagraph":false},{"content":"set in the distro code for php we actually do have a simple regular expression.","startTime":213480,"duration":7280,"startOfParagraph":false},{"content":"And this one is just simply looking to check if the web page that is passed in","startTime":220760,"duration":6030,"startOfParagraph":false},{"content":"matches either login or logout register .PHP.","startTime":226790,"duration":5150,"startOfParagraph":false},{"content":"And then returning true or false based on that regular expression matching.","startTime":232220,"duration":5690,"startOfParagraph":false},{"content":"So when do you use regular expression?","startTime":239400,"duration":2340,"startOfParagraph":true},{"content":"Why are you here today?","startTime":241740,"duration":3080,"startOfParagraph":false},{"content":"So you don't want to use regular expression when there's something that","startTime":245330,"duration":3150,"startOfParagraph":false},{"content":"does the job for you even more easily.","startTime":248480,"duration":3160,"startOfParagraph":false},{"content":"So XML and HTML are actually pretty tricky","startTime":251640,"duration":3870,"startOfParagraph":false},{"content":"to write regular expressions for as we will see in a little bit.","startTime":255510,"duration":2970,"startOfParagraph":false},{"content":"So there are dedicated parsers for those languages.","startTime":259110,"duration":4170,"startOfParagraph":false},{"content":"You also need to be okay with the trade offs and accuracy frequently.","startTime":264170,"duration":5890,"startOfParagraph":false},{"content":"If you are trying--so we saw a regular expression for an email address,","startTime":270060,"duration":6160,"startOfParagraph":false},{"content":"but say you wanted a specific email address and gradually the ","startTime":277370,"duration":5220,"startOfParagraph":false},{"content":"regular expression might become more complex as it became more precise.","startTime":282590,"duration":5980,"startOfParagraph":false},{"content":"So that would be one trade off.","startTime":289580,"duration":2680,"startOfParagraph":false},{"content":"You have to be sure that you are okay making with the regular expression.","startTime":292260,"duration":3070,"startOfParagraph":false},{"content":"If you know exactly what you are looking for it might make more sense","startTime":295330,"duration":2590,"startOfParagraph":false},{"content":"to put in the time and write a more effective parser.","startTime":297920,"duration":4150,"startOfParagraph":false},{"content":"And finally there is a historical issue with the regularity","startTime":302070,"duration":4910,"startOfParagraph":false},{"content":"of expressions and languages.","startTime":306980,"duration":1960,"startOfParagraph":false},{"content":"Regular expressions are actually much more powerful than","startTime":308940,"duration":4020,"startOfParagraph":false},{"content":"regular expressions per say in a formal sense.","startTime":312960,"duration":3490,"startOfParagraph":false},{"content":"So I don't want to go too far into the formal theory,","startTime":317130,"duration":3020,"startOfParagraph":true},{"content":"but most languages that we code in actually are not regular.","startTime":320150,"duration":3850,"startOfParagraph":false},{"content":"And this is why regular expressions sometimes are not considered all that secure.","startTime":324000,"duration":5110,"startOfParagraph":false},{"content":"So basically there is a Chomsky hierarchy for languages,","startTime":329670,"duration":3480,"startOfParagraph":false},{"content":"and regular expressions are build up using union, concatenation, ","startTime":333150,"duration":5250,"startOfParagraph":false},{"content":"and the Kleene star operation that we will see in a few minutes.","startTime":338400,"duration":3410,"startOfParagraph":false},{"content":"If you are interested in theory there is quite a lot going on there under the hood.","startTime":343130,"duration":5730,"startOfParagraph":false},{"content":"So a brief history--just for the context here--regular sets came up","startTime":350360,"duration":5520,"startOfParagraph":true},{"content":"in the 1950s, and then we had simple editors that","startTime":355880,"duration":3700,"startOfParagraph":false},{"content":"incorporated regular expressions--just searching for strings.","startTime":359580,"duration":3720,"startOfParagraph":false},{"content":"Grep--which is a command line tool--was one of the first ","startTime":363570,"duration":5540,"startOfParagraph":false},{"content":"very popular tools that incorporated regular expressions in the 1960s.","startTime":369110,"duration":5050,"startOfParagraph":false},{"content":"In the '80s, Perl was built--is a programming language that","startTime":374160,"duration":6400,"startOfParagraph":false},{"content":"incorporates regular expressions very prominently.","startTime":380560,"duration":3550,"startOfParagraph":false},{"content":"And then more recently we have had Perl compatible regular expression","startTime":384550,"duration":5580,"startOfParagraph":false},{"content":"protocols basically in other languages that use much of the same syntax.","startTime":390130,"duration":5740,"startOfParagraph":false},{"content":"Of course the most important event was in 2008","startTime":396630,"duration":3210,"startOfParagraph":false},{"content":"where there was the first National Regular Expressions Day,","startTime":399840,"duration":3200,"startOfParagraph":false},{"content":"which I believe is June 1 if you want to celebrate that.","startTime":403040,"duration":4310,"startOfParagraph":false},{"content":"Again, just a little bit more theory here.","startTime":408430,"duration":2410,"startOfParagraph":true},{"content":"So there are a couple different ways of constructing regular expressions.","startTime":412180,"duration":3140,"startOfParagraph":false},{"content":"One simple way is to build the expression that you are going to","startTime":415950,"duration":6100,"startOfParagraph":false},{"content":"run on the string interpret--basically build a little mini-program that","startTime":422050,"duration":5450,"startOfParagraph":false},{"content":"will analyze pieces of a string and see, \"Oh, does this fit the regular expression or not?\"","startTime":427500,"duration":4370,"startOfParagraph":false},{"content":"And then run that.","startTime":432250,"duration":2000,"startOfParagraph":false},{"content":"So if you have a very small regular expression, this is probably ","startTime":434250,"duration":3050,"startOfParagraph":false},{"content":"the most efficient way to do it.","startTime":437300,"duration":2080,"startOfParagraph":false},{"content":"And then if you--another option is to keep reconstructing the","startTime":440090,"duration":5330,"startOfParagraph":false},{"content":"expression as you go, and that is the simulate possibility.","startTime":445420,"duration":4840,"startOfParagraph":false},{"content":"And these early attempts at regular expression algorithms were","startTime":450440,"duration":7250,"startOfParagraph":false},{"content":"relatively simple and relatively fast, but didn't have a lot of flexibility.","startTime":457690,"duration":6640,"startOfParagraph":false},{"content":"So to do even some of the things that we are going to look at","startTime":464330,"duration":3170,"startOfParagraph":false},{"content":"today we have had to do more complex regular expression","startTime":467500,"duration":5360,"startOfParagraph":false},{"content":"implementations that are potentially much slower; so that is something to bear in mind","startTime":472860,"duration":3790,"startOfParagraph":false},{"content":"There's also a regular expressions denial of attack variety","startTime":477510,"duration":5410,"startOfParagraph":false},{"content":"that exploit the potential for these newer implementations of ","startTime":482920,"duration":5410,"startOfParagraph":false},{"content":"regular expressions to become very complex.","startTime":488330,"duration":2600,"startOfParagraph":false},{"content":"And in much the same sense that we saw in buffer overflow attacks,","startTime":491570,"duration":4080,"startOfParagraph":false},{"content":"you have attacks that work by making recursive loops that","startTime":495650,"duration":5960,"startOfParagraph":false},{"content":"overrun the capacity of memory.","startTime":501610,"duration":2790,"startOfParagraph":false},{"content":"And by the way Regexen is one of the official plurals of regular expression","startTime":504780,"duration":4760,"startOfParagraph":false},{"content":"by analogy to oxen in the Anglo-Saxon.","startTime":509540,"duration":3350,"startOfParagraph":false},{"content":"Okay, so the Python Library many of you here in person have Macs,","startTime":513500,"duration":6669,"startOfParagraph":true},{"content":"so you can actually pull this up on your screen.","startTime":520169,"duration":3691,"startOfParagraph":false},{"content":"Regular expressions are built into Python.","startTime":523860,"duration":3620,"startOfParagraph":false},{"content":"And so Python is preloaded on Macs and also available online at this link.","startTime":528070,"duration":4950,"startOfParagraph":false},{"content":"So if you are watching you can pause and make sure you have Python","startTime":533770,"duration":3580,"startOfParagraph":false},{"content":"as we play around here.","startTime":538080,"duration":2090,"startOfParagraph":false},{"content":"There is a manual online, so if you just type Python into your computer","startTime":540780,"duration":5640,"startOfParagraph":false},{"content":"you will see that the version comes up in the terminal.","startTime":546420,"duration":4080,"startOfParagraph":false},{"content":"So I provided a link to the manual for Version 2 of Python as well as a cheat sheet.","startTime":551070,"duration":6650,"startOfParagraph":false},{"content":"There is a Version 3 of Python, but your Mac doesn't necessarily","startTime":557720,"duration":5380,"startOfParagraph":false},{"content":"come with that preloaded.","startTime":563100,"duration":2030,"startOfParagraph":false},{"content":"So not terribly different.","startTime":565130,"duration":2230,"startOfParagraph":false},{"content":"Okay, so some basics of using regular expressions in Python.","startTime":567360,"duration":5910,"startOfParagraph":false},{"content":"So here I used a very simple expression, so I did Python import re","startTime":574080,"duration":8570,"startOfParagraph":true},{"content":"and then took the result of re.search.","startTime":583750,"duration":3320,"startOfParagraph":false},{"content":"And the search takes 2 arguments.","startTime":587070,"duration":2840,"startOfParagraph":false},{"content":"The first is the regular expression, and the second is the text","startTime":589910,"duration":6130,"startOfParagraph":false},{"content":"or string you want to analyze.","startTime":596040,"duration":2250,"startOfParagraph":false},{"content":"And then I printed out the result.group.","startTime":598290,"duration":2920,"startOfParagraph":false},{"content":"So these are the 2 basic functions we are going to see today","startTime":601580,"duration":4280,"startOfParagraph":false},{"content":"in learning about regular expressions.","startTime":606790,"duration":3380,"startOfParagraph":false},{"content":"So just breaking down this regular expression here","startTime":610170,"duration":2710,"startOfParagraph":false},{"content":"h and then \\w and then m so \\w just accepts any alphabetical character in there.","startTime":612880,"duration":8890,"startOfParagraph":false},{"content":"So here we are looking for an \"h\" and then another alphabetical character","startTime":621850,"duration":4970,"startOfParagraph":false},{"content":"and then m, so here that would match ham","startTime":626820,"duration":3240,"startOfParagraph":false},{"content":"in, \"Abraham Lincoln and ham sandwiches.\"","startTime":630060,"duration":4420,"startOfParagraph":false},{"content":"This is the result of that group.","startTime":635040,"duration":2110,"startOfParagraph":false},{"content":"Another thing that we can do is use our before strings of text in Python.","startTime":637680,"duration":5450,"startOfParagraph":false},{"content":"So I guess I will go ahead and pull that up here.","startTime":643130,"duration":3090,"startOfParagraph":false},{"content":"Python import re.","startTime":646220,"duration":2990,"startOfParagraph":false},{"content":"And if I were to do the same thing--let us say text is,","startTime":650070,"duration":3930,"startOfParagraph":false},{"content":"\"Abraham,\" let us zoom in--there we go.","startTime":655390,"duration":5410,"startOfParagraph":false},{"content":"Text is, \"Abraham eats ham.\"","startTime":661610,"duration":4820,"startOfParagraph":false},{"content":"Okay, and then result = re.search.","startTime":667460,"duration":7800,"startOfParagraph":false},{"content":"And then our expression can be h, and then I will do dot m.","startTime":676260,"duration":5760,"startOfParagraph":false},{"content":"So dot just takes any character that is not a new line including numbers,","startTime":682020,"duration":4260,"startOfParagraph":false},{"content":"percentage signs, anything like that.","startTime":686280,"duration":2370,"startOfParagraph":false},{"content":"And then text--boom--and then result.group--yeah.","startTime":688650,"duration":9380,"startOfParagraph":false},{"content":"So that is just how to implement basic functionality here.","startTime":698030,"duration":3790,"startOfParagraph":false},{"content":"If we had a text ring that--that crazy text--included say lots of back slashes","startTime":702300,"duration":12810,"startOfParagraph":false},{"content":"and strings inside and things that could look like escape sequences,","startTime":715110,"duration":6070,"startOfParagraph":false},{"content":"then we probably want to use the raw text input to make sure that is accepted.","startTime":721180,"duration":7300,"startOfParagraph":false},{"content":"And that just looks like that.","startTime":728480,"duration":5640,"startOfParagraph":false},{"content":"So if we were looking for each of them in there we shouldn't find anything.","startTime":734120,"duration":3690,"startOfParagraph":false},{"content":"But that is how you would implement it; just before the string of ","startTime":739070,"duration":2610,"startOfParagraph":false},{"content":"the regular expression you put the letter r.","startTime":741680,"duration":3310,"startOfParagraph":false},{"content":"Okay, so let us keep going.","startTime":746150,"duration":4110,"startOfParagraph":true},{"content":"All right--so let us look at a couple repetitive patterns here.","startTime":750260,"duration":3470,"startOfParagraph":false},{"content":"So one thing that you want to do is repeat things","startTime":754750,"duration":4400,"startOfParagraph":false},{"content":"as you are searching through text.","startTime":760040,"duration":2440,"startOfParagraph":false},{"content":"So to do a followed by any number of b--you do ab*.","startTime":762480,"duration":5820,"startOfParagraph":false},{"content":"And then there are a series of other rules too.","startTime":768630,"duration":2990,"startOfParagraph":false},{"content":"And you can look all of these up; I'll just run through some of the","startTime":771620,"duration":2760,"startOfParagraph":false},{"content":"most commonly used ones.","startTime":774380,"duration":3250,"startOfParagraph":false},{"content":"So ab+ is a followed by any N greater than 0 of b.","startTime":777630,"duration":6290,"startOfParagraph":false},{"content":"ab? is a followed by 0 or 1 of b.","startTime":784510,"duration":3490,"startOfParagraph":false},{"content":"ab{N} is a followed by N of b, and then so on.","startTime":789190,"duration":9390,"startOfParagraph":false},{"content":"If you have 2 numbers in the curly braces you are specifying a range","startTime":798580,"duration":4240,"startOfParagraph":false},{"content":"that can be possibly matched.","startTime":803300,"duration":2140,"startOfParagraph":false},{"content":"So we will look more at a couple repetitive patterns in a minute.","startTime":806390,"duration":4030,"startOfParagraph":false},{"content":"So 2 things to keep in mind when using these pattern matching tools here.","startTime":811960,"duration":10340,"startOfParagraph":false},{"content":"So say we want to look at the h.m of, \"Abraham Lincoln makes ham sandwiches.\"","startTime":822300,"duration":9820,"startOfParagraph":false},{"content":"So I changed Abraham Lincoln's name to Abraham.","startTime":832120,"duration":3110,"startOfParagraph":false},{"content":"And now we are looking for what is returned by this search function,","startTime":835230,"duration":5060,"startOfParagraph":false},{"content":"and it only returns ham in this case.","startTime":840290,"duration":2980,"startOfParagraph":false},{"content":"And it does that because search just naturally takes the left most queue.","startTime":843620,"duration":4460,"startOfParagraph":false},{"content":"And all regular expressions unless you specify otherwise will do that.","startTime":848080,"duration":4050,"startOfParagraph":false},{"content":"If we wanted to find all there is a function for that--find all.","startTime":852830,"duration":6050,"startOfParagraph":false},{"content":"So that could just look like all = re.findall('h.m', text) ","startTime":858880,"duration":16220,"startOfParagraph":false},{"content":"and then all.group().","startTime":875100,"duration":9440,"startOfParagraph":false},{"content":"All produces both ham and ham; in this case both of the strings in Abraham each ham.","startTime":884540,"duration":6500,"startOfParagraph":false},{"content":"So that is another option.","startTime":891610,"duration":3500,"startOfParagraph":false},{"content":"Great. The other thing to keep in mind is that regular expressions take the largest intuitively.","startTime":896250,"duration":10690,"startOfParagraph":true},{"content":"Let us look at this example.","startTime":906940,"duration":2580,"startOfParagraph":false},{"content":"We did that left most search here, and then I attempted a larger search","startTime":910200,"duration":5870,"startOfParagraph":false},{"content":"using the Kleene star operator.","startTime":916070,"duration":2730,"startOfParagraph":false},{"content":"So for, \"Abraham Lincoln makes ham sandwiches,\" and I only got back","startTime":918800,"duration":5380,"startOfParagraph":false},{"content":"m as a result.","startTime":924180,"duration":2100,"startOfParagraph":false},{"content":"The reason for that mistake was that I could have taken any number of","startTime":926280,"duration":5390,"startOfParagraph":false},{"content":"h's because I didn't specify anything to go in between h and m.","startTime":931670,"duration":4470,"startOfParagraph":false},{"content":"The only example there that had m--the only examples there with m in it","startTime":936140,"duration":5870,"startOfParagraph":false},{"content":"and any number of h's were just the string m.","startTime":942010,"duration":4210,"startOfParagraph":false},{"content":"Then I tried it again; I said, \"Okay, let us get the actual largest group here.\"","startTime":946490,"duration":5360,"startOfParagraph":false},{"content":"And then I did h.*m, so that just returns any number of characters between h and m.","startTime":951850,"duration":7820,"startOfParagraph":false},{"content":"And if you are just starting out and thinking, \"Oh, okay, well this will","startTime":960280,"duration":2670,"startOfParagraph":false},{"content":"get me ham,\" it actually takes everything from the h in Abraham Lincoln","startTime":962950,"duration":8610,"startOfParagraph":false},{"content":"all the way up to the end of ham.","startTime":971560,"duration":2130,"startOfParagraph":false},{"content":"It is greedy; it sees h--all this other text--m,","startTime":974040,"duration":4070,"startOfParagraph":false},{"content":"and that is what it takes in.","startTime":978110,"duration":3170,"startOfParagraph":false},{"content":"This is a particularly egregious--this is a feature we can also ","startTime":982060,"duration":5420,"startOfParagraph":false},{"content":"specify for it not be greedy using other functions.","startTime":987480,"duration":3190,"startOfParagraph":false},{"content":"But this is something we have to keep in mind especially ","startTime":991480,"duration":3010,"startOfParagraph":false},{"content":"when looking at HTML text, which is one reason that","startTime":994490,"duration":4230,"startOfParagraph":false},{"content":"regular expressions are difficult for HTML.","startTime":998720,"duration":2780,"startOfParagraph":false},{"content":"Because if you have an HTML open tag and then lots of stuff in the middle","startTime":1002460,"duration":3850,"startOfParagraph":false},{"content":"and then some other HTML closed tag much later in the program,","startTime":1006310,"duration":3510,"startOfParagraph":false},{"content":"you have just eaten up a lot of your HTML code possibly by mistake.","startTime":1009820,"duration":5600,"startOfParagraph":false},{"content":"All right--so more special characters, like many other languages,","startTime":1016200,"duration":5640,"startOfParagraph":true},{"content":"we escape using the slash.","startTime":1021840,"duration":2940,"startOfParagraph":false},{"content":"So we can use the dot to specify any character except for a new line.","startTime":1024780,"duration":5549,"startOfParagraph":false},{"content":"We can use the escape w to specify any alphabetical character.","startTime":1030329,"duration":4221,"startOfParagraph":false},{"content":"And by analogy escape d for any integer--numerical character.","startTime":1034550,"duration":5779,"startOfParagraph":false},{"content":"We can specify--we can use brackets to specify related expressions.","startTime":1040630,"duration":6810,"startOfParagraph":false},{"content":"So this would accept a, b, or c.","startTime":1047440,"duration":3530,"startOfParagraph":false},{"content":"And we can also specify or options for either a or b.","startTime":1051320,"duration":5680,"startOfParagraph":false},{"content":"For example--if we were looking for multiple possibilities","startTime":1057000,"duration":4110,"startOfParagraph":false},{"content":"in brackets we could use the or operator as in--","startTime":1061110,"duration":3830,"startOfParagraph":false},{"content":"so let us go back to this example here.","startTime":1064940,"duration":7540,"startOfParagraph":false},{"content":"And now let us take--let us go back to this example here, and then","startTime":1073000,"duration":6790,"startOfParagraph":false},{"content":"take ae--so this should return--I guess this is still Abraham.","startTime":1079790,"duration":12500,"startOfParagraph":false},{"content":"So this--if we do all--great.","startTime":1092290,"duration":5120,"startOfParagraph":false},{"content":"So let us update the text here.","startTime":1097410,"duration":5290,"startOfParagraph":false},{"content":"\"Abraham eats ham while hemming his--while hemming.\" Great.","startTime":1102700,"duration":11990,"startOfParagraph":false},{"content":"All. Great. Now we get ham, ham, and hem.","startTime":1124090,"duration":3240,"startOfParagraph":false},{"content":"While hemming--while humming to him--while humming to hem him. Great.","startTime":1128510,"duration":10860,"startOfParagraph":false},{"content":"Same thing.","startTime":1140350,"duration":2900,"startOfParagraph":false},{"content":"Now all returns still just ham, ham, and hem without picking up on the hum or the him.","startTime":1143820,"duration":5360,"startOfParagraph":false},{"content":"Great--so what if we wanted to look at either that--so we could also do","startTime":1149940,"duration":12660,"startOfParagraph":false},{"content":"him or--we will come back to that.","startTime":1163510,"duration":10300,"startOfParagraph":false},{"content":"Okay--so--all right--in positions you can also use the caret or the dollar sign","startTime":1174810,"duration":10950,"startOfParagraph":false},{"content":"to specify that you are looking for something at the start or the end of a string.","startTime":1185760,"duration":3590,"startOfParagraph":false},{"content":"Or the start or the end of a word.","startTime":1190260,"duration":2000,"startOfParagraph":false},{"content":"That is one way to use that.","startTime":1192400,"duration":2070,"startOfParagraph":false},{"content":"Okay--so let us play around with a slightly larger block of text.","startTime":1195630,"duration":5530,"startOfParagraph":true},{"content":"Let us say this row here--this statement here.","startTime":1203950,"duration":4360,"startOfParagraph":false},{"content":"The power of regular expression is that they can specify patterns","startTime":1208310,"duration":3050,"startOfParagraph":false},{"content":"not just fixed characters.","startTime":1211360,"duration":2030,"startOfParagraph":false},{"content":"Let us make--let us call this block.","startTime":1214900,"duration":3890,"startOfParagraph":false},{"content":"Then we will read all of that in.","startTime":1222400,"duration":4710,"startOfParagraph":false},{"content":"And then have a--let us make all =; so what are some things we could search in here profitably? ","startTime":1228890,"duration":21930,"startOfParagraph":false},{"content":"We could look for the expression ear.","startTime":1250820,"duration":3250,"startOfParagraph":false},{"content":"Not very interesting. How about that? We'll see what happens.","startTime":1255050,"duration":6470,"startOfParagraph":false},{"content":"I gave it a problem.","startTime":1263710,"duration":2000,"startOfParagraph":false},{"content":"So any number of things before re and all.","startTime":1266380,"duration":4370,"startOfParagraph":false},{"content":"So that should return everything from the beginning up to all re perhaps a couple times.","startTime":1270750,"duration":4880,"startOfParagraph":false},{"content":"And then here we have the power of regular expressions is that they","startTime":1278800,"duration":3170,"startOfParagraph":false},{"content":"can specify patterns not just characters here are.","startTime":1281970,"duration":2930,"startOfParagraph":false},{"content":"So all the way up to the final re, it started with the left most and was greedy.","startTime":1284900,"duration":3610,"startOfParagraph":false},{"content":"Let us see--what else could we look for.","startTime":1290710,"duration":2000,"startOfParagraph":false},{"content":"I guess one thing if you were interested in looking for the pronouns she and he,","startTime":1292710,"duration":7150,"startOfParagraph":false},{"content":"you could check for s being equal to 0 or 1 ","startTime":1299860,"duration":4740,"startOfParagraph":false},{"content":"and the expression he, and that is probably not going to return--","startTime":1304600,"duration":5110,"startOfParagraph":false},{"content":"oh, I guess it returned he because there we are looking at the power, that day, here are.","startTime":1309710,"duration":8310,"startOfParagraph":false},{"content":"Let us try specifying that this has to come at the start of something.","startTime":1320590,"duration":5680,"startOfParagraph":false},{"content":"Let us see if that drops off.","startTime":1326640,"duration":2890,"startOfParagraph":false},{"content":"So we can do fat, and there we don't get anything because she and he","startTime":1329530,"duration":10100,"startOfParagraph":false},{"content":"don't occur in this phrase.","startTime":1339630,"duration":3240,"startOfParagraph":false},{"content":"Great. Okay--so back to the cat here.","startTime":1344960,"duration":5450,"startOfParagraph":false},{"content":"So complex patterns is hurting the brain.","startTime":1350410,"duration":5310,"startOfParagraph":false},{"content":"So that is why we use regular expressions to avoid these issues.","startTime":1355720,"duration":4780,"startOfParagraph":false},{"content":"So here are some other useful modes you can play around with.","startTime":1360820,"duration":2700,"startOfParagraph":true},{"content":"We looked at search today, but you can also use match, split, findall, and groups.","startTime":1363520,"duration":6770,"startOfParagraph":false},{"content":"So other cool things you can do with regular expressions besides just","startTime":1370290,"duration":3680,"startOfParagraph":false},{"content":"looking for patterns is taking a pattern and holding all the matches--","startTime":1373970,"duration":4900,"startOfParagraph":false},{"content":"its variables--and then using those in your code later on.","startTime":1378870,"duration":3660,"startOfParagraph":false},{"content":"That can be quite helpful. Other things might be counting.","startTime":1382850,"duration":3130,"startOfParagraph":false},{"content":"So we can count the number of instances of a regular expression pattern,","startTime":1385980,"duration":5740,"startOfParagraph":false},{"content":"and that is what we can use groups for.","startTime":1391720,"duration":2240,"startOfParagraph":false},{"content":"And other modes as well are also possible.","startTime":1393960,"duration":3590,"startOfParagraph":false},{"content":"So I just want to talk a little bit more about other ways you can use regular expressions.","startTime":1398040,"duration":4940,"startOfParagraph":false},{"content":"So one more advanced application is in fuzzy matching.","startTime":1402980,"duration":6120,"startOfParagraph":true},{"content":"So if you are looking for a text for the expression, Julius Caesar, ","startTime":1409100,"duration":4350,"startOfParagraph":false},{"content":"and you see either Gaius Julius Caesar or the name Julius Caesar in other languages,","startTime":1413450,"duration":4290,"startOfParagraph":false},{"content":"then you might also want to assign some weight to those values.","startTime":1417740,"duration":6660,"startOfParagraph":false},{"content":"And if it is close enough--if it crosses a certain threshold--then you want","startTime":1424400,"duration":4530,"startOfParagraph":false},{"content":"to be able to accept Julius Caesar.","startTime":1428930,"duration":1930,"startOfParagraph":false},{"content":"So there are a couple different implementations for that in a few other languages as well.","startTime":1430860,"duration":9720,"startOfParagraph":false},{"content":"Here are some other tools, Regex Pal--a handy little app online to","startTime":1442580,"duration":5840,"startOfParagraph":false},{"content":"check if your regular expressions are composed correctly.","startTime":1448420,"duration":3770,"startOfParagraph":false},{"content":"There are also standalone tools that you can run from your desktop","startTime":1452190,"duration":6310,"startOfParagraph":false},{"content":"like Ultra Pico, and as well as just cookbooks.","startTime":1458500,"duration":3600,"startOfParagraph":false},{"content":"So if you are doing a project that involves a ton of regular expressions","startTime":1462100,"duration":3310,"startOfParagraph":false},{"content":"this is probably the place to go outside the scope of today.","startTime":1465410,"duration":4400,"startOfParagraph":false},{"content":"And then just to give you a sense of how common it is","startTime":1471520,"duration":4250,"startOfParagraph":false},{"content":"there is grep in Unix, Perl has built-in, and C there is PCRE for C.","startTime":1475770,"duration":8320,"startOfParagraph":false},{"content":"And then all these other languages also have regular expression packages","startTime":1484090,"duration":4800,"startOfParagraph":false},{"content":"that operate with essentially the same syntax we got a taste of today.","startTime":1488890,"duration":3130,"startOfParagraph":false},{"content":"PHP, Java, Ruby, and so on.","startTime":1492020,"duration":2770,"startOfParagraph":false},{"content":"Google Code Search is actually worth mentioning; it is one of the","startTime":1496080,"duration":2900,"startOfParagraph":true},{"content":"relatively few applications out there that allows the public to access","startTime":1498980,"duration":6740,"startOfParagraph":false},{"content":"its database using regular expressions.","startTime":1505720,"duration":2080,"startOfParagraph":false},{"content":"So if you look on Google Code Search you can find code","startTime":1507800,"duration":5120,"startOfParagraph":false},{"content":"if you are looking for an instance of how a function might be used,","startTime":1512920,"duration":3960,"startOfParagraph":false},{"content":"you can use a regular expression to find that function being used in all sorts of different cases.","startTime":1516880,"duration":4730,"startOfParagraph":false},{"content":"You could look for fwrite, and then you could look for the flag of write or read","startTime":1521610,"duration":6390,"startOfParagraph":false},{"content":"if you wanted an example of fwrite being used in that case.","startTime":1528000,"duration":4000,"startOfParagraph":false},{"content":"So the same thing there, and here are some references.","startTime":1533530,"duration":3480,"startOfParagraph":false},{"content":"This will be available online as well, so going forwards if","startTime":1537010,"duration":3980,"startOfParagraph":false},{"content":"you want to look at Python, grep, Perl--you just want to get some inspiration","startTime":1540990,"duration":4570,"startOfParagraph":false},{"content":"or if you want to look more at the theory here are some good jumping off places.","startTime":1545560,"duration":5090,"startOfParagraph":false},{"content":"Thank you very much.","startTime":1550650,"duration":3220,"startOfParagraph":false},{"content":"[CS50.TV]","startTime":1558470,"duration":1440,"startOfParagraph":false}]}