1 00:00:00,000 --> 00:00:02,000 [Seminar: Pattern Matching with Regular Expressions] 2 00:00:02,000 --> 00:00:04,000 [John Mussman-Harvard University] 3 00:00:04,000 --> 00:00:07,220 [This is CS50.-CS50.TV] 4 00:00:07,780 --> 00:00:11,610 Okay. Well, welcome everyone. This is CS50 2012. 5 00:00:11,780 --> 00:00:16,610 My name is John, and I will be talking today about regular expressions. 6 00:00:16,610 --> 00:00:22,530 Regular expressions is primarily a tool, but also sometimes used 7 00:00:22,530 --> 00:00:28,650 in code actively to essentially match patterns and strings. 8 00:00:28,650 --> 00:00:33,800 So here's a web comic from xkcd. 9 00:00:34,440 --> 00:00:42,370 In this comic there is a murder mystery where the killer has 10 00:00:42,370 --> 00:00:47,860 followed somebody on vacation, and the protagonists have to 11 00:00:47,860 --> 00:00:52,500 search through 200 megabytes of emails looking for an address. 12 00:00:52,500 --> 00:00:56,090 And they are about to give up when someone who knows regular expressions-- 13 00:00:56,090 --> 00:01:00,550 presumably a superhero--swoops down and writes some code 14 00:01:00,550 --> 00:01:02,970 and solves the murder mystery. 15 00:01:02,970 --> 00:01:07,370 So presumably that will be something that you will be empowered to do 16 00:01:07,370 --> 00:01:09,370 after this seminar. 17 00:01:09,370 --> 00:01:12,250 We are just going to provide a concise introduction to the language 18 00:01:12,250 --> 00:01:16,770 and give you enough wherewithal to go after more resources on your own. 19 00:01:17,680 --> 00:01:21,700 >> So regular expressions look basically like this. 20 00:01:22,930 --> 00:01:25,550 This is a regular expression in Ruby. 21 00:01:25,550 --> 00:01:29,280 It is not terribly different across languages. 22 00:01:29,690 --> 00:01:37,630 We have just on slashes to begin and mark the regular expression in Ruby. 23 00:01:37,630 --> 00:01:42,880 And this is a regular expression to look for in email address pattern. 24 00:01:42,880 --> 00:01:49,160 So we see at the first bit looks for any alphanumeric character. 25 00:01:50,500 --> 00:01:54,880 That is because email addresses often have to start with an alphabetical character. 26 00:01:55,460 --> 00:01:59,330 And then any special character followed by the @ symbol. 27 00:01:59,330 --> 00:02:03,260 And then the same thing for domain name. 28 00:02:03,260 --> 00:02:10,030 And then between 2 and 4 characters to look for the .com, .net, and so on. 29 00:02:10,850 --> 00:02:13,200 So that is another example of regular expression. 30 00:02:13,200 --> 00:02:17,270 So regular expressions are protocols for finding patters in text. 31 00:02:17,270 --> 00:02:21,130 They do comparisons, selections, and replacements. 32 00:02:21,690 --> 00:02:27,970 So a third example is finding all the phone numbers ending in 54 in a directory. 33 00:02:27,970 --> 00:02:34,360 So before David rips up the CS50 directory we could search for 34 00:02:34,360 --> 00:02:40,450 a pattern where we have parentheses then 3 numbers then end parenthesis, 35 00:02:40,450 --> 00:02:44,070 3 more numbers, a dash, 2 numbers, and then 54. 36 00:02:44,070 --> 00:02:48,310 And that would be essentially how we come up with a regular expression to search for that. 37 00:02:49,150 --> 00:02:52,960 >> So there are--we have done some things in CS50 that are a little bit like 38 00:02:52,960 --> 00:02:59,740 regular expressions, so--for example--in the dictionary.C file 39 00:02:59,740 --> 00:03:04,720 for the spell check problem set you may have used fscanf 40 00:03:04,720 --> 00:03:07,930 to read in a word from the dictionary. 41 00:03:07,930 --> 00:03:16,240 And you can see the percentage 45s is looking for a string of 45 characters. 42 00:03:16,240 --> 00:03:20,020 So it is somewhat like a rudimentary regular expression. 43 00:03:21,150 --> 00:03:26,060 And you can have any 45 characters that fit the bill in there 44 00:03:26,060 --> 00:03:28,080 and pick those up. 45 00:03:28,080 --> 00:03:33,480 And then the second example in the most recent web programming problem 46 00:03:33,480 --> 00:03:40,760 set in the distro code for php we actually do have a simple regular expression. 47 00:03:40,760 --> 00:03:46,790 And this one is just simply looking to check if the web page that is passed in 48 00:03:46,790 --> 00:03:51,940 matches either login or logout register .PHP. 49 00:03:52,220 --> 00:03:57,910 And then returning true or false based on that regular expression matching. 50 00:03:59,400 --> 00:04:01,740 >> So when do you use regular expression? 51 00:04:01,740 --> 00:04:04,820 Why are you here today? 52 00:04:05,330 --> 00:04:08,480 So you don't want to use regular expression when there's something that 53 00:04:08,480 --> 00:04:11,640 does the job for you even more easily. 54 00:04:11,640 --> 00:04:15,510 So XML and HTML are actually pretty tricky 55 00:04:15,510 --> 00:04:18,480 to write regular expressions for as we will see in a little bit. 56 00:04:19,110 --> 00:04:23,280 So there are dedicated parsers for those languages. 57 00:04:24,170 --> 00:04:30,060 You also need to be okay with the trade offs and accuracy frequently. 58 00:04:30,060 --> 00:04:36,220 If you are trying--so we saw a regular expression for an email address, 59 00:04:37,370 --> 00:04:42,590 but say you wanted a specific email address and gradually the 60 00:04:42,590 --> 00:04:48,570 regular expression might become more complex as it became more precise. 61 00:04:49,580 --> 00:04:52,260 So that would be one trade off. 62 00:04:52,260 --> 00:04:55,330 You have to be sure that you are okay making with the regular expression. 63 00:04:55,330 --> 00:04:57,920 If you know exactly what you are looking for it might make more sense 64 00:04:57,920 --> 00:05:02,070 to put in the time and write a more effective parser. 65 00:05:02,070 --> 00:05:06,980 And finally there is a historical issue with the regularity 66 00:05:06,980 --> 00:05:08,940 of expressions and languages. 67 00:05:08,940 --> 00:05:12,960 Regular expressions are actually much more powerful than 68 00:05:12,960 --> 00:05:16,450 regular expressions per say in a formal sense. 69 00:05:17,130 --> 00:05:20,150 >> So I don't want to go too far into the formal theory, 70 00:05:20,150 --> 00:05:24,000 but most languages that we code in actually are not regular. 71 00:05:24,000 --> 00:05:29,110 And this is why regular expressions sometimes are not considered all that secure. 72 00:05:29,670 --> 00:05:33,150 So basically there is a Chomsky hierarchy for languages, 73 00:05:33,150 --> 00:05:38,400 and regular expressions are build up using union, concatenation, 74 00:05:38,400 --> 00:05:41,810 and the Kleene star operation that we will see in a few minutes. 75 00:05:43,130 --> 00:05:48,860 If you are interested in theory there is quite a lot going on there under the hood. 76 00:05:50,360 --> 00:05:55,880 >> So a brief history--just for the context here--regular sets came up 77 00:05:55,880 --> 00:05:59,580 in the 1950s, and then we had simple editors that 78 00:05:59,580 --> 00:06:03,300 incorporated regular expressions--just searching for strings. 79 00:06:03,570 --> 00:06:09,110 Grep--which is a command line tool--was one of the first 80 00:06:09,110 --> 00:06:14,160 very popular tools that incorporated regular expressions in the 1960s. 81 00:06:14,160 --> 00:06:20,560 In the '80s, Perl was built--is a programming language that 82 00:06:20,560 --> 00:06:24,110 incorporates regular expressions very prominently. 83 00:06:24,550 --> 00:06:30,130 And then more recently we have had Perl compatible regular expression 84 00:06:30,130 --> 00:06:35,870 protocols basically in other languages that use much of the same syntax. 85 00:06:36,630 --> 00:06:39,840 Of course the most important event was in 2008 86 00:06:39,840 --> 00:06:43,040 where there was the first National Regular Expressions Day, 87 00:06:43,040 --> 00:06:47,350 which I believe is June 1 if you want to celebrate that. 88 00:06:48,430 --> 00:06:50,840 >> Again, just a little bit more theory here. 89 00:06:52,180 --> 00:06:55,320 So there are a couple different ways of constructing regular expressions. 90 00:06:55,950 --> 00:07:02,050 One simple way is to build the expression that you are going to 91 00:07:02,050 --> 00:07:07,500 run on the string interpret--basically build a little mini-program that 92 00:07:07,500 --> 00:07:11,870 will analyze pieces of a string and see, "Oh, does this fit the regular expression or not?" 93 00:07:12,250 --> 00:07:14,250 And then run that. 94 00:07:14,250 --> 00:07:17,300 So if you have a very small regular expression, this is probably 95 00:07:17,300 --> 00:07:19,380 the most efficient way to do it. 96 00:07:20,090 --> 00:07:25,420 And then if you--another option is to keep reconstructing the 97 00:07:25,420 --> 00:07:30,260 expression as you go, and that is the simulate possibility. 98 00:07:30,440 --> 00:07:37,690 And these early attempts at regular expression algorithms were 99 00:07:37,690 --> 00:07:44,330 relatively simple and relatively fast, but didn't have a lot of flexibility. 100 00:07:44,330 --> 00:07:47,500 So to do even some of the things that we are going to look at 101 00:07:47,500 --> 00:07:52,860 today we have had to do more complex regular expression 102 00:07:52,860 --> 00:07:56,650 implementations that are potentially much slower; so that is something to bear in mind 103 00:07:57,510 --> 00:08:02,920 There's also a regular expressions denial of attack variety 104 00:08:02,920 --> 00:08:08,330 that exploit the potential for these newer implementations of 105 00:08:08,330 --> 00:08:10,930 regular expressions to become very complex. 106 00:08:11,570 --> 00:08:15,650 And in much the same sense that we saw in buffer overflow attacks, 107 00:08:15,650 --> 00:08:21,610 you have attacks that work by making recursive loops that 108 00:08:21,610 --> 00:08:24,400 overrun the capacity of memory. 109 00:08:24,780 --> 00:08:29,540 And by the way Regexen is one of the official plurals of regular expression 110 00:08:29,540 --> 00:08:32,890 by analogy to oxen in the Anglo-Saxon. 111 00:08:33,500 --> 00:08:40,169 >> Okay, so the Python Library many of you here in person have Macs, 112 00:08:40,169 --> 00:08:43,860 so you can actually pull this up on your screen. 113 00:08:43,860 --> 00:08:47,480 Regular expressions are built into Python. 114 00:08:48,070 --> 00:08:53,020 And so Python is preloaded on Macs and also available online at this link. 115 00:08:53,770 --> 00:08:57,350 So if you are watching you can pause and make sure you have Python 116 00:08:58,080 --> 00:09:00,170 as we play around here. 117 00:09:00,780 --> 00:09:06,420 There is a manual online, so if you just type Python into your computer 118 00:09:06,420 --> 00:09:10,500 you will see that the version comes up in the terminal. 119 00:09:11,070 --> 00:09:17,720 So I provided a link to the manual for Version 2 of Python as well as a cheat sheet. 120 00:09:17,720 --> 00:09:23,100 There is a Version 3 of Python, but your Mac doesn't necessarily 121 00:09:23,100 --> 00:09:25,130 come with that preloaded. 122 00:09:25,130 --> 00:09:27,360 So not terribly different. 123 00:09:27,360 --> 00:09:33,270 Okay, so some basics of using regular expressions in Python. 124 00:09:34,080 --> 00:09:42,650 >> So here I used a very simple expression, so I did Python import re 125 00:09:43,750 --> 00:09:47,070 and then took the result of re.search. 126 00:09:47,070 --> 00:09:49,910 And the search takes 2 arguments. 127 00:09:49,910 --> 00:09:56,040 The first is the regular expression, and the second is the text 128 00:09:56,040 --> 00:09:58,290 or string you want to analyze. 129 00:09:58,290 --> 00:10:01,210 And then I printed out the result.group. 130 00:10:01,580 --> 00:10:05,860 So these are the 2 basic functions we are going to see today 131 00:10:06,790 --> 00:10:10,170 in learning about regular expressions. 132 00:10:10,170 --> 00:10:12,880 So just breaking down this regular expression here 133 00:10:12,880 --> 00:10:21,770 h and then \w and then m so \w just accepts any alphabetical character in there. 134 00:10:21,850 --> 00:10:26,820 So here we are looking for an "h" and then another alphabetical character 135 00:10:26,820 --> 00:10:30,060 and then m, so here that would match ham 136 00:10:30,060 --> 00:10:34,480 in, "Abraham Lincoln and ham sandwiches." 137 00:10:35,040 --> 00:10:37,150 This is the result of that group. 138 00:10:37,680 --> 00:10:43,130 Another thing that we can do is use our before strings of text in Python. 139 00:10:43,130 --> 00:10:46,220 So I guess I will go ahead and pull that up here. 140 00:10:46,220 --> 00:10:49,210 Python import re. 141 00:10:50,070 --> 00:10:54,000 And if I were to do the same thing--let us say text is, 142 00:10:55,390 --> 00:11:00,800 "Abraham," let us zoom in--there we go. 143 00:11:01,610 --> 00:11:06,430 Text is, "Abraham eats ham." 144 00:11:07,460 --> 00:11:15,260 Okay, and then result = re.search. 145 00:11:16,260 --> 00:11:22,020 And then our expression can be h, and then I will do dot m. 146 00:11:22,020 --> 00:11:26,280 So dot just takes any character that is not a new line including numbers, 147 00:11:26,280 --> 00:11:28,650 percentage signs, anything like that. 148 00:11:28,650 --> 00:11:38,030 And then text--boom--and then result.group--yeah. 149 00:11:38,030 --> 00:11:41,820 So that is just how to implement basic functionality here. 150 00:11:42,300 --> 00:11:55,110 If we had a text ring that--that crazy text--included say lots of back slashes 151 00:11:55,110 --> 00:12:01,180 and strings inside and things that could look like escape sequences, 152 00:12:01,180 --> 00:12:08,480 then we probably want to use the raw text input to make sure that is accepted. 153 00:12:08,480 --> 00:12:14,120 And that just looks like that. 154 00:12:14,120 --> 00:12:17,810 So if we were looking for each of them in there we shouldn't find anything. 155 00:12:19,070 --> 00:12:21,680 But that is how you would implement it; just before the string of 156 00:12:21,680 --> 00:12:24,990 the regular expression you put the letter r. 157 00:12:26,150 --> 00:12:30,260 >> Okay, so let us keep going. 158 00:12:30,260 --> 00:12:33,730 All right--so let us look at a couple repetitive patterns here. 159 00:12:34,750 --> 00:12:39,150 So one thing that you want to do is repeat things 160 00:12:40,040 --> 00:12:42,480 as you are searching through text. 161 00:12:42,480 --> 00:12:48,300 So to do a followed by any number of b--you do ab*. 162 00:12:48,630 --> 00:12:51,620 And then there are a series of other rules too. 163 00:12:51,620 --> 00:12:54,380 And you can look all of these up; I'll just run through some of the 164 00:12:54,380 --> 00:12:57,630 most commonly used ones. 165 00:12:57,630 --> 00:13:03,920 So ab+ is a followed by any N greater than 0 of b. 166 00:13:04,510 --> 00:13:08,000 ab? is a followed by 0 or 1 of b. 167 00:13:09,190 --> 00:13:18,580 ab{N} is a followed by N of b, and then so on. 168 00:13:18,580 --> 00:13:22,820 If you have 2 numbers in the curly braces you are specifying a range 169 00:13:23,300 --> 00:13:25,440 that can be possibly matched. 170 00:13:26,390 --> 00:13:30,420 So we will look more at a couple repetitive patterns in a minute. 171 00:13:31,960 --> 00:13:42,300 So 2 things to keep in mind when using these pattern matching tools here. 172 00:13:42,300 --> 00:13:52,120 So say we want to look at the h.m of, "Abraham Lincoln makes ham sandwiches." 173 00:13:52,120 --> 00:13:55,230 So I changed Abraham Lincoln's name to Abraham. 174 00:13:55,230 --> 00:14:00,290 And now we are looking for what is returned by this search function, 175 00:14:00,290 --> 00:14:03,270 and it only returns ham in this case. 176 00:14:03,620 --> 00:14:08,080 And it does that because search just naturally takes the left most queue. 177 00:14:08,080 --> 00:14:12,130 And all regular expressions unless you specify otherwise will do that. 178 00:14:12,830 --> 00:14:18,880 If we wanted to find all there is a function for that--find all. 179 00:14:18,880 --> 00:14:35,100 So that could just look like all = re.findall('h.m', text) 180 00:14:35,100 --> 00:14:44,540 and then all.group(). 181 00:14:44,540 --> 00:14:51,040 All produces both ham and ham; in this case both of the strings in Abraham each ham. 182 00:14:51,610 --> 00:14:55,110 So that is another option. 183 00:14:56,250 --> 00:15:06,940 >> Great. The other thing to keep in mind is that regular expressions take the largest intuitively. 184 00:15:06,940 --> 00:15:09,520 Let us look at this example. 185 00:15:10,200 --> 00:15:16,070 We did that left most search here, and then I attempted a larger search 186 00:15:16,070 --> 00:15:18,800 using the Kleene star operator. 187 00:15:18,800 --> 00:15:24,180 So for, "Abraham Lincoln makes ham sandwiches," and I only got back 188 00:15:24,180 --> 00:15:26,280 m as a result. 189 00:15:26,280 --> 00:15:31,670 The reason for that mistake was that I could have taken any number of 190 00:15:31,670 --> 00:15:36,140 h's because I didn't specify anything to go in between h and m. 191 00:15:36,140 --> 00:15:42,010 The only example there that had m--the only examples there with m in it 192 00:15:42,010 --> 00:15:46,220 and any number of h's were just the string m. 193 00:15:46,490 --> 00:15:51,850 Then I tried it again; I said, "Okay, let us get the actual largest group here." 194 00:15:51,850 --> 00:15:59,670 And then I did h.*m, so that just returns any number of characters between h and m. 195 00:16:00,280 --> 00:16:02,950 And if you are just starting out and thinking, "Oh, okay, well this will 196 00:16:02,950 --> 00:16:11,560 get me ham," it actually takes everything from the h in Abraham Lincoln 197 00:16:11,560 --> 00:16:13,690 all the way up to the end of ham. 198 00:16:14,040 --> 00:16:18,110 It is greedy; it sees h--all this other text--m, 199 00:16:18,110 --> 00:16:21,280 and that is what it takes in. 200 00:16:22,060 --> 00:16:27,480 This is a particularly egregious--this is a feature we can also 201 00:16:27,480 --> 00:16:30,670 specify for it not be greedy using other functions. 202 00:16:31,480 --> 00:16:34,490 But this is something we have to keep in mind especially 203 00:16:34,490 --> 00:16:38,720 when looking at HTML text, which is one reason that 204 00:16:38,720 --> 00:16:41,500 regular expressions are difficult for HTML. 205 00:16:42,460 --> 00:16:46,310 Because if you have an HTML open tag and then lots of stuff in the middle 206 00:16:46,310 --> 00:16:49,820 and then some other HTML closed tag much later in the program, 207 00:16:49,820 --> 00:16:55,420 you have just eaten up a lot of your HTML code possibly by mistake. 208 00:16:56,200 --> 00:17:01,840 >> All right--so more special characters, like many other languages, 209 00:17:01,840 --> 00:17:04,780 we escape using the slash. 210 00:17:04,780 --> 00:17:10,329 So we can use the dot to specify any character except for a new line. 211 00:17:10,329 --> 00:17:14,550 We can use the escape w to specify any alphabetical character. 212 00:17:14,550 --> 00:17:20,329 And by analogy escape d for any integer--numerical character. 213 00:17:20,630 --> 00:17:27,440 We can specify--we can use brackets to specify related expressions. 214 00:17:27,440 --> 00:17:30,970 So this would accept a, b, or c. 215 00:17:31,320 --> 00:17:37,000 And we can also specify or options for either a or b. 216 00:17:37,000 --> 00:17:41,110 For example--if we were looking for multiple possibilities 217 00:17:41,110 --> 00:17:44,940 in brackets we could use the or operator as in-- 218 00:17:44,940 --> 00:17:52,480 so let us go back to this example here. 219 00:17:53,000 --> 00:17:59,790 And now let us take--let us go back to this example here, and then 220 00:17:59,790 --> 00:18:12,290 take ae--so this should return--I guess this is still Abraham. 221 00:18:12,290 --> 00:18:17,410 So this--if we do all--great. 222 00:18:17,410 --> 00:18:22,700 So let us update the text here. 223 00:18:22,700 --> 00:18:34,690 "Abraham eats ham while hemming his--while hemming." Great. 224 00:18:44,090 --> 00:18:47,330 All. Great. Now we get ham, ham, and hem. 225 00:18:48,510 --> 00:18:59,370 While hemming--while humming to him--while humming to hem him. Great. 226 00:19:00,350 --> 00:19:03,250 Same thing. 227 00:19:03,820 --> 00:19:09,180 Now all returns still just ham, ham, and hem without picking up on the hum or the him. 228 00:19:09,940 --> 00:19:22,600 Great--so what if we wanted to look at either that--so we could also do 229 00:19:23,510 --> 00:19:33,810 him or--we will come back to that. 230 00:19:34,810 --> 00:19:45,760 Okay--so--all right--in positions you can also use the caret or the dollar sign 231 00:19:45,760 --> 00:19:49,350 to specify that you are looking for something at the start or the end of a string. 232 00:19:50,260 --> 00:19:52,260 Or the start or the end of a word. 233 00:19:52,400 --> 00:19:54,470 That is one way to use that. 234 00:19:55,630 --> 00:20:01,160 >> Okay--so let us play around with a slightly larger block of text. 235 00:20:03,950 --> 00:20:08,310 Let us say this row here--this statement here. 236 00:20:08,310 --> 00:20:11,360 The power of regular expression is that they can specify patterns 237 00:20:11,360 --> 00:20:13,390 not just fixed characters. 238 00:20:14,900 --> 00:20:18,790 Let us make--let us call this block. 239 00:20:22,400 --> 00:20:27,110 Then we will read all of that in. 240 00:20:28,890 --> 00:20:50,820 And then have a--let us make all =; so what are some things we could search in here profitably? 241 00:20:50,820 --> 00:20:54,070 We could look for the expression ear. 242 00:20:55,050 --> 00:21:01,520 Not very interesting. How about that? We'll see what happens. 243 00:21:03,710 --> 00:21:05,710 I gave it a problem. 244 00:21:06,380 --> 00:21:10,750 So any number of things before re and all. 245 00:21:10,750 --> 00:21:15,630 So that should return everything from the beginning up to all re perhaps a couple times. 246 00:21:18,800 --> 00:21:21,970 And then here we have the power of regular expressions is that they 247 00:21:21,970 --> 00:21:24,900 can specify patterns not just characters here are. 248 00:21:24,900 --> 00:21:28,510 So all the way up to the final re, it started with the left most and was greedy. 249 00:21:30,710 --> 00:21:32,710 Let us see--what else could we look for. 250 00:21:32,710 --> 00:21:39,860 I guess one thing if you were interested in looking for the pronouns she and he, 251 00:21:39,860 --> 00:21:44,600 you could check for s being equal to 0 or 1 252 00:21:44,600 --> 00:21:49,710 and the expression he, and that is probably not going to return-- 253 00:21:49,710 --> 00:21:58,020 oh, I guess it returned he because there we are looking at the power, that day, here are. 254 00:22:00,590 --> 00:22:06,270 Let us try specifying that this has to come at the start of something. 255 00:22:06,640 --> 00:22:09,530 Let us see if that drops off. 256 00:22:09,530 --> 00:22:19,630 So we can do fat, and there we don't get anything because she and he 257 00:22:19,630 --> 00:22:22,870 don't occur in this phrase. 258 00:22:24,960 --> 00:22:30,410 Great. Okay--so back to the cat here. 259 00:22:30,410 --> 00:22:35,720 So complex patterns is hurting the brain. 260 00:22:35,720 --> 00:22:40,500 So that is why we use regular expressions to avoid these issues. 261 00:22:40,820 --> 00:22:43,520 >> So here are some other useful modes you can play around with. 262 00:22:43,520 --> 00:22:50,290 We looked at search today, but you can also use match, split, findall, and groups. 263 00:22:50,290 --> 00:22:53,970 So other cool things you can do with regular expressions besides just 264 00:22:53,970 --> 00:22:58,870 looking for patterns is taking a pattern and holding all the matches-- 265 00:22:58,870 --> 00:23:02,530 its variables--and then using those in your code later on. 266 00:23:02,850 --> 00:23:05,980 That can be quite helpful. Other things might be counting. 267 00:23:05,980 --> 00:23:11,720 So we can count the number of instances of a regular expression pattern, 268 00:23:11,720 --> 00:23:13,960 and that is what we can use groups for. 269 00:23:13,960 --> 00:23:17,550 And other modes as well are also possible. 270 00:23:18,040 --> 00:23:22,980 So I just want to talk a little bit more about other ways you can use regular expressions. 271 00:23:22,980 --> 00:23:29,100 >> So one more advanced application is in fuzzy matching. 272 00:23:29,100 --> 00:23:33,450 So if you are looking for a text for the expression, Julius Caesar, 273 00:23:33,450 --> 00:23:37,740 and you see either Gaius Julius Caesar or the name Julius Caesar in other languages, 274 00:23:37,740 --> 00:23:44,400 then you might also want to assign some weight to those values. 275 00:23:44,400 --> 00:23:48,930 And if it is close enough--if it crosses a certain threshold--then you want 276 00:23:48,930 --> 00:23:50,860 to be able to accept Julius Caesar. 277 00:23:50,860 --> 00:24:00,580 So there are a couple different implementations for that in a few other languages as well. 278 00:24:02,580 --> 00:24:08,420 Here are some other tools, Regex Pal--a handy little app online to 279 00:24:08,420 --> 00:24:12,190 check if your regular expressions are composed correctly. 280 00:24:12,190 --> 00:24:18,500 There are also standalone tools that you can run from your desktop 281 00:24:18,500 --> 00:24:22,100 like Ultra Pico, and as well as just cookbooks. 282 00:24:22,100 --> 00:24:25,410 So if you are doing a project that involves a ton of regular expressions 283 00:24:25,410 --> 00:24:29,810 this is probably the place to go outside the scope of today. 284 00:24:31,520 --> 00:24:35,770 And then just to give you a sense of how common it is 285 00:24:35,770 --> 00:24:44,090 there is grep in Unix, Perl has built-in, and C there is PCRE for C. 286 00:24:44,090 --> 00:24:48,890 And then all these other languages also have regular expression packages 287 00:24:48,890 --> 00:24:52,020 that operate with essentially the same syntax we got a taste of today. 288 00:24:52,020 --> 00:24:54,790 PHP, Java, Ruby, and so on. 289 00:24:56,080 --> 00:24:58,980 >> Google Code Search is actually worth mentioning; it is one of the 290 00:24:58,980 --> 00:25:05,720 relatively few applications out there that allows the public to access 291 00:25:05,720 --> 00:25:07,800 its database using regular expressions. 292 00:25:07,800 --> 00:25:12,920 So if you look on Google Code Search you can find code 293 00:25:12,920 --> 00:25:16,880 if you are looking for an instance of how a function might be used, 294 00:25:16,880 --> 00:25:21,610 you can use a regular expression to find that function being used in all sorts of different cases. 295 00:25:21,610 --> 00:25:28,000 You could look for fwrite, and then you could look for the flag of write or read 296 00:25:28,000 --> 00:25:32,000 if you wanted an example of fwrite being used in that case. 297 00:25:33,530 --> 00:25:37,010 So the same thing there, and here are some references. 298 00:25:37,010 --> 00:25:40,990 This will be available online as well, so going forwards if 299 00:25:40,990 --> 00:25:45,560 you want to look at Python, grep, Perl--you just want to get some inspiration 300 00:25:45,560 --> 00:25:50,650 or if you want to look more at the theory here are some good jumping off places. 301 00:25:50,650 --> 00:25:53,870 Thank you very much. 302 00:25:58,470 --> 00:25:59,910 [CS50.TV]