1 00:00:00,000 --> 00:00:00,290 2 00:00:00,290 --> 00:00:03,081 BRIAN YU: Now that we've figured out how to compare two files based 3 00:00:03,081 --> 00:00:04,910 on how many lines they have in common, now 4 00:00:04,910 --> 00:00:08,330 let's think about how to compare two files based on the number of sentences 5 00:00:08,330 --> 00:00:09,960 that they have in common. 6 00:00:09,960 --> 00:00:12,590 Now, what might that look like, and what do you have to do? 7 00:00:12,590 --> 00:00:17,420 First, just like in the last function, you'll take in string inputs a and b, 8 00:00:17,420 --> 00:00:21,380 each one of which will be the textual representation of some file. 9 00:00:21,380 --> 00:00:24,350 Then instead of splitting each string into lines, 10 00:00:24,350 --> 00:00:27,320 you'll split each string into sentences. 11 00:00:27,320 --> 00:00:31,190 Then you'll calculate the list of sentences that appear in both a 12 00:00:31,190 --> 00:00:32,750 and also in b. 13 00:00:32,750 --> 00:00:36,140 And finally, you'll return a list that contains all of the sentences that 14 00:00:36,140 --> 00:00:39,110 appear in both of the original strings. 15 00:00:39,110 --> 00:00:41,240 The challenge here is how do you take a string 16 00:00:41,240 --> 00:00:44,840 and convert it into a list of all of the sentences that make it up. 17 00:00:44,840 --> 00:00:48,290 If, for example, we had a string like "Hello there! 18 00:00:48,290 --> 00:00:49,950 How are you?" 19 00:00:49,950 --> 00:00:53,000 we would want to split that up into "Hello there!" 20 00:00:53,000 --> 00:00:54,450 and "How are you?" 21 00:00:54,450 --> 00:00:57,210 knowing that "Hello there!" is one sentence and "How are you?" 22 00:00:57,210 --> 00:00:58,540 is another sentence. 23 00:00:58,540 --> 00:01:01,880 And here complicated issues like dealing with different types of punctuation-- 24 00:01:01,880 --> 00:01:04,671 whether they're periods, or exclamation points, or question marks-- 25 00:01:04,671 --> 00:01:06,030 might come into play. 26 00:01:06,030 --> 00:01:07,880 However, luckily for us, someone's already 27 00:01:07,880 --> 00:01:09,830 implemented this functionality for us. 28 00:01:09,830 --> 00:01:12,080 And we can stand on their shoulders in order 29 00:01:12,080 --> 00:01:16,130 to take advantage of the ability to split a string into sentences 30 00:01:16,130 --> 00:01:18,080 and use it for our own purposes. 31 00:01:18,080 --> 00:01:23,090 For this, we're going to use a Python library called NLTK, 32 00:01:23,090 --> 00:01:26,450 or Natural Language Toolkit, which, within it, defines 33 00:01:26,450 --> 00:01:29,080 a function called sent_tokenize-- 34 00:01:29,080 --> 00:01:31,760 for sentence tokenize-- which takes a string 35 00:01:31,760 --> 00:01:35,210 and splits it up into all of the sentences that make it up. 36 00:01:35,210 --> 00:01:38,150 In order to use it, you can import the function using 37 00:01:38,150 --> 00:01:39,870 a line that looks something like this-- 38 00:01:39,870 --> 00:01:44,240 "from nltk.tokenize import sent_tokenize," 39 00:01:44,240 --> 00:01:47,780 which will allow you to use the sent_tokenize function in order 40 00:01:47,780 --> 00:01:52,250 to take a string and split it up into its component sentences. 41 00:01:52,250 --> 00:01:54,290 Once you've done that, the last step is going 42 00:01:54,290 --> 00:01:57,680 to be to find the sentences that are in common and return that as a list. 43 00:01:57,680 --> 00:02:00,370 So as usual, make sure you avoid duplicates. 44 00:02:00,370 --> 00:02:02,510 If a sentence appears multiple times, you only 45 00:02:02,510 --> 00:02:04,889 want it to appear once in your final list. 46 00:02:04,889 --> 00:02:06,680 And of course, make sure that the data type 47 00:02:06,680 --> 00:02:09,680 that you return at the end of the sentence's function 48 00:02:09,680 --> 00:02:11,540 is, in fact, a list. 49 00:02:11,540 --> 00:02:14,300 After that you should have a list of matching sentences 50 00:02:14,300 --> 00:02:17,410 that you can then return back to your program. 51 00:02:17,410 --> 00:02:18,622