1 00:00:00,000 --> 00:00:00,150 2 00:00:00,150 --> 00:00:02,009 BRIAN YU: Let's dive into readability. 3 00:00:02,009 --> 00:00:05,760 In readability, your task is going to be to write a program in C that 4 00:00:05,760 --> 00:00:09,870 takes as input some text and outputs the approximate US grade 5 00:00:09,870 --> 00:00:13,500 level that would be the appropriate reading level for that text. 6 00:00:13,500 --> 00:00:16,710 For example, you might run your readability program by calling 7 00:00:16,710 --> 00:00:18,600 ./readability. 8 00:00:18,600 --> 00:00:21,300 Your program will then prompt you to type in some text, 9 00:00:21,300 --> 00:00:23,400 or you could type in a couple of sentences. 10 00:00:23,400 --> 00:00:25,650 And then your program would analyze that text 11 00:00:25,650 --> 00:00:29,010 and conclude, for example, that these sentences are at a third grade reading 12 00:00:29,010 --> 00:00:29,778 level. 13 00:00:29,778 --> 00:00:32,070 Or if you typed in something a little more complicated, 14 00:00:32,070 --> 00:00:35,310 you might get that it's a fifth grade reading level, or something else. 15 00:00:35,310 --> 00:00:37,950 How do you actually calculate this reading level? 16 00:00:37,950 --> 00:00:40,170 Well, first, we can make a couple of observations 17 00:00:40,170 --> 00:00:43,410 about what makes something easier or harder to read. 18 00:00:43,410 --> 00:00:48,010 One thing to notice is that longer words tends to mean a higher reading level, 19 00:00:48,010 --> 00:00:51,480 and another thing that you might notice is that more words per sentence-- 20 00:00:51,480 --> 00:00:53,490 in other words, longer sentences-- 21 00:00:53,490 --> 00:00:57,240 might also mean that a particular text is at a higher reading level. 22 00:00:57,240 --> 00:00:59,160 We can take that information and actually 23 00:00:59,160 --> 00:01:03,960 plug it into a readability test, a formula that takes a text and computes 24 00:01:03,960 --> 00:01:06,120 what grade level it's appropriate for. 25 00:01:06,120 --> 00:01:09,300 One such example is the Coleman-Liau index, 26 00:01:09,300 --> 00:01:12,690 which takes the number of letters and words and sentences in a text 27 00:01:12,690 --> 00:01:17,100 and is able to conclude what US grade level it approximately corresponds to. 28 00:01:17,100 --> 00:01:18,910 The formula looks like this. 29 00:01:18,910 --> 00:01:26,280 The Coleman-Liau index value is equal to 0.0588 times L minus 2.96 times S minus 30 00:01:26,280 --> 00:01:32,430 15.8, where here L is the average number of letters per 100 words in the text 31 00:01:32,430 --> 00:01:37,230 and S is the average number of sentences per 100 words in the text. 32 00:01:37,230 --> 00:01:40,950 So to compute the Coleman-Liau index value for a particular text, 33 00:01:40,950 --> 00:01:45,120 you'll first need to count up how many letters, words, and sentences there 34 00:01:45,120 --> 00:01:48,240 are in that particular text, plug them into the formula, 35 00:01:48,240 --> 00:01:51,990 and use the result to determine what the US grade reading level is 36 00:01:51,990 --> 00:01:54,300 appropriate for this particular text. 37 00:01:54,300 --> 00:01:57,690 Let's start by trying to count up the number of letters in a particular text, 38 00:01:57,690 --> 00:01:58,920 for example. 39 00:01:58,920 --> 00:02:00,870 In order to do that, you'll want to keep track 40 00:02:00,870 --> 00:02:03,720 of the number of both uppercase and lowercase letters 41 00:02:03,720 --> 00:02:06,960 that appear in the text, which isn't going to be every character. 42 00:02:06,960 --> 00:02:09,930 You should ignore spaces and punctuation, for example. 43 00:02:09,930 --> 00:02:11,800 But how are you going to do that? 44 00:02:11,800 --> 00:02:14,610 Well, remember that a string of text you can think of as really 45 00:02:14,610 --> 00:02:18,760 just an array of characters you can iterate over one at a time. 46 00:02:18,760 --> 00:02:20,580 So in order to do so, you'll probably want 47 00:02:20,580 --> 00:02:24,090 to keep some sort of variable that's going to keep track of how many letters 48 00:02:24,090 --> 00:02:26,010 you've encountered so far. 49 00:02:26,010 --> 00:02:28,620 That variable probably initially will be set to 0 50 00:02:28,620 --> 00:02:30,810 because before you start looking through the string, 51 00:02:30,810 --> 00:02:32,880 you haven't seen any letters. 52 00:02:32,880 --> 00:02:35,820 But if you loop through the string one character at a time, 53 00:02:35,820 --> 00:02:38,340 you might start with the character n and realize that it 54 00:02:38,340 --> 00:02:40,200 is, in fact, an alphabetic character. 55 00:02:40,200 --> 00:02:44,350 It's a letter, so you should increment your letter, count from 0 to 1. 56 00:02:44,350 --> 00:02:47,990 And you can do so for every subsequent character, checking if it's a letter 57 00:02:47,990 --> 00:02:50,670 and increasing the letter count if so. 58 00:02:50,670 --> 00:02:53,460 But as soon as you encounter a character that isn't a letter, 59 00:02:53,460 --> 00:02:56,040 you'll want to be careful to not increase the letter count, 60 00:02:56,040 --> 00:02:57,798 and in fact, to leave it the same. 61 00:02:57,798 --> 00:03:00,340 And then as you get to the next character, if it is a letter, 62 00:03:00,340 --> 00:03:02,430 then you can continue to increase the count. 63 00:03:02,430 --> 00:03:06,270 And you'll continue to repeat that for each of the characters in the string 64 00:03:06,270 --> 00:03:09,780 so that by the end of it, you have an accurate count of how many letters 65 00:03:09,780 --> 00:03:12,270 there are in this text. 66 00:03:12,270 --> 00:03:15,610 How do you determine whether or not a character is a letter or not? 67 00:03:15,610 --> 00:03:19,350 Well, there are ways to do this using ASCII, remembering that every character 68 00:03:19,350 --> 00:03:21,000 has a numeric value. 69 00:03:21,000 --> 00:03:24,660 But you also might find it helpful to take a look at a C header file called 70 00:03:24,660 --> 00:03:27,570 ctype.h, which includes several functions that 71 00:03:27,570 --> 00:03:31,230 are helpful for determining the type of a particular character. 72 00:03:31,230 --> 00:03:34,950 That might help you to figure out how many letters there are in the text. 73 00:03:34,950 --> 00:03:37,650 After you've calculated how many letters are in the text, 74 00:03:37,650 --> 00:03:41,010 the next step is to figure out how many words are in that text. 75 00:03:41,010 --> 00:03:43,260 And what really is a word? 76 00:03:43,260 --> 00:03:45,510 Well, for the purposes of this program, you're 77 00:03:45,510 --> 00:03:47,850 going to count the number of words in a sentence 78 00:03:47,850 --> 00:03:49,920 by assuming that any sequence of characters 79 00:03:49,920 --> 00:03:54,840 separated by one or more spaces is going to count as a word. 80 00:03:54,840 --> 00:03:57,290 So let's take a look at an example. 81 00:03:57,290 --> 00:04:01,700 Here we again have a string, an array of characters representing some text. 82 00:04:01,700 --> 00:04:03,520 And we have a variable called words which 83 00:04:03,520 --> 00:04:06,650 is going to keep track of how many words we've encountered. 84 00:04:06,650 --> 00:04:08,930 As soon as we hit the first alphabetical character, 85 00:04:08,930 --> 00:04:12,620 the letter A, the fact that we've hit this first non-space character 86 00:04:12,620 --> 00:04:14,690 at the start of the string indicates to us 87 00:04:14,690 --> 00:04:17,149 that this is, in fact, the start of the first word, 88 00:04:17,149 --> 00:04:19,279 and we've now found one word. 89 00:04:19,279 --> 00:04:21,380 When we encounter other alphabetical characters, 90 00:04:21,380 --> 00:04:24,680 we're not going to increment the word count just yet because words 91 00:04:24,680 --> 00:04:27,140 have to be separated by spaces. 92 00:04:27,140 --> 00:04:30,530 As soon as we do hit a space, though, the fact that we've hit a space-- 93 00:04:30,530 --> 00:04:32,240 that marks the end of a word. 94 00:04:32,240 --> 00:04:34,490 It means the next word is coming if we ever 95 00:04:34,490 --> 00:04:36,977 encounter another alphabetic character. 96 00:04:36,977 --> 00:04:38,810 And as soon as we get to the next character, 97 00:04:38,810 --> 00:04:41,300 we do, in fact, encounter an alphabetic character, 98 00:04:41,300 --> 00:04:44,840 so we can increment the word count from 1 to 2. 99 00:04:44,840 --> 00:04:48,650 The non-space character here marks the start of a new word. 100 00:04:48,650 --> 00:04:49,820 And we can keep going. 101 00:04:49,820 --> 00:04:53,210 When we detect the space again, that means a new word is coming so that when 102 00:04:53,210 --> 00:04:56,450 we hit another alphabetical character-- in this case, W-- 103 00:04:56,450 --> 00:04:59,790 we increment the word count from 2 to 3. 104 00:04:59,790 --> 00:05:03,150 Notice that the punctuation after the word "by" here doesn't mean there's 105 00:05:03,150 --> 00:05:04,230 a new word yet. 106 00:05:04,230 --> 00:05:06,380 It's still part of the existing word. 107 00:05:06,380 --> 00:05:09,300 The space means that a new word is coming. 108 00:05:09,300 --> 00:05:13,260 But imagine, for example, there are two spaces in a row in the string. 109 00:05:13,260 --> 00:05:14,800 What happens then? 110 00:05:14,800 --> 00:05:18,270 Well, multiple spaces in a row shouldn't count as a new word yet. 111 00:05:18,270 --> 00:05:20,010 We've still only seen four words. 112 00:05:20,010 --> 00:05:22,170 We haven't yet seen five words. 113 00:05:22,170 --> 00:05:25,860 So you want to wait until we get to the next alphabetic character. 114 00:05:25,860 --> 00:05:29,490 Once we get to the letter T, which is, in fact, a non-space character, 115 00:05:29,490 --> 00:05:32,110 that should indicate to us that we've found another word, 116 00:05:32,110 --> 00:05:36,240 and we can increment the word count from four to five, for example. 117 00:05:36,240 --> 00:05:38,790 We can continue to do that for the rest of the string 118 00:05:38,790 --> 00:05:43,830 so that we can conclude that in this string, there are, in fact, five words. 119 00:05:43,830 --> 00:05:46,080 So that's how we might go about counting words. 120 00:05:46,080 --> 00:05:48,270 But after we've counted letters and words, 121 00:05:48,270 --> 00:05:52,110 the last piece of information we need to plug into that Coleman-Liau index 122 00:05:52,110 --> 00:05:55,590 is the number of sentences that are present in the string. 123 00:05:55,590 --> 00:05:57,750 And this is, in fact, a little bit tricky. 124 00:05:57,750 --> 00:05:59,610 But for the purpose of this problem, we'll 125 00:05:59,610 --> 00:06:03,390 let you assume that any period, exclamation point, or question 126 00:06:03,390 --> 00:06:06,960 mark that appears in the string indicates a sentence. 127 00:06:06,960 --> 00:06:08,820 In reality, this might not be the case. 128 00:06:08,820 --> 00:06:11,040 Consider, for example, Mr., where you might 129 00:06:11,040 --> 00:06:14,310 see a period that doesn't actually indicate the end of a sentence. 130 00:06:14,310 --> 00:06:18,240 But for simplicity, it's safe to assume that generally, periods and exclamation 131 00:06:18,240 --> 00:06:21,850 points and question marks are going to mark sentence boundaries. 132 00:06:21,850 --> 00:06:24,450 So in a string like this, for example, if we 133 00:06:24,450 --> 00:06:27,450 look for all the periods and question marks and exclamation points, 134 00:06:27,450 --> 00:06:28,800 we find two of them. 135 00:06:28,800 --> 00:06:33,155 So we can conclude that this string has two sentences in it. 136 00:06:33,155 --> 00:06:35,280 After we've done all of these steps, you should now 137 00:06:35,280 --> 00:06:39,270 have accurate counts of the number of letters, words, and sentences 138 00:06:39,270 --> 00:06:41,160 that appear inside of the text. 139 00:06:41,160 --> 00:06:45,910 And the last step is to calculate the value of the Coleman-Liau index. 140 00:06:45,910 --> 00:06:47,683 So how are you going to do that? 141 00:06:47,683 --> 00:06:49,350 Well, once you have these three values-- 142 00:06:49,350 --> 00:06:51,750 letters, words, and sentences-- 143 00:06:51,750 --> 00:06:55,770 you can plug them into the formula to compute what the index value should be. 144 00:06:55,770 --> 00:06:59,670 Remember that the index value is based on l, the average number of letters 145 00:06:59,670 --> 00:07:04,380 per 100 words, and S, the average number of sentences per 100 words. 146 00:07:04,380 --> 00:07:07,860 But now that you have a count of the number of words, the number of letters, 147 00:07:07,860 --> 00:07:09,750 and the number of sentences, you should be 148 00:07:09,750 --> 00:07:14,490 able to calculate l and s and plug that information into the Coleman-Liau index 149 00:07:14,490 --> 00:07:18,500 formula to figure out what the reading level should be. 150 00:07:18,500 --> 00:07:20,260 What should your program then output? 151 00:07:20,260 --> 00:07:22,900 Well, your formula might give you a decimal number, 152 00:07:22,900 --> 00:07:26,140 so you'll want to be sure to round the score to the nearest whole number 153 00:07:26,140 --> 00:07:29,562 first because you want to approximate a US grade level. 154 00:07:29,562 --> 00:07:31,270 You want your program to output something 155 00:07:31,270 --> 00:07:36,520 like grade x, where x is the grade level appropriate for this particular text. 156 00:07:36,520 --> 00:07:41,110 Of course, what happens if the number is remarkably low or especially high? 157 00:07:41,110 --> 00:07:43,720 Well, if the output number is less than 1, 158 00:07:43,720 --> 00:07:46,640 then you should instead output before grade 1 159 00:07:46,640 --> 00:07:50,450 to indicate that the reading level is earlier than grade 1. 160 00:07:50,450 --> 00:07:52,840 Meanwhile, if the output is 16 or higher, 161 00:07:52,840 --> 00:07:55,270 approximate that of a college senior or higher, 162 00:07:55,270 --> 00:07:58,535 you should just output grade 16 plus to indicate 163 00:07:58,535 --> 00:08:00,910 that it's the highest reading level that we'll keep track 164 00:08:00,910 --> 00:08:03,010 of for the purpose of this program. 165 00:08:03,010 --> 00:08:06,550 Once you've done that, you should be able to run your readability program, 166 00:08:06,550 --> 00:08:10,540 type in some text, and see as output the approximate reading level that would be 167 00:08:10,540 --> 00:08:13,330 appropriate for this particular text. 168 00:08:13,330 --> 00:08:17,490 My name is Brian, and this was readability. 169 00:08:17,490 --> 00:08:18,113