1
00:00:00,000 --> 00:00:00,150


2
00:00:00,150 --> 00:00:02,009
BRIAN YU: Let's dive into readability.

3
00:00:02,009 --> 00:00:05,760
In readability, your task is going
to be to write a program in C that

4
00:00:05,760 --> 00:00:09,870
takes as input some text and
outputs the approximate US grade

5
00:00:09,870 --> 00:00:13,500
level that would be the appropriate
reading level for that text.

6
00:00:13,500 --> 00:00:16,710
For example, you might run your
readability program by calling

7
00:00:16,710 --> 00:00:18,600
./readability.

8
00:00:18,600 --> 00:00:21,300
Your program will then prompt
you to type in some text,

9
00:00:21,300 --> 00:00:23,400
or you could type in
a couple of sentences.

10
00:00:23,400 --> 00:00:25,650
And then your program
would analyze that text

11
00:00:25,650 --> 00:00:29,010
and conclude, for example, that these
sentences are at a third grade reading

12
00:00:29,010 --> 00:00:29,778
level.

13
00:00:29,778 --> 00:00:32,070
Or if you typed in something
a little more complicated,

14
00:00:32,070 --> 00:00:35,310
you might get that it's a fifth grade
reading level, or something else.

15
00:00:35,310 --> 00:00:37,950
How do you actually
calculate this reading level?

16
00:00:37,950 --> 00:00:40,170
Well, first, we can make
a couple of observations

17
00:00:40,170 --> 00:00:43,410
about what makes something
easier or harder to read.

18
00:00:43,410 --> 00:00:48,010
One thing to notice is that longer words
tends to mean a higher reading level,

19
00:00:48,010 --> 00:00:51,480
and another thing that you might notice
is that more words per sentence--

20
00:00:51,480 --> 00:00:53,490
in other words, longer sentences--

21
00:00:53,490 --> 00:00:57,240
might also mean that a particular
text is at a higher reading level.

22
00:00:57,240 --> 00:00:59,160
We can take that
information and actually

23
00:00:59,160 --> 00:01:03,960
plug it into a readability test, a
formula that takes a text and computes

24
00:01:03,960 --> 00:01:06,120
what grade level it's appropriate for.

25
00:01:06,120 --> 00:01:09,300
One such example is
the Coleman-Liau index,

26
00:01:09,300 --> 00:01:12,690
which takes the number of letters
and words and sentences in a text

27
00:01:12,690 --> 00:01:17,100
and is able to conclude what US grade
level it approximately corresponds to.

28
00:01:17,100 --> 00:01:18,910
The formula looks like this.

29
00:01:18,910 --> 00:01:26,280
The Coleman-Liau index value is equal to
0.0588 times L minus 2.96 times S minus

30
00:01:26,280 --> 00:01:32,430
15.8, where here L is the average number
of letters per 100 words in the text

31
00:01:32,430 --> 00:01:37,230
and S is the average number of
sentences per 100 words in the text.

32
00:01:37,230 --> 00:01:40,950
So to compute the Coleman-Liau
index value for a particular text,

33
00:01:40,950 --> 00:01:45,120
you'll first need to count up how many
letters, words, and sentences there

34
00:01:45,120 --> 00:01:48,240
are in that particular text,
plug them into the formula,

35
00:01:48,240 --> 00:01:51,990
and use the result to determine
what the US grade reading level is

36
00:01:51,990 --> 00:01:54,300
appropriate for this particular text.

37
00:01:54,300 --> 00:01:57,690
Let's start by trying to count up the
number of letters in a particular text,

38
00:01:57,690 --> 00:01:58,920
for example.

39
00:01:58,920 --> 00:02:00,870
In order to do that,
you'll want to keep track

40
00:02:00,870 --> 00:02:03,720
of the number of both
uppercase and lowercase letters

41
00:02:03,720 --> 00:02:06,960
that appear in the text, which
isn't going to be every character.

42
00:02:06,960 --> 00:02:09,930
You should ignore spaces and
punctuation, for example.

43
00:02:09,930 --> 00:02:11,800
But how are you going to do that?

44
00:02:11,800 --> 00:02:14,610
Well, remember that a string of
text you can think of as really

45
00:02:14,610 --> 00:02:18,760
just an array of characters you
can iterate over one at a time.

46
00:02:18,760 --> 00:02:20,580
So in order to do so,
you'll probably want

47
00:02:20,580 --> 00:02:24,090
to keep some sort of variable that's
going to keep track of how many letters

48
00:02:24,090 --> 00:02:26,010
you've encountered so far.

49
00:02:26,010 --> 00:02:28,620
That variable probably
initially will be set to 0

50
00:02:28,620 --> 00:02:30,810
because before you start
looking through the string,

51
00:02:30,810 --> 00:02:32,880
you haven't seen any letters.

52
00:02:32,880 --> 00:02:35,820
But if you loop through the
string one character at a time,

53
00:02:35,820 --> 00:02:38,340
you might start with the
character n and realize that it

54
00:02:38,340 --> 00:02:40,200
is, in fact, an alphabetic character.

55
00:02:40,200 --> 00:02:44,350
It's a letter, so you should increment
your letter, count from 0 to 1.

56
00:02:44,350 --> 00:02:47,990
And you can do so for every subsequent
character, checking if it's a letter

57
00:02:47,990 --> 00:02:50,670
and increasing the letter count if so.

58
00:02:50,670 --> 00:02:53,460
But as soon as you encounter a
character that isn't a letter,

59
00:02:53,460 --> 00:02:56,040
you'll want to be careful to
not increase the letter count,

60
00:02:56,040 --> 00:02:57,798
and in fact, to leave it the same.

61
00:02:57,798 --> 00:03:00,340
And then as you get to the next
character, if it is a letter,

62
00:03:00,340 --> 00:03:02,430
then you can continue
to increase the count.

63
00:03:02,430 --> 00:03:06,270
And you'll continue to repeat that for
each of the characters in the string

64
00:03:06,270 --> 00:03:09,780
so that by the end of it, you have
an accurate count of how many letters

65
00:03:09,780 --> 00:03:12,270
there are in this text.

66
00:03:12,270 --> 00:03:15,610
How do you determine whether or
not a character is a letter or not?

67
00:03:15,610 --> 00:03:19,350
Well, there are ways to do this using
ASCII, remembering that every character

68
00:03:19,350 --> 00:03:21,000
has a numeric value.

69
00:03:21,000 --> 00:03:24,660
But you also might find it helpful to
take a look at a C header file called

70
00:03:24,660 --> 00:03:27,570
ctype.h, which includes
several functions that

71
00:03:27,570 --> 00:03:31,230
are helpful for determining the
type of a particular character.

72
00:03:31,230 --> 00:03:34,950
That might help you to figure out how
many letters there are in the text.

73
00:03:34,950 --> 00:03:37,650
After you've calculated how
many letters are in the text,

74
00:03:37,650 --> 00:03:41,010
the next step is to figure out
how many words are in that text.

75
00:03:41,010 --> 00:03:43,260
And what really is a word?

76
00:03:43,260 --> 00:03:45,510
Well, for the purposes
of this program, you're

77
00:03:45,510 --> 00:03:47,850
going to count the number
of words in a sentence

78
00:03:47,850 --> 00:03:49,920
by assuming that any
sequence of characters

79
00:03:49,920 --> 00:03:54,840
separated by one or more spaces
is going to count as a word.

80
00:03:54,840 --> 00:03:57,290
So let's take a look at an example.

81
00:03:57,290 --> 00:04:01,700
Here we again have a string, an array
of characters representing some text.

82
00:04:01,700 --> 00:04:03,520
And we have a variable
called words which

83
00:04:03,520 --> 00:04:06,650
is going to keep track of how
many words we've encountered.

84
00:04:06,650 --> 00:04:08,930
As soon as we hit the first
alphabetical character,

85
00:04:08,930 --> 00:04:12,620
the letter A, the fact that we've
hit this first non-space character

86
00:04:12,620 --> 00:04:14,690
at the start of the
string indicates to us

87
00:04:14,690 --> 00:04:17,149
that this is, in fact, the
start of the first word,

88
00:04:17,149 --> 00:04:19,279
and we've now found one word.

89
00:04:19,279 --> 00:04:21,380
When we encounter other
alphabetical characters,

90
00:04:21,380 --> 00:04:24,680
we're not going to increment the
word count just yet because words

91
00:04:24,680 --> 00:04:27,140
have to be separated by spaces.

92
00:04:27,140 --> 00:04:30,530
As soon as we do hit a space, though,
the fact that we've hit a space--

93
00:04:30,530 --> 00:04:32,240
that marks the end of a word.

94
00:04:32,240 --> 00:04:34,490
It means the next word
is coming if we ever

95
00:04:34,490 --> 00:04:36,977
encounter another alphabetic character.

96
00:04:36,977 --> 00:04:38,810
And as soon as we get
to the next character,

97
00:04:38,810 --> 00:04:41,300
we do, in fact, encounter
an alphabetic character,

98
00:04:41,300 --> 00:04:44,840
so we can increment the
word count from 1 to 2.

99
00:04:44,840 --> 00:04:48,650
The non-space character here
marks the start of a new word.

100
00:04:48,650 --> 00:04:49,820
And we can keep going.

101
00:04:49,820 --> 00:04:53,210
When we detect the space again, that
means a new word is coming so that when

102
00:04:53,210 --> 00:04:56,450
we hit another alphabetical
character-- in this case, W--

103
00:04:56,450 --> 00:04:59,790
we increment the word count from 2 to 3.

104
00:04:59,790 --> 00:05:03,150
Notice that the punctuation after the
word "by" here doesn't mean there's

105
00:05:03,150 --> 00:05:04,230
a new word yet.

106
00:05:04,230 --> 00:05:06,380
It's still part of the existing word.

107
00:05:06,380 --> 00:05:09,300
The space means that
a new word is coming.

108
00:05:09,300 --> 00:05:13,260
But imagine, for example, there are
two spaces in a row in the string.

109
00:05:13,260 --> 00:05:14,800
What happens then?

110
00:05:14,800 --> 00:05:18,270
Well, multiple spaces in a row
shouldn't count as a new word yet.

111
00:05:18,270 --> 00:05:20,010
We've still only seen four words.

112
00:05:20,010 --> 00:05:22,170
We haven't yet seen five words.

113
00:05:22,170 --> 00:05:25,860
So you want to wait until we get
to the next alphabetic character.

114
00:05:25,860 --> 00:05:29,490
Once we get to the letter T, which
is, in fact, a non-space character,

115
00:05:29,490 --> 00:05:32,110
that should indicate to us
that we've found another word,

116
00:05:32,110 --> 00:05:36,240
and we can increment the word count
from four to five, for example.

117
00:05:36,240 --> 00:05:38,790
We can continue to do that
for the rest of the string

118
00:05:38,790 --> 00:05:43,830
so that we can conclude that in this
string, there are, in fact, five words.

119
00:05:43,830 --> 00:05:46,080
So that's how we might
go about counting words.

120
00:05:46,080 --> 00:05:48,270
But after we've counted
letters and words,

121
00:05:48,270 --> 00:05:52,110
the last piece of information we need
to plug into that Coleman-Liau index

122
00:05:52,110 --> 00:05:55,590
is the number of sentences
that are present in the string.

123
00:05:55,590 --> 00:05:57,750
And this is, in fact,
a little bit tricky.

124
00:05:57,750 --> 00:05:59,610
But for the purpose
of this problem, we'll

125
00:05:59,610 --> 00:06:03,390
let you assume that any period,
exclamation point, or question

126
00:06:03,390 --> 00:06:06,960
mark that appears in the
string indicates a sentence.

127
00:06:06,960 --> 00:06:08,820
In reality, this might not be the case.

128
00:06:08,820 --> 00:06:11,040
Consider, for example,
Mr., where you might

129
00:06:11,040 --> 00:06:14,310
see a period that doesn't actually
indicate the end of a sentence.

130
00:06:14,310 --> 00:06:18,240
But for simplicity, it's safe to assume
that generally, periods and exclamation

131
00:06:18,240 --> 00:06:21,850
points and question marks are
going to mark sentence boundaries.

132
00:06:21,850 --> 00:06:24,450
So in a string like
this, for example, if we

133
00:06:24,450 --> 00:06:27,450
look for all the periods and question
marks and exclamation points,

134
00:06:27,450 --> 00:06:28,800
we find two of them.

135
00:06:28,800 --> 00:06:33,155
So we can conclude that this
string has two sentences in it.

136
00:06:33,155 --> 00:06:35,280
After we've done all of
these steps, you should now

137
00:06:35,280 --> 00:06:39,270
have accurate counts of the number
of letters, words, and sentences

138
00:06:39,270 --> 00:06:41,160
that appear inside of the text.

139
00:06:41,160 --> 00:06:45,910
And the last step is to calculate
the value of the Coleman-Liau index.

140
00:06:45,910 --> 00:06:47,683
So how are you going to do that?

141
00:06:47,683 --> 00:06:49,350
Well, once you have these three values--

142
00:06:49,350 --> 00:06:51,750
letters, words, and sentences--

143
00:06:51,750 --> 00:06:55,770
you can plug them into the formula to
compute what the index value should be.

144
00:06:55,770 --> 00:06:59,670
Remember that the index value is based
on l, the average number of letters

145
00:06:59,670 --> 00:07:04,380
per 100 words, and S, the average
number of sentences per 100 words.

146
00:07:04,380 --> 00:07:07,860
But now that you have a count of the
number of words, the number of letters,

147
00:07:07,860 --> 00:07:09,750
and the number of
sentences, you should be

148
00:07:09,750 --> 00:07:14,490
able to calculate l and s and plug that
information into the Coleman-Liau index

149
00:07:14,490 --> 00:07:18,500
formula to figure out what
the reading level should be.

150
00:07:18,500 --> 00:07:20,260
What should your program then output?

151
00:07:20,260 --> 00:07:22,900
Well, your formula might
give you a decimal number,

152
00:07:22,900 --> 00:07:26,140
so you'll want to be sure to round
the score to the nearest whole number

153
00:07:26,140 --> 00:07:29,562
first because you want to
approximate a US grade level.

154
00:07:29,562 --> 00:07:31,270
You want your program
to output something

155
00:07:31,270 --> 00:07:36,520
like grade x, where x is the grade level
appropriate for this particular text.

156
00:07:36,520 --> 00:07:41,110
Of course, what happens if the number
is remarkably low or especially high?

157
00:07:41,110 --> 00:07:43,720
Well, if the output
number is less than 1,

158
00:07:43,720 --> 00:07:46,640
then you should instead
output before grade 1

159
00:07:46,640 --> 00:07:50,450
to indicate that the reading
level is earlier than grade 1.

160
00:07:50,450 --> 00:07:52,840
Meanwhile, if the
output is 16 or higher,

161
00:07:52,840 --> 00:07:55,270
approximate that of a
college senior or higher,

162
00:07:55,270 --> 00:07:58,535
you should just output
grade 16 plus to indicate

163
00:07:58,535 --> 00:08:00,910
that it's the highest reading
level that we'll keep track

164
00:08:00,910 --> 00:08:03,010
of for the purpose of this program.

165
00:08:03,010 --> 00:08:06,550
Once you've done that, you should be
able to run your readability program,

166
00:08:06,550 --> 00:08:10,540
type in some text, and see as output the
approximate reading level that would be

167
00:08:10,540 --> 00:08:13,330
appropriate for this particular text.

168
00:08:13,330 --> 00:08:17,490
My name is Brian, and
this was readability.

169
00:08:17,490 --> 00:08:18,113