1
00:00:00,000 --> 00:00:02,210
[Walkthrough - Problem Set 6]

2
00:00:02,210 --> 00:00:04,810
[Zamyla Chan - Harvard University]

3
00:00:04,810 --> 00:00:07,240
[This is CS50. - CS50.TV]

4
00:00:07,240 --> 00:00:12,180
>> Hello, everyone, and welcome to Walkthrough 6: Huff'n Puff.

5
00:00:12,180 --> 00:00:17,440
In Huff'n Puff what we are doing is going to be dealing with a Huffman compressed file

6
00:00:17,440 --> 00:00:20,740
and then puffing it back up, so decompressing it,

7
00:00:20,740 --> 00:00:25,810
so that we can translate from the 0s and 1s that the user sends us

8
00:00:25,810 --> 00:00:30,660
and convert it back into the original text.

9
00:00:30,660 --> 00:00:34,360
Pset 6 is going to be pretty cool because you're going to see some of the tools

10
00:00:34,360 --> 00:00:41,730
that you used in pset 4 and pset 5 and kind of combining them into 1 pretty neat concept

11
00:00:41,730 --> 00:00:43,830
when you come to think about it.

12
00:00:43,830 --> 00:00:50,110
>> Also, arguably, pset 4 and 5 were the most challenging psets that we had to offer.

13
00:00:50,110 --> 00:00:53,950
So from now, we have this 1 more pset in C,

14
00:00:53,950 --> 00:00:56,480
and then after that we're on to web programming.

15
00:00:56,480 --> 00:01:02,310
So congratulate yourselves for getting over the toughest hump in CS50.

16
00:01:03,630 --> 00:01:09,760
>> Moving on for Huff'n Puff, our toolbox for this pset are going to be Huffman trees,

17
00:01:09,760 --> 00:01:14,700
so understanding not only how binary trees work but also specifically Huffman trees,

18
00:01:14,700 --> 00:01:16,240
how they're constructed.

19
00:01:16,240 --> 00:01:20,210
And then we're going to have a lot of distribution code in this pset,

20
00:01:20,210 --> 00:01:22,480
and we'll come to see that actually some of the code

21
00:01:22,480 --> 00:01:24,670
we might not be able to fully understand yet,

22
00:01:24,670 --> 00:01:30,080
and so those will be the .c files, but then their accompanying .h files

23
00:01:30,080 --> 00:01:34,300
will give us enough understanding that we need so that we know how those functions work

24
00:01:34,300 --> 00:01:38,100
or at least what they are supposed to do--their inputs and outputs--

25
00:01:38,100 --> 00:01:40,760
even if we don't know what's happening in the black box

26
00:01:40,760 --> 00:01:44,090
or don't understand what's happening in the black box within.

27
00:01:44,090 --> 00:01:49,400
And then finally, as usual, we are dealing with new data structures,

28
00:01:49,400 --> 00:01:51,840
specific types of nodes that point to certain things,

29
00:01:51,840 --> 00:01:56,080
and so here having a pen and paper not only for the design process

30
00:01:56,080 --> 00:01:58,470
and when you're trying to figure out how your pset should work

31
00:01:58,470 --> 00:02:00,520
but also during debugging.

32
00:02:00,520 --> 00:02:06,140
You can have GDB alongside your pen and paper while you take down what the values are,

33
00:02:06,140 --> 00:02:09,320
where your arrows are pointing, and things like that.

34
00:02:09,320 --> 00:02:13,720
>> First let's look at Huffman trees.

35
00:02:13,720 --> 00:02:19,600
Huffman trees are binary trees, meaning that each node only has 2 children.

36
00:02:19,600 --> 00:02:24,870
In Huffman trees the characteristic is that the most frequent values

37
00:02:24,870 --> 00:02:27,140
are represented by the fewest bits.

38
00:02:27,140 --> 00:02:32,690
We saw in lecture examples of Morse code, which kind of consolidated some letters.

39
00:02:32,690 --> 00:02:38,030
If you're trying to translate an A or an E, for example,

40
00:02:38,030 --> 00:02:43,940
you're translating that often, so instead of having to use the full set of bits

41
00:02:43,940 --> 00:02:48,640
allocated for that usual data type, you compress it down to fewer,

42
00:02:48,640 --> 00:02:53,730
and then those letters who are represented less often are represented with longer bits

43
00:02:53,730 --> 00:02:59,840
because you can afford that when you weigh out the frequencies that those letters appear.

44
00:02:59,840 --> 00:03:03,020
We have the same idea here in Huffman trees

45
00:03:03,020 --> 00:03:12,360
where we are making a chain, a kind of path to get to the certain characters.

46
00:03:12,360 --> 00:03:14,470
And then the characters who have the most frequency

47
00:03:14,470 --> 00:03:17,940
are going to be represented with the fewest bits.

48
00:03:17,940 --> 00:03:22,020
>> The way that you construct a Huffman tree

49
00:03:22,020 --> 00:03:27,430
is by placing all of the characters that appear in the text

50
00:03:27,430 --> 00:03:30,630
and calculating their frequency, how often they appear.

51
00:03:30,630 --> 00:03:33,880
This could either be a count of how many times those letters appear

52
00:03:33,880 --> 00:03:40,270
or perhaps a percentage of out of all the characters how many each one appears.

53
00:03:40,270 --> 00:03:44,270
And so what you do is once you have all of that mapped out,

54
00:03:44,270 --> 00:03:49,060
then you look for the 2 lowest frequencies and then join them as siblings

55
00:03:49,060 --> 00:03:55,660
where then the parent node has a frequency which is the sum of its 2 children.

56
00:03:55,660 --> 00:04:00,870
And then you by convention say that the left node,

57
00:04:00,870 --> 00:04:03,770
you follow that by following the 0 branch,

58
00:04:03,770 --> 00:04:08,140
and then the rightmost node is the 1 branch.

59
00:04:08,140 --> 00:04:16,040
As we saw in Morse code, the one gotcha was that if you had just a beep and the beep

60
00:04:16,040 --> 00:04:18,120
it was ambiguous.

61
00:04:18,120 --> 00:04:22,430
It could either be 1 letter or it could be a sequence of 2 letters.

62
00:04:22,430 --> 00:04:27,790
And so what Huffman trees does is because by nature of the characters

63
00:04:27,790 --> 00:04:34,140
or our final actual characters being the last nodes on the branch--

64
00:04:34,140 --> 00:04:39,300
we refer to those as leaves--by virtue of that there can't be any ambiguity

65
00:04:39,300 --> 00:04:45,160
in terms of which letter you're trying to encode with the series of bits

66
00:04:45,160 --> 00:04:50,670
because nowhere along the bits that represent 1 letter

67
00:04:50,670 --> 00:04:55,960
will you encounter another whole letter, and there won't be any confusion there.

68
00:04:55,960 --> 00:04:58,430
But we'll go into examples that you guys can actually see that

69
00:04:58,430 --> 00:05:02,120
instead of us just telling you that that's true.

70
00:05:02,120 --> 00:05:06,390
>> Let's look at a simple example of a Huffman tree.

71
00:05:06,390 --> 00:05:09,380
I have a string here that is 12 characters long.

72
00:05:09,380 --> 00:05:14,010
I have 4 As, 6 Bs, and 2 Cs.

73
00:05:14,010 --> 00:05:17,270
My first step would be to count.

74
00:05:17,270 --> 00:05:20,760
How many times does A appear? It appears 4 times in the string.

75
00:05:20,760 --> 00:05:25,060
B appears 6 times, and C appears 2 times.

76
00:05:25,060 --> 00:05:28,970
Naturally, I'm going to say I'm using B most often,

77
00:05:28,970 --> 00:05:35,970
so I want to represent B with the fewest number of bits, the fewest number of 0s and 1s.

78
00:05:35,970 --> 00:05:42,600
And then I'm also going to expect C to require the most amount of 0s and 1s as well.

79
00:05:42,600 --> 00:05:48,550
First what I did here is I placed them in ascending order in terms of frequency.

80
00:05:48,550 --> 00:05:52,710
We see that the C and the A, those are our 2 lowest frequencies.

81
00:05:52,710 --> 00:06:00,290
We create a parent node, and that parent node doesn't have a letter associated with it,

82
00:06:00,290 --> 00:06:05,070
but it does have a frequency, which is the sum.

83
00:06:05,070 --> 00:06:08,780
The sum becomes 2 + 4, which is 6.

84
00:06:08,780 --> 00:06:10,800
Then we follow the left branch.

85
00:06:10,800 --> 00:06:14,970
If we were at that 6 node, then we would follow 0 to get to C

86
00:06:14,970 --> 00:06:17,450
and then 1 to get to A.

87
00:06:17,450 --> 00:06:20,300
So now we have 2 nodes.

88
00:06:20,300 --> 00:06:23,920
We have the value 6 and then we also have another node with the value 6.

89
00:06:23,920 --> 00:06:28,550
And so those 2 are not only the 2 lowest but also just the 2 that are left,

90
00:06:28,550 --> 00:06:33,820
so we join those by another parent, with the sum being 12.

91
00:06:33,820 --> 00:06:36,300
So here we have our Huffman tree

92
00:06:36,300 --> 00:06:40,020
where to get to B, that would just be the bit 1

93
00:06:40,020 --> 00:06:45,430
and then to get to A we would have 01 and then C having 00.

94
00:06:45,430 --> 00:06:51,300
So here we see that basically we're representing these chars with either 1 or 2 bits

95
00:06:51,300 --> 00:06:55,160
where the B, as predicted, has the least.

96
00:06:55,160 --> 00:07:01,730
And then we had expected C to have the most, but since it's such a small Huffman tree,

97
00:07:01,730 --> 00:07:06,020
then the A is also represented by 2 bits as opposed to somewhere in the middle.

98
00:07:07,820 --> 00:07:11,070
>> Just to go over another simple example of the Huffman tree,

99
00:07:11,070 --> 00:07:19,570
say you have the string "Hello."

100
00:07:19,570 --> 00:07:25,360
What you do is first you would say how many times does H appear in this?

101
00:07:25,360 --> 00:07:34,200
H appears once and then e appears once and then we have l appearing twice

102
00:07:34,200 --> 00:07:36,580
and o appearing once.

103
00:07:36,580 --> 00:07:44,310
And so then we expect which letter to be represented by the least number of bits?

104
00:07:44,310 --> 00:07:47,450
[student] l. >>l. Yeah. l is right.

105
00:07:47,450 --> 00:07:50,730
We expect l to be represented by the least number of bits

106
00:07:50,730 --> 00:07:55,890
because l is used most in the string "Hello."

107
00:07:55,890 --> 00:08:04,280
What I'm going to do now is draw out these nodes.

108
00:08:04,280 --> 00:08:15,580
I have 1, which is H, and then another 1, which is e, and then a 1, which is o--

109
00:08:15,580 --> 00:08:23,410
right now I'm putting them in order--and then 2, which is l.

110
00:08:23,410 --> 00:08:32,799
Then I say the way that I build a Huffman tree is to find the 2 nodes with the least frequencies

111
00:08:32,799 --> 00:08:38,010
and make them siblings by creating a parent node.

112
00:08:38,010 --> 00:08:41,850
Here we have 3 nodes with the lowest frequency. They're all 1.

113
00:08:41,850 --> 00:08:50,620
So here we choose which one we're going to link first.

114
00:08:50,620 --> 00:08:54,850
Let's say I choose the H and the e.

115
00:08:54,850 --> 00:09:01,150
The sum of 1 + 1 is 2, but this node doesn't have a letter associated with it.

116
00:09:01,150 --> 00:09:04,440
It just holds the value.

117
00:09:04,440 --> 00:09:10,950
Now we look at the next 2 lowest frequencies.

118
00:09:10,950 --> 00:09:15,590
That's 2 and 1. That could be either of those 2, but I'm going to choose this one.

119
00:09:15,590 --> 00:09:18,800
The sum is 3.

120
00:09:18,800 --> 00:09:26,410
And then finally, I only have 2 left, so then that becomes 5.

121
00:09:26,410 --> 00:09:32,010
Then here, as expected, if I fill in the encoding for that,

122
00:09:32,010 --> 00:09:37,480
1s are always the right branch and 0s are the left one.

123
00:09:37,480 --> 00:09:45,880
Then we have l represented by just 1 bit and then the o by 2

124
00:09:45,880 --> 00:09:52,360
and then the e by 2 and then the H falls down to 3 bits.

125
00:09:52,360 --> 00:09:59,750
So you can transmit this message "Hello" instead of actually using the characters

126
00:09:59,750 --> 00:10:02,760
by just 0s and 1s.

127
00:10:02,760 --> 00:10:07,910
However, remember that in several cases we had ties with our frequency.

128
00:10:07,910 --> 00:10:11,900
We could have either joined the H and the o first maybe.

129
00:10:11,900 --> 00:10:15,730
Or then later on when we had the l represented by 2

130
00:10:15,730 --> 00:10:19,410
as well as the joined one represented by 2, we could have linked either one.

131
00:10:19,410 --> 00:10:23,630
>> And so when you send the 0s and 1s, that actually doesn't guarantee

132
00:10:23,630 --> 00:10:27,090
that the recipient can fully read your message right off the bat

133
00:10:27,090 --> 00:10:30,490
because they might not know which decision you made.

134
00:10:30,490 --> 00:10:34,920
So when we're dealing with Huffman compression,

135
00:10:34,920 --> 00:10:40,090
somehow we have to tell the recipient of our message how we decided--

136
00:10:40,090 --> 00:10:43,470
They need to know some kind of extra information

137
00:10:43,470 --> 00:10:46,580
in addition to the compressed message.

138
00:10:46,580 --> 00:10:51,490
They need to understand what the tree actually looks like,

139
00:10:51,490 --> 00:10:55,450
how we actually made those decisions.

140
00:10:55,450 --> 00:10:59,100
>> Here we were just doing examples based on the actual count,

141
00:10:59,100 --> 00:11:01,550
but sometimes you can also have a Huffman tree

142
00:11:01,550 --> 00:11:05,760
based on the frequency at which letters appear, and it's the exact same process.

143
00:11:05,760 --> 00:11:09,090
Here I'm expressing it in terms of percentages or a fraction,

144
00:11:09,090 --> 00:11:11,290
and so here the exact same thing.

145
00:11:11,290 --> 00:11:15,300
I find the 2 lowest, sum them, the next 2 lowest, sum them,

146
00:11:15,300 --> 00:11:19,390
until I have a full tree.

147
00:11:19,390 --> 00:11:23,610
Even though we could do it either way, when we're dealing with percentages,

148
00:11:23,610 --> 00:11:27,760
that means we're dividing things and dealing with decimals or rather floats

149
00:11:27,760 --> 00:11:30,900
if we're thinking about data structures of a head.

150
00:11:30,900 --> 00:11:32,540
What do we know about floats?

151
00:11:32,540 --> 00:11:35,180
What's a common problem when we're dealing with floats?

152
00:11:35,180 --> 00:11:38,600
[student] Imprecise arithmetic. >>Yeah. Imprecision.

153
00:11:38,600 --> 00:11:43,760
Because of floating point imprecision, for this pset so that we make sure

154
00:11:43,760 --> 00:11:49,450
that we don't lose any values, then we're actually going to be dealing with the count.

155
00:11:49,450 --> 00:11:54,880
So if you were to think of a Huffman node, if you look back to the structure here,

156
00:11:54,880 --> 00:12:01,740
if you look at the green ones it has a frequency associated with it

157
00:12:01,740 --> 00:12:08,760
as well as it points to a node to its left as well as a node to its right.

158
00:12:08,760 --> 00:12:13,970
And then the red ones there also have a character associated with them.

159
00:12:13,970 --> 00:12:18,900
We're not going to make separate ones for the parents and then the final nodes,

160
00:12:18,900 --> 00:12:23,680
which we refer to as leaves, but rather those will just have NULL values.

161
00:12:23,680 --> 00:12:31,050
For every node we'll have a character, the symbol that that node represents,

162
00:12:31,050 --> 00:12:40,490
then a frequency as well as a pointer to its left child as well as its right child.

163
00:12:40,490 --> 00:12:45,680
The leaves, which are at the very bottom, would also have node pointers

164
00:12:45,680 --> 00:12:49,550
to their left and to their right, but since those values aren't pointing to actual nodes,

165
00:12:49,550 --> 00:12:53,970
what would their value be? >>[student] NULL. >>NULL. Exactly.

166
00:12:53,970 --> 00:12:58,430
Here's an example of how you might represent the frequency in floats,

167
00:12:58,430 --> 00:13:02,130
but we're going to be dealing with it with integers,

168
00:13:02,130 --> 00:13:06,780
so all I did is change the data type there.

169
00:13:06,780 --> 00:13:09,700
>> Let's go on to a little bit more of a complex example.

170
00:13:09,700 --> 00:13:13,360
But now that we've done the simple ones, it's just the same process.

171
00:13:13,360 --> 00:13:20,290
You find the 2 lowest frequencies, sum the frequencies

172
00:13:20,290 --> 00:13:22,450
and that's the new frequency of your parent node,

173
00:13:22,450 --> 00:13:29,310
which then points to its left with the 0 branch and the right with the 1 branch.

174
00:13:29,310 --> 00:13:34,200
If we have the string "This is cs50," then we count how many times is T mentioned,

175
00:13:34,200 --> 00:13:38,420
h mentioned, i, s, c, 5, 0.

176
00:13:38,420 --> 00:13:42,010
Then what I did here is with the red nodes I just planted,

177
00:13:42,010 --> 00:13:48,530
I said I'm going to have these characters eventually at the bottom of my tree.

178
00:13:48,530 --> 00:13:51,740
Those are going to be all of the leaves.

179
00:13:51,740 --> 00:13:58,200
Then what I did is I sorted them by frequency in ascending order,

180
00:13:58,200 --> 00:14:02,950
and this is actually the way that the pset code does it

181
00:14:02,950 --> 00:14:07,550
is it sorts it by frequency and then alphabetically.

182
00:14:07,550 --> 00:14:13,870
So it has the numbers first and then alphabetically by the frequency.

183
00:14:13,870 --> 00:14:18,520
Then what I would do is I would find the 2 lowest. That's 0 and 5.

184
00:14:18,520 --> 00:14:22,390
I would sum them, and that's 2. Then I would continue, find the next 2 lowest.

185
00:14:22,390 --> 00:14:26,100
Those are the two 1s, and then those become 2 as well.

186
00:14:26,100 --> 00:14:31,570
Now I know that my next step is going to be joining the lowest number,

187
00:14:31,570 --> 00:14:41,380
which is the T, the 1, and then choosing one of the nodes that has 2 as the frequency.

188
00:14:41,380 --> 00:14:44,560
So here we have 3 options.

189
00:14:44,560 --> 00:14:47,980
What I'm going to do for the slide is just visually rearrange them for you

190
00:14:47,980 --> 00:14:51,790
so that you can see how I'm building it up.

191
00:14:51,790 --> 00:14:59,040
What the code and your distribution code is going to do would be join the T one

192
00:14:59,040 --> 00:15:01,410
with the 0 and 5 node.

193
00:15:01,410 --> 00:15:05,060
So then that sums to 3, and then we continue the process.

194
00:15:05,060 --> 00:15:08,660
The 2 and the 2 now are the lowest, so then those sum to 4.

195
00:15:08,660 --> 00:15:12,560
Everyone following so far? Okay.

196
00:15:12,560 --> 00:15:16,410
Then after that we have the 3 and the 3 that need to be added up,

197
00:15:16,410 --> 00:15:21,650
so again I'm just switching it so that you can see visually so that it doesn't get too messy.

198
00:15:21,650 --> 00:15:25,740
Then we have a 6, and then our final step is now that we only have 2 nodes

199
00:15:25,740 --> 00:15:30,440
we sum those to make the root of our tree, which is 10.

200
00:15:30,440 --> 00:15:34,100
And the number 10 makes sense because each node represented,

201
00:15:34,100 --> 00:15:40,750
their value, their frequency number, was how many times they appeared in the string,

202
00:15:40,750 --> 00:15:46,350
and then we have 5 characters in our string, so that makes sense.

203
00:15:48,060 --> 00:15:52,320
If we look up at how we would actually encode it,

204
00:15:52,320 --> 00:15:56,580
as expected, the i and the s, which appear the most often

205
00:15:56,580 --> 00:16:01,350
are represented by the fewest number of bits.

206
00:16:03,660 --> 00:16:05,660
>> Be careful here.

207
00:16:05,660 --> 00:16:09,780
In Huffman trees the case actually matters.

208
00:16:09,780 --> 00:16:13,670
An uppercase S is different than a lowercase s.

209
00:16:13,670 --> 00:16:21,260
If we had "This is CS50" with capital letters, then the lowercase s would only appear twice,

210
00:16:21,260 --> 00:16:27,120
would be a node with 2 as its value, and then uppercase S would only be once.

211
00:16:27,120 --> 00:16:33,440
So then your tree would change structures because you actually have an extra leaf here.

212
00:16:33,440 --> 00:16:36,900
But the sum would still be 10.

213
00:16:36,900 --> 00:16:39,570
That's what we're actually going to be calling the checksum,

214
00:16:39,570 --> 00:16:44,060
the addition of all of the counts.

215
00:16:46,010 --> 00:16:50,990
>> Now that we've covered Huffman trees, we can dive into Huff'n Puff, the pset.

216
00:16:50,990 --> 00:16:52,900
We're going to start with a section of questions,

217
00:16:52,900 --> 00:16:57,990
and this is going to get you accustomed with binary trees and how to operate around that:

218
00:16:57,990 --> 00:17:03,230
drawing nodes, creating your own typedef struct for a node,

219
00:17:03,230 --> 00:17:07,230
and seeing how you might insert into a binary tree, one that's sorted,

220
00:17:07,230 --> 00:17:09,050
traversing it, and things like that.

221
00:17:09,050 --> 00:17:14,560
That knowledge is definitely going to help you when you dive into the Huff'n Puff portion

222
00:17:14,560 --> 00:17:17,089
of the pset.

223
00:17:19,150 --> 00:17:26,329
In the standard edition of the pset, your task is to implement Puff,

224
00:17:26,329 --> 00:17:30,240
and in the hacker version your task is to implement Huff.

225
00:17:30,240 --> 00:17:38,490
What Huff does is it takes text and then it translates it into the 0s and 1s,

226
00:17:38,490 --> 00:17:41,990
so the process that we did above where we counted the frequencies

227
00:17:41,990 --> 00:17:50,970
and then made the tree and then said, "How do I get T?"

228
00:17:50,970 --> 00:17:54,840
T is represented by 100, things like that,

229
00:17:54,840 --> 00:17:58,860
and then Huff would take the text and then output that binary.

230
00:17:58,860 --> 00:18:04,920
But also because we know that we want to allow our recipient of the message

231
00:18:04,920 --> 00:18:11,790
to recreate the exact same tree, it also includes information about the frequency counts.

232
00:18:11,790 --> 00:18:17,980
Then with Puff we are given a binary file of 0s and 1s

233
00:18:17,980 --> 00:18:21,740
and given also the information about the frequencies.

234
00:18:21,740 --> 00:18:26,740
We translate all of those 0s and 1s back into the original message that was,

235
00:18:26,740 --> 00:18:29,350
so we're decompressing that.

236
00:18:29,350 --> 00:18:36,450
If you're doing the standard edition, you don't need to implement Huff,

237
00:18:36,450 --> 00:18:39,290
so then you can just use the staff implementation of Huff.

238
00:18:39,290 --> 00:18:42,080
There are instructions in the spec on how to do that.

239
00:18:42,080 --> 00:18:48,780
You can run the staff implementation of Huff upon a certain text file

240
00:18:48,780 --> 00:18:53,270
and then use that output as your input to Puff.

241
00:18:53,270 --> 00:18:59,330
>> As I mentioned before, we have a lot of distribution code for this one.

242
00:18:59,330 --> 00:19:01,810
I'm going to start going through it.

243
00:19:01,810 --> 00:19:04,400
I'm going to spend most of the time on the .h files

244
00:19:04,400 --> 00:19:07,660
because in the .c files, because we have the .h

245
00:19:07,660 --> 00:19:11,650
and that provides us with the prototypes of the functions,

246
00:19:11,650 --> 00:19:15,520
we don't fully need to understand exactly--

247
00:19:15,520 --> 00:19:20,280
If you don't understand what's going on in the .c files, then don't worry too much,

248
00:19:20,280 --> 00:19:23,600
but definitely try to take a look because it might give some hints

249
00:19:23,600 --> 00:19:29,220
and it's useful to get used to reading other people's code.

250
00:19:38,940 --> 00:19:48,270
>> Looking at huffile.h, in the comments it declares a layer of abstraction for Huffman-coded files.

251
00:19:48,270 --> 00:20:01,660
If we go down, we see that there is a maximum of 256 symbols that we might need codes for.

252
00:20:01,660 --> 00:20:05,480
This includes all the letters of the alphabet--uppercase and lowercase--

253
00:20:05,480 --> 00:20:08,250
and then symbols and numbers, etc.

254
00:20:08,250 --> 00:20:11,930
Then here we have a magic number identifying a Huffman-coded file.

255
00:20:11,930 --> 00:20:15,890
Within a Huffman code they're going to have a certain magic number

256
00:20:15,890 --> 00:20:18,560
associated with the header.

257
00:20:18,560 --> 00:20:21,110
This might look like just a random magic number,

258
00:20:21,110 --> 00:20:27,160
but if you actually translate it into ASCII, then it actually spells out HUFF.

259
00:20:27,160 --> 00:20:34,290
Here we have a struct for a Huffman-encoded file.

260
00:20:34,290 --> 00:20:39,670
There's all of these characteristics associated with a Huff file.

261
00:20:39,670 --> 00:20:47,080
Then down here we have the header for a Huff file, so we call it Huffeader

262
00:20:47,080 --> 00:20:50,810
instead of adding the extra h because it sounds the same anyway.

263
00:20:50,810 --> 00:20:52,720
Cute.

264
00:20:52,720 --> 00:20:57,790
We have a magic number associated with it.

265
00:20:57,790 --> 00:21:09,040
If it's an actual Huff file, it's going to be the number up above, this magic one.

266
00:21:09,040 --> 00:21:14,720
And then it will have an array.

267
00:21:14,720 --> 00:21:18,750
So for each symbol, of which there are 256,

268
00:21:18,750 --> 00:21:24,760
it's going to list what the frequency of those symbols are within the Huff file.

269
00:21:24,760 --> 00:21:28,090
And then finally, we have a checksum for the frequencies,

270
00:21:28,090 --> 00:21:32,160
which should be the sum of those frequencies.

271
00:21:32,160 --> 00:21:36,520
So that's what a Huffeader is.

272
00:21:36,520 --> 00:21:44,600
Then we have some functions that return the next bit in the Huff file

273
00:21:44,600 --> 00:21:52,580
as well as writes a bit to the Huff file, and then this function here, hfclose,

274
00:21:52,580 --> 00:21:54,650
that actually closes the Huff file.

275
00:21:54,650 --> 00:21:57,290
Before, we were dealing with straight just fclose,

276
00:21:57,290 --> 00:22:01,190
but when you have a Huff file, instead of fclosing it

277
00:22:01,190 --> 00:22:06,080
what you're actually going to do is hfclose and hfopen it.

278
00:22:06,080 --> 00:22:13,220
Those are specific functions to the Huff files that we're going to be dealing with.

279
00:22:13,220 --> 00:22:19,230
Then here we read in the header and then write the header.

280
00:22:19,230 --> 00:22:25,700
>> Just by reading the .h file we can kind of get a sense of what a Huff file might be,

281
00:22:25,700 --> 00:22:32,480
what characteristics it has, without actually going into the huffile.c,

282
00:22:32,480 --> 00:22:36,750
which, if we dive in, is going to be a bit more complex.

283
00:22:36,750 --> 00:22:41,270
It has all of the file I/O here dealing with pointers.

284
00:22:41,270 --> 00:22:48,010
Here we see that when we call hfread, for instance, it's still dealing with fread.

285
00:22:48,010 --> 00:22:53,050
We're not getting rid of those functions entirely, but we're sending those to be taken care of

286
00:22:53,050 --> 00:22:59,760
inside the Huff file instead of doing all of it ourselves.

287
00:22:59,760 --> 00:23:02,300
You can feel free to scan through this if you're curious

288
00:23:02,300 --> 00:23:08,410
and go and peel the layer back a little bit.

289
00:23:20,650 --> 00:23:24,060
>> The next file that we're going to look at is tree.h.

290
00:23:24,060 --> 00:23:30,210
Before in the Walkthrough slides we said we expect a Huffman node

291
00:23:30,210 --> 00:23:32,960
and we made a typedef struct node.

292
00:23:32,960 --> 00:23:38,360
We expect it to have a symbol, a frequency, and then 2 node stars.

293
00:23:38,360 --> 00:23:41,870
In this case what we're doing is this is essentially the same

294
00:23:41,870 --> 00:23:46,880
except instead of node we're going to call them trees.

295
00:23:48,790 --> 00:23:56,760
We have a function that when you call make tree it returns you a tree pointer.

296
00:23:56,760 --> 00:24:03,450
Back to Speller, when you were making a new node

297
00:24:03,450 --> 00:24:11,410
you said node* new word = malloc(sizeof) and things like that.

298
00:24:11,410 --> 00:24:17,510
Basically, mktree is going to be dealing with that for you.

299
00:24:17,510 --> 00:24:20,990
Similarly, when you want to remove a tree,

300
00:24:20,990 --> 00:24:24,810
so that's essentially freeing the tree when you're done with it,

301
00:24:24,810 --> 00:24:33,790
instead of explicitly calling free on that, you're actually just going to use the function rmtree

302
00:24:33,790 --> 00:24:40,360
where you pass in the pointer to that tree and then tree.c will take care of that for you.

303
00:24:40,360 --> 00:24:42,490
>> We look into tree.c.

304
00:24:42,490 --> 00:24:47,240
We expect the same functions except to see the implementation as well.

305
00:24:47,240 --> 00:24:57,720
As we expected, when you call mktree it mallocs the size of a tree into a pointer,

306
00:24:57,720 --> 00:25:03,190
initializes all of the values to the NULL value, so 0s or NULLs,

307
00:25:03,190 --> 00:25:08,280
and then returns the pointer to that tree that you've just malloc'd to you.

308
00:25:08,280 --> 00:25:13,340
Here when you call remove tree it first makes sure that you're not double freeing.

309
00:25:13,340 --> 00:25:18,320
It makes sure that you actually have a tree that you want to remove.

310
00:25:18,320 --> 00:25:23,330
Here because a tree also includes its children,

311
00:25:23,330 --> 00:25:29,560
what this does is it recursively calls remove tree on the left node of the tree

312
00:25:29,560 --> 00:25:31,650
as well as the right node.

313
00:25:31,650 --> 00:25:37,790
Before it frees the parent, it needs to free the children as well.

314
00:25:37,790 --> 00:25:42,770
Parent is also interchangeable with root.

315
00:25:42,770 --> 00:25:46,500
The first ever parent, so like the great-great-great-great-grandfather

316
00:25:46,500 --> 00:25:52,130
or grandmother tree, first we have to free down the levels first.

317
00:25:52,130 --> 00:25:58,490
So traverse to the bottom, free those, and then come back up, free those, etc.

318
00:26:00,400 --> 00:26:02,210
So that's tree.

319
00:26:02,210 --> 00:26:04,240
>> Now we look at forest.

320
00:26:04,240 --> 00:26:09,860
Forest is where you place all of your Huffman trees.

321
00:26:09,860 --> 00:26:12,910
It's saying that we're going to have something called a plot

322
00:26:12,910 --> 00:26:22,320
that contains a pointer to a tree as well as a pointer to a plot called next.

323
00:26:22,320 --> 00:26:28,480
What structure does this kind of look like?

324
00:26:29,870 --> 00:26:32,490
It kind of says it over there.

325
00:26:34,640 --> 00:26:36,700
Right over here.

326
00:26:37,340 --> 00:26:39,170
A linked list.

327
00:26:39,170 --> 00:26:44,590
We see that when we have a plot it's like a linked list of plots.

328
00:26:44,590 --> 00:26:53,020
A forest is defined as a linked list of plots,

329
00:26:53,020 --> 00:26:58,100
and so the structure of forest is we're just going to have a pointer to our first plot

330
00:26:58,100 --> 00:27:02,740
and that plot has a tree within it or rather points to a tree

331
00:27:02,740 --> 00:27:06,190
and then points to the next plot, so on and so forth.

332
00:27:06,190 --> 00:27:11,100
To make a forest we call mkforest.

333
00:27:11,100 --> 00:27:14,930
Then we have some pretty useful functions here.

334
00:27:14,930 --> 00:27:23,240
We have pick where you pass in a forest and then the return value is a Tree*,

335
00:27:23,240 --> 00:27:25,210
a pointer to a tree.

336
00:27:25,210 --> 00:27:29,370
What pick will do is it will go into the forest that you're pointing to

337
00:27:29,370 --> 00:27:35,240
then remove a tree with the lowest frequency from that forest

338
00:27:35,240 --> 00:27:38,330
and then give you the pointer to that tree.

339
00:27:38,330 --> 00:27:43,030
Once you call pick, the tree won't exist in the forest anymore,

340
00:27:43,030 --> 00:27:48,550
but the return value is the pointer to that tree.

341
00:27:48,550 --> 00:27:50,730
Then you have plant.

342
00:27:50,730 --> 00:27:57,420
Provided that you pass in a pointer to a tree that has a non-0 frequency,

343
00:27:57,420 --> 00:28:04,040
what plant will do is it will take the forest, take the tree, and plant that tree inside of the forest.

344
00:28:04,040 --> 00:28:06,370
Here we have rmforest.

345
00:28:06,370 --> 00:28:11,480
Similar to remove tree, which basically freed all of our trees for us,

346
00:28:11,480 --> 00:28:16,600
remove forest will free everything contained in that forest.

347
00:28:16,600 --> 00:28:24,890
>> If we look into forest.c, we'll expect to see at least 1 rmtree command in there,

348
00:28:24,890 --> 00:28:30,090
because to free memory in the forest if a forest has trees in it,

349
00:28:30,090 --> 00:28:32,930
then eventually you're going to have to remove those trees too.

350
00:28:32,930 --> 00:28:41,020
If we look into forest.c, we have our mkforest, which is as we expect.

351
00:28:41,020 --> 00:28:42,890
We malloc things.

352
00:28:42,890 --> 00:28:51,740
We initialize the first plot in the forest as NULL because it's empty to begin with,

353
00:28:51,740 --> 00:29:05,940
then we see pick, which returns the tree with the lowest weight, the lowest frequency,

354
00:29:05,940 --> 00:29:13,560
and then gets rid of that particular node that points to that tree and the next one,

355
00:29:13,560 --> 00:29:16,760
so it takes that out of the linked list of the forest.

356
00:29:16,760 --> 00:29:24,510
And then here we have plant, which inserts a tree into the linked list.

357
00:29:24,510 --> 00:29:29,960
What forest does is it nicely keeps it sorted for us.

358
00:29:29,960 --> 00:29:37,910
And then finally, we have rmforest and, as expected, we have rmtree called there.

359
00:29:46,650 --> 00:29:55,440
>> Looking at the distribution code so far, huffile.c was probably by far the hardest to understand,

360
00:29:55,440 --> 00:29:59,990
whereas the other files themselves were pretty simple to follow.

361
00:29:59,990 --> 00:30:03,090
With our knowledge of pointers and linked lists and such,

362
00:30:03,090 --> 00:30:04,860
we were able to follow pretty well.

363
00:30:04,860 --> 00:30:10,500
But all we need to really make sure that we fully understand is the .h files

364
00:30:10,500 --> 00:30:15,840
because you need to be calling those functions, dealing with those return values,

365
00:30:15,840 --> 00:30:20,590
so make sure that you fully understand what action is going to be performed

366
00:30:20,590 --> 00:30:24,290
whenever you call one of those functions.

367
00:30:24,290 --> 00:30:33,020
But actually understanding inside of it isn't quite necessary because we have those .h files.

368
00:30:35,170 --> 00:30:39,490
We have 2 more files left in our distribution code.

369
00:30:39,490 --> 00:30:41,640
>> Let's look at dump.

370
00:30:41,640 --> 00:30:47,230
Dump by its comment here takes a Huffman-compressed file

371
00:30:47,230 --> 00:30:55,580
and then translates and dumps all of its content out.

372
00:31:01,010 --> 00:31:04,260
Here we see that it's calling hfopen.

373
00:31:04,260 --> 00:31:10,770
This is kind of mirroring to file* input = fopen,

374
00:31:10,770 --> 00:31:13,500
and then you pass in the information.

375
00:31:13,500 --> 00:31:18,240
It's almost identical except instead of a file* you're passing in a Huffile;

376
00:31:18,240 --> 00:31:22,030
instead of fopen you're passing in hfopen.

377
00:31:22,030 --> 00:31:29,280
Here we read in the header first, which is kind of similar to how we read in the header

378
00:31:29,280 --> 00:31:33,580
for a bitmap file.

379
00:31:33,580 --> 00:31:38,000
What we're doing here is checking to see whether the header information

380
00:31:38,000 --> 00:31:44,330
contains the right magic number that indicates that it's an actual Huff file,

381
00:31:44,330 --> 00:31:53,610
then all of these checks to make sure that the file that we open is an actual huffed file or not.

382
00:31:53,610 --> 00:32:05,330
What this does is it outputs the frequencies of all of the symbols that we can see

383
00:32:05,330 --> 00:32:09,790
within a terminal into a graphical table.

384
00:32:09,790 --> 00:32:15,240
This part is going to be useful.

385
00:32:15,240 --> 00:32:24,680
It has a bit and reads bit by bit into the variable bit and then prints it out.

386
00:32:28,220 --> 00:32:35,430
So if I were to call dump on hth.bin, which is the result of huffing a file

387
00:32:35,430 --> 00:32:39,490
using the staff solution, I would get this.

388
00:32:39,490 --> 00:32:46,000
It's outputting all of these characters and then putting the frequency at which they appear.

389
00:32:46,000 --> 00:32:51,180
If we look, most of them are 0s except for this: H, which appears twice,

390
00:32:51,180 --> 00:32:54,820
and then T, which appears once.

391
00:32:54,820 --> 00:33:07,860
And then here we have the actual message in 0s and 1s.

392
00:33:07,860 --> 00:33:15,450
If we look at hth.txt, which is presumably the original message that was huffed,

393
00:33:15,450 --> 00:33:22,490
we expect to see some Hs and Ts in there.

394
00:33:22,490 --> 00:33:28,720
Specifically, we expect to see just 1 T and 2 Hs.

395
00:33:32,510 --> 00:33:37,440
Here we are in hth.txt. It indeed has HTH.

396
00:33:37,440 --> 00:33:41,270
Included in there, although we can't see it, is a newline character.

397
00:33:41,270 --> 00:33:53,190
The Huff file hth.bin is also encoding the newline character as well.

398
00:33:55,680 --> 00:34:01,330
Here because we know that the order is HTH and then newline,

399
00:34:01,330 --> 00:34:07,340
we can see that probably the H is represented by just a single 1

400
00:34:07,340 --> 00:34:17,120
and then the T is probably 01 and then the next H is 1 as well

401
00:34:17,120 --> 00:34:21,139
and then we have a newline indicated by two 0s.

402
00:34:22,420 --> 00:34:24,280
Cool.

403
00:34:26,530 --> 00:34:31,600
>> And then finally, because we're dealing with multiple .c and .h files,

404
00:34:31,600 --> 00:34:36,350
we're going to have a pretty complex argument to the compiler,

405
00:34:36,350 --> 00:34:40,460
and so here we have a Makefile that makes dump for you.

406
00:34:40,460 --> 00:34:47,070
But actually, you have to go about making your own puff.c file.

407
00:34:47,070 --> 00:34:54,330
The Makefile actually doesn't deal with making puff.c for you.

408
00:34:54,330 --> 00:34:59,310
We're leaving that up to you to edit the Makefile.

409
00:34:59,310 --> 00:35:05,930
When you enter a command like make all, for instance, it will make all of them for you.

410
00:35:05,930 --> 00:35:10,760
Feel free to look at the examples of Makefile from the past pset

411
00:35:10,760 --> 00:35:17,400
as well as going off of this one to see how you might be able to make your Puff file

412
00:35:17,400 --> 00:35:20,260
by editing this Makefile.

413
00:35:20,260 --> 00:35:22,730
That's about it for our distribution code.

414
00:35:22,730 --> 00:35:28,380
>> Once we've gotten through that, then here's just another reminder

415
00:35:28,380 --> 00:35:30,980
of how we're going to be dealing with the Huffman nodes.

416
00:35:30,980 --> 00:35:35,400
We're not going to be calling them nodes anymore; we're going to be calling them trees

417
00:35:35,400 --> 00:35:39,260
where we're going to be representing their symbol with a char,

418
00:35:39,260 --> 00:35:43,340
their frequency, the number of occurrences, with an integer.

419
00:35:43,340 --> 00:35:47,370
We're using that because it's more precise than a float.

420
00:35:47,370 --> 00:35:52,980
And then we have another pointer to the left child as well as the right child.

421
00:35:52,980 --> 00:35:59,630
A forest, as we saw, is just a linked list of trees.

422
00:35:59,630 --> 00:36:04,670
Ultimately, when we're building up our Huff file,

423
00:36:04,670 --> 00:36:07,580
we want our forest to contain just 1 tree--

424
00:36:07,580 --> 00:36:12,420
1 tree, 1 root with multiple children.

425
00:36:12,420 --> 00:36:20,840
Earlier on when we were just making our Huffman trees,

426
00:36:20,840 --> 00:36:25,360
we started out by placing all of the nodes onto our screen

427
00:36:25,360 --> 00:36:27,790
and saying we're going to have these nodes,

428
00:36:27,790 --> 00:36:32,920
eventually they're going to be the leaves, and this is their symbol, this is their frequency.

429
00:36:32,920 --> 00:36:42,070
In our forest if we just have 3 letters, that's a forest of 3 trees.

430
00:36:42,070 --> 00:36:45,150
And then as we go on, when we added the first parent,

431
00:36:45,150 --> 00:36:48,080
we made a forest of 2 trees.

432
00:36:48,080 --> 00:36:54,930
We removed 2 of those children from our forest and then replaced it with a parent node

433
00:36:54,930 --> 00:36:58,820
that had those 2 nodes as children.

434
00:36:58,820 --> 00:37:05,600
And then finally, our last step with making our example with the As, Bs, and Cs

435
00:37:05,600 --> 00:37:08,030
would be to make the final parent,

436
00:37:08,030 --> 00:37:13,190
and so then that would bring our total count of trees in the forest to 1.

437
00:37:13,190 --> 00:37:18,140
Does everyone see how you start out with multiple trees in your forest

438
00:37:18,140 --> 00:37:22,520
and end up with 1? Okay. Cool.

439
00:37:25,530 --> 00:37:28,110
>> What do we need to do for Puff?

440
00:37:28,110 --> 00:37:37,110
What we need to do is ensure that, as always, they give us the right type of input

441
00:37:37,110 --> 00:37:39,090
so that we can actually run the program.

442
00:37:39,090 --> 00:37:43,130
In this case they're going to be giving us after their first command-line argument

443
00:37:43,130 --> 00:37:53,440
2 more: the file that we want to decompress and the output of the decompressed file.

444
00:37:53,440 --> 00:38:00,410
But once we make sure that they pass us in the right amount of values, 

445
00:38:00,410 --> 00:38:05,820
we want to ensure that the input is a Huff file or not.

446
00:38:05,820 --> 00:38:10,420
And then once we guarantee that it's a Huff file, then we want to build our tree,

447
00:38:10,420 --> 00:38:20,940
build up the tree such that it matches the tree that the person who sent the message built.

448
00:38:20,940 --> 00:38:25,840
Then after we build the tree, then we can deal with the 0s and 1s that they passed in,

449
00:38:25,840 --> 00:38:29,590
follow those along our tree because it's identical,

450
00:38:29,590 --> 00:38:33,510
and then write that message out, interpret the bits back into chars.

451
00:38:33,510 --> 00:38:35,880
And then at the end because we're dealing with pointers here,

452
00:38:35,880 --> 00:38:38,110
we want to make sure that we don't have any memory leaks

453
00:38:38,110 --> 00:38:41,330
and that we free everything.

454
00:38:42,820 --> 00:38:46,430
>> Ensuring proper usage is old hat for us by now.

455
00:38:46,430 --> 00:38:51,980
We take in an input, which is going to be the name of the file to puff,

456
00:38:51,980 --> 00:38:56,010
and then we specify an output,

457
00:38:56,010 --> 00:39:01,580
so the name of the file for the puffed output, which will be the text file.

458
00:39:03,680 --> 00:39:08,820
That's usage. And now we want to ensure that the input is huffed or not.

459
00:39:08,820 --> 00:39:16,420
Thinking back, was there anything in the distribution code that might help us

460
00:39:16,420 --> 00:39:21,570
with understanding whether a file is huffed or not?

461
00:39:21,570 --> 00:39:26,910
There was information in huffile.c about the Huffeader.

462
00:39:26,910 --> 00:39:33,430
We know that every Huff file has a Huffeader associated with it with a magic number

463
00:39:33,430 --> 00:39:37,240
as well as an array of the frequencies for each symbol

464
00:39:37,240 --> 00:39:39,570
as well as a checksum.

465
00:39:39,570 --> 00:39:43,180
We know that, but we also took a peek at dump.c,

466
00:39:43,180 --> 00:39:49,120
in which it was reading into a Huff file.

467
00:39:49,120 --> 00:39:53,990
And so to do that, it had to check whether it really was huffed or not.

468
00:39:53,990 --> 00:40:03,380
So perhaps we could use dump.c as a structure for our puff.c.

469
00:40:03,380 --> 00:40:12,680
Back to pset 4 when we had the file copy.c that copied in RGB triples

470
00:40:12,680 --> 00:40:14,860
and we interpreted that for Whodunit and Resize,

471
00:40:14,860 --> 00:40:20,390
similarly, what you could do is just run the command like cp dump.c puff.c

472
00:40:20,390 --> 00:40:23,600
and use some of the code there.

473
00:40:23,600 --> 00:40:28,210
However, it's not going to be as straightforward of a process

474
00:40:28,210 --> 00:40:33,010
for translating your dump.c into puff.c,

475
00:40:33,010 --> 00:40:36,160
but at least it gives you somewhere to start

476
00:40:36,160 --> 00:40:40,540
on how to ensure that the input is actually huffed or not

477
00:40:40,540 --> 00:40:43,240
as well as a few other things.

478
00:40:45,930 --> 00:40:50,250
We have ensured proper usage and ensured that the input is huffed.

479
00:40:50,250 --> 00:40:53,570
Every time that we've done that we have done our proper error checking,

480
00:40:53,570 --> 00:41:01,520
so returning and quitting the function if some failure occurs, if there's a problem.

481
00:41:01,520 --> 00:41:07,170
>> Now what we want to do is build the actual tree.

482
00:41:08,840 --> 00:41:12,640
If we look in Forest, there are 2 main functions

483
00:41:12,640 --> 00:41:15,800
that we're going to want to become very familiar with.

484
00:41:15,800 --> 00:41:23,870
There's the Boolean function plant that plants a non-0 frequency tree inside our forest.

485
00:41:23,870 --> 00:41:29,250
And so there you pass in a pointer to a forest and a pointer to a tree.

486
00:41:32,530 --> 00:41:40,340
Quick question: How many forests will you have when you're building a Huffman tree?

487
00:41:44,210 --> 00:41:46,650
Our forest is like our canvas, right?

488
00:41:46,650 --> 00:41:50,800
So we're only going to have 1 forest, but we're going to have multiple trees.

489
00:41:50,800 --> 00:41:57,590
So before you call plant, you're presumably going to want to make your forest.

490
00:41:57,590 --> 00:42:04,430
There is a command for that if you look into forest.h on how you can make a forest.

491
00:42:04,430 --> 00:42:09,270
You can plant a tree. We know how to do that.

492
00:42:09,270 --> 00:42:11,590
And then you can also pick a tree from the forest,

493
00:42:11,590 --> 00:42:17,540
removing a tree with the lowest weight and giving you the pointer to that.

494
00:42:17,540 --> 00:42:23,090
Thinking back to when we were doing the examples ourselves,

495
00:42:23,090 --> 00:42:27,980
when we were drawing it out, we simply just added the links.

496
00:42:27,980 --> 00:42:31,680
But here instead of just adding the links,

497
00:42:31,680 --> 00:42:40,630
think of it more as you're removing 2 of those nodes and then replacing it by another one.

498
00:42:40,630 --> 00:42:44,200
To express that in terms of picking and planting,

499
00:42:44,200 --> 00:42:48,840
you're picking 2 trees and then planting another tree

500
00:42:48,840 --> 00:42:54,060
that has those 2 trees that you picked as children.

501
00:42:57,950 --> 00:43:05,280
To build Huffman's tree, you can read in the symbols and frequencies in order

502
00:43:05,280 --> 00:43:10,790
because the Huffeader gives that to you,

503
00:43:10,790 --> 00:43:14,250
gives you an array of the frequencies.

504
00:43:14,250 --> 00:43:19,660
So you can go ahead and just ignore anything with the 0 in it

505
00:43:19,660 --> 00:43:23,760
because we don't want 256 leaves at the end of it.

506
00:43:23,760 --> 00:43:27,960
We only want the number of leaves that are characters

507
00:43:27,960 --> 00:43:31,600
that are actually used in the file.

508
00:43:31,600 --> 00:43:37,590
You can read in those symbols, and each of those symbols that have non-0 frequencies,

509
00:43:37,590 --> 00:43:40,440
those are going to be trees.

510
00:43:40,440 --> 00:43:45,990
What you can do is every time you read in a non-0 frequency symbol,

511
00:43:45,990 --> 00:43:50,660
you can plant that tree in the forest.

512
00:43:50,660 --> 00:43:56,620
Once you plant the trees in the forest, you can join those trees as siblings,

513
00:43:56,620 --> 00:44:01,130
so going back to planting and picking where you pick 2 and then plant 1,

514
00:44:01,130 --> 00:44:05,820
where that 1 that you plant is the parent of the 2 children that you picked.

515
00:44:05,820 --> 00:44:11,160
So then your end result is going to be a single tree in your forest.

516
00:44:16,180 --> 00:44:18,170
That's how you build your tree.

517
00:44:18,170 --> 00:44:21,850
>> There are several things that could go wrong here

518
00:44:21,850 --> 00:44:26,580
because we're dealing with making new trees and dealing with pointers and things like that.

519
00:44:26,580 --> 00:44:30,450
Before when we were dealing with pointers,

520
00:44:30,450 --> 00:44:36,580
whenever we malloc'd we wanted to make sure that it didn't return us a NULL pointer value.

521
00:44:36,580 --> 00:44:42,770
So at several steps within this process there are going to be several cases

522
00:44:42,770 --> 00:44:45,920
where your program could fail.

523
00:44:45,920 --> 00:44:51,310
What you want to do is you want to make sure that you handle those errors,

524
00:44:51,310 --> 00:44:54,580
and in the spec it says to handle them gracefully,

525
00:44:54,580 --> 00:45:00,280
so like print out a message to the user telling them why the program has to quit

526
00:45:00,280 --> 00:45:03,050
and then promptly quit it.

527
00:45:03,050 --> 00:45:09,490
To do this error handling, remember that you want to check it

528
00:45:09,490 --> 00:45:12,160
every single time that there could be a failure.

529
00:45:12,160 --> 00:45:14,660
Every single time that you're making a new pointer

530
00:45:14,660 --> 00:45:17,040
you want to make sure that that's successful.

531
00:45:17,040 --> 00:45:20,320
Before what we used to do is make a new pointer and malloc it,

532
00:45:20,320 --> 00:45:22,380
and then we would check whether that pointer is NULL.

533
00:45:22,380 --> 00:45:25,670
So there are going to be some instances where you can just do that,

534
00:45:25,670 --> 00:45:28,610
but sometimes you're actually calling a function

535
00:45:28,610 --> 00:45:33,100
and within that function, that's the one that's doing the mallocing.

536
00:45:33,100 --> 00:45:39,110
In that case, if we look back to some of the functions within the code,

537
00:45:39,110 --> 00:45:42,260
some of them are Boolean functions.

538
00:45:42,260 --> 00:45:48,480
In the abstract case if we have a Boolean function called foo,

539
00:45:48,480 --> 00:45:54,580
basically, we can assume that in addition to doing whatever foo does,

540
00:45:54,580 --> 00:45:57,210
since it's a Boolean function, it returns true or false--

541
00:45:57,210 --> 00:46:01,300
true if successful, false if not.

542
00:46:01,300 --> 00:46:06,270
So we want to check whether the return value of foo is true or false.

543
00:46:06,270 --> 00:46:10,400
If it's false, that means that we're going to want to print some kind of message

544
00:46:10,400 --> 00:46:14,390
and then quit the program.

545
00:46:14,390 --> 00:46:18,530
What we want to do is check the return value of foo.

546
00:46:18,530 --> 00:46:23,310
If foo returns false, then we know that we encountered some kind of error

547
00:46:23,310 --> 00:46:25,110
and we need to quit our program.

548
00:46:25,110 --> 00:46:35,600
A way to do this is have a condition where the actual function itself is your condition.

549
00:46:35,600 --> 00:46:39,320
Say foo takes in x.

550
00:46:39,320 --> 00:46:43,390
We can have as a condition if (foo(x)).

551
00:46:43,390 --> 00:46:50,900
Basically, that means if at the end of executing foo it returns true,

552
00:46:50,900 --> 00:46:57,390
then we can do this because the function has to evaluate foo

553
00:46:57,390 --> 00:47:00,500
in order to evaluate the whole condition.

554
00:47:00,500 --> 00:47:06,500
So then that's how you can do something if the function returns true and is successful.

555
00:47:06,500 --> 00:47:11,800
But when you're error checking, you only want to quit if your function returns false.

556
00:47:11,800 --> 00:47:16,090
What you could do is just add an == false or just add a bang in front of it

557
00:47:16,090 --> 00:47:21,010
and then you have if (!foo).

558
00:47:21,010 --> 00:47:29,540
Within that body of that condition you would have all of the error handling,

559
00:47:29,540 --> 00:47:36,940
so like, "Could not create this tree" and then return 1 or something like that.

560
00:47:36,940 --> 00:47:43,340
What that does, though, is that even though foo returned false--

561
00:47:43,340 --> 00:47:46,980
Say foo returns true.

562
00:47:46,980 --> 00:47:51,060
Then you don't have to call foo again. That's a common misconception.

563
00:47:51,060 --> 00:47:54,730
Because it was in your condition, it's already evaluated,

564
00:47:54,730 --> 00:47:59,430
so you already have the result if you're using make tree or something like that

565
00:47:59,430 --> 00:48:01,840
or plant or pick or something.

566
00:48:01,840 --> 00:48:07,460
It already has that value. It's already executed.

567
00:48:07,460 --> 00:48:10,730
So it's useful to use Boolean functions as the condition

568
00:48:10,730 --> 00:48:13,890
because whether or not you actually execute the body of the loop,

569
00:48:13,890 --> 00:48:18,030
it executes the function anyway.

570
00:48:22,070 --> 00:48:27,330
>> Our second to last step is writing the message to the file.

571
00:48:27,330 --> 00:48:33,070
Once we build the Huffman tree, then writing the message to the file is pretty straightforward.

572
00:48:33,070 --> 00:48:39,260
It's pretty straightforward now to just follow the 0s and 1s.

573
00:48:39,260 --> 00:48:45,480
And so by convention we know that in a Huffman tree the 0s indicate left

574
00:48:45,480 --> 00:48:48,360
and the 1s indicate right.

575
00:48:48,360 --> 00:48:53,540
So then if you read in bit by bit, every time that you get a 0

576
00:48:53,540 --> 00:48:59,100
you'll follow the left branch, and then every time you read in a 1

577
00:48:59,100 --> 00:49:02,100
you're going to follow the right branch.

578
00:49:02,100 --> 00:49:07,570
And then you're going to continue until you hit a leaf

579
00:49:07,570 --> 00:49:11,550
because the leaves are going to be at the end of the branches.

580
00:49:11,550 --> 00:49:16,870
How can we tell whether we've hit a leaf or not?

581
00:49:19,800 --> 00:49:21,690
We said it before.

582
00:49:21,690 --> 00:49:24,040
[student] If the pointers are NULL. >>Yeah.

583
00:49:24,040 --> 00:49:32,220
We can tell if we've hit a leaf if the pointers to both the left and right trees are NULL.

584
00:49:32,220 --> 00:49:34,110
Perfect.

585
00:49:34,110 --> 00:49:40,320
We know that we want to read in bit by bit into our Huff file.

586
00:49:43,870 --> 00:49:51,220
As we saw before in dump.c, what they did is they read in bit by bit into the Huff file

587
00:49:51,220 --> 00:49:54,560
and just printed out what those bits were.

588
00:49:54,560 --> 00:49:58,430
We're not going to be doing that. We're going to be doing something that's a bit more complex.

589
00:49:58,430 --> 00:50:03,620
But what we can do is we can take that bit of code that reads in to the bit.

590
00:50:03,620 --> 00:50:10,250
Here we have the integer bit representing the current bit that we're on.

591
00:50:10,250 --> 00:50:15,520
This takes care of iterating all of the bits in the file until you hit the end of the file.

592
00:50:15,520 --> 00:50:21,270
Based on that, then you're going to want to have some kind of iterator

593
00:50:21,270 --> 00:50:26,760
to traverse your tree.

594
00:50:26,760 --> 00:50:31,460
And then based on whether the bit is 0 or 1,

595
00:50:31,460 --> 00:50:36,920
you're going to want to either move that iterator to the left or move it to the right

596
00:50:36,920 --> 00:50:44,080
all the way until you hit a leaf, so all the way until that node that you're on

597
00:50:44,080 --> 00:50:48,260
doesn't point to any more nodes.

598
00:50:48,260 --> 00:50:54,300
Why can we do this with a Huffman file but not Morse code?

599
00:50:54,300 --> 00:50:56,610
Because in Morse code there's a bit of ambiguity.

600
00:50:56,610 --> 00:51:04,440
We could be like, oh wait, we've hit a letter along the way, so maybe this is our letter,

601
00:51:04,440 --> 00:51:08,150
whereas if we continued just a bit longer, then we would have hit another letter.

602
00:51:08,150 --> 00:51:13,110
But that's not going to happen in Huffman encoding,

603
00:51:13,110 --> 00:51:17,540
so we can rest assured that the only way that we're going to hit a character

604
00:51:17,540 --> 00:51:23,480
is if that node's left and right children are NULL.

605
00:51:28,280 --> 00:51:32,350
>> Finally, we want to free all of our memory.

606
00:51:32,350 --> 00:51:37,420
We want to both close the Huff file that we've been dealing with

607
00:51:37,420 --> 00:51:41,940
as well as remove all of the trees in our forest.

608
00:51:41,940 --> 00:51:46,470
Based on your implementation, you're probably going to want to call remove forest

609
00:51:46,470 --> 00:51:49,780
instead of actually going through all of the trees yourself.

610
00:51:49,780 --> 00:51:53,430
But if you made any temporary trees, you'll want to free that.

611
00:51:53,430 --> 00:51:59,060
You know your code best, so you know where you're allocating memory.

612
00:51:59,060 --> 00:52:04,330
And so if you go in, start by even Control F'ing for malloc,

613
00:52:04,330 --> 00:52:08,330
seeing whenever you malloc and making sure that you free all of that

614
00:52:08,330 --> 00:52:10,190
but then just going through your code,

615
00:52:10,190 --> 00:52:14,260
understanding where you might have allocated memory.

616
00:52:14,260 --> 00:52:21,340
Usually you might just say, "At the end of a file I'm just going to remove forest on my forest,"

617
00:52:21,340 --> 00:52:23,850
so basically clear that memory, free that,

618
00:52:23,850 --> 00:52:28,310
"and then I'm also going to close the file and then my program is going to quit."

619
00:52:28,310 --> 00:52:33,810
But is that the only time that your program quits?

620
00:52:33,810 --> 00:52:37,880
No, because sometimes there might have been an error that happened.

621
00:52:37,880 --> 00:52:42,080
Maybe we couldn't open a file or we couldn't make another tree

622
00:52:42,080 --> 00:52:49,340
or some kind of error happened in the memory allocation process and so it returned NULL.

623
00:52:49,340 --> 00:52:56,710
An error happened and then we returned and quit.

624
00:52:56,710 --> 00:53:02,040
So then you want to make sure that any possible time that your program can quit,

625
00:53:02,040 --> 00:53:06,980
you want to free all of your memory there.

626
00:53:06,980 --> 00:53:13,370
It's not just going to be at the very end of the main function that you quit your code.

627
00:53:13,370 --> 00:53:20,780
You want to look back to every instance that your code potentially might return prematurely

628
00:53:20,780 --> 00:53:25,070
and then free whatever memory makes sense.

629
00:53:25,070 --> 00:53:30,830
Say you had called make forest and that returned false.

630
00:53:30,830 --> 00:53:34,230
Then you probably won't need to remove your forest

631
00:53:34,230 --> 00:53:37,080
because you don't have a forest yet.

632
00:53:37,080 --> 00:53:42,130
But at every point in the code where you might return prematurely

633
00:53:42,130 --> 00:53:46,160
you want to make sure that you free any possible memory.

634
00:53:46,160 --> 00:53:50,020
>> So when we're dealing with freeing memory and having potential leaks,

635
00:53:50,020 --> 00:53:55,440
we want to not only use our judgment and our logic

636
00:53:55,440 --> 00:54:01,850
but also use Valgrind to determine whether we've freed all of our memory properly or not.

637
00:54:01,850 --> 00:54:09,460
You can either run Valgrind on Puff and then you have to also pass it

638
00:54:09,460 --> 00:54:14,020
the right number of command-line arguments to Valgrind.

639
00:54:14,020 --> 00:54:18,100
You can run that, but the output is a bit cryptic.

640
00:54:18,100 --> 00:54:21,630
We've gotten a bit used to it with Speller, but we still need a bit more help,

641
00:54:21,630 --> 00:54:26,450
so then running it with a few more flags like the leak-check=full,

642
00:54:26,450 --> 00:54:32,040
that will probably give us some more helpful output on Valgrind.

643
00:54:32,040 --> 00:54:39,040
>> Then another useful tip when you're debugging is the diff command.

644
00:54:39,040 --> 00:54:48,520
You can access the staff's implementation of Huff, run that on a text file,

645
00:54:48,520 --> 00:54:55,400
and then output it to a binary file, a binary Huff file, to be specific.

646
00:54:55,400 --> 00:54:59,440
Then if you run your own puff on that binary file,

647
00:54:59,440 --> 00:55:03,950
then ideally, your outputted text file is going to be identical

648
00:55:03,950 --> 00:55:08,200
to the original one that you passed in.

649
00:55:08,200 --> 00:55:15,150
Here I'm using hth.txt as the example, and that's the one talked about in your spec.

650
00:55:15,150 --> 00:55:21,040
That's literally just HTH and then a newline.

651
00:55:21,040 --> 00:55:30,970
But definitely feel free and you are definitely encouraged to use longer examples

652
00:55:30,970 --> 00:55:32,620
for your text file.

653
00:55:32,620 --> 00:55:38,110
>> You can even take a shot at maybe compressing and then decompressing

654
00:55:38,110 --> 00:55:41,600
some of the files that you used in Speller like War and Peace

655
00:55:41,600 --> 00:55:46,710
or Jane Austen or something like that--that would be kind of cool--or Austin Powers,

656
00:55:46,710 --> 00:55:51,880
kind of dealing with larger files because we wouldn't come down to it

657
00:55:51,880 --> 00:55:55,590
if we used the next tool here, ls -l.

658
00:55:55,590 --> 00:56:01,150
We're used to ls, which basically lists all the contents in our current directory.

659
00:56:01,150 --> 00:56:07,860
Passing in the flag -l actually displays the size of those files.

660
00:56:07,860 --> 00:56:12,690
If you go through the pset spec, it actually walks you through creating the binary file,

661
00:56:12,690 --> 00:56:16,590
of huffing it, and you see that for very small files

662
00:56:16,590 --> 00:56:23,910
the space cost of compressing it and translating all of that information

663
00:56:23,910 --> 00:56:26,980
of all the frequencies and things like that outweighs the actual benefit

664
00:56:26,980 --> 00:56:30,000
of compressing the file in the first place.

665
00:56:30,000 --> 00:56:37,450
But if you run it on some longer text files, then you might see that you start to get some benefit

666
00:56:37,450 --> 00:56:40,930
in compressing those files.

667
00:56:40,930 --> 00:56:46,210
>> And then finally, we have our old pal GDB, which is definitely going to come in handy too.

668
00:56:48,360 --> 00:56:55,320
>> Do we have any questions on Huff trees or the process perhaps of making the trees

669
00:56:55,320 --> 00:56:58,590
or any other questions on Huff'n Puff?

670
00:57:00,680 --> 00:57:02,570
Okay. I'll stay around for a bit.

671
00:57:02,570 --> 00:57:06,570
>> Thanks, everyone. This was Walkthrough 6. And good luck.

672
00:57:08,660 --> 00:57:10,000
>> [CS50.TV]