1 00:00:00,000 --> 00:00:02,210 [Walkthrough - Problem Set 6] 2 00:00:02,210 --> 00:00:04,810 [Zamyla Chan - Harvard University] 3 00:00:04,810 --> 00:00:07,240 [This is CS50. - CS50.TV] 4 00:00:07,240 --> 00:00:12,180 >> Hello, everyone, and welcome to Walkthrough 6: Huff'n Puff. 5 00:00:12,180 --> 00:00:17,440 In Huff'n Puff what we are doing is going to be dealing with a Huffman compressed file 6 00:00:17,440 --> 00:00:20,740 and then puffing it back up, so decompressing it, 7 00:00:20,740 --> 00:00:25,810 so that we can translate from the 0s and 1s that the user sends us 8 00:00:25,810 --> 00:00:30,660 and convert it back into the original text. 9 00:00:30,660 --> 00:00:34,360 Pset 6 is going to be pretty cool because you're going to see some of the tools 10 00:00:34,360 --> 00:00:41,730 that you used in pset 4 and pset 5 and kind of combining them into 1 pretty neat concept 11 00:00:41,730 --> 00:00:43,830 when you come to think about it. 12 00:00:43,830 --> 00:00:50,110 >> Also, arguably, pset 4 and 5 were the most challenging psets that we had to offer. 13 00:00:50,110 --> 00:00:53,950 So from now, we have this 1 more pset in C, 14 00:00:53,950 --> 00:00:56,480 and then after that we're on to web programming. 15 00:00:56,480 --> 00:01:02,310 So congratulate yourselves for getting over the toughest hump in CS50. 16 00:01:03,630 --> 00:01:09,760 >> Moving on for Huff'n Puff, our toolbox for this pset are going to be Huffman trees, 17 00:01:09,760 --> 00:01:14,700 so understanding not only how binary trees work but also specifically Huffman trees, 18 00:01:14,700 --> 00:01:16,240 how they're constructed. 19 00:01:16,240 --> 00:01:20,210 And then we're going to have a lot of distribution code in this pset, 20 00:01:20,210 --> 00:01:22,480 and we'll come to see that actually some of the code 21 00:01:22,480 --> 00:01:24,670 we might not be able to fully understand yet, 22 00:01:24,670 --> 00:01:30,080 and so those will be the .c files, but then their accompanying .h files 23 00:01:30,080 --> 00:01:34,300 will give us enough understanding that we need so that we know how those functions work 24 00:01:34,300 --> 00:01:38,100 or at least what they are supposed to do--their inputs and outputs-- 25 00:01:38,100 --> 00:01:40,760 even if we don't know what's happening in the black box 26 00:01:40,760 --> 00:01:44,090 or don't understand what's happening in the black box within. 27 00:01:44,090 --> 00:01:49,400 And then finally, as usual, we are dealing with new data structures, 28 00:01:49,400 --> 00:01:51,840 specific types of nodes that point to certain things, 29 00:01:51,840 --> 00:01:56,080 and so here having a pen and paper not only for the design process 30 00:01:56,080 --> 00:01:58,470 and when you're trying to figure out how your pset should work 31 00:01:58,470 --> 00:02:00,520 but also during debugging. 32 00:02:00,520 --> 00:02:06,140 You can have GDB alongside your pen and paper while you take down what the values are, 33 00:02:06,140 --> 00:02:09,320 where your arrows are pointing, and things like that. 34 00:02:09,320 --> 00:02:13,720 >> First let's look at Huffman trees. 35 00:02:13,720 --> 00:02:19,600 Huffman trees are binary trees, meaning that each node only has 2 children. 36 00:02:19,600 --> 00:02:24,870 In Huffman trees the characteristic is that the most frequent values 37 00:02:24,870 --> 00:02:27,140 are represented by the fewest bits. 38 00:02:27,140 --> 00:02:32,690 We saw in lecture examples of Morse code, which kind of consolidated some letters. 39 00:02:32,690 --> 00:02:38,030 If you're trying to translate an A or an E, for example, 40 00:02:38,030 --> 00:02:43,940 you're translating that often, so instead of having to use the full set of bits 41 00:02:43,940 --> 00:02:48,640 allocated for that usual data type, you compress it down to fewer, 42 00:02:48,640 --> 00:02:53,730 and then those letters who are represented less often are represented with longer bits 43 00:02:53,730 --> 00:02:59,840 because you can afford that when you weigh out the frequencies that those letters appear. 44 00:02:59,840 --> 00:03:03,020 We have the same idea here in Huffman trees 45 00:03:03,020 --> 00:03:12,360 where we are making a chain, a kind of path to get to the certain characters. 46 00:03:12,360 --> 00:03:14,470 And then the characters who have the most frequency 47 00:03:14,470 --> 00:03:17,940 are going to be represented with the fewest bits. 48 00:03:17,940 --> 00:03:22,020 >> The way that you construct a Huffman tree 49 00:03:22,020 --> 00:03:27,430 is by placing all of the characters that appear in the text 50 00:03:27,430 --> 00:03:30,630 and calculating their frequency, how often they appear. 51 00:03:30,630 --> 00:03:33,880 This could either be a count of how many times those letters appear 52 00:03:33,880 --> 00:03:40,270 or perhaps a percentage of out of all the characters how many each one appears. 53 00:03:40,270 --> 00:03:44,270 And so what you do is once you have all of that mapped out, 54 00:03:44,270 --> 00:03:49,060 then you look for the 2 lowest frequencies and then join them as siblings 55 00:03:49,060 --> 00:03:55,660 where then the parent node has a frequency which is the sum of its 2 children. 56 00:03:55,660 --> 00:04:00,870 And then you by convention say that the left node, 57 00:04:00,870 --> 00:04:03,770 you follow that by following the 0 branch, 58 00:04:03,770 --> 00:04:08,140 and then the rightmost node is the 1 branch. 59 00:04:08,140 --> 00:04:16,040 As we saw in Morse code, the one gotcha was that if you had just a beep and the beep 60 00:04:16,040 --> 00:04:18,120 it was ambiguous. 61 00:04:18,120 --> 00:04:22,430 It could either be 1 letter or it could be a sequence of 2 letters. 62 00:04:22,430 --> 00:04:27,790 And so what Huffman trees does is because by nature of the characters 63 00:04:27,790 --> 00:04:34,140 or our final actual characters being the last nodes on the branch-- 64 00:04:34,140 --> 00:04:39,300 we refer to those as leaves--by virtue of that there can't be any ambiguity 65 00:04:39,300 --> 00:04:45,160 in terms of which letter you're trying to encode with the series of bits 66 00:04:45,160 --> 00:04:50,670 because nowhere along the bits that represent 1 letter 67 00:04:50,670 --> 00:04:55,960 will you encounter another whole letter, and there won't be any confusion there. 68 00:04:55,960 --> 00:04:58,430 But we'll go into examples that you guys can actually see that 69 00:04:58,430 --> 00:05:02,120 instead of us just telling you that that's true. 70 00:05:02,120 --> 00:05:06,390 >> Let's look at a simple example of a Huffman tree. 71 00:05:06,390 --> 00:05:09,380 I have a string here that is 12 characters long. 72 00:05:09,380 --> 00:05:14,010 I have 4 As, 6 Bs, and 2 Cs. 73 00:05:14,010 --> 00:05:17,270 My first step would be to count. 74 00:05:17,270 --> 00:05:20,760 How many times does A appear? It appears 4 times in the string. 75 00:05:20,760 --> 00:05:25,060 B appears 6 times, and C appears 2 times. 76 00:05:25,060 --> 00:05:28,970 Naturally, I'm going to say I'm using B most often, 77 00:05:28,970 --> 00:05:35,970 so I want to represent B with the fewest number of bits, the fewest number of 0s and 1s. 78 00:05:35,970 --> 00:05:42,600 And then I'm also going to expect C to require the most amount of 0s and 1s as well. 79 00:05:42,600 --> 00:05:48,550 First what I did here is I placed them in ascending order in terms of frequency. 80 00:05:48,550 --> 00:05:52,710 We see that the C and the A, those are our 2 lowest frequencies. 81 00:05:52,710 --> 00:06:00,290 We create a parent node, and that parent node doesn't have a letter associated with it, 82 00:06:00,290 --> 00:06:05,070 but it does have a frequency, which is the sum. 83 00:06:05,070 --> 00:06:08,780 The sum becomes 2 + 4, which is 6. 84 00:06:08,780 --> 00:06:10,800 Then we follow the left branch. 85 00:06:10,800 --> 00:06:14,970 If we were at that 6 node, then we would follow 0 to get to C 86 00:06:14,970 --> 00:06:17,450 and then 1 to get to A. 87 00:06:17,450 --> 00:06:20,300 So now we have 2 nodes. 88 00:06:20,300 --> 00:06:23,920 We have the value 6 and then we also have another node with the value 6. 89 00:06:23,920 --> 00:06:28,550 And so those 2 are not only the 2 lowest but also just the 2 that are left, 90 00:06:28,550 --> 00:06:33,820 so we join those by another parent, with the sum being 12. 91 00:06:33,820 --> 00:06:36,300 So here we have our Huffman tree 92 00:06:36,300 --> 00:06:40,020 where to get to B, that would just be the bit 1 93 00:06:40,020 --> 00:06:45,430 and then to get to A we would have 01 and then C having 00. 94 00:06:45,430 --> 00:06:51,300 So here we see that basically we're representing these chars with either 1 or 2 bits 95 00:06:51,300 --> 00:06:55,160 where the B, as predicted, has the least. 96 00:06:55,160 --> 00:07:01,730 And then we had expected C to have the most, but since it's such a small Huffman tree, 97 00:07:01,730 --> 00:07:06,020 then the A is also represented by 2 bits as opposed to somewhere in the middle. 98 00:07:07,820 --> 00:07:11,070 >> Just to go over another simple example of the Huffman tree, 99 00:07:11,070 --> 00:07:19,570 say you have the string "Hello." 100 00:07:19,570 --> 00:07:25,360 What you do is first you would say how many times does H appear in this? 101 00:07:25,360 --> 00:07:34,200 H appears once and then e appears once and then we have l appearing twice 102 00:07:34,200 --> 00:07:36,580 and o appearing once. 103 00:07:36,580 --> 00:07:44,310 And so then we expect which letter to be represented by the least number of bits? 104 00:07:44,310 --> 00:07:47,450 [student] l. >>l. Yeah. l is right. 105 00:07:47,450 --> 00:07:50,730 We expect l to be represented by the least number of bits 106 00:07:50,730 --> 00:07:55,890 because l is used most in the string "Hello." 107 00:07:55,890 --> 00:08:04,280 What I'm going to do now is draw out these nodes. 108 00:08:04,280 --> 00:08:15,580 I have 1, which is H, and then another 1, which is e, and then a 1, which is o-- 109 00:08:15,580 --> 00:08:23,410 right now I'm putting them in order--and then 2, which is l. 110 00:08:23,410 --> 00:08:32,799 Then I say the way that I build a Huffman tree is to find the 2 nodes with the least frequencies 111 00:08:32,799 --> 00:08:38,010 and make them siblings by creating a parent node. 112 00:08:38,010 --> 00:08:41,850 Here we have 3 nodes with the lowest frequency. They're all 1. 113 00:08:41,850 --> 00:08:50,620 So here we choose which one we're going to link first. 114 00:08:50,620 --> 00:08:54,850 Let's say I choose the H and the e. 115 00:08:54,850 --> 00:09:01,150 The sum of 1 + 1 is 2, but this node doesn't have a letter associated with it. 116 00:09:01,150 --> 00:09:04,440 It just holds the value. 117 00:09:04,440 --> 00:09:10,950 Now we look at the next 2 lowest frequencies. 118 00:09:10,950 --> 00:09:15,590 That's 2 and 1. That could be either of those 2, but I'm going to choose this one. 119 00:09:15,590 --> 00:09:18,800 The sum is 3. 120 00:09:18,800 --> 00:09:26,410 And then finally, I only have 2 left, so then that becomes 5. 121 00:09:26,410 --> 00:09:32,010 Then here, as expected, if I fill in the encoding for that, 122 00:09:32,010 --> 00:09:37,480 1s are always the right branch and 0s are the left one. 123 00:09:37,480 --> 00:09:45,880 Then we have l represented by just 1 bit and then the o by 2 124 00:09:45,880 --> 00:09:52,360 and then the e by 2 and then the H falls down to 3 bits. 125 00:09:52,360 --> 00:09:59,750 So you can transmit this message "Hello" instead of actually using the characters 126 00:09:59,750 --> 00:10:02,760 by just 0s and 1s. 127 00:10:02,760 --> 00:10:07,910 However, remember that in several cases we had ties with our frequency. 128 00:10:07,910 --> 00:10:11,900 We could have either joined the H and the o first maybe. 129 00:10:11,900 --> 00:10:15,730 Or then later on when we had the l represented by 2 130 00:10:15,730 --> 00:10:19,410 as well as the joined one represented by 2, we could have linked either one. 131 00:10:19,410 --> 00:10:23,630 >> And so when you send the 0s and 1s, that actually doesn't guarantee 132 00:10:23,630 --> 00:10:27,090 that the recipient can fully read your message right off the bat 133 00:10:27,090 --> 00:10:30,490 because they might not know which decision you made. 134 00:10:30,490 --> 00:10:34,920 So when we're dealing with Huffman compression, 135 00:10:34,920 --> 00:10:40,090 somehow we have to tell the recipient of our message how we decided-- 136 00:10:40,090 --> 00:10:43,470 They need to know some kind of extra information 137 00:10:43,470 --> 00:10:46,580 in addition to the compressed message. 138 00:10:46,580 --> 00:10:51,490 They need to understand what the tree actually looks like, 139 00:10:51,490 --> 00:10:55,450 how we actually made those decisions. 140 00:10:55,450 --> 00:10:59,100 >> Here we were just doing examples based on the actual count, 141 00:10:59,100 --> 00:11:01,550 but sometimes you can also have a Huffman tree 142 00:11:01,550 --> 00:11:05,760 based on the frequency at which letters appear, and it's the exact same process. 143 00:11:05,760 --> 00:11:09,090 Here I'm expressing it in terms of percentages or a fraction, 144 00:11:09,090 --> 00:11:11,290 and so here the exact same thing. 145 00:11:11,290 --> 00:11:15,300 I find the 2 lowest, sum them, the next 2 lowest, sum them, 146 00:11:15,300 --> 00:11:19,390 until I have a full tree. 147 00:11:19,390 --> 00:11:23,610 Even though we could do it either way, when we're dealing with percentages, 148 00:11:23,610 --> 00:11:27,760 that means we're dividing things and dealing with decimals or rather floats 149 00:11:27,760 --> 00:11:30,900 if we're thinking about data structures of a head. 150 00:11:30,900 --> 00:11:32,540 What do we know about floats? 151 00:11:32,540 --> 00:11:35,180 What's a common problem when we're dealing with floats? 152 00:11:35,180 --> 00:11:38,600 [student] Imprecise arithmetic. >>Yeah. Imprecision. 153 00:11:38,600 --> 00:11:43,760 Because of floating point imprecision, for this pset so that we make sure 154 00:11:43,760 --> 00:11:49,450 that we don't lose any values, then we're actually going to be dealing with the count. 155 00:11:49,450 --> 00:11:54,880 So if you were to think of a Huffman node, if you look back to the structure here, 156 00:11:54,880 --> 00:12:01,740 if you look at the green ones it has a frequency associated with it 157 00:12:01,740 --> 00:12:08,760 as well as it points to a node to its left as well as a node to its right. 158 00:12:08,760 --> 00:12:13,970 And then the red ones there also have a character associated with them. 159 00:12:13,970 --> 00:12:18,900 We're not going to make separate ones for the parents and then the final nodes, 160 00:12:18,900 --> 00:12:23,680 which we refer to as leaves, but rather those will just have NULL values. 161 00:12:23,680 --> 00:12:31,050 For every node we'll have a character, the symbol that that node represents, 162 00:12:31,050 --> 00:12:40,490 then a frequency as well as a pointer to its left child as well as its right child. 163 00:12:40,490 --> 00:12:45,680 The leaves, which are at the very bottom, would also have node pointers 164 00:12:45,680 --> 00:12:49,550 to their left and to their right, but since those values aren't pointing to actual nodes, 165 00:12:49,550 --> 00:12:53,970 what would their value be? >>[student] NULL. >>NULL. Exactly. 166 00:12:53,970 --> 00:12:58,430 Here's an example of how you might represent the frequency in floats, 167 00:12:58,430 --> 00:13:02,130 but we're going to be dealing with it with integers, 168 00:13:02,130 --> 00:13:06,780 so all I did is change the data type there. 169 00:13:06,780 --> 00:13:09,700 >> Let's go on to a little bit more of a complex example. 170 00:13:09,700 --> 00:13:13,360 But now that we've done the simple ones, it's just the same process. 171 00:13:13,360 --> 00:13:20,290 You find the 2 lowest frequencies, sum the frequencies 172 00:13:20,290 --> 00:13:22,450 and that's the new frequency of your parent node, 173 00:13:22,450 --> 00:13:29,310 which then points to its left with the 0 branch and the right with the 1 branch. 174 00:13:29,310 --> 00:13:34,200 If we have the string "This is cs50," then we count how many times is T mentioned, 175 00:13:34,200 --> 00:13:38,420 h mentioned, i, s, c, 5, 0. 176 00:13:38,420 --> 00:13:42,010 Then what I did here is with the red nodes I just planted, 177 00:13:42,010 --> 00:13:48,530 I said I'm going to have these characters eventually at the bottom of my tree. 178 00:13:48,530 --> 00:13:51,740 Those are going to be all of the leaves. 179 00:13:51,740 --> 00:13:58,200 Then what I did is I sorted them by frequency in ascending order, 180 00:13:58,200 --> 00:14:02,950 and this is actually the way that the pset code does it 181 00:14:02,950 --> 00:14:07,550 is it sorts it by frequency and then alphabetically. 182 00:14:07,550 --> 00:14:13,870 So it has the numbers first and then alphabetically by the frequency. 183 00:14:13,870 --> 00:14:18,520 Then what I would do is I would find the 2 lowest. That's 0 and 5. 184 00:14:18,520 --> 00:14:22,390 I would sum them, and that's 2. Then I would continue, find the next 2 lowest. 185 00:14:22,390 --> 00:14:26,100 Those are the two 1s, and then those become 2 as well. 186 00:14:26,100 --> 00:14:31,570 Now I know that my next step is going to be joining the lowest number, 187 00:14:31,570 --> 00:14:41,380 which is the T, the 1, and then choosing one of the nodes that has 2 as the frequency. 188 00:14:41,380 --> 00:14:44,560 So here we have 3 options. 189 00:14:44,560 --> 00:14:47,980 What I'm going to do for the slide is just visually rearrange them for you 190 00:14:47,980 --> 00:14:51,790 so that you can see how I'm building it up. 191 00:14:51,790 --> 00:14:59,040 What the code and your distribution code is going to do would be join the T one 192 00:14:59,040 --> 00:15:01,410 with the 0 and 5 node. 193 00:15:01,410 --> 00:15:05,060 So then that sums to 3, and then we continue the process. 194 00:15:05,060 --> 00:15:08,660 The 2 and the 2 now are the lowest, so then those sum to 4. 195 00:15:08,660 --> 00:15:12,560 Everyone following so far? Okay. 196 00:15:12,560 --> 00:15:16,410 Then after that we have the 3 and the 3 that need to be added up, 197 00:15:16,410 --> 00:15:21,650 so again I'm just switching it so that you can see visually so that it doesn't get too messy. 198 00:15:21,650 --> 00:15:25,740 Then we have a 6, and then our final step is now that we only have 2 nodes 199 00:15:25,740 --> 00:15:30,440 we sum those to make the root of our tree, which is 10. 200 00:15:30,440 --> 00:15:34,100 And the number 10 makes sense because each node represented, 201 00:15:34,100 --> 00:15:40,750 their value, their frequency number, was how many times they appeared in the string, 202 00:15:40,750 --> 00:15:46,350 and then we have 5 characters in our string, so that makes sense. 203 00:15:48,060 --> 00:15:52,320 If we look up at how we would actually encode it, 204 00:15:52,320 --> 00:15:56,580 as expected, the i and the s, which appear the most often 205 00:15:56,580 --> 00:16:01,350 are represented by the fewest number of bits. 206 00:16:03,660 --> 00:16:05,660 >> Be careful here. 207 00:16:05,660 --> 00:16:09,780 In Huffman trees the case actually matters. 208 00:16:09,780 --> 00:16:13,670 An uppercase S is different than a lowercase s. 209 00:16:13,670 --> 00:16:21,260 If we had "This is CS50" with capital letters, then the lowercase s would only appear twice, 210 00:16:21,260 --> 00:16:27,120 would be a node with 2 as its value, and then uppercase S would only be once. 211 00:16:27,120 --> 00:16:33,440 So then your tree would change structures because you actually have an extra leaf here. 212 00:16:33,440 --> 00:16:36,900 But the sum would still be 10. 213 00:16:36,900 --> 00:16:39,570 That's what we're actually going to be calling the checksum, 214 00:16:39,570 --> 00:16:44,060 the addition of all of the counts. 215 00:16:46,010 --> 00:16:50,990 >> Now that we've covered Huffman trees, we can dive into Huff'n Puff, the pset. 216 00:16:50,990 --> 00:16:52,900 We're going to start with a section of questions, 217 00:16:52,900 --> 00:16:57,990 and this is going to get you accustomed with binary trees and how to operate around that: 218 00:16:57,990 --> 00:17:03,230 drawing nodes, creating your own typedef struct for a node, 219 00:17:03,230 --> 00:17:07,230 and seeing how you might insert into a binary tree, one that's sorted, 220 00:17:07,230 --> 00:17:09,050 traversing it, and things like that. 221 00:17:09,050 --> 00:17:14,560 That knowledge is definitely going to help you when you dive into the Huff'n Puff portion 222 00:17:14,560 --> 00:17:17,089 of the pset. 223 00:17:19,150 --> 00:17:26,329 In the standard edition of the pset, your task is to implement Puff, 224 00:17:26,329 --> 00:17:30,240 and in the hacker version your task is to implement Huff. 225 00:17:30,240 --> 00:17:38,490 What Huff does is it takes text and then it translates it into the 0s and 1s, 226 00:17:38,490 --> 00:17:41,990 so the process that we did above where we counted the frequencies 227 00:17:41,990 --> 00:17:50,970 and then made the tree and then said, "How do I get T?" 228 00:17:50,970 --> 00:17:54,840 T is represented by 100, things like that, 229 00:17:54,840 --> 00:17:58,860 and then Huff would take the text and then output that binary. 230 00:17:58,860 --> 00:18:04,920 But also because we know that we want to allow our recipient of the message 231 00:18:04,920 --> 00:18:11,790 to recreate the exact same tree, it also includes information about the frequency counts. 232 00:18:11,790 --> 00:18:17,980 Then with Puff we are given a binary file of 0s and 1s 233 00:18:17,980 --> 00:18:21,740 and given also the information about the frequencies. 234 00:18:21,740 --> 00:18:26,740 We translate all of those 0s and 1s back into the original message that was, 235 00:18:26,740 --> 00:18:29,350 so we're decompressing that. 236 00:18:29,350 --> 00:18:36,450 If you're doing the standard edition, you don't need to implement Huff, 237 00:18:36,450 --> 00:18:39,290 so then you can just use the staff implementation of Huff. 238 00:18:39,290 --> 00:18:42,080 There are instructions in the spec on how to do that. 239 00:18:42,080 --> 00:18:48,780 You can run the staff implementation of Huff upon a certain text file 240 00:18:48,780 --> 00:18:53,270 and then use that output as your input to Puff. 241 00:18:53,270 --> 00:18:59,330 >> As I mentioned before, we have a lot of distribution code for this one. 242 00:18:59,330 --> 00:19:01,810 I'm going to start going through it. 243 00:19:01,810 --> 00:19:04,400 I'm going to spend most of the time on the .h files 244 00:19:04,400 --> 00:19:07,660 because in the .c files, because we have the .h 245 00:19:07,660 --> 00:19:11,650 and that provides us with the prototypes of the functions, 246 00:19:11,650 --> 00:19:15,520 we don't fully need to understand exactly-- 247 00:19:15,520 --> 00:19:20,280 If you don't understand what's going on in the .c files, then don't worry too much, 248 00:19:20,280 --> 00:19:23,600 but definitely try to take a look because it might give some hints 249 00:19:23,600 --> 00:19:29,220 and it's useful to get used to reading other people's code. 250 00:19:38,940 --> 00:19:48,270 >> Looking at huffile.h, in the comments it declares a layer of abstraction for Huffman-coded files. 251 00:19:48,270 --> 00:20:01,660 If we go down, we see that there is a maximum of 256 symbols that we might need codes for. 252 00:20:01,660 --> 00:20:05,480 This includes all the letters of the alphabet--uppercase and lowercase-- 253 00:20:05,480 --> 00:20:08,250 and then symbols and numbers, etc. 254 00:20:08,250 --> 00:20:11,930 Then here we have a magic number identifying a Huffman-coded file. 255 00:20:11,930 --> 00:20:15,890 Within a Huffman code they're going to have a certain magic number 256 00:20:15,890 --> 00:20:18,560 associated with the header. 257 00:20:18,560 --> 00:20:21,110 This might look like just a random magic number, 258 00:20:21,110 --> 00:20:27,160 but if you actually translate it into ASCII, then it actually spells out HUFF. 259 00:20:27,160 --> 00:20:34,290 Here we have a struct for a Huffman-encoded file. 260 00:20:34,290 --> 00:20:39,670 There's all of these characteristics associated with a Huff file. 261 00:20:39,670 --> 00:20:47,080 Then down here we have the header for a Huff file, so we call it Huffeader 262 00:20:47,080 --> 00:20:50,810 instead of adding the extra h because it sounds the same anyway. 263 00:20:50,810 --> 00:20:52,720 Cute. 264 00:20:52,720 --> 00:20:57,790 We have a magic number associated with it. 265 00:20:57,790 --> 00:21:09,040 If it's an actual Huff file, it's going to be the number up above, this magic one. 266 00:21:09,040 --> 00:21:14,720 And then it will have an array. 267 00:21:14,720 --> 00:21:18,750 So for each symbol, of which there are 256, 268 00:21:18,750 --> 00:21:24,760 it's going to list what the frequency of those symbols are within the Huff file. 269 00:21:24,760 --> 00:21:28,090 And then finally, we have a checksum for the frequencies, 270 00:21:28,090 --> 00:21:32,160 which should be the sum of those frequencies. 271 00:21:32,160 --> 00:21:36,520 So that's what a Huffeader is. 272 00:21:36,520 --> 00:21:44,600 Then we have some functions that return the next bit in the Huff file 273 00:21:44,600 --> 00:21:52,580 as well as writes a bit to the Huff file, and then this function here, hfclose, 274 00:21:52,580 --> 00:21:54,650 that actually closes the Huff file. 275 00:21:54,650 --> 00:21:57,290 Before, we were dealing with straight just fclose, 276 00:21:57,290 --> 00:22:01,190 but when you have a Huff file, instead of fclosing it 277 00:22:01,190 --> 00:22:06,080 what you're actually going to do is hfclose and hfopen it. 278 00:22:06,080 --> 00:22:13,220 Those are specific functions to the Huff files that we're going to be dealing with. 279 00:22:13,220 --> 00:22:19,230 Then here we read in the header and then write the header. 280 00:22:19,230 --> 00:22:25,700 >> Just by reading the .h file we can kind of get a sense of what a Huff file might be, 281 00:22:25,700 --> 00:22:32,480 what characteristics it has, without actually going into the huffile.c, 282 00:22:32,480 --> 00:22:36,750 which, if we dive in, is going to be a bit more complex. 283 00:22:36,750 --> 00:22:41,270 It has all of the file I/O here dealing with pointers. 284 00:22:41,270 --> 00:22:48,010 Here we see that when we call hfread, for instance, it's still dealing with fread. 285 00:22:48,010 --> 00:22:53,050 We're not getting rid of those functions entirely, but we're sending those to be taken care of 286 00:22:53,050 --> 00:22:59,760 inside the Huff file instead of doing all of it ourselves. 287 00:22:59,760 --> 00:23:02,300 You can feel free to scan through this if you're curious 288 00:23:02,300 --> 00:23:08,410 and go and peel the layer back a little bit. 289 00:23:20,650 --> 00:23:24,060 >> The next file that we're going to look at is tree.h. 290 00:23:24,060 --> 00:23:30,210 Before in the Walkthrough slides we said we expect a Huffman node 291 00:23:30,210 --> 00:23:32,960 and we made a typedef struct node. 292 00:23:32,960 --> 00:23:38,360 We expect it to have a symbol, a frequency, and then 2 node stars. 293 00:23:38,360 --> 00:23:41,870 In this case what we're doing is this is essentially the same 294 00:23:41,870 --> 00:23:46,880 except instead of node we're going to call them trees. 295 00:23:48,790 --> 00:23:56,760 We have a function that when you call make tree it returns you a tree pointer. 296 00:23:56,760 --> 00:24:03,450 Back to Speller, when you were making a new node 297 00:24:03,450 --> 00:24:11,410 you said node* new word = malloc(sizeof) and things like that. 298 00:24:11,410 --> 00:24:17,510 Basically, mktree is going to be dealing with that for you. 299 00:24:17,510 --> 00:24:20,990 Similarly, when you want to remove a tree, 300 00:24:20,990 --> 00:24:24,810 so that's essentially freeing the tree when you're done with it, 301 00:24:24,810 --> 00:24:33,790 instead of explicitly calling free on that, you're actually just going to use the function rmtree 302 00:24:33,790 --> 00:24:40,360 where you pass in the pointer to that tree and then tree.c will take care of that for you. 303 00:24:40,360 --> 00:24:42,490 >> We look into tree.c. 304 00:24:42,490 --> 00:24:47,240 We expect the same functions except to see the implementation as well. 305 00:24:47,240 --> 00:24:57,720 As we expected, when you call mktree it mallocs the size of a tree into a pointer, 306 00:24:57,720 --> 00:25:03,190 initializes all of the values to the NULL value, so 0s or NULLs, 307 00:25:03,190 --> 00:25:08,280 and then returns the pointer to that tree that you've just malloc'd to you. 308 00:25:08,280 --> 00:25:13,340 Here when you call remove tree it first makes sure that you're not double freeing. 309 00:25:13,340 --> 00:25:18,320 It makes sure that you actually have a tree that you want to remove. 310 00:25:18,320 --> 00:25:23,330 Here because a tree also includes its children, 311 00:25:23,330 --> 00:25:29,560 what this does is it recursively calls remove tree on the left node of the tree 312 00:25:29,560 --> 00:25:31,650 as well as the right node. 313 00:25:31,650 --> 00:25:37,790 Before it frees the parent, it needs to free the children as well. 314 00:25:37,790 --> 00:25:42,770 Parent is also interchangeable with root. 315 00:25:42,770 --> 00:25:46,500 The first ever parent, so like the great-great-great-great-grandfather 316 00:25:46,500 --> 00:25:52,130 or grandmother tree, first we have to free down the levels first. 317 00:25:52,130 --> 00:25:58,490 So traverse to the bottom, free those, and then come back up, free those, etc. 318 00:26:00,400 --> 00:26:02,210 So that's tree. 319 00:26:02,210 --> 00:26:04,240 >> Now we look at forest. 320 00:26:04,240 --> 00:26:09,860 Forest is where you place all of your Huffman trees. 321 00:26:09,860 --> 00:26:12,910 It's saying that we're going to have something called a plot 322 00:26:12,910 --> 00:26:22,320 that contains a pointer to a tree as well as a pointer to a plot called next. 323 00:26:22,320 --> 00:26:28,480 What structure does this kind of look like? 324 00:26:29,870 --> 00:26:32,490 It kind of says it over there. 325 00:26:34,640 --> 00:26:36,700 Right over here. 326 00:26:37,340 --> 00:26:39,170 A linked list. 327 00:26:39,170 --> 00:26:44,590 We see that when we have a plot it's like a linked list of plots. 328 00:26:44,590 --> 00:26:53,020 A forest is defined as a linked list of plots, 329 00:26:53,020 --> 00:26:58,100 and so the structure of forest is we're just going to have a pointer to our first plot 330 00:26:58,100 --> 00:27:02,740 and that plot has a tree within it or rather points to a tree 331 00:27:02,740 --> 00:27:06,190 and then points to the next plot, so on and so forth. 332 00:27:06,190 --> 00:27:11,100 To make a forest we call mkforest. 333 00:27:11,100 --> 00:27:14,930 Then we have some pretty useful functions here. 334 00:27:14,930 --> 00:27:23,240 We have pick where you pass in a forest and then the return value is a Tree*, 335 00:27:23,240 --> 00:27:25,210 a pointer to a tree. 336 00:27:25,210 --> 00:27:29,370 What pick will do is it will go into the forest that you're pointing to 337 00:27:29,370 --> 00:27:35,240 then remove a tree with the lowest frequency from that forest 338 00:27:35,240 --> 00:27:38,330 and then give you the pointer to that tree. 339 00:27:38,330 --> 00:27:43,030 Once you call pick, the tree won't exist in the forest anymore, 340 00:27:43,030 --> 00:27:48,550 but the return value is the pointer to that tree. 341 00:27:48,550 --> 00:27:50,730 Then you have plant. 342 00:27:50,730 --> 00:27:57,420 Provided that you pass in a pointer to a tree that has a non-0 frequency, 343 00:27:57,420 --> 00:28:04,040 what plant will do is it will take the forest, take the tree, and plant that tree inside of the forest. 344 00:28:04,040 --> 00:28:06,370 Here we have rmforest. 345 00:28:06,370 --> 00:28:11,480 Similar to remove tree, which basically freed all of our trees for us, 346 00:28:11,480 --> 00:28:16,600 remove forest will free everything contained in that forest. 347 00:28:16,600 --> 00:28:24,890 >> If we look into forest.c, we'll expect to see at least 1 rmtree command in there, 348 00:28:24,890 --> 00:28:30,090 because to free memory in the forest if a forest has trees in it, 349 00:28:30,090 --> 00:28:32,930 then eventually you're going to have to remove those trees too. 350 00:28:32,930 --> 00:28:41,020 If we look into forest.c, we have our mkforest, which is as we expect. 351 00:28:41,020 --> 00:28:42,890 We malloc things. 352 00:28:42,890 --> 00:28:51,740 We initialize the first plot in the forest as NULL because it's empty to begin with, 353 00:28:51,740 --> 00:29:05,940 then we see pick, which returns the tree with the lowest weight, the lowest frequency, 354 00:29:05,940 --> 00:29:13,560 and then gets rid of that particular node that points to that tree and the next one, 355 00:29:13,560 --> 00:29:16,760 so it takes that out of the linked list of the forest. 356 00:29:16,760 --> 00:29:24,510 And then here we have plant, which inserts a tree into the linked list. 357 00:29:24,510 --> 00:29:29,960 What forest does is it nicely keeps it sorted for us. 358 00:29:29,960 --> 00:29:37,910 And then finally, we have rmforest and, as expected, we have rmtree called there. 359 00:29:46,650 --> 00:29:55,440 >> Looking at the distribution code so far, huffile.c was probably by far the hardest to understand, 360 00:29:55,440 --> 00:29:59,990 whereas the other files themselves were pretty simple to follow. 361 00:29:59,990 --> 00:30:03,090 With our knowledge of pointers and linked lists and such, 362 00:30:03,090 --> 00:30:04,860 we were able to follow pretty well. 363 00:30:04,860 --> 00:30:10,500 But all we need to really make sure that we fully understand is the .h files 364 00:30:10,500 --> 00:30:15,840 because you need to be calling those functions, dealing with those return values, 365 00:30:15,840 --> 00:30:20,590 so make sure that you fully understand what action is going to be performed 366 00:30:20,590 --> 00:30:24,290 whenever you call one of those functions. 367 00:30:24,290 --> 00:30:33,020 But actually understanding inside of it isn't quite necessary because we have those .h files. 368 00:30:35,170 --> 00:30:39,490 We have 2 more files left in our distribution code. 369 00:30:39,490 --> 00:30:41,640 >> Let's look at dump. 370 00:30:41,640 --> 00:30:47,230 Dump by its comment here takes a Huffman-compressed file 371 00:30:47,230 --> 00:30:55,580 and then translates and dumps all of its content out. 372 00:31:01,010 --> 00:31:04,260 Here we see that it's calling hfopen. 373 00:31:04,260 --> 00:31:10,770 This is kind of mirroring to file* input = fopen, 374 00:31:10,770 --> 00:31:13,500 and then you pass in the information. 375 00:31:13,500 --> 00:31:18,240 It's almost identical except instead of a file* you're passing in a Huffile; 376 00:31:18,240 --> 00:31:22,030 instead of fopen you're passing in hfopen. 377 00:31:22,030 --> 00:31:29,280 Here we read in the header first, which is kind of similar to how we read in the header 378 00:31:29,280 --> 00:31:33,580 for a bitmap file. 379 00:31:33,580 --> 00:31:38,000 What we're doing here is checking to see whether the header information 380 00:31:38,000 --> 00:31:44,330 contains the right magic number that indicates that it's an actual Huff file, 381 00:31:44,330 --> 00:31:53,610 then all of these checks to make sure that the file that we open is an actual huffed file or not. 382 00:31:53,610 --> 00:32:05,330 What this does is it outputs the frequencies of all of the symbols that we can see 383 00:32:05,330 --> 00:32:09,790 within a terminal into a graphical table. 384 00:32:09,790 --> 00:32:15,240 This part is going to be useful. 385 00:32:15,240 --> 00:32:24,680 It has a bit and reads bit by bit into the variable bit and then prints it out. 386 00:32:28,220 --> 00:32:35,430 So if I were to call dump on hth.bin, which is the result of huffing a file 387 00:32:35,430 --> 00:32:39,490 using the staff solution, I would get this. 388 00:32:39,490 --> 00:32:46,000 It's outputting all of these characters and then putting the frequency at which they appear. 389 00:32:46,000 --> 00:32:51,180 If we look, most of them are 0s except for this: H, which appears twice, 390 00:32:51,180 --> 00:32:54,820 and then T, which appears once. 391 00:32:54,820 --> 00:33:07,860 And then here we have the actual message in 0s and 1s. 392 00:33:07,860 --> 00:33:15,450 If we look at hth.txt, which is presumably the original message that was huffed, 393 00:33:15,450 --> 00:33:22,490 we expect to see some Hs and Ts in there. 394 00:33:22,490 --> 00:33:28,720 Specifically, we expect to see just 1 T and 2 Hs. 395 00:33:32,510 --> 00:33:37,440 Here we are in hth.txt. It indeed has HTH. 396 00:33:37,440 --> 00:33:41,270 Included in there, although we can't see it, is a newline character. 397 00:33:41,270 --> 00:33:53,190 The Huff file hth.bin is also encoding the newline character as well. 398 00:33:55,680 --> 00:34:01,330 Here because we know that the order is HTH and then newline, 399 00:34:01,330 --> 00:34:07,340 we can see that probably the H is represented by just a single 1 400 00:34:07,340 --> 00:34:17,120 and then the T is probably 01 and then the next H is 1 as well 401 00:34:17,120 --> 00:34:21,139 and then we have a newline indicated by two 0s. 402 00:34:22,420 --> 00:34:24,280 Cool. 403 00:34:26,530 --> 00:34:31,600 >> And then finally, because we're dealing with multiple .c and .h files, 404 00:34:31,600 --> 00:34:36,350 we're going to have a pretty complex argument to the compiler, 405 00:34:36,350 --> 00:34:40,460 and so here we have a Makefile that makes dump for you. 406 00:34:40,460 --> 00:34:47,070 But actually, you have to go about making your own puff.c file. 407 00:34:47,070 --> 00:34:54,330 The Makefile actually doesn't deal with making puff.c for you. 408 00:34:54,330 --> 00:34:59,310 We're leaving that up to you to edit the Makefile. 409 00:34:59,310 --> 00:35:05,930 When you enter a command like make all, for instance, it will make all of them for you. 410 00:35:05,930 --> 00:35:10,760 Feel free to look at the examples of Makefile from the past pset 411 00:35:10,760 --> 00:35:17,400 as well as going off of this one to see how you might be able to make your Puff file 412 00:35:17,400 --> 00:35:20,260 by editing this Makefile. 413 00:35:20,260 --> 00:35:22,730 That's about it for our distribution code. 414 00:35:22,730 --> 00:35:28,380 >> Once we've gotten through that, then here's just another reminder 415 00:35:28,380 --> 00:35:30,980 of how we're going to be dealing with the Huffman nodes. 416 00:35:30,980 --> 00:35:35,400 We're not going to be calling them nodes anymore; we're going to be calling them trees 417 00:35:35,400 --> 00:35:39,260 where we're going to be representing their symbol with a char, 418 00:35:39,260 --> 00:35:43,340 their frequency, the number of occurrences, with an integer. 419 00:35:43,340 --> 00:35:47,370 We're using that because it's more precise than a float. 420 00:35:47,370 --> 00:35:52,980 And then we have another pointer to the left child as well as the right child. 421 00:35:52,980 --> 00:35:59,630 A forest, as we saw, is just a linked list of trees. 422 00:35:59,630 --> 00:36:04,670 Ultimately, when we're building up our Huff file, 423 00:36:04,670 --> 00:36:07,580 we want our forest to contain just 1 tree-- 424 00:36:07,580 --> 00:36:12,420 1 tree, 1 root with multiple children. 425 00:36:12,420 --> 00:36:20,840 Earlier on when we were just making our Huffman trees, 426 00:36:20,840 --> 00:36:25,360 we started out by placing all of the nodes onto our screen 427 00:36:25,360 --> 00:36:27,790 and saying we're going to have these nodes, 428 00:36:27,790 --> 00:36:32,920 eventually they're going to be the leaves, and this is their symbol, this is their frequency. 429 00:36:32,920 --> 00:36:42,070 In our forest if we just have 3 letters, that's a forest of 3 trees. 430 00:36:42,070 --> 00:36:45,150 And then as we go on, when we added the first parent, 431 00:36:45,150 --> 00:36:48,080 we made a forest of 2 trees. 432 00:36:48,080 --> 00:36:54,930 We removed 2 of those children from our forest and then replaced it with a parent node 433 00:36:54,930 --> 00:36:58,820 that had those 2 nodes as children. 434 00:36:58,820 --> 00:37:05,600 And then finally, our last step with making our example with the As, Bs, and Cs 435 00:37:05,600 --> 00:37:08,030 would be to make the final parent, 436 00:37:08,030 --> 00:37:13,190 and so then that would bring our total count of trees in the forest to 1. 437 00:37:13,190 --> 00:37:18,140 Does everyone see how you start out with multiple trees in your forest 438 00:37:18,140 --> 00:37:22,520 and end up with 1? Okay. Cool. 439 00:37:25,530 --> 00:37:28,110 >> What do we need to do for Puff? 440 00:37:28,110 --> 00:37:37,110 What we need to do is ensure that, as always, they give us the right type of input 441 00:37:37,110 --> 00:37:39,090 so that we can actually run the program. 442 00:37:39,090 --> 00:37:43,130 In this case they're going to be giving us after their first command-line argument 443 00:37:43,130 --> 00:37:53,440 2 more: the file that we want to decompress and the output of the decompressed file. 444 00:37:53,440 --> 00:38:00,410 But once we make sure that they pass us in the right amount of values, 445 00:38:00,410 --> 00:38:05,820 we want to ensure that the input is a Huff file or not. 446 00:38:05,820 --> 00:38:10,420 And then once we guarantee that it's a Huff file, then we want to build our tree, 447 00:38:10,420 --> 00:38:20,940 build up the tree such that it matches the tree that the person who sent the message built. 448 00:38:20,940 --> 00:38:25,840 Then after we build the tree, then we can deal with the 0s and 1s that they passed in, 449 00:38:25,840 --> 00:38:29,590 follow those along our tree because it's identical, 450 00:38:29,590 --> 00:38:33,510 and then write that message out, interpret the bits back into chars. 451 00:38:33,510 --> 00:38:35,880 And then at the end because we're dealing with pointers here, 452 00:38:35,880 --> 00:38:38,110 we want to make sure that we don't have any memory leaks 453 00:38:38,110 --> 00:38:41,330 and that we free everything. 454 00:38:42,820 --> 00:38:46,430 >> Ensuring proper usage is old hat for us by now. 455 00:38:46,430 --> 00:38:51,980 We take in an input, which is going to be the name of the file to puff, 456 00:38:51,980 --> 00:38:56,010 and then we specify an output, 457 00:38:56,010 --> 00:39:01,580 so the name of the file for the puffed output, which will be the text file. 458 00:39:03,680 --> 00:39:08,820 That's usage. And now we want to ensure that the input is huffed or not. 459 00:39:08,820 --> 00:39:16,420 Thinking back, was there anything in the distribution code that might help us 460 00:39:16,420 --> 00:39:21,570 with understanding whether a file is huffed or not? 461 00:39:21,570 --> 00:39:26,910 There was information in huffile.c about the Huffeader. 462 00:39:26,910 --> 00:39:33,430 We know that every Huff file has a Huffeader associated with it with a magic number 463 00:39:33,430 --> 00:39:37,240 as well as an array of the frequencies for each symbol 464 00:39:37,240 --> 00:39:39,570 as well as a checksum. 465 00:39:39,570 --> 00:39:43,180 We know that, but we also took a peek at dump.c, 466 00:39:43,180 --> 00:39:49,120 in which it was reading into a Huff file. 467 00:39:49,120 --> 00:39:53,990 And so to do that, it had to check whether it really was huffed or not. 468 00:39:53,990 --> 00:40:03,380 So perhaps we could use dump.c as a structure for our puff.c. 469 00:40:03,380 --> 00:40:12,680 Back to pset 4 when we had the file copy.c that copied in RGB triples 470 00:40:12,680 --> 00:40:14,860 and we interpreted that for Whodunit and Resize, 471 00:40:14,860 --> 00:40:20,390 similarly, what you could do is just run the command like cp dump.c puff.c 472 00:40:20,390 --> 00:40:23,600 and use some of the code there. 473 00:40:23,600 --> 00:40:28,210 However, it's not going to be as straightforward of a process 474 00:40:28,210 --> 00:40:33,010 for translating your dump.c into puff.c, 475 00:40:33,010 --> 00:40:36,160 but at least it gives you somewhere to start 476 00:40:36,160 --> 00:40:40,540 on how to ensure that the input is actually huffed or not 477 00:40:40,540 --> 00:40:43,240 as well as a few other things. 478 00:40:45,930 --> 00:40:50,250 We have ensured proper usage and ensured that the input is huffed. 479 00:40:50,250 --> 00:40:53,570 Every time that we've done that we have done our proper error checking, 480 00:40:53,570 --> 00:41:01,520 so returning and quitting the function if some failure occurs, if there's a problem. 481 00:41:01,520 --> 00:41:07,170 >> Now what we want to do is build the actual tree. 482 00:41:08,840 --> 00:41:12,640 If we look in Forest, there are 2 main functions 483 00:41:12,640 --> 00:41:15,800 that we're going to want to become very familiar with. 484 00:41:15,800 --> 00:41:23,870 There's the Boolean function plant that plants a non-0 frequency tree inside our forest. 485 00:41:23,870 --> 00:41:29,250 And so there you pass in a pointer to a forest and a pointer to a tree. 486 00:41:32,530 --> 00:41:40,340 Quick question: How many forests will you have when you're building a Huffman tree? 487 00:41:44,210 --> 00:41:46,650 Our forest is like our canvas, right? 488 00:41:46,650 --> 00:41:50,800 So we're only going to have 1 forest, but we're going to have multiple trees. 489 00:41:50,800 --> 00:41:57,590 So before you call plant, you're presumably going to want to make your forest. 490 00:41:57,590 --> 00:42:04,430 There is a command for that if you look into forest.h on how you can make a forest. 491 00:42:04,430 --> 00:42:09,270 You can plant a tree. We know how to do that. 492 00:42:09,270 --> 00:42:11,590 And then you can also pick a tree from the forest, 493 00:42:11,590 --> 00:42:17,540 removing a tree with the lowest weight and giving you the pointer to that. 494 00:42:17,540 --> 00:42:23,090 Thinking back to when we were doing the examples ourselves, 495 00:42:23,090 --> 00:42:27,980 when we were drawing it out, we simply just added the links. 496 00:42:27,980 --> 00:42:31,680 But here instead of just adding the links, 497 00:42:31,680 --> 00:42:40,630 think of it more as you're removing 2 of those nodes and then replacing it by another one. 498 00:42:40,630 --> 00:42:44,200 To express that in terms of picking and planting, 499 00:42:44,200 --> 00:42:48,840 you're picking 2 trees and then planting another tree 500 00:42:48,840 --> 00:42:54,060 that has those 2 trees that you picked as children. 501 00:42:57,950 --> 00:43:05,280 To build Huffman's tree, you can read in the symbols and frequencies in order 502 00:43:05,280 --> 00:43:10,790 because the Huffeader gives that to you, 503 00:43:10,790 --> 00:43:14,250 gives you an array of the frequencies. 504 00:43:14,250 --> 00:43:19,660 So you can go ahead and just ignore anything with the 0 in it 505 00:43:19,660 --> 00:43:23,760 because we don't want 256 leaves at the end of it. 506 00:43:23,760 --> 00:43:27,960 We only want the number of leaves that are characters 507 00:43:27,960 --> 00:43:31,600 that are actually used in the file. 508 00:43:31,600 --> 00:43:37,590 You can read in those symbols, and each of those symbols that have non-0 frequencies, 509 00:43:37,590 --> 00:43:40,440 those are going to be trees. 510 00:43:40,440 --> 00:43:45,990 What you can do is every time you read in a non-0 frequency symbol, 511 00:43:45,990 --> 00:43:50,660 you can plant that tree in the forest. 512 00:43:50,660 --> 00:43:56,620 Once you plant the trees in the forest, you can join those trees as siblings, 513 00:43:56,620 --> 00:44:01,130 so going back to planting and picking where you pick 2 and then plant 1, 514 00:44:01,130 --> 00:44:05,820 where that 1 that you plant is the parent of the 2 children that you picked. 515 00:44:05,820 --> 00:44:11,160 So then your end result is going to be a single tree in your forest. 516 00:44:16,180 --> 00:44:18,170 That's how you build your tree. 517 00:44:18,170 --> 00:44:21,850 >> There are several things that could go wrong here 518 00:44:21,850 --> 00:44:26,580 because we're dealing with making new trees and dealing with pointers and things like that. 519 00:44:26,580 --> 00:44:30,450 Before when we were dealing with pointers, 520 00:44:30,450 --> 00:44:36,580 whenever we malloc'd we wanted to make sure that it didn't return us a NULL pointer value. 521 00:44:36,580 --> 00:44:42,770 So at several steps within this process there are going to be several cases 522 00:44:42,770 --> 00:44:45,920 where your program could fail. 523 00:44:45,920 --> 00:44:51,310 What you want to do is you want to make sure that you handle those errors, 524 00:44:51,310 --> 00:44:54,580 and in the spec it says to handle them gracefully, 525 00:44:54,580 --> 00:45:00,280 so like print out a message to the user telling them why the program has to quit 526 00:45:00,280 --> 00:45:03,050 and then promptly quit it. 527 00:45:03,050 --> 00:45:09,490 To do this error handling, remember that you want to check it 528 00:45:09,490 --> 00:45:12,160 every single time that there could be a failure. 529 00:45:12,160 --> 00:45:14,660 Every single time that you're making a new pointer 530 00:45:14,660 --> 00:45:17,040 you want to make sure that that's successful. 531 00:45:17,040 --> 00:45:20,320 Before what we used to do is make a new pointer and malloc it, 532 00:45:20,320 --> 00:45:22,380 and then we would check whether that pointer is NULL. 533 00:45:22,380 --> 00:45:25,670 So there are going to be some instances where you can just do that, 534 00:45:25,670 --> 00:45:28,610 but sometimes you're actually calling a function 535 00:45:28,610 --> 00:45:33,100 and within that function, that's the one that's doing the mallocing. 536 00:45:33,100 --> 00:45:39,110 In that case, if we look back to some of the functions within the code, 537 00:45:39,110 --> 00:45:42,260 some of them are Boolean functions. 538 00:45:42,260 --> 00:45:48,480 In the abstract case if we have a Boolean function called foo, 539 00:45:48,480 --> 00:45:54,580 basically, we can assume that in addition to doing whatever foo does, 540 00:45:54,580 --> 00:45:57,210 since it's a Boolean function, it returns true or false-- 541 00:45:57,210 --> 00:46:01,300 true if successful, false if not. 542 00:46:01,300 --> 00:46:06,270 So we want to check whether the return value of foo is true or false. 543 00:46:06,270 --> 00:46:10,400 If it's false, that means that we're going to want to print some kind of message 544 00:46:10,400 --> 00:46:14,390 and then quit the program. 545 00:46:14,390 --> 00:46:18,530 What we want to do is check the return value of foo. 546 00:46:18,530 --> 00:46:23,310 If foo returns false, then we know that we encountered some kind of error 547 00:46:23,310 --> 00:46:25,110 and we need to quit our program. 548 00:46:25,110 --> 00:46:35,600 A way to do this is have a condition where the actual function itself is your condition. 549 00:46:35,600 --> 00:46:39,320 Say foo takes in x. 550 00:46:39,320 --> 00:46:43,390 We can have as a condition if (foo(x)). 551 00:46:43,390 --> 00:46:50,900 Basically, that means if at the end of executing foo it returns true, 552 00:46:50,900 --> 00:46:57,390 then we can do this because the function has to evaluate foo 553 00:46:57,390 --> 00:47:00,500 in order to evaluate the whole condition. 554 00:47:00,500 --> 00:47:06,500 So then that's how you can do something if the function returns true and is successful. 555 00:47:06,500 --> 00:47:11,800 But when you're error checking, you only want to quit if your function returns false. 556 00:47:11,800 --> 00:47:16,090 What you could do is just add an == false or just add a bang in front of it 557 00:47:16,090 --> 00:47:21,010 and then you have if (!foo). 558 00:47:21,010 --> 00:47:29,540 Within that body of that condition you would have all of the error handling, 559 00:47:29,540 --> 00:47:36,940 so like, "Could not create this tree" and then return 1 or something like that. 560 00:47:36,940 --> 00:47:43,340 What that does, though, is that even though foo returned false-- 561 00:47:43,340 --> 00:47:46,980 Say foo returns true. 562 00:47:46,980 --> 00:47:51,060 Then you don't have to call foo again. That's a common misconception. 563 00:47:51,060 --> 00:47:54,730 Because it was in your condition, it's already evaluated, 564 00:47:54,730 --> 00:47:59,430 so you already have the result if you're using make tree or something like that 565 00:47:59,430 --> 00:48:01,840 or plant or pick or something. 566 00:48:01,840 --> 00:48:07,460 It already has that value. It's already executed. 567 00:48:07,460 --> 00:48:10,730 So it's useful to use Boolean functions as the condition 568 00:48:10,730 --> 00:48:13,890 because whether or not you actually execute the body of the loop, 569 00:48:13,890 --> 00:48:18,030 it executes the function anyway. 570 00:48:22,070 --> 00:48:27,330 >> Our second to last step is writing the message to the file. 571 00:48:27,330 --> 00:48:33,070 Once we build the Huffman tree, then writing the message to the file is pretty straightforward. 572 00:48:33,070 --> 00:48:39,260 It's pretty straightforward now to just follow the 0s and 1s. 573 00:48:39,260 --> 00:48:45,480 And so by convention we know that in a Huffman tree the 0s indicate left 574 00:48:45,480 --> 00:48:48,360 and the 1s indicate right. 575 00:48:48,360 --> 00:48:53,540 So then if you read in bit by bit, every time that you get a 0 576 00:48:53,540 --> 00:48:59,100 you'll follow the left branch, and then every time you read in a 1 577 00:48:59,100 --> 00:49:02,100 you're going to follow the right branch. 578 00:49:02,100 --> 00:49:07,570 And then you're going to continue until you hit a leaf 579 00:49:07,570 --> 00:49:11,550 because the leaves are going to be at the end of the branches. 580 00:49:11,550 --> 00:49:16,870 How can we tell whether we've hit a leaf or not? 581 00:49:19,800 --> 00:49:21,690 We said it before. 582 00:49:21,690 --> 00:49:24,040 [student] If the pointers are NULL. >>Yeah. 583 00:49:24,040 --> 00:49:32,220 We can tell if we've hit a leaf if the pointers to both the left and right trees are NULL. 584 00:49:32,220 --> 00:49:34,110 Perfect. 585 00:49:34,110 --> 00:49:40,320 We know that we want to read in bit by bit into our Huff file. 586 00:49:43,870 --> 00:49:51,220 As we saw before in dump.c, what they did is they read in bit by bit into the Huff file 587 00:49:51,220 --> 00:49:54,560 and just printed out what those bits were. 588 00:49:54,560 --> 00:49:58,430 We're not going to be doing that. We're going to be doing something that's a bit more complex. 589 00:49:58,430 --> 00:50:03,620 But what we can do is we can take that bit of code that reads in to the bit. 590 00:50:03,620 --> 00:50:10,250 Here we have the integer bit representing the current bit that we're on. 591 00:50:10,250 --> 00:50:15,520 This takes care of iterating all of the bits in the file until you hit the end of the file. 592 00:50:15,520 --> 00:50:21,270 Based on that, then you're going to want to have some kind of iterator 593 00:50:21,270 --> 00:50:26,760 to traverse your tree. 594 00:50:26,760 --> 00:50:31,460 And then based on whether the bit is 0 or 1, 595 00:50:31,460 --> 00:50:36,920 you're going to want to either move that iterator to the left or move it to the right 596 00:50:36,920 --> 00:50:44,080 all the way until you hit a leaf, so all the way until that node that you're on 597 00:50:44,080 --> 00:50:48,260 doesn't point to any more nodes. 598 00:50:48,260 --> 00:50:54,300 Why can we do this with a Huffman file but not Morse code? 599 00:50:54,300 --> 00:50:56,610 Because in Morse code there's a bit of ambiguity. 600 00:50:56,610 --> 00:51:04,440 We could be like, oh wait, we've hit a letter along the way, so maybe this is our letter, 601 00:51:04,440 --> 00:51:08,150 whereas if we continued just a bit longer, then we would have hit another letter. 602 00:51:08,150 --> 00:51:13,110 But that's not going to happen in Huffman encoding, 603 00:51:13,110 --> 00:51:17,540 so we can rest assured that the only way that we're going to hit a character 604 00:51:17,540 --> 00:51:23,480 is if that node's left and right children are NULL. 605 00:51:28,280 --> 00:51:32,350 >> Finally, we want to free all of our memory. 606 00:51:32,350 --> 00:51:37,420 We want to both close the Huff file that we've been dealing with 607 00:51:37,420 --> 00:51:41,940 as well as remove all of the trees in our forest. 608 00:51:41,940 --> 00:51:46,470 Based on your implementation, you're probably going to want to call remove forest 609 00:51:46,470 --> 00:51:49,780 instead of actually going through all of the trees yourself. 610 00:51:49,780 --> 00:51:53,430 But if you made any temporary trees, you'll want to free that. 611 00:51:53,430 --> 00:51:59,060 You know your code best, so you know where you're allocating memory. 612 00:51:59,060 --> 00:52:04,330 And so if you go in, start by even Control F'ing for malloc, 613 00:52:04,330 --> 00:52:08,330 seeing whenever you malloc and making sure that you free all of that 614 00:52:08,330 --> 00:52:10,190 but then just going through your code, 615 00:52:10,190 --> 00:52:14,260 understanding where you might have allocated memory. 616 00:52:14,260 --> 00:52:21,340 Usually you might just say, "At the end of a file I'm just going to remove forest on my forest," 617 00:52:21,340 --> 00:52:23,850 so basically clear that memory, free that, 618 00:52:23,850 --> 00:52:28,310 "and then I'm also going to close the file and then my program is going to quit." 619 00:52:28,310 --> 00:52:33,810 But is that the only time that your program quits? 620 00:52:33,810 --> 00:52:37,880 No, because sometimes there might have been an error that happened. 621 00:52:37,880 --> 00:52:42,080 Maybe we couldn't open a file or we couldn't make another tree 622 00:52:42,080 --> 00:52:49,340 or some kind of error happened in the memory allocation process and so it returned NULL. 623 00:52:49,340 --> 00:52:56,710 An error happened and then we returned and quit. 624 00:52:56,710 --> 00:53:02,040 So then you want to make sure that any possible time that your program can quit, 625 00:53:02,040 --> 00:53:06,980 you want to free all of your memory there. 626 00:53:06,980 --> 00:53:13,370 It's not just going to be at the very end of the main function that you quit your code. 627 00:53:13,370 --> 00:53:20,780 You want to look back to every instance that your code potentially might return prematurely 628 00:53:20,780 --> 00:53:25,070 and then free whatever memory makes sense. 629 00:53:25,070 --> 00:53:30,830 Say you had called make forest and that returned false. 630 00:53:30,830 --> 00:53:34,230 Then you probably won't need to remove your forest 631 00:53:34,230 --> 00:53:37,080 because you don't have a forest yet. 632 00:53:37,080 --> 00:53:42,130 But at every point in the code where you might return prematurely 633 00:53:42,130 --> 00:53:46,160 you want to make sure that you free any possible memory. 634 00:53:46,160 --> 00:53:50,020 >> So when we're dealing with freeing memory and having potential leaks, 635 00:53:50,020 --> 00:53:55,440 we want to not only use our judgment and our logic 636 00:53:55,440 --> 00:54:01,850 but also use Valgrind to determine whether we've freed all of our memory properly or not. 637 00:54:01,850 --> 00:54:09,460 You can either run Valgrind on Puff and then you have to also pass it 638 00:54:09,460 --> 00:54:14,020 the right number of command-line arguments to Valgrind. 639 00:54:14,020 --> 00:54:18,100 You can run that, but the output is a bit cryptic. 640 00:54:18,100 --> 00:54:21,630 We've gotten a bit used to it with Speller, but we still need a bit more help, 641 00:54:21,630 --> 00:54:26,450 so then running it with a few more flags like the leak-check=full, 642 00:54:26,450 --> 00:54:32,040 that will probably give us some more helpful output on Valgrind. 643 00:54:32,040 --> 00:54:39,040 >> Then another useful tip when you're debugging is the diff command. 644 00:54:39,040 --> 00:54:48,520 You can access the staff's implementation of Huff, run that on a text file, 645 00:54:48,520 --> 00:54:55,400 and then output it to a binary file, a binary Huff file, to be specific. 646 00:54:55,400 --> 00:54:59,440 Then if you run your own puff on that binary file, 647 00:54:59,440 --> 00:55:03,950 then ideally, your outputted text file is going to be identical 648 00:55:03,950 --> 00:55:08,200 to the original one that you passed in. 649 00:55:08,200 --> 00:55:15,150 Here I'm using hth.txt as the example, and that's the one talked about in your spec. 650 00:55:15,150 --> 00:55:21,040 That's literally just HTH and then a newline. 651 00:55:21,040 --> 00:55:30,970 But definitely feel free and you are definitely encouraged to use longer examples 652 00:55:30,970 --> 00:55:32,620 for your text file. 653 00:55:32,620 --> 00:55:38,110 >> You can even take a shot at maybe compressing and then decompressing 654 00:55:38,110 --> 00:55:41,600 some of the files that you used in Speller like War and Peace 655 00:55:41,600 --> 00:55:46,710 or Jane Austen or something like that--that would be kind of cool--or Austin Powers, 656 00:55:46,710 --> 00:55:51,880 kind of dealing with larger files because we wouldn't come down to it 657 00:55:51,880 --> 00:55:55,590 if we used the next tool here, ls -l. 658 00:55:55,590 --> 00:56:01,150 We're used to ls, which basically lists all the contents in our current directory. 659 00:56:01,150 --> 00:56:07,860 Passing in the flag -l actually displays the size of those files. 660 00:56:07,860 --> 00:56:12,690 If you go through the pset spec, it actually walks you through creating the binary file, 661 00:56:12,690 --> 00:56:16,590 of huffing it, and you see that for very small files 662 00:56:16,590 --> 00:56:23,910 the space cost of compressing it and translating all of that information 663 00:56:23,910 --> 00:56:26,980 of all the frequencies and things like that outweighs the actual benefit 664 00:56:26,980 --> 00:56:30,000 of compressing the file in the first place. 665 00:56:30,000 --> 00:56:37,450 But if you run it on some longer text files, then you might see that you start to get some benefit 666 00:56:37,450 --> 00:56:40,930 in compressing those files. 667 00:56:40,930 --> 00:56:46,210 >> And then finally, we have our old pal GDB, which is definitely going to come in handy too. 668 00:56:48,360 --> 00:56:55,320 >> Do we have any questions on Huff trees or the process perhaps of making the trees 669 00:56:55,320 --> 00:56:58,590 or any other questions on Huff'n Puff? 670 00:57:00,680 --> 00:57:02,570 Okay. I'll stay around for a bit. 671 00:57:02,570 --> 00:57:06,570 >> Thanks, everyone. This was Walkthrough 6. And good luck. 672 00:57:08,660 --> 00:57:10,000 >> [CS50.TV]