1 00:00:00,000 --> 00:00:03,000 [Week 4] 2 00:00:03,000 --> 00:00:05,000 [David J. Malan] [Harvard University] 3 00:00:05,000 --> 00:00:08,000 [This is CS50.] [CS50.TV] 4 00:00:08,000 --> 00:00:12,000 >> All right, this is CS50, and this is the start of week 4, 5 00:00:12,000 --> 00:00:16,000 and this is one of the slowest possible sorting algorithms. 6 00:00:16,000 --> 00:00:19,000 Which one was it that we just watched there? 7 00:00:19,000 --> 00:00:24,000 That was bubble sort, in order big O(n^2) + sum, 8 00:00:24,000 --> 00:00:28,000 and indeed we are not the only ones in this world to seem to know 9 00:00:28,000 --> 00:00:30,000 what bubble sort is or its running time. 10 00:00:30,000 --> 00:00:33,000 Indeed, this was an interview with Eric Schmidt of Google 11 00:00:33,000 --> 00:00:45,000 and former senator Barack Obama just a few years ago. 12 00:00:45,000 --> 00:00:48,000 >> Now, Senator, you're here at Google, 13 00:00:48,000 --> 00:00:54,000 and I like to think of the presidency as a job interview. 14 00:00:54,000 --> 00:00:58,000 Now, it's hard to get a job as president, and you're going through the rigors now. 15 00:00:58,000 --> 00:01:00,000 It's also hard to get a job at Google. 16 00:01:00,000 --> 00:01:05,000 We have questions, and we ask our candidates questions, 17 00:01:05,000 --> 00:01:10,000 and this one is from Larry Schwimmer. 18 00:01:10,000 --> 00:01:14,000 You guys think I'm kidding? It's right here. 19 00:01:14,000 --> 00:01:18,000 What is the most efficient way to sort a million 32-bit integers? 20 00:01:18,000 --> 00:01:21,000 [laughter] 21 00:01:21,000 --> 00:01:24,000 Well— 22 00:01:24,000 --> 00:01:26,000 I'm sorry.>>No, no, no, no. 23 00:01:26,000 --> 00:01:34,000 I think the bubble sort would be the wrong way to go. 24 00:01:34,000 --> 00:01:39,000 >> Come on, who told him this? 25 00:01:39,000 --> 00:01:43,000 Last week recall we took a break from code, at least for a day, 26 00:01:43,000 --> 00:01:46,000 and started focusing on some higher level ideas and problem solving more generally 27 00:01:46,000 --> 00:01:49,000 in the context of searching and sorting, 28 00:01:49,000 --> 00:01:53,000 and we introduced something that we didn't slap this name on last week, 29 00:01:53,000 --> 00:01:56,000 but asymptotic notation, the Big O, the Big Omega, 30 00:01:56,000 --> 00:02:00,000 and sometimes the Big Theta notation, and these were simply ways 31 00:02:00,000 --> 00:02:02,000 of describing the running time of algorithms, 32 00:02:02,000 --> 00:02:05,000 how much time it takes for an algorithm to run. 33 00:02:05,000 --> 00:02:08,000 >> And you may recall that you talked about the running time in terms of the size 34 00:02:08,000 --> 00:02:11,000 of the input, which we generally call n, whatever the problem may be, 35 00:02:11,000 --> 00:02:13,000 where n is the number of people in the room, 36 00:02:13,000 --> 00:02:17,000 the number of pages in a phone book, and we started to write things out 37 00:02:17,000 --> 00:02:21,000 like O(n^2) or O(n) or O(n log n), 38 00:02:21,000 --> 00:02:24,000 and even when the math didn't quite work out so perfectly 39 00:02:24,000 --> 00:02:28,000 and it was n² - n/2 or something like that 40 00:02:28,000 --> 00:02:31,000 we would instead just throw away some of the lower order terms, 41 00:02:31,000 --> 00:02:34,000 and the motivation there is that we really want a 42 00:02:34,000 --> 00:02:37,000 sort of objective way of evaluating 43 00:02:37,000 --> 00:02:39,000 the performance of programs or the performance of algorithms 44 00:02:39,000 --> 00:02:42,000 that at the end of the day has nothing to do, for instance, 45 00:02:42,000 --> 00:02:45,000 with the speed of your computer today. 46 00:02:45,000 --> 00:02:47,000 >> For instance, if you implement bubble sort, 47 00:02:47,000 --> 00:02:50,000 or you implement merge sort or selection sort on today's computer, 48 00:02:50,000 --> 00:02:53,000 a 2 GHz computer, and you run it, 49 00:02:53,000 --> 00:02:56,000 and it takes some number of seconds, next year there's a 3 GHz 50 00:02:56,000 --> 00:02:59,000 or a 4 GHz computer, and you might then claim that "Wow, my algorithm 51 00:02:59,000 --> 00:03:03,000 is now twice as fast," when in reality that's obviously not the case. 52 00:03:03,000 --> 00:03:06,000 It's just the hardware has gotten faster, but your computer 53 00:03:06,000 --> 00:03:10,000 has not, and so we really want to throw away things like 54 00:03:10,000 --> 00:03:13,000 multiples of 2 or multiples of 3 when it comes to describing 55 00:03:13,000 --> 00:03:17,000 how fast or how slow an algorithm is and really just focus 56 00:03:17,000 --> 00:03:20,000 on n or some factor thereof, 57 00:03:20,000 --> 00:03:24,000 some power thereof as in the case of the sorts from last week. 58 00:03:24,000 --> 00:03:27,000 And recall that with the help of merge sort 59 00:03:27,000 --> 00:03:31,000 we were able to do so much better than bubble sort and selection sort 60 00:03:31,000 --> 00:03:33,000 and even insertion sort. 61 00:03:33,000 --> 00:03:36,000 >> We got down to n log n, and again, 62 00:03:36,000 --> 00:03:39,000 recall that log n generally refers to something that grows 63 00:03:39,000 --> 00:03:43,000 more slowly then n, so n log n thus far was good 64 00:03:43,000 --> 00:03:45,000 because it was less than n². 65 00:03:45,000 --> 00:03:47,000 But to achieve n log n with merge sort 66 00:03:47,000 --> 00:03:51,000 what was the basic germ of an idea that we had to leverage 67 00:03:51,000 --> 00:03:54,000 that we also leveraged back in week 0? 68 00:03:54,000 --> 00:03:58,000 How did we tackle the sorting problem cleverly with merge sort? 69 00:03:58,000 --> 00:04:04,000 What was the key insight, perhaps? 70 00:04:04,000 --> 00:04:07,000 Anyone at all. 71 00:04:07,000 --> 00:04:09,000 Okay, let's take a step back. 72 00:04:09,000 --> 00:04:11,000 Describe merge sort in your own words. 73 00:04:11,000 --> 00:04:15,000 How did it work? 74 00:04:15,000 --> 00:04:17,000 Okay, we'll row back to week 0. 75 00:04:17,000 --> 00:04:19,000 Okay, yeah. 76 00:04:19,000 --> 00:04:22,000 [inaudible-student] 77 00:04:22,000 --> 00:04:26,000 Okay, good, so we divided the array of numbers into 2 pieces. 78 00:04:26,000 --> 00:04:29,000 We sorted each of those pieces, and then we merged them, 79 00:04:29,000 --> 00:04:33,000 and we've seen this idea before of taking a problem that's this big 80 00:04:33,000 --> 00:04:36,000 and chopping it up into a problem that's this big or this big. 81 00:04:36,000 --> 00:04:38,000 >> Recall the phone book example. 82 00:04:38,000 --> 00:04:42,000 Recall the self-counting algorithm from weeks ago, 83 00:04:42,000 --> 00:04:45,000 so merge sort was summarized by this pseudocode here. 84 00:04:45,000 --> 00:04:48,000 When you're given n elements, first it was sanity check. 85 00:04:48,000 --> 00:04:51,000 If n < 2 then don't do anything at all 86 00:04:51,000 --> 00:04:55,000 because if n < 2 then n is obviously 0 or 1, 87 00:04:55,000 --> 00:04:57,000 and so if it's either 0 or 1 there's nothing to sort. 88 00:04:57,000 --> 00:04:59,000 You're done. 89 00:04:59,000 --> 00:05:01,000 Your list is already trivially sorted. 90 00:05:01,000 --> 00:05:04,000 But otherwise if you've got 2 or more elements go ahead and divide them 91 00:05:04,000 --> 00:05:06,000 into 2 halves, left and right. 92 00:05:06,000 --> 00:05:09,000 Sort each of those halves, and then merge the sorted halves. 93 00:05:09,000 --> 00:05:13,000 But the problem here is that at first glance this feels like we're punting. 94 00:05:13,000 --> 00:05:17,000 This is a circular definition in that if I've asked you to sort these n elements 95 00:05:17,000 --> 00:05:22,000 and you're telling me "All right, fine, we'll sort those n/2 and those n/2 elements," 96 00:05:22,000 --> 00:05:27,000 then my next question is going to be "Fine, how do you sort the n/2 elements?" 97 00:05:27,000 --> 00:05:30,000 >> But because of the structure of this program, 98 00:05:30,000 --> 00:05:33,000 because there is this base case, so to speak, 99 00:05:33,000 --> 00:05:39,000 this special case that says if n is < some fixed value like 2 return immediately. 100 00:05:39,000 --> 00:05:42,000 Don't respond with that same circular answer. 101 00:05:42,000 --> 00:05:46,000 This process, this cyclicity will eventually end. 102 00:05:46,000 --> 00:05:50,000 If I ask you "Sort these n elements," and you say, "Fine, sort these n/2," 103 00:05:50,000 --> 00:05:53,000 then you say, "Fine, sort these n/4, n/8, n/16," 104 00:05:53,000 --> 00:05:56,000 eventually you'll divide by a big enough number 105 00:05:56,000 --> 00:05:59,000 that you'll have just 1 element left, at which point you can say, 106 00:05:59,000 --> 00:06:02,000 "Here, here is a sorted single element." 107 00:06:02,000 --> 00:06:06,000 Then the brilliance of this algorithm up here is to derive from the fact 108 00:06:06,000 --> 00:06:09,000 that once you have all of these individually sorted lists, 109 00:06:09,000 --> 00:06:12,000 all of which are of size 1, which seems to be useless, 110 00:06:12,000 --> 00:06:15,000 once you start merging them and merging them 111 00:06:15,000 --> 00:06:19,000 you build up finally as Rob did in the video a finally sorted list. 112 00:06:19,000 --> 00:06:22,000 >> But this idea extends far beyond sorting. 113 00:06:22,000 --> 00:06:26,000 There is this idea embedded in this program known as recursion, 114 00:06:26,000 --> 00:06:29,000 the idea whereby you are a program, 115 00:06:29,000 --> 00:06:32,000 and to solve some problem you call yourself, 116 00:06:32,000 --> 00:06:36,000 or put in the context of programming languages you are a function, 117 00:06:36,000 --> 00:06:39,000 and in order to solve a problem, you the function call yourself 118 00:06:39,000 --> 00:06:42,000 again and again and again, but you the function 119 00:06:42,000 --> 00:06:44,000 can't call yourself infinitely many times. 120 00:06:44,000 --> 00:06:47,000 Eventually you have to bottom out, so to speak, 121 00:06:47,000 --> 00:06:49,000 and have some hard-coded base condition that says 122 00:06:49,000 --> 00:06:53,000 at this point stop calling yourself so that the whole process 123 00:06:53,000 --> 00:06:56,000 finally does in fact stop. 124 00:06:56,000 --> 00:06:58,000 What does this really mean, to recurse? 125 00:06:58,000 --> 00:07:01,000 >> Let's see, if we can do a simple, trivial example with, say, 126 00:07:01,000 --> 00:07:03,000 3 people with me up here on stage, if someone is comfortable. 127 00:07:03,000 --> 00:07:06,000 1, come on up, 2 and 3. 128 00:07:06,000 --> 00:07:09,000 If you 3 want to come up here. 129 00:07:09,000 --> 00:07:12,000 If you want to stand right next to me here in a line, suppose that the problem at hand 130 00:07:12,000 --> 00:07:15,000 is very trivially count the number of people who are here. 131 00:07:15,000 --> 00:07:18,000 But frankly, I'm tired of all these counting examples. 132 00:07:18,000 --> 00:07:21,000 This is going to take some time, 1, 2, and dot, dot, dot. 133 00:07:21,000 --> 00:07:23,000 It's going to take forever. 134 00:07:23,000 --> 00:07:25,000 I'd rather just punt this problem altogether with the help of—what's your name? 135 00:07:25,000 --> 00:07:27,000 Sara.>>Sara, all right. 136 00:07:27,000 --> 00:07:29,000 Kelly.>>Kelly and? 137 00:07:29,000 --> 00:07:31,000 >> Willy.>>Willy, Sara, Kelly, and Willy. 138 00:07:31,000 --> 00:07:34,000 Right now I have been asked the question by someone 139 00:07:34,000 --> 00:07:37,000 how many people are up on this stage, and I have no idea. 140 00:07:37,000 --> 00:07:40,000 This is a really long list, and so instead I'm going to do this trick. 141 00:07:40,000 --> 00:07:43,000 I'm going to ask the person next to me to do most of the work, 142 00:07:43,000 --> 00:07:46,000 and once she is done doing most of the work 143 00:07:46,000 --> 00:07:49,000 I'm going to do the least amount of work possible and just add 1 144 00:07:49,000 --> 00:07:51,000 to whatever her answer is, so here we go. 145 00:07:51,000 --> 00:07:54,000 I've been asked how many people are on stage. 146 00:07:54,000 --> 00:07:57,000 How many people are on stage to the left of you? 147 00:07:57,000 --> 00:08:00,000 The left of me?>>Okay, but don't cheat. 148 00:08:00,000 --> 00:08:04,000 That's good, that's correct, but if we want to continue this logic 149 00:08:04,000 --> 00:08:08,000 let's assume that you similarly want to punt this problem to the left of you, 150 00:08:08,000 --> 00:08:11,000 so rather than answer directly go ahead and just pass the buck. 151 00:08:11,000 --> 00:08:14,000 Oh, how many people are to the left of me? 152 00:08:14,000 --> 00:08:16,000 How many people are to the left? 153 00:08:16,000 --> 00:08:18,000 1. 154 00:08:18,000 --> 00:08:27,000 [laughter] 155 00:08:27,000 --> 00:08:30,000 Okay, so 0, so what now Willy has done 156 00:08:30,000 --> 00:08:33,000 is you've returned your answer this direction saying 0. 157 00:08:33,000 --> 00:08:36,000 Now, what should you do?>>1. 158 00:08:36,000 --> 00:08:39,000 Okay, so you're the 1, so you say, "All right, I'm going to add 1 159 00:08:39,000 --> 00:08:41,000 to whatever Willy's count was," so 1 + 0. 160 00:08:41,000 --> 00:08:43,000 You're now 1 so your answer to the right is now— 161 00:08:43,000 --> 00:08:45,000 1.>>And mine would be 2. 162 00:08:45,000 --> 00:08:48,000 Good, so you're taking the previous answer of 1, 163 00:08:48,000 --> 00:08:51,000 adding the minimal amount of work you want to do, which is +1. 164 00:08:51,000 --> 00:08:55,000 You now have 2, and you then hand me which value? 165 00:08:55,000 --> 00:08:57,000 3, I mean, sorry, 2. 166 00:08:57,000 --> 00:08:59,000 Good. 167 00:08:59,000 --> 00:09:02,000 >> Well, we had 0 to the left. 168 00:09:02,000 --> 00:09:05,000 Then we had 1, and then we add 2, 169 00:09:05,000 --> 00:09:07,000 and now you're handing me the number 2, 170 00:09:07,000 --> 00:09:10,000 and so I'm saying, okay, +1, 3. 171 00:09:10,000 --> 00:09:13,000 There's indeed 3 people standing next to me on this stage, 172 00:09:13,000 --> 00:09:16,000 so we could have obviously done this very linearly, 173 00:09:16,000 --> 00:09:19,000 very much in the obvious fashion, but what did we really do? 174 00:09:19,000 --> 00:09:21,000 We took a problem of size 3 initially. 175 00:09:21,000 --> 00:09:24,000 We then broke it down into a problem of size 2, 176 00:09:24,000 --> 00:09:27,000 then a problem of size 1, and then finally the base case 177 00:09:27,000 --> 00:09:29,000 was really, oh, there's no one there, 178 00:09:29,000 --> 00:09:33,000 at which point Willy returned effectively a hard-coded answer a couple of times, 179 00:09:33,000 --> 00:09:36,000 and the second one was then bubbled up, bubbled up, bubbled up, 180 00:09:36,000 --> 00:09:39,000 and then by adding in this one additional 1 181 00:09:39,000 --> 00:09:41,000 we've implemented this basic idea of recursion. 182 00:09:41,000 --> 00:09:44,000 >> Now, in this case it didn't really solve a problem 183 00:09:44,000 --> 00:09:46,000 any more effectively then we've seen thus far. 184 00:09:46,000 --> 00:09:48,000 But think about the algorithms we've done on stage thus far. 185 00:09:48,000 --> 00:09:51,000 We had 8 pieces of paper on the chalkboard, 186 00:09:51,000 --> 00:09:55,000 on video when Sean was looking for the number 7, and what did he really do? 187 00:09:55,000 --> 00:09:58,000 Well, he didn't do any kind of divide and conquer. 188 00:09:58,000 --> 00:10:01,000 He didn't do any kind of recursion. 189 00:10:01,000 --> 00:10:03,000 Rather he just did this linear algorithm. 190 00:10:03,000 --> 00:10:07,000 But when we introduced the idea of sorted numbers on stage live last week 191 00:10:07,000 --> 00:10:09,000 then we had this instinct of going to the middle, 192 00:10:09,000 --> 00:10:13,000 at which point we had a smaller list of size 4 or another list of size 4, 193 00:10:13,000 --> 00:10:17,000 and then we had the exact same problem, so we repeated, repeated, repeated. 194 00:10:17,000 --> 00:10:19,000 In other words, we recursed. 195 00:10:19,000 --> 00:10:24,000 Thank you very much to our 3 volunteers here for demonstrating recursion with us. 196 00:10:24,000 --> 00:10:28,000 >> Let's see if we can't make this now a little more concrete, 197 00:10:28,000 --> 00:10:30,000 solving a problem that again we could do pretty easily, 198 00:10:30,000 --> 00:10:34,000 but we'll use it as a stepping stone to implementing this basic idea. 199 00:10:34,000 --> 00:10:37,000 If I want to compute the summation of a bunch of numbers, 200 00:10:37,000 --> 00:10:39,000 for instance, if you pass in the number 3, 201 00:10:39,000 --> 00:10:42,000 I want to give you the value of sigma 3, 202 00:10:42,000 --> 00:10:46,000 so the sum of 3 + 2 + 1 + 0. 203 00:10:46,000 --> 00:10:48,000 I want to get back the answer 6, 204 00:10:48,000 --> 00:10:51,000 so we'll implement this sigma function, this summation function 205 00:10:51,000 --> 00:10:54,000 that, again, takes in input, and then returns the summation 206 00:10:54,000 --> 00:10:57,000 of that number all the way down to 0. 207 00:10:57,000 --> 00:10:59,000 We could do this pretty simply, right? 208 00:10:59,000 --> 00:11:01,000 We could do this with some kind of looping structure, 209 00:11:01,000 --> 00:11:04,000 so let me go ahead and get this started. 210 00:11:04,000 --> 00:11:07,000 >> Include stdio.h. 211 00:11:07,000 --> 00:11:09,000 Let me get myself into main to work with here. 212 00:11:09,000 --> 00:11:12,000 Let's save this as sigma.c. 213 00:11:12,000 --> 00:11:14,000 Then I'm going to go in here, and I'm going to declare an int n, 214 00:11:14,000 --> 00:11:18,000 and I'm going to do the following while the user does not cooperate. 215 00:11:18,000 --> 00:11:22,000 While the user has not given me a positive number 216 00:11:22,000 --> 00:11:26,000 let me go ahead and prompt them for n = GetInt, 217 00:11:26,000 --> 00:11:28,000 and let me give them some instructions as to what to do, 218 00:11:28,000 --> 00:11:33,000 so printf("Positive integer please"). 219 00:11:33,000 --> 00:11:39,000 Just something relatively simple like this so that by the time we hit line 14 220 00:11:39,000 --> 00:11:42,000 we now have a positive integer presumably in n. 221 00:11:42,000 --> 00:11:44,000 >> Now let's do something with it. 222 00:11:44,000 --> 00:11:50,000 Let me go ahead and compute the summation, so int sum = sigma(n). 223 00:11:50,000 --> 00:11:54,000 Sigma is just summation, so I'm just writing it in the fancier way. 224 00:11:54,000 --> 00:11:56,000 We'll just call it sigma there. 225 00:11:56,000 --> 00:11:58,000 That's the sum, and now I'm going to print out the result, 226 00:11:58,000 --> 00:12:08,000 printf("The sum is %d, \n", sum). 227 00:12:08,000 --> 00:12:11,000 And then I'll return 0 for good measure. 228 00:12:11,000 --> 00:12:15,000 We've done everything that this program requires except the interesting part, 229 00:12:15,000 --> 00:12:18,000 which is to actually implement the sigma function. 230 00:12:18,000 --> 00:12:22,000 >> Let me go down here to the bottom, and let me declare function sigma. 231 00:12:22,000 --> 00:12:26,000 It's got to take a variable that's of type integer, 232 00:12:26,000 --> 00:12:30,000 and what data type do I want to return presumably from sigma? 233 00:12:30,000 --> 00:12:34,000 Int, because I want it to match my expectations on line 15. 234 00:12:34,000 --> 00:12:37,000 In here let me go ahead and implement this 235 00:12:37,000 --> 00:12:41,000 in a pretty straightforward way. 236 00:12:41,000 --> 00:12:45,000 >> Let's go ahead and say int sum = 0, 237 00:12:45,000 --> 00:12:47,000 and now I'm going to go have a little for loop here 238 00:12:47,000 --> 00:12:50,000 that's going to say something like this, 239 00:12:50,000 --> 00:13:01,000 for (int i = 0; I <= number; i++) sum += i. 240 00:13:01,000 --> 00:13:05,000 And then I'm going to return sum. 241 00:13:05,000 --> 00:13:07,000 I could have implemented this in any number of ways. 242 00:13:07,000 --> 00:13:09,000 I could have used a while loop. 243 00:13:09,000 --> 00:13:11,000 I could have skipped using the sum variable if I really wanted to, 244 00:13:11,000 --> 00:13:15,000 but in short, we just have a function that if I didn't goof declares sum is 0. 245 00:13:15,000 --> 00:13:18,000 Then it iterates from 0 on up through the number, 246 00:13:18,000 --> 00:13:23,000 and on each iteration it adds that current value to sum and then returns sum. 247 00:13:23,000 --> 00:13:25,000 >> Now, there's a slight optimization here. 248 00:13:25,000 --> 00:13:29,000 This is probably a wasted step, but so be it. That's fine for now. 249 00:13:29,000 --> 00:13:32,000 We're at least being thorough and going 0 all the way on up. 250 00:13:32,000 --> 00:13:34,000 Not very hard and pretty straightforward, 251 00:13:34,000 --> 00:13:37,000 but it turns out that with the sigma function we have the same opportunity 252 00:13:37,000 --> 00:13:39,000 as we did here on stage. 253 00:13:39,000 --> 00:13:42,000 On stage we just counted how many people were next to me, 254 00:13:42,000 --> 00:13:47,000 but instead if we wanted to count the number 3 + 2 + 1 255 00:13:47,000 --> 00:13:51,000 on down to 0 we could similarly punt to a function 256 00:13:51,000 --> 00:13:55,000 that I'll instead describe as being recursive. 257 00:13:55,000 --> 00:13:57,000 Here let's do a quick sanity check and make sure I didn't goof. 258 00:13:57,000 --> 00:14:00,000 >> I know there's at least one thing in this program that I did do wrong. 259 00:14:00,000 --> 00:14:04,000 When I hit enter am I going to get any kind of yelling at me? 260 00:14:04,000 --> 00:14:06,000 What am I going to be yelled at about? 261 00:14:06,000 --> 00:14:11,000 Yeah, I forgot the prototype, so I'm using a function called sigma on line 15, 262 00:14:11,000 --> 00:14:16,000 but it's not declared until line 22, so I best proactively go up here 263 00:14:16,000 --> 00:14:22,000 and declare a prototype, and I'll say int sigma(int number), and that's it. 264 00:14:22,000 --> 00:14:24,000 It's implemented at the bottom. 265 00:14:24,000 --> 00:14:27,000 >> Or another way I could solve this, 266 00:14:27,000 --> 00:14:30,000 I could move the function up there, which isn't bad, 267 00:14:30,000 --> 00:14:32,000 but at least when your programs start to get long, frankly, 268 00:14:32,000 --> 00:14:35,000 I think there's some value in always having main at the top 269 00:14:35,000 --> 00:14:38,000 so that you in the reader can open the file and then immediately see 270 00:14:38,000 --> 00:14:40,000 what the program is doing without having to search through it 271 00:14:40,000 --> 00:14:42,000 looking for that main function. 272 00:14:42,000 --> 00:14:49,000 Let's go down to my terminal window here, try making sigma make sigma, 273 00:14:49,000 --> 00:14:51,000 and I screwed up here too. 274 00:14:51,000 --> 00:14:55,000 Implicit declaration of function GetInt means I've forgotten to do what else? 275 00:14:55,000 --> 00:14:57,000 [inaudible-student] 276 00:14:57,000 --> 00:15:00,000 Good, so apparently a common mistake, so let's put this up here, 277 00:15:00,000 --> 00:15:04,000 cs50.h, and now let's go back to my terminal window. 278 00:15:04,000 --> 00:15:08,000 >> I'll clear the screen, and I'll rerun make sigma. 279 00:15:08,000 --> 00:15:11,000 It seems to have compiled. Let me now run sigma. 280 00:15:11,000 --> 00:15:15,000 I'll type in the number 3, and I did get 6, so not a rigorous check, 281 00:15:15,000 --> 00:15:18,000 but at least it seems to be working at first glance, but now let's rip it apart, 282 00:15:18,000 --> 00:15:21,000 and let's actually leverage the idea of recursion, again, 283 00:15:21,000 --> 00:15:24,000 in a very simple context so that in a few weeks' time 284 00:15:24,000 --> 00:15:27,000 when we start exploring fancier data structures than arrays 285 00:15:27,000 --> 00:15:30,000 we have another tool in the toolkit with which to 286 00:15:30,000 --> 00:15:33,000 manipulate those data structures as we'll see. 287 00:15:33,000 --> 00:15:36,000 This is the iterative approach, the loop-based approach. 288 00:15:36,000 --> 00:15:39,000 >> Let me instead now do this. 289 00:15:39,000 --> 00:15:44,000 Let me instead say that the summation of number 290 00:15:44,000 --> 00:15:48,000 on down to 0 is really the same thing as 291 00:15:48,000 --> 00:15:53,000 number + sigma(number - 1). 292 00:15:53,000 --> 00:15:57,000 In other words, just like on stage I punted to each of the people next to me, 293 00:15:57,000 --> 00:16:00,000 and they in turn kept punting until we finally bottomed out at Willy, 294 00:16:00,000 --> 00:16:03,000 who had to return a hard-coded answer like 0. 295 00:16:03,000 --> 00:16:07,000 Here now we're similarly punting to sigma 296 00:16:07,000 --> 00:16:10,000 the same function as was originally called, but the key insight here 297 00:16:10,000 --> 00:16:12,000 is that we're not calling sigma identically. 298 00:16:12,000 --> 00:16:14,000 We're not passing in n. 299 00:16:14,000 --> 00:16:17,000 We're clearly passing in number - 1, 300 00:16:17,000 --> 00:16:20,000 so a slightly smaller problem, slightly smaller problem. 301 00:16:20,000 --> 00:16:23,000 >> Unfortunately, this isn't quite a solution yet, and before we fix 302 00:16:23,000 --> 00:16:26,000 what might be jumping out as obvious at some of you 303 00:16:26,000 --> 00:16:28,000 let me go ahead and rerun make. 304 00:16:28,000 --> 00:16:30,000 It seems to compile okay. 305 00:16:30,000 --> 00:16:32,000 Let me rerun sigma with 6. 306 00:16:32,000 --> 00:16:37,000 Whoops, let me rerun sigma with 6. 307 00:16:37,000 --> 00:16:42,000 We've seen this before, albeit accidentally last time as well. 308 00:16:42,000 --> 00:16:48,000 Why did I get this cryptic segmentation fault? Yeah. 309 00:16:48,000 --> 00:16:50,000 [inaudible-student] 310 00:16:50,000 --> 00:16:53,000 There's no base case, and more specifically, what probably happened? 311 00:16:53,000 --> 00:16:58,000 This is a symptom of what behavior? 312 00:16:58,000 --> 00:17:00,000 Say it a little louder. 313 00:17:00,000 --> 00:17:02,000 [inaudible-student] 314 00:17:02,000 --> 00:17:05,000 It's an infinite loop effectively, and the problem with infinite loops 315 00:17:05,000 --> 00:17:08,000 when they involve recursion in this case, a function calling itself, 316 00:17:08,000 --> 00:17:10,000 what happens every time you call a function? 317 00:17:10,000 --> 00:17:13,000 Well, think back to how we laid out the memory in a computer. 318 00:17:13,000 --> 00:17:16,000 We said that there's this chunk of memory called the stack that's at the bottom, 319 00:17:16,000 --> 00:17:19,000 and every time you call a function a little more memory gets put 320 00:17:19,000 --> 00:17:24,000 on this so-called stack containing that function's local variables or parameters, 321 00:17:24,000 --> 00:17:27,000 so if sigma calls sigma calls sigma calls sigma 322 00:17:27,000 --> 00:17:29,000 calls sigma where does this story end? 323 00:17:29,000 --> 00:17:31,000 >> Well, it eventually overruns the total amount 324 00:17:31,000 --> 00:17:33,000 of memory that you have available to your computer. 325 00:17:33,000 --> 00:17:37,000 You overrun the segment that you're supposed to stay within, 326 00:17:37,000 --> 00:17:40,000 and you get this segmentation fault, core dumped, 327 00:17:40,000 --> 00:17:43,000 and what core dumped means is that I now have a file called core 328 00:17:43,000 --> 00:17:46,000 which is a file containing zeros and ones 329 00:17:46,000 --> 00:17:49,000 that actually in the future will be diagnostically useful. 330 00:17:49,000 --> 00:17:52,000 If it's not obvious to you where your bug is 331 00:17:52,000 --> 00:17:54,000 you can actually do a bit of forensic analysis, so to speak, 332 00:17:54,000 --> 00:17:58,000 on this core dump file, which, again, is just a whole bunch of zeros and ones 333 00:17:58,000 --> 00:18:02,000 that essentially represents the state of your program in memory 334 00:18:02,000 --> 00:18:05,000 the moment it crashed in this way. 335 00:18:05,000 --> 00:18:11,000 >> The fix here is that we can't just blindly return sigma, 336 00:18:11,000 --> 00:18:14,000 the number + sigma of a slightly smaller problem. 337 00:18:14,000 --> 00:18:16,000 We need to have some kind of base case here, 338 00:18:16,000 --> 00:18:19,000 and what should the base case probably be? 339 00:18:19,000 --> 00:18:22,000 [inaudible-student] 340 00:18:22,000 --> 00:18:25,000 Okay, so long as the number is positive we should actually return this, 341 00:18:25,000 --> 00:18:29,000 or put another way, if number is, say, <= to 0 342 00:18:29,000 --> 00:18:32,000 you know what, I'll go ahead and return 0, 343 00:18:32,000 --> 00:18:36,000 much like Willy did, and else, I'm going to go ahead 344 00:18:36,000 --> 00:18:41,000 and return this, so it's not that much shorter 345 00:18:41,000 --> 00:18:44,000 than the iterative version that we whipped up first using a for loop, 346 00:18:44,000 --> 00:18:48,000 but notice that there's this sort of elegance to it. 347 00:18:48,000 --> 00:18:51,000 Instead of returning some number and performing all this math 348 00:18:51,000 --> 00:18:54,000 and adding things up with local variables 349 00:18:54,000 --> 00:18:57,000 you're instead saying "Okay, if this is a super easy problem, 350 00:18:57,000 --> 00:19:01,000 like the number is < 0, let me immediately return 0." 351 00:19:01,000 --> 00:19:03,000 >> We're not going to bother supporting negative numbers, 352 00:19:03,000 --> 00:19:05,000 so I'm going to hard code the value of 0. 353 00:19:05,000 --> 00:19:08,000 But otherwise, to implement this idea of summing 354 00:19:08,000 --> 00:19:11,000 all of these numbers together you can effectively take a small bite 355 00:19:11,000 --> 00:19:14,000 out of the problem, much like we did here on stage, 356 00:19:14,000 --> 00:19:18,000 then punt the rest of the problem to the next person, 357 00:19:18,000 --> 00:19:20,000 but in this case the next person is yourself. 358 00:19:20,000 --> 00:19:22,000 It's an identically named function. 359 00:19:22,000 --> 00:19:25,000 Just pass it a smaller and smaller and smaller problem each time, 360 00:19:25,000 --> 00:19:28,000 and even though we haven't quite formalized things in code here 361 00:19:28,000 --> 00:19:33,000 this is exactly what was going on in week 0 with the phone book. 362 00:19:33,000 --> 00:19:36,000 This is exactly what was going on in past weeks with Sean 363 00:19:36,000 --> 00:19:39,000 and with our demonstrations of searching for numbers. 364 00:19:39,000 --> 00:19:42,000 It's taking a problem and dividing it again and again. 365 00:19:42,000 --> 00:19:44,000 >> In other words, there's a way now of translating 366 00:19:44,000 --> 00:19:47,000 this real world construct, this higher level construct 367 00:19:47,000 --> 00:19:51,000 of divide and conquer and doing something again and again 368 00:19:51,000 --> 00:19:56,000 in code, so this is something we will see again over time. 369 00:19:56,000 --> 00:20:00,000 Now, as an aside, if you're new to recursion you should at least understand now 370 00:20:00,000 --> 00:20:02,000 why this is funny. 371 00:20:02,000 --> 00:20:05,000 I'm going to go to google.com, 372 00:20:05,000 --> 00:20:17,000 and I'm going to search for some tips and tricks on recursion, enter. 373 00:20:17,000 --> 00:20:21,000 Tell the person next to you if they weren't laughing just now. 374 00:20:21,000 --> 00:20:23,000 Did you mean recursion? 375 00:20:23,000 --> 00:20:25,000 Did you mean—ah, there we go. 376 00:20:25,000 --> 00:20:28,000 Okay, now that's the rest of everyone. 377 00:20:28,000 --> 00:20:30,000 A little Easter egg embedded somewhere there in Google. 378 00:20:30,000 --> 00:20:33,000 As an aside, one of the links we put on the course's website 379 00:20:33,000 --> 00:20:36,000 for today is just this grid of various sorting algorithms, 380 00:20:36,000 --> 00:20:39,000 some of which we looked at last week, but what's nice about this visualization 381 00:20:39,000 --> 00:20:43,000 as you try to wrap your mind around various things related to algorithms 382 00:20:43,000 --> 00:20:46,000 know that you can very easily now start with different types of inputs. 383 00:20:46,000 --> 00:20:50,000 The inputs all reversed, the inputs mostly sorted, the inputs random and so forth. 384 00:20:50,000 --> 00:20:53,000 As you try to, again, distinguish these things in your mind 385 00:20:53,000 --> 00:20:57,000 realize that this URL on the course's website on the Lectures page 386 00:20:57,000 --> 00:21:00,000 might help you reason through some of those. 387 00:21:00,000 --> 00:21:05,000 >> Today we finally get to solve this problem from a while back, 388 00:21:05,000 --> 00:21:08,000 which was that this swap function just didn't work, 389 00:21:08,000 --> 00:21:12,000 and what was the fundamental problem with this function swap, 390 00:21:12,000 --> 00:21:15,000 the goal of which was, again, to exchange a value here and here 391 00:21:15,000 --> 00:21:17,000 such that this happens? 392 00:21:17,000 --> 00:21:20,000 This didn't actually work. Why? 393 00:21:20,000 --> 00:21:22,000 Yeah. 394 00:21:22,000 --> 00:21:28,000 [inaudible-student] 395 00:21:28,000 --> 00:21:31,000 Exactly, the explanation for this bugginess 396 00:21:31,000 --> 00:21:34,000 simply was because when you call functions in C 397 00:21:34,000 --> 00:21:38,000 and those functions take arguments, like a and b here, 398 00:21:38,000 --> 00:21:42,000 you are passing in copies of whatever value you're providing to that function. 399 00:21:42,000 --> 00:21:46,000 You are not providing the original values themselves, 400 00:21:46,000 --> 00:21:49,000 so we saw this in the context of buggyc, 401 00:21:49,000 --> 00:21:52,000 buggy3.c, which looked a little something like this. 402 00:21:52,000 --> 00:21:57,000 >> Recall that we had x and y initialized to 1 and 2, respectively. 403 00:21:57,000 --> 00:21:59,000 We then printed out what they were. 404 00:21:59,000 --> 00:22:03,000 I then claimed that I was swapping them by calling swap of x, y. 405 00:22:03,000 --> 00:22:06,000 But the problem was that the swapping worked, 406 00:22:06,000 --> 00:22:10,000 but only in the scope of the swap function itself. 407 00:22:10,000 --> 00:22:13,000 As soon as we hit line 40 those swapped values 408 00:22:13,000 --> 00:22:16,000 were thrown away, and so nothing 409 00:22:16,000 --> 00:22:21,000 in the original function main was actually changed at all, 410 00:22:21,000 --> 00:22:26,000 so if you think back then as to what this looks like in terms of our memory 411 00:22:26,000 --> 00:22:29,000 if this left-hand side of the board represents— 412 00:22:29,000 --> 00:22:33,000 and I'll do my best for everyone to see this—if this left-hand side of the board 413 00:22:33,000 --> 00:22:37,000 represents, say, your RAM, and the stack is going to grow on up this way, 414 00:22:37,000 --> 00:22:43,000 and we call a function like main, and main has 2 local variables, x and y, 415 00:22:43,000 --> 00:22:48,000 let's describe those as x here, and let's describe these as y here, 416 00:22:48,000 --> 00:22:55,000 and let's put in the values 1 and 2, so this here is main, 417 00:22:55,000 --> 00:22:58,000 and when main calls the swap function the operating system 418 00:22:58,000 --> 00:23:02,000 gives the swap function its own swath of memory on the stack, 419 00:23:02,000 --> 00:23:04,000 its own frame on the stack, so to speak. 420 00:23:04,000 --> 00:23:08,000 It also allocates 32 bits for these ints. 421 00:23:08,000 --> 00:23:11,000 It happens to call them a and b, but that's totally arbitrary. 422 00:23:11,000 --> 00:23:13,000 It could have called them whatever it wants, but what happens when main 423 00:23:13,000 --> 00:23:19,000 calls swap is it takes this 1, puts a copy there, puts a copy there. 424 00:23:19,000 --> 00:23:23,000 >> There is 1 other local variable in swap, though, called what?>>Tmp. 425 00:23:23,000 --> 00:23:27,000 Tmp, so let me give myself another 32 bits here, 426 00:23:27,000 --> 00:23:29,000 and what did I do in this function? 427 00:23:29,000 --> 00:23:34,000 I said int tmp gets a, so a has 1, so I did this when we last played with this example. 428 00:23:34,000 --> 00:23:39,000 Then a gets b, so b is 2, so now this becomes 2, 429 00:23:39,000 --> 00:23:42,000 and now b gets temp, so temp is 1, 430 00:23:42,000 --> 00:23:44,000 so now b becomes this. 431 00:23:44,000 --> 00:23:46,000 That's great. It worked. 432 00:23:46,000 --> 00:23:49,000 But then as soon as the function returns 433 00:23:49,000 --> 00:23:52,000 swap's memory effectively disappears so that it can be reused 434 00:23:52,000 --> 00:23:58,000 by some other function in the future, and main is obviously completely unchanged. 435 00:23:58,000 --> 00:24:00,000 We need a way of fundamentally solving this problem, 436 00:24:00,000 --> 00:24:03,000 and today we'll finally have a way of doing this whereby 437 00:24:03,000 --> 00:24:06,000 we can introduce something called a pointer. 438 00:24:06,000 --> 00:24:09,000 It turns out that we can solve this problem 439 00:24:09,000 --> 00:24:12,000 not by passing in copies of x and y 440 00:24:12,000 --> 00:24:18,000 but instead by passing in what, do you think, to the swap function? 441 00:24:18,000 --> 00:24:20,000 Yeah, what about the address? 442 00:24:20,000 --> 00:24:22,000 We haven't really talked about addresses in much detail, 443 00:24:22,000 --> 00:24:25,000 but if this blackboard represents my computer's memory 444 00:24:25,000 --> 00:24:28,000 we could certainly start numbering the bytes in my RAM 445 00:24:28,000 --> 00:24:31,000 and say this is byte #1, this is byte #2, byte #3, 446 00:24:31,000 --> 00:24:35,000 byte #4, byte #...2 billion if I have 2 gigabytes of RAM, 447 00:24:35,000 --> 00:24:38,000 so we could certainly come up with some arbitrary numbering scheme 448 00:24:38,000 --> 00:24:41,000 for all the individual bytes in my computer's memory. 449 00:24:41,000 --> 00:24:43,000 >> What if instead when I call swap 450 00:24:43,000 --> 00:24:47,000 rather than pass in copies of x and y 451 00:24:47,000 --> 00:24:51,000 why don't I instead pass in the address of x here, 452 00:24:51,000 --> 00:24:55,000 the address of y here, essentially the postal address 453 00:24:55,000 --> 00:24:59,000 of x and y because then swap, if he's informed 454 00:24:59,000 --> 00:25:01,000 of the address in memory of x and y, 455 00:25:01,000 --> 00:25:04,000 then swap, if we trained him a little bit, 456 00:25:04,000 --> 00:25:07,000 he could potentially drive to that address, so to speak, 457 00:25:07,000 --> 00:25:11,000 x, and change the number there, then drive to the address of y, 458 00:25:11,000 --> 00:25:16,000 change the number there, even while not actually getting copies of those values himself, 459 00:25:16,000 --> 00:25:19,000 so even though we talked about this as being main's memory 460 00:25:19,000 --> 00:25:23,000 and this as being swap's memory the powerful and the dangerous part of C 461 00:25:23,000 --> 00:25:28,000 is that any function can touch memory anywhere in the computer, 462 00:25:28,000 --> 00:25:32,000 and this is powerful in that you can do very fancy things with computer programs in C. 463 00:25:32,000 --> 00:25:36,000 This is dangerous because you can also screw up very easily. 464 00:25:36,000 --> 00:25:39,000 In fact, one of the most common ways for programs these days to be exploited 465 00:25:39,000 --> 00:25:42,000 still is for a programmer not to realize 466 00:25:42,000 --> 00:25:45,000 that he or she is allowing a data 467 00:25:45,000 --> 00:25:49,000 to be written in a location in memory that wasn't intended. 468 00:25:49,000 --> 00:25:51,000 >> For instance, he or she declares an array of size 10 469 00:25:51,000 --> 00:25:56,000 but then accidentally tries to put 11 bytes into that array of memory, 470 00:25:56,000 --> 00:25:59,000 and you start touching parts of memory that are no longer valid. 471 00:25:59,000 --> 00:26:02,000 Just to contextual this, some of you might know that 472 00:26:02,000 --> 00:26:06,000 software often prompts you for serial numbers or registration keys, 473 00:26:06,000 --> 00:26:08,000 Photoshop and Word and programs like this. 474 00:26:08,000 --> 00:26:12,000 There exist cracks, as some of you know, online where you can run a little program, 475 00:26:12,000 --> 00:26:14,000 and voila, no more request for a serial number. 476 00:26:14,000 --> 00:26:16,000 How is that working? 477 00:26:16,000 --> 00:26:21,000 In many cases these things are simply finding in the computers 478 00:26:21,000 --> 00:26:24,000 text segments in the computer's actual zeros and ones 479 00:26:24,000 --> 00:26:28,000 where is that function where the serial number is requested, 480 00:26:28,000 --> 00:26:31,000 and you overwrite that space, or while the program is running 481 00:26:31,000 --> 00:26:33,000 you can figure out where the key is actually stored 482 00:26:33,000 --> 00:26:37,000 using something called a debugger, and you can crack software that way. 483 00:26:37,000 --> 00:26:40,000 This isn't to say that this is our objective for the next couple of days, 484 00:26:40,000 --> 00:26:42,000 but it has very real-world ramifications. 485 00:26:42,000 --> 00:26:45,000 That one happens to involve theft of software, 486 00:26:45,000 --> 00:26:47,000 but there's also compromise of entire machines. 487 00:26:47,000 --> 00:26:50,000 >> In fact, when websites these days are exploited 488 00:26:50,000 --> 00:26:53,000 and compromised and data is leaked and passwords are stolen 489 00:26:53,000 --> 00:26:58,000 this very often relates to poor management of one's memory, 490 00:26:58,000 --> 00:27:01,000 or, in the case of databases, failure to anticipate 491 00:27:01,000 --> 00:27:03,000 adversarial input, so more on that in the weeks to come, 492 00:27:03,000 --> 00:27:07,000 but for now just a sneak preview of the sort of damage that you can do 493 00:27:07,000 --> 00:27:11,000 by not quite understanding how things work underneath the hood. 494 00:27:11,000 --> 00:27:14,000 Let's go about understanding why this is broken 495 00:27:14,000 --> 00:27:17,000 with a tool that will become more and more useful 496 00:27:17,000 --> 00:27:19,000 as our programs get more complex. 497 00:27:19,000 --> 00:27:21,000 Thus far when you've had a bug in your program 498 00:27:21,000 --> 00:27:23,000 how have you gone about debugging it? 499 00:27:23,000 --> 00:27:25,000 What have your techniques been thus far, whether taught by your TF 500 00:27:25,000 --> 00:27:27,000 or just self-taught? 501 00:27:27,000 --> 00:27:29,000 [Student] Printf. 502 00:27:29,000 --> 00:27:31,000 Printf, so printf has probably been your friend in that if you want to see 503 00:27:31,000 --> 00:27:33,000 what's going on inside of your program 504 00:27:33,000 --> 00:27:36,000 you just put printf here, printf here, printf here. 505 00:27:36,000 --> 00:27:38,000 Then you run it, and you get a whole bunch of stuff on the screen 506 00:27:38,000 --> 00:27:43,000 that you can use to then deduce what is actually going wrong in your program. 507 00:27:43,000 --> 00:27:45,000 >> Printf tends to be a very powerful thing, 508 00:27:45,000 --> 00:27:47,000 but it's a very manual process. 509 00:27:47,000 --> 00:27:49,000 You have to put a printf here, a printf here, 510 00:27:49,000 --> 00:27:51,000 and if you put it inside of a loop you might get 100 lines 511 00:27:51,000 --> 00:27:53,000 of output that you then have to sift through. 512 00:27:53,000 --> 00:27:58,000 It's not a very user-friendly or interactive mechanism for debugging programs, 513 00:27:58,000 --> 00:28:00,000 but thankfully there exists alternatives. 514 00:28:00,000 --> 00:28:03,000 There's a program, for instance, called GDB, the GNU Debugger, 515 00:28:03,000 --> 00:28:06,000 which is a little arcane in how you use it. 516 00:28:06,000 --> 00:28:08,000 It's a little complex, but frankly, 517 00:28:08,000 --> 00:28:11,000 this is one of those things where if you put in this week and next 518 00:28:11,000 --> 00:28:14,000 the extra hour to understand something like GDB 519 00:28:14,000 --> 00:28:18,000 it will save you probably tens of hours in the long run, 520 00:28:18,000 --> 00:28:21,000 so with that, let me give you a teaser of how this thing works. 521 00:28:21,000 --> 00:28:23,000 >> I'm in my terminal window. 522 00:28:23,000 --> 00:28:26,000 Let me go ahead and compile this program, buggy3. 523 00:28:26,000 --> 00:28:28,000 It's already up to date. 524 00:28:28,000 --> 00:28:31,000 Let me run it just like we did a while back, and indeed, it's broken. 525 00:28:31,000 --> 00:28:34,000 But why is this? Maybe I screwed up the swap function. 526 00:28:34,000 --> 00:28:37,000 Maybe it's a and b. I'm not quite moving them around correctly. 527 00:28:37,000 --> 00:28:39,000 Let me go ahead and do this. 528 00:28:39,000 --> 00:28:43,000 Rather than just run buggy3 let me instead run this program GDB, 529 00:28:43,000 --> 00:28:48,000 and I'm going to tell it to run buggy3, 530 00:28:48,000 --> 00:28:52,000 and I'm going to include a command line argument, -tui, 531 00:28:52,000 --> 00:28:55,000 and we'll put this in future problems at spec to remind. 532 00:28:55,000 --> 00:28:57,000 And now this black and white interface popped up that, again, 533 00:28:57,000 --> 00:28:59,000 is a little overwhelming at first because there's all this 534 00:28:59,000 --> 00:29:02,000 warranty information down here, but at least there's something familiar. 535 00:29:02,000 --> 00:29:04,000 In the top of the window is my actual code, 536 00:29:04,000 --> 00:29:08,000 and if I scroll up here let me scroll to the very top of my file, 537 00:29:08,000 --> 00:29:11,000 and indeed, there's buggy3.c, and notice at the bottom of this window 538 00:29:11,000 --> 00:29:13,000 I have this GDB prompt. 539 00:29:13,000 --> 00:29:16,000 >> This is not the same as my normal John Harvard prompt. 540 00:29:16,000 --> 00:29:19,000 This is a prompt that's going to allow me to control GDB. 541 00:29:19,000 --> 00:29:21,000 GDB is a debugger. 542 00:29:21,000 --> 00:29:24,000 A debugger is a program that lets you walk through 543 00:29:24,000 --> 00:29:27,000 execution of your program line by line by line, 544 00:29:27,000 --> 00:29:30,000 along the way doing anything you want to the program, 545 00:29:30,000 --> 00:29:33,000 even calling functions, or looking, more importantly, 546 00:29:33,000 --> 00:29:35,000 at various variable's values. 547 00:29:35,000 --> 00:29:37,000 Let's go ahead and do this. 548 00:29:37,000 --> 00:29:40,000 I'm going to go ahead and type in run at GDB's prompt, 549 00:29:40,000 --> 00:29:43,000 so notice at the bottom left of the screen I've typed run, 550 00:29:43,000 --> 00:29:45,000 and I've hit enter, and what did that do? 551 00:29:45,000 --> 00:29:50,000 It literally ran my program, but I didn't actually see much go on here 552 00:29:50,000 --> 00:29:55,000 because I have not actually told the debugger 553 00:29:55,000 --> 00:29:57,000 to pause at a particular moment in time. 554 00:29:57,000 --> 00:29:59,000 Just typing run runs the program. 555 00:29:59,000 --> 00:30:01,000 I don't actually see anything. I can't manipulate it. 556 00:30:01,000 --> 00:30:03,000 >> Instead let me do this. 557 00:30:03,000 --> 00:30:08,000 At this GDB prompt let me instead type break, enter. 558 00:30:08,000 --> 00:30:10,000 That's not what I meant to type. 559 00:30:10,000 --> 00:30:13,000 Let's instead type break main. 560 00:30:13,000 --> 00:30:15,000 In other words, I want to set something called a breakpoint, 561 00:30:15,000 --> 00:30:18,000 which is aptly named because it will break or pause 562 00:30:18,000 --> 00:30:21,000 execution of your program at that particular place. 563 00:30:21,000 --> 00:30:23,000 Main is the name of my function. 564 00:30:23,000 --> 00:30:25,000 Notice that GDB is pretty smart. 565 00:30:25,000 --> 00:30:28,000 It figured out that main happens to start roughly at line 18 566 00:30:28,000 --> 00:30:32,000 of buggy3.c, and then notice here at top left 567 00:30:32,000 --> 00:30:34,000 b+ is right next to line 18. 568 00:30:34,000 --> 00:30:38,000 That's reminding me that I have set a breakpoint at line 18. 569 00:30:38,000 --> 00:30:42,000 This time when I type run, I'm going to run my program 570 00:30:42,000 --> 00:30:45,000 up until it hits that breakpoint, 571 00:30:45,000 --> 00:30:48,000 so the program will pause for me at line 18. 572 00:30:48,000 --> 00:30:50,000 Here we go, run. 573 00:30:50,000 --> 00:30:53,000 Nothing appears to have happened, but notice at bottom left 574 00:30:53,000 --> 00:30:58,000 starting program, buggy3, breakpoint 1 in main at buggy3.c line 18. 575 00:30:58,000 --> 00:31:00,000 What can I do now? 576 00:31:00,000 --> 00:31:03,000 >> Notice I can start typing things like print, 577 00:31:03,000 --> 00:31:08,000 not printf, print x, and now that's strange. 578 00:31:08,000 --> 00:31:11,000 The $1 is just a curiosity, as we'll see 579 00:31:11,000 --> 00:31:14,000 every time you print something you get a new $ value. 580 00:31:14,000 --> 00:31:18,000 That's so that you can refer back to previous values just in case, 581 00:31:18,000 --> 00:31:21,000 but for now what print is telling me is that the value of x at this point in the story 582 00:31:21,000 --> 00:31:26,000 is apparently 134514032. 583 00:31:26,000 --> 00:31:29,000 What? Where did that even come from? 584 00:31:29,000 --> 00:31:31,000 [inaudible-student] 585 00:31:31,000 --> 00:31:34,000 Indeed, this is what we'll call a garbage value, and we've not talked about this yet, 586 00:31:34,000 --> 00:31:37,000 but the reason that you initialize variables 587 00:31:37,000 --> 00:31:40,000 is obviously so that they have some value that you want them to have. 588 00:31:40,000 --> 00:31:44,000 But the catch is recall that you can declare variables 589 00:31:44,000 --> 00:31:46,000 like I did a moment ago in my sigma example 590 00:31:46,000 --> 00:31:48,000 without actually giving them a value. 591 00:31:48,000 --> 00:31:50,000 Recall what I did over here in sigma. 592 00:31:50,000 --> 00:31:52,000 I declared n, but what value did I give it? 593 00:31:52,000 --> 00:31:56,000 None, because I knew that in the next few lines 594 00:31:56,000 --> 00:31:59,000 GetInt would take care of the problem of putting a value inside of n. 595 00:31:59,000 --> 00:32:02,000 >> But at this point in the story of line 11 596 00:32:02,000 --> 00:32:05,000 and line 12 and line 13 and line 14 597 00:32:05,000 --> 00:32:08,000 throughout those several lines what is the value of n? 598 00:32:08,000 --> 00:32:10,000 In C you just don't know. 599 00:32:10,000 --> 00:32:14,000 It's generally some garbage value, some completely random number 600 00:32:14,000 --> 00:32:17,000 that's left over essentially from some previous function 601 00:32:17,000 --> 00:32:21,000 having been run, so as your program runs 602 00:32:21,000 --> 00:32:24,000 recall that function gets function, function, function. 603 00:32:24,000 --> 00:32:27,000 All these frames get put on memory, and then those functions return, 604 00:32:27,000 --> 00:32:31,000 and just like I suggested with the eraser their memory is eventually reused. 605 00:32:31,000 --> 00:32:37,000 Well, it just so happens that this variable x in this program 606 00:32:37,000 --> 00:32:41,000 seems to have contained some garbage value like 134514032 607 00:32:41,000 --> 00:32:44,000 from some previous function, not one that I wrote. 608 00:32:44,000 --> 00:32:47,000 It could be something that comes effectively with the operating system, 609 00:32:47,000 --> 00:32:49,000 some function underneath the hood. 610 00:32:49,000 --> 00:32:52,000 >> Okay, that's fine, but let's now advance to the next line. 611 00:32:52,000 --> 00:32:55,000 If I type "next" at my GDB prompt and I hit enter, 612 00:32:55,000 --> 00:32:58,000 notice that the highlighting moves down to line 19, 613 00:32:58,000 --> 00:33:01,000 but the logical implication is that line 18 614 00:33:01,000 --> 00:33:06,000 has now finished executing, so if I again type "print x" 615 00:33:06,000 --> 00:33:10,000 I should now see 1, and indeed, I do. 616 00:33:10,000 --> 00:33:14,000 Again, the $ stuff is a way of GDB reminding you 617 00:33:14,000 --> 00:33:17,000 what the history of prints are that you've done. 618 00:33:17,000 --> 00:33:21,000 Now let me go ahead and print out y, and indeed, y is some crazy value as well, 619 00:33:21,000 --> 00:33:24,000 but no big deal because in line 19 we're about to assign it 620 00:33:24,000 --> 00:33:27,000 the value 2, so let me type "next" again. 621 00:33:27,000 --> 00:33:29,000 And now we're on the printf line. 622 00:33:29,000 --> 00:33:31,000 Let me do print x. 623 00:33:31,000 --> 00:33:34,000 Let me do print y. Frankly, I'm getting a little tired of printing this. 624 00:33:34,000 --> 00:33:38,000 Let me instead type "display x" and "display y," 625 00:33:38,000 --> 00:33:41,000 and now every time I type a command in the future 626 00:33:41,000 --> 00:33:45,000 I will be reminded of what's x and y, what's x and y, what's x and y. 627 00:33:45,000 --> 00:33:48,000 >> I can also, as an aside, type in "info locals." 628 00:33:48,000 --> 00:33:50,000 Info is a special command. 629 00:33:50,000 --> 00:33:52,000 Locals means it shows me the local variables. 630 00:33:52,000 --> 00:33:55,000 Just in case I forget or this is a crazy, complicated function 631 00:33:55,000 --> 00:33:57,000 that I or someone else wrote info locals will tell you 632 00:33:57,000 --> 00:34:00,000 what are all the local variables inside this local function 633 00:34:00,000 --> 00:34:03,000 that you might care about if you want to poke around. 634 00:34:03,000 --> 00:34:07,000 Now, printf is about to execute, so let me go ahead and just type "next." 635 00:34:07,000 --> 00:34:10,000 Because we're in this environment we're not actually seeing it 636 00:34:10,000 --> 00:34:14,000 execute down here, but notice it's getting a little mangled here. 637 00:34:14,000 --> 00:34:17,000 But notice it's overriding the screen there, 638 00:34:17,000 --> 00:34:21,000 so it's not a perfect program here, but that's okay because I can always poke around 639 00:34:21,000 --> 00:34:23,000 using print if I want. 640 00:34:23,000 --> 00:34:26,000 >> Let me type next again, and now here's the interesting part. 641 00:34:26,000 --> 00:34:29,000 At this point in the story y is 2, and x is 1, 642 00:34:29,000 --> 00:34:32,000 as suggested here, and again, 643 00:34:32,000 --> 00:34:35,000 the reason this is automatically displaying now is because I used the command 644 00:34:35,000 --> 00:34:40,000 display x and display y, so the moment I type next 645 00:34:40,000 --> 00:34:43,000 in theory x and y should become swapped. 646 00:34:43,000 --> 00:34:45,000 Now, we already know that's not going to be the case, 647 00:34:45,000 --> 00:34:49,000 but we'll see in a moment how we can dive deeper to figure out why that's true. 648 00:34:49,000 --> 00:34:54,000 Next, and unfortunately, y is still 2 and x is still 1, and I can confirm as much. 649 00:34:54,000 --> 00:34:56,000 Print x, print y. 650 00:34:56,000 --> 00:34:59,000 Indeed, no swapping has actually happened, so let's start this over. 651 00:34:59,000 --> 00:35:01,000 Clearly swap is broken. 652 00:35:01,000 --> 00:35:04,000 Let's instead type "run" again. 653 00:35:04,000 --> 00:35:07,000 Let me say yes, I want to restart it from the beginning, enter. 654 00:35:07,000 --> 00:35:09,000 >> Now I'm back up at line 18. 655 00:35:09,000 --> 00:35:11,000 Now notice x and y are garbage values again. 656 00:35:11,000 --> 00:35:15,000 Next, next, next, next. 657 00:35:15,000 --> 00:35:17,000 If I get bored I can also just type n for next. 658 00:35:17,000 --> 00:35:21,000 You can abbreviate it to the shortest possible sequence of characters. 659 00:35:21,000 --> 00:35:23,000 Swap is now broken. 660 00:35:23,000 --> 00:35:25,000 Let's dive in, so instead of typing next, 661 00:35:25,000 --> 00:35:30,000 now I'm going to type step so that I'm stepping inside of this function 662 00:35:30,000 --> 00:35:33,000 so that I can walk through it, so I hit step and then enter. 663 00:35:33,000 --> 00:35:37,000 Notice that the highlighting jumps down lower in my program to line 36. 664 00:35:37,000 --> 00:35:39,000 Now what are the local variables? 665 00:35:39,000 --> 00:35:41,000 Info locals. 666 00:35:41,000 --> 00:35:43,000 Nothing just yet because we've not gotten to that line, 667 00:35:43,000 --> 00:35:47,000 so let's go ahead and say "next." 668 00:35:47,000 --> 00:35:50,000 Now we seem to have tmp, print tmp. 669 00:35:50,000 --> 00:35:52,000 Garbage value, right? I think so. 670 00:35:52,000 --> 00:35:55,000 How about print a, print b, 1 and 2? 671 00:35:55,000 --> 00:35:58,000 In a moment, as soon as I type next again 672 00:35:58,000 --> 00:36:02,000 tmp is going to take on a value of 1, hopefully, 673 00:36:02,000 --> 00:36:05,000 because tmp is going to be assigned the value of a. 674 00:36:05,000 --> 00:36:08,000 >> Now let's do print a, print b, 675 00:36:08,000 --> 00:36:11,000 but now print tmp, and it's indeed 1. 676 00:36:11,000 --> 00:36:14,000 Let me do next. Let me do next. 677 00:36:14,000 --> 00:36:16,000 I've finished the swap function. 678 00:36:16,000 --> 00:36:19,000 I'm still inside of it in line 40, so let me print a, 679 00:36:19,000 --> 00:36:22,000 print b, and I don't care what tmp is. 680 00:36:22,000 --> 00:36:27,000 It looks like swap is correct when it comes to swapping a and b. 681 00:36:27,000 --> 00:36:31,000 But if I now type next, I jump back to line 25, 682 00:36:31,000 --> 00:36:34,000 and of course, if I type in x and print y 683 00:36:34,000 --> 00:36:38,000 they're still unchanged, so we haven't fixed the problem. 684 00:36:38,000 --> 00:36:41,000 But diagnostically now perhaps with this GDB program 685 00:36:41,000 --> 00:36:44,000 we've at least gotten one step closer to understanding 686 00:36:44,000 --> 00:36:47,000 what's going wrong without having to litter our code by putting a printf here, 687 00:36:47,000 --> 00:36:50,000 printf here, printf here and then running it again and again 688 00:36:50,000 --> 00:36:52,000 trying to figure out what's going wrong. 689 00:36:52,000 --> 00:36:55,000 >> I'm going to go ahead and quit out of this altogether with quit. 690 00:36:55,000 --> 00:36:57,000 It's going to then say, "Quit anyway?" Yes. 691 00:36:57,000 --> 00:37:00,000 Now I'm back at my normal prompt, and I'm done using GDB. 692 00:37:00,000 --> 00:37:03,000 As an aside, you don't need to use this -tui flag. 693 00:37:03,000 --> 00:37:07,000 In fact, if you omit it you get essentially the bottom half of the screen. 694 00:37:07,000 --> 00:37:11,000 If I then type break main and then run 695 00:37:11,000 --> 00:37:15,000 I can still run my program, but what it will do is more textually 696 00:37:15,000 --> 00:37:18,000 just show me the current line one at a time. 697 00:37:18,000 --> 00:37:21,000 The -tui, textual user interface, 698 00:37:21,000 --> 00:37:25,000 just shows you more of the program at once, which is probably a bit conceptually easier. 699 00:37:25,000 --> 00:37:27,000 But indeed, I can just do next, next, next, 700 00:37:27,000 --> 00:37:30,000 and I'm going to see one line at a time, and if I really want to see what's going on 701 00:37:30,000 --> 00:37:35,000 I can type list and see a whole bunch of neighboring lines. 702 00:37:35,000 --> 00:37:39,000 >> There's a video that we've asked that you watch for problem sets 3 703 00:37:39,000 --> 00:37:43,000 in which Nate covers some of the intricacies of GDB, 704 00:37:43,000 --> 00:37:46,000 and this is one of those things, honestly, where some non-trivial percentage of you 705 00:37:46,000 --> 00:37:49,000 will never touch GDB, and that will be a bad thing 706 00:37:49,000 --> 00:37:53,000 because literally you will end up spending more time later this semester 707 00:37:53,000 --> 00:37:56,000 chasing down bugs then you would if you put in that half hour/hour 708 00:37:56,000 --> 00:38:00,000 this week and next learning to get comfortable with GDB. 709 00:38:00,000 --> 00:38:02,000 Printf was your friend. 710 00:38:02,000 --> 00:38:05,000 GDB should now be your friend. 711 00:38:05,000 --> 00:38:08,000 >> Any questions on GDB? 712 00:38:08,000 --> 00:38:12,000 And here's a quick list of some of the most powerful and useful commands. 713 00:38:12,000 --> 00:38:15,000 Yeah.>>Can you print a string? 714 00:38:15,000 --> 00:38:17,000 Can you print a string? Absolutely. 715 00:38:17,000 --> 00:38:19,000 It doesn't have to just be integers. 716 00:38:19,000 --> 00:38:22,000 If a variable s is a string just type in print s. 717 00:38:22,000 --> 00:38:24,000 It will show you what that string variable is. 718 00:38:24,000 --> 00:38:26,000 [inaudible-student] 719 00:38:26,000 --> 00:38:28,000 It will give you the address and the string itself. 720 00:38:28,000 --> 00:38:32,000 It will show you both. 721 00:38:32,000 --> 00:38:34,000 And one last thing, just because these are good to know too. 722 00:38:34,000 --> 00:38:37,000 Backtrace and frame, let me dive into this one last time, 723 00:38:37,000 --> 00:38:39,000 same exact program with GDB. 724 00:38:39,000 --> 00:38:44,000 Let me go ahead and run the textual user interface version, 725 00:38:44,000 --> 00:38:46,000 break main. 726 00:38:46,000 --> 00:38:49,000 Let me go ahead and run again. Here I am. 727 00:38:49,000 --> 00:38:55,000 Now let me go next, next, next, next, next, step, enter. 728 00:38:55,000 --> 00:39:00,000 >> And now suppose I'm now in swap deliberately, but I'm like "Damn, what was the value of x?" 729 00:39:00,000 --> 00:39:02,000 I can't do x anymore. 730 00:39:02,000 --> 00:39:05,000 I can't do y because they're not in scope. 731 00:39:05,000 --> 00:39:07,000 They're not in context, but no problem. 732 00:39:07,000 --> 00:39:09,000 I can type backtrace. 733 00:39:09,000 --> 00:39:13,000 That shows me all of the functions that have executed up to this point in time. 734 00:39:13,000 --> 00:39:16,000 Notice that the one on the bottom, main, lines up with main 735 00:39:16,000 --> 00:39:18,000 being on the bottom of our picture here. 736 00:39:18,000 --> 00:39:22,000 The fact that swap is above it lines up with swap being above it in memory here, 737 00:39:22,000 --> 00:39:26,000 and if I want to get back to main temporarily I can say "frame." 738 00:39:26,000 --> 00:39:30,000 What number? Main is frame #1. 739 00:39:30,000 --> 00:39:32,000 I'm going to go ahead and say "frame 1." 740 00:39:32,000 --> 00:39:36,000 >> Now I'm back in main, and I can print x, and I can print y, 741 00:39:36,000 --> 00:39:40,000 but I can't print a or b. 742 00:39:40,000 --> 00:39:43,000 But I can if I say, "Okay, wait a minute. Where was the swap?" 743 00:39:43,000 --> 00:39:46,000 Let me go ahead and say "frame 0." 744 00:39:46,000 --> 00:39:48,000 Now I'm back where I want to be, and as an aside, 745 00:39:48,000 --> 00:39:52,000 there's other commands too, like if you're really getting bored typing next, next, next, next, 746 00:39:52,000 --> 00:39:56,000 you can generally say things like "next 10," and that will step through the next 10 lines. 747 00:39:56,000 --> 00:39:59,000 You can also write "continue" when you really get fed up with stepping through it. 748 00:39:59,000 --> 00:40:05,000 Continue will run your program without interruption until it hits another breakpoint, 749 00:40:05,000 --> 00:40:07,000 whether in a loop or lower down in your program. 750 00:40:07,000 --> 00:40:11,000 >> In this case we continued to the end, and the program exited normally. 751 00:40:11,000 --> 00:40:13,000 This is a fancy way, inferior process. 752 00:40:13,000 --> 00:40:16,000 Just your program exited normally. 753 00:40:16,000 --> 00:40:24,000 More on that in the video and in debugging sessions to come. 754 00:40:24,000 --> 00:40:26,000 That was a lot. 755 00:40:26,000 --> 00:40:35,000 Let's take our 5-minute break here, and we'll return with structs and files. 756 00:40:35,000 --> 00:40:38,000 >> If you have dived into this week's pset already 757 00:40:38,000 --> 00:40:41,000 you'll know that we use in the distribution code, 758 00:40:41,000 --> 00:40:45,000 the source code that we provide to you as a starting point, some new techniques. 759 00:40:45,000 --> 00:40:50,000 In particular, we introduced this new keyword called struct, for structure, 760 00:40:50,000 --> 00:40:53,000 so that we can create customized variables of sorts. 761 00:40:53,000 --> 00:40:57,000 We also introduced the notion of file I/O, file input and output, 762 00:40:57,000 --> 00:41:00,000 and this is so that we can save the state 763 00:41:00,000 --> 00:41:03,000 of your Scramble board to a file on disc 764 00:41:03,000 --> 00:41:06,000 so that the teaching fellows and I can understand 765 00:41:06,000 --> 00:41:09,000 what's going on inside of your program without having to manually play 766 00:41:09,000 --> 00:41:11,000 dozens of games of Scramble. 767 00:41:11,000 --> 00:41:13,000 We can do this more automatedly. 768 00:41:13,000 --> 00:41:18,000 >> This idea of a struct solves a fairly compelling problem. 769 00:41:18,000 --> 00:41:21,000 Suppose that we want to implement some program 770 00:41:21,000 --> 00:41:25,000 that somehow keeps track of information on students, 771 00:41:25,000 --> 00:41:28,000 and students might have, for instance, an ID, a name 772 00:41:28,000 --> 00:41:31,000 and a house at a place like Harvard, so these are 3 pieces of information 773 00:41:31,000 --> 00:41:34,000 we want to keep around, so let me go ahead and start writing a little program here, 774 00:41:34,000 --> 00:41:38,000 include stdio.h. 775 00:41:38,000 --> 00:41:42,000 Let me do include cs50.h. 776 00:41:42,000 --> 00:41:44,000 And then start my main function. 777 00:41:44,000 --> 00:41:46,000 I won't bother with any command line arguments, 778 00:41:46,000 --> 00:41:49,000 and here I want to have a student, so I'm going to say 779 00:41:49,000 --> 00:41:54,000 a student has a name, so I'm going to say "string name." 780 00:41:54,000 --> 00:41:59,000 Then I'm going to say a student also has an ID, so int id, 781 00:41:59,000 --> 00:42:03,000 and a student has a house, so I'm also going to say "string house." 782 00:42:03,000 --> 00:42:06,000 Then I'll order these a little more cleanly like this. 783 00:42:06,000 --> 00:42:11,000 Okay, now I have 3 variables with which to represent a student, so "a student." 784 00:42:11,000 --> 00:42:15,000 >> And now I want to populate these values, so let me go ahead and say something like 785 00:42:15,000 --> 00:42:18,000 "id = 123." 786 00:42:18,000 --> 00:42:21,000 Name is going to get David. 787 00:42:21,000 --> 00:42:24,000 Let's say house is going to get Mather, 788 00:42:24,000 --> 00:42:31,000 and then I'm going to do something arbitrarily like printf("%s, 789 00:42:31,000 --> 00:42:37,000 whose ID is %d, lives in %s. 790 00:42:37,000 --> 00:42:41,000 And now, what do I want to plug in here, one after the other? 791 00:42:41,000 --> 00:42:47,000 Name, id, house; return 0. 792 00:42:47,000 --> 00:42:50,000 Okay, unless I screwed up somewhere here 793 00:42:50,000 --> 00:42:54,000 I think we have a pretty good program that stores one student. 794 00:42:54,000 --> 00:42:57,000 Of course, this is not all that interesting. What if I want to have 2 students? 795 00:42:57,000 --> 00:42:59,000 That's no big deal. I can support 2 people. 796 00:42:59,000 --> 00:43:03,000 Let me go ahead and highlight this and go down here, 797 00:43:03,000 --> 00:43:09,000 and I can say "id = 456" for someone like Rob who lives in Kirkland. 798 00:43:09,000 --> 00:43:12,000 >> Okay, wait, but I can't call these the same thing, 799 00:43:12,000 --> 00:43:15,000 and it looks like I'm going to have to copy this, 800 00:43:15,000 --> 00:43:19,000 so let me say that these will be David's variables, 801 00:43:19,000 --> 00:43:23,000 and let me get some copies of these for Rob. 802 00:43:23,000 --> 00:43:27,000 We'll call these Rob's but this isn't going to work now 803 00:43:27,000 --> 00:43:33,000 because I have—wait, let's change me to id1, name1 and house1. 804 00:43:33,000 --> 00:43:35,000 Rob will be 2, 2. 805 00:43:35,000 --> 00:43:42,000 I've got to change this here, here, here, here, here, here. 806 00:43:42,000 --> 00:43:45,000 Wait, what about Tommy? Let's do this again. 807 00:43:45,000 --> 00:43:49,000 Obviously if you still think this is a good way of doing this, it's not, 808 00:43:49,000 --> 00:43:52,000 so copy/paste bad. 809 00:43:52,000 --> 00:43:55,000 But we solved this a week ago. 810 00:43:55,000 --> 00:43:59,000 >> What was our solution when we wanted to have multiple instances of the same data type? 811 00:43:59,000 --> 00:44:01,000 [Students] An array. 812 00:44:01,000 --> 00:44:03,000 An array, so let me try to clean this up. 813 00:44:03,000 --> 00:44:07,000 Let me make some room for myself at the top, and let me instead do this here. 814 00:44:07,000 --> 00:44:12,000 We'll call these people, and instead I'm going to say "int ids," 815 00:44:12,000 --> 00:44:14,000 and I'm going to support 3 of us for now. 816 00:44:14,000 --> 00:44:18,000 I'm going to say "string names," and I'll support 3 of us, 817 00:44:18,000 --> 00:44:22,000 and then I'm going to say "string houses," and I'm going to support 3 of us. 818 00:44:22,000 --> 00:44:26,000 Now in here instead of David getting his own local variables 819 00:44:26,000 --> 00:44:28,000 we can get rid of those. 820 00:44:28,000 --> 00:44:30,000 That feels good that we're cleaning this up. 821 00:44:30,000 --> 00:44:35,000 I can then say David is going to be [0] and names[0] 822 00:44:35,000 --> 00:44:38,000 and houses[0]. 823 00:44:38,000 --> 00:44:41,000 And then Rob we can similarly save on this. 824 00:44:41,000 --> 00:44:46,000 Let's put this down here, so he's going to arbitrarily be ids[1]. 825 00:44:46,000 --> 00:44:50,000 He's going to be names[1], 826 00:44:50,000 --> 00:44:53,000 and then lastly, houses[1]. 827 00:44:53,000 --> 00:44:57,000 >> Still a little tedious, and now I have to figure this out, 828 00:44:57,000 --> 00:45:03,000 so let's say "names[0], id[0], houses[0], 829 00:45:03,000 --> 00:45:06,000 and let's pluralize this. 830 00:45:06,000 --> 00:45:09,000 Ids, ids, ids. 831 00:45:09,000 --> 00:45:12,000 And again, I'm doing it, so again, I'm already resorting to copy/paste again, 832 00:45:12,000 --> 00:45:14,000 so odds are there's another solution here. 833 00:45:14,000 --> 00:45:18,000 I can probably clean this up further with a loop or something like that, 834 00:45:18,000 --> 00:45:21,000 so in short, it's a little better but still feels like 835 00:45:21,000 --> 00:45:24,000 I'm resorting to copy/paste, but even this, I claim, 836 00:45:24,000 --> 00:45:27,000 is not really fundamentally the right solution because 837 00:45:27,000 --> 00:45:29,000 what if sometime we decide you know what? 838 00:45:29,000 --> 00:45:32,000 We really should have been storing email addresses for David and Rob 839 00:45:32,000 --> 00:45:34,000 and everyone else in this program. 840 00:45:34,000 --> 00:45:36,000 We should also store phone numbers. 841 00:45:36,000 --> 00:45:39,000 We should also store emergency contact numbers. 842 00:45:39,000 --> 00:45:41,000 We have all these pieces of data that we want to store, 843 00:45:41,000 --> 00:45:43,000 so how do you go about doing that? 844 00:45:43,000 --> 00:45:46,000 >> You declare another array at the top, and then you manually add 845 00:45:46,000 --> 00:45:49,000 an email address[0], email address[1] 846 00:45:49,000 --> 00:45:51,000 for David and Rob and so forth. 847 00:45:51,000 --> 00:45:56,000 But there's really just an assumption underlying this design 848 00:45:56,000 --> 00:45:59,000 that I am using the honor system to know that 849 00:45:59,000 --> 00:46:03,000 [i] in each of the several arrays 850 00:46:03,000 --> 00:46:06,000 just so happens to refer to the same person, 851 00:46:06,000 --> 00:46:10,000 so [0] in ids is number 123, 852 00:46:10,000 --> 00:46:13,000 and I'm going to assume that names[0] 853 00:46:13,000 --> 00:46:16,000 is the same person's name and houses[0] 854 00:46:16,000 --> 00:46:21,000 is the same person's house and so forth for all of the various arrays that I create. 855 00:46:21,000 --> 00:46:24,000 But notice that there's no fundamental linkage 856 00:46:24,000 --> 00:46:27,000 among those 3 pieces of information, id, name and house, 857 00:46:27,000 --> 00:46:32,000 even though the entity we're trying to model in this program is not arrays. 858 00:46:32,000 --> 00:46:35,000 Arrays are just this programmatic way of doing this. 859 00:46:35,000 --> 00:46:38,000 What we really want to model in our program is a person 860 00:46:38,000 --> 00:46:41,000 like David, a person like Rob inside of which 861 00:46:41,000 --> 00:46:46,000 or encapsulating is a name and ID and a house. 862 00:46:46,000 --> 00:46:49,000 >> Can we somehow express this idea of encapsulation 863 00:46:49,000 --> 00:46:52,000 whereby a person has an ID, a name and a house 864 00:46:52,000 --> 00:46:55,000 and not resort to really this hack whereby we just 865 00:46:55,000 --> 00:46:58,000 trust that bracket something 866 00:46:58,000 --> 00:47:02,000 refers to the same human entity in each of these disparate arrays? 867 00:47:02,000 --> 00:47:04,000 We can actually do this. 868 00:47:04,000 --> 00:47:08,000 Let me go above main for now, and let me create my own data type 869 00:47:08,000 --> 00:47:10,000 for really the first time. 870 00:47:10,000 --> 00:47:14,000 We used this technique in Scramble, 871 00:47:14,000 --> 00:47:17,000 but here I'm going to go ahead and create a data type, 872 00:47:17,000 --> 00:47:19,000 and you know what, I'm going to call it student or person, 873 00:47:19,000 --> 00:47:23,000 and I'm going to use typedef for define a type. 874 00:47:23,000 --> 00:47:25,000 I'm going to say that this is a structure, 875 00:47:25,000 --> 00:47:29,000 and then this structure is going to be of type student, we'll say, 876 00:47:29,000 --> 00:47:31,000 even though it's a little dated now for me. 877 00:47:31,000 --> 00:47:33,000 We'll say "int id." 878 00:47:33,000 --> 00:47:35,000 We'll say "string name." 879 00:47:35,000 --> 00:47:37,000 Then we'll say "string house," 880 00:47:37,000 --> 00:47:40,000 so now by the end of these few lines of code 881 00:47:40,000 --> 00:47:45,000 I have just taught clang that there exists 882 00:47:45,000 --> 00:47:49,000 a data type besides ints, besides strings, besides doubles, besides floats. 883 00:47:49,000 --> 00:47:54,000 >> As of this moment in time line 11, there is now a new data type called students, 884 00:47:54,000 --> 00:47:58,000 and now I can declare a student variable anywhere I want, 885 00:47:58,000 --> 00:48:01,000 so let me scroll down here to people. 886 00:48:01,000 --> 00:48:05,000 Now I can get rid of this, and I can go back down to David here, 887 00:48:05,000 --> 00:48:10,000 and for David I can actually say that David, 888 00:48:10,000 --> 00:48:13,000 we can literally name the variable after myself, 889 00:48:13,000 --> 00:48:16,000 is going to be of type student. 890 00:48:16,000 --> 00:48:18,000 This might look a little weird, but this isn't all that different 891 00:48:18,000 --> 00:48:22,000 from declaring something as an int or a string or a float. 892 00:48:22,000 --> 00:48:24,000 It just so happens to be called student now, 893 00:48:24,000 --> 00:48:28,000 and if I want to put something inside of this structure 894 00:48:28,000 --> 00:48:31,000 I now have to use a new piece of syntax, but it's pretty straightforward, 895 00:48:31,000 --> 00:48:39,000 david.id = 123, david.name = "David" in capital D, 896 00:48:39,000 --> 00:48:42,000 and david.house = "Mather," 897 00:48:42,000 --> 00:48:46,000 and now I can get rid of this stuff here. 898 00:48:46,000 --> 00:48:51,000 Notice we've now redesigned our program in really a much better way 899 00:48:51,000 --> 00:48:54,000 in that now our program mirrors the real world. 900 00:48:54,000 --> 00:48:57,000 >> There's a real-world notion of a person or a student. 901 00:48:57,000 --> 00:49:02,000 Here we have now a C version of a person or more specifically a student. 902 00:49:02,000 --> 00:49:05,000 Inside of that person are these relevant characteristics, 903 00:49:05,000 --> 00:49:10,000 ID, name and house, so Rob essentially becomes the same thing down here, 904 00:49:10,000 --> 00:49:14,000 so student rob, and now rob.id = 456, 905 00:49:14,000 --> 00:49:17,000 rob.name = "Rob." 906 00:49:17,000 --> 00:49:20,000 The fact that the variable is called Rob is sort of meaningless. 907 00:49:20,000 --> 00:49:22,000 We could have called it x or y or z. 908 00:49:22,000 --> 00:49:25,000 We just named it Rob to be semantically consistent, 909 00:49:25,000 --> 00:49:28,000 but really the name is inside of that field itself, 910 00:49:28,000 --> 00:49:30,000 so now I have this. 911 00:49:30,000 --> 00:49:33,000 This too doesn't feel like the best design in that I've hard coded David. 912 00:49:33,000 --> 00:49:35,000 I've hard coded Rob. 913 00:49:35,000 --> 00:49:39,000 And I still have to resort to some copy and paste every time I want new variables. 914 00:49:39,000 --> 00:49:43,000 Moreover, I have to apparently give each of these variables a name, 915 00:49:43,000 --> 00:49:46,000 even though I'd much rather describe these variables 916 00:49:46,000 --> 00:49:48,000 more generically as students. 917 00:49:48,000 --> 00:49:52,000 >> Now we can merge the ideas that have been working well for us 918 00:49:52,000 --> 00:49:56,000 and instead say, "You know what, give me a variable called students, 919 00:49:56,000 --> 00:50:01,000 and let's have it be of size 3," so now I can refine this further, 920 00:50:01,000 --> 00:50:04,000 get rid of the manually declared David, 921 00:50:04,000 --> 00:50:08,000 and I can instead say something like students[0] here. 922 00:50:08,000 --> 00:50:11,000 I can then say students[0] here, 923 00:50:11,000 --> 00:50:14,000 students[0] here, and so forth, and I can go around 924 00:50:14,000 --> 00:50:16,000 and clean that up for Rob. 925 00:50:16,000 --> 00:50:19,000 I could also go about now maybe adding a loop 926 00:50:19,000 --> 00:50:23,000 and using GetString and GetInt to actually get these values from the user. 927 00:50:23,000 --> 00:50:27,000 I could go about adding a constant because this is generally bad practice 928 00:50:27,000 --> 00:50:29,000 to hard code some arbitrary number like 3 right here 929 00:50:29,000 --> 00:50:33,000 and then just remember that you should put no more than 3 students in it. 930 00:50:33,000 --> 00:50:36,000 It would probably be better to use #define at the top of my file 931 00:50:36,000 --> 00:50:40,000 and factor that out, so indeed, let me go ahead and generalize this. 932 00:50:40,000 --> 00:50:43,000 >> Let me open up an example that's among today's 933 00:50:43,000 --> 00:50:46,000 examples in advance, structs1. 934 00:50:46,000 --> 00:50:49,000 This is a more complete program that uses #define up here 935 00:50:49,000 --> 00:50:51,000 and says we're going to have 3 students by default. 936 00:50:51,000 --> 00:50:54,000 Here I'm declaring a class worth of students, 937 00:50:54,000 --> 00:50:57,000 so a classroom of students, and now I'm using a loop 938 00:50:57,000 --> 00:51:00,000 just to make the code a little more elegant, populate the class 939 00:51:00,000 --> 00:51:05,000 with the user's input, so iterate from i = 0 on up to students, which is 3. 940 00:51:05,000 --> 00:51:07,000 And then I prompt the user in this version 941 00:51:07,000 --> 00:51:10,000 what's the student's ID, and I get it with GetInt. 942 00:51:10,000 --> 00:51:13,000 What's the student's name, and then I get it with GetString. 943 00:51:13,000 --> 00:51:15,000 What's the student's house? I get it with GetString. 944 00:51:15,000 --> 00:51:19,000 And then at the bottom here I just decided to change 945 00:51:19,000 --> 00:51:22,000 how I'm printing these out and to actually use a loop, 946 00:51:22,000 --> 00:51:24,000 and who am I printing? 947 00:51:24,000 --> 00:51:27,000 According to the comment I'm printing anyone in Mather, 948 00:51:27,000 --> 00:51:30,000 and that's it so Rob and Tommy and so forth—actually Tommy's in Mather. 949 00:51:30,000 --> 00:51:34,000 Tommy and David would be printed in this case, but how is this working? 950 00:51:34,000 --> 00:51:40,000 We haven't seen this function before, but take a guess as to what this does. 951 00:51:40,000 --> 00:51:42,000 Compares strings. 952 00:51:42,000 --> 00:51:45,000 >> It's a little non-obvious how it compares strings because it turns out 953 00:51:45,000 --> 00:51:49,000 if it returns 0 that means the strings are equal. 954 00:51:49,000 --> 00:51:53,000 If it returns a -1 that means one comes alphabetically before the other, 955 00:51:53,000 --> 00:51:57,000 and if it returns +1 that means the other word comes alphabetically 956 00:51:57,000 --> 00:52:00,000 before the other, and you can look online or at the man page 957 00:52:00,000 --> 00:52:04,000 to see exactly which way is which, but all this is now doing is it's saying 958 00:52:04,000 --> 00:52:09,000 if the [i].house is equal to "Mather" 959 00:52:09,000 --> 00:52:13,000 then go ahead and print out so and so is in Mather. 960 00:52:13,000 --> 00:52:16,000 But here's something we haven't seen before, and we'll come back to this. 961 00:52:16,000 --> 00:52:21,000 I don't recall ever having to do this in any of my programs. 962 00:52:21,000 --> 00:52:24,000 Free is apparently referring to memory, freeing memory, 963 00:52:24,000 --> 00:52:31,000 but what memory am I apparently freeing in this loop at the bottom of this program? 964 00:52:31,000 --> 00:52:34,000 It looks like I'm freeing a person's name 965 00:52:34,000 --> 00:52:37,000 and a person's house, but why is that? 966 00:52:37,000 --> 00:52:41,000 >> It turns out all these weeks that you've been using GetString 967 00:52:41,000 --> 00:52:45,000 we've kind of been introducing a bug into every one of your programs. 968 00:52:45,000 --> 00:52:51,000 GetString by design allocates memory so that it can return to you a string, 969 00:52:51,000 --> 00:52:55,000 like David, or Rob, and you can then do whatever you want 970 00:52:55,000 --> 00:52:59,000 with that string in your program because we've reserved the memory for you. 971 00:52:59,000 --> 00:53:02,000 The problem is all this time every time you call GetString 972 00:53:02,000 --> 00:53:05,000 we, the authors of GetString, have been asking the operating system 973 00:53:05,000 --> 00:53:07,000 to give us a bit of RAM for this string. 974 00:53:07,000 --> 00:53:09,000 Give us a bit of RAM for this next string. 975 00:53:09,000 --> 00:53:11,000 Give us some more RAM for this next string. 976 00:53:11,000 --> 00:53:13,000 What you, the programmer, have never been doing 977 00:53:13,000 --> 00:53:15,000 is giving us that memory back, 978 00:53:15,000 --> 00:53:17,000 so for these several weeks all of the programs you've written 979 00:53:17,000 --> 00:53:20,000 have had what's called a memory leap whereby they keep using 980 00:53:20,000 --> 00:53:24,000 more and more memory every time you call GetString, and that's fine. 981 00:53:24,000 --> 00:53:27,000 We deliberately do that in the first weeks because it's not that interesting 982 00:53:27,000 --> 00:53:29,000 to have to worry about where the string is coming from. 983 00:53:29,000 --> 00:53:34,000 All you want is the word Rob to come back when the user types it in. 984 00:53:34,000 --> 00:53:38,000 >> But moving forward we now have to start getting more sophisticated about this. 985 00:53:38,000 --> 00:53:42,000 Any time we allocate memory we better eventually hand it back. 986 00:53:42,000 --> 00:53:45,000 Otherwise in the real world on your Mac or PC you might have occasionally experienced 987 00:53:45,000 --> 00:53:50,000 symptoms where your computer is grinding to a halt eventually 988 00:53:50,000 --> 00:53:54,000 or the stupid spinning beach ball is just occupying the computer's 989 00:53:54,000 --> 00:53:56,000 entire attention and you can't do things. 990 00:53:56,000 --> 00:54:00,000 That can be explained by any number of bugs, but among those possible bugs 991 00:54:00,000 --> 00:54:03,000 are things called memory leaks whereby someone who wrote that piece of software 992 00:54:03,000 --> 00:54:07,000 you're using didn't remember to free memory 993 00:54:07,000 --> 00:54:10,000 that he or she asked the operating system for, 994 00:54:10,000 --> 00:54:14,000 not using GetString, because that's a CS50 thing, but using similar functions 995 00:54:14,000 --> 00:54:16,000 that ask the operating system for memory. 996 00:54:16,000 --> 00:54:19,000 If you or they screw up and never actually return that memory 997 00:54:19,000 --> 00:54:24,000 a symptom of that can be that a program slows and slows and slows down 998 00:54:24,000 --> 00:54:26,000 unless you remember to call free. 999 00:54:26,000 --> 00:54:28,000 >> We'll come back to when and why you would call free, 1000 00:54:28,000 --> 00:54:32,000 but let's go ahead just for good measure and try running this particular program. 1001 00:54:32,000 --> 00:54:35,000 This was called structs1, enter. 1002 00:54:35,000 --> 00:54:40,000 Let me go ahead and run structs1, 123, David Mather, 1003 00:54:40,000 --> 00:54:47,000 456, Rob Kirkland, 789, 1004 00:54:47,000 --> 00:54:50,000 Tommy Mather, and we see David's in Mather, Tommy's in Mather. 1005 00:54:50,000 --> 00:54:53,000 This is just a little sanity check that the program is working. 1006 00:54:53,000 --> 00:54:56,000 Now, unfortunately, this program is a little frustrating in that 1007 00:54:56,000 --> 00:55:00,000 I did all that work, I typed in 9 different strings, hit enter, 1008 00:55:00,000 --> 00:55:04,000 was told who was in Mather, yet obviously I knew who was in Mather already because I typed it. 1009 00:55:04,000 --> 00:55:07,000 It would be at least nice if this program is more like a database 1010 00:55:07,000 --> 00:55:10,000 and it actually remembers what I have typed in 1011 00:55:10,000 --> 00:55:12,000 so I never again have to input these student records. 1012 00:55:12,000 --> 00:55:15,000 Maybe it's like a registrarial system. 1013 00:55:15,000 --> 00:55:21,000 >> We can do this using this technique known as file I/O, file input and output, 1014 00:55:21,000 --> 00:55:24,000 a very generic way of saying any time you want to read files or write files 1015 00:55:24,000 --> 00:55:26,000 you can do this with a certain set of functions. 1016 00:55:26,000 --> 00:55:29,000 Let me go ahead and open this example structs2.c, 1017 00:55:29,000 --> 00:55:33,000 which is almost identical, but let's see what it now does. 1018 00:55:33,000 --> 00:55:36,000 At the top of the file I declare a class of students. 1019 00:55:36,000 --> 00:55:38,000 I then populate the class with the user's input, 1020 00:55:38,000 --> 00:55:41,000 so those lines of code are exactly like before. 1021 00:55:41,000 --> 00:55:45,000 Then if I scroll down here I print everyone who is in Mather arbitrarily like before, 1022 00:55:45,000 --> 00:55:47,000 but this is an interesting new feature. 1023 00:55:47,000 --> 00:55:51,000 These lines of code are new, and they introduce something here, 1024 00:55:51,000 --> 00:55:55,000 FILE, all caps, and it has * in here as well. 1025 00:55:55,000 --> 00:55:58,000 Let me move this over here, a * over here as well. 1026 00:55:58,000 --> 00:56:00,000 >> This function we haven't seen before, fopen, 1027 00:56:00,000 --> 00:56:03,000 but it means file open, so let's skim through these, 1028 00:56:03,000 --> 00:56:05,000 and this is something we'll come back to in future psets, 1029 00:56:05,000 --> 00:56:10,000 but this line here essentially opens a file called database, 1030 00:56:10,000 --> 00:56:13,000 and it specifically opens it in such a way that it can do what to it? 1031 00:56:13,000 --> 00:56:15,000 [inaudible-student] 1032 00:56:15,000 --> 00:56:19,000 Right, so "w" just means it's telling the operating system 1033 00:56:19,000 --> 00:56:21,000 open this file in such a way that I can write to it. 1034 00:56:21,000 --> 00:56:23,000 I don't want to read it. I don't want to just look at it. 1035 00:56:23,000 --> 00:56:26,000 I want to change it and add stuff potentially to it, 1036 00:56:26,000 --> 00:56:28,000 and the file is going to be called database. 1037 00:56:28,000 --> 00:56:30,000 This could be called anything. 1038 00:56:30,000 --> 00:56:32,000 This could be database.txt. This could be .db. 1039 00:56:32,000 --> 00:56:37,000 This could be a word like foo, but I arbitrarily chose to name the file database. 1040 00:56:37,000 --> 00:56:42,000 This is a little sanity check that we'll come back to in great detail over time, 1041 00:56:42,000 --> 00:56:47,000 if fp, for file pointer, does not equal NULL that means all is well. 1042 00:56:47,000 --> 00:56:51,000 >> Long story short, functions like fopen sometimes fail. 1043 00:56:51,000 --> 00:56:53,000 Maybe the file doesn't exist. Maybe you're out of disc space. 1044 00:56:53,000 --> 00:56:55,000 Maybe you don't have permission to that folder, 1045 00:56:55,000 --> 00:56:58,000 so if fopen returns null something bad happened. 1046 00:56:58,000 --> 00:57:02,000 Conversely, if fopen does not return null all is well 1047 00:57:02,000 --> 00:57:04,000 and I can start writing to this file. 1048 00:57:04,000 --> 00:57:06,000 Here's a new trick. 1049 00:57:06,000 --> 00:57:08,000 This is a for loop that's iterating over each of my students, 1050 00:57:08,000 --> 00:57:10,000 and this looks so similar to what we've done before, 1051 00:57:10,000 --> 00:57:15,000 but this function is a cousin of printf called fprintf for file printf, 1052 00:57:15,000 --> 00:57:18,000 and notice it's different in only 2 ways. 1053 00:57:18,000 --> 00:57:20,000 One, it starts with f instead of p, 1054 00:57:20,000 --> 00:57:23,000 but then its first argument is apparently what? 1055 00:57:23,000 --> 00:57:25,000 [Students] File.>>It's a file. 1056 00:57:25,000 --> 00:57:30,000 This thing called fp, which we'll eventually tease apart what a file pointer is, 1057 00:57:30,000 --> 00:57:35,000 but for now fp simply represents the file that I have opened, 1058 00:57:35,000 --> 00:57:41,000 so fprintf here is saying print this user's ID to the file, not to the screen. 1059 00:57:41,000 --> 00:57:44,000 Print the user's name to the file, not to the screen, 1060 00:57:44,000 --> 00:57:47,000 the house to the file, not to the screen, and then down here, obviously, 1061 00:57:47,000 --> 00:57:50,000 close the file, and then down here free the memory. 1062 00:57:50,000 --> 00:57:53,000 >> The only difference between this version 2 and version 1 1063 00:57:53,000 --> 00:57:58,000 is the introduction of fopen and this FILE with * 1064 00:57:58,000 --> 00:58:01,000 and this notion of fprintf, so let's see what the end result is. 1065 00:58:01,000 --> 00:58:03,000 Let me go into my terminal window. 1066 00:58:03,000 --> 00:58:06,000 Let me run structs2, enter. 1067 00:58:06,000 --> 00:58:09,000 Looks like all is well. Let's rerun structs2. 1068 00:58:09,000 --> 00:58:15,000 123, David Mather, 456, Rob Kirkland, 1069 00:58:15,000 --> 00:58:19,000 789, Tommy Mather, enter. 1070 00:58:19,000 --> 00:58:23,000 Looks like it behaved the same, but if I now do ls 1071 00:58:23,000 --> 00:58:28,000 notice what file is in here among all my code, database, 1072 00:58:28,000 --> 00:58:32,000 so let's open that, gedit of database, and look at that. 1073 00:58:32,000 --> 00:58:34,000 It's not the sexiest of file formats. 1074 00:58:34,000 --> 00:58:38,000 It really is one piece of data line per line per line, 1075 00:58:38,000 --> 00:58:42,000 but those of you who use Excel or CSV files, comma separated values, 1076 00:58:42,000 --> 00:58:47,000 I could certainly have used fprintf to instead maybe do something like this 1077 00:58:47,000 --> 00:58:50,000 so that I could actually create the equivalent of an Excel file 1078 00:58:50,000 --> 00:58:53,000 by separating things with commas, not just new lines. 1079 00:58:53,000 --> 00:58:56,000 >> In this case if I had instead used commas instead of new lines 1080 00:58:56,000 --> 00:59:01,000 I could literally open this database file in Excel if I instead made it look like this. 1081 00:59:01,000 --> 00:59:03,000 In short, now that we have the power to write to files 1082 00:59:03,000 --> 00:59:07,000 we can now start persisting data, keeping it around on disc 1083 00:59:07,000 --> 00:59:10,000 so that we can keep information around again and again. 1084 00:59:10,000 --> 00:59:14,000 Notice a couple of other things that are now a bit more familiar. 1085 00:59:14,000 --> 00:59:16,000 At the top of this C file we have a typedef 1086 00:59:16,000 --> 00:59:21,000 because we wanted to create a data type that represents a word, 1087 00:59:21,000 --> 00:59:25,000 so this type is called word, and inside of this structure 1088 00:59:25,000 --> 00:59:27,000 it's a little fancier now. 1089 00:59:27,000 --> 00:59:30,000 Why is a word made up of apparently an array? 1090 00:59:30,000 --> 00:59:33,000 What is a word just intuitively? 1091 00:59:33,000 --> 00:59:35,000 >> It's an array of characters. 1092 00:59:35,000 --> 00:59:37,000 It's a sequence of characters back to back to back. 1093 00:59:37,000 --> 00:59:41,000 LETTERS in all caps happens to be we arbitrarily say the maximum length 1094 00:59:41,000 --> 00:59:44,000 of any word in the dictionary that we're using for Scramble. 1095 00:59:44,000 --> 00:59:46,000 Why do I have a +1? 1096 00:59:46,000 --> 00:59:48,000 The null character. 1097 00:59:48,000 --> 00:59:51,000 Recall when we did the Bananagrams example we needed a special value 1098 00:59:51,000 --> 00:59:55,000 at the end of the word in order to keep track 1099 00:59:55,000 --> 00:59:59,000 of where words actually ended, and as the problem set specification says 1100 00:59:59,000 --> 01:00:03,000 here we're associating with a given word a boolean value, 1101 01:00:03,000 --> 01:00:05,000 a flag, so to speak, true or false. 1102 01:00:05,000 --> 01:00:09,000 Have you found this word already, because we realize 1103 01:00:09,000 --> 01:00:13,000 we really need a way of remembering not only what a word is in Scramble 1104 01:00:13,000 --> 01:00:15,000 but whether or not you, the human, have found it 1105 01:00:15,000 --> 01:00:20,000 so that if you do find the word "the" you can't just type the, enter, the, enter, the, enter 1106 01:00:20,000 --> 01:00:23,000 and get 3 points, 3 points, 3 points, 3 points. 1107 01:00:23,000 --> 01:00:26,000 We want to be able to blacklist that word by setting a bool 1108 01:00:26,000 --> 01:00:29,000 to true if you've already found it, and so that's why we 1109 01:00:29,000 --> 01:00:31,000 encapsulated it in this structure. 1110 01:00:31,000 --> 01:00:35,000 >> Now, down here in Scramble there's this other struct called dictionary. 1111 01:00:35,000 --> 01:00:39,000 Absent here is the word typedef because in this case 1112 01:00:39,000 --> 01:00:43,000 we needed to encapsulate the idea of a dictionary, 1113 01:00:43,000 --> 01:00:46,000 and a dictionary contains a whole bunch of words, 1114 01:00:46,000 --> 01:00:49,000 as implied by this array, and how many of those words are there? 1115 01:00:49,000 --> 01:00:51,000 Well, whatever this variable called size says. 1116 01:00:51,000 --> 01:00:53,000 But we just need one dictionary. 1117 01:00:53,000 --> 01:00:55,000 We don't need a data type called dictionary. 1118 01:00:55,000 --> 01:00:58,000 We just need one of them, so it turns out in C 1119 01:00:58,000 --> 01:01:03,000 that if you don't say typedef, you just say struct, then inside the curly braces 1120 01:01:03,000 --> 01:01:05,000 you put your variables, then you put the name. 1121 01:01:05,000 --> 01:01:09,000 This is declaring one variable called dictionary 1122 01:01:09,000 --> 01:01:11,000 that looks like this. 1123 01:01:11,000 --> 01:01:16,000 By contrast, these lines are creating a reusable data structure called word 1124 01:01:16,000 --> 01:01:19,000 that you can create multiple copies of, just like we created 1125 01:01:19,000 --> 01:01:22,000 multiple copies of students. 1126 01:01:22,000 --> 01:01:24,000 >> What does this ultimately allow us to do? 1127 01:01:24,000 --> 01:01:30,000 Let me go back into, let's say, a simpler example from simpler times, 1128 01:01:30,000 --> 01:01:34,000 and let me open up, let's say, compare1.c. 1129 01:01:34,000 --> 01:01:38,000 The problem here at hand is to actually peel back 1130 01:01:38,000 --> 01:01:41,000 the layer of a string and start taking off these training wheels 1131 01:01:41,000 --> 01:01:44,000 because it turns out that a string all this time 1132 01:01:44,000 --> 01:01:47,000 is as we promised in week 1 really just a nickname, 1133 01:01:47,000 --> 01:01:51,000 a synonym from the CS50 library for something that looks a little more cryptic, 1134 01:01:51,000 --> 01:01:53,000 char*, and we've seen this star before. 1135 01:01:53,000 --> 01:01:55,000 We saw it in the context of files. 1136 01:01:55,000 --> 01:01:59,000 >> Let's now see why we've been hiding this detail for some time now. 1137 01:01:59,000 --> 01:02:02,000 Here is a file called compare1.c, 1138 01:02:02,000 --> 01:02:07,000 and it apparently asks the user for 2 strings, s and t, 1139 01:02:07,000 --> 01:02:11,000 and then it tries to compare those strings for equality in line 26, 1140 01:02:11,000 --> 01:02:14,000 and if they're equal it says, "You typed the same thing," 1141 01:02:14,000 --> 01:02:17,000 and if they're not equal it says, "You typed different things." 1142 01:02:17,000 --> 01:02:19,000 Let me go ahead and run this program. 1143 01:02:19,000 --> 01:02:23,000 Let me go into my source directory, make a compare1. It compiled okay. 1144 01:02:23,000 --> 01:02:25,000 Let me run compare1. 1145 01:02:25,000 --> 01:02:27,000 I'll zoom in, enter. 1146 01:02:27,000 --> 01:02:29,000 Say something. HELLO. 1147 01:02:29,000 --> 01:02:32,000 I'll say something again. HELLO. 1148 01:02:32,000 --> 01:02:34,000 I definitely didn't type different things. 1149 01:02:34,000 --> 01:02:37,000 >> Let me try this again. BYE BYE. 1150 01:02:37,000 --> 01:02:40,000 Definitely not different, so what's going on here? 1151 01:02:40,000 --> 01:02:44,000 Well, what is really being compared in line 26? 1152 01:02:44,000 --> 01:02:46,000 [inaudible-student] 1153 01:02:46,000 --> 01:02:49,000 Yes, so it turns out that a string, data type, is kind of a white lie. 1154 01:02:49,000 --> 01:02:53,000 A string is a char*, but what is a char*? 1155 01:02:53,000 --> 01:02:56,000 A char*, as they say, is a pointer, 1156 01:02:56,000 --> 01:03:00,000 and a pointer is effectively an address, 1157 01:03:00,000 --> 01:03:05,000 a sum location in memory, and if you happen to have typed in a word like HELLO, 1158 01:03:05,000 --> 01:03:08,000 recall from past discussions of strings 1159 01:03:08,000 --> 01:03:16,000 this is like the word HELLO. 1160 01:03:16,000 --> 01:03:19,000 Remember that a word like HELLO can be represented 1161 01:03:19,000 --> 01:03:22,000 as an array of characters like this 1162 01:03:22,000 --> 01:03:25,000 and then with a special character at the end called the null character, 1163 01:03:25,000 --> 01:03:27,000 as the \ denotes. 1164 01:03:27,000 --> 01:03:29,000 What is actually a string? 1165 01:03:29,000 --> 01:03:32,000 Notice that this is multiple chunks of memory, 1166 01:03:32,000 --> 01:03:36,000 and in fact, the end of it is only known once you look through the whole string 1167 01:03:36,000 --> 01:03:38,000 looking for the special null character. 1168 01:03:38,000 --> 01:03:41,000 But if this is a chunk of memory from my computer's memory, 1169 01:03:41,000 --> 01:03:44,000 let's arbitrarily say that this string just got lucky, 1170 01:03:44,000 --> 01:03:47,000 and it got placed at the very beginning of my computer's RAM. 1171 01:03:47,000 --> 01:03:54,000 This is byte 0, 1, 2, 3, 4, 5, 6... 1172 01:03:54,000 --> 01:04:02,000 >> When I say something like GetString and I do string s = GetString 1173 01:04:02,000 --> 01:04:04,000 what's really being returned? 1174 01:04:04,000 --> 01:04:08,000 For these past several weeks, what's really being stored in s 1175 01:04:08,000 --> 01:04:13,000 is not this string per se, but in this case what's being stored is 1176 01:04:13,000 --> 01:04:18,000 the number 0 because what GetString actually does 1177 01:04:18,000 --> 01:04:20,000 is it doesn't physically return a string. 1178 01:04:20,000 --> 01:04:22,000 That doesn't even really make conceptual sense. 1179 01:04:22,000 --> 01:04:24,000 What it does return is a number. 1180 01:04:24,000 --> 01:04:28,000 That number is the address of HELLO in memory, 1181 01:04:28,000 --> 01:04:32,000 and string s then, if we peel back this layer, string doesn't really exist. 1182 01:04:32,000 --> 01:04:35,000 It's only a simplification in the CS50 library. 1183 01:04:35,000 --> 01:04:38,000 >> This really is something called char*. 1184 01:04:38,000 --> 01:04:41,000 Char makes sense because what's a word, like HELLO? 1185 01:04:41,000 --> 01:04:44,000 Well, it's a series of chars, a series of characters. 1186 01:04:44,000 --> 01:04:47,000 Char* means the address of a character, 1187 01:04:47,000 --> 01:04:50,000 so what does it mean to return a string? 1188 01:04:50,000 --> 01:04:53,000 A nice, simple way of returning a string 1189 01:04:53,000 --> 01:04:57,000 is rather than try to figure out how I return to 5 or 6 different bytes 1190 01:04:57,000 --> 01:05:01,000 let me return to the address of which byte? 1191 01:05:01,000 --> 01:05:03,000 The first one. 1192 01:05:03,000 --> 01:05:06,000 In other words, let me give you the address of a character in memory. 1193 01:05:06,000 --> 01:05:10,000 That's what char* represents, the address of one single character in memory. 1194 01:05:10,000 --> 01:05:12,000 Call that variable s. 1195 01:05:12,000 --> 01:05:15,000 Store in s that particular address, which I arbitrarily said is 0, 1196 01:05:15,000 --> 01:05:19,000 just to keep things simple, but in reality it's generally a bigger number. 1197 01:05:19,000 --> 01:05:21,000 >> Wait a minute. 1198 01:05:21,000 --> 01:05:23,000 If you're only giving me the address of the first character, how do I know what the address is 1199 01:05:23,000 --> 01:05:25,000 of the second character, the third, the fourth and the fifth? 1200 01:05:25,000 --> 01:05:27,000 [inaudible-student] 1201 01:05:27,000 --> 01:05:31,000 You only know where the end of the string is by way of this handy trick, 1202 01:05:31,000 --> 01:05:35,000 so when you use something like printf, what printf literally takes as its argument, 1203 01:05:35,000 --> 01:05:39,000 recall that we use this %s placeholder, and then you pass in 1204 01:05:39,000 --> 01:05:41,000 the variable that's storing a string. 1205 01:05:41,000 --> 01:05:47,000 What you're really passing is the address of the first character of that string. 1206 01:05:47,000 --> 01:05:50,000 Printf then uses a for loop or a while loop upon receiving that address, 1207 01:05:50,000 --> 01:05:53,000 for instance, 0, so let me do this now, 1208 01:05:53,000 --> 01:06:02,000 printf("%s\n,"s); 1209 01:06:02,000 --> 01:06:07,000 When I call printf("%s\n,"s); what I'm really providing printf with 1210 01:06:07,000 --> 01:06:13,000 is the address of the first character in s, which in this arbitrary case is H. 1211 01:06:13,000 --> 01:06:16,000 >> How does printf know what exactly to display on the screen? 1212 01:06:16,000 --> 01:06:19,000 The person who implemented printf implemented a while loop or a for loop 1213 01:06:19,000 --> 01:06:23,000 that says does this character equal the special null character? 1214 01:06:23,000 --> 01:06:25,000 If not, print it. How about this one? 1215 01:06:25,000 --> 01:06:28,000 If not print it, print it, print it, print it. 1216 01:06:28,000 --> 01:06:32,000 Oh, this one is special. Stop printing and return to the user. 1217 01:06:32,000 --> 01:06:35,000 And that's literally all that's been happening underneath the hood, 1218 01:06:35,000 --> 01:06:38,000 and that's a lot to digest in the first day of a class, 1219 01:06:38,000 --> 01:06:43,000 but for now it's really the building block of understanding everything 1220 01:06:43,000 --> 01:06:46,000 that's been going on inside of our computer's memory, 1221 01:06:46,000 --> 01:06:49,000 and eventually we'll tease this apart with a little help 1222 01:06:49,000 --> 01:06:51,000 from one of our friends at Stanford. 1223 01:06:51,000 --> 01:06:56,000 >> Professor Nick Parlante at Stanford has done this wonderful video sequence 1224 01:06:56,000 --> 01:06:58,000 from all sorts of different languages that introduced 1225 01:06:58,000 --> 01:07:00,000 this little Claymation character Binky. 1226 01:07:00,000 --> 01:07:03,000 The voice you're about to hear in just a few second sneak preview 1227 01:07:03,000 --> 01:07:05,000 is that of a Stanford professor, and you're getting 1228 01:07:05,000 --> 01:07:07,000 only 5 or 6 seconds of this right now, 1229 01:07:07,000 --> 01:07:09,000 but this is the note on which we'll conclude today 1230 01:07:09,000 --> 01:07:11,000 and begin on Wednesday. 1231 01:07:11,000 --> 01:07:15,000 I give you Pointer Fun with Binky, the preview. 1232 01:07:15,000 --> 01:07:18,000 [♪ Music ♪] [Professor Parlante] Hey, Binky. 1233 01:07:18,000 --> 01:07:21,000 Wake up. It's time for pointer fun. 1234 01:07:21,000 --> 01:07:24,000 [Binky] What's that? Learn about pointers? 1235 01:07:24,000 --> 01:07:26,000 Oh, goody! 1236 01:07:26,000 --> 01:07:29,000 >> We will see you on Wednesday. 1237 01:07:29,000 --> 01:07:32,000 [CS50.TV]