1 00:00:00,000 --> 00:00:02,490 [CS50 Library] 2 00:00:02,490 --> 00:00:04,220 [Nate Hardison] [Harvard University] 3 00:00:04,220 --> 00:00:07,260 [This is CS50. CS50.TV] 4 00:00:07,260 --> 00:00:11,510 The CS50 library is a helpful tool that we have installed on the appliance 5 00:00:11,510 --> 00:00:15,870 to make it easier for you to write programs that prompt users for input. 6 00:00:15,870 --> 00:00:21,670 In this video, we'll pull back the curtain and look at what exactly is in the CS50 library. 7 00:00:21,670 --> 00:00:25,520 >> In the video on C libraries, we talk about how you #include headers files 8 00:00:25,520 --> 00:00:27,570 of the library in your source code, 9 00:00:27,570 --> 00:00:31,150 and then you link with a binary library file during the linking phase 10 00:00:31,150 --> 00:00:33,140 of the compilation process. 11 00:00:33,140 --> 00:00:36,440 The header files specify the interface of the library. 12 00:00:36,440 --> 00:00:41,280 That is, they detail all of the resources that the library has available for you to use, 13 00:00:41,280 --> 00:00:45,250 like function declarations, constants, and data types. 14 00:00:45,250 --> 00:00:48,890 The binary library file contains the implementation of the library, 15 00:00:48,890 --> 00:00:54,580 which is compiled from the library's header files and the library's .c source code files. 16 00:00:54,580 --> 00:00:59,820 >> The binary library file isn't very interesting to look at since it's, well, in binary. 17 00:00:59,820 --> 00:01:03,300 So, let's take a look at the header files for the library instead. 18 00:01:03,300 --> 00:01:07,710 In this case, there's only one header file called cs50.h. 19 00:01:07,710 --> 00:01:11,040 We've installed it in the user include directory 20 00:01:11,040 --> 00:01:15,150 along with the other system libraries' header files. 21 00:01:15,150 --> 00:01:21,530 >> One of the first things you'll notice is that cs50.h #includes header files from other libraries-- 22 00:01:21,530 --> 00:01:25,670 float, limits, standard bool, and standard lib. 23 00:01:25,670 --> 00:01:28,800 Again, following the principle of not reinventing the wheel, 24 00:01:28,800 --> 00:01:33,490 we've built the CS0 library using tools that other provided for us. 25 00:01:33,490 --> 00:01:38,690 >> The next thing you'll see in the library is that we define a new type called "string." 26 00:01:38,690 --> 00:01:42,330 This line really just creates an alias for the char* type, 27 00:01:42,330 --> 00:01:46,000 so it doesn't magically imbue the new string type with attributes 28 00:01:46,000 --> 00:01:49,650 commonly associated with string objects in other languages, 29 00:01:49,650 --> 00:01:50,850 such as length. 30 00:01:50,850 --> 00:01:55,180 The reason we've done this is to shield new programmers from the gory details 31 00:01:55,180 --> 00:01:57,580 of pointers until they're ready. 32 00:01:57,580 --> 00:02:00,130 >> The next part of the header file is the declaration of the functions 33 00:02:00,130 --> 00:02:04,410 that the CS50 library provides along with documentation. 34 00:02:04,410 --> 00:02:06,940 Notice the level of detail in the comments here. 35 00:02:06,940 --> 00:02:10,560 This is super important so that people know how to use these functions. 36 00:02:10,560 --> 00:02:19,150 We declare, in turn, functions to prompt the user and return chars, doubles, floats, ints, 37 00:02:19,150 --> 00:02:24,160 long longs, and strings, using our own string type. 38 00:02:24,160 --> 00:02:26,260 Following the principle of information hiding, 39 00:02:26,260 --> 00:02:31,640 we have put our definition in a separate .c implementation file--cs50.c-- 40 00:02:31,640 --> 00:02:35,110 located in the user source directory. 41 00:02:35,110 --> 00:02:38,040 We've provided that file so that you can take a look at it, 42 00:02:38,040 --> 00:02:41,490 learn from it, and recompile it on different machines if you wish, 43 00:02:41,490 --> 00:02:45,510 even though we think it's better to work on the appliance for this class. 44 00:02:45,510 --> 00:02:47,580 Anyway, let's take a look at it now. 45 00:02:49,020 --> 00:02:54,620 >> The functions GetChar, GetDouble, GetFloat, GetInt, and GetLongLong 46 00:02:54,620 --> 00:02:58,160 are all built on top of the GetString function. 47 00:02:58,160 --> 00:03:01,510 It turns out that they all follow essentially the same pattern. 48 00:03:01,510 --> 00:03:04,870 They use a while loop to prompt the user for one line of input. 49 00:03:04,870 --> 00:03:08,430 They return a special value if the user inputs an empty line. 50 00:03:08,430 --> 00:03:11,750 They attempt to parse the user's input as the appropriate type, 51 00:03:11,750 --> 00:03:15,010 be it a char, a double, a float, etc. 52 00:03:15,010 --> 00:03:18,710 And then they either return the result if the input was successfully parsed 53 00:03:18,710 --> 00:03:21,330 or they reprompt the user. 54 00:03:21,330 --> 00:03:24,230 >> At a high level, there is nothing really tricky here. 55 00:03:24,230 --> 00:03:28,760 You might have written similarly structured code yourself in the past. 56 00:03:28,760 --> 00:03:34,720 Perhaps the most cryptic-looking part is the sscanf call that parses the user's input. 57 00:03:34,720 --> 00:03:38,160 Sscanf is part of the input format conversion family. 58 00:03:38,160 --> 00:03:42,300 It lives in standard io.h, and its job is to parse a C string, 59 00:03:42,300 --> 00:03:46,520 according to a particular format, storing the parse results in variable 60 00:03:46,520 --> 00:03:48,720 provided by the caller. 61 00:03:48,720 --> 00:03:53,570 Since the input format conversion functions are very useful, widely used functions 62 00:03:53,570 --> 00:03:56,160 that aren't super intuitive at first, 63 00:03:56,160 --> 00:03:58,300 we'll go over how sscanf works. 64 00:03:58,300 --> 00:04:03,330 >> The first argument to sscanf is a char*--a pointer to a character. 65 00:04:03,330 --> 00:04:05,150 For the function to work properly, 66 00:04:05,150 --> 00:04:08,340 that character should be the first character of a C string, 67 00:04:08,340 --> 00:04:12,270 terminated with the null\0 character. 68 00:04:12,270 --> 00:04:15,120 This is the string to parse 69 00:04:15,120 --> 00:04:18,269 The second argument to sscanf is a format string, 70 00:04:18,269 --> 00:04:20,839 typically passed in as a string constant, 71 00:04:20,839 --> 00:04:24,040 and you might have seen a string like this before when using printf. 72 00:04:24,040 --> 00:04:28,650 A percent sign in the format string indicates a conversion specifier. 73 00:04:28,650 --> 00:04:30,850 The character immediately following a percent sign, 74 00:04:30,850 --> 00:04:35,430 indicates the C type that we want sscanf to convert to. 75 00:04:35,430 --> 00:04:40,090 In GetInt, you see that there is a %d and a %c. 76 00:04:40,090 --> 00:04:48,690 This means that sscanf will try to a decimal int--the %d--and a char--the %c. 77 00:04:48,690 --> 00:04:51,510 For each conversion specifier in the format string, 78 00:04:51,510 --> 00:04:56,620 sscanf expects a corresponding argument later in its argument list. 79 00:04:56,620 --> 00:05:00,850 That argument must point to an appropriately typed location 80 00:05:00,850 --> 00:05:04,000 in which to store the result of the conversion. 81 00:05:04,000 --> 00:05:08,910 >> The typical way of doing this is to create a variable on the stack before the sscanf call 82 00:05:08,910 --> 00:05:11,440 for each item that you want to parse from the string 83 00:05:11,440 --> 00:05:15,520 and then use the address operator--the ampersand--to pass pointers 84 00:05:15,520 --> 00:05:19,100 to those variables to the sscanf call. 85 00:05:19,100 --> 00:05:22,720 You can see that in GetInt we do exactly this. 86 00:05:22,720 --> 00:05:28,240 Right before the sscanf call, we declare an int called n and a char call c on the stack, 87 00:05:28,240 --> 00:05:32,340 and we pass pointers to them into the sscanf call. 88 00:05:32,340 --> 00:05:35,800 Putting these variables on the stack is preferred over using space allocated 89 00:05:35,800 --> 00:05:39,350 on the heap with malloc, since you avoid the overhead of the malloc call, 90 00:05:39,350 --> 00:05:43,060 and you don't have to worry about leaking memory. 91 00:05:43,060 --> 00:05:47,280 Characters not prefixed by a percent sign don't prompt conversion. 92 00:05:47,280 --> 00:05:50,380 Rather they just add to the format specification. 93 00:05:50,380 --> 00:05:56,500 >> For example, if the format string in GetInt were a%d instead, 94 00:05:56,500 --> 00:05:59,800 sscanf would look for the letter a followed by an int, 95 00:05:59,800 --> 00:06:04,360 and while it would attempt to convert the int, it wouldn't do anything else with the a. 96 00:06:04,360 --> 00:06:07,440 The only exception to this is whitespace. 97 00:06:07,440 --> 00:06:11,030 White space characters in the format string match any amount of whitespace-- 98 00:06:11,030 --> 00:06:12,890 even none at all. 99 00:06:12,890 --> 00:06:18,100 So, that's why the comment mentions possibly with leading and/or trailing whitespace. 100 00:06:18,100 --> 00:06:22,910 So, at this point it looks like our sscanf call will try to parse the user's input string 101 00:06:22,910 --> 00:06:25,380 by checking for possible leading whitespace, 102 00:06:25,380 --> 00:06:29,300 followed by a int that will be converted and stored in the int variable n 103 00:06:29,300 --> 00:06:33,090 followed by some amount of whitespace, and followed by a character 104 00:06:33,090 --> 00:06:35,810 stored in the char variable c. 105 00:06:35,810 --> 00:06:37,790 >> What about the return value? 106 00:06:37,790 --> 00:06:41,560 Sscanf will parse the input line from start to finish, 107 00:06:41,560 --> 00:06:44,860 stopping when it reaches the end or when a character in the input 108 00:06:44,860 --> 00:06:49,320 doesn't match a format character or when it can't make a conversion. 109 00:06:49,320 --> 00:06:52,690 It's return value is used to single when it stopped. 110 00:06:52,690 --> 00:06:55,670 If it stopped, because it reached the end of the input string 111 00:06:55,670 --> 00:07:00,630 before making any conversions and before failing to match part of the format string, 112 00:07:00,630 --> 00:07:04,840 then the special constant EOF is returned. 113 00:07:04,840 --> 00:07:08,200 Otherwise, it returns the number of successful conversions, 114 00:07:08,200 --> 00:07:14,380 which could be 0, 1, or 2, since we've asked for two conversions. 115 00:07:14,380 --> 00:07:19,000 In our case, we want to make sure that the user typed in an int and only an int. 116 00:07:19,000 --> 00:07:23,370 >> So, we want sscanf to return 1. See why? 117 00:07:23,370 --> 00:07:26,850 If sscanf returned 0, then no conversions were made, 118 00:07:26,850 --> 00:07:31,690 so the user typed something other than an int at the beginning of the input. 119 00:07:31,690 --> 00:07:37,100 If sscanf returns 2, then the user did properly type it in at the beginning of the input, 120 00:07:37,100 --> 00:07:41,390 but they then typed in some non-whitespace character afterwards 121 00:07:41,390 --> 00:07:44,940 since the %c conversion succeeded. 122 00:07:44,940 --> 00:07:49,570 Wow, that's quite a lengthy explanation for one function call. 123 00:07:49,570 --> 00:07:53,460 Anyway, if you want more information on sscanf and its siblings, 124 00:07:53,460 --> 00:07:57,130 check out the man pages, Google, or both. 125 00:07:57,130 --> 00:07:58,780 There are lots of format string options, 126 00:07:58,780 --> 00:08:03,830 and these can save you a lot of manual labor when trying to parse strings in C. 127 00:08:03,830 --> 00:08:07,180 >> The final function in the library to look at is GetString. 128 00:08:07,180 --> 00:08:10,310 It turns out that GetString is a tricky function to write properly, 129 00:08:10,310 --> 00:08:14,290 even though it seems like such a simple, common task. 130 00:08:14,290 --> 00:08:16,170 Why is this the case? 131 00:08:16,170 --> 00:08:21,380 Well, let's think about how we're going to store the line that the user types in. 132 00:08:21,380 --> 00:08:23,880 Since a string is a sequence of chars, 133 00:08:23,880 --> 00:08:26,430 we might want to store it in an array on the stack, 134 00:08:26,430 --> 00:08:31,250 but we would need to know how long the array is going to be when we declare it. 135 00:08:31,250 --> 00:08:34,030 Likewise, if we want to put it on the heap, 136 00:08:34,030 --> 00:08:38,090 we need to pass to malloc the number of bytes we want to reserve, 137 00:08:38,090 --> 00:08:39,730 but this is impossible. 138 00:08:39,730 --> 00:08:42,760 We have no idea how many chars the user will type in 139 00:08:42,760 --> 00:08:46,590 before the user actually does type them. 140 00:08:46,590 --> 00:08:50,720 >> A naive solution to this problem is to just reserve a big chunk of space, say, 141 00:08:50,720 --> 00:08:54,540 a block of 1000 chars for the user's input, 142 00:08:54,540 --> 00:08:57,980 assuming that the user would never type in a string that long. 143 00:08:57,980 --> 00:09:00,810 This is a bad idea for two reasons. 144 00:09:00,810 --> 00:09:05,280 First, assuming that users typically don't type in strings that long, 145 00:09:05,280 --> 00:09:07,610 you could waste a lot of memory. 146 00:09:07,610 --> 00:09:10,530 On modern machines, this might not be an issue if you do this 147 00:09:10,530 --> 00:09:13,890 in one or two isolated instances, 148 00:09:13,890 --> 00:09:17,630 but if you're taking user's input in a loop and storing for later use, 149 00:09:17,630 --> 00:09:20,870 you can quickly suck up a ton of memory. 150 00:09:20,870 --> 00:09:24,450 Additionally, if the program you're writing is for a smaller computer-- 151 00:09:24,450 --> 00:09:28,100 a device like a smartphone or something else with limited memory-- 152 00:09:28,100 --> 00:09:32,060 this solution will cause problems a lot faster. 153 00:09:32,060 --> 00:09:36,450 The second, more serious reason to not do this is that it leaves your program vulnerable 154 00:09:36,450 --> 00:09:39,710 to what's called a buffer overflow attack. 155 00:09:39,710 --> 00:09:45,840 In programming, a buffer is memory used to temporarily store input or output data, 156 00:09:45,840 --> 00:09:48,980 which in this case is our 1000-char block. 157 00:09:48,980 --> 00:09:53,370 A buffer overflow occurs when data is written past the end of the block. 158 00:09:53,370 --> 00:09:57,790 >> For example, if a user actually does type in more than 1000 chars. 159 00:09:57,790 --> 00:10:01,570 You might have experienced this accidentally when programming with arrays. 160 00:10:01,570 --> 00:10:05,620 If you have an array of 10 ints, nothing stops you from trying to read or write 161 00:10:05,620 --> 00:10:07,810 the 15th int. 162 00:10:07,810 --> 00:10:10,000 There are no compiler warnings or errors. 163 00:10:10,000 --> 00:10:13,250 The program just blunders straight ahead and accesses the memory 164 00:10:13,250 --> 00:10:18,150 where it thinks the 15th int will be, and this can overwrite your other variables. 165 00:10:18,150 --> 00:10:22,040 In the worst case, you can overwrite some of your program's internal 166 00:10:22,040 --> 00:10:26,820 control mechanisms, causing your program to actually execute different instructions 167 00:10:26,820 --> 00:10:28,340 than you intended. 168 00:10:28,340 --> 00:10:31,360 >> Now, it's not common to do this accidentally, 169 00:10:31,360 --> 00:10:35,150 but this is a fairly common technique that bad guys use to break programs 170 00:10:35,150 --> 00:10:39,080 and put malicious code on other people's computers. 171 00:10:39,080 --> 00:10:42,910 Therefore, we can't just use our naive solution. 172 00:10:42,910 --> 00:10:45,590 We need a way to prevent our programs from being vulnerable 173 00:10:45,590 --> 00:10:47,880 to a buffer overflow attack. 174 00:10:47,880 --> 00:10:51,430 To do this, we need to make sure that our buffer can grow as we read 175 00:10:51,430 --> 00:10:53,850 more input from the user. 176 00:10:53,850 --> 00:10:57,440 The solution? We use a heap allocated buffer. 177 00:10:57,440 --> 00:10:59,950 Since we can resize it using the resize the realloc function, 178 00:10:59,950 --> 00:11:04,580 and we keep track of two numbers--the index of the next empty slot in the buffer 179 00:11:04,580 --> 00:11:08,390 and the length or capacity of the buffer. 180 00:11:08,390 --> 00:11:13,210 We read in chars from the user one at a time using the fgetc function. 181 00:11:13,210 --> 00:11:19,360 The argument the fgetc function takes--stdin--is a reference to the standard input string, 182 00:11:19,360 --> 00:11:23,810 which is a preconnected input channel that is used to transfer the user's input 183 00:11:23,810 --> 00:11:26,270 from the terminal to the program. 184 00:11:26,270 --> 00:11:29,890 >> Whenever the user types in a new character, we check to see if the index 185 00:11:29,890 --> 00:11:35,810 of the next free slot plus 1 is greater than the capacity of the buffer. 186 00:11:35,810 --> 00:11:39,690 The +1 comes in because if the next free index is 5, 187 00:11:39,690 --> 00:11:44,150 then our buffer's length must be 6 thanks to 0 indexing. 188 00:11:44,150 --> 00:11:48,350 If we've run out of space in the buffer, then we attempt to resize it, 189 00:11:48,350 --> 00:11:51,690 doubling it so that we cut down on the number of times that we resize 190 00:11:51,690 --> 00:11:54,760 if the user is typing in a really long string. 191 00:11:54,760 --> 00:11:57,950 If the string has gotten too long or if we run out of heap memory, 192 00:11:57,950 --> 00:12:01,350 we free our buffer and return null. 193 00:12:01,350 --> 00:12:04,170 >> Finally, we append the char to the buffer. 194 00:12:04,170 --> 00:12:08,200 Once the user hits enter or return, signalling a new line, 195 00:12:08,200 --> 00:12:12,050 or the special char--control d--which signals an end of input, 196 00:12:12,050 --> 00:12:16,240 we do a check to see if the user actually typed in anything at all. 197 00:12:16,240 --> 00:12:18,820 If not, we return null. 198 00:12:18,820 --> 00:12:22,280 Otherwise, because our buffer is probably larger than we need, 199 00:12:22,280 --> 00:12:24,830 in the worst case it's almost twice as large as we need 200 00:12:24,830 --> 00:12:27,830 since we double every time we resize, 201 00:12:27,830 --> 00:12:31,840 we make a new copy of the string using just the amount of space that we need. 202 00:12:31,840 --> 00:12:34,220 We add an extra 1 to the malloc call, 203 00:12:34,220 --> 00:12:37,810 so that there's space for the special null terminator character--the \0, 204 00:12:37,810 --> 00:12:41,990 which we append to the string once we copy in the rest of the characters, 205 00:12:41,990 --> 00:12:45,060 using strncpy instead of strcpy 206 00:12:45,060 --> 00:12:48,830 so that we can specify exactly how many chars we want to copy. 207 00:12:48,830 --> 00:12:51,690 Strcpy copies until it hits a \0. 208 00:12:51,690 --> 00:12:55,740 Then we free our buffer and return the copy to the caller. 209 00:12:55,740 --> 00:12:59,840 >> Who knew such a simple-seeming function could be so complicated? 210 00:12:59,840 --> 00:13:02,820 Now you know what goes into the CS50 library. 211 00:13:02,820 --> 00:13:06,470 >> My name is Nate Hardison, and this is CS50. 212 00:13:06,470 --> 00:13:08,350 [CS50.TV]