[CS50 Library] [Nate Hardison] [Harvard University] [This is CS50. CS50.TV] The CS50 library is a helpful tool that we have installed on the appliance to make it easier for you to write programs that prompt users for input. In this video, we'll pull back the curtain and look at what exactly is in the CS50 library. In the video on C libraries, we talk about how you #include headers files of the library in your source code, and then you link with a binary library file during the linking phase of the compilation process. The header files specify the interface of the library. That is, they detail all of the resources that the library has available for you to use, like function declarations, constants, and data types. The binary library file contains the implementation of the library, which is compiled from the library's header files and the library's .c source code files. The binary library file isn't very interesting to look at since it's, well, in binary. So, let's take a look at the header files for the library instead. In this case, there's only one header file called cs50.h. We've installed it in the user include directory along with the other system libraries' header files. One of the first things you'll notice is that cs50.h #includes header files from other libraries-- float, limits, standard bool, and standard lib. Again, following the principle of not reinventing the wheel, we've built the CS0 library using tools that other provided for us. The next thing you'll see in the library is that we define a new type called "string." This line really just creates an alias for the char* type, so it doesn't magically imbue the new string type with attributes commonly associated with string objects in other languages, such as length. The reason we've done this is to shield new programmers from the gory details of pointers until they're ready. The next part of the header file is the declaration of the functions that the CS50 library provides along with documentation. Notice the level of detail in the comments here. This is super important so that people know how to use these functions. We declare, in turn, functions to prompt the user and return chars, doubles, floats, ints, long longs, and strings, using our own string type. Following the principle of information hiding, we have put our definition in a separate .c implementation file--cs50.c-- located in the user source directory. We've provided that file so that you can take a look at it, learn from it, and recompile it on different machines if you wish, even though we think it's better to work on the appliance for this class. Anyway, let's take a look at it now. The functions GetChar, GetDouble, GetFloat, GetInt, and GetLongLong are all built on top of the GetString function. It turns out that they all follow essentially the same pattern. They use a while loop to prompt the user for one line of input. They return a special value if the user inputs an empty line. They attempt to parse the user's input as the appropriate type, be it a char, a double, a float, etc. And then they either return the result if the input was successfully parsed or they reprompt the user. At a high level, there is nothing really tricky here. You might have written similarly structured code yourself in the past. Perhaps the most cryptic-looking part is the sscanf call that parses the user's input. Sscanf is part of the input format conversion family. It lives in standard io.h, and its job is to parse a C string, according to a particular format, storing the parse results in variable provided by the caller. Since the input format conversion functions are very useful, widely used functions that aren't super intuitive at first, we'll go over how sscanf works. The first argument to sscanf is a char*--a pointer to a character. For the function to work properly, that character should be the first character of a C string, terminated with the null\0 character. This is the string to parse The second argument to sscanf is a format string, typically passed in as a string constant, and you might have seen a string like this before when using printf. A percent sign in the format string indicates a conversion specifier. The character immediately following a percent sign, indicates the C type that we want sscanf to convert to. In GetInt, you see that there is a %d and a %c. This means that sscanf will try to a decimal int--the %d--and a char--the %c. For each conversion specifier in the format string, sscanf expects a corresponding argument later in its argument list. That argument must point to an appropriately typed location in which to store the result of the conversion. The typical way of doing this is to create a variable on the stack before the sscanf call for each item that you want to parse from the string and then use the address operator--the ampersand--to pass pointers to those variables to the sscanf call. You can see that in GetInt we do exactly this. Right before the sscanf call, we declare an int called n and a char call c on the stack, and we pass pointers to them into the sscanf call. Putting these variables on the stack is preferred over using space allocated on the heap with malloc, since you avoid the overhead of the malloc call, and you don't have to worry about leaking memory. Characters not prefixed by a percent sign don't prompt conversion. Rather they just add to the format specification. For example, if the format string in GetInt were a%d instead, sscanf would look for the letter a followed by an int, and while it would attempt to convert the int, it wouldn't do anything else with the a. The only exception to this is whitespace. White space characters in the format string match any amount of whitespace-- even none at all. So, that's why the comment mentions possibly with leading and/or trailing whitespace. So, at this point it looks like our sscanf call will try to parse the user's input string by checking for possible leading whitespace, followed by a int that will be converted and stored in the int variable n followed by some amount of whitespace, and followed by a character stored in the char variable c. What about the return value? Sscanf will parse the input line from start to finish, stopping when it reaches the end or when a character in the input doesn't match a format character or when it can't make a conversion. It's return value is used to single when it stopped. If it stopped, because it reached the end of the input string before making any conversions and before failing to match part of the format string, then the special constant EOF is returned. Otherwise, it returns the number of successful conversions, which could be 0, 1, or 2, since we've asked for two conversions. In our case, we want to make sure that the user typed in an int and only an int. So, we want sscanf to return 1. See why? If sscanf returned 0, then no conversions were made, so the user typed something other than an int at the beginning of the input. If sscanf returns 2, then the user did properly type it in at the beginning of the input, but they then typed in some non-whitespace character afterwards since the %c conversion succeeded. Wow, that's quite a lengthy explanation for one function call. Anyway, if you want more information on sscanf and its siblings, check out the man pages, Google, or both. There are lots of format string options, and these can save you a lot of manual labor when trying to parse strings in C. The final function in the library to look at is GetString. It turns out that GetString is a tricky function to write properly, even though it seems like such a simple, common task. Why is this the case? Well, let's think about how we're going to store the line that the user types in. Since a string is a sequence of chars, we might want to store it in an array on the stack, but we would need to know how long the array is going to be when we declare it. Likewise, if we want to put it on the heap, we need to pass to malloc the number of bytes we want to reserve, but this is impossible. We have no idea how many chars the user will type in before the user actually does type them. A naive solution to this problem is to just reserve a big chunk of space, say, a block of 1000 chars for the user's input, assuming that the user would never type in a string that long. This is a bad idea for two reasons. First, assuming that users typically don't type in strings that long, you could waste a lot of memory. On modern machines, this might not be an issue if you do this in one or two isolated instances, but if you're taking user's input in a loop and storing for later use, you can quickly suck up a ton of memory. Additionally, if the program you're writing is for a smaller computer-- a device like a smartphone or something else with limited memory-- this solution will cause problems a lot faster. The second, more serious reason to not do this is that it leaves your program vulnerable to what's called a buffer overflow attack. In programming, a buffer is memory used to temporarily store input or output data, which in this case is our 1000-char block. A buffer overflow occurs when data is written past the end of the block. For example, if a user actually does type in more than 1000 chars. You might have experienced this accidentally when programming with arrays. If you have an array of 10 ints, nothing stops you from trying to read or write the 15th int. There are no compiler warnings or errors. The program just blunders straight ahead and accesses the memory where it thinks the 15th int will be, and this can overwrite your other variables. In the worst case, you can overwrite some of your program's internal control mechanisms, causing your program to actually execute different instructions than you intended. Now, it's not common to do this accidentally, but this is a fairly common technique that bad guys use to break programs and put malicious code on other people's computers. Therefore, we can't just use our naive solution. We need a way to prevent our programs from being vulnerable to a buffer overflow attack. To do this, we need to make sure that our buffer can grow as we read more input from the user. The solution? We use a heap allocated buffer. Since we can resize it using the resize the realloc function, and we keep track of two numbers--the index of the next empty slot in the buffer and the length or capacity of the buffer. We read in chars from the user one at a time using the fgetc function. The argument the fgetc function takes--stdin--is a reference to the standard input string, which is a preconnected input channel that is used to transfer the user's input from the terminal to the program. Whenever the user types in a new character, we check to see if the index of the next free slot plus 1 is greater than the capacity of the buffer. The +1 comes in because if the next free index is 5, then our buffer's length must be 6 thanks to 0 indexing. If we've run out of space in the buffer, then we attempt to resize it, doubling it so that we cut down on the number of times that we resize if the user is typing in a really long string. If the string has gotten too long or if we run out of heap memory, we free our buffer and return null. Finally, we append the char to the buffer. Once the user hits enter or return, signalling a new line, or the special char--control d--which signals an end of input, we do a check to see if the user actually typed in anything at all. If not, we return null. Otherwise, because our buffer is probably larger than we need, in the worst case it's almost twice as large as we need since we double every time we resize, we make a new copy of the string using just the amount of space that we need. We add an extra 1 to the malloc call, so that there's space for the special null terminator character--the \0, which we append to the string once we copy in the rest of the characters, using strncpy instead of strcpy so that we can specify exactly how many chars we want to copy. Strcpy copies until it hits a \0. Then we free our buffer and return the copy to the caller. Who knew such a simple-seeming function could be so complicated? Now you know what goes into the CS50 library. My name is Nate Hardison, and this is CS50. [CS50.TV]