Andrew Sellergren
crayon
).string
to be a char *
until we could reveal to you what a pointer is. Going forward, we’ll rely less on getting input from the command-line prompt and more on getting input from files and network connections.The CS50 Library includes a number of other libraries, including stdbool.h
, which contains the definitions of true
and false
, as well as stdlib.h
, which contains the definition of malloc
. The following line in cs50.h
defines the string
data type as a char *
:
typedef char* string;
cs50.h
is actually an example of a header file
. Thus far, we’ve been writing our prototypes at the tops of our files. However, we could also have written these prototypes in separate files named with the .h
extension. The definitions of the functions themselves would then live in a file of the same name but with the .c
extension.GetInt
notes that if a line can’t be read, it will return INT_MAX
. In the case that the user types a number that is much too large to be stored in 32 bits (a googol, for example), the function will return this sentinel value. As you might guess, INT_MAX
is a constant representing the largest possible 32-bit signed integer, or around 2 billion. We use this number to signify that something went wrong. Knowing this, your programs should actually be checking to see if the return value of GetInt
is equal to INT_MAX
, which would indicate that an error occurred.cs50.c
is actually pre-compiled into the Appliance so that you need only include cs50.h
and link to it at compile-time if you want to use its functions.The implementation of GetInt
is as follows:
/*
* Reads a line of text from standard input and returns it as an
* int in the range of [-2^31 + 1, 2^31 - 2], if possible; if text
* does not represent such an int, user is prompted to retry. Leading
* and trailing whitespace is ignored. For simplicity, overflow is not
* detected. If line can't be read, returns INT_MAX.
*/
int GetInt(void)
{
// try to get an int from user
while (true)
{
// get line of text, returning INT_MAX on failure
string line = GetString();
if (line == NULL)
return INT_MAX;
// return an int if only an int (possibly with
// leading and/or trailing whitespace) was provided
int n; char c;
if (sscanf(line, " %d %c", &n, &c) == 1)
{
free(line);
return n;
}
else
{
free(line);
printf("Retry: ");
}
}
}
GetString
returns NULL
on failure, so we first check that here and return INT_MAX
to pass the error along.GetString
has returned a non-NULL
value, we pass the user’s input, stored in line
, to a function sscanf
which looks for format strings within it. You can think of sscanf
as the opposite of printf
: it reads in as opposed to writes out. In this case, we’re looking for an integer followed by a character. What we’re checking, however, is if only an integer was found. In that case, sscanf
will return 1, the if condition will evaluate to true, and GetInt
will return the integer. If both an integer and a character are captured, sscanf
will return 2 and the user will be asked to retry. This is a clever bit of error detection: if the user types “123 f,” then both n
and c
will be filled and we will ask him to retry; if the user types “123,” then only n
will be filled and GetInt
will return the integer 123. As you can infer, sscanf
returns the number of placeholders it filled. Because we put a space before and after the %d
, the user is allowed to type a space when entering an integer; sscanf
will ignore it. If the user types only a character, sscanf
will fill neither n
nor c
so it will return 0. Note we’re passing pointers to n
and c
so that the memory they point to can actually be filled by sscanf
.GetInt
and many other CS50 Library functions call GetString
to do their dirty work. Let’s take a look at the implementation of GetString
:
/*
* Reads a line of text from standard input and returns it as a
* string (char*), sans trailing newline character. (Ergo, if
* user inputs only "\n", returns "" not NULL.) Returns NULL
* upon error or no input whatsoever (i.e., just EOF). Leading
* and trailing whitespace is not ignored. Stores string on heap
* (via malloc); memory must be freed by caller to avoid leak.
*/
string GetString(void)
{
// growable buffer for chars
string buffer = NULL;
// capacity of buffer
unsigned int capacity = 0;
// number of chars actually in buffer
unsigned int n = 0;
// character read or EOF
int c;
// iteratively get chars from standard input
while ((c = fgetc(stdin)) != '\n' && c != EOF)
{
// grow buffer if necessary
if (n + 1 > capacity)
{
// determine new capacity: start at 32 then double
if (capacity == 0)
capacity = 32;
else if (capacity <= (UINT_MAX / 2))
capacity *= 2;
else
{
free(buffer);
return NULL;
}
// extend buffer's capacity
string temp = realloc(buffer, capacity * sizeof(char));
if (temp == NULL)
{
free(buffer);
return NULL;
}
buffer = temp;
}
// append current character to buffer
buffer[n++] = c;
}
// return NULL if user provided no input
if (n == 0 && c == EOF)
return NULL;
// minimize buffer
string minimal = malloc((n + 1) * sizeof(char));
strncpy(minimal, buffer, n);
free(buffer);
// terminate string
minimal[n] = '\0';
// return string
return minimal;
}
c
. Although this will store a character, which only requires a maximum of 256, we’ll use an int
to store is so that we can use one additional value as an error indicator.fgetc
with stdin
as a single argument. stdin
represents keyboard input, as opposed to stdout
, which represents output to the screen, and stderr
, which represents a special type of output to the screen. fgetc
reads a single charater of keyboard input and returns it to be stored in c
. We then check that c
doesn’t equal the newline character (if the user has hit Enter) or the EOF
(end-of-file) character (if the user has hit Ctrl+D).n
holds the number of characters that the user has entered. If n
is greater than capacity
, then we have more characters to store than memory in which to store them. Thus, we need to ask the operating system for some more using malloc
. Actually, we also use a function named realloc
to ask the operating system for memory at the end of the currently allocated string, if possible.GetString
, you’ll see that it doubles the size of the memory chunk until there’s enough to store the user’s input. This doubling will cause the size of the memory chunk to grow very rapidly meaning that realloc
, which tends to be slow, will only need to be called a few times. At the end of the function, we ask for a memory chunk that is exactly the size that we need so we don’t waste memory.Now that we know more about memory allocation in C, we need to begin to think about how it can be abused. Consider the following snippet of code:
#include <string.h>
void foo(char* bar)
{
char c[12];
memcpy(c, bar, strlen(bar));
}
int main(int argc, char* argv[])
{
foo(argv[1]);
return 0;
}
char
might be more than 1 byte. Second, we don’t account for running the program without providing a command-line argument.memcpy
takes as its third argument the maximum number of characters to copy. However, if bar
is longer than 12, then we’ll be writing more than 12 characters into only 12 bytes of memory. This is what’s generally known as a buffer overrun attack.foo
as defined above:main
calls foo
, it informs foo
of its return address. That way, when foo
finishes executing, it knows where in memory to jump to continue the program’s execution. This is represented by the red block in the diagram above.c
is growing downward. If we write more than 12 bytes to it, we might overwrite bar
, something called the saved frame pointer, and the return address. If this return address were to be overwritten with a valid memory address of malicious code, then foo
will execute this code when it returns. We might represent this like so:foo
returns, it will jump to this address in memory, which is actually the start of the block of blue A’s at the top, the same block where this malicious user’s input was stored. Thus, the malicious user can type in some attack code followed by this memory address so that when foo
returns, it will jump to the beginning of this block of blue A’s and execute the attack code. Yikes!