Week 4, continued

Files, Headers, and Hex00:00

One of the topics for today is digital forensics, recovering information, and Problem Set 4 will be in this domain.
David used to work in the district attorney’s office, attempting to find evidence from hard drive and memory cards that the police brought in.
- The portrayal of this process in TV and movies might look like this, where characters, in clips from various shows and films, say phrases like "zoom" and "enhance," that magically cause images to reveal details previously unseen.
But when we really try to "enhance" images, like that of Daven, we eventually see the pixels that compose the image, because there are only a finite of bits in the image. (The bad guy in the reflection in his eye will only be 6 pixels, no matter how far we try to zoom!)
We’ll explore this in Problem Set 4 with file I/O, where I/O just means input/output. None of the programs we’ve worked with so far have been saving to disk by creating or changing files.
An example of a file is a JPEG, which is simply an image file.
What’s interesting is that files can typically be identified by certain patterns of bits. Different files, like a JPEG or a PNG (image file) or a GIF (image file) or a Word document or a Excel spreadsheet, will have different patterns of bits, and those patterns are usually at the top of the file, so when a computer opens it, it can recognize, say, a JPEG as an image, and display it to the user as a graphic. Or, it might look like a Word doc, so let’s show it to the user as an essay.
For instance, the first three bytes of a JPEG are:
```
              255  216  255
```
Next week we’ll be poking more deeply into files to see what’s always been there.
But first, realize that "what’s there" generally isn’t written in decimal numbers like above.
Computer scientists actually tend to express numbers in hexadecimal, as opposed to decimal or binary.
Recall that decimal uses 10 digits, 0-9, while binary is composed of 2 digits, 0 and 1.
Hexadecimal means that we will have 16 such digits, 0-9 and a, b, c, d, e, f.
- "a" is 10, "b" is 11, and so on.

How can this be useful? Well let’s write out the bits that represent these numbers:

             255    216     255
        11111111  11011000  11111111

This is interesting because a byte has 8 bits, and if we break each byte into two chunks of 4 bits, each set of 4 bits will correspond to exactly one hexadecimal digit:
```
              255     216     255
        1111 1111  1101 1000  1111 1111
           f    f     d 8     f    f
```

To make this more readable, we remove the whitespace and add 0x, just to signify that the characters in the last row are in hexadecimal:

              255     216     255
        1111 1111  1101 1000  1111 1111
           f    f     d 8     f    f
             0xff     0xd8    0xff

Note that we can also convert two hexadecimal digits to 8 bits in binary, or one byte, making it especially useful for representing binary data.
Another image file format is a bitmap file, BMP. One example of an image in that format is bliss.bmp, a very familiar rolling green hill set against a blue cloudy sky (the default Windows XP wallpaper on millions of PCs).
- A bitmap is just a pattern of pixels, or dots, a "map of bits," if you will.
What’s interesting, though, is that its beginnings are more than just a few bytes. Its header has a whole bunch of numbers, bytes, with their orders and values determined years ago by its author, Microsoft. Indeed, Microsoft has named the types of those values things like WORD and DWORD and LONG, but those are simply data types like int, different names for the same thing.
So when someone clicks on a BMP file, the image is only shown because the operating system (or image-viewing program, really) noticed all of these bits at the beginning of the file and recognized that it was a BMP. More on this later.

Structs09:50

Let’s look at a simpler file first:

 1#include <cs50.h>
 2#include <stdio.h>
 3#include <string.h>
 4
 5#include "structs.h"
 6
 7// number of students
 8#define STUDENTS 3
 9
10int main(void)
11{
12    // TODO
13}

Let’s say we want to start creating a database of every student, and start by saving the name of a student and their house.

We see that there are 3 such STUDENTS, so we can probably use something like a for loop to GetString a name and house for each of them. And we can start by instinctively:

 1#include <cs50.h>
 2#include <stdio.h>
 3#include <string.h>
 4
 5#include "structs.h"
 6
 7// number of students
 8#define STUDENTS 3
 9
10int main(void)
11{
12    string names[STUDENTS];
13    string houses[STUDENTS];
14
15    for (int i = 0; i < STUDENTS; i++)
16    {
17        names[i] = GetString();
18        houses[i] = GetString();
19    }
20
21    // TODO later...
22
23}

So this is correct, as we create arrays to store the names and houses, and iterate through for the number of students, even if it’s not very user-friendly.

But this is not very good design. What if they had an ID number or email or Twitter handle, or more details to associate? To add it we’d have to do something like this:

 1#include <cs50.h>
 2#include <stdio.h>
 3#include <string.h>
 4
 5#include "structs.h"
 6
 7// number of students
 8#define STUDENTS 3
 9
10int main(void)
11{
12    string names[STUDENTS];
13    string houses[STUDENTS];
14    int ids[STUDENTS];
15    string twitters[STUDENTS];
16    ...
17}

But soon, if we keep adding, we’ll have something pretty big and unwieldy.

We can use a higher-level data structure to hold something of a type student, and we see an example of this in structs.h:
```
 1#include <cs50.h>
 2
 3// structure representing a student
 4typedef struct
 5{
 6    string name;
 7    string house;
 8}
 9student;
```
- In fact, we’ve already been using structs in Problem Set 3, since there’s no such thing as a GRect or GOval in C. The SPL has those data types, implemented with the approach shown above.
The keywords typedef and struct on line 4 just mean define a type — a structure — that is a container for multiple things, and inside that structure will be a string called name and a string called house, and the entire structure will be called student for convenience.
student is now a data type just like int and string and GRect and others.

Now we can do something like this, in structs-0.c:

 1#include <cs50.h>
 2#include <stdio.h>
 3#include <string.h>
 4
 5#include "structs.h"
 6
 7// number of students
 8#define STUDENTS 3
 9
10int main(void)
11{
12    // declare students
13    student students[STUDENTS];
14...

Note that we have one array named students, with each element of the type student. There are STUDENTS (which we’ve defined in line 8 to be 3) elements in the students array.

How do we access name and house and other fields, or items, in a struct? Well we scroll down in structs-0.c:

 1...
 2int main(void)
 3{
 4    // declare students
 5    student students[STUDENTS];
 6
 7    // populate students with user's input
 8    for (int i = 0; i < STUDENTS; i++)
 9    {
10        printf("Student's name: ");
11        students[i].name = GetString();
12
13        printf("Student's house: ");
14        students[i].house = GetString();
15    }
16...

We index into the array in line 11, and use a new syntax of .name to get the field called name.
We’ll return to structs soon!
- As an aside, the scene in the Windows XP wallpaper is now yellow and gloomy - or it was when someone went back to get another photo!

Quick Reminder15:21

Speaking of images, here’s another one of Daven at Fire and Ice, a reminder that CS50 Lunch is Friday at 1:15pm as usual, RSVP at http://cs50.harvard.edu/rsvp.

GDB15:31

So where did we leave off on Monday? We introduced this problem of swapping two variables a and b with a temporary variable called tmp:
```
void swap(int a, int b)
{
    int tmp = a;
    a = b;
    b = tmp;
}
```
- But remember that the problem is that it only swaps the variables locally, in the function’s own memory (recall the trays from Annenberg that represent each function’s slice of memory, and how the swap function doesn’t have access to the variables in main, but rather copies).

Let’s open our friend noswap.c:

 1/**
 2 * noswap.c
 3 *
 4 * David J. Malan
 5 * malan@harvard.edu
 6 *
 7 * Should swap two variables' values, but doesn't!  How come?
 8 */
 9
10#include <stdio.h>
11
12void swap(int a, int b);
13
14int main(void)
15{
16    int x = 1;
17    int y = 2;
18
19    printf("x is %i\n", x);
20    printf("y is %i\n", y);
21    printf("Swapping...\n");
22    swap(x, y);
23    printf("Swapped!\n");
24    printf("x is %i\n", x);
25    printf("y is %i\n", y);
26}
27
28/**
29 * Fails to swap arguments' values.
30 */
31void swap(int a, int b)
32{
33    int tmp = a;
34    a = b;
35    b = tmp;
36}

In lines 16 and 17 we initialized x and y, had printfs, and then called swap on line 22. swap, on line 31, might look correct on first glance, but isn’t.

Let’s warm up by investigating with our new friend gdb:

jharvard@appliance (~/Dropbox/src4w): gdb ./noswap
Reading symbols from ./noswap...done.

Now let’s just run it:

(gdb) run
Starting program: /home/jharvard/noswap
x is 1
y is 2
Swapping...
Swapped!
x is 1
y is 2
[Inferior 1 (process 4922) exited normally]

That wasn’t quite so useful, since it just ran the program. Let’s pause execution with break, and we can specify a line like break 10, but let’s just break at the main function:
```
(gdb) break main
Breakpoint 1 at 0x80484ac: file noswap.c, line 16.
```
- Indeed, if we glance back at our source code, line 16 is the first command in main.

Now we can run the program again:

(gdb) run
Starting program: /home/jharvard/noswap
Breakpoint 1, main () at noswap.c:16
16      int x = 1;

And it’s paused. Let’s print x:
```
(gdb) print x
$1 = 0
```
- But it’s 0. gdb has paused execution to just before line 16, so x has no assigned value but rather whatever was left in memory prior (a garbage value), and here we got lucky with a clean value of 0.

Let’s say next and print x again:

(gdb) next
17      int y = 2;
(gdb) print x
$2 = 1

And indeed we see 1. What if we print y?
```
(gdb) print y
$3 = 134514064
```
We see another garbage value as expected, with those bits from some other program that used the memory last to store something. Not to worry, we can proceed:
```
(gdb) next
19      printf("x is %i\n", x);
(gdb) print y
$4 = 2
```

And y is 2 as expected. Moving on:

(gdb) n
x is 1
20      printf("y is %i\n", y);
(gdb) n
y is 2
21      printf("Swapping...\n");
(gdb) n
Swapping...
22      swap(x, y);

But now we want to go inside swap, so we use the step command:

(gdb) step
swap (a=1, b=2) at noswap.c:33
33      int tmp = a;
(gdb) print tmp
$5 = -1209908752

tmp has a garbage value again, but we can see it’s correct after we initialize it.
And remember we have a and b from the source code where swap is declared as void swap(int a, int b), so it refers to the values passed in as a and b, and remember that those are copies of x and y held by main:
```
(gdb) next
34      a = b;
(gdb) print tmp
$6 = 1
(gdb) print a
$7 = 1
(gdb) next
35      b = tmp;
(gdb) print a
$8 = 2
```
Now we see that a is 2, after we said next and executed line 34. We can also check that b is 1 and tmp is still there:
```
(gdb) next
36  }
(gdb) print a
$9 = 2
(gdb) print b
$10 = 1
(gdb) print tmp
$11 = 1
```

Let’s say continue to finish the program:

(gdb) continue
Continuing.
Swapped!
x is 1
y is 2
[Inferior 1 (process 4946) exited normally]
(gdb)

gdb didn’t fix the problem, but helped us realize that our code didn’t have an impact.

Strings20:35

So let’s solve this problem. We’re now peeling back a layer of abstraction, and realizing that a string doesn’t actually exist, and instead is a char* with a name of string.

Let’s open compare-0.c:

 1#include <cs50.h>
 2#include <stdio.h>
 3
 4int main(void)
 5{
 6    // get line of text
 7    printf("Say something: ");
 8    string s = GetString();
 9
10    // get another line of text
11    printf("Say something: ");
12    string t = GetString();
13
14    // try (and fail) to compare strings
15    if (s == t)
16    {
17        printf("You typed the same thing!\n");
18    }
19    else
20    {
21        printf("You typed different things!\n");
22    }
23}

We get two strings, and try to compare them. We make it and try it out:

jharvard@appliance (~/Dropbox/src4w): make compare-0
clang -ggdb3 -O0 -std=c99 -Wall -Werror    compare-0.c  -lcs50 -lm -o compare-0
jharvard@appliance (~/Dropbox/src4w): ./compare-0
Say something: Daven
Say something: Rob
You typed different things!
jharvard@appliance (~/Dropbox/src4w): ./compare-0
Say something: Gabe
Say something: Gabe
You typed different things!
jharvard@appliance (~/Dropbox/src4w): ./compare-0
Say something: Zamyla
Say something: Zamyla
You typed different things!
jharvard@appliance (~/Dropbox/src4w):

So what’s going on? The first time we got what we expected, but then we passed in two strings that were the same! All we’re doing in the source code is getting them and checking if they are ==.
Well it turns out that, when we say something like:
```
string s = GetString();
  -----
  |   |
  -----
```
- where the box below s is storing s.

Now GetString, in our first attempt, returned to us Daven, and also a \0, like so:

string s = GetString();
  -----    -------------------------
  |   |    | D | a | v | e | n |\0 |
  -----    -------------------------

It looks like Daven\0 is made up of many bytes, and if a string is only 4 bytes, 32 bits, how can we fit the entire string inside s? Well, the row of squares containing Daven\0 are just bytes in memory. We can think of each byte as having a certain address, just as buildings might have an address like 33 Oxford Street, 34 Oxford Street, or 35 Oxford Street, and here, as an example, we start numbering each byte in memory:
```
string s = GetString();
  -----    -------------------------
  |   |    | D | a | v | e | n |\0 |
  -----    -------------------------
            0x1 0x2 0x3 0x4 0x5 0x6
```
And now, we can sort of guess that GetString returns not the entire string, but rather the address to the string: Daven "lives" at 0x1. So s only contains the address to that string:
```
string s = GetString();
  -----    -------------------------
  |0x1|    | D | a | v | e | n |\0 |
  -----    -------------------------
            0x1 0x2 0x3 0x4 0x5 0x6
```
And this has been going on since we first introduced a string. All we get when we ask for a string is the location of where it begins.
- Incidentally, programmers can put any address into a variable and try to jump to that area in memory, and we’ll see how that could be problematic next time.
But how do we know where a string ends, and the next string begins? Well, the \0, special null character, tells us when a string ends.

So now if we look at the code of compare.c:

...
    // try (and fail) to compare strings
    if (s == t)
    {
        printf("You typed the same thing!\n");
    }
...

we see that this fails since s and t are pointing to different addresses, since t is another string, and we’re comparing the locations rather than the first character of each one, then the next, and so on:

string s = GetString();
  -----    -------------------------
  |0x1|    | D | a | v | e | n |\0 |
  -----    -------------------------
            0x1 0x2 0x3 0x4 0x5 0x6

string t = GetString();
  -----    -------------------------
  |...|    |   |   |   |   |   |   |
  -----    -------------------------
            ...

So let’s fix this problem. If we had to implement it ourselves, we might compare letters in the two strings, one at a time, until we reached the end of one or both of them. But we don’t need to, thanks to the strcmp function as shown in compare-1.c:

 1#include <cs50.h>
 2#include <stdio.h>
 3#include <string.h>
 4
 5int main(void)
 6{
 7    // get line of text
 8    printf("Say something: ");
 9    char* s = GetString();
10
11    // get another line of text
12    printf("Say something: ");
13    char* t = GetString();
14
15    // try to compare strings
16    if (s != NULL && t != NULL)
17    {
18        if (strcmp(s, t) == 0)
19        {
20            printf("You typed the same thing!\n");
21        }
22        else
23        {
24            printf("You typed different things!\n");
25        }
26    }
27}

Notice that we use strcmp in line 18, which will return a negative number, or a positive number, or zero. Zero would mean that they are equal, and a positive or negative number would mean something like greater than or less than, if you wanted to alphabetize those strings.
We’ve also switched to char* in lines 9 and 13, and really that’s the same as string, which we made up. For now, just know that the * means the address of something, and so a char*, as opposed to int*, means an address of a char.

Going back to the board,

string s = GetString();
  -----    -------------------------
  |0x1|    | D | a | v | e | n |\0 |
  -----    -------------------------
            0x1 0x2 0x3 0x4 0x5 0x6

string t = GetString();
  -----    -------------------------
  |...|    |   |   |   |   |   |   |
  -----    -------------------------
            ...

that box with 0x1 is really a char*.

Let’s open copy-0.c:

 1#include <cs50.h>
 2#include <ctype.h>
 3#include <stdio.h>
 4#include <string.h>
 5
 6int main(void)
 7{
 8    // get line of text
 9    printf("Say something: ");
10    string s = GetString();
11    if (s == NULL)
12    {
13        return 1;
14    }
15
16    // try (and fail) to copy string
17    string t = s;
18
19    // change "copy"
20    printf("Capitalizing copy...\n");
21    if (strlen(t) > 0)
22    {
23        t[0] = toupper(t[0]);
24    }
25
26    // print original and "copy"
27    printf("Original: %s\n", s);
28    printf("Copy:     %s\n", t);
29
30    // success
31    return 0;
32}

So in lines 10-14 we get a string s and checks that it’s not NULL in case something went wrong. Otherwise, we might start going to invalid addresses in memory, and cause more and more problems.
Let’s try to copy the string in line 17, and capitalize the first character of t, t[0], in line 23. And then we print them.
But what’s really happening, and where is the bug? Let’s go back to line 17, where we set string t to s:
```
string t  =  s;
  ------     ------
  |0x50|     |0x50|
  ------     ------
```

So we’re setting t to point to the same address as s, but that just means when we change t[0], the first letter in t, we also change s[0] since s points to the same thing:

string t  =  s;
  ------     ------
  |0x50|     |0x50|
  ------     ------

             ---------------------------------
        ...  | g | a | b | e |\0 |   |   |   |
             ---------------------------------
             0x50

Let’s run copy-0:

jharvard@appliance (~/Dropbox/src4w): ./copy-0
Say something: gabe
Capitalizing copy...
Original: Gabe
Copy:     Gabe

Memory Allocation33:32

Hm, we can address this problem with copy-1.c:

 1#include <cs50.h>
 2#include <ctype.h>
 3#include <stdio.h>
 4#include <string.h>
 5
 6int main(void)
 7{
 8    // get line of text
 9    printf("Say something: ");
10    char* s = GetString();
11    if (s == NULL)
12    {
13        return 1;
14    }
15
16    // allocate enough space for copy
17    char* t = malloc((strlen(s) + 1) * sizeof(char));
18    if (t == NULL)
19    {
20        return 1;
21    }
22
23    // copy string, including '\0' at end
24    for (int i = 0, n = strlen(s); i <= n; i++)
25    {
26        t[i] = s[i];
27    }
28
29    // change copy
30    printf("Capitalizing copy...\n");
31    if (strlen(t) > 0)
32    {
33        t[0] = toupper(t[0]);
34    }
35
36    // print original and copy
37    printf("Original: %s\n", s);
38    printf("Copy:     %s\n", t);
39
40    // free memory
41    free(s);
42    free(t);
43
44    // success
45    return 0;
46}

This looks really complicated, but let’s talk about the concept first. We’ll use a loop to copy it character by character, but now we need to explicitly allocate memory for t:

string s = GetString();
  ------   ---------------------
  |0x50|   | g | a | b | e |\0 |
  ------   ---------------------
           0x50

string t
  ------   ---------------------
  |    |   |                   |
  ------   ---------------------

We declared a string t, but how do we assign a value to it? Well, let’s look at line 17, reproduced below:
```
char* t = malloc((strlen(s) + 1) * sizeof(char));
```
- We started by writing char* t, which is creating a variable t that will store an address to a character, creating that left-hand box for t. Then we call malloc, a function that allocates a certain amount of memory. The argument we pass in is the size of the memory that we want, which is strlen(s), the number of characters in s, + 1 for the \0 character ending the string, all multiplied by the sizeof a char. (We know a character is one byte, but that may, rarely, vary from computer to computer.) Finally, malloc will return the address of that chunk of memory, which might be anywhere:
  string s = GetString(); ------ --------------------- |0x50| | g | a | b | e |\0 | ------ --------------------- 0x50 string t ------ --------------------- |0x88| | | ------ --------------------- 0x88
Now we can access the memory as an array in a for loop, reproduced below:
```
1    // copy string, including '\0' at end
2    for (int i = 0, n = strlen(s); i <= n; i++)
3    {
4        t[i] = s[i];
5    }
```
- We can do this because each string is stored with characters next to one another, so we can access them with this array notation.
To recap, a string all this time was just an address of a character, a pointer, which in turn is just a number, that we conventionally write in hexadecimal.
We also check if t == NULL because we might ask for more memory than malloc is able to give.
And one final thing, if we return to what we were just looking at, we can replace line 4 below with line 5:
```
1    // copy string, including '\0' at end
2    for (int i = 0, n = strlen(s); i <= n; i++)
3    {
4        // t[i] = s[i];
5        *(t + i) = *(s + i);
6    }
```
- The * symbol can actually be used for two purposes. We’ve seen char* t = … which is declaring that t is a pointer to a char, but if we use * without a word like char in front of it, it becomes a dereference operator. That just means "go there" - if an address, like 33 Oxford Street, was written on paper like *(33 Oxford Street), then we would just go there.
- t is the address of the new piece of memory, and s is the address of the original piece, and i goes from 0 to 1 to 2 to 3 etc, so t + i is just another number, since these are all addresses with number values.
- So on the first pass of the loop, with i = 0, we’re going to copy g from 0x50 to 0x88:
  string s = GetString(); ------ --------------------- |0x50| | g | a | b | e |\0 | ------ --------------------- 0x50 string t ------ --------------------- |0x88| | g | | | | | ------ --------------------- 0x88
- On the next pass, i = 1, we’ll copy a from 0x50 + 1, 0x51, to 0x88 + 1, 0x89, and you can see how it’s going to proceed:
  string s = GetString(); ------ --------------------- |0x50| | g | a | b | e |\0 | ------ --------------------- 0x50 string t ------ --------------------- |0x88| | g | a | | | | ------ --------------------- 0x88
Let’s look at a final program:
```
int main(void)
{
    int* x;
    int* y;

    x = malloc(sizeof(int));

    *x = 42;

    *y = 13;

    y = x;

    *y = 13;
}
```
- It first declares two variables, x and y that aren’t integers, but pointers to integers. Then we say x = malloc(sizeof(int));, or "give me enough memory to store an int", and the address returned by malloc will be stored in x.
- Meanwhile, *x = 42 is going to the address stored in x, and putting 42 in it.
- Then we do the same with y, going to its address and putting 13 in it. But wait, y is probably a garbage value, some number left over from previous programs, but not contain an address to memory we should use to store an int. It’s like trying to go into a building you don’t own or have permission to enter, and bad things will happen.
As an aside, David still remembers where he was when he understood pointers, sitting with his TF in the back of Eliot dining hall. So don’t worry if none of this makes sense just yet (though I hope these notes are helpful)!
Let’s watch Pointer Fun with Binky.
- Binky is a clay … figure that talks about this code with a narrator, using a "magic wand of dereferencing" to show what we just explained, in a different way.
- There are three basic rules:
  - "Pointer and pointee are separate - don’t forget to set up the pointee." (Don’t forget to malloc something for y!)
  - "Dereference a pointer to access its pointee." (Use *x to go to the address stored in x!)
  - "Assignment (=) between pointers makes them point to the same pointee." (x = y sets them to the same address.)