Week 2, continued

Introduction

Rob Bowden teaches today, and makes fun of David for sending a vertical video to apologize for missing today’s lecture.
Today we’ll talk more about data representation and cryptography, or scrambling information, but first a story from yesteryear.
- Radio Orphan Annie’s Secret Decoder Ring is a child-friendly form of cryptography, with two discs that rotates independently, the outer ring containing the letters A-Z and the inner ring the numbers 1-26 to encode a message by mapping letters to numbers.
- This clip from A Christmas Story shows a child, Ralphie, excitedly decoding the secret message from the radio, only to find that it is an advertisement for Ovaltine, a beverage popular many years ago.

More on Strings

Recall from Monday the following representation of a string containing Zamyla’s name, with the individual `char`s split up into separate boxes.
```
-------------------------
| Z | a | m | y | l | a |
-------------------------
```
- We can access these individual characters using bracket notation, so if we assign this string to a variable called s, we can get the first character using s[0], the second with s[1], and so on up to s[5] (the final a).
- This string is of length 6, but positions in a string are 0-indexed, so it contains indices 0 through 5.

Now let’s look at this string in a larger context of your computer’s memory:

---------------------------------
| Z | a | m | y | l | a |   |   |
---------------------------------
|   |   |   |   |   |   |   |   |
---------------------------------
|   |   |   |   |   |   |   |   |
---------------------------------
|   |   |   |   |   |   |   |   |
---------------------------------

The computer’s memory is basically one long string of bytes, but we’ll represent it as a grid to save space.

A volunteer from the audience acts as a computer implementing the following code and storing the strings in memory:

#include <stdio.h>
#include <cs50.h>

int main(void)
{
    // get four strings from the user
    string s1 = GetString();
    string s2 = GetString();
    string s3 = GetString();
    string s4 = GetString();

    // print the first string entered
    printf("string s1 is %s\n", s1);
}

Rob, as the user, provides the strings DEAN, HANNAH, MARIA, and ROB, and our volunteer fills in the computer’s memory like so:

-------------------------
|...| D | E | A | N |   |
-------------------------
| H | A | N | N | A | H |
-------------------------
|   | M | A | R | I | A |
-------------------------
|   | R | O | B |   |...|
-------------------------

Our volunteer left spaces between each string so the computer can tell where each string ends (otherwise, when we try to print just s1, we would get DEANHANNAHMARIAROB rather than just DEAN!).

Rather than spaces, the computer actually uses a special terminator character, \0, to represent the end of a string:

-------------------------
|...| D | E | A | N |\0 |
-------------------------
| H | A | N | N | A | H |
-------------------------
|\0 | M | A | R | I | A |
-------------------------
|\0 | R | O | B |\0 |...|
-------------------------

The strlen function also relies on the presence of \0 - it just iterates over the characters of the string until it finds a \0.
In memory, the character \0 is actually represented by a byte of all 0s (so 8 consecutive 0 bits). So what about the character 0? Remember ASCII, the system that maps characters to underlying byte values, where A maps to 65 and so on; the number 0 is represented by ASCII 48.
We’ll refer to the \0 character as a nul terminator.

Arrays

So this general idea of storing items in boxes, as we do under the hood in a string, is known as an array. An array is a type of data structure, with a continguous number of the same type of data, back-to-back. A string is just an array of char variables, but we can put any of our other data types in an array as well.

Now say we wanted to get the ages of a number of people in the room. We might start with:

 1#include <cs50.h>
 2#include <stdio.h>
 3
 4int main(void)
 5{
 6    int age0 = GetInt();
 7    int age1 = GetInt();
 8    int age2 = GetInt();
 9
10    // do something with those numbers ...
11}

But this will force us to completely rewrite our code, and copy-paste that GetInt line for every person we want to get an age from.
We can solve this problem by using an array. The general format for declaring an array of a given type and size as a variable called name is:
```
type name[size];
```
Let’s look at how we do this in ages.c:
```
 1#include <cs50.h>
 2#include <stdio.h>
 3
 4int main(void)
 5{
 6    // determine number of people
 7    int n;
 8    do
 9    {
10        printf("Number of people in room: ");
11        n = GetInt();
12    }
13    while (n < 1);
14
15    // declare array in which to store everyone's age
16    int ages[n];
17
18    // get everyone's age
19    for (int i = 0; i < n; i++)
20    {
21        printf("Age of person #%i: ", i + 1);
22        ages[i] = GetInt();
23    }
24
25    // report everyone's age a year hence
26    printf("Time passes...\n");
27    for (int i = 0; i < n; i++)
28    {
29        printf("A year from now, person #%i will be %i years old.\n", i + 1, ages[i] + 1);
30    }
31}
```
- In line 16, we declare an array that stores exactly n integers. The number in this case is how big we want the array to be, whereas earlier when we used s[i] we were retrieving that particular item in the array since it was already declared.
  - This means that somewhere in memory, we have n integer-sized (32-bit, or 4-byte) boxes in a row.
- Then we GetInt for each person, storing it in ages[i] as we go through the loop, meaning the ages will be placed in the first box, second box, and so on of the ages[] array.
  - Now it should make a little more sense why the convention is to start for loops from 0 rather than 1 - we very often use a for loop to iterate over the indices of an array, and arrays are 0-indexed.
- Finally, we iterate through the array again and print out each age, with 1 added to demonstrate what we can do after we retrieve the int from the array.
- What happens if we try to store another int in ages[n+1]? Just as we saw when we tried to read many bytes past the end of a string, this results in a segmentation fault, an error that indicates we’ve touched a segment of memory that doesn’t belong to us.
- What if we didn’t check whether the user entered a positive number for the number of people in the room? An array cannot have a negative size - if you declare an array with a negative size directly in your code, it will not compile - so if the user gets away with entering a negative size for the array, it also results in a segmentation fault.

Command-Line Arguments

Now that we’ve seen arrays, we can start to work with command-line arguments, which we’ll need for Problem Set 2.
Commands like cd don’t ask for input; instead they take arguments from the command-line, so cd pset1 changes the directory to pset1 without a separate prompt for input. mkdir pset2 makes a directory called pset2, and make hello builds a program called hello.
- clang -o hello hello.c has three such arguments (-o, hello, and hello.c).
We’ve been writing programs that look like this, whereby main does not take any arguments (as implied by the presence of void):
```
#include <cs50.h>
#include <stdio.h>

int main(void)
{
    // TODO
}
```
- This means that no other words can be typed after the program’s name and accessed within main, and the only way to provide input is by a function running after the program is started, like with GetString.

We will start adding code like this:

#include <cs50.h>
#include <stdio.h>

int main(int argc, string argv[])
{
    // TODO
}

We see that main now takes two arguments, argc which is an int, and argv which is an array of strings. We specify the name, argv (short for argument vector, or array of arguments), but not the size, so any array can be passed in to main.
- This is a slightly different use of the bracket notation we haven’t seen before - rather than indicating the size of an array or a position within an array, these empty brackets just mean that argv is an array (of unspecified size), and we know it’s an array of strings because it’s declared as string argv[].
- We need the size of argv to be unspecified because many programs (such as clang) can take different numbers of command-line arguments depending on what you’re trying to do with them.

So command-line arguments look like this in memory:

argc
------------
|          |
------------

argv
------------------------------------------------
|          |           |           |           |  ...
------------------------------------------------

argv is a chunk of memory that stores one string after another, and argc is a single chunk of memory that holds an int.

We can access each string individually:

argc
------------
|          |
------------

argv[0]    argv[1]     argv[2]     argv[3]
------------------------------------------------
|          |           |           |           |  ...
------------------------------------------------

If we run a program with ./hello, the contents of argc and argv[0] would be as follows:

argc
------------
| 1        |
------------

argv[0]     argv[1]     argv[2]     argv[3]
-------------------------------------------------
| ./hello   |           |           |           |  ...
-------------------------------------------------

If we ran clang -o hello hello.c, however, we get:

argc
------------
| 4        |
------------

argv[0]     argv[1]     argv[2]     argv[3]
-------------------------------------------------
| clang     | -o        | hello     | hello.c   |  ...
-------------------------------------------------

Since we don’t know where argv will end by itself, we need argc to tell us where to stop looking.

Let’s write a program that uses these arguments. What about a program that says hello without using GetString? Instead, it’ll take arguments like this:

argc
------------
| 2        |
------------

argv[0]     argv[1]     argv[2]     argv[3]
-------------------------------------------------
| ./hello   | Zamyla    |           |           |  ...
-------------------------------------------------

We’ll call this hello-3.c:

1#include <cs50.h>
2#include <stdio.h>
3
4int main(int argc, string argv[])
5{
6    printf("hello, %s\n", argv[1]);
7}

argv[1] contains whatever string is passed in after the name of our program.

But what happens if we don’t type someone’s name in?
```
jharvard@ide50:~/workspace/src2w $ make hello-3
clang -ggdb3 -O0 -std=c99 -Wall -Werror    hello-3.c  -lcs50 -lm -o hello-3
jharvard@ide50:~/workspace/src2w $ ./hello-3
hello, (null)
```
- printf is printing (null) because there’s nothing (well, technically, NULL) in argv[1].
- When we run just ./hello-3, argc is 1, so the length of the array argv is 1, so the only valid index in argv is argv[0].

So let’s look at how we can prevent something like this happening in hello-4.c:

 1#include <cs50.h>
 2#include <stdio.h>
 3
 4int main(int argc, string argv[])
 5{
 6    if (argc == 2)
 7    {
 8        printf("hello, %s\n", argv[1]);
 9    }
10    else
11    {
12        printf("hello, you\n");
13    }
14}

On line 6 we make sure that argc has a value of 2, and if so, we know we have a single name string to plug in and say hello to the user. If not, we instead substitute a generic message of "hello, you".

jharvard@ide50:~/workspace/src2w $ ./hello-4
hello, you
jharvard@ide50:~/workspace/src2w $ ./hello-4 Rob
hello, Rob
jharvard@ide50:~/workspace/src2w $ ./hello-4 Rob Maria
hello, you
jharvard@ide50:~/workspace/src2w $

Let’s look at argv-1.c:

 1#include <cs50.h>
 2#include <stdio.h>
 3
 4int main(int argc, string argv[])
 5{
 6    // print arguments
 7    for (int i = 0; i < argc; i++)
 8    {
 9        printf("%s\n", argv[i]);
10    }
11}

We’re iterating over the argv array, using its length, argc, to know when to stop.

This will print each argument, one per line:

jharvard@ide50:~/workspace/src2w $ make argv-1
clang -ggdb3 -O0 -std=c99 -Wall -Werror    argv-1.c  -lcs50 -lm -o argv-1
jharvard@ide50:~/workspace/src2w $ ./argv-1
./argv-1
jharvard@ide50:~/workspace/src2w $ ./argv-1 Rob
./argv-1
Rob
jharvard@ide50:~/workspace/src2w $ ./argv-1 Rob Maria Hannah
./argv-1
Rob
Maria
Hannah

Note that argc can never be less than 1, because there will always at least be the name of the program itself.

We can take this further in argv-2.c:

 1#include <cs50.h>
 2#include <stdio.h>
 3#include <string.h>
 4
 5int main(int argc, string argv[])
 6{
 7    // print arguments
 8    for (int i = 0; i < argc; i++)
 9    {
10        for (int j = 0, n = strlen(argv[i]); j < n; j++)
11        {
12            printf("%c\n", argv[i][j]);
13        }
14        printf("\n");
15    }
16}

Now we go through each argument with line 8, but in line 10 we check the length of the argument stored in argv[i], store it in n, and use j as a counter to iterate through argv[i] since i was already used.

Then in line 12 we use this new syntax, argv[i][j] that gets the i’th string and the j’th character in that string. This relies on the fact that argv is an array of strings, which are themselves arrays of characters, so argv is a nested array of arrays.

jharvard@ide50:~/workspace/src2w $ make argv-2
clang -ggdb3 -O0 -std=c99 -Wall -Werror    argv-2.c  -lcs50 -lm -o argv-2
jharvard@ide50:~/workspace/src2w $ ./argv-2
.
/
a
r
g
v
-
2

jharvard@ide50:~/workspace/src2w $ ./argv-2 foo bar
.
/
a
r
g
v
-
2

f
o
o

b
a
r

One thing to note is that command-line arguments are generally separated by spaces, as you’d expect, but if you want to pass an argument that itself contains a space, you can do it as follows:
```
jharvard@ide50:~/workspace/src2w $ ./argv-2 Rob Maria 'Hannah Blumberg'
```
- By enclosing Hannah Blumberg in quotes, we tell the program that it’s not two separate arguments, but rather only one.

Cryptography

In Problem Set 2 we introduce you to cryptography, specifically secret-key crypto, which can only be decoded by someone who knows the secret key that was used to encode the message.
In the Hacker Edition, we’ll give you some usernames and encrypted (well, "hashed") passwords that look like {crypt}$1$LlBcWwQn$pxTB3yAjbVS/HTD2xuXFI0, challenging you to crack them and finding the original passwords.
After this problem set, you’ll be able to decode what this means:
```
or fher gb qevax lbhe Binygvar
```
As well as this URL, which you may remember from Week 0:
```
uggcf://lbhgh.or/bUt5FWLEUN0
```