Like Magic

Recall from Recover in Problem Set 3 that the first three bytes of any JPEG are, in order, 0xff, 0xd8, and 0xff. The fourth byte, meanwhile, is either 0xe0, 0xe1, 0xe2, 0xe3, 0xe4, 0xe5, 0xe6, 0xe7, 0xe8, 0xe9, 0xea, 0xeb, 0xec, 0xed, 0xee, or 0xef.

Recall also from Whodunit in Problem Set 3 that a BMP begins with a BITMAPFILEHEADER. It turns out that the first two bytes of any BMP and, in turn, the first two bytes of that struct (aka bfType) are 0x42 and 0x4d.

The first four bytes of any PDF, meanwhile, are, in order, 0x25, 0x50, 0x44, and 0x46.

These are all "magic numbers," sequences of bytes (unrelated to constants in C), the presence of which at the start of a file likely indicates its type, a signature of sorts.

Answer the below in magic.md and magic.c.

Questions

  1. (1 point.) If you treat the first two bytes of a BMP as ASCII characters, with what (non-terminated) string does a BMP begin?

  2. (1 point.) If you treat the first four bytes of a PDF as ASCII characters, with what (non-terminated) string does a PDF begin?

  3. (2 points.) Why might the presence of a magic number at the start of a file likely, but not necessarily, indicate its type?

  4. (3 points.) Consider the C code, below, via which you could check whether some file is likely a JPEG using && and ||, otherwise known as "logical AND" and "logical OR," respectively, both "logical operators."

    if (buffer[0] == 0xff &&
        buffer[1] == 0xd8 &&
        buffer[2] == 0xff &&
        (buffer[3] == 0xe0 ||
         buffer[3] == 0xe1 ||
         buffer[3] == 0xe2 ||
         buffer[3] == 0xe3 ||
         buffer[3] == 0xe4 ||
         buffer[3] == 0xe5 ||
         buffer[3] == 0xe6 ||
         buffer[3] == 0xe7 ||
         buffer[3] == 0xe8 ||
         buffer[3] == 0xe9 ||
         buffer[3] == 0xea ||
         buffer[3] == 0xeb ||
         buffer[3] == 0xec ||
         buffer[3] == 0xed ||
         buffer[3] == 0xee ||
         buffer[3] == 0xef))
    {
        // Likely a JPEG
    }

    Notice, though, how each of those possible fourth bytes (sixteen of them in total) begins with 0xe (which, incidentally, is 1110 in binary). Zamyla noticed the same and thus, in the walkthrough for Recover, instead proposed the more succinct code below, using not only && but also &, a "bitwise operator" otherwise known as "bitwise AND."

    if (buffer[0] == 0xff &&
        buffer[1] == 0xd8 &&
        buffer[2] == 0xff &&
        (buffer[3] & 0xf0) == 0xe0)
    {
        // Likely a JPEG
    }

    But how exactly does this code work? Read up on bitwise operators, particularly &, otherwise known as bitwise AND, and explain how

    (buffer[3] & 0xf0) == 0xe0

    checks whether that fourth byte is any of those sixteen values.

  5. (2 points.) Why is Zamyla’s proposed code, which uses bitwise AND, more efficient than that using logical OR?

  6. (6 points.) Implement, in magic.c, a program in C that checks whether a file is a BMP, JPEG, or PDF, relying only on its first several bytes, irrespective of the file’s name (or extension).

    Your program should accept, as its sole command-line argument, the name (or path) of a file. And it should print

    • BMP\n if the file is likely a BMP,

    • JPEG\n (not JPG\n) if the file is likely a JPEG,

    • PDF\n if the file is likely a PDF,

    • \n otherwise.

    Moreover, your program should exit with a status code of

    • 1 if argc is not 2,

    • 1 if argv[1] cannot be opened or read, or

    • 0 otherwise.

    Your program must not leak any memory or potentially segfault. And your program must compile with make magic without any warnings or errors.

    Here are some (small) files with which you can test your program:

Debrief

  1. Which resources, if any, did you find helpful in answering this problem’s questions?

  2. About how long, in minutes, did you spend on this problem’s questions?