#functions

Recall that a hash function maps a (potentially infinite) domain of inputs to a finite range of outputs. A "perfect" hash function, meanwhile, is one that maps a (finite) set of inputs to a range of distinct outputs without any collisions.

Answer the below in functions.md.

Questions

  1. (2 points.) Consider the hash function below.

    int hash(char *s)
    {
        if (s[0] >= 'A' && s[0] <= 'Z')
        {
            return s[0] - 'A';
        }
        else
        {
            return s[0] - 'a';
        }
    }

    Why might a hash table, if implemented using this hash function, prove to be slow? Assume that s is an entirely alphabetical string of non-zero length.

  2. (4 points) Recall that a string is just a sequence of characters, each of which is a sequence of bits. In ASCII, A is thus 01000001, AB is thus 0100000101000010, and ABC is thus 010000010100001001000011. Suppose that a hash function, given a string in ASCII as input, outputs the (decimal) integer represented by its sequence of bits. Given A as input, this hash function would output 65; given AB as input, this hash function would output 16706; and given ABC as input, this hash function would output 4276803. Why is this hash function "perfect" by definition but imperfect in practice if implemented in C?

  3. (4 points.) The input to a hash function could also be a file. After all, a file is just a sequence of bits, much like a string. As it turns out, check50 uses hashes to check the correctness of recover for Problem Set 4. Rather than compare the 50 files outputted by your implementation of recover against the original 50 JPEGs from card.raw (as is possible, recall, with programs like diff), check50 instead hashes each of your 50 files, the output of which is a 64-digit hexadecimal string (i.e., 256 bits), comparing each such hash against 64-digit hexadecimal strings hard-coded into the source code of check50's checks. If curious, you (and anyone on the internet!) can see them atop https://github.com/cs50/checks/blob/master/cs50/2017/fall/recover/check50/__init__.py, though no need to understand the Python code therein. Why might we be storing those 50 hashes in that repository instead of the original 50 JPEGs themselves?

  4. (4 points.) Recall that a trie iteratively hashes each of the letters in words in order to index into arrays. And even a hash table might hash all of the letters in words in order to index into an array. Why, though, is lookup of words said to be in O(1) for a trie but O(n) for a hash table?

Debrief

  1. Which resources, if any, did you find helpful in answering this problem’s questions?

  2. About how long did you spend on this problem’s questions?