NATE HARDISON: In the video on binary, we show how to represent the set of whole numbers, from zero on up, using only the digits zero and one. In this video, we're going to use binary notation to represent text, letters and such, as well. Why would we bother to do this? Well, under the hood, a computer only really understands zeros and ones, the binary digits, since these can be represented easily with electromagnetic things. For example, think of your computer's memory like a long string of light bulbs, whereby each individual bulb represents a zero if it's turned off, and a one if it's turned on. Instead of using a bunch of light bulbs, some modern memory does this using capacitors that hold a low charge to represent a zero and a high charge to represent a one. There are other techniques as well. Anyway, in order to store anything in memory, we need to first convert it into something that can be actually represented in the physical hardware. So let's think about how we might represent letters with binary notation. In English, we've got 26 letters in the alphabetic, A, B, C, D, and so on, up through Z. We can assign each one of these a number, say zero through 25, and then using binary notation, we can represent each number as a sequence of zeros and ones. That's not too bad. However, that's not going to be enough. With this system, we can't actually distinguish between upper and lowercase letters. If we want our computer to be able to differentiate between the two cases, then we need an additional 26 numbers. And what about periods, commas, and other punctuation marks? On my keyboard, I've got 32 of those, including all of the special characters like the caret and the ampersand. That's not including the digit characters, zero through nine, since we still want to be able to type numbers in decimal notation on the computer, even if the computer only really understands binary notation under the hood. And finally, we'll need to represent a space character so that our Space Bar works. So figuring out how to represent text on the computer takes a little more than we might have thought initially. Additionally, assume we then come up with our own encoding scheme to represent characters as numbers. However we decide to encode characters will inevitably be arbitrary, as we saw earlier when we talked about using the numbers zero through 25 to represent the letters A through Z. Why not use 10 through 35 so that we can save zero through nine for the digit characters? There's no real reason, we just chose whatever seemed best for us. Back in the early 1960s, this was a real problem. Different computer manufacturers were using different encoding schemes, and this made communication between different machines a very difficult task. The American National Standards Institute, ANSI, formed a committee to develop a common scheme. And in 1963, the American Standard Code for Information Interchange, more commonly known as ASCII, was born. ASCII was designed as a seven-bit encoding, which means that each character is represented by a combination of seven zeros and ones. With those two possible values, zero or one, for each of the seven bits, there are two to the seventh or 128 characters that can be represented with the ASCII encoding scheme. So 128 characters sounds like a lot, right? Well, remember that there are 26 lowercase letters in English, another 26 uppercase letters, 10 digit characters, 32 punctuation and special characters, and one space character. That puts us at 95, so we have another 33 characters that we can represent. So what's left? Well, in the days of the development of ASCII, teletype machines, which are typewriters that are used to send messages across a network, were widespread. And these machines had additional characters used to control them, for example, to tell them when to move the print head down a line, the line feed or new line key, when to move to the left margin, the carriage return, or simply return key, and when to go back one space, the backspace character, and so on. These characters are called control characters, and they constitute the rest of the ASCII set. So if we look at an ASCII table, we see that the first 32 numbers, zero through 31, are reserved for control characters. But we just said that there were 33 control characters. What's the deal? Well, the number zero and 127, the first and last of the ASCII set, have special bit patterns, all zeros and all ones, respectively. The designers of ASCII decided, therefore, to preserve these numbers for extra special characters, namely the null character and the DEL character. Null and DEL were intended for paper tape editing, which used to be a common way of storing data. Paper tape was literally just a long strip of paper, and at regular intervals on the tape, you'd punch holes to store data. Depending on the width of the tape, each column would be able to accommodate five, six, seven, or eight bits. To represent a zero bit, you'd do nothing to the tape, you'd just leave a blank space. For a one bit, you'd punch a hole. The null character would just leave a blank column, indicating all zeros. And the DEL character would punch a column full of holes through your tape. As a result, you could use the DEL character to delete information. Imagine taking a filled-out election ballot and then punching all the unpunched holes. You invalidate the ballot because it's impossible to tell what the original votes were. While the DEL character is still used is the modern Delete key, the null character came to be used as the termination character for C strings and some other data formats. You might know it as the backslash zero character, since that's how we represent it in writing. So back to our ASCII table. After the first 32 control characters come the 95 printable characters. There are a couple cool design decisions worth talking about here. First, the decimal digit characters, zero through nine, correspond to the numbers 48 through 57, which seems unremarkable until we look at the numbers 48 through 57 written in binary notation. If we do that, then we see that the digit character, zero, corresponds to 0110000, one maps to 0110001, two to 0110010, and so on. See the pattern? Each digit character is mapped to its corresponding equivalent in binary notation, prefixed with 011. Next up, you notice that the uppercase letters start at 65, with uppercase A, but the lowercase letters don't start until 97. So there are 32 spaces in between. That seems weird. They are only 26 letters in the alphabet. Why split them up like this? Again, if we look at the binary representations, we can see a pattern. Uppercase A is represented by 1000001, and lowercase a is represented by 1100001. Uppercase B is represented by 1000010, and lowercase b is represented by 1100010. Can you tell what's going on here? The bit that's the second from the left, in the two to the fifths, for 32ths position, is 0 for all of the uppercase letters, and 1 for all of the lowercase letters. That means converting from uppercase to lowercase, and vice versa, is a matter of a simple bit flip. So that brings us to the end of the ASCII table. Can you think of anything we've forgotten? Well, what about the Spanish enye, or the Greek or Cyrillic alphabets? And how about Chinese characters? There's a lot that's been left out of ASCII. However, another standard called Unicode has been developed to cover all of these characters and many more. But that's a subject for another time. My name is Nate Hardison. This is CS50.