1 00:00:07,220 --> 00:00:09,290 NATE HARDISON: In the video on binary, we show how to 2 00:00:09,290 --> 00:00:12,540 represent the set of whole numbers, from zero on up, 3 00:00:12,540 --> 00:00:15,110 using only the digits zero and one. 4 00:00:15,110 --> 00:00:17,890 In this video, we're going to use binary notation to 5 00:00:17,890 --> 00:00:21,160 represent text, letters and such, as well. 6 00:00:21,160 --> 00:00:22,810 >> Why would we bother to do this? 7 00:00:22,810 --> 00:00:25,450 Well, under the hood, a computer only really 8 00:00:25,450 --> 00:00:29,070 understands zeros and ones, the binary digits, since these 9 00:00:29,070 --> 00:00:32,100 can be represented easily with electromagnetic things. 10 00:00:32,100 --> 00:00:35,040 >> For example, think of your computer's memory like a long 11 00:00:35,040 --> 00:00:37,810 string of light bulbs, whereby each individual bulb 12 00:00:37,810 --> 00:00:40,680 represents a zero if it's turned off, and a one 13 00:00:40,680 --> 00:00:42,230 if it's turned on. 14 00:00:42,230 --> 00:00:44,730 Instead of using a bunch of light bulbs, some modern 15 00:00:44,730 --> 00:00:46,990 memory does this using capacitors that hold a low 16 00:00:46,990 --> 00:00:49,120 charge to represent a zero and a high charge 17 00:00:49,120 --> 00:00:50,780 to represent a one. 18 00:00:50,780 --> 00:00:52,510 >> There are other techniques as well. 19 00:00:52,510 --> 00:00:55,500 Anyway, in order to store anything in memory, we need to 20 00:00:55,500 --> 00:00:57,590 first convert it into something that can be actually 21 00:00:57,590 --> 00:01:00,140 represented in the physical hardware. 22 00:01:00,140 --> 00:01:02,450 So let's think about how we might represent letters with 23 00:01:02,450 --> 00:01:04,230 binary notation. 24 00:01:04,230 --> 00:01:08,141 In English, we've got 26 letters in the alphabetic, A, 25 00:01:08,141 --> 00:01:12,930 >> B, C, D, and so on, up through Z. We can assign each one of 26 00:01:12,930 --> 00:01:16,650 these a number, say zero through 25, and then using 27 00:01:16,650 --> 00:01:18,880 binary notation, we can represent each number as a 28 00:01:18,880 --> 00:01:20,890 sequence of zeros and ones. 29 00:01:20,890 --> 00:01:22,420 That's not too bad. 30 00:01:22,420 --> 00:01:25,050 However, that's not going to be enough. 31 00:01:25,050 --> 00:01:27,680 With this system, we can't actually distinguish between 32 00:01:27,680 --> 00:01:29,830 upper and lowercase letters. 33 00:01:29,830 --> 00:01:32,140 If we want our computer to be able to differentiate between 34 00:01:32,140 --> 00:01:36,020 the two cases, then we need an additional 26 numbers. 35 00:01:36,020 --> 00:01:38,700 And what about periods, commas, and 36 00:01:38,700 --> 00:01:40,390 other punctuation marks? 37 00:01:40,390 --> 00:01:43,560 >> On my keyboard, I've got 32 of those, including all of the 38 00:01:43,560 --> 00:01:46,800 special characters like the caret and the ampersand. 39 00:01:46,800 --> 00:01:49,700 That's not including the digit characters, zero through nine, 40 00:01:49,700 --> 00:01:51,840 since we still want to be able to type numbers in decimal 41 00:01:51,840 --> 00:01:54,840 notation on the computer, even if the computer only really 42 00:01:54,840 --> 00:01:57,830 understands binary notation under the hood. 43 00:01:57,830 --> 00:02:00,620 >> And finally, we'll need to represent a space character so 44 00:02:00,620 --> 00:02:02,450 that our Space Bar works. 45 00:02:02,450 --> 00:02:04,920 So figuring out how to represent text on the computer 46 00:02:04,920 --> 00:02:08,400 takes a little more than we might have thought initially. 47 00:02:08,400 --> 00:02:11,710 Additionally, assume we then come up with our own encoding 48 00:02:11,710 --> 00:02:14,560 scheme to represent characters as numbers. 49 00:02:14,560 --> 00:02:17,470 However we decide to encode characters will inevitably be 50 00:02:17,470 --> 00:02:20,630 arbitrary, as we saw earlier when we talked about using the 51 00:02:20,630 --> 00:02:23,730 numbers zero through 25 to represent the letters A 52 00:02:23,730 --> 00:02:26,850 through Z. Why not use 10 through 35 so that we can save 53 00:02:26,850 --> 00:02:29,350 zero through nine for the digit characters? 54 00:02:29,350 --> 00:02:31,590 >> There's no real reason, we just chose whatever seemed 55 00:02:31,590 --> 00:02:33,770 best for us. 56 00:02:33,770 --> 00:02:37,650 Back in the early 1960s, this was a real problem. 57 00:02:37,650 --> 00:02:39,370 Different computer manufacturers were using 58 00:02:39,370 --> 00:02:41,910 different encoding schemes, and this made communication 59 00:02:41,910 --> 00:02:44,340 between different machines a very difficult task. 60 00:02:44,340 --> 00:02:47,810 The American National Standards Institute, ANSI, 61 00:02:47,810 --> 00:02:50,210 formed a committee to develop a common scheme. 62 00:02:50,210 --> 00:02:53,780 And in 1963, the American Standard Code for Information 63 00:02:53,780 --> 00:02:58,600 Interchange, more commonly known as ASCII, was born. 64 00:02:58,600 --> 00:03:01,360 >> ASCII was designed as a seven-bit encoding, which 65 00:03:01,360 --> 00:03:03,800 means that each character is represented by a combination 66 00:03:03,800 --> 00:03:06,070 of seven zeros and ones. 67 00:03:06,070 --> 00:03:09,670 With those two possible values, zero or one, for each 68 00:03:09,670 --> 00:03:14,040 of the seven bits, there are two to the seventh or 128 69 00:03:14,040 --> 00:03:16,120 characters that can be represented with the ASCII 70 00:03:16,120 --> 00:03:18,140 encoding scheme. 71 00:03:18,140 --> 00:03:21,480 So 128 characters sounds like a lot, right? 72 00:03:21,480 --> 00:03:24,180 Well, remember that there are 26 lowercase letters in 73 00:03:24,180 --> 00:03:29,260 English, another 26 uppercase letters, 10 digit characters, 74 00:03:29,260 --> 00:03:31,470 32 punctuation and special characters, 75 00:03:31,470 --> 00:03:33,430 and one space character. 76 00:03:33,430 --> 00:03:37,050 >> That puts us at 95, so we have another 33 characters that we 77 00:03:37,050 --> 00:03:38,400 can represent. 78 00:03:38,400 --> 00:03:39,900 >> So what's left? 79 00:03:39,900 --> 00:03:43,130 Well, in the days of the development of ASCII, teletype 80 00:03:43,130 --> 00:03:45,080 machines, which are typewriters that are used to 81 00:03:45,080 --> 00:03:48,040 send messages across a network, were widespread. 82 00:03:48,040 --> 00:03:50,030 And these machines had additional characters used to 83 00:03:50,030 --> 00:03:52,890 control them, for example, to tell them when to move the 84 00:03:52,890 --> 00:03:57,620 print head down a line, the line feed or new line key, 85 00:03:57,620 --> 00:04:00,440 when to move to the left margin, the carriage return, 86 00:04:00,440 --> 00:04:04,890 or simply return key, and when to go back one space, the 87 00:04:04,890 --> 00:04:07,760 backspace character, and so on. 88 00:04:07,760 --> 00:04:10,250 >> These characters are called control characters, and they 89 00:04:10,250 --> 00:04:12,680 constitute the rest of the ASCII set. 90 00:04:12,680 --> 00:04:15,230 So if we look at an ASCII table, we see that the first 91 00:04:15,230 --> 00:04:18,800 32 numbers, zero through 31, are reserved for control 92 00:04:18,800 --> 00:04:20,200 characters. 93 00:04:20,200 --> 00:04:23,420 But we just said that there were 33 control characters. 94 00:04:23,420 --> 00:04:24,780 What's the deal? 95 00:04:24,780 --> 00:04:29,350 Well, the number zero and 127, the first and last of the 96 00:04:29,350 --> 00:04:32,560 ASCII set, have special bit patterns, all zeros and all 97 00:04:32,560 --> 00:04:34,710 ones, respectively. 98 00:04:34,710 --> 00:04:36,860 >> The designers of ASCII decided, therefore, to 99 00:04:36,860 --> 00:04:39,610 preserve these numbers for extra special characters, 100 00:04:39,610 --> 00:04:43,310 namely the null character and the DEL character. 101 00:04:43,310 --> 00:04:46,340 Null and DEL were intended for paper tape editing, which used 102 00:04:46,340 --> 00:04:48,930 to be a common way of storing data. 103 00:04:48,930 --> 00:04:51,850 Paper tape was literally just a long strip of paper, and at 104 00:04:51,850 --> 00:04:53,760 regular intervals on the tape, you'd punch 105 00:04:53,760 --> 00:04:55,430 holes to store data. 106 00:04:55,430 --> 00:04:58,720 Depending on the width of the tape, each column would be 107 00:04:58,720 --> 00:05:03,186 able to accommodate five, six, seven, or eight bits. 108 00:05:03,186 --> 00:05:05,930 >> To represent a zero bit, you'd do nothing to the tape, you'd 109 00:05:05,930 --> 00:05:07,930 just leave a blank space. 110 00:05:07,930 --> 00:05:10,560 For a one bit, you'd punch a hole. 111 00:05:10,560 --> 00:05:12,980 The null character would just leave a blank column, 112 00:05:12,980 --> 00:05:14,480 indicating all zeros. 113 00:05:14,480 --> 00:05:17,250 And the DEL character would punch a column full of holes 114 00:05:17,250 --> 00:05:18,550 through your tape. 115 00:05:18,550 --> 00:05:21,300 As a result, you could use the DEL character to delete 116 00:05:21,300 --> 00:05:22,440 information. 117 00:05:22,440 --> 00:05:25,060 Imagine taking a filled-out election ballot and then 118 00:05:25,060 --> 00:05:27,180 punching all the unpunched holes. 119 00:05:27,180 --> 00:05:29,410 >> You invalidate the ballot because it's impossible to 120 00:05:29,410 --> 00:05:31,820 tell what the original votes were. 121 00:05:31,820 --> 00:05:34,720 While the DEL character is still used is the modern 122 00:05:34,720 --> 00:05:37,980 Delete key, the null character came to be used as the 123 00:05:37,980 --> 00:05:40,010 termination character for C strings and 124 00:05:40,010 --> 00:05:41,990 some other data formats. 125 00:05:41,990 --> 00:05:45,140 You might know it as the backslash zero character, 126 00:05:45,140 --> 00:05:47,720 since that's how we represent it in writing. 127 00:05:47,720 --> 00:05:49,580 So back to our ASCII table. 128 00:05:49,580 --> 00:05:52,770 After the first 32 control characters come the 95 129 00:05:52,770 --> 00:05:54,280 printable characters. 130 00:05:54,280 --> 00:05:55,800 >> There are a couple cool design decisions worth 131 00:05:55,800 --> 00:05:57,330 talking about here. 132 00:05:57,330 --> 00:06:00,810 First, the decimal digit characters, zero through nine, 133 00:06:00,810 --> 00:06:04,050 correspond to the numbers 48 through 57, which seems 134 00:06:04,050 --> 00:06:06,980 unremarkable until we look at the numbers 48 through 57 135 00:06:06,980 --> 00:06:09,080 written in binary notation. 136 00:06:09,080 --> 00:06:11,530 If we do that, then we see that the digit character, 137 00:06:11,530 --> 00:06:22,320 zero, corresponds to 0110000, one maps to 0110001, two to 138 00:06:22,320 --> 00:06:26,640 0110010, and so on. 139 00:06:26,640 --> 00:06:27,950 See the pattern? 140 00:06:27,950 --> 00:06:30,170 Each digit character is mapped to its corresponding 141 00:06:30,170 --> 00:06:35,170 equivalent in binary notation, prefixed with 011. 142 00:06:35,170 --> 00:06:38,820 Next up, you notice that the uppercase letters start at 65, 143 00:06:38,820 --> 00:06:41,310 with uppercase A, but the lowercase letters 144 00:06:41,310 --> 00:06:43,010 don't start until 97. 145 00:06:43,010 --> 00:06:45,580 So there are 32 spaces in between. 146 00:06:45,580 --> 00:06:47,000 That seems weird. 147 00:06:47,000 --> 00:06:49,500 They are only 26 letters in the alphabet. 148 00:06:49,500 --> 00:06:51,410 >> Why split them up like this? 149 00:06:51,410 --> 00:06:53,960 Again, if we look at the binary representations, we can 150 00:06:53,960 --> 00:06:55,230 see a pattern. 151 00:06:55,230 --> 00:07:01,360 Uppercase A is represented by 1000001, and lowercase a is 152 00:07:01,360 --> 00:07:05,810 represented by 1100001. 153 00:07:05,810 --> 00:07:12,770 Uppercase B is represented by 1000010, and lowercase b is 154 00:07:12,770 --> 00:07:17,280 represented by 1100010. 155 00:07:17,280 --> 00:07:19,440 Can you tell what's going on here? 156 00:07:19,440 --> 00:07:22,470 The bit that's the second from the left, in the two to the 157 00:07:22,470 --> 00:07:26,510 fifths, for 32ths position, is 0 for all of the uppercase 158 00:07:26,510 --> 00:07:30,120 letters, and 1 for all of the lowercase letters. 159 00:07:30,120 --> 00:07:33,130 >> That means converting from uppercase to lowercase, and 160 00:07:33,130 --> 00:07:36,000 vice versa, is a matter of a simple bit flip. 161 00:07:36,000 --> 00:07:38,380 So that brings us to the end of the ASCII table. 162 00:07:38,380 --> 00:07:40,700 Can you think of anything we've forgotten? 163 00:07:40,700 --> 00:07:42,510 Well, what about the Spanish enye, or the 164 00:07:42,510 --> 00:07:44,630 Greek or Cyrillic alphabets? 165 00:07:44,630 --> 00:07:46,610 And how about Chinese characters? 166 00:07:46,610 --> 00:07:49,050 There's a lot that's been left out of ASCII. 167 00:07:49,050 --> 00:07:51,920 However, another standard called Unicode has been 168 00:07:51,920 --> 00:07:53,040 developed to cover all of these 169 00:07:53,040 --> 00:07:54,840 characters and many more. 170 00:07:54,840 --> 00:07:57,040 >> But that's a subject for another time. 171 00:07:57,040 --> 00:07:58,500 My name is Nate Hardison. 172 00:07:58,500 --> 00:08:00,650 This is CS50.