1 00:00:07,200 --> 00:00:09,100 ROB BOWDEN: Let's talk about compilers. 2 00:00:09,100 --> 00:00:11,490 Until this point, you've just typed up your source code into 3 00:00:11,490 --> 00:00:14,260 some files, sent them through this big black box that is 4 00:00:14,260 --> 00:00:16,890 Clang, and out comes your executable file that does 5 00:00:16,890 --> 00:00:19,430 exactly what you wrote in your source code. 6 00:00:19,430 --> 00:00:22,170 As magical as that's been, we're going to take a closer 7 00:00:22,170 --> 00:00:23,590 look at what's actually happening 8 00:00:23,590 --> 00:00:25,220 when we compile a file. 9 00:00:25,220 --> 00:00:28,580 So what does it mean to compile something? 10 00:00:28,580 --> 00:00:31,150 >> Well, in the most general sense, it just means 11 00:00:31,150 --> 00:00:32,580 transforming code written in one 12 00:00:32,580 --> 00:00:34,680 programming language to another. 13 00:00:34,680 --> 00:00:37,550 But usually when people say they compile something, they 14 00:00:37,550 --> 00:00:39,660 mean they're taking it from a higher level programming 15 00:00:39,660 --> 00:00:42,460 language to a lower level programming language. 16 00:00:42,460 --> 00:00:44,960 These may seem like very subjective terms. 17 00:00:44,960 --> 00:00:48,090 For example, you probably don't think of C as a high 18 00:00:48,090 --> 00:00:51,440 level programming language, but you do compile it. 19 00:00:51,440 --> 00:00:52,730 But it's all relative. 20 00:00:52,730 --> 00:00:55,790 As we'll see, the assembly code and eventually machine 21 00:00:55,790 --> 00:00:59,270 code that we compile down to is undeniably a lower level 22 00:00:59,270 --> 00:01:00,700 than C. 23 00:01:00,700 --> 00:01:03,310 Although we'll be using Clang in today's demonstration, a 24 00:01:03,310 --> 00:01:06,360 lot of the ideas here carry over to other compilers. 25 00:01:06,360 --> 00:01:09,160 >> For Clang, there are four major steps in the overall 26 00:01:09,160 --> 00:01:10,200 compilation. 27 00:01:10,200 --> 00:01:15,430 These are one, preprocessing done by the preprocessor; two, 28 00:01:15,430 --> 00:01:19,530 compilation done by the compiler; three, assembling 29 00:01:19,530 --> 00:01:22,010 done by the assembler; and four, 30 00:01:22,010 --> 00:01:24,640 linking done by the linker. 31 00:01:24,640 --> 00:01:27,600 It may be confusing that one of the substeps of the overall 32 00:01:27,600 --> 00:01:30,980 Clang compilers is called the compiler, but 33 00:01:30,980 --> 00:01:32,530 we'll get to that. 34 00:01:32,530 --> 00:01:35,050 We'll be using a simple hello world program as our example 35 00:01:35,050 --> 00:01:36,270 throughout this video. 36 00:01:36,270 --> 00:01:38,380 Let's take a look. 37 00:01:38,380 --> 00:01:40,330 >> The first step is preprocessing. 38 00:01:40,330 --> 00:01:42,520 What does the preprocessor do? 39 00:01:42,520 --> 00:01:45,560 In pretty much every C program you've ever read or written, 40 00:01:45,560 --> 00:01:48,310 you've used lines of code that begin with a hash. 41 00:01:48,310 --> 00:01:51,730 I'll call it hash, but you may also call it pounds, number 42 00:01:51,730 --> 00:01:53,280 sign, or sharp. 43 00:01:53,280 --> 00:01:56,840 Any such line is a preprocessor directive. 44 00:01:56,840 --> 00:02:00,650 You've probably seen #define and #include before, but there 45 00:02:00,650 --> 00:02:03,690 are several more that the preprocessor recognizes. 46 00:02:03,690 --> 00:02:07,340 Let's add a #define to our hello world example. 47 00:02:07,340 --> 00:02:11,690 Now let's run just the preprocessor on this file. 48 00:02:11,690 --> 00:02:16,150 By passing clage the -E flag, you're instructing it to run 49 00:02:16,150 --> 00:02:17,880 just the preprocessor. 50 00:02:17,880 --> 00:02:19,130 Let's see what happens. 51 00:02:22,250 --> 00:02:24,020 It looks like Clang just spits out everything 52 00:02:24,020 --> 00:02:25,200 at the command line. 53 00:02:25,200 --> 00:02:27,800 In order to save all of this output to a new file called 54 00:02:27,800 --> 00:02:33,850 hello2.c, we'll append > hello2.c to our command. 55 00:02:33,850 --> 00:02:37,800 Now let's take a look at our preprocessed file. 56 00:02:37,800 --> 00:02:40,810 >> Whoa, what happened to our short little program? 57 00:02:40,810 --> 00:02:43,890 If we go all the way to the bottom of this file, we'll see 58 00:02:43,890 --> 00:02:46,070 some of the code that we actually wrote. 59 00:02:46,070 --> 00:02:49,800 Notice that the #define is gone and all instances of name 60 00:02:49,800 --> 00:02:51,950 have been replaced with exactly what we specified in 61 00:02:51,950 --> 00:02:53,590 the #define line. 62 00:02:53,590 --> 00:02:56,530 So what are all these typedefs and function declarations 63 00:02:56,530 --> 00:02:58,140 at the top of the file? 64 00:02:58,140 --> 00:03:00,820 Notice that the #define wasn't the only preprocessor 65 00:03:00,820 --> 00:03:02,390 directive that we specified. 66 00:03:02,390 --> 00:03:05,280 We also have #include stdio.h. 67 00:03:05,280 --> 00:03:09,560 So all of the crazy lines are actually just stdio.h copied 68 00:03:09,560 --> 00:03:11,810 and pasted into the top of this file. 69 00:03:11,810 --> 00:03:14,110 That's why header files are so useful for function 70 00:03:14,110 --> 00:03:15,160 declarations. 71 00:03:15,160 --> 00:03:17,740 Instead of needing to copy and paste all of the function 72 00:03:17,740 --> 00:03:21,050 declarations you plan on using at the top of your file, the 73 00:03:21,050 --> 00:03:22,990 preprocessor will copy and paste them from the header 74 00:03:22,990 --> 00:03:24,140 file for you. 75 00:03:24,140 --> 00:03:26,480 >> Now that we're done preprocessing, we move onto 76 00:03:26,480 --> 00:03:27,680 compilation. 77 00:03:27,680 --> 00:03:30,725 The reason we call this step compilation is because this is 78 00:03:30,725 --> 00:03:34,130 the step where Clang actually does its compiling from C to 79 00:03:34,130 --> 00:03:35,370 assembly code. 80 00:03:35,370 --> 00:03:38,280 In order to have Clang compile a file down to assembly, but 81 00:03:38,280 --> 00:03:42,030 continue no further, pass it the -S flag 82 00:03:42,030 --> 00:03:43,560 at the command line. 83 00:03:43,560 --> 00:03:44,790 Let's take a look at the assembly 84 00:03:44,790 --> 00:03:47,390 file that was outputted. 85 00:03:47,390 --> 00:03:49,740 It looks like quite a different language. 86 00:03:49,740 --> 00:03:52,660 Assembly code is very processor specific. 87 00:03:52,660 --> 00:03:55,440 In this case, since the CS50 appliance is running on a 88 00:03:55,440 --> 00:04:00,470 virtual x86 processor, this is x86 assembly code. 89 00:04:00,470 --> 00:04:03,450 Very few people write directly in assembly code these days, 90 00:04:03,450 --> 00:04:06,490 but every C program you ever write gets transformed down 91 00:04:06,490 --> 00:04:07,940 into assembly. 92 00:04:07,940 --> 00:04:11,440 Again, we call this step compiling the C into assembly 93 00:04:11,440 --> 00:04:14,170 since we are going from a higher level to a lower level 94 00:04:14,170 --> 00:04:15,480 programming language. 95 00:04:15,480 --> 00:04:17,880 >> What makes assembly lower level than C? 96 00:04:17,880 --> 00:04:21,660 Well, in assembly, we are very limited in what we can do. 97 00:04:21,660 --> 00:04:25,120 There are no if's, while's, for's, or loops of any kind. 98 00:04:25,120 --> 00:04:27,560 But you can accomplish the same things that these control 99 00:04:27,560 --> 00:04:30,270 structures offer using the limited operations that 100 00:04:30,270 --> 00:04:32,350 assembly does provide. 101 00:04:32,350 --> 00:04:35,960 But to see just how low level assembly really is, let's go 102 00:04:35,960 --> 00:04:39,320 one step further in our compilation, assembling. 103 00:04:39,320 --> 00:04:41,890 It's the assembler's job to transform the assembly code 104 00:04:41,890 --> 00:04:44,740 into object or machine code. 105 00:04:44,740 --> 00:04:47,610 Remember that the assembler does not output assembly; 106 00:04:47,610 --> 00:04:51,080 rather, it takes in assembly and outputs machine code. 107 00:04:51,080 --> 00:04:54,040 Machine code is the actual 1's and 0's that a CPU can 108 00:04:54,040 --> 00:04:57,290 understand, although we still have a tiny bit of work left 109 00:04:57,290 --> 00:04:59,380 before we can run our program. 110 00:04:59,380 --> 00:05:01,400 Let's assemble our assembly code by passing 111 00:05:01,400 --> 00:05:04,080 Clang the -c flag. 112 00:05:04,080 --> 00:05:06,410 Now let's see what's in the assembled file. 113 00:05:06,410 --> 00:05:09,220 >> Well, that doesn't help us very much. 114 00:05:09,220 --> 00:05:11,340 Remember that the machine code is the ones and zeros that 115 00:05:11,340 --> 00:05:13,240 your computer can understand. 116 00:05:13,240 --> 00:05:16,080 That doesn't mean it's easy for us to understand. 117 00:05:16,080 --> 00:05:19,160 So exactly how low level is assembly? 118 00:05:19,160 --> 00:05:21,480 It's nearly identical to object code. 119 00:05:21,480 --> 00:05:24,300 Going from assembly to object code is much more of a 120 00:05:24,300 --> 00:05:27,540 translation than a transformation, which is why 121 00:05:27,540 --> 00:05:29,310 one might not consider the assembler to 122 00:05:29,310 --> 00:05:31,400 do any actual compiling. 123 00:05:31,400 --> 00:05:34,110 In fact, it's pretty easy to manually translate from 124 00:05:34,110 --> 00:05:36,050 assembly to machine code. 125 00:05:36,050 --> 00:05:39,040 Looking at the assembly for a main function, that first line 126 00:05:39,040 --> 00:05:42,100 happens to correspond to hexadecimal 0x55. 127 00:05:42,100 --> 00:05:45,470 In binary, that's 1010101. 128 00:05:45,470 --> 00:05:49,300 The second line happens to correspond hexadecimal 0x895. 129 00:05:49,300 --> 00:05:51,290 And the next, 0x56. 130 00:05:51,290 --> 00:05:53,730 Given a relatively simple table, you could translate 131 00:05:53,730 --> 00:05:57,130 assembly into the code that machines can understand too. 132 00:05:57,130 --> 00:05:58,810 >> So there's one remaining step in 133 00:05:58,810 --> 00:06:01,150 compilation, which is linking. 134 00:06:01,150 --> 00:06:04,530 Linking combines a bunch of object files into one big file 135 00:06:04,530 --> 00:06:06,380 that you can actually execute. 136 00:06:06,380 --> 00:06:08,570 Linking is very system dependent. 137 00:06:08,570 --> 00:06:11,030 So the easiest way to get Clang to just link object 138 00:06:11,030 --> 00:06:13,920 files together is to call Clang on all of the files that 139 00:06:13,920 --> 00:06:15,190 you want to link together. 140 00:06:15,190 --> 00:06:18,740 If you specify .o files, then it won't need to reprocess, 141 00:06:18,740 --> 00:06:21,680 compile, and assemble all of your source code. 142 00:06:21,680 --> 00:06:23,960 Let's throw a math function into our file, so we have 143 00:06:23,960 --> 00:06:25,210 something to link in. 144 00:06:34,220 --> 00:06:37,010 Now let's compile it back down to object code and 145 00:06:37,010 --> 00:06:38,260 call Clang on it. 146 00:06:40,560 --> 00:06:41,420 Oops. 147 00:06:41,420 --> 00:06:43,790 Since we included a math function, we need to link in 148 00:06:43,790 --> 00:06:46,610 the math library with -lm. 149 00:06:46,610 --> 00:06:48,990 >> If we wanted to link together bunch of .o files that we 150 00:06:48,990 --> 00:06:51,420 wrote on our own, we'd just specify them all at the 151 00:06:51,420 --> 00:06:52,460 command line. 152 00:06:52,460 --> 00:06:55,320 The restriction is that only one of these files must 153 00:06:55,320 --> 00:06:57,790 actually specify a main function, or else the 154 00:06:57,790 --> 00:06:59,930 resulting executable wouldn't know where to start 155 00:06:59,930 --> 00:07:00,910 running your code. 156 00:07:00,910 --> 00:07:03,360 What's the difference between specifying a file to link in 157 00:07:03,360 --> 00:07:06,600 with -l and just specifying a file directly? 158 00:07:06,600 --> 00:07:07,440 Nothing. 159 00:07:07,440 --> 00:07:09,850 It's just that Clang happens to know exactly what file 160 00:07:09,850 --> 00:07:12,560 something like -lm happens to refer to. 161 00:07:12,560 --> 00:07:14,700 If you knew that file yourself, you could specify it 162 00:07:14,700 --> 00:07:15,930 explicitly. 163 00:07:15,930 --> 00:07:18,990 Just remember that all -l flags have to come at the end 164 00:07:18,990 --> 00:07:20,770 of your client demand. 165 00:07:20,770 --> 00:07:22,300 >> And that's all there is to it. 166 00:07:22,300 --> 00:07:24,940 When you just run Clang on some files, this is what it's 167 00:07:24,940 --> 00:07:26,350 actually doing. 168 00:07:26,350 --> 00:07:29,490 My name is Rob Bowden, and this is CS50.