1 00:00:00,000 --> 00:00:02,000 [File I/O] 2 00:00:02,000 --> 00:00:04,000 [Jason Hirschhorn, Harvard University] 3 00:00:04,000 --> 00:00:07,000 [This is CS50, CS50.TV] 4 00:00:07,000 --> 00:00:11,000 When we think of a file, what comes to mind is a Microsoft Word document, 5 00:00:11,000 --> 00:00:14,000 a JPEG image, or an MP3 song, 6 00:00:14,000 --> 00:00:17,000 and we interact with each of these types of files in different ways. 7 00:00:17,000 --> 00:00:20,000 For example, in a Word document we add text 8 00:00:20,000 --> 00:00:24,000 while with a JPEG image we might crop out the edges or retouch the colors. 9 00:00:24,000 --> 00:00:28,000 Yet under the hood all of the files in our computer are nothing more 10 00:00:28,000 --> 00:00:31,000 than a long sequence of zeros and ones. 11 00:00:31,000 --> 00:00:33,000 It's up to the specific application that interacts with the file 12 00:00:33,000 --> 00:00:38,000 to decide how to process this long sequence and present it to the user. 13 00:00:38,000 --> 00:00:41,000 On one hand, a document may look at just one byte, 14 00:00:41,000 --> 00:00:45,000 or 8 zeros and ones, and display an ASCII character on the screen. 15 00:00:45,000 --> 00:00:48,000 On the other hand, a bitmap image may look at 3 bytes, 16 00:00:48,000 --> 00:00:50,000 or 24 zeros and ones, 17 00:00:50,000 --> 00:00:53,000 and interpret them as 3 hexadecimal numbers 18 00:00:53,000 --> 00:00:56,000 that represent the values for red, green, and blue 19 00:00:56,000 --> 00:00:58,000 in one pixel of an image. 20 00:00:58,000 --> 00:01:01,000 Whatever they may look like on your screen, at their core, 21 00:01:01,000 --> 00:01:05,000 files are nothing more than a sequence of zeros and ones. 22 00:01:05,000 --> 00:01:08,000 So let's dive in and look at how we actually manipulate these zeros and ones 23 00:01:08,000 --> 00:01:12,000 when it comes to writing to and reading from a file. 24 00:01:12,000 --> 00:01:15,000 >> I'll start by breaking it down into a simple 3-part process. 25 00:01:15,000 --> 00:01:19,000 Next, I'll dive into two code examples that demonstrate these three parts. 26 00:01:19,000 --> 00:01:23,000 Finally, I'll review the process and some of its most important details. 27 00:01:23,000 --> 00:01:25,000 As with any file that sits on your desktop, 28 00:01:25,000 --> 00:01:28,000 the first thing to do is to open it. 29 00:01:28,000 --> 00:01:31,000 In C we do this by declaring a pointer to a predefined struct 30 00:01:31,000 --> 00:01:33,000 that represents a file on disk. 31 00:01:33,000 --> 00:01:38,460 In this function call, we also decide whether we want to write to or read from the file. 32 00:01:38,460 --> 00:01:41,660 Next, we do the actual reading and writing. 33 00:01:41,660 --> 00:01:44,800 There are a number of specialized functions we can use in this part, 34 00:01:44,800 --> 00:01:48,790 and almost all of them start with the letter F, which stands for file. 35 00:01:48,790 --> 00:01:53,560 Last, akin to the little red X in the top corner of the files open on your computer, 36 00:01:53,560 --> 00:01:56,680 we close the file with a final function call. 37 00:01:56,680 --> 00:01:59,540 Now that we have a general idea of what we're going to do, 38 00:01:59,540 --> 00:02:02,000 let's dive into the code. 39 00:02:02,000 --> 00:02:06,100 >> In this directory, we have two C files and their corresponding executable files. 40 00:02:06,100 --> 00:02:09,710 The typewriter program takes one command line argument, 41 00:02:09,710 --> 00:02:12,060 the name of the document we want to create. 42 00:02:12,060 --> 00:02:16,160 In this case, we'll call it doc.txt. 43 00:02:16,160 --> 00:02:19,080 Let's run the program and enter a couple of lines. 44 00:02:19,080 --> 00:02:23,660 Hi. My name is Jason. 45 00:02:23,660 --> 00:02:26,710 Finally, we'll type "quit." 46 00:02:26,710 --> 00:02:29,720 If we now list all of the files in this directory, 47 00:02:29,720 --> 00:02:33,770 we see that a new document exists called doc.txt. 48 00:02:34,190 --> 00:02:36,110 That's the file this program just created. 49 00:02:36,110 --> 00:02:40,520 And of course, it too is nothing more than a long sequence of zeros and ones. 50 00:02:41,100 --> 00:02:43,260 If we open this new file, 51 00:02:43,260 --> 00:02:45,870 we see the 3 lines of code we entered into our program-- 52 00:02:46,060 --> 00:02:49,060 Hi. May name is Jason. 53 00:02:49,580 --> 00:02:52,090 But what's actually going on when typewriter.c runs? 54 00:02:52,810 --> 00:02:55,520 The first line of interest for us is line 24. 55 00:02:55,560 --> 00:02:58,490 In this line, we declare our file pointer. 56 00:02:59,080 --> 00:03:03,140 The function that returns this pointer, fopen, takes two arguments. 57 00:03:03,140 --> 00:03:07,440 The first is the file name including the file extension if appropriate. 58 00:03:07,440 --> 00:03:10,980 Recall that a file extension does not influence the file at its lowest level. 59 00:03:10,980 --> 00:03:14,640 We're always dealing with a long sequence of zeros and ones. 60 00:03:14,640 --> 00:03:19,630 But it does influence how files are interpreted and what applications are used to open them. 61 00:03:19,630 --> 00:03:22,290 The second argument to fopen is a single letter 62 00:03:22,290 --> 00:03:25,300 that stands for what we plan to do after we open the file. 63 00:03:25,300 --> 00:03:30,630 There are three options for this argument--W, R, and A. 64 00:03:30,630 --> 00:03:34,900 We've chosen w in this case because we want to write to the file. 65 00:03:34,900 --> 00:03:38,820 R, as you can probably guess, is for reading to the file. 66 00:03:38,820 --> 00:03:41,760 And a is for appending to the file. 67 00:03:41,760 --> 00:03:44,960 While both w and a may be used for writing to files, 68 00:03:44,960 --> 00:03:47,460 w will start writing from the beginning of the file 69 00:03:47,460 --> 00:03:50,810 and potentially overwrite any data that have previously been stored. 70 00:03:50,810 --> 00:03:54,070 By default, the file we open, if it doesn't already exist, 71 00:03:54,070 --> 00:03:57,180 is created in our present working directory. 72 00:03:57,180 --> 00:04:00,540 However, if we want to access or create a file in a different location, 73 00:04:00,540 --> 00:04:02,650 in the first argument of fopen, 74 00:04:02,650 --> 00:04:05,840 we may specify a file path in addition to the file name. 75 00:04:05,840 --> 00:04:09,490 While the first part of this process is only one line of code long, 76 00:04:09,490 --> 00:04:12,350 it's always good practice to include another set of lines 77 00:04:12,350 --> 00:04:15,930 that check to ensure that the file was successfully opened or created. 78 00:04:15,930 --> 00:04:20,300 If fopen returns null, we wouldn't want to forge ahead with our program, 79 00:04:20,300 --> 00:04:23,270 and this may happen if the operating system is out of memory 80 00:04:23,270 --> 00:04:27,940 or if we try to open a file in a directory for which we didn't have the proper permissions. 81 00:04:27,940 --> 00:04:31,780 >> Part two of the process takes place in typewriter's while loop. 82 00:04:31,780 --> 00:04:35,000 We use a CS50 library function to get input from the user, 83 00:04:35,000 --> 00:04:37,190 and assuming they don't want to quit the program, 84 00:04:37,190 --> 00:04:41,940 we use the function fputs to take the string and write it to the file. 85 00:04:41,940 --> 00:04:46,700 fputs is only one of the many functions we could use to write to the file. 86 00:04:46,700 --> 00:04:51,920 Others include fwrite, fputc, and even fprintf. 87 00:04:51,920 --> 00:04:54,840 Regardless of the particular function we end up using, though, 88 00:04:54,840 --> 00:04:57,480 all of them need to know, via their arguments, 89 00:04:57,480 --> 00:04:59,670 at least two things-- 90 00:04:59,670 --> 00:05:03,140 what needs to be written and where it needs to be written to. 91 00:05:03,140 --> 00:05:07,240 In our case, input is the string that needs to be written 92 00:05:07,240 --> 00:05:11,290 and fp is the pointer that directs us to where we're writing. 93 00:05:11,290 --> 00:05:15,330 In this program, part two of the process is rather straightforward. 94 00:05:15,330 --> 00:05:17,360 We're simply taking a string from the user 95 00:05:17,360 --> 00:05:22,120 and adding it directly to our file with little-to-no input validation or security checks. 96 00:05:22,120 --> 00:05:26,160 Often, however, part two will take up the bulk of your code. 97 00:05:26,160 --> 00:05:30,580 Finally, part three is on line 58, where we close the file. 98 00:05:30,580 --> 00:05:34,860 Here we call fclose and pass it our original file pointer. 99 00:05:34,860 --> 00:05:39,500 In the subsequent line, we return zero, signalling the end of our program. 100 00:05:39,500 --> 00:05:42,630 And, yes, part three is as simple as that. 101 00:05:42,630 --> 00:05:45,260 >> Let's move on to reading from files. 102 00:05:45,260 --> 00:05:48,220 Back in our directory we have a file called printer.c. 103 00:05:48,220 --> 00:05:50,910 Let's run it with the file we just created-- 104 00:05:50,910 --> 00:05:53,350 doc.txt. 105 00:05:53,350 --> 00:05:58,150 This program, as the name suggests, will simply print out the contents of the file passed to it. 106 00:05:58,150 --> 00:06:00,230 And there we have it. 107 00:06:00,230 --> 00:06:03,780 The lines of code we had typed earlier and saved in doc.txt. 108 00:06:03,780 --> 00:06:06,980 Hi. My name is Jason. 109 00:06:06,980 --> 00:06:09,120 If we dive into printer.c, 110 00:06:09,120 --> 00:06:13,570 we see that a lot of the code looks similar to what we just walked through in typewriter.c. 111 00:06:13,570 --> 00:06:16,720 Indeed line 22, where we opened the file, 112 00:06:16,720 --> 00:06:19,220 and line 39, where we closed the file, 113 00:06:19,220 --> 00:06:23,890 are both almost identical to typewriter.c, save for fopen second argument. 114 00:06:23,890 --> 00:06:26,510 This time we're reading from a file, 115 00:06:26,510 --> 00:06:29,040 so we have chosen r instead of w. 116 00:06:29,040 --> 00:06:31,950 Thus, let's focus on the second part of the process. 117 00:06:31,950 --> 00:06:36,060 In line 35, as the second condition in our 4 loop, 118 00:06:36,060 --> 00:06:38,590 we make a call to fgets, 119 00:06:38,590 --> 00:06:42,190 the companion function to fputs from before. 120 00:06:42,190 --> 00:06:44,660 This time we have three arguments. 121 00:06:44,660 --> 00:06:48,810 The first is the pointer to the array of characters where the string will be stored. 122 00:06:48,810 --> 00:06:52,670 The second is the maximum number of characters to be read. 123 00:06:52,670 --> 00:06:56,010 And the third is the pointer to the file with which we're working. 124 00:06:56,010 --> 00:07:00,780 You'll notice that the for loop ends when fgets returns null. 125 00:07:00,780 --> 00:07:02,940 There are two reason that this may have happened. 126 00:07:02,940 --> 00:07:05,380 First, an error may have occurred. 127 00:07:05,380 --> 00:07:10,740 Second, and more likely, the end of the file was reached and no more characters were read. 128 00:07:10,740 --> 00:07:14,040 In case you're wondering, two functions do exist that allow us to tell 129 00:07:14,040 --> 00:07:17,160 which reason is the cause for this particular null pointer. 130 00:07:17,160 --> 00:07:21,090 And, not surprisingly, since they have to do with working with files, 131 00:07:21,090 --> 00:07:26,940 both the ferror function and the feof function start with the letter f. 132 00:07:26,940 --> 00:07:32,130 >> Finally, before we conclude, one quick note about the end of file function, 133 00:07:32,130 --> 00:07:36,690 which, as just mentioned, is written as feof. 134 00:07:36,690 --> 00:07:41,550 Often you'll find yourself using while and for loops to progressively read your way through files. 135 00:07:41,550 --> 00:07:45,790 Thus, you'll need a way to end these loops after you reach the end of these files. 136 00:07:45,790 --> 00:07:50,510 Calling feof on your file pointer and checking to see if it's true 137 00:07:50,510 --> 00:07:52,310 would do just that. 138 00:07:52,310 --> 00:07:59,820 Thus, a while loop with the condition (!feof(fp)) might seem like a perfectly appropriate solution. 139 00:07:59,820 --> 00:08:03,770 However, say we have one line left in our text file. 140 00:08:03,770 --> 00:08:07,130 We'll enter our while loop and everything will work out as planned. 141 00:08:07,130 --> 00:08:12,750 On the next round through, our program will check to see if feof of fp is true, 142 00:08:12,750 --> 00:08:15,430 but--and this is the crucial point to understand here-- 143 00:08:15,430 --> 00:08:17,770 it won't be true just yet. 144 00:08:17,770 --> 00:08:21,110 That's because the purpose of feof is not to check 145 00:08:21,110 --> 00:08:24,400 if the next call to a read function will hit the end of the file, 146 00:08:24,400 --> 00:08:28,190 but rather to check whether or not the end of the file has already been reached. 147 00:08:28,190 --> 00:08:30,140 In the case of this example, 148 00:08:30,140 --> 00:08:32,780 reading the last line of our file goes perfectly smoothly, 149 00:08:32,780 --> 00:08:36,210 but the program doesn't yet know that we've hit the end of our file. 150 00:08:36,210 --> 00:08:40,549 It's not until it does one additional read that it counters the end of the file. 151 00:08:40,549 --> 00:08:43,210 Thus, a correct condition would be the following: 152 00:08:43,210 --> 00:08:49,330 fgets and its three arguments--output, size of output, and fp-- 153 00:08:49,330 --> 00:08:52,570 and all of that not equal to null. 154 00:08:52,570 --> 00:08:55,260 This is the approach we took in printer.c, 155 00:08:55,260 --> 00:08:57,890 and in this case, after the loop exits, 156 00:08:57,890 --> 00:09:04,290 you could call feof or ferror to inform the user as to the specific reasoning for exiting this loop. 157 00:09:04,290 --> 00:09:08,100 >> Writing to and reading from a file is, at its most basic, 158 00:09:08,100 --> 00:09:10,150 a simple 3-part process. 159 00:09:10,150 --> 00:09:12,530 First, we open the file. 160 00:09:12,530 --> 00:09:16,740 Second, we put some things into our file or take some things out of it. 161 00:09:16,740 --> 00:09:19,200 Third, we close the file. 162 00:09:19,200 --> 00:09:21,170 The first and last parts are easy. 163 00:09:21,170 --> 00:09:23,920 The middle part is where the tricky stuff lies. 164 00:09:23,920 --> 00:09:27,760 And though underneath the hood we're always dealing with a long sequence of zeros and ones, 165 00:09:27,760 --> 00:09:30,710 it does help when coding to add a layer of abstraction 166 00:09:30,710 --> 00:09:35,350 that turns the sequence into something that more closely resembles what we're used to seeing. 167 00:09:35,350 --> 00:09:39,570 For example, if we're working with a 24-bit bitmap file, 168 00:09:39,570 --> 00:09:43,290 we'll likely be reading or writing three bytes at a time. 169 00:09:43,290 --> 00:09:46,450 In which case, it would make sense to define and appropriately name 170 00:09:46,450 --> 00:09:48,980 a struct that is 3 bytes large. 171 00:09:48,980 --> 00:09:51,410 >> Though working with files may seem complicated, 172 00:09:51,410 --> 00:09:54,530 utilizing them allows us to do something truly remarkable. 173 00:09:54,530 --> 00:09:58,880 We can change the state of the world outside our program, 174 00:09:58,880 --> 00:10:01,730 we can create something that lives beyond the life of our program, 175 00:10:01,730 --> 00:10:07,190 or we can even change something that was created before our program started running. 176 00:10:07,190 --> 00:10:11,210 Interacting with files is a truly powerful part of programming in C. 177 00:10:11,210 --> 00:10:15,300 and I'm excited to see what you're going to create with it in the code to come. 178 00:10:15,300 --> 00:10:19,770 My name is Jason Hirschhorn. This is CS50. 179 00:10:19,770 --> 00:10:21,770 [CS50.TV] 180 00:10:21,770 --> 00:10:25,940 >> [Laughter] 181 00:10:25,940 --> 00:10:29,330 Okay. One take. Here we go. 182 00:10:49,000 --> 00:10:52,140 When we think of a file-- >>Oh, wait. Sorry. 183 00:10:52,140 --> 00:10:56,800 [Laughter] Okay. 184 00:11:06,620 --> 00:11:09,970 Hey there. 185 00:11:13,670 --> 00:11:16,310 When we think of a file-- 186 00:11:17,610 --> 00:11:20,710 When you think of a file-- Okay. Tell me when you're ready. 187 00:11:20,710 --> 00:11:22,520 Oh, great. 188 00:11:22,520 --> 00:11:26,180 Though reading from a teleprompter may seem--no. My bad.