[File I/O] [Jason Hirschhorn, Harvard University] [This is CS50, CS50.TV] When we think of a file, what comes to mind is a Microsoft Word document, a JPEG image, or an MP3 song, and we interact with each of these types of files in different ways. For example, in a Word document we add text while with a JPEG image we might crop out the edges or retouch the colors. Yet under the hood all of the files in our computer are nothing more than a long sequence of zeros and ones. It's up to the specific application that interacts with the file to decide how to process this long sequence and present it to the user. On one hand, a document may look at just one byte, or 8 zeros and ones, and display an ASCII character on the screen. On the other hand, a bitmap image may look at 3 bytes, or 24 zeros and ones, and interpret them as 3 hexadecimal numbers that represent the values for red, green, and blue in one pixel of an image. Whatever they may look like on your screen, at their core, files are nothing more than a sequence of zeros and ones. So let's dive in and look at how we actually manipulate these zeros and ones when it comes to writing to and reading from a file. I'll start by breaking it down into a simple 3-part process. Next, I'll dive into two code examples that demonstrate these three parts. Finally, I'll review the process and some of its most important details. As with any file that sits on your desktop, the first thing to do is to open it. In C we do this by declaring a pointer to a predefined struct that represents a file on disk. In this function call, we also decide whether we want to write to or read from the file. Next, we do the actual reading and writing. There are a number of specialized functions we can use in this part, and almost all of them start with the letter F, which stands for file. Last, akin to the little red X in the top corner of the files open on your computer, we close the file with a final function call. Now that we have a general idea of what we're going to do, let's dive into the code. In this directory, we have two C files and their corresponding executable files. The typewriter program takes one command line argument, the name of the document we want to create. In this case, we'll call it doc.txt. Let's run the program and enter a couple of lines. Hi. My name is Jason. Finally, we'll type "quit." If we now list all of the files in this directory, we see that a new document exists called doc.txt. That's the file this program just created. And of course, it too is nothing more than a long sequence of zeros and ones. If we open this new file, we see the 3 lines of code we entered into our program-- Hi. May name is Jason. But what's actually going on when typewriter.c runs? The first line of interest for us is line 24. In this line, we declare our file pointer. The function that returns this pointer, fopen, takes two arguments. The first is the file name including the file extension if appropriate. Recall that a file extension does not influence the file at its lowest level. We're always dealing with a long sequence of zeros and ones. But it does influence how files are interpreted and what applications are used to open them. The second argument to fopen is a single letter that stands for what we plan to do after we open the file. There are three options for this argument--W, R, and A. We've chosen w in this case because we want to write to the file. R, as you can probably guess, is for reading to the file. And a is for appending to the file. While both w and a may be used for writing to files, w will start writing from the beginning of the file and potentially overwrite any data that have previously been stored. By default, the file we open, if it doesn't already exist, is created in our present working directory. However, if we want to access or create a file in a different location, in the first argument of fopen, we may specify a file path in addition to the file name. While the first part of this process is only one line of code long, it's always good practice to include another set of lines that check to ensure that the file was successfully opened or created. If fopen returns null, we wouldn't want to forge ahead with our program, and this may happen if the operating system is out of memory or if we try to open a file in a directory for which we didn't have the proper permissions. Part two of the process takes place in typewriter's while loop. We use a CS50 library function to get input from the user, and assuming they don't want to quit the program, we use the function fputs to take the string and write it to the file. fputs is only one of the many functions we could use to write to the file. Others include fwrite, fputc, and even fprintf. Regardless of the particular function we end up using, though, all of them need to know, via their arguments, at least two things-- what needs to be written and where it needs to be written to. In our case, input is the string that needs to be written and fp is the pointer that directs us to where we're writing. In this program, part two of the process is rather straightforward. We're simply taking a string from the user and adding it directly to our file with little-to-no input validation or security checks. Often, however, part two will take up the bulk of your code. Finally, part three is on line 58, where we close the file. Here we call fclose and pass it our original file pointer. In the subsequent line, we return zero, signalling the end of our program. And, yes, part three is as simple as that. Let's move on to reading from files. Back in our directory we have a file called printer.c. Let's run it with the file we just created-- doc.txt. This program, as the name suggests, will simply print out the contents of the file passed to it. And there we have it. The lines of code we had typed earlier and saved in doc.txt. Hi. My name is Jason. If we dive into printer.c, we see that a lot of the code looks similar to what we just walked through in typewriter.c. Indeed line 22, where we opened the file, and line 39, where we closed the file, are both almost identical to typewriter.c, save for fopen second argument. This time we're reading from a file, so we have chosen r instead of w. Thus, let's focus on the second part of the process. In line 35, as the second condition in our 4 loop, we make a call to fgets, the companion function to fputs from before. This time we have three arguments. The first is the pointer to the array of characters where the string will be stored. The second is the maximum number of characters to be read. And the third is the pointer to the file with which we're working. You'll notice that the for loop ends when fgets returns null. There are two reason that this may have happened. First, an error may have occurred. Second, and more likely, the end of the file was reached and no more characters were read. In case you're wondering, two functions do exist that allow us to tell which reason is the cause for this particular null pointer. And, not surprisingly, since they have to do with working with files, both the ferror function and the feof function start with the letter f. Finally, before we conclude, one quick note about the end of file function, which, as just mentioned, is written as feof. Often you'll find yourself using while and for loops to progressively read your way through files. Thus, you'll need a way to end these loops after you reach the end of these files. Calling feof on your file pointer and checking to see if it's true would do just that. Thus, a while loop with the condition (!feof(fp)) might seem like a perfectly appropriate solution. However, say we have one line left in our text file. We'll enter our while loop and everything will work out as planned. On the next round through, our program will check to see if feof of fp is true, but--and this is the crucial point to understand here-- it won't be true just yet. That's because the purpose of feof is not to check if the next call to a read function will hit the end of the file, but rather to check whether or not the end of the file has already been reached. In the case of this example, reading the last line of our file goes perfectly smoothly, but the program doesn't yet know that we've hit the end of our file. It's not until it does one additional read that it counters the end of the file. Thus, a correct condition would be the following: fgets and its three arguments--output, size of output, and fp-- and all of that not equal to null. This is the approach we took in printer.c, and in this case, after the loop exits, you could call feof or ferror to inform the user as to the specific reasoning for exiting this loop. Writing to and reading from a file is, at its most basic, a simple 3-part process. First, we open the file. Second, we put some things into our file or take some things out of it. Third, we close the file. The first and last parts are easy. The middle part is where the tricky stuff lies. And though underneath the hood we're always dealing with a long sequence of zeros and ones, it does help when coding to add a layer of abstraction that turns the sequence into something that more closely resembles what we're used to seeing. For example, if we're working with a 24-bit bitmap file, we'll likely be reading or writing three bytes at a time. In which case, it would make sense to define and appropriately name a struct that is 3 bytes large. Though working with files may seem complicated, utilizing them allows us to do something truly remarkable. We can change the state of the world outside our program, we can create something that lives beyond the life of our program, or we can even change something that was created before our program started running. Interacting with files is a truly powerful part of programming in C. and I'm excited to see what you're going to create with it in the code to come. My name is Jason Hirschhorn. This is CS50. [CS50.TV] [Laughter] Okay. One take. Here we go. When we think of a file-- >>Oh, wait. Sorry. [Laughter] Okay. Hey there. When we think of a file-- When you think of a file-- Okay. Tell me when you're ready. Oh, great. Though reading from a teleprompter may seem--no. My bad.