[MUSIC PLAYING] CARTER ZENKE: Well, hello, one and all, and welcome to CS50's Introduction to Programming with R. My name is Carter Zenke. And I'm so excited to embark on this journey to learn this language called R with you. Now, it's likely that you have never programmed before. And if so, that's OK. But you might be asking, what is a programming language, actually, anyway? Well, it turns out a programming language is something that we humans have created to actually talk to computers and have them solve problems for us. And you might have heard of lines of code, things that actually tell computers what to do. And you may have heard of programs being lines and lines and lines of code, telling computers step by step what it is we want them to do. And you might have heard as well there are other languages you could learn, like C, like Python, like R, like JavaScript. And so you could be asking yourself, why would I learn R? Well, R, it turns out, is this language built from the ground up to work with data. And so, if you are interested in data or some of these fields here, like data science, data visualization, research, or statistics, R could be a language for you. And although this course won't actually teach data science or statistics or the math behind them, you will emerge being able to use R for these disciplines. Now, actually, just recently, researchers used R to model how COVID-19 was spread. And this is a visualization built entirely in R, to model how COVID-19 was spread on a cruise ship. You might have also heard of FiveThirtyEight, these data journalists who actually write articles with data. They use R to analyze data, to write their articles and share visualizations like this one so we can actually understand insights from data, too. So without further ado, let's actually begin by writing our very first R program. And to do that, we'll need what's called an Integrated Development Environment, or an IDE for short. Now, code, at the end of the day, is just text. And you could use any text editor to write code. But it turns out that when you actually want to write lots and lots of code, it's helpful to have a tool to do that with. And that's what this IDE will be doing for us. Now let me introduce you over here to this IDE called RStudio. Now, RStudio is an IDE built exclusively and particularly for R. And here it is. You'll notice off the bat that I have this kind of greater than sign here with a blinking cursor. And this indicates what we're going to call the R console. It's a place I can actually type R statements and have them run kind of line by line by line. It's a great place for me to actually execute or to run one line of code at a time. So let's actually type our very first line of R code here to create this file called hello.R, in which will write actual full-fledged program. So to create a file using this console in RStudio, I can type file.create, open parentheses and closed parentheses, and then the name of the file I want to create. So here I'm going to create, in this case, hello.R, just like that. Now, notice that hello.R ends in dot R, and that is actually very particular. You might have heard of files ending in, like, dot jpg for images, or maybe dot csv for data files. In R, we denote our R programs with this dot capital R here. So let me run this R statement here by hitting Enter. And now I should see-- well, I see TRUE kind of yelling at me right here. What this means is that this file was created. And we'll also see this kind of bracket 1 here, which will become more apparent as we go on, but it turns out that because R works a lot with data and lists of data, it's often handy, when you have lots and lots of data, to give us an indicator of where in that list we are. So we'll see this come in handy a bit more later. But for now, with this list of 1 value, like TRUE, it just has one for now. Well, where was this file created? I claim it was created. But where was it created? So RStudio has this file explorer that I can open on the side here. And this file explorer shows me the contents of some particular folder on my computer. Even if you don't know much about computers, you probably know about files and folders. And so here, it seems like I am inside of the User/jharvard folder. And inside of that folder, RStudio and R created for me this file called hello.R. Now, RStudio works by default in a single folder. And that folder, it turns out, is called our working directory. If I ask R to create a file for me, it will create that file in the working directory. Or if I ask it to find some data file and read it into my R program, it will first search in that working directory for me here. So let's open up this R file and actually write our first R program. I'll hit this hello.R file icon here. Let me close my file explorer for now because not important at this point. But now I'm introduced to this new component of RStudio, in this case, my file editor. So as we saw before, this R console down here is great for writing single R statements, single lines of R code. But this hello.R file is great for writing more than one line of code, tens of lines, hundreds of lines, creating a full-fledged program here. So I have this blinking cursor, which means I can actually just go ahead and type some text. And let's type in some text that will be our very first R program. I'll type print, P-R-I-N-T, followed by an opening parentheses and a closing one. And inside those parentheses, let me type "hello, world", just like this. And now I've typed in some text, my first line of R code. I can save this file by clicking on this little Save icon here. Or, on Mac, I could do Command-S. Or, on Windows, I could do Control-S. So let me hit the Save button here. And now this program is saved on my computer I could run it if I wanted to. Now, you may be used to running programs by double-clicking on them or finding some icon and clicking on that to open the program and run it, but I don't see those icons here. And that is intentional. This is a program we've made ourselves, which will require a different way of running it for us. And even if you don't know much about computers, you probably know computers speak this language called binary, or 1's and 0's. And right now, we just have print("hello, world"). That doesn't look like 1's and 0's to me. So there has to be a way to translate or interpret this R text we've written into 1's and 0's the computer actually understands. So R is more than a language. It's also an interpreter that takes this text we've written and converts it to the 1's and 0's the computer understands. Now, to kick off that process, that interpretation process, I can actually click this button over here that says Run. So let me go down to the console first. And let me clear the console so it's very clear what's happening here. I'll type Control-L to clear the console. And now, if I go to this line of code, line 1, and hit Run, I should see my first R program saying hello to the world over here. So let's take a step back first, though, and think about what it is we have just done. So here in R, we have created our very first program. And we've done so using something that's called a function. Now, in languages like R, and lots of others, too, you'll have access to these things called functions. And functions let you tell the computer to take some action, do something to solve some particular problem. In this case, our problem was displaying some text. And this function that we saw called print helped us do just that. So let's go back and show you this program again. I'm going to come back to RStudio. We saw that this function is called print. You can probably guess that yourself here. And we know it's a function because I used these parentheses here. This is convention in R to denote a function by its name, followed by parentheses, followed by some particular input to that function. Now, print, when the R developers designed it, doesn't print some predetermined piece of text. It doesn't always print "hello, world". It doesn't always print "hello" to somebody else. Instead, it prints the text that I want it to print. And so we'll see, as we learn programming together, that these functions can take inputs. And more precisely, these inputs are called arguments to the function. They're inputs that change how the function actually runs. Now, what is it that print did? Well, down below in the console here, I see that it printed "hello, world". And this is what's known as the side effect of the function, something visual that happened. Side effects could also be something that happens via audio, something else I see. Whatever happens as the function is running, that is known as its side effect. And this is our first program in R so far. But let's pause here and ask what questions we have on R, RStudio, or this first program we just wrote called "hello, world". So a question about why would you use RStudio as opposed to VS Code-- well, if you're familiar with VS Code, you would know that it can work with a variety of languages, like Python and C and others. RStudio, though, is tailor-made to work with R. And as we'll see later on in the course, there's a lot of features of RStudio that make working with R much easier, particularly with visualizations and plotting. So we'll see that later on in the course as well. Also, a question about why would you use R instead of Python-- well, Python tends to be this really big language that can be used for lots and lots of things. When you think of libraries in Python, like NumPy, you can certainly do work like you would do in R with those libraries. RStudio, though, is more of a precise tool. It's built from the ground up to work with data, whereas Python is more of a general tool for solving lots of different problems. R then comes with some optimizations that make working with data little more efficiently. But in general, you could use either Python, with NumPy or other libraries, or R to work with data overall. Now a question from Luis-- LUIS: Would you show us how to change the working directory for the console? CARTER ZENKE: So a great question about actually changing the working directory, we saw earlier that RStudio defaults to having some kind of working directory. But you could change that if you wanted to. Well, in R, there is a function dedicated to changing that working directory. And I can show you what that looks like over here. Let me come back to my RStudio environment and, in particular, my console. So changing the working directory, that actually only requires one line of R code. So I could then clear my console down below here and execute that line of code in the console. The function to do so is setwd, just like this. And then in parentheses, you can then type the path for the folder you want to change your working directory to. So if you do want to do that, you can do it using the setwd function here. Now, we've completed our very first "hello, world" program. But odds are things aren't always going to go as smoothly for you if you're just beginning with programming. And even if you're more experienced, you might encounter these things called bugs, or these errors in your code. So let me reproduce, kind of intentionally here, a bug that could happen in my program. Let's say, as I was typing, I didn't type print. I instead typed prin. Let me save this program here. And let me run it to see what happens. I'll hit Run. And, well, now I see down in my console, Error in prin("hello, world"), could not find function "prin", in this case. So this error is hopefully telling me, look, there is no function called prin. And hopefully, by seeing this, I should then actually know, well, oh, I didn't mean to type prin. I meant to type print instead. So I can go back and fix it. And this process of actually going back and fixing our programs is known as debugging, finding those bugs in our code and getting rid of them by looking through these errors and talking to a colleague or trying to fix them overall. So here, let me go back and clear my terminal, my console down below. I'll run this line of R code, or run this line of this program here again. And I'll see, once again, "hello, world". So this is a pretty good program, but I argue we could do a little bit better with it. Like, we don't just have to say "hello" to everyone. We could try to say "hello" to a particular user. And so our next step here will be to actually ask the user for their name and then say "hello" to that particular user. Now, to get user input, there is a function in R for that. It's not print. It's instead called readline. So let's use the readline function here. I'll type R-E-A-D, readline-- this is all one word-- and then parentheses, opening and closing. And it turns out that readline takes an argument, too. This argument will then be the prompt it prompts the user with. So I'll ask the user, "What's your name?" just like this. And I'll save the file here. So let me go down and clear my console. I'll then run this line of R code, and I'll see "What's your name?" I could type, in this case, my name is Carter, hit Enter, and I see "Carter". So it's not so much a greeting as much as it is just kind of saying my name back to me. I think we could still do better than that as well. Let me clear my console, and ideally I want to say something like this. I want to say, hello, Carter. So I could use print again. Maybe on line 2 now, I choose to say print and then "Hello, Carter". And I'll save this file again. Well, now I have not just one line of code but two. And when R goes to interpret this program, it will read these lines of code top to bottom, left to right, executing each function along the way. But it turns out that this Run button over here, it says Run, but what it really does is it only runs one line of R code at a time. For example, I could run just line 1, or I could run just line 2, but what we have here is really a full-fledged program. It's more than one line of code. It's two. So to run the entire program, I need to use a different approach, different button in this case, and that button is this Source button up top here. Source in this case means to run the entire source file here. So let me go ahead and source. And let me say, what's my name? My name is Carter. And I see "Hello, Carter". But there might still be a bug in this program. Let's try it one more time. Let me clear the console. I'll source again. And what if my name was, like, Mario from the Nintendo universe. I could type, well, my name is Mario. Hmm. "Hello, Carter". OK, let's try it again. I'll do Source. And I'll do-- maybe my name is Princess Peach, so Princess Peach here. And, well, still "Hello, Carter". So it seems to me like we need to do something more dynamic than just printing out, "Hello, Carter". We need to actually print what the user has given us at the console. So for that, we'll actually need this new concept in programming called a return value. Functions don't just have arguments and side effects. They also have return values. And in this case, readline will return to us, as its return value, the user input. You could think of it a bit like a metaphor of asking your friend to go out and ask somebody for their name. And they might write it down and return it back to you, the programmer, so you can use it later on in your code. That's kind of what return values are in this case. But if I have a return value, I'm going to need someplace to store it, someplace to reuse it later on in my code. And for that we'll need something called a variable or an object. A variable is some name for a value that could change. So let's see how we could use both return values and objects in R to make this program more dynamic. Let me go to line 1 here. And let me try to give this return value of readline a name I could reuse later on. Well, if the user types in Carter, I want to refer to that value via some name. And maybe the right name is just simply name. So I'll type name here. And now, if I want to store the return value of readline, that is whatever the user typed in, inside this object called name, I could use this syntax, these particular characters here, the less than sign and a dash. And notice, if we read this right to left, first readline will run as a function. It will prompt the user for their input. The user will type that input in. And then readline will come back to us and give us back whatever the value is that the user typed in. And then these lines of code over here-- name, space, less than, this dash here-- will store it underneath this name called name. This arrow, this leftward arrow, is called the assignment operator. We're assigning whatever return value from readline to this new object called name. So let me just, for now, get rid of this line 2 and just focus on this particular piece right now. Let me source this file. And now let me say my name is Mario. I'll hit Enter. And I see nothing yet because there's no print. But now, if I open up this new pane in RStudio, let me go ahead and go to my environment pane, I'll actually see what the value is for this object we created called name. So it seems like RStudio is telling me that I have this object called name, and its value now is Mario. This is part of our environment. The environment is the place we actually store our objects while our program is running so we can then reuse them later on in our code. So I've kind of captured this user input and stored it in this object called name. But let's see how we could use it now. I'll come back to RStudio. And let me go back to line 2. And let me now print, in this case, let me print "Hello, name", just like this. I'll kind of close my environment over here. And now let me source this particular file. I'll type Mario as my name. And now I'll hit Enter. And I see, well, "Hello, name". So this isn't exactly what we wanted. I actually printed out literally "Hello, name". So I think we'll need to find some other solution here. We solved one problem, which was getting the user input and storing it somewhere. But, I mean, how do we reuse that later on? Well, if I'm being observant, I might notice that what I'm really trying to do is combine some pieces of text. Like, this text here is Hello comma space. And the text I'm trying to combine is whatever the user typed in, or that is whatever is being stored in this name object I have over here. So I'm trying to combine Hello comma space and this text from the user, Mario. Now, this is a common problem in programming, trying to combine pieces of text, so common that it actually has its own particular name. This is called string concatenation, where concatenation means combining together these various pieces of text, or these strings. So let's break it down. And here, of course, we have the very first part of our greeting. This is some piece of text, or from here on out called a string, a string because it's characters strung together into one piece of text. Strings begin and end with these double quotes here. So I have Hello comma space, and that is one particular string. But then the user comes in, and they type in their own string, let's say, like, Carter. And now my task is to combine these together into a single string and print that back out to the user. So my goal here is to effectively this, turn these two separate strings into one individual string. Now, R, handily enough, comes with a function to do just this. So let's explore that function here. This function is actually called paste, paste as in P-A-S-T-E. And paste allows me to concatenate various strings. So let's try this one here. If I want to use paste, I use it the same way I would any other function. I could use the function name, and then open and closing parentheses. And then paste will take as input any number of strings I want to concatenate, or paste together in this case. So let's say the first string, as we said, is Hello comma space. Now, the next string is something new. It's actually going to be whatever is stored in this name object over here. So if I want to provide another input to paste, I should actually separate it now with a comma. So after the first input to paste, the first argument, I'll then give it a second input, or a second argument. And this one will be literally name, in this case, this object we stored that has that value the user themselves typed in. And now paste, its return value will be the combined version of these two strings. So let's try it. Maybe I'll store the return value inside its own object called greeting. I'll use the left arrow, this assignment operator, to store the return value of paste in this object called greeting. And then, down in print here, I won't print "Hello, name" literally. I'll instead print whatever is stored in the object called greeting. So as I run this, let me open up my environment. So we can see how these values change over time. Let me go ahead and click Source here. My name, in this case, is Carter. I'll hit Enter. And I see "Hello, Carter". So if I go back to my environment here, I see I not only have this object called name storing "Carter". I also have this object called greeting storing the concatenated versions of, in this case, the string Hello comma space and then the string "Carter" itself. But if you're being particularly observant here, what do you notice as a bug in this program? Let me ask our audience here. What do you notice as a bug in this program? AUDIENCE: There are two spaces. CARTER ZENKE: Yeah. There are two spaces. So I only wanted one space. But it seems I have somehow gotten two. Greeting here has Hello comma space space Carter. So why is that? Well, I could spend a lot of time kind of banging my head wondering, why is this not working? Or I could look at something called documentation. So programmers, when they write functions like paste, they also write documentation that tells me exactly how to use paste and what the expected output of paste might be. So let's look then at the documentation for paste. I can actually access documentation if I use this special character in R called the question mark. And if you want to remember this, I tend to think of just being confused. Like, what do I do? The question mark is that symbol. Then I follow it with the function name, in this case paste. So now, over on the right-hand side-- let me make this bigger for us over here-- I'll actually see the documentation for paste. Whoever created this function called paste helpfully wrote this documentation to guide me on how to use paste itself. So I'll see up top, the goal of paste is to concatenate strings, like we just talked about. That much is pretty obvious. But down below, I think, is the helpful part. This down below will tell me what kinds of inputs paste could potentially take. And I'll see the same thing we saw before, paste followed by some parentheses with various what we'll call parameters here. So these are still inputs to paste. But they're potential inputs. And because they're potential ones, we'll call them parameters. Arguments are the actual values we pass to paste. Parameters are the potential ones here. Now, I see, if I go down here, some dot dot dots. These dot dot dots mean that paste could take really any particular number of arguments. That could be any number of strings I want to concatenate in this case. I then have over here, though, this named parameter called sep. And it says sep = quote unquote with a space in the middle. Now, this is what's called a named parameter. It has a given name because it has a special use case in this case of paste. Now, the equal sign means the default value for this parameter is going to be this value here, the quote space unquote. So it seems like this parameter called sep might be what's making this extra space actually happen in my output. And if I go back to the documentation over here, let me scroll down a little bit so we can see, I'll go down to the arguments. And you can see that sep is, in fact, a character string to separate the terms. So my job now is to think about, what would I change the value of sep to be to remove that extra space? Well, I can go back to my program. And ideally, I don't want anything, any character, to default separate the strings. I just want to have them put together with no spaces in between. So to use this named parameter called sep. I can then have another input to paste. I could follow it with a comma, just like this, and then use that named parameter sep, in this case, and set it equal to some new kind of value, in this case quote unquote, or really, no spaces at all. So let's try this now. I'll clear my console. I'll run source. And then I'll type Carter again. And now I'll see Hello comma space, only one space, Carter. Now, if you're like me, you might think, I'm going to often want to concatenate strings that don't have any spaces in between them. And it's going to be a lot of typing to always type comma sep = quote unquote. If you're writing many lines of code, you don't want to do this over and over and over again. And in fact, some R users got tired of just this, and they wrote their own function where the default is sep = quote unquote, nothing. This function is called simply paste0, paste0, where 0 means there's nothing in between these concatenated strings. I now don't need to supply the input sep, because the default will always be no particular space at all. Now I'll rerun this program, source, and say Carter. And I'll get that same output. Now one other way to do this, because there's always more than one way to do something, is I could maybe just omit the space altogether. Like, I could say Hello comma and make that my string. And I could then assume that paste will go ahead and actually add the space in for me. I could run source, just like this. Type in Carter. And now I'll see "Hello, Carter" again, completing our program here. So let me pause and ask what questions we have on paste, string catenation, or our program so far. AUDIENCE: So what's the difference between the paste function and the cat function? Because I think both of them are used for string concatenation. CARTER ZENKE: Yeah. Good. So if you are familiar with R, you might have also heard of this function called cat. And cat itself stands for concatenation. Cat and paste have two similar use cases. They both involve combining strings together. But they have slightly different outputs. So cat has the side effect of printing whatever you've concatenated to the console. Paste, on the other hand, does not paste only returns to you, kind of silently, the concatenated version therein. And you can then use that later on in your code. Cat, I believe, does not actually return to you the result. It just kind of prints it to the screen as a side effect, so two very different use cases but the same kind of goal here. All right. Let's continue, and let's keep improving our program here. So one thing you might notice is that I have this object called greeting and I'm just, on the next line, using that very same object. And this seems just a little bit redundant because I'm just storing this value from paste in greeting and immediately giving it back to this function called print. Well, to have this same result and actually reduce some number of lines of code, I could do the following I could actually remove the idea of storing the result of paste in any given object. And I could simply run paste and immediately pass the value, or the return value from paste, as the input to print. Now, this is arguably the most complicated line of code we've seen so far. So let's break it down just a little bit this. What we're doing right here is known as function composition. I'm making one function run and then immediately passing the output of that function as input to the next. So this is our line of R code. And we said before that R runs lines of code kind of line by line, top to bottom, left to right. And that's mostly true. But in this case, we see there's more than one function to run or action to take on this particular line of code. So what is R to do? Well, what R will always do is look first for the function that is innermost in the parentheses. So in this case, that is the paste0 function that is concatenating or combining "Hello, " and then name over here. Now, what this will do is make the return value of Hello comma, let's say, space Carter, and then pass that immediately as input into print, just like this. And once that's done, print can then do its job and, of course, just print out something like "Hello, Carter". So always think about the innermost function running first and passing its return value as the input to the next innermost function, and so on and so forth. So let's go ahead and try this out. I'll come back to RStudio here. And here I have paste as opposed to paste0, but kind of the same thing, as we saw before. Let me go ahead and click Source here. I'll type Carter. And we'll see that I get the very same result without storing, in this case, an additional object. Now, an extension of this might be the following. I could take readline. I could take read line, and notice how it's just simply storing the value in name, which I immediately pass as input to paste. I could do this. I could take readline and put this right there. And now I have three functions nested inside of each other. But let me actually ask you, why might this not be a good idea? Let's think about other people who might read this code, or think about working together on projects. Like, why might I not want to do this or go this far with the design of my code here? AUDIENCE: It's, I think, because it doesn't explain the code perfectly to the user. CARTER ZENKE: Yeah. It's kind of hard to read. Like, if I saw this line of code here, I would have to think to myself, OK, which function is happening first? Well, it looks like it might be readline. And then what happens next? OK. Paste happens next. So it's a lot for me to think of as I'm reading this program. And even though it is shorter, I would say it's not necessarily better. So these are questions about the design of programs. Which way to write the code is better? We have the same result. So they're both correct. But there are still different ways to design it and trade-offs to consider in terms of readability as well. So let's come back, and let's try to fix up this program a little bit. I would argue that it's probably cleaner if we instead have readline on a separate line of code. I'll put this first on line 1. And I'll go back to storing this object called name. And I'll pass it in as input to paste here, just like this. Now, we just talked about the idea of making our code more readable. And it turns out that R comes with a feature that can let me do just that. I can actually leave myself some notes to self called comments that will help me understand my code using the English language. So if I want to write a comment, or a note to myself, in my code, I can do so by typing this hashtag here, followed by a space. And I can then type the comment I want to type. I could say maybe this line asks user, this line asks user for, asks user for name, just like this. And the next line, well, what does this line of code do? This line of code says hello to user, just like that. Now, comments, by convention, go on the line above, the line of code they are talking about. So in this case, I know this comment on line 1 refers to the code on line 2. And this comment on line 4 refers to the line of code on line 5. But comments are very helpful when you actually are working on a larger project. You come back later. Don't know what you did. Comments can then help you understand exactly what to do, and what you had done prior as well. So this was our "Hello, world" program. We said hello to the world. We said hello to some users. Let's get working with some data now. And in one case, we might want to work with data in terms of, let's say, counting votes for an election. So let's go ahead and try to simulate an election between some fictional characters from the Nintendo universe, in this case Mario, Peach, and Bowser. So to create this new program, I'll go to my console again, and I'll type file.create. And in this case, I want to count some votes. So I'll call this program count.R, just like this. I'll hit Enter. And I'll see that this file was created. So now, if I open up, open up my window over here, and go to files, I should now see that count.R is available to me as a file to write this program in. I'll open up count.R. And now I have a blank slate of a program to write. So we have three candidates to keep track of votes for, Mario, Peach, and Bowser. So let's let the user actually type in those votes and return to them or print out the total number of votes that happened in this election. So maybe I will ask the user, using readline, to enter votes for Mario, just like this. And I'll also ask the user to enter votes for Peach, just like this, Princess Peach. And I'll also use readline to ask the user to enter votes for Bowser, just like this. Now, it's likely I'll want to use whatever the user types in later on in my code. So why don't I store the return value of readline in an object I could re-use later on. Maybe I'll call this one Mario and this one Peach. And this one, let's go for Bowser, just like this. So I'll save it. And again the goal was to kind of add up the total number of votes. Well, maybe I'll make a new object called total to store the total number of votes. And I'll have something that will assign that total number here. Well, it turns out that to actually add data together, I can use, in R, this plus sign, this plus operator. So I'll say mario + peach + bowser. And that should return to me the total number of votes the user has actually entered in the console. And if I want to then print that back out to the user, well, I could use print and paste again. So I'll use print, paste, and then Total votes, no space, because paste will actually add it for me, and then total itself down below here. So a few things we've seen before and a few new things. What's new is this arithmetic. We've seen now, we just used this plus operator to add together some numbers. And R has more than just the plus operator. It has several others as well. It has addition, as we just saw with a plus sign, subtraction with this minus sign, or a dash, multiplication with this star or asterisk operator, and division, just like this, with the forward slash here. There are other operators, too, that we'll talk about later on in the course. But for now, these four will help you do some basic arithmetic that can help us solve some really interesting problems in this case. So let's come back, and let's run our program. I'll come back to RStudio. And I think that this should work. I'll go ahead and run source. And I'll enter votes, in this case, for Mario. Maybe I'll say Mario has 100 votes. And Peach has 150. And Bowser has 120. And now I should see the total number of votes that were cast in this election. I'll hit Enter. And I'll see one other error. It says error in mario + peach, non-numeric argument to binary operator. Hmm. The other error, I'll admit, was easier to understand. This one's less easy. So at least it tells me where the error happened. It says it happened in mario + peach, so it seems like maybe on line 5 here, when I tried to add the user's input for Mario to the user's input for Peach. And it says, the reason, a non-numeric argument to the binary operator. So the binary operator, I'll tell you, is this plus sign here. But the non-numeric argument, it seems like it's telling us that Mario and Peach, those aren't numbers at all. So let's take a peek at our environment where we stored those actual objects. Let me take a peek over here. And what will we see? If I go to Environment, maybe remove this down below here, I see some old, some old things here, like greeting and name, that I didn't get rid of before. But I also see Bowser, Mario, Peach. And what do you notice? Well, it seems like before, we had, let's say, greeting. That was a character string. And we knew it was a character string because it had quotes around it. But we see the same thing now for bowser and for mario and for peach, which implies to me that these are still character strings. They're not so much numbers. Now, I think R is now telling me that it needs numbers to be able to add these things together. It can't add a character, 120, with the character 100. They need to be actual numbers. So let's see what we can do about that. Let's come back to RStudio here, and let's actually introduce this new idea of a data type or a storage mode. In R, we have various ways of storing data. We've seen one so far called a character string, but there are lots of others, too. Among them are these. Characters, we just saw, and then double and integer. These are both numbers. Double refers to a decimal number, like a 1.5, for instance. Integer refers to a whole number, like just 1, plain and simple. And there are more, too, but these are the ones that matter here. So it seems to me like readline, when it returns us the input from the user, it returned a data type of character, or a storage mode of character. But what I really need to add these numbers together is a double or an integer, these numeric storage modes down here. So let's see if there aren't functions that could help us. There actually are. So among them, this idea of as.character, as.double, and as.integer. These are functions that can actually take some particular object and convert them to the storage mode we want. So I could give as input to as.integer some object, and it will return to me then that same object but now as an integer. And this is known as coercion, changing the storage mode of an object, using a function like this to convert it to some particular new storage mode. So let's try these out. I'll come to RStudio again. And I will then try to convert this data to an integer before I actually add it together. On line 5 and below, let me go ahead and use as.integer. I'll type as.integer(mario) and as.integer(peach) and as.integer(bowser). Well, in this case, hopefully that should work. Let me go and hit Source again. And let me clear my terminal first. I'll enter 100 votes for Mario, 150 for Peach, and 120 for Bowser. And I still see that they're not numeric. So one common mistake is that simply running the function here is not enough to change this particular object. I need to then reassign the return value of a function to the object itself. So for instance, if I want to update the value of mario, I need to reassign it as the return value of as.integer. Or I need to update the value of peach by reassigning it as the return value of this function here. And same with bowser as well. And now, I think, if I run, this fingers crossed-- let me come to my console again, run source, and I'll choose 100 for Mario, 150 for Peach, 120 for Bowser. And I'll see the total votes was 370. Now, this is a bit of a longer program than we've seen before. This is like 11 lines total. There's probably a way to actually clean this up a little bit, though. One way would be to immediately try to convert the input from the user to an integer. So I could use function composition. And I could instead immediately pass the return value of readline as the input of as integer, and same for peach, and same for bowser. Let me actually clean this up, have a parentheses there and a parentheses here. And now I can get rid of lines 5 through 7. And now it's just a little bit shorter. And I'd argue that this is actually a good use of this particular function composition, because now I'm immediately seeing that, OK, I want an integer from the user. Let me go ahead and try this again. I'll click Source. And now I'll see 100, 150, 120, and I still see 370. Now, this is pretty good, but let's think of a corner case or some other scenario where there are more than three candidates. Well, in this case, I don't want to be stuck always typing plus, plus, plus some new candidate. What I want to use instead is likely a function called sum. Now, R, because it works so often with data, has this function called sum that can take as input any number of numeric arguments and sum them up for me. So let's use some instead. In total here, I actually want to return the result of calling sum with three arguments, three inputs. The first is mario, second is peach, and the third is bowser. So now sum will look at all three of these numbers, add them up, and store them now in total. So I'll clear the console, run source, and I'll type in 100 votes for Mario, 150 for Peach, and 120 for Bowser. And now I'll see total votes was 370 as well. So we've improved our program so far and we've seen how to use these storage modes to add data together. Now what questions do we have on this program here, or storage modes in general? AUDIENCE: Can we enter an argument like an array to this sum function? CARTER ZENKE: So I heard you mentioned this idea of an array. And if you're familiar with other programming languages, you might have heard this idea of an array, like some list of data. And the question is, could we give sum not these three separate values but actually an array or some list of data? In fact, we can, and let me suggest we actually take like a five-minute break and come back to learn more about these structures we can use to represent data just like that. See you in five. Well, we're back. And so we've seen so far how to write programs that take user input. But odds are, as you write more R programs, you won't rely so much on the user to actually enter data for you. You'll instead read data from a file, like a CSV file for instance. So let's take a look at ways you can actually represent data and how to use those representations now in R. Well, you often find data is stored in these things called tables. And here is an example table. I'm trying to represent here candidates, like Mario, Peach, and Bowser, and the number of votes they received at the poll, so this is actual, physical polling location, and via mail, from mail in ballots, let's say. So notice how this table has both rows, this kind of horizontal orientation, and columns, this vertical orientation. In particular, there are three columns with names. So one is candidate, where I have the names of my candidates, in this case, Mario, Peach, and Bowser. Over here, I have this column called poll, representing the number of ballots or votes that Mario, Peach, and Bowser received at the actual, physical polling location. So let's say Mario got 37 votes, Peach 43, and Bowser 84 now. Well, for the mail column, there's also going to be some numbers here as well. Let's say Mario got 63 mail-in votes, Peach got 107 mail-in votes, and Bowser, 36. So this then is our table of rows and columns. And one kind of analysis we might want to do on this table is called a tabulation. That is figuring out how many votes we received by poll or by mail and also how many votes each candidate received. So we could ask the question, how many votes did Mario receive overall? That would be a tabulation along these rows here. We could also ask, how many votes did we receive at the actual, physical polling location? That would be a tabulation along this column here. So these are two questions we can actually answer using R. But R, at least immediately, doesn't give us a way to represent a table exactly like this. If you want to store this data kind of long term, you need to do so in a file. And one popular way of representing data like this inside of a file is to use a CSV file, or a comma-separated values file. So here is the same representation of that data but now as a CSV file. Notice here I have those same column names, candidate, poll, and mail, and I still have those same column values, Mario, Peach, Bowser, 37, 43, 84, and so on. But what you might notice is that the columns are now separated using these commas. And that makes sense. This is a comma-separated values file, or a CSV file. Every row is still in its own row in the file, but now these columns are presented now with these commas. So let's see how we could use R to read in this CSV file and give us an actual table of data to work with. I'll come to RStudio here, and one of the first things I actually might want to do is clean up my working space. If I want to see what's currently in my environment, I can type this function, ls, at the console and hit Enter. And now I'll see all the objects that I still have in my environment some from some prior programs. Now, I probably want to get rid of these as I'm writing some brand new program. So to do that I could use this function called rm, which stands for remove, whereas ls stands for list. I could use rm, and it turns out that rm takes a named argument called list that is the list of values I want to remove from my environment. And I'll say that this list is, well, it's the result of calling ls. That is, it will include bowser, greeting, mario, name, peach, and total, all of these prior objects I no longer want anymore. So I'll hit Enter on this. And I'll type ls again. And now I'll see character(0), which basically says there's nothing here right now. There's an empty string, nothing at all in my environment. So now my environment is clean. There are no objects here. Let me actually create a new program, one called tabulate. So I'll do file.create("tabulate.R") to represent how we're going to tabulate this table of data and find the number of votes for each candidate and each voting method. Let me do Enter here. And I'll see that file was created for me. I'll go to my file explorer and open up tabulator.R. So actually, notice here in my file explorer, I do have this file named votes.csv. And if I click on it, I can actually see, if I click View File here, what's inside this file. And notice here I have this same exact thing we saw on the slides, candidate comma poll comma mail, and then one row for every row in my data set. So our goal then is to read this CSV and store it in R so you can actually get back a table of data to work with, now entirely in R. Well, one function I could use to read data from a file like this is actually called read.table, read.table. And I can give read.table the name of the file I want to read, or to open, to load inside of R. So that file name was votes.csv. And because votes.csv is in that working directory, I can just refer to it by its plain and simple name. So read.table that table has a return value. It's going to give me back a table of data. So I'm going to actually store that, let's say, in this table called votes. And now let me run just this line of our code. I can do that by hitting Run over here. Or on Mac, I could type Command-Enter. On Windows, I could type Control-Enter. I'll do Command-Enter on Mac. And now I see, according to the console, I have now read the votes data table here. So if I want to see what it looks like, there are a few ways to do that. I could actually look at my environment. Let's try that first. I'll go to Environment. And I'll see the following, that votes seems to have four observations of one variable. Hmm. So observations actually refers to the number of rows in this table I've gotten back. And variable refers to the number of columns I've gotten back. And you might already be thinking, this doesn't seem right, because I thought I had at least three rows, and I thought I had at least three columns, and I seem to have four rows and one column, so something might be wrong here. If we want to see exactly what happened, I can use this function called View, capital V, and I can pass as input the object I want to view in this case. If I run this line of R code, I should now see a separate tab that shows me exactly what is stored in this object. And I would say these results are not good. This is not what I want it to look like. Because, again, we only have one column, that R seems to have named V1, and instead of three rows there are four, where one row is actually the names of the columns that I wanted to be the case. This is just not what we want. So it seems like read.table needs more information on how to read this particular file. And one thing it might need to know is, what is the separator between each of my columns? Well, because this is a CSV file, that separator is none other than a comma. So read.table, like paste, takes a named argument called sep. And this we'll set to be an actual comma. So now read.table knows to look for these commas and use those to identify what is a column inside of this data file. So let me then rerun this line of code, and now view it. And I'll see we're getting better. So here I have three columns, although one is called V1, one is called V2, the other is called V3, and I still have four rows. So it's not quite there, but we're getting close. One other argument to read.table is, in fact, this one called header, header. So header can be either true or false, yes or no. Do the column names exist in this file? In this case, they do. So I'll say header = TRUE. I'm essentially saying that, yes, the column names are inside this file. You should look for them, and you should use them. So let me rerun this. I'll rerun line 1, and now line 2. And now I think we're in a pretty good place. So R actually has stored inside its environment this table for me to use. I see three columns, candidate, poll, and mail, and now three rows. I could do something to make this a little more readable. Often, when we have lots of arguments to these functions, it's better to put them on separate lines. So according to the style guide for R that kind of tells me how I should be structuring my file, I should do something a bit like this. I should put each argument on a new line in my code and then make sure that this closing parentheses is all the way against the left-hand side. So this allows me to more quickly see what arguments I have supplied to read.table, but the result is exactly the same, of course. Now, CSV files are pretty popular. And it doesn't make sense to me to always be writing sep = comma, header = TRUE. And in fact, people who work with R, they come up with their own function called read.csv to make this much easier for us as programmers. So instead of read.table, which can work on a variety of actual data files, I might instead use read.csv because, of course, I have simply a CSV here. So let's try this. I'll run just read.csv, given the file name. Hit Enter here and Enter here. And it's the same result but now with much less typing, much fewer arguments. read.csv just seems to know more naturally how to read these files called CSV files. OK. So we've successfully read this CSV file. But the next question is, what exactly has it given back to us? Certainly, it's a table. But in R, this table has a special name. And this special name is a data frame. Now, a data frame is what we're going to call a data structure. Some way of organizing our data that allows us to do things with it much more quickly. And those things, for a data frame, might involve accessing the columns, or accessing the rows, or performing things like tabulations, as we'll see in just a little bit. So here again is our data frame. But now it's called votes, just like this. And let's say I actually want to access some particular columns or some particular rows of this data frame. Well, because it is a data frame, R gives me some special syntax, some special actual characters I could type to access those columns and those rows. Now, one way is to use what we call bracket notation, where I could take the name of this data frame, use brackets, and then supply the number of the row and the number of the column that I want to see. So for instance, let's say I wanted to access this particular value, Mario, in the first column and the first row. Well, in that case, I could type votes bracket 1 comma space 1 for the first value in the first row and the first column. But more often, you'll want to access not just one particular value but all the values in a column or all the values in a row. And to do that, you can simply omit one or the other, the row or the column value. So here to access all the candidates, I could type votes bracket comma space 1, omitting the row number but only supplying the column number, the first column here. That will give me this list of Mario, Peach, and Bowser. What if I wanted the poll numbers? Well, I could do votes comma 2, and that would give me 37, 43, 84, and same for mail, but now with the number 3. So let's try it in R. I'll come back over to RStudio and go back to my program. And our goal, again, was to sum up let's say the number of votes we got at the polls. So I could at least see those values if I do votes bracket and then comma 2, just like this. Let me clear my console and run this line. And now I see those same values, 37, 43, 84. Same thing here, 37, 43, 84, even though in my console, they're kind of turned on their side like this, these are, in fact, the same values I've seen in my table. But what could go wrong here? Well, if we think about this, what's to stop me from rearranging these columns, from maybe making mail the second column and poll the third. If my program can't update based on that kind of rearrangement, it's not a very good program. So there is another way to actually access columns, in this case using their names. So instead of the number of the column I want to access, I can use its name, which is much more robust. If I change the ordering of the columns, I can still access the column I want to access. Now, the syntax for this looks a bit as follows. I would use the data frame's name, votes, followed by a dollar sign. And then I would get back, in this case-- I would actually type the name of the column I want to access. And I would then get access to that particular column. So votes$candidate, that gives me access to this column of candidates. Same thing with poll, votes$poll, and same thing with mail, votes$mail, this dollar sign has nothing to do with currency. It's just a way to actually access a column by a particular name that it has. So let's come back to RStudio and try that out now. Let's say I want to access the poll column. Well, instead of using the bracket notation with the comma 2, I could use votes$poll dollar sign poll. And now let me clear my terminal, my console down below. Let me run source. And oops. Let me actually, instead, run this particular line here. I'll get 37, 43, 84, just as we saw before. So let me pause here and ask, what questions do we have about these data frames so far? AUDIENCE: Can I ask if I can get some particular value from some particular row by accessing the votes or the main column, votes bracket or dollar sign, the column, then dollar sign, the value? CARTER ZENKE: Great question. So we saw earlier there is some syntax that can get us access to some particular value in this data frame. Let's see that in action a little bit here, too. So I'll come back to RStudio. And let's say we want to access Mario's number of votes they received at the polls. So this would be the second column and the first row. And in R, we tend to index things, that is start counting, from one, so the second column and the first row. So here, let me go ahead and try to access that particular value. I could say votes, like we saw before, open and closed brackets. And I know, again, this is the first row. So I'll put that as the 1 here and then the column number, which in this case was the second column, counting from 1, so 2. This should, if I hit Enter on this line of code, or Command-Enter, should show me that, in fact, Mario has 37 votes at the polls. I could do the same for Peach, let's say. Peach has 43. Bowser, Bowser has 84. So that is a way to actually access individual values in our data frame. One other way to do this, though, is to take advantage of another data structure. So it turns out that when we access the columns of data frame, what we're getting back is no longer a data frame. What we're getting back instead is what R calls a vector. A vector is simply a list of data that is all of the same storage mode. If you've heard of arrays in C or lists in Python, a vector is a similar idea. But it's simply a list of values all of the same storage mode. So to visualize this, let's go back to our data frame here. And let's say I want to access this particular column of votes. Well, I could access it using votes$candidate. And when I do that, what I really get back is this separate structure that is just the values from that particular column. And now, because this is a new structure, I could actually use that same bracket notation to ask for particular values from this list of data. Now, again, we start counting from one in R, so this is our first value, second, and third. If I want the first value the, first candidate, I could use votes$candidate and then bracket 1. That will give me Mario. I could use votes$candidate and then bracket 2. That would give me Peach, and same thing with Bowser here. So we're kind of taking out or extracting this new vector. And we're able to access individual values in it. Now, one example I like to use is the example of building blocks here. So here we have our very own data frame, composed of nine individual pieces of data. Now, when I ask for some particular column from this data frame, what I'm effectively doing is taking out one column and treating it as a separate object I can use in my program. Now, again, if I want the first value in this vector, I would simply take the one from the top; the second, the second one down; the third one, the third one down. Now, in R, we also see that when we print out vectors, see them in our console, they aren't always kind of vertically arranged like this. Often, they'll be a bit more like this, where I might take, in this case, put it on its side, a bit like this. Or I would now get A, B, and C, left to right like this. Even though in our data frame it was top to bottom, we might get it represented side by side, a bit like this. So same thing, ultimately, and there's this idea of kind of extracting this vector from our data frame. So let's try that now in R. I'll come back to RStudio. And let's go ahead and try to access maybe the first value of the poll column. Now, we saw before, I could simply use bracket 1 on this new vector that I've created by accessing the poll column of votes. Let me hit Command-Enter, and I'll see 37. Let me do the second value, and I'll see 43. So I think, with this, we could start to answer one question we had, which was, how many votes did we get at the polls in total? Well, one way to answer this is to say, let's sum up votes$candidate, the first value in that poll vector. Then let's find that second value as well. Then let's find that third value, just like this. And sum all of those up. And if I hit Command-Enter here, I'll see I have 164 votes at the polls. But just like we saw before, there's probably something that could be better designed about this particular line of code here. And that is, if I had more than three candidates, I'd be typing a really, really long line of code here. So what I could instead do is this. I could give sum, to our earlier question, the vector itself. Simply votes$candidate, which we know is a list of values, a vector, I could give that entire vector to sum, that will then know what to do with it and return to me, in fact, the sum of each of those elements of the vector. So we call sum here vectorized. Sum knows what to do when it gets not just a single value as input but an entire vector. And R is really built from the ground up with these vectorized functions that can take whole lists of data and operate on them very, very efficiently in this case. So here we've answered that question of, how many votes do we get at the poll? Let's take that next one, which was, how many votes do we get in the mail? Well, I could do now sum, sum, and then give the mail column, which is, in fact, a vector, and have that summed up. And we'll see we got 206 votes now in the mail. So we've seen vectors. And we've seen data frames. But we still have one other question to answer, which was how many votes did each candidate get? Now, to do that we have some sum values across columns here. 37 plus 63 is Mario's votes. 43 plus 107 is Peach's votes. And 84 plus 36, that's Bowser's votes. So let's try it. I could maybe treat two separate columns here. I could do votes and, what's that one, poll, votes poll. Get the first value for Mario. And add up, let's say, the first value in mail, a bit like this. And that, I would argue, is Mario's total number of votes. What if we did maybe the second value in poll for Peach and, let's say, the second value in mail, also for Peach. 150, so that's Peach's total number of votes. Let's do votes and the third element in the poll column and the third element in the mail column, and that would then be Bowser's votes. And I hope you can tell by me being kind of bored while I'm doing this, this is not the best way to do this. There is actually a better way that takes advantage of a feature of vectors in R, which is vector arithmetic. So not only are vectors handy at representing lists of data. We can use them to efficiently perform math as well. So if I wanted to sum up all of these vectors and find out how many votes each candidate got, I could simplify this and simply type the following, votes$candidate, in this case, votes$mail. And that would be it. But let's visualize this and see why exactly this works. So we said, we had this idea of vector arithmetic. And here, again, is our data frame. So I want to effectively sum up these two columns and return to myself a new total for each candidate for every row, in fact, that I have. Well, to do that, I can take out these two vectors, poll and mail, and think of them separately now. Now I want to add these together. Like we said, 37 plus 63, that's Mario's votes. 43 plus 107, that's Peach's votes. Well, to add these I could use the plus sign, just like this, but if you're new, it's not quite obvious what's going to happen here. I mean, how could I add up these three numbers and these three numbers? And what is the structure of what I get back in the end? Well, it turns out that R uses a new vector as the result of this. And it actually computes the new values element-wise, element-wise meaning it goes top to bottom in each vector and adds those two corresponding elements together. So first it looks at the first element of each column here, the poll column and the mail column. 37 plus 63 is 100. Then it goes to the next one. 43 plus 107 is 150. Then it goes to the next one here. 84 plus 36 is now 120. And we now have a brand new vector by adding up or summing together these two distinct vectors overall. To go back to our visualization here of these blocks. So you could think of me taking a data frame, just like this, and kind of extracting these two vectors here. Let's say this one, and let's say this one. Let me put this one to the side for now. And I want to do some math with these. So in this case, I'll look at the first element in each vector, in this case 4, and in this case 1, or 4 and 1. What is the addition here? Well, 4 plus 1 is 5. So the first element of my new vector would be 5. Then I go to the next two elements here. In this case, I have 5 and 2. What's 5 plus 2? That's 7. So the second element in my new vector would be 7. And then, later on, what do I have? I have 6 and 3. Those added together would be 9. My there an element in my new vector would, in fact, be 9. So vector arithmetic gives us, in the end, a new vector by actually adding, or in this case, adding these elements together, kind of element-wise in the end. So let's go back and see how this looks in RStudio. I'll come back to RStudio here. And I will then hit Enter on this line of code. And we'll see that I get back a vector of three elements, 100, 150, and 120. So let's pause here and ask, what questions do we have on these vectors or vector arithmetic? AUDIENCE: How is the sum here being printed on a terminal without using the print function? CARTER ZENKE: Ah, good question. So you've noticed that when I press Command-Enter, for instance, I'm seeing the results, whatever is stored in this object, in my console without using print. And that, in fact, is a feature of R and RStudio, that when I run some particular line of code, I can then see the return value of that particular line, or whatever it computes for me, down to my console. It's kind of a handy feature of these things called IDEs that let me actually understand what's going on in my code all the more clearly. Good question. AUDIENCE: I have a question in data frames. So in the dollar notation, it returns a vector. What about in the bracket notation? Suppose we are taking, for a row, 1 comma. Does it also return a vector? CARTER ZENKE: A great question, so we just saw with the dollar notation, we get back a vector. Does it do the same thing then with bracket notation? So in fact, it does. Let me show you how it works in RStudio. I'll come back over here. And let's say I wanted to sum up these two columns still, but I don't want to use their names for whatever reason. I'll instead use their bracket representations here. So I'll take the second column of my table here, using comma 2. And I'll add together, in this case, the third column, just like this. I'll hit Enter. Whoops. Hit Enter. And I'll see I get back the same result. So these, each one of these is in fact a vector. Where you should be careful, though, is that there is some notation that looks a bit like this, votes, in this case, bracket 1. Notice how I'm not actually asking for a row and a column. I'm only asking for some number 1. This will give me back, in this case, the data frame but now only the first column of that data frame. So this, to be clear, is not a vector. This is the same data frame, but now it's only that single column of that data frame. So be careful here. You won't often use this notation, but you could get back a data frame if you use that kind of notation overall. A good question, and let's now keep going towards solving our problem of tabulating this data. So we saw before that simply adding the poll column and votes, and the mail column and votes would give us the total number of votes for each candidate. But I would argue that if I run line 3 here, and I just get back some numbers, it's not super clear which candidate these votes belong to. So I can actually go ahead and add a new column to my data frame that includes these particular values here, one for Mario, one for Peach, and one for Bowser. And as long as my vector is of the same number, of the same length as my number of rows in my data frame, I should be able to actually add this as a new column to my data frame. Now to create this new column, I could simply kind of wish it into existence, a bit like this. I could say votes$total. And I know there is no column called total, but I can make one. I could say, let's assign this vector to be the new column total in the votes data frame. If I run this line, I'll see I don't get a result, but I do, at least in votes, see there is a new column called total that now includes those same elements in my vector, top to bottom. So we've solved that problem. And our next step might be to now save this file again, to keep track of it later on, to share it with a friend. And to do that, I could use the write.csv function. We saw read.csv. But we also have write.csv, to actually save this file as a CSV. Let me try it. I'll say write.csv. And I'll call this one, let's say, totals.csv, for the total number of votes. Now, write.csv actually needs two arguments. One is the file name, like we just said here. But even before that, it needs to know what data frame to write to the file. So in this case, it's our votes data frame. Now, by convention, in its documentation, we see the first argument to write.csv is that data frame. The second is the name of the file itself. So let me clear my console. Let me run write.csv. And now, if I go to my file explorer, I should see down below-- oops-- totals.csv. And now let me View File. And I should see something a bit like my data frame. There's a few other features here, though. I see that there's these numbers here, 1, 2, 3, and this empty number, or this empty value up here. These are, in fact, the row names of this data frame. So not only do data frames have column names, like we saw here. They also have row names. And by default, they start counting at one and going to 2, then 3, then 4, and so on. And I can actually tell write.csv whether or not I want those row names printed in my CSV. You often won't, because it kind of makes a wacky format where you have this empty value up here. And it's probably just not worth it for you. So you could specify the following. You could say row.names, row.names = FALSE, meaning that write.csv should not write any row names to the CSV, even though they exist in the data frame. So let me go ahead and clear my console. I'll do Command-Enter here. And now I should see, if I go back to totals.csv, I now have a much cleaner CSV without those particular row names. If I did, though, want to access those in R, I could use these functions. I could say colnames(votes), which will return to me the columns I have inside of my votes data frame, or ronames(votes), which gives me access to the row names of this particular data frame as well. All right. So we've seen here how we've been able to save this data frame, how to read data from a file and save it back to a file as well. What we will do next is actually take advantage of online data sets that other folks have written for us and use those to explore data in R as well. See you all in five. Well, we're back. And so we've seen a few examples of how to use R. So far we've seen how to take user input and actually deal with that in our own programs. We've also found how to use data we've put in our own file for ourselves. But often, when you use R, you'll be working with not your own data but somebody else's. And so we'll do exactly that here for the rest of lecture. Now, you might have heard of FiveThirtyEight, as mentioned earlier in lecture. They work a lot with data. And they often put that data online for others to use and analyze themselves. In fact, they worked on a poll that asked people in the US about their voting habits. Do they plan to vote? How often do they vote? What encourages them or motivates them to vote? Or what are the reasons they actually don't vote in the first place? So here I have the URL that actually includes this data file online, on the internet somewhere else. And if you look at the end here, you'll see the file name nonvoters_data.csv. So this tells me that this data, even though it is online, is still stored in that same familiar file format called a CSV. And now, on the second line of code, I can still use read.csv. All I'm doing now, though, is telling read.csv where to look, where to go find this data file on the internet, and bring it down for me here in my own R environment. And I'm storing the result now, that data frame, as a data frame called voters. So let's take a peek and see what's inside. I'll come to RStudio over here. And let me do the function View, View. And I'll view voters. And whoops. Let me first run these lines of code. So let me run line 1 to create a new object called url. Let me run line 2 to actually load that data. And now, once I have those objects loaded, let me go ahead and view voters. And this is the data frame. It is a big one. If we scroll all the way to the side, you will see lots of numbers, lots of columns. If you scroll down, you'll see lots and lots of rows. And one question you might have is, well, just how many rows and columns are there in this data frame? One way to answer that, at least in R code itself, is to use these functions nrow and ncol, which stands for number of rows or number of columns respectively. So let me try nrow to get a sense of how many rows we have in this data frame. We'll do nrow(voters). And it seems like down on my console here, there are 5,836 rows of data. Now, each row of data represents a particular voter or maybe a non-voter, somebody who lives in the US and was asked to fill out this poll where they talk about why they vote or why they don't vote. But in the end, each row is its own voter. Now let's see how many columns we have, ncol(voters). It seems like we have 119 columns, which is a lot of columns. Each column in this case represents a question asked to that voter. And if I go back to the voters view here, I might see Q1, for instance. It seems like the voter who was given the ID 470001 answered 1 to question one. And so it's a little arcane right now. But thankfully, FiveThirtyEight has given us a tool we can use to interpret some of this data. In fact, that tool is called a codebook. And you might often use the same in your own work with data analysis, where a codebook tells you exactly what questions or columns mean in some particular context. So in this case, I do see Q1 or Q2_1 or Q2_2. And if I'm new to this data set, I don't know what the heck that means. But I could look in the codebook that FiveThirtyEight gave me, in which case they say Q1 asked participants this, Q2 asked participants this, and so on and so forth for all questions we have inside our data set. So let's actually take a peek at one particular column. I know this column exists because I looked in the codebook to find it out. There's a column called voter_category that actually tells us how each voter represented themselves in terms of their voting habits. So to access that column, as we saw before, I could use voters, the name of the data frame, followed by dollar sign, followed by the name of the column, voter_category. And if I hit Command-Enter here, well, I'll see at least some of the results from that column, some of the elements here. And now you'll actually see why that bracket 1 that we saw initially is kind of useful now. As we work with bigger and bigger data sets, notice that that bracket notation tells us exactly where in our list or our vector we are. Notice how this one here means, this "always", that is the first element in our vector of all of these responses. This "always", well, that's the seventh element in our vector of all these responses. And same thing for this "rarely/never", that is the 13th element, and so on and so forth, all the way down. Now, there's more than what looks to be 900 or about 1,000 entries in our vector here. We just can't print all of them because it would be a really, really long output to our console. Now, I'm curious. I saw a lot of values here, but I want to know exactly what ways voters could have categorized themselves. Like, out of all of these responses, what are the unique ones? And it turns out R has a function called unique that I can use to actually figure out what are those unique values in this vector. So let me give this vector called voter_category, or the column voter_category, from voters to this function called unique, just like this. I'll do Command-Enter to run this line. And now I'll see I only get back three elements, "always", "sporadic", or "rarely/never". So it seems like voters could have categorized themselves in one of three categories. They always vote. They sporadically or occasionally vote. Or they rarely/never vote at all, so interesting data here. There's another actual column I'm interested in, too, which happens to be question 22. So if I look at that here, I'll type voters question 22 and hit command Enter. And here, I'll see all the responses for this particular question. And it seems to me like I'm seeing some numbers but also these NAs. And before we dive into that, let's take a peek at what this question is actually asking users. So it turns out this question 22 is asking users this. You previously indicated that you are not registered to vote. Which of the following reasons best describes why you are not registered to vote. And users were given several responses, one through seven, I believe, where one was, I don't have time. Two was, I don't trust the political system. Three was, I don't know, and so on and so forth. And it seems to me like when we saw this data set, we had some numbers here, one through seven, but we also had a lot of these NAs. And I'm curious if you have a sense of what those NAs might be. They're certainly not any of these options. What do you think those NAs might represent in this particular column of data here. AUDIENCE: Not answered or not [? customized ?] to be in the list. CARTER ZENKE: A great guess, so maybe it means something like not answered, or there's just no data there. And in fact, if we look at the context of this question, we can actually figure it out like you did. It says, "You previously indicated that you are not registered to vote." So in the US, we often have to register ourselves to be able to vote. But some people have done that, and some people haven't. It seems like this question was asked to only those participants who did not register to vote or who were not currently registered to vote. So it would make sense then that people who actually are registered couldn't have answered that question, and in which case, we don't want to have an empty string or some sort of number there. We could instead use one of R's special values. Now, our comes with these special values to indicate something very special. In this case, NA means not available. There could be data here, but there isn't. That's what NA in particular means. There are others, too. You might see Inf and -Inf. These mean infinity or negative infinity, numbers that are so big R can't represent them appropriately well. We saw NA here for not available. We also have NaN, which stands for not a number. You could think of if I tried to ask you, well, what is infinity divided by infinity? And you'd probably look at me confused. You would say, that's not a number. And I would say, yes. That's actually correct. R says NaN is that result in this particular case. And we also have this value called Null, capital N-u-l-l, which is similar in spirit to NA but slightly different. NA, as we said before, stands for not available. There could be data here, but there isn't. Null is simply a special value meaning absolutely nothing. I can have a vector of NAs, meaning that maybe I have five NAs in this vector, there are five places data could be, but I can't have a vector of Null. Null is literally nothing. It means absolutely nothing at all, so different slightly from NA in this case. So let's go back to our RStudio here, and let's take a peek at this particular column. So it seems like I have a lot of NAs. And one thing I'm interested in is, what kinds of unique values are we seeing inside of this particular column? So I could use unique again and hit Command-Enter. And now I'll see, if I clear my terminal, run it again, I'll see clearly that I have several different possible values here. I have NA, 7, 6, 2, negative 1, 1, 4, 5, and 3. And this makes sense. We saw earlier that people were given several options, 1 through 7. And we have people who didn't respond, either with NA or 1 through 7 for those other responses there. What we don't have actually is negative 1. And I actually tried kind of hard to figure out why negative 1 is here, and I couldn't. So sometimes data is just messy. There are values you don't know and can't deal with. So just keep that in mind as you work with data sets other people have given you. Now, one more column that I found interesting was this one. So it was voters question 21, question 21. And the question 21 asked this. It said, "There will be an election for president, members of the US Senate, House of Representatives, and other state and local offices. Do you plan to vote in that election?" And it gave people a few potential options here. 1 meant yes. 2 meant no. 3 meant unsure or undecided; I don't know if I'll vote in that election. So let's see what we actually got back as responses for this particular question. I'll come back to RStudio here. And why don't we take a peek at voters question 21. I'll hit Command-Enter. And here is the first 1,000 or so entries to this particular question. So we see a lot of 1's, which could mean yeses. I see some 2's, which could mean no, 3, unsure, undecided. I do see a negative 1 down here, which seems to be showing up for some unknown reason. But we could get a better sense if I used unique. So I'll use unique voters question 21. And now I'll see, again, my unique responses are 1, 2, 3, or negative 1. So at this point I'm kind of getting tired of looking at my vector here and trying to convert in my head between these 1's, 2's, 3's, negative 1's, to the actual response these participants gave. And it turns out there is a way in R to make it much easier for myself to work with data that's exactly like this. This data has some set number of categories of responses, as we saw with unique here. The categories are, in this case, 1, 2, 3, or negative 1. Those are the only possible values that could be stored in this vector. And we also know that 1, 2, and 3 correspond to some real-life, actual phrase that participants actually responded to in this poll. So to help us represent this kind of data, we can use something that R calls a factor, a factor. Now, I can create a factor using the factor function. And a factor function takes a vector, like the one I just have, the one I just had here, and converts it into a factor. So let's try that. I'll say factor(voters$Q21), just like this. And I'll hit Enter again, Command-Enter. And I'll see a very similar thing. If I scroll up, I have all the same data. Everything is just as it was, but now down below, I see levels, levels. And what does that look like to you? Well, we just saw before, we have several unique categories of data inside this vector, negative 1, 1, 2, and 3. And it seems like this factor took those unique values and made it this factors level, so the unique possible categories of data inside of this factor here. So factors are great for representing this categorical data, data that can actually be in one category or another. Now, one thing I could do is try to make it easier to read these categories as English phrases. And to do that, I could pass factor another argument, this one called labels. So not only do I give it my vector that has certain categories of data. I also give it some labels to apply to each of those categories, one after the other. So here, let me use labels as the argument. And let me give it a vector itself. So we saw before, down below, that we have a set of levels, number of categories in this data, negative 1, 1, 2, and 3. But now I could actually give those levels, those categories, some particular name. So I'll give it a vector itself. And I'll make a vector in R using this C function, which stands for Combine. I'll create this vector where negative 1, let's say, maps to or corresponds to this value as a question mark. I don't know what the heck it is. Maybe 1, though, we saw was yes. And 2 was no. And 3 was unsure or undecided, Exactly like this. So now notice that when I look at my levels, negative 1, that corresponds to the first label, question mark. What the heck is it? I don't know. Then, yes, that corresponds to level 1. Then level 2 corresponds to no. Level 3 corresponds to unsure or undecided. And now, if I clear my console and run this particular line of code, well, now I have something much more user-friendly. I can actually see not 1's, 2's, and 3's but now labels for those 1's, 2's, and 3's. And here I see Yes, Unsure/Undecided, No. Let me see so I can find a question mark here. Yep. There's one. So 517 seems to be a question mark. So now I've converted this vector into my own factor. Now, it's also not a good idea to have data we don't know what to do with. Like I said, this negative 1, I really don't know where it came from. And in that case, I could tell this factor when it's being created to exclude that value altogether. I could use the exclude parameter, the argument here, and say I want to exclude some vector of given values that are part of this vector that want to just take out altogether. So I'll say negative 1 is the value I want to exclude. This is a vector of length 1. But the value I want to exclude in this case is negative 1. And now I can actually remove this label here because my only categories will now be Yes, No, Unsure/Undecided, or underneath the hood, that 1, 2, or 3. So let me clear my console again. Let me rerun factor here. And now let me see. Well, now my data is getting much cleaner. I have Yes, No, Unsure/Undecided. And if I go back to that 517 that we saw a question mark in up here-- let me see if I can find it again. 517, almost there, I see now it's an NA, which stands for, of course, not available. We excluded it altogether from this particular factor. So what questions do we have now on taking vectors and converting them to these factors that involve categories of data and labeling those categories of data? AUDIENCE: Why have we used factor instead of as.factor like before. CARTER ZENKE: So we've just seen here this function called factor. But there is a function called as.factor, similar to what we saw before, as.character, or as.integer, or as.double. We could use, I think, either one to get the same result. In general, though-- let me look at the documentation. Let me go back to RStudio here. Let me pull up the documentation for factor. Let me clear my console. Let me do ?factor. So this is, here, how we're supposed to use that factor function. We can see here some of the arguments, the actual parameters I gave, like the labels here as well. Let's now try to find the documentation for as.factor, as.factor. Let me go back over here, type as.factor. And now I'll actually see-- I'll see the same documentation page. And so, to me, this is symbolizing that these are very closely related. If I were to scroll down here, I could see, I think I see it here, is.factor, as.factor are the membership and coercion functions for these classes. So essentially, if you want to make a new factor from a vector, you would use factor. If you have something that you want to convert to a factor that is already something else entirely, like maybe not even a vector, you might be able to use as.factor. So I hope that at least gives you some idea of the difference between as.factor and factor, as the function itself. All right. What other questions do we have? Let's take a few more. AUDIENCE: How is a factor different from a table? CARTER ZENKE: A good question, so we saw that a table has rows and columns in it. A factor, though, is simply a list of data, a bit like a vector. It's one-dimensional. And let me show you some slides that can hopefully, actually help you visualize factors in general. So if I come back over here, let me find the right slide to go to. Here, in general, is this idea of a factor. So we had, of course, this vector that was previously only 1's, 2's, 3's, and 2's and 1's and negative 1. But when we converted it to a factor, we did the following. We found the unique categories of data in this vector, just like this, 1, 2, 3, and negative 1. What we then did is we sought out those categories and called them, essentially, levels. The categories of our data are the levels here. And we applied some label to each of those levels, basically saying negative 1 becomes "?". 1 becomes "Yes". 2 becomes "No". 3 becomes "Unsure/Undecided". Now notice that this factor here is still a vector. Well, still kind of one-dimensional, it's a list of data. It's not two-dimensional like a data frame was or a table was. It's only one-dimensional. And when we apply these labels, we convert these 1's, 2's, 3's, negative 1's instead to "Yes", "No", "Unsure", and "?". And if we exclude something, like we did with negative 1, we would then get NA instead of negative 1 in the end. So I hope that helps at least a little bit answering that question. Let's take one more here. AUDIENCE: I wanted to understand whether-- just like you showed how to exclude a negative value, what if I want to-- there are a number of negative values in the data, and I want to exclude all of the negative values. CARTER ZENKE: Ah, a good question, so the first part is you want to represent maybe negative values, and how do you do that. And the second part is how would we exclude those negative values, let's say. Another kind of similar case is, let's say we have lots of NA values. How do we get rid of them? So let me answer your first one and actually tee up that second question for next lecture. So come back to RStudio. And let's try representing some negative values here. We saw before that I had this idea of negative 1. If I want to have some negative value in R Studio, I could simply put a negative or a dash in front of any particular value. So here, let's say, is negative 1. But to your question then of how we could exclude these values we don't want, that's a problem of filtering, or subsetting our data, which we'll actually learn about next lecture. So we've seen so far how to represent data in R how to take all kinds of data like votes, tables of votes, and represent them in R to manipulate them and so on. What we'll see next time is how to actually transform that data, removing values we don't want, adding data we do. And for that, we'll see you all next time.