1 00:00:00,000 --> 00:00:02,994 [MUSIC PLAYING] 2 00:00:02,994 --> 00:00:19,960 3 00:00:19,960 --> 00:00:23,440 CARTER ZENKE: Well, hello, one and all, and welcome to CS50's Introduction 4 00:00:23,440 --> 00:00:26,260 to Programming with R. My name is Carter Zenke. 5 00:00:26,260 --> 00:00:29,950 And I'm so excited to embark on this journey to learn this language called R 6 00:00:29,950 --> 00:00:30,790 with you. 7 00:00:30,790 --> 00:00:33,640 Now, it's likely that you have never programmed before. 8 00:00:33,640 --> 00:00:35,320 And if so, that's OK. 9 00:00:35,320 --> 00:00:38,560 But you might be asking, what is a programming language, actually, anyway? 10 00:00:38,560 --> 00:00:40,810 Well, it turns out a programming language is something 11 00:00:40,810 --> 00:00:43,300 that we humans have created to actually talk to computers 12 00:00:43,300 --> 00:00:45,400 and have them solve problems for us. 13 00:00:45,400 --> 00:00:47,932 And you might have heard of lines of code, things that 14 00:00:47,932 --> 00:00:49,390 actually tell computers what to do. 15 00:00:49,390 --> 00:00:52,600 And you may have heard of programs being lines and lines and lines of code, 16 00:00:52,600 --> 00:00:55,350 telling computers step by step what it is we want them to do. 17 00:00:55,350 --> 00:00:57,100 And you might have heard as well there are 18 00:00:57,100 --> 00:01:02,740 other languages you could learn, like C, like Python, like R, like JavaScript. 19 00:01:02,740 --> 00:01:06,010 And so you could be asking yourself, why would I learn R? 20 00:01:06,010 --> 00:01:09,760 Well, R, it turns out, is this language built from the ground up 21 00:01:09,760 --> 00:01:11,170 to work with data. 22 00:01:11,170 --> 00:01:14,380 And so, if you are interested in data or some of these fields 23 00:01:14,380 --> 00:01:18,550 here, like data science, data visualization, research, or statistics, 24 00:01:18,550 --> 00:01:21,110 R could be a language for you. 25 00:01:21,110 --> 00:01:23,420 And although this course won't actually teach data 26 00:01:23,420 --> 00:01:25,580 science or statistics or the math behind them, 27 00:01:25,580 --> 00:01:30,380 you will emerge being able to use R for these disciplines. 28 00:01:30,380 --> 00:01:33,890 Now, actually, just recently, researchers used R to model 29 00:01:33,890 --> 00:01:35,660 how COVID-19 was spread. 30 00:01:35,660 --> 00:01:38,570 And this is a visualization built entirely in R, 31 00:01:38,570 --> 00:01:41,810 to model how COVID-19 was spread on a cruise ship. 32 00:01:41,810 --> 00:01:45,320 You might have also heard of FiveThirtyEight, these data journalists 33 00:01:45,320 --> 00:01:47,810 who actually write articles with data. 34 00:01:47,810 --> 00:01:51,080 They use R to analyze data, to write their articles 35 00:01:51,080 --> 00:01:54,770 and share visualizations like this one so we can actually understand insights 36 00:01:54,770 --> 00:01:56,460 from data, too. 37 00:01:56,460 --> 00:01:58,370 So without further ado, let's actually begin 38 00:01:58,370 --> 00:02:00,530 by writing our very first R program. 39 00:02:00,530 --> 00:02:03,890 And to do that, we'll need what's called an Integrated Development 40 00:02:03,890 --> 00:02:06,890 Environment, or an IDE for short. 41 00:02:06,890 --> 00:02:10,039 Now, code, at the end of the day, is just text. 42 00:02:10,039 --> 00:02:13,220 And you could use any text editor to write code. 43 00:02:13,220 --> 00:02:16,700 But it turns out that when you actually want to write lots and lots of code, 44 00:02:16,700 --> 00:02:19,490 it's helpful to have a tool to do that with. 45 00:02:19,490 --> 00:02:22,270 And that's what this IDE will be doing for us. 46 00:02:22,270 --> 00:02:27,670 Now let me introduce you over here to this IDE called RStudio. 47 00:02:27,670 --> 00:02:33,700 Now, RStudio is an IDE built exclusively and particularly for R. And here it is. 48 00:02:33,700 --> 00:02:37,090 You'll notice off the bat that I have this kind of greater than sign 49 00:02:37,090 --> 00:02:39,070 here with a blinking cursor. 50 00:02:39,070 --> 00:02:43,120 And this indicates what we're going to call the R console. 51 00:02:43,120 --> 00:02:45,670 It's a place I can actually type R statements 52 00:02:45,670 --> 00:02:48,670 and have them run kind of line by line by line. 53 00:02:48,670 --> 00:02:52,960 It's a great place for me to actually execute or to run one line of code 54 00:02:52,960 --> 00:02:53,950 at a time. 55 00:02:53,950 --> 00:02:56,680 So let's actually type our very first line of R code 56 00:02:56,680 --> 00:03:00,640 here to create this file called hello.R, in which 57 00:03:00,640 --> 00:03:02,930 will write actual full-fledged program. 58 00:03:02,930 --> 00:03:06,730 So to create a file using this console in RStudio, 59 00:03:06,730 --> 00:03:12,250 I can type file.create, open parentheses and closed parentheses, 60 00:03:12,250 --> 00:03:15,070 and then the name of the file I want to create. 61 00:03:15,070 --> 00:03:19,870 So here I'm going to create, in this case, hello.R, just like that. 62 00:03:19,870 --> 00:03:25,820 Now, notice that hello.R ends in dot R, and that is actually very particular. 63 00:03:25,820 --> 00:03:29,750 You might have heard of files ending in, like, dot jpg for images, 64 00:03:29,750 --> 00:03:32,420 or maybe dot csv for data files. 65 00:03:32,420 --> 00:03:37,650 In R, we denote our R programs with this dot capital R here. 66 00:03:37,650 --> 00:03:42,380 So let me run this R statement here by hitting Enter. 67 00:03:42,380 --> 00:03:43,820 And now I should see-- 68 00:03:43,820 --> 00:03:46,910 well, I see TRUE kind of yelling at me right here. 69 00:03:46,910 --> 00:03:49,820 What this means is that this file was created. 70 00:03:49,820 --> 00:03:52,280 And we'll also see this kind of bracket 1 71 00:03:52,280 --> 00:03:54,770 here, which will become more apparent as we go on, 72 00:03:54,770 --> 00:03:59,330 but it turns out that because R works a lot with data and lists of data, 73 00:03:59,330 --> 00:04:01,610 it's often handy, when you have lots and lots of data, 74 00:04:01,610 --> 00:04:04,220 to give us an indicator of where in that list we are. 75 00:04:04,220 --> 00:04:06,690 So we'll see this come in handy a bit more later. 76 00:04:06,690 --> 00:04:11,150 But for now, with this list of 1 value, like TRUE, it just has one for now. 77 00:04:11,150 --> 00:04:13,850 Well, where was this file created? 78 00:04:13,850 --> 00:04:15,080 I claim it was created. 79 00:04:15,080 --> 00:04:16,910 But where was it created? 80 00:04:16,910 --> 00:04:21,860 So RStudio has this file explorer that I can open on the side here. 81 00:04:21,860 --> 00:04:26,720 And this file explorer shows me the contents of some particular folder 82 00:04:26,720 --> 00:04:27,467 on my computer. 83 00:04:27,467 --> 00:04:29,300 Even if you don't know much about computers, 84 00:04:29,300 --> 00:04:31,610 you probably know about files and folders. 85 00:04:31,610 --> 00:04:36,800 And so here, it seems like I am inside of the User/jharvard folder. 86 00:04:36,800 --> 00:04:41,300 And inside of that folder, RStudio and R created for me 87 00:04:41,300 --> 00:04:44,300 this file called hello.R. 88 00:04:44,300 --> 00:04:48,530 Now, RStudio works by default in a single folder. 89 00:04:48,530 --> 00:04:52,250 And that folder, it turns out, is called our working directory. 90 00:04:52,250 --> 00:04:55,340 If I ask R to create a file for me, it will create 91 00:04:55,340 --> 00:04:57,590 that file in the working directory. 92 00:04:57,590 --> 00:05:02,300 Or if I ask it to find some data file and read it into my R program, 93 00:05:02,300 --> 00:05:06,120 it will first search in that working directory for me here. 94 00:05:06,120 --> 00:05:10,910 So let's open up this R file and actually write our first R program. 95 00:05:10,910 --> 00:05:14,270 I'll hit this hello.R file icon here. 96 00:05:14,270 --> 00:05:18,350 Let me close my file explorer for now because not important at this point. 97 00:05:18,350 --> 00:05:22,120 But now I'm introduced to this new component of RStudio, 98 00:05:22,120 --> 00:05:24,490 in this case, my file editor. 99 00:05:24,490 --> 00:05:27,850 So as we saw before, this R console down here 100 00:05:27,850 --> 00:05:31,960 is great for writing single R statements, single lines of R code. 101 00:05:31,960 --> 00:05:35,080 But this hello.R file is great for writing 102 00:05:35,080 --> 00:05:38,920 more than one line of code, tens of lines, hundreds of lines, 103 00:05:38,920 --> 00:05:41,440 creating a full-fledged program here. 104 00:05:41,440 --> 00:05:44,440 So I have this blinking cursor, which means I can actually just go ahead 105 00:05:44,440 --> 00:05:45,700 and type some text. 106 00:05:45,700 --> 00:05:49,490 And let's type in some text that will be our very first R program. 107 00:05:49,490 --> 00:05:55,460 I'll type print, P-R-I-N-T, followed by an opening parentheses and a closing 108 00:05:55,460 --> 00:05:55,960 one. 109 00:05:55,960 --> 00:06:03,530 And inside those parentheses, let me type "hello, world", just like this. 110 00:06:03,530 --> 00:06:08,200 And now I've typed in some text, my first line of R code. 111 00:06:08,200 --> 00:06:12,160 I can save this file by clicking on this little Save icon here. 112 00:06:12,160 --> 00:06:15,850 Or, on Mac, I could do Command-S. Or, on Windows, I could do Control-S. So 113 00:06:15,850 --> 00:06:17,740 let me hit the Save button here. 114 00:06:17,740 --> 00:06:24,040 And now this program is saved on my computer I could run it if I wanted to. 115 00:06:24,040 --> 00:06:28,030 Now, you may be used to running programs by double-clicking on them 116 00:06:28,030 --> 00:06:32,170 or finding some icon and clicking on that to open the program and run it, 117 00:06:32,170 --> 00:06:34,930 but I don't see those icons here. 118 00:06:34,930 --> 00:06:36,850 And that is intentional. 119 00:06:36,850 --> 00:06:38,740 This is a program we've made ourselves, which 120 00:06:38,740 --> 00:06:41,180 will require a different way of running it for us. 121 00:06:41,180 --> 00:06:43,180 And even if you don't know much about computers, 122 00:06:43,180 --> 00:06:48,640 you probably know computers speak this language called binary, or 1's and 0's. 123 00:06:48,640 --> 00:06:53,890 And right now, we just have print("hello, world"). 124 00:06:53,890 --> 00:06:56,060 That doesn't look like 1's and 0's to me. 125 00:06:56,060 --> 00:06:59,470 So there has to be a way to translate or interpret 126 00:06:59,470 --> 00:07:05,110 this R text we've written into 1's and 0's the computer actually understands. 127 00:07:05,110 --> 00:07:07,060 So R is more than a language. 128 00:07:07,060 --> 00:07:09,790 It's also an interpreter that takes this text we've written 129 00:07:09,790 --> 00:07:14,230 and converts it to the 1's and 0's the computer understands. 130 00:07:14,230 --> 00:07:17,830 Now, to kick off that process, that interpretation process, 131 00:07:17,830 --> 00:07:21,380 I can actually click this button over here that says Run. 132 00:07:21,380 --> 00:07:23,223 So let me go down to the console first. 133 00:07:23,223 --> 00:07:26,140 And let me clear the console so it's very clear what's happening here. 134 00:07:26,140 --> 00:07:28,780 I'll type Control-L to clear the console. 135 00:07:28,780 --> 00:07:34,090 And now, if I go to this line of code, line 1, and hit Run, 136 00:07:34,090 --> 00:07:39,830 I should see my first R program saying hello to the world over here. 137 00:07:39,830 --> 00:07:42,250 So let's take a step back first, though, and think 138 00:07:42,250 --> 00:07:44,510 about what it is we have just done. 139 00:07:44,510 --> 00:07:48,160 So here in R, we have created our very first program. 140 00:07:48,160 --> 00:07:51,760 And we've done so using something that's called a function. 141 00:07:51,760 --> 00:07:54,460 Now, in languages like R, and lots of others, 142 00:07:54,460 --> 00:07:57,550 too, you'll have access to these things called functions. 143 00:07:57,550 --> 00:08:00,550 And functions let you tell the computer to take some action, 144 00:08:00,550 --> 00:08:02,950 do something to solve some particular problem. 145 00:08:02,950 --> 00:08:06,070 In this case, our problem was displaying some text. 146 00:08:06,070 --> 00:08:10,370 And this function that we saw called print helped us do just that. 147 00:08:10,370 --> 00:08:13,210 So let's go back and show you this program again. 148 00:08:13,210 --> 00:08:15,340 I'm going to come back to RStudio. 149 00:08:15,340 --> 00:08:18,280 We saw that this function is called print. 150 00:08:18,280 --> 00:08:20,080 You can probably guess that yourself here. 151 00:08:20,080 --> 00:08:23,590 And we know it's a function because I used these parentheses here. 152 00:08:23,590 --> 00:08:26,380 This is convention in R to denote a function 153 00:08:26,380 --> 00:08:31,660 by its name, followed by parentheses, followed by some particular input 154 00:08:31,660 --> 00:08:33,250 to that function. 155 00:08:33,250 --> 00:08:37,159 Now, print, when the R developers designed it, 156 00:08:37,159 --> 00:08:40,164 doesn't print some predetermined piece of text. 157 00:08:40,164 --> 00:08:41,789 It doesn't always print "hello, world". 158 00:08:41,789 --> 00:08:43,880 It doesn't always print "hello" to somebody else. 159 00:08:43,880 --> 00:08:48,050 Instead, it prints the text that I want it to print. 160 00:08:48,050 --> 00:08:50,540 And so we'll see, as we learn programming together, 161 00:08:50,540 --> 00:08:52,700 that these functions can take inputs. 162 00:08:52,700 --> 00:08:57,140 And more precisely, these inputs are called arguments to the function. 163 00:08:57,140 --> 00:09:01,610 They're inputs that change how the function actually runs. 164 00:09:01,610 --> 00:09:04,040 Now, what is it that print did? 165 00:09:04,040 --> 00:09:09,170 Well, down below in the console here, I see that it printed "hello, world". 166 00:09:09,170 --> 00:09:12,230 And this is what's known as the side effect of the function, something 167 00:09:12,230 --> 00:09:13,730 visual that happened. 168 00:09:13,730 --> 00:09:15,950 Side effects could also be something that happens 169 00:09:15,950 --> 00:09:18,410 via audio, something else I see. 170 00:09:18,410 --> 00:09:21,260 Whatever happens as the function is running, 171 00:09:21,260 --> 00:09:23,960 that is known as its side effect. 172 00:09:23,960 --> 00:09:27,650 And this is our first program in R so far. 173 00:09:27,650 --> 00:09:32,960 But let's pause here and ask what questions we have on R, RStudio, 174 00:09:32,960 --> 00:09:36,540 or this first program we just wrote called "hello, world". 175 00:09:36,540 --> 00:09:40,850 So a question about why would you use RStudio as opposed to VS Code-- 176 00:09:40,850 --> 00:09:42,680 well, if you're familiar with VS Code, you 177 00:09:42,680 --> 00:09:45,020 would know that it can work with a variety of languages, 178 00:09:45,020 --> 00:09:47,300 like Python and C and others. 179 00:09:47,300 --> 00:09:51,020 RStudio, though, is tailor-made to work with R. 180 00:09:51,020 --> 00:09:54,470 And as we'll see later on in the course, there's a lot of features of RStudio 181 00:09:54,470 --> 00:09:58,250 that make working with R much easier, particularly with visualizations 182 00:09:58,250 --> 00:09:59,120 and plotting. 183 00:09:59,120 --> 00:10:02,480 So we'll see that later on in the course as well. 184 00:10:02,480 --> 00:10:05,690 Also, a question about why would you use R instead of Python-- 185 00:10:05,690 --> 00:10:08,600 well, Python tends to be this really big language that can 186 00:10:08,600 --> 00:10:10,580 be used for lots and lots of things. 187 00:10:10,580 --> 00:10:13,370 When you think of libraries in Python, like NumPy, 188 00:10:13,370 --> 00:10:17,930 you can certainly do work like you would do in R with those libraries. 189 00:10:17,930 --> 00:10:21,380 RStudio, though, is more of a precise tool. 190 00:10:21,380 --> 00:10:23,900 It's built from the ground up to work with data, 191 00:10:23,900 --> 00:10:27,290 whereas Python is more of a general tool for solving lots of different problems. 192 00:10:27,290 --> 00:10:30,920 R then comes with some optimizations that make working with data little more 193 00:10:30,920 --> 00:10:31,910 efficiently. 194 00:10:31,910 --> 00:10:35,480 But in general, you could use either Python, with NumPy or other libraries, 195 00:10:35,480 --> 00:10:38,780 or R to work with data overall. 196 00:10:38,780 --> 00:10:40,640 Now a question from Luis-- 197 00:10:40,640 --> 00:10:44,030 LUIS: Would you show us how to change the working directory for the console? 198 00:10:44,030 --> 00:10:46,905 CARTER ZENKE: So a great question about actually changing the working 199 00:10:46,905 --> 00:10:49,400 directory, we saw earlier that RStudio defaults 200 00:10:49,400 --> 00:10:51,200 to having some kind of working directory. 201 00:10:51,200 --> 00:10:53,660 But you could change that if you wanted to. 202 00:10:53,660 --> 00:10:56,270 Well, in R, there is a function dedicated 203 00:10:56,270 --> 00:10:57,920 to changing that working directory. 204 00:10:57,920 --> 00:11:00,230 And I can show you what that looks like over here. 205 00:11:00,230 --> 00:11:03,980 Let me come back to my RStudio environment and, in particular, 206 00:11:03,980 --> 00:11:04,910 my console. 207 00:11:04,910 --> 00:11:07,520 So changing the working directory, that actually 208 00:11:07,520 --> 00:11:09,690 only requires one line of R code. 209 00:11:09,690 --> 00:11:11,840 So I could then clear my console down below here 210 00:11:11,840 --> 00:11:14,690 and execute that line of code in the console. 211 00:11:14,690 --> 00:11:19,290 The function to do so is setwd, just like this. 212 00:11:19,290 --> 00:11:22,220 And then in parentheses, you can then type the path 213 00:11:22,220 --> 00:11:26,060 for the folder you want to change your working directory to. 214 00:11:26,060 --> 00:11:31,160 So if you do want to do that, you can do it using the setwd function here. 215 00:11:31,160 --> 00:11:35,030 Now, we've completed our very first "hello, world" program. 216 00:11:35,030 --> 00:11:38,270 But odds are things aren't always going to go as smoothly for you if you're 217 00:11:38,270 --> 00:11:39,450 just beginning with programming. 218 00:11:39,450 --> 00:11:41,367 And even if you're more experienced, you might 219 00:11:41,367 --> 00:11:44,470 encounter these things called bugs, or these errors in your code. 220 00:11:44,470 --> 00:11:48,120 So let me reproduce, kind of intentionally here, 221 00:11:48,120 --> 00:11:50,550 a bug that could happen in my program. 222 00:11:50,550 --> 00:11:53,640 Let's say, as I was typing, I didn't type print. 223 00:11:53,640 --> 00:11:56,250 I instead typed prin. 224 00:11:56,250 --> 00:11:58,380 Let me save this program here. 225 00:11:58,380 --> 00:12:00,220 And let me run it to see what happens. 226 00:12:00,220 --> 00:12:01,470 I'll hit Run. 227 00:12:01,470 --> 00:12:08,970 And, well, now I see down in my console, Error in prin("hello, world"), 228 00:12:08,970 --> 00:12:12,160 could not find function "prin", in this case. 229 00:12:12,160 --> 00:12:16,650 So this error is hopefully telling me, look, there is no function called prin. 230 00:12:16,650 --> 00:12:19,620 And hopefully, by seeing this, I should then actually know, well, oh, 231 00:12:19,620 --> 00:12:20,745 I didn't mean to type prin. 232 00:12:20,745 --> 00:12:22,660 I meant to type print instead. 233 00:12:22,660 --> 00:12:24,420 So I can go back and fix it. 234 00:12:24,420 --> 00:12:27,660 And this process of actually going back and fixing our programs 235 00:12:27,660 --> 00:12:31,050 is known as debugging, finding those bugs in our code 236 00:12:31,050 --> 00:12:34,050 and getting rid of them by looking through these errors 237 00:12:34,050 --> 00:12:37,150 and talking to a colleague or trying to fix them overall. 238 00:12:37,150 --> 00:12:40,830 So here, let me go back and clear my terminal, my console down below. 239 00:12:40,830 --> 00:12:45,570 I'll run this line of R code, or run this line of this program here again. 240 00:12:45,570 --> 00:12:49,630 And I'll see, once again, "hello, world". 241 00:12:49,630 --> 00:12:53,770 So this is a pretty good program, but I argue we could do a little bit better 242 00:12:53,770 --> 00:12:54,270 with it. 243 00:12:54,270 --> 00:12:56,760 Like, we don't just have to say "hello" to everyone. 244 00:12:56,760 --> 00:12:59,910 We could try to say "hello" to a particular user. 245 00:12:59,910 --> 00:13:04,020 And so our next step here will be to actually ask the user for their name 246 00:13:04,020 --> 00:13:07,290 and then say "hello" to that particular user. 247 00:13:07,290 --> 00:13:12,300 Now, to get user input, there is a function in R for that. 248 00:13:12,300 --> 00:13:13,650 It's not print. 249 00:13:13,650 --> 00:13:16,200 It's instead called readline. 250 00:13:16,200 --> 00:13:18,690 So let's use the readline function here. 251 00:13:18,690 --> 00:13:21,810 I'll type R-E-A-D, readline-- 252 00:13:21,810 --> 00:13:22,920 this is all one word-- 253 00:13:22,920 --> 00:13:25,710 and then parentheses, opening and closing. 254 00:13:25,710 --> 00:13:29,130 And it turns out that readline takes an argument, too. 255 00:13:29,130 --> 00:13:33,090 This argument will then be the prompt it prompts the user with. 256 00:13:33,090 --> 00:13:37,600 So I'll ask the user, "What's your name?" just like this. 257 00:13:37,600 --> 00:13:39,550 And I'll save the file here. 258 00:13:39,550 --> 00:13:41,790 So let me go down and clear my console. 259 00:13:41,790 --> 00:13:46,440 I'll then run this line of R code, and I'll see "What's your name?" 260 00:13:46,440 --> 00:13:51,863 I could type, in this case, my name is Carter, hit Enter, and I see "Carter". 261 00:13:51,863 --> 00:13:54,030 So it's not so much a greeting as much as it is just 262 00:13:54,030 --> 00:13:55,780 kind of saying my name back to me. 263 00:13:55,780 --> 00:13:58,780 I think we could still do better than that as well. 264 00:13:58,780 --> 00:14:02,560 Let me clear my console, and ideally I want to say something like this. 265 00:14:02,560 --> 00:14:04,570 I want to say, hello, Carter. 266 00:14:04,570 --> 00:14:05,850 So I could use print again. 267 00:14:05,850 --> 00:14:12,150 Maybe on line 2 now, I choose to say print and then "Hello, Carter". 268 00:14:12,150 --> 00:14:14,370 And I'll save this file again. 269 00:14:14,370 --> 00:14:19,740 Well, now I have not just one line of code but two. 270 00:14:19,740 --> 00:14:23,280 And when R goes to interpret this program, 271 00:14:23,280 --> 00:14:26,880 it will read these lines of code top to bottom, left to right, 272 00:14:26,880 --> 00:14:29,940 executing each function along the way. 273 00:14:29,940 --> 00:14:35,320 But it turns out that this Run button over here, it says Run, 274 00:14:35,320 --> 00:14:40,440 but what it really does is it only runs one line of R code at a time. 275 00:14:40,440 --> 00:14:43,590 For example, I could run just line 1, or I 276 00:14:43,590 --> 00:14:48,737 could run just line 2, but what we have here is really a full-fledged program. 277 00:14:48,737 --> 00:14:50,070 It's more than one line of code. 278 00:14:50,070 --> 00:14:50,910 It's two. 279 00:14:50,910 --> 00:14:54,540 So to run the entire program, I need to use a different approach, 280 00:14:54,540 --> 00:14:58,660 different button in this case, and that button is this Source button up top 281 00:14:58,660 --> 00:14:59,160 here. 282 00:14:59,160 --> 00:15:03,850 Source in this case means to run the entire source file here. 283 00:15:03,850 --> 00:15:05,490 So let me go ahead and source. 284 00:15:05,490 --> 00:15:07,500 And let me say, what's my name? 285 00:15:07,500 --> 00:15:08,610 My name is Carter. 286 00:15:08,610 --> 00:15:11,310 And I see "Hello, Carter". 287 00:15:11,310 --> 00:15:15,085 But there might still be a bug in this program. 288 00:15:15,085 --> 00:15:16,210 Let's try it one more time. 289 00:15:16,210 --> 00:15:17,700 Let me clear the console. 290 00:15:17,700 --> 00:15:19,200 I'll source again. 291 00:15:19,200 --> 00:15:22,680 And what if my name was, like, Mario from the Nintendo universe. 292 00:15:22,680 --> 00:15:24,990 I could type, well, my name is Mario. 293 00:15:24,990 --> 00:15:25,830 Hmm. 294 00:15:25,830 --> 00:15:26,582 "Hello, Carter". 295 00:15:26,582 --> 00:15:27,540 OK, let's try it again. 296 00:15:27,540 --> 00:15:28,530 I'll do Source. 297 00:15:28,530 --> 00:15:32,970 And I'll do-- maybe my name is Princess Peach, so Princess Peach here. 298 00:15:32,970 --> 00:15:36,250 And, well, still "Hello, Carter". 299 00:15:36,250 --> 00:15:39,320 So it seems to me like we need to do something more dynamic than just 300 00:15:39,320 --> 00:15:41,210 printing out, "Hello, Carter". 301 00:15:41,210 --> 00:15:46,370 We need to actually print what the user has given us at the console. 302 00:15:46,370 --> 00:15:49,610 So for that, we'll actually need this new concept 303 00:15:49,610 --> 00:15:52,130 in programming called a return value. 304 00:15:52,130 --> 00:15:55,910 Functions don't just have arguments and side effects. 305 00:15:55,910 --> 00:15:58,430 They also have return values. 306 00:15:58,430 --> 00:16:01,490 And in this case, readline will return to us, 307 00:16:01,490 --> 00:16:03,928 as its return value, the user input. 308 00:16:03,928 --> 00:16:05,720 You could think of it a bit like a metaphor 309 00:16:05,720 --> 00:16:08,780 of asking your friend to go out and ask somebody for their name. 310 00:16:08,780 --> 00:16:12,290 And they might write it down and return it back to you, the programmer, 311 00:16:12,290 --> 00:16:14,390 so you can use it later on in your code. 312 00:16:14,390 --> 00:16:17,540 That's kind of what return values are in this case. 313 00:16:17,540 --> 00:16:22,100 But if I have a return value, I'm going to need someplace to store it, 314 00:16:22,100 --> 00:16:25,280 someplace to reuse it later on in my code. 315 00:16:25,280 --> 00:16:29,120 And for that we'll need something called a variable or an object. 316 00:16:29,120 --> 00:16:32,940 A variable is some name for a value that could change. 317 00:16:32,940 --> 00:16:37,670 So let's see how we could use both return values and objects in R 318 00:16:37,670 --> 00:16:40,320 to make this program more dynamic. 319 00:16:40,320 --> 00:16:42,210 Let me go to line 1 here. 320 00:16:42,210 --> 00:16:45,360 And let me try to give this return value of readline 321 00:16:45,360 --> 00:16:47,610 a name I could reuse later on. 322 00:16:47,610 --> 00:16:50,310 Well, if the user types in Carter, I want 323 00:16:50,310 --> 00:16:52,890 to refer to that value via some name. 324 00:16:52,890 --> 00:16:55,510 And maybe the right name is just simply name. 325 00:16:55,510 --> 00:16:57,210 So I'll type name here. 326 00:16:57,210 --> 00:17:01,230 And now, if I want to store the return value of readline, 327 00:17:01,230 --> 00:17:05,280 that is whatever the user typed in, inside this object called name, 328 00:17:05,280 --> 00:17:10,619 I could use this syntax, these particular characters here, the less 329 00:17:10,619 --> 00:17:12,660 than sign and a dash. 330 00:17:12,660 --> 00:17:15,359 And notice, if we read this right to left, 331 00:17:15,359 --> 00:17:18,599 first readline will run as a function. 332 00:17:18,599 --> 00:17:21,030 It will prompt the user for their input. 333 00:17:21,030 --> 00:17:22,680 The user will type that input in. 334 00:17:22,680 --> 00:17:25,319 And then readline will come back to us and give us back 335 00:17:25,319 --> 00:17:27,690 whatever the value is that the user typed in. 336 00:17:27,690 --> 00:17:31,230 And then these lines of code over here-- name, space, 337 00:17:31,230 --> 00:17:37,200 less than, this dash here-- will store it underneath this name called name. 338 00:17:37,200 --> 00:17:40,770 This arrow, this leftward arrow, is called the assignment operator. 339 00:17:40,770 --> 00:17:45,060 We're assigning whatever return value from readline to this new object 340 00:17:45,060 --> 00:17:46,570 called name. 341 00:17:46,570 --> 00:17:50,370 So let me just, for now, get rid of this line 2 342 00:17:50,370 --> 00:17:53,880 and just focus on this particular piece right now. 343 00:17:53,880 --> 00:17:56,130 Let me source this file. 344 00:17:56,130 --> 00:17:59,340 And now let me say my name is Mario. 345 00:17:59,340 --> 00:18:00,300 I'll hit Enter. 346 00:18:00,300 --> 00:18:02,460 And I see nothing yet because there's no print. 347 00:18:02,460 --> 00:18:07,500 But now, if I open up this new pane in RStudio, let me go ahead 348 00:18:07,500 --> 00:18:10,620 and go to my environment pane, I'll actually 349 00:18:10,620 --> 00:18:16,390 see what the value is for this object we created called name. 350 00:18:16,390 --> 00:18:21,270 So it seems like RStudio is telling me that I have this object called name, 351 00:18:21,270 --> 00:18:24,180 and its value now is Mario. 352 00:18:24,180 --> 00:18:26,190 This is part of our environment. 353 00:18:26,190 --> 00:18:29,910 The environment is the place we actually store our objects 354 00:18:29,910 --> 00:18:35,500 while our program is running so we can then reuse them later on in our code. 355 00:18:35,500 --> 00:18:38,130 So I've kind of captured this user input and stored it 356 00:18:38,130 --> 00:18:39,480 in this object called name. 357 00:18:39,480 --> 00:18:41,220 But let's see how we could use it now. 358 00:18:41,220 --> 00:18:43,350 I'll come back to RStudio. 359 00:18:43,350 --> 00:18:45,390 And let me go back to line 2. 360 00:18:45,390 --> 00:18:51,670 And let me now print, in this case, let me print "Hello, name", just like this. 361 00:18:51,670 --> 00:18:54,030 I'll kind of close my environment over here. 362 00:18:54,030 --> 00:18:57,270 And now let me source this particular file. 363 00:18:57,270 --> 00:19:00,060 I'll type Mario as my name. 364 00:19:00,060 --> 00:19:02,220 And now I'll hit Enter. 365 00:19:02,220 --> 00:19:05,020 And I see, well, "Hello, name". 366 00:19:05,020 --> 00:19:06,600 So this isn't exactly what we wanted. 367 00:19:06,600 --> 00:19:10,210 I actually printed out literally "Hello, name". 368 00:19:10,210 --> 00:19:13,110 So I think we'll need to find some other solution here. 369 00:19:13,110 --> 00:19:15,510 We solved one problem, which was getting the user input 370 00:19:15,510 --> 00:19:17,010 and storing it somewhere. 371 00:19:17,010 --> 00:19:20,370 But, I mean, how do we reuse that later on? 372 00:19:20,370 --> 00:19:22,500 Well, if I'm being observant, I might notice 373 00:19:22,500 --> 00:19:26,850 that what I'm really trying to do is combine some pieces of text. 374 00:19:26,850 --> 00:19:30,540 Like, this text here is Hello comma space. 375 00:19:30,540 --> 00:19:34,440 And the text I'm trying to combine is whatever the user typed in, 376 00:19:34,440 --> 00:19:39,310 or that is whatever is being stored in this name object I have over here. 377 00:19:39,310 --> 00:19:43,440 So I'm trying to combine Hello comma space and this text 378 00:19:43,440 --> 00:19:45,120 from the user, Mario. 379 00:19:45,120 --> 00:19:47,520 Now, this is a common problem in programming, 380 00:19:47,520 --> 00:19:49,830 trying to combine pieces of text, so common 381 00:19:49,830 --> 00:19:52,200 that it actually has its own particular name. 382 00:19:52,200 --> 00:19:55,560 This is called string concatenation, where 383 00:19:55,560 --> 00:19:59,940 concatenation means combining together these various pieces of text, 384 00:19:59,940 --> 00:20:01,750 or these strings. 385 00:20:01,750 --> 00:20:03,150 So let's break it down. 386 00:20:03,150 --> 00:20:06,690 And here, of course, we have the very first part of our greeting. 387 00:20:06,690 --> 00:20:09,690 This is some piece of text, or from here on out called 388 00:20:09,690 --> 00:20:13,290 a string, a string because it's characters strung together 389 00:20:13,290 --> 00:20:14,940 into one piece of text. 390 00:20:14,940 --> 00:20:18,040 Strings begin and end with these double quotes here. 391 00:20:18,040 --> 00:20:22,080 So I have Hello comma space, and that is one particular string. 392 00:20:22,080 --> 00:20:25,710 But then the user comes in, and they type in their own string, 393 00:20:25,710 --> 00:20:27,690 let's say, like, Carter. 394 00:20:27,690 --> 00:20:31,740 And now my task is to combine these together into a single string 395 00:20:31,740 --> 00:20:33,910 and print that back out to the user. 396 00:20:33,910 --> 00:20:37,680 So my goal here is to effectively this, turn these two separate strings 397 00:20:37,680 --> 00:20:40,210 into one individual string. 398 00:20:40,210 --> 00:20:44,650 Now, R, handily enough, comes with a function to do just this. 399 00:20:44,650 --> 00:20:46,570 So let's explore that function here. 400 00:20:46,570 --> 00:20:54,310 This function is actually called paste, paste as in P-A-S-T-E. 401 00:20:54,310 --> 00:20:58,820 And paste allows me to concatenate various strings. 402 00:20:58,820 --> 00:21:00,730 So let's try this one here. 403 00:21:00,730 --> 00:21:05,530 If I want to use paste, I use it the same way I would any other function. 404 00:21:05,530 --> 00:21:09,760 I could use the function name, and then open and closing parentheses. 405 00:21:09,760 --> 00:21:14,080 And then paste will take as input any number 406 00:21:14,080 --> 00:21:18,590 of strings I want to concatenate, or paste together in this case. 407 00:21:18,590 --> 00:21:22,750 So let's say the first string, as we said, is Hello comma space. 408 00:21:22,750 --> 00:21:25,090 Now, the next string is something new. 409 00:21:25,090 --> 00:21:30,350 It's actually going to be whatever is stored in this name object over here. 410 00:21:30,350 --> 00:21:34,420 So if I want to provide another input to paste, 411 00:21:34,420 --> 00:21:37,660 I should actually separate it now with a comma. 412 00:21:37,660 --> 00:21:41,150 So after the first input to paste, the first argument, 413 00:21:41,150 --> 00:21:44,630 I'll then give it a second input, or a second argument. 414 00:21:44,630 --> 00:21:47,990 And this one will be literally name, in this case, 415 00:21:47,990 --> 00:21:53,090 this object we stored that has that value the user themselves typed in. 416 00:21:53,090 --> 00:21:57,380 And now paste, its return value will be the combined version 417 00:21:57,380 --> 00:21:59,250 of these two strings. 418 00:21:59,250 --> 00:22:00,530 So let's try it. 419 00:22:00,530 --> 00:22:05,600 Maybe I'll store the return value inside its own object called greeting. 420 00:22:05,600 --> 00:22:08,360 I'll use the left arrow, this assignment operator, 421 00:22:08,360 --> 00:22:11,870 to store the return value of paste in this object called greeting. 422 00:22:11,870 --> 00:22:15,980 And then, down in print here, I won't print "Hello, name" literally. 423 00:22:15,980 --> 00:22:20,930 I'll instead print whatever is stored in the object called greeting. 424 00:22:20,930 --> 00:22:25,020 So as I run this, let me open up my environment. 425 00:22:25,020 --> 00:22:28,500 So we can see how these values change over time. 426 00:22:28,500 --> 00:22:30,650 Let me go ahead and click Source here. 427 00:22:30,650 --> 00:22:33,200 My name, in this case, is Carter. 428 00:22:33,200 --> 00:22:34,460 I'll hit Enter. 429 00:22:34,460 --> 00:22:36,540 And I see "Hello, Carter". 430 00:22:36,540 --> 00:22:40,150 So if I go back to my environment here, I 431 00:22:40,150 --> 00:22:44,680 see I not only have this object called name storing "Carter". 432 00:22:44,680 --> 00:22:49,090 I also have this object called greeting storing the concatenated versions 433 00:22:49,090 --> 00:22:54,370 of, in this case, the string Hello comma space and then the string 434 00:22:54,370 --> 00:22:56,440 "Carter" itself. 435 00:22:56,440 --> 00:23:00,850 But if you're being particularly observant here, 436 00:23:00,850 --> 00:23:04,540 what do you notice as a bug in this program? 437 00:23:04,540 --> 00:23:06,430 Let me ask our audience here. 438 00:23:06,430 --> 00:23:10,330 What do you notice as a bug in this program? 439 00:23:10,330 --> 00:23:12,328 AUDIENCE: There are two spaces. 440 00:23:12,328 --> 00:23:13,120 CARTER ZENKE: Yeah. 441 00:23:13,120 --> 00:23:14,140 There are two spaces. 442 00:23:14,140 --> 00:23:16,780 So I only wanted one space. 443 00:23:16,780 --> 00:23:19,120 But it seems I have somehow gotten two. 444 00:23:19,120 --> 00:23:23,540 Greeting here has Hello comma space space Carter. 445 00:23:23,540 --> 00:23:25,280 So why is that? 446 00:23:25,280 --> 00:23:28,150 Well, I could spend a lot of time kind of banging my head wondering, 447 00:23:28,150 --> 00:23:29,500 why is this not working? 448 00:23:29,500 --> 00:23:32,950 Or I could look at something called documentation. 449 00:23:32,950 --> 00:23:36,670 So programmers, when they write functions like paste, 450 00:23:36,670 --> 00:23:41,600 they also write documentation that tells me exactly how to use paste 451 00:23:41,600 --> 00:23:44,090 and what the expected output of paste might be. 452 00:23:44,090 --> 00:23:48,080 So let's look then at the documentation for paste. 453 00:23:48,080 --> 00:23:53,720 I can actually access documentation if I use this special character in R 454 00:23:53,720 --> 00:23:55,190 called the question mark. 455 00:23:55,190 --> 00:23:58,940 And if you want to remember this, I tend to think of just being confused. 456 00:23:58,940 --> 00:23:59,990 Like, what do I do? 457 00:23:59,990 --> 00:24:02,030 The question mark is that symbol. 458 00:24:02,030 --> 00:24:06,860 Then I follow it with the function name, in this case paste. 459 00:24:06,860 --> 00:24:10,760 So now, over on the right-hand side-- let me make this bigger for us over 460 00:24:10,760 --> 00:24:11,330 here-- 461 00:24:11,330 --> 00:24:15,080 I'll actually see the documentation for paste. 462 00:24:15,080 --> 00:24:18,860 Whoever created this function called paste helpfully wrote 463 00:24:18,860 --> 00:24:23,070 this documentation to guide me on how to use paste itself. 464 00:24:23,070 --> 00:24:26,417 So I'll see up top, the goal of paste is to concatenate strings, 465 00:24:26,417 --> 00:24:27,500 like we just talked about. 466 00:24:27,500 --> 00:24:29,420 That much is pretty obvious. 467 00:24:29,420 --> 00:24:32,150 But down below, I think, is the helpful part. 468 00:24:32,150 --> 00:24:38,060 This down below will tell me what kinds of inputs paste could potentially take. 469 00:24:38,060 --> 00:24:41,030 And I'll see the same thing we saw before, paste 470 00:24:41,030 --> 00:24:43,345 followed by some parentheses with various 471 00:24:43,345 --> 00:24:45,930 what we'll call parameters here. 472 00:24:45,930 --> 00:24:48,290 So these are still inputs to paste. 473 00:24:48,290 --> 00:24:50,120 But they're potential inputs. 474 00:24:50,120 --> 00:24:52,850 And because they're potential ones, we'll call them parameters. 475 00:24:52,850 --> 00:24:55,790 Arguments are the actual values we pass to paste. 476 00:24:55,790 --> 00:24:58,220 Parameters are the potential ones here. 477 00:24:58,220 --> 00:25:01,670 Now, I see, if I go down here, some dot dot dots. 478 00:25:01,670 --> 00:25:05,150 These dot dot dots mean that paste could take really 479 00:25:05,150 --> 00:25:07,730 any particular number of arguments. 480 00:25:07,730 --> 00:25:12,240 That could be any number of strings I want to concatenate in this case. 481 00:25:12,240 --> 00:25:17,030 I then have over here, though, this named parameter called sep. 482 00:25:17,030 --> 00:25:21,590 And it says sep = quote unquote with a space in the middle. 483 00:25:21,590 --> 00:25:24,830 Now, this is what's called a named parameter. 484 00:25:24,830 --> 00:25:29,780 It has a given name because it has a special use case in this case of paste. 485 00:25:29,780 --> 00:25:34,580 Now, the equal sign means the default value for this parameter 486 00:25:34,580 --> 00:25:38,670 is going to be this value here, the quote space unquote. 487 00:25:38,670 --> 00:25:41,520 So it seems like this parameter called sep 488 00:25:41,520 --> 00:25:46,080 might be what's making this extra space actually happen in my output. 489 00:25:46,080 --> 00:25:48,750 And if I go back to the documentation over here, 490 00:25:48,750 --> 00:25:52,830 let me scroll down a little bit so we can see, I'll go down to the arguments. 491 00:25:52,830 --> 00:25:58,650 And you can see that sep is, in fact, a character string to separate the terms. 492 00:25:58,650 --> 00:26:04,440 So my job now is to think about, what would I change the value of sep 493 00:26:04,440 --> 00:26:07,320 to be to remove that extra space? 494 00:26:07,320 --> 00:26:10,500 Well, I can go back to my program. 495 00:26:10,500 --> 00:26:14,280 And ideally, I don't want anything, any character, 496 00:26:14,280 --> 00:26:16,200 to default separate the strings. 497 00:26:16,200 --> 00:26:19,050 I just want to have them put together with no spaces in between. 498 00:26:19,050 --> 00:26:22,440 So to use this named parameter called sep. 499 00:26:22,440 --> 00:26:25,140 I can then have another input to paste. 500 00:26:25,140 --> 00:26:27,850 I could follow it with a comma, just like this, 501 00:26:27,850 --> 00:26:31,800 and then use that named parameter sep, in this case, 502 00:26:31,800 --> 00:26:37,200 and set it equal to some new kind of value, in this case quote unquote, 503 00:26:37,200 --> 00:26:39,770 or really, no spaces at all. 504 00:26:39,770 --> 00:26:41,440 So let's try this now. 505 00:26:41,440 --> 00:26:43,480 I'll clear my console. 506 00:26:43,480 --> 00:26:44,710 I'll run source. 507 00:26:44,710 --> 00:26:47,360 And then I'll type Carter again. 508 00:26:47,360 --> 00:26:53,560 And now I'll see Hello comma space, only one space, Carter. 509 00:26:53,560 --> 00:26:57,070 Now, if you're like me, you might think, I'm 510 00:26:57,070 --> 00:26:59,770 going to often want to concatenate strings that 511 00:26:59,770 --> 00:27:01,600 don't have any spaces in between them. 512 00:27:01,600 --> 00:27:06,940 And it's going to be a lot of typing to always type comma sep = quote unquote. 513 00:27:06,940 --> 00:27:08,980 If you're writing many lines of code, you 514 00:27:08,980 --> 00:27:11,530 don't want to do this over and over and over again. 515 00:27:11,530 --> 00:27:14,620 And in fact, some R users got tired of just this, 516 00:27:14,620 --> 00:27:19,540 and they wrote their own function where the default is sep = quote unquote, 517 00:27:19,540 --> 00:27:20,200 nothing. 518 00:27:20,200 --> 00:27:24,280 This function is called simply paste0, paste0, 519 00:27:24,280 --> 00:27:28,510 where 0 means there's nothing in between these concatenated strings. 520 00:27:28,510 --> 00:27:31,480 I now don't need to supply the input sep, 521 00:27:31,480 --> 00:27:35,380 because the default will always be no particular space at all. 522 00:27:35,380 --> 00:27:40,000 Now I'll rerun this program, source, and say Carter. 523 00:27:40,000 --> 00:27:42,580 And I'll get that same output. 524 00:27:42,580 --> 00:27:45,640 Now one other way to do this, because there's always more than one way 525 00:27:45,640 --> 00:27:49,420 to do something, is I could maybe just omit the space altogether. 526 00:27:49,420 --> 00:27:52,870 Like, I could say Hello comma and make that my string. 527 00:27:52,870 --> 00:27:56,620 And I could then assume that paste will go ahead and actually 528 00:27:56,620 --> 00:27:58,300 add the space in for me. 529 00:27:58,300 --> 00:28:00,610 I could run source, just like this. 530 00:28:00,610 --> 00:28:01,630 Type in Carter. 531 00:28:01,630 --> 00:28:07,370 And now I'll see "Hello, Carter" again, completing our program here. 532 00:28:07,370 --> 00:28:09,940 So let me pause and ask what questions we 533 00:28:09,940 --> 00:28:16,060 have on paste, string catenation, or our program so far. 534 00:28:16,060 --> 00:28:20,350 AUDIENCE: So what's the difference between the paste function and the cat 535 00:28:20,350 --> 00:28:20,980 function? 536 00:28:20,980 --> 00:28:25,018 Because I think both of them are used for string concatenation. 537 00:28:25,018 --> 00:28:25,810 CARTER ZENKE: Yeah. 538 00:28:25,810 --> 00:28:26,310 Good. 539 00:28:26,310 --> 00:28:29,110 So if you are familiar with R, you might have also heard 540 00:28:29,110 --> 00:28:30,880 of this function called cat. 541 00:28:30,880 --> 00:28:34,360 And cat itself stands for concatenation. 542 00:28:34,360 --> 00:28:37,690 Cat and paste have two similar use cases. 543 00:28:37,690 --> 00:28:40,100 They both involve combining strings together. 544 00:28:40,100 --> 00:28:41,950 But they have slightly different outputs. 545 00:28:41,950 --> 00:28:45,970 So cat has the side effect of printing whatever 546 00:28:45,970 --> 00:28:48,370 you've concatenated to the console. 547 00:28:48,370 --> 00:28:51,760 Paste, on the other hand, does not paste only returns 548 00:28:51,760 --> 00:28:54,760 to you, kind of silently, the concatenated version therein. 549 00:28:54,760 --> 00:28:57,490 And you can then use that later on in your code. 550 00:28:57,490 --> 00:29:00,670 Cat, I believe, does not actually return to you the result. 551 00:29:00,670 --> 00:29:04,360 It just kind of prints it to the screen as a side effect, so two very different 552 00:29:04,360 --> 00:29:07,990 use cases but the same kind of goal here. 553 00:29:07,990 --> 00:29:08,530 All right. 554 00:29:08,530 --> 00:29:12,380 Let's continue, and let's keep improving our program here. 555 00:29:12,380 --> 00:29:18,250 So one thing you might notice is that I have this object called greeting 556 00:29:18,250 --> 00:29:22,330 and I'm just, on the next line, using that very same object. 557 00:29:22,330 --> 00:29:24,580 And this seems just a little bit redundant 558 00:29:24,580 --> 00:29:27,760 because I'm just storing this value from paste 559 00:29:27,760 --> 00:29:32,410 in greeting and immediately giving it back to this function called print. 560 00:29:32,410 --> 00:29:35,530 Well, to have this same result and actually reduce 561 00:29:35,530 --> 00:29:38,980 some number of lines of code, I could do the following 562 00:29:38,980 --> 00:29:43,560 I could actually remove the idea of storing the result of paste 563 00:29:43,560 --> 00:29:45,180 in any given object. 564 00:29:45,180 --> 00:29:49,290 And I could simply run paste and immediately 565 00:29:49,290 --> 00:29:54,810 pass the value, or the return value from paste, as the input to print. 566 00:29:54,810 --> 00:29:57,750 Now, this is arguably the most complicated line of code 567 00:29:57,750 --> 00:29:58,890 we've seen so far. 568 00:29:58,890 --> 00:30:02,250 So let's break it down just a little bit this. 569 00:30:02,250 --> 00:30:05,160 What we're doing right here is known as function composition. 570 00:30:05,160 --> 00:30:09,150 I'm making one function run and then immediately passing the output 571 00:30:09,150 --> 00:30:12,090 of that function as input to the next. 572 00:30:12,090 --> 00:30:15,150 So this is our line of R code. 573 00:30:15,150 --> 00:30:19,290 And we said before that R runs lines of code kind of line by line, 574 00:30:19,290 --> 00:30:21,000 top to bottom, left to right. 575 00:30:21,000 --> 00:30:22,560 And that's mostly true. 576 00:30:22,560 --> 00:30:27,240 But in this case, we see there's more than one function to run or action 577 00:30:27,240 --> 00:30:29,980 to take on this particular line of code. 578 00:30:29,980 --> 00:30:31,570 So what is R to do? 579 00:30:31,570 --> 00:30:34,500 Well, what R will always do is look first 580 00:30:34,500 --> 00:30:37,250 for the function that is innermost in the parentheses. 581 00:30:37,250 --> 00:30:41,090 So in this case, that is the paste0 function that is concatenating 582 00:30:41,090 --> 00:30:45,650 or combining "Hello, " and then name over here. 583 00:30:45,650 --> 00:30:50,060 Now, what this will do is make the return value of Hello comma, 584 00:30:50,060 --> 00:30:54,650 let's say, space Carter, and then pass that immediately as input into print, 585 00:30:54,650 --> 00:30:55,800 just like this. 586 00:30:55,800 --> 00:30:59,570 And once that's done, print can then do its job and, of course, 587 00:30:59,570 --> 00:31:02,700 just print out something like "Hello, Carter". 588 00:31:02,700 --> 00:31:06,710 So always think about the innermost function running first and passing 589 00:31:06,710 --> 00:31:10,790 its return value as the input to the next innermost function, 590 00:31:10,790 --> 00:31:12,510 and so on and so forth. 591 00:31:12,510 --> 00:31:14,820 So let's go ahead and try this out. 592 00:31:14,820 --> 00:31:16,820 I'll come back to RStudio here. 593 00:31:16,820 --> 00:31:20,000 And here I have paste as opposed to paste0, but kind of the same thing, 594 00:31:20,000 --> 00:31:21,020 as we saw before. 595 00:31:21,020 --> 00:31:23,390 Let me go ahead and click Source here. 596 00:31:23,390 --> 00:31:24,350 I'll type Carter. 597 00:31:24,350 --> 00:31:28,100 And we'll see that I get the very same result without storing, 598 00:31:28,100 --> 00:31:30,830 in this case, an additional object. 599 00:31:30,830 --> 00:31:33,050 Now, an extension of this might be the following. 600 00:31:33,050 --> 00:31:34,790 I could take readline. 601 00:31:34,790 --> 00:31:37,220 I could take read line, and notice how it's just 602 00:31:37,220 --> 00:31:41,600 simply storing the value in name, which I immediately pass as input to paste. 603 00:31:41,600 --> 00:31:42,810 I could do this. 604 00:31:42,810 --> 00:31:46,460 I could take readline and put this right there. 605 00:31:46,460 --> 00:31:51,590 And now I have three functions nested inside of each other. 606 00:31:51,590 --> 00:31:56,960 But let me actually ask you, why might this not be a good idea? 607 00:31:56,960 --> 00:32:01,010 Let's think about other people who might read this code, 608 00:32:01,010 --> 00:32:03,410 or think about working together on projects. 609 00:32:03,410 --> 00:32:08,780 Like, why might I not want to do this or go this far with the design of my code 610 00:32:08,780 --> 00:32:10,010 here? 611 00:32:10,010 --> 00:32:14,300 AUDIENCE: It's, I think, because it doesn't explain the code perfectly 612 00:32:14,300 --> 00:32:15,278 to the user. 613 00:32:15,278 --> 00:32:16,070 CARTER ZENKE: Yeah. 614 00:32:16,070 --> 00:32:17,300 It's kind of hard to read. 615 00:32:17,300 --> 00:32:21,470 Like, if I saw this line of code here, I would have to think to myself, OK, 616 00:32:21,470 --> 00:32:22,940 which function is happening first? 617 00:32:22,940 --> 00:32:25,370 Well, it looks like it might be readline. 618 00:32:25,370 --> 00:32:26,495 And then what happens next? 619 00:32:26,495 --> 00:32:26,995 OK. 620 00:32:26,995 --> 00:32:27,920 Paste happens next. 621 00:32:27,920 --> 00:32:31,610 So it's a lot for me to think of as I'm reading this program. 622 00:32:31,610 --> 00:32:35,520 And even though it is shorter, I would say it's not necessarily better. 623 00:32:35,520 --> 00:32:39,080 So these are questions about the design of programs. 624 00:32:39,080 --> 00:32:41,660 Which way to write the code is better? 625 00:32:41,660 --> 00:32:44,420 We have the same result. So they're both correct. 626 00:32:44,420 --> 00:32:47,600 But there are still different ways to design it and trade-offs to consider 627 00:32:47,600 --> 00:32:50,250 in terms of readability as well. 628 00:32:50,250 --> 00:32:53,450 So let's come back, and let's try to fix up this program a little bit. 629 00:32:53,450 --> 00:32:56,570 I would argue that it's probably cleaner if we instead have 630 00:32:56,570 --> 00:32:59,060 readline on a separate line of code. 631 00:32:59,060 --> 00:33:01,130 I'll put this first on line 1. 632 00:33:01,130 --> 00:33:04,490 And I'll go back to storing this object called name. 633 00:33:04,490 --> 00:33:08,370 And I'll pass it in as input to paste here, just like this. 634 00:33:08,370 --> 00:33:12,290 Now, we just talked about the idea of making our code more readable. 635 00:33:12,290 --> 00:33:17,210 And it turns out that R comes with a feature that can let me do just that. 636 00:33:17,210 --> 00:33:20,150 I can actually leave myself some notes to self 637 00:33:20,150 --> 00:33:23,000 called comments that will help me understand 638 00:33:23,000 --> 00:33:25,430 my code using the English language. 639 00:33:25,430 --> 00:33:29,390 So if I want to write a comment, or a note to myself, in my code, 640 00:33:29,390 --> 00:33:33,020 I can do so by typing this hashtag here, followed by a space. 641 00:33:33,020 --> 00:33:35,090 And I can then type the comment I want to type. 642 00:33:35,090 --> 00:33:38,780 I could say maybe this line asks user, this line 643 00:33:38,780 --> 00:33:43,260 asks user for, asks user for name, just like this. 644 00:33:43,260 --> 00:33:46,670 And the next line, well, what does this line of code do? 645 00:33:46,670 --> 00:33:51,440 This line of code says hello to user, just like that. 646 00:33:51,440 --> 00:33:54,770 Now, comments, by convention, go on the line 647 00:33:54,770 --> 00:33:57,450 above, the line of code they are talking about. 648 00:33:57,450 --> 00:34:02,810 So in this case, I know this comment on line 1 refers to the code on line 2. 649 00:34:02,810 --> 00:34:07,700 And this comment on line 4 refers to the line of code on line 5. 650 00:34:07,700 --> 00:34:09,727 But comments are very helpful when you actually 651 00:34:09,727 --> 00:34:11,060 are working on a larger project. 652 00:34:11,060 --> 00:34:11,960 You come back later. 653 00:34:11,960 --> 00:34:13,040 Don't know what you did. 654 00:34:13,040 --> 00:34:16,820 Comments can then help you understand exactly what to do, 655 00:34:16,820 --> 00:34:20,340 and what you had done prior as well. 656 00:34:20,340 --> 00:34:22,580 So this was our "Hello, world" program. 657 00:34:22,580 --> 00:34:23,719 We said hello to the world. 658 00:34:23,719 --> 00:34:25,219 We said hello to some users. 659 00:34:25,219 --> 00:34:27,530 Let's get working with some data now. 660 00:34:27,530 --> 00:34:32,090 And in one case, we might want to work with data in terms of, let's say, 661 00:34:32,090 --> 00:34:34,070 counting votes for an election. 662 00:34:34,070 --> 00:34:36,596 So let's go ahead and try to simulate an election 663 00:34:36,596 --> 00:34:39,679 between some fictional characters from the Nintendo universe, in this case 664 00:34:39,679 --> 00:34:41,690 Mario, Peach, and Bowser. 665 00:34:41,690 --> 00:34:45,920 So to create this new program, I'll go to my console again, 666 00:34:45,920 --> 00:34:48,320 and I'll type file.create. 667 00:34:48,320 --> 00:34:50,840 And in this case, I want to count some votes. 668 00:34:50,840 --> 00:34:55,050 So I'll call this program count.R, just like this. 669 00:34:55,050 --> 00:34:56,000 I'll hit Enter. 670 00:34:56,000 --> 00:34:58,400 And I'll see that this file was created. 671 00:34:58,400 --> 00:35:03,590 So now, if I open up, open up my window over here, and go to files, 672 00:35:03,590 --> 00:35:07,010 I should now see that count.R is available to me 673 00:35:07,010 --> 00:35:09,050 as a file to write this program in. 674 00:35:09,050 --> 00:35:14,420 I'll open up count.R. And now I have a blank slate of a program to write. 675 00:35:14,420 --> 00:35:17,480 So we have three candidates to keep track of votes 676 00:35:17,480 --> 00:35:20,180 for, Mario, Peach, and Bowser. 677 00:35:20,180 --> 00:35:24,170 So let's let the user actually type in those votes and return to them 678 00:35:24,170 --> 00:35:27,920 or print out the total number of votes that happened in this election. 679 00:35:27,920 --> 00:35:31,520 So maybe I will ask the user, using readline, 680 00:35:31,520 --> 00:35:35,010 to enter votes for Mario, just like this. 681 00:35:35,010 --> 00:35:39,680 And I'll also ask the user to enter votes for Peach, just 682 00:35:39,680 --> 00:35:41,240 like this, Princess Peach. 683 00:35:41,240 --> 00:35:46,340 And I'll also use readline to ask the user to enter votes for Bowser, just 684 00:35:46,340 --> 00:35:47,400 like this. 685 00:35:47,400 --> 00:35:51,710 Now, it's likely I'll want to use whatever the user types in later 686 00:35:51,710 --> 00:35:53,130 on in my code. 687 00:35:53,130 --> 00:35:55,940 So why don't I store the return value of readline 688 00:35:55,940 --> 00:35:58,430 in an object I could re-use later on. 689 00:35:58,430 --> 00:36:02,960 Maybe I'll call this one Mario and this one Peach. 690 00:36:02,960 --> 00:36:06,900 And this one, let's go for Bowser, just like this. 691 00:36:06,900 --> 00:36:08,060 So I'll save it. 692 00:36:08,060 --> 00:36:12,920 And again the goal was to kind of add up the total number of votes. 693 00:36:12,920 --> 00:36:15,830 Well, maybe I'll make a new object called total 694 00:36:15,830 --> 00:36:17,780 to store the total number of votes. 695 00:36:17,780 --> 00:36:21,750 And I'll have something that will assign that total number here. 696 00:36:21,750 --> 00:36:24,950 Well, it turns out that to actually add data together, 697 00:36:24,950 --> 00:36:28,820 I can use, in R, this plus sign, this plus operator. 698 00:36:28,820 --> 00:36:32,450 So I'll say mario + peach + bowser. 699 00:36:32,450 --> 00:36:35,780 And that should return to me the total number of votes the user has actually 700 00:36:35,780 --> 00:36:38,070 entered in the console. 701 00:36:38,070 --> 00:36:40,710 And if I want to then print that back out to the user, 702 00:36:40,710 --> 00:36:42,660 well, I could use print and paste again. 703 00:36:42,660 --> 00:36:47,420 So I'll use print, paste, and then Total votes, no space, 704 00:36:47,420 --> 00:36:49,170 because paste will actually add it for me, 705 00:36:49,170 --> 00:36:52,210 and then total itself down below here. 706 00:36:52,210 --> 00:36:56,130 So a few things we've seen before and a few new things. 707 00:36:56,130 --> 00:36:58,560 What's new is this arithmetic. 708 00:36:58,560 --> 00:37:03,570 We've seen now, we just used this plus operator to add together some numbers. 709 00:37:03,570 --> 00:37:06,090 And R has more than just the plus operator. 710 00:37:06,090 --> 00:37:07,800 It has several others as well. 711 00:37:07,800 --> 00:37:10,260 It has addition, as we just saw with a plus sign, 712 00:37:10,260 --> 00:37:15,690 subtraction with this minus sign, or a dash, multiplication with this star 713 00:37:15,690 --> 00:37:20,550 or asterisk operator, and division, just like this, with the forward slash here. 714 00:37:20,550 --> 00:37:22,523 There are other operators, too, that we'll 715 00:37:22,523 --> 00:37:23,940 talk about later on in the course. 716 00:37:23,940 --> 00:37:27,390 But for now, these four will help you do some basic arithmetic that 717 00:37:27,390 --> 00:37:31,090 can help us solve some really interesting problems in this case. 718 00:37:31,090 --> 00:37:33,610 So let's come back, and let's run our program. 719 00:37:33,610 --> 00:37:35,910 I'll come back to RStudio. 720 00:37:35,910 --> 00:37:37,530 And I think that this should work. 721 00:37:37,530 --> 00:37:39,060 I'll go ahead and run source. 722 00:37:39,060 --> 00:37:41,790 And I'll enter votes, in this case, for Mario. 723 00:37:41,790 --> 00:37:44,310 Maybe I'll say Mario has 100 votes. 724 00:37:44,310 --> 00:37:46,470 And Peach has 150. 725 00:37:46,470 --> 00:37:48,420 And Bowser has 120. 726 00:37:48,420 --> 00:37:53,610 And now I should see the total number of votes that were cast in this election. 727 00:37:53,610 --> 00:37:54,900 I'll hit Enter. 728 00:37:54,900 --> 00:37:56,940 And I'll see one other error. 729 00:37:56,940 --> 00:38:01,770 It says error in mario + peach, non-numeric argument 730 00:38:01,770 --> 00:38:04,336 to binary operator. 731 00:38:04,336 --> 00:38:05,550 Hmm. 732 00:38:05,550 --> 00:38:08,100 The other error, I'll admit, was easier to understand. 733 00:38:08,100 --> 00:38:09,090 This one's less easy. 734 00:38:09,090 --> 00:38:11,310 So at least it tells me where the error happened. 735 00:38:11,310 --> 00:38:15,930 It says it happened in mario + peach, so it seems like maybe on line 736 00:38:15,930 --> 00:38:21,390 5 here, when I tried to add the user's input for Mario to the user's input 737 00:38:21,390 --> 00:38:22,320 for Peach. 738 00:38:22,320 --> 00:38:27,540 And it says, the reason, a non-numeric argument to the binary operator. 739 00:38:27,540 --> 00:38:30,630 So the binary operator, I'll tell you, is this plus sign here. 740 00:38:30,630 --> 00:38:35,220 But the non-numeric argument, it seems like it's telling us that Mario 741 00:38:35,220 --> 00:38:38,020 and Peach, those aren't numbers at all. 742 00:38:38,020 --> 00:38:43,110 So let's take a peek at our environment where we stored those actual objects. 743 00:38:43,110 --> 00:38:45,180 Let me take a peek over here. 744 00:38:45,180 --> 00:38:47,010 And what will we see? 745 00:38:47,010 --> 00:38:51,330 If I go to Environment, maybe remove this down below here, 746 00:38:51,330 --> 00:38:55,740 I see some old, some old things here, like greeting and name, 747 00:38:55,740 --> 00:38:57,150 that I didn't get rid of before. 748 00:38:57,150 --> 00:39:01,350 But I also see Bowser, Mario, Peach. 749 00:39:01,350 --> 00:39:03,750 And what do you notice? 750 00:39:03,750 --> 00:39:07,590 Well, it seems like before, we had, let's say, greeting. 751 00:39:07,590 --> 00:39:09,420 That was a character string. 752 00:39:09,420 --> 00:39:12,750 And we knew it was a character string because it had quotes around it. 753 00:39:12,750 --> 00:39:16,920 But we see the same thing now for bowser and for mario 754 00:39:16,920 --> 00:39:21,300 and for peach, which implies to me that these are still character strings. 755 00:39:21,300 --> 00:39:23,160 They're not so much numbers. 756 00:39:23,160 --> 00:39:26,790 Now, I think R is now telling me that it needs numbers to be 757 00:39:26,790 --> 00:39:28,740 able to add these things together. 758 00:39:28,740 --> 00:39:33,720 It can't add a character, 120, with the character 100. 759 00:39:33,720 --> 00:39:36,400 They need to be actual numbers. 760 00:39:36,400 --> 00:39:38,730 So let's see what we can do about that. 761 00:39:38,730 --> 00:39:42,000 Let's come back to RStudio here, and let's 762 00:39:42,000 --> 00:39:47,700 actually introduce this new idea of a data type or a storage mode. 763 00:39:47,700 --> 00:39:51,360 In R, we have various ways of storing data. 764 00:39:51,360 --> 00:39:55,890 We've seen one so far called a character string, but there are lots of others, 765 00:39:55,890 --> 00:39:56,520 too. 766 00:39:56,520 --> 00:39:58,020 Among them are these. 767 00:39:58,020 --> 00:40:01,620 Characters, we just saw, and then double and integer. 768 00:40:01,620 --> 00:40:03,300 These are both numbers. 769 00:40:03,300 --> 00:40:07,650 Double refers to a decimal number, like a 1.5, for instance. 770 00:40:07,650 --> 00:40:11,730 Integer refers to a whole number, like just 1, plain and simple. 771 00:40:11,730 --> 00:40:15,310 And there are more, too, but these are the ones that matter here. 772 00:40:15,310 --> 00:40:20,280 So it seems to me like readline, when it returns us the input from the user, 773 00:40:20,280 --> 00:40:26,040 it returned a data type of character, or a storage mode of character. 774 00:40:26,040 --> 00:40:28,650 But what I really need to add these numbers together 775 00:40:28,650 --> 00:40:33,223 is a double or an integer, these numeric storage modes down here. 776 00:40:33,223 --> 00:40:35,640 So let's see if there aren't functions that could help us. 777 00:40:35,640 --> 00:40:36,432 There actually are. 778 00:40:36,432 --> 00:40:41,320 So among them, this idea of as.character, as.double, 779 00:40:41,320 --> 00:40:42,610 and as.integer. 780 00:40:42,610 --> 00:40:46,210 These are functions that can actually take some particular object 781 00:40:46,210 --> 00:40:49,670 and convert them to the storage mode we want. 782 00:40:49,670 --> 00:40:53,830 So I could give as input to as.integer some object, 783 00:40:53,830 --> 00:40:57,910 and it will return to me then that same object but now as an integer. 784 00:40:57,910 --> 00:41:02,860 And this is known as coercion, changing the storage mode of an object, 785 00:41:02,860 --> 00:41:07,570 using a function like this to convert it to some particular new storage mode. 786 00:41:07,570 --> 00:41:09,350 So let's try these out. 787 00:41:09,350 --> 00:41:11,500 I'll come to RStudio again. 788 00:41:11,500 --> 00:41:16,990 And I will then try to convert this data to an integer 789 00:41:16,990 --> 00:41:19,900 before I actually add it together. 790 00:41:19,900 --> 00:41:24,970 On line 5 and below, let me go ahead and use as.integer. 791 00:41:24,970 --> 00:41:32,440 I'll type as.integer(mario) and as.integer(peach) 792 00:41:32,440 --> 00:41:35,950 and as.integer(bowser). 793 00:41:35,950 --> 00:41:38,720 Well, in this case, hopefully that should work. 794 00:41:38,720 --> 00:41:41,000 Let me go and hit Source again. 795 00:41:41,000 --> 00:41:43,110 And let me clear my terminal first. 796 00:41:43,110 --> 00:41:49,040 I'll enter 100 votes for Mario, 150 for Peach, and 120 for Bowser. 797 00:41:49,040 --> 00:41:51,560 And I still see that they're not numeric. 798 00:41:51,560 --> 00:41:55,520 So one common mistake is that simply running the function 799 00:41:55,520 --> 00:41:59,600 here is not enough to change this particular object. 800 00:41:59,600 --> 00:42:06,660 I need to then reassign the return value of a function to the object itself. 801 00:42:06,660 --> 00:42:09,620 So for instance, if I want to update the value of mario, 802 00:42:09,620 --> 00:42:14,540 I need to reassign it as the return value of as.integer. 803 00:42:14,540 --> 00:42:18,920 Or I need to update the value of peach by reassigning it as the return 804 00:42:18,920 --> 00:42:20,450 value of this function here. 805 00:42:20,450 --> 00:42:22,500 And same with bowser as well. 806 00:42:22,500 --> 00:42:25,760 And now, I think, if I run, this fingers crossed-- 807 00:42:25,760 --> 00:42:31,010 let me come to my console again, run source, and I'll choose 100 for Mario, 808 00:42:31,010 --> 00:42:34,250 150 for Peach, 120 for Bowser. 809 00:42:34,250 --> 00:42:39,170 And I'll see the total votes was 370. 810 00:42:39,170 --> 00:42:41,900 Now, this is a bit of a longer program than we've seen before. 811 00:42:41,900 --> 00:42:43,453 This is like 11 lines total. 812 00:42:43,453 --> 00:42:46,370 There's probably a way to actually clean this up a little bit, though. 813 00:42:46,370 --> 00:42:49,790 One way would be to immediately try to convert 814 00:42:49,790 --> 00:42:53,100 the input from the user to an integer. 815 00:42:53,100 --> 00:42:55,400 So I could use function composition. 816 00:42:55,400 --> 00:43:00,020 And I could instead immediately pass the return value of readline 817 00:43:00,020 --> 00:43:06,680 as the input of as integer, and same for peach, and same for bowser. 818 00:43:06,680 --> 00:43:09,982 Let me actually clean this up, have a parentheses there 819 00:43:09,982 --> 00:43:10,940 and a parentheses here. 820 00:43:10,940 --> 00:43:13,845 And now I can get rid of lines 5 through 7. 821 00:43:13,845 --> 00:43:15,470 And now it's just a little bit shorter. 822 00:43:15,470 --> 00:43:19,850 And I'd argue that this is actually a good use of this particular function 823 00:43:19,850 --> 00:43:22,130 composition, because now I'm immediately seeing that, 824 00:43:22,130 --> 00:43:25,190 OK, I want an integer from the user. 825 00:43:25,190 --> 00:43:27,000 Let me go ahead and try this again. 826 00:43:27,000 --> 00:43:27,830 I'll click Source. 827 00:43:27,830 --> 00:43:35,060 And now I'll see 100, 150, 120, and I still see 370. 828 00:43:35,060 --> 00:43:37,670 Now, this is pretty good, but let's think 829 00:43:37,670 --> 00:43:40,520 of a corner case or some other scenario where 830 00:43:40,520 --> 00:43:42,770 there are more than three candidates. 831 00:43:42,770 --> 00:43:46,010 Well, in this case, I don't want to be stuck always typing 832 00:43:46,010 --> 00:43:49,220 plus, plus, plus some new candidate. 833 00:43:49,220 --> 00:43:53,720 What I want to use instead is likely a function called sum. 834 00:43:53,720 --> 00:43:56,180 Now, R, because it works so often with data, 835 00:43:56,180 --> 00:43:58,760 has this function called sum that can take 836 00:43:58,760 --> 00:44:03,500 as input any number of numeric arguments and sum them up for me. 837 00:44:03,500 --> 00:44:05,840 So let's use some instead. 838 00:44:05,840 --> 00:44:11,750 In total here, I actually want to return the result of calling sum 839 00:44:11,750 --> 00:44:14,060 with three arguments, three inputs. 840 00:44:14,060 --> 00:44:19,520 The first is mario, second is peach, and the third is bowser. 841 00:44:19,520 --> 00:44:22,760 So now sum will look at all three of these numbers, 842 00:44:22,760 --> 00:44:25,700 add them up, and store them now in total. 843 00:44:25,700 --> 00:44:30,140 So I'll clear the console, run source, and I'll type in 100 votes for Mario, 844 00:44:30,140 --> 00:44:33,440 150 for Peach, and 120 for Bowser. 845 00:44:33,440 --> 00:44:37,740 And now I'll see total votes was 370 as well. 846 00:44:37,740 --> 00:44:40,730 So we've improved our program so far and we've 847 00:44:40,730 --> 00:44:44,930 seen how to use these storage modes to add data together. 848 00:44:44,930 --> 00:44:47,930 Now what questions do we have on this program here, 849 00:44:47,930 --> 00:44:50,690 or storage modes in general? 850 00:44:50,690 --> 00:44:57,110 AUDIENCE: Can we enter an argument like an array to this sum function? 851 00:44:57,110 --> 00:44:59,720 CARTER ZENKE: So I heard you mentioned this idea of an array. 852 00:44:59,720 --> 00:45:02,053 And if you're familiar with other programming languages, 853 00:45:02,053 --> 00:45:05,180 you might have heard this idea of an array, like some list of data. 854 00:45:05,180 --> 00:45:10,160 And the question is, could we give sum not these three separate values 855 00:45:10,160 --> 00:45:13,250 but actually an array or some list of data? 856 00:45:13,250 --> 00:45:16,760 In fact, we can, and let me suggest we actually take like a five-minute break 857 00:45:16,760 --> 00:45:19,135 and come back to learn more about these structures we can 858 00:45:19,135 --> 00:45:21,020 use to represent data just like that. 859 00:45:21,020 --> 00:45:22,730 See you in five. 860 00:45:22,730 --> 00:45:23,930 Well, we're back. 861 00:45:23,930 --> 00:45:28,130 And so we've seen so far how to write programs that take user input. 862 00:45:28,130 --> 00:45:31,370 But odds are, as you write more R programs, 863 00:45:31,370 --> 00:45:35,300 you won't rely so much on the user to actually enter data for you. 864 00:45:35,300 --> 00:45:39,740 You'll instead read data from a file, like a CSV file for instance. 865 00:45:39,740 --> 00:45:42,580 So let's take a look at ways you can actually represent data 866 00:45:42,580 --> 00:45:45,910 and how to use those representations now in R. 867 00:45:45,910 --> 00:45:50,020 Well, you often find data is stored in these things called tables. 868 00:45:50,020 --> 00:45:52,330 And here is an example table. 869 00:45:52,330 --> 00:45:56,680 I'm trying to represent here candidates, like Mario, Peach, and Bowser, 870 00:45:56,680 --> 00:45:59,290 and the number of votes they received at the poll, 871 00:45:59,290 --> 00:46:01,750 so this is actual, physical polling location, 872 00:46:01,750 --> 00:46:05,540 and via mail, from mail in ballots, let's say. 873 00:46:05,540 --> 00:46:10,960 So notice how this table has both rows, this kind of horizontal orientation, 874 00:46:10,960 --> 00:46:14,080 and columns, this vertical orientation. 875 00:46:14,080 --> 00:46:16,880 In particular, there are three columns with names. 876 00:46:16,880 --> 00:46:21,340 So one is candidate, where I have the names of my candidates, in this case, 877 00:46:21,340 --> 00:46:23,530 Mario, Peach, and Bowser. 878 00:46:23,530 --> 00:46:26,590 Over here, I have this column called poll, 879 00:46:26,590 --> 00:46:29,260 representing the number of ballots or votes 880 00:46:29,260 --> 00:46:33,040 that Mario, Peach, and Bowser received at the actual, physical polling 881 00:46:33,040 --> 00:46:33,820 location. 882 00:46:33,820 --> 00:46:39,860 So let's say Mario got 37 votes, Peach 43, and Bowser 84 now. 883 00:46:39,860 --> 00:46:44,340 Well, for the mail column, there's also going to be some numbers here as well. 884 00:46:44,340 --> 00:46:48,500 Let's say Mario got 63 mail-in votes, Peach 885 00:46:48,500 --> 00:46:52,910 got 107 mail-in votes, and Bowser, 36. 886 00:46:52,910 --> 00:46:57,020 So this then is our table of rows and columns. 887 00:46:57,020 --> 00:47:01,160 And one kind of analysis we might want to do on this table 888 00:47:01,160 --> 00:47:03,020 is called a tabulation. 889 00:47:03,020 --> 00:47:08,750 That is figuring out how many votes we received by poll or by mail 890 00:47:08,750 --> 00:47:12,260 and also how many votes each candidate received. 891 00:47:12,260 --> 00:47:16,340 So we could ask the question, how many votes did Mario receive overall? 892 00:47:16,340 --> 00:47:19,590 That would be a tabulation along these rows here. 893 00:47:19,590 --> 00:47:24,350 We could also ask, how many votes did we receive at the actual, physical polling 894 00:47:24,350 --> 00:47:25,100 location? 895 00:47:25,100 --> 00:47:28,470 That would be a tabulation along this column here. 896 00:47:28,470 --> 00:47:32,810 So these are two questions we can actually answer using R. 897 00:47:32,810 --> 00:47:35,870 But R, at least immediately, doesn't give us a way 898 00:47:35,870 --> 00:47:37,880 to represent a table exactly like this. 899 00:47:37,880 --> 00:47:43,640 If you want to store this data kind of long term, you need to do so in a file. 900 00:47:43,640 --> 00:47:47,750 And one popular way of representing data like this inside of a file 901 00:47:47,750 --> 00:47:52,790 is to use a CSV file, or a comma-separated values file. 902 00:47:52,790 --> 00:47:58,760 So here is the same representation of that data but now as a CSV file. 903 00:47:58,760 --> 00:48:04,040 Notice here I have those same column names, candidate, poll, and mail, 904 00:48:04,040 --> 00:48:09,545 and I still have those same column values, Mario, Peach, Bowser, 37, 43, 905 00:48:09,545 --> 00:48:11,760 84, and so on. 906 00:48:11,760 --> 00:48:14,420 But what you might notice is that the columns are now 907 00:48:14,420 --> 00:48:16,875 separated using these commas. 908 00:48:16,875 --> 00:48:17,750 And that makes sense. 909 00:48:17,750 --> 00:48:21,980 This is a comma-separated values file, or a CSV file. 910 00:48:21,980 --> 00:48:24,920 Every row is still in its own row in the file, 911 00:48:24,920 --> 00:48:28,700 but now these columns are presented now with these commas. 912 00:48:28,700 --> 00:48:33,140 So let's see how we could use R to read in this CSV file 913 00:48:33,140 --> 00:48:36,500 and give us an actual table of data to work with. 914 00:48:36,500 --> 00:48:41,560 I'll come to RStudio here, and one of the first things I actually might want 915 00:48:41,560 --> 00:48:44,380 to do is clean up my working space. 916 00:48:44,380 --> 00:48:47,320 If I want to see what's currently in my environment, 917 00:48:47,320 --> 00:48:51,970 I can type this function, ls, at the console and hit Enter. 918 00:48:51,970 --> 00:48:56,770 And now I'll see all the objects that I still have in my environment some 919 00:48:56,770 --> 00:48:58,540 from some prior programs. 920 00:48:58,540 --> 00:49:02,480 Now, I probably want to get rid of these as I'm writing some brand new program. 921 00:49:02,480 --> 00:49:04,700 So to do that I could use this function called 922 00:49:04,700 --> 00:49:08,950 rm, which stands for remove, whereas ls stands for list. 923 00:49:08,950 --> 00:49:14,560 I could use rm, and it turns out that rm takes a named argument called 924 00:49:14,560 --> 00:49:19,390 list that is the list of values I want to remove from my environment. 925 00:49:19,390 --> 00:49:23,950 And I'll say that this list is, well, it's the result of calling ls. 926 00:49:23,950 --> 00:49:28,210 That is, it will include bowser, greeting, mario, name, peach, 927 00:49:28,210 --> 00:49:32,020 and total, all of these prior objects I no longer want anymore. 928 00:49:32,020 --> 00:49:33,430 So I'll hit Enter on this. 929 00:49:33,430 --> 00:49:35,300 And I'll type ls again. 930 00:49:35,300 --> 00:49:39,140 And now I'll see character(0), which basically 931 00:49:39,140 --> 00:49:41,180 says there's nothing here right now. 932 00:49:41,180 --> 00:49:44,700 There's an empty string, nothing at all in my environment. 933 00:49:44,700 --> 00:49:46,730 So now my environment is clean. 934 00:49:46,730 --> 00:49:48,380 There are no objects here. 935 00:49:48,380 --> 00:49:52,400 Let me actually create a new program, one called tabulate. 936 00:49:52,400 --> 00:49:57,710 So I'll do file.create("tabulate.R") to represent how we're going to tabulate 937 00:49:57,710 --> 00:50:01,910 this table of data and find the number of votes for each candidate and each 938 00:50:01,910 --> 00:50:03,080 voting method. 939 00:50:03,080 --> 00:50:04,520 Let me do Enter here. 940 00:50:04,520 --> 00:50:06,950 And I'll see that file was created for me. 941 00:50:06,950 --> 00:50:11,810 I'll go to my file explorer and open up tabulator.R. 942 00:50:11,810 --> 00:50:14,420 So actually, notice here in my file explorer, 943 00:50:14,420 --> 00:50:18,320 I do have this file named votes.csv. 944 00:50:18,320 --> 00:50:22,130 And if I click on it, I can actually see, if I click View File here, 945 00:50:22,130 --> 00:50:24,320 what's inside this file. 946 00:50:24,320 --> 00:50:26,600 And notice here I have this same exact thing 947 00:50:26,600 --> 00:50:30,080 we saw on the slides, candidate comma poll comma mail, 948 00:50:30,080 --> 00:50:34,040 and then one row for every row in my data set. 949 00:50:34,040 --> 00:50:39,890 So our goal then is to read this CSV and store it in R so you can actually 950 00:50:39,890 --> 00:50:44,810 get back a table of data to work with, now entirely in R. Well, 951 00:50:44,810 --> 00:50:48,350 one function I could use to read data from a file 952 00:50:48,350 --> 00:50:53,600 like this is actually called read.table, read.table. 953 00:50:53,600 --> 00:50:57,170 And I can give read.table the name of the file 954 00:50:57,170 --> 00:51:00,680 I want to read, or to open, to load inside of R. 955 00:51:00,680 --> 00:51:03,920 So that file name was votes.csv. 956 00:51:03,920 --> 00:51:07,760 And because votes.csv is in that working directory, 957 00:51:07,760 --> 00:51:11,580 I can just refer to it by its plain and simple name. 958 00:51:11,580 --> 00:51:14,420 So read.table that table has a return value. 959 00:51:14,420 --> 00:51:16,370 It's going to give me back a table of data. 960 00:51:16,370 --> 00:51:20,570 So I'm going to actually store that, let's say, in this table called votes. 961 00:51:20,570 --> 00:51:23,960 And now let me run just this line of our code. 962 00:51:23,960 --> 00:51:25,820 I can do that by hitting Run over here. 963 00:51:25,820 --> 00:51:28,100 Or on Mac, I could type Command-Enter. 964 00:51:28,100 --> 00:51:30,230 On Windows, I could type Control-Enter. 965 00:51:30,230 --> 00:51:32,060 I'll do Command-Enter on Mac. 966 00:51:32,060 --> 00:51:35,180 And now I see, according to the console, I have now 967 00:51:35,180 --> 00:51:39,060 read the votes data table here. 968 00:51:39,060 --> 00:51:43,490 So if I want to see what it looks like, there are a few ways to do that. 969 00:51:43,490 --> 00:51:45,740 I could actually look at my environment. 970 00:51:45,740 --> 00:51:47,010 Let's try that first. 971 00:51:47,010 --> 00:51:48,140 I'll go to Environment. 972 00:51:48,140 --> 00:51:52,190 And I'll see the following, that votes seems to have 973 00:51:52,190 --> 00:51:55,710 four observations of one variable. 974 00:51:55,710 --> 00:51:56,210 Hmm. 975 00:51:56,210 --> 00:51:58,790 So observations actually refers to the number 976 00:51:58,790 --> 00:52:01,580 of rows in this table I've gotten back. 977 00:52:01,580 --> 00:52:05,750 And variable refers to the number of columns I've gotten back. 978 00:52:05,750 --> 00:52:09,020 And you might already be thinking, this doesn't seem right, 979 00:52:09,020 --> 00:52:11,330 because I thought I had at least three rows, 980 00:52:11,330 --> 00:52:13,640 and I thought I had at least three columns, 981 00:52:13,640 --> 00:52:15,505 and I seem to have four rows and one column, 982 00:52:15,505 --> 00:52:16,880 so something might be wrong here. 983 00:52:16,880 --> 00:52:19,280 If we want to see exactly what happened, I 984 00:52:19,280 --> 00:52:22,550 can use this function called View, capital V, 985 00:52:22,550 --> 00:52:27,710 and I can pass as input the object I want to view in this case. 986 00:52:27,710 --> 00:52:30,740 If I run this line of R code, I should now 987 00:52:30,740 --> 00:52:35,990 see a separate tab that shows me exactly what is stored in this object. 988 00:52:35,990 --> 00:52:38,923 And I would say these results are not good. 989 00:52:38,923 --> 00:52:40,590 This is not what I want it to look like. 990 00:52:40,590 --> 00:52:45,990 Because, again, we only have one column, that R seems to have named V1, 991 00:52:45,990 --> 00:52:49,110 and instead of three rows there are four, 992 00:52:49,110 --> 00:52:52,110 where one row is actually the names of the columns 993 00:52:52,110 --> 00:52:54,120 that I wanted to be the case. 994 00:52:54,120 --> 00:52:55,660 This is just not what we want. 995 00:52:55,660 --> 00:52:59,460 So it seems like read.table needs more information 996 00:52:59,460 --> 00:53:02,160 on how to read this particular file. 997 00:53:02,160 --> 00:53:07,410 And one thing it might need to know is, what is the separator between each 998 00:53:07,410 --> 00:53:08,610 of my columns? 999 00:53:08,610 --> 00:53:14,250 Well, because this is a CSV file, that separator is none other than a comma. 1000 00:53:14,250 --> 00:53:19,410 So read.table, like paste, takes a named argument called sep. 1001 00:53:19,410 --> 00:53:23,610 And this we'll set to be an actual comma. 1002 00:53:23,610 --> 00:53:27,390 So now read.table knows to look for these commas 1003 00:53:27,390 --> 00:53:32,790 and use those to identify what is a column inside of this data file. 1004 00:53:32,790 --> 00:53:37,500 So let me then rerun this line of code, and now view it. 1005 00:53:37,500 --> 00:53:39,100 And I'll see we're getting better. 1006 00:53:39,100 --> 00:53:44,490 So here I have three columns, although one is called V1, one is called V2, 1007 00:53:44,490 --> 00:53:47,128 the other is called V3, and I still have four rows. 1008 00:53:47,128 --> 00:53:49,170 So it's not quite there, but we're getting close. 1009 00:53:49,170 --> 00:53:55,290 One other argument to read.table is, in fact, this one called header, header. 1010 00:53:55,290 --> 00:53:59,220 So header can be either true or false, yes or no. 1011 00:53:59,220 --> 00:54:02,310 Do the column names exist in this file? 1012 00:54:02,310 --> 00:54:03,570 In this case, they do. 1013 00:54:03,570 --> 00:54:05,490 So I'll say header = TRUE. 1014 00:54:05,490 --> 00:54:09,480 I'm essentially saying that, yes, the column names are inside this file. 1015 00:54:09,480 --> 00:54:11,970 You should look for them, and you should use them. 1016 00:54:11,970 --> 00:54:13,360 So let me rerun this. 1017 00:54:13,360 --> 00:54:16,140 I'll rerun line 1, and now line 2. 1018 00:54:16,140 --> 00:54:18,630 And now I think we're in a pretty good place. 1019 00:54:18,630 --> 00:54:23,970 So R actually has stored inside its environment this table for me to use. 1020 00:54:23,970 --> 00:54:29,790 I see three columns, candidate, poll, and mail, and now three rows. 1021 00:54:29,790 --> 00:54:32,340 I could do something to make this a little more readable. 1022 00:54:32,340 --> 00:54:34,950 Often, when we have lots of arguments to these functions, 1023 00:54:34,950 --> 00:54:36,930 it's better to put them on separate lines. 1024 00:54:36,930 --> 00:54:41,010 So according to the style guide for R that kind of tells me 1025 00:54:41,010 --> 00:54:44,740 how I should be structuring my file, I should do something a bit like this. 1026 00:54:44,740 --> 00:54:48,275 I should put each argument on a new line in my code 1027 00:54:48,275 --> 00:54:50,400 and then make sure that this closing parentheses is 1028 00:54:50,400 --> 00:54:52,750 all the way against the left-hand side. 1029 00:54:52,750 --> 00:54:55,050 So this allows me to more quickly see what 1030 00:54:55,050 --> 00:54:58,800 arguments I have supplied to read.table, but the result 1031 00:54:58,800 --> 00:55:02,220 is exactly the same, of course. 1032 00:55:02,220 --> 00:55:05,580 Now, CSV files are pretty popular. 1033 00:55:05,580 --> 00:55:08,280 And it doesn't make sense to me to always 1034 00:55:08,280 --> 00:55:11,160 be writing sep = comma, header = TRUE. 1035 00:55:11,160 --> 00:55:14,340 And in fact, people who work with R, they 1036 00:55:14,340 --> 00:55:19,170 come up with their own function called read.csv to make this much easier 1037 00:55:19,170 --> 00:55:20,760 for us as programmers. 1038 00:55:20,760 --> 00:55:26,190 So instead of read.table, which can work on a variety of actual data files, 1039 00:55:26,190 --> 00:55:32,530 I might instead use read.csv because, of course, I have simply a CSV here. 1040 00:55:32,530 --> 00:55:34,090 So let's try this. 1041 00:55:34,090 --> 00:55:37,620 I'll run just read.csv, given the file name. 1042 00:55:37,620 --> 00:55:39,270 Hit Enter here and Enter here. 1043 00:55:39,270 --> 00:55:44,310 And it's the same result but now with much less typing, much fewer arguments. 1044 00:55:44,310 --> 00:55:48,630 read.csv just seems to know more naturally how to read these files 1045 00:55:48,630 --> 00:55:51,720 called CSV files. 1046 00:55:51,720 --> 00:55:52,860 OK. 1047 00:55:52,860 --> 00:55:56,970 So we've successfully read this CSV file. 1048 00:55:56,970 --> 00:56:00,960 But the next question is, what exactly has it given back to us? 1049 00:56:00,960 --> 00:56:02,340 Certainly, it's a table. 1050 00:56:02,340 --> 00:56:05,940 But in R, this table has a special name. 1051 00:56:05,940 --> 00:56:09,390 And this special name is a data frame. 1052 00:56:09,390 --> 00:56:12,900 Now, a data frame is what we're going to call a data structure. 1053 00:56:12,900 --> 00:56:15,960 Some way of organizing our data that allows us 1054 00:56:15,960 --> 00:56:18,390 to do things with it much more quickly. 1055 00:56:18,390 --> 00:56:22,230 And those things, for a data frame, might involve accessing the columns, 1056 00:56:22,230 --> 00:56:25,980 or accessing the rows, or performing things like tabulations, 1057 00:56:25,980 --> 00:56:27,970 as we'll see in just a little bit. 1058 00:56:27,970 --> 00:56:30,660 So here again is our data frame. 1059 00:56:30,660 --> 00:56:33,100 But now it's called votes, just like this. 1060 00:56:33,100 --> 00:56:37,050 And let's say I actually want to access some particular columns 1061 00:56:37,050 --> 00:56:40,380 or some particular rows of this data frame. 1062 00:56:40,380 --> 00:56:43,170 Well, because it is a data frame, R gives me 1063 00:56:43,170 --> 00:56:45,420 some special syntax, some special actual characters 1064 00:56:45,420 --> 00:56:49,170 I could type to access those columns and those rows. 1065 00:56:49,170 --> 00:56:52,800 Now, one way is to use what we call bracket notation, where 1066 00:56:52,800 --> 00:56:56,070 I could take the name of this data frame, use brackets, 1067 00:56:56,070 --> 00:57:00,300 and then supply the number of the row and the number of the column 1068 00:57:00,300 --> 00:57:01,620 that I want to see. 1069 00:57:01,620 --> 00:57:06,270 So for instance, let's say I wanted to access this particular value, Mario, 1070 00:57:06,270 --> 00:57:08,730 in the first column and the first row. 1071 00:57:08,730 --> 00:57:13,380 Well, in that case, I could type votes bracket 1 comma space 1 1072 00:57:13,380 --> 00:57:16,950 for the first value in the first row and the first column. 1073 00:57:16,950 --> 00:57:20,880 But more often, you'll want to access not just 1074 00:57:20,880 --> 00:57:26,250 one particular value but all the values in a column or all the values in a row. 1075 00:57:26,250 --> 00:57:30,750 And to do that, you can simply omit one or the other, the row or the column 1076 00:57:30,750 --> 00:57:31,480 value. 1077 00:57:31,480 --> 00:57:35,190 So here to access all the candidates, I could type votes 1078 00:57:35,190 --> 00:57:38,820 bracket comma space 1, omitting the row number 1079 00:57:38,820 --> 00:57:42,210 but only supplying the column number, the first column here. 1080 00:57:42,210 --> 00:57:45,570 That will give me this list of Mario, Peach, and Bowser. 1081 00:57:45,570 --> 00:57:47,130 What if I wanted the poll numbers? 1082 00:57:47,130 --> 00:57:53,010 Well, I could do votes comma 2, and that would give me 37, 43, 84, and same 1083 00:57:53,010 --> 00:57:55,830 for mail, but now with the number 3. 1084 00:57:55,830 --> 00:57:59,940 So let's try it in R. I'll come back over to RStudio 1085 00:57:59,940 --> 00:58:02,040 and go back to my program. 1086 00:58:02,040 --> 00:58:06,090 And our goal, again, was to sum up let's say the number of votes 1087 00:58:06,090 --> 00:58:06,960 we got at the polls. 1088 00:58:06,960 --> 00:58:11,700 So I could at least see those values if I do votes bracket and then 1089 00:58:11,700 --> 00:58:13,960 comma 2, just like this. 1090 00:58:13,960 --> 00:58:17,010 Let me clear my console and run this line. 1091 00:58:17,010 --> 00:58:22,560 And now I see those same values, 37, 43, 84. 1092 00:58:22,560 --> 00:58:27,150 Same thing here, 37, 43, 84, even though in my console, 1093 00:58:27,150 --> 00:58:29,400 they're kind of turned on their side like this, 1094 00:58:29,400 --> 00:58:34,290 these are, in fact, the same values I've seen in my table. 1095 00:58:34,290 --> 00:58:36,550 But what could go wrong here? 1096 00:58:36,550 --> 00:58:42,090 Well, if we think about this, what's to stop me from rearranging these columns, 1097 00:58:42,090 --> 00:58:46,290 from maybe making mail the second column and poll the third. 1098 00:58:46,290 --> 00:58:50,610 If my program can't update based on that kind of rearrangement, 1099 00:58:50,610 --> 00:58:52,300 it's not a very good program. 1100 00:58:52,300 --> 00:58:55,350 So there is another way to actually access columns, 1101 00:58:55,350 --> 00:58:57,790 in this case using their names. 1102 00:58:57,790 --> 00:59:00,690 So instead of the number of the column I want to access, 1103 00:59:00,690 --> 00:59:03,720 I can use its name, which is much more robust. 1104 00:59:03,720 --> 00:59:06,570 If I change the ordering of the columns, I can still 1105 00:59:06,570 --> 00:59:09,120 access the column I want to access. 1106 00:59:09,120 --> 00:59:12,450 Now, the syntax for this looks a bit as follows. 1107 00:59:12,450 --> 00:59:17,550 I would use the data frame's name, votes, followed by a dollar sign. 1108 00:59:17,550 --> 00:59:20,620 And then I would get back, in this case-- 1109 00:59:20,620 --> 00:59:23,940 I would actually type the name of the column I want to access. 1110 00:59:23,940 --> 00:59:26,680 And I would then get access to that particular column. 1111 00:59:26,680 --> 00:59:31,950 So votes$candidate, that gives me access to this column of candidates. 1112 00:59:31,950 --> 00:59:38,010 Same thing with poll, votes$poll, and same thing with mail, votes$mail, 1113 00:59:38,010 --> 00:59:40,500 this dollar sign has nothing to do with currency. 1114 00:59:40,500 --> 00:59:45,150 It's just a way to actually access a column by a particular name 1115 00:59:45,150 --> 00:59:46,260 that it has. 1116 00:59:46,260 --> 00:59:49,740 So let's come back to RStudio and try that out now. 1117 00:59:49,740 --> 00:59:54,720 Let's say I want to access the poll column. 1118 00:59:54,720 --> 00:59:58,770 Well, instead of using the bracket notation with the comma 2, 1119 00:59:58,770 --> 01:00:01,560 I could use votes$poll dollar sign poll. 1120 01:00:01,560 --> 01:00:04,470 And now let me clear my terminal, my console down below. 1121 01:00:04,470 --> 01:00:05,760 Let me run source. 1122 01:00:05,760 --> 01:00:06,660 And oops. 1123 01:00:06,660 --> 01:00:10,590 Let me actually, instead, run this particular line here. 1124 01:00:10,590 --> 01:00:16,390 I'll get 37, 43, 84, just as we saw before. 1125 01:00:16,390 --> 01:00:19,110 So let me pause here and ask, what questions do we 1126 01:00:19,110 --> 01:00:22,470 have about these data frames so far? 1127 01:00:22,470 --> 01:00:26,430 AUDIENCE: Can I ask if I can get some particular value 1128 01:00:26,430 --> 01:00:33,690 from some particular row by accessing the votes or the main column, votes 1129 01:00:33,690 --> 01:00:38,482 bracket or dollar sign, the column, then dollar sign, the value? 1130 01:00:38,482 --> 01:00:39,690 CARTER ZENKE: Great question. 1131 01:00:39,690 --> 01:00:41,482 So we saw earlier there is some syntax that 1132 01:00:41,482 --> 01:00:44,720 can get us access to some particular value in this data frame. 1133 01:00:44,720 --> 01:00:46,890 Let's see that in action a little bit here, too. 1134 01:00:46,890 --> 01:00:49,010 So I'll come back to RStudio. 1135 01:00:49,010 --> 01:00:54,080 And let's say we want to access Mario's number of votes 1136 01:00:54,080 --> 01:00:55,580 they received at the polls. 1137 01:00:55,580 --> 01:00:59,630 So this would be the second column and the first row. 1138 01:00:59,630 --> 01:01:04,010 And in R, we tend to index things, that is start counting, from one, 1139 01:01:04,010 --> 01:01:06,690 so the second column and the first row. 1140 01:01:06,690 --> 01:01:10,640 So here, let me go ahead and try to access that particular value. 1141 01:01:10,640 --> 01:01:14,330 I could say votes, like we saw before, open and closed brackets. 1142 01:01:14,330 --> 01:01:16,890 And I know, again, this is the first row. 1143 01:01:16,890 --> 01:01:21,020 So I'll put that as the 1 here and then the column number, which in this case 1144 01:01:21,020 --> 01:01:24,890 was the second column, counting from 1, so 2. 1145 01:01:24,890 --> 01:01:28,940 This should, if I hit Enter on this line of code, or Command-Enter, 1146 01:01:28,940 --> 01:01:34,400 should show me that, in fact, Mario has 37 votes at the polls. 1147 01:01:34,400 --> 01:01:37,740 I could do the same for Peach, let's say. 1148 01:01:37,740 --> 01:01:40,140 Peach has 43. 1149 01:01:40,140 --> 01:01:43,710 Bowser, Bowser has 84. 1150 01:01:43,710 --> 01:01:48,960 So that is a way to actually access individual values in our data frame. 1151 01:01:48,960 --> 01:01:51,930 One other way to do this, though, is to take 1152 01:01:51,930 --> 01:01:53,670 advantage of another data structure. 1153 01:01:53,670 --> 01:01:59,040 So it turns out that when we access the columns of data frame, what we're 1154 01:01:59,040 --> 01:02:01,950 getting back is no longer a data frame. 1155 01:02:01,950 --> 01:02:06,150 What we're getting back instead is what R calls a vector. 1156 01:02:06,150 --> 01:02:11,400 A vector is simply a list of data that is all of the same storage mode. 1157 01:02:11,400 --> 01:02:18,480 If you've heard of arrays in C or lists in Python, a vector is a similar idea. 1158 01:02:18,480 --> 01:02:23,010 But it's simply a list of values all of the same storage mode. 1159 01:02:23,010 --> 01:02:27,210 So to visualize this, let's go back to our data frame here. 1160 01:02:27,210 --> 01:02:31,980 And let's say I want to access this particular column of votes. 1161 01:02:31,980 --> 01:02:35,610 Well, I could access it using votes$candidate. 1162 01:02:35,610 --> 01:02:38,100 And when I do that, what I really get back 1163 01:02:38,100 --> 01:02:43,380 is this separate structure that is just the values from that particular column. 1164 01:02:43,380 --> 01:02:46,500 And now, because this is a new structure, 1165 01:02:46,500 --> 01:02:51,960 I could actually use that same bracket notation to ask for particular values 1166 01:02:51,960 --> 01:02:53,880 from this list of data. 1167 01:02:53,880 --> 01:02:56,550 Now, again, we start counting from one in R, 1168 01:02:56,550 --> 01:02:59,190 so this is our first value, second, and third. 1169 01:02:59,190 --> 01:03:01,680 If I want the first value the, first candidate, 1170 01:03:01,680 --> 01:03:05,640 I could use votes$candidate and then bracket 1. 1171 01:03:05,640 --> 01:03:07,770 That will give me Mario. 1172 01:03:07,770 --> 01:03:11,610 I could use votes$candidate and then bracket 2. 1173 01:03:11,610 --> 01:03:15,160 That would give me Peach, and same thing with Bowser here. 1174 01:03:15,160 --> 01:03:18,630 So we're kind of taking out or extracting this new vector. 1175 01:03:18,630 --> 01:03:22,560 And we're able to access individual values in it. 1176 01:03:22,560 --> 01:03:25,960 Now, one example I like to use is the example of building blocks here. 1177 01:03:25,960 --> 01:03:29,280 So here we have our very own data frame, composed 1178 01:03:29,280 --> 01:03:33,030 of nine individual pieces of data. 1179 01:03:33,030 --> 01:03:37,530 Now, when I ask for some particular column from this data frame, what 1180 01:03:37,530 --> 01:03:40,740 I'm effectively doing is taking out one column 1181 01:03:40,740 --> 01:03:44,580 and treating it as a separate object I can use in my program. 1182 01:03:44,580 --> 01:03:48,150 Now, again, if I want the first value in this vector, 1183 01:03:48,150 --> 01:03:51,960 I would simply take the one from the top; the second, the second one down; 1184 01:03:51,960 --> 01:03:53,940 the third one, the third one down. 1185 01:03:53,940 --> 01:03:58,300 Now, in R, we also see that when we print out vectors, 1186 01:03:58,300 --> 01:04:01,260 see them in our console, they aren't always kind of vertically 1187 01:04:01,260 --> 01:04:02,430 arranged like this. 1188 01:04:02,430 --> 01:04:07,590 Often, they'll be a bit more like this, where I might take, in this case, 1189 01:04:07,590 --> 01:04:09,420 put it on its side, a bit like this. 1190 01:04:09,420 --> 01:04:12,990 Or I would now get A, B, and C, left to right like this. 1191 01:04:12,990 --> 01:04:15,930 Even though in our data frame it was top to bottom, 1192 01:04:15,930 --> 01:04:19,840 we might get it represented side by side, a bit like this. 1193 01:04:19,840 --> 01:04:22,080 So same thing, ultimately, and there's this idea 1194 01:04:22,080 --> 01:04:25,680 of kind of extracting this vector from our data frame. 1195 01:04:25,680 --> 01:04:29,970 So let's try that now in R. I'll come back to RStudio. 1196 01:04:29,970 --> 01:04:35,940 And let's go ahead and try to access maybe the first value of the poll 1197 01:04:35,940 --> 01:04:36,640 column. 1198 01:04:36,640 --> 01:04:41,730 Now, we saw before, I could simply use bracket 1 on this new vector 1199 01:04:41,730 --> 01:04:45,390 that I've created by accessing the poll column of votes. 1200 01:04:45,390 --> 01:04:48,630 Let me hit Command-Enter, and I'll see 37. 1201 01:04:48,630 --> 01:04:52,350 Let me do the second value, and I'll see 43. 1202 01:04:52,350 --> 01:04:57,360 So I think, with this, we could start to answer one question we had, which was, 1203 01:04:57,360 --> 01:05:00,870 how many votes did we get at the polls in total? 1204 01:05:00,870 --> 01:05:06,360 Well, one way to answer this is to say, let's sum up votes$candidate, 1205 01:05:06,360 --> 01:05:08,940 the first value in that poll vector. 1206 01:05:08,940 --> 01:05:12,630 Then let's find that second value as well. 1207 01:05:12,630 --> 01:05:16,150 Then let's find that third value, just like this. 1208 01:05:16,150 --> 01:05:17,580 And sum all of those up. 1209 01:05:17,580 --> 01:05:23,640 And if I hit Command-Enter here, I'll see I have 164 votes at the polls. 1210 01:05:23,640 --> 01:05:26,430 But just like we saw before, there's probably 1211 01:05:26,430 --> 01:05:30,340 something that could be better designed about this particular line of code 1212 01:05:30,340 --> 01:05:30,840 here. 1213 01:05:30,840 --> 01:05:34,200 And that is, if I had more than three candidates, 1214 01:05:34,200 --> 01:05:36,740 I'd be typing a really, really long line of code here. 1215 01:05:36,740 --> 01:05:39,260 So what I could instead do is this. 1216 01:05:39,260 --> 01:05:44,110 I could give sum, to our earlier question, the vector itself. 1217 01:05:44,110 --> 01:05:49,000 Simply votes$candidate, which we know is a list of values, a vector, 1218 01:05:49,000 --> 01:05:53,080 I could give that entire vector to sum, that will then know what to do with it 1219 01:05:53,080 --> 01:05:59,230 and return to me, in fact, the sum of each of those elements of the vector. 1220 01:05:59,230 --> 01:06:02,500 So we call sum here vectorized. 1221 01:06:02,500 --> 01:06:06,880 Sum knows what to do when it gets not just a single value as input 1222 01:06:06,880 --> 01:06:08,650 but an entire vector. 1223 01:06:08,650 --> 01:06:13,300 And R is really built from the ground up with these vectorized functions that 1224 01:06:13,300 --> 01:06:16,880 can take whole lists of data and operate on them very, 1225 01:06:16,880 --> 01:06:19,400 very efficiently in this case. 1226 01:06:19,400 --> 01:06:23,830 So here we've answered that question of, how many votes do we get at the poll? 1227 01:06:23,830 --> 01:06:27,250 Let's take that next one, which was, how many votes do we get in the mail? 1228 01:06:27,250 --> 01:06:31,750 Well, I could do now sum, sum, and then give the mail 1229 01:06:31,750 --> 01:06:35,410 column, which is, in fact, a vector, and have that summed up. 1230 01:06:35,410 --> 01:06:41,710 And we'll see we got 206 votes now in the mail. 1231 01:06:41,710 --> 01:06:43,210 So we've seen vectors. 1232 01:06:43,210 --> 01:06:44,770 And we've seen data frames. 1233 01:06:44,770 --> 01:06:47,740 But we still have one other question to answer, which was 1234 01:06:47,740 --> 01:06:50,590 how many votes did each candidate get? 1235 01:06:50,590 --> 01:06:55,090 Now, to do that we have some sum values across columns here. 1236 01:06:55,090 --> 01:06:58,180 37 plus 63 is Mario's votes. 1237 01:06:58,180 --> 01:07:00,730 43 plus 107 is Peach's votes. 1238 01:07:00,730 --> 01:07:03,910 And 84 plus 36, that's Bowser's votes. 1239 01:07:03,910 --> 01:07:05,810 So let's try it. 1240 01:07:05,810 --> 01:07:08,920 I could maybe treat two separate columns here. 1241 01:07:08,920 --> 01:07:13,630 I could do votes and, what's that one, poll, votes poll. 1242 01:07:13,630 --> 01:07:15,550 Get the first value for Mario. 1243 01:07:15,550 --> 01:07:20,570 And add up, let's say, the first value in mail, a bit like this. 1244 01:07:20,570 --> 01:07:24,460 And that, I would argue, is Mario's total number of votes. 1245 01:07:24,460 --> 01:07:29,200 What if we did maybe the second value in poll 1246 01:07:29,200 --> 01:07:34,090 for Peach and, let's say, the second value in mail, also for Peach. 1247 01:07:34,090 --> 01:07:36,640 150, so that's Peach's total number of votes. 1248 01:07:36,640 --> 01:07:41,650 Let's do votes and the third element in the poll column 1249 01:07:41,650 --> 01:07:44,650 and the third element in the mail column, 1250 01:07:44,650 --> 01:07:46,972 and that would then be Bowser's votes. 1251 01:07:46,972 --> 01:07:49,930 And I hope you can tell by me being kind of bored while I'm doing this, 1252 01:07:49,930 --> 01:07:52,360 this is not the best way to do this. 1253 01:07:52,360 --> 01:07:57,220 There is actually a better way that takes advantage of a feature of vectors 1254 01:07:57,220 --> 01:07:59,950 in R, which is vector arithmetic. 1255 01:07:59,950 --> 01:08:02,830 So not only are vectors handy at representing lists of data. 1256 01:08:02,830 --> 01:08:06,260 We can use them to efficiently perform math as well. 1257 01:08:06,260 --> 01:08:09,850 So if I wanted to sum up all of these vectors and find out how many votes 1258 01:08:09,850 --> 01:08:13,840 each candidate got, I could simplify this and simply type the following, 1259 01:08:13,840 --> 01:08:20,080 votes$candidate, in this case, votes$mail. 1260 01:08:20,080 --> 01:08:21,580 And that would be it. 1261 01:08:21,580 --> 01:08:25,250 But let's visualize this and see why exactly this works. 1262 01:08:25,250 --> 01:08:28,330 So we said, we had this idea of vector arithmetic. 1263 01:08:28,330 --> 01:08:31,270 And here, again, is our data frame. 1264 01:08:31,270 --> 01:08:35,770 So I want to effectively sum up these two columns 1265 01:08:35,770 --> 01:08:40,149 and return to myself a new total for each candidate for every row, 1266 01:08:40,149 --> 01:08:41,290 in fact, that I have. 1267 01:08:41,290 --> 01:08:44,590 Well, to do that, I can take out these two vectors, poll and mail, 1268 01:08:44,590 --> 01:08:46,689 and think of them separately now. 1269 01:08:46,689 --> 01:08:48,250 Now I want to add these together. 1270 01:08:48,250 --> 01:08:51,609 Like we said, 37 plus 63, that's Mario's votes. 1271 01:08:51,609 --> 01:08:54,010 43 plus 107, that's Peach's votes. 1272 01:08:54,010 --> 01:08:57,279 Well, to add these I could use the plus sign, just like this, 1273 01:08:57,279 --> 01:09:01,479 but if you're new, it's not quite obvious what's going to happen here. 1274 01:09:01,479 --> 01:09:04,779 I mean, how could I add up these three numbers and these three numbers? 1275 01:09:04,779 --> 01:09:07,930 And what is the structure of what I get back in the end? 1276 01:09:07,930 --> 01:09:13,510 Well, it turns out that R uses a new vector as the result of this. 1277 01:09:13,510 --> 01:09:19,090 And it actually computes the new values element-wise, element-wise meaning 1278 01:09:19,090 --> 01:09:24,069 it goes top to bottom in each vector and adds those two corresponding elements 1279 01:09:24,069 --> 01:09:24,770 together. 1280 01:09:24,770 --> 01:09:30,250 So first it looks at the first element of each column here, 1281 01:09:30,250 --> 01:09:32,140 the poll column and the mail column. 1282 01:09:32,140 --> 01:09:34,450 37 plus 63 is 100. 1283 01:09:34,450 --> 01:09:35,890 Then it goes to the next one. 1284 01:09:35,890 --> 01:09:38,439 43 plus 107 is 150. 1285 01:09:38,439 --> 01:09:40,090 Then it goes to the next one here. 1286 01:09:40,090 --> 01:09:43,149 84 plus 36 is now 120. 1287 01:09:43,149 --> 01:09:47,590 And we now have a brand new vector by adding up or summing together 1288 01:09:47,590 --> 01:09:50,950 these two distinct vectors overall. 1289 01:09:50,950 --> 01:09:53,620 To go back to our visualization here of these blocks. 1290 01:09:53,620 --> 01:09:57,620 So you could think of me taking a data frame, just like this, 1291 01:09:57,620 --> 01:10:00,460 and kind of extracting these two vectors here. 1292 01:10:00,460 --> 01:10:02,710 Let's say this one, and let's say this one. 1293 01:10:02,710 --> 01:10:05,030 Let me put this one to the side for now. 1294 01:10:05,030 --> 01:10:07,360 And I want to do some math with these. 1295 01:10:07,360 --> 01:10:12,430 So in this case, I'll look at the first element in each vector, in this case 4, 1296 01:10:12,430 --> 01:10:14,800 and in this case 1, or 4 and 1. 1297 01:10:14,800 --> 01:10:16,130 What is the addition here? 1298 01:10:16,130 --> 01:10:17,950 Well, 4 plus 1 is 5. 1299 01:10:17,950 --> 01:10:20,710 So the first element of my new vector would be 5. 1300 01:10:20,710 --> 01:10:22,690 Then I go to the next two elements here. 1301 01:10:22,690 --> 01:10:24,850 In this case, I have 5 and 2. 1302 01:10:24,850 --> 01:10:26,080 What's 5 plus 2? 1303 01:10:26,080 --> 01:10:26,980 That's 7. 1304 01:10:26,980 --> 01:10:29,530 So the second element in my new vector would be 7. 1305 01:10:29,530 --> 01:10:31,630 And then, later on, what do I have? 1306 01:10:31,630 --> 01:10:33,280 I have 6 and 3. 1307 01:10:33,280 --> 01:10:35,350 Those added together would be 9. 1308 01:10:35,350 --> 01:10:39,280 My there an element in my new vector would, in fact, be 9. 1309 01:10:39,280 --> 01:10:43,060 So vector arithmetic gives us, in the end, a new vector 1310 01:10:43,060 --> 01:10:48,790 by actually adding, or in this case, adding these elements together, 1311 01:10:48,790 --> 01:10:51,320 kind of element-wise in the end. 1312 01:10:51,320 --> 01:10:54,550 So let's go back and see how this looks in RStudio. 1313 01:10:54,550 --> 01:10:56,500 I'll come back to RStudio here. 1314 01:10:56,500 --> 01:10:59,320 And I will then hit Enter on this line of code. 1315 01:10:59,320 --> 01:11:07,180 And we'll see that I get back a vector of three elements, 100, 150, and 120. 1316 01:11:07,180 --> 01:11:10,780 So let's pause here and ask, what questions do we have 1317 01:11:10,780 --> 01:11:14,380 on these vectors or vector arithmetic? 1318 01:11:14,380 --> 01:11:17,320 AUDIENCE: How is the sum here being printed on a terminal 1319 01:11:17,320 --> 01:11:19,447 without using the print function? 1320 01:11:19,447 --> 01:11:20,780 CARTER ZENKE: Ah, good question. 1321 01:11:20,780 --> 01:11:24,760 So you've noticed that when I press Command-Enter, for instance, 1322 01:11:24,760 --> 01:11:28,030 I'm seeing the results, whatever is stored in this object, 1323 01:11:28,030 --> 01:11:30,610 in my console without using print. 1324 01:11:30,610 --> 01:11:34,060 And that, in fact, is a feature of R and RStudio, 1325 01:11:34,060 --> 01:11:36,860 that when I run some particular line of code, 1326 01:11:36,860 --> 01:11:40,190 I can then see the return value of that particular line, 1327 01:11:40,190 --> 01:11:42,557 or whatever it computes for me, down to my console. 1328 01:11:42,557 --> 01:11:44,390 It's kind of a handy feature of these things 1329 01:11:44,390 --> 01:11:47,330 called IDEs that let me actually understand what's going on 1330 01:11:47,330 --> 01:11:49,700 in my code all the more clearly. 1331 01:11:49,700 --> 01:11:50,990 Good question. 1332 01:11:50,990 --> 01:11:54,020 AUDIENCE: I have a question in data frames. 1333 01:11:54,020 --> 01:11:59,960 So in the dollar notation, it returns a vector. 1334 01:11:59,960 --> 01:12:01,850 What about in the bracket notation? 1335 01:12:01,850 --> 01:12:04,910 Suppose we are taking, for a row, 1 comma. 1336 01:12:04,910 --> 01:12:06,590 Does it also return a vector? 1337 01:12:06,590 --> 01:12:09,590 CARTER ZENKE: A great question, so we just saw with the dollar notation, 1338 01:12:09,590 --> 01:12:11,090 we get back a vector. 1339 01:12:11,090 --> 01:12:13,310 Does it do the same thing then with bracket notation? 1340 01:12:13,310 --> 01:12:14,360 So in fact, it does. 1341 01:12:14,360 --> 01:12:16,250 Let me show you how it works in RStudio. 1342 01:12:16,250 --> 01:12:17,880 I'll come back over here. 1343 01:12:17,880 --> 01:12:22,430 And let's say I wanted to sum up these two columns still, 1344 01:12:22,430 --> 01:12:25,010 but I don't want to use their names for whatever reason. 1345 01:12:25,010 --> 01:12:29,760 I'll instead use their bracket representations here. 1346 01:12:29,760 --> 01:12:36,200 So I'll take the second column of my table here, using comma 2. 1347 01:12:36,200 --> 01:12:40,260 And I'll add together, in this case, the third column, just like this. 1348 01:12:40,260 --> 01:12:41,570 I'll hit Enter. 1349 01:12:41,570 --> 01:12:42,560 Whoops. 1350 01:12:42,560 --> 01:12:43,640 Hit Enter. 1351 01:12:43,640 --> 01:12:48,320 And I'll see I get back the same result. So these, each one of these 1352 01:12:48,320 --> 01:12:50,167 is in fact a vector. 1353 01:12:50,167 --> 01:12:52,250 Where you should be careful, though, is that there 1354 01:12:52,250 --> 01:12:58,250 is some notation that looks a bit like this, votes, in this case, bracket 1. 1355 01:12:58,250 --> 01:13:01,850 Notice how I'm not actually asking for a row and a column. 1356 01:13:01,850 --> 01:13:04,430 I'm only asking for some number 1. 1357 01:13:04,430 --> 01:13:09,290 This will give me back, in this case, the data frame but now 1358 01:13:09,290 --> 01:13:12,050 only the first column of that data frame. 1359 01:13:12,050 --> 01:13:14,270 So this, to be clear, is not a vector. 1360 01:13:14,270 --> 01:13:17,660 This is the same data frame, but now it's only 1361 01:13:17,660 --> 01:13:20,300 that single column of that data frame. 1362 01:13:20,300 --> 01:13:21,770 So be careful here. 1363 01:13:21,770 --> 01:13:23,870 You won't often use this notation, but you 1364 01:13:23,870 --> 01:13:29,180 could get back a data frame if you use that kind of notation overall. 1365 01:13:29,180 --> 01:13:32,180 A good question, and let's now keep going towards solving 1366 01:13:32,180 --> 01:13:34,400 our problem of tabulating this data. 1367 01:13:34,400 --> 01:13:42,110 So we saw before that simply adding the poll column and votes, and the mail 1368 01:13:42,110 --> 01:13:47,000 column and votes would give us the total number of votes for each candidate. 1369 01:13:47,000 --> 01:13:51,740 But I would argue that if I run line 3 here, and I just get back some numbers, 1370 01:13:51,740 --> 01:13:55,520 it's not super clear which candidate these votes belong to. 1371 01:13:55,520 --> 01:14:00,890 So I can actually go ahead and add a new column to my data frame 1372 01:14:00,890 --> 01:14:05,660 that includes these particular values here, one for Mario, one for Peach, 1373 01:14:05,660 --> 01:14:06,710 and one for Bowser. 1374 01:14:06,710 --> 01:14:09,680 And as long as my vector is of the same number, 1375 01:14:09,680 --> 01:14:13,430 of the same length as my number of rows in my data frame, 1376 01:14:13,430 --> 01:14:18,590 I should be able to actually add this as a new column to my data frame. 1377 01:14:18,590 --> 01:14:23,570 Now to create this new column, I could simply kind of wish it into existence, 1378 01:14:23,570 --> 01:14:24,240 a bit like this. 1379 01:14:24,240 --> 01:14:27,080 I could say votes$total. 1380 01:14:27,080 --> 01:14:31,850 And I know there is no column called total, but I can make one. 1381 01:14:31,850 --> 01:14:37,370 I could say, let's assign this vector to be the new column 1382 01:14:37,370 --> 01:14:39,620 total in the votes data frame. 1383 01:14:39,620 --> 01:14:46,010 If I run this line, I'll see I don't get a result, but I do, at least in votes, 1384 01:14:46,010 --> 01:14:51,650 see there is a new column called total that now includes those same elements 1385 01:14:51,650 --> 01:14:55,780 in my vector, top to bottom. 1386 01:14:55,780 --> 01:14:58,030 So we've solved that problem. 1387 01:14:58,030 --> 01:15:02,770 And our next step might be to now save this file again, to keep track of it 1388 01:15:02,770 --> 01:15:04,300 later on, to share it with a friend. 1389 01:15:04,300 --> 01:15:08,170 And to do that, I could use the write.csv function. 1390 01:15:08,170 --> 01:15:10,060 We saw read.csv. 1391 01:15:10,060 --> 01:15:15,580 But we also have write.csv, to actually save this file as a CSV. 1392 01:15:15,580 --> 01:15:16,490 Let me try it. 1393 01:15:16,490 --> 01:15:18,550 I'll say write.csv. 1394 01:15:18,550 --> 01:15:22,930 And I'll call this one, let's say, totals.csv, 1395 01:15:22,930 --> 01:15:24,850 for the total number of votes. 1396 01:15:24,850 --> 01:15:29,020 Now, write.csv actually needs two arguments. 1397 01:15:29,020 --> 01:15:31,690 One is the file name, like we just said here. 1398 01:15:31,690 --> 01:15:37,210 But even before that, it needs to know what data frame to write to the file. 1399 01:15:37,210 --> 01:15:39,850 So in this case, it's our votes data frame. 1400 01:15:39,850 --> 01:15:42,430 Now, by convention, in its documentation, 1401 01:15:42,430 --> 01:15:47,350 we see the first argument to write.csv is that data frame. 1402 01:15:47,350 --> 01:15:50,390 The second is the name of the file itself. 1403 01:15:50,390 --> 01:15:51,910 So let me clear my console. 1404 01:15:51,910 --> 01:15:54,160 Let me run write.csv. 1405 01:15:54,160 --> 01:15:58,900 And now, if I go to my file explorer, I should see down below-- 1406 01:15:58,900 --> 01:16:00,910 oops-- totals.csv. 1407 01:16:00,910 --> 01:16:03,820 And now let me View File. 1408 01:16:03,820 --> 01:16:07,000 And I should see something a bit like my data frame. 1409 01:16:07,000 --> 01:16:09,010 There's a few other features here, though. 1410 01:16:09,010 --> 01:16:15,970 I see that there's these numbers here, 1, 2, 3, and this empty number, 1411 01:16:15,970 --> 01:16:17,470 or this empty value up here. 1412 01:16:17,470 --> 01:16:21,850 These are, in fact, the row names of this data frame. 1413 01:16:21,850 --> 01:16:26,260 So not only do data frames have column names, like we saw here. 1414 01:16:26,260 --> 01:16:27,970 They also have row names. 1415 01:16:27,970 --> 01:16:32,260 And by default, they start counting at one and going to 2, then 3, then 4, 1416 01:16:32,260 --> 01:16:33,100 and so on. 1417 01:16:33,100 --> 01:16:37,030 And I can actually tell write.csv whether or not 1418 01:16:37,030 --> 01:16:40,180 I want those row names printed in my CSV. 1419 01:16:40,180 --> 01:16:43,480 You often won't, because it kind of makes a wacky format where 1420 01:16:43,480 --> 01:16:44,980 you have this empty value up here. 1421 01:16:44,980 --> 01:16:46,930 And it's probably just not worth it for you. 1422 01:16:46,930 --> 01:16:48,850 So you could specify the following. 1423 01:16:48,850 --> 01:16:54,610 You could say row.names, row.names = FALSE, 1424 01:16:54,610 --> 01:16:59,690 meaning that write.csv should not write any row names to the CSV, 1425 01:16:59,690 --> 01:17:02,180 even though they exist in the data frame. 1426 01:17:02,180 --> 01:17:03,980 So let me go ahead and clear my console. 1427 01:17:03,980 --> 01:17:05,630 I'll do Command-Enter here. 1428 01:17:05,630 --> 01:17:08,690 And now I should see, if I go back to totals.csv, 1429 01:17:08,690 --> 01:17:13,610 I now have a much cleaner CSV without those particular row names. 1430 01:17:13,610 --> 01:17:17,720 If I did, though, want to access those in R, I could use these functions. 1431 01:17:17,720 --> 01:17:23,630 I could say colnames(votes), which will return to me the columns I have inside 1432 01:17:23,630 --> 01:17:27,170 of my votes data frame, or ronames(votes), 1433 01:17:27,170 --> 01:17:32,720 which gives me access to the row names of this particular data frame as well. 1434 01:17:32,720 --> 01:17:33,320 All right. 1435 01:17:33,320 --> 01:17:36,680 So we've seen here how we've been able to save this data frame, 1436 01:17:36,680 --> 01:17:40,820 how to read data from a file and save it back to a file as well. 1437 01:17:40,820 --> 01:17:44,390 What we will do next is actually take advantage of online data sets 1438 01:17:44,390 --> 01:17:48,140 that other folks have written for us and use those to explore data in R as well. 1439 01:17:48,140 --> 01:17:49,700 See you all in five. 1440 01:17:49,700 --> 01:17:51,140 Well, we're back. 1441 01:17:51,140 --> 01:17:54,710 And so we've seen a few examples of how to use R. So far 1442 01:17:54,710 --> 01:17:56,990 we've seen how to take user input and actually 1443 01:17:56,990 --> 01:17:58,850 deal with that in our own programs. 1444 01:17:58,850 --> 01:18:03,470 We've also found how to use data we've put in our own file for ourselves. 1445 01:18:03,470 --> 01:18:07,520 But often, when you use R, you'll be working with not your own data 1446 01:18:07,520 --> 01:18:08,900 but somebody else's. 1447 01:18:08,900 --> 01:18:12,560 And so we'll do exactly that here for the rest of lecture. 1448 01:18:12,560 --> 01:18:14,660 Now, you might have heard of FiveThirtyEight, 1449 01:18:14,660 --> 01:18:16,370 as mentioned earlier in lecture. 1450 01:18:16,370 --> 01:18:18,050 They work a lot with data. 1451 01:18:18,050 --> 01:18:23,000 And they often put that data online for others to use and analyze themselves. 1452 01:18:23,000 --> 01:18:27,500 In fact, they worked on a poll that asked people in the US 1453 01:18:27,500 --> 01:18:29,210 about their voting habits. 1454 01:18:29,210 --> 01:18:30,560 Do they plan to vote? 1455 01:18:30,560 --> 01:18:32,150 How often do they vote? 1456 01:18:32,150 --> 01:18:34,670 What encourages them or motivates them to vote? 1457 01:18:34,670 --> 01:18:38,210 Or what are the reasons they actually don't vote in the first place? 1458 01:18:38,210 --> 01:18:43,580 So here I have the URL that actually includes this data file online, 1459 01:18:43,580 --> 01:18:45,260 on the internet somewhere else. 1460 01:18:45,260 --> 01:18:51,530 And if you look at the end here, you'll see the file name nonvoters_data.csv. 1461 01:18:51,530 --> 01:18:55,130 So this tells me that this data, even though it is online, 1462 01:18:55,130 --> 01:19:00,110 is still stored in that same familiar file format called a CSV. 1463 01:19:00,110 --> 01:19:04,700 And now, on the second line of code, I can still use read.csv. 1464 01:19:04,700 --> 01:19:07,850 All I'm doing now, though, is telling read.csv 1465 01:19:07,850 --> 01:19:11,420 where to look, where to go find this data file on the internet, 1466 01:19:11,420 --> 01:19:15,500 and bring it down for me here in my own R environment. 1467 01:19:15,500 --> 01:19:17,390 And I'm storing the result now, that data 1468 01:19:17,390 --> 01:19:20,630 frame, as a data frame called voters. 1469 01:19:20,630 --> 01:19:22,970 So let's take a peek and see what's inside. 1470 01:19:22,970 --> 01:19:24,590 I'll come to RStudio over here. 1471 01:19:24,590 --> 01:19:28,760 And let me do the function View, View. 1472 01:19:28,760 --> 01:19:30,380 And I'll view voters. 1473 01:19:30,380 --> 01:19:32,300 And whoops. 1474 01:19:32,300 --> 01:19:34,680 Let me first run these lines of code. 1475 01:19:34,680 --> 01:19:38,270 So let me run line 1 to create a new object called url. 1476 01:19:38,270 --> 01:19:41,750 Let me run line 2 to actually load that data. 1477 01:19:41,750 --> 01:19:44,180 And now, once I have those objects loaded, 1478 01:19:44,180 --> 01:19:46,460 let me go ahead and view voters. 1479 01:19:46,460 --> 01:19:48,860 And this is the data frame. 1480 01:19:48,860 --> 01:19:50,150 It is a big one. 1481 01:19:50,150 --> 01:19:51,890 If we scroll all the way to the side, you 1482 01:19:51,890 --> 01:19:54,100 will see lots of numbers, lots of columns. 1483 01:19:54,100 --> 01:19:57,040 If you scroll down, you'll see lots and lots of rows. 1484 01:19:57,040 --> 01:19:59,560 And one question you might have is, well, 1485 01:19:59,560 --> 01:20:03,670 just how many rows and columns are there in this data frame? 1486 01:20:03,670 --> 01:20:07,090 One way to answer that, at least in R code itself, 1487 01:20:07,090 --> 01:20:10,780 is to use these functions nrow and ncol, which 1488 01:20:10,780 --> 01:20:14,350 stands for number of rows or number of columns respectively. 1489 01:20:14,350 --> 01:20:20,050 So let me try nrow to get a sense of how many rows we have in this data frame. 1490 01:20:20,050 --> 01:20:21,880 We'll do nrow(voters). 1491 01:20:21,880 --> 01:20:28,600 And it seems like down on my console here, there are 5,836 rows of data. 1492 01:20:28,600 --> 01:20:33,730 Now, each row of data represents a particular voter or maybe a non-voter, 1493 01:20:33,730 --> 01:20:37,270 somebody who lives in the US and was asked to fill out this poll 1494 01:20:37,270 --> 01:20:40,990 where they talk about why they vote or why they don't vote. 1495 01:20:40,990 --> 01:20:44,710 But in the end, each row is its own voter. 1496 01:20:44,710 --> 01:20:48,430 Now let's see how many columns we have, ncol(voters). 1497 01:20:48,430 --> 01:20:53,380 It seems like we have 119 columns, which is a lot of columns. 1498 01:20:53,380 --> 01:20:58,100 Each column in this case represents a question asked to that voter. 1499 01:20:58,100 --> 01:21:03,710 And if I go back to the voters view here, I might see Q1, for instance. 1500 01:21:03,710 --> 01:21:11,120 It seems like the voter who was given the ID 470001 answered 1 1501 01:21:11,120 --> 01:21:12,650 to question one. 1502 01:21:12,650 --> 01:21:14,460 And so it's a little arcane right now. 1503 01:21:14,460 --> 01:21:16,610 But thankfully, FiveThirtyEight has given us 1504 01:21:16,610 --> 01:21:20,300 a tool we can use to interpret some of this data. 1505 01:21:20,300 --> 01:21:23,240 In fact, that tool is called a codebook. 1506 01:21:23,240 --> 01:21:25,670 And you might often use the same in your own work 1507 01:21:25,670 --> 01:21:28,670 with data analysis, where a codebook tells you 1508 01:21:28,670 --> 01:21:33,410 exactly what questions or columns mean in some particular context. 1509 01:21:33,410 --> 01:21:39,110 So in this case, I do see Q1 or Q2_1 or Q2_2. 1510 01:21:39,110 --> 01:21:42,500 And if I'm new to this data set, I don't know what the heck that means. 1511 01:21:42,500 --> 01:21:47,210 But I could look in the codebook that FiveThirtyEight gave me, in which case 1512 01:21:47,210 --> 01:21:52,010 they say Q1 asked participants this, Q2 asked participants this, 1513 01:21:52,010 --> 01:21:57,090 and so on and so forth for all questions we have inside our data set. 1514 01:21:57,090 --> 01:22:01,280 So let's actually take a peek at one particular column. 1515 01:22:01,280 --> 01:22:04,940 I know this column exists because I looked in the codebook to find it out. 1516 01:22:04,940 --> 01:22:09,560 There's a column called voter_category that actually tells us 1517 01:22:09,560 --> 01:22:13,370 how each voter represented themselves in terms of their voting habits. 1518 01:22:13,370 --> 01:22:16,220 So to access that column, as we saw before, I 1519 01:22:16,220 --> 01:22:20,240 could use voters, the name of the data frame, followed by dollar sign, 1520 01:22:20,240 --> 01:22:23,600 followed by the name of the column, voter_category. 1521 01:22:23,600 --> 01:22:28,430 And if I hit Command-Enter here, well, I'll see at least some 1522 01:22:28,430 --> 01:22:32,150 of the results from that column, some of the elements here. 1523 01:22:32,150 --> 01:22:37,220 And now you'll actually see why that bracket 1 that we saw initially 1524 01:22:37,220 --> 01:22:39,140 is kind of useful now. 1525 01:22:39,140 --> 01:22:41,360 As we work with bigger and bigger data sets, 1526 01:22:41,360 --> 01:22:43,910 notice that that bracket notation tells us 1527 01:22:43,910 --> 01:22:48,050 exactly where in our list or our vector we are. 1528 01:22:48,050 --> 01:22:51,140 Notice how this one here means, this "always", 1529 01:22:51,140 --> 01:22:55,280 that is the first element in our vector of all of these responses. 1530 01:22:55,280 --> 01:22:58,028 This "always", well, that's the seventh element 1531 01:22:58,028 --> 01:22:59,570 in our vector of all these responses. 1532 01:22:59,570 --> 01:23:02,120 And same thing for this "rarely/never", that 1533 01:23:02,120 --> 01:23:05,720 is the 13th element, and so on and so forth, all the way down. 1534 01:23:05,720 --> 01:23:10,340 Now, there's more than what looks to be 900 or about 1,000 1535 01:23:10,340 --> 01:23:11,660 entries in our vector here. 1536 01:23:11,660 --> 01:23:13,410 We just can't print all of them because it 1537 01:23:13,410 --> 01:23:16,520 would be a really, really long output to our console. 1538 01:23:16,520 --> 01:23:17,870 Now, I'm curious. 1539 01:23:17,870 --> 01:23:22,850 I saw a lot of values here, but I want to know exactly what ways voters 1540 01:23:22,850 --> 01:23:24,800 could have categorized themselves. 1541 01:23:24,800 --> 01:23:29,030 Like, out of all of these responses, what are the unique ones? 1542 01:23:29,030 --> 01:23:32,420 And it turns out R has a function called unique 1543 01:23:32,420 --> 01:23:34,460 that I can use to actually figure out what 1544 01:23:34,460 --> 01:23:37,760 are those unique values in this vector. 1545 01:23:37,760 --> 01:23:43,070 So let me give this vector called voter_category, or the column 1546 01:23:43,070 --> 01:23:48,300 voter_category, from voters to this function called unique, just like this. 1547 01:23:48,300 --> 01:23:50,720 I'll do Command-Enter to run this line. 1548 01:23:50,720 --> 01:23:54,620 And now I'll see I only get back three elements, 1549 01:23:54,620 --> 01:23:58,210 "always", "sporadic", or "rarely/never". 1550 01:23:58,210 --> 01:24:01,210 So it seems like voters could have categorized themselves 1551 01:24:01,210 --> 01:24:02,860 in one of three categories. 1552 01:24:02,860 --> 01:24:04,150 They always vote. 1553 01:24:04,150 --> 01:24:07,090 They sporadically or occasionally vote. 1554 01:24:07,090 --> 01:24:12,670 Or they rarely/never vote at all, so interesting data here. 1555 01:24:12,670 --> 01:24:16,360 There's another actual column I'm interested in, too, which 1556 01:24:16,360 --> 01:24:18,520 happens to be question 22. 1557 01:24:18,520 --> 01:24:21,880 So if I look at that here, I'll type voters 1558 01:24:21,880 --> 01:24:25,180 question 22 and hit command Enter. 1559 01:24:25,180 --> 01:24:29,950 And here, I'll see all the responses for this particular question. 1560 01:24:29,950 --> 01:24:35,428 And it seems to me like I'm seeing some numbers but also these NAs. 1561 01:24:35,428 --> 01:24:37,720 And before we dive into that, let's take a peek at what 1562 01:24:37,720 --> 01:24:40,180 this question is actually asking users. 1563 01:24:40,180 --> 01:24:44,110 So it turns out this question 22 is asking users this. 1564 01:24:44,110 --> 01:24:47,950 You previously indicated that you are not registered to vote. 1565 01:24:47,950 --> 01:24:50,500 Which of the following reasons best describes 1566 01:24:50,500 --> 01:24:52,900 why you are not registered to vote. 1567 01:24:52,900 --> 01:24:56,870 And users were given several responses, one through seven, I believe, 1568 01:24:56,870 --> 01:24:59,000 where one was, I don't have time. 1569 01:24:59,000 --> 01:25:00,860 Two was, I don't trust the political system. 1570 01:25:00,860 --> 01:25:04,560 Three was, I don't know, and so on and so forth. 1571 01:25:04,560 --> 01:25:09,290 And it seems to me like when we saw this data set, we had some numbers here, 1572 01:25:09,290 --> 01:25:13,670 one through seven, but we also had a lot of these NAs. 1573 01:25:13,670 --> 01:25:18,290 And I'm curious if you have a sense of what those NAs might be. 1574 01:25:18,290 --> 01:25:20,960 They're certainly not any of these options. 1575 01:25:20,960 --> 01:25:23,750 What do you think those NAs might represent 1576 01:25:23,750 --> 01:25:27,050 in this particular column of data here. 1577 01:25:27,050 --> 01:25:31,545 AUDIENCE: Not answered or not [? customized ?] to be in the list. 1578 01:25:31,545 --> 01:25:34,670 CARTER ZENKE: A great guess, so maybe it means something like not answered, 1579 01:25:34,670 --> 01:25:36,687 or there's just no data there. 1580 01:25:36,687 --> 01:25:39,020 And in fact, if we look at the context of this question, 1581 01:25:39,020 --> 01:25:41,060 we can actually figure it out like you did. 1582 01:25:41,060 --> 01:25:45,020 It says, "You previously indicated that you are not registered to vote." 1583 01:25:45,020 --> 01:25:49,130 So in the US, we often have to register ourselves to be able to vote. 1584 01:25:49,130 --> 01:25:52,160 But some people have done that, and some people haven't. 1585 01:25:52,160 --> 01:25:56,670 It seems like this question was asked to only those participants who 1586 01:25:56,670 --> 01:26:00,600 did not register to vote or who were not currently registered to vote. 1587 01:26:00,600 --> 01:26:04,170 So it would make sense then that people who actually are registered 1588 01:26:04,170 --> 01:26:07,020 couldn't have answered that question, and in which case, 1589 01:26:07,020 --> 01:26:11,220 we don't want to have an empty string or some sort of number there. 1590 01:26:11,220 --> 01:26:14,850 We could instead use one of R's special values. 1591 01:26:14,850 --> 01:26:18,060 Now, our comes with these special values to indicate something very special. 1592 01:26:18,060 --> 01:26:21,540 In this case, NA means not available. 1593 01:26:21,540 --> 01:26:24,090 There could be data here, but there isn't. 1594 01:26:24,090 --> 01:26:26,070 That's what NA in particular means. 1595 01:26:26,070 --> 01:26:27,630 There are others, too. 1596 01:26:27,630 --> 01:26:30,120 You might see Inf and -Inf. 1597 01:26:30,120 --> 01:26:33,000 These mean infinity or negative infinity, 1598 01:26:33,000 --> 01:26:37,170 numbers that are so big R can't represent them appropriately well. 1599 01:26:37,170 --> 01:26:40,050 We saw NA here for not available. 1600 01:26:40,050 --> 01:26:45,810 We also have NaN, which stands for not a number. 1601 01:26:45,810 --> 01:26:47,970 You could think of if I tried to ask you, well, 1602 01:26:47,970 --> 01:26:50,965 what is infinity divided by infinity? 1603 01:26:50,965 --> 01:26:52,590 And you'd probably look at me confused. 1604 01:26:52,590 --> 01:26:53,850 You would say, that's not a number. 1605 01:26:53,850 --> 01:26:54,420 And I would say, yes. 1606 01:26:54,420 --> 01:26:55,420 That's actually correct. 1607 01:26:55,420 --> 01:26:58,980 R says NaN is that result in this particular case. 1608 01:26:58,980 --> 01:27:03,950 And we also have this value called Null, capital N-u-l-l, 1609 01:27:03,950 --> 01:27:08,280 which is similar in spirit to NA but slightly different. 1610 01:27:08,280 --> 01:27:11,370 NA, as we said before, stands for not available. 1611 01:27:11,370 --> 01:27:13,860 There could be data here, but there isn't. 1612 01:27:13,860 --> 01:27:17,850 Null is simply a special value meaning absolutely nothing. 1613 01:27:17,850 --> 01:27:22,770 I can have a vector of NAs, meaning that maybe I have five NAs in this vector, 1614 01:27:22,770 --> 01:27:27,300 there are five places data could be, but I can't have a vector of Null. 1615 01:27:27,300 --> 01:27:28,645 Null is literally nothing. 1616 01:27:28,645 --> 01:27:31,020 It means absolutely nothing at all, so different slightly 1617 01:27:31,020 --> 01:27:33,490 from NA in this case. 1618 01:27:33,490 --> 01:27:37,410 So let's go back to our RStudio here, and let's 1619 01:27:37,410 --> 01:27:41,050 take a peek at this particular column. 1620 01:27:41,050 --> 01:27:43,230 So it seems like I have a lot of NAs. 1621 01:27:43,230 --> 01:27:47,010 And one thing I'm interested in is, what kinds of unique values 1622 01:27:47,010 --> 01:27:50,350 are we seeing inside of this particular column? 1623 01:27:50,350 --> 01:27:54,300 So I could use unique again and hit Command-Enter. 1624 01:27:54,300 --> 01:27:57,030 And now I'll see, if I clear my terminal, run it again, 1625 01:27:57,030 --> 01:28:00,450 I'll see clearly that I have several different possible values here. 1626 01:28:00,450 --> 01:28:06,240 I have NA, 7, 6, 2, negative 1, 1, 4, 5, and 3. 1627 01:28:06,240 --> 01:28:07,530 And this makes sense. 1628 01:28:07,530 --> 01:28:11,580 We saw earlier that people were given several options, 1 through 7. 1629 01:28:11,580 --> 01:28:15,960 And we have people who didn't respond, either with NA or 1 through 7 1630 01:28:15,960 --> 01:28:17,430 for those other responses there. 1631 01:28:17,430 --> 01:28:20,400 What we don't have actually is negative 1. 1632 01:28:20,400 --> 01:28:23,670 And I actually tried kind of hard to figure out why negative 1 is here, 1633 01:28:23,670 --> 01:28:24,330 and I couldn't. 1634 01:28:24,330 --> 01:28:25,800 So sometimes data is just messy. 1635 01:28:25,800 --> 01:28:28,222 There are values you don't know and can't deal with. 1636 01:28:28,222 --> 01:28:30,180 So just keep that in mind as you work with data 1637 01:28:30,180 --> 01:28:33,420 sets other people have given you. 1638 01:28:33,420 --> 01:28:37,560 Now, one more column that I found interesting was this one. 1639 01:28:37,560 --> 01:28:42,630 So it was voters question 21, question 21. 1640 01:28:42,630 --> 01:28:45,660 And the question 21 asked this. 1641 01:28:45,660 --> 01:28:49,010 It said, "There will be an election for president, members of the US Senate, 1642 01:28:49,010 --> 01:28:51,510 House of Representatives, and other state and local offices. 1643 01:28:51,510 --> 01:28:54,690 Do you plan to vote in that election?" 1644 01:28:54,690 --> 01:28:57,990 And it gave people a few potential options here. 1645 01:28:57,990 --> 01:28:59,820 1 meant yes. 1646 01:28:59,820 --> 01:29:01,320 2 meant no. 1647 01:29:01,320 --> 01:29:05,340 3 meant unsure or undecided; I don't know if I'll vote in that election. 1648 01:29:05,340 --> 01:29:07,980 So let's see what we actually got back as responses 1649 01:29:07,980 --> 01:29:09,660 for this particular question. 1650 01:29:09,660 --> 01:29:11,970 I'll come back to RStudio here. 1651 01:29:11,970 --> 01:29:16,920 And why don't we take a peek at voters question 21. 1652 01:29:16,920 --> 01:29:18,330 I'll hit Command-Enter. 1653 01:29:18,330 --> 01:29:24,730 And here is the first 1,000 or so entries to this particular question. 1654 01:29:24,730 --> 01:29:28,170 So we see a lot of 1's, which could mean yeses. 1655 01:29:28,170 --> 01:29:32,460 I see some 2's, which could mean no, 3, unsure, undecided. 1656 01:29:32,460 --> 01:29:35,070 I do see a negative 1 down here, which seems to be showing up 1657 01:29:35,070 --> 01:29:36,450 for some unknown reason. 1658 01:29:36,450 --> 01:29:39,240 But we could get a better sense if I used unique. 1659 01:29:39,240 --> 01:29:41,820 So I'll use unique voters question 21. 1660 01:29:41,820 --> 01:29:48,510 And now I'll see, again, my unique responses are 1, 2, 3, or negative 1. 1661 01:29:48,510 --> 01:29:53,310 So at this point I'm kind of getting tired of looking at my vector 1662 01:29:53,310 --> 01:29:58,810 here and trying to convert in my head between these 1's, 2's, 3's, 1663 01:29:58,810 --> 01:30:03,040 negative 1's, to the actual response these participants gave. 1664 01:30:03,040 --> 01:30:08,110 And it turns out there is a way in R to make it much easier for myself to work 1665 01:30:08,110 --> 01:30:10,150 with data that's exactly like this. 1666 01:30:10,150 --> 01:30:14,230 This data has some set number of categories of responses, 1667 01:30:14,230 --> 01:30:15,640 as we saw with unique here. 1668 01:30:15,640 --> 01:30:20,110 The categories are, in this case, 1, 2, 3, or negative 1. 1669 01:30:20,110 --> 01:30:23,920 Those are the only possible values that could be stored in this vector. 1670 01:30:23,920 --> 01:30:27,460 And we also know that 1, 2, and 3 correspond 1671 01:30:27,460 --> 01:30:30,700 to some real-life, actual phrase that participants actually 1672 01:30:30,700 --> 01:30:33,730 responded to in this poll. 1673 01:30:33,730 --> 01:30:37,390 So to help us represent this kind of data, 1674 01:30:37,390 --> 01:30:41,425 we can use something that R calls a factor, a factor. 1675 01:30:41,425 --> 01:30:45,100 Now, I can create a factor using the factor function. 1676 01:30:45,100 --> 01:30:48,940 And a factor function takes a vector, like the one I just have, 1677 01:30:48,940 --> 01:30:52,600 the one I just had here, and converts it into a factor. 1678 01:30:52,600 --> 01:30:53,520 So let's try that. 1679 01:30:53,520 --> 01:30:57,930 I'll say factor(voters$Q21), just like this. 1680 01:30:57,930 --> 01:31:00,050 And I'll hit Enter again, Command-Enter. 1681 01:31:00,050 --> 01:31:02,060 And I'll see a very similar thing. 1682 01:31:02,060 --> 01:31:03,860 If I scroll up, I have all the same data. 1683 01:31:03,860 --> 01:31:09,530 Everything is just as it was, but now down below, I see levels, levels. 1684 01:31:09,530 --> 01:31:11,120 And what does that look like to you? 1685 01:31:11,120 --> 01:31:14,690 Well, we just saw before, we have several unique categories 1686 01:31:14,690 --> 01:31:18,590 of data inside this vector, negative 1, 1, 2, and 3. 1687 01:31:18,590 --> 01:31:22,430 And it seems like this factor took those unique values 1688 01:31:22,430 --> 01:31:26,810 and made it this factors level, so the unique possible categories of data 1689 01:31:26,810 --> 01:31:29,460 inside of this factor here. 1690 01:31:29,460 --> 01:31:33,260 So factors are great for representing this categorical data, data that can 1691 01:31:33,260 --> 01:31:37,100 actually be in one category or another. 1692 01:31:37,100 --> 01:31:40,760 Now, one thing I could do is try to make it easier 1693 01:31:40,760 --> 01:31:44,000 to read these categories as English phrases. 1694 01:31:44,000 --> 01:31:48,080 And to do that, I could pass factor another argument, 1695 01:31:48,080 --> 01:31:49,910 this one called labels. 1696 01:31:49,910 --> 01:31:54,530 So not only do I give it my vector that has certain categories of data. 1697 01:31:54,530 --> 01:31:59,540 I also give it some labels to apply to each of those categories, one 1698 01:31:59,540 --> 01:32:00,870 after the other. 1699 01:32:00,870 --> 01:32:04,940 So here, let me use labels as the argument. 1700 01:32:04,940 --> 01:32:07,470 And let me give it a vector itself. 1701 01:32:07,470 --> 01:32:10,520 So we saw before, down below, that we have 1702 01:32:10,520 --> 01:32:16,490 a set of levels, number of categories in this data, negative 1, 1, 2, and 3. 1703 01:32:16,490 --> 01:32:20,540 But now I could actually give those levels, those categories, 1704 01:32:20,540 --> 01:32:21,960 some particular name. 1705 01:32:21,960 --> 01:32:23,360 So I'll give it a vector itself. 1706 01:32:23,360 --> 01:32:27,620 And I'll make a vector in R using this C function, which stands for Combine. 1707 01:32:27,620 --> 01:32:31,760 I'll create this vector where negative 1, let's say, maps to 1708 01:32:31,760 --> 01:32:34,790 or corresponds to this value as a question mark. 1709 01:32:34,790 --> 01:32:36,200 I don't know what the heck it is. 1710 01:32:36,200 --> 01:32:39,080 Maybe 1, though, we saw was yes. 1711 01:32:39,080 --> 01:32:41,300 And 2 was no. 1712 01:32:41,300 --> 01:32:46,080 And 3 was unsure or undecided, Exactly like this. 1713 01:32:46,080 --> 01:32:49,430 So now notice that when I look at my levels, 1714 01:32:49,430 --> 01:32:53,090 negative 1, that corresponds to the first label, question mark. 1715 01:32:53,090 --> 01:32:53,960 What the heck is it? 1716 01:32:53,960 --> 01:32:54,800 I don't know. 1717 01:32:54,800 --> 01:32:58,520 Then, yes, that corresponds to level 1. 1718 01:32:58,520 --> 01:33:00,680 Then level 2 corresponds to no. 1719 01:33:00,680 --> 01:33:03,830 Level 3 corresponds to unsure or undecided. 1720 01:33:03,830 --> 01:33:08,300 And now, if I clear my console and run this particular line of code, 1721 01:33:08,300 --> 01:33:10,460 well, now I have something much more user-friendly. 1722 01:33:10,460 --> 01:33:15,050 I can actually see not 1's, 2's, and 3's but now labels for those 1723 01:33:15,050 --> 01:33:16,310 1's, 2's, and 3's. 1724 01:33:16,310 --> 01:33:20,090 And here I see Yes, Unsure/Undecided, No. 1725 01:33:20,090 --> 01:33:22,830 Let me see so I can find a question mark here. 1726 01:33:22,830 --> 01:33:23,330 Yep. 1727 01:33:23,330 --> 01:33:23,830 There's one. 1728 01:33:23,830 --> 01:33:26,270 So 517 seems to be a question mark. 1729 01:33:26,270 --> 01:33:31,910 So now I've converted this vector into my own factor. 1730 01:33:31,910 --> 01:33:35,930 Now, it's also not a good idea to have data we don't know what to do with. 1731 01:33:35,930 --> 01:33:38,900 Like I said, this negative 1, I really don't know where it came from. 1732 01:33:38,900 --> 01:33:42,200 And in that case, I could tell this factor 1733 01:33:42,200 --> 01:33:45,650 when it's being created to exclude that value altogether. 1734 01:33:45,650 --> 01:33:49,340 I could use the exclude parameter, the argument here, 1735 01:33:49,340 --> 01:33:53,097 and say I want to exclude some vector of given values that 1736 01:33:53,097 --> 01:33:55,680 are part of this vector that want to just take out altogether. 1737 01:33:55,680 --> 01:33:58,970 So I'll say negative 1 is the value I want to exclude. 1738 01:33:58,970 --> 01:34:00,740 This is a vector of length 1. 1739 01:34:00,740 --> 01:34:04,400 But the value I want to exclude in this case is negative 1. 1740 01:34:04,400 --> 01:34:07,400 And now I can actually remove this label here 1741 01:34:07,400 --> 01:34:11,840 because my only categories will now be Yes, No, Unsure/Undecided, 1742 01:34:11,840 --> 01:34:15,020 or underneath the hood, that 1, 2, or 3. 1743 01:34:15,020 --> 01:34:16,670 So let me clear my console again. 1744 01:34:16,670 --> 01:34:18,080 Let me rerun factor here. 1745 01:34:18,080 --> 01:34:19,610 And now let me see. 1746 01:34:19,610 --> 01:34:22,010 Well, now my data is getting much cleaner. 1747 01:34:22,010 --> 01:34:24,320 I have Yes, No, Unsure/Undecided. 1748 01:34:24,320 --> 01:34:28,670 And if I go back to that 517 that we saw a question mark in up here-- 1749 01:34:28,670 --> 01:34:30,560 let me see if I can find it again. 1750 01:34:30,560 --> 01:34:36,320 517, almost there, I see now it's an NA, which stands for, 1751 01:34:36,320 --> 01:34:38,180 of course, not available. 1752 01:34:38,180 --> 01:34:42,050 We excluded it altogether from this particular factor. 1753 01:34:42,050 --> 01:34:47,570 So what questions do we have now on taking vectors and converting them 1754 01:34:47,570 --> 01:34:50,570 to these factors that involve categories of data 1755 01:34:50,570 --> 01:34:53,840 and labeling those categories of data? 1756 01:34:53,840 --> 01:34:58,320 AUDIENCE: Why have we used factor instead of as.factor like before. 1757 01:34:58,320 --> 01:35:01,070 CARTER ZENKE: So we've just seen here this function called factor. 1758 01:35:01,070 --> 01:35:04,400 But there is a function called as.factor, 1759 01:35:04,400 --> 01:35:08,300 similar to what we saw before, as.character, or as.integer, 1760 01:35:08,300 --> 01:35:09,530 or as.double. 1761 01:35:09,530 --> 01:35:12,860 We could use, I think, either one to get the same result. 1762 01:35:12,860 --> 01:35:15,468 In general, though-- let me look at the documentation. 1763 01:35:15,468 --> 01:35:16,760 Let me go back to RStudio here. 1764 01:35:16,760 --> 01:35:20,300 Let me pull up the documentation for factor. 1765 01:35:20,300 --> 01:35:21,980 Let me clear my console. 1766 01:35:21,980 --> 01:35:23,810 Let me do ?factor. 1767 01:35:23,810 --> 01:35:28,250 So this is, here, how we're supposed to use that factor function. 1768 01:35:28,250 --> 01:35:32,900 We can see here some of the arguments, the actual parameters I gave, 1769 01:35:32,900 --> 01:35:35,760 like the labels here as well. 1770 01:35:35,760 --> 01:35:40,640 Let's now try to find the documentation for as.factor, as.factor. 1771 01:35:40,640 --> 01:35:44,060 Let me go back over here, type as.factor. 1772 01:35:44,060 --> 01:35:46,340 And now I'll actually see-- 1773 01:35:46,340 --> 01:35:48,170 I'll see the same documentation page. 1774 01:35:48,170 --> 01:35:51,770 And so, to me, this is symbolizing that these are very closely related. 1775 01:35:51,770 --> 01:35:57,383 If I were to scroll down here, I could see, I think I see it here, is.factor, 1776 01:35:57,383 --> 01:36:00,300 as.factor are the membership and coercion functions for these classes. 1777 01:36:00,300 --> 01:36:04,560 So essentially, if you want to make a new factor from a vector, 1778 01:36:04,560 --> 01:36:05,730 you would use factor. 1779 01:36:05,730 --> 01:36:08,880 If you have something that you want to convert to a factor that is already 1780 01:36:08,880 --> 01:36:11,130 something else entirely, like maybe not even a vector, 1781 01:36:11,130 --> 01:36:13,590 you might be able to use as.factor. 1782 01:36:13,590 --> 01:36:15,420 So I hope that at least gives you some idea 1783 01:36:15,420 --> 01:36:21,270 of the difference between as.factor and factor, as the function itself. 1784 01:36:21,270 --> 01:36:21,810 All right. 1785 01:36:21,810 --> 01:36:24,120 What other questions do we have? 1786 01:36:24,120 --> 01:36:27,050 Let's take a few more. 1787 01:36:27,050 --> 01:36:30,320 AUDIENCE: How is a factor different from a table? 1788 01:36:30,320 --> 01:36:34,130 CARTER ZENKE: A good question, so we saw that a table has 1789 01:36:34,130 --> 01:36:35,900 rows and columns in it. 1790 01:36:35,900 --> 01:36:39,890 A factor, though, is simply a list of data, a bit like a vector. 1791 01:36:39,890 --> 01:36:40,765 It's one-dimensional. 1792 01:36:40,765 --> 01:36:43,473 And let me show you some slides that can hopefully, actually help 1793 01:36:43,473 --> 01:36:45,020 you visualize factors in general. 1794 01:36:45,020 --> 01:36:49,610 So if I come back over here, let me find the right slide to go to. 1795 01:36:49,610 --> 01:36:53,210 Here, in general, is this idea of a factor. 1796 01:36:53,210 --> 01:36:56,120 So we had, of course, this vector that was previously 1797 01:36:56,120 --> 01:36:59,690 only 1's, 2's, 3's, and 2's and 1's and negative 1. 1798 01:36:59,690 --> 01:37:02,240 But when we converted it to a factor, we did the following. 1799 01:37:02,240 --> 01:37:05,870 We found the unique categories of data in this vector, 1800 01:37:05,870 --> 01:37:09,200 just like this, 1, 2, 3, and negative 1. 1801 01:37:09,200 --> 01:37:13,808 What we then did is we sought out those categories 1802 01:37:13,808 --> 01:37:15,350 and called them, essentially, levels. 1803 01:37:15,350 --> 01:37:17,420 The categories of our data are the levels here. 1804 01:37:17,420 --> 01:37:21,230 And we applied some label to each of those levels, 1805 01:37:21,230 --> 01:37:23,900 basically saying negative 1 becomes "?". 1806 01:37:23,900 --> 01:37:24,890 1 becomes "Yes". 1807 01:37:24,890 --> 01:37:25,650 2 becomes "No". 1808 01:37:25,650 --> 01:37:27,930 3 becomes "Unsure/Undecided". 1809 01:37:27,930 --> 01:37:32,665 Now notice that this factor here is still a vector. 1810 01:37:32,665 --> 01:37:35,040 Well, still kind of one-dimensional, it's a list of data. 1811 01:37:35,040 --> 01:37:37,623 It's not two-dimensional like a data frame was or a table was. 1812 01:37:37,623 --> 01:37:39,000 It's only one-dimensional. 1813 01:37:39,000 --> 01:37:42,690 And when we apply these labels, we convert these 1's, 2's, 3's, negative 1814 01:37:42,690 --> 01:37:46,645 1's instead to "Yes", "No", "Unsure", and "?". 1815 01:37:46,645 --> 01:37:49,020 And if we exclude something, like we did with negative 1, 1816 01:37:49,020 --> 01:37:53,590 we would then get NA instead of negative 1 in the end. 1817 01:37:53,590 --> 01:37:56,530 So I hope that helps at least a little bit answering that question. 1818 01:37:56,530 --> 01:37:58,080 Let's take one more here. 1819 01:37:58,080 --> 01:38:01,320 AUDIENCE: I wanted to understand whether-- 1820 01:38:01,320 --> 01:38:06,750 just like you showed how to exclude a negative value, what if I want to-- 1821 01:38:06,750 --> 01:38:09,360 there are a number of negative values in the data, 1822 01:38:09,360 --> 01:38:13,170 and I want to exclude all of the negative values. 1823 01:38:13,170 --> 01:38:15,450 CARTER ZENKE: Ah, a good question, so the first part 1824 01:38:15,450 --> 01:38:18,780 is you want to represent maybe negative values, and how do you do that. 1825 01:38:18,780 --> 01:38:22,530 And the second part is how would we exclude those negative values, 1826 01:38:22,530 --> 01:38:23,820 let's say. 1827 01:38:23,820 --> 01:38:27,780 Another kind of similar case is, let's say we have lots of NA values. 1828 01:38:27,780 --> 01:38:29,440 How do we get rid of them? 1829 01:38:29,440 --> 01:38:31,950 So let me answer your first one and actually tee up 1830 01:38:31,950 --> 01:38:33,990 that second question for next lecture. 1831 01:38:33,990 --> 01:38:36,000 So come back to RStudio. 1832 01:38:36,000 --> 01:38:39,450 And let's try representing some negative values here. 1833 01:38:39,450 --> 01:38:42,270 We saw before that I had this idea of negative 1. 1834 01:38:42,270 --> 01:38:45,180 If I want to have some negative value in R Studio, 1835 01:38:45,180 --> 01:38:49,210 I could simply put a negative or a dash in front of any particular value. 1836 01:38:49,210 --> 01:38:51,750 So here, let's say, is negative 1. 1837 01:38:51,750 --> 01:38:55,360 But to your question then of how we could exclude these values 1838 01:38:55,360 --> 01:38:59,130 we don't want, that's a problem of filtering, or subsetting our data, 1839 01:38:59,130 --> 01:39:01,050 which we'll actually learn about next lecture. 1840 01:39:01,050 --> 01:39:03,780 So we've seen so far how to represent data 1841 01:39:03,780 --> 01:39:08,490 in R how to take all kinds of data like votes, tables of votes, 1842 01:39:08,490 --> 01:39:11,280 and represent them in R to manipulate them and so on. 1843 01:39:11,280 --> 01:39:13,830 What we'll see next time is how to actually transform 1844 01:39:13,830 --> 01:39:16,860 that data, removing values we don't want, adding data we do. 1845 01:39:16,860 --> 01:39:20,180 And for that, we'll see you all next time. 1846 01:39:20,180 --> 01:39:21,000