1 00:00:00,000 --> 00:00:03,332 [MUSIC PLAYING] 2 00:00:03,332 --> 00:00:05,533 3 00:00:05,533 --> 00:00:06,950 SPEAKER: Well, hello, one and all. 4 00:00:06,950 --> 00:00:09,720 And welcome to our short on data frames. 5 00:00:09,720 --> 00:00:12,320 Now data frames are a convenient way in R 6 00:00:12,320 --> 00:00:15,660 to store data in terms of rows and columns. 7 00:00:15,660 --> 00:00:17,780 So I have here a table-- 8 00:00:17,780 --> 00:00:22,020 really a data frame that stores data in rows and columns. 9 00:00:22,020 --> 00:00:26,460 I have here two columns, one called name and one called distance. 10 00:00:26,460 --> 00:00:28,910 And each of my rows represents, in this case, 11 00:00:28,910 --> 00:00:33,690 a man-made spacecraft exploring the furthest reaches of outer space. 12 00:00:33,690 --> 00:00:37,430 So I have here Voyager 1, which is a probe that is currently 13 00:00:37,430 --> 00:00:42,830 about 163 astronomical units away from Earth, Voyager 2, which 14 00:00:42,830 --> 00:00:45,620 is 136 astronomical units away from Earth, 15 00:00:45,620 --> 00:00:49,970 and Pioneer 10, which is, right now, about 80 astronomical units away 16 00:00:49,970 --> 00:00:50,960 from Earth. 17 00:00:50,960 --> 00:00:54,560 So let's say I wanted to take a table like this 18 00:00:54,560 --> 00:00:58,700 and actually create it in R for manipulation, to use my programs, 19 00:00:58,700 --> 00:00:59,520 and so on. 20 00:00:59,520 --> 00:01:04,080 We'll see how we can do just that using a function called data.frame. 21 00:01:04,080 --> 00:01:07,770 So I have here a program called spacecraft.R. 22 00:01:07,770 --> 00:01:10,530 And my goal, first and foremost, is to really create 23 00:01:10,530 --> 00:01:14,690 that same table we just saw now here in R. Like I 24 00:01:14,690 --> 00:01:19,560 said before, we can use a function built into R called data.frame. 25 00:01:19,560 --> 00:01:23,910 It can conveniently create data frames for us based on some given vectors 26 00:01:23,910 --> 00:01:25,600 we provide as input. 27 00:01:25,600 --> 00:01:28,380 So I'll type here data.frame. 28 00:01:28,380 --> 00:01:33,480 And data.frame takes as input any number of named arguments 29 00:01:33,480 --> 00:01:38,010 where the names for those arguments become our column names. 30 00:01:38,010 --> 00:01:40,050 And the value for each of those arguments 31 00:01:40,050 --> 00:01:43,270 becomes the values to fill in that particular column. 32 00:01:43,270 --> 00:01:49,630 So for instance, if we recall, we had a column named name just like this. 33 00:01:49,630 --> 00:01:53,400 If I want it to fill in that column with, let's say, 34 00:01:53,400 --> 00:01:56,580 some vector of information, I could do so just like this. 35 00:01:56,580 --> 00:02:01,170 I could say name equals and, then, provide, as input, some given vector 36 00:02:01,170 --> 00:02:05,950 to be able to be able to fill in the values for this column here. 37 00:02:05,950 --> 00:02:11,460 So I could type, for instance, Voyager 1, followed by Voyager 2, 38 00:02:11,460 --> 00:02:14,170 followed by Pioneer 10. 39 00:02:14,170 --> 00:02:18,160 So this is the data that should fill in that first column of information. 40 00:02:18,160 --> 00:02:21,120 I'll separate these arguments with a comma here. 41 00:02:21,120 --> 00:02:24,930 And I'll say my next column was distance. 42 00:02:24,930 --> 00:02:30,600 And it seemed like Voyager 1 was 163 au away from Earth. 43 00:02:30,600 --> 00:02:35,610 Voyager 2 was 136 au, or astronomical units, away from Earth. 44 00:02:35,610 --> 00:02:38,590 And Pioneer 10 was about 80. 45 00:02:38,590 --> 00:02:41,080 So this, I think, is my data frame. 46 00:02:41,080 --> 00:02:44,700 Let me go ahead and store it in a object called spacecraft. 47 00:02:44,700 --> 00:02:50,310 If I run, let's say line one here, and go ahead and print out spacecraft 48 00:02:50,310 --> 00:02:54,480 by just viewing the object, I should now see down below 49 00:02:54,480 --> 00:02:57,900 that I have myself a table with two columns-- 50 00:02:57,900 --> 00:03:00,310 one called name, one called distance. 51 00:03:00,310 --> 00:03:03,510 And each of the rows seems to be exactly what we had in that table 52 00:03:03,510 --> 00:03:04,690 earlier as well. 53 00:03:04,690 --> 00:03:09,120 So this is my data frame thanks to data.frame. 54 00:03:09,120 --> 00:03:11,880 Now, notice here that we have those columns. 55 00:03:11,880 --> 00:03:16,810 But we also have these numbers on the left-hand side, like 1, 2 and 3. 56 00:03:16,810 --> 00:03:21,990 These are the row names that data.frame automatically provides for us 57 00:03:21,990 --> 00:03:23,920 when creating some new data frame. 58 00:03:23,920 --> 00:03:26,280 But more on those in a little bit. 59 00:03:26,280 --> 00:03:29,430 So certainly, I can view this data frame by typing out 60 00:03:29,430 --> 00:03:31,920 its name, in this case, spacecraft. 61 00:03:31,920 --> 00:03:34,900 But I might also want to access individual columns. 62 00:03:34,900 --> 00:03:38,730 Well, we've seen we can do that using the dollar sign syntax here, 63 00:03:38,730 --> 00:03:41,230 spacecraft, dollar sign, name. 64 00:03:41,230 --> 00:03:44,880 And that gives me, in this case, the vector that I actually 65 00:03:44,880 --> 00:03:47,830 gave as input for the name column. 66 00:03:47,830 --> 00:03:49,620 So I'll run line 8 here. 67 00:03:49,620 --> 00:03:52,900 And we'll see Voyager 1, Voyager 2, and Pioneer 10. 68 00:03:52,900 --> 00:03:59,140 This is a vector that composes the first column of our spacecraft data frame. 69 00:03:59,140 --> 00:04:02,490 Same thing, in fact, for the distance column. 70 00:04:02,490 --> 00:04:06,210 The distance column will now get that vector corresponding, in this case, 71 00:04:06,210 --> 00:04:08,650 to the distance column. 72 00:04:08,650 --> 00:04:12,190 And notice, too, that these vectors are of different types. 73 00:04:12,190 --> 00:04:13,960 This is a numeric vector. 74 00:04:13,960 --> 00:04:15,900 This is a character string vector. 75 00:04:15,900 --> 00:04:19,269 But they can all live in the same data frame. 76 00:04:19,269 --> 00:04:23,490 What I can't do, though, still is combine different data types 77 00:04:23,490 --> 00:04:26,200 in the same vector. 78 00:04:26,200 --> 00:04:29,608 So distance is strictly numeric, and name is strictly characters. 79 00:04:29,608 --> 00:04:31,650 But they can be combined into the same data frame 80 00:04:31,650 --> 00:04:33,810 despite being different types. 81 00:04:33,810 --> 00:04:40,090 So let's see how else we can try to access columns of our data frame. 82 00:04:40,090 --> 00:04:45,090 Well, one way we can do so is by not using this syntax but by using, 83 00:04:45,090 --> 00:04:49,030 let's say, the indexes, the indices of our various columns. 84 00:04:49,030 --> 00:04:54,998 If I type spacecraft, bracket 1, you might think-- 85 00:04:54,998 --> 00:04:56,040 well, maybe a few things. 86 00:04:56,040 --> 00:04:59,640 But you might think, for instance, that maybe this is referring 87 00:04:59,640 --> 00:05:02,580 to the first column of spacecraft. 88 00:05:02,580 --> 00:05:04,585 And you would be correct. 89 00:05:04,585 --> 00:05:07,580 But we'll get a bit of an unexpected result here. 90 00:05:07,580 --> 00:05:09,040 I'll run line 11. 91 00:05:09,040 --> 00:05:14,200 And we'll see that I do see that first column of spacecraft, 92 00:05:14,200 --> 00:05:17,300 but I see a few other hints here that this isn't quite a vector. 93 00:05:17,300 --> 00:05:19,780 I see the column name, which is still name. 94 00:05:19,780 --> 00:05:24,290 And I see those row names, which correspond to our data frame. 95 00:05:24,290 --> 00:05:27,820 So it seems like, when I have a data frame like spacecraft 96 00:05:27,820 --> 00:05:31,030 and I do something like bracket 1 or bracket 2, 97 00:05:31,030 --> 00:05:35,470 some index within these individual brackets here, well, I 98 00:05:35,470 --> 00:05:40,360 get back a subset of that data frame-- some number of rows 99 00:05:40,360 --> 00:05:43,160 that I asked for or some number of columns that I asked for. 100 00:05:43,160 --> 00:05:46,130 But the end result is still a data frame. 101 00:05:46,130 --> 00:05:47,240 This is not a vector. 102 00:05:47,240 --> 00:05:51,670 It is still a data frame but one of only one column. 103 00:05:51,670 --> 00:05:55,630 So let's say I wanted the vector instead for that first column. 104 00:05:55,630 --> 00:05:58,930 I can access that using bracket bracket 1-- 105 00:05:58,930 --> 00:05:59,960 bracket bracket 1. 106 00:05:59,960 --> 00:06:05,550 These double brackets give me access to the vector composing that first column, 107 00:06:05,550 --> 00:06:06,330 just like this. 108 00:06:06,330 --> 00:06:08,850 And we'll see this is, in fact, a vector. 109 00:06:08,850 --> 00:06:10,740 So be careful when you write your programs. 110 00:06:10,740 --> 00:06:14,300 If you have a data frame, recall that bracket notation 111 00:06:14,300 --> 00:06:19,540 with this number in here will give you access to, in this case, 112 00:06:19,540 --> 00:06:24,590 a subset of your data frame and not a vector itself. 113 00:06:24,590 --> 00:06:27,030 Let's try something like this, though. 114 00:06:27,030 --> 00:06:29,710 Maybe I want to access some particular row. 115 00:06:29,710 --> 00:06:33,270 Well, I can do so using some syntax we're probably familiar with by now. 116 00:06:33,270 --> 00:06:36,173 I can use a comma, space, 1. 117 00:06:36,173 --> 00:06:38,090 And that will give me access to, in this case, 118 00:06:38,090 --> 00:06:41,660 the first vector that I have in my data frame, in this case 119 00:06:41,660 --> 00:06:44,220 Voyager 1, Voyager 2, and Pioneer 10. 120 00:06:44,220 --> 00:06:47,420 So a similar way of accessing information 121 00:06:47,420 --> 00:06:50,420 to spacecraft bracket bracket 1 and spacecraft 122 00:06:50,420 --> 00:06:53,720 name if we wanted to get things by name here. 123 00:06:53,720 --> 00:06:58,513 Going down below, what if we wanted rows from our spacecraft data frame? 124 00:06:58,513 --> 00:06:59,430 Well, I could do this. 125 00:06:59,430 --> 00:07:03,210 I could do spacecraft and then 1, comma, space. 126 00:07:03,210 --> 00:07:05,690 And this gives me access to, in this case, 127 00:07:05,690 --> 00:07:08,820 the very first row of my data frame. 128 00:07:08,820 --> 00:07:14,370 So spacecraft, bracket 1, comma, space, that'll give me access to the first row. 129 00:07:14,370 --> 00:07:19,820 And in general, this bracket syntax, when I have a comma in the middle, 130 00:07:19,820 --> 00:07:23,970 I can actually ask for some particular value at some particular location, 131 00:07:23,970 --> 00:07:29,000 in this case, the first row and the first column, which is, in this case, 132 00:07:29,000 --> 00:07:30,120 Voyager 1. 133 00:07:30,120 --> 00:07:35,047 So suffice to say, a lot of ways to access data in data frames like these. 134 00:07:35,047 --> 00:07:37,880 And you'll get familiar with the syntax as you just practice it more 135 00:07:37,880 --> 00:07:40,940 and see what the results might be. 136 00:07:40,940 --> 00:07:44,730 Now let's look at these row names in particular. 137 00:07:44,730 --> 00:07:48,020 So I might want to just play around with them 138 00:07:48,020 --> 00:07:51,960 and see what they can do for me just to get a sense of what data frames can do. 139 00:07:51,960 --> 00:07:56,720 If we look back at our spacecraft data frame, notice how it did automatically 140 00:07:56,720 --> 00:07:59,090 give us row names like these. 141 00:07:59,090 --> 00:08:02,370 But maybe you want to set the row names yourself. 142 00:08:02,370 --> 00:08:05,940 Well, you can do that within the function data.frame. 143 00:08:05,940 --> 00:08:11,630 In fact, one of the arguments to a data.frame is one called row.names-- 144 00:08:11,630 --> 00:08:12,650 row.names. 145 00:08:12,650 --> 00:08:17,660 And I can give a vector as input to this particular argument 146 00:08:17,660 --> 00:08:21,210 here, maybe, in this case, the names of our spaceships. 147 00:08:21,210 --> 00:08:25,500 So notice how I've removed that column we called names. 148 00:08:25,500 --> 00:08:28,700 I'm now instead using this argument row.names 149 00:08:28,700 --> 00:08:34,100 and giving it the same vector of all of our names of our spaceships here. 150 00:08:34,100 --> 00:08:38,820 Let's see the result. I'll go ahead and fill in this comma here, and I'll run. 151 00:08:38,820 --> 00:08:41,820 We see on line 1 and then line 6. 152 00:08:41,820 --> 00:08:46,050 And we still have a data frame, but it looks a little bit different. 153 00:08:46,050 --> 00:08:48,230 Notice how I only have one column-- 154 00:08:48,230 --> 00:08:49,260 distance. 155 00:08:49,260 --> 00:08:53,510 And on the left-hand side here, I see these values 156 00:08:53,510 --> 00:08:56,370 that are on the left-hand side of my data frame. 157 00:08:56,370 --> 00:08:58,040 Well, these are the row names-- 158 00:08:58,040 --> 00:09:00,930 Voyager 1, Voyager 2, and Pioneer 10. 159 00:09:00,930 --> 00:09:04,520 So by default, R will give you some ascending list of numbers. 160 00:09:04,520 --> 00:09:09,440 But I can override that if I wanted to using row.names. 161 00:09:09,440 --> 00:09:14,840 Well, of course, spacecraft$name, the name column within spacecraft, 162 00:09:14,840 --> 00:09:16,170 that no longer exists. 163 00:09:16,170 --> 00:09:20,180 If I go ahead and I run line six, I'll get back null. 164 00:09:20,180 --> 00:09:25,280 There is no column named name in the spacecraft data frame. 165 00:09:25,280 --> 00:09:29,750 But if I wanted to now, I could make use of these row names 166 00:09:29,750 --> 00:09:32,790 to access some particular row that I want. 167 00:09:32,790 --> 00:09:38,450 I could type spacecraft and then the row name like Voyager 1-- 168 00:09:38,450 --> 00:09:41,940 Voyager 1-- followed by a comma. 169 00:09:41,940 --> 00:09:44,570 And then, if I go ahead and hit Enter on this, 170 00:09:44,570 --> 00:09:47,900 I should see that I get back the particular row 171 00:09:47,900 --> 00:09:51,560 I was looking for, in this case, 163, which 172 00:09:51,560 --> 00:09:54,060 corresponds to the distance column. 173 00:09:54,060 --> 00:09:59,240 So this is a way of accessing information from our data frames. 174 00:09:59,240 --> 00:10:02,480 Why don't we go ahead and add another column here 175 00:10:02,480 --> 00:10:05,580 to just kind of see what this gives us more precisely. 176 00:10:05,580 --> 00:10:11,550 I can add a column, maybe one like type, so we can figure out 177 00:10:11,550 --> 00:10:13,200 what type each of these are. 178 00:10:13,200 --> 00:10:17,940 So each of these is a probe, probe, probe, probe. 179 00:10:17,940 --> 00:10:21,360 So here I am now, adding a new column called type. 180 00:10:21,360 --> 00:10:26,200 And I'll go ahead and rerun line 1 to update my data frame. 181 00:10:26,200 --> 00:10:28,620 Here is what it looks like. 182 00:10:28,620 --> 00:10:31,600 Now, I have those row names with two columns. 183 00:10:31,600 --> 00:10:36,120 If I now use the row name to access my data frame, 184 00:10:36,120 --> 00:10:39,780 I'll get back, in this case, that particular row that I'm 185 00:10:39,780 --> 00:10:46,110 looking from my data frame with both of those columns now involved as well. 186 00:10:46,110 --> 00:10:49,530 So this was our brief foray into data frames, 187 00:10:49,530 --> 00:10:53,670 creating them from scratch using this function called data.frame, 188 00:10:53,670 --> 00:10:56,020 as well as accessing them in various ways. 189 00:10:56,020 --> 00:10:58,380 You'll get the hang of as you do more practice. 190 00:10:58,380 --> 00:11:03,140 This, then, are short on data frames, and we'll see you next time. 191 00:11:03,140 --> 00:11:04,000