1 00:00:00,000 --> 00:00:03,479 [MUSIC PLAYING] 2 00:00:03,479 --> 00:00:19,557 3 00:00:19,557 --> 00:00:21,140 CARTER ZENKE: Well, hello one and all. 4 00:00:21,140 --> 00:00:24,410 And welcome back to CS50's Introduction to Programming with R. 5 00:00:24,410 --> 00:00:27,950 My name is Carter Zenke, and this is our lecture on visualizing data. 6 00:00:27,950 --> 00:00:30,560 Now, a good visualization can help you see 7 00:00:30,560 --> 00:00:34,100 trends you wouldn't have seen otherwise, help you compare groups in your data, 8 00:00:34,100 --> 00:00:36,050 and help you share your findings with others. 9 00:00:36,050 --> 00:00:39,170 So we're going to do all of that and more today with the help of R 10 00:00:39,170 --> 00:00:41,630 and with this package called ggplot2. 11 00:00:41,630 --> 00:00:43,670 It is part of the tidyverse. 12 00:00:43,670 --> 00:00:46,320 Now, this name, ggplot, is a bit weird. 13 00:00:46,320 --> 00:00:47,930 But there is a reasoning behind it. 14 00:00:47,930 --> 00:00:50,785 The "plot" in ggplot means we're going to plot 15 00:00:50,785 --> 00:00:52,910 our data, which is another word for visualizing it, 16 00:00:52,910 --> 00:00:57,110 translating it from a table into some picture to represent that same data. 17 00:00:57,110 --> 00:01:01,800 And the "gg" in ggplot stands for this grammar of graphics, which, put simply, 18 00:01:01,800 --> 00:01:04,069 is a way of expressing ourselves graphically. 19 00:01:04,069 --> 00:01:06,230 And in particular, this grammar of graphics 20 00:01:06,230 --> 00:01:08,510 gives us some individual components we can 21 00:01:08,510 --> 00:01:13,310 combine to create plots or visualizations from our data. 22 00:01:13,310 --> 00:01:15,710 But what are those components? 23 00:01:15,710 --> 00:01:18,500 Well, the first one, of course, is going to be data itself. 24 00:01:18,500 --> 00:01:21,690 Before we can do anything, we need to have data to visualize. 25 00:01:21,690 --> 00:01:25,500 And so let me propose we use this data here, a table of candidates 26 00:01:25,500 --> 00:01:27,540 and the votes for each of these candidates. 27 00:01:27,540 --> 00:01:30,360 So I have three of them, Mario, Peach, and Bowser, 28 00:01:30,360 --> 00:01:33,210 and each has some number of votes. 29 00:01:33,210 --> 00:01:36,450 Now, of course, data is just data. 30 00:01:36,450 --> 00:01:37,980 It's not a visualization yet. 31 00:01:37,980 --> 00:01:41,730 Our goal is to translate this table into some picture. 32 00:01:41,730 --> 00:01:45,300 So we'll need another component of our visualization to take this data 33 00:01:45,300 --> 00:01:47,580 and convert it into that picture. 34 00:01:47,580 --> 00:01:50,670 Now, what we'll need is what we call a geometry, more formally. 35 00:01:50,670 --> 00:01:54,540 And a geometry is simply a way of saying what kind of visualization do we want? 36 00:01:54,540 --> 00:01:56,950 I think it's best shown through example here. 37 00:01:56,950 --> 00:01:59,670 So I'll show you three different plots, each 38 00:01:59,670 --> 00:02:01,680 with their own different geometries. 39 00:02:01,680 --> 00:02:04,140 Notice here I have one involving columns, 40 00:02:04,140 --> 00:02:08,669 good for data that involves groups and values associated with those groups. 41 00:02:08,669 --> 00:02:12,600 I have one plot that uses points here, good for representing relationships 42 00:02:12,600 --> 00:02:15,240 between two columns in your data set perhaps. 43 00:02:15,240 --> 00:02:17,640 And I have here a line geometry, one that 44 00:02:17,640 --> 00:02:20,130 can actually show us change over time. 45 00:02:20,130 --> 00:02:22,558 And so I'm curious, given these three geometries, 46 00:02:22,558 --> 00:02:25,350 of which there are many more-- we'll focus on these three for now-- 47 00:02:25,350 --> 00:02:28,650 which one do you think would be good or best for the data 48 00:02:28,650 --> 00:02:33,150 we have here, where we have three candidates and some number of votes? 49 00:02:33,150 --> 00:02:35,250 Let's see what our audience thinks. 50 00:02:35,250 --> 00:02:38,490 What kind of geometry among these three here 51 00:02:38,490 --> 00:02:43,080 would you use to visualize this data in particular? 52 00:02:43,080 --> 00:02:48,742 AUDIENCE: I think we will use columns to present it in more visualized example. 53 00:02:48,742 --> 00:02:50,200 CARTER ZENKE: I like your thinking. 54 00:02:50,200 --> 00:02:53,580 So we could use these columns here to visualize our candidates 55 00:02:53,580 --> 00:02:55,110 and their number of votes. 56 00:02:55,110 --> 00:02:58,530 And more particularly, let me say that maybe each candidate gets 57 00:02:58,530 --> 00:02:59,580 their own column. 58 00:02:59,580 --> 00:03:02,340 And how high or low the bar goes, the column 59 00:03:02,340 --> 00:03:06,120 goes, that would be the number of votes that candidate received. 60 00:03:06,120 --> 00:03:10,680 So notice here how we're specifying not just the data involved 61 00:03:10,680 --> 00:03:15,600 and the kind of plot we want but also how that data references 62 00:03:15,600 --> 00:03:18,240 or is represented on our plot here. 63 00:03:18,240 --> 00:03:21,030 How many columns we have, how high or low those columns go, 64 00:03:21,030 --> 00:03:24,390 those are all associated with some part of our data. 65 00:03:24,390 --> 00:03:28,552 Now, we call this association, more formally, this aesthetic mapping, 66 00:03:28,552 --> 00:03:29,760 which is a bit of a big term. 67 00:03:29,760 --> 00:03:31,843 But we can break it down into smaller pieces here. 68 00:03:31,843 --> 00:03:35,340 An aesthetic is some visual feature of our plot, 69 00:03:35,340 --> 00:03:39,690 maybe how many columns there are, how high or low each of those columns go. 70 00:03:39,690 --> 00:03:43,230 And a mapping is really another word for a relationship. 71 00:03:43,230 --> 00:03:44,880 Well, what are we relating here? 72 00:03:44,880 --> 00:03:49,700 We're relating our data to some visual features of our plot. 73 00:03:49,700 --> 00:03:51,790 I think this is best shown by example. 74 00:03:51,790 --> 00:03:53,850 So here I have a plot. 75 00:03:53,850 --> 00:03:58,470 And you've often seen plots with a both vertical line and a horizontal line 76 00:03:58,470 --> 00:03:58,990 here. 77 00:03:58,990 --> 00:04:01,440 And if I look at this plot kind of naively 78 00:04:01,440 --> 00:04:06,240 and you ask me to draw some columns, I might say, well, how? 79 00:04:06,240 --> 00:04:08,100 I mean, do they go left to right? 80 00:04:08,100 --> 00:04:09,180 Do they go up to down? 81 00:04:09,180 --> 00:04:13,410 I'm not quite sure how to draw these columns even if you want me to. 82 00:04:13,410 --> 00:04:17,190 So thankfully, there are certain aesthetic or visual features 83 00:04:17,190 --> 00:04:20,550 of this plot that we could talk about and relate our data to. 84 00:04:20,550 --> 00:04:25,780 Namely, most plots tend to have what we call an x-axis on this horizontal line 85 00:04:25,780 --> 00:04:26,280 here. 86 00:04:26,280 --> 00:04:28,560 This is known as our x-axis. 87 00:04:28,560 --> 00:04:33,970 And most plots, too, have what's known as a y-axis, a vertical axis here. 88 00:04:33,970 --> 00:04:39,210 And if we have both a y, or a vertical, axis and an x, or a horizontal, axis, 89 00:04:39,210 --> 00:04:43,740 we could talk about, well, which parts of our data go on the x-axis 90 00:04:43,740 --> 00:04:47,020 and which parts of our data go on the y-axis here. 91 00:04:47,020 --> 00:04:49,170 So let me show you again our data. 92 00:04:49,170 --> 00:04:54,060 We had here candidates, one column, and votes, one column. 93 00:04:54,060 --> 00:04:58,260 Let me ask more precisely now, which column of our data 94 00:04:58,260 --> 00:05:03,885 do you think should be mapped to or represented by the x-axis of our data? 95 00:05:03,885 --> 00:05:05,870 AUDIENCE: Our candidate column? 96 00:05:05,870 --> 00:05:07,620 CARTER ZENKE: Maybe our candidates, right? 97 00:05:07,620 --> 00:05:11,100 Because we could have the candidates on the x-axis representing now 98 00:05:11,100 --> 00:05:13,300 one individual column for each of them. 99 00:05:13,300 --> 00:05:17,400 So we will map, let's say, our candidate column to this x-axis. 100 00:05:17,400 --> 00:05:21,428 And by process of elimination, it seems like votes should go now on the y-axis. 101 00:05:21,428 --> 00:05:23,220 And so we can see what it looks like to now 102 00:05:23,220 --> 00:05:26,200 map these columns to our plot over here. 103 00:05:26,200 --> 00:05:31,960 Well, if I wanted to map my candidate column to the x-axis, what could I do? 104 00:05:31,960 --> 00:05:35,790 I could kind of write these candidates' names on the x-axis here. 105 00:05:35,790 --> 00:05:39,120 I'll start with Mario down below. 106 00:05:39,120 --> 00:05:40,890 And then I'll follow it with Peach, who's 107 00:05:40,890 --> 00:05:43,660 our second candidate, just like this. 108 00:05:43,660 --> 00:05:47,910 And now I'll follow that with Bowser too, our third candidate, 109 00:05:47,910 --> 00:05:49,150 just like that. 110 00:05:49,150 --> 00:05:55,440 So now we could say our column candidate is mapped to our x-axis. 111 00:05:55,440 --> 00:05:57,480 But now we need the y-axis. 112 00:05:57,480 --> 00:06:01,200 And we said the y-axis is going to be represented-- is going 113 00:06:01,200 --> 00:06:04,060 to have this votes column on it here. 114 00:06:04,060 --> 00:06:05,200 So how could we do that? 115 00:06:05,200 --> 00:06:09,000 Well, notice that these votes here fall into some range between, 116 00:06:09,000 --> 00:06:11,490 let's say, 0 to 200 at the maximum. 117 00:06:11,490 --> 00:06:17,280 So maybe on this y-axis I could say, let's start at the very bottom with 0. 118 00:06:17,280 --> 00:06:20,800 And all the way at the top, we'll put 200. 119 00:06:20,800 --> 00:06:26,020 So now the height of these columns should correspond to the y-axis. 120 00:06:26,020 --> 00:06:29,020 And I think now that I have these x and y-axes, 121 00:06:29,020 --> 00:06:31,840 I know much more how to draw these columns. 122 00:06:31,840 --> 00:06:35,380 If you told me draw columns representing the number of votes each candidate got, 123 00:06:35,380 --> 00:06:36,977 I would know how to do exactly that. 124 00:06:36,977 --> 00:06:38,810 I would know they should start at the bottom 125 00:06:38,810 --> 00:06:40,780 and move all the way up in terms of height. 126 00:06:40,780 --> 00:06:42,610 So let's draw these now. 127 00:06:42,610 --> 00:06:44,860 I see Mario has 100 votes. 128 00:06:44,860 --> 00:06:47,380 So I want to put a column on here for Mario. 129 00:06:47,380 --> 00:06:50,170 Well, Mario's column should be of the height 130 00:06:50,170 --> 00:06:53,020 on the x-axis that is equal to 100, which should be 131 00:06:53,020 --> 00:06:55,300 right in the middle between 0 and 200. 132 00:06:55,300 --> 00:06:57,388 So I'll draw Mario's column a bit like this. 133 00:06:57,388 --> 00:06:58,930 I'll put it right in the middle here. 134 00:06:58,930 --> 00:07:01,930 And I'll draw Mario a column just like that. 135 00:07:01,930 --> 00:07:03,650 That's Mario's column. 136 00:07:03,650 --> 00:07:05,380 Now I want to draw Peach's column. 137 00:07:05,380 --> 00:07:07,540 Well, Peach had 200 votes. 138 00:07:07,540 --> 00:07:11,410 I'll make Peach's column go all the way to the top of our y-axis, 139 00:07:11,410 --> 00:07:12,970 where we put the 200 votes. 140 00:07:12,970 --> 00:07:16,810 And I'll make sure that that column reaches all the way up there. 141 00:07:16,810 --> 00:07:21,940 And finally, for Bowser, well, Bowser had 150, somewhere between Mario's 100 142 00:07:21,940 --> 00:07:22,900 and Peach's 200. 143 00:07:22,900 --> 00:07:30,230 So I will put Bowser's column right in between those two, just like this. 144 00:07:30,230 --> 00:07:32,590 So this now is our complete plot. 145 00:07:32,590 --> 00:07:33,970 We've worked with data. 146 00:07:33,970 --> 00:07:36,580 We've worked with our aesthetic mappings, taking our columns 147 00:07:36,580 --> 00:07:38,590 and aligning them to the x and the y-axes. 148 00:07:38,590 --> 00:07:40,480 And now we've worked with our geometries, 149 00:07:40,480 --> 00:07:45,320 this actual visual representation of now our data in terms of columns. 150 00:07:45,320 --> 00:07:49,310 So all that's left to do now is do this in code. 151 00:07:49,310 --> 00:07:51,340 So we'll come back now to RStudio. 152 00:07:51,340 --> 00:07:56,020 And we'll look at how we can use this package, ggplot2, to do just that. 153 00:07:56,020 --> 00:07:59,650 I have open here a file called votes.R. And I'm 154 00:07:59,650 --> 00:08:03,220 reading in this CSV called votes.csv. 155 00:08:03,220 --> 00:08:06,430 Notice here how it goes into a table that I'm calling "votes." 156 00:08:06,430 --> 00:08:09,280 And this is the same table we saw before on the slides. 157 00:08:09,280 --> 00:08:12,482 We have a candidate named Mario, a candidate named Peach, 158 00:08:12,482 --> 00:08:13,690 and a candidate named Bowser. 159 00:08:13,690 --> 00:08:16,300 And each has some number of votes. 160 00:08:16,300 --> 00:08:20,990 Well, if I want to create a new blank plot, like what I had before here, 161 00:08:20,990 --> 00:08:25,850 I'll go ahead and I'll use this function, ggplot, just like this. 162 00:08:25,850 --> 00:08:28,660 And so long as I have installed the tidyverse 163 00:08:28,660 --> 00:08:31,990 and loaded it using library down below-- 164 00:08:31,990 --> 00:08:33,830 tidyverse, just like this-- 165 00:08:33,830 --> 00:08:37,570 I can have access to this function called ggplot that creates for me 166 00:08:37,570 --> 00:08:39,760 a new blank plot. 167 00:08:39,760 --> 00:08:41,799 So let me run line 3 here. 168 00:08:41,799 --> 00:08:43,030 And what do we see? 169 00:08:43,030 --> 00:08:45,880 Well, in the plots column or in the plots tab here, 170 00:08:45,880 --> 00:08:48,850 I'll see this blank plot that I can then draw, 171 00:08:48,850 --> 00:08:51,940 put brushstrokes on to visualize my data. 172 00:08:51,940 --> 00:08:56,080 But we haven't actually given our plot anything at all. 173 00:08:56,080 --> 00:08:59,320 This ggplot function has no input whatsoever. 174 00:08:59,320 --> 00:09:01,750 So I think it's worth thinking about what kinds of inputs 175 00:09:01,750 --> 00:09:04,900 we need to give to our plot so that it can visualize 176 00:09:04,900 --> 00:09:07,240 this data we have in our table here. 177 00:09:07,240 --> 00:09:10,000 Now, the first input, as we saw in our grammar of graphics, 178 00:09:10,000 --> 00:09:12,400 is going to be the data to visualize. 179 00:09:12,400 --> 00:09:16,420 So let's give us input now to this ggplot function, the data itself. 180 00:09:16,420 --> 00:09:18,010 I'll come back now to RStudio. 181 00:09:18,010 --> 00:09:21,760 And it turns out that the first argument to this ggplot function which creates 182 00:09:21,760 --> 00:09:24,940 me a new plot is the data itself. 183 00:09:24,940 --> 00:09:28,450 I'll provide this first argument now, this votes data frame. 184 00:09:28,450 --> 00:09:34,090 And now, if I run line 3, I still see nothing. 185 00:09:34,090 --> 00:09:37,840 And that is kind of expected because we said that data, at the end of the day, 186 00:09:37,840 --> 00:09:39,100 is just data. 187 00:09:39,100 --> 00:09:43,930 We still need a way to map its data to certain aesthetic features of our plot, 188 00:09:43,930 --> 00:09:45,160 certain visual features. 189 00:09:45,160 --> 00:09:48,550 And we need to say what kind of plot we want. 190 00:09:48,550 --> 00:09:50,890 Well, let's do just that now in code. 191 00:09:50,890 --> 00:09:52,840 I'll come back now to RStudio. 192 00:09:52,840 --> 00:09:57,430 And it turns out that one other piece we need here is our geometry. 193 00:09:57,430 --> 00:10:00,520 We've given it some data, but we need some geometry to specify 194 00:10:00,520 --> 00:10:03,040 what kind of plot we want now. 195 00:10:03,040 --> 00:10:07,450 Well, the way we can do this syntactically in R, and in ggplot 196 00:10:07,450 --> 00:10:10,720 more generally, is to use this plus sign here 197 00:10:10,720 --> 00:10:13,810 and to follow it with the part of geometry we want, 198 00:10:13,810 --> 00:10:16,240 which is going to be a column geometry. 199 00:10:16,240 --> 00:10:21,710 Now, to get a column geometry, I can use this function here, geom_call. 200 00:10:21,710 --> 00:10:25,670 And there are other geom functions, like geom_point or geom_line. 201 00:10:25,670 --> 00:10:29,850 But for now, we'll use geom_call to create ourselves some columns. 202 00:10:29,850 --> 00:10:32,690 But notice here this plus. 203 00:10:32,690 --> 00:10:37,100 So you've seen plus when we've been adding up some numbers, like 1 plus 1. 204 00:10:37,100 --> 00:10:38,900 But this is not that kind of addition. 205 00:10:38,900 --> 00:10:41,210 In fact, the creators of ggplot have always 206 00:10:41,210 --> 00:10:43,520 called overloading this plus sign. 207 00:10:43,520 --> 00:10:45,080 It no longer means addition. 208 00:10:45,080 --> 00:10:47,060 It means something else entirely. 209 00:10:47,060 --> 00:10:52,040 In fact, it means we're going to add a new layer to this blank plot. 210 00:10:52,040 --> 00:10:54,360 Let me show you what I mean with this prop over here. 211 00:10:54,360 --> 00:10:58,850 So we had here our visualization, which had columns on top of it. 212 00:10:58,850 --> 00:11:04,790 But I drew these columns on the same layer, let's say, that I drew my axes. 213 00:11:04,790 --> 00:11:08,450 Turns out that in ggplot, our plots have multiple layers. 214 00:11:08,450 --> 00:11:11,120 So instead of everything being on the same page, so to speak, 215 00:11:11,120 --> 00:11:14,030 what we instead do is the following. 216 00:11:14,030 --> 00:11:15,950 Let me erase my columns that I have here. 217 00:11:15,950 --> 00:11:18,420 And let me instead do this. 218 00:11:18,420 --> 00:11:22,890 Let me take out a new layer I could put on top of my blank plot here, 219 00:11:22,890 --> 00:11:24,150 my x and y-axes. 220 00:11:24,150 --> 00:11:27,630 And on this layer could I draw my columns. 221 00:11:27,630 --> 00:11:29,820 Let me go ahead and do the same thing we did before. 222 00:11:29,820 --> 00:11:32,280 Mario had about 100 votes. 223 00:11:32,280 --> 00:11:34,650 I'll put these halfway up the y-axis. 224 00:11:34,650 --> 00:11:36,210 Peach had 200. 225 00:11:36,210 --> 00:11:38,730 I'll put those all the way up at 200. 226 00:11:38,730 --> 00:11:43,650 And Bowser had 150, somewhere between Mario and Peach's votes. 227 00:11:43,650 --> 00:11:45,810 And now notice what I've done. 228 00:11:45,810 --> 00:11:48,600 I have separated my plot now into layers. 229 00:11:48,600 --> 00:11:51,900 One layer, of course, is my columns here. 230 00:11:51,900 --> 00:11:56,250 And one layer seems to be my axes, same thing I'm doing here. 231 00:11:56,250 --> 00:12:00,120 We've started now with this blank piece of paper, thanks to ggplot, 232 00:12:00,120 --> 00:12:04,210 and we're now adding a new layer, one that involves columns. 233 00:12:04,210 --> 00:12:05,770 So let's see what happens. 234 00:12:05,770 --> 00:12:07,980 I'll come back now to RStudio. 235 00:12:07,980 --> 00:12:13,380 And let me go ahead and run this here and see what we see. 236 00:12:13,380 --> 00:12:14,880 Nothing yet. 237 00:12:14,880 --> 00:12:18,570 I do see this error that the geom_call function 238 00:12:18,570 --> 00:12:22,980 requires the following missing aesthetics, x and y. 239 00:12:22,980 --> 00:12:28,050 And so I think ggplot is running into this same problem that I did earlier. 240 00:12:28,050 --> 00:12:32,730 If I don't know how to map my data to the visual features of this plot, well, 241 00:12:32,730 --> 00:12:34,780 how can I even draw these columns? 242 00:12:34,780 --> 00:12:39,270 So we'll need to specify these aesthetic features, x and y, before ggplot 243 00:12:39,270 --> 00:12:42,180 knows how to draw these columns here. 244 00:12:42,180 --> 00:12:45,510 Now, it turns out we can specify these aesthetic features using 245 00:12:45,510 --> 00:12:50,220 a function, one called aes, which stands for aesthetics here. 246 00:12:50,220 --> 00:12:53,400 Now, aesthetics, the function, looks a bit like this-- 247 00:12:53,400 --> 00:12:59,280 aes, and it takes as input any number of pairings between aesthetic values, 248 00:12:59,280 --> 00:13:02,400 like x or y, and the columns in my data. 249 00:13:02,400 --> 00:13:03,840 For instance, like this. 250 00:13:03,840 --> 00:13:06,570 This would be one aesthetic mapping. 251 00:13:06,570 --> 00:13:09,000 I'm taking one column of my data and assigning 252 00:13:09,000 --> 00:13:12,990 it to control some visual feature now of my plot. 253 00:13:12,990 --> 00:13:17,160 So let's do the same thing now using aes, this aesthetic function, 254 00:13:17,160 --> 00:13:21,210 to actually make it happen for us inside of R and ggplot. 255 00:13:21,210 --> 00:13:23,730 So I'll come back to RStudio here. 256 00:13:23,730 --> 00:13:26,280 And let's go ahead and use the aes function 257 00:13:26,280 --> 00:13:30,960 to tell this plot exactly how to map our data to certain visual features 258 00:13:30,960 --> 00:13:32,610 of our plot more generally. 259 00:13:32,610 --> 00:13:35,160 So it turns out, by reading the documentation, 260 00:13:35,160 --> 00:13:39,840 I know the second argument to ggplot is this aes function, which 261 00:13:39,840 --> 00:13:43,650 we know takes as input certain pairings between these aesthetic features, 262 00:13:43,650 --> 00:13:46,890 like x and y, and columns in our data. 263 00:13:46,890 --> 00:13:50,130 So one aesthetic feature is the x-axis. 264 00:13:50,130 --> 00:13:53,220 Which column should go on the x-axis of our data? 265 00:13:53,220 --> 00:13:58,270 Well, we know we're going to put the candidates column on that x-axis here. 266 00:13:58,270 --> 00:14:02,040 I'll go ahead and I'll say, x equals candidate, just like this. 267 00:14:02,040 --> 00:14:05,100 This means that the candidate column in my data 268 00:14:05,100 --> 00:14:09,330 should now be mapped to the x-axis of my plot. 269 00:14:09,330 --> 00:14:10,980 Now, I'll do the same for the y column. 270 00:14:10,980 --> 00:14:12,780 Well, the y-- or the y aesthetic. 271 00:14:12,780 --> 00:14:17,610 The y aesthetic is going to be mapped to the votes column and vice versa. 272 00:14:17,610 --> 00:14:21,060 So the votes column should control the y-axis, effectively. 273 00:14:21,060 --> 00:14:23,490 I'll go ahead and say that y equals votes. 274 00:14:23,490 --> 00:14:26,100 And now, with these aesthetic mappings specified, 275 00:14:26,100 --> 00:14:28,270 I should be able to visualize this. 276 00:14:28,270 --> 00:14:29,910 I'll go ahead and run line 3. 277 00:14:29,910 --> 00:14:33,180 And voila, we have our very first plot we've now 278 00:14:33,180 --> 00:14:36,600 created with ggplot, thanks to starting with data, 279 00:14:36,600 --> 00:14:39,840 defining how that data controls visual features of our plot, 280 00:14:39,840 --> 00:14:45,310 like the x and y-axes, and now using a new layer, adding in these columns. 281 00:14:45,310 --> 00:14:51,400 So let me ask you now, what questions do we have on what we've done so far? 282 00:14:51,400 --> 00:14:59,760 AUDIENCE: If we look into the x-axis of the plot that we did now in ggplot2, 283 00:14:59,760 --> 00:15:06,480 the order of our candidates is slightly different from what 284 00:15:06,480 --> 00:15:09,240 we drew in the board. 285 00:15:09,240 --> 00:15:11,130 So why is that? 286 00:15:11,130 --> 00:15:16,443 And how do we expect ggplot to order our labels? 287 00:15:16,443 --> 00:15:18,360 CARTER ZENKE: Yeah, a really good observation. 288 00:15:18,360 --> 00:15:20,235 I was hoping you might notice this, actually. 289 00:15:20,235 --> 00:15:22,680 So notice here on the plot, our names are not 290 00:15:22,680 --> 00:15:24,810 in the same order we saw them in our data. 291 00:15:24,810 --> 00:15:27,600 In our data, it was Mario, Peach, and then Bowser. 292 00:15:27,600 --> 00:15:31,710 But it seems like on the x-axis here, it's Bowser, Mario, and Peach. 293 00:15:31,710 --> 00:15:33,630 Well, what order is this? 294 00:15:33,630 --> 00:15:35,800 It turns out this is alphabetical order. 295 00:15:35,800 --> 00:15:39,780 So Bowser with B comes first, followed by Mario, followed by Peach. 296 00:15:39,780 --> 00:15:43,920 So we can generally expect ggplot to sort our data on either axis. 297 00:15:43,920 --> 00:15:46,050 In this case, because we're working with names, 298 00:15:46,050 --> 00:15:49,290 sorted alphabetically down below on the x-axis. 299 00:15:49,290 --> 00:15:51,510 But a good observation to make. 300 00:15:51,510 --> 00:15:53,490 Now, let's keep going. 301 00:15:53,490 --> 00:15:57,450 And ggplot gives us some pretty good defaults for our plots. 302 00:15:57,450 --> 00:16:01,140 I'll notice here that on the labels here for my axes, 303 00:16:01,140 --> 00:16:03,540 I see the name of the column, like candidate. 304 00:16:03,540 --> 00:16:07,290 And on the y-axis, I see votes for the y-axis. 305 00:16:07,290 --> 00:16:11,700 And even on the y-axis, it has decided for me that the range of values 306 00:16:11,700 --> 00:16:14,160 should be 0 to 200. 307 00:16:14,160 --> 00:16:16,110 So these are pretty good defaults. 308 00:16:16,110 --> 00:16:19,860 But sometimes you might want to override or change 309 00:16:19,860 --> 00:16:24,240 those defaults, which you can do by adding new layers to your plot. 310 00:16:24,240 --> 00:16:26,110 So let's consider how to do just that. 311 00:16:26,110 --> 00:16:29,850 Well, here, one thing I want to do is change this plot. 312 00:16:29,850 --> 00:16:33,600 So I have a little more headroom up here on Peach's votes column. 313 00:16:33,600 --> 00:16:39,780 Maybe I want my y-axis to not stop at 200 but to go up to, let's say, 250. 314 00:16:39,780 --> 00:16:43,110 Well, to do that, we'll need to learn about this other concept in plots, 315 00:16:43,110 --> 00:16:46,380 and in ggplot2 more generally, one called scale. 316 00:16:46,380 --> 00:16:48,840 So let's learn about scales a little bit together. 317 00:16:48,840 --> 00:16:54,600 A scale is simply a way now of specifying how our values actually 318 00:16:54,600 --> 00:16:56,838 control our aesthetic mappings. 319 00:16:56,838 --> 00:16:59,130 And for instance, there are really two kinds of scales, 320 00:16:59,130 --> 00:17:03,540 one called a continuous scale and one called a discrete scale. 321 00:17:03,540 --> 00:17:07,560 Now, a continuous scale is for values that fall in some range. 322 00:17:07,560 --> 00:17:12,060 You could think of our y-axis here with some number of votes between 0 323 00:17:12,060 --> 00:17:13,950 and, let's say, 200 or 250. 324 00:17:13,950 --> 00:17:19,079 That is a continuous scale because some votes could fall in some range. 325 00:17:19,079 --> 00:17:21,630 A discrete scale, though, is for values that 326 00:17:21,630 --> 00:17:23,670 are what we might call categorical. 327 00:17:23,670 --> 00:17:25,200 They fall into categories. 328 00:17:25,200 --> 00:17:29,310 For instance, our x-axis has a discrete scale. 329 00:17:29,310 --> 00:17:31,680 There are three distinct candidates. 330 00:17:31,680 --> 00:17:34,560 It wouldn't quite make sense to put them on any given scale. 331 00:17:34,560 --> 00:17:35,740 They're just names. 332 00:17:35,740 --> 00:17:38,370 So this is what we call a discrete scale. 333 00:17:38,370 --> 00:17:42,120 Now, scales that are continuous have what are called limits. 334 00:17:42,120 --> 00:17:48,180 And exactly what we've seen here is our y-axis, our y scale starts at 0 335 00:17:48,180 --> 00:17:50,730 and goes up to 200 currently. 336 00:17:50,730 --> 00:17:54,390 Those are the limits of this scale, from 0 to 200. 337 00:17:54,390 --> 00:17:59,320 But we want to change the limit here from 0 to 250 instead. 338 00:17:59,320 --> 00:18:04,680 And so it turns out that ggplot gives us functions to modify different scales, 339 00:18:04,680 --> 00:18:06,490 using some of these right here. 340 00:18:06,490 --> 00:18:09,630 So we see scale_x_continuous is a function, 341 00:18:09,630 --> 00:18:14,340 scale_y_continuous is a function, scale_x_discrete, and scale_y_discrete. 342 00:18:14,340 --> 00:18:18,300 These change various different scales on our x or y-axis, 343 00:18:18,300 --> 00:18:21,780 depending on if they are continuous or discrete. 344 00:18:21,780 --> 00:18:24,953 Now, we said before, we want to change our y-axis. 345 00:18:24,953 --> 00:18:26,370 So that narrows our possibilities. 346 00:18:26,370 --> 00:18:29,790 It's either this one here, scale_y_continuous, or this one 347 00:18:29,790 --> 00:18:32,250 here, scale_y_discrete. 348 00:18:32,250 --> 00:18:35,850 We said our y-axis is a continuous scale. 349 00:18:35,850 --> 00:18:37,200 It has a range of values. 350 00:18:37,200 --> 00:18:38,533 It doesn't have discrete values. 351 00:18:38,533 --> 00:18:39,610 It has a range of values. 352 00:18:39,610 --> 00:18:42,660 So we could probably use scale_y_continuous 353 00:18:42,660 --> 00:18:45,870 to modify now our y-axis. 354 00:18:45,870 --> 00:18:49,710 Let's go ahead and do just that by adding a new layer to our plot. 355 00:18:49,710 --> 00:18:51,250 I'll come back over here. 356 00:18:51,250 --> 00:18:54,280 And let's say I go back to RStudio. 357 00:18:54,280 --> 00:18:57,450 Well, I want to change now this y-axis's defaults. 358 00:18:57,450 --> 00:19:00,900 And I can do so, as I said before, by adding a new layer. 359 00:19:00,900 --> 00:19:04,050 I'll use this plus here to add a new layer to my plot. 360 00:19:04,050 --> 00:19:08,850 And let me now kind of overwrite the scale I currently see on the board. 361 00:19:08,850 --> 00:19:15,060 I'll use scale_y_continuous, which will allow me to adjust my continuous y 362 00:19:15,060 --> 00:19:16,530 scale that I currently see. 363 00:19:16,530 --> 00:19:19,140 And it turns out, that by reading the documentation I know, 364 00:19:19,140 --> 00:19:24,010 scale_y_continuous takes this argument called limits, just like this. 365 00:19:24,010 --> 00:19:27,570 And limits takes a value which is a vector of length two, where 366 00:19:27,570 --> 00:19:31,320 the first value is where the scale starts, let's say 0, 367 00:19:31,320 --> 00:19:34,110 and where the scale ends, let's say 250. 368 00:19:34,110 --> 00:19:38,520 So I'll say-- I'll give limits here, this vector, 0 to 250. 369 00:19:38,520 --> 00:19:43,470 And that should now change for me my scale on my y-axis. 370 00:19:43,470 --> 00:19:45,960 Let me go ahead and redraw this plot, which I'll 371 00:19:45,960 --> 00:19:48,060 do by running this line of code here. 372 00:19:48,060 --> 00:19:51,390 And now we'll see my axis has changed. 373 00:19:51,390 --> 00:19:52,830 So what have we done? 374 00:19:52,830 --> 00:19:56,130 We have first defined this blank plot, given it 375 00:19:56,130 --> 00:20:00,390 some data and some mappings between that data and its visual features. 376 00:20:00,390 --> 00:20:04,260 We have then drawn some columns on top of it, thanks to geom_call. 377 00:20:04,260 --> 00:20:09,780 And then we've kind of overridden the default scale, moving it from 0 to 200 378 00:20:09,780 --> 00:20:12,810 to instead 0 to 250. 379 00:20:12,810 --> 00:20:17,160 We're kind of building up our plot as we go now, using these layers. 380 00:20:17,160 --> 00:20:21,600 Well, one more thing we could do is override the labels here. 381 00:20:21,600 --> 00:20:24,750 I'll say this one is candidate with a lowercase C. 382 00:20:24,750 --> 00:20:27,840 This one is votes with a lowercase v. I'd love to make them 383 00:20:27,840 --> 00:20:29,670 capital to make them more professional. 384 00:20:29,670 --> 00:20:32,520 And also, I want to add a title so people can look at this 385 00:20:32,520 --> 00:20:35,370 and know exactly what they're looking at immediately. 386 00:20:35,370 --> 00:20:39,450 So I can add a new layer now to my plot to add in the labels 387 00:20:39,450 --> 00:20:43,740 and either override some defaults or add some new labels altogether. 388 00:20:43,740 --> 00:20:48,060 Well, it turns out the function I use to do this is one called labs. 389 00:20:48,060 --> 00:20:50,340 Labs is short for labels. 390 00:20:50,340 --> 00:20:54,630 I'll go ahead and add a new layer now to my plot called labs. 391 00:20:54,630 --> 00:21:00,690 And labs takes as arguments the kinds of labels I want to adjust on my plot. 392 00:21:00,690 --> 00:21:06,180 Now, for instance, I could use x equals and set it equal to a character string 393 00:21:06,180 --> 00:21:08,940 that I want to be the label for my x-axis. 394 00:21:08,940 --> 00:21:12,000 I'll call this one Candidate with a capital C. 395 00:21:12,000 --> 00:21:14,700 I could also give it a label for the y-axis 396 00:21:14,700 --> 00:21:17,830 by saying y equals and then some character string here. 397 00:21:17,830 --> 00:21:22,030 I'll go ahead and call this Votes with a capital V. And then for my title, 398 00:21:22,030 --> 00:21:25,750 why don't I use this, Election Results, making 399 00:21:25,750 --> 00:21:28,240 it very clear exactly what we're visualizing 400 00:21:28,240 --> 00:21:30,490 for any viewer who comes in. 401 00:21:30,490 --> 00:21:33,230 I'll go ahead and now build up my plot step by step again 402 00:21:33,230 --> 00:21:34,990 by running all of these lines of code. 403 00:21:34,990 --> 00:21:38,350 And now I should see my plot as it should be. 404 00:21:38,350 --> 00:21:43,180 I have all my candidates on the x-axis, my votes on the y-axis, 405 00:21:43,180 --> 00:21:44,920 and those labeled now appropriately. 406 00:21:44,920 --> 00:21:50,060 I even see my title up top, Election Results. 407 00:21:50,060 --> 00:21:55,360 So we have started with a blank plot, added columns to it, 408 00:21:55,360 --> 00:21:58,090 adjusted our scale, and added labels. 409 00:21:58,090 --> 00:22:00,760 Let me ask, what questions do we have so far 410 00:22:00,760 --> 00:22:04,495 on how we've built up this plot using ggplot's layers here? 411 00:22:04,495 --> 00:22:07,540 AUDIENCE: Are we going in this lecture to know 412 00:22:07,540 --> 00:22:12,130 how to design or to change the layout of these statics, 413 00:22:12,130 --> 00:22:15,640 like the votes and the candidates columns 414 00:22:15,640 --> 00:22:19,460 and the columns representation to the user? 415 00:22:19,460 --> 00:22:20,710 CARTER ZENKE: A good question. 416 00:22:20,710 --> 00:22:23,502 So you might be asking, well, how could we change these aesthetics? 417 00:22:23,502 --> 00:22:25,740 And often, you'll get to this plot and say, 418 00:22:25,740 --> 00:22:28,560 I don't exactly like how this works maybe I 419 00:22:28,560 --> 00:22:31,260 actually want my columns to be on the-- 420 00:22:31,260 --> 00:22:33,690 going across vertically, let's say. 421 00:22:33,690 --> 00:22:36,000 We could absolutely change our aesthetics here 422 00:22:36,000 --> 00:22:39,010 to change what this column-- what this plot is doing here. 423 00:22:39,010 --> 00:22:41,910 We could do so by changing these values x and y. 424 00:22:41,910 --> 00:22:45,210 Maybe I would make x equal to votes and y equal to candidates. 425 00:22:45,210 --> 00:22:48,870 And that would map my columns so they go now left to right as opposed to 426 00:22:48,870 --> 00:22:49,860 up to down. 427 00:22:49,860 --> 00:22:52,170 That's one way we could change our plot visually. 428 00:22:52,170 --> 00:22:54,270 The next way we'll see, though, is how we could 429 00:22:54,270 --> 00:22:56,400 change our plot in terms of colors. 430 00:22:56,400 --> 00:22:58,420 And I think that's what you're alluding to here, 431 00:22:58,420 --> 00:23:02,460 which is our plot looks pretty nice, but it's pretty gray. 432 00:23:02,460 --> 00:23:03,780 It's like black and white. 433 00:23:03,780 --> 00:23:06,870 We could probably do better in terms of colors. 434 00:23:06,870 --> 00:23:08,820 So let's do just that. 435 00:23:08,820 --> 00:23:11,310 We'll, I'll come back now to RStudio. 436 00:23:11,310 --> 00:23:15,960 And let's see how we could change the fill color, that is, 437 00:23:15,960 --> 00:23:20,380 the color filling of these columns depending on the candidate's name. 438 00:23:20,380 --> 00:23:23,590 Well, notice here I said I want the color to depend 439 00:23:23,590 --> 00:23:27,460 on the candidate's name, which seems a lot like these aesthetic mappings. 440 00:23:27,460 --> 00:23:30,550 I have some visual feature, in this case the fill color 441 00:23:30,550 --> 00:23:34,900 of my columns, that I now want to be dependent on the value 442 00:23:34,900 --> 00:23:37,000 in the candidate's column. 443 00:23:37,000 --> 00:23:40,930 Well, I can do this using the aes function like we saw earlier. 444 00:23:40,930 --> 00:23:43,960 And there are more aesthetics than just x and y. 445 00:23:43,960 --> 00:23:46,270 There is, in fact, an aesthetic called fill 446 00:23:46,270 --> 00:23:50,390 that can change the fill color of each of these columns. 447 00:23:50,390 --> 00:23:54,430 So here I want to change the fill color of these columns themselves. 448 00:23:54,430 --> 00:24:00,610 So what I'll do is pass in the aes function as input to geom_call itself. 449 00:24:00,610 --> 00:24:04,810 And here I'll specify I want to change the fill aesthetic, the color that 450 00:24:04,810 --> 00:24:06,070 fills these columns. 451 00:24:06,070 --> 00:24:10,630 And I want it to depend on the candidate column in my data. 452 00:24:10,630 --> 00:24:15,130 So what I've said here is I want to specify a new aesthetic mapping, one 453 00:24:15,130 --> 00:24:16,750 that applies only to columns. 454 00:24:16,750 --> 00:24:20,350 I want to change their fill color and have it depend on now the candidate 455 00:24:20,350 --> 00:24:21,070 column. 456 00:24:21,070 --> 00:24:24,460 I'll go ahead and I'll run this update to our plot. 457 00:24:24,460 --> 00:24:27,220 And now we'll see some color, which is pretty nice. 458 00:24:27,220 --> 00:24:30,910 Notice how every candidate has their own color, which 459 00:24:30,910 --> 00:24:35,680 is because as input to geom_call, we said the fill aesthetic 460 00:24:35,680 --> 00:24:38,740 should depend on now the candidate column we have. 461 00:24:38,740 --> 00:24:42,970 Each candidate should, effectively, get their own color. 462 00:24:42,970 --> 00:24:46,945 But when we work with color, it's important to be mindful 463 00:24:46,945 --> 00:24:48,820 that people who might look at this plot might 464 00:24:48,820 --> 00:24:50,945 be looking at it with some form of color blindness. 465 00:24:50,945 --> 00:24:53,170 And so when we convey information with color, 466 00:24:53,170 --> 00:24:55,780 we should make sure we do it as accessibly as we can. 467 00:24:55,780 --> 00:24:58,000 And thankfully, in R and in ggplot, there 468 00:24:58,000 --> 00:25:00,155 are ways to adjust colors to make sure they're 469 00:25:00,155 --> 00:25:03,280 friendly to those who might look at this with some form of color blindness. 470 00:25:03,280 --> 00:25:05,260 So let's see how to do that now. 471 00:25:05,260 --> 00:25:07,270 I'll come back to our plot here. 472 00:25:07,270 --> 00:25:12,190 And actually, notice on the right-hand side, I've created kind of a new scale. 473 00:25:12,190 --> 00:25:16,540 I have here different colors assigned to different candidates. 474 00:25:16,540 --> 00:25:20,330 This is, in fact, its very own scale, one called a fill scale. 475 00:25:20,330 --> 00:25:22,700 We have our x scale and our y scale. 476 00:25:22,700 --> 00:25:25,490 And now we have a new one called a fill scale, 477 00:25:25,490 --> 00:25:29,310 determining what colors will belong to each of these candidates here. 478 00:25:29,310 --> 00:25:32,720 Well, if I want to change this scale, I could do it 479 00:25:32,720 --> 00:25:36,050 in a way that's very similar to the way I changed the y scale, by adding 480 00:25:36,050 --> 00:25:39,570 a new layer to my plot and overriding the default. 481 00:25:39,570 --> 00:25:43,610 And actually, thankfully, R comes with this scale called 482 00:25:43,610 --> 00:25:47,330 the viridis scale, which is known to be very friendly to many different forms 483 00:25:47,330 --> 00:25:48,470 of color blindness. 484 00:25:48,470 --> 00:25:52,720 If I want to use the viridis scale, I can do so using this function here. 485 00:25:52,720 --> 00:25:54,470 I'll go back to my plot, and I'll go ahead 486 00:25:54,470 --> 00:25:57,650 and add this additional colorblind-friendly scale in. 487 00:25:57,650 --> 00:26:02,660 I'll say I want to adjust the fill scale I just created, the one you 488 00:26:02,660 --> 00:26:04,700 see on the very right-hand side. 489 00:26:04,700 --> 00:26:08,360 And I want it to instead be the viridis scale, which 490 00:26:08,360 --> 00:26:11,360 is going to be set to the discrete version of that scale. 491 00:26:11,360 --> 00:26:14,510 So recall how we had both continuous and discrete scales, continuous being 492 00:26:14,510 --> 00:26:17,000 on a range, discrete being individual values? 493 00:26:17,000 --> 00:26:20,360 There's a special scale called viridis discrete, 494 00:26:20,360 --> 00:26:22,730 or viridis_d for short, that allows me now 495 00:26:22,730 --> 00:26:26,330 to say I want to take these discrete, colorblind-friendly colors 496 00:26:26,330 --> 00:26:29,720 and make them the colors I'll see for each of my candidates. 497 00:26:29,720 --> 00:26:32,360 I can do this followed with a plus sign, and now I've 498 00:26:32,360 --> 00:26:35,570 added in this new layer that overrides the default colors 499 00:26:35,570 --> 00:26:38,560 and makes them more colorblind friendly. 500 00:26:38,560 --> 00:26:40,310 Let me go ahead and build this plot again. 501 00:26:40,310 --> 00:26:43,027 And here I'll see my colors change to be more 502 00:26:43,027 --> 00:26:45,110 friendly now to those who might be looking at this 503 00:26:45,110 --> 00:26:46,727 with some form of color blindness. 504 00:26:46,727 --> 00:26:48,560 And there's one more thing we could do here. 505 00:26:48,560 --> 00:26:52,520 Notice how on the right-hand side, this scale has a name, candidate. 506 00:26:52,520 --> 00:26:55,610 But I want this to be capitalized, so I could change this 507 00:26:55,610 --> 00:27:01,153 by passing into scale_fill_viridis_d the title I want for this scale. 508 00:27:01,153 --> 00:27:02,570 Let me come back and do just that. 509 00:27:02,570 --> 00:27:05,810 I'll come back over here and say I want this viridis scale 510 00:27:05,810 --> 00:27:08,630 to instead be named now Candidate. 511 00:27:08,630 --> 00:27:10,640 I'll go ahead and rebuild my plot. 512 00:27:10,640 --> 00:27:14,750 And I'll see on the right-hand side I've changed not just the scale's colors 513 00:27:14,750 --> 00:27:17,390 but also its title. 514 00:27:17,390 --> 00:27:19,400 Now, while we're thinking about aesthetics 515 00:27:19,400 --> 00:27:24,200 and how nice this plot looks, one more thing I could do is change its theme. 516 00:27:24,200 --> 00:27:26,000 Ggplot comes with several themes. 517 00:27:26,000 --> 00:27:28,550 And by default or by convention, these themes 518 00:27:28,550 --> 00:27:31,460 are often applied at the end of our plot. 519 00:27:31,460 --> 00:27:35,930 After we've added our layers of columns and adjusting scales and adding labels, 520 00:27:35,930 --> 00:27:38,060 we can change the theme of our plot. 521 00:27:38,060 --> 00:27:41,480 So I'll go ahead and add a new layer, one final one to my plot here. 522 00:27:41,480 --> 00:27:45,922 And I can use this family of functions that all begin with theme_. 523 00:27:45,922 --> 00:27:47,630 And there are many themes to choose from. 524 00:27:47,630 --> 00:27:50,810 You could do theme_bw or theme_classic. 525 00:27:50,810 --> 00:27:54,560 These are all available in a reference on the ggplot package website. 526 00:27:54,560 --> 00:27:58,140 But here, I'll use theme_classic, which is more minimalistic here. 527 00:27:58,140 --> 00:28:02,030 Let me go ahead and say I want my theme to be the classic theme in ggplot. 528 00:28:02,030 --> 00:28:03,980 I'll go ahead and rebuild my plot now. 529 00:28:03,980 --> 00:28:07,310 And here I'll see that I've kind of changed the aesthetics more 530 00:28:07,310 --> 00:28:11,810 so visually I've dropped the gray background I've simplified my scales. 531 00:28:11,810 --> 00:28:14,390 And I think things just look now much prettier, 532 00:28:14,390 --> 00:28:17,880 thanks to this theme layer here. 533 00:28:17,880 --> 00:28:21,260 So now we've updated the aesthetics by changing now 534 00:28:21,260 --> 00:28:23,630 the colors of each of these candidates and changing 535 00:28:23,630 --> 00:28:26,030 the general theme of the plot. 536 00:28:26,030 --> 00:28:29,990 What questions do we have on how we built this plot up from a blank page 537 00:28:29,990 --> 00:28:32,720 to adding columns to changing scales to adding labels 538 00:28:32,720 --> 00:28:35,720 and now changing the theme at the very end? 539 00:28:35,720 --> 00:28:38,150 What questions do we have? 540 00:28:38,150 --> 00:28:40,490 AUDIENCE: Can the layers be in any order? 541 00:28:40,490 --> 00:28:42,758 Or do we have to follow a specific order for them? 542 00:28:42,758 --> 00:28:44,300 CARTER ZENKE: A really good question. 543 00:28:44,300 --> 00:28:46,280 Can these layers be in any particular order, 544 00:28:46,280 --> 00:28:49,140 or is there some defined when we have to use? 545 00:28:49,140 --> 00:28:53,810 Turns out the only thing that has to come first is this ggplot function. 546 00:28:53,810 --> 00:28:58,410 This gives us metaphorically that blank page to write everything else on top 547 00:28:58,410 --> 00:28:58,910 of. 548 00:28:58,910 --> 00:29:02,840 We could put these other layers in any other order we wanted to, 549 00:29:02,840 --> 00:29:05,240 but there is some convention. 550 00:29:05,240 --> 00:29:08,360 Generally speaking, we go from a blank page 551 00:29:08,360 --> 00:29:11,930 to adding our geometries, like these columns here. 552 00:29:11,930 --> 00:29:15,080 Afterwards, we'll adjust our scales if we need to. 553 00:29:15,080 --> 00:29:18,740 Here I adjusted the fill scale, the one we see on the right-hand side. 554 00:29:18,740 --> 00:29:22,850 And I adjusted the y scale, the one going vertically now here. 555 00:29:22,850 --> 00:29:26,990 After we adjust our scales, should we adjust our labels if we want to? 556 00:29:26,990 --> 00:29:28,220 Here I did just that. 557 00:29:28,220 --> 00:29:30,560 I adjusted the labels here, setting my x-axis 558 00:29:30,560 --> 00:29:32,390 as the candidate name down below. 559 00:29:32,390 --> 00:29:35,780 My y-axis has the Votes name on this left-hand side here. 560 00:29:35,780 --> 00:29:37,730 And my title is Election Results. 561 00:29:37,730 --> 00:29:40,580 And by convention at the end do we add in our theme 562 00:29:40,580 --> 00:29:44,270 here to say I want to take all this I've just done and style it 563 00:29:44,270 --> 00:29:47,120 in some particular way, in this case this classic theme, which 564 00:29:47,120 --> 00:29:49,250 is more minimalist in spirit. 565 00:29:49,250 --> 00:29:53,580 But a really good question on the ordering of these layers here. 566 00:29:53,580 --> 00:29:57,770 Let's take one more question on what we've just done so far. 567 00:29:57,770 --> 00:30:02,180 AUDIENCE: Is there a way to get rid of the legend on the right 568 00:30:02,180 --> 00:30:05,340 since it is redundant with the x-axis labels? 569 00:30:05,340 --> 00:30:06,840 CARTER ZENKE: Yeah, a good question. 570 00:30:06,840 --> 00:30:10,880 So here you might notice that I have one color for each of my candidates, 571 00:30:10,880 --> 00:30:14,000 and I have this so-called legend on the right-hand side called 572 00:30:14,000 --> 00:30:17,450 Candidate that tells me which colors these are associated with. 573 00:30:17,450 --> 00:30:20,050 Well, because I only have one color for each candidate, 574 00:30:20,050 --> 00:30:21,790 and I already have their names down here, 575 00:30:21,790 --> 00:30:23,890 you could probably argue that this is redundant. 576 00:30:23,890 --> 00:30:25,210 So we want to remove this. 577 00:30:25,210 --> 00:30:29,680 And it turns out I can do that using a parameter of geom_call, 578 00:30:29,680 --> 00:30:34,432 one called show.legend, which by default is true, but we could change to false. 579 00:30:34,432 --> 00:30:36,140 So let me go ahead and do that over here. 580 00:30:36,140 --> 00:30:37,900 I'll come back to RStudio. 581 00:30:37,900 --> 00:30:42,850 And let's say I want to make sure that this fill aesthetic of my column 582 00:30:42,850 --> 00:30:46,180 does not produce a legend on the right-hand side. 583 00:30:46,180 --> 00:30:50,470 Well, I could say I want to specify the show.legend parameter 584 00:30:50,470 --> 00:30:54,552 and have it be not true by default but false instead. 585 00:30:54,552 --> 00:30:56,510 And because this is getting a little long here, 586 00:30:56,510 --> 00:31:00,010 let me go ahead and put each of these arguments now on their own line, 587 00:31:00,010 --> 00:31:03,730 just like this, and move this parenthesis to its own line. 588 00:31:03,730 --> 00:31:07,870 And now I should see, if I were to rebuild my plot top to bottom, 589 00:31:07,870 --> 00:31:09,670 that I've removed that legend. 590 00:31:09,670 --> 00:31:13,810 And now I just have those candidates on the x-axis and different colors 591 00:31:13,810 --> 00:31:15,910 now for each of them here. 592 00:31:15,910 --> 00:31:18,920 But there's one more problem I would say with this too, which 593 00:31:18,920 --> 00:31:22,080 is in our original plot, we had something like this. 594 00:31:22,080 --> 00:31:26,450 We had Mario and then Peach and then Bowser. 595 00:31:26,450 --> 00:31:29,690 If you look here, we had different ordering than we see in our plot. 596 00:31:29,690 --> 00:31:32,450 Here I have Mario, Peach, and Bowser, but on this plot here 597 00:31:32,450 --> 00:31:35,630 I have Bowser, Mario, and Peach because, by default, 598 00:31:35,630 --> 00:31:39,770 as we said before, ggplot will order these values in alphabetical order 599 00:31:39,770 --> 00:31:42,740 by name, Bowser, Mario, and then Peach. 600 00:31:42,740 --> 00:31:45,590 There is a way to reorder these explicitly 601 00:31:45,590 --> 00:31:48,165 if we wanted to, using what we'll call factors. 602 00:31:48,165 --> 00:31:49,790 But we'll save that a little bit more-- 603 00:31:49,790 --> 00:31:52,760 I'll save that for a little bit more later on. 604 00:31:52,760 --> 00:31:56,910 Now, we have here our plot, let's say, good enough for now. 605 00:31:56,910 --> 00:31:59,990 And I want to save it so I could share it with a friend. 606 00:31:59,990 --> 00:32:02,870 Here, this is currently all in RStudio. 607 00:32:02,870 --> 00:32:06,440 I'm seeing it here, but I want to maybe get an image file 608 00:32:06,440 --> 00:32:08,130 I could share with somebody else. 609 00:32:08,130 --> 00:32:11,896 Well, I could do that as well with a special function called ggsave. 610 00:32:11,896 --> 00:32:16,080 It lets me save my ggplot to my own computer. 611 00:32:16,080 --> 00:32:18,650 So let's see ggsave in action now. 612 00:32:18,650 --> 00:32:19,950 I'll come back over here. 613 00:32:19,950 --> 00:32:23,630 And why don't I take this entire plot I've built 614 00:32:23,630 --> 00:32:26,180 and now store it in an object? 615 00:32:26,180 --> 00:32:30,860 And by default for plots, we use this p name for plot objects, where 616 00:32:30,860 --> 00:32:34,550 I'll say everything we've done here, starting with a blank plot, 617 00:32:34,550 --> 00:32:38,810 adding in each of these layers, will be stored now under the name p, 618 00:32:38,810 --> 00:32:40,130 this object here. 619 00:32:40,130 --> 00:32:43,640 And I could, later on in my code, still add more layers. 620 00:32:43,640 --> 00:32:47,870 As long as I've saved this plot as this object p, I could say, well, p-- 621 00:32:47,870 --> 00:32:50,385 and let's go ahead and add in some new layer down below. 622 00:32:50,385 --> 00:32:51,260 But we won't do that. 623 00:32:51,260 --> 00:32:53,510 We'll instead save our plot. 624 00:32:53,510 --> 00:32:57,650 Well, we have a function called ggsave that lets us save our plots. 625 00:32:57,650 --> 00:32:59,760 And ggsave works a bit like this. 626 00:32:59,760 --> 00:33:02,480 I'll type ggsave here as the function name. 627 00:33:02,480 --> 00:33:07,110 And it takes quite a few arguments to save this plot to my computer. 628 00:33:07,110 --> 00:33:12,050 The first one is going to be the name of this file, in this case votes.png, 629 00:33:12,050 --> 00:33:12,860 let's say. 630 00:33:12,860 --> 00:33:18,260 And the next one is going to be the plot I want to save under this file name. 631 00:33:18,260 --> 00:33:21,770 Well, here I want it to be the plot p I just made. 632 00:33:21,770 --> 00:33:26,390 So I'll say the plot parameter to ggsave is equal to the p, this argument here, 633 00:33:26,390 --> 00:33:28,190 we want this to be the plot we save. 634 00:33:28,190 --> 00:33:33,110 I'll then say how wide and how tall I want this image to be. 635 00:33:33,110 --> 00:33:35,270 Here, I played around with this a bit before, 636 00:33:35,270 --> 00:33:38,870 and I found that if the width is around 1200 pixels 637 00:33:38,870 --> 00:33:43,430 and the height is around 900, so a four-by-three kind of square, 638 00:33:43,430 --> 00:33:45,800 this looks pretty good for this kind of plot here. 639 00:33:45,800 --> 00:33:51,450 And I'll also specify that these units are pixels, just like this. 640 00:33:51,450 --> 00:33:54,330 They also have inches and so on or centimeters and so on, 641 00:33:54,330 --> 00:33:56,240 but here I'll use pixels. 642 00:33:56,240 --> 00:33:58,820 And so thanks to all of these parameters here, 643 00:33:58,820 --> 00:34:03,260 the filename, the plot to save, how wide and how tall, and what units 644 00:34:03,260 --> 00:34:08,060 we're working with, if I were to run ggsave and go to my file explorer, 645 00:34:08,060 --> 00:34:10,429 well, now I would see votes.png. 646 00:34:10,429 --> 00:34:15,110 And if I click on it here, I'll see my own file here now saved to my computer, 647 00:34:15,110 --> 00:34:17,239 visualizing all of this data. 648 00:34:17,239 --> 00:34:20,749 So we've seen so far how to visualize our data using columns. 649 00:34:20,749 --> 00:34:23,179 When we come back, we'll see how to use another geometry, 650 00:34:23,179 --> 00:34:25,310 one called a point to visualize relationships 651 00:34:25,310 --> 00:34:27,260 among columns in our data. 652 00:34:27,260 --> 00:34:28,670 See you all in a few. 653 00:34:28,670 --> 00:34:29,989 Well, we're back. 654 00:34:29,989 --> 00:34:33,659 And so we've seen so far how to visualize our data using columns. 655 00:34:33,659 --> 00:34:37,699 But now we'll take a look at a new kind of geometry, namely the point. 656 00:34:37,699 --> 00:34:41,060 A point is good for when you want to visualize a relationship between two 657 00:34:41,060 --> 00:34:42,770 columns you might have in your data. 658 00:34:42,770 --> 00:34:45,710 And those columns are both on a continuous scale. 659 00:34:45,710 --> 00:34:49,250 So to illustrate this point, no pun intended, we have here this data 660 00:34:49,250 --> 00:34:50,750 set of candy. 661 00:34:50,750 --> 00:34:55,280 Namely, we have names of candy and their price percentile and their sugar 662 00:34:55,280 --> 00:34:55,969 percentile. 663 00:34:55,969 --> 00:34:57,660 Well, what does that mean? 664 00:34:57,660 --> 00:35:00,660 Well, a price percentile is best illustrated through example here. 665 00:35:00,660 --> 00:35:03,507 So let's say I bought this Hershey's Milk Chocolate bar. 666 00:35:03,507 --> 00:35:05,090 And I went and bought it at the store. 667 00:35:05,090 --> 00:35:07,130 And it turns out that this milk chocolate bar 668 00:35:07,130 --> 00:35:10,700 is more expensive than 92% of candy. 669 00:35:10,700 --> 00:35:13,670 So the Hershey's Milk Chocolate bar, this is a pretty expensive candy 670 00:35:13,670 --> 00:35:14,720 overall. 671 00:35:14,720 --> 00:35:17,480 On the other hand, a Reese's Peanut Butter Cup, well, this 672 00:35:17,480 --> 00:35:20,690 is more expensive than 65% of other candies, 673 00:35:20,690 --> 00:35:24,500 so a little less expensive than this but still not that cheap either. 674 00:35:24,500 --> 00:35:27,380 Now, on the other hand, Sour Patch Kids is a candy-- 675 00:35:27,380 --> 00:35:31,520 this is more expensive than 12% of candy, so a little bit on the cheaper 676 00:35:31,520 --> 00:35:33,480 end as far as candies go. 677 00:35:33,480 --> 00:35:35,420 So this is price percentile. 678 00:35:35,420 --> 00:35:37,910 It's a relative measure of how much this candy costs 679 00:35:37,910 --> 00:35:39,560 among other candies in general. 680 00:35:39,560 --> 00:35:41,840 But we also have sugar percentile. 681 00:35:41,840 --> 00:35:45,050 Well, if we take this same candy, this Hershey's Milk Chocolate bar, 682 00:35:45,050 --> 00:35:49,430 it turns out that it has more sugar than 43% of other candies. 683 00:35:49,430 --> 00:35:52,670 And this Reese's Peanut Butter Cup seems to have even more sugar. 684 00:35:52,670 --> 00:35:55,940 It has more sugar than 72% of candies. 685 00:35:55,940 --> 00:36:00,560 So in general here, a higher number for any of these columns, either price 686 00:36:00,560 --> 00:36:04,370 percentile or sugar percentile, means this candy is more expensive 687 00:36:04,370 --> 00:36:07,640 or has more sugar than other candies comparatively. 688 00:36:07,640 --> 00:36:10,940 So let's see how we could visualize now this data. 689 00:36:10,940 --> 00:36:15,290 Well, I have here this plot where on the x-axis, I have price, 690 00:36:15,290 --> 00:36:17,630 and on the y-axis, I have sugar. 691 00:36:17,630 --> 00:36:20,630 And it might make sense for us to go through these candies one 692 00:36:20,630 --> 00:36:24,750 by one to plot them as individual points on this plot here. 693 00:36:24,750 --> 00:36:27,500 So here let's look at again this Hershey's Milk Chocolate bar, 694 00:36:27,500 --> 00:36:32,380 which has a price percentile of 92 and a sugar percentile of 43. 695 00:36:32,380 --> 00:36:35,180 Where would this candy go on this plot? 696 00:36:35,180 --> 00:36:38,750 Well, I could look first, let's say, at my x-axis, which has this price 697 00:36:38,750 --> 00:36:40,880 percentile variable here. 698 00:36:40,880 --> 00:36:44,690 If I see that this candy has a 92 price percentile, that 699 00:36:44,690 --> 00:36:47,840 would be kind of over on the right of my price axis here, 700 00:36:47,840 --> 00:36:50,330 this x-axis, close to 100. 701 00:36:50,330 --> 00:36:53,120 Now, it has here a sugar percentile of 43, 702 00:36:53,120 --> 00:36:55,880 which seems like it would go a little bit below 50, so maybe 703 00:36:55,880 --> 00:36:59,990 somewhere, if I point at the x-axis and the y-axis, somewhere right 704 00:36:59,990 --> 00:37:01,080 around here. 705 00:37:01,080 --> 00:37:04,850 I'll go ahead and draw that point here for the Hershey's Milk Chocolate bar. 706 00:37:04,850 --> 00:37:07,730 And maybe, as we add these points, we could 707 00:37:07,730 --> 00:37:10,760 ask ourselves, if we pay more for these candies, 708 00:37:10,760 --> 00:37:13,580 do we actually get more sugar, assuming sugar is what we want? 709 00:37:13,580 --> 00:37:14,270 We'll see. 710 00:37:14,270 --> 00:37:17,690 So the next one was this Reese's Peanut Butter Cup here. 711 00:37:17,690 --> 00:37:19,910 It turns out that compared to other candies, 712 00:37:19,910 --> 00:37:25,520 this is in the 65th percentile for price and the 72nd percentile for sugar 713 00:37:25,520 --> 00:37:27,620 so a good amount of sugar in these. 714 00:37:27,620 --> 00:37:30,540 Let's go ahead and plot this point on our plot here. 715 00:37:30,540 --> 00:37:35,900 Well, on the x-axis, it's the number 65, so a little bit past 50, 716 00:37:35,900 --> 00:37:37,790 let's say, maybe right around here. 717 00:37:37,790 --> 00:37:41,750 And the sugar percentile is 72, so probably somewhere 718 00:37:41,750 --> 00:37:43,700 in the top right or so. 719 00:37:43,700 --> 00:37:48,590 Why don't we say maybe right around here would be our Reese's Peanut Butter Cup 720 00:37:48,590 --> 00:37:51,075 now plotted on our plot here. 721 00:37:51,075 --> 00:37:52,700 Well, there's still more candies to go. 722 00:37:52,700 --> 00:37:55,640 We have as well Sour Patch Kids, just like 723 00:37:55,640 --> 00:37:58,280 this, which is relatively less expensive and also 724 00:37:58,280 --> 00:38:00,890 doesn't have that much sugar compared to other candies. 725 00:38:00,890 --> 00:38:05,180 So this would probably go somewhere in the bottom left, I would say. 726 00:38:05,180 --> 00:38:07,580 It's pretty low on both the x and the y-axes. 727 00:38:07,580 --> 00:38:11,060 So if we look here, it has 12 on the price percentile, 728 00:38:11,060 --> 00:38:14,750 so kind of closer over here, and a 7 on the sugar 729 00:38:14,750 --> 00:38:17,090 percentile, so also pretty low over here. 730 00:38:17,090 --> 00:38:22,190 I'd say we could plot this point right about here for Sour Patch Kids. 731 00:38:22,190 --> 00:38:25,700 So seems like we're seeing a bit of a trend maybe going on here. 732 00:38:25,700 --> 00:38:28,190 What if we tried now Swedish Fish? 733 00:38:28,190 --> 00:38:31,010 Well, Swedish Fish, similar to this Reese's Peanut Butter Cup, 734 00:38:31,010 --> 00:38:34,490 they're pretty high in price but also pretty high in sugar. 735 00:38:34,490 --> 00:38:36,540 So let's plot this one too. 736 00:38:36,540 --> 00:38:37,680 I'll come back over here. 737 00:38:37,680 --> 00:38:42,350 And say, well, 76 is somewhere in the middle between 50 and 100, 738 00:38:42,350 --> 00:38:44,630 so around here on the x-axis. 739 00:38:44,630 --> 00:38:49,040 And the 60 is just above the 50 here on the y-axis, so let me go ahead 740 00:38:49,040 --> 00:38:52,940 and plot this as our Swedish Fish point. 741 00:38:52,940 --> 00:38:55,460 So maybe some relationship here. 742 00:38:55,460 --> 00:39:00,590 We see as price goes up, maybe our sugar intake goes up as well. 743 00:39:00,590 --> 00:39:01,820 Well, we'll see. 744 00:39:01,820 --> 00:39:04,280 Now, one thing we could do with this plot 745 00:39:04,280 --> 00:39:07,280 is think about some edge cases, which is, here, I 746 00:39:07,280 --> 00:39:09,650 have plotted four different candies. 747 00:39:09,650 --> 00:39:16,280 But let's say along comes another candy, this one called Hershey's Special Dark. 748 00:39:16,280 --> 00:39:19,250 What do you notice about Hershey's Special Dark as 749 00:39:19,250 --> 00:39:23,360 compared to other candies on our list so far? 750 00:39:23,360 --> 00:39:26,330 One thing I notice here is that Hershey's Special Dark 751 00:39:26,330 --> 00:39:31,760 has actually the same price and sugar percentile as Hershey's Milk Chocolate. 752 00:39:31,760 --> 00:39:36,110 So it brings the question, how do we plot this data point? 753 00:39:36,110 --> 00:39:39,920 Well, if I look at this chart here and want to plot Hershey's Special Dark, 754 00:39:39,920 --> 00:39:44,630 I'll go ahead and look at it and say, well, the price percentile is about 92 755 00:39:44,630 --> 00:39:48,750 and the sugar percentile is 43, that would put it right here. 756 00:39:48,750 --> 00:39:51,420 So I'll go ahead and plot it just like this. 757 00:39:51,420 --> 00:39:56,180 And it seems like these points overlap. 758 00:39:56,180 --> 00:40:00,530 So although we've plotted supposedly five candies, 759 00:40:00,530 --> 00:40:04,100 I see here 1, 2, 3, 4 points. 760 00:40:04,100 --> 00:40:08,030 So it seems like I'm now missing a point because these two here are overlapping. 761 00:40:08,030 --> 00:40:09,980 So there are a few ways to solve this. 762 00:40:09,980 --> 00:40:12,650 One, as we'll see in ggplot, is actually to do 763 00:40:12,650 --> 00:40:14,570 what's called jittering these points, where 764 00:40:14,570 --> 00:40:16,640 you might be familiar with jittering, like if you have the jitters 765 00:40:16,640 --> 00:40:18,432 about you're being nervous or excited here. 766 00:40:18,432 --> 00:40:20,500 It's kind of the same idea with these points. 767 00:40:20,500 --> 00:40:24,470 I could take this point here, and I could do what's called jittering it. 768 00:40:24,470 --> 00:40:27,820 I could say, well, I don't really care if it's exactly where it needs to be. 769 00:40:27,820 --> 00:40:30,340 As long as these two points are just slightly separated, 770 00:40:30,340 --> 00:40:31,810 that's good enough for me. 771 00:40:31,810 --> 00:40:34,910 I'll erase this point here but keeping track where I had it. 772 00:40:34,910 --> 00:40:39,490 I'll put one point slightly below and one point slightly above. 773 00:40:39,490 --> 00:40:43,600 And now we see two distinct points, even though they have the same value. 774 00:40:43,600 --> 00:40:48,620 We jittered them around so they have a nonoverlapping visual here. 775 00:40:48,620 --> 00:40:50,920 So as you go about plotting these points, 776 00:40:50,920 --> 00:40:53,230 keep in mind if your data has these overlaps 777 00:40:53,230 --> 00:40:55,718 and you care about seeing each individual point, 778 00:40:55,718 --> 00:40:57,760 you might want to do what's called jittering them 779 00:40:57,760 --> 00:41:01,720 so you can see all of them and not just some of them in terms of overlaps 780 00:41:01,720 --> 00:41:02,750 as well. 781 00:41:02,750 --> 00:41:06,740 So one thing left to do now is to translate this into code. 782 00:41:06,740 --> 00:41:08,620 So let's go back now to RStudio and see if we 783 00:41:08,620 --> 00:41:11,800 can make a plot like this with ggplot. 784 00:41:11,800 --> 00:41:13,150 Come back now over here. 785 00:41:13,150 --> 00:41:14,410 And let's take a look. 786 00:41:14,410 --> 00:41:19,460 Here I have a file called candy.R. And the first thing it does 787 00:41:19,460 --> 00:41:23,810 is load for me this file called candy.RData. 788 00:41:23,810 --> 00:41:27,980 Now, inside candy.RData is this data frame called candy. 789 00:41:27,980 --> 00:41:31,190 So if I were to run line 1, I'll now see that I 790 00:41:31,190 --> 00:41:33,140 have this data frame called candy. 791 00:41:33,140 --> 00:41:35,510 And inside is much more than just five candies. 792 00:41:35,510 --> 00:41:40,040 I have lots of different candies in here and their price and sugar percentiles. 793 00:41:40,040 --> 00:41:43,100 So let's see if we could visualize them using ggplot. 794 00:41:43,100 --> 00:41:46,250 I'll come back now to candy.R. And of course, we'll 795 00:41:46,250 --> 00:41:52,320 begin with our ggplot function to begin with a new blank plot, just like this. 796 00:41:52,320 --> 00:41:56,150 I now have my blank canvas that I can add layers to. 797 00:41:56,150 --> 00:41:58,220 Well, the first thing I might want to do is 798 00:41:58,220 --> 00:42:01,790 say to ggplot, what data frame are you using here? 799 00:42:01,790 --> 00:42:03,530 In this case, the candy data frame. 800 00:42:03,530 --> 00:42:06,050 So I'll pass this first input here, candy. 801 00:42:06,050 --> 00:42:09,290 And then I'll also assign some aesthetics. 802 00:42:09,290 --> 00:42:13,820 I should probably assign the x and the y-axes as we saw here on my chart. 803 00:42:13,820 --> 00:42:16,730 I'll say aesthetically, I want the x-axis 804 00:42:16,730 --> 00:42:19,430 to match the price percentile column. 805 00:42:19,430 --> 00:42:22,310 And I want the y aesthetic, that vertical bar, 806 00:42:22,310 --> 00:42:25,727 to match the sugar percentile column, just like that. 807 00:42:25,727 --> 00:42:27,560 And this is getting a little long as a line. 808 00:42:27,560 --> 00:42:31,070 So I will on one line put the candy data frame 809 00:42:31,070 --> 00:42:34,250 and on the next line put these aesthetic mappings, let's say. 810 00:42:34,250 --> 00:42:37,400 And this is the very beginning of my plot. 811 00:42:37,400 --> 00:42:40,040 In fact, if I were to run what I have right now, 812 00:42:40,040 --> 00:42:42,920 I would actually see that ggplot has constructed 813 00:42:42,920 --> 00:42:45,680 for me a plot that has these axes. 814 00:42:45,680 --> 00:42:48,380 Notice how on the bottom, I see price percentile. 815 00:42:48,380 --> 00:42:50,923 And on the y-axis, I see sugar percentile, 816 00:42:50,923 --> 00:42:53,840 very similar to what we have in our chart here but just different kind 817 00:42:53,840 --> 00:42:56,720 of numbers that we're seeing on the individual axes here. 818 00:42:56,720 --> 00:42:58,760 Here, I have 0, 50, 100. 819 00:42:58,760 --> 00:43:00,410 Here, I have 0, 50, 100. 820 00:43:00,410 --> 00:43:03,930 Here, I have 0, 25, 50, 75, 100, and so on. 821 00:43:03,930 --> 00:43:07,440 So same thing but different numbers on this plot here. 822 00:43:07,440 --> 00:43:09,310 So what more could we do? 823 00:43:09,310 --> 00:43:14,060 We want to visualize these points, which we can do using a new kind of geom. 824 00:43:14,060 --> 00:43:16,340 We saw geom_call last time. 825 00:43:16,340 --> 00:43:19,010 Let's see what geom_point could do for us. 826 00:43:19,010 --> 00:43:21,710 I will add a new layer to my plot, let's say. 827 00:43:21,710 --> 00:43:26,060 I'll come back over here, and I will add this geom_point layer, 828 00:43:26,060 --> 00:43:30,770 which will draw for me these points according to every individual row 829 00:43:30,770 --> 00:43:31,850 that I have. 830 00:43:31,850 --> 00:43:35,960 Geom_point will look at my data frame and make a new point 831 00:43:35,960 --> 00:43:40,045 for every individual row I have of candy inside this data frame. 832 00:43:40,045 --> 00:43:41,420 I'll go ahead, and I'll run this. 833 00:43:41,420 --> 00:43:42,470 And what do I see? 834 00:43:42,470 --> 00:43:46,820 Well, now I see lots of individual points representing the candies I have, 835 00:43:46,820 --> 00:43:49,100 their price, and their sugar content. 836 00:43:49,100 --> 00:43:52,910 Now, the relationship seems a little bit less clear here. 837 00:43:52,910 --> 00:43:55,680 If you pay more, maybe you get more sugar, maybe not. 838 00:43:55,680 --> 00:43:58,220 It depends, it seems, on the candy. 839 00:43:58,220 --> 00:44:01,070 But I'd argue we're running into the same issue 840 00:44:01,070 --> 00:44:03,590 we just saw with this physical plot here, which 841 00:44:03,590 --> 00:44:07,940 is if any two candies have the same price and sugar percentile, 842 00:44:07,940 --> 00:44:10,580 well, they're going to be overlapping each other. 843 00:44:10,580 --> 00:44:13,580 And in fact, I can show you a few points that do just that. 844 00:44:13,580 --> 00:44:18,650 If I come back now to my table and show you in the candy data frame-- 845 00:44:18,650 --> 00:44:21,980 let me sort this by price percentile and scroll down a little bit more. 846 00:44:21,980 --> 00:44:26,240 Here I'll see that between Hershey's Krackel, Milk Chocolate, Special Dark, 847 00:44:26,240 --> 00:44:29,720 these all have the same price and sugar percentile. 848 00:44:29,720 --> 00:44:33,890 So these would appear as only one point on my plot 849 00:44:33,890 --> 00:44:36,990 when I ideally want to separate them even just a little bit. 850 00:44:36,990 --> 00:44:42,230 So to do just that, I can use a geometry not called geom_point 851 00:44:42,230 --> 00:44:44,257 but one called geom_jitter. 852 00:44:44,257 --> 00:44:47,090 Like we said, we're going to jitter our points, meaning kind of move 853 00:44:47,090 --> 00:44:49,190 them around a little bit randomly but still 854 00:44:49,190 --> 00:44:52,340 so they're still in the roughly the same place that they're supposed to be. 855 00:44:52,340 --> 00:44:53,840 So I'll use geom_jitter here. 856 00:44:53,840 --> 00:44:55,860 And I'll run this top to bottom. 857 00:44:55,860 --> 00:44:59,430 And I'll see ever so slightly my points have changed. 858 00:44:59,430 --> 00:45:02,390 And I now am more able to see individual points. 859 00:45:02,390 --> 00:45:05,872 Particularly around these, you might see just a bit of separation. 860 00:45:05,872 --> 00:45:08,330 Around these, you might see a bit more separation here too. 861 00:45:08,330 --> 00:45:14,610 You can tell ggplot how much to widen or to change the height of these points 862 00:45:14,610 --> 00:45:15,110 here. 863 00:45:15,110 --> 00:45:17,240 But for now, we'll leave it as the default. 864 00:45:17,240 --> 00:45:20,963 We can now see a little bit more of these individual points. 865 00:45:20,963 --> 00:45:23,130 Now, I'll go ahead and improve this chart some more. 866 00:45:23,130 --> 00:45:25,020 Here, I want to probably add some labels, 867 00:45:25,020 --> 00:45:28,410 like renaming my x-axis and my y-axis, adding a title. 868 00:45:28,410 --> 00:45:30,510 And let's go ahead and set our theme too. 869 00:45:30,510 --> 00:45:32,580 I'll come back now to RStudio. 870 00:45:32,580 --> 00:45:34,920 And let me change this. 871 00:45:34,920 --> 00:45:38,010 I will add in a label layer. 872 00:45:38,010 --> 00:45:43,320 And I'll set the x label equal to Price, the y label equal to Sugar. 873 00:45:43,320 --> 00:45:49,380 And the title of this chart will be simply Price and Sugar. 874 00:45:49,380 --> 00:45:52,500 And then, finally, I'll go ahead and adjust my theme. 875 00:45:52,500 --> 00:45:54,840 I'll again use the classic theme here. 876 00:45:54,840 --> 00:45:58,830 And we should see my chart is now coming into shape. 877 00:45:58,830 --> 00:46:00,510 So what have we done? 878 00:46:00,510 --> 00:46:04,710 We have started again with this blank canvas and given as input 879 00:46:04,710 --> 00:46:06,270 our candy data frame. 880 00:46:06,270 --> 00:46:11,130 We've told ggplot we want to map the price percentile column to the x-axis, 881 00:46:11,130 --> 00:46:12,300 as we just did here. 882 00:46:12,300 --> 00:46:16,860 And we want to map the sugar percentile column, as we just did, on the y-axis 883 00:46:16,860 --> 00:46:17,400 here. 884 00:46:17,400 --> 00:46:19,990 We're going to go ahead and add in these points 885 00:46:19,990 --> 00:46:23,380 and jitter them just a little bit so we can see those individual points too. 886 00:46:23,380 --> 00:46:26,830 We're going to add labels and set our theme. 887 00:46:26,830 --> 00:46:32,470 Let me ask, what questions do we have so far on this point geometry, jittering, 888 00:46:32,470 --> 00:46:35,770 our points, or anything more related to this plot here? 889 00:46:35,770 --> 00:46:41,770 AUDIENCE: When does-- when does this plot is preferred over the column? 890 00:46:41,770 --> 00:46:43,588 CARTER ZENKE: Yeah, a really good question, 891 00:46:43,588 --> 00:46:45,880 and it's one you probably will run into very frequently 892 00:46:45,880 --> 00:46:48,760 when you're visualizing your data is, what type of visualization 893 00:46:48,760 --> 00:46:49,930 is best here? 894 00:46:49,930 --> 00:46:55,063 Now, when we have data that involves both categorical and continuous 895 00:46:55,063 --> 00:46:57,230 variables, like we saw in our candidates-- remember, 896 00:46:57,230 --> 00:47:01,090 we had called the x-axis a discrete scale or a categorical scale? 897 00:47:01,090 --> 00:47:03,940 It had individual candidates, and each of those candidates 898 00:47:03,940 --> 00:47:06,970 had some continuous value associated with them, 899 00:47:06,970 --> 00:47:10,300 the number of votes-- that's a good kind of data to use 900 00:47:10,300 --> 00:47:13,180 for a bar chart or a column chart. 901 00:47:13,180 --> 00:47:16,990 Here, though, we have actually two continuous variables, 902 00:47:16,990 --> 00:47:18,400 two continuous scales. 903 00:47:18,400 --> 00:47:22,990 On the x-axis, I see a range of values between 0 and 100. 904 00:47:22,990 --> 00:47:25,660 And on the y-axis, I see the same thing. 905 00:47:25,660 --> 00:47:28,420 Well, if you have a continuous range of values 906 00:47:28,420 --> 00:47:31,840 on both your x-axis and your y-axis, that's going to be a good hint. 907 00:47:31,840 --> 00:47:35,510 You probably want to use something like points, for instance. 908 00:47:35,510 --> 00:47:38,015 You could imagine using columns here, but I'm not 909 00:47:38,015 --> 00:47:39,640 exactly sure what that would look like. 910 00:47:39,640 --> 00:47:43,660 Maybe I need to have two columns for each of my candies, 911 00:47:43,660 --> 00:47:46,600 one for sugar content and one for price content, 912 00:47:46,600 --> 00:47:50,680 which wouldn't quite show me the relationship between price and sugar. 913 00:47:50,680 --> 00:47:54,520 Here, I argue, that this shows us much better the relationship between price 914 00:47:54,520 --> 00:47:58,870 and sugar when both are continuous, that is, on this individual scale 915 00:47:58,870 --> 00:48:01,240 here between 0 and 100. 916 00:48:01,240 --> 00:48:03,340 But a good question and one to consider as you 917 00:48:03,340 --> 00:48:06,760 go off and design your own plots too. 918 00:48:06,760 --> 00:48:09,310 Now, let's go ahead and tidy this up a little bit more. 919 00:48:09,310 --> 00:48:11,590 And maybe I can go ahead and actually do one thing, 920 00:48:11,590 --> 00:48:13,540 which is the title doesn't appear to be showing here. 921 00:48:13,540 --> 00:48:15,710 I have Price and Sugar, but I didn't set the title of it here. 922 00:48:15,710 --> 00:48:17,800 So let me go ahead and go back and change that. 923 00:48:17,800 --> 00:48:23,760 I will update the label layer to instead say the title is Price and Sugar. 924 00:48:23,760 --> 00:48:27,180 Let me rerun this, and we'll see Price and Sugar up top. 925 00:48:27,180 --> 00:48:31,650 But now one kind of fun thing I could do is change the color 926 00:48:31,650 --> 00:48:33,660 that I see on these points. 927 00:48:33,660 --> 00:48:38,040 And it turns out that there is an aesthetic that I can apply to my points 928 00:48:38,040 --> 00:48:41,460 here with geom_jitter, one called color. 929 00:48:41,460 --> 00:48:46,290 Now, one color I kind of like is this one called dark orchid. 930 00:48:46,290 --> 00:48:49,890 And in R, you actually have access to certain colors 931 00:48:49,890 --> 00:48:51,537 that are known by given names. 932 00:48:51,537 --> 00:48:53,370 So there's one called dark orchid, and we'll 933 00:48:53,370 --> 00:48:57,510 see here that RStudio has automatically shown me what color dark orchid is, 934 00:48:57,510 --> 00:49:03,400 but you also have more primary colors, like blue or like red or like green, 935 00:49:03,400 --> 00:49:04,320 and so on. 936 00:49:04,320 --> 00:49:09,030 And it turns out that this color aesthetic, when applied to this point 937 00:49:09,030 --> 00:49:12,450 here, will actually change the color of these points we're seeing. 938 00:49:12,450 --> 00:49:15,720 So I'll try to make these points this color, dark orchid, which kind of 939 00:49:15,720 --> 00:49:17,550 evokes some candy here for us. 940 00:49:17,550 --> 00:49:19,200 Let me go ahead and see what that does. 941 00:49:19,200 --> 00:49:20,570 I'll go ahead and run this. 942 00:49:20,570 --> 00:49:25,570 And now I'll see that my points have changed to this color, dark orchid. 943 00:49:25,570 --> 00:49:29,890 Now, notice here that we're not specifying an aesthetic mapping. 944 00:49:29,890 --> 00:49:34,600 An aesthetic mapping tells ggplot to map some column in our data 945 00:49:34,600 --> 00:49:39,280 to some given aesthetic, like x or y or even color if we wanted to. 946 00:49:39,280 --> 00:49:43,030 What I'm doing here instead is saying all points should actually 947 00:49:43,030 --> 00:49:44,860 get this same color. 948 00:49:44,860 --> 00:49:48,190 If I want to apply any aesthetic to have only one value, 949 00:49:48,190 --> 00:49:50,830 I don't need to use the aes function, I can simply 950 00:49:50,830 --> 00:49:53,860 say the aesthetic name and then the value I want it to have 951 00:49:53,860 --> 00:49:56,710 inside the geometry I want to apply it to. 952 00:49:56,710 --> 00:49:59,530 So in this case, I want to change the color of these points that 953 00:49:59,530 --> 00:50:04,210 are jittering about and change it to dark orchid, in particular. 954 00:50:04,210 --> 00:50:08,110 Now, one other step we could change to is one called size. 955 00:50:08,110 --> 00:50:10,720 Well, these points have a certain size, a certain radius, 956 00:50:10,720 --> 00:50:12,970 a certain size on this page here, that I could change 957 00:50:12,970 --> 00:50:14,662 using this aesthetic called size. 958 00:50:14,662 --> 00:50:16,120 So let me go back and do just that. 959 00:50:16,120 --> 00:50:17,410 I'll come back to RStudio. 960 00:50:17,410 --> 00:50:22,050 And let's try changing the size of these points to be maybe a little bit bigger. 961 00:50:22,050 --> 00:50:25,830 I could set size, which is by default somewhere between 1 and 2, 962 00:50:25,830 --> 00:50:27,660 let's say 1.5 or so-- 963 00:50:27,660 --> 00:50:31,470 I could change this perhaps to maybe 2, make it just a little bit bigger. 964 00:50:31,470 --> 00:50:32,970 I'll go ahead and visualize this. 965 00:50:32,970 --> 00:50:35,853 And I'll see my points are now just a little bit bigger. 966 00:50:35,853 --> 00:50:37,020 I can make them even bigger. 967 00:50:37,020 --> 00:50:40,270 I can make them maybe size 4 or so, just like this. 968 00:50:40,270 --> 00:50:43,230 And now we're better able to see our points because they're bigger, 969 00:50:43,230 --> 00:50:45,370 but there's still a lot of overlap here. 970 00:50:45,370 --> 00:50:49,170 So one thing you could do is maybe make them smaller to reduce that overlap. 971 00:50:49,170 --> 00:50:52,290 I'll come back over and say let's change this from 4 972 00:50:52,290 --> 00:50:56,100 to maybe like a 0.5, make them pretty small. 973 00:50:56,100 --> 00:51:00,030 And now I'll be better able to see these individual points. 974 00:51:00,030 --> 00:51:02,880 I think I'll leave it somewhere around 2 or so 975 00:51:02,880 --> 00:51:05,310 to make this chart more visually interesting. 976 00:51:05,310 --> 00:51:08,640 But you can have access to this aesthetic called size 977 00:51:08,640 --> 00:51:12,270 to change what your plot looks like in terms of how big these dots actually 978 00:51:12,270 --> 00:51:13,410 are. 979 00:51:13,410 --> 00:51:15,510 Now, a few more we can work with-- 980 00:51:15,510 --> 00:51:19,330 one is going to be called in this case fill. 981 00:51:19,330 --> 00:51:21,130 And one is going to be called shape. 982 00:51:21,130 --> 00:51:25,070 We've actually seen fill already when we filled in our columns. 983 00:51:25,070 --> 00:51:28,030 But we can also supply the fill aesthetic 984 00:51:28,030 --> 00:51:30,890 if we change the shape of some of our dots here. 985 00:51:30,890 --> 00:51:34,030 So let's go back to looking at our dots, and let me play around 986 00:51:34,030 --> 00:51:36,340 with this aesthetic called shape. 987 00:51:36,340 --> 00:51:38,655 Let me go actually and put this above size 988 00:51:38,655 --> 00:51:40,780 to keep these arguments in alphabetical order here. 989 00:51:40,780 --> 00:51:43,690 I'll say that shape is equal to-- 990 00:51:43,690 --> 00:51:45,340 well, what do I want? 991 00:51:45,340 --> 00:51:50,510 I mean, it turns out I can't specify something like triangle like this. 992 00:51:50,510 --> 00:51:52,910 And I can't use something like square like this. 993 00:51:52,910 --> 00:51:55,490 I actually have to actually put in some numbers here. 994 00:51:55,490 --> 00:51:58,570 And if I look up on the ggplot reference what numbers correspond 995 00:51:58,570 --> 00:52:01,120 to which shapes, I can actually see which shape I want 996 00:52:01,120 --> 00:52:03,070 and type in the corresponding number. 997 00:52:03,070 --> 00:52:05,945 Let's go through a few of these here just to see what they look like. 998 00:52:05,945 --> 00:52:08,650 I'll change shape to be 1 and see what I get. 999 00:52:08,650 --> 00:52:10,570 I'll go ahead and rebuild this chart. 1000 00:52:10,570 --> 00:52:15,520 And now I see my dots are still there, but they're a little more translucent. 1001 00:52:15,520 --> 00:52:18,160 I see here that they have the color on the outside. 1002 00:52:18,160 --> 00:52:20,140 And on the inside, they seem to be transparent. 1003 00:52:20,140 --> 00:52:23,830 I actually see through to the white background at the bottom of this page, 1004 00:52:23,830 --> 00:52:25,150 if you will. 1005 00:52:25,150 --> 00:52:27,040 I could change to a different shape too. 1006 00:52:27,040 --> 00:52:30,370 Let me come back and try maybe the second shape option we have, 1007 00:52:30,370 --> 00:52:32,380 shape equals 2. 1008 00:52:32,380 --> 00:52:35,560 And now I get triangles, which is kind of cool. 1009 00:52:35,560 --> 00:52:39,250 I could also use maybe shape 3, and now I get some plus signs. 1010 00:52:39,250 --> 00:52:42,130 There are lots of shapes you can play around with and use 1011 00:52:42,130 --> 00:52:44,710 depending on how you want to visualize your data 1012 00:52:44,710 --> 00:52:47,380 and change things aesthetically. 1013 00:52:47,380 --> 00:52:51,160 Now, one shape I like in particular, particularly for this kind of data, 1014 00:52:51,160 --> 00:52:52,455 is going to be shape 21. 1015 00:52:52,455 --> 00:52:54,580 And I only know that because I looked it up online. 1016 00:52:54,580 --> 00:52:56,080 I went through the shapes in ggplot. 1017 00:52:56,080 --> 00:52:59,740 And I found that this shape corresponds to the number 21. 1018 00:52:59,740 --> 00:53:01,240 So let's see what that looks like. 1019 00:53:01,240 --> 00:53:02,660 I'll come back over here. 1020 00:53:02,660 --> 00:53:06,850 And I'll try shape equals, in this case, 21. 1021 00:53:06,850 --> 00:53:08,740 Let me go ahead and rebuild my plot. 1022 00:53:08,740 --> 00:53:11,890 And I'll see those translucent points again. 1023 00:53:11,890 --> 00:53:15,670 But it turns out that, actually, this shape, 21, 1024 00:53:15,670 --> 00:53:21,280 allows us to specify both a color and a fill, where the color, to be clear, 1025 00:53:21,280 --> 00:53:25,090 is this color on the outside of the dot and the fill 1026 00:53:25,090 --> 00:53:27,790 is the color on the inside of the dot. 1027 00:53:27,790 --> 00:53:31,073 So I could have kind of a two-tone dot here with some border around it 1028 00:53:31,073 --> 00:53:34,240 that's a little bit darker and some fill that's a little bit lighter to make 1029 00:53:34,240 --> 00:53:35,740 it more aesthetically pleasing. 1030 00:53:35,740 --> 00:53:37,880 Well, let's try setting the fill aesthetic here. 1031 00:53:37,880 --> 00:53:39,910 I'll come back to RStudio. 1032 00:53:39,910 --> 00:53:43,630 And let me change, in this case, the fill aesthetic 1033 00:53:43,630 --> 00:53:50,260 to be a color that I found in R's manual called just regular old orchid, 1034 00:53:50,260 --> 00:53:51,320 like this. 1035 00:53:51,320 --> 00:53:55,990 So my color, my border of these dots will be dark orchid. 1036 00:53:55,990 --> 00:53:58,450 And their fill, the color on the inside of them, 1037 00:53:58,450 --> 00:54:01,780 will be this kind of pinkish-purplish color called orchid. 1038 00:54:01,780 --> 00:54:03,850 I'll go ahead and rebuild this plot. 1039 00:54:03,850 --> 00:54:06,370 And now I'll see some kind of two-toned dots. 1040 00:54:06,370 --> 00:54:09,760 I see here that I have that ring around them in dark orchid. 1041 00:54:09,760 --> 00:54:14,020 And in the middle, I see that color orchid for their center fill. 1042 00:54:14,020 --> 00:54:17,620 So here we've now played around with different aesthetics for our dots, 1043 00:54:17,620 --> 00:54:20,230 including color, fill, shape, and size. 1044 00:54:20,230 --> 00:54:24,580 There are more at your disposal too but more on that another time. 1045 00:54:24,580 --> 00:54:26,500 And this, I think, is a pretty good plot. 1046 00:54:26,500 --> 00:54:29,620 We've built it up from scratch, adding in our dots, our aesthetics, 1047 00:54:29,620 --> 00:54:31,180 our labels, and our themes. 1048 00:54:31,180 --> 00:54:36,490 Let me ask, what questions do we have on designing plots now with points? 1049 00:54:36,490 --> 00:54:40,090 AUDIENCE: Is there a way to randomize the dots 1050 00:54:40,090 --> 00:54:44,380 and the color of the dots in the plot? 1051 00:54:44,380 --> 00:54:46,870 CARTER ZENKE: To randomize the color of the dots? 1052 00:54:46,870 --> 00:54:47,980 Yes. 1053 00:54:47,980 --> 00:54:51,040 So one way we could do that is specifying a new aesthetic. 1054 00:54:51,040 --> 00:54:55,570 So if we ever want to vary a certain aesthetic, like let's say color 1055 00:54:55,570 --> 00:54:59,800 or fill or shape or size based on some data, whether it's random or not, 1056 00:54:59,800 --> 00:55:01,880 we would need to specify an aesthetic here. 1057 00:55:01,880 --> 00:55:04,840 So you could imagine specifying a new aesthetic mapping, one 1058 00:55:04,840 --> 00:55:08,080 that involves color and is associated not with a column in your data set 1059 00:55:08,080 --> 00:55:09,760 but some random data you give it. 1060 00:55:09,760 --> 00:55:12,135 And you could certainly do that to make sure the color is 1061 00:55:12,135 --> 00:55:14,050 randomized across these different dots. 1062 00:55:14,050 --> 00:55:16,150 But a good question too. 1063 00:55:16,150 --> 00:55:20,070 One thing I'm seeing is I think people are trying this and seeing that shape, 1064 00:55:20,070 --> 00:55:22,300 I can actually specify a text input to it. 1065 00:55:22,300 --> 00:55:23,050 So let's try that. 1066 00:55:23,050 --> 00:55:24,100 I'll come back over here. 1067 00:55:24,100 --> 00:55:30,570 And I said we couldn't do something like type in square or triangle 1068 00:55:30,570 --> 00:55:31,510 or things like that. 1069 00:55:31,510 --> 00:55:33,000 But let's just try it and see. 1070 00:55:33,000 --> 00:55:35,160 I'll go ahead and type in square. 1071 00:55:35,160 --> 00:55:36,480 And I do get squares. 1072 00:55:36,480 --> 00:55:39,870 Maybe triangle here, and I'll type in triangle. 1073 00:55:39,870 --> 00:55:41,272 Oops, triangle. 1074 00:55:41,272 --> 00:55:42,480 And I'll see I get triangles. 1075 00:55:42,480 --> 00:55:45,720 I could try circle, too, just to see what we could go off and do. 1076 00:55:45,720 --> 00:55:49,920 And now I see circles, so it seems like some basic shapes you can specify. 1077 00:55:49,920 --> 00:55:53,580 But in general, what we'll tend to do is specify these shapes by numbers, 1078 00:55:53,580 --> 00:55:55,890 looking up and cross reference now to determine 1079 00:55:55,890 --> 00:55:58,500 which shape it is we want, like 21, that has 1080 00:55:58,500 --> 00:56:01,890 both this fill and this color aesthetic here. 1081 00:56:01,890 --> 00:56:03,510 Pretty cool. 1082 00:56:03,510 --> 00:56:06,810 OK, so we've seen now how to visualize relationships 1083 00:56:06,810 --> 00:56:10,770 between two continuous variables, in this case price and sugar. 1084 00:56:10,770 --> 00:56:13,650 When we come back, we'll see how to visualize change over time 1085 00:56:13,650 --> 00:56:15,480 in the context of hurricanes. 1086 00:56:15,480 --> 00:56:17,560 We'll see you all in five. 1087 00:56:17,560 --> 00:56:18,760 Well, we're back. 1088 00:56:18,760 --> 00:56:22,240 And what we'll do next is visualize data that changes over time, 1089 00:56:22,240 --> 00:56:24,760 otherwise known as time series data. 1090 00:56:24,760 --> 00:56:27,850 And we'll do so in the context of a particular hurricane named 1091 00:56:27,850 --> 00:56:32,230 Hurricane Anita that happened in the Atlantic in 1977. 1092 00:56:32,230 --> 00:56:34,750 Now, here's a picture of Anita making landfall in Mexico. 1093 00:56:34,750 --> 00:56:37,750 And thankfully, it did so in an area that wasn't very heavily populated, 1094 00:56:37,750 --> 00:56:40,330 but it still unfortunately did much damage. 1095 00:56:40,330 --> 00:56:42,740 So by looking at this data and visualizing it, 1096 00:56:42,740 --> 00:56:45,880 we can actually hope to learn how hurricanes like these evolve and change 1097 00:56:45,880 --> 00:56:48,010 over time so that we can better prepare for them 1098 00:56:48,010 --> 00:56:50,860 and ultimately respond better to them in turn. 1099 00:56:50,860 --> 00:56:54,970 Now, here are some observations of how Hurricane Anita grew over the days 1100 00:56:54,970 --> 00:56:55,960 that it was active. 1101 00:56:55,960 --> 00:56:59,232 Here I have a column called wind speed, or just called wind. 1102 00:56:59,232 --> 00:57:00,940 And it's representing wind speed in terms 1103 00:57:00,940 --> 00:57:05,080 of knots, this kind of nautical term for how fast the wind is blowing. 1104 00:57:05,080 --> 00:57:09,400 I here have a column called timestamp too that tells me on what date 1105 00:57:09,400 --> 00:57:12,110 this observation was taken and what time too. 1106 00:57:12,110 --> 00:57:16,150 So here I'll see that this one is taken in 1977 on August 30, 1107 00:57:16,150 --> 00:57:17,500 around 12:00 noon. 1108 00:57:17,500 --> 00:57:21,340 And the wind speed of Anita was known to be about 50 knots in total. 1109 00:57:21,340 --> 00:57:25,060 On the next day, on August 31, also at 12:00 noon, 1110 00:57:25,060 --> 00:57:28,360 the wind speed was about 75 knots in speed. 1111 00:57:28,360 --> 00:57:31,870 So here we can see how Hurricane Anita is evolving over the days, 1112 00:57:31,870 --> 00:57:34,150 that it's growing actively too. 1113 00:57:34,150 --> 00:57:36,700 Now, how could we plot this data? 1114 00:57:36,700 --> 00:57:38,560 Well, we could do it very similar to what 1115 00:57:38,560 --> 00:57:42,220 we saw before, putting points on our plot, like we did with candies here. 1116 00:57:42,220 --> 00:57:46,120 Maybe on the x-axis we have our timestamp, and on the y-axis 1117 00:57:46,120 --> 00:57:49,070 we have the wind speed that's exactly what we have over here. 1118 00:57:49,070 --> 00:57:51,730 So here I have a plot where on the x-axis, 1119 00:57:51,730 --> 00:57:55,690 I've put the date, in this case, August 30, August 31, September 1, 1120 00:57:55,690 --> 00:57:56,890 September 2. 1121 00:57:56,890 --> 00:58:00,130 And on the y-axis, I now have the wind speed in knots, 1122 00:58:00,130 --> 00:58:03,100 going all the way up to 160. 1123 00:58:03,100 --> 00:58:08,110 Now, to plot this data, I could go point by point and add it to this chart here. 1124 00:58:08,110 --> 00:58:12,310 Maybe for the first one, I see this happened on August 30, 12:00 noon. 1125 00:58:12,310 --> 00:58:14,200 The wind speed was 50. 1126 00:58:14,200 --> 00:58:18,790 So if I look at this plot, I might look and see, well, August 30 and the wind 1127 00:58:18,790 --> 00:58:21,100 speed being about 50 or right up here. 1128 00:58:21,100 --> 00:58:26,350 Now, true to how ggplot works, let's go ahead and add a new layer to our plot 1129 00:58:26,350 --> 00:58:28,990 here, one for these points we're adding. 1130 00:58:28,990 --> 00:58:31,120 I'll go ahead and put this over my axes. 1131 00:58:31,120 --> 00:58:33,130 And let me go ahead and draw this first point. 1132 00:58:33,130 --> 00:58:36,670 I'll say August 30 is equal to wind speed of about 50. 1133 00:58:36,670 --> 00:58:39,670 So I'll put it above August 30 and kind of beside, 1134 00:58:39,670 --> 00:58:43,450 let's say, where 50 might be, right around here, let's say, 1135 00:58:43,450 --> 00:58:45,340 for our first observation. 1136 00:58:45,340 --> 00:58:50,410 On the next day, Anita strengthened, and it was about 75 knots on August 30. 1137 00:58:50,410 --> 00:58:52,730 So I'll go ahead and add that here. 1138 00:58:52,730 --> 00:58:57,610 I'll go over to August 31 and say it was about 75 knots. 1139 00:58:57,610 --> 00:59:02,560 I'll go maybe just below 80 on my y-axis, somewhere right around there. 1140 00:59:02,560 --> 00:59:07,880 And then on September 1, well, Anita blew about 90 knots here. 1141 00:59:07,880 --> 00:59:15,250 So 90 would go well above September 1 and maybe between 80 and 120, 1142 00:59:15,250 --> 00:59:18,160 so somewhere around there, let's say. 1143 00:59:18,160 --> 00:59:21,700 September 2, Anita blew 120 knots. 1144 00:59:21,700 --> 00:59:23,560 So let's go over and add that one. 1145 00:59:23,560 --> 00:59:26,590 I'll put that one kind of right beside 120. 1146 00:59:26,590 --> 00:59:28,060 Let me put that one right here. 1147 00:59:28,060 --> 00:59:31,810 And let me lower this one just a little bit to be sure, 1148 00:59:31,810 --> 00:59:33,280 make sure we're accurate here. 1149 00:59:33,280 --> 00:59:36,400 And this, I think, represents the points that Anita would have 1150 00:59:36,400 --> 00:59:39,940 as it grew in wind speed over time. 1151 00:59:39,940 --> 00:59:44,200 Now, one thing I could do to make this even more apparent as change over time 1152 00:59:44,200 --> 00:59:46,810 is maybe connect these dots with a line. 1153 00:59:46,810 --> 00:59:51,310 And so I could very much do that by adding a new layer here to my plot. 1154 00:59:51,310 --> 00:59:52,510 I'll add this on top. 1155 00:59:52,510 --> 00:59:55,990 And I'll decide, well, I want to connect these points to show 1156 00:59:55,990 --> 00:59:58,060 how Anita grew over time. 1157 00:59:58,060 --> 01:00:02,590 I'll start by connecting these first two points here, this one between August 30 1158 01:00:02,590 --> 01:00:03,910 and August 31. 1159 01:00:03,910 --> 01:00:06,400 I'll draw this line here. 1160 01:00:06,400 --> 01:00:10,000 And now I've seen how Anita changed between August 30 1161 01:00:10,000 --> 01:00:15,370 and August 31 I'll do the same now for August 31 to September 1, 1162 01:00:15,370 --> 01:00:19,940 just like this, and the same again for September 1 to September 2, 1163 01:00:19,940 --> 01:00:21,270 just like this. 1164 01:00:21,270 --> 01:00:24,710 And now, I argue, I'm better visualizing change over time. 1165 01:00:24,710 --> 01:00:27,410 I'm seeing how this hurricane strengthened over the days 1166 01:00:27,410 --> 01:00:31,250 that it occurred and how it grew to be a full-fledged hurricane. 1167 01:00:31,250 --> 01:00:37,190 So let's see how we can make a plot a bit like this now using ggplot itself. 1168 01:00:37,190 --> 01:00:40,580 I'll come back now to RStudio, and let me go ahead 1169 01:00:40,580 --> 01:00:44,990 and open up this data file called anita.RData. 1170 01:00:44,990 --> 01:00:48,440 And inside anita.RData is this table here, 1171 01:00:48,440 --> 01:00:51,650 one called Anita, that tells me many observations 1172 01:00:51,650 --> 01:00:53,420 of how Hurricane Anita grew. 1173 01:00:53,420 --> 01:00:56,750 And here, see, I have more than the observations we saw on our slides. 1174 01:00:56,750 --> 01:00:57,800 I have ones-- 1175 01:00:57,800 --> 01:00:59,460 I have multiple for each day even. 1176 01:00:59,460 --> 01:00:59,960 Let's see. 1177 01:00:59,960 --> 01:01:05,270 On August 30, I have the midnight observation, the 6:00 AM observation, 1178 01:01:05,270 --> 01:01:08,960 the 12:00 noon, the 6 o'clock observation. 1179 01:01:08,960 --> 01:01:13,670 There's a lot of observations here in more detail of how Anita grew. 1180 01:01:13,670 --> 01:01:16,760 But we still have those same columns, timestamp and wind. 1181 01:01:16,760 --> 01:01:20,870 So let's use them now to visualize this data in terms of a plot. 1182 01:01:20,870 --> 01:01:23,930 I'll start as I usually do, with our ggplot function, 1183 01:01:23,930 --> 01:01:26,720 to give myself this blank canvas to work with. 1184 01:01:26,720 --> 01:01:30,890 I'll then pass in as the first argument to ggplot the Anita data 1185 01:01:30,890 --> 01:01:32,250 frame, just like this. 1186 01:01:32,250 --> 01:01:34,760 And I'll assign these aesthetic mappings. 1187 01:01:34,760 --> 01:01:39,260 I want to make sure that the timestamp column falls on the x-axis, 1188 01:01:39,260 --> 01:01:43,010 and the wind column here falls on the y-axis. 1189 01:01:43,010 --> 01:01:46,190 So I'll assign them in terms of these aesthetic mappings now. 1190 01:01:46,190 --> 01:01:52,380 I'll say that x equals timestamp and y equals wind, just like this. 1191 01:01:52,380 --> 01:01:56,180 And of course, given what I have now, I have my scales 1192 01:01:56,180 --> 01:01:57,650 on either the x and the y-axis. 1193 01:01:57,650 --> 01:02:01,670 But now I want to add in my points that I just made originally over here. 1194 01:02:01,670 --> 01:02:05,630 I'll add a new layer, now syntactically in ggplot, with this plus sign. 1195 01:02:05,630 --> 01:02:06,920 And I'll say geom_point. 1196 01:02:06,920 --> 01:02:09,830 It's what I want to add to this first layer. 1197 01:02:09,830 --> 01:02:12,710 And now, with this more complete set of observations, 1198 01:02:12,710 --> 01:02:15,830 do we see exactly how Anita grew over the days 1199 01:02:15,830 --> 01:02:21,500 that it was considered a storm, all the way from August 30 or so to September 3 1200 01:02:21,500 --> 01:02:22,820 or so. 1201 01:02:22,820 --> 01:02:27,770 But we ideally want to connect these points, showing change over time. 1202 01:02:27,770 --> 01:02:30,740 And thankfully, we do have another geometry we could use, 1203 01:02:30,740 --> 01:02:32,960 one called geom_line. 1204 01:02:32,960 --> 01:02:37,460 So let's do just that and use geom_line to connect these points. 1205 01:02:37,460 --> 01:02:40,190 Well, similar to what we just did over in this demo table 1206 01:02:40,190 --> 01:02:43,610 here by adding a new layer to our plot, I could do the same. 1207 01:02:43,610 --> 01:02:47,300 I could have more than one geometry on my plot. 1208 01:02:47,300 --> 01:02:52,640 I could have one for points and one afterwards for, let's say, lines, just 1209 01:02:52,640 --> 01:02:53,300 like this. 1210 01:02:53,300 --> 01:02:57,653 Geom_line draws lines between all of the data points that we have. 1211 01:02:57,653 --> 01:02:59,570 So let me go ahead and run this here, and I'll 1212 01:02:59,570 --> 01:03:03,200 see now all of these dots are connected by lines. 1213 01:03:03,200 --> 01:03:06,590 And in fact, my plot has multiple layers similar to this one, 1214 01:03:06,590 --> 01:03:08,690 one layer for these lines. 1215 01:03:08,690 --> 01:03:11,060 One layer are our dots here. 1216 01:03:11,060 --> 01:03:15,090 And the bottom layer is our aesthetic mappings and our blank plot here. 1217 01:03:15,090 --> 01:03:17,390 So I'll go ahead and remake this plot right here. 1218 01:03:17,390 --> 01:03:21,560 And we'll see the same over in ggplot to be sure. 1219 01:03:21,560 --> 01:03:23,570 Now, let's spruce this up a little bit more. 1220 01:03:23,570 --> 01:03:26,730 I want to maybe change the labels here, give it a title. 1221 01:03:26,730 --> 01:03:27,860 So I'll do just that. 1222 01:03:27,860 --> 01:03:31,580 I'll come back over to RStudio and add in a label layer. 1223 01:03:31,580 --> 01:03:33,656 I'll say let me go ahead and add some labels 1224 01:03:33,656 --> 01:03:37,100 and make sure the x-axis is-- maybe let's call it Date, 1225 01:03:37,100 --> 01:03:38,210 like we did over here. 1226 01:03:38,210 --> 01:03:41,480 And I'll go ahead and say the wind-- 1227 01:03:41,480 --> 01:03:43,310 the wind column is called Wind, but we'll 1228 01:03:43,310 --> 01:03:46,280 call it maybe Wind Speed in Knots, to be more specific. 1229 01:03:46,280 --> 01:03:48,320 And then we'll go ahead and say the title is 1230 01:03:48,320 --> 01:03:52,430 going to be Hurricane Anita to be clear what we're visualizing here. 1231 01:03:52,430 --> 01:03:57,140 Let me rebuild my plot, and I'll see all of those labels now in place. 1232 01:03:57,140 --> 01:03:58,940 So we're getting pretty far. 1233 01:03:58,940 --> 01:04:03,380 But what else can we do to improve the design of this plot? 1234 01:04:03,380 --> 01:04:05,960 Well, I'd argue that we could play around with some colors 1235 01:04:05,960 --> 01:04:07,940 here and make sure it looks a little bit more-- 1236 01:04:07,940 --> 01:04:10,440 a little more amusing, a little more interesting to look at. 1237 01:04:10,440 --> 01:04:14,570 And one thing we could do is experiment with the color for these points. 1238 01:04:14,570 --> 01:04:18,050 So we saw before that we might want to change colors of points. 1239 01:04:18,050 --> 01:04:22,610 We can do so by setting this color aesthetic inside of our geom_point 1240 01:04:22,610 --> 01:04:23,420 or geom_jitter. 1241 01:04:23,420 --> 01:04:25,790 Both of those are drawing points for us. 1242 01:04:25,790 --> 01:04:28,460 Inside, let's say, geom_point, I could go ahead 1243 01:04:28,460 --> 01:04:31,880 and say I want to set this color to be equal to, well, the one 1244 01:04:31,880 --> 01:04:34,850 I found is called deepskyblue4. 1245 01:04:34,850 --> 01:04:35,985 I looked this up online. 1246 01:04:35,985 --> 01:04:39,110 And it was a pretty cool R color because it symbolizes hurricanes, at least 1247 01:04:39,110 --> 01:04:39,770 for me. 1248 01:04:39,770 --> 01:04:42,050 I'll go ahead and reload this plot, and we'll 1249 01:04:42,050 --> 01:04:47,390 see I now have these dots here, now colored too. 1250 01:04:47,390 --> 01:04:51,350 But if you look a little bit closely, I'll 1251 01:04:51,350 --> 01:04:55,910 see that this line is actually overlapping these dots. 1252 01:04:55,910 --> 01:04:57,410 And I'm not sure that's what I want. 1253 01:04:57,410 --> 01:05:02,480 I think what I really want is for this line to be behind these dots. 1254 01:05:02,480 --> 01:05:05,660 And so similar to thinking of our plot in layers, 1255 01:05:05,660 --> 01:05:10,010 it seems like I drew the points first underneath the line layer, 1256 01:05:10,010 --> 01:05:14,360 and so the line, of course, will kind of overwrite or be on top of the points. 1257 01:05:14,360 --> 01:05:17,773 If I want it vice versa, I should change the order of these here. 1258 01:05:17,773 --> 01:05:19,690 So to the question of ordering earlier, if you 1259 01:05:19,690 --> 01:05:22,720 want your geometries in certain order, one on top of the other, well, 1260 01:05:22,720 --> 01:05:24,740 ordering does matter in that case. 1261 01:05:24,740 --> 01:05:28,030 So let me switch now geom_line with geom_point. 1262 01:05:28,030 --> 01:05:29,240 I'll come back over here. 1263 01:05:29,240 --> 01:05:35,170 And I'll decide now to change this to have first the lines drawn, just 1264 01:05:35,170 --> 01:05:38,920 like this, and then the points drawn on top of them. 1265 01:05:38,920 --> 01:05:40,120 Let me rebuild my chart. 1266 01:05:40,120 --> 01:05:44,080 And now, suddenly, we'll see that the lines are behind the points, 1267 01:05:44,080 --> 01:05:46,610 and the points are now in front of them. 1268 01:05:46,610 --> 01:05:52,720 So here we've seen our very first chart using both geometries, point and line. 1269 01:05:52,720 --> 01:05:57,100 Let me ask, what questions do we have about how we visualize this hurricane 1270 01:05:57,100 --> 01:05:58,945 and its growth so far? 1271 01:05:58,945 --> 01:06:01,240 AUDIENCE: It's been formatted on the graph, 1272 01:06:01,240 --> 01:06:04,500 but it is not well formatted in the data file. 1273 01:06:04,500 --> 01:06:07,000 CARTER ZENKE: A good question about the format of your data. 1274 01:06:07,000 --> 01:06:08,980 And so it kind of harkens back to last time, 1275 01:06:08,980 --> 01:06:10,270 where we learned about clean data, making 1276 01:06:10,270 --> 01:06:12,937 sure our data is clean before we actually can visualize it here. 1277 01:06:12,937 --> 01:06:15,340 The reason I'm able to visualize this plot 1278 01:06:15,340 --> 01:06:18,170 is because I have my data in a certain order. 1279 01:06:18,170 --> 01:06:23,030 I have each individual column that I can then map to individual axes here. 1280 01:06:23,030 --> 01:06:26,180 If my data were not in that shape or in that format, 1281 01:06:26,180 --> 01:06:28,157 I couldn't do what I'm doing here. 1282 01:06:28,157 --> 01:06:29,990 So it is important to make sure your data is 1283 01:06:29,990 --> 01:06:32,390 in the right format in order for you to visualize it 1284 01:06:32,390 --> 01:06:33,980 in the way you want to visualize it. 1285 01:06:33,980 --> 01:06:39,110 But more on that, actually, last time, when we saw how to clean data as well. 1286 01:06:39,110 --> 01:06:42,290 Well, let's keep going and making our more visually interesting. 1287 01:06:42,290 --> 01:06:46,100 And we've seen now these points, but what about these lines? 1288 01:06:46,100 --> 01:06:47,960 How could we change how they look? 1289 01:06:47,960 --> 01:06:51,410 Well, it turns out that geom_line, or this line geometry, 1290 01:06:51,410 --> 01:06:53,495 has its own aesthetics we can play with too. 1291 01:06:53,495 --> 01:06:55,370 And I'll show you two of them, in particular. 1292 01:06:55,370 --> 01:07:00,170 Let's come back now to RStudio and play around with a few of these aesthetics. 1293 01:07:00,170 --> 01:07:06,230 One of them is called linetype, and one of them is called linewidth. 1294 01:07:06,230 --> 01:07:10,220 And kind of true to their name, linetype changes the type of line, 1295 01:07:10,220 --> 01:07:13,430 let's say whether it's solid or dashed, for instance. 1296 01:07:13,430 --> 01:07:17,630 And linewidth changes how wide this line might look. 1297 01:07:17,630 --> 01:07:20,690 So I'll come back now to RStudio. 1298 01:07:20,690 --> 01:07:23,990 Let me try to adjust the style of this line. 1299 01:07:23,990 --> 01:07:27,350 Well, maybe I'll first play around with the type of line. 1300 01:07:27,350 --> 01:07:31,040 And I can probably do so by specifying some number, 1301 01:07:31,040 --> 01:07:32,900 much like we did for shape. 1302 01:07:32,900 --> 01:07:36,800 There are a few different line types, among them the solid one that we 1303 01:07:36,800 --> 01:07:39,797 see here, dashed, dot dash, and so on. 1304 01:07:39,797 --> 01:07:41,630 Let's just see what a few of them look like. 1305 01:07:41,630 --> 01:07:46,460 Here, linetype 1, if I build this, well, linetype 1 1306 01:07:46,460 --> 01:07:47,990 is exactly what we already have. 1307 01:07:47,990 --> 01:07:49,865 Linetype 2, what's that? 1308 01:07:49,865 --> 01:07:51,360 We'll visualize this. 1309 01:07:51,360 --> 01:07:53,780 And now we'll see kind of a dashed line. 1310 01:07:53,780 --> 01:07:56,690 I'll see that there are some translucent parts to this line. 1311 01:07:56,690 --> 01:08:00,440 And I see dashes now between each individual little dot 1312 01:08:00,440 --> 01:08:02,000 and really on the line in general. 1313 01:08:02,000 --> 01:08:05,750 Let's try now maybe linetype 3 and see what that looks like. 1314 01:08:05,750 --> 01:08:10,370 Linetype 3 seems to be, well, more so dotted. 1315 01:08:10,370 --> 01:08:15,770 I see not full dashes in my line but now individual dots separated by spaces. 1316 01:08:15,770 --> 01:08:17,333 So there are many line types. 1317 01:08:17,333 --> 01:08:20,000 I encourage you to play around with them, look in the reference, 1318 01:08:20,000 --> 01:08:24,500 and see what kinds of types of lines you can create with ggplot. 1319 01:08:24,500 --> 01:08:28,850 Now, one other one we saw earlier was linewidth, how wide or thick 1320 01:08:28,850 --> 01:08:29,899 should this line be. 1321 01:08:29,899 --> 01:08:31,319 Let's play around with that too. 1322 01:08:31,319 --> 01:08:33,500 I'll come back now to RStudio. 1323 01:08:33,500 --> 01:08:35,779 And let me change my linetype back to 1. 1324 01:08:35,779 --> 01:08:37,520 I kind of like that one the most. 1325 01:08:37,520 --> 01:08:42,037 And I'll use linewidth now, where linewidth might be, 1326 01:08:42,037 --> 01:08:44,120 well, let's just try 1 and see what that gives us. 1327 01:08:44,120 --> 01:08:45,170 I'll hit Enter here. 1328 01:08:45,170 --> 01:08:48,740 And my line is a little bit thicker than I'd say it was before. 1329 01:08:48,740 --> 01:08:49,970 It is pretty thick here. 1330 01:08:49,970 --> 01:08:52,069 I can see it kind of connecting these lines still. 1331 01:08:52,069 --> 01:08:55,237 I can make it even more thick or probably, I'd argue, 1332 01:08:55,237 --> 01:08:56,779 I want it to be a little bit thinner. 1333 01:08:56,779 --> 01:09:01,760 So I'll make it smaller than 1, maybe something like 0.5 or so. 1334 01:09:01,760 --> 01:09:06,710 I'll go back over to linewidth and say I want it to be 0.5 in size. 1335 01:09:06,710 --> 01:09:10,279 And I'll see this seems more reasonable in terms of a width for my lines. 1336 01:09:10,279 --> 01:09:12,620 And I encourage you to experiment with these line widths 1337 01:09:12,620 --> 01:09:16,550 and see which one actually best represents your data. 1338 01:09:16,550 --> 01:09:19,883 So here, we've played around with these line geometries, 1339 01:09:19,883 --> 01:09:22,550 but we could probably still improve the chart a little bit more. 1340 01:09:22,550 --> 01:09:24,920 I think I want these dots to be a little bit bigger. 1341 01:09:24,920 --> 01:09:26,630 And we saw how to do this just last time. 1342 01:09:26,630 --> 01:09:32,180 I could specify in geom_point a size for each of those points, not just a color. 1343 01:09:32,180 --> 01:09:35,840 But also I want to say the size of these is just a little bit bigger than usual, 1344 01:09:35,840 --> 01:09:37,890 maybe a 2 instead. 1345 01:09:37,890 --> 01:09:40,550 And now, just barely, these dots are a little bit bigger, 1346 01:09:40,550 --> 01:09:43,790 and I think we're now seeing our data just a little bit better. 1347 01:09:43,790 --> 01:09:45,920 I'd say this looks pretty good. 1348 01:09:45,920 --> 01:09:49,130 Now, let's go ahead and add in our theme here. 1349 01:09:49,130 --> 01:09:51,859 We can go to the bottom and say I've added in all 1350 01:09:51,859 --> 01:09:54,229 of my geometries, my points, my labels. 1351 01:09:54,229 --> 01:09:58,070 Let me go ahead and say I want this classic theme to tidy things up here. 1352 01:09:58,070 --> 01:10:03,020 And now I'll see my chart as I'm pretty sure I want it to be. 1353 01:10:03,020 --> 01:10:05,120 Now, this is pretty good as a chart. 1354 01:10:05,120 --> 01:10:07,550 But I think there's more we could do to it. 1355 01:10:07,550 --> 01:10:11,390 One thing we could do is figure out when exactly 1356 01:10:11,390 --> 01:10:14,060 Hurricane Anita became a hurricane. 1357 01:10:14,060 --> 01:10:17,018 In fact, hurricanes start as these lesser storms 1358 01:10:17,018 --> 01:10:19,060 known as tropical depressions or tropical storms. 1359 01:10:19,060 --> 01:10:21,790 And they grow to be hurricanes. 1360 01:10:21,790 --> 01:10:26,830 Well, here, we don't quite have a sense of exactly when Hurricane Anita became 1361 01:10:26,830 --> 01:10:27,730 a hurricane. 1362 01:10:27,730 --> 01:10:30,550 But it turns out that hurricanes are considered hurricanes 1363 01:10:30,550 --> 01:10:34,810 when they reach a wind speed in knots of 65 knots. 1364 01:10:34,810 --> 01:10:40,900 So it seems like any dots that are above this 65 mark on my y-axis, 1365 01:10:40,900 --> 01:10:45,460 well, that indicates when Anita was a full hurricane. 1366 01:10:45,460 --> 01:10:50,830 So if I want to add not just data points but some arbitrary line, 1367 01:10:50,830 --> 01:10:53,320 I could do that just as well here too. 1368 01:10:53,320 --> 01:10:57,880 Effectively, what I would do is add a new layer now to my plot. 1369 01:10:57,880 --> 01:10:59,770 I'll do that on our visual here. 1370 01:10:59,770 --> 01:11:01,420 I'll take this plot. 1371 01:11:01,420 --> 01:11:06,520 And why don't I draw a line indicating when this hurricane became a hurricane? 1372 01:11:06,520 --> 01:11:08,980 And we know it does so when the wind speed is 1373 01:11:08,980 --> 01:11:11,770 greater than or equal to 65 knots. 1374 01:11:11,770 --> 01:11:18,490 So I could draw maybe somewhere between 40 and 80, right around here or so, 1375 01:11:18,490 --> 01:11:23,470 a line, maybe a dotted line saying that above this line-- 1376 01:11:23,470 --> 01:11:28,240 above this line, Hurricane Anita was, in fact, a hurricane. 1377 01:11:28,240 --> 01:11:32,080 Now, some of what I just did by adding a new layer to my plot, 1378 01:11:32,080 --> 01:11:34,060 I can do the same in ggplot. 1379 01:11:34,060 --> 01:11:38,770 And I'll use this geometry called an hline for a horizontal line. 1380 01:11:38,770 --> 01:11:41,260 We also have a vline for a vertical line. 1381 01:11:41,260 --> 01:11:44,110 But here we'll focus on this horizontal line here. 1382 01:11:44,110 --> 01:11:46,360 Let's come back now to RStudio and see what 1383 01:11:46,360 --> 01:11:50,860 it would look like to add this hline, this horizontal line. 1384 01:11:50,860 --> 01:11:56,260 Well, I probably want it to come after I add in my lines and my points. 1385 01:11:56,260 --> 01:12:00,880 I'll go ahead and add this layer after I specify my points here. 1386 01:12:00,880 --> 01:12:05,890 And I can do so by using geom_hline for this horizontal line I 1387 01:12:05,890 --> 01:12:07,990 want to add to my chart. 1388 01:12:07,990 --> 01:12:09,400 I'll finish this off with a plus. 1389 01:12:09,400 --> 01:12:11,440 I'll make sure to add my labels later on. 1390 01:12:11,440 --> 01:12:16,210 And now, there are a few parameters I can specify in terms of this hline. 1391 01:12:16,210 --> 01:12:18,670 One I can specify is still the line type. 1392 01:12:18,670 --> 01:12:21,820 It is a line, so it has that same aesthetic of a line type. 1393 01:12:21,820 --> 01:12:25,210 I could change that for hline, perhaps to this dotted one 1394 01:12:25,210 --> 01:12:28,120 we saw earlier, which was linetype 3. 1395 01:12:28,120 --> 01:12:30,470 But let me go ahead and try to visualize this. 1396 01:12:30,470 --> 01:12:31,600 I'll go ahead and run. 1397 01:12:31,600 --> 01:12:35,230 And I'll see I actually get a warning or really an error. 1398 01:12:35,230 --> 01:12:41,950 Geom_hline requires the following missing aesthetics, yintercept. 1399 01:12:41,950 --> 01:12:44,290 What is a y-intercept? 1400 01:12:44,290 --> 01:12:46,990 Well, as the name kind of implies, it is the place 1401 01:12:46,990 --> 01:12:51,760 that this line intercepts or crosses with this y-axis. 1402 01:12:51,760 --> 01:12:56,230 In our case, we said that whenever a storm gets to 65 knots or higher, 1403 01:12:56,230 --> 01:12:57,520 that means it is a hurricane. 1404 01:12:57,520 --> 01:13:00,670 So it seems like the place that this line intercepts 1405 01:13:00,670 --> 01:13:04,180 the y-axis is, well, 65 knots. 1406 01:13:04,180 --> 01:13:10,960 So I could as a parameter to geom_hline say that the y-intercept should be 65, 1407 01:13:10,960 --> 01:13:12,610 meaning 65 knots. 1408 01:13:12,610 --> 01:13:15,460 I'll come back now to RStudio and do exactly that. 1409 01:13:15,460 --> 01:13:18,970 Let me go ahead and say that the yintercept, 1410 01:13:18,970 --> 01:13:24,130 the yintercept of this line, should be 65, exactly that. 1411 01:13:24,130 --> 01:13:27,230 And then I'll go ahead and say-- let me run this top to bottom. 1412 01:13:27,230 --> 01:13:29,650 And now I have a pretty neat graph. 1413 01:13:29,650 --> 01:13:32,170 I see the evolution of Hurricane Anita. 1414 01:13:32,170 --> 01:13:35,410 I see what days it was considered a full-fledged hurricane. 1415 01:13:35,410 --> 01:13:41,120 And I also see what days it was not a hurricane but a tropical storm instead. 1416 01:13:41,120 --> 01:13:46,030 So we've gone from an empty plot to adding in many geometries. 1417 01:13:46,030 --> 01:13:49,480 We've added in dots and lines and horizontal lines and so on. 1418 01:13:49,480 --> 01:13:53,530 We've added in our aesthetics in terms of x and y and color and so on. 1419 01:13:53,530 --> 01:13:58,390 We've added in our labels and our themes too, making this final plot. 1420 01:13:58,390 --> 01:14:02,425 What questions do we have on what we've done so far? 1421 01:14:02,425 --> 01:14:03,220 AUDIENCE: Hi. 1422 01:14:03,220 --> 01:14:10,000 I was asking if it is possible that we do the colorblind and color 1423 01:14:10,000 --> 01:14:10,870 visualization? 1424 01:14:10,870 --> 01:14:15,490 Like anything above the 65, we change the color 1425 01:14:15,490 --> 01:14:17,943 or maybe change the line color or something. 1426 01:14:17,943 --> 01:14:20,110 CARTER ZENKE: I really like the way you're thinking. 1427 01:14:20,110 --> 01:14:22,360 Yeah, so you're considering, what can we do 1428 01:14:22,360 --> 01:14:24,577 to make this plot more colorblind friendly? 1429 01:14:24,577 --> 01:14:26,410 And that's a really good consideration, when 1430 01:14:26,410 --> 01:14:29,230 you're working with different colors, having those colors mean 1431 01:14:29,230 --> 01:14:31,240 something distinct in your data. 1432 01:14:31,240 --> 01:14:34,510 I would argue that in this data set or this plot, 1433 01:14:34,510 --> 01:14:37,240 the colors are more aesthetically pleasing. 1434 01:14:37,240 --> 01:14:39,580 They aren't really showing me information in this plot. 1435 01:14:39,580 --> 01:14:41,500 But if I wanted to, I could choose colors 1436 01:14:41,500 --> 01:14:43,540 that are friendly to those who are colorblind. 1437 01:14:43,540 --> 01:14:46,360 And actually, a good thing I could do is if I use multiple colors. 1438 01:14:46,360 --> 01:14:48,110 Beyond just, in this case, black and blue, 1439 01:14:48,110 --> 01:14:51,068 well, I could choose colors that are distinguishable to those who might 1440 01:14:51,068 --> 01:14:52,707 have some form of color blindness. 1441 01:14:52,707 --> 01:14:54,790 And for that, I could actually look up what colors 1442 01:14:54,790 --> 01:14:56,915 I might be able to use to make sure that that works 1443 01:14:56,915 --> 01:14:58,570 for various forms of color blindness. 1444 01:14:58,570 --> 01:15:01,345 The viridis scale might be a good place to start. 1445 01:15:01,345 --> 01:15:03,220 But here, because I just have black and blue, 1446 01:15:03,220 --> 01:15:06,050 I'd argue that this is going to be good enough for now. 1447 01:15:06,050 --> 01:15:08,800 But a great question here. 1448 01:15:08,800 --> 01:15:11,980 OK, so we've seen now all in all today how 1449 01:15:11,980 --> 01:15:15,052 to visualize groups of data and values associated with them. 1450 01:15:15,052 --> 01:15:16,760 We've seen how to visualize relationships 1451 01:15:16,760 --> 01:15:18,970 in two different columns in our data. 1452 01:15:18,970 --> 01:15:21,967 And we've also seen how to visualize data over time. 1453 01:15:21,967 --> 01:15:24,550 When we come back, we'll see how to actually test our programs 1454 01:15:24,550 --> 01:15:26,008 to make sure they work as intended. 1455 01:15:26,008 --> 01:15:28,090 But more on that next time. 1456 01:15:28,090 --> 01:15:30,480 We'll see you later on. 1457 01:15:30,480 --> 01:15:32,000