[MUSIC PLAYING] CARTER ZENKE: Well, hello one and all. And welcome back to CS50's Introduction to Programming with R. My name is Carter Zenke, and this is our lecture on visualizing data. Now, a good visualization can help you see trends you wouldn't have seen otherwise, help you compare groups in your data, and help you share your findings with others. So we're going to do all of that and more today with the help of R and with this package called ggplot2. It is part of the tidyverse. Now, this name, ggplot, is a bit weird. But there is a reasoning behind it. The "plot" in ggplot means we're going to plot our data, which is another word for visualizing it, translating it from a table into some picture to represent that same data. And the "gg" in ggplot stands for this grammar of graphics, which, put simply, is a way of expressing ourselves graphically. And in particular, this grammar of graphics gives us some individual components we can combine to create plots or visualizations from our data. But what are those components? Well, the first one, of course, is going to be data itself. Before we can do anything, we need to have data to visualize. And so let me propose we use this data here, a table of candidates and the votes for each of these candidates. So I have three of them, Mario, Peach, and Bowser, and each has some number of votes. Now, of course, data is just data. It's not a visualization yet. Our goal is to translate this table into some picture. So we'll need another component of our visualization to take this data and convert it into that picture. Now, what we'll need is what we call a geometry, more formally. And a geometry is simply a way of saying what kind of visualization do we want? I think it's best shown through example here. So I'll show you three different plots, each with their own different geometries. Notice here I have one involving columns, good for data that involves groups and values associated with those groups. I have one plot that uses points here, good for representing relationships between two columns in your data set perhaps. And I have here a line geometry, one that can actually show us change over time. And so I'm curious, given these three geometries, of which there are many more-- we'll focus on these three for now-- which one do you think would be good or best for the data we have here, where we have three candidates and some number of votes? Let's see what our audience thinks. What kind of geometry among these three here would you use to visualize this data in particular? AUDIENCE: I think we will use columns to present it in more visualized example. CARTER ZENKE: I like your thinking. So we could use these columns here to visualize our candidates and their number of votes. And more particularly, let me say that maybe each candidate gets their own column. And how high or low the bar goes, the column goes, that would be the number of votes that candidate received. So notice here how we're specifying not just the data involved and the kind of plot we want but also how that data references or is represented on our plot here. How many columns we have, how high or low those columns go, those are all associated with some part of our data. Now, we call this association, more formally, this aesthetic mapping, which is a bit of a big term. But we can break it down into smaller pieces here. An aesthetic is some visual feature of our plot, maybe how many columns there are, how high or low each of those columns go. And a mapping is really another word for a relationship. Well, what are we relating here? We're relating our data to some visual features of our plot. I think this is best shown by example. So here I have a plot. And you've often seen plots with a both vertical line and a horizontal line here. And if I look at this plot kind of naively and you ask me to draw some columns, I might say, well, how? I mean, do they go left to right? Do they go up to down? I'm not quite sure how to draw these columns even if you want me to. So thankfully, there are certain aesthetic or visual features of this plot that we could talk about and relate our data to. Namely, most plots tend to have what we call an x-axis on this horizontal line here. This is known as our x-axis. And most plots, too, have what's known as a y-axis, a vertical axis here. And if we have both a y, or a vertical, axis and an x, or a horizontal, axis, we could talk about, well, which parts of our data go on the x-axis and which parts of our data go on the y-axis here. So let me show you again our data. We had here candidates, one column, and votes, one column. Let me ask more precisely now, which column of our data do you think should be mapped to or represented by the x-axis of our data? AUDIENCE: Our candidate column? CARTER ZENKE: Maybe our candidates, right? Because we could have the candidates on the x-axis representing now one individual column for each of them. So we will map, let's say, our candidate column to this x-axis. And by process of elimination, it seems like votes should go now on the y-axis. And so we can see what it looks like to now map these columns to our plot over here. Well, if I wanted to map my candidate column to the x-axis, what could I do? I could kind of write these candidates' names on the x-axis here. I'll start with Mario down below. And then I'll follow it with Peach, who's our second candidate, just like this. And now I'll follow that with Bowser too, our third candidate, just like that. So now we could say our column candidate is mapped to our x-axis. But now we need the y-axis. And we said the y-axis is going to be represented-- is going to have this votes column on it here. So how could we do that? Well, notice that these votes here fall into some range between, let's say, 0 to 200 at the maximum. So maybe on this y-axis I could say, let's start at the very bottom with 0. And all the way at the top, we'll put 200. So now the height of these columns should correspond to the y-axis. And I think now that I have these x and y-axes, I know much more how to draw these columns. If you told me draw columns representing the number of votes each candidate got, I would know how to do exactly that. I would know they should start at the bottom and move all the way up in terms of height. So let's draw these now. I see Mario has 100 votes. So I want to put a column on here for Mario. Well, Mario's column should be of the height on the x-axis that is equal to 100, which should be right in the middle between 0 and 200. So I'll draw Mario's column a bit like this. I'll put it right in the middle here. And I'll draw Mario a column just like that. That's Mario's column. Now I want to draw Peach's column. Well, Peach had 200 votes. I'll make Peach's column go all the way to the top of our y-axis, where we put the 200 votes. And I'll make sure that that column reaches all the way up there. And finally, for Bowser, well, Bowser had 150, somewhere between Mario's 100 and Peach's 200. So I will put Bowser's column right in between those two, just like this. So this now is our complete plot. We've worked with data. We've worked with our aesthetic mappings, taking our columns and aligning them to the x and the y-axes. And now we've worked with our geometries, this actual visual representation of now our data in terms of columns. So all that's left to do now is do this in code. So we'll come back now to RStudio. And we'll look at how we can use this package, ggplot2, to do just that. I have open here a file called votes.R. And I'm reading in this CSV called votes.csv. Notice here how it goes into a table that I'm calling "votes." And this is the same table we saw before on the slides. We have a candidate named Mario, a candidate named Peach, and a candidate named Bowser. And each has some number of votes. Well, if I want to create a new blank plot, like what I had before here, I'll go ahead and I'll use this function, ggplot, just like this. And so long as I have installed the tidyverse and loaded it using library down below-- tidyverse, just like this-- I can have access to this function called ggplot that creates for me a new blank plot. So let me run line 3 here. And what do we see? Well, in the plots column or in the plots tab here, I'll see this blank plot that I can then draw, put brushstrokes on to visualize my data. But we haven't actually given our plot anything at all. This ggplot function has no input whatsoever. So I think it's worth thinking about what kinds of inputs we need to give to our plot so that it can visualize this data we have in our table here. Now, the first input, as we saw in our grammar of graphics, is going to be the data to visualize. So let's give us input now to this ggplot function, the data itself. I'll come back now to RStudio. And it turns out that the first argument to this ggplot function which creates me a new plot is the data itself. I'll provide this first argument now, this votes data frame. And now, if I run line 3, I still see nothing. And that is kind of expected because we said that data, at the end of the day, is just data. We still need a way to map its data to certain aesthetic features of our plot, certain visual features. And we need to say what kind of plot we want. Well, let's do just that now in code. I'll come back now to RStudio. And it turns out that one other piece we need here is our geometry. We've given it some data, but we need some geometry to specify what kind of plot we want now. Well, the way we can do this syntactically in R, and in ggplot more generally, is to use this plus sign here and to follow it with the part of geometry we want, which is going to be a column geometry. Now, to get a column geometry, I can use this function here, geom_call. And there are other geom functions, like geom_point or geom_line. But for now, we'll use geom_call to create ourselves some columns. But notice here this plus. So you've seen plus when we've been adding up some numbers, like 1 plus 1. But this is not that kind of addition. In fact, the creators of ggplot have always called overloading this plus sign. It no longer means addition. It means something else entirely. In fact, it means we're going to add a new layer to this blank plot. Let me show you what I mean with this prop over here. So we had here our visualization, which had columns on top of it. But I drew these columns on the same layer, let's say, that I drew my axes. Turns out that in ggplot, our plots have multiple layers. So instead of everything being on the same page, so to speak, what we instead do is the following. Let me erase my columns that I have here. And let me instead do this. Let me take out a new layer I could put on top of my blank plot here, my x and y-axes. And on this layer could I draw my columns. Let me go ahead and do the same thing we did before. Mario had about 100 votes. I'll put these halfway up the y-axis. Peach had 200. I'll put those all the way up at 200. And Bowser had 150, somewhere between Mario and Peach's votes. And now notice what I've done. I have separated my plot now into layers. One layer, of course, is my columns here. And one layer seems to be my axes, same thing I'm doing here. We've started now with this blank piece of paper, thanks to ggplot, and we're now adding a new layer, one that involves columns. So let's see what happens. I'll come back now to RStudio. And let me go ahead and run this here and see what we see. Nothing yet. I do see this error that the geom_call function requires the following missing aesthetics, x and y. And so I think ggplot is running into this same problem that I did earlier. If I don't know how to map my data to the visual features of this plot, well, how can I even draw these columns? So we'll need to specify these aesthetic features, x and y, before ggplot knows how to draw these columns here. Now, it turns out we can specify these aesthetic features using a function, one called aes, which stands for aesthetics here. Now, aesthetics, the function, looks a bit like this-- aes, and it takes as input any number of pairings between aesthetic values, like x or y, and the columns in my data. For instance, like this. This would be one aesthetic mapping. I'm taking one column of my data and assigning it to control some visual feature now of my plot. So let's do the same thing now using aes, this aesthetic function, to actually make it happen for us inside of R and ggplot. So I'll come back to RStudio here. And let's go ahead and use the aes function to tell this plot exactly how to map our data to certain visual features of our plot more generally. So it turns out, by reading the documentation, I know the second argument to ggplot is this aes function, which we know takes as input certain pairings between these aesthetic features, like x and y, and columns in our data. So one aesthetic feature is the x-axis. Which column should go on the x-axis of our data? Well, we know we're going to put the candidates column on that x-axis here. I'll go ahead and I'll say, x equals candidate, just like this. This means that the candidate column in my data should now be mapped to the x-axis of my plot. Now, I'll do the same for the y column. Well, the y-- or the y aesthetic. The y aesthetic is going to be mapped to the votes column and vice versa. So the votes column should control the y-axis, effectively. I'll go ahead and say that y equals votes. And now, with these aesthetic mappings specified, I should be able to visualize this. I'll go ahead and run line 3. And voila, we have our very first plot we've now created with ggplot, thanks to starting with data, defining how that data controls visual features of our plot, like the x and y-axes, and now using a new layer, adding in these columns. So let me ask you now, what questions do we have on what we've done so far? AUDIENCE: If we look into the x-axis of the plot that we did now in ggplot2, the order of our candidates is slightly different from what we drew in the board. So why is that? And how do we expect ggplot to order our labels? CARTER ZENKE: Yeah, a really good observation. I was hoping you might notice this, actually. So notice here on the plot, our names are not in the same order we saw them in our data. In our data, it was Mario, Peach, and then Bowser. But it seems like on the x-axis here, it's Bowser, Mario, and Peach. Well, what order is this? It turns out this is alphabetical order. So Bowser with B comes first, followed by Mario, followed by Peach. So we can generally expect ggplot to sort our data on either axis. In this case, because we're working with names, sorted alphabetically down below on the x-axis. But a good observation to make. Now, let's keep going. And ggplot gives us some pretty good defaults for our plots. I'll notice here that on the labels here for my axes, I see the name of the column, like candidate. And on the y-axis, I see votes for the y-axis. And even on the y-axis, it has decided for me that the range of values should be 0 to 200. So these are pretty good defaults. But sometimes you might want to override or change those defaults, which you can do by adding new layers to your plot. So let's consider how to do just that. Well, here, one thing I want to do is change this plot. So I have a little more headroom up here on Peach's votes column. Maybe I want my y-axis to not stop at 200 but to go up to, let's say, 250. Well, to do that, we'll need to learn about this other concept in plots, and in ggplot2 more generally, one called scale. So let's learn about scales a little bit together. A scale is simply a way now of specifying how our values actually control our aesthetic mappings. And for instance, there are really two kinds of scales, one called a continuous scale and one called a discrete scale. Now, a continuous scale is for values that fall in some range. You could think of our y-axis here with some number of votes between 0 and, let's say, 200 or 250. That is a continuous scale because some votes could fall in some range. A discrete scale, though, is for values that are what we might call categorical. They fall into categories. For instance, our x-axis has a discrete scale. There are three distinct candidates. It wouldn't quite make sense to put them on any given scale. They're just names. So this is what we call a discrete scale. Now, scales that are continuous have what are called limits. And exactly what we've seen here is our y-axis, our y scale starts at 0 and goes up to 200 currently. Those are the limits of this scale, from 0 to 200. But we want to change the limit here from 0 to 250 instead. And so it turns out that ggplot gives us functions to modify different scales, using some of these right here. So we see scale_x_continuous is a function, scale_y_continuous is a function, scale_x_discrete, and scale_y_discrete. These change various different scales on our x or y-axis, depending on if they are continuous or discrete. Now, we said before, we want to change our y-axis. So that narrows our possibilities. It's either this one here, scale_y_continuous, or this one here, scale_y_discrete. We said our y-axis is a continuous scale. It has a range of values. It doesn't have discrete values. It has a range of values. So we could probably use scale_y_continuous to modify now our y-axis. Let's go ahead and do just that by adding a new layer to our plot. I'll come back over here. And let's say I go back to RStudio. Well, I want to change now this y-axis's defaults. And I can do so, as I said before, by adding a new layer. I'll use this plus here to add a new layer to my plot. And let me now kind of overwrite the scale I currently see on the board. I'll use scale_y_continuous, which will allow me to adjust my continuous y scale that I currently see. And it turns out, that by reading the documentation I know, scale_y_continuous takes this argument called limits, just like this. And limits takes a value which is a vector of length two, where the first value is where the scale starts, let's say 0, and where the scale ends, let's say 250. So I'll say-- I'll give limits here, this vector, 0 to 250. And that should now change for me my scale on my y-axis. Let me go ahead and redraw this plot, which I'll do by running this line of code here. And now we'll see my axis has changed. So what have we done? We have first defined this blank plot, given it some data and some mappings between that data and its visual features. We have then drawn some columns on top of it, thanks to geom_call. And then we've kind of overridden the default scale, moving it from 0 to 200 to instead 0 to 250. We're kind of building up our plot as we go now, using these layers. Well, one more thing we could do is override the labels here. I'll say this one is candidate with a lowercase C. This one is votes with a lowercase v. I'd love to make them capital to make them more professional. And also, I want to add a title so people can look at this and know exactly what they're looking at immediately. So I can add a new layer now to my plot to add in the labels and either override some defaults or add some new labels altogether. Well, it turns out the function I use to do this is one called labs. Labs is short for labels. I'll go ahead and add a new layer now to my plot called labs. And labs takes as arguments the kinds of labels I want to adjust on my plot. Now, for instance, I could use x equals and set it equal to a character string that I want to be the label for my x-axis. I'll call this one Candidate with a capital C. I could also give it a label for the y-axis by saying y equals and then some character string here. I'll go ahead and call this Votes with a capital V. And then for my title, why don't I use this, Election Results, making it very clear exactly what we're visualizing for any viewer who comes in. I'll go ahead and now build up my plot step by step again by running all of these lines of code. And now I should see my plot as it should be. I have all my candidates on the x-axis, my votes on the y-axis, and those labeled now appropriately. I even see my title up top, Election Results. So we have started with a blank plot, added columns to it, adjusted our scale, and added labels. Let me ask, what questions do we have so far on how we've built up this plot using ggplot's layers here? AUDIENCE: Are we going in this lecture to know how to design or to change the layout of these statics, like the votes and the candidates columns and the columns representation to the user? CARTER ZENKE: A good question. So you might be asking, well, how could we change these aesthetics? And often, you'll get to this plot and say, I don't exactly like how this works maybe I actually want my columns to be on the-- going across vertically, let's say. We could absolutely change our aesthetics here to change what this column-- what this plot is doing here. We could do so by changing these values x and y. Maybe I would make x equal to votes and y equal to candidates. And that would map my columns so they go now left to right as opposed to up to down. That's one way we could change our plot visually. The next way we'll see, though, is how we could change our plot in terms of colors. And I think that's what you're alluding to here, which is our plot looks pretty nice, but it's pretty gray. It's like black and white. We could probably do better in terms of colors. So let's do just that. We'll, I'll come back now to RStudio. And let's see how we could change the fill color, that is, the color filling of these columns depending on the candidate's name. Well, notice here I said I want the color to depend on the candidate's name, which seems a lot like these aesthetic mappings. I have some visual feature, in this case the fill color of my columns, that I now want to be dependent on the value in the candidate's column. Well, I can do this using the aes function like we saw earlier. And there are more aesthetics than just x and y. There is, in fact, an aesthetic called fill that can change the fill color of each of these columns. So here I want to change the fill color of these columns themselves. So what I'll do is pass in the aes function as input to geom_call itself. And here I'll specify I want to change the fill aesthetic, the color that fills these columns. And I want it to depend on the candidate column in my data. So what I've said here is I want to specify a new aesthetic mapping, one that applies only to columns. I want to change their fill color and have it depend on now the candidate column. I'll go ahead and I'll run this update to our plot. And now we'll see some color, which is pretty nice. Notice how every candidate has their own color, which is because as input to geom_call, we said the fill aesthetic should depend on now the candidate column we have. Each candidate should, effectively, get their own color. But when we work with color, it's important to be mindful that people who might look at this plot might be looking at it with some form of color blindness. And so when we convey information with color, we should make sure we do it as accessibly as we can. And thankfully, in R and in ggplot, there are ways to adjust colors to make sure they're friendly to those who might look at this with some form of color blindness. So let's see how to do that now. I'll come back to our plot here. And actually, notice on the right-hand side, I've created kind of a new scale. I have here different colors assigned to different candidates. This is, in fact, its very own scale, one called a fill scale. We have our x scale and our y scale. And now we have a new one called a fill scale, determining what colors will belong to each of these candidates here. Well, if I want to change this scale, I could do it in a way that's very similar to the way I changed the y scale, by adding a new layer to my plot and overriding the default. And actually, thankfully, R comes with this scale called the viridis scale, which is known to be very friendly to many different forms of color blindness. If I want to use the viridis scale, I can do so using this function here. I'll go back to my plot, and I'll go ahead and add this additional colorblind-friendly scale in. I'll say I want to adjust the fill scale I just created, the one you see on the very right-hand side. And I want it to instead be the viridis scale, which is going to be set to the discrete version of that scale. So recall how we had both continuous and discrete scales, continuous being on a range, discrete being individual values? There's a special scale called viridis discrete, or viridis_d for short, that allows me now to say I want to take these discrete, colorblind-friendly colors and make them the colors I'll see for each of my candidates. I can do this followed with a plus sign, and now I've added in this new layer that overrides the default colors and makes them more colorblind friendly. Let me go ahead and build this plot again. And here I'll see my colors change to be more friendly now to those who might be looking at this with some form of color blindness. And there's one more thing we could do here. Notice how on the right-hand side, this scale has a name, candidate. But I want this to be capitalized, so I could change this by passing into scale_fill_viridis_d the title I want for this scale. Let me come back and do just that. I'll come back over here and say I want this viridis scale to instead be named now Candidate. I'll go ahead and rebuild my plot. And I'll see on the right-hand side I've changed not just the scale's colors but also its title. Now, while we're thinking about aesthetics and how nice this plot looks, one more thing I could do is change its theme. Ggplot comes with several themes. And by default or by convention, these themes are often applied at the end of our plot. After we've added our layers of columns and adjusting scales and adding labels, we can change the theme of our plot. So I'll go ahead and add a new layer, one final one to my plot here. And I can use this family of functions that all begin with theme_. And there are many themes to choose from. You could do theme_bw or theme_classic. These are all available in a reference on the ggplot package website. But here, I'll use theme_classic, which is more minimalistic here. Let me go ahead and say I want my theme to be the classic theme in ggplot. I'll go ahead and rebuild my plot now. And here I'll see that I've kind of changed the aesthetics more so visually I've dropped the gray background I've simplified my scales. And I think things just look now much prettier, thanks to this theme layer here. So now we've updated the aesthetics by changing now the colors of each of these candidates and changing the general theme of the plot. What questions do we have on how we built this plot up from a blank page to adding columns to changing scales to adding labels and now changing the theme at the very end? What questions do we have? AUDIENCE: Can the layers be in any order? Or do we have to follow a specific order for them? CARTER ZENKE: A really good question. Can these layers be in any particular order, or is there some defined when we have to use? Turns out the only thing that has to come first is this ggplot function. This gives us metaphorically that blank page to write everything else on top of. We could put these other layers in any other order we wanted to, but there is some convention. Generally speaking, we go from a blank page to adding our geometries, like these columns here. Afterwards, we'll adjust our scales if we need to. Here I adjusted the fill scale, the one we see on the right-hand side. And I adjusted the y scale, the one going vertically now here. After we adjust our scales, should we adjust our labels if we want to? Here I did just that. I adjusted the labels here, setting my x-axis as the candidate name down below. My y-axis has the Votes name on this left-hand side here. And my title is Election Results. And by convention at the end do we add in our theme here to say I want to take all this I've just done and style it in some particular way, in this case this classic theme, which is more minimalist in spirit. But a really good question on the ordering of these layers here. Let's take one more question on what we've just done so far. AUDIENCE: Is there a way to get rid of the legend on the right since it is redundant with the x-axis labels? CARTER ZENKE: Yeah, a good question. So here you might notice that I have one color for each of my candidates, and I have this so-called legend on the right-hand side called Candidate that tells me which colors these are associated with. Well, because I only have one color for each candidate, and I already have their names down here, you could probably argue that this is redundant. So we want to remove this. And it turns out I can do that using a parameter of geom_call, one called show.legend, which by default is true, but we could change to false. So let me go ahead and do that over here. I'll come back to RStudio. And let's say I want to make sure that this fill aesthetic of my column does not produce a legend on the right-hand side. Well, I could say I want to specify the show.legend parameter and have it be not true by default but false instead. And because this is getting a little long here, let me go ahead and put each of these arguments now on their own line, just like this, and move this parenthesis to its own line. And now I should see, if I were to rebuild my plot top to bottom, that I've removed that legend. And now I just have those candidates on the x-axis and different colors now for each of them here. But there's one more problem I would say with this too, which is in our original plot, we had something like this. We had Mario and then Peach and then Bowser. If you look here, we had different ordering than we see in our plot. Here I have Mario, Peach, and Bowser, but on this plot here I have Bowser, Mario, and Peach because, by default, as we said before, ggplot will order these values in alphabetical order by name, Bowser, Mario, and then Peach. There is a way to reorder these explicitly if we wanted to, using what we'll call factors. But we'll save that a little bit more-- I'll save that for a little bit more later on. Now, we have here our plot, let's say, good enough for now. And I want to save it so I could share it with a friend. Here, this is currently all in RStudio. I'm seeing it here, but I want to maybe get an image file I could share with somebody else. Well, I could do that as well with a special function called ggsave. It lets me save my ggplot to my own computer. So let's see ggsave in action now. I'll come back over here. And why don't I take this entire plot I've built and now store it in an object? And by default for plots, we use this p name for plot objects, where I'll say everything we've done here, starting with a blank plot, adding in each of these layers, will be stored now under the name p, this object here. And I could, later on in my code, still add more layers. As long as I've saved this plot as this object p, I could say, well, p-- and let's go ahead and add in some new layer down below. But we won't do that. We'll instead save our plot. Well, we have a function called ggsave that lets us save our plots. And ggsave works a bit like this. I'll type ggsave here as the function name. And it takes quite a few arguments to save this plot to my computer. The first one is going to be the name of this file, in this case votes.png, let's say. And the next one is going to be the plot I want to save under this file name. Well, here I want it to be the plot p I just made. So I'll say the plot parameter to ggsave is equal to the p, this argument here, we want this to be the plot we save. I'll then say how wide and how tall I want this image to be. Here, I played around with this a bit before, and I found that if the width is around 1200 pixels and the height is around 900, so a four-by-three kind of square, this looks pretty good for this kind of plot here. And I'll also specify that these units are pixels, just like this. They also have inches and so on or centimeters and so on, but here I'll use pixels. And so thanks to all of these parameters here, the filename, the plot to save, how wide and how tall, and what units we're working with, if I were to run ggsave and go to my file explorer, well, now I would see votes.png. And if I click on it here, I'll see my own file here now saved to my computer, visualizing all of this data. So we've seen so far how to visualize our data using columns. When we come back, we'll see how to use another geometry, one called a point to visualize relationships among columns in our data. See you all in a few. Well, we're back. And so we've seen so far how to visualize our data using columns. But now we'll take a look at a new kind of geometry, namely the point. A point is good for when you want to visualize a relationship between two columns you might have in your data. And those columns are both on a continuous scale. So to illustrate this point, no pun intended, we have here this data set of candy. Namely, we have names of candy and their price percentile and their sugar percentile. Well, what does that mean? Well, a price percentile is best illustrated through example here. So let's say I bought this Hershey's Milk Chocolate bar. And I went and bought it at the store. And it turns out that this milk chocolate bar is more expensive than 92% of candy. So the Hershey's Milk Chocolate bar, this is a pretty expensive candy overall. On the other hand, a Reese's Peanut Butter Cup, well, this is more expensive than 65% of other candies, so a little less expensive than this but still not that cheap either. Now, on the other hand, Sour Patch Kids is a candy-- this is more expensive than 12% of candy, so a little bit on the cheaper end as far as candies go. So this is price percentile. It's a relative measure of how much this candy costs among other candies in general. But we also have sugar percentile. Well, if we take this same candy, this Hershey's Milk Chocolate bar, it turns out that it has more sugar than 43% of other candies. And this Reese's Peanut Butter Cup seems to have even more sugar. It has more sugar than 72% of candies. So in general here, a higher number for any of these columns, either price percentile or sugar percentile, means this candy is more expensive or has more sugar than other candies comparatively. So let's see how we could visualize now this data. Well, I have here this plot where on the x-axis, I have price, and on the y-axis, I have sugar. And it might make sense for us to go through these candies one by one to plot them as individual points on this plot here. So here let's look at again this Hershey's Milk Chocolate bar, which has a price percentile of 92 and a sugar percentile of 43. Where would this candy go on this plot? Well, I could look first, let's say, at my x-axis, which has this price percentile variable here. If I see that this candy has a 92 price percentile, that would be kind of over on the right of my price axis here, this x-axis, close to 100. Now, it has here a sugar percentile of 43, which seems like it would go a little bit below 50, so maybe somewhere, if I point at the x-axis and the y-axis, somewhere right around here. I'll go ahead and draw that point here for the Hershey's Milk Chocolate bar. And maybe, as we add these points, we could ask ourselves, if we pay more for these candies, do we actually get more sugar, assuming sugar is what we want? We'll see. So the next one was this Reese's Peanut Butter Cup here. It turns out that compared to other candies, this is in the 65th percentile for price and the 72nd percentile for sugar so a good amount of sugar in these. Let's go ahead and plot this point on our plot here. Well, on the x-axis, it's the number 65, so a little bit past 50, let's say, maybe right around here. And the sugar percentile is 72, so probably somewhere in the top right or so. Why don't we say maybe right around here would be our Reese's Peanut Butter Cup now plotted on our plot here. Well, there's still more candies to go. We have as well Sour Patch Kids, just like this, which is relatively less expensive and also doesn't have that much sugar compared to other candies. So this would probably go somewhere in the bottom left, I would say. It's pretty low on both the x and the y-axes. So if we look here, it has 12 on the price percentile, so kind of closer over here, and a 7 on the sugar percentile, so also pretty low over here. I'd say we could plot this point right about here for Sour Patch Kids. So seems like we're seeing a bit of a trend maybe going on here. What if we tried now Swedish Fish? Well, Swedish Fish, similar to this Reese's Peanut Butter Cup, they're pretty high in price but also pretty high in sugar. So let's plot this one too. I'll come back over here. And say, well, 76 is somewhere in the middle between 50 and 100, so around here on the x-axis. And the 60 is just above the 50 here on the y-axis, so let me go ahead and plot this as our Swedish Fish point. So maybe some relationship here. We see as price goes up, maybe our sugar intake goes up as well. Well, we'll see. Now, one thing we could do with this plot is think about some edge cases, which is, here, I have plotted four different candies. But let's say along comes another candy, this one called Hershey's Special Dark. What do you notice about Hershey's Special Dark as compared to other candies on our list so far? One thing I notice here is that Hershey's Special Dark has actually the same price and sugar percentile as Hershey's Milk Chocolate. So it brings the question, how do we plot this data point? Well, if I look at this chart here and want to plot Hershey's Special Dark, I'll go ahead and look at it and say, well, the price percentile is about 92 and the sugar percentile is 43, that would put it right here. So I'll go ahead and plot it just like this. And it seems like these points overlap. So although we've plotted supposedly five candies, I see here 1, 2, 3, 4 points. So it seems like I'm now missing a point because these two here are overlapping. So there are a few ways to solve this. One, as we'll see in ggplot, is actually to do what's called jittering these points, where you might be familiar with jittering, like if you have the jitters about you're being nervous or excited here. It's kind of the same idea with these points. I could take this point here, and I could do what's called jittering it. I could say, well, I don't really care if it's exactly where it needs to be. As long as these two points are just slightly separated, that's good enough for me. I'll erase this point here but keeping track where I had it. I'll put one point slightly below and one point slightly above. And now we see two distinct points, even though they have the same value. We jittered them around so they have a nonoverlapping visual here. So as you go about plotting these points, keep in mind if your data has these overlaps and you care about seeing each individual point, you might want to do what's called jittering them so you can see all of them and not just some of them in terms of overlaps as well. So one thing left to do now is to translate this into code. So let's go back now to RStudio and see if we can make a plot like this with ggplot. Come back now over here. And let's take a look. Here I have a file called candy.R. And the first thing it does is load for me this file called candy.RData. Now, inside candy.RData is this data frame called candy. So if I were to run line 1, I'll now see that I have this data frame called candy. And inside is much more than just five candies. I have lots of different candies in here and their price and sugar percentiles. So let's see if we could visualize them using ggplot. I'll come back now to candy.R. And of course, we'll begin with our ggplot function to begin with a new blank plot, just like this. I now have my blank canvas that I can add layers to. Well, the first thing I might want to do is say to ggplot, what data frame are you using here? In this case, the candy data frame. So I'll pass this first input here, candy. And then I'll also assign some aesthetics. I should probably assign the x and the y-axes as we saw here on my chart. I'll say aesthetically, I want the x-axis to match the price percentile column. And I want the y aesthetic, that vertical bar, to match the sugar percentile column, just like that. And this is getting a little long as a line. So I will on one line put the candy data frame and on the next line put these aesthetic mappings, let's say. And this is the very beginning of my plot. In fact, if I were to run what I have right now, I would actually see that ggplot has constructed for me a plot that has these axes. Notice how on the bottom, I see price percentile. And on the y-axis, I see sugar percentile, very similar to what we have in our chart here but just different kind of numbers that we're seeing on the individual axes here. Here, I have 0, 50, 100. Here, I have 0, 50, 100. Here, I have 0, 25, 50, 75, 100, and so on. So same thing but different numbers on this plot here. So what more could we do? We want to visualize these points, which we can do using a new kind of geom. We saw geom_call last time. Let's see what geom_point could do for us. I will add a new layer to my plot, let's say. I'll come back over here, and I will add this geom_point layer, which will draw for me these points according to every individual row that I have. Geom_point will look at my data frame and make a new point for every individual row I have of candy inside this data frame. I'll go ahead, and I'll run this. And what do I see? Well, now I see lots of individual points representing the candies I have, their price, and their sugar content. Now, the relationship seems a little bit less clear here. If you pay more, maybe you get more sugar, maybe not. It depends, it seems, on the candy. But I'd argue we're running into the same issue we just saw with this physical plot here, which is if any two candies have the same price and sugar percentile, well, they're going to be overlapping each other. And in fact, I can show you a few points that do just that. If I come back now to my table and show you in the candy data frame-- let me sort this by price percentile and scroll down a little bit more. Here I'll see that between Hershey's Krackel, Milk Chocolate, Special Dark, these all have the same price and sugar percentile. So these would appear as only one point on my plot when I ideally want to separate them even just a little bit. So to do just that, I can use a geometry not called geom_point but one called geom_jitter. Like we said, we're going to jitter our points, meaning kind of move them around a little bit randomly but still so they're still in the roughly the same place that they're supposed to be. So I'll use geom_jitter here. And I'll run this top to bottom. And I'll see ever so slightly my points have changed. And I now am more able to see individual points. Particularly around these, you might see just a bit of separation. Around these, you might see a bit more separation here too. You can tell ggplot how much to widen or to change the height of these points here. But for now, we'll leave it as the default. We can now see a little bit more of these individual points. Now, I'll go ahead and improve this chart some more. Here, I want to probably add some labels, like renaming my x-axis and my y-axis, adding a title. And let's go ahead and set our theme too. I'll come back now to RStudio. And let me change this. I will add in a label layer. And I'll set the x label equal to Price, the y label equal to Sugar. And the title of this chart will be simply Price and Sugar. And then, finally, I'll go ahead and adjust my theme. I'll again use the classic theme here. And we should see my chart is now coming into shape. So what have we done? We have started again with this blank canvas and given as input our candy data frame. We've told ggplot we want to map the price percentile column to the x-axis, as we just did here. And we want to map the sugar percentile column, as we just did, on the y-axis here. We're going to go ahead and add in these points and jitter them just a little bit so we can see those individual points too. We're going to add labels and set our theme. Let me ask, what questions do we have so far on this point geometry, jittering, our points, or anything more related to this plot here? AUDIENCE: When does-- when does this plot is preferred over the column? CARTER ZENKE: Yeah, a really good question, and it's one you probably will run into very frequently when you're visualizing your data is, what type of visualization is best here? Now, when we have data that involves both categorical and continuous variables, like we saw in our candidates-- remember, we had called the x-axis a discrete scale or a categorical scale? It had individual candidates, and each of those candidates had some continuous value associated with them, the number of votes-- that's a good kind of data to use for a bar chart or a column chart. Here, though, we have actually two continuous variables, two continuous scales. On the x-axis, I see a range of values between 0 and 100. And on the y-axis, I see the same thing. Well, if you have a continuous range of values on both your x-axis and your y-axis, that's going to be a good hint. You probably want to use something like points, for instance. You could imagine using columns here, but I'm not exactly sure what that would look like. Maybe I need to have two columns for each of my candies, one for sugar content and one for price content, which wouldn't quite show me the relationship between price and sugar. Here, I argue, that this shows us much better the relationship between price and sugar when both are continuous, that is, on this individual scale here between 0 and 100. But a good question and one to consider as you go off and design your own plots too. Now, let's go ahead and tidy this up a little bit more. And maybe I can go ahead and actually do one thing, which is the title doesn't appear to be showing here. I have Price and Sugar, but I didn't set the title of it here. So let me go ahead and go back and change that. I will update the label layer to instead say the title is Price and Sugar. Let me rerun this, and we'll see Price and Sugar up top. But now one kind of fun thing I could do is change the color that I see on these points. And it turns out that there is an aesthetic that I can apply to my points here with geom_jitter, one called color. Now, one color I kind of like is this one called dark orchid. And in R, you actually have access to certain colors that are known by given names. So there's one called dark orchid, and we'll see here that RStudio has automatically shown me what color dark orchid is, but you also have more primary colors, like blue or like red or like green, and so on. And it turns out that this color aesthetic, when applied to this point here, will actually change the color of these points we're seeing. So I'll try to make these points this color, dark orchid, which kind of evokes some candy here for us. Let me go ahead and see what that does. I'll go ahead and run this. And now I'll see that my points have changed to this color, dark orchid. Now, notice here that we're not specifying an aesthetic mapping. An aesthetic mapping tells ggplot to map some column in our data to some given aesthetic, like x or y or even color if we wanted to. What I'm doing here instead is saying all points should actually get this same color. If I want to apply any aesthetic to have only one value, I don't need to use the aes function, I can simply say the aesthetic name and then the value I want it to have inside the geometry I want to apply it to. So in this case, I want to change the color of these points that are jittering about and change it to dark orchid, in particular. Now, one other step we could change to is one called size. Well, these points have a certain size, a certain radius, a certain size on this page here, that I could change using this aesthetic called size. So let me go back and do just that. I'll come back to RStudio. And let's try changing the size of these points to be maybe a little bit bigger. I could set size, which is by default somewhere between 1 and 2, let's say 1.5 or so-- I could change this perhaps to maybe 2, make it just a little bit bigger. I'll go ahead and visualize this. And I'll see my points are now just a little bit bigger. I can make them even bigger. I can make them maybe size 4 or so, just like this. And now we're better able to see our points because they're bigger, but there's still a lot of overlap here. So one thing you could do is maybe make them smaller to reduce that overlap. I'll come back over and say let's change this from 4 to maybe like a 0.5, make them pretty small. And now I'll be better able to see these individual points. I think I'll leave it somewhere around 2 or so to make this chart more visually interesting. But you can have access to this aesthetic called size to change what your plot looks like in terms of how big these dots actually are. Now, a few more we can work with-- one is going to be called in this case fill. And one is going to be called shape. We've actually seen fill already when we filled in our columns. But we can also supply the fill aesthetic if we change the shape of some of our dots here. So let's go back to looking at our dots, and let me play around with this aesthetic called shape. Let me go actually and put this above size to keep these arguments in alphabetical order here. I'll say that shape is equal to-- well, what do I want? I mean, it turns out I can't specify something like triangle like this. And I can't use something like square like this. I actually have to actually put in some numbers here. And if I look up on the ggplot reference what numbers correspond to which shapes, I can actually see which shape I want and type in the corresponding number. Let's go through a few of these here just to see what they look like. I'll change shape to be 1 and see what I get. I'll go ahead and rebuild this chart. And now I see my dots are still there, but they're a little more translucent. I see here that they have the color on the outside. And on the inside, they seem to be transparent. I actually see through to the white background at the bottom of this page, if you will. I could change to a different shape too. Let me come back and try maybe the second shape option we have, shape equals 2. And now I get triangles, which is kind of cool. I could also use maybe shape 3, and now I get some plus signs. There are lots of shapes you can play around with and use depending on how you want to visualize your data and change things aesthetically. Now, one shape I like in particular, particularly for this kind of data, is going to be shape 21. And I only know that because I looked it up online. I went through the shapes in ggplot. And I found that this shape corresponds to the number 21. So let's see what that looks like. I'll come back over here. And I'll try shape equals, in this case, 21. Let me go ahead and rebuild my plot. And I'll see those translucent points again. But it turns out that, actually, this shape, 21, allows us to specify both a color and a fill, where the color, to be clear, is this color on the outside of the dot and the fill is the color on the inside of the dot. So I could have kind of a two-tone dot here with some border around it that's a little bit darker and some fill that's a little bit lighter to make it more aesthetically pleasing. Well, let's try setting the fill aesthetic here. I'll come back to RStudio. And let me change, in this case, the fill aesthetic to be a color that I found in R's manual called just regular old orchid, like this. So my color, my border of these dots will be dark orchid. And their fill, the color on the inside of them, will be this kind of pinkish-purplish color called orchid. I'll go ahead and rebuild this plot. And now I'll see some kind of two-toned dots. I see here that I have that ring around them in dark orchid. And in the middle, I see that color orchid for their center fill. So here we've now played around with different aesthetics for our dots, including color, fill, shape, and size. There are more at your disposal too but more on that another time. And this, I think, is a pretty good plot. We've built it up from scratch, adding in our dots, our aesthetics, our labels, and our themes. Let me ask, what questions do we have on designing plots now with points? AUDIENCE: Is there a way to randomize the dots and the color of the dots in the plot? CARTER ZENKE: To randomize the color of the dots? Yes. So one way we could do that is specifying a new aesthetic. So if we ever want to vary a certain aesthetic, like let's say color or fill or shape or size based on some data, whether it's random or not, we would need to specify an aesthetic here. So you could imagine specifying a new aesthetic mapping, one that involves color and is associated not with a column in your data set but some random data you give it. And you could certainly do that to make sure the color is randomized across these different dots. But a good question too. One thing I'm seeing is I think people are trying this and seeing that shape, I can actually specify a text input to it. So let's try that. I'll come back over here. And I said we couldn't do something like type in square or triangle or things like that. But let's just try it and see. I'll go ahead and type in square. And I do get squares. Maybe triangle here, and I'll type in triangle. Oops, triangle. And I'll see I get triangles. I could try circle, too, just to see what we could go off and do. And now I see circles, so it seems like some basic shapes you can specify. But in general, what we'll tend to do is specify these shapes by numbers, looking up and cross reference now to determine which shape it is we want, like 21, that has both this fill and this color aesthetic here. Pretty cool. OK, so we've seen now how to visualize relationships between two continuous variables, in this case price and sugar. When we come back, we'll see how to visualize change over time in the context of hurricanes. We'll see you all in five. Well, we're back. And what we'll do next is visualize data that changes over time, otherwise known as time series data. And we'll do so in the context of a particular hurricane named Hurricane Anita that happened in the Atlantic in 1977. Now, here's a picture of Anita making landfall in Mexico. And thankfully, it did so in an area that wasn't very heavily populated, but it still unfortunately did much damage. So by looking at this data and visualizing it, we can actually hope to learn how hurricanes like these evolve and change over time so that we can better prepare for them and ultimately respond better to them in turn. Now, here are some observations of how Hurricane Anita grew over the days that it was active. Here I have a column called wind speed, or just called wind. And it's representing wind speed in terms of knots, this kind of nautical term for how fast the wind is blowing. I here have a column called timestamp too that tells me on what date this observation was taken and what time too. So here I'll see that this one is taken in 1977 on August 30, around 12:00 noon. And the wind speed of Anita was known to be about 50 knots in total. On the next day, on August 31, also at 12:00 noon, the wind speed was about 75 knots in speed. So here we can see how Hurricane Anita is evolving over the days, that it's growing actively too. Now, how could we plot this data? Well, we could do it very similar to what we saw before, putting points on our plot, like we did with candies here. Maybe on the x-axis we have our timestamp, and on the y-axis we have the wind speed that's exactly what we have over here. So here I have a plot where on the x-axis, I've put the date, in this case, August 30, August 31, September 1, September 2. And on the y-axis, I now have the wind speed in knots, going all the way up to 160. Now, to plot this data, I could go point by point and add it to this chart here. Maybe for the first one, I see this happened on August 30, 12:00 noon. The wind speed was 50. So if I look at this plot, I might look and see, well, August 30 and the wind speed being about 50 or right up here. Now, true to how ggplot works, let's go ahead and add a new layer to our plot here, one for these points we're adding. I'll go ahead and put this over my axes. And let me go ahead and draw this first point. I'll say August 30 is equal to wind speed of about 50. So I'll put it above August 30 and kind of beside, let's say, where 50 might be, right around here, let's say, for our first observation. On the next day, Anita strengthened, and it was about 75 knots on August 30. So I'll go ahead and add that here. I'll go over to August 31 and say it was about 75 knots. I'll go maybe just below 80 on my y-axis, somewhere right around there. And then on September 1, well, Anita blew about 90 knots here. So 90 would go well above September 1 and maybe between 80 and 120, so somewhere around there, let's say. September 2, Anita blew 120 knots. So let's go over and add that one. I'll put that one kind of right beside 120. Let me put that one right here. And let me lower this one just a little bit to be sure, make sure we're accurate here. And this, I think, represents the points that Anita would have as it grew in wind speed over time. Now, one thing I could do to make this even more apparent as change over time is maybe connect these dots with a line. And so I could very much do that by adding a new layer here to my plot. I'll add this on top. And I'll decide, well, I want to connect these points to show how Anita grew over time. I'll start by connecting these first two points here, this one between August 30 and August 31. I'll draw this line here. And now I've seen how Anita changed between August 30 and August 31 I'll do the same now for August 31 to September 1, just like this, and the same again for September 1 to September 2, just like this. And now, I argue, I'm better visualizing change over time. I'm seeing how this hurricane strengthened over the days that it occurred and how it grew to be a full-fledged hurricane. So let's see how we can make a plot a bit like this now using ggplot itself. I'll come back now to RStudio, and let me go ahead and open up this data file called anita.RData. And inside anita.RData is this table here, one called Anita, that tells me many observations of how Hurricane Anita grew. And here, see, I have more than the observations we saw on our slides. I have ones-- I have multiple for each day even. Let's see. On August 30, I have the midnight observation, the 6:00 AM observation, the 12:00 noon, the 6 o'clock observation. There's a lot of observations here in more detail of how Anita grew. But we still have those same columns, timestamp and wind. So let's use them now to visualize this data in terms of a plot. I'll start as I usually do, with our ggplot function, to give myself this blank canvas to work with. I'll then pass in as the first argument to ggplot the Anita data frame, just like this. And I'll assign these aesthetic mappings. I want to make sure that the timestamp column falls on the x-axis, and the wind column here falls on the y-axis. So I'll assign them in terms of these aesthetic mappings now. I'll say that x equals timestamp and y equals wind, just like this. And of course, given what I have now, I have my scales on either the x and the y-axis. But now I want to add in my points that I just made originally over here. I'll add a new layer, now syntactically in ggplot, with this plus sign. And I'll say geom_point. It's what I want to add to this first layer. And now, with this more complete set of observations, do we see exactly how Anita grew over the days that it was considered a storm, all the way from August 30 or so to September 3 or so. But we ideally want to connect these points, showing change over time. And thankfully, we do have another geometry we could use, one called geom_line. So let's do just that and use geom_line to connect these points. Well, similar to what we just did over in this demo table here by adding a new layer to our plot, I could do the same. I could have more than one geometry on my plot. I could have one for points and one afterwards for, let's say, lines, just like this. Geom_line draws lines between all of the data points that we have. So let me go ahead and run this here, and I'll see now all of these dots are connected by lines. And in fact, my plot has multiple layers similar to this one, one layer for these lines. One layer are our dots here. And the bottom layer is our aesthetic mappings and our blank plot here. So I'll go ahead and remake this plot right here. And we'll see the same over in ggplot to be sure. Now, let's spruce this up a little bit more. I want to maybe change the labels here, give it a title. So I'll do just that. I'll come back over to RStudio and add in a label layer. I'll say let me go ahead and add some labels and make sure the x-axis is-- maybe let's call it Date, like we did over here. And I'll go ahead and say the wind-- the wind column is called Wind, but we'll call it maybe Wind Speed in Knots, to be more specific. And then we'll go ahead and say the title is going to be Hurricane Anita to be clear what we're visualizing here. Let me rebuild my plot, and I'll see all of those labels now in place. So we're getting pretty far. But what else can we do to improve the design of this plot? Well, I'd argue that we could play around with some colors here and make sure it looks a little bit more-- a little more amusing, a little more interesting to look at. And one thing we could do is experiment with the color for these points. So we saw before that we might want to change colors of points. We can do so by setting this color aesthetic inside of our geom_point or geom_jitter. Both of those are drawing points for us. Inside, let's say, geom_point, I could go ahead and say I want to set this color to be equal to, well, the one I found is called deepskyblue4. I looked this up online. And it was a pretty cool R color because it symbolizes hurricanes, at least for me. I'll go ahead and reload this plot, and we'll see I now have these dots here, now colored too. But if you look a little bit closely, I'll see that this line is actually overlapping these dots. And I'm not sure that's what I want. I think what I really want is for this line to be behind these dots. And so similar to thinking of our plot in layers, it seems like I drew the points first underneath the line layer, and so the line, of course, will kind of overwrite or be on top of the points. If I want it vice versa, I should change the order of these here. So to the question of ordering earlier, if you want your geometries in certain order, one on top of the other, well, ordering does matter in that case. So let me switch now geom_line with geom_point. I'll come back over here. And I'll decide now to change this to have first the lines drawn, just like this, and then the points drawn on top of them. Let me rebuild my chart. And now, suddenly, we'll see that the lines are behind the points, and the points are now in front of them. So here we've seen our very first chart using both geometries, point and line. Let me ask, what questions do we have about how we visualize this hurricane and its growth so far? AUDIENCE: It's been formatted on the graph, but it is not well formatted in the data file. CARTER ZENKE: A good question about the format of your data. And so it kind of harkens back to last time, where we learned about clean data, making sure our data is clean before we actually can visualize it here. The reason I'm able to visualize this plot is because I have my data in a certain order. I have each individual column that I can then map to individual axes here. If my data were not in that shape or in that format, I couldn't do what I'm doing here. So it is important to make sure your data is in the right format in order for you to visualize it in the way you want to visualize it. But more on that, actually, last time, when we saw how to clean data as well. Well, let's keep going and making our more visually interesting. And we've seen now these points, but what about these lines? How could we change how they look? Well, it turns out that geom_line, or this line geometry, has its own aesthetics we can play with too. And I'll show you two of them, in particular. Let's come back now to RStudio and play around with a few of these aesthetics. One of them is called linetype, and one of them is called linewidth. And kind of true to their name, linetype changes the type of line, let's say whether it's solid or dashed, for instance. And linewidth changes how wide this line might look. So I'll come back now to RStudio. Let me try to adjust the style of this line. Well, maybe I'll first play around with the type of line. And I can probably do so by specifying some number, much like we did for shape. There are a few different line types, among them the solid one that we see here, dashed, dot dash, and so on. Let's just see what a few of them look like. Here, linetype 1, if I build this, well, linetype 1 is exactly what we already have. Linetype 2, what's that? We'll visualize this. And now we'll see kind of a dashed line. I'll see that there are some translucent parts to this line. And I see dashes now between each individual little dot and really on the line in general. Let's try now maybe linetype 3 and see what that looks like. Linetype 3 seems to be, well, more so dotted. I see not full dashes in my line but now individual dots separated by spaces. So there are many line types. I encourage you to play around with them, look in the reference, and see what kinds of types of lines you can create with ggplot. Now, one other one we saw earlier was linewidth, how wide or thick should this line be. Let's play around with that too. I'll come back now to RStudio. And let me change my linetype back to 1. I kind of like that one the most. And I'll use linewidth now, where linewidth might be, well, let's just try 1 and see what that gives us. I'll hit Enter here. And my line is a little bit thicker than I'd say it was before. It is pretty thick here. I can see it kind of connecting these lines still. I can make it even more thick or probably, I'd argue, I want it to be a little bit thinner. So I'll make it smaller than 1, maybe something like 0.5 or so. I'll go back over to linewidth and say I want it to be 0.5 in size. And I'll see this seems more reasonable in terms of a width for my lines. And I encourage you to experiment with these line widths and see which one actually best represents your data. So here, we've played around with these line geometries, but we could probably still improve the chart a little bit more. I think I want these dots to be a little bit bigger. And we saw how to do this just last time. I could specify in geom_point a size for each of those points, not just a color. But also I want to say the size of these is just a little bit bigger than usual, maybe a 2 instead. And now, just barely, these dots are a little bit bigger, and I think we're now seeing our data just a little bit better. I'd say this looks pretty good. Now, let's go ahead and add in our theme here. We can go to the bottom and say I've added in all of my geometries, my points, my labels. Let me go ahead and say I want this classic theme to tidy things up here. And now I'll see my chart as I'm pretty sure I want it to be. Now, this is pretty good as a chart. But I think there's more we could do to it. One thing we could do is figure out when exactly Hurricane Anita became a hurricane. In fact, hurricanes start as these lesser storms known as tropical depressions or tropical storms. And they grow to be hurricanes. Well, here, we don't quite have a sense of exactly when Hurricane Anita became a hurricane. But it turns out that hurricanes are considered hurricanes when they reach a wind speed in knots of 65 knots. So it seems like any dots that are above this 65 mark on my y-axis, well, that indicates when Anita was a full hurricane. So if I want to add not just data points but some arbitrary line, I could do that just as well here too. Effectively, what I would do is add a new layer now to my plot. I'll do that on our visual here. I'll take this plot. And why don't I draw a line indicating when this hurricane became a hurricane? And we know it does so when the wind speed is greater than or equal to 65 knots. So I could draw maybe somewhere between 40 and 80, right around here or so, a line, maybe a dotted line saying that above this line-- above this line, Hurricane Anita was, in fact, a hurricane. Now, some of what I just did by adding a new layer to my plot, I can do the same in ggplot. And I'll use this geometry called an hline for a horizontal line. We also have a vline for a vertical line. But here we'll focus on this horizontal line here. Let's come back now to RStudio and see what it would look like to add this hline, this horizontal line. Well, I probably want it to come after I add in my lines and my points. I'll go ahead and add this layer after I specify my points here. And I can do so by using geom_hline for this horizontal line I want to add to my chart. I'll finish this off with a plus. I'll make sure to add my labels later on. And now, there are a few parameters I can specify in terms of this hline. One I can specify is still the line type. It is a line, so it has that same aesthetic of a line type. I could change that for hline, perhaps to this dotted one we saw earlier, which was linetype 3. But let me go ahead and try to visualize this. I'll go ahead and run. And I'll see I actually get a warning or really an error. Geom_hline requires the following missing aesthetics, yintercept. What is a y-intercept? Well, as the name kind of implies, it is the place that this line intercepts or crosses with this y-axis. In our case, we said that whenever a storm gets to 65 knots or higher, that means it is a hurricane. So it seems like the place that this line intercepts the y-axis is, well, 65 knots. So I could as a parameter to geom_hline say that the y-intercept should be 65, meaning 65 knots. I'll come back now to RStudio and do exactly that. Let me go ahead and say that the yintercept, the yintercept of this line, should be 65, exactly that. And then I'll go ahead and say-- let me run this top to bottom. And now I have a pretty neat graph. I see the evolution of Hurricane Anita. I see what days it was considered a full-fledged hurricane. And I also see what days it was not a hurricane but a tropical storm instead. So we've gone from an empty plot to adding in many geometries. We've added in dots and lines and horizontal lines and so on. We've added in our aesthetics in terms of x and y and color and so on. We've added in our labels and our themes too, making this final plot. What questions do we have on what we've done so far? AUDIENCE: Hi. I was asking if it is possible that we do the colorblind and color visualization? Like anything above the 65, we change the color or maybe change the line color or something. CARTER ZENKE: I really like the way you're thinking. Yeah, so you're considering, what can we do to make this plot more colorblind friendly? And that's a really good consideration, when you're working with different colors, having those colors mean something distinct in your data. I would argue that in this data set or this plot, the colors are more aesthetically pleasing. They aren't really showing me information in this plot. But if I wanted to, I could choose colors that are friendly to those who are colorblind. And actually, a good thing I could do is if I use multiple colors. Beyond just, in this case, black and blue, well, I could choose colors that are distinguishable to those who might have some form of color blindness. And for that, I could actually look up what colors I might be able to use to make sure that that works for various forms of color blindness. The viridis scale might be a good place to start. But here, because I just have black and blue, I'd argue that this is going to be good enough for now. But a great question here. OK, so we've seen now all in all today how to visualize groups of data and values associated with them. We've seen how to visualize relationships in two different columns in our data. And we've also seen how to visualize data over time. When we come back, we'll see how to actually test our programs to make sure they work as intended. But more on that next time. We'll see you later on.