[MUSIC PLAYING] DAVID MALAN: All right, this is CS50. This is the end of week eight. And today, we start to fill in some pieces when it comes to building things on the web. So, recall that on Monday we spend a lot more time on PHP, which is this dynamic programming language that lets us output, among other things, HTML and other such content that we'll want to see. But we haven't really looked at how we're going to store any information. Indeed, almost any of that the super interesting websites you visit today have some kind of database on the back end, right? Facebook certainly stores lots of data about all of us and Gmail stores all of your emails. And so, many other sites aren't just static content that's informational. It's actually dynamic in some way. You provide input, it updates the pages for other people. You get messages, you send messages, and so forth. So today, we look closer at the underpinnings of a project that you'll dive into next week, CS50 Finance, which is actually going to have you build something not in C, but in PHP. A website that looks a little something like this that allows to buy and sell stocks that are actually going to draw upon real time stock data from Yahoo Finance. And so ultimately, you'll have the illusion for yourself and for users that you're actually buying and selling stocks and getting nearly real time updates, managing a portfolio, all of which is going to require having, ultimately, a database of users. So, in your own words, especially if you're not super familiar with computer science or databases, what do you know a database to be right now, in nontechnical terms? What is it? How would you describe it to a roommate or a friend? AUDIENCE: [INAUDIBLE] information [INAUDIBLE] DAVID MALAN: So, a list of information, or a store-- a list of information that you might want to store about something, like a user. And what do users have associated with them? If you're a user on Facebook or Gmail, what are the characteristics that all of us users have? Like, what might be some of the columns in the spreadsheet to which we alluded last time? Because again, you can think of a database really as a fancy Excel file or Google Spreadsheet or Apple Numbers file. So, what do you think of when you think of a user? What do they have? What's that? AUDIENCE: A name. DAVID MALAN: A name. So if name, like, David Malan would be the name of some user. What else does a user have? AUDIENCE: An ID. DAVID MALAN: An ID. So, like an ID number, like your Harvard ID or your Yale Net ID or the like. What else might a user have? AUDIENCE: Password. DAVID MALAN: A password, maybe an address, maybe a phone number, maybe an email address. So, there's bunches of fields and this could sort of spiral out of control quickly as soon as you start realizing, oh, let's store this and let's store this and that. But how do we actually do that? So again, the mental model to have for today as we dive into actual SQL, Structured Query Language, is a database that looks like this. It's just rows and columns. And you can imagine Google Spreadsheets or any number of other programs. But what's key about MySQL, which is the database software we're going to use, the freely openly available-- Facebook uses it and any number of other websites-- database stores things relationally. And a relational database just means one that literally stores its data in rows and columns. It's as simple as that. So, even something like Oracle that you might have generally heard of is a relational database. And underneath the hood, it stores data in rows and columns. And Oracle charges you a lot of money to do that, whereas MySQL charges you nothing for the same. So, SQL is going to give us at least four operations. The ability to select data, like read data, insert, delete, and update data. In other words, those are really the four key operations that are going to allow us to change stuff in those rows and columns. The tool that we'll use today especially to learn SQL and to play with it is again called PHP MyAdmin. It's web based tool. Total coincidence that it's written in PHP. But it's going to give us a graphical user interface so that we can actually create these rows and columns and then talk to them via code. So, let's now start to what I think is frankly kind of the fun process of building the back end of websites, the parts that users don't see, but surely do care about, because that's rather data's going. So, similar to C and a little less like PHP, SQL, or a database that supports SQL, has at least these data types and bunches of others. CHAR, VARCHAR, INT, BIGINT, DECIMAL, and DATETIME. And there's a whole bunch of other features, but let's do this by way of actual example. I'm going to go into CS50 IDE where, in advance, I've logged in and I've also visited a URL for this tool called PHP MyAdmin. And in problem set seven, we'll tell you exactly how to get to this interface as well. On the top left hand corner, notice it says lecture. And that just means that in advance, I created an empty database called lecture that has no spreadsheets in it yet. There's no rows and columns. Because the first thing we're going to do is start to create a table that's going to store our users. So, literally over here to the right, I'm going to tell the database I want a table called Users. So, this is like the file that I want to store all of my data in. And how many columns? Well, let's keep it simple for now. I just want to store like a username and a name for a user. We'll start small. So, I want two columns total. And I'm going to go ahead and click Go. And then, for these columns, what I'm going to do-- if this internet cooperates-- all right, so we're going to try that again. I'm going to create a table called Users with two columns, click Go, OK. Now we've got it really fast. Thank you, very well done. All right, so what do we want these columns to be called? So, one's going to be called Username. So, all I see here-- and the interface frankly gets a little ugly eventually, once you start typing in all this data. But what's nice is that sort of paradoxically, I'm creating columns, but the tool has foolishly laid them out in rows so that I can configure these columns. So, there's two blanks there under Name. And one of these fields I want to called Username, and the other field I want to call Name. And now I have to choose data types for these things. So, whereas in Excel and Google Spreadsheets, if you want a column, you literally just type Name or Username, hit Enter. Maybe you make it bold face just for clarity, but that's it. You don't specify the types of the columns. Now in Google Spreadsheets or Excel, you might specify how the data is rendered. You could go to the Format menu, and you can specify show this like dollar sign, show this like a floating point value. So, it's similar in spirit to that what we're about to do, but this is actually going to force the data to be a certain type. Now, even though a moment ago I said there's only a few data types, there's actually a whole lot, and they're in varying degrees of specificity. And as an aside, you can even do fancy things like storage geometries inside of a database. You can store things like GPS coordinates and actually find, mathematically, points that are close to others. But we're going to keep this super simple and go up to here, all of the so-called string types. So, here's a list of a whole bunch of options. CHAR, VARCHAR, TINYTEXT, MEDIUMTEXT, LONGTEXT. And it's kind of overwhelming. And unfortunately, somewhat paradoxically to C, a CHAR is not really a CHAR. If you specify in a database that your data type is CHAR, that means that yes, it's a CHAR, but it's one or more CHARs. And you have to specify how many CHARs you want. So, what's a typical length for a username? Is there a limit typically? AUDIENCE: [INAUDIBLE] DAVID MALAN: 16 maybe? Something like that. You know, back in the day, it used to be eight. Sometimes it's 16, sometimes it's even more than that. And so, this doesn't mean give me one CHAR. This means I have to specify the length of the field, and now I might say something like 16. And there's a trade off here. So, we'll see in a moment that this means one, every username has to be 16 characters. But wait a minute, M-A-L-A-N. If that's my username and I'm only using five, what would you propose that the database to do for the other 11 characters that I've reserved space for? What would you do? AUDIENCE: [INAUDIBLE] DAVID MALAN: Yeah, just make them all null. Make them spaces. But probably null, so a lot of backslash zeros. So, on the one hand, we've now made sure that my username can be no more than 16 characters. And the flip side of that is that if I had a really long name or wanted a really long username like some of you guys might have in that college or at Yale.edu, you can't have one. And so in fact, if you've ever registered for a website and you get yelled at saying your password's too long or your username's too long, it's simply because a programmer, when configuring his or her database, decided that this field will be no longer than this length. All right, so what if we proceed to name? How long should a typical human's name be? How many characters, 16? I'm guessing we could find someone in this room where by his or her first plus last name is longer than 16 characters. So, what's better than that, 17? 18? 25? Bigger? 30? AUDIENCE: [INAUDIBLE] DAVID MALAN: 5,000, oh my God. So, that's probably a decent upper bound, shall we say. And here we kind of have to make a judgment call. Like, there's no right answer here. Infinite is not quite possible, because we're eventually going to have-- we're going to run out of memory. So, we have to make a judgment call at some point. Very common would be, for instance, to use-- and let me specify CHAR here as before-- 255 was literally the upper limit on this database software years ago. And so, a lot of humans would just say, fine. 255's the limit. Let's just use the maximum. And this is pretty ridiculous. Like, if you're typing someone's name for 200 plus characters, that a little ridiculous. But, remember that ASCII is not the only system for characters. And so, especially in a lot of Asian languages where there's characters we can't express on keyboards like my US keyboard, some characters actually take up 16 bits instead of eight bits. And so, this actually isn't all that unreasonable that we need more space if we want to fit larger characters than the very US centric ones we've tended to discuss. So, we need some upper bound. I don't know what the best one is, but 255 is generally a common one. 25 feels low. 16, 32 feel low. I would err on the side of something higher. But there's a trade off, as always. What is the, perhaps, obvious trade off of reserving 255 CHARs for everyone's name in my database? AUDIENCE: [INAUDIBLE] DAVID MALAN: What's that? AUDIENCE: [INAUDIBLE] DAVID MALAN: It's a lot of memory, right? M-A-L-A-N. I have just wasted 250 characters just to store my name defensively, just in case someone in the class has a really long name. That seems like an undue tradeoff. So, it turns out that SQL, this database language, actually supports something called VARCHAR, or Variable CHAR. And this is kind of nice in that this allows you to specify not a fixed width, but rather, a variable width. And more specifically, a maximum width of the field. So, this means that a name can be no more than 250 characters, but it can certainly be fewer. And the database is going to be smart. If you do put in M-A-L-A-N, it's only going to use five, maybe six bytes for like a trailing null character, and not spend an additional 249 or 250 bytes unnecessarily. So, this seems like I should have started with this story. But there's always a tradeoff. So, on the one hand, a username I've specified to be hard coded at 16, and maybe that wasn't the right call, maybe it is, but why not use VARCHARs for everything? It exists for a reason. Why not use VARCHARs for every field whose length you don't know in advance if it seems to be a great thing, right? Use only as much space as you need up to this limit? AUDIENCE: Slower. DAVID MALAN: Speller? AUDIENCE: Makes it slower? DAVID MALAN: Oh, it's slower. Good, that's almost always the answer, frankly. Like, what's the tradeoff? It either costs more space or it costs more time. So, in this case, it might be slower. Why? AUDIENCE: [INAUDIBLE] determining [INAUDIBLE]. DAVID MALAN: Good. So, you might recall from even PSED5, playing with your approach to the dictionary, if you have to allocate memory dynamically or keep growing a buffer, that can actually be slow. If you have to call malloc underneath the hood and maybe that's what MySQL is doing, so surely that could be the case. And if you think way back to PSet-- or even weeks two, when we did things like binary search or even linear search, one of the nice things about every word in a database or every word in a column being exactly the same length, even if a whole bunch of those characters are blank, is that you can use random access on your data, right? If you know that every word is 16 characters away, you can use pointer arithmetic, so to speak, and go to us 16, 32, 48, 64, and you can just jump instantly using arithmetic to any of the words in your database. Whereas if it's a VARCHAR, what do you instead have to do? [PHONE RINGING] If it's a VARCHAR, you can't use random access. What you have to look for or do? Yeah? AUDIENCE: [INAUDIBLE] DAVID MALAN: Look through the whole-- trace through the whole list looking for what, most likely? What kind of special value? AUDIENCE: [INAUDIBLE] DAVID MALAN: Looking for the null terminators that demarcate the separation of words. So again, a tradeoff, and there's no right answer. But this is where, especially when your users get to be many and your load on your servers, the number of people using it gets high, these are actually nontrivial decisions. So, we can leave these as this, but let's scroll down over to the right here. Now, there's a couple of columns where we have to make a judgment call. Does it make sense to allow a user's name, a user's username or a user's name, to be null? That is, just blank. Feels a little nonsensical, so I'm not going to check those boxes. But it turns out in a database, you can say, someone can optionally have this value. This column does not have to actually be there. Now, there's this drop down menu. And notice I'm still in the first row there, so I'm talking about username now. And it turns out that a database, unlike a simple mere spreadsheet, has powerful features called indexes. And an index is a way of telling the database in advance that I the human am smarter than you. I know what kinds of queries, select or insert or delete or update, that my code is going to end up doing on this database. I want to read a lot of data. I want to insert a lot of data. I want to constantly delete a lot of data. If I know that I'm going to be accessing a field like Username a lot, I can preemptively tell the database, I know more than you, and I want to decree that you should index this field. Where indexing a field or a column means that the database in advance should borrow some ideas from, like, week four and five and six from CS50 and actually build up something like a binary search tree or something generally called a B tree that you would learn in a class like CS124 at Harvard, an algorithms class, or any number of other places. The database and the smart people who implemented it will figure out how to store that table of information in memory so that searches and other operations are super fast. You don't have to do it. You don't have to implement linear search or binary search or merge sort or selection sort, any of that. The database does it for you if you tell it preemptively to index this field. And you can see too, there's some other characteristics we can tell the database to enforce. What might it mean if I choose Unique from this menu, just intuitively? Yeah? AUDIENCE: [INAUDIBLE] DAVID MALAN: Yeah, the username has to be unique. Is this a good thing or a bad thing for a database, for a website with users? Should usernames be unique? Yeah, probably. If that's what the field we use to log in, you don't really want to people having the same feel or the same username. So, we can have the database enforce that so that now in my PHP code or any language, I don't have to, for instance, check necessarily does this username exist before I let someone register? The database won't let two people named David or Malans register in this case. And as an aside, even though this menu only lets you select one, a unique index is one that's indexed for super fast performance, but it also enforces uniqueness. And we'll come back to what the other two mean in just a moment. Meanwhile, if I go to my second row, which is the user's name, should I specify that the name should be unique? No, because you could certainly have-- there's no two David Malans in this room, most likely. But if we choose a different name, we might surely have collisions. Think back to hash tables and the like. So, we certainly don't want to make the name field unique. So, we're just going to leave that as dash, dash, dash, nothing. And I'm going to leave everything else alone. Indeed, most of these fields we won't have to care about. And when I'm ready to save this, if the internet cooperates, I click Save, and very, very, very slowly does the database get saved. And now I'm back to this interface, which admittedly, is overwhelming at first glance. But all I'm going to do is click on the word Users at top left. I'm going to go up here, click Users, and by default, it has executed some SQL, but more on that in a moment. Here's just a summary of what I did. And not to worry that you see mention of Latin and Swedish here. Those are just the default settings, because MySQL originally, or PHP MyAdmin, one of the two happened to be written by some Swedish people. But it's irrelevant in our case here. All right, so why is this all interesting? It turns out, I can insert data into a database by writing code. And I'm go ahead and in my file here, I'm going to go ahead and pretend like this is wired to that database, which it isn't at the moment, but it will be when we get to problem set seven. And I'm going to go ahead and execute a function called query, which we will give you in problem set seven's distribution code, that takes at least one argument, which is just a string. A string of SQL code. So, you're about to learn how to write Structured Query Language. If I want to insert a new row into my database because someone has submitted a form to my code, I would literally write INSERT INTO users the following fields: username, comma, name, the VALUES, and now I need to insert something like Malan, and quote, unquote 'David Malan.' And now even for those unfamiliar with SQL, why am I using single quotes inside of this green string? What might be the reason here? Notice I'm co-mingling two languages. Query is a PHP function, but it takes an argument. And that argument has to itself be written in another language called SQL, Structured Query Language. So, everything that I have just highlighted here is this language called SQL. So, what's with the single quotes, just as a quick sanity check? Go ahead. They're strings. So, quote, unquote Malan and quote, unquote David Malan are strings. And just thinking intuitively now, knowing what you know about C and PHP, why did I not do this, which I usually used double quotes for strings? Why did I not want to do that? Yeah? AUDIENCE: [INAUDIBLE] DAVID MALAN: Exactly. Because I'm already using double quotes on the way outside of the argument to the PHP function, I would just confuse the interpreter. It won't know, do these go together? Do these go together? Do these go together? So, I alternate instead. Or I could do something like this, backslash quote or backslash quote. Frankly, that just starts to get very unreadable and ugly. But that would achieve the same result as well. So, if I were to execute this query now, let's see what happens. I'm going to go ahead now and rather than execute the PHP code, which is where you'll play in problem set seven, I'm going to instead go to PHP MyAdmin. And I'm manually going to go to the SQL tab, and let me zoom in on the interface. And I'm going to paste in the thing I just typed. And the color coding has changed a little bit now, just because the program formats things a little differently. But notice that all I've done is I've said, insert into Users. I've specified, then, in a comma separated parenthesized list the two fields that I want to insert, and then I've literally said values followed by another paren, and then the two values I want to plug-in, and now for good measure, I'll put a semicolon at the end. So, this is not C. This is not PHP. This is now SQL, and I'm pasting it into this web based interface that's just going to let me, as soon as I click Go, execute this query on the database running inside of CS50 IDE. So this is good. Notice that said one row inserted, went super fast, 0.0054 seconds to insert that data. So, that sounds pretty healthy. It reformatted my query for me here just to see it in sort of color coded version. But now if I click Browse, notice that, even though there's a lot of clutter on the screen, my table now has two rows. So, let me go ahead and do another. Instead of this, let me go to the SQL tab again. And this time I'll insert something like Rob and his name will be Rob Bowden. Bowden. Let's click Save. Oops, rather Go. Click Browse again, and now notice I have two rows. So, this is just a way more complex way of opening up Google Spreadsheets and just typing a row into a column. But what's key is that we now have the syntax with which to write code so that ultimately, we could actually do some and this. Recall that PHP supports super global variables. What is inside of dollar sign underscore GET in PHP? We took a look at one or two simple examples. And in PSet6, recall you have hello dot PHP which uses this variable. What goes in there? Or what is it? A little louder. AUDIENCE: [INAUDIBLE] DAVID MALAN: It's a snow seed of array, which is just a fancy way of saying an array that has key value pairs. And the keys aren't numeric. They're words or strings. And specifically, what are those key value pairs? Where do they come from? Sorry? AUDIENCE: [INAUDIBLE] DAVID MALAN: No? Where do those key value pairs come from? Say again? Again? Am I the only one hearing something? [LAUGHTER] That's right, yes? AUDIENCE: [INAUDIBLE] DAVID MALAN: Yeah, they come from the query string. So, if you rewind in time to when we've played with Google and we've gone to Google.com slash search question mark q equals cats, if I were to hit Enter and if Google were implemented in PHP, PHP code that Google wrote would have access to dollar sign underscore GET inside of which is a key called Q and a value called cats that it can then use used to do an actual search with. So, in fact, what I'm going to do now is go back to my PHP code that you'll again see more of in PSet7. And instead of plugging in hard coded values which doesn't seem like a very dynamic website, I'm going to give you a teaser of what your actual code would do. You would put in two question marks like this. I don't know what the username is. I don't know what the name is going to be, but I do know I can get them dynamically. So, if the code we're writing now is the code running on Google's servers, or if this is hello dot PHP, which comes with PSet6, I'm going to pass into the query function just like printf, two other arguments. GET, quote, unquote username, and GET, quote, unquote name. And now, notice what the general structure is here. I've got on the left hand side of the call, this function called query in PHP. I still have as a first argument, just a string of text. But that string of text is written in a language called SQL. And frankly, it's not a big language. We're only going to talk about it formally today, really. And then in problem set seven, there's relatively few features that we're going to leverage. The question marks, though, mean plug in a value here and plug in another value here. And notice, I've omitted what from around the quote-- damn it-- around the quotation marks this time. I've omitted the quotation marks around the question mark, sorry, this time around. So, what's nice about this question mark feature which PHP tends to support, Ruby and Python and other languages, this just means plug in some value here and you know what? You figure out whether to use single quotes or double quotes. Don't bother me with those intellectually uninteresting details. But, make sure it's correct so that my code is ultimately operational and safe, which will have a meaning before long. Now, how many arguments total, just to be clear, is the query function taking? Anyone want to vote for more than two? Three? Sure, why? Why three? AUDIENCE: [INAUDIBLE] DAVID MALAN: Exactly. The first part is the string. The second argument is dollar sign underscore GET bracket username. And the third argument is the same thing, but just the name. So in other words, now if I had a web form that had to text fields, one for the user's username, one for his or her name, just like you would see in a website when you register for some website, this might be the code on the back end that actually does the insertion now into the database. Now by contrast, let's fast forward. Suppose a user is now logging in and you want to write PHP code that checks whether the person who's just logged in is actually a user, you can use pretty simple syntax. You can say SELECT, let's say star, where star means everything. I don't know what I want, so just give me all the columns from the table called users where, and this is nice. Select supports what's called a predicate, which is like a way of qualifying what you want. Where username equals quote, unquote Malan. So here too, I've embedded inside the argument to a PHP function, a line of SQL code. And that SQL code this time is literally going to search for quote, unquote Malan. Now that's not all that useful, so I'm going to skip that and I'm going to put away this tip from Brady, and go and plug-in instead a question mark here. So, just to be clear, what should my second argument be if someone has just logged in and I want to check if he or she is actually a user? AUDIENCE: [INAUDIBLE] DAVID MALAN: Yeah. I hear dollar sign underscore GET quote, unquote username. And that should return to me any of the rows in my database that have a username of Malan. Now hopefully, I'm going to get back zero if Malan's never been here, or one if he has. I shouldn't get back two or three or four. Why? AUDIENCE: [INAUDIBLE] DAVID MALAN: I said unique, right? Simple reason. Because I said it's got to be unique, just logically, you can only have zero or one Malans in this particular database table. Now as an aside, just so you've seen it, even though I keep using GET and even though PSet6 only used GET, you can certainly have POST. And recall that Post is another technique for submitting information from a form, but it doesn't show up in the URL. It's a little more secure certainly for things like usernames and passwords, which PSet7 will, in fact, involve. So, let's do this in PHP MyAdmin and see what happens. I'm going to go to MySQL tab. And notice that the default value for PHP MyAdmin, just to try to be helpful, is to select star from users where one. Well, one is always true, so this has the silly effective of just select everything. But I'm going to be a little more pedantic and manually type out SELECT star FROM users. Now technically, you can quote the name of the tables. It's rare that you have to, but notice these are not your normal quotes on the US keyboard. This is the so-called backtick, which is generally on the top left hand corner of your keyboard. But it's rare that you'll actually need to bother with that, so I'll just omit them anyway. So now, let me go ahead and hit go. And how many rows should I get back when I select star from users? AUDIENCE: [INAUDIBLE] DAVID MALAN: The number of rows, sure. But how many in this concrete story right now? Two, because there was me and there was Rob. So, if I click Go, I see visually that I've gotten back, indeed, two rows. There's a lot of clutter on the screen, but I only see two rows. By contrast, if I do this again and do SELECT star FROM users, where username equals quote, unquote Malan, now if I click Go, I'm only going to get back one row. And lastly, if I do something like this, suppose that I don't care about getting everything, which is kind of meaningless now, because there's only two columns. It's not like I'm selecting a huge amount of data. Suppose I go ahead and do SELECT name FROM users, where username equals Malan, what's nice about SQL honestly, is that it really just does what you tell it to do. It's pretty succinct, but you literally just tell it what you want to do. Select name from users where the username equals Malan. And it really is that explicit. So, now if I hit Go, how many rows am I going to get back? One, because it's just Malan, hopefully. Or zero if he's not there, but one maximally. And how many columns will I get back? How many columns? This time, I'm just going to get one because I didn't select star, which is everything. Now I'm selecting just name, so I just get back one column and one row. And it looks sort of appropriately ridiculous, just looking super small like this. So, what's really happening? When you execute a SQL query using select, what you're getting back from the database is like a temporary table with rows and columns, maybe, but that omit anything that wasn't actually selected by you. So, it's like if someone had a big spreadsheet of all the students registered for some student group, and you say, give me all of the freshman who've registered for our student group, what your colleague in the student group might do is they could just hand you the whole spreadsheet. That's like saying select star. And it's a little annoying if you only wanted the freshman. And so, if you instead said, select star from database table where year equals quote, unquote freshman, it's as though your friend in the student group literally highlighted and copied only the freshman rows, pasted them into a new Google Spreadsheet or an Excel file, and handed you back the resulting file only. That's all that's going on conceptually here. So in the end, we can do some pretty fancy things by storing things like usernames and passwords and the like. But, it turns out, we should do a little differently than this. It's not that smart to only store a username and a password. Someone earlier, I think down here, suggested an ID. Now an ID could be like a Harvard ID or Yale's Net ID, but it could be even simpler in our database case. And indeed, the common case is to have another column. And I'm going to go ahead and edit my table. And if you play around with this interface for PSet7, you'll see that you can check this button here and add a field at the beginning of the table. And now if I click Go, it's going to give me one of those forms from earlier. I'm going to add a field called ID. And I'm going to make it a numeric type. I have a whole bunch of values for numerics. I'm just going to choose an INT and not worry about the disparate sizes. I don't have to specify a length or a value, because it's going to be 32 bits no matter what. Attributes, we didn't see before. Any interest in any of these menu options this time? For an INT? What did you propose? No? Do any of these make sense? Yeah. Yeah, unsigned, right? Generally, if we're going to give everyone a unique number, which is where this story is going, I really only want a person to have the number like zero and one and two and three and four. I don't need to deal with negative numbers. It just seems like undue complexity. I want four billion possible values, not four billion possible values, so I just doubled the capacity of my INT. As an aside, if you want to relate this to something like Facebook, back in sort of my day when Facebook first came out, I believe what they were using in their MySQL database to store a user's identifier, was just an INT. But of course, there's a lot of real people in the world. There's a lot of fake Facebook accounts in the world. And so eventually, Facebook overflowed the size of an INT, a four billion value. Which is why, if you look around and there's websites that can tell you what your unique ID is. And if you never chose a username in Facebook, you'll see your unique ID. I think it's profile dot PHP question mark ID equals something. That is now something like a big INT, or a long long if you will, which is a 64-bit value or something comparable. So, even in the real world do these issues ultimately sometimes matter. And it turns out here, if I'm giving all of my users a unique ID, I want to be super explicit and minimally make this field unique. But it turns out there's one piece of nomenclature today too that's a primary key. If you're designing a database table and you know in advance that one of the columns in that table should and will uniquely identify rows in the table, you want to specify it and tell the database, this is my primary key. There might be duplicates in other fields, but I'm telling the database that this is my primary, my most important field, that's guaranteed to be unique. Now, this seems redundant. I am now proposing that we add, by clicking Save here, a field called-- and I'm going to go ahead and click AI, we'll come back to that in a moment, Save. I am proposing now that my table look like this. I have an INT field called ID, a CHAR field called Username, a VARCHAR field called Name, but ID, if it's primary and therefore unique, why did I just waste time introducing what effectively is a second unique field called ID that's an INT? Username, recall, was already unique, we said. So just logically, you don't need any database experience to reason through this, why might I have introduced an INT as my unique identifier as well? What's this-- say again? AUDIENCE: [INAUDIBLE] DAVID MALAN: Random access is easier, why? AUDIENCE: [INAUDIBLE] DAVID MALAN: Yeah, it's just accessing numbers. So, if you think of this truly is a table, like an array, now I have unique identifiers that I can jump around. And better than that still is that how big is an INT going to be again? 32 bits or four bytes. How big is my username going to be? Maximally? 16 bytes. So, if you are really caring about the performance of your code, think back to PSet5, would you prefer to search for a four byte value or a 16 byte value, right? It really is as simple as that. You have to do four times as much work to search for usernames because those are 16 bytes. So, you have to literally compare all 16 bytes to be sure yes, this is username I want. Whereas for an INT, you can do it with just four bytes. And as an aside for those interested in computer hardware, it turns out you can fit something like an INT or a 32-bit value in something called a register in a computer CPU, which means it's super, super fast, even at the lowest level of the computer's hardware. So, there's just advantages all around. So, what does this mean? In fact, when you're designing a database table, almost all of the time are you going to have not only the data you care about, but also something like a unique identifier because this is going to let us do other things. And let's trip over one problem here. Suppose that users have not just usernames and names, but they also have things like cities and states and zip codes, at least here in the US. So, I'm going to go ahead and just quickly say, give me three more columns at the end of the table. And this is going to be City, this is going to be State, and this is going to be Zip. Now City, what data types should this be, perhaps? VARCHAR? I don't know what the longest name city is. Somewhere in America, there's probably some ridiculously long word, so let's just go with 255, somewhat historically or arbitrarily. State, what you want to do? Judgment call, right? What's maybe the most efficient? How many characters? Maybe just two, if we can get away with doing just, like, MA for Massachusetts and so forth. So, I'm going to go a CHAR value of two. Zip code's an interesting one. We're here in 02138, so that suggests we should use what? It's an INT, right? INT, INT, short? Short would work. No? CHAR or five, but I want an INT. Why push back on an INT? Persuade me from this. What's stupid about an INT, my idea? Yeah. AUDIENCE: Take up more memory. DAVID MALAN: Take up more memory. Four bytes, but you're proposing a zip code as five bytes or someone was as a CHAR, which feels like eh, that's not really the case. Well, fun story. Years ago, when I used to use Microsoft Outlook for my email, I eventually wanted to switch to Gmail. And so, I exported all of my contacts from Outlook as a CSV file. Comma separated values, which just meant I had all my friends names and last names and phone numbers and zip codes and all of that. And then I made the mistake of opening it up in Excel, which is a spreadsheet program that understands CSV files as we've seen. But then, I must have hit, like, Command or Control S at one point. And Excel apparently at the time had a feature whereby any time it saw a number, it tried to be helpful. And if that number started with zeros, it would just get rid of them. Why do you need leading zeros on integers? They're meaningless, mathematically. They're not meaningless in the US Postal system. So, I've had for years, to this day, I still have friends that when the rare case that I need someone's address these days, I'll still see that I have a friend in Cambridge, Massachusetts, 2138. And it's annoying if you're trying to sort of programmatically generate envelopes or just jot it down. And that's because of this reason, I chose the wrong data type. So, I love your idea. Let's use a CHAR field. Five characters, except there is a corner case. If you still send mail, sometimes zip codes these days, they're, like, plus four. So, we need a hyphen and then we need four more numbers. So to be honest, it could go many different ways. For now, I'm going to keep it simple and I'm just going to say that it's a five CHAR value and we're going to skip the whole dash plus four. But these are the kinds of tradeoffs. And you can think of the same problems arising with phone numbers or other fields. And now, this is actually a foolish road to go down. Suppose both Rob and I and Hannah and Maria and [? Davon ?] and Andy and others on the staff all live in Cambridge, Massachusetts, 02138. This actually feel stupid that I'm adding to my users table, city, state, and zip. Why? AUDIENCE: [INAUDIBLE] DAVID MALAN: Say again? AUDIENCE: [INAUDIBLE] DAVID MALAN: They're always going to go together, right? When it turns out, we used to think this was the case until we exhaustively searched the whole US, and turns out that there are some inconsistencies where multiple towns have the same zip, which is weird. But, if we stipulate for now that 02138 is always Cambridge, Massachusetts, why in the world would you store in your database Cambridge and MA and 02138 for me and for Hannah and for Rob and for [? Davon ?] and for others who live here in Cambridge, it's perfectly redundant. We should get away with just storing what? Just the zip code. But then, if we store just the zip code, I do want, probably, for my website to know where 02138 is. So, I need another table. And that's OK. And in fact, this is one of the design processes of designing tables that you'll do in PSet7 as well whereby you want to factor out common data. Just like we've been factoring out common code and factoring out common styles from CSS, here too in the database, if I only need 02138 to uniquely identify someone's hometown, don't store Cambridge, Mass for every darn user in your table. Instead, have a separate table called Zips that should have what columns? Probably an ID field, just because, for the principles we're talking about now. Probably a zip field for 02138. And then probably what other columns? City and state, but only have one row for 02138, one row for 02139, one row for 90210. And that is literally all the zip codes I know. So now, what can you do? This is problematic, because now I've got two tables. So, my users are mostly over here, but their city state information's over here. So, it turns out with SQL, there's actually a way to join information, and you'll see this in the PSet. But it turns out you can do something like this. SELECT star FROM users, JOIN zips ON users dot zip equals zips dot zip. Which is a little wordy, admittedly, but this just means select everything from the process of taking my users table and my zips table. Join them on the one field they have in column. So, literally doing something like this, and give me back a new temporary table that's wider, that's bigger, that has all of the columns from both of them. And that, quite simply, would be the syntax for doing something like this. So, there's this ahead, but there's going to be other design decisions you'll have to make, not only with indexes but also running into challenges. In fact, there's a challenge in any database design whereby sometimes two people might want to access the same rows of the database table. So, this is something that we'll encounter in PSet7 as well. But I thought I'd look at one attack that's possible in SQL. What are some of the problems that can arise? So, you'll encounter this in PSet7. And we tell you outright what the coding solution for this problem is. But if you take a higher level class, especially in operating systems, you're going to encounter an issue of atomicity, the problem of trying to do multiple things all at once without interruption. And I thought I'd introduce this idea for PSet7 with a metaphor that I learned myself in Margo Seltzer's CS164 operating systems class years ago. Suppose that you have one of these dorm fridges in your dorm room or house, and you have a real penchant for milk. And so, you come home from classes one day, you open the fridge. Oh, damn it. There's no milk in the fridge. So, you close the fridge, lock the door, lock your dorm, walk around the corner to CVS, get in line, and start checking out for some milk. And it's going to take a while, because those damn self checkout counters take forever to use anyway. So meanwhile, your roommate comes home. He or she really likes milk as well. They come into the dorm room, open the fridge, oh, darn it. There's no more milk. So, he or she also goes around the corner. But now, since there's like two or three or four CVSes nearby, they happen to go to one of the different ones in the square. And so now, a few minutes later, both of you come home and ugh, worst problem ever. Now you have too much milk because it's going to go sour. And you like milk, but you don't really like milk. So now, this was an expensive mistake because both of you made a decision based on the state of some variable that was in the process of being changed by you, the initiator of going to get milk. So, what is perhaps a human solution to that problem? AUDIENCE: [INAUDIBLE] DAVID MALAN: Leave a note, right? Always leave a note, if you're familiar with that show. Yes, there are two of us. So, always leave a note, or literally lock the refrigerator with some kind of padlock or something over the top like that. But that's actually going to be key problem with database design, especially when you might have multiple browsers, multiple laptops, multiple users all trying to update information at once. Particularly sensitive information like financial information, whereby with a stock trading website like you'll be building, what if you want to check how much money you have and then if you have enough, buy some stock? But what if someone else who has a joint account with you is simultaneously trying to buy some stock? So, he or she is checking the account balance, both of you get back the same answer, there's no milk. Or both of you get back the answer, you have $100 in the account. Both of you try to make the decision to buy one share of some company stock. And now, what happens? You have two shares? You have no shares? Problems like that can arise. So, we'll encounter that. SQL injection attacks, thankfully, are something we'll help you with, but these are atrociously common these days still. So, this is just an example. I make no claims that Harvard PIN system is vulnerable to this particular attack. We've tried. But, you know that we have a field like this. And Yale's Net ID has a similar looking screen these days. And it turns out, that maybe the PIN system is implemented in PHP. And if it were-- it's not-- they might have code that looks like this. They have two variables. Give me the username and password from the post super global variable that we talked about earlier. Maybe Harvard has a query like SELECT star FROM users where username equals that and password equals that. And notice that I'm just plugging it in using the curly brace notation from the other day, which means just plug in a value here. I'm not using the question mark technique. I don't have any second or third arguments. I'm just literally constructing the string myself. The problem, though, is that if someone like a scroob, which is a reference to a film, logs in with something like this, and I've removed the dots that usually cover up passwords, what if he is particularly malicious and his password maybe is 12345, per the movie called "Spaceballs," but he critically types a single quote after the five, then literally the word or in the space, and then quote, unquote one equals quote one, but notice he's omitted what? He's omitted the quote on the right and he's omitted the quote on the left. Because if this attacker scroob's presumption is that the people who wrote this PHP code weren't so bright, maybe they just have some single quotes around the interpolation of a variable in curly braces? And so maybe, he could kind of complete their thought for them, but in a way that's going to let him hacked into the PIN system. In other words, suppose that this is the code and we now plug in what scroob typed. And it's red, because it's bad. And the underlying text is what he typed in, scroob could trick Harvard's server into constructing a SQL query string that looks like this. Password equals 12345 or one equals one. The result of which, logically, is that this will log scroob in if his password is 12345 or if one equals one, which is of course always true, which means scroob always gets in. And so, the way to fix this, as in a lot of cases, would be to write more defensively. To use something like our actual query function, which you'll see in PSet7, where we plug in something like question marks here. And the beauty of the query function that we give you is it defends against these so-called SQL injection attacks, where someone is tricking your code into injecting his or her own SQL code. Because what the query function we give you will actually do, if you use the question mark syntax and a second and a third argument here, is what did it add to the input that the user provided? Those backslash quotes. So, it escapes any potentially dangerous characters. This looks weird now, but it's not vulnerable because it doesn't change the logic anymore because that whole password is now a single quote that's not, in fact, scroob's password. So, there's been some jokes about this over the years. So, this was a photo taken of some geek in a parking lot whereby you might know that some cities and states try to scan your license plate to bill you or to ticket you if you go through without, like, the E-Z Pass thing. So, this person presumed that maybe the people writing the E-Z Pass system were not so bright, and maybe they just concatenated together a string, so that he or she couldn't maliciously not just complete their thought, but actually execute a bad command, which we've not mentioned yet, but you can probably guess. That in addition to delete and insert and update and select, there's also a keyword called drop, which literally deletes everything in the database, which is particularly bad. We can zoom in on this if it's a little tough to see. This, now, is a famous cartoon that's wonderfully clever now and understandable. [LAUGHTER] Yeah, cool. Kind of geeking out. So these, then, are SQL injection attacks. And they're so easy to avoid by using the right code or the right libraries. And you'll see in PSet7, that's why we give you the query function. So, a couple of teasers that we thought we'd give you here in our remaining minutes together. So, as you remember from week zero, we introduced these two light bulbs which are nice, not just because they're pretty and are colorful, but because they support something called an API, an Application Programming Interface And in CS50 thus far, we've mostly focused on GET and POST, but it turns out there's other HTTP verbs like PUT. And in fact, this was a slide from week zero whereby if you write code that sends a la PSet6 an HTTP request that looks like this with this chunk of text at the bottom, which is called JSON, or JavaScript Object Notation that we'll talk about next week, you can turn on or turn off or change the color of lights like those. So if CS50 also has in addition to some of those light bulbs here in New Haven if you'd like to borrow them for final projects, also some Microsoft Bands, which are like watches that you wear around your wrist that similarly have an API so that you can write your own software for them. We have an account with Apple's iOS code so that if you have an Apple Watch or an iPhone or an iPad or an iPod, you can write code that actually runs on those. We have a whole bunch of Arduinos, which are tiny little computers without cases, essentially, that you can connect via USB, typically to your own Mac or PC, write code that runs on these physical devices that often have sensors on them so you can interact with the real world. We have a whole bunch of Leap Motion devices, which are USB devices for Macs and PCs, here and again, in New Haven. And if you connect it to your Mac, you can actually control your computer by writing software that via infrared beams, figures out where your human hands are, even without touching your keyboard. We thought we'd share a quick glimpse at this, for instance. [MUSIC PLAYING] So, we have a whole bunch of these things, too, called Myo arm bands which you put over your forearm and then you can control the real world or the virtual world like this. [MUSIC PLAYING] Or, we also have some Google Cardboard, which is literally, like, a cardboard box you could put on your face, but slide in your phone into it so that you put the glass of your phone really close to your eyes. And Google Cardboard is pretty cheap at $10 or $20. And it has little lenses that slightly off shift the image on the screen for your human eyes to give you a sense of depth so that you actually have a 3D environment in front of you. We also have some Samsung Gear, which is the more expensive version of this, but that can similarly slide in an Android phone and give you the illusion of-- or give the experience of virtual reality . And in our final two minutes, we thought we'd try to do this. If I can project what Colton has here just to whet your appetite, let me go ahead and throw up on the big screen here. Let me kill the lights. Colton, do you want to go ahead and put on your cell for a moment and come on over to the middle of the stage? And do you want to project-- this is what Colton sees. Now, the Wi-Fi in here is not so strong for this device that this is super compelling, but Colton is literally in this magical futuristic place. He only sees one image. You are seeing his left and right eye that his brain are stitching together in a three dimensional environment on his face. He's just selected a menu option here. And so again, he's wearing this headset with a Samsung phone on it that's wirelessly projecting to our overhead. Now you're on Mars, I think? COLTON: I think so. I'm not sure [INAUDIBLE]. [LAUGHTER] DAVID MALAN: Turns out Mars has these menus. COLTON: [INAUDIBLE] some cool places if we want to go to-- DAVID MALAN: Where do we want to go? COLTON: [INAUDIBLE] DAVID MALAN: And let's see where Colton's taking us now. COLTON: [INAUDIBLE] DAVID MALAN: So, there's so many different places you can take yourself. There's FAPIs via which you can write games or interactions that run, ultimately, on the phone. So, you really just writing a mobile phone app. But thanks to the software and the graphics capabilities, now Colton is in this tiny little cottage. And at the risk of overwhelming ourselves, Colton and I'll stick around for while at the end of class here today if you'd like to come up and play. And we'll bring them back next week as well. Without further, ado that's it for today. We'll see you next week. [MUSIC - RAGGA TWINS, "BAD MAN"]