WEBVTT X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:900000 00:00:17.600 --> 00:00:19.880 DAVID MALAN: We've seen how with languages like Python 00:00:19.880 --> 00:00:23.240 can we implement business logic on a server and, even if we want, 00:00:23.240 --> 00:00:24.680 generate web pages. 00:00:24.680 --> 00:00:27.770 And we've seen, on the browser side, the so-called client side, 00:00:27.770 --> 00:00:31.340 how you can render information or data to users. 00:00:31.340 --> 00:00:33.770 And with CSS can we style it, and with JavaScript can 00:00:33.770 --> 00:00:35.480 we even make it interactive. 00:00:35.480 --> 00:00:38.120 But when building a business or any application that 00:00:38.120 --> 00:00:42.060 is interacting with users, where is all that data being stored? 00:00:42.060 --> 00:00:44.060 Well, if you're running your own small business, 00:00:44.060 --> 00:00:46.040 you might simply be using a spreadsheet. 00:00:46.040 --> 00:00:48.207 Of course, spreadsheets, whether they're implemented 00:00:48.207 --> 00:00:51.050 as Microsoft Excel, or Google Spreadsheets, or Apple Numbers, 00:00:51.050 --> 00:00:53.930 generally stores data in rows and columns. 00:00:53.930 --> 00:00:56.990 And if you're like me, you probably use that first row 00:00:56.990 --> 00:01:00.230 to represent the names of the columns and the types of the data 00:01:00.230 --> 00:01:01.220 that you're storing. 00:01:01.220 --> 00:01:04.010 And as you continue to add more and more data to that spreadsheet, 00:01:04.010 --> 00:01:08.120 you probably, like me, continue to add row after row after row. 00:01:08.120 --> 00:01:11.720 And meanwhile, if you've got so much data or so many different types of data 00:01:11.720 --> 00:01:14.510 that it doesn't really belong in one sheet, 00:01:14.510 --> 00:01:16.833 you might actually create a second sheet or a tab 00:01:16.833 --> 00:01:19.250 along the bottom of the screen, for instance, in which you 00:01:19.250 --> 00:01:22.430 can store additional data as well. 00:01:22.430 --> 00:01:27.440 So indeed, Microsoft Excel allows you to store data relationally, so to speak. 00:01:27.440 --> 00:01:31.550 You might have one sheet with customer info, another sheet with invoices, 00:01:31.550 --> 00:01:35.480 another sheet with products, or any number of other types of data. 00:01:35.480 --> 00:01:40.610 And you can relate those to another by somehow having some commonality 00:01:40.610 --> 00:01:45.360 among them, some identifier, some name, some customer ID or the like. 00:01:45.360 --> 00:01:48.050 And so data that tends to be stored in rows and columns 00:01:48.050 --> 00:01:49.502 can have these relationships. 00:01:49.502 --> 00:01:51.710 And you can do this certainly in the cloud these days 00:01:51.710 --> 00:01:55.580 as well with Google Docs, simply a web-based version of the same. 00:01:55.580 --> 00:01:58.940 But when designing your spreadsheets, or more 00:01:58.940 --> 00:02:01.945 generally, your data's schema, the design thereof, 00:02:01.945 --> 00:02:04.070 there's a number of questions you need to consider. 00:02:04.070 --> 00:02:05.600 What data do you want to store? 00:02:05.600 --> 00:02:07.100 And where do you want to store it? 00:02:07.100 --> 00:02:09.623 And what type of data is it, in fact? 00:02:09.623 --> 00:02:12.290 Because whether using Excel, or Numbers, or Google Spreadsheets, 00:02:12.290 --> 00:02:16.970 odds are you sometimes configure the columns to be displayed or formatted 00:02:16.970 --> 00:02:17.930 in some ways. 00:02:17.930 --> 00:02:20.150 Numbers might be displayed with or without commas. 00:02:20.150 --> 00:02:23.330 Numbers might be displayed with dollar signs or some other symbol. 00:02:23.330 --> 00:02:25.700 You might separate one field from another 00:02:25.700 --> 00:02:27.990 using special symbols or punctuation. 00:02:27.990 --> 00:02:31.580 So these are all configuration options in a spreadsheet that might 00:02:31.580 --> 00:02:33.680 be among the first decisions you make. 00:02:33.680 --> 00:02:35.930 In fact, let's consider a sample spreadsheet 00:02:35.930 --> 00:02:39.930 wherein I might want to represent users in some web application. 00:02:39.930 --> 00:02:42.680 All of us have, of course, registered for websites or applications 00:02:42.680 --> 00:02:43.710 these days. 00:02:43.710 --> 00:02:46.190 And what are some of the questions you're asked? 00:02:46.190 --> 00:02:49.850 Well, you might be asked for your user name by some application. 00:02:49.850 --> 00:02:52.405 You might be asked for your actual name. 00:02:52.405 --> 00:02:53.780 What else might you be asked for? 00:02:53.780 --> 00:02:56.180 Perhaps your email address. 00:02:56.180 --> 00:02:58.970 And perhaps if you're buying something, or registering 00:02:58.970 --> 00:03:01.940 for something, or someone that needs to know a bit more about you, 00:03:01.940 --> 00:03:05.450 perhaps they'll even ask you for your address, maybe your phone number, 00:03:05.450 --> 00:03:07.590 or perhaps even your age. 00:03:07.590 --> 00:03:10.670 So you can certainly imagine there being even other fields that you need. 00:03:10.670 --> 00:03:15.740 But let's consider now what types of data each of these fields is. 00:03:15.740 --> 00:03:17.600 Now, what should a user name be? 00:03:17.600 --> 00:03:21.050 Well, by definition, this should be a unique value that belongs to you 00:03:21.050 --> 00:03:22.910 and only you on a system. 00:03:22.910 --> 00:03:25.010 For instance, if I'm registering for a website 00:03:25.010 --> 00:03:28.400 and that website is storing some data in, for instance, a spreadsheet, 00:03:28.400 --> 00:03:31.730 I might ideally try to claim Malan if it's available. 00:03:31.730 --> 00:03:33.800 My name meanwhile, of course, will be just 00:03:33.800 --> 00:03:36.710 David Malan, which someone else can certainly have as well. 00:03:36.710 --> 00:03:39.320 Hopefully my email address is only mine. 00:03:39.320 --> 00:03:42.020 And that, too, will be stored in a column of its own. 00:03:42.020 --> 00:03:46.860 And then my address, for instance, here on campus, 33 Oxford Street, Cambridge, 00:03:46.860 --> 00:03:49.940 Massachusetts, 02138. 00:03:49.940 --> 00:03:54.710 A phone number by which I can be reached, 617-495-5000. 00:03:54.710 --> 00:03:56.450 And then my age shall be-- 00:03:56.450 --> 00:03:58.430 well, we'll leave that blank just for now. 00:03:58.430 --> 00:04:02.390 So notice how each of these pieces of data is just a little bit different. 00:04:02.390 --> 00:04:04.100 Some of them seem to be numeric. 00:04:04.100 --> 00:04:06.020 Some of them seem to be alphabetic. 00:04:06.020 --> 00:04:08.900 Some of them even have some punctuation therein. 00:04:08.900 --> 00:04:11.120 Now, you might or might not in a spreadsheet 00:04:11.120 --> 00:04:13.820 care to display this data a bit differently. 00:04:13.820 --> 00:04:16.700 Indeed, the only number that I'm sweeping under the rug 00:04:16.700 --> 00:04:18.019 here is, in fact, my age. 00:04:18.019 --> 00:04:21.390 But hopefully I don't need one or more commas in that value. 00:04:21.390 --> 00:04:25.190 So there isn't really any need for special formatting here. 00:04:25.190 --> 00:04:27.350 But it turns out that when you're actually 00:04:27.350 --> 00:04:30.950 building a website or software-based application, 00:04:30.950 --> 00:04:34.970 you're probably not storing your data ultimately in just spreadsheets. 00:04:34.970 --> 00:04:39.470 You might graduate eventually from being a small shop to needing more than just 00:04:39.470 --> 00:04:41.478 Microsoft Excel, or Numbers, or even something 00:04:41.478 --> 00:04:43.020 cloud-based like Google Spreadsheets. 00:04:43.020 --> 00:04:43.700 Why? 00:04:43.700 --> 00:04:47.580 Well, you have more rows than those programs can generally handle. 00:04:47.580 --> 00:04:49.670 In fact, on your own Mac or PC, odds are, 00:04:49.670 --> 00:04:53.600 when opening up big files, whether a spreadsheet or any other document, 00:04:53.600 --> 00:04:55.430 sometimes you might actually feel that. 00:04:55.430 --> 00:04:57.380 The computer might start to slow, and you 00:04:57.380 --> 00:05:00.200 might start to see a spinning beach ball or hourglass 00:05:00.200 --> 00:05:02.090 because you start to hit the limits of what 00:05:02.090 --> 00:05:05.240 a local computer can do with just client-side software, 00:05:05.240 --> 00:05:06.950 like a spreadsheet tool. 00:05:06.950 --> 00:05:10.730 And so eventually you might actually need to use a proper database. 00:05:10.730 --> 00:05:13.460 And a database is really just a piece of software 00:05:13.460 --> 00:05:15.470 that can absolutely run on your Mac or PC. 00:05:15.470 --> 00:05:19.010 But very commonly it runs on a server or somewhere else 00:05:19.010 --> 00:05:21.670 in the cloud to which your own software connects. 00:05:21.670 --> 00:05:24.170 For instance, if you're building a website or an application 00:05:24.170 --> 00:05:27.110 in a language like Python, you can, in Python, 00:05:27.110 --> 00:05:32.570 write code that talks or communicates with that database to pull data down 00:05:32.570 --> 00:05:34.730 and to send data back up. 00:05:34.730 --> 00:05:37.170 But spreadsheets are wonderfully straightforward. 00:05:37.170 --> 00:05:40.310 It's just so intuitive how you store the data in rows and columns 00:05:40.310 --> 00:05:43.070 and just more and more rows as you have more and more data. 00:05:43.070 --> 00:05:48.410 And so what's nice about some databases is that they mimic exactly this design. 00:05:48.410 --> 00:05:51.560 There exists in the world what are called relational databases. 00:05:51.560 --> 00:05:55.197 And indeed, this is among the most common ways to store data relationally. 00:05:55.197 --> 00:05:57.530 The jargon is a bit different in the world of databases. 00:05:57.530 --> 00:05:59.720 You no longer call these things spreadsheets. 00:05:59.720 --> 00:06:01.160 We call them databases. 00:06:01.160 --> 00:06:03.830 And we don't call the individual tabs sheets. 00:06:03.830 --> 00:06:05.240 We call them tables. 00:06:05.240 --> 00:06:08.240 But we continue to call the structure of the data therein 00:06:08.240 --> 00:06:11.090 rows for each additional piece of data and columns 00:06:11.090 --> 00:06:13.430 for the different types of data that we have. 00:06:13.430 --> 00:06:18.500 But in a relational database, such as Oracle, Microsoft Access, SQL Server, 00:06:18.500 --> 00:06:23.030 My SQL, Postgres, or something smaller and lighter weight called SQLlite, 00:06:23.030 --> 00:06:26.930 the burden is on you, the designer of the database, 00:06:26.930 --> 00:06:29.000 or the programmer, or the business person 00:06:29.000 --> 00:06:32.690 to actually decide on what types of data you are going to store. 00:06:32.690 --> 00:06:35.660 Because the database, in order to handle more and more data 00:06:35.660 --> 00:06:39.350 than a typical spreadsheet can support, needs a bit of help from you, 00:06:39.350 --> 00:06:42.680 needs to know what type of data you're storing so that it can search it 00:06:42.680 --> 00:06:45.170 more efficiently and sort it more effectively 00:06:45.170 --> 00:06:48.530 and make it easier for you ultimately to add and remove data 00:06:48.530 --> 00:06:51.075 from that particular database. 00:06:51.075 --> 00:06:53.450 So in other words, you can do all of the same operations. 00:06:53.450 --> 00:06:56.420 But generally, in a database, you have so much more data, 00:06:56.420 --> 00:07:00.150 you're going to need to have the database help you help yourself. 00:07:00.150 --> 00:07:02.232 So what does that actually mean here? 00:07:02.232 --> 00:07:04.190 Well, let's consider some of these fields here. 00:07:04.190 --> 00:07:08.630 Were we to migrate my spreadsheet with many, many, many rows of users, 00:07:08.630 --> 00:07:12.570 for instance, to a proper database relationally, 00:07:12.570 --> 00:07:15.090 I'm going to have to make a few more decisions as well. 00:07:15.090 --> 00:07:18.530 And it turns out that I need to choose various types for these columns. 00:07:18.530 --> 00:07:22.740 And generally, in a database, you only have a limited menu of options. 00:07:22.740 --> 00:07:26.740 So let's take a look at what those might be. 00:07:26.740 --> 00:07:28.960 Here in many relational databases are just 00:07:28.960 --> 00:07:31.940 some of the data types that are available to you; 00:07:31.940 --> 00:07:35.560 an integer if you want to represent something like 1, 2, 3, or perhaps 00:07:35.560 --> 00:07:39.250 even a negative, a real number, otherwise known as a floating point 00:07:39.250 --> 00:07:41.200 value, for instance, in Python that actually 00:07:41.200 --> 00:07:44.530 has a decimal point and perhaps some numbers thereafter, 00:07:44.530 --> 00:07:47.530 numeric, which is more of a catch-all and might handle things like dates 00:07:47.530 --> 00:07:50.800 and times that are numbers in some sense but have some more 00:07:50.800 --> 00:07:53.770 formal structure to them, and then the more general text, when you just 00:07:53.770 --> 00:07:56.830 have words or phrases, characters or whole paragraphs 00:07:56.830 --> 00:08:00.070 or more that you might want to store as well in some column. 00:08:00.070 --> 00:08:03.400 And then lastly, you might have the cutely named BLOB, 00:08:03.400 --> 00:08:07.030 or binary large object, when you actually want to store zeros and ones, 00:08:07.030 --> 00:08:10.730 that is to say binary data, like actual files in your database. 00:08:10.730 --> 00:08:12.670 However, you needn't do this typically. 00:08:12.670 --> 00:08:16.690 You can actually store files, of course, on a file system on your own hard drive 00:08:16.690 --> 00:08:20.260 somewhere on the server, but that option exists for you. 00:08:20.260 --> 00:08:23.110 Now, this happens to be the list of data types supported 00:08:23.110 --> 00:08:27.460 by the simplest of relational databases, something called SQLite. 00:08:27.460 --> 00:08:31.300 And as its name implies, it actually supports a very specific language 00:08:31.300 --> 00:08:36.890 via which you or anyone can request data from the database and store 00:08:36.890 --> 00:08:39.760 data in the database, or update it, or delete it. 00:08:39.760 --> 00:08:44.440 And that language is called SQL, structured query language. 00:08:44.440 --> 00:08:47.200 SQL is a language via which you can do exactly 00:08:47.200 --> 00:08:51.670 that, retrieve data from a database, put data into a database, and so much more. 00:08:51.670 --> 00:08:54.867 It's an alternative to the pointing and clicking with which you're probably 00:08:54.867 --> 00:08:57.700 familiar with something like Microsoft Excel, or Google spreadsheet, 00:08:57.700 --> 00:08:59.830 or Apple Numbers, where it's really you, the human, 00:08:59.830 --> 00:09:03.250 doing all the work via a graphical user interface. 00:09:03.250 --> 00:09:05.710 Now, to be fair, in all of this spreadsheet programs 00:09:05.710 --> 00:09:08.110 are there functions or macros that you might 00:09:08.110 --> 00:09:11.540 be able to write so that you can program in those environments 00:09:11.540 --> 00:09:13.690 but with some limitations. 00:09:13.690 --> 00:09:17.680 And indeed with SQL and with SQL databases, more generally, 00:09:17.680 --> 00:09:20.680 that is databases that understand this language, can you 00:09:20.680 --> 00:09:24.430 achieve far higher performance, can you store far more data, 00:09:24.430 --> 00:09:26.380 and nonetheless get at it quickly. 00:09:26.380 --> 00:09:29.860 So it's sort of the next evolution of what you might otherwise do only 00:09:29.860 --> 00:09:31.810 within your own computer spreadsheet. 00:09:31.810 --> 00:09:36.490 But many other databases, Oracle, SQL Server, Microsoft Access, MySQL, 00:09:36.490 --> 00:09:39.100 Postgres and more, support more than these data 00:09:39.100 --> 00:09:42.760 types that allow you to help the database help you even more. 00:09:42.760 --> 00:09:46.700 Because the more the database knows about your data, the smarter decisions 00:09:46.700 --> 00:09:49.637 it can make at a lower level below these abstractions 00:09:49.637 --> 00:09:51.970 and storing that data so that, when you have a question, 00:09:51.970 --> 00:09:54.260 it can answer it quickly. 00:09:54.260 --> 00:09:57.760 And so how might you refine what we mean by integer? 00:09:57.760 --> 00:09:59.890 Well, in some databases, you might want to tell it 00:09:59.890 --> 00:10:03.370 that this is just a small integer in this column one 00:10:03.370 --> 00:10:04.990 after the other, row by row. 00:10:04.990 --> 00:10:08.020 Or maybe it's just an integer somewhere a little bigger 00:10:08.020 --> 00:10:12.070 than that, or perhaps a big int, which means it can be even larger. 00:10:12.070 --> 00:10:14.860 Now, these actually map to very well-defined values. 00:10:14.860 --> 00:10:19.630 If you think back to how we considered data itself is stored in a computer, 00:10:19.630 --> 00:10:22.300 it's ultimately with zeros and ones or bits. 00:10:22.300 --> 00:10:26.110 And indeed, an integer, so to speak, generally takes up 00:10:26.110 --> 00:10:28.930 just 32 bits or four bytes. 00:10:28.930 --> 00:10:31.930 But, of course, that few bits actually translates, 00:10:31.930 --> 00:10:35.530 if you do a bit of the math, to four billion possible values. 00:10:35.530 --> 00:10:39.160 So if you were to specify that a column in your database 00:10:39.160 --> 00:10:43.300 is of type integer, that means you could type in the number from zero 00:10:43.300 --> 00:10:46.570 all the way on up to four billion, or if you want negative numbers, 00:10:46.570 --> 00:10:51.400 from negative two billion roughly to positive two billion as well. 00:10:51.400 --> 00:10:54.580 But beyond that three billion or four, you're out of luck 00:10:54.580 --> 00:10:56.320 if you're only using 32 bits. 00:10:56.320 --> 00:10:57.460 Now, that might seem huge. 00:10:57.460 --> 00:11:02.170 And that is a good problem to have if I have four billion users, or sales, 00:11:02.170 --> 00:11:04.280 or anything in my database. 00:11:04.280 --> 00:11:08.080 But for certain large companies these days, having four billion 00:11:08.080 --> 00:11:12.040 or more records of some sort, transactions or logs, 00:11:12.040 --> 00:11:13.420 not at all uncommon. 00:11:13.420 --> 00:11:15.700 And for those purposes does there exist big int, 00:11:15.700 --> 00:11:20.080 a bigger data type that uses 64 bits or eight bytes via which 00:11:20.080 --> 00:11:22.743 you can count much, much, much higher, so high 00:11:22.743 --> 00:11:24.910 that I'm not even sure how to pronounce that number. 00:11:24.910 --> 00:11:29.110 But indeed, it should be large enough for almost all of our purposes. 00:11:29.110 --> 00:11:32.890 So ultimately, it's not enough to say that the field in a database 00:11:32.890 --> 00:11:38.440 is an integer but to specify just how big maximally that integer can be. 00:11:38.440 --> 00:11:42.400 Now, fortunately, I think for my age we could probably get away with small int. 00:11:42.400 --> 00:11:45.510 But indeed, with a small int, you typically have 16 bits. 00:11:45.510 --> 00:11:52.270 So even then could your user be as old as 65,535 years old. 00:11:52.270 --> 00:11:53.750 Now, what about real numbers? 00:11:53.750 --> 00:11:56.260 These are floating point values with decimal points. 00:11:56.260 --> 00:12:00.100 And here, too, you have fewer choices, typically, but along the same lines. 00:12:00.100 --> 00:12:03.310 A real number would typically take 32 bits, 00:12:03.310 --> 00:12:07.150 which allows you only some amount of precision, some number of digits 00:12:07.150 --> 00:12:08.470 after that decimal point. 00:12:08.470 --> 00:12:11.440 If you want to go even further and be ever more precise 00:12:11.440 --> 00:12:14.590 can you use twice as much space, 64 bits or eight 00:12:14.590 --> 00:12:17.740 bytes, and use double precision instead. 00:12:17.740 --> 00:12:20.920 Well, it would seem that we sort of start with an OK 00:12:20.920 --> 00:12:23.360 value and an even better value here. 00:12:23.360 --> 00:12:25.355 But what's the trade-off, as there always is? 00:12:25.355 --> 00:12:27.480 Well, it's fine if want to be able to count higher, 00:12:27.480 --> 00:12:29.820 whether with integers or real numbers. 00:12:29.820 --> 00:12:32.190 But you are literally going to pay a price. 00:12:32.190 --> 00:12:34.770 Because if you want to count higher than four billion 00:12:34.770 --> 00:12:36.900 in the context of integers, you're going to have 00:12:36.900 --> 00:12:40.890 to spend twice as much space from four bytes to eight 00:12:40.890 --> 00:12:42.390 in order to store those values. 00:12:42.390 --> 00:12:46.560 And if you're never actually going to use values larger than four billion, 00:12:46.560 --> 00:12:49.420 you're just spending twice as much space as you need to. 00:12:49.420 --> 00:12:51.810 And if you don't really need to store floating point 00:12:51.810 --> 00:12:56.220 values to terribly many digits of precision, you're just wasting space. 00:12:56.220 --> 00:12:59.240 Now, for small databases, this might be inconsequential. 00:12:59.240 --> 00:13:01.740 But for the Googles, and Microsofts, and others of the world 00:13:01.740 --> 00:13:07.050 to deal in gigabytes and terabytes of data, all of these bits and bytes 00:13:07.050 --> 00:13:08.670 will surely add up. 00:13:08.670 --> 00:13:12.010 As for numeric, more of a catch-all numerically, so to speak, 00:13:12.010 --> 00:13:16.110 whereby you have Booleans, zeros or ones, or false and true values, 00:13:16.110 --> 00:13:18.990 dates which comes in a prescribed format, typically 00:13:18.990 --> 00:13:24.270 year, year, year, year, dash, month, month, dash, day, day. 00:13:24.270 --> 00:13:26.382 Indeed, even though across the world there 00:13:26.382 --> 00:13:29.340 are different formats and different communities for representing dates, 00:13:29.340 --> 00:13:33.820 in SQL do you specify them exactly in that format only. 00:13:33.820 --> 00:13:38.460 Date time adds to that a space followed by hour, hour, colon, minute, minute, 00:13:38.460 --> 00:13:42.240 colon, second, second, whereby if you want to store both date and a time, 00:13:42.240 --> 00:13:45.720 you should store it in exactly that format in your database, even though, 00:13:45.720 --> 00:13:48.930 thanks to code, whether Python or SQL itself, 00:13:48.930 --> 00:13:52.080 you want to display that date in a geographically 00:13:52.080 --> 00:13:54.810 localized and conventional way, you nonetheless 00:13:54.810 --> 00:13:57.580 store it in this way in your database. 00:13:57.580 --> 00:14:01.680 Now, sometimes you need to fight against the imprecision that's 00:14:01.680 --> 00:14:04.480 inherent as a risk with real numbers. 00:14:04.480 --> 00:14:07.110 And so you can use the so-called numeric data type 00:14:07.110 --> 00:14:09.570 to specify a so-called scale and precision. 00:14:09.570 --> 00:14:13.290 That is two numbers that specify really how many digits 00:14:13.290 --> 00:14:15.490 you ultimately want to support. 00:14:15.490 --> 00:14:17.640 So if you want to store things like dollar amounts, 00:14:17.640 --> 00:14:20.610 maybe to cents or hundreds of cents, two or four decimal 00:14:20.610 --> 00:14:23.520 places, can you specify exactly that, and you 00:14:23.520 --> 00:14:26.460 will get exactly that much precision. 00:14:26.460 --> 00:14:30.090 As for time and timestamp, well, time is quite simply hour, hour, colon, minute, 00:14:30.090 --> 00:14:32.050 minute, colon, second, second. 00:14:32.050 --> 00:14:34.650 But timestamp's a little more interesting. 00:14:34.650 --> 00:14:37.200 It, too, can be used to represent time. 00:14:37.200 --> 00:14:40.860 But as with many languages, Python and JavaScript among them, 00:14:40.860 --> 00:14:43.560 it does so typically by remembering or counting 00:14:43.560 --> 00:14:48.690 some number of milliseconds or seconds from a particular point in time. 00:14:48.690 --> 00:14:51.570 In fact, some years ago, it was conventional to simply start 00:14:51.570 --> 00:14:56.370 counting time from January 1st, 1970, the so-called epic as opposed 00:14:56.370 --> 00:14:58.950 to say 0 BC or AD. 00:14:58.950 --> 00:15:03.690 Unfortunately, it turns out that you generally represent these timestamps 00:15:03.690 --> 00:15:08.250 with four bytes or 32 bits, which gives you access to four billion 00:15:08.250 --> 00:15:09.900 or so possible values. 00:15:09.900 --> 00:15:13.170 And it turns out that in the year 2038, we 00:15:13.170 --> 00:15:16.770 will have run out of bits with which to represent time. 00:15:16.770 --> 00:15:20.460 So if you recall the so-called Y2K problem when we ran into this issue 00:15:20.460 --> 00:15:24.180 around the year 2000, so have we rather painted ourselves 00:15:24.180 --> 00:15:28.290 into a corner in many systems by not allocating enough space 00:15:28.290 --> 00:15:30.120 to get us past that year. 00:15:30.120 --> 00:15:33.110 Well, what's the solution invariably going to be? 00:15:33.110 --> 00:15:35.700 Well, use more space and, thus, cost. 00:15:35.700 --> 00:15:38.400 But that is inherent in many systems. 00:15:38.400 --> 00:15:41.880 And it's one of these things that us humans will have to address. 00:15:41.880 --> 00:15:43.040 How about text? 00:15:43.040 --> 00:15:46.088 Well, text can, too, come in multiple forms. 00:15:46.088 --> 00:15:47.880 Whether it's English or any other language, 00:15:47.880 --> 00:15:49.770 you might have different lengths thereof. 00:15:49.770 --> 00:15:52.890 And so within the world of text values and databases, 00:15:52.890 --> 00:15:55.800 need you specify typically how many characters 00:15:55.800 --> 00:15:57.750 you want to store in that column. 00:15:57.750 --> 00:16:00.750 And this is distinct from something like a spreadsheet program 00:16:00.750 --> 00:16:04.410 where you can probably keep typing and typing or pasting as many characters 00:16:04.410 --> 00:16:05.490 as you'd like. 00:16:05.490 --> 00:16:09.810 A database, in so far as its purpose in life is to give you better performance 00:16:09.810 --> 00:16:12.510 and more scalability than a spreadsheet allows, 00:16:12.510 --> 00:16:14.760 wants you to tell it a bit more. 00:16:14.760 --> 00:16:16.620 It's going to ask you to decide whether you 00:16:16.620 --> 00:16:19.320 want to store a fixed number of chars. 00:16:19.320 --> 00:16:22.650 So char followed by a number here represented at its end 00:16:22.650 --> 00:16:29.418 means how many characters exactly shall you store in every cell in this column. 00:16:29.418 --> 00:16:30.835 Now, why might that be applicable? 00:16:30.835 --> 00:16:33.120 Well, in the US, for instance, for states, 00:16:33.120 --> 00:16:35.250 we have a standardization of two characters 00:16:35.250 --> 00:16:36.970 for every state in the country. 00:16:36.970 --> 00:16:39.570 And so you might say char (2), thereby allowing 00:16:39.570 --> 00:16:43.230 you to store CT for Connecticut, and FL for Florida, 00:16:43.230 --> 00:16:47.280 and MA for Massachusetts and so on, because you know that every value is 00:16:47.280 --> 00:16:49.800 going to be that finite length, two. 00:16:49.800 --> 00:16:51.840 But sometimes you don't know the max length. 00:16:51.840 --> 00:16:56.640 And indeed, in my own table of users, my own name might have D-A-V-I-D 00:16:56.640 --> 00:17:00.060 and a space and M-A-L-A-N, so 11 for me. 00:17:00.060 --> 00:17:02.437 But your name might be shorter or longer. 00:17:02.437 --> 00:17:04.770 Or some user that hasn't even registered for our website 00:17:04.770 --> 00:17:07.980 yet might have an even longer name than that. 00:17:07.980 --> 00:17:11.609 So varchar, or variable number of chars, exists, too, 00:17:11.609 --> 00:17:13.859 where you can specify not a fixed number of characters 00:17:13.859 --> 00:17:19.109 to use but a maximum upper bound, perhaps something like 30, 00:17:19.109 --> 00:17:22.230 perhaps something like 300, or some number of characters 00:17:22.230 --> 00:17:26.079 that you decide that's going to be the upper limit of any human I actually 00:17:26.079 --> 00:17:29.410 see in my sight who might have a name as many as that characters. 00:17:29.410 --> 00:17:32.350 But varchar is smart and the databases that use it, 00:17:32.350 --> 00:17:35.110 because it will use that many characters maximally. 00:17:35.110 --> 00:17:37.220 But it's not going to waste space. 00:17:37.220 --> 00:17:42.010 In fact, if you're using 30 characters for every name in your table, 00:17:42.010 --> 00:17:45.890 well, it's only going to store as many characters are 00:17:45.890 --> 00:17:49.870 necessary, perhaps plus a bit of overhead or accounting 00:17:49.870 --> 00:17:51.340 to keep track of that value. 00:17:51.340 --> 00:17:56.800 But that way you save space but still can handle large text or large strings. 00:17:56.800 --> 00:18:00.550 Lastly, if you have particularly large text, whole paragraphs, or essays, 00:18:00.550 --> 00:18:02.440 or documents that someone might paste in, 00:18:02.440 --> 00:18:05.232 or the contents of an entire web page that you want to search, 00:18:05.232 --> 00:18:07.190 well, you can have a text field more generally, 00:18:07.190 --> 00:18:11.290 which tends to support tens of thousands or more characters in total. 00:18:11.290 --> 00:18:14.260 But it does so in a way that's not quite as efficient. 00:18:14.260 --> 00:18:17.720 Rather than keep it right in the column itself, so to speak, 00:18:17.720 --> 00:18:20.500 it puts it somewhere else on the server, the result of which 00:18:20.500 --> 00:18:23.620 is that you can fit more room over here, so to speak. 00:18:23.620 --> 00:18:27.520 But it takes more time and effort to go search or find that data, so, 00:18:27.520 --> 00:18:30.610 again, a trade-off of space and time. 00:18:30.610 --> 00:18:33.170 But here, too, as with integers and real, 00:18:33.170 --> 00:18:37.000 you seem to have multiple choices, one of which at first glance is better. 00:18:37.000 --> 00:18:41.543 After all, why use char ever if you could also just say varchar 00:18:41.543 --> 00:18:42.460 with that same amount? 00:18:42.460 --> 00:18:45.550 For instance, for those two character codes for states, 00:18:45.550 --> 00:18:50.320 why not just say varchar and not just char (2)? 00:18:50.320 --> 00:18:53.650 Well, it turns out that if you promise the database 00:18:53.650 --> 00:18:58.900 that every cell in your column will use exactly the same number of bytes, 00:18:58.900 --> 00:19:03.220 it turns out that you'll have very straight or justified edges, 00:19:03.220 --> 00:19:05.890 in some sense, conceptually for that column. 00:19:05.890 --> 00:19:08.560 For instance, if every cell in a column takes up 00:19:08.560 --> 00:19:11.740 exactly the same amount of space, then you might know, 00:19:11.740 --> 00:19:15.250 if you start numbering the bytes that are represented by those cells, 00:19:15.250 --> 00:19:18.910 this might be byte zero up here and then two characters away 00:19:18.910 --> 00:19:22.670 will be address two, and then four, and then six, and then eight. 00:19:22.670 --> 00:19:26.320 And if you think back to how algorithms like, say, binary search 00:19:26.320 --> 00:19:30.610 are implemented, as soon as you have the ability to know arithmetically 00:19:30.610 --> 00:19:34.780 where the next value, or the last value, or, best yet, the middle value 00:19:34.780 --> 00:19:39.880 is, numerically can you jump right to it in so-called constant time, 00:19:39.880 --> 00:19:44.260 thereby enabling something like binary search and ultimately logarithmic time. 00:19:44.260 --> 00:19:45.820 So this is a powerful thing. 00:19:45.820 --> 00:19:48.220 If you commit to the database that all of your values 00:19:48.220 --> 00:19:50.740 will take up the exact amount of space, the database 00:19:50.740 --> 00:19:54.820 can search that value faster for you. 00:19:54.820 --> 00:20:00.520 Varchar, unfortunately, is rather akin to having only a left justified column, 00:20:00.520 --> 00:20:03.713 whereby you might have data that looks very straight on one edge 00:20:03.713 --> 00:20:05.380 with all of the characters left aligned. 00:20:05.380 --> 00:20:07.630 But because the length of those characters vary, 00:20:07.630 --> 00:20:11.080 this cell might be this wide, and this one this wide, and this one this wide, 00:20:11.080 --> 00:20:12.430 and this one this wide. 00:20:12.430 --> 00:20:15.790 And as such, because you have a very ragged edge, so to speak, 00:20:15.790 --> 00:20:18.610 you can't just do simple math and add two and add two 00:20:18.610 --> 00:20:21.350 to get to the next value in that column. 00:20:21.350 --> 00:20:24.370 So in the worst case, if a column is only a varchar, 00:20:24.370 --> 00:20:27.520 the best the database can do by default is just linear search, 00:20:27.520 --> 00:20:31.810 or big O of n, which, recall, was notably slower than something 00:20:31.810 --> 00:20:34.250 like log N or logarithmic time. 00:20:34.250 --> 00:20:37.250 So the more help you can provide to the database the better off you are. 00:20:37.250 --> 00:20:39.792 But of course you don't want to air too far on the other side 00:20:39.792 --> 00:20:45.190 and say, oh, well, I'm going to have char 300 for every cell 00:20:45.190 --> 00:20:50.200 in my Names column, because then you're using 300 characters by definition 00:20:50.200 --> 00:20:54.970 for just David Malan who needs just, say, 11. 00:20:54.970 --> 00:20:58.570 So lastly is there this thing BLOB, binary large object, 00:20:58.570 --> 00:21:03.160 which can only store binary data, which can be used for images or files. 00:21:03.160 --> 00:21:08.230 But those, again, tend to be best stored somewhere else on the system. 00:21:08.230 --> 00:21:10.720 So with these possibilities in mind, how might we 00:21:10.720 --> 00:21:13.450 go about designing the schema for this data? 00:21:13.450 --> 00:21:15.790 We already have the names of my columns. 00:21:15.790 --> 00:21:18.040 And I've already gone ahead and put in one sample row. 00:21:18.040 --> 00:21:21.130 So let's consider in the context of all of those SQL types 00:21:21.130 --> 00:21:22.960 which ones apply here. 00:21:22.960 --> 00:21:25.180 Well, for user name, something like Malan, 00:21:25.180 --> 00:21:27.490 it's probably not numeric because that's not 00:21:27.490 --> 00:21:29.230 something I'd be so inclined to type. 00:21:29.230 --> 00:21:30.700 But it probably is text. 00:21:30.700 --> 00:21:33.460 But within text we have the choice of char or varchar 00:21:33.460 --> 00:21:36.070 or larger blocks of text, more generally. 00:21:36.070 --> 00:21:37.760 So what should this be? 00:21:37.760 --> 00:21:39.040 Well, this one depends. 00:21:39.040 --> 00:21:41.170 A user name tends to be fairly short. 00:21:41.170 --> 00:21:44.500 Because after all, we humans probably don't want to type very long strings 00:21:44.500 --> 00:21:46.090 just to log into some site. 00:21:46.090 --> 00:21:49.540 Historically, maximal values of eight characters was common. 00:21:49.540 --> 00:21:52.060 But that tends to be fairly constrained. 00:21:52.060 --> 00:21:54.370 And you can no longer express very much of your name 00:21:54.370 --> 00:21:56.950 if you tend to have a long name yourself. 00:21:56.950 --> 00:22:01.030 So here we might say something like char, because we know it will be small. 00:22:01.030 --> 00:22:03.710 And we'd like to be able to search on this field efficiently, 00:22:03.710 --> 00:22:06.880 especially to log someone in fast, but probably not quite as 00:22:06.880 --> 00:22:10.300 short as eight, so maybe 16 or 20. 00:22:10.300 --> 00:22:12.940 Or if it's hard to guess there and you want more dynamism, 00:22:12.940 --> 00:22:17.050 maybe you would say varchar of something like 255. 00:22:17.050 --> 00:22:18.000 Why that value? 00:22:18.000 --> 00:22:22.220 Well, recall that with eight bits can you count 256 possible values. 00:22:22.220 --> 00:22:25.740 But if you start counting at zero, you can only go as high as 255. 00:22:25.740 --> 00:22:29.280 And historically have many databases had limits of, say, 00:22:29.280 --> 00:22:32.770 255 for a varchar, though that's no longer the case. 00:22:32.770 --> 00:22:35.040 But you'll still see this very commonly. 00:22:35.040 --> 00:22:36.960 So what's the best answer? 00:22:36.960 --> 00:22:37.750 I don't know. 00:22:37.750 --> 00:22:40.140 It depends on your data or someone else's. 00:22:40.140 --> 00:22:43.200 For here, I'll go ahead and leave it as varchar (255) 00:22:43.200 --> 00:22:44.910 just to give myself flexibility. 00:22:44.910 --> 00:22:48.510 But a very good argument could be made for char of some smaller value 00:22:48.510 --> 00:22:51.780 or even varchar of some smaller value, too. 00:22:51.780 --> 00:22:54.810 As for name as well, I'm not really sure what this one should be. 00:22:54.810 --> 00:22:58.170 I don't know how many characters your own name has. 00:22:58.170 --> 00:23:00.390 Something like 11 is probably too few, even 00:23:00.390 --> 00:23:02.740 though it satisfies my name just fine. 00:23:02.740 --> 00:23:05.520 30 feels a bit [? type. ?] And frankly, I 00:23:05.520 --> 00:23:09.300 bet if we Google longest name in world, something 00:23:09.300 --> 00:23:13.110 tells me there's someone out there with quite a few characters, hopefully 00:23:13.110 --> 00:23:15.210 no more than, say, 255. 00:23:15.210 --> 00:23:18.240 But there, too, we might want to do some due diligence. 00:23:18.240 --> 00:23:20.097 With email, too, this seems to be easy. 00:23:20.097 --> 00:23:22.680 This, too, is just characters, even though you could certainly 00:23:22.680 --> 00:23:23.550 have numbers. 00:23:23.550 --> 00:23:26.010 I don't know how long the maximum email address will be. 00:23:26.010 --> 00:23:29.500 But frankly, it's probably going to be variable in length, 00:23:29.500 --> 00:23:31.120 so I'm going to go with a default. 00:23:31.120 --> 00:23:32.160 Why 255? 00:23:32.160 --> 00:23:34.140 Again, it's probably way more than we need. 00:23:34.140 --> 00:23:35.533 But varchar is smart. 00:23:35.533 --> 00:23:37.200 It's not going to waste that many bytes. 00:23:37.200 --> 00:23:41.490 It's just a reasonable upper bound in the absence of better intuition. 00:23:41.490 --> 00:23:43.680 At least we're being consistent. 00:23:43.680 --> 00:23:45.580 Now, address, that might be decently long, 00:23:45.580 --> 00:23:48.370 especially if it's going on envelope on, say, multiple lines. 00:23:48.370 --> 00:23:51.840 But here, too, this is probably just text, and so I'll go ahead here 00:23:51.840 --> 00:23:54.358 and say this, too, 255. 00:23:54.358 --> 00:23:54.900 Phone number. 00:23:54.900 --> 00:23:57.540 Finally, a number by name. 00:23:57.540 --> 00:23:59.700 Well, what should a phone numbers field be? 00:23:59.700 --> 00:24:04.410 Well, we had integer, or real, or, more specifically, small int or integer, 00:24:04.410 --> 00:24:06.030 or big int. 00:24:06.030 --> 00:24:09.870 But the funny thing here is even though a phone number is indeed 00:24:09.870 --> 00:24:13.260 called by us humans a number, it's really 00:24:13.260 --> 00:24:18.300 just a symbolic representation of a unique identifier for someone's phone. 00:24:18.300 --> 00:24:22.030 And so these hyphens might be common in one community or culture. 00:24:22.030 --> 00:24:25.890 Maybe you'd have parentheses or even pluses for country codes. 00:24:25.890 --> 00:24:30.390 So frankly, very quickly does even a phone number not become so much 00:24:30.390 --> 00:24:33.040 a number but a textual string. 00:24:33.040 --> 00:24:35.070 So here I have some discretion. 00:24:35.070 --> 00:24:37.890 And maybe I could be presumptuous and assume 00:24:37.890 --> 00:24:40.020 I'm only going to have customers or users for now, 00:24:40.020 --> 00:24:43.343 say, in the US for whatever business constraints. 00:24:43.343 --> 00:24:44.760 And so I might say, you know what? 00:24:44.760 --> 00:24:49.740 This is a candidate to actually do something like char, say, 10, 00:24:49.740 --> 00:24:53.220 three-digit area code, a three-digit exchange, and then four digits 00:24:53.220 --> 00:24:54.300 thereafter. 00:24:54.300 --> 00:24:58.470 But that doesn't leave room for those hyphens, so I could make a char 12. 00:24:58.470 --> 00:25:01.170 Or frankly, if they're just going to be there all the time, 00:25:01.170 --> 00:25:04.170 why don't I leave them as char 10 and just 00:25:04.170 --> 00:25:07.440 get rid of those values in my data? 00:25:07.440 --> 00:25:11.207 Or alternatively, I could support parentheses or pluses as well. 00:25:11.207 --> 00:25:13.290 It really depends on how I want to store the data. 00:25:13.290 --> 00:25:19.500 But I like the idea of a textual type, something like char or maybe varchar 00:25:19.500 --> 00:25:20.970 as opposed to an integer. 00:25:20.970 --> 00:25:23.220 Because at least if I've called certain communities 00:25:23.220 --> 00:25:26.660 or out from some businesses, sometimes you have to type unusual numbers. 00:25:26.660 --> 00:25:28.410 At least in other countries, for instance, 00:25:28.410 --> 00:25:30.930 if we generalize beyond this data set here do you 00:25:30.930 --> 00:25:34.320 type zero to actually connect to someone local. 00:25:34.320 --> 00:25:37.440 And the problem with zero is that, mathematically, it's 00:25:37.440 --> 00:25:39.690 meaningless to start a value with zero. 00:25:39.690 --> 00:25:41.910 And unfortunately, a computer takes that to heart. 00:25:41.910 --> 00:25:45.190 And if you were to store phone as an integer 00:25:45.190 --> 00:25:48.450 but a number were in some community to start with a zero, 00:25:48.450 --> 00:25:51.090 your database would probably just get rid of it. 00:25:51.090 --> 00:25:54.152 In fact, try that yourself in Microsoft Excel, or Apple Numbers, 00:25:54.152 --> 00:25:55.110 or Google Spreadsheets. 00:25:55.110 --> 00:25:58.530 And depending on how you formatted the column, you can type as many zeros 00:25:58.530 --> 00:26:00.280 as you want followed by other digits. 00:26:00.280 --> 00:26:04.380 And odds are, when you hit Enter, those zeros are gone. 00:26:04.380 --> 00:26:09.540 As for age, here we perhaps finally have a compelling candidate for a number. 00:26:09.540 --> 00:26:11.850 Small int should probably get the job done. 00:26:11.850 --> 00:26:16.890 Integer would work as well, or big int, but increasingly wasteful. 00:26:16.890 --> 00:26:17.760 But you know what? 00:26:17.760 --> 00:26:19.920 Even here it's not that obvious. 00:26:19.920 --> 00:26:24.930 I probably shouldn't even use an integer type here at all. 00:26:24.930 --> 00:26:25.830 Why? 00:26:25.830 --> 00:26:30.270 Well, I might indeed ask a human for his or her age upon registering. 00:26:30.270 --> 00:26:34.620 The catch is that age might change the next day or the day 00:26:34.620 --> 00:26:38.490 after, or the day after, because, of course, time is advancing. 00:26:38.490 --> 00:26:41.310 And unless I also stored the date and ideally 00:26:41.310 --> 00:26:45.600 time at which the user registered, I can't really 00:26:45.600 --> 00:26:49.500 even do any math to figure out, oh, you registered a year ago. 00:26:49.500 --> 00:26:52.150 Let me assume your age is that plus one. 00:26:52.150 --> 00:26:55.170 So what would have been better from the get-go than age? 00:26:55.170 --> 00:26:58.830 Probably just something like date of birth, DOB. 00:26:58.830 --> 00:27:01.530 And, of course, in SQL do we have the ability 00:27:01.530 --> 00:27:03.850 to store dates, even date times. 00:27:03.850 --> 00:27:06.120 So here we probably have a better candidate 00:27:06.120 --> 00:27:09.700 for exactly one of those numeric types, so to speak. 00:27:09.700 --> 00:27:13.097 Now, we've only just begun to scratch the surface of available data types 00:27:13.097 --> 00:27:14.430 and data we might want to store. 00:27:14.430 --> 00:27:18.150 In fact, our spreadsheet or database could contain even more types. 00:27:18.150 --> 00:27:21.540 But now let's just suppose that we're happy with our types. 00:27:21.540 --> 00:27:24.480 And the goal now at hand is to start searching our data 00:27:24.480 --> 00:27:27.940 and storing that data in an actual database. 00:27:27.940 --> 00:27:31.600 I've been using here Google Spreadsheets just to lay out my schema 00:27:31.600 --> 00:27:34.130 sort of as a worksheet, if you will. 00:27:34.130 --> 00:27:37.000 But now let's graduate to actual SQL syntax 00:27:37.000 --> 00:27:39.250 and suppose that this same data is stored 00:27:39.250 --> 00:27:43.780 not in Google Spreadsheets or any other but in some database 00:27:43.780 --> 00:27:47.080 elsewhere, a piece of software running somewhere else on my own computer 00:27:47.080 --> 00:27:50.320 or somewhere in the cloud that knows how to organize this data, still 00:27:50.320 --> 00:27:56.760 in rows and columns but in such a way that I can use this new language, SQL 00:27:56.760 --> 00:28:01.430 or SQL, in order to access my data. 00:28:01.430 --> 00:28:03.850 So what are the fundamental operations that 00:28:03.850 --> 00:28:07.510 SQL supports, or a relational database, more generally? 00:28:07.510 --> 00:28:09.910 It turns out that throughout computer science is there 00:28:09.910 --> 00:28:13.630 a pattern of operations that many different systems support. 00:28:13.630 --> 00:28:16.000 In the world of databases, you have fairly crassly 00:28:16.000 --> 00:28:20.710 what's called CRUD, the ability to create data, read data, update, 00:28:20.710 --> 00:28:21.490 and delete. 00:28:21.490 --> 00:28:26.680 But more specifically, in the context of SQL, this new language, our last, 00:28:26.680 --> 00:28:31.540 you have the ability to create data, select data, a.k.a. 00:28:31.540 --> 00:28:35.050 READ data, update or insert data, a.k.a. 00:28:35.050 --> 00:28:38.560 UPDATE, or delete or drop data, a.k.a. 00:28:38.560 --> 00:28:39.460 DELETE. 00:28:39.460 --> 00:28:42.490 So whereas in SQL you have these very specific terms, 00:28:42.490 --> 00:28:45.880 they are just representative of a class of operations 00:28:45.880 --> 00:28:48.520 that you might see throughout computer science. 00:28:48.520 --> 00:28:50.830 So how do we go about using SQL? 00:28:50.830 --> 00:28:53.740 Well, short of buying and installing something like Oracle 00:28:53.740 --> 00:28:56.230 or downloading something free like MySQL, 00:28:56.230 --> 00:28:58.870 we can simply use something that's indeed, by definition, 00:28:58.870 --> 00:29:01.540 lighter weight called SQLite. 00:29:01.540 --> 00:29:05.200 SQLite is an implementation of SQL, this database language, 00:29:05.200 --> 00:29:08.430 via which you can install it on your own Mac or PC, 00:29:08.430 --> 00:29:13.270 and it uses not special software or servers but rather just a file 00:29:13.270 --> 00:29:16.330 on your own Mac or PC that stores all of those tables 00:29:16.330 --> 00:29:18.300 and, in turn, rows and columns. 00:29:18.300 --> 00:29:20.440 It thereafter creates an abstraction of sorts 00:29:20.440 --> 00:29:22.960 as though you have a full-fledged server running 00:29:22.960 --> 00:29:26.350 on your own machine with which you can communicate using SQL. 00:29:26.350 --> 00:29:31.360 Now, any number of programs can be used to actually talk to a SQLite database. 00:29:31.360 --> 00:29:35.200 You might use a purely textual interface, so-called command line 00:29:35.200 --> 00:29:37.090 interface, using only your keyboard. 00:29:37.090 --> 00:29:41.210 Or you might use something graphical, a GUI, a graphical user interface. 00:29:41.210 --> 00:29:43.030 In fact, one of the easiest programs to use 00:29:43.030 --> 00:29:46.600 is this one here, DB Browser, freely available for Macs, 00:29:46.600 --> 00:29:50.740 for PCS running Windows, or Linux, or any number of operating systems 00:29:50.740 --> 00:29:51.670 as well. 00:29:51.670 --> 00:29:56.110 In advance, I've gone ahead and opened up a database in a file whose name ends 00:29:56.110 --> 00:29:59.830 in .sqllite or also commonly .db. 00:29:59.830 --> 00:30:01.900 This then is, again, just a file, a binary 00:30:01.900 --> 00:30:04.510 file filled with zeros and ones that collectively 00:30:04.510 --> 00:30:09.100 represent patterns of rows and columns and the tables that contain them. 00:30:09.100 --> 00:30:13.120 And if I open this program, I see all of the tables or, if you will, 00:30:13.120 --> 00:30:15.640 sheets from our old spreadsheet world. 00:30:15.640 --> 00:30:18.370 This happens to be a database all about music. 00:30:18.370 --> 00:30:22.960 And indeed, I have a table about musical albums and artists, customers 00:30:22.960 --> 00:30:26.560 and employees, and genres, and invoices, invoice lines that 00:30:26.560 --> 00:30:29.920 represent actual songs bought, the types of media involved, 00:30:29.920 --> 00:30:33.340 and playlists and playlists tracks via which users can actually 00:30:33.340 --> 00:30:37.807 customize those songs, and then lastly, the tracks or the songs themselves. 00:30:37.807 --> 00:30:39.640 In other words, someone else on the internet 00:30:39.640 --> 00:30:42.250 has gone to the trouble of aggregating all of this data 00:30:42.250 --> 00:30:47.770 about music and their authors and organized it into an SQL database. 00:30:47.770 --> 00:30:48.550 Now, how? 00:30:48.550 --> 00:30:50.342 Well, if you look to the right here, you'll 00:30:50.342 --> 00:30:54.040 see a number of esoteric commands all involving CREATE TABLE. 00:30:54.040 --> 00:30:57.490 And indeed, CREATE is one of the four fundamental operations 00:30:57.490 --> 00:30:58.660 that SQL supports. 00:30:58.660 --> 00:31:02.860 And without going into the weeds of how these tables were created, 00:31:02.860 --> 00:31:07.070 they were created in such a way that the author had to decide on every table's 00:31:07.070 --> 00:31:09.640 columns' names as well as their types. 00:31:09.640 --> 00:31:11.440 And that part is interesting. 00:31:11.440 --> 00:31:16.090 For instance, let me go into the Albums table here and expand this one now. 00:31:16.090 --> 00:31:21.370 Here I have Album ID, and Title, and, curiously, Artist ID. 00:31:21.370 --> 00:31:25.450 But you'll notice in the Album table we have no mention of artist's name. 00:31:25.450 --> 00:31:27.970 In fact, if we follow the white rabbit here, 00:31:27.970 --> 00:31:31.450 we see in the Artist table that, oh, in Artist do 00:31:31.450 --> 00:31:33.700 we have an artist ID and name. 00:31:33.700 --> 00:31:37.150 And indeed, if we keep digging, we'll see that in each of these tables 00:31:37.150 --> 00:31:41.050 there's not quite as much information as I might like. 00:31:41.050 --> 00:31:44.830 In fact, IDs, whatever those are seem to be more prevalent. 00:31:44.830 --> 00:31:46.930 And that is because, in the world of databases, 00:31:46.930 --> 00:31:49.390 and perhaps even some of your own spreadsheets, 00:31:49.390 --> 00:31:53.560 it's conventional and daresay best practice to normalize your data. 00:31:53.560 --> 00:31:58.240 You indeed should try to avoid having redundancies within tables and even 00:31:58.240 --> 00:31:59.620 across tables. 00:31:59.620 --> 00:32:00.730 Now, what does that mean? 00:32:00.730 --> 00:32:02.830 Well, consider our own spreadsheet of users 00:32:02.830 --> 00:32:05.410 that we're trying to get into a database. 00:32:05.410 --> 00:32:07.960 I had a column there called Address. 00:32:07.960 --> 00:32:12.170 And I proposed that my own address was 33 Oxford Street in Cambridge, 00:32:12.170 --> 00:32:13.930 Massachusetts 02138. 00:32:13.930 --> 00:32:17.680 Now, it turns out that there's quite a few other people at my workplace, 00:32:17.680 --> 00:32:21.260 and so they might have that exact same address as well. 00:32:21.260 --> 00:32:23.770 And in fact, that address doesn't really lend itself 00:32:23.770 --> 00:32:27.340 to very nice printing on envelopes because all I did 00:32:27.340 --> 00:32:30.280 was separate things with a comma, not actually hit Enter. 00:32:30.280 --> 00:32:33.190 And in fact, because everything was all in one column, 00:32:33.190 --> 00:32:36.550 it would seem a little difficult to search by, say, zip code. 00:32:36.550 --> 00:32:41.470 Find for me all of the users from Cambridge, Massachusetts in 02138. 00:32:41.470 --> 00:32:44.330 Well, you could do it sort of searching free form 00:32:44.330 --> 00:32:46.840 all of the values in some column. 00:32:46.840 --> 00:32:49.510 But it wouldn't be particularly clean or efficient because you'd 00:32:49.510 --> 00:32:53.770 have to look at and then ignore all of those other words and numbers, 33 00:32:53.770 --> 00:32:57.550 Oxford Street, Cambridge mass, if you only care about that zip. 00:32:57.550 --> 00:32:59.380 So before we forge ahead, let's see if we 00:32:59.380 --> 00:33:03.490 can't infer why this database is seemingly, at first glance, 00:33:03.490 --> 00:33:05.650 more complicated than you might like. 00:33:05.650 --> 00:33:09.040 Well, if I go back into my spreadsheet here, what really should 00:33:09.040 --> 00:33:11.050 I have done with Address? 00:33:11.050 --> 00:33:13.750 I probably should have quantized that value 00:33:13.750 --> 00:33:17.830 into streets and then city, then state and zip code, 00:33:17.830 --> 00:33:20.680 not having just one column but several. 00:33:20.680 --> 00:33:24.220 In fact, here, what I've gone ahead and done is separate out Address 00:33:24.220 --> 00:33:27.050 into Street and City and State and Zip. 00:33:27.050 --> 00:33:30.040 And for each of those have I specified a very precise type. 00:33:30.040 --> 00:33:34.000 I've gone ahead and proposed that Street is still varchar (255), as is city, 00:33:34.000 --> 00:33:37.360 because I don't really know an upper bound, so we'll at least be consistent. 00:33:37.360 --> 00:33:41.530 For state, I've hopefully been smart in at least assuming users are in the US. 00:33:41.530 --> 00:33:44.800 I've said char (2) for just that two-character code. 00:33:44.800 --> 00:33:47.110 And for Zip, too, I'm preemptively trying 00:33:47.110 --> 00:33:51.670 to avoid a mistake with even Cambridge whose zip codes start with a zero. 00:33:51.670 --> 00:33:54.610 Where I had to specify again that that's just an integer, 00:33:54.610 --> 00:33:57.760 I might actually lose mathematically that first digit. 00:33:57.760 --> 00:34:02.440 But by storing it as a char with five characters and no hyphen or four others 00:34:02.440 --> 00:34:07.120 can I ensure that that 02138 remains exactly that. 00:34:07.120 --> 00:34:10.060 But here, too, we have a bit of inefficiency. 00:34:10.060 --> 00:34:12.880 Imagine if more and more users from my own building 00:34:12.880 --> 00:34:15.340 register for this particular application. 00:34:15.340 --> 00:34:18.670 You might have again and again these exact same values 00:34:18.670 --> 00:34:21.219 if all of my colleagues are in the same building. 00:34:21.219 --> 00:34:23.500 And here, as with programming, as soon as you 00:34:23.500 --> 00:34:27.639 start to see a lot of redundancy, the same data again and again, 00:34:27.639 --> 00:34:30.520 there's surely an opportunity for better design. 00:34:30.520 --> 00:34:31.330 This is correct. 00:34:31.330 --> 00:34:33.880 My colleagues might also live here if we added 00:34:33.880 --> 00:34:35.679 their names and their distinct emails. 00:34:35.679 --> 00:34:39.940 But they don't necessarily need to have the same exact text stored 00:34:39.940 --> 00:34:41.830 again and again. 00:34:41.830 --> 00:34:45.580 So what might you do in this case even in the world of spreadsheets? 00:34:45.580 --> 00:34:49.989 Well, on the sheet, I might just rename this actually more explicitly to users. 00:34:49.989 --> 00:34:50.739 And you know what? 00:34:50.739 --> 00:34:53.406 Let me go ahead and create another sheet in my spreadsheet world 00:34:53.406 --> 00:34:55.239 and call this, say, Cities. 00:34:55.239 --> 00:35:00.980 And my cities might actually have a city name and perhaps a state and a zip. 00:35:00.980 --> 00:35:04.600 But in this leftmost column, I could be smart 00:35:04.600 --> 00:35:09.730 and start to assign each of these cities some unique ID or identifier. 00:35:09.730 --> 00:35:13.870 And so here might I have just a unique identifier, typically an integer. 00:35:13.870 --> 00:35:16.560 City might again be varchar (255). 00:35:16.560 --> 00:35:18.790 And State might, again, be char (2). 00:35:18.790 --> 00:35:21.250 Zip Code, meanwhile, can stay as char (5). 00:35:21.250 --> 00:35:25.900 But now what I can do is presume this, that if Cambridge, Massachusetts, 00:35:25.900 --> 00:35:31.720 02138 is in this sheet or, if you will, table let's arbitrarily but reasonably 00:35:31.720 --> 00:35:33.920 give it a unique identifier as one. 00:35:33.920 --> 00:35:37.480 And if I were to add more cities here, like Allston, Massachusetts and its zip 00:35:37.480 --> 00:35:40.660 code, I could sign it a unique ID of two. 00:35:40.660 --> 00:35:43.420 Because now that I have this data, here's 00:35:43.420 --> 00:35:45.340 where you get that relational aspect. 00:35:45.340 --> 00:35:50.050 I can relate this sheet or table to my other as follows. 00:35:50.050 --> 00:35:54.040 Back in this table now can I go ahead and delete much of this redundancy 00:35:54.040 --> 00:35:58.420 here and actually go ahead and say city is not really a city but it's a city 00:35:58.420 --> 00:35:59.200 ID. 00:35:59.200 --> 00:36:02.410 And here now it can be a proper integer, because the city 00:36:02.410 --> 00:36:06.460 in which all of my colleagues live is the exact same one as mine. 00:36:06.460 --> 00:36:09.433 Now, here there's still an opportunity for improvement, to be fair. 00:36:09.433 --> 00:36:11.350 Because if all of us are in the same building, 00:36:11.350 --> 00:36:14.000 maybe that should be factored out as well. 00:36:14.000 --> 00:36:16.150 And if I really wanted to go down this rabbit hole, 00:36:16.150 --> 00:36:18.460 I could add another sheet or table called, 00:36:18.460 --> 00:36:24.430 say, Buildings and factor out also that 33 Oxford Street, give it a unique ID, 00:36:24.430 --> 00:36:26.560 and only store numbers. 00:36:26.560 --> 00:36:29.552 So in short, the more redundancy and the more waste 00:36:29.552 --> 00:36:32.260 that you end up having in your data, the more opportunities there 00:36:32.260 --> 00:36:34.150 are to, so to speak, normalize it. 00:36:34.150 --> 00:36:36.400 To factor out those commonalities and create 00:36:36.400 --> 00:36:39.580 relations between some pieces of data and others 00:36:39.580 --> 00:36:43.180 and the convention in computing is to do, quite simply, numbers. 00:36:43.180 --> 00:36:43.680 Why? 00:36:43.680 --> 00:36:48.340 Well, it turns out it's a lot more efficient to just relate some data 00:36:48.340 --> 00:36:51.250 to others by relying on small integers. 00:36:51.250 --> 00:36:53.020 Four bytes is not all that much. 00:36:53.020 --> 00:36:55.390 And in fact, inside of a computer CPU are 00:36:55.390 --> 00:36:58.390 small registers, tiny pieces of memory that 00:36:58.390 --> 00:37:00.940 can be used on the most basic of operations; 00:37:00.940 --> 00:37:04.330 additions, subtractions, and comparisons for equality. 00:37:04.330 --> 00:37:07.180 And so with small values like integers can you 00:37:07.180 --> 00:37:12.170 very quickly reassemble or relate some data to others. 00:37:12.170 --> 00:37:15.430 And so here we have a general principle of database design 00:37:15.430 --> 00:37:18.690 to normalize it by factoring things out. 00:37:18.690 --> 00:37:21.940 And so if we go back into our musical database, 00:37:21.940 --> 00:37:26.920 you can perhaps infer why the author of this data did that preemptively. 00:37:26.920 --> 00:37:31.990 They have their albums having album IDs, a number like 1, 2, and 3, a title, 00:37:31.990 --> 00:37:34.210 which is the actual title of that album. 00:37:34.210 --> 00:37:38.650 And then to associate those albums with artists, 00:37:38.650 --> 00:37:41.620 they've used not the artist's name but an ID. 00:37:41.620 --> 00:37:45.940 In this way can an album have the same artist as another 00:37:45.940 --> 00:37:48.430 without storing that artist's name twice. 00:37:48.430 --> 00:37:51.190 Moreover, if the artist happens to change his or her name, 00:37:51.190 --> 00:37:53.860 as is not uncommon in the musical world, you 00:37:53.860 --> 00:37:56.740 can change that name in just one place and not 00:37:56.740 --> 00:38:00.540 have to scour your tables for multiple copies. 00:38:00.540 --> 00:38:03.630 And so if we continue this logic, we'll see in more and more tables 00:38:03.630 --> 00:38:07.740 that we have this principle of naming the data but then assigning it an ID. 00:38:07.740 --> 00:38:10.230 And if you want to relate some of that data to another, 00:38:10.230 --> 00:38:14.520 you simply store the ID, not the actual values. 00:38:14.520 --> 00:38:19.140 Of course, this is decreasingly useful as we go, because now some of my data 00:38:19.140 --> 00:38:21.990 is in this table, and that, and this other table, and here. 00:38:21.990 --> 00:38:26.310 And so while very academically clean and refined, it doesn't anymore 00:38:26.310 --> 00:38:29.910 seem useful to the data scientist in front of his or her computer 00:38:29.910 --> 00:38:32.610 that just wants to answer questions about data. 00:38:32.610 --> 00:38:35.490 And yet it's all over the place, whereas before, with Excel, 00:38:35.490 --> 00:38:37.710 you could just pull up a window. 00:38:37.710 --> 00:38:39.510 But that's where SQL itself comes in. 00:38:39.510 --> 00:38:44.010 SQL does not just prescribe how to type your data 00:38:44.010 --> 00:38:46.530 but rather how to query it as well. 00:38:46.530 --> 00:38:49.010 And in this particular program here, DB Browser, 00:38:49.010 --> 00:38:52.320 I can actually go over to this tab here for Execute SQL. 00:38:52.320 --> 00:38:55.830 And I can begin to execute those actual commands, SELECT, 00:38:55.830 --> 00:39:00.030 and CREATE, and UPDATE, DELETE, and others and actually see the results. 00:39:00.030 --> 00:39:03.660 What SQL allows you to do is express yourself programmatically 00:39:03.660 --> 00:39:07.200 using a fairly small language, albeit new, that 00:39:07.200 --> 00:39:11.160 allows you to create temporary tables, just the results, just the result 00:39:11.160 --> 00:39:15.130 sets, so to speak, that you want, only those rows that you care about. 00:39:15.130 --> 00:39:18.240 So for instance, if I want to see all of the albums in this database, 00:39:18.240 --> 00:39:21.897 well, in the world of spreadsheets, I would just double click in and peruse. 00:39:21.897 --> 00:39:24.480 But in the world of SQL, I'm actually going to type a command. 00:39:24.480 --> 00:39:28.740 I'm going to go ahead here and say SELECT star, for give me everything, 00:39:28.740 --> 00:39:33.417 from the table called Album, and then semicolon to finish my thought. 00:39:33.417 --> 00:39:36.000 I'm going to go ahead and click the graphical Play button here 00:39:36.000 --> 00:39:37.330 to execute this command. 00:39:37.330 --> 00:39:40.110 And you'll see suddenly that here are all of the albums 00:39:40.110 --> 00:39:45.930 apparently in this table, 347 of them in total in this particular database. 00:39:45.930 --> 00:39:48.330 And notice that all of the titles are in one column, 00:39:48.330 --> 00:39:53.190 the Album ID is to the left, and the Artist ID, of course, to the right. 00:39:53.190 --> 00:39:57.510 Well, if you're now curious who is the artist behind the album called 00:39:57.510 --> 00:40:00.240 For Those About to Rock, We Salute You, well, I 00:40:00.240 --> 00:40:02.670 can just make a mental note that the artist ID is one. 00:40:02.670 --> 00:40:03.540 And you know what? 00:40:03.540 --> 00:40:06.600 With SQL, it turns out you can use predicates. 00:40:06.600 --> 00:40:08.802 You can qualify the data you want. 00:40:08.802 --> 00:40:10.260 You don't have to say, give me all. 00:40:10.260 --> 00:40:11.820 You can say, give me this. 00:40:11.820 --> 00:40:13.210 So how do I do that? 00:40:13.210 --> 00:40:18.690 Well, I can actually say SELECT star from Artist, the other table, where-- 00:40:18.690 --> 00:40:25.450 and here comes my predicate-- artist ID equals one, a semicolon again to finish 00:40:25.450 --> 00:40:26.950 that thought, hit Play, and voila. 00:40:26.950 --> 00:40:32.410 It turns out it's AC/DC, the artist behind that particular album. 00:40:32.410 --> 00:40:34.280 Of course, this felt a bit manual. 00:40:34.280 --> 00:40:36.790 And this seems no better than a spreadsheet wherein 00:40:36.790 --> 00:40:38.920 now more of the work seems to be on me. 00:40:38.920 --> 00:40:41.230 But SQL's more expressive than this. 00:40:41.230 --> 00:40:44.140 Not only can you create, and select, an update, and delete data, 00:40:44.140 --> 00:40:48.050 you can also join data from one table and another. 00:40:48.050 --> 00:40:51.760 So in fact, let me go ahead and do this a little more explicitly. 00:40:51.760 --> 00:40:56.650 I want to go ahead and select everything from that original album table, 00:40:56.650 --> 00:41:00.100 but I'd like to first join it with the artist table. 00:41:00.100 --> 00:41:00.880 How? 00:41:00.880 --> 00:41:03.850 Well, I want to join those two tables, this one and this one, 00:41:03.850 --> 00:41:07.810 kind of like this, conceptually, so to speak. 00:41:07.810 --> 00:41:09.010 On what fields? 00:41:09.010 --> 00:41:12.640 Well, in Album, I recall there's an artist ID. 00:41:12.640 --> 00:41:17.290 I want that to be equal to artist.artistid. 00:41:17.290 --> 00:41:21.730 In other words, if you imagine this hand to be albums and this hand to be artist 00:41:21.730 --> 00:41:25.510 and the tips of my fingers each represent those artist IDs, 00:41:25.510 --> 00:41:29.230 we essentially want the SQL to line up my fingers 00:41:29.230 --> 00:41:32.290 so that I have on the left my albums and I have on the right the artist. 00:41:32.290 --> 00:41:36.040 And every one of my rows now has both. 00:41:36.040 --> 00:41:39.100 Let me finish my thought with a semicolon here and hit Play. 00:41:39.100 --> 00:41:43.750 And voila, now we see even more information but all together. 00:41:43.750 --> 00:41:47.020 We see that album number one, For Those About to Rock, We Salute You, 00:41:47.020 --> 00:41:51.370 has an artist ID of one and clearly an artist ID of one but a name. 00:41:51.370 --> 00:41:52.420 Well, what's happened? 00:41:52.420 --> 00:41:56.230 Well, both of these tables have kind of been concatenated together but joined 00:41:56.230 --> 00:42:00.010 intelligently such that the artist IDs in both tables 00:42:00.010 --> 00:42:02.463 line up on the left and the right. 00:42:02.463 --> 00:42:05.380 Of course, at this point, I really don't care about all these numbers. 00:42:05.380 --> 00:42:08.560 And I definitely don't need the temporary duplication of data, 00:42:08.560 --> 00:42:10.270 so I don't have to just keep saying star, 00:42:10.270 --> 00:42:13.480 which is the so-called wild card, which means give me everything. 00:42:13.480 --> 00:42:17.380 I can actually instead just say, give me the title of albums 00:42:17.380 --> 00:42:20.170 and the names of the artists by specifying with commas 00:42:20.170 --> 00:42:23.290 the names of the columns that I actually want. 00:42:23.290 --> 00:42:26.320 And if I now click Play, I get much simpler results, 00:42:26.320 --> 00:42:32.030 just the titles of albums and just the names of those artists. 00:42:32.030 --> 00:42:34.460 Now, how else can we express ourselves with SQL? 00:42:34.460 --> 00:42:37.820 Well, there are other keywords besides WHERE and besides JOIN. 00:42:37.820 --> 00:42:41.990 You can also group values by something and specify that they must all 00:42:41.990 --> 00:42:43.460 have something as well. 00:42:43.460 --> 00:42:46.910 For instance, let me go back to my data here and consider 00:42:46.910 --> 00:42:49.610 which artists have multiple albums. 00:42:49.610 --> 00:42:52.070 Well, if we consider the results that we had earlier do 00:42:52.070 --> 00:42:56.600 we have AC/DC as the artist behind For Those About to Rock, We Salute You, 00:42:56.600 --> 00:42:59.180 but also behind Let There Be Rock. 00:42:59.180 --> 00:43:02.400 Moreover, this band Accept has multiple albums as well. 00:43:02.400 --> 00:43:04.970 And if we scrolled further, we'd probably see others. 00:43:04.970 --> 00:43:08.600 So if we'd like to see those bands, those artists that have 00:43:08.600 --> 00:43:10.760 multiple albums, how can we do this? 00:43:10.760 --> 00:43:16.490 Well, what if I somehow collapsed all mentions of AC/DC into just one row, 00:43:16.490 --> 00:43:19.280 and next to that row I put an actual count? 00:43:19.280 --> 00:43:21.420 Well, I could refine this query as follows. 00:43:21.420 --> 00:43:24.260 I can say, yes, join these tables together, but then 00:43:24.260 --> 00:43:28.710 collapse them, if you will, based on grouping some common value. 00:43:28.710 --> 00:43:31.430 Let's go ahead now and group the data by name 00:43:31.430 --> 00:43:33.650 so that any artist that appears multiple times 00:43:33.650 --> 00:43:36.050 will effectively be collapsed into one. 00:43:36.050 --> 00:43:40.850 But I'd like to remember how many rows got collapsed into just one. 00:43:40.850 --> 00:43:43.850 And so rather than select the albums themselves this time, 00:43:43.850 --> 00:43:48.080 I'm going to select the album's name and then the count there of, 00:43:48.080 --> 00:43:52.190 thereby specifying show me the name and show me the count of that name 00:43:52.190 --> 00:43:54.350 before we grouped by. 00:43:54.350 --> 00:43:57.980 If I go ahead now and finish my thought and click Execute, 00:43:57.980 --> 00:44:02.390 I'll see that, indeed, AC/DC had two names and Aaron Goldberg had one. 00:44:02.390 --> 00:44:05.210 And if we keep scrolling, we'd see all of the bands' names 00:44:05.210 --> 00:44:09.350 that had one or more albums and the count for each of those. 00:44:09.350 --> 00:44:11.570 If I want to filter out now maybe those bands that 00:44:11.570 --> 00:44:14.570 only had one hit album in this database, I can instead say, 00:44:14.570 --> 00:44:17.840 you know what, go ahead and group by the group's name, 00:44:17.840 --> 00:44:22.400 but then show me only those bands having more than one album. 00:44:22.400 --> 00:44:25.790 Well, here, too, can I simply filter my results saying literally quite 00:44:25.790 --> 00:44:32.840 that, having some number count name greater than one, semicolon. 00:44:32.840 --> 00:44:36.735 Hitting Play now and you see the results immediately eliminate all 00:44:36.735 --> 00:44:38.360 of those bands that had just one album. 00:44:38.360 --> 00:44:40.610 And now if I scroll through, we'll see all those bands 00:44:40.610 --> 00:44:42.650 that had two or really more. 00:44:42.650 --> 00:44:45.740 And so with SQL can you begin to express yourself 00:44:45.740 --> 00:44:47.870 certainly more arcanely than you could with just 00:44:47.870 --> 00:44:52.310 a click of the mouse in a spreadsheet but ever so much more powerfully. 00:44:52.310 --> 00:44:55.340 But the key is to build these queries up piecemeal. 00:44:55.340 --> 00:44:58.160 Indeed, this now already looks quite complicated. 00:44:58.160 --> 00:45:02.990 But we began by just selecting all data and then refining, refining, refining, 00:45:02.990 --> 00:45:06.020 thereby working at this level and only getting into the weeds 00:45:06.020 --> 00:45:09.110 once we knew what we wanted. 00:45:09.110 --> 00:45:11.960 Notice now just how fast these operations have been. 00:45:11.960 --> 00:45:15.560 Indeed, at the bottom here do I see that 56 rows were returned. 00:45:15.560 --> 00:45:16.130 How fast? 00:45:16.130 --> 00:45:18.320 In one millisecond. 00:45:18.320 --> 00:45:20.450 And indeed, even though it might take me longer 00:45:20.450 --> 00:45:23.450 to describe in English what it is I want, in fact, 00:45:23.450 --> 00:45:27.290 the computer can find this data so quickly. 00:45:27.290 --> 00:45:30.080 But that's because we've done at least one thing already. 00:45:30.080 --> 00:45:33.290 This data is already organized with these proper types. 00:45:33.290 --> 00:45:36.650 And it also has a few other key characteristics as well. 00:45:36.650 --> 00:45:38.990 When storing your data in a SQL database, 00:45:38.990 --> 00:45:41.450 you're also asked to answer some questions. 00:45:41.450 --> 00:45:44.900 For every table you're asked to specify effectively what, 00:45:44.900 --> 00:45:47.900 if any, is this table's primary key. 00:45:47.900 --> 00:45:50.810 These are key words in SQL that you can apply 00:45:50.810 --> 00:45:53.690 to some column that says, hey, database, this column 00:45:53.690 --> 00:45:59.750 is my primary value that uniquely identifies every row in this table. 00:45:59.750 --> 00:46:02.420 In the context then of our user spreadsheet 00:46:02.420 --> 00:46:05.930 with which we began this discussion, that identifier for City 00:46:05.930 --> 00:46:07.640 was a primary key. 00:46:07.640 --> 00:46:10.370 I might very well have used a city's name 00:46:10.370 --> 00:46:12.500 as unique or perhaps even the zip code. 00:46:12.500 --> 00:46:15.050 But far more efficient, especially if you 00:46:15.050 --> 00:46:17.380 want to avoid ambiguities or duplication, 00:46:17.380 --> 00:46:19.130 is to just use that integer. 00:46:19.130 --> 00:46:23.420 And so here a primary key is almost always a numeric value, at least 00:46:23.420 --> 00:46:27.140 in the most optimized layouts of data. 00:46:27.140 --> 00:46:29.960 But it guarantees to the database that there will be 00:46:29.960 --> 00:46:32.480 no duplicates on this particular value. 00:46:32.480 --> 00:46:38.270 But more powerfully, you can define in one table a column to be a primary key, 00:46:38.270 --> 00:46:43.490 but then in another table that same value to be a so-called foreign key. 00:46:43.490 --> 00:46:47.420 In other words, throughout this example in the actual SQL database, 00:46:47.420 --> 00:46:50.360 I had Albums in one table and Artists in others. 00:46:50.360 --> 00:46:54.050 And that Artist table had an Artist ID column 00:46:54.050 --> 00:46:57.020 that was within that table known as a primary key. 00:46:57.020 --> 00:47:00.800 But when you saw that artist ID in the Albums table, 00:47:00.800 --> 00:47:04.370 it was contextually there a foreign key. 00:47:04.370 --> 00:47:08.000 Now, beyond semantics, this is an actual and valuable property. 00:47:08.000 --> 00:47:10.490 It ensures that the database knows how to link 00:47:10.490 --> 00:47:13.430 and how to link those two columns efficiently. 00:47:13.430 --> 00:47:17.810 Moreover, you have even fancier features available to you when you declare keys. 00:47:17.810 --> 00:47:19.910 You can also tell the database, you know what? 00:47:19.910 --> 00:47:24.800 If I ever delete this artist, go ahead and delete all of that artist's albums 00:47:24.800 --> 00:47:25.310 as well. 00:47:25.310 --> 00:47:27.530 And you can configure a database automatically 00:47:27.530 --> 00:47:31.520 to have this so-called cascade effect whereby data is updated 00:47:31.520 --> 00:47:36.320 and your data is consistent at the end of the day based on those relations, 00:47:36.320 --> 00:47:37.220 if you will. 00:47:37.220 --> 00:47:40.640 Now, in columns of data can you also specify 00:47:40.640 --> 00:47:42.770 that every value's got to be unique. 00:47:42.770 --> 00:47:45.260 It doesn't necessarily need to be your primary key, 00:47:45.260 --> 00:47:47.480 but it might still be unique. 00:47:47.480 --> 00:47:48.440 Like what? 00:47:48.440 --> 00:47:51.650 Well, in our Users table that we were creating on the fly, 00:47:51.650 --> 00:47:54.200 an email address might, by human convention, 00:47:54.200 --> 00:47:57.200 be unique, assuming I don't share it with someone else. 00:47:57.200 --> 00:48:01.070 But using an email address, multiple characters, many possible characters, 00:48:01.070 --> 00:48:04.340 tends not to be the most efficient way to search on data. 00:48:04.340 --> 00:48:08.540 So even in my Users table might I have added for best practice 00:48:08.540 --> 00:48:13.280 a numeric column as well, probably called ID, as my primary key. 00:48:13.280 --> 00:48:17.330 But I might still specify when moving that data from my Google spreadsheet 00:48:17.330 --> 00:48:20.030 into a SQL database that, you know what? 00:48:20.030 --> 00:48:23.330 Please ensure that this Email column's unique 00:48:23.330 --> 00:48:25.970 so that I or some other programmer doesn't accidentally 00:48:25.970 --> 00:48:28.760 insert duplicate data into this table. 00:48:28.760 --> 00:48:31.340 And moreover, the database then can search it 00:48:31.340 --> 00:48:35.180 more efficiently because it knows how many, if any, there 00:48:35.180 --> 00:48:36.380 are of any one value. 00:48:36.380 --> 00:48:39.590 There's one and only one maximally. 00:48:39.590 --> 00:48:42.830 Lastly, there's this keyword here, index. 00:48:42.830 --> 00:48:46.610 Well, the other feature you get so powerfully from proper databases 00:48:46.610 --> 00:48:50.180 is the ability to search and sort values efficiently. 00:48:50.180 --> 00:48:54.020 But the database doesn't know a priori how to do either 00:48:54.020 --> 00:48:55.250 on the data you care about. 00:48:55.250 --> 00:48:58.820 Because only if you tell the database what data you plan to search on 00:48:58.820 --> 00:49:03.210 and sort frequently can it help you in advance. 00:49:03.210 --> 00:49:08.000 And so if when creating a database table you tell the database server, go ahead 00:49:08.000 --> 00:49:12.590 and index this column, what it will do is use a database structure, 00:49:12.590 --> 00:49:17.210 a tree structure not unlike our so-called binary search trees, 00:49:17.210 --> 00:49:20.960 that pulls all the data up in an artist's rendition thereof, 00:49:20.960 --> 00:49:23.720 thereby ensuring that it doesn't take as many steps 00:49:23.720 --> 00:49:27.560 to find some email address or something else because you 00:49:27.560 --> 00:49:29.430 have indexed that particular column. 00:49:29.430 --> 00:49:32.310 It won't store it linearly top to bottom or left to right. 00:49:32.310 --> 00:49:34.060 It will store it in a two-dimensional tree 00:49:34.060 --> 00:49:38.500 structure of some sort, often known as a B-tree, that 00:49:38.500 --> 00:49:44.260 allows you to grab the data in hopefully logarithmic and not linear time. 00:49:44.260 --> 00:49:46.420 Well, turns out there are even more features you 00:49:46.420 --> 00:49:49.840 get from actual databases like SQLite. 00:49:49.840 --> 00:49:53.530 Well, you have the ability to specify when creating a table, please go ahead 00:49:53.530 --> 00:49:56.050 and auto increment this column. 00:49:56.050 --> 00:49:57.380 Well, what does that mean? 00:49:57.380 --> 00:49:59.320 Well, I very manually a moment ago assigned 00:49:59.320 --> 00:50:02.230 Cambridge the unique identifier of one. 00:50:02.230 --> 00:50:04.330 But why should I, the programmer, even have 00:50:04.330 --> 00:50:09.280 to worry or care about what the unique values of my inputs are? 00:50:09.280 --> 00:50:11.560 I just need that that key exists. 00:50:11.560 --> 00:50:14.920 I do not need to care about what that value is, just that it exists 00:50:14.920 --> 00:50:16.030 and it's unique. 00:50:16.030 --> 00:50:18.740 So you can tell the database on its own, please go ahead 00:50:18.740 --> 00:50:21.490 and, any time I add a new row to this table, 00:50:21.490 --> 00:50:25.010 increment that value automatically. 00:50:25.010 --> 00:50:26.980 You can also specify to a database, please 00:50:26.980 --> 00:50:31.480 ensure that no values in my database are null that is empty, 00:50:31.480 --> 00:50:35.560 thereby ensuring that a bug in your code or some missing user input 00:50:35.560 --> 00:50:38.110 doesn't accidentally put into your database 00:50:38.110 --> 00:50:41.860 a row that's only sparsely filled with real data. 00:50:41.860 --> 00:50:44.980 The database can help you with these things just as Python can as well, 00:50:44.980 --> 00:50:49.980 but it's a final layer of a defense before your data. 00:50:49.980 --> 00:50:51.370 And then functions as well. 00:50:51.370 --> 00:50:53.830 SQL itself is a programming language. 00:50:53.830 --> 00:50:56.950 It might not necessarily have as high ceiling as something like Python, 00:50:56.950 --> 00:50:58.120 as much functionality. 00:50:58.120 --> 00:51:01.480 But built into SQL are any number of functions. 00:51:01.480 --> 00:51:03.940 If you want to select the average revenue 00:51:03.940 --> 00:51:08.680 across any number of purchase orders, you can use the average function. 00:51:08.680 --> 00:51:13.900 And in MySQL Query can I select data but pass it into one of these functions 00:51:13.900 --> 00:51:16.960 and get back an answer without having to paste it into, say, 00:51:16.960 --> 00:51:19.120 a spreadsheet, let alone calculator. 00:51:19.120 --> 00:51:20.920 I can count rows just as I did. 00:51:20.920 --> 00:51:24.520 I wanted to count the number of albums that a given artist had, 00:51:24.520 --> 00:51:27.220 and COUNT was a function supported by SQL. 00:51:27.220 --> 00:51:28.540 You can get maxes and mins. 00:51:28.540 --> 00:51:31.330 You can get summations of value and so many more 00:51:31.330 --> 00:51:33.600 features built into this language. 00:51:33.600 --> 00:51:35.350 And while the tool you might use might not 00:51:35.350 --> 00:51:39.850 be DB Browser, perhaps it's just a textual interface or even something 00:51:39.850 --> 00:51:42.190 even more graphical, it ultimately is just 00:51:42.190 --> 00:51:47.110 executing on your behalf the SQL queries and handing them off to the database 00:51:47.110 --> 00:51:48.415 for execution. 00:51:48.415 --> 00:51:51.040 Now, with all of these features that you get with the database, 00:51:51.040 --> 00:51:52.850 it all sounds too good to be true. 00:51:52.850 --> 00:51:54.910 You can scale, you can eliminate redundancy, 00:51:54.910 --> 00:51:57.670 and you can still select all the data you want. 00:51:57.670 --> 00:52:00.610 But unfortunately, you have to start to think harder 00:52:00.610 --> 00:52:02.470 about the design of your system. 00:52:02.470 --> 00:52:06.940 Because databases are sometimes vulnerable to mistakes, if you will. 00:52:06.940 --> 00:52:09.310 Consider, for instance, something like Twitter 00:52:09.310 --> 00:52:12.790 that tends to keep track of how many times something's retweeted. 00:52:12.790 --> 00:52:14.890 Or consider an app like Instagram, which keeps 00:52:14.890 --> 00:52:19.990 track of how many likes or upvotes or views some post has gotten. 00:52:19.990 --> 00:52:23.890 On the most popular or viral of media, those counters, 00:52:23.890 --> 00:52:26.980 those integers might be getting incremented ever so quickly. 00:52:26.980 --> 00:52:31.090 If you and I both happen to view or like something at nearly the same time, 00:52:31.090 --> 00:52:35.140 well, that interface from us into the system 00:52:35.140 --> 00:52:38.470 might actually trigger some instruction on some server 00:52:38.470 --> 00:52:43.210 somewhere to tell Instagram's database to increment some value. 00:52:43.210 --> 00:52:46.730 But how does a database go about incrementing a value? 00:52:46.730 --> 00:52:50.650 Well, if the value of views or the value of counts 00:52:50.650 --> 00:52:54.790 is somehow stored in a database, a column of type integer, 00:52:54.790 --> 00:52:59.020 and you go ahead and execute a SQL SELECT in order to get that value, 00:52:59.020 --> 00:53:03.730 and, for instance, 100 people before me has liked some post, 00:53:03.730 --> 00:53:07.640 well, the value of that result comes back as 100. 00:53:07.640 --> 00:53:10.180 I then do some math in my code, perhaps Python. 00:53:10.180 --> 00:53:15.490 I increment the 100 to 101, and then I use a SQL UPDATE, as you can, 00:53:15.490 --> 00:53:18.520 to push the data back into the database. 00:53:18.520 --> 00:53:21.220 But suppose both you and I anywhere in the world 00:53:21.220 --> 00:53:25.240 both happen to like a post at the same or nearly the same time, 00:53:25.240 --> 00:53:29.320 as can certainly happen when posts are especially popular. 00:53:29.320 --> 00:53:33.040 Unfortunately, a computer can sometimes do multiple things 00:53:33.040 --> 00:53:35.440 at once or at least in such rapid succession 00:53:35.440 --> 00:53:39.250 that it appears to be at the same time, but a race of sorts 00:53:39.250 --> 00:53:42.650 can happen, a race condition, if you will, as follows. 00:53:42.650 --> 00:53:45.550 If both my button and your button is pressed 00:53:45.550 --> 00:53:50.830 at nearly the same time and that induces execution of code on Instagram server 00:53:50.830 --> 00:53:54.130 that selects for both of us the current count of views, 00:53:54.130 --> 00:53:58.630 suppose that both of the threads, so to speak, both of the SQL operations 00:53:58.630 --> 00:54:02.620 that select that data both come back with the value 100, 00:54:02.620 --> 00:54:05.800 each of the blocks of code serving our requests 00:54:05.800 --> 00:54:10.360 go ahead and increment that value to 101 and then, via SQL UPDATE, 00:54:10.360 --> 00:54:13.690 pushes that value back to the database. 00:54:13.690 --> 00:54:19.510 Unfortunately, because both you and I induced an evaluation of that math 00:54:19.510 --> 00:54:22.900 at nearly the same time, what the database might end up storing 00:54:22.900 --> 00:54:26.470 is not 102 but 101. 00:54:26.470 --> 00:54:32.110 In other words, if two people's input is triggering a race to update data, 00:54:32.110 --> 00:54:37.600 the database had better keep track of who and when asked for that update. 00:54:37.600 --> 00:54:39.250 Otherwise, you lose data. 00:54:39.250 --> 00:54:43.840 Now, in the case of tweets or likes, it's probably not all that problematic. 00:54:43.840 --> 00:54:46.510 Though, frankly, that is their business. 00:54:46.510 --> 00:54:50.380 But you can certainly imagine that with banks or financial institutions, where 00:54:50.380 --> 00:54:53.140 the numbers matter ever so more, you certainly 00:54:53.140 --> 00:54:56.800 don't want to accidentally lose track of some dollars. 00:54:56.800 --> 00:55:02.310 And so how do we go about solving this in the case of a SQL database? 00:55:02.310 --> 00:55:06.370 Well, it turns out that there is fairly fundamental primitives or solutions 00:55:06.370 --> 00:55:07.360 you can use. 00:55:07.360 --> 00:55:12.790 Consider a metaphor in the real world, such as, say, a familiar refrigerator. 00:55:12.790 --> 00:55:15.190 And suppose that you and your significant other 00:55:15.190 --> 00:55:19.300 happened to both like something to drink at the end of the day, like some milk. 00:55:19.300 --> 00:55:22.410 And so you go ahead when you get home, and the other person's not, 00:55:22.410 --> 00:55:25.255 and you open the fridge and you see, oh, darn it, we're out of milk. 00:55:25.255 --> 00:55:27.700 And so you close the fridge and you head downstairs 00:55:27.700 --> 00:55:29.530 and you walk to the nearest store. 00:55:29.530 --> 00:55:32.440 Meanwhile, that other person comes home and, wanting some milk, 00:55:32.440 --> 00:55:36.430 opens the fridge, and darn it if we aren't out of milk as well. 00:55:36.430 --> 00:55:39.670 And so that person now heads out, perhaps in a different car, 00:55:39.670 --> 00:55:43.420 in a different route, and heads to the store, some other store to get milk. 00:55:43.420 --> 00:55:45.490 Fast forward some amount of time and both of you 00:55:45.490 --> 00:55:49.390 come home, and darn it if you don't now have twice as much milk as you need, 00:55:49.390 --> 00:55:50.980 and it does go bad. 00:55:50.980 --> 00:55:55.270 And so you've both ended up buying milk when really only one of you needed to. 00:55:55.270 --> 00:55:57.520 And this is similar in spirit, but now you've 00:55:57.520 --> 00:56:02.440 got more data than you actually wanted, but it's not the right amount of data. 00:56:02.440 --> 00:56:03.830 So why did that happen? 00:56:03.830 --> 00:56:08.170 Well, both of you, like Instagram, inspected the state of some value 00:56:08.170 --> 00:56:10.900 and made a decision on it before the other person 00:56:10.900 --> 00:56:14.170 was done acting on that information. 00:56:14.170 --> 00:56:17.170 So in our very real world of milk, how could you 00:56:17.170 --> 00:56:22.540 go about avoiding that conflict, that race, to restock the fridge? 00:56:22.540 --> 00:56:26.290 Well, you could quite simply grab a pen and paper and leave a note, 00:56:26.290 --> 00:56:29.260 so to speak, on the fridge telling someone else, gone for milk, 00:56:29.260 --> 00:56:31.390 and hopefully they then would not do the same. 00:56:31.390 --> 00:56:35.290 Or perhaps more dramatically you could lock the refrigerator in some sense 00:56:35.290 --> 00:56:38.740 so that they can't even get into it and inspect that state. 00:56:38.740 --> 00:56:41.800 But ultimately, you need your act of checking the fridge 00:56:41.800 --> 00:56:44.950 and restocking it to be what we'll call atomic. 00:56:44.950 --> 00:56:48.880 And databases can and hopefully do provide atomicity, 00:56:48.880 --> 00:56:54.670 that property, the ability to do multiple things together or not at all 00:56:54.670 --> 00:56:58.300 but not be interrupted by someone else's work. 00:56:58.300 --> 00:57:00.160 And in fact, in the database world, these 00:57:00.160 --> 00:57:04.270 are generally known as actual locks whereby you say to the database, 00:57:04.270 --> 00:57:07.630 don't let anyone else write to this table or row 00:57:07.630 --> 00:57:11.780 until I am ready to release or unlock it. 00:57:11.780 --> 00:57:14.530 That, of course, though, tends to be a very heavy-handed solution. 00:57:14.530 --> 00:57:17.140 Say don't let anyone else touch this data. 00:57:17.140 --> 00:57:19.660 Better to do it on a more fine-grained control 00:57:19.660 --> 00:57:21.940 so that you don't slow your whole system down. 00:57:21.940 --> 00:57:25.660 And so SQL databases tend to support what are more finally known 00:57:25.660 --> 00:57:30.530 as transactions whereby you can execute one or more commands again and again 00:57:30.530 --> 00:57:33.430 and again back to back but make sure that all of them 00:57:33.430 --> 00:57:36.220 go through it once before, say, your commands 00:57:36.220 --> 00:57:41.260 that your user input induced actually is allowed to get executed. 00:57:41.260 --> 00:57:43.960 Now, honestly, even in the most high-tech 00:57:43.960 --> 00:57:47.260 of places like Instagram and Twitter, this is a hard problem. 00:57:47.260 --> 00:57:51.490 Because at some point, even waiting for my operations 00:57:51.490 --> 00:57:54.290 to complete before yours can go in can be a challenge 00:57:54.290 --> 00:57:56.210 and a bottleneck for everyone else. 00:57:56.210 --> 00:58:00.360 And so in the case of the most viral of posts, what can systems these days do? 00:58:00.360 --> 00:58:03.100 Well, you could just kind of wait and write 00:58:03.100 --> 00:58:06.790 that data back to the database that is updated eventually. 00:58:06.790 --> 00:58:09.100 And indeed, another property of databases 00:58:09.100 --> 00:58:13.870 is known as just that, eventual consistency, a property that says, 00:58:13.870 --> 00:58:19.900 don't lose any data, but only eventually make sure it's reflected on the server. 00:58:19.900 --> 00:58:23.170 Eventually get the value right, but do get it right. 00:58:23.170 --> 00:58:25.720 And so what Instagram and Twitter and others might do 00:58:25.720 --> 00:58:30.690 is just cache or buffer that data, waiting until things have quieted down 00:58:30.690 --> 00:58:35.140 'til the post is no longer viral or most users have gone to sleep. 00:58:35.140 --> 00:58:38.050 Now, that alone might not be to the best of solutions, 00:58:38.050 --> 00:58:43.990 but it avoids having to get the highest powered and most expensive hardware. 00:58:43.990 --> 00:58:46.720 Of course, in other contexts, that might be the better solution. 00:58:46.720 --> 00:58:48.760 In the world of finance, sometimes it comes 00:58:48.760 --> 00:58:53.140 down to the actual length of cables or distance from some server to another 00:58:53.140 --> 00:58:55.930 to ensure that the data gets there so fast 00:58:55.930 --> 00:58:59.030 that you don't run into these sorts of challenges. 00:58:59.030 --> 00:59:01.810 So databases can solve this, but the developers and designers 00:59:01.810 --> 00:59:07.740 that use those databases need to know how to do it and that they should. 00:59:07.740 --> 00:59:10.230 Lastly, there's another challenge as well, 00:59:10.230 --> 00:59:12.780 unfortunately all too commonly done these days 00:59:12.780 --> 00:59:17.280 because folks just don't defend against it via common mechanisms. 00:59:17.280 --> 00:59:21.810 It turns out that a bad actor somewhere on the internet or your own network 00:59:21.810 --> 00:59:24.960 can potentially, if you're not careful, trick 00:59:24.960 --> 00:59:29.400 a database into executing commands that you didn't intend. 00:59:29.400 --> 00:59:32.160 For instance, suppose in the context of Python 00:59:32.160 --> 00:59:34.740 you have some code that looks a bit like this. 00:59:34.740 --> 00:59:37.280 Here is a program written in a mix of pseudocode 00:59:37.280 --> 00:59:40.380 and Python that's designed to allow a user to input 00:59:40.380 --> 00:59:43.560 the title of an album for which they want to search. 00:59:43.560 --> 00:59:48.720 And so here I use the Python function INPUT to prompt the user for just that. 00:59:48.720 --> 00:59:51.570 On the left-hand side do I clear a variable called Title, and then 00:59:51.570 --> 00:59:55.800 assigned from right to left, the user's response to that variable. 00:59:55.800 --> 00:59:59.610 Then suppose for the sake of discussion there is some function called EXECUTE 00:59:59.610 --> 01:00:05.020 whose purpose in life is to take input that itself represents a SQL command. 01:00:05.020 --> 01:00:08.430 That SQL command might be this, so like star from Artist 01:00:08.430 --> 01:00:11.190 where Title equals something. 01:00:11.190 --> 01:00:12.490 Now, what is that something? 01:00:12.490 --> 01:00:15.270 Well, if I have the user's input in a variable called Title, 01:00:15.270 --> 01:00:19.680 I can use the plus operator in Python, not to add but concatenate 01:00:19.680 --> 01:00:24.390 two strings together, coding them singly and completing that thought. 01:00:24.390 --> 01:00:28.800 The problem, though, with SQL is that user's not really to be trusted. 01:00:28.800 --> 01:00:32.580 And whether the user's input is coming from a keyboard on a Mac, 01:00:32.580 --> 01:00:36.990 or PC, or perhaps, more compellingly, from an app or website, 01:00:36.990 --> 01:00:40.260 you probably should not trust all your users. 01:00:40.260 --> 01:00:44.880 Because suppose that your user typed in not the album name for which they 01:00:44.880 --> 01:00:48.090 want to search, Let There Be Rock, but rather they type something 01:00:48.090 --> 01:00:51.990 like Let There Be Rock, semicolon, DELETE, 01:00:51.990 --> 01:00:56.160 thereby using SQL's apparent DELETE command in order 01:00:56.160 --> 01:01:00.360 to trick your database into executing not one but two 01:01:00.360 --> 01:01:02.430 commands, a SELECT and DELETE. 01:01:02.430 --> 01:01:05.130 And indeed, this is what's known as a SQL injection 01:01:05.130 --> 01:01:08.100 attack, the ability for an adversary, a bad actor 01:01:08.100 --> 01:01:12.060 out there, to somehow trick your database and your code 01:01:12.060 --> 01:01:16.500 into executing some command that you didn't intend. 01:01:16.500 --> 01:01:17.690 How is that possible? 01:01:17.690 --> 01:01:21.330 Well, some of these characters are dangerous, so to speak. 01:01:21.330 --> 01:01:25.680 A semicolon in SQL tends to separate one command from another. 01:01:25.680 --> 01:01:27.180 It finishes your thought. 01:01:27.180 --> 01:01:30.930 And if you yourself don't anticipate that some human, this bad actor, 01:01:30.930 --> 01:01:33.960 might type in themselves a semicolon when they really 01:01:33.960 --> 01:01:38.040 shouldn't be typing SQL at all, you might mistake that semicolon 01:01:38.040 --> 01:01:42.330 for the actual terminus between one command and another. 01:01:42.330 --> 01:01:45.240 And if you just blindly pass it into your server 01:01:45.240 --> 01:01:49.080 and let it execute as usual, you might execute not just that SELECT 01:01:49.080 --> 01:01:51.750 but that DELETE or anything else as well. 01:01:51.750 --> 01:01:55.260 And in this way can an adversary not only delete data from your database 01:01:55.260 --> 01:01:59.430 but maybe select more than you intended, or update or insert. 01:01:59.430 --> 01:02:03.030 It's ultimately up to you to defend against these threats. 01:02:03.030 --> 01:02:04.680 So, how? 01:02:04.680 --> 01:02:08.610 Well, it turns out that there are libraries, code written by other people 01:02:08.610 --> 01:02:11.460 that, frankly, via very easy-to-use functions, 01:02:11.460 --> 01:02:17.760 just make it easy to sanitize or scrub, so to speak, user's input. 01:02:17.760 --> 01:02:20.700 What do these libraries or these functions typically do? 01:02:20.700 --> 01:02:24.280 Honestly, they just escape, so to speak, these dangerous characters. 01:02:24.280 --> 01:02:27.750 Something like a semicolon or perhaps a single apostrophe that 01:02:27.750 --> 01:02:32.230 might, in SQL, have some special and dangerous potential meaning, 01:02:32.230 --> 01:02:36.330 they just escape them as by putting a backslash, a special character in front 01:02:36.330 --> 01:02:40.080 of them so that if the human were to type in Let There Be Rock, 01:02:40.080 --> 01:02:46.020 semicolon, DELETE, that would actually be interpreted safely by your database 01:02:46.020 --> 01:02:51.840 as a search for an album called Let There Be Rock, semicolon, DELETE, 01:02:51.840 --> 01:02:55.380 which of course most likely is not the name of an album. 01:02:55.380 --> 01:03:00.060 So that query would probably not return or select any results. 01:03:00.060 --> 01:03:02.280 But more importantly, it would not be tricked 01:03:02.280 --> 01:03:05.280 into executing two SQL commands. 01:03:05.280 --> 01:03:10.200 Rather, it would execute only the SELECT but with a nonsensical value. 01:03:10.200 --> 01:03:13.260 Lastly, consider what a database is. 01:03:13.260 --> 01:03:16.170 It's really a piece of software running on some computer 01:03:16.170 --> 01:03:20.820 somewhere, be it on my own Mac, or PC, or some server in the cloud. 01:03:20.820 --> 01:03:24.780 But if you have just one database, as I seem to keep presuming, 01:03:24.780 --> 01:03:27.180 you have this so-called single point of failure, 01:03:27.180 --> 01:03:31.290 again, just as we had in the world of cloud computing more generally. 01:03:31.290 --> 01:03:35.330 And so with something like data where you don't want to lose it 01:03:35.330 --> 01:03:37.830 and you certainly don't want all of your web servers or apps 01:03:37.830 --> 01:03:43.170 to go offline just because one server, your database server, has gone out, 01:03:43.170 --> 01:03:48.327 it's time to revisit the architecture or topology of systems more generally. 01:03:48.327 --> 01:03:50.160 Something tells me that it's not going to be 01:03:50.160 --> 01:03:52.560 sufficient to have just one database. 01:03:52.560 --> 01:03:54.570 You probably want two. 01:03:54.570 --> 01:03:59.010 But if you have two databases, now how do you decide where to put the data? 01:03:59.010 --> 01:04:02.580 Do you put it here, or over here, or maybe in both places? 01:04:02.580 --> 01:04:06.050 If you put it in both places, though, you're then using twice as much space, 01:04:06.050 --> 01:04:08.510 so already we've opened a can of worms. 01:04:08.510 --> 01:04:11.730 To solve one simple problem, don't be a single point of failure. 01:04:11.730 --> 01:04:15.980 But that's going to cost you some time, or some money, or certainly space. 01:04:15.980 --> 01:04:18.020 So what can you do if you're architecting 01:04:18.020 --> 01:04:22.670 a system that has now not just web servers but, say, databases? 01:04:22.670 --> 01:04:26.510 Well, odds are you're going to want to have not just the one, pictured here 01:04:26.510 --> 01:04:29.690 as a cylinder, as this convention, but you're probably 01:04:29.690 --> 01:04:32.600 going to want to have a second as well. 01:04:32.600 --> 01:04:35.120 But of course, if you have two of them, odds 01:04:35.120 --> 01:04:38.200 are it's not sufficient just to store half of your data on one 01:04:38.200 --> 01:04:40.670 and half of your data on the other, because, of course, 01:04:40.670 --> 01:04:42.890 you've not solved the single point of failure. 01:04:42.890 --> 01:04:45.890 You now just have two single points of failure because half of your data 01:04:45.890 --> 01:04:48.590 could be lost here or half of it here. 01:04:48.590 --> 01:04:52.400 So odds are you're going to want to start having backups of data. 01:04:52.400 --> 01:04:55.130 But you don't want to necessarily have to waste time restoring 01:04:55.130 --> 01:04:57.650 from backup, especially if you want to maintain 01:04:57.650 --> 01:05:00.740 as many as five nines of uptime. 01:05:00.740 --> 01:05:04.160 So odds are you're going to want to have these databases really 01:05:04.160 --> 01:05:06.830 be duplicates of one another. 01:05:06.830 --> 01:05:09.140 And whenever you write data to one database, 01:05:09.140 --> 01:05:12.170 you should probably write it to the other in parallel. 01:05:12.170 --> 01:05:15.140 So, yes, admittedly, you have just spent twice as much space 01:05:15.140 --> 01:05:17.190 and, frankly, twice as much money. 01:05:17.190 --> 01:05:20.390 But at some point those sorts of costs are inevitable. 01:05:20.390 --> 01:05:23.000 But there's other ways to scale here, too. 01:05:23.000 --> 01:05:26.720 You can, of course, hit a ceiling on vertical scaling 01:05:26.720 --> 01:05:28.520 even when it comes to databases. 01:05:28.520 --> 01:05:32.570 After all, if a database is just a program running on some server 01:05:32.570 --> 01:05:36.620 and there is only so much RAM or disk space or CPU in that server, 01:05:36.620 --> 01:05:39.770 eventually you're not going to be able to store as much data 01:05:39.770 --> 01:05:42.000 or as quickly as you want. 01:05:42.000 --> 01:05:43.340 So what can you do? 01:05:43.340 --> 01:05:45.620 Well, you could, for instance, shard your data 01:05:45.620 --> 01:05:50.060 and have not just two but maybe four or more servers and put all of the users 01:05:50.060 --> 01:05:55.040 whose names start from A to M on one half of your cluster of servers, 01:05:55.040 --> 01:05:57.170 so to speak, but maybe everyone else from M 01:05:57.170 --> 01:06:00.500 to Z based on, say, their last name can go on the others. 01:06:00.500 --> 01:06:04.220 To shard a database means to split the data in some predictable way 01:06:04.220 --> 01:06:05.990 that you can repeat again and again. 01:06:05.990 --> 01:06:10.070 But even there, too, even if only the As through Ms are going to the left, 01:06:10.070 --> 01:06:14.060 you want to make sure that you still have that backup or replica. 01:06:14.060 --> 01:06:17.600 And this arrow suggests that they really should be intercommunicating, 01:06:17.600 --> 01:06:20.450 not unlike load balancers we've seen. 01:06:20.450 --> 01:06:23.120 But there's another way you can scale your databases as well. 01:06:23.120 --> 01:06:27.440 You don't have to have databases doing both reading and writing. 01:06:27.440 --> 01:06:31.340 To read data from a database or any server means to take it from its memory 01:06:31.340 --> 01:06:32.720 and read it into yours. 01:06:32.720 --> 01:06:36.140 And to write means to do the opposite, to save it. 01:06:36.140 --> 01:06:39.140 Well, what you can do actually in the world of databases 01:06:39.140 --> 01:06:42.710 is also replicate your databases multiple times. 01:06:42.710 --> 01:06:46.800 And you might have connected to these two primary databases 01:06:46.800 --> 01:06:52.610 multiple other databases that are just copies in one direction and not both. 01:06:52.610 --> 01:06:57.030 And what you might then do is use these two primary databases not only to read 01:06:57.030 --> 01:06:59.720 but to write, abbreviated here RW. 01:06:59.720 --> 01:07:02.000 But these other databases down here, which 01:07:02.000 --> 01:07:04.730 are just copies of the ones to which they're connected, 01:07:04.730 --> 01:07:06.440 are just called read replicas. 01:07:06.440 --> 01:07:11.450 They exist solely for the purpose to read from them again and again. 01:07:11.450 --> 01:07:13.200 When might this make sense? 01:07:13.200 --> 01:07:15.320 Well, in some contexts, like social media, 01:07:15.320 --> 01:07:17.930 like Facebook, it's probably the case that there 01:07:17.930 --> 01:07:20.690 are more reads than there are writes. 01:07:20.690 --> 01:07:24.800 That is to say you probably know more people who post more content than you 01:07:24.800 --> 01:07:29.310 but you probably still read or look at theirs. 01:07:29.310 --> 01:07:32.180 And so if the data for your business follows 01:07:32.180 --> 01:07:36.560 that pattern whereby writes are maybe common but reads are way more common, 01:07:36.560 --> 01:07:41.030 you can do exactly this model and replicate again and again, honestly, 01:07:41.030 --> 01:07:44.420 as a tree structure for efficiency so that it doesn't all 01:07:44.420 --> 01:07:46.280 have to replicate one at a time. 01:07:46.280 --> 01:07:50.000 But then you can write software, be it in Python or something else, 01:07:50.000 --> 01:07:54.440 that writes data only to these one or two servers but reads 01:07:54.440 --> 01:07:56.720 from any number of them as well. 01:07:56.720 --> 01:07:59.450 But this, too, is a bit of a rabbit hole, because at some point 01:07:59.450 --> 01:08:01.580 you want to have this redundancy not in one 01:08:01.580 --> 01:08:05.990 location but others, east coast and west coast, one country and another. 01:08:05.990 --> 01:08:10.400 And at that point, you might actually run into the limitations of time. 01:08:10.400 --> 01:08:13.910 Because after all, it takes a non-zero number of milliseconds or seconds 01:08:13.910 --> 01:08:16.310 for data to travel long distance. 01:08:16.310 --> 01:08:18.050 Consider after all how long it might take 01:08:18.050 --> 01:08:21.590 data to go from Cambridge, Massachusetts to somewhere in Japan. 01:08:21.590 --> 01:08:26.210 That's far longer than it might take to just go down the road to MIT. 01:08:26.210 --> 01:08:30.979 So here, too, we can revisit all of the problems we saw in the world of cloud 01:08:30.979 --> 01:08:32.540 computing and servers more generally. 01:08:32.540 --> 01:08:35.990 They're back to revisit in the context of databases. 01:08:35.990 --> 01:08:40.399 But with databases, you care ever more that these things not go down, 01:08:40.399 --> 01:08:44.630 or if they do, that you have spares, backups, or replicas. 01:08:44.630 --> 01:08:48.680 Because now that we're storing our data in this centralized place, 01:08:48.680 --> 01:08:52.069 we have to think hard not only about how we're scaling computationally 01:08:52.069 --> 01:08:56.359 but how we're scaling in terms of our data as well. 01:08:56.359 --> 01:08:59.090 So consider where then we began. 01:08:59.090 --> 01:09:02.770 We started by laying out data in a spreadsheet, be it Microsoft Excel, 01:09:02.770 --> 01:09:04.939 or Apple Numbers, or Google Spreadsheets. 01:09:04.939 --> 01:09:07.130 From there we considered what types of data 01:09:07.130 --> 01:09:09.950 we might store there so that if we want to upgrade, 01:09:09.950 --> 01:09:12.410 so to speak, from a spreadsheet to database, 01:09:12.410 --> 01:09:15.260 we know what types we can specify. 01:09:15.260 --> 01:09:19.189 And in SQL, whether implemented in SQLite, Oracle, or MySQL, 01:09:19.189 --> 01:09:21.560 or something else, they tend to be standard types that 01:09:21.560 --> 01:09:25.279 tend to be common across platforms, albeit with some variations, 01:09:25.279 --> 01:09:27.260 so that we can think hard about these types 01:09:27.260 --> 01:09:31.277 and then ultimately help the database help us be performant. 01:09:31.277 --> 01:09:34.069 Because if I know that I'm going to be searching or selecting based 01:09:34.069 --> 01:09:36.470 on certain data, I can tell the database, for instance, 01:09:36.470 --> 01:09:39.680 to make it unique or at least index it. 01:09:39.680 --> 01:09:43.520 And then using SQL constructions like SELECT, and INSERT, and UPDATE, 01:09:43.520 --> 01:09:46.670 and DELETE, and yet others can I manipulate that data and get 01:09:46.670 --> 01:09:52.069 at it even faster, frankly, than the human me could with a mere spreadsheet. 01:09:52.069 --> 01:09:54.860 But with the design of any system, as with databases, 01:09:54.860 --> 01:09:58.430 we start to open new cans of worms and new problems 01:09:58.430 --> 01:10:01.400 as we start to explore more sophisticated challenges. 01:10:01.400 --> 01:10:04.880 But here, too, many, if not all, of these problems 01:10:04.880 --> 01:10:09.290 can be solved by simply reducing the problems to first principles 01:10:09.290 --> 01:10:11.450 and consider, what is the problem to be solved? 01:10:11.450 --> 01:10:12.770 How is our data represented? 01:10:12.770 --> 01:10:14.150 Where is it stored? 01:10:14.150 --> 01:10:18.290 And consider ultimately what business constraints or legal constraints 01:10:18.290 --> 01:10:22.700 we have when manipulating that data and consider what tools of the trade 01:10:22.700 --> 01:10:24.050 are available to us. 01:10:24.050 --> 01:10:27.740 This then is database design and, more generally, design unto itself, 01:10:27.740 --> 01:10:32.450 not just focusing on the correctness of implementing solutions to problems 01:10:32.450 --> 01:10:37.670 but the efficiency and the scalability as well thereof.