1 00:00:00,000 --> 00:00:17,600 2 00:00:17,600 --> 00:00:19,880 DAVID MALAN: We've seen how with languages like Python 3 00:00:19,880 --> 00:00:23,240 can we implement business logic on a server and, even if we want, 4 00:00:23,240 --> 00:00:24,680 generate web pages. 5 00:00:24,680 --> 00:00:27,770 And we've seen, on the browser side, the so-called client side, 6 00:00:27,770 --> 00:00:31,340 how you can render information or data to users. 7 00:00:31,340 --> 00:00:33,770 And with CSS can we style it, and with JavaScript can 8 00:00:33,770 --> 00:00:35,480 we even make it interactive. 9 00:00:35,480 --> 00:00:38,120 But when building a business or any application that 10 00:00:38,120 --> 00:00:42,060 is interacting with users, where is all that data being stored? 11 00:00:42,060 --> 00:00:44,060 Well, if you're running your own small business, 12 00:00:44,060 --> 00:00:46,040 you might simply be using a spreadsheet. 13 00:00:46,040 --> 00:00:48,207 Of course, spreadsheets, whether they're implemented 14 00:00:48,207 --> 00:00:51,050 as Microsoft Excel, or Google Spreadsheets, or Apple Numbers, 15 00:00:51,050 --> 00:00:53,930 generally stores data in rows and columns. 16 00:00:53,930 --> 00:00:56,990 And if you're like me, you probably use that first row 17 00:00:56,990 --> 00:01:00,230 to represent the names of the columns and the types of the data 18 00:01:00,230 --> 00:01:01,220 that you're storing. 19 00:01:01,220 --> 00:01:04,010 And as you continue to add more and more data to that spreadsheet, 20 00:01:04,010 --> 00:01:08,120 you probably, like me, continue to add row after row after row. 21 00:01:08,120 --> 00:01:11,720 And meanwhile, if you've got so much data or so many different types of data 22 00:01:11,720 --> 00:01:14,510 that it doesn't really belong in one sheet, 23 00:01:14,510 --> 00:01:16,833 you might actually create a second sheet or a tab 24 00:01:16,833 --> 00:01:19,250 along the bottom of the screen, for instance, in which you 25 00:01:19,250 --> 00:01:22,430 can store additional data as well. 26 00:01:22,430 --> 00:01:27,440 So indeed, Microsoft Excel allows you to store data relationally, so to speak. 27 00:01:27,440 --> 00:01:31,550 You might have one sheet with customer info, another sheet with invoices, 28 00:01:31,550 --> 00:01:35,480 another sheet with products, or any number of other types of data. 29 00:01:35,480 --> 00:01:40,610 And you can relate those to another by somehow having some commonality 30 00:01:40,610 --> 00:01:45,360 among them, some identifier, some name, some customer ID or the like. 31 00:01:45,360 --> 00:01:48,050 And so data that tends to be stored in rows and columns 32 00:01:48,050 --> 00:01:49,502 can have these relationships. 33 00:01:49,502 --> 00:01:51,710 And you can do this certainly in the cloud these days 34 00:01:51,710 --> 00:01:55,580 as well with Google Docs, simply a web-based version of the same. 35 00:01:55,580 --> 00:01:58,940 But when designing your spreadsheets, or more 36 00:01:58,940 --> 00:02:01,945 generally, your data's schema, the design thereof, 37 00:02:01,945 --> 00:02:04,070 there's a number of questions you need to consider. 38 00:02:04,070 --> 00:02:05,600 What data do you want to store? 39 00:02:05,600 --> 00:02:07,100 And where do you want to store it? 40 00:02:07,100 --> 00:02:09,623 And what type of data is it, in fact? 41 00:02:09,623 --> 00:02:12,290 Because whether using Excel, or Numbers, or Google Spreadsheets, 42 00:02:12,290 --> 00:02:16,970 odds are you sometimes configure the columns to be displayed or formatted 43 00:02:16,970 --> 00:02:17,930 in some ways. 44 00:02:17,930 --> 00:02:20,150 Numbers might be displayed with or without commas. 45 00:02:20,150 --> 00:02:23,330 Numbers might be displayed with dollar signs or some other symbol. 46 00:02:23,330 --> 00:02:25,700 You might separate one field from another 47 00:02:25,700 --> 00:02:27,990 using special symbols or punctuation. 48 00:02:27,990 --> 00:02:31,580 So these are all configuration options in a spreadsheet that might 49 00:02:31,580 --> 00:02:33,680 be among the first decisions you make. 50 00:02:33,680 --> 00:02:35,930 In fact, let's consider a sample spreadsheet 51 00:02:35,930 --> 00:02:39,930 wherein I might want to represent users in some web application. 52 00:02:39,930 --> 00:02:42,680 All of us have, of course, registered for websites or applications 53 00:02:42,680 --> 00:02:43,710 these days. 54 00:02:43,710 --> 00:02:46,190 And what are some of the questions you're asked? 55 00:02:46,190 --> 00:02:49,850 Well, you might be asked for your user name by some application. 56 00:02:49,850 --> 00:02:52,405 You might be asked for your actual name. 57 00:02:52,405 --> 00:02:53,780 What else might you be asked for? 58 00:02:53,780 --> 00:02:56,180 Perhaps your email address. 59 00:02:56,180 --> 00:02:58,970 And perhaps if you're buying something, or registering 60 00:02:58,970 --> 00:03:01,940 for something, or someone that needs to know a bit more about you, 61 00:03:01,940 --> 00:03:05,450 perhaps they'll even ask you for your address, maybe your phone number, 62 00:03:05,450 --> 00:03:07,590 or perhaps even your age. 63 00:03:07,590 --> 00:03:10,670 So you can certainly imagine there being even other fields that you need. 64 00:03:10,670 --> 00:03:15,740 But let's consider now what types of data each of these fields is. 65 00:03:15,740 --> 00:03:17,600 Now, what should a user name be? 66 00:03:17,600 --> 00:03:21,050 Well, by definition, this should be a unique value that belongs to you 67 00:03:21,050 --> 00:03:22,910 and only you on a system. 68 00:03:22,910 --> 00:03:25,010 For instance, if I'm registering for a website 69 00:03:25,010 --> 00:03:28,400 and that website is storing some data in, for instance, a spreadsheet, 70 00:03:28,400 --> 00:03:31,730 I might ideally try to claim Malan if it's available. 71 00:03:31,730 --> 00:03:33,800 My name meanwhile, of course, will be just 72 00:03:33,800 --> 00:03:36,710 David Malan, which someone else can certainly have as well. 73 00:03:36,710 --> 00:03:39,320 Hopefully my email address is only mine. 74 00:03:39,320 --> 00:03:42,020 And that, too, will be stored in a column of its own. 75 00:03:42,020 --> 00:03:46,860 And then my address, for instance, here on campus, 33 Oxford Street, Cambridge, 76 00:03:46,860 --> 00:03:49,940 Massachusetts, 02138. 77 00:03:49,940 --> 00:03:54,710 A phone number by which I can be reached, 617-495-5000. 78 00:03:54,710 --> 00:03:56,450 And then my age shall be-- 79 00:03:56,450 --> 00:03:58,430 well, we'll leave that blank just for now. 80 00:03:58,430 --> 00:04:02,390 So notice how each of these pieces of data is just a little bit different. 81 00:04:02,390 --> 00:04:04,100 Some of them seem to be numeric. 82 00:04:04,100 --> 00:04:06,020 Some of them seem to be alphabetic. 83 00:04:06,020 --> 00:04:08,900 Some of them even have some punctuation therein. 84 00:04:08,900 --> 00:04:11,120 Now, you might or might not in a spreadsheet 85 00:04:11,120 --> 00:04:13,820 care to display this data a bit differently. 86 00:04:13,820 --> 00:04:16,700 Indeed, the only number that I'm sweeping under the rug 87 00:04:16,700 --> 00:04:18,019 here is, in fact, my age. 88 00:04:18,019 --> 00:04:21,390 But hopefully I don't need one or more commas in that value. 89 00:04:21,390 --> 00:04:25,190 So there isn't really any need for special formatting here. 90 00:04:25,190 --> 00:04:27,350 But it turns out that when you're actually 91 00:04:27,350 --> 00:04:30,950 building a website or software-based application, 92 00:04:30,950 --> 00:04:34,970 you're probably not storing your data ultimately in just spreadsheets. 93 00:04:34,970 --> 00:04:39,470 You might graduate eventually from being a small shop to needing more than just 94 00:04:39,470 --> 00:04:41,478 Microsoft Excel, or Numbers, or even something 95 00:04:41,478 --> 00:04:43,020 cloud-based like Google Spreadsheets. 96 00:04:43,020 --> 00:04:43,700 Why? 97 00:04:43,700 --> 00:04:47,580 Well, you have more rows than those programs can generally handle. 98 00:04:47,580 --> 00:04:49,670 In fact, on your own Mac or PC, odds are, 99 00:04:49,670 --> 00:04:53,600 when opening up big files, whether a spreadsheet or any other document, 100 00:04:53,600 --> 00:04:55,430 sometimes you might actually feel that. 101 00:04:55,430 --> 00:04:57,380 The computer might start to slow, and you 102 00:04:57,380 --> 00:05:00,200 might start to see a spinning beach ball or hourglass 103 00:05:00,200 --> 00:05:02,090 because you start to hit the limits of what 104 00:05:02,090 --> 00:05:05,240 a local computer can do with just client-side software, 105 00:05:05,240 --> 00:05:06,950 like a spreadsheet tool. 106 00:05:06,950 --> 00:05:10,730 And so eventually you might actually need to use a proper database. 107 00:05:10,730 --> 00:05:13,460 And a database is really just a piece of software 108 00:05:13,460 --> 00:05:15,470 that can absolutely run on your Mac or PC. 109 00:05:15,470 --> 00:05:19,010 But very commonly it runs on a server or somewhere else 110 00:05:19,010 --> 00:05:21,670 in the cloud to which your own software connects. 111 00:05:21,670 --> 00:05:24,170 For instance, if you're building a website or an application 112 00:05:24,170 --> 00:05:27,110 in a language like Python, you can, in Python, 113 00:05:27,110 --> 00:05:32,570 write code that talks or communicates with that database to pull data down 114 00:05:32,570 --> 00:05:34,730 and to send data back up. 115 00:05:34,730 --> 00:05:37,170 But spreadsheets are wonderfully straightforward. 116 00:05:37,170 --> 00:05:40,310 It's just so intuitive how you store the data in rows and columns 117 00:05:40,310 --> 00:05:43,070 and just more and more rows as you have more and more data. 118 00:05:43,070 --> 00:05:48,410 And so what's nice about some databases is that they mimic exactly this design. 119 00:05:48,410 --> 00:05:51,560 There exists in the world what are called relational databases. 120 00:05:51,560 --> 00:05:55,197 And indeed, this is among the most common ways to store data relationally. 121 00:05:55,197 --> 00:05:57,530 The jargon is a bit different in the world of databases. 122 00:05:57,530 --> 00:05:59,720 You no longer call these things spreadsheets. 123 00:05:59,720 --> 00:06:01,160 We call them databases. 124 00:06:01,160 --> 00:06:03,830 And we don't call the individual tabs sheets. 125 00:06:03,830 --> 00:06:05,240 We call them tables. 126 00:06:05,240 --> 00:06:08,240 But we continue to call the structure of the data therein 127 00:06:08,240 --> 00:06:11,090 rows for each additional piece of data and columns 128 00:06:11,090 --> 00:06:13,430 for the different types of data that we have. 129 00:06:13,430 --> 00:06:18,500 But in a relational database, such as Oracle, Microsoft Access, SQL Server, 130 00:06:18,500 --> 00:06:23,030 My SQL, Postgres, or something smaller and lighter weight called SQLlite, 131 00:06:23,030 --> 00:06:26,930 the burden is on you, the designer of the database, 132 00:06:26,930 --> 00:06:29,000 or the programmer, or the business person 133 00:06:29,000 --> 00:06:32,690 to actually decide on what types of data you are going to store. 134 00:06:32,690 --> 00:06:35,660 Because the database, in order to handle more and more data 135 00:06:35,660 --> 00:06:39,350 than a typical spreadsheet can support, needs a bit of help from you, 136 00:06:39,350 --> 00:06:42,680 needs to know what type of data you're storing so that it can search it 137 00:06:42,680 --> 00:06:45,170 more efficiently and sort it more effectively 138 00:06:45,170 --> 00:06:48,530 and make it easier for you ultimately to add and remove data 139 00:06:48,530 --> 00:06:51,075 from that particular database. 140 00:06:51,075 --> 00:06:53,450 So in other words, you can do all of the same operations. 141 00:06:53,450 --> 00:06:56,420 But generally, in a database, you have so much more data, 142 00:06:56,420 --> 00:07:00,150 you're going to need to have the database help you help yourself. 143 00:07:00,150 --> 00:07:02,232 So what does that actually mean here? 144 00:07:02,232 --> 00:07:04,190 Well, let's consider some of these fields here. 145 00:07:04,190 --> 00:07:08,630 Were we to migrate my spreadsheet with many, many, many rows of users, 146 00:07:08,630 --> 00:07:12,570 for instance, to a proper database relationally, 147 00:07:12,570 --> 00:07:15,090 I'm going to have to make a few more decisions as well. 148 00:07:15,090 --> 00:07:18,530 And it turns out that I need to choose various types for these columns. 149 00:07:18,530 --> 00:07:22,740 And generally, in a database, you only have a limited menu of options. 150 00:07:22,740 --> 00:07:26,740 So let's take a look at what those might be. 151 00:07:26,740 --> 00:07:28,960 Here in many relational databases are just 152 00:07:28,960 --> 00:07:31,940 some of the data types that are available to you; 153 00:07:31,940 --> 00:07:35,560 an integer if you want to represent something like 1, 2, 3, or perhaps 154 00:07:35,560 --> 00:07:39,250 even a negative, a real number, otherwise known as a floating point 155 00:07:39,250 --> 00:07:41,200 value, for instance, in Python that actually 156 00:07:41,200 --> 00:07:44,530 has a decimal point and perhaps some numbers thereafter, 157 00:07:44,530 --> 00:07:47,530 numeric, which is more of a catch-all and might handle things like dates 158 00:07:47,530 --> 00:07:50,800 and times that are numbers in some sense but have some more 159 00:07:50,800 --> 00:07:53,770 formal structure to them, and then the more general text, when you just 160 00:07:53,770 --> 00:07:56,830 have words or phrases, characters or whole paragraphs 161 00:07:56,830 --> 00:08:00,070 or more that you might want to store as well in some column. 162 00:08:00,070 --> 00:08:03,400 And then lastly, you might have the cutely named BLOB, 163 00:08:03,400 --> 00:08:07,030 or binary large object, when you actually want to store zeros and ones, 164 00:08:07,030 --> 00:08:10,730 that is to say binary data, like actual files in your database. 165 00:08:10,730 --> 00:08:12,670 However, you needn't do this typically. 166 00:08:12,670 --> 00:08:16,690 You can actually store files, of course, on a file system on your own hard drive 167 00:08:16,690 --> 00:08:20,260 somewhere on the server, but that option exists for you. 168 00:08:20,260 --> 00:08:23,110 Now, this happens to be the list of data types supported 169 00:08:23,110 --> 00:08:27,460 by the simplest of relational databases, something called SQLite. 170 00:08:27,460 --> 00:08:31,300 And as its name implies, it actually supports a very specific language 171 00:08:31,300 --> 00:08:36,890 via which you or anyone can request data from the database and store 172 00:08:36,890 --> 00:08:39,760 data in the database, or update it, or delete it. 173 00:08:39,760 --> 00:08:44,440 And that language is called SQL, structured query language. 174 00:08:44,440 --> 00:08:47,200 SQL is a language via which you can do exactly 175 00:08:47,200 --> 00:08:51,670 that, retrieve data from a database, put data into a database, and so much more. 176 00:08:51,670 --> 00:08:54,867 It's an alternative to the pointing and clicking with which you're probably 177 00:08:54,867 --> 00:08:57,700 familiar with something like Microsoft Excel, or Google spreadsheet, 178 00:08:57,700 --> 00:08:59,830 or Apple Numbers, where it's really you, the human, 179 00:08:59,830 --> 00:09:03,250 doing all the work via a graphical user interface. 180 00:09:03,250 --> 00:09:05,710 Now, to be fair, in all of this spreadsheet programs 181 00:09:05,710 --> 00:09:08,110 are there functions or macros that you might 182 00:09:08,110 --> 00:09:11,540 be able to write so that you can program in those environments 183 00:09:11,540 --> 00:09:13,690 but with some limitations. 184 00:09:13,690 --> 00:09:17,680 And indeed with SQL and with SQL databases, more generally, 185 00:09:17,680 --> 00:09:20,680 that is databases that understand this language, can you 186 00:09:20,680 --> 00:09:24,430 achieve far higher performance, can you store far more data, 187 00:09:24,430 --> 00:09:26,380 and nonetheless get at it quickly. 188 00:09:26,380 --> 00:09:29,860 So it's sort of the next evolution of what you might otherwise do only 189 00:09:29,860 --> 00:09:31,810 within your own computer spreadsheet. 190 00:09:31,810 --> 00:09:36,490 But many other databases, Oracle, SQL Server, Microsoft Access, MySQL, 191 00:09:36,490 --> 00:09:39,100 Postgres and more, support more than these data 192 00:09:39,100 --> 00:09:42,760 types that allow you to help the database help you even more. 193 00:09:42,760 --> 00:09:46,700 Because the more the database knows about your data, the smarter decisions 194 00:09:46,700 --> 00:09:49,637 it can make at a lower level below these abstractions 195 00:09:49,637 --> 00:09:51,970 and storing that data so that, when you have a question, 196 00:09:51,970 --> 00:09:54,260 it can answer it quickly. 197 00:09:54,260 --> 00:09:57,760 And so how might you refine what we mean by integer? 198 00:09:57,760 --> 00:09:59,890 Well, in some databases, you might want to tell it 199 00:09:59,890 --> 00:10:03,370 that this is just a small integer in this column one 200 00:10:03,370 --> 00:10:04,990 after the other, row by row. 201 00:10:04,990 --> 00:10:08,020 Or maybe it's just an integer somewhere a little bigger 202 00:10:08,020 --> 00:10:12,070 than that, or perhaps a big int, which means it can be even larger. 203 00:10:12,070 --> 00:10:14,860 Now, these actually map to very well-defined values. 204 00:10:14,860 --> 00:10:19,630 If you think back to how we considered data itself is stored in a computer, 205 00:10:19,630 --> 00:10:22,300 it's ultimately with zeros and ones or bits. 206 00:10:22,300 --> 00:10:26,110 And indeed, an integer, so to speak, generally takes up 207 00:10:26,110 --> 00:10:28,930 just 32 bits or four bytes. 208 00:10:28,930 --> 00:10:31,930 But, of course, that few bits actually translates, 209 00:10:31,930 --> 00:10:35,530 if you do a bit of the math, to four billion possible values. 210 00:10:35,530 --> 00:10:39,160 So if you were to specify that a column in your database 211 00:10:39,160 --> 00:10:43,300 is of type integer, that means you could type in the number from zero 212 00:10:43,300 --> 00:10:46,570 all the way on up to four billion, or if you want negative numbers, 213 00:10:46,570 --> 00:10:51,400 from negative two billion roughly to positive two billion as well. 214 00:10:51,400 --> 00:10:54,580 But beyond that three billion or four, you're out of luck 215 00:10:54,580 --> 00:10:56,320 if you're only using 32 bits. 216 00:10:56,320 --> 00:10:57,460 Now, that might seem huge. 217 00:10:57,460 --> 00:11:02,170 And that is a good problem to have if I have four billion users, or sales, 218 00:11:02,170 --> 00:11:04,280 or anything in my database. 219 00:11:04,280 --> 00:11:08,080 But for certain large companies these days, having four billion 220 00:11:08,080 --> 00:11:12,040 or more records of some sort, transactions or logs, 221 00:11:12,040 --> 00:11:13,420 not at all uncommon. 222 00:11:13,420 --> 00:11:15,700 And for those purposes does there exist big int, 223 00:11:15,700 --> 00:11:20,080 a bigger data type that uses 64 bits or eight bytes via which 224 00:11:20,080 --> 00:11:22,743 you can count much, much, much higher, so high 225 00:11:22,743 --> 00:11:24,910 that I'm not even sure how to pronounce that number. 226 00:11:24,910 --> 00:11:29,110 But indeed, it should be large enough for almost all of our purposes. 227 00:11:29,110 --> 00:11:32,890 So ultimately, it's not enough to say that the field in a database 228 00:11:32,890 --> 00:11:38,440 is an integer but to specify just how big maximally that integer can be. 229 00:11:38,440 --> 00:11:42,400 Now, fortunately, I think for my age we could probably get away with small int. 230 00:11:42,400 --> 00:11:45,510 But indeed, with a small int, you typically have 16 bits. 231 00:11:45,510 --> 00:11:52,270 So even then could your user be as old as 65,535 years old. 232 00:11:52,270 --> 00:11:53,750 Now, what about real numbers? 233 00:11:53,750 --> 00:11:56,260 These are floating point values with decimal points. 234 00:11:56,260 --> 00:12:00,100 And here, too, you have fewer choices, typically, but along the same lines. 235 00:12:00,100 --> 00:12:03,310 A real number would typically take 32 bits, 236 00:12:03,310 --> 00:12:07,150 which allows you only some amount of precision, some number of digits 237 00:12:07,150 --> 00:12:08,470 after that decimal point. 238 00:12:08,470 --> 00:12:11,440 If you want to go even further and be ever more precise 239 00:12:11,440 --> 00:12:14,590 can you use twice as much space, 64 bits or eight 240 00:12:14,590 --> 00:12:17,740 bytes, and use double precision instead. 241 00:12:17,740 --> 00:12:20,920 Well, it would seem that we sort of start with an OK 242 00:12:20,920 --> 00:12:23,360 value and an even better value here. 243 00:12:23,360 --> 00:12:25,355 But what's the trade-off, as there always is? 244 00:12:25,355 --> 00:12:27,480 Well, it's fine if want to be able to count higher, 245 00:12:27,480 --> 00:12:29,820 whether with integers or real numbers. 246 00:12:29,820 --> 00:12:32,190 But you are literally going to pay a price. 247 00:12:32,190 --> 00:12:34,770 Because if you want to count higher than four billion 248 00:12:34,770 --> 00:12:36,900 in the context of integers, you're going to have 249 00:12:36,900 --> 00:12:40,890 to spend twice as much space from four bytes to eight 250 00:12:40,890 --> 00:12:42,390 in order to store those values. 251 00:12:42,390 --> 00:12:46,560 And if you're never actually going to use values larger than four billion, 252 00:12:46,560 --> 00:12:49,420 you're just spending twice as much space as you need to. 253 00:12:49,420 --> 00:12:51,810 And if you don't really need to store floating point 254 00:12:51,810 --> 00:12:56,220 values to terribly many digits of precision, you're just wasting space. 255 00:12:56,220 --> 00:12:59,240 Now, for small databases, this might be inconsequential. 256 00:12:59,240 --> 00:13:01,740 But for the Googles, and Microsofts, and others of the world 257 00:13:01,740 --> 00:13:07,050 to deal in gigabytes and terabytes of data, all of these bits and bytes 258 00:13:07,050 --> 00:13:08,670 will surely add up. 259 00:13:08,670 --> 00:13:12,010 As for numeric, more of a catch-all numerically, so to speak, 260 00:13:12,010 --> 00:13:16,110 whereby you have Booleans, zeros or ones, or false and true values, 261 00:13:16,110 --> 00:13:18,990 dates which comes in a prescribed format, typically 262 00:13:18,990 --> 00:13:24,270 year, year, year, year, dash, month, month, dash, day, day. 263 00:13:24,270 --> 00:13:26,382 Indeed, even though across the world there 264 00:13:26,382 --> 00:13:29,340 are different formats and different communities for representing dates, 265 00:13:29,340 --> 00:13:33,820 in SQL do you specify them exactly in that format only. 266 00:13:33,820 --> 00:13:38,460 Date time adds to that a space followed by hour, hour, colon, minute, minute, 267 00:13:38,460 --> 00:13:42,240 colon, second, second, whereby if you want to store both date and a time, 268 00:13:42,240 --> 00:13:45,720 you should store it in exactly that format in your database, even though, 269 00:13:45,720 --> 00:13:48,930 thanks to code, whether Python or SQL itself, 270 00:13:48,930 --> 00:13:52,080 you want to display that date in a geographically 271 00:13:52,080 --> 00:13:54,810 localized and conventional way, you nonetheless 272 00:13:54,810 --> 00:13:57,580 store it in this way in your database. 273 00:13:57,580 --> 00:14:01,680 Now, sometimes you need to fight against the imprecision that's 274 00:14:01,680 --> 00:14:04,480 inherent as a risk with real numbers. 275 00:14:04,480 --> 00:14:07,110 And so you can use the so-called numeric data type 276 00:14:07,110 --> 00:14:09,570 to specify a so-called scale and precision. 277 00:14:09,570 --> 00:14:13,290 That is two numbers that specify really how many digits 278 00:14:13,290 --> 00:14:15,490 you ultimately want to support. 279 00:14:15,490 --> 00:14:17,640 So if you want to store things like dollar amounts, 280 00:14:17,640 --> 00:14:20,610 maybe to cents or hundreds of cents, two or four decimal 281 00:14:20,610 --> 00:14:23,520 places, can you specify exactly that, and you 282 00:14:23,520 --> 00:14:26,460 will get exactly that much precision. 283 00:14:26,460 --> 00:14:30,090 As for time and timestamp, well, time is quite simply hour, hour, colon, minute, 284 00:14:30,090 --> 00:14:32,050 minute, colon, second, second. 285 00:14:32,050 --> 00:14:34,650 But timestamp's a little more interesting. 286 00:14:34,650 --> 00:14:37,200 It, too, can be used to represent time. 287 00:14:37,200 --> 00:14:40,860 But as with many languages, Python and JavaScript among them, 288 00:14:40,860 --> 00:14:43,560 it does so typically by remembering or counting 289 00:14:43,560 --> 00:14:48,690 some number of milliseconds or seconds from a particular point in time. 290 00:14:48,690 --> 00:14:51,570 In fact, some years ago, it was conventional to simply start 291 00:14:51,570 --> 00:14:56,370 counting time from January 1st, 1970, the so-called epic as opposed 292 00:14:56,370 --> 00:14:58,950 to say 0 BC or AD. 293 00:14:58,950 --> 00:15:03,690 Unfortunately, it turns out that you generally represent these timestamps 294 00:15:03,690 --> 00:15:08,250 with four bytes or 32 bits, which gives you access to four billion 295 00:15:08,250 --> 00:15:09,900 or so possible values. 296 00:15:09,900 --> 00:15:13,170 And it turns out that in the year 2038, we 297 00:15:13,170 --> 00:15:16,770 will have run out of bits with which to represent time. 298 00:15:16,770 --> 00:15:20,460 So if you recall the so-called Y2K problem when we ran into this issue 299 00:15:20,460 --> 00:15:24,180 around the year 2000, so have we rather painted ourselves 300 00:15:24,180 --> 00:15:28,290 into a corner in many systems by not allocating enough space 301 00:15:28,290 --> 00:15:30,120 to get us past that year. 302 00:15:30,120 --> 00:15:33,110 Well, what's the solution invariably going to be? 303 00:15:33,110 --> 00:15:35,700 Well, use more space and, thus, cost. 304 00:15:35,700 --> 00:15:38,400 But that is inherent in many systems. 305 00:15:38,400 --> 00:15:41,880 And it's one of these things that us humans will have to address. 306 00:15:41,880 --> 00:15:43,040 How about text? 307 00:15:43,040 --> 00:15:46,088 Well, text can, too, come in multiple forms. 308 00:15:46,088 --> 00:15:47,880 Whether it's English or any other language, 309 00:15:47,880 --> 00:15:49,770 you might have different lengths thereof. 310 00:15:49,770 --> 00:15:52,890 And so within the world of text values and databases, 311 00:15:52,890 --> 00:15:55,800 need you specify typically how many characters 312 00:15:55,800 --> 00:15:57,750 you want to store in that column. 313 00:15:57,750 --> 00:16:00,750 And this is distinct from something like a spreadsheet program 314 00:16:00,750 --> 00:16:04,410 where you can probably keep typing and typing or pasting as many characters 315 00:16:04,410 --> 00:16:05,490 as you'd like. 316 00:16:05,490 --> 00:16:09,810 A database, in so far as its purpose in life is to give you better performance 317 00:16:09,810 --> 00:16:12,510 and more scalability than a spreadsheet allows, 318 00:16:12,510 --> 00:16:14,760 wants you to tell it a bit more. 319 00:16:14,760 --> 00:16:16,620 It's going to ask you to decide whether you 320 00:16:16,620 --> 00:16:19,320 want to store a fixed number of chars. 321 00:16:19,320 --> 00:16:22,650 So char followed by a number here represented at its end 322 00:16:22,650 --> 00:16:29,418 means how many characters exactly shall you store in every cell in this column. 323 00:16:29,418 --> 00:16:30,835 Now, why might that be applicable? 324 00:16:30,835 --> 00:16:33,120 Well, in the US, for instance, for states, 325 00:16:33,120 --> 00:16:35,250 we have a standardization of two characters 326 00:16:35,250 --> 00:16:36,970 for every state in the country. 327 00:16:36,970 --> 00:16:39,570 And so you might say char (2), thereby allowing 328 00:16:39,570 --> 00:16:43,230 you to store CT for Connecticut, and FL for Florida, 329 00:16:43,230 --> 00:16:47,280 and MA for Massachusetts and so on, because you know that every value is 330 00:16:47,280 --> 00:16:49,800 going to be that finite length, two. 331 00:16:49,800 --> 00:16:51,840 But sometimes you don't know the max length. 332 00:16:51,840 --> 00:16:56,640 And indeed, in my own table of users, my own name might have D-A-V-I-D 333 00:16:56,640 --> 00:17:00,060 and a space and M-A-L-A-N, so 11 for me. 334 00:17:00,060 --> 00:17:02,437 But your name might be shorter or longer. 335 00:17:02,437 --> 00:17:04,770 Or some user that hasn't even registered for our website 336 00:17:04,770 --> 00:17:07,980 yet might have an even longer name than that. 337 00:17:07,980 --> 00:17:11,609 So varchar, or variable number of chars, exists, too, 338 00:17:11,609 --> 00:17:13,859 where you can specify not a fixed number of characters 339 00:17:13,859 --> 00:17:19,109 to use but a maximum upper bound, perhaps something like 30, 340 00:17:19,109 --> 00:17:22,230 perhaps something like 300, or some number of characters 341 00:17:22,230 --> 00:17:26,079 that you decide that's going to be the upper limit of any human I actually 342 00:17:26,079 --> 00:17:29,410 see in my sight who might have a name as many as that characters. 343 00:17:29,410 --> 00:17:32,350 But varchar is smart and the databases that use it, 344 00:17:32,350 --> 00:17:35,110 because it will use that many characters maximally. 345 00:17:35,110 --> 00:17:37,220 But it's not going to waste space. 346 00:17:37,220 --> 00:17:42,010 In fact, if you're using 30 characters for every name in your table, 347 00:17:42,010 --> 00:17:45,890 well, it's only going to store as many characters are 348 00:17:45,890 --> 00:17:49,870 necessary, perhaps plus a bit of overhead or accounting 349 00:17:49,870 --> 00:17:51,340 to keep track of that value. 350 00:17:51,340 --> 00:17:56,800 But that way you save space but still can handle large text or large strings. 351 00:17:56,800 --> 00:18:00,550 Lastly, if you have particularly large text, whole paragraphs, or essays, 352 00:18:00,550 --> 00:18:02,440 or documents that someone might paste in, 353 00:18:02,440 --> 00:18:05,232 or the contents of an entire web page that you want to search, 354 00:18:05,232 --> 00:18:07,190 well, you can have a text field more generally, 355 00:18:07,190 --> 00:18:11,290 which tends to support tens of thousands or more characters in total. 356 00:18:11,290 --> 00:18:14,260 But it does so in a way that's not quite as efficient. 357 00:18:14,260 --> 00:18:17,720 Rather than keep it right in the column itself, so to speak, 358 00:18:17,720 --> 00:18:20,500 it puts it somewhere else on the server, the result of which 359 00:18:20,500 --> 00:18:23,620 is that you can fit more room over here, so to speak. 360 00:18:23,620 --> 00:18:27,520 But it takes more time and effort to go search or find that data, so, 361 00:18:27,520 --> 00:18:30,610 again, a trade-off of space and time. 362 00:18:30,610 --> 00:18:33,170 But here, too, as with integers and real, 363 00:18:33,170 --> 00:18:37,000 you seem to have multiple choices, one of which at first glance is better. 364 00:18:37,000 --> 00:18:41,543 After all, why use char ever if you could also just say varchar 365 00:18:41,543 --> 00:18:42,460 with that same amount? 366 00:18:42,460 --> 00:18:45,550 For instance, for those two character codes for states, 367 00:18:45,550 --> 00:18:50,320 why not just say varchar and not just char (2)? 368 00:18:50,320 --> 00:18:53,650 Well, it turns out that if you promise the database 369 00:18:53,650 --> 00:18:58,900 that every cell in your column will use exactly the same number of bytes, 370 00:18:58,900 --> 00:19:03,220 it turns out that you'll have very straight or justified edges, 371 00:19:03,220 --> 00:19:05,890 in some sense, conceptually for that column. 372 00:19:05,890 --> 00:19:08,560 For instance, if every cell in a column takes up 373 00:19:08,560 --> 00:19:11,740 exactly the same amount of space, then you might know, 374 00:19:11,740 --> 00:19:15,250 if you start numbering the bytes that are represented by those cells, 375 00:19:15,250 --> 00:19:18,910 this might be byte zero up here and then two characters away 376 00:19:18,910 --> 00:19:22,670 will be address two, and then four, and then six, and then eight. 377 00:19:22,670 --> 00:19:26,320 And if you think back to how algorithms like, say, binary search 378 00:19:26,320 --> 00:19:30,610 are implemented, as soon as you have the ability to know arithmetically 379 00:19:30,610 --> 00:19:34,780 where the next value, or the last value, or, best yet, the middle value 380 00:19:34,780 --> 00:19:39,880 is, numerically can you jump right to it in so-called constant time, 381 00:19:39,880 --> 00:19:44,260 thereby enabling something like binary search and ultimately logarithmic time. 382 00:19:44,260 --> 00:19:45,820 So this is a powerful thing. 383 00:19:45,820 --> 00:19:48,220 If you commit to the database that all of your values 384 00:19:48,220 --> 00:19:50,740 will take up the exact amount of space, the database 385 00:19:50,740 --> 00:19:54,820 can search that value faster for you. 386 00:19:54,820 --> 00:20:00,520 Varchar, unfortunately, is rather akin to having only a left justified column, 387 00:20:00,520 --> 00:20:03,713 whereby you might have data that looks very straight on one edge 388 00:20:03,713 --> 00:20:05,380 with all of the characters left aligned. 389 00:20:05,380 --> 00:20:07,630 But because the length of those characters vary, 390 00:20:07,630 --> 00:20:11,080 this cell might be this wide, and this one this wide, and this one this wide, 391 00:20:11,080 --> 00:20:12,430 and this one this wide. 392 00:20:12,430 --> 00:20:15,790 And as such, because you have a very ragged edge, so to speak, 393 00:20:15,790 --> 00:20:18,610 you can't just do simple math and add two and add two 394 00:20:18,610 --> 00:20:21,350 to get to the next value in that column. 395 00:20:21,350 --> 00:20:24,370 So in the worst case, if a column is only a varchar, 396 00:20:24,370 --> 00:20:27,520 the best the database can do by default is just linear search, 397 00:20:27,520 --> 00:20:31,810 or big O of n, which, recall, was notably slower than something 398 00:20:31,810 --> 00:20:34,250 like log N or logarithmic time. 399 00:20:34,250 --> 00:20:37,250 So the more help you can provide to the database the better off you are. 400 00:20:37,250 --> 00:20:39,792 But of course you don't want to air too far on the other side 401 00:20:39,792 --> 00:20:45,190 and say, oh, well, I'm going to have char 300 for every cell 402 00:20:45,190 --> 00:20:50,200 in my Names column, because then you're using 300 characters by definition 403 00:20:50,200 --> 00:20:54,970 for just David Malan who needs just, say, 11. 404 00:20:54,970 --> 00:20:58,570 So lastly is there this thing BLOB, binary large object, 405 00:20:58,570 --> 00:21:03,160 which can only store binary data, which can be used for images or files. 406 00:21:03,160 --> 00:21:08,230 But those, again, tend to be best stored somewhere else on the system. 407 00:21:08,230 --> 00:21:10,720 So with these possibilities in mind, how might we 408 00:21:10,720 --> 00:21:13,450 go about designing the schema for this data? 409 00:21:13,450 --> 00:21:15,790 We already have the names of my columns. 410 00:21:15,790 --> 00:21:18,040 And I've already gone ahead and put in one sample row. 411 00:21:18,040 --> 00:21:21,130 So let's consider in the context of all of those SQL types 412 00:21:21,130 --> 00:21:22,960 which ones apply here. 413 00:21:22,960 --> 00:21:25,180 Well, for user name, something like Malan, 414 00:21:25,180 --> 00:21:27,490 it's probably not numeric because that's not 415 00:21:27,490 --> 00:21:29,230 something I'd be so inclined to type. 416 00:21:29,230 --> 00:21:30,700 But it probably is text. 417 00:21:30,700 --> 00:21:33,460 But within text we have the choice of char or varchar 418 00:21:33,460 --> 00:21:36,070 or larger blocks of text, more generally. 419 00:21:36,070 --> 00:21:37,760 So what should this be? 420 00:21:37,760 --> 00:21:39,040 Well, this one depends. 421 00:21:39,040 --> 00:21:41,170 A user name tends to be fairly short. 422 00:21:41,170 --> 00:21:44,500 Because after all, we humans probably don't want to type very long strings 423 00:21:44,500 --> 00:21:46,090 just to log into some site. 424 00:21:46,090 --> 00:21:49,540 Historically, maximal values of eight characters was common. 425 00:21:49,540 --> 00:21:52,060 But that tends to be fairly constrained. 426 00:21:52,060 --> 00:21:54,370 And you can no longer express very much of your name 427 00:21:54,370 --> 00:21:56,950 if you tend to have a long name yourself. 428 00:21:56,950 --> 00:22:01,030 So here we might say something like char, because we know it will be small. 429 00:22:01,030 --> 00:22:03,710 And we'd like to be able to search on this field efficiently, 430 00:22:03,710 --> 00:22:06,880 especially to log someone in fast, but probably not quite as 431 00:22:06,880 --> 00:22:10,300 short as eight, so maybe 16 or 20. 432 00:22:10,300 --> 00:22:12,940 Or if it's hard to guess there and you want more dynamism, 433 00:22:12,940 --> 00:22:17,050 maybe you would say varchar of something like 255. 434 00:22:17,050 --> 00:22:18,000 Why that value? 435 00:22:18,000 --> 00:22:22,220 Well, recall that with eight bits can you count 256 possible values. 436 00:22:22,220 --> 00:22:25,740 But if you start counting at zero, you can only go as high as 255. 437 00:22:25,740 --> 00:22:29,280 And historically have many databases had limits of, say, 438 00:22:29,280 --> 00:22:32,770 255 for a varchar, though that's no longer the case. 439 00:22:32,770 --> 00:22:35,040 But you'll still see this very commonly. 440 00:22:35,040 --> 00:22:36,960 So what's the best answer? 441 00:22:36,960 --> 00:22:37,750 I don't know. 442 00:22:37,750 --> 00:22:40,140 It depends on your data or someone else's. 443 00:22:40,140 --> 00:22:43,200 For here, I'll go ahead and leave it as varchar (255) 444 00:22:43,200 --> 00:22:44,910 just to give myself flexibility. 445 00:22:44,910 --> 00:22:48,510 But a very good argument could be made for char of some smaller value 446 00:22:48,510 --> 00:22:51,780 or even varchar of some smaller value, too. 447 00:22:51,780 --> 00:22:54,810 As for name as well, I'm not really sure what this one should be. 448 00:22:54,810 --> 00:22:58,170 I don't know how many characters your own name has. 449 00:22:58,170 --> 00:23:00,390 Something like 11 is probably too few, even 450 00:23:00,390 --> 00:23:02,740 though it satisfies my name just fine. 451 00:23:02,740 --> 00:23:05,520 30 feels a bit [? type. ?] And frankly, I 452 00:23:05,520 --> 00:23:09,300 bet if we Google longest name in world, something 453 00:23:09,300 --> 00:23:13,110 tells me there's someone out there with quite a few characters, hopefully 454 00:23:13,110 --> 00:23:15,210 no more than, say, 255. 455 00:23:15,210 --> 00:23:18,240 But there, too, we might want to do some due diligence. 456 00:23:18,240 --> 00:23:20,097 With email, too, this seems to be easy. 457 00:23:20,097 --> 00:23:22,680 This, too, is just characters, even though you could certainly 458 00:23:22,680 --> 00:23:23,550 have numbers. 459 00:23:23,550 --> 00:23:26,010 I don't know how long the maximum email address will be. 460 00:23:26,010 --> 00:23:29,500 But frankly, it's probably going to be variable in length, 461 00:23:29,500 --> 00:23:31,120 so I'm going to go with a default. 462 00:23:31,120 --> 00:23:32,160 Why 255? 463 00:23:32,160 --> 00:23:34,140 Again, it's probably way more than we need. 464 00:23:34,140 --> 00:23:35,533 But varchar is smart. 465 00:23:35,533 --> 00:23:37,200 It's not going to waste that many bytes. 466 00:23:37,200 --> 00:23:41,490 It's just a reasonable upper bound in the absence of better intuition. 467 00:23:41,490 --> 00:23:43,680 At least we're being consistent. 468 00:23:43,680 --> 00:23:45,580 Now, address, that might be decently long, 469 00:23:45,580 --> 00:23:48,370 especially if it's going on envelope on, say, multiple lines. 470 00:23:48,370 --> 00:23:51,840 But here, too, this is probably just text, and so I'll go ahead here 471 00:23:51,840 --> 00:23:54,358 and say this, too, 255. 472 00:23:54,358 --> 00:23:54,900 Phone number. 473 00:23:54,900 --> 00:23:57,540 Finally, a number by name. 474 00:23:57,540 --> 00:23:59,700 Well, what should a phone numbers field be? 475 00:23:59,700 --> 00:24:04,410 Well, we had integer, or real, or, more specifically, small int or integer, 476 00:24:04,410 --> 00:24:06,030 or big int. 477 00:24:06,030 --> 00:24:09,870 But the funny thing here is even though a phone number is indeed 478 00:24:09,870 --> 00:24:13,260 called by us humans a number, it's really 479 00:24:13,260 --> 00:24:18,300 just a symbolic representation of a unique identifier for someone's phone. 480 00:24:18,300 --> 00:24:22,030 And so these hyphens might be common in one community or culture. 481 00:24:22,030 --> 00:24:25,890 Maybe you'd have parentheses or even pluses for country codes. 482 00:24:25,890 --> 00:24:30,390 So frankly, very quickly does even a phone number not become so much 483 00:24:30,390 --> 00:24:33,040 a number but a textual string. 484 00:24:33,040 --> 00:24:35,070 So here I have some discretion. 485 00:24:35,070 --> 00:24:37,890 And maybe I could be presumptuous and assume 486 00:24:37,890 --> 00:24:40,020 I'm only going to have customers or users for now, 487 00:24:40,020 --> 00:24:43,343 say, in the US for whatever business constraints. 488 00:24:43,343 --> 00:24:44,760 And so I might say, you know what? 489 00:24:44,760 --> 00:24:49,740 This is a candidate to actually do something like char, say, 10, 490 00:24:49,740 --> 00:24:53,220 three-digit area code, a three-digit exchange, and then four digits 491 00:24:53,220 --> 00:24:54,300 thereafter. 492 00:24:54,300 --> 00:24:58,470 But that doesn't leave room for those hyphens, so I could make a char 12. 493 00:24:58,470 --> 00:25:01,170 Or frankly, if they're just going to be there all the time, 494 00:25:01,170 --> 00:25:04,170 why don't I leave them as char 10 and just 495 00:25:04,170 --> 00:25:07,440 get rid of those values in my data? 496 00:25:07,440 --> 00:25:11,207 Or alternatively, I could support parentheses or pluses as well. 497 00:25:11,207 --> 00:25:13,290 It really depends on how I want to store the data. 498 00:25:13,290 --> 00:25:19,500 But I like the idea of a textual type, something like char or maybe varchar 499 00:25:19,500 --> 00:25:20,970 as opposed to an integer. 500 00:25:20,970 --> 00:25:23,220 Because at least if I've called certain communities 501 00:25:23,220 --> 00:25:26,660 or out from some businesses, sometimes you have to type unusual numbers. 502 00:25:26,660 --> 00:25:28,410 At least in other countries, for instance, 503 00:25:28,410 --> 00:25:30,930 if we generalize beyond this data set here do you 504 00:25:30,930 --> 00:25:34,320 type zero to actually connect to someone local. 505 00:25:34,320 --> 00:25:37,440 And the problem with zero is that, mathematically, it's 506 00:25:37,440 --> 00:25:39,690 meaningless to start a value with zero. 507 00:25:39,690 --> 00:25:41,910 And unfortunately, a computer takes that to heart. 508 00:25:41,910 --> 00:25:45,190 And if you were to store phone as an integer 509 00:25:45,190 --> 00:25:48,450 but a number were in some community to start with a zero, 510 00:25:48,450 --> 00:25:51,090 your database would probably just get rid of it. 511 00:25:51,090 --> 00:25:54,152 In fact, try that yourself in Microsoft Excel, or Apple Numbers, 512 00:25:54,152 --> 00:25:55,110 or Google Spreadsheets. 513 00:25:55,110 --> 00:25:58,530 And depending on how you formatted the column, you can type as many zeros 514 00:25:58,530 --> 00:26:00,280 as you want followed by other digits. 515 00:26:00,280 --> 00:26:04,380 And odds are, when you hit Enter, those zeros are gone. 516 00:26:04,380 --> 00:26:09,540 As for age, here we perhaps finally have a compelling candidate for a number. 517 00:26:09,540 --> 00:26:11,850 Small int should probably get the job done. 518 00:26:11,850 --> 00:26:16,890 Integer would work as well, or big int, but increasingly wasteful. 519 00:26:16,890 --> 00:26:17,760 But you know what? 520 00:26:17,760 --> 00:26:19,920 Even here it's not that obvious. 521 00:26:19,920 --> 00:26:24,930 I probably shouldn't even use an integer type here at all. 522 00:26:24,930 --> 00:26:25,830 Why? 523 00:26:25,830 --> 00:26:30,270 Well, I might indeed ask a human for his or her age upon registering. 524 00:26:30,270 --> 00:26:34,620 The catch is that age might change the next day or the day 525 00:26:34,620 --> 00:26:38,490 after, or the day after, because, of course, time is advancing. 526 00:26:38,490 --> 00:26:41,310 And unless I also stored the date and ideally 527 00:26:41,310 --> 00:26:45,600 time at which the user registered, I can't really 528 00:26:45,600 --> 00:26:49,500 even do any math to figure out, oh, you registered a year ago. 529 00:26:49,500 --> 00:26:52,150 Let me assume your age is that plus one. 530 00:26:52,150 --> 00:26:55,170 So what would have been better from the get-go than age? 531 00:26:55,170 --> 00:26:58,830 Probably just something like date of birth, DOB. 532 00:26:58,830 --> 00:27:01,530 And, of course, in SQL do we have the ability 533 00:27:01,530 --> 00:27:03,850 to store dates, even date times. 534 00:27:03,850 --> 00:27:06,120 So here we probably have a better candidate 535 00:27:06,120 --> 00:27:09,700 for exactly one of those numeric types, so to speak. 536 00:27:09,700 --> 00:27:13,097 Now, we've only just begun to scratch the surface of available data types 537 00:27:13,097 --> 00:27:14,430 and data we might want to store. 538 00:27:14,430 --> 00:27:18,150 In fact, our spreadsheet or database could contain even more types. 539 00:27:18,150 --> 00:27:21,540 But now let's just suppose that we're happy with our types. 540 00:27:21,540 --> 00:27:24,480 And the goal now at hand is to start searching our data 541 00:27:24,480 --> 00:27:27,940 and storing that data in an actual database. 542 00:27:27,940 --> 00:27:31,600 I've been using here Google Spreadsheets just to lay out my schema 543 00:27:31,600 --> 00:27:34,130 sort of as a worksheet, if you will. 544 00:27:34,130 --> 00:27:37,000 But now let's graduate to actual SQL syntax 545 00:27:37,000 --> 00:27:39,250 and suppose that this same data is stored 546 00:27:39,250 --> 00:27:43,780 not in Google Spreadsheets or any other but in some database 547 00:27:43,780 --> 00:27:47,080 elsewhere, a piece of software running somewhere else on my own computer 548 00:27:47,080 --> 00:27:50,320 or somewhere in the cloud that knows how to organize this data, still 549 00:27:50,320 --> 00:27:56,760 in rows and columns but in such a way that I can use this new language, SQL 550 00:27:56,760 --> 00:28:01,430 or SQL, in order to access my data. 551 00:28:01,430 --> 00:28:03,850 So what are the fundamental operations that 552 00:28:03,850 --> 00:28:07,510 SQL supports, or a relational database, more generally? 553 00:28:07,510 --> 00:28:09,910 It turns out that throughout computer science is there 554 00:28:09,910 --> 00:28:13,630 a pattern of operations that many different systems support. 555 00:28:13,630 --> 00:28:16,000 In the world of databases, you have fairly crassly 556 00:28:16,000 --> 00:28:20,710 what's called CRUD, the ability to create data, read data, update, 557 00:28:20,710 --> 00:28:21,490 and delete. 558 00:28:21,490 --> 00:28:26,680 But more specifically, in the context of SQL, this new language, our last, 559 00:28:26,680 --> 00:28:31,540 you have the ability to create data, select data, a.k.a. 560 00:28:31,540 --> 00:28:35,050 READ data, update or insert data, a.k.a. 561 00:28:35,050 --> 00:28:38,560 UPDATE, or delete or drop data, a.k.a. 562 00:28:38,560 --> 00:28:39,460 DELETE. 563 00:28:39,460 --> 00:28:42,490 So whereas in SQL you have these very specific terms, 564 00:28:42,490 --> 00:28:45,880 they are just representative of a class of operations 565 00:28:45,880 --> 00:28:48,520 that you might see throughout computer science. 566 00:28:48,520 --> 00:28:50,830 So how do we go about using SQL? 567 00:28:50,830 --> 00:28:53,740 Well, short of buying and installing something like Oracle 568 00:28:53,740 --> 00:28:56,230 or downloading something free like MySQL, 569 00:28:56,230 --> 00:28:58,870 we can simply use something that's indeed, by definition, 570 00:28:58,870 --> 00:29:01,540 lighter weight called SQLite. 571 00:29:01,540 --> 00:29:05,200 SQLite is an implementation of SQL, this database language, 572 00:29:05,200 --> 00:29:08,430 via which you can install it on your own Mac or PC, 573 00:29:08,430 --> 00:29:13,270 and it uses not special software or servers but rather just a file 574 00:29:13,270 --> 00:29:16,330 on your own Mac or PC that stores all of those tables 575 00:29:16,330 --> 00:29:18,300 and, in turn, rows and columns. 576 00:29:18,300 --> 00:29:20,440 It thereafter creates an abstraction of sorts 577 00:29:20,440 --> 00:29:22,960 as though you have a full-fledged server running 578 00:29:22,960 --> 00:29:26,350 on your own machine with which you can communicate using SQL. 579 00:29:26,350 --> 00:29:31,360 Now, any number of programs can be used to actually talk to a SQLite database. 580 00:29:31,360 --> 00:29:35,200 You might use a purely textual interface, so-called command line 581 00:29:35,200 --> 00:29:37,090 interface, using only your keyboard. 582 00:29:37,090 --> 00:29:41,210 Or you might use something graphical, a GUI, a graphical user interface. 583 00:29:41,210 --> 00:29:43,030 In fact, one of the easiest programs to use 584 00:29:43,030 --> 00:29:46,600 is this one here, DB Browser, freely available for Macs, 585 00:29:46,600 --> 00:29:50,740 for PCS running Windows, or Linux, or any number of operating systems 586 00:29:50,740 --> 00:29:51,670 as well. 587 00:29:51,670 --> 00:29:56,110 In advance, I've gone ahead and opened up a database in a file whose name ends 588 00:29:56,110 --> 00:29:59,830 in .sqllite or also commonly .db. 589 00:29:59,830 --> 00:30:01,900 This then is, again, just a file, a binary 590 00:30:01,900 --> 00:30:04,510 file filled with zeros and ones that collectively 591 00:30:04,510 --> 00:30:09,100 represent patterns of rows and columns and the tables that contain them. 592 00:30:09,100 --> 00:30:13,120 And if I open this program, I see all of the tables or, if you will, 593 00:30:13,120 --> 00:30:15,640 sheets from our old spreadsheet world. 594 00:30:15,640 --> 00:30:18,370 This happens to be a database all about music. 595 00:30:18,370 --> 00:30:22,960 And indeed, I have a table about musical albums and artists, customers 596 00:30:22,960 --> 00:30:26,560 and employees, and genres, and invoices, invoice lines that 597 00:30:26,560 --> 00:30:29,920 represent actual songs bought, the types of media involved, 598 00:30:29,920 --> 00:30:33,340 and playlists and playlists tracks via which users can actually 599 00:30:33,340 --> 00:30:37,807 customize those songs, and then lastly, the tracks or the songs themselves. 600 00:30:37,807 --> 00:30:39,640 In other words, someone else on the internet 601 00:30:39,640 --> 00:30:42,250 has gone to the trouble of aggregating all of this data 602 00:30:42,250 --> 00:30:47,770 about music and their authors and organized it into an SQL database. 603 00:30:47,770 --> 00:30:48,550 Now, how? 604 00:30:48,550 --> 00:30:50,342 Well, if you look to the right here, you'll 605 00:30:50,342 --> 00:30:54,040 see a number of esoteric commands all involving CREATE TABLE. 606 00:30:54,040 --> 00:30:57,490 And indeed, CREATE is one of the four fundamental operations 607 00:30:57,490 --> 00:30:58,660 that SQL supports. 608 00:30:58,660 --> 00:31:02,860 And without going into the weeds of how these tables were created, 609 00:31:02,860 --> 00:31:07,070 they were created in such a way that the author had to decide on every table's 610 00:31:07,070 --> 00:31:09,640 columns' names as well as their types. 611 00:31:09,640 --> 00:31:11,440 And that part is interesting. 612 00:31:11,440 --> 00:31:16,090 For instance, let me go into the Albums table here and expand this one now. 613 00:31:16,090 --> 00:31:21,370 Here I have Album ID, and Title, and, curiously, Artist ID. 614 00:31:21,370 --> 00:31:25,450 But you'll notice in the Album table we have no mention of artist's name. 615 00:31:25,450 --> 00:31:27,970 In fact, if we follow the white rabbit here, 616 00:31:27,970 --> 00:31:31,450 we see in the Artist table that, oh, in Artist do 617 00:31:31,450 --> 00:31:33,700 we have an artist ID and name. 618 00:31:33,700 --> 00:31:37,150 And indeed, if we keep digging, we'll see that in each of these tables 619 00:31:37,150 --> 00:31:41,050 there's not quite as much information as I might like. 620 00:31:41,050 --> 00:31:44,830 In fact, IDs, whatever those are seem to be more prevalent. 621 00:31:44,830 --> 00:31:46,930 And that is because, in the world of databases, 622 00:31:46,930 --> 00:31:49,390 and perhaps even some of your own spreadsheets, 623 00:31:49,390 --> 00:31:53,560 it's conventional and daresay best practice to normalize your data. 624 00:31:53,560 --> 00:31:58,240 You indeed should try to avoid having redundancies within tables and even 625 00:31:58,240 --> 00:31:59,620 across tables. 626 00:31:59,620 --> 00:32:00,730 Now, what does that mean? 627 00:32:00,730 --> 00:32:02,830 Well, consider our own spreadsheet of users 628 00:32:02,830 --> 00:32:05,410 that we're trying to get into a database. 629 00:32:05,410 --> 00:32:07,960 I had a column there called Address. 630 00:32:07,960 --> 00:32:12,170 And I proposed that my own address was 33 Oxford Street in Cambridge, 631 00:32:12,170 --> 00:32:13,930 Massachusetts 02138. 632 00:32:13,930 --> 00:32:17,680 Now, it turns out that there's quite a few other people at my workplace, 633 00:32:17,680 --> 00:32:21,260 and so they might have that exact same address as well. 634 00:32:21,260 --> 00:32:23,770 And in fact, that address doesn't really lend itself 635 00:32:23,770 --> 00:32:27,340 to very nice printing on envelopes because all I did 636 00:32:27,340 --> 00:32:30,280 was separate things with a comma, not actually hit Enter. 637 00:32:30,280 --> 00:32:33,190 And in fact, because everything was all in one column, 638 00:32:33,190 --> 00:32:36,550 it would seem a little difficult to search by, say, zip code. 639 00:32:36,550 --> 00:32:41,470 Find for me all of the users from Cambridge, Massachusetts in 02138. 640 00:32:41,470 --> 00:32:44,330 Well, you could do it sort of searching free form 641 00:32:44,330 --> 00:32:46,840 all of the values in some column. 642 00:32:46,840 --> 00:32:49,510 But it wouldn't be particularly clean or efficient because you'd 643 00:32:49,510 --> 00:32:53,770 have to look at and then ignore all of those other words and numbers, 33 644 00:32:53,770 --> 00:32:57,550 Oxford Street, Cambridge mass, if you only care about that zip. 645 00:32:57,550 --> 00:32:59,380 So before we forge ahead, let's see if we 646 00:32:59,380 --> 00:33:03,490 can't infer why this database is seemingly, at first glance, 647 00:33:03,490 --> 00:33:05,650 more complicated than you might like. 648 00:33:05,650 --> 00:33:09,040 Well, if I go back into my spreadsheet here, what really should 649 00:33:09,040 --> 00:33:11,050 I have done with Address? 650 00:33:11,050 --> 00:33:13,750 I probably should have quantized that value 651 00:33:13,750 --> 00:33:17,830 into streets and then city, then state and zip code, 652 00:33:17,830 --> 00:33:20,680 not having just one column but several. 653 00:33:20,680 --> 00:33:24,220 In fact, here, what I've gone ahead and done is separate out Address 654 00:33:24,220 --> 00:33:27,050 into Street and City and State and Zip. 655 00:33:27,050 --> 00:33:30,040 And for each of those have I specified a very precise type. 656 00:33:30,040 --> 00:33:34,000 I've gone ahead and proposed that Street is still varchar (255), as is city, 657 00:33:34,000 --> 00:33:37,360 because I don't really know an upper bound, so we'll at least be consistent. 658 00:33:37,360 --> 00:33:41,530 For state, I've hopefully been smart in at least assuming users are in the US. 659 00:33:41,530 --> 00:33:44,800 I've said char (2) for just that two-character code. 660 00:33:44,800 --> 00:33:47,110 And for Zip, too, I'm preemptively trying 661 00:33:47,110 --> 00:33:51,670 to avoid a mistake with even Cambridge whose zip codes start with a zero. 662 00:33:51,670 --> 00:33:54,610 Where I had to specify again that that's just an integer, 663 00:33:54,610 --> 00:33:57,760 I might actually lose mathematically that first digit. 664 00:33:57,760 --> 00:34:02,440 But by storing it as a char with five characters and no hyphen or four others 665 00:34:02,440 --> 00:34:07,120 can I ensure that that 02138 remains exactly that. 666 00:34:07,120 --> 00:34:10,060 But here, too, we have a bit of inefficiency. 667 00:34:10,060 --> 00:34:12,880 Imagine if more and more users from my own building 668 00:34:12,880 --> 00:34:15,340 register for this particular application. 669 00:34:15,340 --> 00:34:18,670 You might have again and again these exact same values 670 00:34:18,670 --> 00:34:21,219 if all of my colleagues are in the same building. 671 00:34:21,219 --> 00:34:23,500 And here, as with programming, as soon as you 672 00:34:23,500 --> 00:34:27,639 start to see a lot of redundancy, the same data again and again, 673 00:34:27,639 --> 00:34:30,520 there's surely an opportunity for better design. 674 00:34:30,520 --> 00:34:31,330 This is correct. 675 00:34:31,330 --> 00:34:33,880 My colleagues might also live here if we added 676 00:34:33,880 --> 00:34:35,679 their names and their distinct emails. 677 00:34:35,679 --> 00:34:39,940 But they don't necessarily need to have the same exact text stored 678 00:34:39,940 --> 00:34:41,830 again and again. 679 00:34:41,830 --> 00:34:45,580 So what might you do in this case even in the world of spreadsheets? 680 00:34:45,580 --> 00:34:49,989 Well, on the sheet, I might just rename this actually more explicitly to users. 681 00:34:49,989 --> 00:34:50,739 And you know what? 682 00:34:50,739 --> 00:34:53,406 Let me go ahead and create another sheet in my spreadsheet world 683 00:34:53,406 --> 00:34:55,239 and call this, say, Cities. 684 00:34:55,239 --> 00:35:00,980 And my cities might actually have a city name and perhaps a state and a zip. 685 00:35:00,980 --> 00:35:04,600 But in this leftmost column, I could be smart 686 00:35:04,600 --> 00:35:09,730 and start to assign each of these cities some unique ID or identifier. 687 00:35:09,730 --> 00:35:13,870 And so here might I have just a unique identifier, typically an integer. 688 00:35:13,870 --> 00:35:16,560 City might again be varchar (255). 689 00:35:16,560 --> 00:35:18,790 And State might, again, be char (2). 690 00:35:18,790 --> 00:35:21,250 Zip Code, meanwhile, can stay as char (5). 691 00:35:21,250 --> 00:35:25,900 But now what I can do is presume this, that if Cambridge, Massachusetts, 692 00:35:25,900 --> 00:35:31,720 02138 is in this sheet or, if you will, table let's arbitrarily but reasonably 693 00:35:31,720 --> 00:35:33,920 give it a unique identifier as one. 694 00:35:33,920 --> 00:35:37,480 And if I were to add more cities here, like Allston, Massachusetts and its zip 695 00:35:37,480 --> 00:35:40,660 code, I could sign it a unique ID of two. 696 00:35:40,660 --> 00:35:43,420 Because now that I have this data, here's 697 00:35:43,420 --> 00:35:45,340 where you get that relational aspect. 698 00:35:45,340 --> 00:35:50,050 I can relate this sheet or table to my other as follows. 699 00:35:50,050 --> 00:35:54,040 Back in this table now can I go ahead and delete much of this redundancy 700 00:35:54,040 --> 00:35:58,420 here and actually go ahead and say city is not really a city but it's a city 701 00:35:58,420 --> 00:35:59,200 ID. 702 00:35:59,200 --> 00:36:02,410 And here now it can be a proper integer, because the city 703 00:36:02,410 --> 00:36:06,460 in which all of my colleagues live is the exact same one as mine. 704 00:36:06,460 --> 00:36:09,433 Now, here there's still an opportunity for improvement, to be fair. 705 00:36:09,433 --> 00:36:11,350 Because if all of us are in the same building, 706 00:36:11,350 --> 00:36:14,000 maybe that should be factored out as well. 707 00:36:14,000 --> 00:36:16,150 And if I really wanted to go down this rabbit hole, 708 00:36:16,150 --> 00:36:18,460 I could add another sheet or table called, 709 00:36:18,460 --> 00:36:24,430 say, Buildings and factor out also that 33 Oxford Street, give it a unique ID, 710 00:36:24,430 --> 00:36:26,560 and only store numbers. 711 00:36:26,560 --> 00:36:29,552 So in short, the more redundancy and the more waste 712 00:36:29,552 --> 00:36:32,260 that you end up having in your data, the more opportunities there 713 00:36:32,260 --> 00:36:34,150 are to, so to speak, normalize it. 714 00:36:34,150 --> 00:36:36,400 To factor out those commonalities and create 715 00:36:36,400 --> 00:36:39,580 relations between some pieces of data and others 716 00:36:39,580 --> 00:36:43,180 and the convention in computing is to do, quite simply, numbers. 717 00:36:43,180 --> 00:36:43,680 Why? 718 00:36:43,680 --> 00:36:48,340 Well, it turns out it's a lot more efficient to just relate some data 719 00:36:48,340 --> 00:36:51,250 to others by relying on small integers. 720 00:36:51,250 --> 00:36:53,020 Four bytes is not all that much. 721 00:36:53,020 --> 00:36:55,390 And in fact, inside of a computer CPU are 722 00:36:55,390 --> 00:36:58,390 small registers, tiny pieces of memory that 723 00:36:58,390 --> 00:37:00,940 can be used on the most basic of operations; 724 00:37:00,940 --> 00:37:04,330 additions, subtractions, and comparisons for equality. 725 00:37:04,330 --> 00:37:07,180 And so with small values like integers can you 726 00:37:07,180 --> 00:37:12,170 very quickly reassemble or relate some data to others. 727 00:37:12,170 --> 00:37:15,430 And so here we have a general principle of database design 728 00:37:15,430 --> 00:37:18,690 to normalize it by factoring things out. 729 00:37:18,690 --> 00:37:21,940 And so if we go back into our musical database, 730 00:37:21,940 --> 00:37:26,920 you can perhaps infer why the author of this data did that preemptively. 731 00:37:26,920 --> 00:37:31,990 They have their albums having album IDs, a number like 1, 2, and 3, a title, 732 00:37:31,990 --> 00:37:34,210 which is the actual title of that album. 733 00:37:34,210 --> 00:37:38,650 And then to associate those albums with artists, 734 00:37:38,650 --> 00:37:41,620 they've used not the artist's name but an ID. 735 00:37:41,620 --> 00:37:45,940 In this way can an album have the same artist as another 736 00:37:45,940 --> 00:37:48,430 without storing that artist's name twice. 737 00:37:48,430 --> 00:37:51,190 Moreover, if the artist happens to change his or her name, 738 00:37:51,190 --> 00:37:53,860 as is not uncommon in the musical world, you 739 00:37:53,860 --> 00:37:56,740 can change that name in just one place and not 740 00:37:56,740 --> 00:38:00,540 have to scour your tables for multiple copies. 741 00:38:00,540 --> 00:38:03,630 And so if we continue this logic, we'll see in more and more tables 742 00:38:03,630 --> 00:38:07,740 that we have this principle of naming the data but then assigning it an ID. 743 00:38:07,740 --> 00:38:10,230 And if you want to relate some of that data to another, 744 00:38:10,230 --> 00:38:14,520 you simply store the ID, not the actual values. 745 00:38:14,520 --> 00:38:19,140 Of course, this is decreasingly useful as we go, because now some of my data 746 00:38:19,140 --> 00:38:21,990 is in this table, and that, and this other table, and here. 747 00:38:21,990 --> 00:38:26,310 And so while very academically clean and refined, it doesn't anymore 748 00:38:26,310 --> 00:38:29,910 seem useful to the data scientist in front of his or her computer 749 00:38:29,910 --> 00:38:32,610 that just wants to answer questions about data. 750 00:38:32,610 --> 00:38:35,490 And yet it's all over the place, whereas before, with Excel, 751 00:38:35,490 --> 00:38:37,710 you could just pull up a window. 752 00:38:37,710 --> 00:38:39,510 But that's where SQL itself comes in. 753 00:38:39,510 --> 00:38:44,010 SQL does not just prescribe how to type your data 754 00:38:44,010 --> 00:38:46,530 but rather how to query it as well. 755 00:38:46,530 --> 00:38:49,010 And in this particular program here, DB Browser, 756 00:38:49,010 --> 00:38:52,320 I can actually go over to this tab here for Execute SQL. 757 00:38:52,320 --> 00:38:55,830 And I can begin to execute those actual commands, SELECT, 758 00:38:55,830 --> 00:39:00,030 and CREATE, and UPDATE, DELETE, and others and actually see the results. 759 00:39:00,030 --> 00:39:03,660 What SQL allows you to do is express yourself programmatically 760 00:39:03,660 --> 00:39:07,200 using a fairly small language, albeit new, that 761 00:39:07,200 --> 00:39:11,160 allows you to create temporary tables, just the results, just the result 762 00:39:11,160 --> 00:39:15,130 sets, so to speak, that you want, only those rows that you care about. 763 00:39:15,130 --> 00:39:18,240 So for instance, if I want to see all of the albums in this database, 764 00:39:18,240 --> 00:39:21,897 well, in the world of spreadsheets, I would just double click in and peruse. 765 00:39:21,897 --> 00:39:24,480 But in the world of SQL, I'm actually going to type a command. 766 00:39:24,480 --> 00:39:28,740 I'm going to go ahead here and say SELECT star, for give me everything, 767 00:39:28,740 --> 00:39:33,417 from the table called Album, and then semicolon to finish my thought. 768 00:39:33,417 --> 00:39:36,000 I'm going to go ahead and click the graphical Play button here 769 00:39:36,000 --> 00:39:37,330 to execute this command. 770 00:39:37,330 --> 00:39:40,110 And you'll see suddenly that here are all of the albums 771 00:39:40,110 --> 00:39:45,930 apparently in this table, 347 of them in total in this particular database. 772 00:39:45,930 --> 00:39:48,330 And notice that all of the titles are in one column, 773 00:39:48,330 --> 00:39:53,190 the Album ID is to the left, and the Artist ID, of course, to the right. 774 00:39:53,190 --> 00:39:57,510 Well, if you're now curious who is the artist behind the album called 775 00:39:57,510 --> 00:40:00,240 For Those About to Rock, We Salute You, well, I 776 00:40:00,240 --> 00:40:02,670 can just make a mental note that the artist ID is one. 777 00:40:02,670 --> 00:40:03,540 And you know what? 778 00:40:03,540 --> 00:40:06,600 With SQL, it turns out you can use predicates. 779 00:40:06,600 --> 00:40:08,802 You can qualify the data you want. 780 00:40:08,802 --> 00:40:10,260 You don't have to say, give me all. 781 00:40:10,260 --> 00:40:11,820 You can say, give me this. 782 00:40:11,820 --> 00:40:13,210 So how do I do that? 783 00:40:13,210 --> 00:40:18,690 Well, I can actually say SELECT star from Artist, the other table, where-- 784 00:40:18,690 --> 00:40:25,450 and here comes my predicate-- artist ID equals one, a semicolon again to finish 785 00:40:25,450 --> 00:40:26,950 that thought, hit Play, and voila. 786 00:40:26,950 --> 00:40:32,410 It turns out it's AC/DC, the artist behind that particular album. 787 00:40:32,410 --> 00:40:34,280 Of course, this felt a bit manual. 788 00:40:34,280 --> 00:40:36,790 And this seems no better than a spreadsheet wherein 789 00:40:36,790 --> 00:40:38,920 now more of the work seems to be on me. 790 00:40:38,920 --> 00:40:41,230 But SQL's more expressive than this. 791 00:40:41,230 --> 00:40:44,140 Not only can you create, and select, an update, and delete data, 792 00:40:44,140 --> 00:40:48,050 you can also join data from one table and another. 793 00:40:48,050 --> 00:40:51,760 So in fact, let me go ahead and do this a little more explicitly. 794 00:40:51,760 --> 00:40:56,650 I want to go ahead and select everything from that original album table, 795 00:40:56,650 --> 00:41:00,100 but I'd like to first join it with the artist table. 796 00:41:00,100 --> 00:41:00,880 How? 797 00:41:00,880 --> 00:41:03,850 Well, I want to join those two tables, this one and this one, 798 00:41:03,850 --> 00:41:07,810 kind of like this, conceptually, so to speak. 799 00:41:07,810 --> 00:41:09,010 On what fields? 800 00:41:09,010 --> 00:41:12,640 Well, in Album, I recall there's an artist ID. 801 00:41:12,640 --> 00:41:17,290 I want that to be equal to artist.artistid. 802 00:41:17,290 --> 00:41:21,730 In other words, if you imagine this hand to be albums and this hand to be artist 803 00:41:21,730 --> 00:41:25,510 and the tips of my fingers each represent those artist IDs, 804 00:41:25,510 --> 00:41:29,230 we essentially want the SQL to line up my fingers 805 00:41:29,230 --> 00:41:32,290 so that I have on the left my albums and I have on the right the artist. 806 00:41:32,290 --> 00:41:36,040 And every one of my rows now has both. 807 00:41:36,040 --> 00:41:39,100 Let me finish my thought with a semicolon here and hit Play. 808 00:41:39,100 --> 00:41:43,750 And voila, now we see even more information but all together. 809 00:41:43,750 --> 00:41:47,020 We see that album number one, For Those About to Rock, We Salute You, 810 00:41:47,020 --> 00:41:51,370 has an artist ID of one and clearly an artist ID of one but a name. 811 00:41:51,370 --> 00:41:52,420 Well, what's happened? 812 00:41:52,420 --> 00:41:56,230 Well, both of these tables have kind of been concatenated together but joined 813 00:41:56,230 --> 00:42:00,010 intelligently such that the artist IDs in both tables 814 00:42:00,010 --> 00:42:02,463 line up on the left and the right. 815 00:42:02,463 --> 00:42:05,380 Of course, at this point, I really don't care about all these numbers. 816 00:42:05,380 --> 00:42:08,560 And I definitely don't need the temporary duplication of data, 817 00:42:08,560 --> 00:42:10,270 so I don't have to just keep saying star, 818 00:42:10,270 --> 00:42:13,480 which is the so-called wild card, which means give me everything. 819 00:42:13,480 --> 00:42:17,380 I can actually instead just say, give me the title of albums 820 00:42:17,380 --> 00:42:20,170 and the names of the artists by specifying with commas 821 00:42:20,170 --> 00:42:23,290 the names of the columns that I actually want. 822 00:42:23,290 --> 00:42:26,320 And if I now click Play, I get much simpler results, 823 00:42:26,320 --> 00:42:32,030 just the titles of albums and just the names of those artists. 824 00:42:32,030 --> 00:42:34,460 Now, how else can we express ourselves with SQL? 825 00:42:34,460 --> 00:42:37,820 Well, there are other keywords besides WHERE and besides JOIN. 826 00:42:37,820 --> 00:42:41,990 You can also group values by something and specify that they must all 827 00:42:41,990 --> 00:42:43,460 have something as well. 828 00:42:43,460 --> 00:42:46,910 For instance, let me go back to my data here and consider 829 00:42:46,910 --> 00:42:49,610 which artists have multiple albums. 830 00:42:49,610 --> 00:42:52,070 Well, if we consider the results that we had earlier do 831 00:42:52,070 --> 00:42:56,600 we have AC/DC as the artist behind For Those About to Rock, We Salute You, 832 00:42:56,600 --> 00:42:59,180 but also behind Let There Be Rock. 833 00:42:59,180 --> 00:43:02,400 Moreover, this band Accept has multiple albums as well. 834 00:43:02,400 --> 00:43:04,970 And if we scrolled further, we'd probably see others. 835 00:43:04,970 --> 00:43:08,600 So if we'd like to see those bands, those artists that have 836 00:43:08,600 --> 00:43:10,760 multiple albums, how can we do this? 837 00:43:10,760 --> 00:43:16,490 Well, what if I somehow collapsed all mentions of AC/DC into just one row, 838 00:43:16,490 --> 00:43:19,280 and next to that row I put an actual count? 839 00:43:19,280 --> 00:43:21,420 Well, I could refine this query as follows. 840 00:43:21,420 --> 00:43:24,260 I can say, yes, join these tables together, but then 841 00:43:24,260 --> 00:43:28,710 collapse them, if you will, based on grouping some common value. 842 00:43:28,710 --> 00:43:31,430 Let's go ahead now and group the data by name 843 00:43:31,430 --> 00:43:33,650 so that any artist that appears multiple times 844 00:43:33,650 --> 00:43:36,050 will effectively be collapsed into one. 845 00:43:36,050 --> 00:43:40,850 But I'd like to remember how many rows got collapsed into just one. 846 00:43:40,850 --> 00:43:43,850 And so rather than select the albums themselves this time, 847 00:43:43,850 --> 00:43:48,080 I'm going to select the album's name and then the count there of, 848 00:43:48,080 --> 00:43:52,190 thereby specifying show me the name and show me the count of that name 849 00:43:52,190 --> 00:43:54,350 before we grouped by. 850 00:43:54,350 --> 00:43:57,980 If I go ahead now and finish my thought and click Execute, 851 00:43:57,980 --> 00:44:02,390 I'll see that, indeed, AC/DC had two names and Aaron Goldberg had one. 852 00:44:02,390 --> 00:44:05,210 And if we keep scrolling, we'd see all of the bands' names 853 00:44:05,210 --> 00:44:09,350 that had one or more albums and the count for each of those. 854 00:44:09,350 --> 00:44:11,570 If I want to filter out now maybe those bands that 855 00:44:11,570 --> 00:44:14,570 only had one hit album in this database, I can instead say, 856 00:44:14,570 --> 00:44:17,840 you know what, go ahead and group by the group's name, 857 00:44:17,840 --> 00:44:22,400 but then show me only those bands having more than one album. 858 00:44:22,400 --> 00:44:25,790 Well, here, too, can I simply filter my results saying literally quite 859 00:44:25,790 --> 00:44:32,840 that, having some number count name greater than one, semicolon. 860 00:44:32,840 --> 00:44:36,735 Hitting Play now and you see the results immediately eliminate all 861 00:44:36,735 --> 00:44:38,360 of those bands that had just one album. 862 00:44:38,360 --> 00:44:40,610 And now if I scroll through, we'll see all those bands 863 00:44:40,610 --> 00:44:42,650 that had two or really more. 864 00:44:42,650 --> 00:44:45,740 And so with SQL can you begin to express yourself 865 00:44:45,740 --> 00:44:47,870 certainly more arcanely than you could with just 866 00:44:47,870 --> 00:44:52,310 a click of the mouse in a spreadsheet but ever so much more powerfully. 867 00:44:52,310 --> 00:44:55,340 But the key is to build these queries up piecemeal. 868 00:44:55,340 --> 00:44:58,160 Indeed, this now already looks quite complicated. 869 00:44:58,160 --> 00:45:02,990 But we began by just selecting all data and then refining, refining, refining, 870 00:45:02,990 --> 00:45:06,020 thereby working at this level and only getting into the weeds 871 00:45:06,020 --> 00:45:09,110 once we knew what we wanted. 872 00:45:09,110 --> 00:45:11,960 Notice now just how fast these operations have been. 873 00:45:11,960 --> 00:45:15,560 Indeed, at the bottom here do I see that 56 rows were returned. 874 00:45:15,560 --> 00:45:16,130 How fast? 875 00:45:16,130 --> 00:45:18,320 In one millisecond. 876 00:45:18,320 --> 00:45:20,450 And indeed, even though it might take me longer 877 00:45:20,450 --> 00:45:23,450 to describe in English what it is I want, in fact, 878 00:45:23,450 --> 00:45:27,290 the computer can find this data so quickly. 879 00:45:27,290 --> 00:45:30,080 But that's because we've done at least one thing already. 880 00:45:30,080 --> 00:45:33,290 This data is already organized with these proper types. 881 00:45:33,290 --> 00:45:36,650 And it also has a few other key characteristics as well. 882 00:45:36,650 --> 00:45:38,990 When storing your data in a SQL database, 883 00:45:38,990 --> 00:45:41,450 you're also asked to answer some questions. 884 00:45:41,450 --> 00:45:44,900 For every table you're asked to specify effectively what, 885 00:45:44,900 --> 00:45:47,900 if any, is this table's primary key. 886 00:45:47,900 --> 00:45:50,810 These are key words in SQL that you can apply 887 00:45:50,810 --> 00:45:53,690 to some column that says, hey, database, this column 888 00:45:53,690 --> 00:45:59,750 is my primary value that uniquely identifies every row in this table. 889 00:45:59,750 --> 00:46:02,420 In the context then of our user spreadsheet 890 00:46:02,420 --> 00:46:05,930 with which we began this discussion, that identifier for City 891 00:46:05,930 --> 00:46:07,640 was a primary key. 892 00:46:07,640 --> 00:46:10,370 I might very well have used a city's name 893 00:46:10,370 --> 00:46:12,500 as unique or perhaps even the zip code. 894 00:46:12,500 --> 00:46:15,050 But far more efficient, especially if you 895 00:46:15,050 --> 00:46:17,380 want to avoid ambiguities or duplication, 896 00:46:17,380 --> 00:46:19,130 is to just use that integer. 897 00:46:19,130 --> 00:46:23,420 And so here a primary key is almost always a numeric value, at least 898 00:46:23,420 --> 00:46:27,140 in the most optimized layouts of data. 899 00:46:27,140 --> 00:46:29,960 But it guarantees to the database that there will be 900 00:46:29,960 --> 00:46:32,480 no duplicates on this particular value. 901 00:46:32,480 --> 00:46:38,270 But more powerfully, you can define in one table a column to be a primary key, 902 00:46:38,270 --> 00:46:43,490 but then in another table that same value to be a so-called foreign key. 903 00:46:43,490 --> 00:46:47,420 In other words, throughout this example in the actual SQL database, 904 00:46:47,420 --> 00:46:50,360 I had Albums in one table and Artists in others. 905 00:46:50,360 --> 00:46:54,050 And that Artist table had an Artist ID column 906 00:46:54,050 --> 00:46:57,020 that was within that table known as a primary key. 907 00:46:57,020 --> 00:47:00,800 But when you saw that artist ID in the Albums table, 908 00:47:00,800 --> 00:47:04,370 it was contextually there a foreign key. 909 00:47:04,370 --> 00:47:08,000 Now, beyond semantics, this is an actual and valuable property. 910 00:47:08,000 --> 00:47:10,490 It ensures that the database knows how to link 911 00:47:10,490 --> 00:47:13,430 and how to link those two columns efficiently. 912 00:47:13,430 --> 00:47:17,810 Moreover, you have even fancier features available to you when you declare keys. 913 00:47:17,810 --> 00:47:19,910 You can also tell the database, you know what? 914 00:47:19,910 --> 00:47:24,800 If I ever delete this artist, go ahead and delete all of that artist's albums 915 00:47:24,800 --> 00:47:25,310 as well. 916 00:47:25,310 --> 00:47:27,530 And you can configure a database automatically 917 00:47:27,530 --> 00:47:31,520 to have this so-called cascade effect whereby data is updated 918 00:47:31,520 --> 00:47:36,320 and your data is consistent at the end of the day based on those relations, 919 00:47:36,320 --> 00:47:37,220 if you will. 920 00:47:37,220 --> 00:47:40,640 Now, in columns of data can you also specify 921 00:47:40,640 --> 00:47:42,770 that every value's got to be unique. 922 00:47:42,770 --> 00:47:45,260 It doesn't necessarily need to be your primary key, 923 00:47:45,260 --> 00:47:47,480 but it might still be unique. 924 00:47:47,480 --> 00:47:48,440 Like what? 925 00:47:48,440 --> 00:47:51,650 Well, in our Users table that we were creating on the fly, 926 00:47:51,650 --> 00:47:54,200 an email address might, by human convention, 927 00:47:54,200 --> 00:47:57,200 be unique, assuming I don't share it with someone else. 928 00:47:57,200 --> 00:48:01,070 But using an email address, multiple characters, many possible characters, 929 00:48:01,070 --> 00:48:04,340 tends not to be the most efficient way to search on data. 930 00:48:04,340 --> 00:48:08,540 So even in my Users table might I have added for best practice 931 00:48:08,540 --> 00:48:13,280 a numeric column as well, probably called ID, as my primary key. 932 00:48:13,280 --> 00:48:17,330 But I might still specify when moving that data from my Google spreadsheet 933 00:48:17,330 --> 00:48:20,030 into a SQL database that, you know what? 934 00:48:20,030 --> 00:48:23,330 Please ensure that this Email column's unique 935 00:48:23,330 --> 00:48:25,970 so that I or some other programmer doesn't accidentally 936 00:48:25,970 --> 00:48:28,760 insert duplicate data into this table. 937 00:48:28,760 --> 00:48:31,340 And moreover, the database then can search it 938 00:48:31,340 --> 00:48:35,180 more efficiently because it knows how many, if any, there 939 00:48:35,180 --> 00:48:36,380 are of any one value. 940 00:48:36,380 --> 00:48:39,590 There's one and only one maximally. 941 00:48:39,590 --> 00:48:42,830 Lastly, there's this keyword here, index. 942 00:48:42,830 --> 00:48:46,610 Well, the other feature you get so powerfully from proper databases 943 00:48:46,610 --> 00:48:50,180 is the ability to search and sort values efficiently. 944 00:48:50,180 --> 00:48:54,020 But the database doesn't know a priori how to do either 945 00:48:54,020 --> 00:48:55,250 on the data you care about. 946 00:48:55,250 --> 00:48:58,820 Because only if you tell the database what data you plan to search on 947 00:48:58,820 --> 00:49:03,210 and sort frequently can it help you in advance. 948 00:49:03,210 --> 00:49:08,000 And so if when creating a database table you tell the database server, go ahead 949 00:49:08,000 --> 00:49:12,590 and index this column, what it will do is use a database structure, 950 00:49:12,590 --> 00:49:17,210 a tree structure not unlike our so-called binary search trees, 951 00:49:17,210 --> 00:49:20,960 that pulls all the data up in an artist's rendition thereof, 952 00:49:20,960 --> 00:49:23,720 thereby ensuring that it doesn't take as many steps 953 00:49:23,720 --> 00:49:27,560 to find some email address or something else because you 954 00:49:27,560 --> 00:49:29,430 have indexed that particular column. 955 00:49:29,430 --> 00:49:32,310 It won't store it linearly top to bottom or left to right. 956 00:49:32,310 --> 00:49:34,060 It will store it in a two-dimensional tree 957 00:49:34,060 --> 00:49:38,500 structure of some sort, often known as a B-tree, that 958 00:49:38,500 --> 00:49:44,260 allows you to grab the data in hopefully logarithmic and not linear time. 959 00:49:44,260 --> 00:49:46,420 Well, turns out there are even more features you 960 00:49:46,420 --> 00:49:49,840 get from actual databases like SQLite. 961 00:49:49,840 --> 00:49:53,530 Well, you have the ability to specify when creating a table, please go ahead 962 00:49:53,530 --> 00:49:56,050 and auto increment this column. 963 00:49:56,050 --> 00:49:57,380 Well, what does that mean? 964 00:49:57,380 --> 00:49:59,320 Well, I very manually a moment ago assigned 965 00:49:59,320 --> 00:50:02,230 Cambridge the unique identifier of one. 966 00:50:02,230 --> 00:50:04,330 But why should I, the programmer, even have 967 00:50:04,330 --> 00:50:09,280 to worry or care about what the unique values of my inputs are? 968 00:50:09,280 --> 00:50:11,560 I just need that that key exists. 969 00:50:11,560 --> 00:50:14,920 I do not need to care about what that value is, just that it exists 970 00:50:14,920 --> 00:50:16,030 and it's unique. 971 00:50:16,030 --> 00:50:18,740 So you can tell the database on its own, please go ahead 972 00:50:18,740 --> 00:50:21,490 and, any time I add a new row to this table, 973 00:50:21,490 --> 00:50:25,010 increment that value automatically. 974 00:50:25,010 --> 00:50:26,980 You can also specify to a database, please 975 00:50:26,980 --> 00:50:31,480 ensure that no values in my database are null that is empty, 976 00:50:31,480 --> 00:50:35,560 thereby ensuring that a bug in your code or some missing user input 977 00:50:35,560 --> 00:50:38,110 doesn't accidentally put into your database 978 00:50:38,110 --> 00:50:41,860 a row that's only sparsely filled with real data. 979 00:50:41,860 --> 00:50:44,980 The database can help you with these things just as Python can as well, 980 00:50:44,980 --> 00:50:49,980 but it's a final layer of a defense before your data. 981 00:50:49,980 --> 00:50:51,370 And then functions as well. 982 00:50:51,370 --> 00:50:53,830 SQL itself is a programming language. 983 00:50:53,830 --> 00:50:56,950 It might not necessarily have as high ceiling as something like Python, 984 00:50:56,950 --> 00:50:58,120 as much functionality. 985 00:50:58,120 --> 00:51:01,480 But built into SQL are any number of functions. 986 00:51:01,480 --> 00:51:03,940 If you want to select the average revenue 987 00:51:03,940 --> 00:51:08,680 across any number of purchase orders, you can use the average function. 988 00:51:08,680 --> 00:51:13,900 And in MySQL Query can I select data but pass it into one of these functions 989 00:51:13,900 --> 00:51:16,960 and get back an answer without having to paste it into, say, 990 00:51:16,960 --> 00:51:19,120 a spreadsheet, let alone calculator. 991 00:51:19,120 --> 00:51:20,920 I can count rows just as I did. 992 00:51:20,920 --> 00:51:24,520 I wanted to count the number of albums that a given artist had, 993 00:51:24,520 --> 00:51:27,220 and COUNT was a function supported by SQL. 994 00:51:27,220 --> 00:51:28,540 You can get maxes and mins. 995 00:51:28,540 --> 00:51:31,330 You can get summations of value and so many more 996 00:51:31,330 --> 00:51:33,600 features built into this language. 997 00:51:33,600 --> 00:51:35,350 And while the tool you might use might not 998 00:51:35,350 --> 00:51:39,850 be DB Browser, perhaps it's just a textual interface or even something 999 00:51:39,850 --> 00:51:42,190 even more graphical, it ultimately is just 1000 00:51:42,190 --> 00:51:47,110 executing on your behalf the SQL queries and handing them off to the database 1001 00:51:47,110 --> 00:51:48,415 for execution. 1002 00:51:48,415 --> 00:51:51,040 Now, with all of these features that you get with the database, 1003 00:51:51,040 --> 00:51:52,850 it all sounds too good to be true. 1004 00:51:52,850 --> 00:51:54,910 You can scale, you can eliminate redundancy, 1005 00:51:54,910 --> 00:51:57,670 and you can still select all the data you want. 1006 00:51:57,670 --> 00:52:00,610 But unfortunately, you have to start to think harder 1007 00:52:00,610 --> 00:52:02,470 about the design of your system. 1008 00:52:02,470 --> 00:52:06,940 Because databases are sometimes vulnerable to mistakes, if you will. 1009 00:52:06,940 --> 00:52:09,310 Consider, for instance, something like Twitter 1010 00:52:09,310 --> 00:52:12,790 that tends to keep track of how many times something's retweeted. 1011 00:52:12,790 --> 00:52:14,890 Or consider an app like Instagram, which keeps 1012 00:52:14,890 --> 00:52:19,990 track of how many likes or upvotes or views some post has gotten. 1013 00:52:19,990 --> 00:52:23,890 On the most popular or viral of media, those counters, 1014 00:52:23,890 --> 00:52:26,980 those integers might be getting incremented ever so quickly. 1015 00:52:26,980 --> 00:52:31,090 If you and I both happen to view or like something at nearly the same time, 1016 00:52:31,090 --> 00:52:35,140 well, that interface from us into the system 1017 00:52:35,140 --> 00:52:38,470 might actually trigger some instruction on some server 1018 00:52:38,470 --> 00:52:43,210 somewhere to tell Instagram's database to increment some value. 1019 00:52:43,210 --> 00:52:46,730 But how does a database go about incrementing a value? 1020 00:52:46,730 --> 00:52:50,650 Well, if the value of views or the value of counts 1021 00:52:50,650 --> 00:52:54,790 is somehow stored in a database, a column of type integer, 1022 00:52:54,790 --> 00:52:59,020 and you go ahead and execute a SQL SELECT in order to get that value, 1023 00:52:59,020 --> 00:53:03,730 and, for instance, 100 people before me has liked some post, 1024 00:53:03,730 --> 00:53:07,640 well, the value of that result comes back as 100. 1025 00:53:07,640 --> 00:53:10,180 I then do some math in my code, perhaps Python. 1026 00:53:10,180 --> 00:53:15,490 I increment the 100 to 101, and then I use a SQL UPDATE, as you can, 1027 00:53:15,490 --> 00:53:18,520 to push the data back into the database. 1028 00:53:18,520 --> 00:53:21,220 But suppose both you and I anywhere in the world 1029 00:53:21,220 --> 00:53:25,240 both happen to like a post at the same or nearly the same time, 1030 00:53:25,240 --> 00:53:29,320 as can certainly happen when posts are especially popular. 1031 00:53:29,320 --> 00:53:33,040 Unfortunately, a computer can sometimes do multiple things 1032 00:53:33,040 --> 00:53:35,440 at once or at least in such rapid succession 1033 00:53:35,440 --> 00:53:39,250 that it appears to be at the same time, but a race of sorts 1034 00:53:39,250 --> 00:53:42,650 can happen, a race condition, if you will, as follows. 1035 00:53:42,650 --> 00:53:45,550 If both my button and your button is pressed 1036 00:53:45,550 --> 00:53:50,830 at nearly the same time and that induces execution of code on Instagram server 1037 00:53:50,830 --> 00:53:54,130 that selects for both of us the current count of views, 1038 00:53:54,130 --> 00:53:58,630 suppose that both of the threads, so to speak, both of the SQL operations 1039 00:53:58,630 --> 00:54:02,620 that select that data both come back with the value 100, 1040 00:54:02,620 --> 00:54:05,800 each of the blocks of code serving our requests 1041 00:54:05,800 --> 00:54:10,360 go ahead and increment that value to 101 and then, via SQL UPDATE, 1042 00:54:10,360 --> 00:54:13,690 pushes that value back to the database. 1043 00:54:13,690 --> 00:54:19,510 Unfortunately, because both you and I induced an evaluation of that math 1044 00:54:19,510 --> 00:54:22,900 at nearly the same time, what the database might end up storing 1045 00:54:22,900 --> 00:54:26,470 is not 102 but 101. 1046 00:54:26,470 --> 00:54:32,110 In other words, if two people's input is triggering a race to update data, 1047 00:54:32,110 --> 00:54:37,600 the database had better keep track of who and when asked for that update. 1048 00:54:37,600 --> 00:54:39,250 Otherwise, you lose data. 1049 00:54:39,250 --> 00:54:43,840 Now, in the case of tweets or likes, it's probably not all that problematic. 1050 00:54:43,840 --> 00:54:46,510 Though, frankly, that is their business. 1051 00:54:46,510 --> 00:54:50,380 But you can certainly imagine that with banks or financial institutions, where 1052 00:54:50,380 --> 00:54:53,140 the numbers matter ever so more, you certainly 1053 00:54:53,140 --> 00:54:56,800 don't want to accidentally lose track of some dollars. 1054 00:54:56,800 --> 00:55:02,310 And so how do we go about solving this in the case of a SQL database? 1055 00:55:02,310 --> 00:55:06,370 Well, it turns out that there is fairly fundamental primitives or solutions 1056 00:55:06,370 --> 00:55:07,360 you can use. 1057 00:55:07,360 --> 00:55:12,790 Consider a metaphor in the real world, such as, say, a familiar refrigerator. 1058 00:55:12,790 --> 00:55:15,190 And suppose that you and your significant other 1059 00:55:15,190 --> 00:55:19,300 happened to both like something to drink at the end of the day, like some milk. 1060 00:55:19,300 --> 00:55:22,410 And so you go ahead when you get home, and the other person's not, 1061 00:55:22,410 --> 00:55:25,255 and you open the fridge and you see, oh, darn it, we're out of milk. 1062 00:55:25,255 --> 00:55:27,700 And so you close the fridge and you head downstairs 1063 00:55:27,700 --> 00:55:29,530 and you walk to the nearest store. 1064 00:55:29,530 --> 00:55:32,440 Meanwhile, that other person comes home and, wanting some milk, 1065 00:55:32,440 --> 00:55:36,430 opens the fridge, and darn it if we aren't out of milk as well. 1066 00:55:36,430 --> 00:55:39,670 And so that person now heads out, perhaps in a different car, 1067 00:55:39,670 --> 00:55:43,420 in a different route, and heads to the store, some other store to get milk. 1068 00:55:43,420 --> 00:55:45,490 Fast forward some amount of time and both of you 1069 00:55:45,490 --> 00:55:49,390 come home, and darn it if you don't now have twice as much milk as you need, 1070 00:55:49,390 --> 00:55:50,980 and it does go bad. 1071 00:55:50,980 --> 00:55:55,270 And so you've both ended up buying milk when really only one of you needed to. 1072 00:55:55,270 --> 00:55:57,520 And this is similar in spirit, but now you've 1073 00:55:57,520 --> 00:56:02,440 got more data than you actually wanted, but it's not the right amount of data. 1074 00:56:02,440 --> 00:56:03,830 So why did that happen? 1075 00:56:03,830 --> 00:56:08,170 Well, both of you, like Instagram, inspected the state of some value 1076 00:56:08,170 --> 00:56:10,900 and made a decision on it before the other person 1077 00:56:10,900 --> 00:56:14,170 was done acting on that information. 1078 00:56:14,170 --> 00:56:17,170 So in our very real world of milk, how could you 1079 00:56:17,170 --> 00:56:22,540 go about avoiding that conflict, that race, to restock the fridge? 1080 00:56:22,540 --> 00:56:26,290 Well, you could quite simply grab a pen and paper and leave a note, 1081 00:56:26,290 --> 00:56:29,260 so to speak, on the fridge telling someone else, gone for milk, 1082 00:56:29,260 --> 00:56:31,390 and hopefully they then would not do the same. 1083 00:56:31,390 --> 00:56:35,290 Or perhaps more dramatically you could lock the refrigerator in some sense 1084 00:56:35,290 --> 00:56:38,740 so that they can't even get into it and inspect that state. 1085 00:56:38,740 --> 00:56:41,800 But ultimately, you need your act of checking the fridge 1086 00:56:41,800 --> 00:56:44,950 and restocking it to be what we'll call atomic. 1087 00:56:44,950 --> 00:56:48,880 And databases can and hopefully do provide atomicity, 1088 00:56:48,880 --> 00:56:54,670 that property, the ability to do multiple things together or not at all 1089 00:56:54,670 --> 00:56:58,300 but not be interrupted by someone else's work. 1090 00:56:58,300 --> 00:57:00,160 And in fact, in the database world, these 1091 00:57:00,160 --> 00:57:04,270 are generally known as actual locks whereby you say to the database, 1092 00:57:04,270 --> 00:57:07,630 don't let anyone else write to this table or row 1093 00:57:07,630 --> 00:57:11,780 until I am ready to release or unlock it. 1094 00:57:11,780 --> 00:57:14,530 That, of course, though, tends to be a very heavy-handed solution. 1095 00:57:14,530 --> 00:57:17,140 Say don't let anyone else touch this data. 1096 00:57:17,140 --> 00:57:19,660 Better to do it on a more fine-grained control 1097 00:57:19,660 --> 00:57:21,940 so that you don't slow your whole system down. 1098 00:57:21,940 --> 00:57:25,660 And so SQL databases tend to support what are more finally known 1099 00:57:25,660 --> 00:57:30,530 as transactions whereby you can execute one or more commands again and again 1100 00:57:30,530 --> 00:57:33,430 and again back to back but make sure that all of them 1101 00:57:33,430 --> 00:57:36,220 go through it once before, say, your commands 1102 00:57:36,220 --> 00:57:41,260 that your user input induced actually is allowed to get executed. 1103 00:57:41,260 --> 00:57:43,960 Now, honestly, even in the most high-tech 1104 00:57:43,960 --> 00:57:47,260 of places like Instagram and Twitter, this is a hard problem. 1105 00:57:47,260 --> 00:57:51,490 Because at some point, even waiting for my operations 1106 00:57:51,490 --> 00:57:54,290 to complete before yours can go in can be a challenge 1107 00:57:54,290 --> 00:57:56,210 and a bottleneck for everyone else. 1108 00:57:56,210 --> 00:58:00,360 And so in the case of the most viral of posts, what can systems these days do? 1109 00:58:00,360 --> 00:58:03,100 Well, you could just kind of wait and write 1110 00:58:03,100 --> 00:58:06,790 that data back to the database that is updated eventually. 1111 00:58:06,790 --> 00:58:09,100 And indeed, another property of databases 1112 00:58:09,100 --> 00:58:13,870 is known as just that, eventual consistency, a property that says, 1113 00:58:13,870 --> 00:58:19,900 don't lose any data, but only eventually make sure it's reflected on the server. 1114 00:58:19,900 --> 00:58:23,170 Eventually get the value right, but do get it right. 1115 00:58:23,170 --> 00:58:25,720 And so what Instagram and Twitter and others might do 1116 00:58:25,720 --> 00:58:30,690 is just cache or buffer that data, waiting until things have quieted down 1117 00:58:30,690 --> 00:58:35,140 'til the post is no longer viral or most users have gone to sleep. 1118 00:58:35,140 --> 00:58:38,050 Now, that alone might not be to the best of solutions, 1119 00:58:38,050 --> 00:58:43,990 but it avoids having to get the highest powered and most expensive hardware. 1120 00:58:43,990 --> 00:58:46,720 Of course, in other contexts, that might be the better solution. 1121 00:58:46,720 --> 00:58:48,760 In the world of finance, sometimes it comes 1122 00:58:48,760 --> 00:58:53,140 down to the actual length of cables or distance from some server to another 1123 00:58:53,140 --> 00:58:55,930 to ensure that the data gets there so fast 1124 00:58:55,930 --> 00:58:59,030 that you don't run into these sorts of challenges. 1125 00:58:59,030 --> 00:59:01,810 So databases can solve this, but the developers and designers 1126 00:59:01,810 --> 00:59:07,740 that use those databases need to know how to do it and that they should. 1127 00:59:07,740 --> 00:59:10,230 Lastly, there's another challenge as well, 1128 00:59:10,230 --> 00:59:12,780 unfortunately all too commonly done these days 1129 00:59:12,780 --> 00:59:17,280 because folks just don't defend against it via common mechanisms. 1130 00:59:17,280 --> 00:59:21,810 It turns out that a bad actor somewhere on the internet or your own network 1131 00:59:21,810 --> 00:59:24,960 can potentially, if you're not careful, trick 1132 00:59:24,960 --> 00:59:29,400 a database into executing commands that you didn't intend. 1133 00:59:29,400 --> 00:59:32,160 For instance, suppose in the context of Python 1134 00:59:32,160 --> 00:59:34,740 you have some code that looks a bit like this. 1135 00:59:34,740 --> 00:59:37,280 Here is a program written in a mix of pseudocode 1136 00:59:37,280 --> 00:59:40,380 and Python that's designed to allow a user to input 1137 00:59:40,380 --> 00:59:43,560 the title of an album for which they want to search. 1138 00:59:43,560 --> 00:59:48,720 And so here I use the Python function INPUT to prompt the user for just that. 1139 00:59:48,720 --> 00:59:51,570 On the left-hand side do I clear a variable called Title, and then 1140 00:59:51,570 --> 00:59:55,800 assigned from right to left, the user's response to that variable. 1141 00:59:55,800 --> 00:59:59,610 Then suppose for the sake of discussion there is some function called EXECUTE 1142 00:59:59,610 --> 01:00:05,020 whose purpose in life is to take input that itself represents a SQL command. 1143 01:00:05,020 --> 01:00:08,430 That SQL command might be this, so like star from Artist 1144 01:00:08,430 --> 01:00:11,190 where Title equals something. 1145 01:00:11,190 --> 01:00:12,490 Now, what is that something? 1146 01:00:12,490 --> 01:00:15,270 Well, if I have the user's input in a variable called Title, 1147 01:00:15,270 --> 01:00:19,680 I can use the plus operator in Python, not to add but concatenate 1148 01:00:19,680 --> 01:00:24,390 two strings together, coding them singly and completing that thought. 1149 01:00:24,390 --> 01:00:28,800 The problem, though, with SQL is that user's not really to be trusted. 1150 01:00:28,800 --> 01:00:32,580 And whether the user's input is coming from a keyboard on a Mac, 1151 01:00:32,580 --> 01:00:36,990 or PC, or perhaps, more compellingly, from an app or website, 1152 01:00:36,990 --> 01:00:40,260 you probably should not trust all your users. 1153 01:00:40,260 --> 01:00:44,880 Because suppose that your user typed in not the album name for which they 1154 01:00:44,880 --> 01:00:48,090 want to search, Let There Be Rock, but rather they type something 1155 01:00:48,090 --> 01:00:51,990 like Let There Be Rock, semicolon, DELETE, 1156 01:00:51,990 --> 01:00:56,160 thereby using SQL's apparent DELETE command in order 1157 01:00:56,160 --> 01:01:00,360 to trick your database into executing not one but two 1158 01:01:00,360 --> 01:01:02,430 commands, a SELECT and DELETE. 1159 01:01:02,430 --> 01:01:05,130 And indeed, this is what's known as a SQL injection 1160 01:01:05,130 --> 01:01:08,100 attack, the ability for an adversary, a bad actor 1161 01:01:08,100 --> 01:01:12,060 out there, to somehow trick your database and your code 1162 01:01:12,060 --> 01:01:16,500 into executing some command that you didn't intend. 1163 01:01:16,500 --> 01:01:17,690 How is that possible? 1164 01:01:17,690 --> 01:01:21,330 Well, some of these characters are dangerous, so to speak. 1165 01:01:21,330 --> 01:01:25,680 A semicolon in SQL tends to separate one command from another. 1166 01:01:25,680 --> 01:01:27,180 It finishes your thought. 1167 01:01:27,180 --> 01:01:30,930 And if you yourself don't anticipate that some human, this bad actor, 1168 01:01:30,930 --> 01:01:33,960 might type in themselves a semicolon when they really 1169 01:01:33,960 --> 01:01:38,040 shouldn't be typing SQL at all, you might mistake that semicolon 1170 01:01:38,040 --> 01:01:42,330 for the actual terminus between one command and another. 1171 01:01:42,330 --> 01:01:45,240 And if you just blindly pass it into your server 1172 01:01:45,240 --> 01:01:49,080 and let it execute as usual, you might execute not just that SELECT 1173 01:01:49,080 --> 01:01:51,750 but that DELETE or anything else as well. 1174 01:01:51,750 --> 01:01:55,260 And in this way can an adversary not only delete data from your database 1175 01:01:55,260 --> 01:01:59,430 but maybe select more than you intended, or update or insert. 1176 01:01:59,430 --> 01:02:03,030 It's ultimately up to you to defend against these threats. 1177 01:02:03,030 --> 01:02:04,680 So, how? 1178 01:02:04,680 --> 01:02:08,610 Well, it turns out that there are libraries, code written by other people 1179 01:02:08,610 --> 01:02:11,460 that, frankly, via very easy-to-use functions, 1180 01:02:11,460 --> 01:02:17,760 just make it easy to sanitize or scrub, so to speak, user's input. 1181 01:02:17,760 --> 01:02:20,700 What do these libraries or these functions typically do? 1182 01:02:20,700 --> 01:02:24,280 Honestly, they just escape, so to speak, these dangerous characters. 1183 01:02:24,280 --> 01:02:27,750 Something like a semicolon or perhaps a single apostrophe that 1184 01:02:27,750 --> 01:02:32,230 might, in SQL, have some special and dangerous potential meaning, 1185 01:02:32,230 --> 01:02:36,330 they just escape them as by putting a backslash, a special character in front 1186 01:02:36,330 --> 01:02:40,080 of them so that if the human were to type in Let There Be Rock, 1187 01:02:40,080 --> 01:02:46,020 semicolon, DELETE, that would actually be interpreted safely by your database 1188 01:02:46,020 --> 01:02:51,840 as a search for an album called Let There Be Rock, semicolon, DELETE, 1189 01:02:51,840 --> 01:02:55,380 which of course most likely is not the name of an album. 1190 01:02:55,380 --> 01:03:00,060 So that query would probably not return or select any results. 1191 01:03:00,060 --> 01:03:02,280 But more importantly, it would not be tricked 1192 01:03:02,280 --> 01:03:05,280 into executing two SQL commands. 1193 01:03:05,280 --> 01:03:10,200 Rather, it would execute only the SELECT but with a nonsensical value. 1194 01:03:10,200 --> 01:03:13,260 Lastly, consider what a database is. 1195 01:03:13,260 --> 01:03:16,170 It's really a piece of software running on some computer 1196 01:03:16,170 --> 01:03:20,820 somewhere, be it on my own Mac, or PC, or some server in the cloud. 1197 01:03:20,820 --> 01:03:24,780 But if you have just one database, as I seem to keep presuming, 1198 01:03:24,780 --> 01:03:27,180 you have this so-called single point of failure, 1199 01:03:27,180 --> 01:03:31,290 again, just as we had in the world of cloud computing more generally. 1200 01:03:31,290 --> 01:03:35,330 And so with something like data where you don't want to lose it 1201 01:03:35,330 --> 01:03:37,830 and you certainly don't want all of your web servers or apps 1202 01:03:37,830 --> 01:03:43,170 to go offline just because one server, your database server, has gone out, 1203 01:03:43,170 --> 01:03:48,327 it's time to revisit the architecture or topology of systems more generally. 1204 01:03:48,327 --> 01:03:50,160 Something tells me that it's not going to be 1205 01:03:50,160 --> 01:03:52,560 sufficient to have just one database. 1206 01:03:52,560 --> 01:03:54,570 You probably want two. 1207 01:03:54,570 --> 01:03:59,010 But if you have two databases, now how do you decide where to put the data? 1208 01:03:59,010 --> 01:04:02,580 Do you put it here, or over here, or maybe in both places? 1209 01:04:02,580 --> 01:04:06,050 If you put it in both places, though, you're then using twice as much space, 1210 01:04:06,050 --> 01:04:08,510 so already we've opened a can of worms. 1211 01:04:08,510 --> 01:04:11,730 To solve one simple problem, don't be a single point of failure. 1212 01:04:11,730 --> 01:04:15,980 But that's going to cost you some time, or some money, or certainly space. 1213 01:04:15,980 --> 01:04:18,020 So what can you do if you're architecting 1214 01:04:18,020 --> 01:04:22,670 a system that has now not just web servers but, say, databases? 1215 01:04:22,670 --> 01:04:26,510 Well, odds are you're going to want to have not just the one, pictured here 1216 01:04:26,510 --> 01:04:29,690 as a cylinder, as this convention, but you're probably 1217 01:04:29,690 --> 01:04:32,600 going to want to have a second as well. 1218 01:04:32,600 --> 01:04:35,120 But of course, if you have two of them, odds 1219 01:04:35,120 --> 01:04:38,200 are it's not sufficient just to store half of your data on one 1220 01:04:38,200 --> 01:04:40,670 and half of your data on the other, because, of course, 1221 01:04:40,670 --> 01:04:42,890 you've not solved the single point of failure. 1222 01:04:42,890 --> 01:04:45,890 You now just have two single points of failure because half of your data 1223 01:04:45,890 --> 01:04:48,590 could be lost here or half of it here. 1224 01:04:48,590 --> 01:04:52,400 So odds are you're going to want to start having backups of data. 1225 01:04:52,400 --> 01:04:55,130 But you don't want to necessarily have to waste time restoring 1226 01:04:55,130 --> 01:04:57,650 from backup, especially if you want to maintain 1227 01:04:57,650 --> 01:05:00,740 as many as five nines of uptime. 1228 01:05:00,740 --> 01:05:04,160 So odds are you're going to want to have these databases really 1229 01:05:04,160 --> 01:05:06,830 be duplicates of one another. 1230 01:05:06,830 --> 01:05:09,140 And whenever you write data to one database, 1231 01:05:09,140 --> 01:05:12,170 you should probably write it to the other in parallel. 1232 01:05:12,170 --> 01:05:15,140 So, yes, admittedly, you have just spent twice as much space 1233 01:05:15,140 --> 01:05:17,190 and, frankly, twice as much money. 1234 01:05:17,190 --> 01:05:20,390 But at some point those sorts of costs are inevitable. 1235 01:05:20,390 --> 01:05:23,000 But there's other ways to scale here, too. 1236 01:05:23,000 --> 01:05:26,720 You can, of course, hit a ceiling on vertical scaling 1237 01:05:26,720 --> 01:05:28,520 even when it comes to databases. 1238 01:05:28,520 --> 01:05:32,570 After all, if a database is just a program running on some server 1239 01:05:32,570 --> 01:05:36,620 and there is only so much RAM or disk space or CPU in that server, 1240 01:05:36,620 --> 01:05:39,770 eventually you're not going to be able to store as much data 1241 01:05:39,770 --> 01:05:42,000 or as quickly as you want. 1242 01:05:42,000 --> 01:05:43,340 So what can you do? 1243 01:05:43,340 --> 01:05:45,620 Well, you could, for instance, shard your data 1244 01:05:45,620 --> 01:05:50,060 and have not just two but maybe four or more servers and put all of the users 1245 01:05:50,060 --> 01:05:55,040 whose names start from A to M on one half of your cluster of servers, 1246 01:05:55,040 --> 01:05:57,170 so to speak, but maybe everyone else from M 1247 01:05:57,170 --> 01:06:00,500 to Z based on, say, their last name can go on the others. 1248 01:06:00,500 --> 01:06:04,220 To shard a database means to split the data in some predictable way 1249 01:06:04,220 --> 01:06:05,990 that you can repeat again and again. 1250 01:06:05,990 --> 01:06:10,070 But even there, too, even if only the As through Ms are going to the left, 1251 01:06:10,070 --> 01:06:14,060 you want to make sure that you still have that backup or replica. 1252 01:06:14,060 --> 01:06:17,600 And this arrow suggests that they really should be intercommunicating, 1253 01:06:17,600 --> 01:06:20,450 not unlike load balancers we've seen. 1254 01:06:20,450 --> 01:06:23,120 But there's another way you can scale your databases as well. 1255 01:06:23,120 --> 01:06:27,440 You don't have to have databases doing both reading and writing. 1256 01:06:27,440 --> 01:06:31,340 To read data from a database or any server means to take it from its memory 1257 01:06:31,340 --> 01:06:32,720 and read it into yours. 1258 01:06:32,720 --> 01:06:36,140 And to write means to do the opposite, to save it. 1259 01:06:36,140 --> 01:06:39,140 Well, what you can do actually in the world of databases 1260 01:06:39,140 --> 01:06:42,710 is also replicate your databases multiple times. 1261 01:06:42,710 --> 01:06:46,800 And you might have connected to these two primary databases 1262 01:06:46,800 --> 01:06:52,610 multiple other databases that are just copies in one direction and not both. 1263 01:06:52,610 --> 01:06:57,030 And what you might then do is use these two primary databases not only to read 1264 01:06:57,030 --> 01:06:59,720 but to write, abbreviated here RW. 1265 01:06:59,720 --> 01:07:02,000 But these other databases down here, which 1266 01:07:02,000 --> 01:07:04,730 are just copies of the ones to which they're connected, 1267 01:07:04,730 --> 01:07:06,440 are just called read replicas. 1268 01:07:06,440 --> 01:07:11,450 They exist solely for the purpose to read from them again and again. 1269 01:07:11,450 --> 01:07:13,200 When might this make sense? 1270 01:07:13,200 --> 01:07:15,320 Well, in some contexts, like social media, 1271 01:07:15,320 --> 01:07:17,930 like Facebook, it's probably the case that there 1272 01:07:17,930 --> 01:07:20,690 are more reads than there are writes. 1273 01:07:20,690 --> 01:07:24,800 That is to say you probably know more people who post more content than you 1274 01:07:24,800 --> 01:07:29,310 but you probably still read or look at theirs. 1275 01:07:29,310 --> 01:07:32,180 And so if the data for your business follows 1276 01:07:32,180 --> 01:07:36,560 that pattern whereby writes are maybe common but reads are way more common, 1277 01:07:36,560 --> 01:07:41,030 you can do exactly this model and replicate again and again, honestly, 1278 01:07:41,030 --> 01:07:44,420 as a tree structure for efficiency so that it doesn't all 1279 01:07:44,420 --> 01:07:46,280 have to replicate one at a time. 1280 01:07:46,280 --> 01:07:50,000 But then you can write software, be it in Python or something else, 1281 01:07:50,000 --> 01:07:54,440 that writes data only to these one or two servers but reads 1282 01:07:54,440 --> 01:07:56,720 from any number of them as well. 1283 01:07:56,720 --> 01:07:59,450 But this, too, is a bit of a rabbit hole, because at some point 1284 01:07:59,450 --> 01:08:01,580 you want to have this redundancy not in one 1285 01:08:01,580 --> 01:08:05,990 location but others, east coast and west coast, one country and another. 1286 01:08:05,990 --> 01:08:10,400 And at that point, you might actually run into the limitations of time. 1287 01:08:10,400 --> 01:08:13,910 Because after all, it takes a non-zero number of milliseconds or seconds 1288 01:08:13,910 --> 01:08:16,310 for data to travel long distance. 1289 01:08:16,310 --> 01:08:18,050 Consider after all how long it might take 1290 01:08:18,050 --> 01:08:21,590 data to go from Cambridge, Massachusetts to somewhere in Japan. 1291 01:08:21,590 --> 01:08:26,210 That's far longer than it might take to just go down the road to MIT. 1292 01:08:26,210 --> 01:08:30,979 So here, too, we can revisit all of the problems we saw in the world of cloud 1293 01:08:30,979 --> 01:08:32,540 computing and servers more generally. 1294 01:08:32,540 --> 01:08:35,990 They're back to revisit in the context of databases. 1295 01:08:35,990 --> 01:08:40,399 But with databases, you care ever more that these things not go down, 1296 01:08:40,399 --> 01:08:44,630 or if they do, that you have spares, backups, or replicas. 1297 01:08:44,630 --> 01:08:48,680 Because now that we're storing our data in this centralized place, 1298 01:08:48,680 --> 01:08:52,069 we have to think hard not only about how we're scaling computationally 1299 01:08:52,069 --> 01:08:56,359 but how we're scaling in terms of our data as well. 1300 01:08:56,359 --> 01:08:59,090 So consider where then we began. 1301 01:08:59,090 --> 01:09:02,770 We started by laying out data in a spreadsheet, be it Microsoft Excel, 1302 01:09:02,770 --> 01:09:04,939 or Apple Numbers, or Google Spreadsheets. 1303 01:09:04,939 --> 01:09:07,130 From there we considered what types of data 1304 01:09:07,130 --> 01:09:09,950 we might store there so that if we want to upgrade, 1305 01:09:09,950 --> 01:09:12,410 so to speak, from a spreadsheet to database, 1306 01:09:12,410 --> 01:09:15,260 we know what types we can specify. 1307 01:09:15,260 --> 01:09:19,189 And in SQL, whether implemented in SQLite, Oracle, or MySQL, 1308 01:09:19,189 --> 01:09:21,560 or something else, they tend to be standard types that 1309 01:09:21,560 --> 01:09:25,279 tend to be common across platforms, albeit with some variations, 1310 01:09:25,279 --> 01:09:27,260 so that we can think hard about these types 1311 01:09:27,260 --> 01:09:31,277 and then ultimately help the database help us be performant. 1312 01:09:31,277 --> 01:09:34,069 Because if I know that I'm going to be searching or selecting based 1313 01:09:34,069 --> 01:09:36,470 on certain data, I can tell the database, for instance, 1314 01:09:36,470 --> 01:09:39,680 to make it unique or at least index it. 1315 01:09:39,680 --> 01:09:43,520 And then using SQL constructions like SELECT, and INSERT, and UPDATE, 1316 01:09:43,520 --> 01:09:46,670 and DELETE, and yet others can I manipulate that data and get 1317 01:09:46,670 --> 01:09:52,069 at it even faster, frankly, than the human me could with a mere spreadsheet. 1318 01:09:52,069 --> 01:09:54,860 But with the design of any system, as with databases, 1319 01:09:54,860 --> 01:09:58,430 we start to open new cans of worms and new problems 1320 01:09:58,430 --> 01:10:01,400 as we start to explore more sophisticated challenges. 1321 01:10:01,400 --> 01:10:04,880 But here, too, many, if not all, of these problems 1322 01:10:04,880 --> 01:10:09,290 can be solved by simply reducing the problems to first principles 1323 01:10:09,290 --> 01:10:11,450 and consider, what is the problem to be solved? 1324 01:10:11,450 --> 01:10:12,770 How is our data represented? 1325 01:10:12,770 --> 01:10:14,150 Where is it stored? 1326 01:10:14,150 --> 01:10:18,290 And consider ultimately what business constraints or legal constraints 1327 01:10:18,290 --> 01:10:22,700 we have when manipulating that data and consider what tools of the trade 1328 01:10:22,700 --> 01:10:24,050 are available to us. 1329 01:10:24,050 --> 01:10:27,740 This then is database design and, more generally, design unto itself, 1330 01:10:27,740 --> 01:10:32,450 not just focusing on the correctness of implementing solutions to problems 1331 01:10:32,450 --> 01:10:37,670 but the efficiency and the scalability as well thereof. 1332 01:10:37,670 --> 01:10:39,196