DAVID MALAN: We've seen how with languages like Python can we implement business logic on a server and, even if we want, generate web pages. And we've seen, on the browser side, the so-called client side, how you can render information or data to users. And with CSS can we style it, and with JavaScript can we even make it interactive. But when building a business or any application that is interacting with users, where is all that data being stored? Well, if you're running your own small business, you might simply be using a spreadsheet. Of course, spreadsheets, whether they're implemented as Microsoft Excel, or Google Spreadsheets, or Apple Numbers, generally stores data in rows and columns. And if you're like me, you probably use that first row to represent the names of the columns and the types of the data that you're storing. And as you continue to add more and more data to that spreadsheet, you probably, like me, continue to add row after row after row. And meanwhile, if you've got so much data or so many different types of data that it doesn't really belong in one sheet, you might actually create a second sheet or a tab along the bottom of the screen, for instance, in which you can store additional data as well. So indeed, Microsoft Excel allows you to store data relationally, so to speak. You might have one sheet with customer info, another sheet with invoices, another sheet with products, or any number of other types of data. And you can relate those to another by somehow having some commonality among them, some identifier, some name, some customer ID or the like. And so data that tends to be stored in rows and columns can have these relationships. And you can do this certainly in the cloud these days as well with Google Docs, simply a web-based version of the same. But when designing your spreadsheets, or more generally, your data's schema, the design thereof, there's a number of questions you need to consider. What data do you want to store? And where do you want to store it? And what type of data is it, in fact? Because whether using Excel, or Numbers, or Google Spreadsheets, odds are you sometimes configure the columns to be displayed or formatted in some ways. Numbers might be displayed with or without commas. Numbers might be displayed with dollar signs or some other symbol. You might separate one field from another using special symbols or punctuation. So these are all configuration options in a spreadsheet that might be among the first decisions you make. In fact, let's consider a sample spreadsheet wherein I might want to represent users in some web application. All of us have, of course, registered for websites or applications these days. And what are some of the questions you're asked? Well, you might be asked for your user name by some application. You might be asked for your actual name. What else might you be asked for? Perhaps your email address. And perhaps if you're buying something, or registering for something, or someone that needs to know a bit more about you, perhaps they'll even ask you for your address, maybe your phone number, or perhaps even your age. So you can certainly imagine there being even other fields that you need. But let's consider now what types of data each of these fields is. Now, what should a user name be? Well, by definition, this should be a unique value that belongs to you and only you on a system. For instance, if I'm registering for a website and that website is storing some data in, for instance, a spreadsheet, I might ideally try to claim Malan if it's available. My name meanwhile, of course, will be just David Malan, which someone else can certainly have as well. Hopefully my email address is only mine. And that, too, will be stored in a column of its own. And then my address, for instance, here on campus, 33 Oxford Street, Cambridge, Massachusetts, 02138. A phone number by which I can be reached, 617-495-5000. And then my age shall be-- well, we'll leave that blank just for now. So notice how each of these pieces of data is just a little bit different. Some of them seem to be numeric. Some of them seem to be alphabetic. Some of them even have some punctuation therein. Now, you might or might not in a spreadsheet care to display this data a bit differently. Indeed, the only number that I'm sweeping under the rug here is, in fact, my age. But hopefully I don't need one or more commas in that value. So there isn't really any need for special formatting here. But it turns out that when you're actually building a website or software-based application, you're probably not storing your data ultimately in just spreadsheets. You might graduate eventually from being a small shop to needing more than just Microsoft Excel, or Numbers, or even something cloud-based like Google Spreadsheets. Why? Well, you have more rows than those programs can generally handle. In fact, on your own Mac or PC, odds are, when opening up big files, whether a spreadsheet or any other document, sometimes you might actually feel that. The computer might start to slow, and you might start to see a spinning beach ball or hourglass because you start to hit the limits of what a local computer can do with just client-side software, like a spreadsheet tool. And so eventually you might actually need to use a proper database. And a database is really just a piece of software that can absolutely run on your Mac or PC. But very commonly it runs on a server or somewhere else in the cloud to which your own software connects. For instance, if you're building a website or an application in a language like Python, you can, in Python, write code that talks or communicates with that database to pull data down and to send data back up. But spreadsheets are wonderfully straightforward. It's just so intuitive how you store the data in rows and columns and just more and more rows as you have more and more data. And so what's nice about some databases is that they mimic exactly this design. There exists in the world what are called relational databases. And indeed, this is among the most common ways to store data relationally. The jargon is a bit different in the world of databases. You no longer call these things spreadsheets. We call them databases. And we don't call the individual tabs sheets. We call them tables. But we continue to call the structure of the data therein rows for each additional piece of data and columns for the different types of data that we have. But in a relational database, such as Oracle, Microsoft Access, SQL Server, My SQL, Postgres, or something smaller and lighter weight called SQLlite, the burden is on you, the designer of the database, or the programmer, or the business person to actually decide on what types of data you are going to store. Because the database, in order to handle more and more data than a typical spreadsheet can support, needs a bit of help from you, needs to know what type of data you're storing so that it can search it more efficiently and sort it more effectively and make it easier for you ultimately to add and remove data from that particular database. So in other words, you can do all of the same operations. But generally, in a database, you have so much more data, you're going to need to have the database help you help yourself. So what does that actually mean here? Well, let's consider some of these fields here. Were we to migrate my spreadsheet with many, many, many rows of users, for instance, to a proper database relationally, I'm going to have to make a few more decisions as well. And it turns out that I need to choose various types for these columns. And generally, in a database, you only have a limited menu of options. So let's take a look at what those might be. Here in many relational databases are just some of the data types that are available to you; an integer if you want to represent something like 1, 2, 3, or perhaps even a negative, a real number, otherwise known as a floating point value, for instance, in Python that actually has a decimal point and perhaps some numbers thereafter, numeric, which is more of a catch-all and might handle things like dates and times that are numbers in some sense but have some more formal structure to them, and then the more general text, when you just have words or phrases, characters or whole paragraphs or more that you might want to store as well in some column. And then lastly, you might have the cutely named BLOB, or binary large object, when you actually want to store zeros and ones, that is to say binary data, like actual files in your database. However, you needn't do this typically. You can actually store files, of course, on a file system on your own hard drive somewhere on the server, but that option exists for you. Now, this happens to be the list of data types supported by the simplest of relational databases, something called SQLite. And as its name implies, it actually supports a very specific language via which you or anyone can request data from the database and store data in the database, or update it, or delete it. And that language is called SQL, structured query language. SQL is a language via which you can do exactly that, retrieve data from a database, put data into a database, and so much more. It's an alternative to the pointing and clicking with which you're probably familiar with something like Microsoft Excel, or Google spreadsheet, or Apple Numbers, where it's really you, the human, doing all the work via a graphical user interface. Now, to be fair, in all of this spreadsheet programs are there functions or macros that you might be able to write so that you can program in those environments but with some limitations. And indeed with SQL and with SQL databases, more generally, that is databases that understand this language, can you achieve far higher performance, can you store far more data, and nonetheless get at it quickly. So it's sort of the next evolution of what you might otherwise do only within your own computer spreadsheet. But many other databases, Oracle, SQL Server, Microsoft Access, MySQL, Postgres and more, support more than these data types that allow you to help the database help you even more. Because the more the database knows about your data, the smarter decisions it can make at a lower level below these abstractions and storing that data so that, when you have a question, it can answer it quickly. And so how might you refine what we mean by integer? Well, in some databases, you might want to tell it that this is just a small integer in this column one after the other, row by row. Or maybe it's just an integer somewhere a little bigger than that, or perhaps a big int, which means it can be even larger. Now, these actually map to very well-defined values. If you think back to how we considered data itself is stored in a computer, it's ultimately with zeros and ones or bits. And indeed, an integer, so to speak, generally takes up just 32 bits or four bytes. But, of course, that few bits actually translates, if you do a bit of the math, to four billion possible values. So if you were to specify that a column in your database is of type integer, that means you could type in the number from zero all the way on up to four billion, or if you want negative numbers, from negative two billion roughly to positive two billion as well. But beyond that three billion or four, you're out of luck if you're only using 32 bits. Now, that might seem huge. And that is a good problem to have if I have four billion users, or sales, or anything in my database. But for certain large companies these days, having four billion or more records of some sort, transactions or logs, not at all uncommon. And for those purposes does there exist big int, a bigger data type that uses 64 bits or eight bytes via which you can count much, much, much higher, so high that I'm not even sure how to pronounce that number. But indeed, it should be large enough for almost all of our purposes. So ultimately, it's not enough to say that the field in a database is an integer but to specify just how big maximally that integer can be. Now, fortunately, I think for my age we could probably get away with small int. But indeed, with a small int, you typically have 16 bits. So even then could your user be as old as 65,535 years old. Now, what about real numbers? These are floating point values with decimal points. And here, too, you have fewer choices, typically, but along the same lines. A real number would typically take 32 bits, which allows you only some amount of precision, some number of digits after that decimal point. If you want to go even further and be ever more precise can you use twice as much space, 64 bits or eight bytes, and use double precision instead. Well, it would seem that we sort of start with an OK value and an even better value here. But what's the trade-off, as there always is? Well, it's fine if want to be able to count higher, whether with integers or real numbers. But you are literally going to pay a price. Because if you want to count higher than four billion in the context of integers, you're going to have to spend twice as much space from four bytes to eight in order to store those values. And if you're never actually going to use values larger than four billion, you're just spending twice as much space as you need to. And if you don't really need to store floating point values to terribly many digits of precision, you're just wasting space. Now, for small databases, this might be inconsequential. But for the Googles, and Microsofts, and others of the world to deal in gigabytes and terabytes of data, all of these bits and bytes will surely add up. As for numeric, more of a catch-all numerically, so to speak, whereby you have Booleans, zeros or ones, or false and true values, dates which comes in a prescribed format, typically year, year, year, year, dash, month, month, dash, day, day. Indeed, even though across the world there are different formats and different communities for representing dates, in SQL do you specify them exactly in that format only. Date time adds to that a space followed by hour, hour, colon, minute, minute, colon, second, second, whereby if you want to store both date and a time, you should store it in exactly that format in your database, even though, thanks to code, whether Python or SQL itself, you want to display that date in a geographically localized and conventional way, you nonetheless store it in this way in your database. Now, sometimes you need to fight against the imprecision that's inherent as a risk with real numbers. And so you can use the so-called numeric data type to specify a so-called scale and precision. That is two numbers that specify really how many digits you ultimately want to support. So if you want to store things like dollar amounts, maybe to cents or hundreds of cents, two or four decimal places, can you specify exactly that, and you will get exactly that much precision. As for time and timestamp, well, time is quite simply hour, hour, colon, minute, minute, colon, second, second. But timestamp's a little more interesting. It, too, can be used to represent time. But as with many languages, Python and JavaScript among them, it does so typically by remembering or counting some number of milliseconds or seconds from a particular point in time. In fact, some years ago, it was conventional to simply start counting time from January 1st, 1970, the so-called epic as opposed to say 0 BC or AD. Unfortunately, it turns out that you generally represent these timestamps with four bytes or 32 bits, which gives you access to four billion or so possible values. And it turns out that in the year 2038, we will have run out of bits with which to represent time. So if you recall the so-called Y2K problem when we ran into this issue around the year 2000, so have we rather painted ourselves into a corner in many systems by not allocating enough space to get us past that year. Well, what's the solution invariably going to be? Well, use more space and, thus, cost. But that is inherent in many systems. And it's one of these things that us humans will have to address. How about text? Well, text can, too, come in multiple forms. Whether it's English or any other language, you might have different lengths thereof. And so within the world of text values and databases, need you specify typically how many characters you want to store in that column. And this is distinct from something like a spreadsheet program where you can probably keep typing and typing or pasting as many characters as you'd like. A database, in so far as its purpose in life is to give you better performance and more scalability than a spreadsheet allows, wants you to tell it a bit more. It's going to ask you to decide whether you want to store a fixed number of chars. So char followed by a number here represented at its end means how many characters exactly shall you store in every cell in this column. Now, why might that be applicable? Well, in the US, for instance, for states, we have a standardization of two characters for every state in the country. And so you might say char (2), thereby allowing you to store CT for Connecticut, and FL for Florida, and MA for Massachusetts and so on, because you know that every value is going to be that finite length, two. But sometimes you don't know the max length. And indeed, in my own table of users, my own name might have D-A-V-I-D and a space and M-A-L-A-N, so 11 for me. But your name might be shorter or longer. Or some user that hasn't even registered for our website yet might have an even longer name than that. So varchar, or variable number of chars, exists, too, where you can specify not a fixed number of characters to use but a maximum upper bound, perhaps something like 30, perhaps something like 300, or some number of characters that you decide that's going to be the upper limit of any human I actually see in my sight who might have a name as many as that characters. But varchar is smart and the databases that use it, because it will use that many characters maximally. But it's not going to waste space. In fact, if you're using 30 characters for every name in your table, well, it's only going to store as many characters are necessary, perhaps plus a bit of overhead or accounting to keep track of that value. But that way you save space but still can handle large text or large strings. Lastly, if you have particularly large text, whole paragraphs, or essays, or documents that someone might paste in, or the contents of an entire web page that you want to search, well, you can have a text field more generally, which tends to support tens of thousands or more characters in total. But it does so in a way that's not quite as efficient. Rather than keep it right in the column itself, so to speak, it puts it somewhere else on the server, the result of which is that you can fit more room over here, so to speak. But it takes more time and effort to go search or find that data, so, again, a trade-off of space and time. But here, too, as with integers and real, you seem to have multiple choices, one of which at first glance is better. After all, why use char ever if you could also just say varchar with that same amount? For instance, for those two character codes for states, why not just say varchar and not just char (2)? Well, it turns out that if you promise the database that every cell in your column will use exactly the same number of bytes, it turns out that you'll have very straight or justified edges, in some sense, conceptually for that column. For instance, if every cell in a column takes up exactly the same amount of space, then you might know, if you start numbering the bytes that are represented by those cells, this might be byte zero up here and then two characters away will be address two, and then four, and then six, and then eight. And if you think back to how algorithms like, say, binary search are implemented, as soon as you have the ability to know arithmetically where the next value, or the last value, or, best yet, the middle value is, numerically can you jump right to it in so-called constant time, thereby enabling something like binary search and ultimately logarithmic time. So this is a powerful thing. If you commit to the database that all of your values will take up the exact amount of space, the database can search that value faster for you. Varchar, unfortunately, is rather akin to having only a left justified column, whereby you might have data that looks very straight on one edge with all of the characters left aligned. But because the length of those characters vary, this cell might be this wide, and this one this wide, and this one this wide, and this one this wide. And as such, because you have a very ragged edge, so to speak, you can't just do simple math and add two and add two to get to the next value in that column. So in the worst case, if a column is only a varchar, the best the database can do by default is just linear search, or big O of n, which, recall, was notably slower than something like log N or logarithmic time. So the more help you can provide to the database the better off you are. But of course you don't want to air too far on the other side and say, oh, well, I'm going to have char 300 for every cell in my Names column, because then you're using 300 characters by definition for just David Malan who needs just, say, 11. So lastly is there this thing BLOB, binary large object, which can only store binary data, which can be used for images or files. But those, again, tend to be best stored somewhere else on the system. So with these possibilities in mind, how might we go about designing the schema for this data? We already have the names of my columns. And I've already gone ahead and put in one sample row. So let's consider in the context of all of those SQL types which ones apply here. Well, for user name, something like Malan, it's probably not numeric because that's not something I'd be so inclined to type. But it probably is text. But within text we have the choice of char or varchar or larger blocks of text, more generally. So what should this be? Well, this one depends. A user name tends to be fairly short. Because after all, we humans probably don't want to type very long strings just to log into some site. Historically, maximal values of eight characters was common. But that tends to be fairly constrained. And you can no longer express very much of your name if you tend to have a long name yourself. So here we might say something like char, because we know it will be small. And we'd like to be able to search on this field efficiently, especially to log someone in fast, but probably not quite as short as eight, so maybe 16 or 20. Or if it's hard to guess there and you want more dynamism, maybe you would say varchar of something like 255. Why that value? Well, recall that with eight bits can you count 256 possible values. But if you start counting at zero, you can only go as high as 255. And historically have many databases had limits of, say, 255 for a varchar, though that's no longer the case. But you'll still see this very commonly. So what's the best answer? I don't know. It depends on your data or someone else's. For here, I'll go ahead and leave it as varchar (255) just to give myself flexibility. But a very good argument could be made for char of some smaller value or even varchar of some smaller value, too. As for name as well, I'm not really sure what this one should be. I don't know how many characters your own name has. Something like 11 is probably too few, even though it satisfies my name just fine. 30 feels a bit [? type. ?] And frankly, I bet if we Google longest name in world, something tells me there's someone out there with quite a few characters, hopefully no more than, say, 255. But there, too, we might want to do some due diligence. With email, too, this seems to be easy. This, too, is just characters, even though you could certainly have numbers. I don't know how long the maximum email address will be. But frankly, it's probably going to be variable in length, so I'm going to go with a default. Why 255? Again, it's probably way more than we need. But varchar is smart. It's not going to waste that many bytes. It's just a reasonable upper bound in the absence of better intuition. At least we're being consistent. Now, address, that might be decently long, especially if it's going on envelope on, say, multiple lines. But here, too, this is probably just text, and so I'll go ahead here and say this, too, 255. Phone number. Finally, a number by name. Well, what should a phone numbers field be? Well, we had integer, or real, or, more specifically, small int or integer, or big int. But the funny thing here is even though a phone number is indeed called by us humans a number, it's really just a symbolic representation of a unique identifier for someone's phone. And so these hyphens might be common in one community or culture. Maybe you'd have parentheses or even pluses for country codes. So frankly, very quickly does even a phone number not become so much a number but a textual string. So here I have some discretion. And maybe I could be presumptuous and assume I'm only going to have customers or users for now, say, in the US for whatever business constraints. And so I might say, you know what? This is a candidate to actually do something like char, say, 10, three-digit area code, a three-digit exchange, and then four digits thereafter. But that doesn't leave room for those hyphens, so I could make a char 12. Or frankly, if they're just going to be there all the time, why don't I leave them as char 10 and just get rid of those values in my data? Or alternatively, I could support parentheses or pluses as well. It really depends on how I want to store the data. But I like the idea of a textual type, something like char or maybe varchar as opposed to an integer. Because at least if I've called certain communities or out from some businesses, sometimes you have to type unusual numbers. At least in other countries, for instance, if we generalize beyond this data set here do you type zero to actually connect to someone local. And the problem with zero is that, mathematically, it's meaningless to start a value with zero. And unfortunately, a computer takes that to heart. And if you were to store phone as an integer but a number were in some community to start with a zero, your database would probably just get rid of it. In fact, try that yourself in Microsoft Excel, or Apple Numbers, or Google Spreadsheets. And depending on how you formatted the column, you can type as many zeros as you want followed by other digits. And odds are, when you hit Enter, those zeros are gone. As for age, here we perhaps finally have a compelling candidate for a number. Small int should probably get the job done. Integer would work as well, or big int, but increasingly wasteful. But you know what? Even here it's not that obvious. I probably shouldn't even use an integer type here at all. Why? Well, I might indeed ask a human for his or her age upon registering. The catch is that age might change the next day or the day after, or the day after, because, of course, time is advancing. And unless I also stored the date and ideally time at which the user registered, I can't really even do any math to figure out, oh, you registered a year ago. Let me assume your age is that plus one. So what would have been better from the get-go than age? Probably just something like date of birth, DOB. And, of course, in SQL do we have the ability to store dates, even date times. So here we probably have a better candidate for exactly one of those numeric types, so to speak. Now, we've only just begun to scratch the surface of available data types and data we might want to store. In fact, our spreadsheet or database could contain even more types. But now let's just suppose that we're happy with our types. And the goal now at hand is to start searching our data and storing that data in an actual database. I've been using here Google Spreadsheets just to lay out my schema sort of as a worksheet, if you will. But now let's graduate to actual SQL syntax and suppose that this same data is stored not in Google Spreadsheets or any other but in some database elsewhere, a piece of software running somewhere else on my own computer or somewhere in the cloud that knows how to organize this data, still in rows and columns but in such a way that I can use this new language, SQL or SQL, in order to access my data. So what are the fundamental operations that SQL supports, or a relational database, more generally? It turns out that throughout computer science is there a pattern of operations that many different systems support. In the world of databases, you have fairly crassly what's called CRUD, the ability to create data, read data, update, and delete. But more specifically, in the context of SQL, this new language, our last, you have the ability to create data, select data, a.k.a. READ data, update or insert data, a.k.a. UPDATE, or delete or drop data, a.k.a. DELETE. So whereas in SQL you have these very specific terms, they are just representative of a class of operations that you might see throughout computer science. So how do we go about using SQL? Well, short of buying and installing something like Oracle or downloading something free like MySQL, we can simply use something that's indeed, by definition, lighter weight called SQLite. SQLite is an implementation of SQL, this database language, via which you can install it on your own Mac or PC, and it uses not special software or servers but rather just a file on your own Mac or PC that stores all of those tables and, in turn, rows and columns. It thereafter creates an abstraction of sorts as though you have a full-fledged server running on your own machine with which you can communicate using SQL. Now, any number of programs can be used to actually talk to a SQLite database. You might use a purely textual interface, so-called command line interface, using only your keyboard. Or you might use something graphical, a GUI, a graphical user interface. In fact, one of the easiest programs to use is this one here, DB Browser, freely available for Macs, for PCS running Windows, or Linux, or any number of operating systems as well. In advance, I've gone ahead and opened up a database in a file whose name ends in .sqllite or also commonly .db. This then is, again, just a file, a binary file filled with zeros and ones that collectively represent patterns of rows and columns and the tables that contain them. And if I open this program, I see all of the tables or, if you will, sheets from our old spreadsheet world. This happens to be a database all about music. And indeed, I have a table about musical albums and artists, customers and employees, and genres, and invoices, invoice lines that represent actual songs bought, the types of media involved, and playlists and playlists tracks via which users can actually customize those songs, and then lastly, the tracks or the songs themselves. In other words, someone else on the internet has gone to the trouble of aggregating all of this data about music and their authors and organized it into an SQL database. Now, how? Well, if you look to the right here, you'll see a number of esoteric commands all involving CREATE TABLE. And indeed, CREATE is one of the four fundamental operations that SQL supports. And without going into the weeds of how these tables were created, they were created in such a way that the author had to decide on every table's columns' names as well as their types. And that part is interesting. For instance, let me go into the Albums table here and expand this one now. Here I have Album ID, and Title, and, curiously, Artist ID. But you'll notice in the Album table we have no mention of artist's name. In fact, if we follow the white rabbit here, we see in the Artist table that, oh, in Artist do we have an artist ID and name. And indeed, if we keep digging, we'll see that in each of these tables there's not quite as much information as I might like. In fact, IDs, whatever those are seem to be more prevalent. And that is because, in the world of databases, and perhaps even some of your own spreadsheets, it's conventional and daresay best practice to normalize your data. You indeed should try to avoid having redundancies within tables and even across tables. Now, what does that mean? Well, consider our own spreadsheet of users that we're trying to get into a database. I had a column there called Address. And I proposed that my own address was 33 Oxford Street in Cambridge, Massachusetts 02138. Now, it turns out that there's quite a few other people at my workplace, and so they might have that exact same address as well. And in fact, that address doesn't really lend itself to very nice printing on envelopes because all I did was separate things with a comma, not actually hit Enter. And in fact, because everything was all in one column, it would seem a little difficult to search by, say, zip code. Find for me all of the users from Cambridge, Massachusetts in 02138. Well, you could do it sort of searching free form all of the values in some column. But it wouldn't be particularly clean or efficient because you'd have to look at and then ignore all of those other words and numbers, 33 Oxford Street, Cambridge mass, if you only care about that zip. So before we forge ahead, let's see if we can't infer why this database is seemingly, at first glance, more complicated than you might like. Well, if I go back into my spreadsheet here, what really should I have done with Address? I probably should have quantized that value into streets and then city, then state and zip code, not having just one column but several. In fact, here, what I've gone ahead and done is separate out Address into Street and City and State and Zip. And for each of those have I specified a very precise type. I've gone ahead and proposed that Street is still varchar (255), as is city, because I don't really know an upper bound, so we'll at least be consistent. For state, I've hopefully been smart in at least assuming users are in the US. I've said char (2) for just that two-character code. And for Zip, too, I'm preemptively trying to avoid a mistake with even Cambridge whose zip codes start with a zero. Where I had to specify again that that's just an integer, I might actually lose mathematically that first digit. But by storing it as a char with five characters and no hyphen or four others can I ensure that that 02138 remains exactly that. But here, too, we have a bit of inefficiency. Imagine if more and more users from my own building register for this particular application. You might have again and again these exact same values if all of my colleagues are in the same building. And here, as with programming, as soon as you start to see a lot of redundancy, the same data again and again, there's surely an opportunity for better design. This is correct. My colleagues might also live here if we added their names and their distinct emails. But they don't necessarily need to have the same exact text stored again and again. So what might you do in this case even in the world of spreadsheets? Well, on the sheet, I might just rename this actually more explicitly to users. And you know what? Let me go ahead and create another sheet in my spreadsheet world and call this, say, Cities. And my cities might actually have a city name and perhaps a state and a zip. But in this leftmost column, I could be smart and start to assign each of these cities some unique ID or identifier. And so here might I have just a unique identifier, typically an integer. City might again be varchar (255). And State might, again, be char (2). Zip Code, meanwhile, can stay as char (5). But now what I can do is presume this, that if Cambridge, Massachusetts, 02138 is in this sheet or, if you will, table let's arbitrarily but reasonably give it a unique identifier as one. And if I were to add more cities here, like Allston, Massachusetts and its zip code, I could sign it a unique ID of two. Because now that I have this data, here's where you get that relational aspect. I can relate this sheet or table to my other as follows. Back in this table now can I go ahead and delete much of this redundancy here and actually go ahead and say city is not really a city but it's a city ID. And here now it can be a proper integer, because the city in which all of my colleagues live is the exact same one as mine. Now, here there's still an opportunity for improvement, to be fair. Because if all of us are in the same building, maybe that should be factored out as well. And if I really wanted to go down this rabbit hole, I could add another sheet or table called, say, Buildings and factor out also that 33 Oxford Street, give it a unique ID, and only store numbers. So in short, the more redundancy and the more waste that you end up having in your data, the more opportunities there are to, so to speak, normalize it. To factor out those commonalities and create relations between some pieces of data and others and the convention in computing is to do, quite simply, numbers. Why? Well, it turns out it's a lot more efficient to just relate some data to others by relying on small integers. Four bytes is not all that much. And in fact, inside of a computer CPU are small registers, tiny pieces of memory that can be used on the most basic of operations; additions, subtractions, and comparisons for equality. And so with small values like integers can you very quickly reassemble or relate some data to others. And so here we have a general principle of database design to normalize it by factoring things out. And so if we go back into our musical database, you can perhaps infer why the author of this data did that preemptively. They have their albums having album IDs, a number like 1, 2, and 3, a title, which is the actual title of that album. And then to associate those albums with artists, they've used not the artist's name but an ID. In this way can an album have the same artist as another without storing that artist's name twice. Moreover, if the artist happens to change his or her name, as is not uncommon in the musical world, you can change that name in just one place and not have to scour your tables for multiple copies. And so if we continue this logic, we'll see in more and more tables that we have this principle of naming the data but then assigning it an ID. And if you want to relate some of that data to another, you simply store the ID, not the actual values. Of course, this is decreasingly useful as we go, because now some of my data is in this table, and that, and this other table, and here. And so while very academically clean and refined, it doesn't anymore seem useful to the data scientist in front of his or her computer that just wants to answer questions about data. And yet it's all over the place, whereas before, with Excel, you could just pull up a window. But that's where SQL itself comes in. SQL does not just prescribe how to type your data but rather how to query it as well. And in this particular program here, DB Browser, I can actually go over to this tab here for Execute SQL. And I can begin to execute those actual commands, SELECT, and CREATE, and UPDATE, DELETE, and others and actually see the results. What SQL allows you to do is express yourself programmatically using a fairly small language, albeit new, that allows you to create temporary tables, just the results, just the result sets, so to speak, that you want, only those rows that you care about. So for instance, if I want to see all of the albums in this database, well, in the world of spreadsheets, I would just double click in and peruse. But in the world of SQL, I'm actually going to type a command. I'm going to go ahead here and say SELECT star, for give me everything, from the table called Album, and then semicolon to finish my thought. I'm going to go ahead and click the graphical Play button here to execute this command. And you'll see suddenly that here are all of the albums apparently in this table, 347 of them in total in this particular database. And notice that all of the titles are in one column, the Album ID is to the left, and the Artist ID, of course, to the right. Well, if you're now curious who is the artist behind the album called For Those About to Rock, We Salute You, well, I can just make a mental note that the artist ID is one. And you know what? With SQL, it turns out you can use predicates. You can qualify the data you want. You don't have to say, give me all. You can say, give me this. So how do I do that? Well, I can actually say SELECT star from Artist, the other table, where-- and here comes my predicate-- artist ID equals one, a semicolon again to finish that thought, hit Play, and voila. It turns out it's AC/DC, the artist behind that particular album. Of course, this felt a bit manual. And this seems no better than a spreadsheet wherein now more of the work seems to be on me. But SQL's more expressive than this. Not only can you create, and select, an update, and delete data, you can also join data from one table and another. So in fact, let me go ahead and do this a little more explicitly. I want to go ahead and select everything from that original album table, but I'd like to first join it with the artist table. How? Well, I want to join those two tables, this one and this one, kind of like this, conceptually, so to speak. On what fields? Well, in Album, I recall there's an artist ID. I want that to be equal to artist.artistid. In other words, if you imagine this hand to be albums and this hand to be artist and the tips of my fingers each represent those artist IDs, we essentially want the SQL to line up my fingers so that I have on the left my albums and I have on the right the artist. And every one of my rows now has both. Let me finish my thought with a semicolon here and hit Play. And voila, now we see even more information but all together. We see that album number one, For Those About to Rock, We Salute You, has an artist ID of one and clearly an artist ID of one but a name. Well, what's happened? Well, both of these tables have kind of been concatenated together but joined intelligently such that the artist IDs in both tables line up on the left and the right. Of course, at this point, I really don't care about all these numbers. And I definitely don't need the temporary duplication of data, so I don't have to just keep saying star, which is the so-called wild card, which means give me everything. I can actually instead just say, give me the title of albums and the names of the artists by specifying with commas the names of the columns that I actually want. And if I now click Play, I get much simpler results, just the titles of albums and just the names of those artists. Now, how else can we express ourselves with SQL? Well, there are other keywords besides WHERE and besides JOIN. You can also group values by something and specify that they must all have something as well. For instance, let me go back to my data here and consider which artists have multiple albums. Well, if we consider the results that we had earlier do we have AC/DC as the artist behind For Those About to Rock, We Salute You, but also behind Let There Be Rock. Moreover, this band Accept has multiple albums as well. And if we scrolled further, we'd probably see others. So if we'd like to see those bands, those artists that have multiple albums, how can we do this? Well, what if I somehow collapsed all mentions of AC/DC into just one row, and next to that row I put an actual count? Well, I could refine this query as follows. I can say, yes, join these tables together, but then collapse them, if you will, based on grouping some common value. Let's go ahead now and group the data by name so that any artist that appears multiple times will effectively be collapsed into one. But I'd like to remember how many rows got collapsed into just one. And so rather than select the albums themselves this time, I'm going to select the album's name and then the count there of, thereby specifying show me the name and show me the count of that name before we grouped by. If I go ahead now and finish my thought and click Execute, I'll see that, indeed, AC/DC had two names and Aaron Goldberg had one. And if we keep scrolling, we'd see all of the bands' names that had one or more albums and the count for each of those. If I want to filter out now maybe those bands that only had one hit album in this database, I can instead say, you know what, go ahead and group by the group's name, but then show me only those bands having more than one album. Well, here, too, can I simply filter my results saying literally quite that, having some number count name greater than one, semicolon. Hitting Play now and you see the results immediately eliminate all of those bands that had just one album. And now if I scroll through, we'll see all those bands that had two or really more. And so with SQL can you begin to express yourself certainly more arcanely than you could with just a click of the mouse in a spreadsheet but ever so much more powerfully. But the key is to build these queries up piecemeal. Indeed, this now already looks quite complicated. But we began by just selecting all data and then refining, refining, refining, thereby working at this level and only getting into the weeds once we knew what we wanted. Notice now just how fast these operations have been. Indeed, at the bottom here do I see that 56 rows were returned. How fast? In one millisecond. And indeed, even though it might take me longer to describe in English what it is I want, in fact, the computer can find this data so quickly. But that's because we've done at least one thing already. This data is already organized with these proper types. And it also has a few other key characteristics as well. When storing your data in a SQL database, you're also asked to answer some questions. For every table you're asked to specify effectively what, if any, is this table's primary key. These are key words in SQL that you can apply to some column that says, hey, database, this column is my primary value that uniquely identifies every row in this table. In the context then of our user spreadsheet with which we began this discussion, that identifier for City was a primary key. I might very well have used a city's name as unique or perhaps even the zip code. But far more efficient, especially if you want to avoid ambiguities or duplication, is to just use that integer. And so here a primary key is almost always a numeric value, at least in the most optimized layouts of data. But it guarantees to the database that there will be no duplicates on this particular value. But more powerfully, you can define in one table a column to be a primary key, but then in another table that same value to be a so-called foreign key. In other words, throughout this example in the actual SQL database, I had Albums in one table and Artists in others. And that Artist table had an Artist ID column that was within that table known as a primary key. But when you saw that artist ID in the Albums table, it was contextually there a foreign key. Now, beyond semantics, this is an actual and valuable property. It ensures that the database knows how to link and how to link those two columns efficiently. Moreover, you have even fancier features available to you when you declare keys. You can also tell the database, you know what? If I ever delete this artist, go ahead and delete all of that artist's albums as well. And you can configure a database automatically to have this so-called cascade effect whereby data is updated and your data is consistent at the end of the day based on those relations, if you will. Now, in columns of data can you also specify that every value's got to be unique. It doesn't necessarily need to be your primary key, but it might still be unique. Like what? Well, in our Users table that we were creating on the fly, an email address might, by human convention, be unique, assuming I don't share it with someone else. But using an email address, multiple characters, many possible characters, tends not to be the most efficient way to search on data. So even in my Users table might I have added for best practice a numeric column as well, probably called ID, as my primary key. But I might still specify when moving that data from my Google spreadsheet into a SQL database that, you know what? Please ensure that this Email column's unique so that I or some other programmer doesn't accidentally insert duplicate data into this table. And moreover, the database then can search it more efficiently because it knows how many, if any, there are of any one value. There's one and only one maximally. Lastly, there's this keyword here, index. Well, the other feature you get so powerfully from proper databases is the ability to search and sort values efficiently. But the database doesn't know a priori how to do either on the data you care about. Because only if you tell the database what data you plan to search on and sort frequently can it help you in advance. And so if when creating a database table you tell the database server, go ahead and index this column, what it will do is use a database structure, a tree structure not unlike our so-called binary search trees, that pulls all the data up in an artist's rendition thereof, thereby ensuring that it doesn't take as many steps to find some email address or something else because you have indexed that particular column. It won't store it linearly top to bottom or left to right. It will store it in a two-dimensional tree structure of some sort, often known as a B-tree, that allows you to grab the data in hopefully logarithmic and not linear time. Well, turns out there are even more features you get from actual databases like SQLite. Well, you have the ability to specify when creating a table, please go ahead and auto increment this column. Well, what does that mean? Well, I very manually a moment ago assigned Cambridge the unique identifier of one. But why should I, the programmer, even have to worry or care about what the unique values of my inputs are? I just need that that key exists. I do not need to care about what that value is, just that it exists and it's unique. So you can tell the database on its own, please go ahead and, any time I add a new row to this table, increment that value automatically. You can also specify to a database, please ensure that no values in my database are null that is empty, thereby ensuring that a bug in your code or some missing user input doesn't accidentally put into your database a row that's only sparsely filled with real data. The database can help you with these things just as Python can as well, but it's a final layer of a defense before your data. And then functions as well. SQL itself is a programming language. It might not necessarily have as high ceiling as something like Python, as much functionality. But built into SQL are any number of functions. If you want to select the average revenue across any number of purchase orders, you can use the average function. And in MySQL Query can I select data but pass it into one of these functions and get back an answer without having to paste it into, say, a spreadsheet, let alone calculator. I can count rows just as I did. I wanted to count the number of albums that a given artist had, and COUNT was a function supported by SQL. You can get maxes and mins. You can get summations of value and so many more features built into this language. And while the tool you might use might not be DB Browser, perhaps it's just a textual interface or even something even more graphical, it ultimately is just executing on your behalf the SQL queries and handing them off to the database for execution. Now, with all of these features that you get with the database, it all sounds too good to be true. You can scale, you can eliminate redundancy, and you can still select all the data you want. But unfortunately, you have to start to think harder about the design of your system. Because databases are sometimes vulnerable to mistakes, if you will. Consider, for instance, something like Twitter that tends to keep track of how many times something's retweeted. Or consider an app like Instagram, which keeps track of how many likes or upvotes or views some post has gotten. On the most popular or viral of media, those counters, those integers might be getting incremented ever so quickly. If you and I both happen to view or like something at nearly the same time, well, that interface from us into the system might actually trigger some instruction on some server somewhere to tell Instagram's database to increment some value. But how does a database go about incrementing a value? Well, if the value of views or the value of counts is somehow stored in a database, a column of type integer, and you go ahead and execute a SQL SELECT in order to get that value, and, for instance, 100 people before me has liked some post, well, the value of that result comes back as 100. I then do some math in my code, perhaps Python. I increment the 100 to 101, and then I use a SQL UPDATE, as you can, to push the data back into the database. But suppose both you and I anywhere in the world both happen to like a post at the same or nearly the same time, as can certainly happen when posts are especially popular. Unfortunately, a computer can sometimes do multiple things at once or at least in such rapid succession that it appears to be at the same time, but a race of sorts can happen, a race condition, if you will, as follows. If both my button and your button is pressed at nearly the same time and that induces execution of code on Instagram server that selects for both of us the current count of views, suppose that both of the threads, so to speak, both of the SQL operations that select that data both come back with the value 100, each of the blocks of code serving our requests go ahead and increment that value to 101 and then, via SQL UPDATE, pushes that value back to the database. Unfortunately, because both you and I induced an evaluation of that math at nearly the same time, what the database might end up storing is not 102 but 101. In other words, if two people's input is triggering a race to update data, the database had better keep track of who and when asked for that update. Otherwise, you lose data. Now, in the case of tweets or likes, it's probably not all that problematic. Though, frankly, that is their business. But you can certainly imagine that with banks or financial institutions, where the numbers matter ever so more, you certainly don't want to accidentally lose track of some dollars. And so how do we go about solving this in the case of a SQL database? Well, it turns out that there is fairly fundamental primitives or solutions you can use. Consider a metaphor in the real world, such as, say, a familiar refrigerator. And suppose that you and your significant other happened to both like something to drink at the end of the day, like some milk. And so you go ahead when you get home, and the other person's not, and you open the fridge and you see, oh, darn it, we're out of milk. And so you close the fridge and you head downstairs and you walk to the nearest store. Meanwhile, that other person comes home and, wanting some milk, opens the fridge, and darn it if we aren't out of milk as well. And so that person now heads out, perhaps in a different car, in a different route, and heads to the store, some other store to get milk. Fast forward some amount of time and both of you come home, and darn it if you don't now have twice as much milk as you need, and it does go bad. And so you've both ended up buying milk when really only one of you needed to. And this is similar in spirit, but now you've got more data than you actually wanted, but it's not the right amount of data. So why did that happen? Well, both of you, like Instagram, inspected the state of some value and made a decision on it before the other person was done acting on that information. So in our very real world of milk, how could you go about avoiding that conflict, that race, to restock the fridge? Well, you could quite simply grab a pen and paper and leave a note, so to speak, on the fridge telling someone else, gone for milk, and hopefully they then would not do the same. Or perhaps more dramatically you could lock the refrigerator in some sense so that they can't even get into it and inspect that state. But ultimately, you need your act of checking the fridge and restocking it to be what we'll call atomic. And databases can and hopefully do provide atomicity, that property, the ability to do multiple things together or not at all but not be interrupted by someone else's work. And in fact, in the database world, these are generally known as actual locks whereby you say to the database, don't let anyone else write to this table or row until I am ready to release or unlock it. That, of course, though, tends to be a very heavy-handed solution. Say don't let anyone else touch this data. Better to do it on a more fine-grained control so that you don't slow your whole system down. And so SQL databases tend to support what are more finally known as transactions whereby you can execute one or more commands again and again and again back to back but make sure that all of them go through it once before, say, your commands that your user input induced actually is allowed to get executed. Now, honestly, even in the most high-tech of places like Instagram and Twitter, this is a hard problem. Because at some point, even waiting for my operations to complete before yours can go in can be a challenge and a bottleneck for everyone else. And so in the case of the most viral of posts, what can systems these days do? Well, you could just kind of wait and write that data back to the database that is updated eventually. And indeed, another property of databases is known as just that, eventual consistency, a property that says, don't lose any data, but only eventually make sure it's reflected on the server. Eventually get the value right, but do get it right. And so what Instagram and Twitter and others might do is just cache or buffer that data, waiting until things have quieted down 'til the post is no longer viral or most users have gone to sleep. Now, that alone might not be to the best of solutions, but it avoids having to get the highest powered and most expensive hardware. Of course, in other contexts, that might be the better solution. In the world of finance, sometimes it comes down to the actual length of cables or distance from some server to another to ensure that the data gets there so fast that you don't run into these sorts of challenges. So databases can solve this, but the developers and designers that use those databases need to know how to do it and that they should. Lastly, there's another challenge as well, unfortunately all too commonly done these days because folks just don't defend against it via common mechanisms. It turns out that a bad actor somewhere on the internet or your own network can potentially, if you're not careful, trick a database into executing commands that you didn't intend. For instance, suppose in the context of Python you have some code that looks a bit like this. Here is a program written in a mix of pseudocode and Python that's designed to allow a user to input the title of an album for which they want to search. And so here I use the Python function INPUT to prompt the user for just that. On the left-hand side do I clear a variable called Title, and then assigned from right to left, the user's response to that variable. Then suppose for the sake of discussion there is some function called EXECUTE whose purpose in life is to take input that itself represents a SQL command. That SQL command might be this, so like star from Artist where Title equals something. Now, what is that something? Well, if I have the user's input in a variable called Title, I can use the plus operator in Python, not to add but concatenate two strings together, coding them singly and completing that thought. The problem, though, with SQL is that user's not really to be trusted. And whether the user's input is coming from a keyboard on a Mac, or PC, or perhaps, more compellingly, from an app or website, you probably should not trust all your users. Because suppose that your user typed in not the album name for which they want to search, Let There Be Rock, but rather they type something like Let There Be Rock, semicolon, DELETE, thereby using SQL's apparent DELETE command in order to trick your database into executing not one but two commands, a SELECT and DELETE. And indeed, this is what's known as a SQL injection attack, the ability for an adversary, a bad actor out there, to somehow trick your database and your code into executing some command that you didn't intend. How is that possible? Well, some of these characters are dangerous, so to speak. A semicolon in SQL tends to separate one command from another. It finishes your thought. And if you yourself don't anticipate that some human, this bad actor, might type in themselves a semicolon when they really shouldn't be typing SQL at all, you might mistake that semicolon for the actual terminus between one command and another. And if you just blindly pass it into your server and let it execute as usual, you might execute not just that SELECT but that DELETE or anything else as well. And in this way can an adversary not only delete data from your database but maybe select more than you intended, or update or insert. It's ultimately up to you to defend against these threats. So, how? Well, it turns out that there are libraries, code written by other people that, frankly, via very easy-to-use functions, just make it easy to sanitize or scrub, so to speak, user's input. What do these libraries or these functions typically do? Honestly, they just escape, so to speak, these dangerous characters. Something like a semicolon or perhaps a single apostrophe that might, in SQL, have some special and dangerous potential meaning, they just escape them as by putting a backslash, a special character in front of them so that if the human were to type in Let There Be Rock, semicolon, DELETE, that would actually be interpreted safely by your database as a search for an album called Let There Be Rock, semicolon, DELETE, which of course most likely is not the name of an album. So that query would probably not return or select any results. But more importantly, it would not be tricked into executing two SQL commands. Rather, it would execute only the SELECT but with a nonsensical value. Lastly, consider what a database is. It's really a piece of software running on some computer somewhere, be it on my own Mac, or PC, or some server in the cloud. But if you have just one database, as I seem to keep presuming, you have this so-called single point of failure, again, just as we had in the world of cloud computing more generally. And so with something like data where you don't want to lose it and you certainly don't want all of your web servers or apps to go offline just because one server, your database server, has gone out, it's time to revisit the architecture or topology of systems more generally. Something tells me that it's not going to be sufficient to have just one database. You probably want two. But if you have two databases, now how do you decide where to put the data? Do you put it here, or over here, or maybe in both places? If you put it in both places, though, you're then using twice as much space, so already we've opened a can of worms. To solve one simple problem, don't be a single point of failure. But that's going to cost you some time, or some money, or certainly space. So what can you do if you're architecting a system that has now not just web servers but, say, databases? Well, odds are you're going to want to have not just the one, pictured here as a cylinder, as this convention, but you're probably going to want to have a second as well. But of course, if you have two of them, odds are it's not sufficient just to store half of your data on one and half of your data on the other, because, of course, you've not solved the single point of failure. You now just have two single points of failure because half of your data could be lost here or half of it here. So odds are you're going to want to start having backups of data. But you don't want to necessarily have to waste time restoring from backup, especially if you want to maintain as many as five nines of uptime. So odds are you're going to want to have these databases really be duplicates of one another. And whenever you write data to one database, you should probably write it to the other in parallel. So, yes, admittedly, you have just spent twice as much space and, frankly, twice as much money. But at some point those sorts of costs are inevitable. But there's other ways to scale here, too. You can, of course, hit a ceiling on vertical scaling even when it comes to databases. After all, if a database is just a program running on some server and there is only so much RAM or disk space or CPU in that server, eventually you're not going to be able to store as much data or as quickly as you want. So what can you do? Well, you could, for instance, shard your data and have not just two but maybe four or more servers and put all of the users whose names start from A to M on one half of your cluster of servers, so to speak, but maybe everyone else from M to Z based on, say, their last name can go on the others. To shard a database means to split the data in some predictable way that you can repeat again and again. But even there, too, even if only the As through Ms are going to the left, you want to make sure that you still have that backup or replica. And this arrow suggests that they really should be intercommunicating, not unlike load balancers we've seen. But there's another way you can scale your databases as well. You don't have to have databases doing both reading and writing. To read data from a database or any server means to take it from its memory and read it into yours. And to write means to do the opposite, to save it. Well, what you can do actually in the world of databases is also replicate your databases multiple times. And you might have connected to these two primary databases multiple other databases that are just copies in one direction and not both. And what you might then do is use these two primary databases not only to read but to write, abbreviated here RW. But these other databases down here, which are just copies of the ones to which they're connected, are just called read replicas. They exist solely for the purpose to read from them again and again. When might this make sense? Well, in some contexts, like social media, like Facebook, it's probably the case that there are more reads than there are writes. That is to say you probably know more people who post more content than you but you probably still read or look at theirs. And so if the data for your business follows that pattern whereby writes are maybe common but reads are way more common, you can do exactly this model and replicate again and again, honestly, as a tree structure for efficiency so that it doesn't all have to replicate one at a time. But then you can write software, be it in Python or something else, that writes data only to these one or two servers but reads from any number of them as well. But this, too, is a bit of a rabbit hole, because at some point you want to have this redundancy not in one location but others, east coast and west coast, one country and another. And at that point, you might actually run into the limitations of time. Because after all, it takes a non-zero number of milliseconds or seconds for data to travel long distance. Consider after all how long it might take data to go from Cambridge, Massachusetts to somewhere in Japan. That's far longer than it might take to just go down the road to MIT. So here, too, we can revisit all of the problems we saw in the world of cloud computing and servers more generally. They're back to revisit in the context of databases. But with databases, you care ever more that these things not go down, or if they do, that you have spares, backups, or replicas. Because now that we're storing our data in this centralized place, we have to think hard not only about how we're scaling computationally but how we're scaling in terms of our data as well. So consider where then we began. We started by laying out data in a spreadsheet, be it Microsoft Excel, or Apple Numbers, or Google Spreadsheets. From there we considered what types of data we might store there so that if we want to upgrade, so to speak, from a spreadsheet to database, we know what types we can specify. And in SQL, whether implemented in SQLite, Oracle, or MySQL, or something else, they tend to be standard types that tend to be common across platforms, albeit with some variations, so that we can think hard about these types and then ultimately help the database help us be performant. Because if I know that I'm going to be searching or selecting based on certain data, I can tell the database, for instance, to make it unique or at least index it. And then using SQL constructions like SELECT, and INSERT, and UPDATE, and DELETE, and yet others can I manipulate that data and get at it even faster, frankly, than the human me could with a mere spreadsheet. But with the design of any system, as with databases, we start to open new cans of worms and new problems as we start to explore more sophisticated challenges. But here, too, many, if not all, of these problems can be solved by simply reducing the problems to first principles and consider, what is the problem to be solved? How is our data represented? Where is it stored? And consider ultimately what business constraints or legal constraints we have when manipulating that data and consider what tools of the trade are available to us. This then is database design and, more generally, design unto itself, not just focusing on the correctness of implementing solutions to problems but the efficiency and the scalability as well thereof.