[MUSIC PLAYING] RICK HOULIHAN: All right. Hi, everybody. My name is Rick Houlihan. I'm a senior principal solutions architect at AWS. I focus on NoSQL and DynamoDB technologies. I'm here today to talk to you a little bit about those. My background is primarily in data layer. I spent half my development career writing database, data access, solutions for various applications. I've been in Cloud virtualization for about 20 years. So before the Cloud was the Cloud, we used to call it utility computing. And the idea was, it's like PG&E, you pay for what you use. Today we call it the cloud. But over the years, I've worked for a couple of companies you've probably never heard of. But I've compiled a list of technical accomplishments, I guess you'd say. I have eight patents in Cloud systems virtualization, microprocessor design, complex event processing, and other areas as well. So these days, I focus mostly on NoSQL technologies and the next generation database. And that's generally what I'm going to be here talking to you today about. So what you can expect from this session, we'll go through a brief history of data processing. It's always helpful to understand where we came from and why we're where we are. And we'll talk a little bit about NoSQL technology from a fundamental standpoint. We will get into some of the DynamoDB internals. DynamoDB is AWS's no flavor. It's a fully managed and hosted NoSQL solution. And we'll talk a little bit about table structure, APIs, data types, indexes, and some of the internals of that DynamoDB technology. We'll get into some of the design patterns and best practices. We'll talk about how you use this technology for some of today's applications. And then we'll talk a little bit about the evolution or the emergence of a new paradigm in programming called event-driven applications and how DynamoDB plays in that as well. And we'll leave you a little bit of a reference architecture discussion so we can talk about some of the ways you can use DynamoDB. So first off-- this is a question I hear a lot is, what's a database. A lot of people think they know what a database is. If you Google, you'll see this. It's a a structured set of data held in a computer, especially one that is accessible in various ways. I suppose that's a good definition of a modern database. But I don't like it, because it implies a couple of things. It implies structure. And it implies that it's on a computer. And databases didn't always exist on computers. Databases actually existed in many ways. So a better definition of a database is something like this. A database is an organized mechanism for storing, managing, and retrieving information. This is from About.com. So I like this because it really talks about a database being a repository, a repository of information, not necessarily something that sits on a computer. And throughout history, we haven't always had computers. Now, if I ask the average developer today what's a database, that's the answer I get. Somewhere I can stick stuff. Right? And it's true. But it's unfortunate. Because the database is really the foundation of the modern app. It's the foundation of every application. And how you build that database, how you structure that data is going to dictate how that application performs as you scale. So a lot of my job today is dealing with what happens when developers take this approach and dealing with the aftermath of an application that is now scaling beyond the original intent and suffering from bad design. So hopefully when you walk away today, you'll have a couple of tools in your belt that'll keep you from making those same mistakes. All right. So let's talk about a little bit of the timeline of database technology. I think I read an article not that long ago and it said something on the lines-- it's a very poetic statement. It said the history of data processing is full of high watermarks of data abundance. OK. Now, I guess that's kind of true. But I actually look at is as the history is actually filled with high watermark of data pressure. Because the data rate of ingestion never goes down. It only goes up. And innovation occurs when we see data pressure, which is the amount of data that is now in coming into the system. And it cannot be processed efficiently either in time or in cost. And that's when we start to look at data pressure. So when we look at the first database, this is the one that was between our ears. We're all born with it. It's a nice database. It has a high availability. It's always on. You can always get it. But it's single user. I can't share my thoughts with you. You can't get my thoughts when you want them. And their abilitiy is not so good. We forget things. Every now and then, one of us leaves and moves on to another existence and we lose everything that was in that database. So that's not all that good. And this worked well over time when we were back in the day when all we really needed to know is where are we going to go on tomorrow or where we gather the best food. But as we started to grow as a civilization and government started to come into being, and businesses started to evolve, we started to realize we need a little more than what we could put in our head. All right? We needed systems of record. We needed places to be able store data. So we started writing documents, creating libraries and archives. We started developing a system a ledger accounting. And that system of ledger counting ran the world for many centuries, and maybe even millennia as we kind of grew to the point where that data load surpassed the ability of those systems to be able to contain it. And this actually happened in the 1880s. Right? In the 1880 US Census. This is really where the turning point modern data processing. This is the point at which the amount of data that was being collected by the US government got to the point where it took eight years to process. Now, eight years-- as you know, the census runs every 10 years-- so it's pretty obvious that by time we got the 1890 census, the amount of data that was going to be processed by government was going to exceed the 10 years that it would take to launched the new census. This was a problem. So a guy named Herman Hollerith came along and he invented unit record punch cards, punch card reader, punch card tabulator, and the collation of the mechanisms for this technology. And that company that he formed at the time, along with a couple of others, actually became one of the pillars of a small company we know today called IBM. So IBM originally was in the database business. And that's really what they did. They did data processing. As so the proliferation of punch cards, an ingenious mechanisms of being able to leverage that technology to poll sorted result sets. You can see in this picture there we have a little-- it's a little small-- but you can see a very ingenious mechanical mechanism where we have a punch card deck. And somebody's taking a little screwdriver and sticking through the slots and lifting it up to get that match, that sorted results set. This is an aggregation. We do this all the time today in the computer, where you do it in the database. We used to do it manually, right? People put these things together. And it was the proliferation of these punch cards into what we called data drums and data reels, paper tape. The data processing industry took a lesson from the player pianos. Player pianos back at the turn of the century used to use paper reels with slots on to tell it which keys to play. So that technology was adapted eventually to store digital data, because they could put that data onto those paper tape reels. Now, as a result, data was actually-- how you access this data was directly dependent on how you stored it. So if I put the data on a tape, I had access the data linearly. I had to roll the whole tape to access all the data. If I put the data in punch cards, I could access it in a little more random fashion, maybe not as quickly. But there were limitations in how we access to data based on how was stored. And so this was a problem going into the '50s. Again, we can start to see that as we develop new technologies to process the data, right, it opens up the door for new solutions, for new programs, new applications for that data. And really, governance may have been the reason why we developed some of these systems. But business rapidly became the driver behind the evolution of the modern database and the modern file system. So the next thing that came up was in the '50s was the file system and the development of random access storage. This was beautiful. Now, all of sudden, we can put our files anywhere on these hard drives and we can access this data randomly. We can parse that information out of files. And we solved all the world's problems with data processing. And that lasted about 20 or 30 years until the evolution of the relational database, which is when the world decided we now need to have a repository that defeats the sprawl of data across the file systems that we've built. Right? Too much data distributed in too many places, the de-duplication of data, and the cost of storage was enormous. In the '70s, the most expensive resource that a computer had was the storage. The processor was viewed as a fixed cost. When I buy the box, the CPU does some work. It's going to be spinning whether it's actually working or not. That's really a sunk cost. But what cost me as a business is storage. If I have to buy more disks next month, that's a real cost that I pay. And that storage is expensive. Now we fast forward 40 years and we have a different problem. The compute is now the most expensive resource. The storage is cheap. I mean, we can go anywhere on the cloud and we can find cheap storage. But what I can't find is cheap compute. So the evolution of today's technology, of database technology, is really focused around distributed databases that don't suffer from the same type of scale limitations of relational databases. We'll talk a little bit about what that actually means. But one of the reasons and the driver behind this-- we talked about the data pressure. Data pressure is something that drives innovation. And if you look at over the last five years, this is a chart of what the data load across the general enterprise looks like in the last five years. And the general rule of thumb these days-- if you go Google-- is 90% of the data that we store today, and it was generated within the last two years. OK. Now, this is not a trend that's new. This is a trend that's been going out for 100 years. Ever since Herman Hollerith developed the punch card, we've been building data repositories and gathering data at phenomenal rates. So over the last 100 years, we've seen this trend. That's not going to change. Going forward, we're going to see this, if not an accelerated trend. And you can see what that looks like. If a business in 2010 had one terabyte of data under management, today that means they're managing 6.5 petabytes of data. That's 6,500 times more data. And I know this. I work with these businesses every day. Five years ago, I would talk to companies who would talk to me about what a pain it is to manage terabytes of data. And they would talk to me about how we see that this is probably going to be a petabyte or two within a couple of years. These same companies today I'm meeting with, and they're talking to me about the problem are there having managing tens, 20 petabytes of data. So the explosion of the data in the industry is driving the enormous need for better solutions. And the relational database is just not living up to the demand. And so there's a linear correlation between data pressure and technical innovation. History has shown us this, that over time, whenever the volume of data that needs to be processed exceeds the capacity of the system to process it in a reasonable time or at a reasonable cost, then new technologies are invented to solve those problems. Those new technologies, in turn, open the door to another set of problems, which is gathering even more data. Now, we're not going to stop this. Right? We're not going to stop this. Why? Because you can't know everything there is to know in the universe. And as long as we've been alive, throughout the history of man, we have always driven to know more. So it seems like every inch we move down the path of scientific discovery, we are multiplying the amount of data that we need to process exponentially as we uncover more and more and more about the inner workings of life, about how the universe works, about driving the scientific discovery, and the invention that we're doing today. The volume of data just continually increases. So being able to deal with this problem is enormous. So one of the things we look as why NoSQL? How does NoSQL solve this problem? Well, relational databases, Structured Query Language, SQL-- that's really a construct of the relational database-- these things are optimized for storage. Back in the '70s, again, disk is expensive. The provisioning exercise of storage in the enterprise is never-ending. I know. I lived it. I wrote storage drivers for an enterprised superserver company back in the '90s. And the bottom line is racking another storage array was just something that happened every day in the enterprise. And it never stopped. Higher density storage, demand for high density storage, and for more efficient storage devices-- it's never stopped. And NoSQL is a great technology because it normalizes the data. It de-duplicates the data. It puts the data in a structure that is agnostic to every access pattern. Multiple applications can hit that SQL database, run ad hoc queries, and get data in the shape that they need to process for their workloads. That sounds fantastic. But the bottom line is with any system, if it's agnostic to everything, it is optimized for nothing. OK? And that's what we get with the relational database. It's optimized for storage. It's normalized. It's relational. It supports the ad hoc queries. And it and it scales vertically. If I need to get a bigger SQL database or a more powerful SQL database, I go buy a bigger piece of iron. OK? I've worked with a lot of customers that have been through major upgrades in their SQL infrastructure only to find out six months later, they're hitting the wall again. And the answer from Oracle or MSSQL or anybody else is get a bigger box. Well sooner or later, you can't buy a bigger box, and that's real problem. We need to actually change things. So where does this work? It works well for offline analytics, OLAP-type workloads. And that's really where SQL belongs. Now, it's used today in many online transactional processing-type applications. And it works just fine at some level of utilization, but it just doesn't scale the way that NoSQL does. And we'll talk a little bit about why that is. Now, NoSQL, on the other hand, is more optimized for compute. OK? It is not agnostic to the access pattern. Is what we call de-normalized structure or a hierarchical structure. The data in a relational database is joined together from multiple tables to produce the view that you need. The data in a NoSQL database is stored in a document that contains the hierarchical structure. All of the data that would normally be joined together to produce that view is stored in a single document. And we'll talk a little bit about how that works in a couple of charts. But the idea here is that you store your data as these instantiated views. OK? You scale horizontally. Right? If I need to increase the size of my NoSQL cluster, I don't need to get a bigger box. I get another box. And I cluster those together, and I can shard that data. We'll talk a bit about what sharding is, to be able to scale that database across multiple physical devices and remove the barrier that requires me to scale vertically. So it's really built for online transaction processing and scale. There's a big distinction here between reporting, right? Reporting, I don't know the questions I'm going to ask. Right? Reporting-- if someone from my marketing department wants to just-- how many of my customers have this particular characteristic who bought on this day-- I don't know what query they're going to ask. So I need to be agnostic. Now, in a online transactional application, I know what questions I'm asking. I built the application for a very specific workflow. OK? So if I optimize the data store to support that workflow, it's going to be faster. And that's why NoSQL can really accelerate the delivery of those types of services. All right. So we're going to get into a little bit of theory here. And some of you, your eyes might roll back a little bit. But I'll try to keep it as high level as I can. So if you're in project management, there's a construct called the triangle of constraints. OK. The triangle of constrains dictates you can't have everything all the time. Can't have your pie and eat it too. So in project management, that triangle constraints is you can have it cheap, you can have it fast, or you can have it good. Pick two. Because you can't have all three. Right? OK. So you hear about this a lot. It's a triple constraint, triangle of triple constraint, or the iron triangle is oftentimes-- when you talk to project managers, they'll talk about this. Now, databases have their own iron triangle. And the iron triangle of data is what we call CAP theorem. OK? CAP theorem dictates how databases operate under a very specific condition. And we'll talk about what that condition is. But the three points of the triangle, so to speak, are C, consistency. OK? So in CAP, consistency means that all clients who can access the database will always have a very consistent view of data. Nobody's gonna see two different things. OK? If I see the database, I'm seeing the same view as my partner who sees the same database. That's consistency. Availability means that if the database online, if it can be reached, that all clients will always be able to read and write. OK? So every client that can read the database will always be able read data and write data. And if that's the case, it's an available system. And the third point is what we call partition tolerance. OK? Partition tolerance means that the system works well despite physical network partitions between the nodes. OK? So nodes in the cluster can't talk to each other, what happens? All right. So relational databases choose-- you can pick two of these. OK. So relational databases choose to be consistent and available. If the partition happens between the DataNodes in the data store, the database crashes. Right? It just goes down. OK. And this is why they have to grow with bigger boxes. Right? Because there's no-- usually, a cluster database, there's not very many of them that operate that way. But most databases scale vertically within a single box. Because they need to be consistent and available. If a partition were to be injected, then you would have to make a choice. You have to make a choice between being consistent and available. And that's what NoSQL databases do. All right. So a NoSQL database, it comes in two flavors. We have-- well, it comes in many flavors, but it comes with two basic characteristics-- what we would call CP database, or a consistent and partition tolerance system. These guys make the choice that when the nodes lose contact with each other, we're not going to allow people to write any more. OK? Until that partition is removed, write access is blocked. That means they're not available. They're consistent. When we see that partition inject itself, we are now consistent, because we're not going to allow the data change on two sides of the partition independently of each other. We will have to reestablish communication before any update to the data is allowed. OK? The next flavor would be an AP system, or an available and partitioned tolerance system. These guys don't care. Right? Any node that gets a write, we'll take it. So I'm replicating my data across multiple nodes. These nodes get a client, client comes in, says, I'm going to write some data. Node says, no problem. The node next to him gets a write on the same record, he's going to say no problem. Somewhere back on the back end, that data's going to replicate. And then someone's going to realize, uh-oh, they system will realize, uh-oh, there's been an update to two sides. What do we do? And what they do then is they do something which allows them to resolve that data state. And we'll talk about that in the next chart. Thing to point out here. And I'm not going to get too much into this, because this gets into deep data theory. But there's a transactional framework that runs in a relational system that allows me to safely make updates to multiple entities in the database. And those updates will occur all at once or not at all. And this is called ACID transactions. OK? ACID gives us atomicity, consistency, isolation, and durability. OK? That means atomic, transactions, all my updates either happen or they don't. Consistency means that the database will always be brought into a consistent state after an update. I will never leave the database in a bad state after applying an update. OK? So it's a little different than CAP consistency. CAP consistency means all my clients can always see the data. ACID consistency means that when a transaction's done, data's good. My relationships are all good. I'm not going to delete a parent row and leave a bunch of orphan children in some other table. It can't happen if I'm consistent in an acid transaction. Isolation means that transactions will always occur one after the other. The end result of the data will be the same state as if those transactions that were issued concurrently were executed serially. So it's concurrency control in the database. So basically, I can't increment the same value twice with two operations. But if I say add 1 to this value, and two transactions come in and try to do it, one's going to get there first and the other one's going to get there after. So in the end, I added two. You see what I mean? OK. Durability is pretty straightforward. When the transaction is acknowledged, it's going to be there even if the system crashes. When that system recovers, that transaction that was committed is actually going to be there. So that's the guarantees of ACID transactions. Those are pretty nice guarantees to have on a database, but they come at that cost. Right? Because the problem with this framework is if there is a partition in the data set, I have to make a decision. I'm going to have to allow updates on one side or the other. And if that happens, then I'm no longer going to be able to maintain those characteristics. They won't be consistent. They won't be isolated. This is where it breaks down for relational databases. This is the reason relational databases scale vertically. On the other hand, we have what's called BASE technology. And these are your NoSQL Databases. All right. So we have our CP, AP databases. And these are what you call basically available, soft state, eventually consistent. OK? Basically available, because they're partition tolerant. They will always be there, even if there's a network partition between the nodes. If I can talk to a node, I'm going to be able to read data. OK? I might not always be able to write data if I'm a consistent platform. But I'll be able to read data. The soft state indicates that when I read that data, it might not be the same as other nodes. If a right was issued on a node somewhere else in the cluster and it hasn't replicated across the cluster yet when I read that data, that state might not be consistent. However, it will be eventually consistent, meaning that when a write is made to the system, it will replicate across the nodes. And eventually, that state will be brought into order, and it will be a consistent state. Now, CAP theorem really plays only in one condition. That condition is when this happens. Because whenever it's operating in normal mode, there's no partition, everything's consistent and available. You only worry about CAP when we have that partition. So those are rare. But how the system reacts when those occur dictate what type of system we're dealing with. So let's take a look at what that looks like for AP systems. OK? AP systems come in two flavors. They come in the flavor that is a master master, 100%, always available. And they come in the other flavor, which says, you know what, I'm going to worry about this partitioning thing when an actual partition occurs. Otherwise, there's going to be primary nodes who's going to take the rights. OK? So if we something like Cassandra. Cassandra would be a master master, let's me write to any node. So what happens? So I have an object in the database that exists on two nodes. Let's call that object S. So we have state for S. We have some operations on S that are ongoing. Cassandra allows me to write to multiple nodes. So let's say I get a write for s to two nodes. Well, what ends up happening is we call that a partitioning event. There may not be a physical network partition. But because of the design of the system, it's actually partitioning as soon as I get a write on two nodes. It's not forcing me to write all through one node. I'm writing on two nodes. OK? So now I have two states. OK? What's going to happen is sooner or later, there's going to be a replication event. There's going to be what we called a partition recovery, which is where these two states come back together and there's going to be an algorithm that runs inside the database, decides what to do. OK? By default, last update wins in most AP systems. So there's usually a default algorithm, what they call a callback function, something that will be called when this condition is detected to execute some logic to resolve that conflict. OK? The default callback and default resolver in most AP databases is, guess what, timestamp wins. This was the last update. I'm going to put that update in there. I may dump this record that I dumped off into a recovery log so that the user can come back later and say, hey, there was a collision. What happened? And you can actually dump a record of all the collisions and the rollbacks and see what happens. Now, as a user, you can also include logic into that callback. So you can change that callback operation. You can say, hey, I want to remediate this data. And I want to try and merge those two records. But that's up to you. The database doesn't know how to do that by default. Most the time, the only thing the database knows how to do is say, this one was the last record. That's the one that's going to win, and that's the value I'm going to put. Once that partition recovery and replication occurs, we have our state, which is now S prime, which is the merge state of all those objects. So AP systems have this. CP systems don't need to worry about this. Because as soon as a partition comes into play, they just stop taking writes. OK? So that's very easy to deal with being consistent when you don't accept any updates. That's with CP systems do. All right. So let's talk a little bit about access patterns. When we talk about NoSQL, it's all about the access pattern. Now, SQL is ad hoc, queries. It's relational store. We don't have to worry about the access pattern. I write a very complex query. It goes and gets the data. That's what this looks like, normalization. So in this particular structure, we're looking at a products catalog. I have different types of products. I have books. I have albums. I have videos. The relationship between products and any one of these books, albums, and videos tables is 1:1. All right? I've got a product ID, and that ID corresponds to a book, an album, or a video. OK? That's a 1:1 relationship across those tables. Now, books-- all they have is root properties. No problem. That's great. One-to-one relationship, I get all the data I need to describe that book. Albums-- albums have tracks. This is what we call one to many. Every album could have many tracks. So for every track on the album, I could have another record in this child table. So I create one record in my albums table. I create multiple records in the tracks table. One-to-many relationship. This relationship is what we call many-to-many. OK? You see that actors could be in many movies, many videos. So what we do is we put this mapping table between those, which it just maps the actor ID to the video ID. Now I can create a query the joins videos through actor video to actors, and it gives me a nice list of all the movies and all the actors who were in that movie. OK. So here we go. One-to-one is the top-level relationship; one-to-many, albums to tracks; many-to-many. Those are the three top-level relationships in any database. If you know how those relationships work together, then you know a lot about database already. So NoSQL works a little differently. Let's think about for a second what it looks like to go get all my products. In a relational store, I want to get all my products on a list of all my products. That's a lot of queries. I got a query for all my books. I got a query from my albums. And I got a query for all my videos. And I got to put it all together in a list and serve it back up to the application that's requesting it. To get my books, I join Products and Books. To get my albums, I got to join Products, Albums, and Tracks. And to get my videos, I have to join Products to Videos, join through Actor Videos, and bring in the Actors. So that's three queries. Very complex queries to assemble one result set. That's less than optimal. This is why when we talk about a data structure that's built to be agnostic to the access pattern-- well that's great. And you can see this is really nice how we've organized the data. And you know what? I only have one record for an actor. That's cool. I've deduplicated all my actors, and I maintained my associations in this mapping table. However, getting the data out becomes expensive. I'm sending the CPU all over the system joining these data structures together to be able to pull that data back. So how do I get around that? In NoSQL it's about aggregation, not normalization. So we want to say we want to support the access pattern. If the access pattern to the applications, I need to get all my products. Let's put all the products in one table. If I put all the products in one table, I can just select all the products from that table and I get it all. Well how do I do that? Well in NoSQL there's no structure to the table. We'll talk a little bit about how this works in Dynamo DB. But you don't have the same attributes and the same properties in every single row, in every single item, like you do in an SQL table. And what this allows me to do is a lot of things and give me a lot of flexibility. In this particular case, I have my product documents. And in this particular example, everything is a document in the Products table. And the product for a book might have a type ID that specifies a book. And the application would switch on that ID. At the application tier, I'm going to say oh, what record type is this? Oh, it's a book record. Book records have these properties. Let me create a book object. So I'm going to fill the book object with this item. Next item comes and says, what's this item? Well this item is an album. Oh, I got a whole different processing routine for that, because it's an album. You see what I mean? So the application tier-- I just select all these records. They all start coming in. They could be all different types. And it's the application's logic that switches across those types and decides how to process them. Again, so we're optimizing the schema for the access pattern. We're doing it by collapsing those tables. We're basically taking these normalized structures, and we're building hierarchical structures. Inside each one of these records I'm going to see array properties. Inside this document for Albums, I'm seeing arrays of tracks. Those tracks now become-- it's basically this child table that exists right here in this structure. So you can do this in DynamoDB. You can do this in MongoDB. You can do this in any NoSQL database. Create these types of hierarchical data structures that allow you retrieve data very quickly because now I don't have to conform. When I insert a row into the Tracks table, or a row into the Albums table, I have to conform to that schema. I have to have the attribute or the property that is defined on that table. Every one of them, when I insert that row. That's not the case in NoSQL. I can have totally different properties in every document that I insert into the collection. So very powerful mechanism. And it's really how you optimize the system. Because now that query, instead of joining all these tables and executing a half a dozen queries to pull back the data I need, I'm executing one query. And I'm iterating across the results set. it gives you an idea of the power of NoSQL. I'm going to kind of go sideways here and talk a little bit about this. This is more kind of the marketing or technology-- the marketing of technology type of discussion. But it's important to understand because if we look at the top here at this chart, what we're looking at is what we call the technology hype curve. And what this means is new stuff comes into play. People think it's great. I've solved all my problems. This could be the end all, be all to everything. And they start using it. And they say, this stuff doesn't work. This is not right. The old stuff was better. And they go back to doing things the way they were. And then eventually they go, you know what? This stuff is not so bad. Oh, that's how it works. And once they figure out how it works, they start getting better. And the funny thing about it is, it kind of lines up to what we call the Technology Adoption Curve. So what happens is we have some sort technology trigger. In the case of databases, it's data pressure. We talked about the high water points of data pressure throughout time. When that data pressure hits a certain point, that's a technology trigger. It's getting too expensive. It takes too long to process the data. We need something better. You get the innovators out there running around, trying to find out what's the solution. What's the new idea? What's the next best way to do this thing? And they come up with something. And the people with the real pain, the guys at the bleeding edge, they'll jump all over it, because they need an answer. Now what inevitably happens-- and it's happening right now in NoSQL. I see it all the time. What inevitably happens is people start using the new tool the same way they used the old tool. And they find out it doesn't work so well. I can't remember who I was talking to earlier today. But it's like, when the jackhammer was invented, people didn't swing it over their head to smash the concrete. But that is what's happening with NoSQL today. If you walk in to most shops, they are trying to be NoSQL shops. What they're doing is they're using NoSQL, and they're loading it full of relational schema. Because that's how they design databases. And they're wondering, why is it not performing very well? Boy, this thing stinks. I had to maintain all my joins in-- it's like, no, no. Maintain joins? Why are you joining data? You don't join data in NoSQL. You aggregate it. So if you want to avoid this, learn how the tool works before you actually start using it. Don't try and use the new tools the same way you used the old tools. You're going to have a bad experience. And every single time that's what this is about. When we start coming up here, it's because people figured out how to use the tools. They did the same thing when relational databases were invented, and they were replacing file systems. They tried to build file systems with relational databases because that's what people understood. It didn't work. So understanding the best practices of the technology you're working with is huge. Very important. So we're going to get into DynamoDB. DynamoDB is AWS's fully-managed NoSQL platform. What does fully-managed mean? It means you don't need to really worry about anything. You come in, you tell us, I need a table. It needs this much capacity. You hit the button, and we provision all the infrastructure behind the scene. Now that is enormous. Because when you talk about scaling a database, NoSQL data clusters at scale, running petabytes, running millions of transactions per second, these things are not small clusters. We're talking thousands of instances. Managing thousands of instances, even virtual instances, is a real pain in the butt. I mean, think about every time an operating system patch comes out or a new version of the database. What does that mean to you operationally? That means you got 1,200 servers that need to be updated. Now even with automation, that can take a long time. That can cause a lot of operational headaches, because I might have services down. As I update these databases, I might do blue green deployments where I deploy and upgrade half my nodes, and then upgrade the other half. Take those down. So managing the infrastructure scale is enormously painful. And AWS take that pain out of it. And NoSQL databases can be extraordinarily painful because of the way they scale. Scale horizontally. If you want to get a bigger NoSQL database, you buy more nodes. Every node you buy is another operational headache. So let somebody else do that for you. AWS can do that. We support document key values. Now we didn't go too much into on the other chart. There's a lot of different flavors of NoSQL. They're all kind of getting munged together at this point. You can look at DynamoDB and say yes, we're both a document and a key value store this point. And you can argue the features of one over the other. To me, a lot of this is really six of one half a dozen of the other. Every one of these technologies is a fine technology and a fine solution. I wouldn't say MongoDB is better or worse than Couch, then Cassandra, then Dynamo, or vice versa. I mean, these are just options. It's fast and it's consistent at any scale. So this is one of the biggest bonuses you get with AWS. With DynamoDB is the ability to get a low single digit millisecond latency at any scale. That was a design goal of the system. And we have customers that are doing millions of transactions per second. Now I'll go through some of those use cases in a few minutes here. Integrated access control-- we have what we call Identity Access Management, or IAM. It permeates every system, every service that AWS offers. DynamoDB is no exception. You can control access to the DynamoDB tables. Across all your AWS accounts by defining access roles and permissions in the IAM infrastructure. And it's a key and integral component in what we call Event Driven Programming. Now this is a new paradigm. AUDIENCE: How's your rate of true positives versus false negatives on your access control system? RICK HOULIHAN: True positives versus false negatives? AUDIENCE: Returning what you should be returning? As opposed to once in a while it doesn't return when it should validate? RICK HOULIHAN: I couldn't tell you that. If there's any failures whatsoever on that, I'm not the person to ask that particular question. But that's a good question. I would be curious to know that myself, actually. And so then again, new paradigm is event driven programming. This is the idea that you can deploy complex applications that can operate a very, very high scale without any infrastructure whatsoever. Without any fixed infrastructure whatsoever. And we'll talk a little bit about what that means as we get on to the next couple of charts. The first thing we'll do is we'll talk about tables. API data types for Dynamo. And the first thing you'll notice when you look at this, if you're familiar with any database, databases have really two kind of APIs I'd call it. Or two sets of API. One of those would be administrative API. The things they take care of the functions of the database. Configuring the storage engine, setting up and adding tables. creating database catalogs and instances. These things-- in DynamoDB, you have very short, short lists. So in other databases, you might see dozens of commands, of administrative commands, for configuring these additional options. In DynamoDB you don't need those because you don't configure the system, we do. So the only thing you need to do is tell me what size table do I need. So you get a very limited set of commands. You get a Create Table Update, Table, Delete Table, and Describe Table. Those are the only things you need for DynamoDB. You don't need a storage engine configuration. I don't need to worry about replication. I don't need to worry about sharding. I don't need to worry about any of this stuff. We do it all for you. So that's a huge amount of overhead that's just lifted off your plate. Then we have the CRUD operators. CRUD is something what we call in database that's Create, Update, Delete operators. These are your common database operations. Things like put item, get item, update items, delete items, batch query, scan. If you want to scan the entire table. Pull everything off the table. One of the nice things about DynamoDB is it allows parallel scanning. So you can actually let me know how many threads you want to run on that scan. And we can run those threads. We can spin that scan up across multiple threads so you can scan the entire table space very, very quickly in DynamoDB. The other API we have is what we call our Streams API. We're not going to talk too much about this right now. I've got some content later on in the deck about this. But Streams is really a running-- think of it as the time ordered and partition change log. Everything that's happening on the table shows up on the stream. Every write to the table shows up on the stream. You can read that stream, and you can do things with it. We'll talk about what types of things you do with the things like replication, creating secondary indexes. All kinds of really cool things you can do with that. Data types. In DynamoDB, we support both key value and document data types. On the left hand side of the screen here, we've got our basic types. Key value types. These are strings, numbers, and binaries. So just three basic types. And then you can have sets of those. One of the nice things about NoSQL is you can contain arrays as properties. And with DynamoDB you can contain arrays of basic types as a root property. And then there's the document types. How many people are familiar with JSON? You guys familiar with JSON so much? It's basically JavaScript, Object, Notation. It allows you to basically define a hierarchical structure. You can store a JSON document on DynamoDB using common components or building blocks that are available in most programming languages. So if you have Java, you're looking at maps and lists. I can create objects that area map. A map as key values stored as properties. And it might have lists of values within those properties. You can store this complex hierarchical structure as a single attribute of a DynamoDB item. So tables in DynamoDB, like most NoSQL databases, tables have items. In MongoDB you would call these documents. And it would be the couch base. Also a document database. You call these documents. Documents or items have attributes. Attributes can exist or not exist on the item. In DynamoDB, there's one mandatory attribute. Just like in a relational database, you have a primary key on the table. DynamoDB has what we call a hash key. Hash key must be unique. So when I define a hash table, basically what I'm saying is every item will have a hash key. And every hash key must be unique. Every item is defined by that unique hash key. And there can only be one. This is OK, but oftentimes what people need is they want is this hash key to do a little bit more than just be a unique identifier. Oftentimes we want to use that hash key as the top level aggregation bucket. And the way we do that is by adding what we call a range key. So if it's a hash only table, this must be unique. If it's a hash and range table, the combination of the hash and the range must be unique. So think about it this way. If I have a forum. And the form has topics, it has posts, and it has responses. So I might have a hash key, which is the topic ID. And I might have a range key, which is the response ID. That way if I want to get all the responses for particular topic, I can just query the hash. I can just say give me all the items that have this hash. And I'm going to get every question or post for that particular topic. These top level aggregations are very important. They support the primary access pattern of the application. Generally speaking, this is what we want to do. We want that table-- as you load the table, we want to structure the data within the table in such a way that the application can very quickly retrieve those results. And oftentimes the way to do that is to maintain these aggregations as we insert the data. Basically, we're spreading the data into the bright bucket as it comes in. Range keys allow me-- hash keys have to be equality. When I query a hash, I have to say give me a hash that equals this. When I query a range, I can say give me a range that is using any kind of rich operator that we support. Give me all the items for a hash. Is it equal, greater than, less than, does it begin with, does it exist between these two values? So these types of range queries that we're always interested in. Now one thing about data, when you look at accessing data, when you access the data, it's always about an aggregation. It's always about the records that are related to this. Give me everything here that's-- all the transactions on this credit card for the last month. That's an aggregation. Almost everything you do in the database is some kind of aggregation. So being able to be able to define these buckets and give you these range attributes to be able to query on, those rich queries support many, many, many application access patterns. So the other thing the hash key does is it gives us a mechanism to be able to spread the data around. NoSQL databases work best when the data is evenly distributed across the cluster. How many people are familiar with hashing algorithms? When I say hash and a hashing-- because a hashing algorithm is a way of being able to generate a random value from any given value. So in this particular case, the hash algorithm we run is ND 5 based. And if I have an ID, and this is my hash key, I have 1, 2, 3. When I run the hash algorithm, it's going to come back and say, well 1 equals 7B, 2 equals 48, 3 equals CD. They're spread all over the key space. And why do you do this? Because that makes sure that I can put the records across multiple nodes. If I'm doing this incrementally, 1, 2, 3. And I have a hash range that runs in this particular case, a small hash space, it runs from 00 to FF, then the records are going to come in and they're going to go 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12. What happens? Every insert is going to the same node. You see what I mean? Because when I split the space, and I spread these records across, and I partition, I'm going to say partition 1 has key space 0 to 54. Partition 2 is 55 to 89. Partition 3 is AA to FF. So if I'm using linearly incrementing IDs, you can see what's happening. 1, 2, 3, 4, 5, 6, all way up to 54. So as I'm hammering the records into the system, everything ends up going to one node. That's not good. That's an antipattern. In MongoDB they have this problem if you don't use a hash key. MongoDB gives you the option of hashing the key value. You should always do that, if you're using an incrementing hash key in MongoDB, or you'll be nailing every write to one node, and you will be limiting your write throughput badly. AUDIENCE: Is that A9 169 in decimal? RICK HOULIHAN: Yeah, it's somewhere around there. A9, I don't know. You'd have to get my binary to decimal calculator. My brain doesn't work like that. AUDIENCE: Just a quick one of your Mongo comments. So is the object ID that comes natively with Mongo do that? RICK HOULIHAN: Does it do that? If you specify it. With MongoDB, you have the option. You can specify-- every document in MongoDB has to have an underscore ID. That's the unique value. In MongoDB you can specify whether to hash it or not. They just give you the option. If you know that it's random, no problem. You don't need to do that. If you know that it's not random, that it's incrementing, then do the hash. Now the thing about hashing, once you hash a value-- and this is why hash keys are always unique queries, because I've changed the value, now I can't do a range query. I can't say is this between this or that, because the hash value is not going to be equivalent to the actual value. So when you hash that key, it's equality only. This is why in DynamoDB hash key queries are always equality only. So now in a range key-- when I add that range key, those range key records all come in and they get stored on the same partition. So they are very quickly, easily retrieved because this is the hash, this is the range. And you see everything with the same hash gets stored on the same partition space. You can use that range key to help locate your data close to its parent. So what am I really doing here? This is a one to many relationship. The relationship between a hash key and the range key is one to many. I can have multiple hash keys. I can only have multiple range keys within every hash key. The hash defines the parent, the range defines the children. So you can see there's analog here between the relational construct and the same types of constructs in NoSQL. People talk about NoSQL as nonrelational. It's not nonrelational. Data always has relationships. Those relationships just are modeled differently. Let's talk a little bit about durability. When you write to DynamoDB, writes are always three-way replicated. Meaning that we have three AZ's. AZ's are Availability Zones. You can think of an Availability Zone as a data center or a collection of data centers. These things are geographically isolated from each other across different fault zones, across different power grids and floodplains. A failure in one AZ is not going to take down another. They are also linked together with dark fiber. It supports one sub 1 millisecond latency between AZs. So real time data replications capable in multi AZs. And oftentimes multi AZ deployments meet the high availability requirements of most enterprise organizations. So DynamoDB is spread across three AZs by default. We're only going to knowledge the write when two of those three nodes come back and say, Yeah, I got it. Why is that? Because on the read side we're only going to give you the data back when we get it from two nodes. If I'm replicating across three, and I'm reading from two, I'm always guaranteed to have at least one of those reads to be the most current copy of data. That's what makes DynamoDB consistent. Now you can choose to turn those consistent reads off. In which case I'm going to say, I'll only read from one node. And I can't guarantee it's going to be the most current data. So if a write is coming in, it hasn't replicated yet, you're going to get that copy. That's an eventually consistent read. And what that is is half the cost. So this is something to think about. When you're reading out DynamoDB, and you're setting up your read capacity units, if you choose eventually consistent reads, it's a lot cheaper, it's about half the cost. And so it saves you money. But that's your choice. If you want a consistent read or an eventually consistent read. That's something that you can choose. Let's talk about indexes. So we mentioned that top level aggregation. We've got hash keys, and we've got range keys. That's nice. And that's on the primary table, I got one hash key, I got one range key. What does that mean? I've got one attribute that I can run rich queries against. It's the range key. The other attributes on that item-- I can filter on those attributes. But I can't do things like, it begins with, or is greater than. How do I do that? I create an index. There's two types of indexes in DynamoDB. An index is really another view of the table. And the local secondary index. The first one we'll talk about. So local secondaries are coexisted on the same partition as the data. And as such, they are on the same physical node. They are what we call consistent. Meaning, they will acknowledge the write along with the table. When the write comes in, we'll write through the index. We'll write up to the table, and then we will acknowledge. So that's consistent. Once the write has been acknowledged from the table, it's guaranteed that the local secondary index will have the same vision of data. But what they allow you do is define alternate range keys. Have to use the same hash key as the primary table, because they are co-located on the same partition, and they're consistent. But I can create an index with different range keys. So for example, if I had a manufacturer that had a raw parts table coming in. And raw parts come in, and they're aggregated by assembly. And maybe there's a recall. Any part that was made by this manufacturer after this date, I need to pull from my line. I can spin an index that would be looking, aggregating on the date of manufacture of that particular part. So if my top level table was already hashed by manufacturer, maybe it was arranged on part ID, I can create an index off that table as hashed by manufacturer and ranged on date of manufacture. And that way I could say, anything that was manufactured between these dates, I need to pull from the line. So that's a local secondary index. These have the effect of limiting your hash key space. Because they co-existed on the same storage node, they limit the hash key space to 10 gigabytes. DynamoDB, under the tables, will partition your table every 10 gigabytes. When you put 10 gigs of data in, we go [PHH], and we add another node. We will not split the LSI across multiple partitions. We'll split the table. But we won't split the LSI. So that's something important to understand is if you're doing very, very, very large aggregations, then you're going to be limited to 10 gigabytes on your LSIs. If that's the case, we can use global secondaries. Global secondaries are really another table. They exist completely off to the side of your primary table. And they allow me to find a completely different structure. So think of it as data is being inserted into two different tables, structured in two different ways. I can define a totally different hash key. I can define a totally different range key. And I can run this completely independently. As a matter of fact, I've provisioned my read capacity and write capacity for my global secondary indexes completely independently of my primary table. If I define that index, I tell it how much read and write capacity it's going to be using. And that is separate from my primary table. Now both of the indexes allow us to not only define hash and range keys, but they allow us to project additional values. So if I want to read off the index, and I want to get some set of data, I don't need to go back to the main table to get the additional attributes. I can project those additional attributes into the table to support the access pattern. I know we're probably getting into some really, really-- getting into the weeds here on some of this stuff. Now I got to drift out of this. AUDIENCE: [INAUDIBLE] --table key meant was a hash? The original hash? Multi-slats? RICK HOULIHAN: Yes. Yes. The table key basically points back to the item. So an index is a pointer back to the original items on the table. Now you can choose to build an index that only has the table key, and no other properties. And why might I do that? Well, maybe I have very large items. I really only need to know which-- my access pattern might say, which items contain this property? Don't need to return the item. I just need to know which items contain it. So you can build indexes that only have the table key. But that's primarily what an index in database is for. It's for being able to quickly identify which records, which rows, which items in the table have the properties that I'm searching for. GSIs, so how do they work? GSIs basically are asynchronous. The update comes into the table, table is then asynchronously updated all of your GSIs. This is why GSIs are eventually consistent. It is important to note that when you're building GSIs, and you understand you're creating another dimension of aggregation-- now let's say a good example here is a manufacturer. I think I might have talked about a medical device manufacturer. Medical device manufacturers oftentimes have serialized parts. The parts that go into a hip replacement all have a little serial number on them. And they could have millions and millions and billions of parts in all the devices that they ship. Well, they need to aggregate under different dimensions, all the parts in an assembly, all the parts that were made on a certain line, all the parts that came in from a certain manufacturer on a certain date. And these aggregations sometimes get up into the billions. So I work with some of these guys who are suffering because they're creating these ginormous aggregations in their secondary indexes. They might have a raw parts table that comes as hash only. Every part has a unique serial number. I use the serial number as the hash. It's beautiful. My raw data table is spread all across the key space. My [? write ?] [? ingestion ?] is awesome. I take a lot of data. Then what they do is they create a GSI. And I say, you know what, I need to see all the parts for this manufacturer. Well, all of a sudden I'm taking a billion rows, and stuff them onto one node, because when I aggregate as the manufacturer ID as the hash, and part number as the range, then all of the sudden I'm putting a billion parts into what this manufacturer has delivered me. That can cause a lot of pressure on the GSI, again, because I'm hammering one node. I'm putting all these inserts into one node. And that's a real problematic use case. Now, I got a good design pattern for how you avoid that. And that's one of the problems that I always work with. But what happens, is the GSI might not have enough write capacity to be able to push all those rows into a single node. And what happens then is the primary, the client table, the primary table will be throttled because the GSI can't keep up. So my insert rate will fall on the primary table as my GSI tries to keep up. All right, so GSI's, LSI's, which one should I use? LSI's are consistent. GSI's are eventually consistent. If that's OK, I recommend using a GSI, they're much more flexible. LSI's can be modeled as a GSI. And if the data size per hash keys in your collection exceeds 10 gigabytes, then you're going to want to use that GSI because it's just a hard limit. All right, so scaling. Throughput in Dynamo DB, you can provision [INAUDIBLE] throughput to a table. We have customers that have provisioned 60 billion-- are doing 60 billion requests, regularly running at over a million requests per second on our tables. There's really no theoretical limit to how much and how fast the table can run in Dynamo DB. There are some soft limits on your account that we put in there so that you don't go crazy. If you want more than that, not a problem. You come tell us. We'll turn up the dial. Every account is limited to some level in every service, just off the bat so the people don't go crazy get themselves into trouble. No limit in size. You can put any number of items on a table. The size of an item is limited to 400 kilobytes each, that would be item not the attributes. So the sum of all attributes is limited to 400 kilobytes. And then again, we have that little LSI issue with the 10 gigabyte limit per hash. AUDIENCE: Small number, I'm missing what you're telling me, that is-- AUDIENCE: Oh, 400 kilobyte is the maximum size per item. So an item has all the attributes. So 400 k is the total size of that item, 400 kilobytes. So of all the attributes combined, all the data that's in all those attributes, rolled up into a total size, currently today the item limit is 400 k. So scaling again, achieved through partitioning. Throughput is provisioned at the table level. And there's really two knobs. We have read capacity and write capacity. So these are adjusted independently of each other. RCU's measure strictly consistent reads. OK, so if you're saying I want 1,000 RCU's those are strictly consistent, those are consistent reads. If you say I want eventual consistent reads, you can provision 1,000 RCU's, you're going to get 2,000 eventually consistent reads. And half the price for those eventually consist in reads. Again, adjusted independently of each other. And they have the throughput-- If you consume 100% of your RCU, you're not going to impact the availability of your rights. So they are completely independent of each other. All right, so one of the things that I mentioned briefly was throttling. Throttling is bad. Throttling indicates bad no SQL. There are things we can do to help you alleviate the throttling that you are experiencing. But the best solution to this is let's take a look at what you're doing, because there's an anti-pattern in play here. These things, things like non-uniform workloads, hot keys, hot partitions. I'm hitting a particular key space very hard for some particular reason. Why am I doing this? Let's figure that out. I'm mixing my hot data with cold data. I'm letting my tables get huge, but there's really only some subset of the data that's really interesting to me. So for log data, for example, a lot of customers, they get log data every day. They got a huge amount of log data. If you're just dumping all that log data into one big table, over time that table's going to get massive. But I'm really only interested in last 24 hours, the last seven days, the last 30 days. Whatever the window of time that I'm interested in looking for the event that bothers me, or the event that's interesting to me, that's the only window time that I need. So why am I putting 10 years worth of log data in the table? What that causes is the table the fragment. It gets huge. It starts spreading out across thousands of nodes. And since your capacity is so low, you're actually rate limiting on each one of those individual nodes. So let's start looking at how do we roll that table over. How do we manage that data a little better to avoid these problems. And what does that look like? This is what that looks like. This is what bad NoSQL looks like. I got a hot key here. If you look on the side here, these are all my partitions. I got 16 partitions up here on this particular database. We do this all the time. I run this for customers all time. It's called the heat map. Heat map tells me how you're accessing your key space. And what this is telling me is that there's one particular hash that this guy likes an awful lot, because he's hitting it really, really hard. So the blue is nice. We like blue. We don't like red. Red's where the pressure gets up to 100%. 100%, now you're going to be throttled. So whenever you see any red lines like this-- and it's not just Dynamo DB-- every NoSQL database has this problem. There are anti-patterns that can drive these types of conditions. What I do is I work with customers to alleviate these conditions. And what does that look like? And this is getting the most out of Dynamo DB throughput, but it's really getting the most out of NoSQL. This is not restricted to Dynamo. This is definitely-- I used to work at Mongo. I'm familiar with many NoSQL platforms. Every one has these types of hot key problems. To get the most out of any NoSQL database, specifically Dynamo DB, you want to create the tables where the hash key element has a large number of distinct values, a high degree of cardinality. Because that means I'm writing to lots of different buckets. The more buckets I'm writing to, the more likely I am to spread that write load or read load out across multiple nodes, the more likely I am to have a high throughput on the table. And then I want the values to be requested fairly evenly over time and uniformly as randomly as possible. Well, that's kind of interesting, because I can't really control when the users come. So suffice to say, if we spread things out across the key space, we'll probably be in better shape. There's a certain amount of time delivery that you're not going to be able control. But those are really the two dimensions that we have, space, access evenly spread, time, requests arriving evenly spaced in time. And if those two conditions are being met, then that's what it's going to look like. This is much nicer. We're really happy here. We've got a very even access pattern. Yeah, maybe you're getting a little pressure every now and then, but nothing really too extensive. So it's amazing how many times, when I work with customers, that first graph with the big red bar and all that ugly yellow it's all over the place, we get done with the exercise after a couple of months of re-architecture, they're running the exact same workload at the exact same load. And this is what it's looking like now. So what you get with NoSQL is a data schema that is absolutely tied to the access pattern. And you can optimize that data schema to support that access pattern. If you don't, then you're going to see those types of problems with those hot keys. AUDIENCE: Well, inevitably some places are going to be hotter than others. RICK HOULIHAN: Always. Always. Yeah, I mean there's always a-- and again, there's some design patterns we'll get through that will talk about how you deal with these super large aggregations. I mean, I got to have them, how do we deal with them? I got a pretty good use case that we'll talk about for that. All right, so let's talk about some customers now. These guys are AdRoll. I don't know if you're familiar with AdRoll. You probably see them a lot on the browser. They're ad re-targeting, they're the largest ad re-targeting business out there. They normally regularly run over 60 billion transactions per day. They're doing over a million transactions per second. They've got a pretty simple table structure, the busiest table. It's basically just a hash key is the cookie, the range is the demographic category, and then the third attribute is the score. So we all have cookies in our browser from these guys. And when you go to a participating merchant, they basically score you across various demographic categories. When you go to a website and you say I want to see this ad-- or basically you don't say that-- but when you go to the website they say you want to see this ad. And they go get that ad from AdRoll. AdRoll looks you up on their table. They find your cookie. The advertisers telling them, I want somebody who's middle-aged, 40-year-old man, into sports. And they score you in those demographics and they decide whether or not that's a good ad for you. Now they have a SLA with their advertising providers to provide sub-10 millisecond response on every single request. So they're using Dynamo DB for this. They're hitting us a million requests per second. They're able to do all their lookups, triage all that data, and get that add link back to that advertiser in under 10 milliseconds. It's really pretty phenomenal implementation that they have. These guys actually-- are these the guys. I'm not sure if it's these guys. Might be these guys. Basically told us-- no, I don't think it was them. I think it was somebody else. I was working with a customer that told me that now that they've gone to Dynamo DB, they're spending more money on snacks for their development team every month than they spend on their database. So it'll give you an idea of the cost savings that you can get in Dynamo DB is huge. All right, dropcam's another company. These guy's kind of-- if you think of internet of things, dropcam is basically internet security video. You put your camera out there. Camera has a motion detector. Someone comes along, triggers a cue point. Camera starts recording for a while till it doesn't detect any motion anymore. Puts that video up on the internet. Dropcam was a company that is basically switched to Dynamo DB because they were experiencing enormous growing pains. And what they told us, suddenly petabytes of data. They had no idea their service would be so successful. More inbound video than YouTube is what these guys are getting. They use DynamoDB to track all the metadata on all their video key points. So they have S3 buckets they push all the binary artifacts into. And then they have Dynamo DB records that point people to those S3 three objects. When they need to look at a video, they look up the record in Dynamo DB. They click the link. They pull down the video from S3. So that's kind of what this looks like. And this is straight from their team. Dynamo DB reduces their delivery time for video events from five to 10 seconds. In their old relational store, they used to have to go and execute multiple complex queries to figure out which videos to pull down, to less than 50 milliseconds. So it's amazing, amazing how much performance you can get when you optimize and you tune the underlying database to support the access pattern. Halfbrick, these guys, what is it, Fruit Ninja I guess is their thing. That all runs on Dynamo DB. And these guys, they are a great development team, great development shop. Not a good ops team. They didn't have a lot of operation resources. They were struggling trying to keep their application infrastructure up and running. They came to us. They looked at that Dynamo DB. They said, that's for us. They built their whole application framework on it. Some really nice comments here from the team on their ability to now focus on building the game and not having to maintain the infrastructure, which was becoming an enormous amount of overhead for their team. So this is something that-- the benefit that you get from Dynamo DB. All right, getting into data modeling here. And we talked a little about this one to one, one to many, and many to many type relationships. And how do you maintain those in Dynamo. In Dynamo DB we use indexes, generally speaking, to rotate the data from one flavor to the other. Hash keys, range keys, and indexes. In this particular example, as most states have a licensing requirement that only one driver's license per person. You can't go to get two driver's licenses in the state of Boston. I can't do it in Texas. That's kind of the way it is. And so at the DMV, we have lookups, we want to look up the driver's license by the social security number. I want to look up the user details by the driver's license number. So we might have a user's table that has a hash key on the serial number, or the social security number, and various attributes defined on the item. Now on that table I could define a GSI that flips that around that says I want a hash key on the license and then all the other items. Now if I want to query and find the license number for any given Social Security number, I can query the main table. If I want to query and I want to get the social security number or other attributes by a license number, I can query the GSI. That model is that one to one relationship. Just a very simple GSI, flip those things around. Now, talk about one to many. One to many is basically your hash range key. Where we get a lot with this use case is monitor data. Monitor data comes in regular interval, like internet of things. We always get all these records coming in all the time. And I want to find all the readings between a particular time period. It's a very common query in monitoring infrastructure. The way go about that is to find a simple table structure, one table. I've got a device measurements table with a hash key on the device ID. And I have a range key on the timestamp, or in this case, the epic. And that allows me execute complex queries against that range key and return those records that are relative to the result set that I'm looking for. And it builds that one to many relationship into the primary table using the hash key, range key structure. So that's kind of built into the table in Dynamo DB. When I define a hash and range t table, I'm defining a one to many relationship. It's a parent-child relationship. Let's talk about many to many relationships. And for this particular example, again, we're going to use GSI's. And let's talk about gaming scenario where I have a given user. I want to find out all the games that he's registered for or playing in. And for a given game, I want to find all the users. So how do I do that? My user games table, I'm going to have a hash key of user ID and a range key of the game. So a user can have multiple games. It's a one to many relationship between the user and the games he plays. And then on the GSI, I'll flip that around. I'll hash on the game and I'll range on the user. So if I want to get all the game the user's playing in, I'll query the main table. If I want to get all the users that are playing a particular game, I query the GSI. So you see how we do this? You build these GSI's to support the use case, the application, the access pattern, the application. If I need to query on this dimension, let me create an index on that dimension. If I don't, I don't care. And depending on the use case, I may need the index or I might not. If it's a simple one to many, the primary table is fine. If I need to do these many to many's, or I need to do one to ones, then maybe I do need to second the index. So it all depends on what I'm trying to do and what I'm trying to get accomplished. Probably I'm not going to spend too much time talking about documents. This gets a little bit, probably, deeper than we need to go into. Let's talk a little bit about rich query expression. So in Dynamo DB we have the ability to create what we call projection expressions. Projection expressions are simply picking the fields or the values that you want to display. OK, so I make a selection. I make a query against Dynamo DB. And I say, you know what, show me only the five star reviews for this particular product. So that's all I want to see. I don't want to see all the other attributes of the row, I just want to see this. It's just like in SQL when you say select star or from table, you get everything. When I say select name from table, I only get one attribute. It's the same kind of thing in Dynamo DB or another NoSQL databases. Filter expressions allow me to basically cut the result set down. So I make a query. Query may come back with 500 items. But I only want the items that have an attribute that says this. OK, so let's filter out those items that don't match that particular query. So we have filter expressions. Filter expressions can be run on any attribute. They're not like range queries. Raise queries are more selective. Filter queries require me to go get the entire results set and then carve out the data I don't want. Why is that important? Because I read it all. In a query, I'm going to read and it's going to be a giant about data. And then I'm going to carve out what I need. And if I'm only carving out a couple of rows, then that's OK. It's not so inefficient. But if I'm reading a whole pile of data, just to carve out one item, then I'm going to be better off using a range query, because it's much more selective. It's going to save me a lot of money, because I pay for that read. Where the results that comes back cross that wire might be smaller, but I'm paying for the read. So understand how you're getting the data. That's very important in Dynamo DB. Conditional expressions, this is what you might call optimistic locking. Update IF EXISTS, or if this value is equivalent to what I specify. And if I have a time stamp on a record, I might read the data. I might change that data. I might go write that data back to the database. If somebody has changed the record, the timestamp might have changed. And that way my conditional update could say update if the timestamp equals this. Or the update will fail because somebody updated the record in the meantime. That's what we call optimistic locking. It means that somebody can come in and change it, and I'm going to detect it when I go back to write. And then I can actually read that data and say, oh, he changed this. I need to account for that. And I can change the data in my record and apply another update. So you can catch those incremental updates that occur between the time that you read the data and the time you might write the data. AUDIENCE: And the filter expression actually means not in the number or not-- [INTERPOSING VOICES] RICK HOULIHAN: I won't get too much into this. This Is a reserved keyword. The pound view is a reserved keyword in Dynamo DB. Every database has its own reserved names for collections you can't use. Dynamo DB, if you specify a pound in front of this, you can define those names up above. This is a referenced value. It's probably not the best syntax to have up there for this discussion, because it gets into some real-- I would have been talking more about that at a deeper level. But suffice to say, this could be query scan where they views-- nor pound views is greater than 10. It is a numerical value, yes. If you want, we can talk about that after the discussion. All right, so we're getting into some scenarios in best practices where we're going to talk about some apps here. What are the use cases for Dynamo DB. What are the design patterns in Dynamo DB. And the first one we're going to talk about is the internet of things. So we get a lot of-- I guess, what is it-- more than 50% of traffic on the internet these days is actually generated by machines, automated processes, not by humans. I mean this thing this thing that you carry around in your pocket, how much data that that thing is actually sending around without you knowing it is absolutely amazing. Your location, information about how fast you're going. How do you think Google Maps works when they tell you what the traffic is. It's because there are millions and millions of people driving around with phones that are sending data all over place all the time. So one of the things about this type of data that comes in, monitor data, log data, time series data, is it's usually only interesting for a little bit of time. After that time, it's not so interesting. So we talked about, don't let those tables grow without bounds. The idea here is that maybe I've got 24 hours worth of events in my hot table. And that hot table is going to be provisioned at a very high rate, because it's taking a lot of data. It's taking a lot of data in and I'm reading it a lot. I've got a lot of operation queries running against that data. After 24 hours, hey, you know what, I don't care. So maybe every midnight I roll my table over to a new table and I deprovision this table. And I'll take the RCU's and WCU's down because 24 hours later I'm not running as many queries against that data. So I'm going to save money. And maybe 30 days later I don't even need to care about it all. I could take the WCU's all the way down to one, because you know what, it's never going to get written to. The data is 30 days old. It never changes. And it's almost never going to get read, so let's just take that RCU down to 10. And I'm saving a ton of money on this data, and only paying for my hot data. So that's the important thing to look at when you look at a time series data coming in in volume. These are strategies. Now, I could just let it all go to the same table and just let that table grow. Eventually, I'm going to see performance issues. I'm going to have to start to archive some of that data off the table, what not. Let's much better design your application so that you can operate this way right. So it's just automatic in the application code. At midnight every night it rolls the table. Maybe what I need is a sliding window of 24 hours of data. Then on a regular basis I'm calling data off the table. I'm trimming it with a Cron job and I'm putting it onto these other tables, whatever you need. So if a rollover works, that's great. If not, trim it. But let's keep that hot data away from your cold data. It'll save you a lot of money and make your tables more performing. So the next thing we'll talk about is product catalog. Product catalog is pretty common use case. This is actually a very common pattern that we'll see in a variety of things. You know, Twitter for example, a hot tweet. Everyone's coming and grabbing that tweet. Product catalog, I got a sale. I got a hot sale. I got 70,000 requests per second coming for a product description out of my product catalog. We see this on the retail operation quite a bit. So how do we deal with that? There's no way to deal with that. All my users want to see the same piece of data. They're are coming in, concurrently. And they're all making requests for the same piece of data. This gives me that hot key, that big red stripe on my chart that we don't like. And that's what that looks like. So across my key space I'm getting hammered in the sale items. I'm getting nothing anywhere else. How do I alleviate this problem? Well, we alleviate this with cache. Cache, you put basically an in-memory partition in front of the database. We have managed [INAUDIBLE] cache, how you can set up your own cache, [INAUDIBLE] cache [? d, ?] whatever you want. Put that up in front of the database. And that way you can store that data from those hot keys up in that cache space and read through the cache. And then most of your reads start looking like this. I got all these cache hits up here and I got nothing going on down here because database is sitting behind the cache and the reads never come through. If I change the data in the database, I have to update the cache. We can use something like steams to do that. And I'll explain how that works. All right, messaging. Email, we all use email. This is a pretty good example. We've got some sort of messages table. And we got inbox and outbox. This is what the SQL would look like to build that inbox. We kind of use the same kind of strategy to use GSI's, GSI's for my inbox and my outbox. So I got raw messages coming into my messages table. And the first approach to this might be, say, OK, no problem. I've got raw messages. Messages coming [INAUDIBLE], message ID, that's great. That's my unique hash. I'm going to create two GSI's, one for my inbox, one for my outbox. And the first thing I'll do is I'll say my hash key is going to be the recipient and I'm going to arrange on the date. This is fantastic. I got my nice view here. But there's a little issue here. And you run into this in relational databases as well. They called vertically partitioning. You want to keep your big data away from your little data. And the reason why is because I gotta go read the items to get the attributes. And if my bodies are all on here, then reading just a few items if my body length is averaging 256 kilobytes each, the math gets pretty ugly. So say I want to read David's inbox. David's inbox has 50 items. The average and size is 256 kilobytes. Here's my conversion ratio for RCU's is four kilobytes. OK, let's go with eventually consistent reads. I'm still eating 1600 RCU's just to read David's inbox. Ouch. OK, now let's think about how the app works. If I'm in an email app and I'm looking at my inbox, and I look at the body of every message, no, I'm looking at the summaries. I'm looking at only the headers. So let's build a table structure that looks more like that. So here's the information that my workflow needs. It's in my inbox GSI. It's the date, the sender, the subject, and then the message ID, which points back to the messages table where I can get the body. Well, these would be record IDs. They would point back to the item IDs on the Dynamo DB table. Every index always creates-- always has the item ID as part of-- that comes with the index. All right. AUDIENCE: It tells it where it's stored? RICK HOULIHAN: Yes, it tells exactly-- that's exactly what it does. It says here's my re record. And it'll point it back to my re record. Exactly. OK, so now my inbox is actually much smaller. And this actually supports the workflow of an email app. So my inbox, I click. I go along and I click on the message, that's when I need to go get the body, because I'm going to go to a different view. So if you think about MVC type of framework, model view controller. The model contains the data that the view needs and the controller interacts with. When I change the frame, when I change the perspective, it's OK to go back to the server and repopulate the model, because that's what the user expects. When they change views, that's when we can go back to the database. So email, click. I'm looking for the body. Round trip. Go get the body. I read a lot less data. I'm only reading the bodies that David needs when he needs them. And I'm not burn in 1600 RCU's just to show his inbox. So now that-- this is the way that LSI or GSI-- I'm sorry, GSI, would work out. We've got our hash on the recipient. We've got the range key on the date. And we've got the projected attributes that we need only to support the view. We rotate that for the outbox. Hash on sender. And in essence, we have the very nice, clean view. And it's basically-- we have this nice messages table that's being spread nicely because it's hash only, hashed message ID. And we have two indexes that are rotating off of that table. All right, so idea here is don't keep the big data and this small data together. Partition vertically, partition those tables. Don't read data you don't have to. All right, gaming. We all like games. At least I like games then. So some of the things that we deal with when we're thinking about gaming, right? Gaming these days, especially mobile gaming, is all about thinking. And I'm going to rotate here a little bit away from DynamoDB. I'm going to bring in some of the discussion around some of the other AWS technologies. But the idea about gaming is to think about in terms of APIs, APIs that are, generally speaking, HTTP and JSON. It's how mobile games kind of interact with their back ends. They do JSON posting. They get data, and it's all, generally speaking, in nice JSON APIs. Things like get friends, get the leaderboard, exchange data, user generated content, push back up to the system, these are types of things that we're going to do. Binary asset data, this data might not sit in the database. This might sit in an object store, right? But the database is going to end up telling the system, telling the application where to go get it. And inevitably, multiplayer servers, back end infrastructure, and designed for high availability and scalability. So these are things that we all want in the gaming infrastructure today. So let's take a look at what that looks like. Got a core back end, very straightforward. We've got a system here with multiple availability zones. We talked about AZs as being-- think of them as separate data centers. More than one data center per AZ, but that's OK, just think of them as separate data centers that are geographically and fault isolated. We're going to have a couple EC2 instances. We're going to have some back end server. Maybe if you're a legacy architecture, we're using what we call RDS, relational database services. Could be MSSQL, MySQL, or something like that. This is way a lot applications are designed today. Well we might want to go with this is when we scale out. We'll go ahead and put the S3 bucket up there. And that S3 bucket, instead of serving up those objects from our servers-- we could do that. You put all your binary objects on your servers and you can use those server instances to serve that data up. But that's pretty expensive. Better way to do is go ahead and put those objects in an S3 bucket. S3 is an object repositories. It's built specifically for serving up these types of things. And let those clients request directly from those object buckets, offload the servers. So we're starting to scale out here. Now we got users all over the world. I got users. I need to have content locally located close to these users, right? I've created an S3 bucket as my source repository. And I'll front that with the CloudFront distribution. CloudFront is a CD and a content delivery network. Basically it takes data that you specify and caches it all over the internet so users everywhere can have a very quick response when they request those objects. So you get an idea. You're kind of leveraging all the aspects of AWS here to get this done. And eventually, we throw in an auto scaling group. So our AC2 instances of our game servers, as they start to get busier and busier and busier, they'll just spin another instance, spin another instance, spin another instance. So the technology AWS has, it allows you specify the parameters around which your servers will grow. So you can have n number of servers out there at any given time. And if your load goes away, they'll shrink, the number will shrink. And if the load comes back, it'll grow back out, elastically. So this looks great. We've got a lot of EC2 instances. We can put cache in front of the databases, try and accelerate the databases. The next pressure point typically people see is they scale a game using a relational database system. Jeez, the database performance is terrible. How do we improve that? Let's try putting cache in front of that. Well, cache doesn't work so great in games, right? For games, writing is painful. Games are very write heavy. Cache doesn't work when you're write heavy because you've always got to update the cache. You update the cache, it's irrelevant to be caching. It's actually just extra work. So where we go here? You've got a big bottleneck down there in the database. And the place to go obviously is partitioning. Partitioning is not easy to do when you're dealing with relational databases. With relational databases, you're responsible for managing, effectively, the key space. You're saying users between A and M go here, between N and Z go there. And you're switching across the application. So you're dealing with this partition data source. You have transactional constraints that don't span partitions. You've got all kinds of messiness that you're dealing with down there trying to deal with scaling out and building a larger infrastructure. It's just no fun. AUDIENCE: So are you saying that increasing source points speeds up the process? RICK HOULIHAN: Increasing? AUDIENCE: Source points. RICK HOULIHAN: Source points? AUDIENCE: From the information, where the information is coming from? RICK HOULIHAN: No. What I'm saying is increasing the number of partitions in the data store improves throughput. So what's happening here is users coming into the EC2 instance up here, well, if I need a user that's A to M, I'll go here. From N to p, I'll go here. From P to Z, I'll go here. AUDIENCE: OK, those so those are all stored in different nodes? RICK HOULIHAN: Yes. Think of these as different silos of data. So you're having to do this. If you're trying to do this, if you're trying to scale on a relational platform, this is what you're doing. You're taking data and you're cutting it down. And you're partitioning it across multiple instances of the database. And you're managing all that at the application tier. It's no fun. So what do we want to go? We want to go DynamoDB, fully managed, NoSQL data store, provision throughput. We use secondary indexes. It's basically HTTP API and includes document support. So you don't have to worry about any of that partitioning. We do it all for you. So now, instead, you just write to the table. If the table needs to be partitioned, that happens behind the scenes. You're completely insulated from that as a developer. So let's talk about some of the use cases that we run into in gaming, common gaming scenarios, leaderboard. So you've got users coming in, the BoardNames that they're on, the scores for this user. We might be hashing on the UserID, and then we have range on the game. So every user wants to see all the game he's played and all his top score across all the game. So that's his personal leaderboard. Now I want to go in and I want to get-- so I get these personal leaderboards. What I want to do is go get the top score across all users. So how do I do that? When my record is hashed on the UserID, ranged on the game, well I'm going to go ahead and restructure, create a GSI, and I'm going to restructure that data. Now I'm going to hash on the BoardName, which is the game. And I'm going to range on the top score. And now I've created different buckets. I'm using the same table, the same item data. But I'm creating a bucket that gives me an aggregation of top score by game. And I can query that table to get that information. So I've set that query pattern up to be supported by a secondary index. Now they can be sorted by BoardName and sorted by TopScore, depending on. So you can see, these are types of use cases you get in gaming. Another good use case we get in gaming is awards and who's won the awards. And this is a great use case where we call sparse indexes. Sparse indexes are the ability to generate an index that doesn't necessarily contain every single item on the table. And why not? Because the attribute that's being indexed doesn't exist on every item. So in this particular use case, I'm saying, you know what, I'm going to create an attribute called Award. And I'm going to give every user that has an award that attribute. Users that don't have awards are not going to have that attribute. So when I create the index, the only users that are going to show up in the index are the ones that actually have won awards. So that's a great way to be able to create filtered indexes that are very, very selective that don't have to index the entire table. So we're getting low on time here. I'm going to go ahead and skip out and skip this scenario. Talk a little bit about-- AUDIENCE: Can I ask a quick question? One is write heavy? RICK HOULIHAN: What is? AUDIENCE: Write heavy. RICK HOULIHAN: Write heavy. Let me see. AUDIENCE: Or is that not something you can just voice to in a matter of seconds? RICK HOULIHAN: We go through the voting scenario. It's not that bad. Do you guys have a few minutes? OK. So we'll talk about voting. So real time voting, we have requirements for voting. Requirements are that we allow each person to vote only once. We want nobody to be able to change their vote. We want real-time aggregation and analytics for demographics that we're going to be showing to users on the site. Think of this scenario. We work a lot of reality TV shows where they're doing these exact type of things. So you can think of the scenario, we have millions and millions of teenage girls there with their cell phones and voting, and voting, and voting for whoever they are find to be the most popular. So these are some of the requirements we run out. And so the first take in solving this problem would be to build a very simple application. So I've got this app. I have some voters out there. They come in, they hit the voting app. I've got some raw votes table I'll just dump those votes into. I'll have some aggregate votes table that will do my analytics and demographics, and we'll put all this in there. And this is great. Life is good. Life's good until we find out that there's always only one or two people that are popular in an election. There's only one or two things that people really care about. And if you're voting at scale, all of a sudden I'm going to be hammering the hell out of two candidates, one or two candidates. A very limited number of items people find to be popular. This is not a good design pattern. This is actually a very bad design pattern because it creates exactly what we talked about which was hot keys. Hot keys are something we don't like. So how do we fix that? And really, the way to fix this is by taking those candidate buckets and for each candidate we have, we're going to append a random value, something that we know, random value between one and 100, between 100 and 1,000, or between one and 1,000, however many random values you want to append onto the end of that candidate. And what have I really done then? If I'm using the candidate ID as the bucket to aggregate votes, if I've added a random number to the end of that, I've created now 10 buckets, a hundred buckets, a thousand buckets that I'm aggregating votes across. So I have millions, and millions, and millions of records coming in for these candidates, I am now spreading those votes across Candidate A_1 through Candidate A_100, because every time a vote comes in, I'm generating a random value between one and 100. I'm tacking it onto the end of the candidate that person's voting for. I'm dumping it into that bucket. Now on the backside, I know that I got a hundred buckets. So when I want to go ahead and aggregate the votes, I read from all those buckets. So I go ahead and add. And then I do the scatter gather where I go out and say hey, you know what, this candidate's key spaces is over a hundred buckets. I'm going to gather all the votes from those hundred buckets. I'm going to aggregate them and I'm going to say, Candidate A now has total vote count of x. Now both the write query and the read query are nicely distributed because I'm writing across and I'm reading across hundreds of keys. I'm not writing and reading across one key now. So that's a great pattern. This is actually probably one of the most important design patterns for scale in NoSQL. You will see this type of design pattern in every flavor. MongoDB, DynamoDB, it doesn't matter, we all have to do this. Because when you're dealing with those huge aggregations, you have to figure out a way to spread them out across buckets. So this is the way you do that. All right, so what you're doing right now is you're trading off read cost for write scalability. The cost of my read is a little more complex and I have to go read from a hundred buckets instead of one. But I'm able to write. And my throughput, my write throughput is incredible. So it's usually a valuable technique for scaling DynamoDB, or any NoSQL database for that matter. So we figured out how to scale it. And we figured how to eliminate our hot keys. And this is fantastic. And we got this nice system. And it's given us very correct voting because we have record vote de-dupe. It's built into DynamoDB. We talked about conditional rights. When a voter comes in, puts an insert on the table, they insert with their voter ID, if they try to insert another vote, I do a conditional write. Say only write this if this doesn't exist. So as soon as I see that that vote's hit the table, nobody else's going to be able to put their vote in. And that's fantastic. And we're incrementing our candidate counters. And we're doing our demographics and all that. But what happens if my application falls over? Now all of a sudden votes are coming in, and I don't know if they're getting processed into my analytics and demographics anymore. And when the application comes back up, how the hell do I know what votes have been processed and where do I start? So this is a real problem when you start to look at this type of scenario. And how do we solve that? We solve it with what we call DynamoDB Streams. Streams is a time ordered and partitioned change log of every access to the table, every write access to the table. Any data that's written to the table shows up on the stream. It's basically a 24 hour queue. Items hit the stream, they live for 24 hours. They can be read multiple times. Guaranteed to be delivered only once to the stream, could be read n number of times. So however many processes you want to consume that data, you can consume it. It will appear every update. Every write will only appear once on the stream. So you don't have to worry about processing it twice from the same process. It's strictly ordered per item. When we say time ordered and partitioned, you'll see per partition on the stream. You will see items, updates in order. We are not guaranteeing on the stream that you're going to get every transaction in the order across items. So streams are idempotent. Do we all know what idempotent means? Idempotent means you can do it over, and over, and over again. The result's going to be the same. Streams are idempotent, but they have to be played from the starting point, wherever you choose, to the end, or they will not result in the same values. Same thing with MongoDB. MongoDB has a construct they call the oplog. It is the exact same construct. Many NoSQL databases have this construct. They use it to do things like replication, which is exactly what we do with streams. AUDIENCE: Maybe a heretical question, but you talk about apps doing down an so forth. Are streams guaranteed to never possibly go down? RICK HOULIHAN: Yeah, streams are guaranteed to never go down. We manage the infrastructure behind. streams automatically deploy in their auto scaling group. We'll go through a little bit about what happens. I shouldn't say they're not guaranteed to never go down. The elements are guaranteed to appear in the stream. And the stream will be accessible. So what goes down or comes back up, that happens underneath. It covers-- it's OK. All right, so you get different view types off the screen. The view types that are important to a programmer typically are, what was it? I get the old view. When an update hits the table, it'll push the old view to the stream so data can archive, or change control, change identification, change management. The new image, what it is now after the update, that's another type of view you can get. You can get both the old and new images. Maybe I want them both. I want to see what it was. I want to see what it changed to. I have a compliance type of process that runs. It needs to verify that when these things change, that they're within certain limits or within certain parameters. And then maybe I only need to know what changed. I don't care what item changed. I don't need to need to know what attributes changed. I just need to know that the items are being touched. So these are the types of views that you get off the stream and you can interact with. The application that consumes the stream, this is kind of the way this works. DynamoDB client ask to push data to the tables. Streams deploy on what we call shards. Shards are scaled independently of the table. They don't line up completely to the partitions of your table. And the reason why is because they line up to the capacity, the current capacity of the table. They deploy in their own auto scaling group, and they start to spin out depending on how many writes are coming in, how many reads-- really it's writes. There's no reads-- but how many writes are coming in. And then on the back end, we have what we call a KCL, or Kinesis Client Library. Kinesis is a stream data processing technology from Amazon. And streams is built on that. So you use a KCL enabled application to read the stream. The Kinesis Client Library actually manages the workers for you. And it also does some interesting things. It will create some tables up in your DynamoDB tablespace to track which items have been processed. So this way if it falls back, if it falls over and comes and gets stood back up, it can determine where was it in processing the stream. That's very important when you're talking about replication. I need to know what data was been processed and what data has yet to be processed. So the KCL library for streams will give you a lot of that functionality. It takes care of all the housekeeping. It stands up a worker for every shard. It creates an administrative table for every shard, for every worker. And as those workers fire, they maintain those tables so you know this record was read and processed. And then that way if the process dies and comes back online, it can resume right where it took off. So we use this for cross-region replication. A lot of customers have the need to move data or parts of their data tables around to different regions. There are nine regions all around the world. So there might be a need-- I might have users in Asia, users in the East Coast of the United States. They have different data that needs to be locally distributed. And maybe a user flies from Asia over to the United States, and I want to replicate his data with him. So when he gets off the plane, he has a good experience using his mobile app. You can use the cross-region replication library to do this. Basically we have provided two technologies. One's a console application you can stand up on your own EC2 instance. It runs pure replication. And then we gave you the library. The library you can use to build your own application if you want to do crazy things with that data-- filter, replicate only part of it, rotate the data, move it into a different table, so on and so forth. So that's kind of what that looks like. DynamoDB Streams can be processed by what we call Lambda. We mentioned a little bit about event driven application architectures. Lambda is a key component of that. Lambda is code that fires on demand in response to a particular event. One of those events could be a record appearing on the stream. If a record appears on the stream, we'll call this Java function. Well, this is JavaScript, and Lambda supports Node.js, Java, Python, and will soon support other languages as well. And suffice to say, it's pure code. write In Java, you define a class. You push the JAR up into Lambda. And then you specify which class to call in response to which event. And then the Lambda infrastructure behind that will run that code. That code can process records off the stream. It can do anything it wants with it. In this particular example, all we're really doing is logging the attributes. But this is just code. Code can do anything, right? So you can rotate that data. You can create a derivative view. If it's a document structure, you can flatten the structure. You can create alternate indexes. All kinds of things you can do with the DynamoDB Streams. And really, that's what that looks like. So you get those updates coming in. They're coming off the string. They're read by the Lambda function. They're rotating the data and pushing it up in derivative tables, notifying external systems of change, and pushing data into ElastiCache. We talked about how to put the cache in front of the database for that sales scenario. Well what happens if I update the item description? Well, if I had a Lambda function running on that table, if I update the item description, it'll pick up the record off the stream, and it'll update the ElastiCache instance with the new data. So that's a lot of what we do with Lambda. It's glue code, connectors. And it actually gives the ability to launch and to run very complex applications without a dedicated server infrastructure, which is really cool. So let's go back to our real-time voting architecture. This is new and improved with our streams and KCL enabled application. Same as before, we can handle any scale of election. We like this. We're doing out scatter gathers across multiple buckets. We've got optimistic locking going on. We can keep our voters from changing their votes. They can only vote only once. This is fantastic. Real-time fault tolerance, scalable aggregation now. If the thing falls over, it knows where to restart itself when it comes back up because we're using the KCL app. And then we can also use that KCL application to push data out to Redshift for other app analytics, or use the Elastic MapReduce to run real-time streaming aggregations off of that data. So these are things we haven't talked about much. But they're additional technologies that come to bear when you're looking at these types of scenarios. All right, so that's about analytics with DynamoDB Streams. You can collect de-dupe data, do all kinds of nice stuff, aggregate data in memory, create those derivative tables. That's a huge use case that a lot of customers are involved with, taking the nested properties of those JSON documents and creating additional indexes. We're at the end. Thank you for bearing with me. So let's talk about reference architecture. DynamoDB sits in the middle of so much of the AWS infrastructure. Basically you can hook it up to anything you want. Applications built using Dynamo include Lambda, ElastiCache, CloudSearch, push the data out into Elastic MapReduce, import export from DynamoDB into S3, all kinds of workflows. But probably the best thing to talk about, and this is what's really interesting is when we talk about event driven applications. This is an example of an internal project that we have where we're actually publishing to gather survey results. So in an email link that we send out, there'll be a little link saying click here to respond to the survey. And when a person clicks that link, what happens is they pull down a secure HTML survey form from S3. There's no server. This is just an S3 object. That form comes up, loads up in the browser. It's got Backbone. It's got complex JavaScript that it's running. So it's very rich application running in the client's browser. They don't know that they're not interacting with a back end server. At this point, it's all browser. They publish the results to what we call the Amazon API Gateway. API Gateway is simply a web API that you can define and hook up to whatever you want. In this particular case, we're hooked up to a Lambda function. So my POST operation is happening with no server. Basically that API Gateway sits there. It costs me nothing until people start POSTing to it, right? The Lambda function just sits there. And it costs me nothing until people start hitting it. So you can see, as the volume increases, that's when the charges come. I'm not running a server 7/24. So I pull the form down out of the bucket, and I post through the API Gateway into the Lambda function. And then the Lambda function says, you know what, I've got some PIIs, some personally identifiable information in these responses. I got comments coming from users. I've got email addresses. I've got usernames. Let me split this off. I'm going to generate some metadata off this record. And I'm going to push the metadata into DynamoDB. And I could encrypt all the data and push it into DynamoDB if I want. But it's easier for me, in this use case, to go ahead an say, I'm going to push the raw data into an encrypted S3 bucket. So I use built in S3 server side encryption and Amazon's Key Management Service so that I have a key that can rotate on a regular interval, and I can protect that PII data as part of this whole workflow. So what have I done? I've just deployed a whole application, and I have no server. So is what event driven application architecture does for you. Now if you think about the use case for this-- we have other customers I'm talking to about this exact architecture who run phenomenally large campaigns, who are looking at this and going, oh my. Because now, they can basically push it out there, let that campaign just sit there until it launches, and not have to worry a fig about what kind of infrastructure is going to be there to support it. And then as soon as that campaign is done, it's like the infrastructure just immediately goes away because there really is no infrastructure. It's just code that sits on Lambda. It's just data that sits in DynamoDB. It is an amazing way to build applications. AUDIENCE: So is it more ephemeral than it would be if it was stored on an actual server? RICK HOULIHAN: Absolutely. Because that server instance would have to be a 7/24. It has to be available for somebody to respond to. Well guess what? S3 is available 7/24. S3 always responds. And S3 is very, very good at serving up objects. Those objects can be HTML files, or JavaScript files, or whatever you want. You can run very rich web applications out of S3 buckets, and people do. And so that's the idea here is to get away from the way we used to think about it. We all used to think in terms of servers and hosts. It's not about that anymore. It's about infrastructure as code. Deploy the code to the cloud and let the cloud run it for you. And that's what AWS is trying to do. AUDIENCE: So your gold box in the middle of the API Gateway is not server-like, but instead is just-- RICK HOULIHAN: You can think of it as server facade. All it is is it'll take an HTTP request and map it to another process. That's all it does. And in this case, we're mapping it to a Lambda function. All right, so that's all I got. Thank you very much. I appreciate it. I know we want a little bit over time. And hopefully you guys got a little bit of information that you can take away today. And I apologize if I went over some of your heads, but there's a good lot of fundamental foundational knowledge that I think is very valuable for you. So thank you for having me. [APPLAUSE] AUDIENCE: [INAUDIBLE] is when you were saying you had to go through the thing from the beginning to the end to get the right values or the same values, how would the values change if [INAUDIBLE]. RICK HOULIHAN: Oh, idempotent? How would the values change? Well, because if I didn't run it all the way to the end, then I don't know what changes were made in the last mile. It's not going to be the same data as what I saw. AUDIENCE: Oh, so you just haven't gotten the entire input. RICK HOULIHAN: Right. You have to go from beginning to end, and then it's going to be a consistent state. Cool. AUDIENCE: So you showed us DynamoDB can do document or the key value. And we spent a lot of time on the key value with a hash and the ways to flip it around. When you looked at those tables, is that leaving behind the document approach? RICK HOULIHAN: I wouldn't say leaving it behind. AUDIENCE: They were separated from the-- RICK HOULIHAN: With the document approach, the document type in DynamoDB is just think of as another attribute. It's an attribute that contains a hierarchical data structure. And then in the queries, you can use the properties of those objects using Object Notation. So I can filter on a nested property of the JSON document. AUDIENCE: So any time I do a document approach, I can sort of arrive at the tabular-- AUDIENCE: Absolutely. AUDIENCE: --indexes and things you just talked about. RICK HOULIHAN: Yeah, the indexes and all that, when you want to index the properties of the JSON, the way that we'd have to do that is if you insert a JSON object or a document into Dynamo, you would use streams. Streams would read the input. You'd get that JSON object and you'd say OK, what's the property I want to index? You create a derivative table. Now that's the way it works right now. We don't allow you to index directly those properties. AUDIENCE: Tabularizing your documents. RICK HOULIHAN: Exactly, flattening it, tabularizing it, exactly. That's what you do with it. AUDIENCE: Thank you. RICK HOULIHAN: Yep, absolutely, thank you. AUDIENCE: So it's kind of Mongo meets Redis classifers. RICK HOULIHAN: Yeah, it's a lot like that. That's a good description for it. Cool.